[go: up one dir, main page]

WO2025007034A1 - Methods for determining surveillance and therapy for diseases - Google Patents

Methods for determining surveillance and therapy for diseases Download PDF

Info

Publication number
WO2025007034A1
WO2025007034A1 PCT/US2024/036222 US2024036222W WO2025007034A1 WO 2025007034 A1 WO2025007034 A1 WO 2025007034A1 US 2024036222 W US2024036222 W US 2024036222W WO 2025007034 A1 WO2025007034 A1 WO 2025007034A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
individuals
information
individual
computing system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/036222
Other languages
French (fr)
Inventor
Helmy Eltoukhy
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guardant Health Inc
Original Assignee
Guardant Health Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guardant Health Inc filed Critical Guardant Health Inc
Publication of WO2025007034A1 publication Critical patent/WO2025007034A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57484Immunoassay; Biospecific binding assay; Materials therefor for cancer involving compounds serving as markers for tumor, cancer, neoplasia, e.g. cellular determinants, receptors, heat shock/stress proteins, A-protein, oligosaccharides, metabolites
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/52Predicting or monitoring the response to treatment, e.g. for selection of therapy based on assay results in personalised medicine; Prognosis
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/54Determining the risk of relapse

Definitions

  • Cancer is a major cause of disease worldwide. Each year, tens of millions of people are diagnosed with cancer around the world, and more than half of the patients eventually die from it. In many countries, cancer ranks the second most common cause of death following cardiovascular diseases. Early detection is associated with improved outcomes for many cancers.
  • Laboratory tests are another type of screening test and may include medical procedures to procure samples of tissue, blood, urine, or other substances in the body before conducting laboratory testing. Imaging procedures screen for cancer by generating visual representations of areas inside the body. Genetic tests detect certain gene deleterious mutations linked to some types of cancer. Genetic testing is particularly useful for a number of diagnostic methods.
  • Described herein is a method, including: determining a state of biological molecules obtained from a sample derived from a human subject, testing for minimal residual disease (MRD), determining the likelihood of recurrence based on the MRD test, generating a schedule for one or more additional MRD tests based on the determination of the likelihood of recurrence.
  • the biological molecules are one or more of DNA, methylated DNA, RNA, methylated RNA, proteins, and peptides.
  • the method includes testing for MRD includes combining a plurality of nucleic acid molecules derived from a subject with a solution including an amount of methyl binding domain (MBD) proteins to produce a nucleic acid-MBD protein solution; and performing a plurality of washes of the nucleic acid- MBD protein solution with a salt solution to produce a number of nucleic acid fractions, individual nucleic acid fractions having a threshold number of methylated cytosines in regions of the plurality of nucleic acids having at least the threshold cytosine-guanine content.
  • MRD methyl binding domain
  • the wash of the plurality of washes is performed with a solution having a concentration of sodium chloride (NaCl) and produces a nucleic acid fraction of the number of nucleic acid fractions having a range of binding strengths to MBD proteins.
  • NaCl sodium chloride
  • the method includes determining that a first nucleic acid fraction is associated with a first partition of a plurality of partitions of nucleic acids, the first partition corresponding to a first range of binding strengths to MBD proteins, attaching a first molecular barcode to nucleic acids of the first nucleic acid fraction, the first molecular barcode being included in a first set of molecular barcodes associated with the first partition, determining that a second nucleic acid fraction is associated with a second partition of the plurality of partitions of nucleic acids, the second partition corresponding to a second range of binding energies to MBD proteins different from the first range of binding strengths to MBD proteins, and attaching a second molecular barcode to nucleic acids of the second nucleic acid fraction, the second molecular barcode being included in a second set of molecular barcodes associated with the second partition.
  • the method includes combining at least a portion of the number of nucleic acid fractions with an amount of restriction enzyme that cleaves molecules with one or more unmethylated cytosines to produce at least a portion of the plurality of samples used to produce the sequencing reads, wherein the threshold amount of methylated cytosines corresponds to a minimum frequency of methylated cytosines within a region having at least the threshold cytosine-guanine content.
  • the method includes combining at least a portion of the number of nucleic acid fractions with an amount of a restriction enzyme that cleaves molecules with one or more methylated cytosines to produce at least a portion of the plurality of samples used to produce the sequencing reads, wherein the threshold amount of unmethylated cytosines corresponds to a maximum frequency of methylated cytosines that are not cleaved within a region having at least the threshold cytosine-guanine content.
  • the method includes testing for MRD includes sequencing nucleic acid molecules derived from a sample obtained from a subject, analyzing sequence reads derived from the sequencing to identify one or more driver mutations in the nucleic acid molecules, and using information about the presence, absence, or amount of the one or more driver mutations in the nucleic acid molecules to identify a tumor in the subject.
  • the nucleic acid molecules are cell-free DNA.
  • the sample is at least one of blood, serum, plasma or tissue.
  • the method includes determination of treatment for the subject.
  • the a limit of detection for the model to determine tumor fraction of samples is no greater than 0.05%.
  • the one or more driver mutations includes a somatic variant detected at a mutant allele frequency (MAF) of no more than 0.05%. In other embodiments, the one or more driver mutations includes a fusion detected at a mutant allele frequency (MAF) of no more than 0.1%. In other embodiments, the method includes detecting mutation distributions for each of one or more driver mutations, wherein the mutation distribution for each of the one or more driver mutations is detected with a correlation of at least 0.99 to a mutation distribution of the driver mutation detected in a cohort of the subject by tissue genotyping. In other embodiments, the method detects the tumor in the subject with a sensitivity of at least 85%, a specificity of at least 99%, and a diagnostic accuracy of at least 99%.
  • the method includes identify circulating tumor DNA (ctDNA) and one or more driver mutations in the ctDNA.
  • the method includes obtaining, by a computing system having one or more hardware processors and memory, testing sequence data from a subject, the testing sequence data including testing sequencing reads derived from a sample of the subject, analyzing, by the computing system, the testing sequencing reads to determine a first quantitative measure derived from the testing sequencing reads to genomic regions of a reference genome, analyzing, by the computing system, the testing sequencing reads to determine a second quantitative measure derived from the testing sequencing reads to genomic regions of a reference genome, determining, by the computing system, a metric based on the first quantitative and the second quantitative measure, and generating, by the computing system, an input vector that includes the metrics determining, by the computing system, an indication of cancer status in the subject by providing the input vector to a model that implements one or more machine learning techniques to generate indications of cancer status in subjects, the model including weights for individual classification regions of a
  • the individual testing sequencing reads include a nucleotide sequence corresponding to a fragment of a nucleic acid included in the sample and the individual testing sequencing reads correspond to molecules having a threshold amount of methylated cytosines included in regions of the nucleotide sequence having at least the threshold cytosine- guanine content, the first quantitative measure derived from the testing sequencing reads that correspond to individual classification regions of a plurality of classification regions at least a portion of the individual classification regions of the plurality of classification regions corresponding to genomic regions of a reference genome that have the threshold amount of methylated cytosines in subjects in which cancer is detected and that have at least the threshold cytosine-guanine content, the second quantitative measure derived from the testing sequencing reads that correspond to individual control regions a plurality of control regions, individual control regions of the plurality of control regions corresponding to additional genomic regions of the reference genome that have at least the threshold cytosine-guanine content and that have at least the threshold amount of methylated cytosines in
  • the method includes obtaining, by the computing system having one or more hardware processors and memory, training sequence data including training sequencing reads derived from a plurality of samples of a plurality of training subjects, individual training sequencing reads including a nucleotide sequence corresponding to a fragment of a nucleic acid included in a sample of the plurality of samples and individual training sequencing reads corresponding to molecules having a threshold amount of methylated cytosines included in regions of the nucleotide sequence having at least a threshold cytosine-guanine content, analyzing, by the computing system, the training sequencing reads to determine an additional first quantitative measure derived from the training sequencing reads that corresponds to individual classification regions of the plurality of classification regions, analyzing, by the computing system, the training sequencing reads to determine an additional second quantitative measure derived from the training sequencing reads that correspond to a plurality of control regions, determining, by the computing system, an additional metric for the individual classification regions of the plurality of classification regions based on the additional first quantitative measure for the
  • the one or more machine learning algorithms include one or more classification algorithms. In other embodiments, the one or more machine learning algorithms include one or more regression algorithms, and the indication corresponds to an estimate of tumor fraction of the sample.
  • the training sequencing reads comprise a first portion of the training sequence data and additional training sequencing reads comprise a second portion of the training sequence data, wherein the additional training sequencing reads are different from the training sequencing reads; and the method including analyzing, by the computing system, at least one of the first portion of the training sequence data or the second portion of the training sequence data to determine an individual frequency of a plurality of variants present in an individual sample of the plurality of samples, determining, by the computing system and for the individual sample, a variant of the plurality of variants having a maximum frequency that corresponds to the individual frequency having a greatest value among individual frequencies derived from an individual sample, and determining, by the computing system, individual measures of tumor fraction for an individual sample based on the greatest value of the individual frequencies derived from the individual sample.
  • the training data includes the individual measures of tumor fraction for the individual samples of the plurality of samples, and the model is generated based on the individual measures of tumor fraction for the individual samples of the plurality of samples.
  • the method includes generating, by a computing system including processing circuitry and memory, a data file including first tokens generated using a first hash function, individual first tokens corresponding to a respective individual of a group of individuals having data stored by a molecular data repository, sending, by the computing system, the data file to a health insurance claims data management system, obtaining, by the computing system and from the health insurance claims data management system, in response to the data file, health data corresponding to the group of individuals, generating, by the computing system, a number of identifiers using a second hash function that is different from the first hash function, each identifier corresponds to one or more tokens related to each individual of the group of individuals, obtaining, by the computing system and using the number of identifiers, second data from the molecular data repository
  • the method includes determining, by the computing system, a first set of data processing instructions that are executable in relation to first data stored by the integrated data repository, causing, by the computing system, the first set of data processing instructions to be executed to analyze first health insurance claims codes included in the first data to determine a first subset of the group of individuals in which a biological condition is present and generating, by the computing system, a first dataset indicating the subset of the group of individuals in which the biological condition is present.
  • the method includes determining, by the computing system, a second set of data processing instructions that are executable in relation to second data stored by the integrated data repository, causing, by the computing system, the second set of data processing instructions to be executed to analyze the second health insurance claims codes included in the second data to determine one or more treatments provided to a second subset of the group of individuals, and generating, by the computing system, a second dataset indicating the one or more treatments provided to the second subset of the group of individuals.
  • the method includes determining, by the computing system, a third subset of the group of individuals that includes a portion of the first subset of the group of individuals that overlaps with a portion of the second subset of the group of individuals, receiving, by the computing system, a request to perform an analysis of the first dataset and the second dataset in relation to the third subset of the group of individuals, and analyzing, by the computing system and in response to the request, the first dataset and the second dataset with respect to the third subset of the group of individuals to determine a measure of significance of a characteristic of the third subset of the group of individuals with respect to the biological condition.
  • the method includes determining, by the computing system, one or more genomic mutations present in the third subset of the group of individuals, determining, by the computing system, a plurality of treatments provided to the third subset of the group of individuals, and determining, by the computing system, respective survival rates for the third subset of the group of individuals.
  • the measure of significance corresponds to survival rate with respect to a treatment of the plurality of treatments and a genomic mutation of the one or more genomic mutations.
  • the method includes determining, by the computing system and based on measure of significance, an effectiveness of the treatment for the third subset of the group of individuals.
  • the method includes determining, by the computing system, individuals in third subset of the group of individuals that have not received the treatment. In other embodiments, the method includes administering one or more therapeutically effective amounts of the treatment to the individuals in the third subset that have not received the treatment.
  • the integrated data repository is arranged according to a data repository schema that includes a plurality of data tables and a plurality of logical links between the plurality of data tables, individual logical links of the plurality of logical links indicating one or more rows of a data table of the plurality of data tables that correspond to one or more additional rows of an additional data table of the plurality of data tables.
  • the plurality of data tables include a first data table that stores genomics data of the group of individuals, a second data that stores data related to one or more patient visits by individuals to one or more healthcare providers, a third data table that stores information corresponding to respective services provided to individuals with respect to one or more patient visits to one or more healthcare providers indicated by the second data table, a fourth data table that stores personal information of the group of individuals, a fifth data table that stores information related to a health insurance company or governmental entity that made payment for services provided to the group of individuals, a sixth data table storing information corresponding to health insurance coverage information for the group of individuals, and a seventh data table that stores information related to pharmaceutical treatments obtained by the group of individuals.
  • the number of identifiers generated using the second hash function comprise intermediate identifiers; and the method includes applying, by the computing system, a salt function to the intermediate identifiers to generate a final set of identifiers.
  • the method includes obtaining, by the computing system, information from an additional data repository that includes electronic medical records of an additional group of individuals, determining, by the computing system, a subset of the additional group of individuals that corresponds to the group of individuals having data stored by the genomics data repository, and modifying, by the computing system, the integrated data repository to store at least a portion of the information of the medical records of the subset of the additional group of individuals in relation to the number of identifiers.
  • the method includes performing, by the computing system, one or more optical character recognition operations with respect to the additional information, analyzing, by the computing system, the additional information obtained from the additional data repository to determine one or more portions of the additional information to remove to produce a corpus of information.
  • the method includes analyzing, by the computing system, the corpus of information to determine a portion of the subset of the additional group of individuals that correspond to one or more biomarkers, and generating, by the computing system, one or more data structures that store identifiers of the portion of the subset of the additional group of individuals and that store an indication that the portion of the subset of the additional group of individuals corresponds to the one or more biomarkers.
  • the method includes storing, by the computing system, the one or more data structures in an intermediate data repository, performing, by the computing system, one or more de-identification operations with respect to the identifiers of the portion of the subset of the additional group of individuals before modifying the integrated data repository to store at least a portion of the additional information of the medical records of the portion of the subset of the additional group of individuals in relation to the number of identifiers.
  • the molecular data repository stores at least one or more of genomic information, genetic information, metabolomic information, transcriptomic information, fragmentomic information, immune receptor information, methylation information, epigenomic information, or proteomic information.
  • the method includes determining the likelihood of recurrence includes MRD test, real world evidence (RWE), or both.
  • Described herein is a method, including: determining a state of biological molecules obtained from a sample derived from a human subject, testing for minimal residual disease (MRD), determining the likelihood of recurrence based on the MRD test, recommending and/or administering treatment.
  • the methods described herein determine an assessment including comprehensive evaluation, diagnostic testing, molecular and genetic profiling and/or risk assessment.
  • the methods described herein determine a treatment plan including patient consultation, treatment strategy and/or tailored treatment.
  • the methods described herein determine a treatment including pre-treatment and/or administration of treatment execution.
  • the methods described herein determine monitoring and/or adjustment, including one or more follow-ups and/or response assessment.
  • the methods described herein determine long-term management and/or survivorship which can include post-treatment surveillance and/or recurrence management can support long term management and/or survivorship.
  • Figure 1 illustrates an example architecture to generate an integrated data repository that includes multiple types of healthcare data, according to one or more implementations.
  • Figure 2 illustrates an example framework corresponding to an arrangement of data tables in an integrated data repository, according to one or more implementations.
  • Figure 3 illustrates an architecture to generate one or more datasets from information retrieved from a data repository that integrates health related data from a number of sources, according to one or more implementations.
  • Figure 4 illustrates an architecture to generate an integrated data repository that includes de-identified health insurance claims data and de-identified genomics data, according to one or more implementations.
  • Figure 5 illustrates a framework to generate a dataset, by a data pipeline system, based on data stored by an integrated data repository, according to one or more implementations.
  • Figure 6 is a schematic diagram of an architecture to incorporate medical records data into an integrated data repository.
  • Figure 7 is a data flow diagram of an example process to generate an integrated data repository that stores health insurance claims data and genomics data, according to one or more implementations.
  • Figure 8 is a data flow diagram of an example process to generate a number of datasets used to analyze information stored by an integrated data repository that stores health insurance claims data and genomics data, according to one or more implementations.
  • Figure 9 illustrates a diagrammatic representation of a machine in the form of a computer system within which a set of instructions may be executed for causing the machine to perform any one or more of the methodologies discussed herein, according to one or more implementations.
  • Figure 10 illustrates a diagrammatic representation for tailoring the aggressiveness of patient surveillance based on a likelihood of tumor recurrence obtained from MRD testing outcomes using testing data and real world evidence
  • Figure 11 illustrates a diagrammatic representation for treatment planning.
  • Figure 12 illustrates a diagrammatic representation for treatment implementation.
  • Figure 13 illustrates a diagrammatic representation for monitoring and adjustment.
  • Figure 14 illustrates a diagrammatic representation for long term management and planning.
  • the term “about” and its grammatical equivalents in relation to a reference numerical value can include a range of values up to plus or minus 10% from that value.
  • the amount “about 10 ” can include amounts from 9 to 11.
  • the term “about ” in relation to a reference numerical value can include a range of values plus or minus 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, or 1% from that value.
  • the term “at least” and its grammatical equivalents in relation to a reference numerical value can include the reference numerical value and greater than that value.
  • the amount “at least 10” can include the value 10 and any numerical value above 10, such as 11, 100, and 1,000.
  • the term “at most” and its grammatical equivalents in relation to a reference numerical value can include the reference numerical value and less than that value.
  • the amount “at most 10” can include the value 10 and any numerical value under 10, such as 9, 8, 5, 1, 0.5, and 0.1.
  • DNA methylation profiling can be used to detect regions with different extents of methylation (“differentially methylated regions” or “DMRs”) of the genome that are altered during development or that are perturbed by disease, for example, cancer or any cancer-associated disease.
  • the genome of cancer cells harbor imbalance in the above DNA methylation patterns, and therefore in functional packaging of the DNA.
  • the abnormalities of chromatin organization are therefore coupled with methylation changes and may contribute to enhanced cancer profiling when analyzed jointly.
  • Combining MBD-partitioning with fragmentomic data, such as fragment mapped starts and stops positions (correlated with nucleosome positions) , fragment length and associated nucleosome occupancy, can be used for chromatin structure analysis in hypermethylation studies with the aim to improve biomarker detection rate.
  • Methylation profiling can involve determining methylation patterns across different regions of the genome. For example, after partitioning molecules based on extent of methylation (e.g., relative number of methylated sites per molecule) and sequencing, the sequences of molecules in the different partitions can be mapped to a reference genome. This can show regions of the genome that, compared with other regions, are more highly methylated or are less highly methylated. In this way, genomic regions, in contrast to individual molecules, may differ in their extent of methylation.
  • extent of methylation e.g., relative number of methylated sites per molecule
  • a characteristic of nucleic acid molecules may be a modification, which may include various chemical or protein modifications (i.e. epigenetic modifications).
  • chemical modification may include, but are not limited to, covalent DNA modifications, including DNA methylation.
  • DNA methylation includes addition of a methyl group to a cytosine at a CpG site (a cytosine followed by a guanine in a nucleic acid sequence).
  • DNA methylation includes addition of a methyl group to adenine, such as in N6-methyladenine.
  • DNA methylation is 5- methylation (modification of the 5th carbon of the 6 carbon ring of cytosine).
  • 5-methylation includes addition of a methyl group to the 5C position of the cytosine to create 5 -methylcytosine (m5c).
  • methylation includes a derivative of m5c.
  • Derivatives of m5c include, but are not limited to, 5-hydroxymethylcytosine (5-hmC), 5-formylcytosine (5-fC), and 5-caryboxylcytosine (5-caC).
  • DNA methylation is 3C methylation (modification of the 3rd carbon of the 6 carbon ring of cytosine).
  • 3C methylation includes addition of a methyl group to the 3C position of the cytosine to generate 3 -methylcytosine (3mC).
  • Other examples include N6- methyladenine or glycosylation.
  • DNA methylation includes addition of methyl groups to DNA (e.g. CpG) and can change the expression of methylated DNA region.. Methylation can also occur at non CpG sites, for example, methylation can occur at a CpA, CpT, or CpC site. DNA methylation can change the activity of methylated DNA region. For example, when DNA in a promoter region is methylated, transcription of the gene may be repressed.
  • a CpG dyad is the dinucleotide CpG (cytosine-phosphate-guanine, i.e. a cytosine followed by a guanine in a 5’ - 3’ direction of the nucleic acid sequence) on the sense strand and its complementary CpG on the antisense strand of a double-stranded DNA molecule.
  • CpG dyads can be either fully methylated or hemi-methylated (methylated on one strand only).
  • the CpG dinucleotide is underrepresented in the normal human genome, with the majority of CpG dinucleotide sequences being transcriptionally inert (e.g. DNA heterochromatic regions in pericentromeric parts of the chromosome and in repeat elements) and methylated. However, many CpG islands are protected from such methylation especially around transcription start sites (TSS).
  • TSS transcription start sites
  • Protein modifications include binding to components of chromatin, particularly histones including modified forms thereof, and binding to other proteins, such as proteins involved in replication or transcription.
  • the disclosure provides methods of processing and analyzing nucleic acids with different extents of modification, such that the nature of their original modification is correlated with a nucleic acid tag and can be decoded by sequencing the tag when nucleic acids are analyzed. Genetic variation of sample nucleic acid modifications can then be associated with the extent of modification (epigenetic variation) of that nucleic acid in the original sample, include single stranded (e.g., ssDNA or RNA) or double stranded molecules (e.g., dsDNA).
  • single stranded e.g., ssDNA or RNA
  • double stranded molecules e.g., dsDNA
  • the loss of DNA can reduce the presence of one or more types of DNA such that the presence of the one or more types of DNA such as cfDNA, is difficult to detect.
  • existing methods to measure DNA methylation such as enrichment or depletion methods, can have a relatively high level of resolution, such as about 100 base pairs (bp) to about 200 bp that can make accurately determining an amount of methylation of DNA difficult.
  • the accuracy with which DNA methylation is determined can impact the accuracy of estimates of tumor fraction for samples. Since tumor fraction can be used to determine whether a sample is derived from a subject in which a tumor is present or not, the accuracy of determinations of tumor fraction estimates can impact diagnosis and/or treatment decisions for individuals.
  • a healthcare provider may refer to an entity, individual, or group of individuals involved in provided care to individuals in relation to at least one of the treatment or prevention of one or more biological conditions.
  • a biological condition can refer to an abnormality of function and/or structure in an individual to such a degree as to produce or threaten to produce a detectable feature of the abnormality.
  • a biological condition can be characterized by external and/or internal characteristics, signs, and/or symptoms that indicate a deviation from a biological norm in one or more populations.
  • a biological condition can be characterized by external and/or internal characteristics, signs, and/or symptoms that indicate a deviation from a biological norm in one or more populations.
  • a biological condition can include one or more molecular phenotypes.
  • a biological condition may correspond to genetic or epigenetic lesions.
  • a biological condition can include at least one of one or more diseases, one or more disorders, one or more injuries, one or more syndromes, one or more disabilities, one or more infections, one or more isolated symptoms, or other atypical variations of biological structure and/or function of individuals.
  • a treatment can refer to a substance, procedure, routine, device, and/or other intervention that can administered or performed with the intent of treating one or more effects of a biological condition in an individual.
  • a treatment may include a substance that is metabolized by the individual.
  • the substance may include a composition of matter, such as a pharmaceutical composition.
  • the substance may be delivered to the individual via a number of methods, such as ingestion, injection, absorption, or inhalation.
  • a treatment may also include physical interventions, such as one or more surgeries.
  • the treatment can include a therapeutically meaningful intervention.
  • the healthcare data typically analyzed by existing systems includes unstructured data.
  • Unstructured data can include data that is not organized according to a pre-defined or standardized format.
  • unstructured data may include notes made by a healthcare provider that is comprised of free text. That is, the manner in which the notes are captured does not include pre-defined inputs that are selectable by the healthcare provider, such as via a dropdown menu or via a list. Rather, the notes include text entered by a healthcare provider that may include sentences, sentence fragments, words, letters, symbols, abbreviations, one or more combinations thereof, and so forth.
  • unstructured data may be partially structured. For example, a provider could select an insurance billing code from a predefined list of insurance billing codes, and add unstructured notes to data associated with that billing code.
  • Existing systems typically devote a large amount of computing resources to analyzing unstructured data in order to extract information that may be relevant to analyses being performed by the existing systems.
  • existing systems may analyze unstructured data and transform the unstructured data to a structured format in order to facilitate the analysis of the previously unstructured data.
  • the analysis of unstructured data by existing systems can be inefficient as well as inaccurate.
  • the unstructured data is obtained from healthcare data, the importance of accurately analyzing the information is high because the analysis may be related to at least one of the treatment or diagnosis of a number of individuals with respect to one or more biological conditions. Thus, inaccurate analyses of healthcare data may result in suboptimal treatment of individuals.
  • the implementations of techniques, architectures, frameworks, systems, processes, and computer-readable instructions described herein are directed to analyzing health insurance claims data to derive information about at least one of the health or treatment of individuals.
  • health insurance claims data is structured according to one or more formats and stored by a number of data tables.
  • the data tables may include codes or other alphanumeric information indicating treatments received by individuals, dates of treatments, dosage information, diagnoses of individuals with respect to one or more biological conditions, information related to visits to healthcare providers, dates of visits to healthcare providers, billing information, and the like.
  • the implementations described herein may be used to accurately analyze health insurance claims data for hundreds, up to thousands, up to tens of thousands of individuals or more in which one or more biological conditions are present. In various examples, tens of thousands, hundreds of thousands, up to millions of rows and/or columns of health insurance claims data may be analyzed to determine health-related information for individuals in which one or more biological conditions are present.
  • the implementations described herein can integrate molecular data with health insurance claims data.
  • the molecular data may include information derived from tissue samples extracted from a number of individuals.
  • the molecular data may also include information derived from blood samples extracted from a number of individuals.
  • the molecular data may include genomics data.
  • the health insurance claims data may be integrated with germline genetic information for a number of individuals.
  • An integrated data repository may be created that combines the health insurance claims data for individuals with the molecular data of the individuals.
  • an identifier may be generated for an individual that is associated with both the health insurance claims data of the individual and the molecular data of the individual. Both the molecular data and the health insurance claims data stored by the integrated data repository may be accessible using a single identifier of the individual.
  • the identifier for an individual may include an encrypted security key.
  • the integrated data repository may include a number of data tables corresponding to different aspects of the data stored within the data repository.
  • a first data table may be generated that includes summary data of individuals included in the integrated data repository, such as personal information
  • a second data table may be generated that includes data corresponding to visits to healthcare providers.
  • a third data table may be generated indicating medical procedures provided to individuals and a fourth data table may be generated indicating information related to prescriptions obtained by individuals.
  • a fifth data table may be generated that includes multiomics profiling of individuals. Multiomics profiles may include at least one of genomic profiles, transcriptomic profiles, epigenetic profiles, or proteomic profiles.
  • the data tables included in the integrated data repository may be linked via logical links. In this way, a query to retrieve information from one data table may cause information from one or more additional data tables to be retrieved.
  • Information stored by the linked data tables may be accessed to generate a number of different datasets that may be used to analyze the information stored by the integrated data repository.
  • the information stored by the integrated data repository may be analyzed by one or more algorithms to generate datasets that are organized according to one or more schemas.
  • the datasets may indicate treatment received by an individual over a period of time with respect to a biological condition.
  • the datasets may also indicate cohorts of individuals included in the integrated data repository having a number of common characteristics.
  • the datasets may consolidate and arrange information from a number of different data sources, including the integrated data repository.
  • the datasets may be analyzed with respect to a number of queries to indicate information that may be of interest to at least one of healthcare providers, patients, or providers of treatments of biological conditions.
  • one or more datasets may be analyzed to more accurately determine a survival rate of individuals in which a biological condition is present and having a specified genomic profile in response to receiving a specified treatment.
  • the implementations described herein may provide a platform to integrate health insurance claims data and molecular data for individuals that is not found in existing systems that typically rely on electronic medical records that include an amount of unstructured data.
  • the implementations described herein may provide more accurate characterizations of the integrated data in relation to existing systems that rely on relatively inaccurate, unstructured electronic medical records data.
  • implementations described herein generate analytics ready datasets that enable the analysis of health information about individuals in a confidential and anonymized manner.
  • a sample can be any biological sample isolated from a subject.
  • a sample can be a bodily sample.
  • Samples can include body tissues, such as known or suspected solid tumors, whole blood, platelets, serum, plasma, stool, red blood cells, white blood cells or leucocytes, endothelial cells, tissue biopsies, cerebrospinal fluid synovial fluid, lymphatic fluid, ascites fluid, interstitial or extracellular fluid, the fluid in spaces between cells, including gingival crevicular fluid, bone marrow, pleural effusions, cerebrospinal fluid, saliva, mucous, sputum, semen, sweat, urine. Samples are preferably body fluids, particularly blood and fractions thereof, and urine.
  • a sample can be in the form originally isolated from a subject or can have been subjected to further processing to remove or add components, such as cells, or enrich for one component relative to another.
  • a preferred body fluid for analysis is plasma or serum containing cell-free nucleic acids.
  • a sample can be isolated or obtained from a subject and transported to a site of sample analysis. The sample may be preserved and shipped at a desirable temperature, e.g., room temperature, 4°C, -20°C, and/or -80°C.
  • a sample can be isolated or obtained from a subject at the site of the sample analysis.
  • the subject can be a human, a mammal, an animal, a companion animal, a service animal, or a pet.
  • the subject may have a cancer.
  • the subject may not have cancer or a detectable cancer symptom.
  • the subject may have been treated with one or more cancer therapy, e.g., any one or more of chemotherapies, antibodies, vaccines or biologies.
  • the subject may be in remission.
  • the subject may or may not be diagnosed of being susceptible to cancer or any cancer-associated genetic mutations/disorders.
  • the volume of plasma can depend on the desired read depth for sequenced regions. Exemplary volumes are 0.4-40 ml, 5-20 ml, 10-20 ml.
  • the volume can be 0.5 mL, 1 mL, 5 mL 10 mL, 20 mL, 30 mL, or 40 mL.
  • a volume of sampled plasma may be 5 to 20 mL.
  • a sample can comprise various amount of nucleic acid that contains genome equivalents.
  • a sample of about 30 ng DNA can contain about 10,000 (104) haploid human genome equivalents and, in the case of cfDNA, about 200 billion (2x1011) individual polynucleotide molecules.
  • a sample of about 100 ng of DNA can contain about 30,000 haploid human genome equivalents and, in the case of cfDNA, about 600 billion individual molecules.
  • a sample can comprise nucleic acids from different sources, e.g., from cells and cell-free of the same subject, from cells and cell-free of different subjects.
  • a sample can comprise nucleic acids carrying mutations.
  • a sample can comprise DNA carrying germline mutations and/or somatic mutations.
  • Germline mutations refer to mutations existing in germline DNA of a subject.
  • Somatic mutations refer to mutations originating in somatic cells of a subject, e.g., cancer cells.
  • a sample can comprise DNA carrying cancer-associated mutations (e.g., cancer-associated somatic mutations).
  • a sample can comprise an epigenetic variant (i.e.
  • the sample includes an epigenetic variant associated with the presence of a genetic variant, wherein the sample does not comprise the genetic variant.
  • Exemplary amounts of cell-free nucleic acids in a sample before amplification range from about 1 fg to about 1 pg, e.g., 1 pg to 200 ng, 1 ng to 100 ng, 10 ng to 1000 ng.
  • the amount can be up to about 600 ng, up to about 500 ng, up to about 400 ng, up to about 300 ng, up to about 200 ng, up to about 100 ng, up to about 50 ng, or up to about 20 ng of cell-free nucleic acid molecules.
  • the amount can be at least 1 fg, at least 10 fg, at least 100 fg, at least 1 pg, at least 10 pg, at least 100 pg, at least 1 ng, at least 10 ng, at least 100 ng, at least 150 ng, or at least 200 ng of cell-free nucleic acid molecules.
  • the amount can be up to 1 femtogram (fg), 10 fg, 100 fg, 1 picogram (pg), 10 pg, 100 pg, 1 ng, 10 ng, 100 ng, 150 ng, or 200 ng of cell-free nucleic acid molecules.
  • the method can comprise obtaining 1 femtogram (fg) to 200 ng-
  • Cell-free nucleic acids are nucleic acids not contained within or otherwise bound to a cell or in other words nucleic acids remaining in a sample after removing intact cells.
  • Cell-free nucleic acids include DNA, RNA, and hybrids thereof, including genomic DNA, mitochondrial DNA, siRNA, miRNA, circulating RNA (cRNA), tRNA, rRNA, small nucleolar RNA (snoRNA), Piwi-interacting RNA (piRNA), long non-coding RNA (long ncRNA), or fragments of any of these.
  • Cell-free nucleic acids can be double-stranded, single-stranded, or a hybrid thereof.
  • a cell-free nucleic acid can be released into bodily fluid through secretion or cell death processes, e.g., cellular necrosis and apoptosis.
  • Some cell-free nucleic acids are released into bodily fluid from cancer cells e.g., circulating tumor DNA, (ctDNA). Others are released from healthy cells.
  • cfDNA is cell-free fetal DNA (cffDNA)
  • cell free nucleic acids are produced by tumor cells.
  • cell free nucleic acids are produced by a mixture of tumor cells and non-tumor cells.
  • Cell-free nucleic acids have an exemplary size distribution of about 100-500 nucleotides, with molecules of 110 to about 230 nucleotides representing about 90% of molecules, with a mode of about 168 nucleotides and a second minor peak in a range between 240 to 440 nucleotides.
  • Cell-free nucleic acids can be isolated from bodily fluids through a fractionation or partitioning step in which cell-free nucleic acids, as found in solution, are separated from intact cells and other non-soluble components of the bodily fluid. Partitioning may include techniques such as centrifugation or filtration. Alternatively, cells in bodily fluids can be lysed and cell-free and cellular nucleic acids processed together.
  • nucleic acids can be precipitated with an alcohol. Further clean up steps may be used such as silica based columns to remove contaminants or salts.
  • Non-specific bulk carrier nucleic acids such as Cot-1 DNA, DNA or protein for bisulfite sequencing, hybridization, and/or ligation, may be added throughout the reaction to optimize certain aspects of the procedure such as yield.
  • samples can include various forms of nucleic acid including double stranded DNA, single stranded DNA and single stranded RNA.
  • single stranded DNA and RNA can be converted to double stranded forms so they are included in subsequent processing and analysis steps.
  • Analytes can include nucleic acid analytes, and non-nucleic acid analytes.
  • the disclosure provides for detecting genetic variations in biological samples from a subject.
  • Biological samples may include polynucleotides from cancer cells. Polynucleotides may be DNA (e.g., genomic DNA, cDNA), RNA (e.g., mRNA, small RNAs), or any combination thereof.
  • Biological samples may include tumor tissue, e.g., from a biopsy. In some cases, biological samples may include blood or saliva. In particular cases, biological samples may comprise cell free DNA (“cfDNA”) or circulating tumor DNA (“ctDNA”). Cell free DNA can be present in, e.g., blood.
  • cfDNA cell free DNA
  • ctDNA circulating tumor DNA
  • non-nucleic acid analytes include, but are not limited to, lipids, carbohydrates, peptides, proteins, glycoproteins (N-linked or O-linked), lipoproteins, phosphoproteins, specific phosphorylated or acetylated variants of proteins, amidation variants of proteins, hydroxylation variants of proteins, methylation variants of proteins, ubiquity lati on variants of proteins, sulfation variants of proteins, viral proteins (e.g., viral capsid, viral envelope, viral coat, viral accessory, viral glycoproteins, viral spike, etc.), extracellular and intracellular proteins, antibodies, and antigen binding fragments.
  • viral proteins e.g., viral capsid, viral envelope, viral coat, viral accessory, viral glycoproteins, viral spike, etc.
  • a posttranslational modification e.g., phosphorylation, glycosylation, ubiquitination, nitrosylation, methylation, acetylation or lipidation
  • the systems, apparatus, methods, and compositions can be used to analyze any number of analytes, further including both nucleic acid analytes and non-nucleic acid analytes.
  • the number of analytes that are analyzed can be at least about 2, at least about 3, at least about 4, at least about 5, at least about 6, at least about 7, at least about 8, at least about 9, at least about 10, at least about 11, at least about 12, at least about 13, at least about 14, at least about 15, at least about 20, at least about 25, at least about 30, at least about 40, at least about 50, at least about 100, at least about 1,000, at least about 10,000, at least about 100,000 or more different analytes present in a region of the sample or within an individual feature of the substrate. Methods for performing multiplexed assays to analyze two or more different analytes will be discussed in a subsequent section of this disclosure.
  • nucleic acid analytes and/or non-nucleic acid analytes constitute a set of molecular interactions in a biological system under study (e.g., cells), which may be regarded as “interactome” - the molecular interactions that occur between molecules belonging to different biochemical families (proteins, nucleic acids, lipids, carbohydrates, etc.) and also within a given family.
  • an interactome is a protein-DNA interactome (network formed by transcription factors (and DNA or chromatin regulatory proteins) and their target genes.
  • interactome refers to protein-protein interaction network(PPI), or protein interaction network (PIN).
  • PPI protein-protein interaction network
  • PIN protein interaction network
  • the present methods can be used to diagnose presence of conditions, particularly cancer, in a subject, to characterize conditions (e.g., staging cancer or determining heterogeneity of a cancer), monitor response to treatment of a condition, effect prognosis risk of developing a condition or subsequent course of a condition.
  • the present disclosure can also be useful in determining the efficacy of a particular treatment option.
  • Successful treatment options may increase the amount of copy number variation or rare mutations detected in subject's blood if the treatment is successful as more cancers may die and shed DNA. In other examples, this may not occur.
  • certain treatment options may be correlated with genetic profiles of cancers over time. This correlation may be useful in selecting a therapy.
  • the present methods can be used to monitor residual disease or recurrence of disease.
  • the types and number of cancers that may be detected may include blood cancers, brain cancers, lung cancers, skin cancers, nose cancers, throat cancers, liver cancers, bone cancers, lymphomas, pancreatic cancers, skin cancers, bowel cancers, rectal cancers, thyroid cancers, bladder cancers, kidney cancers, mouth cancers, stomach cancers, solid state tumors, heterogeneous tumors, homogenous tumors and the like.
  • Type and/or stage of cancer can be detected from genetic variations including mutations, rare mutations, indels, copy number variations, transversions, translocations, inversion, deletions, aneuploidy, partial aneuploidy, polyploidy, chromosomal instability, chromosomal structure alterations, gene fusions, chromosome fusions, gene truncations, gene amplification, gene duplications, chromosomal lesions, DNA lesions, abnormal changes in nucleic acid chemical modifications, abnormal changes in epigenetic patterns, and abnormal changes in nucleic acid 5-methylcytosine.
  • Genetic and other analyte data can also be used for characterizing a specific form of cancer. Cancers are often heterogeneous in both composition and staging. Genetic profile data may allow characterization of specific sub-types of cancer that may be important in the diagnosis or treatment of that specific sub-type. This information may also provide a subject or practitioner clues regarding the prognosis of a specific type of cancer and allow either a subject or practitioner to adapt treatment options in accord with the progress of the disease. Some cancers can progress to become more aggressive and genetically unstable. Other cancers may remain benign, inactive or dormant. The system and methods of this disclosure may be useful in determining disease progression.
  • the present analyses are also useful in determining the efficacy of a particular treatment option.
  • Successful treatment options may increase the amount of copy number variation or rare mutations detected in subject's blood if the treatment is successful as more cancers may die and shed DNA. In other examples, this may not occur.
  • certain treatment options may be correlated with genetic profiles of cancers over time. This correlation may be useful in selecting a therapy.
  • the present methods can be used to monitor residual disease or recurrence of disease.
  • the present methods can also be used for detecting genetic variations in conditions other than cancer.
  • Immune cells such as B cells
  • Clonal expansions may be monitored using copy number variation detection and certain immune states may be monitored.
  • copy number variation analysis may be performed over time to produce a profile of how a particular disease may be progressing.
  • Copy number variation or even rare mutation detection may be used to determine how a population of pathogens is changing during the course of infection. This may be particularly important during chronic infections, such as HIV/AIDS or Hepatitis infections, whereby viruses may change life cycle state and/or mutate into more virulent forms during the course of infection.
  • the present methods may be used to determine or profile rejection activities of the host body, as immune cells attempt to destroy transplanted tissue to monitor the status of transplanted tissue as well as altering the course of treatment or prevention of rejection.
  • an abnormal condition is cancer.
  • the abnormal condition may be one resulting in a heterogeneous genomic population.
  • some tumors are known to comprise tumor cells in different stages of the cancer.
  • heterogeneity may comprise multiple foci of disease.
  • the present methods can be used to generate or profile, fingerprint or set of data that is a summation of genetic information derived from different cells in a heterogeneous disease. This set of data may comprise copy number variation and mutation analyses alone or in combination.
  • the present methods can be used to diagnose, prognose, monitor or observe cancers, or other diseases. In some embodiments, the methods herein do not involve the diagnosing, prognosing or monitoring a fetus and as such are not directed to non-invasive prenatal testing. In other embodiments, these methodologies may be employed in a pregnant subject to diagnose, prognose, monitor or observe cancers or other diseases in an unborn subject whose DNA and other polynucleotides may co-circulate with maternal molecules.
  • determining the methylation pattern includes distinguishing 5-methylcytosine (5mC) from non-methylated cytosine. In some embodiments, determining methylation pattern includes distinguishing N6-methyladenine from non-methylated adenine. In some embodiments, determining the methylation pattern includes distinguishing 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), and 5- carboxylcytosine (5caC) from non-methylated cytosine.
  • bisulfite sequencing examples include, but are not limited to oxidative bisulfite sequencing (OX-BS-seq), Tet-assisted bisulfite sequencing (TAB-seq), and reduced bisulfite sequencing (redBS-seq).
  • OX-BS-seq oxidative bisulfite sequencing
  • TAB-seq Tet-assisted bisulfite sequencing
  • redBS-seq reduced bisulfite sequencing
  • Oxidative bisulfite sequencing (OX-BS-seq) is used to distinguish between 5mC and 5hmC, by first converting the 5hmC to 5fC, and then proceeding with bisulfite sequencing as previously described.
  • Tet-assisted bisulfite sequencing (TAB-seq) can also be used to distinguish 5mc and 5hmC.
  • TAB-seq 5hmC is protected by glucosylation.
  • a Tet enzyme is then used to convert 5mC to 5caC before proceeding with bisulfite sequencing, as previously described.
  • Reduced bisulfite sequencing is used to distinguish 5fC from modified cytosines.
  • cytosine sequencing a nucleic acid sample is divided into two aliquots and one aliquot is treated with bisulfite.
  • the bisulfite converts native cytosine and certain modified cytosine nucleotides (e.g. 5-formylcytosine or 5-carboxylcytosine) to uracil whereas other modified cytosines (e.g., 5- methylcytosine, 5-hydroxylmethylcystosine) are not converted.
  • modified cytosines e.g., 5- methylcytosine, 5-hydroxylmethylcystosine
  • Comparison of nucleic acid sequences of molecules from the two aliquots indicates which cytosines were and were not converted to uracils. Consequently, cytosines which were and were not modified can be determined.
  • the initial splitting of the sample into two aliquots is disadvantageous for samples containing only small amounts of nucleic acids, and/or composed of heterogeneous cell/tissue origins such as bodily
  • the present disclosure provides methods allowing bisulfite sequencing and variants thereof. These methods work by linking nucleic acids in a population to a capture moiety, i.e., a label that can be captured or immobilized.
  • Capture moieties include, without limitation, biotin, avidin, streptavidin, a nucleic acid including a particular nucleotide sequence, a hapten recognized by an antibody, and magnetically attractable particles.
  • the extraction moiety can be a member of a binding pair, such as biotin/streptavidin or hapten/antibody.
  • a capture moiety that is attached to an analyte is captured by its binding pair which is attached to an isolatable moiety, such as a magnetically attractable particle or a large particle that can be sedimented through centrifugation.
  • the capture moiety can be any type of molecule that allows affinity separation of nucleic acids bearing the capture moiety from nucleic acids lacking the capture moiety.
  • Exemplary capture moieties are biotin which allows affinity separation by binding to streptavidin linked or linkable to a solid phase or an oligonucleotide, which allows affinity separation through binding to a complementary oligonucleotide linked or linkable to a solid phase.
  • the capture moiety can be linked to sample nucleic acids as a component of an adapter, which may also provide amplification and/or sequencing primer binding sites.
  • sample nucleic acids are linked to adapters at both ends, with both adapters bearing a capture moiety.
  • any cytosine residues in the adapters are modified, such as by 5methylcytosine, to protect against the action of bisulfite.
  • the capture moieties are linked to the original templates by a cleavable linkage (e.g., photocleavable desthiobiotin- TEG or uracil residues cleavable with USERTM enzyme, Chem. Commun. (Camb). 2015 Feb 21; 51(15): 3266-3269), in which case the capture moieties can, if desired, be removed.
  • the amplicons are denatured and contacted with an affinity reagent for the capture tag.
  • Original templates bind to the affinity reagent whereas nucleic acid molecules resulting from amplification do not.
  • the original templates can be separated from nucleic acid molecules resulting from amplification.
  • the respective populations of nucleic acids can be subjected to bisulfite treatment with the original template population receiving bisulfite treatment and the amplification products not.
  • the amplification products can be subjected to bisulfite treatment and the original template population not.
  • the respective populations can be amplified (which in the case of the original template population converts uracils to thymines).
  • the populations can also be subjected to biotin probe hybridization for enrichment. The respective populations are then analyzed and sequences compared to determine which cytosines were 5- methylated (or 5-hydroxylmethylated) in the original.
  • Detection of a T nucleotide in the template population indicates an unmodified C.
  • the presence of C's at corresponding positions of the original template and amplified populations indicates a modified C in the original sample.
  • a method uses sequential DNA-seq and bisulfite-seq (BlS-seq) NGS library preparation of molecular tagged DNA libraries. This process is performed by labeling of adapters (e.g., biotin), DNA-seq amplification of whole library, parent molecule recovery (e.g. streptavidin bead pull down), bisulfite conversion and BlS-seq.
  • the method identifies 5-methylcytosine with single-base resolution, through sequential NGS-preparative amplification of parent library molecules with and without bisulfite treatment.
  • sample DNA molecules are adapter ligated, and amplified (e.g., by PCR). As only the parent molecules will have a labeled adapter end, they can be selectively recovered from their amplified progeny by label-specific capture methods (e.g., streptavidin-magnetic beads).
  • label-specific capture methods e.g., streptavidin-magnetic beads.
  • the bisulfite treated library can be combined with a non-treated library prior to enrichment/NGS by addition of a sample tag DNA sequence in standard multiplexed NGS workflow.
  • bioinformatics analysis can be carried out for genomic alignment and 5- methylated base identification. In sum, this method provides the ability to selectively recover the parent, ligated molecules, carrying 5-methylcytosine marks, after library amplification, thereby allowing for parallel processing for bisulfite converted DNA.
  • the disclosure provides alternative methods for analyzing modified nucleic acids (e.g., methylated, linked to histones and other modifications discussed above).
  • a population of nucleic acids bearing the modification to different extents e.g., 0, 1, 2, 3, 4, 5 or more methyl groups per nucleic acid molecule
  • Adapters attach to either one end or both ends of nucleic acid molecules in the population.
  • the adapters include different tags of sufficient numbers that the number of combinations of tags results in a low probability e.g., 95, 99 or 99.9% of two nucleic acids with the same start and stop points receiving the same combination of tags.
  • the nucleic acids are amplified from primers binding to the primer binding sites within the adapters.
  • Adapters whether bearing the same or different tags, can include the same or different primer binding sites, but preferably adapters include the same primer binding site.
  • the nucleic acids are contacted with an agent that preferably binds to nucleic acids bearing the modification (such as the previously described such agents).
  • the nucleic acids are separated into at least two partitions differing in the extent to which the nucleic acids bear the modification from binding to the agents.
  • nucleic acids overrepresented in the modification preferentially bind to the agent, whereas nucleic acids underrepresented for the modification do not bind or are more easily eluted from the agent.
  • the different partitions can then be subject to further processing steps, which typically include further amplification, and sequence analysis, in parallel but separately. Sequence data from the different partitions can then be compared.
  • Nucleic acids can be linked at both ends to Y-shaped adapters including primer binding sites and tags.
  • the molecules are amplified.
  • the amplified molecules are then fractionated by contact with an antibody preferentially binding to 5-methylcytosine to produce two partitions.
  • One partition includes original molecules lacking methylation and amplification copies having lost methylation.
  • the other partition includes original DNA molecules with methylation.
  • the two partitions are then processed and sequenced separately with further amplification of the methylated partition.
  • the sequence data of the two partitions can then be compared.
  • tags are not used to distinguish between methylated and unmethylated DNA but rather to distinguish between different molecules within these partitions so that one can determine whether reads with the same start and stop points are based on the same or different molecules.
  • the disclosure provides further methods for analyzing a population of nucleic acid in which at least some of the nucleic acids include one or more modified cytosine residues, such as 5-methylcytosine and any of the other modifications described previously.
  • the population of nucleic acids is contacted with adapters including one or more cytosine residues modified at the 5C position, such as 5-methylcytosine.
  • cytosine residues in such adapters are also modified, or all such cytosines in a primer binding region of the adapters are modified.
  • Adapters attach to both ends of nucleic acid molecules in the population.
  • the adapters include different tags of sufficient numbers that the number of combinations of tags results in a low probability e.g., 95, 99 or 99.9% of two nucleic acids with the same start and stop points receiving the same combination of tags.
  • the primer binding sites in such adapters can be the same or different, but are preferably the same.
  • the nucleic acids are amplified from primers binding to the primer binding sites of the adapters.
  • the amplified nucleic acids are split into first and second aliquots.
  • the first aliquot is assayed for sequence data with or without further processing.
  • the sequence data on molecules in the first aliquot is thus determined irrespective of the initial methylation state of the nucleic acid molecules.
  • the nucleic acid molecules in the second aliquot are treated with bisulfite. This treatment converts unmodified cytosines to uracils.
  • the bisulfite treated nucleic acids are then subjected to amplification primed by primers to the original primer binding sites of the adapters linked to nucleic acid. Only the nucleic acid molecules originally linked to adapters (as distinct from amplification products thereof) are now amplifiable because these nucleic acids retain cytosines in the primer binding sites of the adapters, whereas amplification products have lost the methylation of these cytosine residues, which have undergone conversion to uracils in the bisulfite treatment. Thus, only original molecules in the populations, at least some of which are methylated, undergo amplification. After amplification, these nucleic acids are subject to sequence analysis. Comparison of sequences determined from the first and second aliquots can indicate among other things, which cytosines in the nucleic acid population were subject to methylation.
  • a population of different forms of nucleic acids can be physically partitioned based on one or more characteristics of the nucleic acids prior to further analysis, e.g., differentially modifying or isolating a nucleobase, tagging, and/or sequencing. This approach can be used to determine, for example, whether certain sequences are hypermethylated or hypomethylated.
  • hypermethylation variable epigenetic target regions are analyzed to determine whether they show hypermethylation characteristic of tumor cells and/or hypomethylation variable epigenetic target regions are analyzed to determine whether they show hypomethylation characteristic of tumor cells.
  • partitioning a heterogeneous nucleic acid population one may increase rare signals, e.g., by enriching rare nucleic acid molecules that are more prevalent in one fraction (or partition) of the population. For example, a genetic variation present in hyper-methylated DNA but less (or not) in hypomethylated DNA can be more easily detected by partitioning a sample into hyper-methylated and hypo-methylated nucleic acid molecules.
  • a multi-dimensional analysis of a single locus of a genome or species of nucleic acid can be performed and hence, greater sensitivity can be achieved.
  • a heterogeneous nucleic acid sample is partitioned into two or more partitions (e.g., at least 3, 4, 5, 6 or 7 partitions).
  • each partition is differentially tagged.
  • Tagged partitions can then be pooled together for collective sample prep and/or sequencing. The partitioning-tagging-pooling steps can occur more than once, with each round of partitioning occurring based on a different characteristics (examples provided herein), and tagged using differential tags that are distinguished from other partitions and partitioning means.
  • partitioning examples include sequence length, methylation level, nucleosome binding, sequence mismatch, immunoprecipitation, and/or proteins that bind to DNA.
  • Resulting partitions can include one or more of the following nucleic acid forms: single-stranded DNA (ssDNA), double-stranded DNA (dsDNA), shorter DNA fragments and longer DNA fragments.
  • partitioning based on a cytosine modification (e.g., cytosine methylation) or methylation generally is performed and is optionally combined with at least one additional partitioning step, which may be based on any of the foregoing characteristics or forms of DNA.
  • a heterogeneous population of nucleic acids is partitioned into nucleic acids with one or more epigenetic modifications and without the one or more epigenetic modifications.
  • epigenetic modifications include presence or absence of methylation; level of methylation; type of methylation (e.g., 5- methylcytosine versus other types of methylation, such as adenine methylation and/or cytosine hydroxymethylation); and association and level of association with one or more proteins, such as histones.
  • a heterogeneous population of nucleic acids can be partitioned into nucleic acid molecules associated with nucleosomes and nucleic acid molecules devoid of nucleosomes.
  • a heterogeneous population of nucleic acids may be partitioned into single-stranded DNA (ssDNA) and double-stranded DNA (dsDNA).
  • a heterogeneous population of nucleic acids may be partitioned based on nucleic acid length (e.g., molecules of up to 160 bp and molecules having a length of greater than 160 bp).
  • each partition (representative of a different nucleic acid form) is differentially labelled, and the partitions are pooled together prior to sequencing. In other instances, the different forms are separately sequenced.
  • a population of different nucleic acids is partitioned into two or more different partitions. Each partition is representative of a different nucleic acid form, and a first partition (also referred to as a subsample) includes DNA with a cytosine modification in a greater proportion than a second subsample. Each partition is distinctly tagged.
  • the first subsample is subjected to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample, wherein the first nucleobase is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the first nucleobase and the second nucleobase have the same base pairing specificity.
  • the tagged nucleic acids are pooled together prior to sequencing. Sequence reads are obtained and analyzed, including to distinguish the first nucleobase from the second nucleobase in the DNA of the first subsample, in silico. Tags are used to sort reads from different partitions.
  • Analysis to detect genetic variants can be performed on a partition-by-partition level, as well as whole nucleic acid population level.
  • analysis can include in silico analysis to determine genetic variants, such as CNV, SNV, indel, fusion in nucleic acids in each partition.
  • in silico analysis can include determining chromatin structure.
  • coverage of sequence reads can be used to determine nucleosome positioning in chromatin. Higher coverage can correlate with higher nucleosome occupancy in genomic region while lower coverage can correlate with lower nucleosome occupancy or nucleosome depleted region (NDR).
  • Samples can include nucleic acids varying in modifications including post-replication modifications to nucleotides and binding, usually noncovalently, to one or more proteins.
  • the population of nucleic acids is one obtained from a serum, plasma or blood sample from a subject suspected of having neoplasia, a tumor, or cancer or previously diagnosed with neoplasia, a tumor, or cancer.
  • the population of nucleic acids includes nucleic acids having varying levels of methylation. Methylation can occur from any one or more postreplication or transcriptional modifications. Post-replication modifications include modifications of the nucleotide cytosine, particularly at the 5-position of the nucleobase, e.g., 5- methylcytosine, 5-hydroxymethylcytosine, 5-formylcytosine and 5-carboxylcytosine.
  • the affinity agents can be antibodies with the desired specificity, natural binding partners or variants thereof (Bock et al., Nat Biotech 28: 1106-1114 (2010); Song et al., Nat Biotech 29: 68-72 (2011)), or artificial peptides selected e.g., by phage display to have specificity to a given target.
  • capture moi eties contemplated herein include methyl binding domain (MBDs) and methyl binding proteins (MBPs) as described herein, including proteins such as MeCP2 and antibodies preferentially binding to 5-methylcytosine.
  • partitioning of different forms of nucleic acids can be performed using histone binding proteins which can separate nucleic acids bound to histones from free or unbound nucleic acids.
  • histone binding proteins examples include RBBP4, RbAp48 and SANT domain peptides.
  • binding to the agent may occur in an essentially all or none manner depending on whether a nucleic acid bears a modification, the separation may be one of degree.
  • nucleic acids overrepresented in a modification bind to the agent at a greater extent that nucleic acids underrepresented in the modification.
  • nucleic acids having modifications may bind in an all or nothing manner. But then, various levels of modifications may be sequentially eluted from the binding agent.
  • partitioning can be binary or based on degree/level of modifications.
  • all methylated fragments can be partitioned from unmethylated fragments using methyl -binding domain proteins (e.g., MethylMiner Methylated DNA Enrichment Kit (ThermoFisher Scientific)).
  • methyl -binding domain proteins e.g., MethylMiner Methylated DNA Enrichment Kit (ThermoFisher Scientific)
  • additional partitioning may involve eluting fragments having different levels of methylation by adjusting the salt concentration in a solution with the methyl -binding domain and bound fragments. As salt concentration increases, fragments having greater methylation levels are eluted.
  • the final partitions are representative of nucleic acids having different extents of modifications (overrepresentative or underrepresentative of modifications).
  • Overrepresentation and underrepresentation can be defined by the number of modifications bom by a nucleic acid relative to the median number of modifications per strand in a population. For example, if the median number of 5-methylcytosine residues in nucleic acid in a sample is 2, a nucleic acid including more than two 5- methylcytosine residues is overrepresented in this modification and a nucleic acid with 1 or zero 5-methylcytosine residues is underrepresented.
  • the effect of the affinity separation is to enrich for nucleic acids overrepresented in a modification in a bound phase and for nucleic acids underrepresented in a modification in an unbound phase (i.e. in solution). The nucleic acids in the bound phase can be eluted before subsequent processing.
  • methylation When using MethylMiner Methylated DNA Enrichment Kit (ThermoFisher Scientific) various levels of methylation can be partitioned using sequential elutions. For example, a hypomethylated partition (e.g., no methylation) can be separated from a methylated partition by contacting the nucleic acid population with the MBD from the kit, which is attached to magnetic beads. The beads are used to separate out the methylated nucleic acids from the non- methylated nucleic acids. Subsequently, one or more elution steps are performed sequentially to elute nucleic acids having different levels of methylation.
  • a hypomethylated partition e.g., no methylation
  • a first set of methylated nucleic acids can be eluted at a salt concentration of 160 mM or higher, e.g., at least 150 mM, at least 200 mM, at least 300 mM, at least 400 mM, at least 500 mM, at least 600 mM, at least 700 mM, at least 800 mM, at least 900 mM, at least 1000 mM, or at least 2000 mM.
  • magnetic separation is once again used to separate higher level of methylated nucleic acids from those with lower level of methylation.
  • the elution and magnetic separation steps can repeat themselves to create various partitions such as a hypomethylated partition (representative of no methylation), a methylated partition (representative of low level of methylation), and a hyper methylated partition (representative of high level of methylation).
  • nucleic acids bound to an agent used for affinity separation are subjected to a wash step.
  • the wash step washes off nucleic acids weakly bound to the affinity agent.
  • nucleic acids can be enriched in nucleic acids having the modification to an extent close to the mean or median (i.e., intermediate between nucleic acids remaining bound to the solid phase and nucleic acids not binding to the solid phase on initial contacting of the sample with the agent).
  • the affinity separation results in at least two, and sometimes three or more partitions of nucleic acids with different extents of a modification.
  • the nucleic acids of at least one partition, and usually two or three (or more) partitions are linked to nucleic acid tags, usually provided as components of adapters, with the nucleic acids in different partitions receiving different tags that distinguish members of one partition from another.
  • the tags linked to nucleic acid molecules of the same partition can be the same or different from one another. But if different from one another, the tags may have part of their code in common so as to identify the molecules to which they are attached as being of a particular partition.
  • portioning nucleic acid samples based on characteristics such as methylation see WO2018/119452, which is incorporated herein by reference.
  • the nucleic acid molecules can be fractionated into different partitions based on the nucleic acid molecules that are bound to a specific protein or a fragment thereof and those that are not bound to that specific protein or fragment thereof.
  • Nucleic acid molecules can be fractionated based on DNA-protein binding.
  • Protein-DNA complexes can be fractionated based on a specific property of a protein. Examples of such properties include various epitopes, modifications (e.g., histone methylation or acetylation) or enzymatic activity. Examples of proteins which may bind to DNA and serve as a basis for fractionation may include, but are not limited to, protein A and protein G. Any suitable method can be used to fractionate the nucleic acid molecules based on protein bound regions.
  • Examples of methods used to fractionate nucleic acid molecules based on protein bound regions include, but are not limited to, SDS-PAGE, chromatin-immuno-precipitation (ChIP), heparin chromatography, and asymmetrical field flow fractionation (AF4).
  • ChIP chromatin-immuno-precipitation
  • AF4 asymmetrical field flow fractionation
  • partitioning of the nucleic acids is performed by contacting the nucleic acids with a methylation binding domain (“MBD”) of a methylation binding protein (“MBP”).
  • MBD binds to 5-methylcytosine (5mC).
  • MBD is coupled to paramagnetic beads, such as Dynabeads® M-280 Streptavidin via a biotin linker. Partitioning into fractions with different extents of methylation can be performed by eluting fractions by increasing the NaCl concentration.
  • genomic regions of interest e.g., cancer-specific genetic variants and differentially methylated regions.
  • Re-amplification of the enriched total DNA library appending a sample tag. Different samples are pooled, and assayed in multiplex on an NGS instrument.
  • MBPs contemplated herein include, but are not limited to:
  • MeCP2 is a protein preferentially binding to 5 -methyl -cytosine over unmodified cytosine.
  • FOXK1, FOXK2, FOXP1, FOXP4 and FOXI3 preferably bind to 5-formyl-cytosine over unmodified cytosine (lurlaro et al., Genome Biol. 14: R119 (2013)).
  • elution is a function of number of methylated sites per molecule, with molecules having more methylation eluting under increased salt concentrations.
  • salt concentration can range from about 100 nM to about 2500 mM NaCl.
  • the process results in three (3) partitions. Molecules are contacted with a solution at a first salt concentration and including a molecule including a methyl binding domain, which molecule can be attached to a capture moiety, such as streptavidin.
  • a population of molecules will bind to the MBD and a population will remain unbound.
  • the unbound population can be separated as a “hypomethylated” population.
  • a first partition representative of the hypom ethylated form of DNA is that which remains unbound at a low salt concentration, e.g., 100 mM or 160 mM.
  • a second partition representative of intermediate methylated DNA is eluted using an intermediate salt concentration, e.g., between 100 mM and 2000 mM concentration. This is also separated from the sample.
  • a third partition representative of hypermethylated form of DNA is eluted using a high salt concentration, e.g., at least about 2000 mM.
  • the disclosure provides further methods for analyzing a population of nucleic acids in which at least some of the nucleic acids include one or more modified cytosine residues, such as 5-methylcytosine and any of the other modifications described previously.
  • the subsamples of nucleic acids are contacted with adapters including one or more cytosine residues modified at the 5C position, such as 5-methylcytosine.
  • cytosine residues in such adapters are also modified, or all such cytosines in a primer binding region of the adapters are modified.
  • Adapters attach to both ends of nucleic acid molecules in the population.
  • the adapters include different tags of sufficient numbers that the number of combinations of tags results in a low probability e.g., 95, 99 or 99.9% of two nucleic acids with the same start and stop points receiving the same combination of tags.
  • the primer binding sites in such adapters can be the same or different, but are preferably the same.
  • the nucleic acids are amplified from primers binding to the primer binding sites of the adapters.
  • the amplified nucleic acids are split into first and second aliquots.
  • the first aliquot is assayed for sequence data with or without further processing.
  • the sequence data on molecules in the first aliquot is thus determined irrespective of the initial methylation state of the nucleic acid molecules.
  • the nucleic acid molecules in the second aliquot are subjected to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA, wherein the first nucleobase includes a cytosine modified at the 5 position, and the second nucleobase includes unmodified cytosine.
  • This procedure may be bisulfite treatment or another procedure that converts unmodified cytosines to uracils.
  • the nucleic acids subjected to the procedure are then amplified with primers to the original primer binding sites of the adapters linked to nucleic acid.
  • nucleic acid molecules originally linked to adapters are now amplifiable because these nucleic acids retain cytosines in the primer binding sites of the adapters, whereas amplification products have lost the methylation of these cytosine residues, which have undergone conversion to uracils in the bisulfite treatment.
  • amplification products have lost the methylation of these cytosine residues, which have undergone conversion to uracils in the bisulfite treatment.
  • amplification products have lost the methylation of these cytosine residues, which have undergone conversion to uracils in the bisulfite treatment.
  • amplification products have lost the methylation of these cytosine residues, which have undergone conversion to uracils in the bisulfite treatment.
  • only original molecules in the populations, at least some of which are methylated undergo amplification.
  • these nucleic acids are subject to sequence analysis. Comparison of sequences determined from the first and second aliquots can indicate among other things, which cytos
  • methylated DNA is linked to Y-shaped adapters at both ends including primer binding sites and tags.
  • the cytosines in the adapters are modified at the 5 position (e.g., 5- methylated).
  • the modification of the adapters serves to protect the primer binding sites in a subsequent conversion step (e.g., bisulfite treatment, TAP conversion, or any other conversion that does not affect the modified cytosine but affects unmodified cytosine).
  • the DNA molecules are amplified.
  • the amplification product is split into two aliquots for sequencing with and without conversion. The aliquot not subjected to conversion can be subjected to sequence analysis with or without further processing.
  • the other aliquot is subjected to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA, wherein the first nucleobase includes a cytosine modified at the 5 position, and the second nucleobase includes unmodified cytosine.
  • This procedure may be bisulfite treatment or another procedure that converts unmodified cytosines to uracils. Only primer binding sites protected by modification of cytosines can support amplification when contacted with primers specific for original primer binding sites. Thus, only original molecules and not copies from the first amplification are subjected to further amplification. The further amplified molecules are then subjected to sequence analysis. Sequences can then be compared from the two aliquots. As in the separation scheme discussed above, nucleic acid tags in adapters are not used to distinguish between methylated and unmethylated DNA but to distinguish nucleic acid molecules within the same partition.
  • Methods disclosed herein comprise a step of subjecting the first subsample to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample, wherein the first nucleobase is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the first nucleobase and the second nucleobase have the same base pairing specificity.
  • the second nucleobase is a modified or unmodified adenine; if the first nucleobase is a modified or unmodified cytosine, then the second nucleobase is a modified or unmodified cytosine; if the first nucleobase is a modified or unmodified guanine, then the second nucleobase is a modified or unmodified guanine; and if the first nucleobase is a modified or unmodified thymine, then the second nucleobase is a modified or unmodified thymine (where modified and unmodified uracil are encompassed within modified thymine for the purpose of this step).
  • the first nucleobase is a modified or unmodified cytosine
  • the second nucleobase is a modified or unmodified cytosine.
  • first nucleobase may comprise unmodified cytosine (C) and the second nucleobase may comprise one or more of 5- methylcytosine (mC) and 5-hydroxymethylcytosine (hmC).
  • the second nucleobase may comprise C and the first nucleobase may comprise one or more of mC and hmC.
  • Other combinations are also possible, as indicated, e.g., in the Summary above and the following discussion, such as where one of the first and second nucleobases includes mC and the other includes hmC.
  • the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes bisulfite conversion.
  • Treatment with bisulfite converts unmodified cytosine and certain modified cytosine nucleotides (e.g. 5-formyl cytosine (fC) or 5-carboxylcytosine (caC)) to uracil whereas other modified cytosines (e.g., 5-methylcytosine, 5-hydroxylmethylcystosine) are not converted.
  • modified cytosine nucleotides e.g. 5-formyl cytosine (fC) or 5-carboxylcytosine (caC)
  • fC 5-formyl cytosine
  • caC 5-carboxylcytosine
  • the first nucleobase includes one or more of unmodified cytosine, 5-formyl cytosine, 5-carboxylcytosine, or other cytosine forms affected by bisulfite
  • the second nucleobase may comprise one or more of mC and hmC, such as mC and optionally hmC.
  • Sequencing of bisulfite-treated DNA identifies positions that are read as cytosine as being mC or hmC positions. Meanwhile, positions that are read as T are identified as being T or a bisulfite-susceptible form of C, such as unmodified cytosine, 5-formyl cytosine, or 5-carboxylcytosine.
  • the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes oxidative bisulfite (Ox-BS) conversion. In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes Tet-assisted bisulfite (TAB) conversion.
  • Ox-BS oxidative bisulfite
  • TAB Tet-assisted bisulfite
  • the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes Tet-assisted conversion with a substituted borane reducing agent, optionally wherein the substituted borane reducing agent is 2-picoline borane, borane pyridine, tert-butylamine borane, or ammonia borane.
  • a substituted borane reducing agent is 2-picoline borane, borane pyridine, tert-butylamine borane, or ammonia borane.
  • the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes chemi cal -assisted conversion with a substituted borane reducing agent, optionally wherein the substituted borane reducing agent is 2-picoline borane, borane pyridine, tert-butylamine borane, or ammonia borane.
  • the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes APOBEC-coupled epigenetic (ACE) conversion.
  • ACE APOBEC-coupled epigenetic
  • procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes enzymatic conversion of the first nucleobase, e.g., as in EM-Seq. See, e.g., Vaisvila R, et al. (2019) EM-seq: Detection of DNA methylation at single base resolution from picograms of DNA. bioRxiv; DOI: 10.1101/2019.12.20.884692, available at www.biorxiv.org/content/10.1101/2019.12.20.884692vl.
  • TET2 and T4-PGT can be used to convert 5mC and 5hmC into substrates that cannot be deaminated by a deaminase (e.g., AP0BEC3A), and then a deaminase (e.g., AP0BEC3A) can be used to deaminate unmodified cytosines converting them to uracils.
  • a deaminase e.g., AP0BEC3A
  • AP0BEC3A a deaminase
  • the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes separating DNA originally including the first nucleobase from DNA not originally including the first nucleobase.
  • the first nucleobase is a modified or unmodified adenine
  • the second nucleobase is a modified or unmodified adenine.
  • the modified adenine is N6-methyladenine (mA).
  • the modified adenine is one or more of N6-m ethyladenine (mA), N6-hydroxymethyladenine (hmA), or N6-formyladenine (fA).
  • MeDIP methylated DNA immunoprecipitation
  • methods disclosed herein comprise a step of capturing one or more sets of target regions of DNA, such as cfDNA. Capture may be performed using any suitable approach known in the art. In some embodiments, capturing includes contacting the DNA to be captured with a set of target-specific probes.
  • the set of target-specific probes may have any of the features described herein for sets of target-specific probes, including but not limited to in the embodiments set forth above and the sections relating to probes below. Capturing may be performed on one or more subsamples prepared during methods disclosed herein.
  • DNA is captured from at least the first subsample or the second subsample, e.g., at least the first subsample and the second subsample.
  • a separation step e.g., separating DNA originally including the first nucleobase (e.g., hmC) from DNA not originally including the first nucleobase, such as hmC-seal
  • capturing may be performed on any, any two, or all of the DNA originally including the first nucleobase (e.g., hmC), the DNA not originally including the first nucleobase, and the second subsample.
  • the subsamples are differentially tagged (e.g., as described herein) and then pooled before undergoing capture.
  • the capturing step may be performed using conditions suitable for specific nucleic acid hybridization, which generally depend to some extent on features of the probes such as length, base composition, etc. Those skilled in the art will be familiar with appropriate conditions given general knowledge in the art regarding nucleic acid hybridization. In some embodiments, complexes of target-specific probes and DNA are formed.
  • a method described herein includes capturing cfDNA obtained from a test subject for a plurality of sets of target regions.
  • the target regions comprise epigenetic target regions, which may show differences in methylation levels and/or fragmentation patterns depending on whether they originated from a tumor or from healthy cells.
  • the target regions also comprise sequence-variable target regions, which may show differences in sequence depending on whether they originated from a tumor or from healthy cells.
  • the capturing step produces a captured set of cfDNA molecules, and the cfDNA molecules corresponding to the sequencevariable target region set are captured at a greater capture yield in the captured set of cfDNA molecules than cfDNA molecules corresponding to the epigenetic target region set.
  • a method described herein includes contacting cfDNA obtained from a test subject with a set of target-specific probes, wherein the set of target-specific probes is configured to capture cfDNA corresponding to the sequence-variable target region set at a greater capture yield than cfDNA corresponding to the epigenetic target region set.
  • the volume of data needed to determine fragmentation patterns (e.g., to test fsor perturbation of transcription start sites or CTCF binding sites) or fragment abundance (e.g., in hypermethylated and hypomethylated partitions) is generally less than the volume of data needed to determine the presence or absence of cancer-related sequence mutations.
  • Capturing the target region sets at different yields can facilitate sequencing the target regions to different depths of sequencing in the same sequencing run (e.g., using a pooled mixture and/or in the same sequencing cell).
  • the methods further comprise sequencing the captured cfDNA, e.g., to different degrees of sequencing depth for the epigenetic and sequence-variable target region sets, consistent with the discussion herein.
  • complexes of targetspecific probes and DNA are separated from DNA not bound to target-specific probes.
  • a washing or aspiration step can be used to separate unbound material.
  • the complexes have chromatographic properties distinct from unbound material (e.g., where the probes comprise a ligand that binds a chromatographic resin), chromatography can be used.
  • the set of target-specific probes may comprise a plurality of sets such as probes for a sequence-variable target region set and probes for an epigenetic target region set.
  • the capturing step is performed with the probes for the sequence-variable target region set and the probes for the epigenetic target region set in the same vessel at the same time, e.g., the probes for the sequence-variable and epigenetic target region sets are in the same composition.
  • the concentration of the probes for the sequence-variable target region set is greater that the concentration of the probes for the epigenetic target region set.
  • the capturing step is performed with the sequence-variable target region probe set in a first vessel and with the epigenetic target region probe set in a second vessel, or the contacting step is performed with the sequence-variable target region probe set at a first time and a first vessel and the epigenetic target region probe set at a second time before or after the first time.
  • This approach allows for preparation of separate first and second compositions including captured DNA corresponding to the sequence-variable target region set and captured DNA corresponding to the epigenetic target region set.
  • the compositions can be processed separately as desired (e.g., to fractionate based on methylation as described elsewhere herein) and recombined in appropriate proportions to provide material for further processing and analysis such as sequencing.
  • the DNA is amplified. In some embodiments, amplification is performed before the capturing step. In some embodiments, amplification is performed after the capturing step.
  • adapters are included in the DNA. This may be done concurrently with an amplification procedure, e.g., by providing the adapters in a 5’ portion of a primer, e.g., as described above. Alternatively, adapters can be added by other approaches, such as ligation.
  • tags which may be or include barcodes, are included in the DNA. Tags can facilitate identification of the origin of a nucleic acid. For example, barcodes can be used to allow the origin (e.g., subject) whence the DNA came to be identified following pooling of a plurality of samples for parallel sequencing.
  • amplification procedure e.g., by providing the barcodes in a 5’ portion of a primer, e.g., as described above.
  • adapters and tags/barcodes are provided by the same primer or primer set.
  • the barcode may be located 3’ of the adapter and 5’ of the target-hybridizing portion of the primer.
  • barcodes can be added by other approaches, such as ligation, optionally together with adapters in the same ligation substrate.
  • a captured set of DNA (e.g., cfDNA) is provided.
  • the captured set of DNA may be provided, e.g., by performing a capturing step after a partitioning step as described herein.
  • the captured set may comprise DNA corresponding to a sequence-variable target region set, an epigenetic target region set, or a combination thereof.
  • the quantity of captured sequence-variable target region DNA is greater than the quantity of the captured epigenetic target region DNA, when normalized for the difference in the size of the targeted regions (footprint size).
  • first and second captured sets may be provided, including, respectively, DNA corresponding to a sequence-variable target region set and DNA corresponding to an epigenetic target region set.
  • the first and second captured sets may be combined to provide a combined captured set.
  • the DNA corresponding to the sequence-variable target region set may be present at a greater concentration than the DNA corresponding to the epigenetic target region set, e.g., a 1.1 to 1.2-fold greater concentration, a 1.2- to 1.4-fold greater concentration, a 1.4- to 1.6-fold greater concentration, a 1.6- to 1.8-fold greater concentration, a 1.8- to 2.0-fold greater concentration, a 2.0- to 2.2-fold greater concentration, a 2.2- to 2.4-fold greater concentration a 2.4- to 2.6-fold greater concentration, a 2.6- to 2.8-fold greater concentration, a 2.8- to 3.0-fold greater concentration, a 3.0- to 3.5-fold greater concentration, a 3.5- to 4.0, a 4.0- to 4.5-fold greater concentration, a 4.5- to 5.0-fold
  • the epigenetic target region set may comprise one or more types of target regions likely to differentiate DNA from neoplastic (e.g., tumor or cancer) cells and from healthy cells, e.g., non-neoplastic circulating cells. Exemplary types of such regions are discussed in detail herein.
  • the epigenetic target region set may also comprise one or more control regions, e.g., as described herein. In some embodiments, the epigenetic target region set has a footprint of at least 100 kb, e.g., at least 200 kb, at least 300 kb, or at least 400 kb.
  • the epigenetic target region set has a footprint in the range of 100-1000 kb, e.g., 100-200 kb, 200- 300 kb, 300-400 kb, 400-500 kb, 500-600 kb, 600-700 kb, 700-800 kb, 800-900 kb, and 900- 1,000 kb.
  • the epigenetic target region set includes one or more hypermethylation variable target regions.
  • hypermethylation variable target regions refer to regions where an increase in the level of observed methylation, e.g., in a cfDNA sample, indicates an increased likelihood that a sample (e.g., of cfDNA) contains DNA produced by neoplastic cells, such as tumor or cancer cells.
  • a sample e.g., of cfDNA
  • hypermethylation of promoters of tumor suppressor genes has been observed repeatedly. See, e.g., Kang et al., Genome Biol. 18:53 (2017) and references cited therein.
  • hypermethylation variable target regions can include regions that do not necessarily differ in methylation in cancerous tissue relative to DNA from healthy tissue of the same type, but do differ in methylation (e.g., have more methylation) relative to cfDNA that is typical in healthy subjects.
  • methylation e.g., have more methylation
  • the presence of a cancer results in increased cell death such as apoptosis of cells of the tissue type corresponding to the cancer
  • such a cancer can be detected at least in part using such hypermethylation variable target regions.
  • hypermethylation variable target regions include one or more genomic regions, where the cfDNA molecules in those regions do not differ in methylation state in cancer subjects relative to cfDNA from healthy subjects, but the presence/increased quantity of hypermethylated cfDNA in those regions is indicative of a particular tissue type (e.g., cancer origin) and is presented as cfDNA with increased apoptosis (e.g. tumor shedding) into circulation.
  • tissue type e.g., cancer origin
  • apoptosis e.g. tumor shedding
  • Hypermethylation target regions may be obtained, e.g., from the Cancer Genome Atlas. Kang et al., Genome Biology 18:53 (2017), describe construction of a probabilistic method called CancerLocator using hypermethylation target regions from breast, colon, kidney, liver, and lung.
  • the hypermethylation target regions can be specific to one or more types of cancer.
  • the hypermethylation target regions include one, two, three, four, or five subsets of hypermethylation target regions that collectively show hypermethylation in one, two, three, four, or five of breast, colon, kidney, liver, and lung cancers.
  • the probes for the epigenetic target region set comprise probes specific for one or more hypermethylation variable target regions.
  • the hypermethylation variable target regions may be any of those set forth above.
  • the probes specific for hypermethylation variable target regions comprise probes specific for a plurality of loci listed in Table 1, e.g., at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% of the loci listed in Table 1.
  • the probes specific for hypermethylation variable target regions comprise probes specific for a plurality of loci listed in Table 2, e.g., at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% of the loci listed in Table 2.
  • the probes specific for hypermethylation variable target regions comprise probes specific for a plurality of loci listed in Table 1 or Table 2, e.g., at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% of the loci listed in Table 1 or Table 2.
  • the one or more probes bind within 300 bp of the listed position, e.g., within 200 or 100 bp.
  • a probe has a hybridization site overlapping the position listed above.
  • the probes specific for the hypermethylation target regions include probes specific for one, two, three, four, or five subsets of hypermethylation target regions that collectively show hypermethylation in one, two, three, four, or five of breast, colon, kidney, liver, and lung cancers.
  • the epigenetic target region set includes hypomethylation variable target regions, where a decrease in the level of observed methylation indicates an increased likelihood that a sample (e.g., of cfDNA) contains DNA produced by neoplastic cells, such as tumor or cancer cells.
  • hypomethylation variable target regions can include regions that do not necessarily differ in methylation state in cancerous tissue relative to DNA from healthy tissue of the same type, but do differ in methylation (e.g., are less methylated) relative to cfDNA that is typical in healthy subjects.
  • hypomethylation variable target regions include one or more genomic regions, where the cfDNA molecules in those regions do not differ in methylation state in cancer subjects relative to cfDNA from healthy subjects, but the presence/increased quantity of hypom ethylated cfDNA in those regions is indicative of a particular tissue type (e.g., cancer origin) and is presented as cfDNA with increased apoptosis (e.g. tumor shedding) into circulation.
  • tissue type e.g., cancer origin
  • hypomethylation variable target regions include repeated elements and/or intergenic regions.
  • repeated elements include one, two, three, four, or five of LINE1 elements, Alu elements, centromeric tandem repeats, pericentromeric tandem repeats, and/or satellite DNA.
  • Exemplary specific genomic regions that show cancer-associated hypomethylation include nucleotides 8403565-8953708 and 151104701-151106035 of human chromosome 1.
  • the hypomethylation variable target regions overlap or comprise one or both of these regions.
  • the probes for the epigenetic target region set comprise probes specific for one or more hypomethylation variable target regions.
  • the hypomethylation variable target regions may be any of those set forth above.
  • the probes specific for one or more hypomethylation variable target regions may include probes for regions such as repeated elements, e.g., LINE1 elements, Alu elements, centromeric tandem repeats, pericentromeric tandem repeats, and satellite DNA, and intergenic regions that are ordinarily methylated in healthy cells may show reduced methylation in tumor cells.
  • probes specific for hypomethylation variable target regions include probes specific for repeated elements and/or intergenic regions.
  • probes specific for repeated elements include probes specific for one, two, three, four, or five of LINE1 elements, Alu elements, centromeric tandem repeats, pericentromeric tandem repeats, and/or satellite DNA.
  • Exemplary probes specific for genomic regions that show cancer-associated hypomethylation include probes specific for nucleotides 8403565-8953708 and/or 151104701- 151106035 of human chromosome 1.
  • the probes specific for hypomethylation variable target regions include probes specific for regions overlapping or including nucleotides 8403565-8953708 and/or 151104701-151106035 of human chromosome [0139]
  • Probes for detecting the panel of regions can include those for detecting genomic regions of interest (hotspot regions) as well as nucleosome-aware probes (e.g., KRAS codons 12 and 13) and may be designed to optimize capture based on analysis of cfDNA coverage and fragment size variation impacted by nucleosome binding patterns and GC sequence composition. Regions used herein can also include non-hotspot regions optimized based on nucleosome positions and GC models. Subjects
  • the DNA is obtained from a subject having a cancer. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject suspected of having a cancer. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject having a tumor. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject suspected of having a tumor. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject having neoplasia. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject suspected of having neoplasia.
  • the DNA (e.g., cfDNA) is obtained from a subject in remission from a tumor, cancer, or neoplasia (e.g., following chemotherapy, surgical resection, radiation, or a combination thereof).
  • the cancer, tumor, or neoplasia or suspected cancer, tumor, or neoplasia may be of the lung, colon, rectum, kidney, breast, prostate, or liver.
  • the cancer, tumor, or neoplasia or suspected cancer, tumor, or neoplasia is of the lung.
  • the cancer, tumor, or neoplasia or suspected cancer, tumor, or neoplasia is of the colon or rectum. In some embodiments, the cancer, tumor, or neoplasia or suspected cancer, tumor, or neoplasia is of the breast. In some embodiments, the cancer, tumor, or neoplasia or suspected cancer, tumor, or neoplasia is of the prostate. In any of the foregoing embodiments, the subject may be a human subject.
  • the sequence-variable target region probe set has a footprint of at least 0.5 kb, e.g., at least 1 kb, at least 2 kb, at least 5 kb, at least 10 kb, at least 20 kb, at least 30 kb, or at least 40 kb.
  • the epigenetic target region probe set has a footprint in the range of 0.5-100 kb, e.g., 0.5-2 kb, 2-10 kb, 10-20 kb, 20-30 kb, 30-40 kb, 40-50 kb, 50-60 kb, 60-70 kb, 70-80 kb, 80-90 kb, and 90-100 kb.
  • the probes specific for the sequence-variable target region set comprise probes specific for target regions from at least 10, 20, 30, or 35 cancer-related genes, such as AKT1, ALK, BRAF, CCND1, CDK2A, CTNNB1, EGFR, ERBB2, ESRI, FGFR1, FGFR2, FGFR3, FOXL2, GATA3, GNA11, GNAQ, GNAS, HRAS, IDH1, IDH2, KIT, KRAS, MED 12, MET, MYC, NFE2L2, NRAS, PDGFRA, PIK3CA, PPP2R1 A, PTEN, RET, STK11, TP53, and U2AFl.
  • cancer-related genes such as AKT1, ALK, BRAF, CCND1, CDK2A, CTNNB1, EGFR, ERBB2, ESRI, FGFR1, FGFR2, FGFR3, FOXL2, GATA3, GNA11, GNAQ, GNAS, HRAS, IDH1, IDH2, K
  • the first population may comprise or be derived from DNA with a cytosine modification in a greater proportion than the second population.
  • the first population may comprise a form of a first nucleobase originally present in the DNA with altered base pairing specificity and a second nucleobase without altered base pairing specificity, wherein the form of the first nucleobase originally present in the DNA prior to alteration of base pairing specificity is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the form of the first nucleobase originally present in the DNA prior to alteration of base pairing specificity and the second nucleobase have the same base pairing specificity.
  • the second population does not comprise the form of the first nucleobase originally present in the DNA with altered base pairing specificity.
  • the cytosine modification is cytosine methylation.
  • the first nucleobase is a modified or unmodified cytosine and the second nucleobase is a modified or unmodified cytosine.
  • the first and second nucleobase may be any of those discussed herein in the Summary or with respect to subjecting the first subsample to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample.
  • the first population includes a sequence tag selected from a first set of one or more sequence tags and the second population includes a sequence tag selected from a second set of one or more sequence tags, and the second set of sequence tags is different from the first set of sequence tags.
  • the sequence tags may comprise barcodes.
  • the first population includes protected hmC, such as glucosylated hmC.
  • the first population was subjected to any of the conversion procedures discussed herein, such as bisulfite conversion, Ox-BS conversion, TAB conversion, ACE conversion, TAP conversion, TAPSP conversion, or CAP conversion.
  • the first population was subjected to protection of hmC followed by deamination of mC and/or C.
  • the first population includes or was derived from DNA with a cytosine modification in a greater proportion than the second population and the first population includes first and second subpopulations
  • the first nucleobase is a modified or unmodified nucleobase
  • the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase
  • the first nucleobase and the second nucleobase have the same base pairing specificity.
  • the second population does not comprise the first nucleobase.
  • the first nucleobase is a modified or unmodified cytosine
  • the second nucleobase is a modified or unmodified cytosine, optionally wherein the modified cytosine is mC or hmC.
  • the first nucleobase is a modified or unmodified adenine
  • the second nucleobase is a modified or unmodified adenine, optionally wherein the modified adenine is mA.
  • the first nucleobase (e.g., a modified cytosine) is biotinylated.
  • the first nucleobase e.g., a modified cytosine
  • the captured DNA may comprise cfDNA.
  • the captured DNA may have any of the features described herein concerning captured sets, including, e.g., a greater concentration of the DNA corresponding to the sequence-variable target region set (normalized for footprint size as discussed above) than of the DNA corresponding to the epigenetic target region set.
  • the DNA of the captured set includes sequence tags, which may be added to the DNA as described herein. In general, the inclusion of sequence tags results in the DNA molecules differing from their naturally occurring, untagged form.
  • the combination may further comprise a probe set described herein or sequencing primers, each of which may differ from naturally occurring nucleic acid molecules.
  • a probe set described herein may comprise a capture moiety
  • sequencing primers may comprise a non-naturally occurring label.
  • Methods of the present disclosure can be implemented using, or with the aid of, computer systems.
  • such methods may comprise: partitioning the sample into a plurality of subsamples, including a first subsample and a second subsample, wherein the first subsample includes DNA with a cytosine modification in a greater proportion than the second subsample; subjecting the first subsample to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample, wherein the first nucleobase is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the first nucleobase and the second nucleobase have the same base pairing specificity; and sequencing DNA in the first subsample and DNA in the second subsample in a manner that distinguishes the first nucleobase from the second nucleobase in the DNA of
  • the present disclosure provides a non-transitory computer-readable medium including computer-executable instructions which, when executed by at least one electronic processor, perform at least a portion of a method including: collecting cfDNA from a test subject; capturing a plurality of sets of target regions from the cfDNA, wherein the plurality of target region sets includes a sequence-variable target region set and an epigenetic target region set, whereby a captured set of cfDNA molecules is produced; sequencing the captured cfDNA molecules, wherein the captured cfDNA molecules of the sequence-variable target region set are sequenced to a greater depth of sequencing than the captured cfDNA molecules of the epigenetic target region set; obtaining a plurality of sequence reads generated by a nucleic acid sequencer from sequencing the captured cfDNA molecules; mapping the plurality of sequence reads to one or more reference sequences to generate mapped sequence reads; and processing the mapped sequence reads corresponding to the sequence-variable target region set and to the epi
  • the code can be pre-compiled and configured for use with a machine have a processer adapted to execute the code or can be compiled during runtime.
  • the code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
  • FIG. 1 illustrates an example architecture 100 to generate an integrated data repository that includes multiple types of healthcare data, according to one or more implementations.
  • the architecture 100 may include a data integration and analysis system 102.
  • the data integration and analysis system 102 may obtain data from a number of data sources and integrate the data from the data sources into an integrated data repository 104.
  • the data integration and analysis system 102 may obtain data from a health insurance claims data repository 106.
  • the data integration and analysis system 102 and the health insurance claims data repository 106 may be created and maintained by different entities.
  • the data integration and analysis system 102 and the health insurance claims data repository 106 may be created and maintained by a same entity.
  • the data integration and analysis system 102 may be implemented by one or more computing devices.
  • the one or more computing devices may include one or more server computing devices, one or more desktop computing devices, one or more laptop computing devices, one or more tablet computing devices, one or more mobile computing devices, or combinations thereof.
  • at least a portion of the one or more computing devices may be implemented in a distributed computing environment.
  • at least a portion of the one or more computing devices may be implemented in a cloud computing architecture.
  • processing operations may be performed concurrently by multiple virtual machines.
  • the data integration and analysis system 102 may implement multithreading techniques. The implementation of a distributed computing architecture and multithreading techniques cause the data integration and analysis system 102 to utilize fewer computing resources in relation to computing architectures that do not implement these techniques.
  • the health insurance claims data repository 106 may store information obtained from one or more health insurance companies that corresponds to insurance claims made by subscribers of the one or more health insurance companies.
  • the health insurance claims data repository 106 may be arranged (e.g., sorted) by patient identifier.
  • the patient identifier may be based on the patient’s first name, last name, date of birth, social security number, address, employer, and the like.
  • the data stored by the health insurance claims data repository 106 may include structured data that is arranged in one or more data tables.
  • the one or more data tables storing the structured data may include a number of rows and a number of columns that indicate information about health insurance claims made by subscribers of one or more health insurance companies in relation to procedures and/or treatments received by the subscribers from healthcare providers.
  • At least a portion of the rows and columns of the data tables stored by the health insurance claims data repository 106 may include health insurance codes that may indicate diagnoses of biological conditions, and treatments and/or procedures obtained by subscribers of the one or more health insurance companies.
  • the health insurance codes may also indicate diagnostic procedures obtained by individuals that are related to one or more biological conditions that may be present in the individuals.
  • a diagnostic procedure may provide information used in the detection of the presence of a biological condition.
  • a diagnostic procedure may also provide information used to determine a progression of a biological condition.
  • a diagnostic procedure may include one or more imaging procedures, one or more assays, one or more laboratory procedures, one or more combinations thereof, and the like.
  • the data integration and analysis system 102 may also obtain information from a molecular data repository 108.
  • the molecular data repository 108 may store data of a number of individuals related to genomic information, genetic information, metabolomic information, transcriptomic information, fragmentomic information, immune receptor information, methylation information, epigenomic information, and/or proteomic information.
  • the data integration and analysis system 102 and the molecular data repository 108 may be created and maintained by different entities.
  • the data integration and analysis system 102 and the molecular data repository 108 may be created and maintained by a same entity.
  • the genomic information may indicate one or more mutations corresponding to genes of the individuals.
  • a mutation to a gene of individuals may correspond to differences between a sequence of nucleic acids of the individuals and one or more reference genomes.
  • the reference genome may include a known reference genome, such as hgl9.
  • a mutation of a gene of an individual may correspond to a difference in a germline gene of an individual in relation to the reference genome.
  • the reference genome may include a germline genome of an individual.
  • genomic information stored by the molecular data repository 108 may include genomic profiles of tumor cells present within individuals.
  • the genomic information may be derived from an analysis of genetic material, such as deoxyribonucleic acid (DNA) and/or ribonucleic acid (RNA) from a sample, including, but not limited to, a tissue sample or tumor biopsy, circulating tumor cells (CTCs), exosomes or efferosomes, or from circulating nucleic acids (e.g., cell-free DNA) found in blood samples of individuals that is present due to the degradation of tumor cells present in the individuals.
  • the genomic information of tumor cells of individuals may correspond to one or more target regions.
  • One or more mutations present with respect to the one or more target regions may indicate the presence of tumor cells in individuals.
  • the genomic information stored by the molecular data repository 108 may be generated in relation to an assay or other diagnostic test that may determine one or more mutations with respect to one or more target regions of the reference genome.
  • the data integration and analysis system 102 may obtain information from one or more additional data repositories 110.
  • the one or more additional data repositories 110 may store data related to electronic medical records of individuals for which data is present in at least one of the health insurance claims data repository 106 or the molecular data repository 108. Further, the one or more additional data repositories 110 may store data related to pathology reports of individuals for which data is present in at least one of the health insurance claims data repository 106 or the molecular data repository 108. In various examples, the one or more additional data repositories 110 may store data related to biological conditions and/or treatments for biological conditions.
  • the data integration and analysis system 102 and at least a portion of the one or more additional data repositories 110 may be created and maintained by different entities. In one or more further examples, the data integration and analysis system 102 and at least a portion of the one or more additional data repositories 110 may be created and maintained by a same entity.
  • the data integration and analysis system 102 may obtain information from one or more reference information data repositories 112.
  • the one or more reference information data repositories 112 may store information that includes definitions, standards, protocols, vocabularies, one or more combinations thereof, and the like.
  • the information stored by the one or more reference information data repositories may correspond to biological conditions and/or treatments for biological conditions.
  • the one or more reference information data repositories 112 may include RxNorm.
  • the data integration and analysis system 102 and at least a portion of the one or more reference information data repositories 112 may be created and maintained by different entities. In one or more further examples, the data integration and analysis system 102 and at least a portion of the one or more reference information data repositories 112 may be created and maintained by a same entity.
  • the data integration and analysis system 102 may obtain data from at least one of the health insurance claims data repository 106, the molecular data repository 108, the one or more additional data repositories 110, or the reference information data repositories 112 via one or more communication networks accessible to the data integration and analysis system 102 and accessible to at least one of the health insurance claims data repository 106, the molecular data repository 108, the one or more additional data repositories 110, or the reference information data repositories 112.
  • the data integration and analysis system 102 may also obtain data from at least one of the health insurance claims data repository 106, the molecular data repository 108, the one or more additional data repositories 110, or the reference information data repositories 112 via one or more secure communication channels.
  • the data integration and analysis system 102 may obtain data from at least one of the health insurance claims data repository 106, the molecular data repository 108, the one or more additional data repositories 110, or the reference information data repositories 112 via one or more calls of an application programming interface (API).
  • API application programming interface
  • the data integration and analysis system 102 may include a data integration system 114.
  • the data integration system 114 may obtain data from the health insurance claims data repository 106 and the molecular data repository 108 to generate the integrated data repository 104.
  • the data integration system 114 may also obtain data from the one or more additional data repositories 110 to generate the integrated data repository 104.
  • the data integration system 114 may implement one or more natural language processing techniques to integrate data from the one or more additional data repositories 110 into the integrated data repository 104.
  • the data integration system 114 may generate one or more tokens to identify individuals that have data stored in the health insurance claims data repository 106 and that have data stored in the molecular data repository 108.
  • the data integration system 114 may generate one or more tokens by implementing one or more hash functions.
  • the data integration system 114 may implement the one or more hash functions to generate the one or more tokens based on information stored by at least one of the health insurance claims data repository 106 or the molecular data repository 108.
  • the information used by the data integration system 114 to generate individual tokens by implementing a hash function may include at least one of an identifier of respective individuals, date of birth of the respective individuals, a postal code of the respective individuals, date of birth of the respective individuals, or a gender of the respective individuals.
  • the identifiers of the respective individuals may include a combination of at least a portion of a first name of the respective individuals and at least a portion of the last name of the respective individuals.
  • Tokens generated using data from different data repositories may correspond to the same or similar information or the same or similar type stored by the different data repositories.
  • tokens may be generated using a portion of names of individuals, date of birth, at least a portion of a postal code, and gender obtained from the health insurance claims data repository 106 and the molecular data repository 108.
  • the data integration system 114 may integrate data from a number of different data sources by analyzing tokens generated by implementing one or more hash functions using data obtained from the number of different data sources. For example, the data integration system 114 may obtain one or more first tokens generated from data stored by the health insurance claims data repository 106 and one or more second tokens generated from data stored by the molecular data repository 108. The data integration system 114 may analyze the one or more first tokens with respect to the one or more second tokens to determine individual first tokens that correspond to individual second tokens. In one or more illustrative examples, the data integration system 114 may identify individual first tokens that match individual second tokens.
  • a first token may match a second token when the data of the first token has at least a threshold amount of similarity with respect to the data of the second token.
  • a first token may match a second token when the data of the first token is the same as the data of the second token.
  • a first token may match a second token when an alphanumeric string of the first token is the same as an alphanumeric string of the second token.
  • the data integration system 114 may identify an individual having data that is stored in both the health insurance claims data repository 106 and in the molecular data repository 108. In this way, the data integration system 114 may obtain data from the health insurance claims data repository 106 from a number of individuals and data from the molecular data repository 108 from the same number of individuals and store the health insurance claims data and the molecular data for the number of individuals in the integrated data repository 104.
  • the data integration system 114 may also integrate data stored by the one or more additional data repositories 110 with data from the health insurance claims data repository 106 and the molecular data repository 108 to generate the integrated data repository 104.
  • the data integration system 114 may obtain one or more third tokens generated from data stored by an additional data repository 110, such as a data repository storing data corresponding to pathology reports.
  • the data integration system 114 may analyze the one or more third tokens with respect to the first tokens generated using information stored by the health insurance claims data repository 106 and the second tokens generated using information stored by the molecular data repository 108 to determine respective third tokens that correspond to individuals first tokens and individual second tokens.
  • the data integration system 114 may identify third tokens generated using one or more hash functions and a common set of information obtained from the health insurance claims data repository 106, the molecular data repository 108, and the additional data repository 110.
  • the data integration system 114 may identify an individual having data that is stored in the health insurance claims data repository 106, the molecular data repository 108, and in an additional data repository 110. In this way, the data integration system 114 may obtain data from the health insurance claims data repository 106 from a number of individuals and data from the molecular data repository 108 and an additional data repository 110 from the same number of individuals and store the health insurance claims data, the molecular data, and the additional data for the number of individuals in the integrated data repository 104.
  • the data stored by the integrated data repository 104 for the number of individuals may be accessible using respective identifiers of individuals.
  • the data integration system 114 may implement a number of techniques as part of a de-identification process with respect to storing and retrieving information of individuals in the integrated data repository 104.
  • the identifiers of individuals may correspond to keys that are generated using at least one hash function.
  • the identifiers of the individuals may also be generated by implementing one or more salting processes with respect to the keys generated using the at least one hash function, the tokens generated using one or more hash functions and a common set of information obtained from the health insurance claims data repository 106, the molecular data repository 108, and/or the additional data repository 110.
  • the identifiers generated by the data integration system 114 to access information for respective individuals that is stored by the integrated data repository 104 may be unique for each individual. In one or more examples, the identifiers of the individuals may be generated using at least a portion of the information used to generate the tokens related to the individuals. In one or more additional examples, the identifiers of the individuals may be generated using different information from the information used to generate the tokens related to the individuals.
  • the data integration system 114 may also generate the integrated data repository 104 from a number of different combinations of data repositories in a similar manner. For example, the data integration system 114 may obtain tokens generated from information stored by the health insurance claims data repository 106 and additional tokens generated from information stored by one or more additional data stores 110. The data integration system 114 may determine individual tokens generated from information stored by the health insurance claims data repository 106 that correspond to individual additional tokens generated from information stored by the one or more additional data repositories 110.
  • the data integration system 114 may identify individuals having data that is stored in both the health insurance claims data repository 106 and in the additional data repository 110. In this way, the data integration system 114 may obtain data from the health insurance claims data repository 106 from a number of individuals and data from the additional data repository 110 from the same number of individuals and store the health insurance claims data and the additional data for the number of individuals in the integrated data repository 104.
  • the health insurance claims data and the additional data stored by the integrated data repository 104 for the number of individuals may be accessible using respective identifiers of individuals.
  • the data integration system 114 may obtain tokens generated from information stored by the molecular data repository 108 and tokens generated from information stored by one or more additional data stores 110.
  • the data integration system 114 may determine individual tokens generated from information stored by the molecular data repository 108 that correspond to individual additional tokens generated from information stored by the one or more additional data repositories 110. By determining tokens generated using data stored by the molecular data repository 108 that correspond to additional tokens generated using data stored by an additional data repository 110, the data integration system 114 may identify individuals having data that is stored in both the molecular data repository 108 and in the additional data repository 110.
  • the data integration system 114 may obtain data from the molecular data repository 108 from a number of individuals and data from the additional data repository 110 from the same number of individuals and store the molecular data and the additional data for the number of individuals in the integrated data repository 104.
  • the molecular data and the additional data stored by the integrated data repository 104 for the number of individuals may be accessible using respective identifiers of individuals.
  • the data stored by the integrated data repository 104 may be stored according to one or more regulatory frameworks that protect the privacy and ensure the security of medical records, health information, and insurance information of individuals.
  • data may be stored by the integrated data repository 104 in accordance with one or more governmental regulatory frameworks directed to protecting personal information, such as the Health Insurance Portability and Accountability Act (HIPAA) and/or the General Data Protection Regulation (GDPR).
  • HIPAA Health Insurance Portability and Accountability Act
  • GDPR General Data Protection Regulation
  • the integrated data repository 104 also stores data in an anonymized and de-identified manner to ensure protection of the privacy of individuals that have data stored by the integrated data repository 104.
  • the data integration system 114 may re-generate the integrated data repository 104 periodically.
  • the data integration system 114 may create the integrated data repository 104 once per quarter.
  • the data integration system 114 may generated the integrated data repository 104 on a monthly basis, on a weekly basis, or once every two weeks.
  • the integrated data repository 104 enhances privacy protection with respect to data stored by the integrated data repository 104. That is, in situations where data repositories are refreshed simply with new data, it may be possible to more easily track individuals associated with data that has been newly added to a data repository because the number of new individuals added at a given time is typically smaller than an existing number of individuals that already have data stored by the data repository.
  • data stored by the integrated data repository 104 may be accessed via a database management system.
  • the integrated data repository 104 may store data according to one or more database models.
  • the integrated data repository 104 may store data according to one or more relational database technologies.
  • the integrated data repository 104 may store data according to a relational database model.
  • the integrated data repository 104 may store data according to an object-oriented database model.
  • the integrated data repository 104 may store data according to an extensible markup language (XML) database model.
  • the integrated data repository 104 may store data according to a structured query language (SQL) database model.
  • the integrated data repository may store data according to an image database model.
  • the data integration system 114 may generate the integrated data repository 104 by generating a number of data tables and creating links between the data tables.
  • the links may indicate logical couplings between the data tables.
  • the data integration system 114 may generate the data tables by extracting specified sets of data from the information obtained from the data repositories 106, 108, 110, 112 and storing the data in rows and columns of respective data tables.
  • the logical couplings between data tables may include at least one of a one-to-one link where a row of information in one data table corresponds to a row of information in another data table, a one-to-many link where a row of information in one data table corresponds to multiple rows of information in another data table, or a many-to-many link where multiple rows of information of one data table correspond to multiple rows of information in another data table.
  • the number of data tables may be arranged according to a data repository schema 116.
  • the data repository schema 114 includes a first data table 118, a second data table 120, a third data table 122, a fourth data table 124, and a fifth data table 124.
  • the data repository schema 116 may include more data tables or fewer data tables.
  • the data repository schema 116 may also include links between the data tables 118, 120, 122, 124, 128.
  • the links between the data tables 118, 120, 122, 124, 126 may indicate that information retrieved from one of the data tables 118, 120, 122, 124, 126 results in additional information stored by one or more additional data tables 118, 120, 122, 124, 126 to be retrieved. Additionally, not all the data tables 118, 120, 122, 124, 126 may be linked to each of the other data tables 118, 120, 120, 122, 124, 126. In the illustrative example of Figure 1, the first data table 118 is logically coupled to the second data table 118 by a first link 128 and the first data table 118 is logically coupled to the fourth data table 124 by a second link 130.
  • the second data table 120 is logically coupled to the third data table 122 via a third link 132 and the fourth data table 124 is logically coupled to the fifth data table 126 via a fourth link 134. Further, the third data table 122 is logically coupled to the fifth data table 126 via a fifth link 136.
  • the integrated data repository 104 may store data tables according to the data repository schema 116 for at least a portion of the individuals for which the data integration system 114 obtained information from a combination of at least two of the health insurance claims data repository 106, the molecular data repository 108, the one or more additional data repositories 110, and the one or more reference information data repositories 112.
  • the integrated data repository 104 may store respective instances of the data tables 118, 120, 122, 124, 126 according to the data repository schema 116 for thousands, tens of thousands, up to hundreds of thousands or more individuals.
  • the data integration and analysis system 102 may also include a data pipeline system 138.
  • the data pipeline system 138 may include a number of algorithms, software code, scripts, macros, or other bundles of computer-executable instructions that process information stored by the integrated data repository 104 to generate additional datasets.
  • the additional datasets may include information obtained from one or more of the data tables 118, 120, 122, 124, 126.
  • the additional datasets may also include information that is derived from data obtained from one or more of the data tables 118, 120, 122, 124, 126.
  • the components of the data pipeline system 138 implemented to generate a first additional dataset may be different from the components of the data pipeline system 138 used to generate a second additional dataset.
  • the data pipeline system 138 may generate a dataset that indicates pharmacy treatments received by a number of individuals.
  • the data pipeline system 138 may analyze information stored in at least one of the data tables 118, 120, 122, 124, 126 to determine health insurance codes corresponding to pharmaceutical treatments received by a number of individuals.
  • the data pipeline system 138 may analyze the health insurance codes corresponding to pharmaceutical treatments with respect to a library of data that indicates specified pharmaceutical treatments that correspond to one or more health insurance codes to determine names of pharmaceutical treatments that have been received by the individuals.
  • the data pipeline system 138 may analyze information stored by the integrated data repository 104 to determine medical procedures received by a number of individuals.
  • the data pipeline system 138 may analyze information stored by one of the data tables 118, 120, 122, 124, 126 to determine treatments received by individuals via at least one of injection or intravenously.
  • the data pipeline system 138 may analyze information stored by the integrated data repository 104 to determine episodes of care for individuals, lines of therapy received by individuals, progression of a biological condition, or time to next treatment.
  • the datasets generated by the data pipeline system 138 may be different for different biological conditions.
  • the data pipeline system 138 may generate a first number of datasets with respect to a first type of cancer, such as lung cancer, and a second number of datasets with respect to a second type of cancer, such as colorectal cancer.
  • the data pipeline system 138 may also determine one or more confidence levels to assign to information associated with individuals having data stored by the integrated data repository 104.
  • the respective confidence levels may correspond to different measures of accuracy for information associated with individuals having data stored by the integrated data repository 104.
  • the information associated with the respective confidence levels may correspond to one or more characteristics of individuals derived from data stored by the integrated data repository 104. Values of confidence levels for the one or more characteristics may be generated by the data pipeline system 138 in conjunction with generating one or more datasets from the integrated data repository 104.
  • a first confidence level may correspond to a first range of measures of accuracy
  • a second confidence level may correspond to a second range of measures of accuracy
  • a third confidence level may correspond to a third range of measures of accuracy.
  • the second range of measures of accuracy may include values that are less values of the first range of measures of accuracy and the third range of measures of accuracy may include values that are less than values of the second range of measures of accuracy.
  • information corresponding to the first confidence level may be referred to as Gold standard information
  • information corresponding to the second confidence level may be referred to as Silver standard information
  • information corresponding to the third confidence level may be referred to as Bronze standard information.
  • the data pipeline system 138 may determine values for the confidence levels of characteristics of individuals based on a number of factors. For example, a respective set of information may be used to determine characteristics of individuals. The data pipeline system 138 may determine the confidence levels of characteristics of individuals based on an amount of completeness of the respective set of information used to determine a characteristic for an individual. In situations where one or more pieces of information are missing from the set of information associated with a first number of individuals, the confidence levels for a characteristic may be lower than for a second number of individuals where information is not missing from the set of information. In one or more examples, an amount of missing information may be used by the data pipeline system 138 to determine confidence levels of characteristics of individuals.
  • a greater amount of missing information used to determine a characteristic of an individual may cause confidence levels for the characteristic to be lower than in situations where the amount of missing information used to determine the characteristic is lower.
  • different types of information may correspond to various confidence levels for a characteristic.
  • the presence of a first piece of information used to determine a characteristic of an individual may result in confidence levels for the characteristic being higher than the presence of a second piece of information used to determine the characteristic.
  • the data pipeline system 138 may determine a number of individuals included in a cohort with a primary diagnosis of lung cancer (or other biological condition).
  • the data pipeline system 138 may determine confidence levels for respective individuals with respect to being classified as having a primary diagnosis of lung cancer.
  • the data pipeline system 138 may use information from a number of columns included in the data tables 118, 120, 122 124, 126 to determine a confidence level for the inclusion of individuals within a lung cancer cohort.
  • the number of columns may include health insurance codes related to diagnosis of biological conditions and/or treatments of biological conditions. Additionally, the number of columns may correspond to dates of diagnosis and/or treatment for biological conditions.
  • the data pipeline system 138 may determine that a confidence level of an individual being characterized as being part of the lung cancer cohort is higher in scenarios where information is available for each of the number of columns or at least a threshold number of columns than in instances where information is available for less than a threshold number of columns. Further, the data pipeline system 138 may determine confidence levels for individuals included in a lung cancer cohort based on the type of information and availability of information associated with one or more columns.
  • the data pipeline system 138 may determine that the confidence level of including the group of individuals in the lung cancer cohort is greater than in situations where at least one of the diagnosis codes is absent and the treatment codes used to determine whether individuals are included in the lung cancer cohort are present.
  • the data integration and analysis system 102 may include a data analysis system 140.
  • the data analysis system 148 may receive integrated data repository requests 142 from one or more computing devices, such as an example computing device 144.
  • the one or more integrated data repository requests 142 may cause data to be retrieved from the integrated data repository 104.
  • the one or more integrated data repository requests 142 may cause data to be retrieved from one or more datasets generated by the data pipeline system 138.
  • the integrated data repository requests 142 may specify the data to be retrieved from the integrated data repository 104 and/or the one or more datasets generated by the data pipeline system 138.
  • the integrated data repository requests 142 may include one or more prebuilt queries that correspond to computer-executable instructions that retrieve a specified set of data from the integrated data repository 104 and/or one or more datasets generated by the data pipeline system 138.
  • the data analysis system 140 may analyze data retrieved from at least one of the integrated data repository 104 or one or more datasets generated by the data pipeline system 138 to generate data analysis results 146.
  • the data analysis results 146 may be sent to one or more computing devices, such as example computing device 148.
  • the data analysis results 146 may be received by a same computing device that sent the one or more integrated data repository requests 142.
  • the data analysis results 146 may be displayed by one or more user interfaces rendered by the computing device 144 or the computing device 148.
  • the data analysis system 140 may implement at least one of one or more machine learning techniques or one or more statistical techniques to analyze data retrieved in response to one or more integrated data repository requests 142.
  • the data analysis system 140 may implement one or more artificial neural networks to analyze data retrieved in response to one or more integrated data repository requests 142.
  • the data analysis system 140 may implement at least one of one or more convolutional neural networks or one or more residual neural networks to analyze data retrieved from the integrated data repository 104 in response to one or more integrated data repository requests 142.
  • the data analysis system 140 may implement one or more random forests techniques, one or more support vector machines, or one or more Hidden Markov models to analyze data retrieved in response to one or more integrated data repository requests 142.
  • One or more statistical models may also be implemented to analyzed data retrieved in response to one or more integrated data repository requests 142 to identify at least one of correlations or measures of significance between characteristics of individuals. For example, log rank tests may be applied to data retrieved in response to one or more integrated data repository requests 142.
  • Cox proportional hazards models may be implemented with respect to date retrieved in response to one or more integrated data repository requests 142.
  • Wilcoxon singed rank tests may be applied to data retrieved in response to one or more integrated data repository requests 142.
  • a z-score analysis may be performed with respect to data retrieved in response to one or more integrated data repository requests 142.
  • a Kaplan Meier analysis may be performed with respect to data retrieved in response to one or more integrated data repository requests 142.
  • one or more machine learning techniques may be implemented in combination with one or more statistical techniques to analyze data retrieved in response to one or more integrated data repository requests 142.
  • the data analysis system 140 may determine a rate of survival of individuals in which lung cancer is present in response to one or more treatments. In one or more additional illustrative examples, the data analysis system 140 may determine a rate of survival of individuals having one or more genomic region mutations in which lung cancer is present in response to one or more treatments. In various examples, the data analysis system 140 may generate the data analysis results 146 in situations where the data retrieved from at least one of the integrated data repository 104 or the one or more datasets generated by the data pipeline system 138 satisfies one or more criteria.
  • the data analysis system 140 may determine whether at least a portion of the data retrieved in response to one or more integrated data repository requests 142 satisfies a threshold confidence level. In situations where the confidence level for at least a portion of the date retrieved in response to one or more integrated data repository requests 142 is less than a threshold confidence level, the data analysis system 140 may refrain from generating at least a portion of data analysis results 146. In scenarios where the confidence level for at least a portion of the data retrieved in response to one or more integrated data repository requests 142 is at least a threshold confidence level, the data analysis system 140 may generate at least a portion of the data analysis results 146. In various examples, the threshold confidence level may be related to the type of data analysis results 146 being generated by the data analysis system 140.
  • the data analysis system 140 may receive an integrated data repository request 142 to generate data analysis results 146 that indicate a rate of survival of one or more individuals. In these instances, the data analysis system 140 may determine whether the data stored by the integrated data repository 104 and/or by one or more datasets generated by the data pipeline system 138 satisfies a threshold confidence level, such as a Gold standard confidence level. In one or more additional examples, the data analysis system 140 may receive an integrated data repository request 142 to generate data analysis results 146 that indicate a treatment received by one or more individuals. In these implementations, the data analysis system 140 may determine whether the data stored by the integrated data repository 104 and/or by one or more datasets generated by the data pipeline system 138 satisfies a lower threshold confidence level, such as a Bronze standard confidence level.
  • a threshold confidence level such as a Gold standard confidence level
  • the data analysis system 140 may receive an integrated data repository request 142 to determine individuals having one or more genomic mutations and that have received one or more treatments for a biological condition. Continuing with this example, the data analysis system 140 can determine a survival rate of individuals with the one or more genomic mutations in relation to the one or more treatments received by the individuals. The data analysis system 140 can then identify based on the survival rate of individuals an effectiveness of treatments for the individuals in relation to genomic mutations that may be present in the individuals. In this way, health outcomes of individuals may be improved by identifying prospective treatments that may be more effective for populations of individuals having one or more genomic mutations than current treatments being provided to the individuals.
  • Figure 2 illustrates an example framework 200 corresponding to an arrangement of data tables in an integrated data repository, according to one or more implementations.
  • the framework 200 includes a data repository schema 202 that includes a first data table 204, a second data table 206, a third data table 208, a fourth data table 210, a fifth data table 212, a sixth data table 214, and a seventh data table 216.
  • the data repository schema 202 may include more data tables or fewer data tables.
  • the data repository schema 202 may also include links between the data tables 204, 206, 208, 210, 212, 214, 216.
  • the links between the data tables 204, 206, 208, 210, 212, 214, 216 may indicate that information retrieved from one of the data tables 204, 206, 208, 210, 212, 214, 216 results in additional information stored by one or more additional data tables 204, 206, 208, 210, 212, 214, 216 to be retrieved. Additionally, not all the data tables 204, 206, 208, 210, 212, 214, 216 may be linked to each of the other data tables 204, 206, 208, 210, 212, 214, 216.
  • the first data table 204 is logically coupled to the second data table 206 by a first link 218 and the third data table 208 is logically coupled to the second data table 206 by a second link 220.
  • the second data table 206 is also logically coupled to the fourth data table 210 by a third link 222, the second data table 206 is logically coupled to the fifth data table 212 by a fourth link 224, and the second data table 206 is logically coupled to the sixth data table 214 by a fifth link 226.
  • fifth data table 212 is logically coupled to the sixth data table 214 by a sixth link 228 and the sixth data table 214 is logically coupled to the seventh data table 216 by a seventh link 230. Further, the seventh data table 216 is logically coupled to the fourth data table 210 by an eighth link 232.
  • additional links between data tables may be added to or removed from the data repository schema 202.
  • the integrated data repository 104 may store data tables according to the data repository schema 202 for at least a portion of the individuals for which the data integration system 114 obtained information from a combination of at least two of the health insurance claims data repository 106, the molecular data repository 108, and the one or more additional data repositories 110.
  • the integrated data repository 104 may store respective instances of the data tables 204, 206, 208, 210, 212, 214, 216 according to the data repository schema 204 for thousands, tens of thousands, up to hundreds of thousands or more individuals.
  • the first data table 204 may store data corresponding to genomics and genomics testing for individuals.
  • the first data table 204 may include columns that include information corresponding to a panel used to generate genomics data, mutations of genomic regions, types of mutations, copy numbers of genomic regions, coverage data indicating numbers of nucleic acid molecules identified in a sample having one or more mutations, testing dates, and patient information.
  • the first data table 204 may also include one or more columns that include health insurance data codes that may correspond to one or more diagnosis codes.
  • the information in first data table 204 may include at least one identifier for an individual that is associated with an instance of the first data table 204.
  • the second data table 206 may store data related to one or more patient visits by individuals to one or more healthcare providers.
  • the third data table 208 may store information corresponding to respective services provided to individuals with respect to one or more patient visits to one or more healthcare providers indicated by the second data table 206.
  • an individual may visit a healthcare provider and multiple services may be performed with respect to the individual at the visit.
  • a second data table 206 may include columns indicating information for each of the multiple services performed during the patient visit.
  • Multiple third data tables 208 may be generated with respect to the patient visit that include columns indicating information on a more granular level for a respective service provided during the patient visit than the information stored by the second data table 206 related to the patient visit.
  • the second data table 206 may include multiple columns indicating a health insurance code for different services provided to an individual during a patient visit and a third data table 208 related to one of the services may include multiple columns for additional health insurance codes that correspond to additional information related to the respective services.
  • the second data table 206 and the third data table(s) 208 for a patient visit may indicate one or more dates of service corresponding to the patient visit.
  • the fourth data table 210 may include columns that indicate information about individuals for which information is stored by the integrated data repository 104.
  • the fourth data table 210 may include columns that indicate information related to at least one of a location of an individual, a gender of an individual, a date of birth of an individual, a date of death of an individual (if applicable), or one or more keys associated with the individual.
  • the fourth data table 210 may include one or more columns related to whether erroneous data has been identified for an individual.
  • a single fourth data table 210 may be generated for respective individuals.
  • the data repository schema 202 may include multiple instances of the fourth data table 210, such as thousands, tens of thousands, up to hundreds of thousands or more.
  • the fifth data table 212 may include columns that indicate information related to a health insurance company or governmental entity that made payment for one or more services provided to respective individuals.
  • the fifth data table 212 may include one or more payer identifiers.
  • the sixth data table 214 may include columns that include information corresponding to health insurance coverage information for respective individuals.
  • the sixth data table 214 may include columns indicating the presence of medical coverage for an individual, the presence of pharmacy coverage for an individual, and a type of health insurance plan related to the individual, such as health maintenance organization (HMO), preferred provider organization (PPO), and the like.
  • HMO health maintenance organization
  • PPO preferred provider organization
  • the seventh data table 216 may include columns that indicate information related to pharmaceutical treatments obtained by a respective individual.
  • the seventh data table 216 may include one or more columns indicating health insurance codes corresponding to pharmaceutical treatments that are available via a pharmacy.
  • the health insurance codes may correspond to individual pharmaceutical treatments. Additionally, the health insurance codes may indicate a diagnosis of a biological condition with respect to an individual.
  • the seventh data table 216 may also include additional information, such as at least one of dosage amounts, number of days’ supply, quantity dispensed, number of refills authorized, dates of service, or information related to the individual receiving the pharmaceutical treatment.
  • the data repository schema 202 may provide results of analysis of the information stored by the data tables 204, 206, 208, 210, 212, 214, 216 in a more efficient manner than typical data repository schemas.
  • the logical connections between the data tables 204, 206, 208, 210, 212, 214, 216 are arranged to efficiently retrieve data that is related across the different data tables 204, 206, 208, 210, 212, 214, 216.
  • Figure 3 illustrates an architecture 300 to generate one or more datasets from information retrieved from a data repository that integrates health related data from a number of sources, according to one or more implementations.
  • the architecture 300 may include the data integration and analysis system 102 and the integrated data repository 104. Additionally, the data integration and analysis system 102 may include at least the data pipeline system 138 and the data analysis system 140.
  • the data pipeline system 138 may include a number of sets of data processing instructions that are executable to generate respective datasets that may be analyzed by the data analysis system 140 in response to an integrated data repository request 142 to generate data analysis results 146.
  • the data pipeline system 138 may include first data processing instructions 302, second data processing instructions 304, up to Nth data processing instructions 306.
  • the data processing instructions 302, 304, 306 may be executable by one or more processing units to perform a number of operations to generate respective datasets using information obtained from the integrated data repository 104.
  • the data processing instructions 302, 304, 306 may include at least one of software code, scripts, API calls, macros, and so forth.
  • the first data processing instructions 302 may be executable to generate a first dataset 308.
  • the second data processing instructions 304 may be executable to generate a second dataset 310.
  • the Nth data processing instructions 306 may be executable to generate an Nth dataset 312.
  • the data pipeline system 138 may cause the data processing instructions 302, 304, 306 to be executed to generate the datasets 308, 310, 312.
  • the datasets 308, 310, 312 may be stored by the integrated data repository 104 or by an additional data repository that is accessible to the data integration and analysis system 102.
  • At least a portion of the data processing instructions 302, 304, 306 may analyze health insurance codes to generate at least a portion of the datasets 308, 310, 312.
  • at least a portion of the data processing instructions 302, 304, 306 may analyze genomics data to generate at least a portion of the datasets 308, 310, 312.
  • the first data processing instructions 302 may be executable to retrieve data from one or more first data tables stored by the integrated data repository 104. The first data processing instructions 302 may also be executable to retrieve data from one or more specified columns of the one or more first data tables. In various examples, the first data processing instructions 302 may be executable to identify individuals that have a health insurance code stored in one or more column and row combinations that correspond to one or more diagnosis codes. The first data processing instructions 302 may then be executable to analyze the one or more diagnosis codes to determine a biological condition for which the individuals have been diagnosed.
  • the first data processing instructions 302 may be executable to analyze the one or more diagnosis codes with respect to a library of diagnosis codes that indicates one or more biological conditions that correspond to respective diagnosis codes.
  • the library of diagnosis codes may include hundreds up to thousands of diagnosis codes.
  • the first data processing instructions 302 may also be executable to determine individuals diagnosed with a biological condition by analyzing timing information of the individuals, such as dates of treatment, dates of diagnosis, dates of death, one or more combinations thereof, and the like.
  • the second data processing instructions 304 may be executable to retrieve data from one or more second data tables stored by the integrated data repository 104.
  • the second data processing instructions 304 may also be executable to retrieve data from one or more specified columns of the one or more second data tables.
  • the second data processing instructions 304 may be executable to identify individuals that have a health insurance code stored in one or more column and row combinations that correspond to one or more treatment codes.
  • the one or more treatment codes may correspond to treatments obtained from a pharmacy.
  • the one or more treatment codes may correspond to treatments received by a medical procedure, such as an injection or intravenously.
  • the second data processing instructions 304 may be executable to determine one or more treatments that correspond to the respective health insurance codes included in the one or more second data tables by analyzing the health insurance code in relation to a predetermined set of information.
  • the predetermined set of information may include a data library that indicates one or more treatments that correspond to one out of hundreds up to thousands of health insurance codes.
  • the second data processing instructions 304 may generate the second dataset 310 to indicate respective treatments received by a group of individuals.
  • the group of individuals may correspond to the individuals included in the first dataset 308.
  • the second dataset 310 may be arranged in rows and columns with one or more rows corresponding to a single individual and one or more columns indicating the treatments received by the respective individual.
  • the Nth processing instructions 306 may be executable to generate the Nth dataset 312 by combining information from a number of previously generated datasets, such as the first dataset 308 and the second dataset 310.
  • the Nth processing instructions 306 may be executable to generate the Nth dataset 312 to retrieve additional information from one or more additional columns of the integrated data repository 104 and incorporate the additional information from the integrated data repository 104 with information obtained from the first dataset 308 and the second dataset 310.
  • the Nth processing instructions 306 may be executable to identify individuals included in the first dataset 308 that are diagnosed with a biological condition and analyze specified columns of one or more additional data tables of the integrated data repository 104 to determine dates of the treatments indicated in the second dataset 210 that correspond to the individuals included in the first dataset 308. In one or more further examples, the Nth processing instructions 306 may be executable to analyze columns of one or more additional data tables of the integrated data repository 104 to determine dosages of treatments indicated in the second dataset 310 received by the individuals included in the first dataset 308. In this way, the Nth processing instructions 306 may be executable to generate an episodes of care dataset based on information included in a cohort dataset and a treatments dataset.
  • the data analysis system 140 may determine one or more datasets that correspond to the features of the query related to the integrated data repository request 142. For example, the data analysis system 140 may determine that information included in the first dataset 308 and the second dataset 310 is applicable to responding to the integrated data repository request 142. In these scenarios, the data analysis system 140 may analyze at least a portion of the data included in the first dataset 308 and the second dataset 310 to generate the data analysis results 146. In one or more additional examples, the data analysis system 140 may determine different datasets to respond to different queries included in the integrated data repository request 142 in order to generate the data analysis results 146.
  • the use of specific sets of data processing instructions to generate respective data sets may reduce the number of inputs from users of the data integration and analysis system 102 as well as reduce the computational load, such as the amount of processing resources and memory, utilized to process integrated data repository requests 142.
  • the data utilized to respond to the integrated data repository request 142 is assembled from the data repository 104.
  • the data processing instruction 302, 304, 306 to generate the datasets 308, 310, 312 the data needed to respond to various integrated data repository requests 142 has already been assembled and may be accessed by the data analysis system 140 to respond to the integrated data repository request 142.
  • the computing resources used to respond to the integrated data repository request 142 by implementing the data pipeline system 138 to generate the datasets 308, 310, 312 are less than typical systems that perform an information parsing and collecting process for each integrated data repository request 142.
  • users of the data integration and analysis system 102 may need to submit multiple integrated data repository request 142 in order to analyze the information that the users are intending to have analyzed either because the ad hoc collection of data to respond to an integrated data repository request 142 in typical systems is inaccurate or because the data analysis system 140 is called upon multiple times to perform an analysis of information in typical systems that may be performed using a single integrated data repository request 142 when the data pipeline system 138 is implemented.
  • FIG. 4 illustrates an architecture 400 to generate an integrated data repository that includes de-identified health insurance claims data and de-identified genomics data it, according to one or more implementations.
  • the architecture 400 may include the data integration and analysis system 102, the health insurance claims data repository 106, and the molecular data repository 108.
  • the data integration and analysis system 102 may obtain patient information 402 from the molecular data repository 108.
  • the patient information 402 may include genomics data 404 for individuals having data stored by the molecular data repository 108.
  • the genomics data 404 may indicate results of one or more nucleic acid sequencing operations that analyze sequences of nucleic acid molecules included in a sample obtained from the individuals with respect to one or more target genomic regions.
  • the sample may be obtained from tissue of one or more individuals. In one or more additional examples, the sample may be obtained from fluid of one or more individuals, such as blood or plasma.
  • the one or more target genomic regions may correspond to genomic regions that correspond to the presence of one or more biological conditions.
  • the target regions may correspond to genomic regions of a reference genome having mutations that are present in individuals in which a biological condition is present.
  • the target regions may correspond to genomic regions of a reference human genome in which one or more mutations are present in individuals in which one or more forms of cancer are present.
  • the patient information 402 may also include information indicating personal information about individuals with data stored by the molecular data repository 108 and information corresponding to the testing and analysis performed on samples provided by individuals.
  • the data integration and analysis system 102 may perform a de-identification process 406 that anonymizes personal information obtained from the molecular data repository 108.
  • the data integration and analysis system 102 may implement one or more computational techniques as part of the de-identification process to anonymize data related to individuals stored by the molecular data repository 108 such that the de-identified data protects the privacy of the individuals and is in compliance with one or more privacy regulation frameworks.
  • the de- identification process 406 may include, at 408, accessing tokens.
  • the tokens may comprise an alphanumeric string of characters.
  • the tokens may be generated by the data integration and analysis system 102.
  • the tokens may be generated by a third-party and obtained by the data integration and analysis system 102.
  • the tokens may be generated using one or more hash functions in relation to a subset 410 of the patient information 402.
  • the tokens may be generated using a combination of at least a portion of a first name of the respective individuals, at least a portion of the last name of the respective individuals, at least a portion of a date of birth of the respective individuals, a gender of the individuals, and at least a portion of a location identifier of the respective individuals.
  • the de-identification process 406 may also include, at 412, generating identifiers for individuals that have data stored by the molecular data repository 108.
  • the identifiers may be generated by the data integration and analysis system 102 using one or more hash functions that are different from the one or more hash functions used to generate the tokens.
  • the data integration and analysis system 102 may generate an intermediate version of respective identifiers using one or more hash function and then apply one or more salting techniques to the intermediate versions of the identifiers to generate final versions of the identifiers.
  • a salt function includes a function configured to add at least one random bit to each intermediate identifier to generate a respective final identifier.
  • the data integration and analysis system 102 may generate the identifiers at 412 using at least a portion of the information for respective individuals stored by the molecular data repository 108.
  • the identifiers may be generated based on a patient identifier included in the patient information 402.
  • the identifiers generated by the data integration and analysis system 102 may be unique for respective individuals having data stored by the molecular data repository 108.
  • the data integration and analysis system 102 may generate modified patient information 416 based on the identifiers.
  • the modified patient information 416 may include genomics data 404 related to individuals associated with the molecular data repository 108 and the identifiers of the respective individuals.
  • the modified patient information 416 may have a data structure 418.
  • the data structure 418 may include a column that includes respective identifiers of individuals associated with the molecular data repository 108 and a number of columns that include genomics data 404 related to the individuals, such as identifiers of one or more genes, alterations to the one or more genes, type of alteration to the genes, and so forth.
  • the data integration and analysis system 102 may generate a token file 420.
  • the token file 420 may include first tokens 422 accessed at operation 408 for respective individuals having data stored by the molecular data repository 108.
  • the token file 420 may have a data structure 424 that includes a number of columns that include information for respective individuals.
  • the data structure 424 may include a column indicating respective identifiers generated by the data integration and analysis system 102 and columns indicating one or more first tokens 422 associated with the respective identifiers.
  • the data integration and analysis system 102 may send the token file 420 to a health insurance claims data management system 426 that is coupled to the health insurance claims data repository 106.
  • the health insurance claims data management system 426 may analyze the first tokens 422 with respect to corresponding second tokens 428.
  • the second tokens 428 may be accessed by or generated by the health insurance claims data management system 426.
  • the second tokens 428 may be generated using a same or similar subset of information for individuals having data stored in the health insurance claims data repository 106 as the subset 410 of the patient information 402.
  • the second tokens 428 may be generated using a combination of at least a portion of a first name of the respective individuals, at least a portion of the last name of the respective individuals, at least a portion of a date of birth of the respective individuals, a gender of the individuals, and at least a portion of a location identifier of the respective individuals.
  • the health insurance claims data management system 426 may retrieve health insurance claims data from the health insurance claims data repository 106 for individuals associated with respective second tokens 428 that match corresponding first tokens 422.
  • a first token 422 may match a second token 428 when the data of the first token 422 has at least a threshold amount of similarity with respect to the data of the second token 428.
  • a first token 422 may match a second token 428 when the data of the first token 422 is the same as the data of the second token 428.
  • the health insurance claims data management system 426 may generate modified health insurance claims data 430.
  • the health insurance claims data management system 426 may send the modified health insurance claims data 430 to the data integration and analysis system 102.
  • the modified health insurance claims data 430 may be formatted according to a data structure 432.
  • the data structure 432 may include a column that includes a subset of the second tokens 428 that correspond to the first tokens 422 and a number of columns that include the health insurance claims data.
  • the data integration and analysis system 102 may integrate genomics data and health insurance claims data of individuals that are common to both the molecular data repository 108 and the health insurance claims data repository 106.
  • the data integration and analysis system 102 may determine individuals that are common to both the molecular data repository 108 and the health insurance claims data repository 106 by determining genomics data and health insurance claims data corresponding to common tokens.
  • the data integration and analysis system 102 may determine that a first token 422 related to a portion of the genomics data 404 corresponds to a second token 428 related to a portion of the health insurance claims data by determining a measure of similarity between the first token 422 and the second token 428.
  • the data integration and analysis system 102 may store the corresponding portion of the genomics data 404 and the corresponding portion of the health insurance claims data in relation to the identifier of the individual in an integrated data repository, such as the integrated data repository 104 of Figure 1, Figure 2, and Figure 3.
  • an integrated data repository such as the integrated data repository 104 of Figure 1, Figure 2, and Figure 3.
  • the implementation of the architecture 400 may implement a cryptographic protocol that enables de-identified information from disparate data repositories to be integrated into a single data repository. In this way, the security of the data stored by the integrated data repository 104 is increased.
  • the cryptographic protocol implemented by the architecture 400 may enable more efficient retrieval and accurate analysis of information stored by the integrated data repository 104 than in situations where the cryptographic protocol of the architecture 400 is not utilized. For example, by generating a token file 420 that includes first tokens 422 using a cryptographic technique based on a specified set of information stored by the molecular data repository 104 and utilizing second tokens 428 generated using a same or similar cryptographic technique with respect to the similar or same set of information stored by the health insurance claims data repository 106, the data integration and analysis system 102 may match information stored by disparate data repositories that correspond to a same individual.
  • the probability of incorrectly attributing information from one data repository to one or more individuals increases, which decreases the accuracy of results provided by the data integration and analysis system 102 in response to integrated data repository requests 142 sent to the data integration and analysis system 102.
  • FIG. 5 illustrates a framework 500 to generate a dataset, by a data pipeline system 138, based on data stored by an integrated data repository 104, according to one or more implementations.
  • the integrated data repository 104 may store health insurance claims data and genomics data for a group of individuals 502.
  • the integrated data repository 104 may store information obtained from health insurance claims records 504 of the group of individuals 502.
  • the integrated data repository 104 may store information obtained from multiple health insurance claim records 504.
  • the information stored by the integrated data repository 104 may include and/or be derived from thousands, tens of thousands, hundreds of thousands, up to millions of health insurance claims records 504 for a number of individuals.
  • each health insurance claim record may include multiple columns.
  • the integrated data repository 104 may be generated through the analysis of millions of columns of health insurance claims data.
  • health insurance claims data may be organized according to a structured data format
  • health insurance claims data is typically arranged to be viewed by health insurance providers, patients, and healthcare providers in order to show financial information and insurance code information related to services provided to individuals by healthcare providers.
  • health insurance claims data is not easily analyzed to gain insights that may be available in relation to characteristics of individuals in which a biological condition is present and that may aid in the treatment of the individuals with respect to the biological condition.
  • the integrated data repository 104 may be generated and organized by analyzing and modifying raw health insurance claims data in a manner that enables the data stored by the integrated data repository 104 to be further analyzed to determine trends, characteristics, features, and/or insights with respect to individuals in which one or more biological conditions may be present.
  • health insurance codes may be stored in the integrated data repository 104 in such a way that at least one of medical procedures, biological conditions, treatments, dosages, manufacturers of medications, distributors of medications, or diagnoses may be determined for a given individual based on health insurance claims data for the individual.
  • the data integration and analysis system 102 may generate and implement one or more tables that indicate correlations between health insurance claims data and various treatments, symptoms, or biological conditions that correspond to the health insurance claims data.
  • the integrated data repository 104 may be generated using genomics data records 506 of the group of individuals 502. In various examples, the large amounts of health insurance claims data may be matched with genomics data for the group of individuals 502 to generate the integrated data repository 104.
  • the data integration and analysis system 102 may determine correlations between the presence of one or more biomarkers that are present in the genomics data records 506 with other characteristics of individuals that are indicated by the health insurance claims data records 506 that existing systems are typically unable to determine. For example, the data integration and analysis system 102 may determine one or more genomic characteristics of individuals that correspond to treatments received by individuals, timing of treatments, dosages of treatments, diagnoses of individuals, smoking status, presence of one or more biological conditions, presence of one or more symptoms of a biological condition, one or more combinations thereof, and the like.
  • the processes and techniques implemented to integrate the health insurance claims records 504 and the genomics claims records 506 in order to generate the integrated data repository 104 may be complex and implement efficiency-enhancing techniques, systems, and processes in order to minimize the amount of computing resources used to generate the integrated data repository 104.
  • the data pipeline system 138 may access information stored by the integrated data repository 104 to generate datasets that include a number of additional data records 508 that include information related to at least a portion of the group of individuals 502.
  • the additional data record 508 includes information indicating whether individuals are included in a cohort of individuals in which lung cancer is present.
  • the data pipeline system 138 may execute a plurality of different sets of data processing instructions to determine a cohort of the group of individuals 502 in which lung cancer is present.
  • the additional data record 508 may indicate information used to determine a status of an individual 502 with respect to lung cancer, such as one or more transaction insurance identifier, one or more international classification of diseases (ICD) codes, and one or more health insurance transaction dates.
  • ICD international classification of diseases
  • the additional data record 508 may include a column indicating a confidence level of the status of the individual 502 with respect to the presence of lung cancer.
  • Figure 6 is a schematic diagram of a computing architecture 600 to incorporate medical records data into an integrated data repository 104.
  • the operations of the computing architecture 600 may be performed by the data integration and analysis system 102 of Figures 1, 3, and 4.
  • at least a portion of the operations of the computing architecture 600 may be performed by one or more additional computing systems that are at least one of controlled, maintained, or implemented by a service provider that also at least one of controls, maintains, or implements the data integration and analysis system 102.
  • at least a portion of the operations of the computing architecture 600 may be performed by a number of servers in a distributed computing environment.
  • the computing architecture 600 may include a medical records data repository 602.
  • the medical records data repository 602 may store medical records data from a number of individuals.
  • the medical records data may include imaging information, laboratory test results, diagnostic test information, clinical observations, dental health information, notes of healthcare practitioners, medical history forms, diagnostic request forms, medical procedure order forms, medical information charts, one or more combinations thereof, and so forth.
  • the medical records data repository 602 may store information obtained from one or more healthcare practitioners that is related to the individual.
  • the computing architecture 600 may perform operation 604 that includes obtaining data packages from the medical records data repository 602.
  • the data packages may be obtained in response to one or more requests sent to the medical records data repository 602 for medical records that correspond to one or more individuals.
  • the data packages may be obtained by the computing architecture 600 using one or more application programming interface (API) calls.
  • API application programming interface
  • a first data package 606, a second data package 608, up to an Nth data package 610 may be obtained using the computing architecture 600.
  • the individual data packages 606, 608, 610 may correspond medical records of a respective individual.
  • the first data package 606 may include medical records of a first individual
  • the second data package 608 may include medical records of a second individual
  • the Nth data package 610 may include medical records of a third individual.
  • Individual data packages 606, 608, 610 may include a number of components. In one or more examples, individual data packages 606, 608, 610 may include individual components that correspond to medical records from different healthcare providers. In one or more additional examples, the individual data packages 606, 608, 610 may include individual components that correspond to different parts of medical records that correspond to one or more healthcare providers. In the illustrative example of Figure 6, the second data package 608 may include a first component 612, a second component 614, up to an Nth component 616.
  • the first component 612 may include a first portion of medical records of an individual
  • the second component 614 may include a second portion of medical records of an individual
  • the Nth component 616 may include a third portion of medical records of an individual.
  • the first component 612 may correspond to medical records of a first healthcare provider for the individual
  • the second component 614 may correspond to medical records of a second healthcare provider for the individual
  • the third component may correspond to medical records of a third healthcare provider for the individual.
  • the first component 612 may include a first section of medical records of the individual, such as one or more forms related to a diagnostic test or procedure
  • the second component 614 may include a second section of medical records of the individual, such as a pathology report of the individual.
  • the computing architecture 600 may preprocess individual data packages to identify a corpus of information 620 to be analyzed.
  • the preprocessing of data packages obtained from the medical records data repository 602 may include transforming the data included in the data packages.
  • preprocessing the data packages may include transforming at least a portion of the data obtained from the medical records data repository 602 to machine encoded information.
  • preprocessing the data packages may include performing one or more optical character recognition (OCR) operations with respect to at least a portion of the data packages obtained from the medical records data repository 602.
  • OCR optical character recognition
  • the data packages may be subjected to a number of operations, such as one or more parsing operations to identify one or more characters or strings of characters or one or more editing operations that are unable to be performed with respect to at least a portion of the data packages obtained from the medical records data repository 602.
  • the preprocessing of individual data packages may include determining information included in individual data packages that is to be excluded from further analysis by the computing architecture 600.
  • one or more components of individual data packages may be excluded from a corpus of information 620 to be analyzed.
  • the computing architecture 600 may determine that the first component 612 is to be excluded from further analysis by the computing architecture 600.
  • the computing architecture 600 may analyze the components 612, 614, and/or 616 with respect to one or more keywords to identify at least one of the components 612, 614, and/or 616 to exclude from further analysis by the computing architecture 600.
  • the computing architecture 600 may parse the components 612, 614, and/or 616 to identify one or more keywords and in response to identifying the one or more keywords in a component 612, 614, and/or 616, the computing architecture 600 may determine to exclude the respective component 612, 614, and/or 616 from further analysis by the computing architecture 600. For example, the computing architecture 600 may determine that the first component 612 of the second data package 608 is a test requisition form for one or more diagnostic procedures or tests. In these scenarios, the computing architecture 600 may determine that the first component 612 is to be excluded from further analysis by the computing architecture 600.
  • the computing architecture 600 may determine that at least one of the second component 614 and/or 616 correspond to one or more pathology reports for an individual based on one or more keywords included in at least one of the second component 614 or the Nth component 616. In these instances, the computing architecture 600 may determine that at least a portion of the second component 614 and/or at least a portion of the Nth component 616 is to be included in the corpus of information 620 to be further analyzed by the computing architecture 600.
  • a subset of the components of individual data packages obtained from the medical records data repository 602 may be included in the corpus of information 620.
  • one or more additional operations may be performed to narrow the corpus of information 620.
  • one or more queries may be applied to a subset of information obtained from the medical records data repository 602.
  • the one or more queries may extract information from the one or more data packages that satisfy the one or more queries.
  • the one or more queries may be a group of queries that are applied to individual components of a data package.
  • the group of queries may determine information to be included in the corpus of information 620 and additional information that is to be excluded from the corpus of information 620.
  • one or more sections of at least one component of a data package may be excluded from the corpus of information 620.
  • the computing architecture 600 may then cause one or more queries to be implemented with respect to at least one the second component 614 or the Nth component 616.
  • the one or more queries may determine that a section of the second component 614, such as a section that indicates family history for one or more biological conditions, is to be excluded from the corpus of information 620.
  • the one or more queries may be directed to identifying a number of keywords and/or combinations of keywords included in at least one of the second component 614 or the Nth component 616.
  • the computing architecture 600 may exclude from the corpus of information 620 one or more portions of the individual components of the data packages that include one or more keywords or combinations of keywords. In one or more additional examples, the computing architecture 600 may exclude from the corpus of information 620 a number of words, a number of characters, and/or a number of symbols following one or more keywords that are included in one or more portions of the individual components of the data packages.
  • the computing architecture 600 may analyze the corpus of information to determine characteristics of individuals.
  • the computing architecture 600 may analyze the corpus of information 620 to determine individuals that have one or more phenotypes.
  • the computing architecture 600 may analyze the corpus of information 620 to determine one or more biomarkers that are indicative of a biological condition.
  • the computing architecture 600 may analyze the corpus of information 620 to determine individuals having one or more genetic characteristics.
  • the one or more genetic characteristics may include at least one of one or more variants of a genomic region that correspond to a biological condition.
  • the one or more genetic characteristics may correspond to one or more variants of a genomic region that correspond to a type of cancer.
  • the one or more biomarkers may correspond to levels of an analyte being outside of a specified range.
  • the computing architecture 600 may analyze the corpus of information 620 to determine individuals having levels of one or more proteins and/or levels of one or more small molecules present that are indicative of a biological condition. In these scenarios, the computing architecture 600 may analyze results of laboratory tests to determine levels of analytes of individuals. In one or more additional examples, the computing architecture 600 may analyze the corpus of information 620 to determine individuals in which one or more symptoms are present that are indicative of a biological condition. In one or more further examples, the computing architecture 600 may analyze imaging information included in the corpus of information 620 to determine individuals in which one or more biomarkers are present.
  • the computing architecture 600 may implement one or more machine learning techniques to analyze the corpus of information 620.
  • the computing architecture 600 may implement one or more artificial neural networks, such as at least one of one or more convolutional neural networks or one or more residual neural networks to analyze the corpus of information 620.
  • the computing architecture 600 may also implement at least one of one or more random forests techniques, one or more hidden Markov models, or one or more support vector machines to analyze the corpus of information 620.
  • the computing architecture 600 may analyze the corpus of information 620 by performing one or more queries with respect to the corpus of information 620.
  • the one or more queries may correspond to one or more keywords and/or combinations of keywords.
  • the one or more keywords and/or combinations of keywords may correspond to at least one of characters or symbols that correspond to one or more biological conditions.
  • a keyword may correspond to characters related to a mutation of a genomic region, such as HER2.
  • one or more criteria may be associated with combinations of keyworks.
  • a criterion that corresponds to a combination of keywords may include a number of words being present within a specified distance of one another in a portion of the corpus of information 620 for an individual, such as the words fatigue, blood pressure, and swelling occurring within 100 characters of one another.
  • the computing architecture 600 may parse the corpus of information 620 for the one or more keywords and/or combinations of keywords.
  • the computing architecture 600 may determine that a biological condition is present with respect to a given individual.
  • the one or more queries may be image-based and the computing architecture 600 may analyze images included in the corpus of information 620 with respect to template images.
  • the template images may be generated based on analyzing a number of images in which a biological condition is present and aggregating the number of images into a template image.
  • the computing architecture 600 may analyze images included in the corpus of information 620 with respect to one or more template images to determine a measure of similarity between the images included in the corpus of information 620 and the template images. In situations where the measure of similarity for an individual is at least a threshold value, the computing architecture 600 may determine that a characteristic of a biological condition is present in the individual.
  • the computing architecture 600 may, at operation 624, generate data structures that store data for individuals having the one or more characteristics.
  • the computing architecture 600 may generate data tables that indicate individuals having an individual characteristics and/or individuals having a group of characteristics.
  • the computing architecture 600 may generate a first data table 626 and a second data table 628.
  • the first data table 626 may indicate individuals having one or more first characteristics and the second data table 628 may indicate individuals having one or more second characteristics.
  • the first data table 626 may indicate individuals having one or more first biomarkers for a biological condition and the second data table 628 may indicate individual having one or more second biomarkers for the biological condition.
  • the one or more first biomarkers may correspond to one or more first genomic variants that are associated with the biological condition and the one or more second biomarkers may correspond to one or more second genomic variants that are associated with the biological condition.
  • the data tables 626, 628 may indicate whether or not the one or more characteristics associated with the individual data tables 626, 628 are present with respect to individual individuals.
  • the first data table 626 may include a first indication for individuals in which one or more first genomic variants are present and a second indication for individuals in which the one or more first genomic variants are not present.
  • the first data table 626 may indicate smoking status of individuals and the second data table 628 may indicate whether or not individual individuals have received one or more treatments for a biological condition.
  • the first data table 626 and the second data table 628 may have rows that correspond to individual individuals.
  • an individual identifier may be present in individual rows.
  • the individual identifier may include at least one of alphanumeric characters or symbols that correspond to an individual.
  • the individual identifier may be present in a data package that corresponds to an individual.
  • Columns of the first data table 626 and the second data table 628 may indicate a status of individual individuals with respect to one or more characteristics.
  • the columns of the data tables 626, 628 may include an identifier that includes at least one of alphanumeric characters or symbols that indicate the presence or absence of one or more characteristics for a given individual.
  • the computing architecture 600 may generate more data tables or fewer data tables.
  • the computing architecture 600 may store the data structures in an additional data repository.
  • the computing architecture 600 may store at least the first data tale 626 and/or the second data table 628 in an intermediate data repository 632.
  • the first data table 626 and the second data table 628 may be temporarily stored in the intermediate data repository 632.
  • the first data table 626 and the second data table 628 may be stored in the intermediate data repository 632 before being added to the integrated data repository 104.
  • the integrated data repository 104 may be periodically generated and/or updated. In these scenarios, data structures generated by the computing architecture 600 based on analyzing the corpus of information 620 may be stored in the intermediate data repository 632 until a time when the integrated data repository 104 is to be at least one of generated or updated.
  • the computing architecture 600 may perform one or more de- identification processes at operation 634.
  • the data structures stored by the intermediate data repository 632 may be de-identified in order to preserve the privacy of individuals.
  • the one or more de-identification processes may include applying one or more electronically implemented cryptographic techniques to information of individuals included in the data structures stored by the intermediate data repository 632.
  • the computing architecture 600 may generate tokens that correspond to individual individuals that have information stored in data structures of the intermediate data repository 632.
  • the tokens may be generated by applying one or more hash functions to information related to individual individuals.
  • the one or more de-identification processes may include applying a salt function to information corresponding to individual individuals to generate tokens for the individual individuals.
  • the one or more cryptographic techniques applied to deidentify the data structures stored by the intermediate data repository 632 may be the same or similar to those applied to information obtained from the health insurance claims data repository 106 of Figures 1 and 4.
  • the computing architecture 600 may store the de-identified data structures in conjunction with the integrated data repository 104.
  • the information stored in the intermediate data repository 632 for a given individual may be stored in conjunction with additional information about the given individual in the integrated data repository 104.
  • the integrated data repository 104 may store information for a given individual obtained from at least two of the molecular data repository 108, obtained from the health insurance claims data repository 106, and obtained from the intermediate data repository 632. In this way, information about a given individual obtained from a number of disparate data repositories may be stored in the integrated data repository 104. As a result, information about individuals that is obtained from the different data repositories may be analyzed together rather than analyzed separately as with many existing systems.
  • the information stored by the intermediate data repository 632 may be used to validate one or more determinations made by the data integration and analysis system 102.
  • the data integration and analysis system 102 may analyze information obtained from the health insurance claims data repository 106 and the molecular data repository 108 to determine characteristics of individuals. The data integration and analysis system 102 may then analyze information obtained from the intermediate data repository 632 to determine whether the predicted characteristics identified from the information obtained from the health insurance claims data repository 106 and from the molecular data repository 108 correspond to the characteristics for the same individuals with respect to information stored by the intermediate data repository 632.
  • the one or more cryptographic techniques applied to de-identify the data structures stored by the intermediate data repository 632 may utilize the same or similar information that was used to generate at least one of the first tokens 422 or the second tokens 428 of Figure 4.
  • the operation 634 may implement one or more cryptographic techniques using a combination of at least a portion of a first name of the respective individuals, at least a portion of the last name of the respective individuals, at least a portion of a date of birth of the respective individuals, a gender of the individuals, and at least a portion of a location identifier of the respective individuals to de-identify the data structures of the intermediate data repository.
  • the information stored by the intermediate data repository 632 may be synchronized with information for the same individuals that have information stored in the integrated data repository 104.
  • Both the integrated data repository 104 and the intermediate data repository 632 may store information for thousands, tens of thousands, up to millions of individuals.
  • the data structures of the integrated data repository 104 and the data structures of the intermediate data repository 632 that are associated with a same individual may not be stored in a manner such that the information stored by the integrated data repository 104 and the information stored by the intermediate data repository 632 may be retrieved together for a given individual, which may lead to inaccurate information being provided by the data integration and analysis system 102.
  • the absence of a specified cryptographic protocol as described herein may also lead to the use of more computing resources to determine the information stored in the integrated data repository 104 from other data sources and the information stored by the intermediate data repository 632 that correspond to a given individual.
  • Figures 7 and 8 illustrate example processes to generate an integrated data repository and generate datasets used in the analysis of information stored by the integrated data repository.
  • the example processes are illustrated as collections of blocks in logical flow graphs, which represent sequences of operations that may be implemented in hardware, software, or a combination thereof.
  • the blocks are referenced by numbers.
  • the blocks represent computer-executable instructions stored on one or more computer-readable media that, when executed by one or more processing units (such as hardware microprocessors), perform the recited operations.
  • processing units such as hardware microprocessors
  • computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular data types.
  • the order in which the operations are described is not intended to be construed as a limitation, and any number of the described blocks may be combined in any order and/or in parallel to implement the process.
  • FIG. 7 is a data flow diagram of an example process 700 to generate an integrated data repository that stores health insurance claims data and genomics data, according to one or more implementations.
  • the process 700 may include generating a data file that includes tokens generated using a first hash function.
  • Individual tokens may correspond to a respective individual of a group of individuals having data stored by a molecular data repository.
  • an individual having data stored by the molecular data repository may be associated with one or more tokens.
  • the tokens may be generated by applying one or more first hash functions to a subset of information corresponding to the group of individuals stored by the genomics data repository.
  • individual tokens may be generated by applying one or more first hash functions to one or more combinations of at least a portion of a first name of a respective individual of the group of individuals, at least a portion of a second name of a respective individual of the group of individuals, a location identifier of a respective individual of the group of individuals, a gender of a respective individual of the group of individuals, and a date of birth of a respective individual of the group of individuals.
  • the tokens may be generated by a data integration and analysis system that is coupled to the genomics data repository.
  • the tokens may be generated by a third-party system and accessed by a data integration and analysis system coupled to the molecular data repository.
  • the process 700 may also include, at operation 704, sending the data file to a health insurance claims data management system.
  • the health insurance claims data management system may match the tokens included in the data file with second tokens accessed by the health insurance data management system and generated based on information stored by a health insurance claims data repository.
  • the process 700 may include obtaining, from the health insurance claims data management system, in response to the data file, first data corresponding to the group of individuals, where the first data includes health insurance claims data.
  • affirmative consent is obtained from the members of the group of individuals for their data to be transferred from the health insurance claims data management system.
  • the data is transferred in an anonymized format, such that the data may not be traced back to an individual member.
  • the health insurance claims data management system may be coupled to a health insurance claims data repository that stores health insurance claims information for a number of individuals.
  • the health insurance claims data management system may analyze the tokens of the data file with respect to additional tokens generated by the health insurance claims data management system.
  • the additional tokens may be generated based on a same set of information used to generate the tokens included in the data file. However, an individual’s identity may not be determined based on a token.
  • the health insurance claims data management system may match tokens included in the data file with additional tokens generated based on information stored by the health insurance claims data repository to determine individuals having information stored by the health insurance claims data repository that also have information stored by the genomics data repository.
  • the technology disclosed herein complies with legal and best practice privacy standards, such as HIPAA and GDPR.
  • the process 700 may include generating a number of identifiers using a second hash function that is different from the first hash function.
  • individual identifiers may correspond to one or more tokens related to a respective individual of the group of individuals.
  • the identifiers may be unique with respect to a given individual of the group of individuals and are de-identified.
  • the identifiers may be generated using information stored by the genomics data repository for the group of individuals that is different from the information stored by the genomics data repository used to generate the tokens.
  • intermediate identifiers may be generated by applying the second hash function to information of the respective groups of individuals and final versions of the identifiers may be generated by applying one or more salting techniques to the intermediate identifiers.
  • Information stored by the genomics data repository for respective individuals may be stored in association with the identifiers such that at least a portion of the information for given individuals stored by the genomics data repository may be accessed using respective identifiers of the given individuals.
  • the process 700 may include, at operation 710, obtaining, using the number of identifiers, second data from the molecular data repository for the group of individuals, and, at operation 712, the process 700 may include determining respective portions of the first data that correspond to respective portions of the second data for the group of individuals. For example, for a given individual, first data corresponding to health insurance claims data for the given individual may be identified in addition to second data corresponding to molecular data of the given individual, such as genomics data. In this way, for a given individual, both health insurance claims data and molecular data may be identified.
  • the process 700 may include, at operation 714, generating an integrated data repository that stores the respective portions of the first data and the respective portions of the second data in relation to respective identifiers of the number of identifiers.
  • the integrated data repository may store health insurance claims data and genomics claims data for a given individual in association with an identifier that may be used to access the health insurance claims data and the genomics claims data for the given individual.
  • the information stored by the integrated data repository may be organized according to a data repository schema.
  • the integrated data repository may store health insurance claims data and genomics data for the group of individuals in a number of data tables. In one or more examples, information stored by the number of data tables may be linked.
  • the data repository schema may include a first data table that stores genomics data of the group of individuals.
  • the first data table may store information corresponding to a panel used to generate genomics data, mutations of genomic regions, types of mutations, copy numbers of genomic regions, coverage data indicating numbers of nucleic acid molecules identified in a sample having one or more mutations, testing dates, and patient information.
  • the data repository schema may also include a second data that stores data related to one or more patient visits by individuals to one or more healthcare providers and a third data table that stores information corresponding to respective services provided to individuals with respect to one or more patient visits to one or more healthcare providers indicated by the second data table. Additionally, the data repository schema may include a fourth data table that stores personal information of the group of individuals and a fifth data table that stores information related to a health insurance company or governmental entity that made payment for services provided to the group of individuals. Further, the data repository schema may include a sixth data table storing information corresponding to health insurance coverage information for the group of individuals, such as a type of health insurance plan related to the group of individuals. The data repository schema may also include a seventh data table that stores information related to pharmaceutical treatments obtained by the group of individuals.
  • the integrated data repository may also store medical records that correspond to at least a portion of the group of individuals.
  • the medical records may be obtained from one or more data repositories storing the medical records.
  • One or more optical character recognition (OCR) operations may be performed with respect to the medical records.
  • OCR optical character recognition
  • the medical records may be analyzed to determine one or more portions of the additional information to remove to produce a corpus of information.
  • the corpus of information may be analyzed to determine a portion of the subset of the additional group of individuals that correspond to one or more biomarkers.
  • One or more data structures may be generated from the corpus of information that store identifiers of the portion of the subset of the additional group of individuals and that store an indication that the portion of the subset of the additional group of individuals corresponds to the one or more biomarkers.
  • the one or more data structures may be stored by an intermediate data repository.
  • One or more de-identification operations may be performed with respect to the identifiers of the portion of the subset of the additional group of individuals before modifying the integrated data repository to store at least a portion of the additional information of the medical records of the portion of the subset of the additional group of individuals in relation to the number of identifiers.
  • the information stored by the integrated data repository may be added to the integrated data repository.
  • the de-identified medical records information may be added to the integrated data repository in addition to or in lieu of the health insurance claims data.
  • the one or more data structures storing the de- identified medical records information with respect to the biomarker data may have one or more logical connections with other data structures stored in the integrated data repository.
  • the one or more data structures storing the de-identified medical records information with respect to the biomarker data may have one or more logical connections with at least one of the first data table may store information corresponding to a panel used to generate genomics data, mutations of genomic regions, types of mutations, copy numbers of genomic regions, coverage data indicating numbers of nucleic acid molecules identified in a sample having one or more mutations, testing dates, and patient information, the second data that stores data related to one or more patient visits by individuals to one or more healthcare providers, the a third data table that stores information corresponding to respective services provided to individuals with respect to one or more patient visits to one or more healthcare providers indicated by the second data table, the fourth data table that stores personal information of the group of individuals, the fifth data table that stores information related to a health insurance company or governmental entity that made payment for services provided to the group of individuals, the sixth data table storing information corresponding to health insurance coverage information for the group of individuals, such as a type of health insurance plan related to the group of individuals, or the seventh data table that stores information corresponding
  • the medical records data may be added to the integrated data repository by generating a data file including first tokens generated using a first hash function.
  • Individual first tokens may correspond to a respective individual of a group of individuals having data stored by a molecular data repository.
  • the data file may be sent to a medical records data management system and medical records data corresponding to the group of individuals may be obtained from the medical records data management system in response to the data file.
  • a number of identifiers may be generated using a second hash function that is different from the first hash function. Each identifier may correspond to one or more tokens related to each individual of the group of individuals. Using the number of identifiers second data may be obtained from the molecular data repository for the group of individuals.
  • respective portions of the first data may be determined that correspond to respective portions of the second data for the group of individuals.
  • the integrated data repository may be generated that stores the respective portions of the first data and the respective portions of the second data in relation to respective identifiers of the number of identifiers.
  • a request may be received to determine data with respect to a number of individuals having data stored in the integrated data repository.
  • the request includes may one or more search criteria.
  • a subset of the number of individuals having one or more characteristics that correspond to the one or more search criteria may be determined and information of the subset of the number of individuals may be analyzed to determine a measure of significance of a characteristic of the one or more characteristics with respect to a biological condition.
  • one or more genomic mutations may be determined to be present in the subset of the number of individuals and a plurality of treatments provided to the subset of the number of individuals may also be determined.
  • respective survival rates for the subset of the number of individuals may be determined, such as real world survival rates.
  • the measure of significance may correspond to survival rate with respect to a treatment of the plurality of treatments and a genomic mutation of the one or more genomic mutations. Based on measure of significance, an effectiveness of the treatment for the subset of the number of individuals may be determined.
  • individuals in subset of the number of individuals that have not received the treatment may be determined.
  • One or more therapeutically effective amounts of the treatment may be administered to the individuals in the subset of the number of individuals that have not received the treatment.
  • FIG. 8 is a data flow diagram of an example process 800 to generate a number of datasets used to analyze information stored by an integrated data repository that stores health insurance claims data and genomics data, according to one or more implementations.
  • the process 800 may include, at operation 802, determining a first set of data processing instructions that are executable in relation to first data stored by an integrated data repository.
  • the integrated data repository may store health insurance claims data and molecular data for a common group of individuals.
  • the first set of data processing instructions may be included in a plurality of sets of data processing instructions that are part of a data processing pipeline. Each of the sets of data processing instructions of the data processing pipeline may be executed to generate a respective analytics ready dataset.
  • individual sets of data processing instructions of the data processing pipeline may be executable to generate datasets that include specified portions of information and/or combinations of information stored by the integrated data repository.
  • individual sets of data processing instructions of the data processing pipeline may be executable to analyze and modify portions of information stored by the integrated data repository to generate respective datasets.
  • individual sets of data processing instructions may be executable with respect to individual subsets of information stored by the integrated data repository.
  • the process 800 may also include, at operation 804, causing the first set of data processing instructions to be executed to generate a first dataset.
  • the first dataset may indicate a subset of the group of individuals in which a biological condition is present.
  • the first set of data processing instructions may be executed to analyze data stored by the integrated data repository to identify a cohort of individuals in which the biological condition is present.
  • the biological condition may include a cancer.
  • the first set of data processing instructions may be executed to analyze data stored by the integrated data repository to identify a cohort of individuals in which lung cancer is present.
  • the data processing pipeline may include multiple sets of data processing instructions to identify cohorts of individuals in which different biological conditions are present.
  • the first set of data processing instructions may be executed to analyze at least one of health insurance claims data or molecular data to determine a cohort of individuals in which the biological condition is present. For example, the first set of data processing instructions may be executed to identify individuals having one or more health insurance codes present in health insurance claims data to determine a group of individuals in which the biological condition is present. Additionally, the first set of data processing instructions may be executed to identify individuals in which one or more mutations are present in a genomic region of nucleic acid molecules derived from samples obtained from the individuals to determine a group of individuals in which the biological condition is present.
  • the process 800 may include, at operation 806, determining a second set of data processing instructions that are executable in relation to second data stored by the integrated data repository.
  • the second set of data stored by the integrated data repository may be different from the first set of data stored by the integrated data repository and analyzed in relation to the first set of data processing instructions.
  • the first data may correspond to first columns of one or more first data tables stored by the integrated data repository and the second data may correspond to second columns of one or more second data tables stored by the integrated data repository.
  • the process 800 may include causing the second set of data processing instructions to be executed to generate a second dataset indicating one or more treatments provided to a second subset of the group of individuals.
  • the second dataset may indicate a subset of the group of individuals that have received one or more treatments.
  • the one or more treatments may be provided to individuals in which one or more biological conditions are present.
  • the second set of data processing instructions may be executed to analyze data stored by the integrated data repository to identify a cohort of individuals that received the one or more treatments.
  • the second set of data processing instructions may be executed to analyze at least one of health insurance claims data or genomics data to determine a cohort of individuals that received the one or more treatments.
  • the second set of data processing instructions may be executed to identify individuals having one or more health insurance codes present in health insurance claims data to determine a group of individuals that received the one or more treatments.
  • the process 800 may include, at operation 810, determining a third subset of the group of individuals that includes a portion of the first subset of the group of individuals that overlaps with a portion of the second subset of the group of individuals.
  • the third subset of the group of individuals corresponds to individuals in which both the biological condition is present and the one or more treatments are provided.
  • the process 800 may include analyzing the first dataset and the second dataset with respect to the third subset of the group of individuals to determine a measure of significance of a characteristic of the third subset of the group of individuals.
  • one or more machine learning techniques or statistical techniques may be applied to information included in at least one of the first dataset and the second dataset with respect to the third subset of the group of individuals.
  • the measure of significance may correspond to a statistical measure of significance with respect to the characteristic.
  • the measure of significance may correspond to a probability of the characteristic being present in individuals in which the biological condition is present.
  • the characteristic may include one or more treatments provided to the individuals in which the biological condition is present.
  • the characteristic may include the presence of a mutation of a genomic region of nucleic acid molecules derived from samples obtained from individuals in which the biological condition is present.
  • information included in at least one of the first dataset or the second dataset may be analyzed to determine an impact of the characteristic with respect to one or more metrics.
  • information included in at least one of the first dataset or the second dataset may be analyzed to determine an amount of influence of a treatment on a survival rate of individuals in which the biological condition is present.
  • information included in at least one of the first dataset or the second dataset may be analyzed to determine an amount of influence of a mutation of a genomic region on a survival rate of individuals in which the biological condition is present. Additionally, information included in the first dataset and the second dataset may be analyzed to determine an amount of impact of one or more treatments with respect to individuals in which the biological condition is present and in which one or more genomic mutations are also present.
  • Figure 9 illustrates a diagrammatic representation of a machine 9900 in the form of a computer system within which a set of instructions may be executed for causing the machine 900 to perform any one or more of the methodologies discussed herein, according to an example, according to an example implementation.
  • Figure 8 shows a diagrammatic representation of the machine 900 in the example form of a computer system, within which instructions 902 (e.g., software, a program, an application, an applet, an app, or other executable code) for causing the machine 900 to perform any one or more of the methodologies discussed herein may be executed.
  • instructions 902 e.g., software, a program, an application, an applet, an app, or other executable code
  • the instructions 902 may cause the machine 900 to implement the architectures and frameworks 100, 200, 300, 400, 500, 600 described with respect to Figures 1, 2, 3, 4, 5, and 6, respectively, and to execute the methods 700, 800 described with respect to Figures 7 and 8, respectively.
  • the instructions 902 transform the general, non-programmed machine 900 into a particular machine 900 programmed to carry out the described and illustrated functions in the manner described.
  • the machine 900 operates as a standalone device or may be coupled (e.g., networked) to other machines.
  • the machine 900 may operate in the capacity of a server machine or a client machine in a serverclient network environment, or as a peer machine in a peer-to-peer (or distributed) network environment.
  • the machine 900 may comprise, but not be limited to, a server computer, a client computer, a personal computer (PC), a tablet computer, a laptop computer, a netbook, a set-top box (STB), a personal digital assistant (PDA), an entertainment media system, a cellular telephone, a smart phone, a mobile device, a wearable device (e.g., a smart watch), a smart home device (e.g., a smart appliance), other smart devices, a web appliance, a network router, a network switch, a network bridge, or any machine capable of executing the instructions 902, sequentially or otherwise, that specify actions to be taken by the machine 900.
  • the term “machine” shall also be taken to include a collection of machines 900 that individually or jointly execute the instructions 902 to perform any one or more of the methodologies discussed herein.
  • Examples of computing device 900 may include logic, one or more components, circuits (e.g., modules), or mechanisms. Circuits are tangible entities configured to perform certain operations. In an example, circuits may be arranged (e.g., internally or with respect to external entities such as other circuits) in a specified manner. In an example, one or more computer systems (e.g., a standalone, client or server computer system) or one or more hardware processors (processors) may be configured by software (e.g., instructions, an application portion, or an application) as a circuit that operates to perform certain operations as described herein. In an example, the software may reside (1) on a non-transitory machine readable medium or (2) in a transmission signal. In an example, the software, when executed by the underlying hardware of the circuit, causes the circuit to perform the certain operations.
  • circuits e.g., modules
  • Circuits are tangible entities configured to perform certain operations.
  • circuits may be arranged (e.g., internally or with respect to external entities such as other circuits)
  • a circuit may be implemented mechanically or electronically.
  • a circuit may comprise dedicated circuitry or logic that is specifically configured to perform one or more techniques such as discussed above, such as including a special-purpose processor, a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).
  • a circuit may comprise programmable logic (e.g., circuitry, as encompassed within a general -purpose processor or other programmable processor) that may be temporarily configured (e.g., by software) to perform the certain operations. It will be appreciated that the decision to implement a circuit mechanically (e.g., in dedicated and permanently configured circuitry), or in temporarily configured circuitry (e.g., configured by software) may be driven by cost and time considerations.
  • circuit is understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily (e.g., transitorily) configured (e.g., programmed) to operate in a specified manner or to perform specified operations.
  • each of the circuits need not be configured or instantiated at any one instance in time.
  • the circuits comprise a general -purpose processor configured via software
  • the general- purpose processor may be configured as respective different circuits at different times.
  • Software may accordingly configure a processor, for example, to constitute a particular circuit at one instance of time and to constitute a different circuit at a different instance of time.
  • circuits may provide information to, and receive information from, other circuits.
  • the circuits may be regarded as being communicatively coupled to one or more other circuits.
  • communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) that connect the circuits.
  • communications between such circuits may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple circuits have access.
  • one circuit may perform an operation and store the output of that operation in a memory device to which it is communicatively coupled.
  • a further circuit may then, at a later time, access the memory device to retrieve and process the stored output.
  • circuits may be configured to initiate or receive communications with input or output devices and may operate on a resource (e.g., a collection of information).
  • processors may be temporarily configured (e.g., by software) or permanently configured to perform the relevant operations.
  • processors may constitute processor-implemented circuits that operate to perform one or more operations or functions.
  • the circuits referred to herein may comprise processor-implemented circuits.
  • the methods described herein may be at least partially processor implemented. For example, at least some or all of the operations of a method may be performed by one or processors or processor-implemented circuits. The performance of certain of the operations may be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In an example, the processor or processors may be located in a single location (e.g., within a home environment, an office environment or as a server farm), while in other examples the processors may be distributed across a number of locations.
  • the one or more processors may also operate to support performance of the relevant operations in a "cloud computing" environment or as a "software as a service”
  • SaaS Application Program Interfaces
  • computers as examples of machines including processors
  • APIs Application Program Interfaces
  • Example implementations may be implemented in digital electronic circuitry, in computer hardware, in firmware, in software, or in any combination thereof.
  • Example implementations may be implemented using a computer program product (e.g., a computer program, tangibly embodied in an information carrier or in a machine readable medium, for execution by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers).
  • a computer program product e.g., a computer program, tangibly embodied in an information carrier or in a machine readable medium, for execution by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers.
  • a computer program may be written in any form of programming language, including compiled or interpreted languages, and it may be deployed in any form, including as a standalone program or as a software module, subroutine, or other unit suitable for use in a computing environment.
  • a computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
  • operations may be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output.
  • Examples of method operations may also be performed by, and example apparatus may be implemented as, special purpose logic circuitry (e.g., a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)).
  • FPGA field programmable gate array
  • ASIC application-specific integrated circuit
  • the computing system may include clients and servers.
  • a client and server are generally remote from each other and generally interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • both hardware and software architectures require consideration. Specifically, it will be appreciated that the choice of whether to implement certain functionality in permanently configured hardware (e.g., an ASIC), in temporarily configured hardware (e.g., a combination of software and a programmable processor), or a combination of permanently and temporarily configured hardware may be a design choice.
  • hardware e.g., computing device 900
  • software architectures that may be deployed in example implementations.
  • the computing device 900 may operate as a standalone device or the computing device 900 may be connected (e.g., networked) to other machines.
  • the computing device 900 may operate in the capacity of either a server or a client machine in server-client network environments.
  • computing device 900 may act as a peer machine in peer-to-peer (or other distributed) network environments.
  • the computing device 900 may be a personal computer (PC), a tablet PC, a set- top box (STB), a mobile telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) specifying actions to be taken (e.g., performed) by the computing device 900.
  • PC personal computer
  • PTT set- top box
  • mobile telephone a web appliance
  • network router switch or bridge
  • the term “computing device” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein
  • Example computing device 900 may include a processor 904 (e.g., a central processing unit CPU), a graphics processing unit (GPU) or both), a main memory 906 and a static memory 908, some or all of which may communicate with each other via a bus 910.
  • the computing device 900 may further include a display unit 912, an alphanumeric input device 914 (e.g., a keyboard), and a user interface (UI) navigation device 916 (e.g., a mouse).
  • the display unit 912, input device 914 and UI navigation device 916 may be a touch screen display.
  • the computing device 900 may additionally include a storage device (e.g., drive unit) 918, a signal generation device 920 (e.g., a speaker), a network interface device 922, and one or more sensors 924, such as a global positioning system (GPS) sensor, compass, accelerometer, or another sensor.
  • a storage device e.g., drive unit
  • a signal generation device 920 e.g., a speaker
  • a network interface device 922 e.g., a Wi-Fi
  • sensors 924 such as a global positioning system (GPS) sensor, compass, accelerometer, or another sensor.
  • GPS global positioning system
  • the storage device 918 may include a machine readable medium 926 on which is stored one or more sets of data structures or instructions 902 (e.g., software) embodying or utilized by any one or more of the methodologies or functions described herein.
  • the instructions 902 may also reside, completely or at least partially, within the main memory 906, within static memory 908, or within the processor 904 during execution thereof by the computing device 900.
  • one or any combination of the processor 904, the main memory 906, the static memory 908, or the storage device 918 may constitute machine readable media.
  • machine readable medium 926 is illustrated as a single medium, the term “machine readable medium” may include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that configured to store the one or more instructions 902.
  • the term “machine readable medium” may also be taken to include any tangible medium that is capable of storing, encoding, or carrying instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure or that is capable of storing, encoding or carrying data structures utilized by or associated with such instructions.
  • the term “machine readable medium” may accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media.
  • machine-readable media may include non-volatile memory, including, by way of example, semiconductor memory devices (e.g., Electrically Programmable Read-Only Memory [0270] (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)) and flash memory devices; magnetic disks such as internal hard disks and removable disks; magnetooptical disks; and CD-ROM and DVD-ROM disks.
  • semiconductor memory devices e.g., Electrically Programmable Read-Only Memory [0270] (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)
  • flash memory devices e.g., Electrically Programmable Read-Only Memory [0270] (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)
  • flash memory devices e.g., Electrically Programmable Read-Only Memory [0270] (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)
  • flash memory devices e.g., Electrically Erasable Programm
  • the instructions 902 may further be transmitted or received over a communications network 828 using a transmission medium via the network interface device 822 utilizing any one of a number of transfer protocols (e.g., frame relay, IP, TCP, UDP, HTTP, etc.).
  • Example communication networks may include a local area network (LAN), a wide area network (WAN), a packet data network (e.g., the Internet), mobile telephone networks (e.g., cellular networks), Plain Old Telephone (POTS) networks, and wireless data networks (e.g., IEEE 802.11 standards family known as Wi-Fi®, IEEE 802.16 standards family known as WiMax®), peer-to-peer (P2P) networks, among others.
  • the term “transmission medium” shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions for execution by the machine, and includes digital or analog communications signals or other intangible medium to facilitate communication of such software.
  • a component may refer to a device, physical entity, or logic having boundaries defined by function or subroutine calls, branch points, APIs, or other technologies that provide for the partitioning or modularization of particular processing or control functions. Components may be combined via their interfaces with other components to carry out a machine process.
  • a component may be a packaged functional hardware unit designed for use with other components and a part of a program that usually performs a particular function of related functions. Components may constitute either software components (e.g., code embodied on a machine-readable medium) or hardware components.
  • a "hardware component” is a tangible unit capable of performing certain operations and may be configured or arranged in a certain physical manner.
  • one or more computer systems e.g., a standalone computer system, a client computer system, or a server computer system
  • one or more hardware components of a computer system e.g., a processor or a group of processors
  • software e.g., an application or application portion
  • the present methods can be used to diagnose presence of conditions, particularly cancer, in a subject, to characterize conditions (e.g., staging cancer or determining heterogeneity of a cancer), monitor response to treatment of a condition, effect prognosis risk of developing a condition or subsequent course of a condition.
  • the present disclosure can also be useful in determining the efficacy of a particular treatment option.
  • Successful treatment options may increase the amount of copy number variation or rare mutations detected in subject's blood if the treatment is successful as more cancers may die and shed DNA. In other examples, this may not occur.
  • certain treatment options may be correlated with genetic profiles of cancers over time. This correlation may be useful in selecting a therapy.
  • the present methods can be used to monitor residual disease or recurrence of disease.
  • the methods and systems disclosed herein may be used to identify customized or targeted therapies to treat a given disease or condition in patients based on the classification of a nucleic acid variant as being of somatic or germline origin.
  • the disease under consideration is a type of cancer.
  • Non-limiting examples of such cancers include biliary tract cancer, bladder cancer, transitional cell carcinoma, urothelial carcinoma, brain cancer, gliomas, astrocytomas, breast carcinoma, metaplastic carcinoma, cervical cancer, cervical squamous cell carcinoma, rectal cancer, colorectal carcinoma, colon cancer, hereditary nonpolyposis colorectal cancer, colorectal adenocarcinomas, gastrointestinal stromal tumors (GISTs), endometrial carcinoma, endometrial stromal sarcomas, esophageal cancer, esophageal squamous cell carcinoma, esophageal adenocarcinoma, ocular melanoma, uveal melanoma, gallbladder carcinomas, gallbladder adenocarcinoma, renal cell carcinoma, clear cell renal cell carcinoma, transitional cell carcinoma, urothelial carcinomas, Wilms tumor, leukemia, acute lymphocytic leukemia (ALL
  • Prostate cancer prostate adenocarcinoma, skin cancer, melanoma, malignant melanoma, cutaneous melanoma, small intestine carcinomas, stomach cancer, gastric carcinoma, gastrointestinal stromal tumor (GIST), uterine cancer, or uterine sarcoma.
  • Type and/or stage of cancer can be detected from genetic variations including mutations, rare mutations, indels, copy number variations, transversions, translocations, inversion, deletions, aneuploidy, partial aneuploidy, polyploidy, chromosomal instability, chromosomal structure alterations, gene fusions, chromosome fusions, gene truncations, gene amplification, gene duplications, chromosomal lesions, DNA lesions, abnormal changes in nucleic acid chemical modifications, abnormal changes in epigenetic patterns, and abnormal changes in nucleic acid 5 -methylcytosine.
  • Genetic data can also be used for characterizing a specific form of cancer. Cancers are often heterogeneous in both composition and staging. Genetic profile data may allow characterization of specific sub-types of cancer that may be important in the diagnosis or treatment of that specific sub-type. This information may also provide a subject or practitioner clues regarding the prognosis of a specific type of cancer and allow either a subject or practitioner to adapt treatment options in accord with the progress of the disease. Some cancers can progress to become more aggressive and genetically unstable. Other cancers may remain benign, inactive or dormant. The system and methods of this disclosure may be useful in determining disease progression.
  • an abnormal condition is cancer.
  • the abnormal condition may be one resulting in a heterogeneous genomic population.
  • some tumors are known to comprise tumor cells in different stages of the cancer.
  • heterogeneity may comprise multiple foci of disease. Again, in the example of cancer, there may be multiple tumor foci, perhaps where one or more foci are the result of metastases that have spread from a primary site.
  • the present methods can be used to generate or profile, fingerprint or set of data that is a summation of genetic information derived from different cells in a heterogeneous disease.
  • This set of data may comprise copy number variation, epigenetic variation, and mutation analyses alone or in combination.
  • the present methods can be used to diagnose, prognose, monitor or observe cancers, or other diseases.
  • the methods herein do not involve the diagnosing, prognosing or monitoring a fetus and as such are not directed to non-invasive prenatal testing.
  • these methodologies may be employed in a pregnant subject to diagnose, prognose, monitor or observe cancers or other diseases in an unborn subject whose DNA and other polynucleotides may co-circulate with maternal molecules.
  • Non-limiting examples of other genetic-based diseases, disorders, or conditions that are optionally evaluated using the methods and systems disclosed herein include achondroplasia, alpha-1 antitrypsin deficiency, antiphospholipid syndrome, autism, autosomal dominant polycystic kidney disease, Charcot-Marie-Tooth (CMT), cri du chat, Crohn's disease, cystic fibrosis, Dercum disease, down syndrome, Duane syndrome, Duchenne muscular dystrophy, Factor V Leiden thrombophilia, familial hypercholesterolemia, familial Mediterranean fever, fragile X syndrome, Gaucher disease, hemochromatosis, hemophilia, holoprosencephaly, Huntington's disease, Klinefelter syndrome, Marfan syndrome, myotonic dystrophy, neurofibromatosis, Noonan syndrome, osteogenesis imperfecta, Parkinson's disease, phenylketonuria, Poland anomaly, porphyria, progeria, retinitis pigmentosa,
  • a method described herein includes detecting a presence or absence of DNA originating or derived from a tumor cell at a preselected timepoint following a previous cancer treatment of a subject previously diagnosed with cancer using a set of sequence information obtained as described herein.
  • the method may further comprise determining a cancer recurrence score that is indicative of the presence or absence of the DNA originating or derived from the tumor cell for the test subject. Where a cancer recurrence score is determined, it may further be used to determine a cancer recurrence status.
  • the cancer recurrence status may be at risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold.
  • the cancer recurrence status may be at low or lower risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold.
  • a cancer recurrence score equal to the predetermined threshold may result in a cancer recurrence status of either at risk for cancer recurrence or at low or lower risk for cancer recurrence.
  • a cancer recurrence score is compared with a predetermined cancer recurrence threshold, and the test subject is classified as a candidate for a subsequent cancer treatment when the cancer recurrence score is above the cancer recurrence threshold or not a candidate for therapy when the cancer recurrence score is below the cancer recurrence threshold.
  • a cancer recurrence score equal to the cancer recurrence threshold may result in classification as either a candidate for a subsequent cancer treatment or not a candidate for therapy.
  • the methods discussed above may further comprise any compatible feature or features set forth elsewhere herein, including in the section regarding methods of determining a risk of cancer recurrence in a test subject and/or classifying a test subject as being a candidate for a subsequent cancer treatment.
  • a method provided herein is a method of determining a risk of cancer recurrence in a test subject. In some embodiments, a method provided herein is a method of classifying a test subject as being a candidate for a subsequent cancer treatment.
  • Any of such methods may comprise collecting DNA (e.g., originating or derived from a tumor cell) from the test subject diagnosed with the cancer at one or more preselected timepoints following one or more previous cancer treatments to the test subject.
  • the subject may be any of the subjects described herein.
  • the DNA may be cfDNA.
  • the DNA may be obtained from a tissue sample.
  • Any of such methods may comprise capturing a plurality of sets of target regions from DNA from the subject, wherein the plurality of target region sets includes a sequence-variable target region set and an epigenetic target region set, whereby a captured set of DNA molecules is produced.
  • the capturing step may be performed according to any of the embodiments described elsewhere herein.
  • the previous cancer treatment may comprise surgery, administration of a therapeutic composition, and/or chemotherapy.
  • Any of such methods may comprise sequencing the captured DNA molecules, whereby a set of sequence information is produced.
  • the captured DNA molecules of the sequence-variable target region set may be sequenced to a greater depth of sequencing than the captured DNA molecules of the epigenetic target region set.
  • Any of such methods may comprise detecting a presence or absence of DNA originating or derived from a tumor cell at a preselected timepoint using the set of sequence information.
  • the detection of the presence or absence of DNA originating or derived from a tumor cell may be performed according to any of the embodiments thereof described elsewhere herein.
  • Methods of determining a risk of cancer recurrence in a test subject may comprise determining a cancer recurrence score that is indicative of the presence or absence, or amount, of the DNA originating or derived from the tumor cell for the test subject.
  • the cancer recurrence score may further be used to determine a cancer recurrence status.
  • the cancer recurrence status may be at risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold.
  • the cancer recurrence status may be at low or lower risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold.
  • a cancer recurrence score equal to the predetermined threshold may result in a cancer recurrence status of either at risk for cancer recurrence or at low or lower risk for cancer recurrence.
  • Methods of classifying a test subject as being a candidate for a subsequent cancer treatment may comprise comparing the cancer recurrence score of the test subject with a predetermined cancer recurrence threshold, thereby classifying the test subject as a candidate for the subsequent cancer treatment when the cancer recurrence score is above the cancer recurrence threshold or not a candidate for therapy when the cancer recurrence score is below the cancer recurrence threshold.
  • a cancer recurrence score equal to the cancer recurrence threshold may result in classification as either a candidate for a subsequent cancer treatment or not a candidate for therapy.
  • the subsequent cancer treatment includes chemotherapy or administration of a therapeutic composition.
  • Any of such methods may comprise determining a disease-free survival (DFS) period for the test subject based on the cancer recurrence score; for example, the DFS period may be 1 year, 2 years, 3, years, 4 years, 5 years, or 10 years.
  • DFS disease-free survival
  • the set of sequence information includes sequence-variable target region sequences
  • determining the cancer recurrence score may comprise determining at least a first subscore indicative of the amount of SNVs, insertions/deletions, CNVs and/or fusions present in sequence-variable target region sequences.
  • a number of mutations in the sequence-variable target regions chosen from 1, 2, 3, 4, or 5 is sufficient for the first subscore to result in a cancer recurrence score classified as positive for cancer recurrence. In some embodiments, the number of mutations is chosen from 1, 2, or 3.
  • the set of sequence information includes epigenetic target region sequences
  • determining the cancer recurrence score includes determining a second subscore indicative of the amount of molecules (obtained from the epigenetic target region sequences) that represent an epigenetic state different from DNA found in a corresponding sample from a healthy subject (e.g., cfDNA found in a blood sample from a healthy subject, or DNA found in a tissue sample from a healthy subject where the tissue sample is of the same type of tissue as was obtained from the test subject).
  • abnormal molecules i.e., molecules with an epigenetic state different from DNA found in a corresponding sample from a healthy subject
  • epigenetic changes associated with cancer e.g., methylation of hypermethylation variable target regions and/or perturbed fragmentation of fragmentation variable target regions, where “perturbed” means different from DNA found in a corresponding sample from a healthy subject.
  • a proportion of molecules corresponding to the hypermethylation variable target region set and/or fragmentation variable target region set that indicate hypermethylation in the hypermethylation variable target region set and/or abnormal fragmentation in the fragmentation variable target region set greater than or equal to a value in the range of 0.001%-10% is sufficient for the second subscore to be classified as positive for cancer recurrence.
  • the range may be 0.001%-l%, 0.005%-l%, 0.01%-5%, 0.01%-2%, or 0.01%-l%.
  • any of such methods may comprise determining a fraction of tumor DNA from the fraction of molecules in the set of sequence information that indicate one or more features indicative of origination from a tumor cell. This may be done for molecules corresponding to some or all of the epigenetic target regions, e.g., including one or both of hypermethylation variable target regions and fragmentation variable target regions (hypermethylation of a hypermethylation variable target region and/or abnormal fragmentation of a fragmentation variable target region may be considered indicative of origination from a tumor cell). This may be done for molecules corresponding to sequence variable target regions, e.g., molecules including alterations consistent with cancer, such as SNVs, indels, CNVs, and/or fusions. The fraction of tumor DNA may be determined based on a combination of molecules corresponding to epigenetic target regions and molecules corresponding to sequence variable target regions.
  • Determination of a cancer recurrence score may be based at least in part on the fraction of tumor DNA, wherein a fraction of tumor DNA greater than a threshold in the range of 10-11 to 1 or 10-10 to 1 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence.
  • a fraction of tumor DNA greater than or equal to a threshold in the range of 10-10 to 10-9, 10-9 to 10-8, 10-8 to 10-7, 10-7 to 10-6, 10-6 to 10- 5, 10-5 to 10-4, 10-4 to 10-3, 10-3 to 10-2, or 10-2 to 10-1 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence.
  • the fraction of tumor DNA greater than a threshold of at least 10-7 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence.
  • a determination that a fraction of tumor DNA is greater than a threshold may be made based on a cumulative probability. For example, the sample was considered positive if the cumulative probability that the tumor fraction was greater than a threshold in any of the foregoing ranges exceeds a probability threshold of at least 0.5, 0.75, 0.9, 0.95, 0.98, 0.99, 0.995, or 0.999.
  • the probability threshold is at least 0.95, such as 0.99.
  • the set of sequence information includes sequence-variable target region sequences and epigenetic target region sequences
  • determining the cancer recurrence score includes determining a first subscore indicative of the amount of SNVs, insertions/deletions, CNVs and/or fusions present in sequence-variable target region sequences and a second subscore indicative of the amount of abnormal molecules in epigenetic target region sequences, and combining the first and second subscores to provide the cancer recurrence score.
  • first and second subscores may be combined by applying a threshold to each subscore independently (e.g., greater than a predetermined number of mutations (e.g., > 1) in sequence-variable target regions, and greater than a predetermined fraction of abnormal molecules (i.e., molecules with an epigenetic state different from the DNA found in a corresponding sample from a healthy subject; e.g., tumor) in epigenetic target regions), or training a machine learning classifier to determine status based on a plurality of positive and negative training samples.
  • a threshold e.g., greater than a predetermined number of mutations (e.g., > 1) in sequence-variable target regions, and greater than a predetermined fraction of abnormal molecules (i.e., molecules with an epigenetic state different from the DNA found in a corresponding sample from a healthy subject; e.g., tumor) in epigenetic target regions
  • a value for the combined score in the range of -4 to 2 or -3 to 1 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence.
  • the cancer recurrence status of the subject may be at risk for cancer recurrence and/or the subject may be classified as a candidate for a subsequent cancer treatment.
  • the cancer is any one of the types of cancer described elsewhere herein, e.g., colorectal cancer.
  • FIG 10 comprehensive evaluation, diagnostic testing, molecular and genetic profiling and/or risk assessment, can be utilized in combination for an assessment.
  • patient consultation, treatment strategy and/or tailored treatment can be utilized in combination for treatment planning.
  • treatment implementation can include pre-treatment and/or treatment execution.
  • regular follow-ups and/or response assessment can constitute mechanisms for determining monitoring and adjustment.
  • post-treatment surveillance and/or recurrence management can support long term management and/or survivorship.
  • the methods disclosed herein relate to identifying and administering customized therapies to patients given the status of a nucleic acid variant as being of somatic or germline origin.
  • essentially any cancer therapy e.g., surgical therapy, radiation therapy, chemotherapy, and/or the like
  • customized therapies include at least one immunotherapy (or an immunotherapeutic agent).
  • Immunotherapy refers generally to methods of enhancing an immune response against a given cancer type.
  • immunotherapy refers to methods of enhancing a T cell response against a tumor or cancer.
  • the status of a nucleic acid variant from a sample from a subject as being of somatic or germline origin may be compared with a database of comparator results from a reference population to identify customized or targeted therapies for that subject.
  • the reference population includes patients with the same cancer or disease type as the test subject and/or patients who are receiving, or who have received, the same therapy as the test subject.
  • a customized or targeted therapy (or therapies) may be identified when the nucleic variant and the comparator results satisfy certain classification criteria (e.g., are a substantial or an approximate match).
  • the customized therapies described herein are typically administered parenterally (e.g., intravenously or subcutaneously).
  • Pharmaceutical compositions containing an immunotherapeutic agent are typically administered intravenously.
  • Certain therapeutic agents are administered orally.
  • customized therapies e.g., immunotherapeutic agents, etc.
  • kits including the compositions as described herein.
  • the kits can be useful in performing the methods as described herein.
  • a kit includes a first reagent for partitioning a sample into a plurality of subsamples as described herein, such as any of the partitioning reagents described elsewhere herein.
  • a kit includes a second reagent for subjecting the first subsample to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample, wherein the first nucleobase is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the first nucleobase and the second nucleobase have the same base pairing specificity (e.g., any of the reagents described elsewhere herein for converting a nucleobase such as cytosine or methylated cytosine to a different nucleobase).
  • the kit may comprise the first and second reagents and additional elements as discussed below and/or elsewhere herein.
  • Kits may further comprise a plurality of oligonucleotide probes that selectively hybridize to least 5, 6, 7, 8, 9, 10, 20, 30, 40 or all genes selected from the group consisting of ALK, APC, BRAF, CDKN2A, EGFR, ERBB2, FBXW7, KRAS, MYC, NOTCH1, NRAS, PIK3CA, PTEN, RBI, TP53, MET, AR, ABL1, AKT1, ATM, CDH1, CSFIR, CTNNB1, ERBB4, EZH2, FGFR1, FGFR2, FGFR3, FLT3, GNA11, GNAQ, GNAS, HNF1A, HRAS, IDH1, IDH2, JAK2, JAK3, KDR, KIT, MLH1, MPL, NPM1, PDGFRA, PROC, PTPN11, RET,SMAD4, SMARCB1, SMO, SRC, STK11, VHL, TERT, CCND1, CDK4, CDKN2B
  • the number genes to which the oligonucleotide probes can selectively hybridize can vary.
  • the number of genes can comprise 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, or 54.
  • the kit can include a container that includes the plurality of oligonucleotide probes and instructions for performing any of the methods described herein.
  • the oligonucleotide probes can selectively hybridize to exon regions of the genes, e.g., of the at least 5 genes. In some cases, the oligonucleotide probes can selectively hybridize to at least 30 exons of the genes, e.g., of the at least 5 genes. In some cases, the multiple probes can selectively hybridize to each of the at least 30 exons. The probes that hybridize to each exon can have sequences that overlap with at least 1 other probe. In some embodiments, the oligoprobes can selectively hybridize to non-coding regions of genes disclosed herein, for example, intronic regions of the genes. The oligoprobes can also selectively hybridize to regions of genes including both exonic and intronic regions of the genes disclosed herein.
  • any number of exons can be targeted by the oligonucleotide probes. For example, at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, , 295, 300, 400, 500, 600, 700, 800, 900, 1,000, or more, exons can be targeted.
  • the kit can comprise at least 4, 5, 6, 7, or 8 different library adaptors having distinct molecular barcodes and identical sample barcodes.
  • the library adaptors may not be sequencing adaptors.
  • the library adaptors do not include flow cell sequences or sequences that permit the formation of hairpin loops for sequencing.
  • the different variations and combinations of molecular barcodes and sample barcodes are described throughout, and are applicable to the kit.
  • the adaptors are not sequencing adaptors.
  • the adaptors provided with the kit can also comprise sequencing adaptors.
  • a sequencing adaptor can comprise a sequence hybridizing to one or more sequencing primers.
  • a sequencing adaptor can further comprise a sequence hybridizing to a solid support, e.g., a flow cell sequence.
  • a sequencing adaptor can be a flow cell adaptor.
  • the sequencing adaptors can be attached to one or both ends of a polynucleotide fragment.
  • the kit can comprise at least 8 different library adaptors having distinct molecular barcodes and identical sample barcodes.
  • the library adaptors may not be sequencing adaptors.
  • the kit can further include a sequencing adaptor having a first sequence that selectively hybridizes to the library adaptors and a second sequence that selectively hybridizes to a flow cell sequence.
  • a sequencing adaptor can be hairpin shaped.
  • the hairpin shaped adaptor can comprise a complementary double stranded portion and a loop portion, where the double stranded portion can be attached ⁇ e.g.
  • a sequencing adaptor can be up to 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
  • the sequencing adaptor can comprise 20-30, 20-
  • a sequencing adaptor can comprise one or more barcodes.
  • a sequencing adaptor can comprise a sample barcode.
  • the sample barcode can comprise a pre-determined sequence.
  • the sample barcodes can be used to identify the source of the polynucleotides.
  • the sample barcode can be at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, or more (or any length as described throughout) nucleic acid bases, e.g., at least 8 bases.
  • the barcode can be contiguous or non-contiguous sequences, as described above.
  • the library adaptors can be blunt ended and Y-shaped and can be less than or equal to 40 nucleic acid bases in length. Other variations of the can be found throughout and are applicable to the kit.
  • a biomarker may be any gene or variant of a gene whose presence, mutation, deletion, substitution, copy number, or translation (i.e., to a protein) is an indicator of a disease state.
  • Biomarkers of the present disclosure may include the presence, mutation, deletion, substitution, copy number, or translation in any one or more of EGFR, KRAS, MET, BRAF, MYC, NRAS, ERBB2, ALK, Notch, PIK3CA, APC, and SMO.
  • a biomarker is a genetic variant associated with one or more cancers. Biomarkers may be determined using any of several resources or methods. A biomarker may have been previously discovered or may be discovered de novo using experimental or epidemiological techniques. Detection of a biomarker may be indicative of cancer when the biomarker is highly correlated a cancer. Detection of a biomarker may be indicative of cancer when a biomarker in a region or gene occur with a frequency that is greater than a frequency for a given background population or dataset.
  • Non-limiting examples of databases include FANTOM, GT ex, GEO, Body Atlas, INSiGHT, OMIM (Online Mendelian Inheritance in Man, omim.org), cBioPortal (cbioportal.org), CIViC (Clinical Interpretations of Variants in Cancer, civic.genome.wustl.edu), D0CM (Database of Curated Mutations, docm.genome.wustl.edu), and ICGC Data Portal (dcc.icgc.org).
  • the COSMIC Catalogue of Somatic Mutations in Cancer
  • Biomarkers may also be determined de novo by conducting experiments such as case control or association (e.g, genome-wide association studies) studies.
  • Biomarkers may be detected in the sequencing panel.
  • a biomarker may be one or more genetic variants associated with cancer.
  • Biomarkers can be selected from single nucleotide variants (SNVs), copy number variants (CNVs), insertions or deletions (e.g., indels), gene fusions and inversions.
  • Biomarkers may affect the level of a protein. Biomarkers may be in a promoter or enhancer, and may alter the transcription of a gene. The biomarkers may affect the transcription and/or translation efficacy of a gene. The biomarkers may affect the stability of a transcribed mRNA. The biomarker may result in a change to the amino acid sequence of a translated protein.
  • the biomarker may affect splicing, may change the amino acid coded by a particular codon, may result in a frameshift, or may result in a premature stop codon.
  • the biomarker may result in a conservative substitution of an amino acid.
  • One or more biomarkers may result in a conservative substitution of an amino acid.
  • One or more biomarkers may result in a nonconservative substitution of an amino acid.
  • One or more of the biomarkers may be a driver mutation.
  • a driver mutation is a mutation that gives a selective advantage to a tumor cell in its microenvironment, through either increasing its survival or reproduction. None of the biomarkers may be a driver mutation.
  • One or more of the biomarkers may be a passenger mutation.
  • a passenger mutation is a mutation that has no effect on the fitness of a tumor cell but may be associated with a clonal expansion because it occurs in the same genome with a driver mutation.
  • the frequency of a biomarker may be as low as 0.001%.
  • the frequency of a biomarker may be as low as 0.005%.
  • the frequency of a biomarker may be as low as 0.01%.
  • the frequency of a biomarker may be as low as 0.02%.
  • the frequency of a biomarker may be as low as 0.03%.
  • the frequency of a biomarker may be as low as 0.05%.
  • the frequency of a biomarker may be as low as 0.1%.
  • the frequency of a biomarker may be as low as 1%.
  • No single biomarker may be present in more than 50%, of subjects having the cancer.
  • No single biomarker may be present in more than 40%, of subjects having the cancer.
  • No single biomarker may be present in more than 30%, of subjects having the cancer.
  • No single biomarker may be present in more than 20%, of subjects having the cancer.
  • No single biomarker may be present in more than 10%, of subjects having the cancer.
  • No single biomarker may be present in more than 5%, of subjects having the cancer.
  • a single biomarker may be present in 0.001% to 50% of subjects having cancer.
  • a single biomarker may be present in 0.01% to 50% of subjects having cancer.
  • a single biomarker may be present in 0.01% to 30% of subjects having cancer.
  • a single biomarker may be present in 0.01% to 20% of subjects having cancer.
  • a single biomarker may be present in 0.01% to 10% of subjects having cancer.
  • a single biomarker may be present in 0.1% to 10% of subjects having cancer.
  • a single biomarker may be present in 0.1% to 5% of subjects having cancer.
  • Detection of a biomarker may indicate the presence of one or more cancers. Detection may indicate presence of a cancer selected from the group including ovarian cancer, pancreatic cancer, breast cancer, colorectal cancer, non-small cell lung carcinoma (e.g., squamous cell carcinoma, or adenocarcinoma) or any other cancer. Detection may indicate the presence of any cancer selected from the group including ovarian cancer, pancreatic cancer, breast cancer, colorectal cancer, non-small cell lung carcinoma (squamous cell or adenocarcinoma) or any other cancer.
  • a cancer selected from the group including ovarian cancer, pancreatic cancer, breast cancer, colorectal cancer, non-small cell lung carcinoma (squamous cell or adenocarcinoma) or any other cancer.
  • Detection may indicate the presence of any of a plurality of cancers selected from the group including ovarian cancer, pancreatic cancer, breast cancer, colorectal cancer and non- small cell lung carcinoma (squamous cell or adenocarcinoma), or any other cancer. Detection may indicate presence of one or more of any of the cancers mentioned in this application.
  • One or more cancers may exhibit a biomarker in at least one exon in the panel.
  • One or more cancers selected from the group including ovarian cancer, pancreatic cancer, breast cancer, colorectal cancer, non-small cell lung carcinoma (squamous cell or adenocarcinoma), or any other cancer each exhibit a biomarker in at least one exon in the panel.
  • Each of at least 3 of the cancers may exhibit a biomarker in at least one exon in the panel.
  • Each of at least 4 of the cancers may exhibit a biomarker in at least one exon in the panel.
  • Each of at least 5 of the cancers may exhibit a biomarker in at least one exon in the panel.
  • Each of at least 8 of the cancers may exhibit a biomarker in at least one exon in the panel.
  • Each of at least 10 of the cancers may exhibit a biomarker in at least one exon in the panel.
  • All of the cancers may exhibit a biomarker in at least one exon in the panel.
  • a subject may exhibit a biomarker in at least one exon or gene in the panel. At least 85% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 90%, of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 92% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 95% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 96% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel.
  • At least 97% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 98% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 99% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 99.5% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel.
  • a subject may exhibit a biomarker in at least one region in the panel. At least 85% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 90%, of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 92% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 95% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 96% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 97% of subjects having a cancer may exhibit a biomarker in at least one region in the panel.
  • At least 98% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 99% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 99.5% of subjects having a cancer may exhibit a biomarker in at least one region in the panel.
  • Detection may be performed with a high sensitivity and/or a high specificity.
  • Sensitivity can refer to a measure of the proportion of positives that are correctly identified as such.
  • sensitivity refers to the percentage of all existing biomarkers that are detected.
  • sensitivity refers to the percentage of sick people who are correctly identified as having certain disease.
  • Specificity can refer to a measure of the proportion of negatives that are correctly identified as such.
  • specificity refers to the proportion of unaltered bases which are correctly identified.
  • specificity refers to the percentage of healthy people who are correctly identified as not having certain disease.
  • Detection may be performed with a sensitivity of at least 95%, 97%, 98%, 99%, 99.5%, or 99.9% and/or a specificity of at least 80%, 90%, 95%, 97%, 98% or 99%. Detection may be performed with a sensitivity of at least 90%, 95%, 97%, 98%, 99%, 99.5%, 99.6%, 99.98%, 99.9% or 99.95%.
  • Detection may be performed with a specificity of at least 90%, 95%, 97%, 98%, 99%, 99.5%, 99.6%, 99.98%, 99.9% or 99.95%. Detection may be performed with a specificity of at least 70% and a sensitivity of at least 70%, a specificity of at least 75% and a sensitivity of at least 75%, a specificity of at least 80% and a sensitivity of at least 80%, a specificity of at least 85% and a sensitivity of at least 85%, a specificity of at least 90% and a sensitivity of at least 90%, a specificity of at least 95% and a sensitivity of at least 95%, a specificity of at least 96% and a sensitivity of at least 96%, a specificity of at least 97% and a sensitivity of at least 97%, a specificity of at least 98% and a sensitivity of at least 98%, a specificity of at least 99% and a sensitivity of at least 99%,
  • the methods can detect a biomarker at a sensitivity of sensitivity of about 80% or greater. In some cases, the methods can detect a biomarker at a sensitivity of sensitivity of about 95% or greater. In some cases, the methods can detect a biomarker at a sensitivity of sensitivity of about 80% or greater, and a sensitivity of sensitivity of about 95% or greater.
  • Detection may be highly accurate. Accuracy may apply to the identification of biomarkers in cell free DNA, and/or to the diagnosis of cancer. Statistical tools, such as covariate analysis described above, may be used to increase and/or measure accuracy.
  • the methods can detect a biomarker at an accuracy of at least 80%, 90%, 95%, 97%, 98% or 99%, 99.5%, 99.6%, 99.98%, 99.9%, or 99.95%. In some cases, the methods can detect a biomarker at an accuracy of at least 95% or greater.
  • the cancer treatment includes, without limitation, imatinib, gefatinib, afatinib, dacomitinib, sunitinib, sorafenib, vandetanib, brivanib, cabozantib, neratinib, tivantinib, bevacizumab, cixutumumab, dalotuzumab, figitumumab, rilotumumab, onartuzumab, ganitumab, ramucirumab, ridaforolimus, tensirolimus, everolimus, BMS-690514, BMS-754807, EMD 525797, GDC-0973, GDC-0941, MK-2206, AZD6244, GSK1120212, PX-866, XL821, IMC- A12, MM-121, PF-02341066, RG7160, and Sym004.
  • Antibodies suitable for use as anti-EGFR therapy include cetuximab (Trade Name: Erbitux) and panitumumab (Trade Name: Vectibex).
  • the cancer treatment includes EGFR tyrosine kinase inhibitors such as gefitinib (Trade Name: Iressa), erlotinib (Trade Name: Tarceva), lapatinib, canertinib, and cetuximab.
  • therapties may be used in combination, such as an anti-EGFR therapy and an anti-EGFR therapy.
  • Anti-EGFR therapy may be used in combination with any combination of chemotherapeutic agents or chemotherapeutic regimens, for example, FOLFOX (fluorouracil [5-FU]/leucovorin/oxaliplatin), FOLFIRI (5-FU/leucovorin/irinotecan), and the like.
  • a cancer treatment is administered to a subject.
  • the cancer treatment is administered in combination another therapy, such as a non-anti-EGFR therapy with anti-EGFR therapy.
  • the region of DNA sequenced may comprise a panel of genes or genomic regions. Selection of a limited region for sequencing (e.g., a limited panel) can reduce the total sequencing needed (e.g., a total amount of nucleotides sequenced.
  • a sequencing panel can target a plurality of different genes or regions to detect a single cancer, a set of cancers, or all cancers.
  • a panel targets a plurality of different genes or genomic regions is selected such that a determined proportion of subjects having a cancer exhibits a genetic variant or biomarker in one or more different genes or genomic regions in the panel.
  • the panel may be selected to limit a region for sequencing to a fixed number of base pairs.
  • the panel may be selected to sequence a desired amount of DNA.
  • the panel may be further selected to achieve a desired sequence read depth.
  • the panel may be selected to achieve a desired sequence read depth or sequence read coverage for an amount of sequenced base pairs.
  • the panel may be selected to achieve a theoretical sensitivity, a theoretical specificity and/or a theoretical accuracy for detecting one or more genetic variants in a sample.
  • Probes for detecting the panel of regions can include those for detecting hotspots regions as well as nucleosome-aware probes (e.g., KRAS codons 12 and 13) and may be designed to optimize capture based on analysis of cfDNA coverage and fragment size variation impacted by nucleosome binding patterns and GC sequence composition. Regions used herein can also include non-hotspot regions optimized based on nucleosome positions and GC models.
  • the panel can comprise a plurality of subpanels, including subpanels for identifying tissue of origin (e.g., use of published literature to define 50-100 baits representing genes with most diverse transcription profile across tissues (not necessarily promoters)), whole genome scaffold (e.g., for identifying ultra-conservative genomic content and tiling sparsely across chromosomes with handful of probes for copy number base lining purposes), transcription start site (TSS)/CpG islands (e.g., for capturing differential methylated regions (e.g., Differentially Methylated Regions (DMRs)) in for example in promoters of tumor suppressor genes (e.g., SEPT9/VIM in colorectal cancer)).
  • tissue of origin e.g., use of published literature to define 50-100 baits representing genes with most diverse transcription profile across tissues (not necessarily promoters)
  • whole genome scaffold e.g., for identifying ultra-conservative genomic content and tiling sparsely across
  • the one or more regions in the panel can comprise one or more loci from one or a plurality of genes.
  • the plurality of genes may be selected for sequencing and biomarker detection. Genes included in the region to be sequenced may be selected from genes known to be involved in cancer, or from genes not involved in cancer.
  • the plurality of genes in the panel may be oncogenes, tumor suppressors, growth factors, DNA repair genes, signaling genes, transcription factors, receptors or metabolic genes.
  • genes that may be in the panel include, but are not limited to: APC, AR, ARID1 A, BRAF, BRCA1, BRCA2, CCND1, CCND2, CCNE1, CDK4, CDK6, CDKN2A, CDKN2B, EGFR, ERBB2, FGFR1, FGFR2, HRAS, KIT, KRAS, MET, MYC, NF1, NRAS, PDGFRA, PIK3CA, PTEN, RAFI, TP53, AKT1, ALK, ARAF, ATM, CDH1, CTNNB1,ESR1, EZH2, FBXW7, FGFR3, GATA3, GNA11, GNAQ, GNAS, HNF1A, IDH1, IDH2, JAK2, JAK3, MAP2K1, MAP2K2, MLH1, MPL, NFE2L2, NOTCH1, NPM1, NTRK1, PTPN11, RET, RHEB, RHOA, RIT1, ROS1, SMAD4, SMO,
  • the one or more regions in the panel can comprise one or more loci from one or a plurality of genes, including one or more of AKT1, ALK, APC, ATM, BRAF, CTNNB1, EGFR, ERBB2, ESRI, FGFR2, GATA3, GNAS, IDH1, IDH2, KIT, KRAS, MET, NRAS, PDGFRA, PIK3CA, PTEN, RBI, SMAD4, STK11, and TP53.
  • genes including one or more of AKT1, ALK, APC, ATM, BRAF, CTNNB1, EGFR, ERBB2, ESRI, FGFR2, GATA3, GNAS, IDH1, IDH2, KIT, KRAS, MET, NRAS, PDGFRA, PIK3CA, PTEN, RBI, SMAD4, STK11, and TP53.
  • the one or more regions in a panel for colorectal cancer can comprise one or more loci from one or a plurality of genes, including one of, two of, three of, four of, or five of TP53, APC, BRAF, KRAS, and NRAS.
  • the one or more regions in a panel for ovarian cancer can comprise one or more loci from one or a plurality of genes, including TP53.
  • the one or more regions in a panel for pancreatic cancer can comprise one or more loci from one or a plurality of genes, including one or both of TP53 and KRAS.
  • the one or more regions in a panel for lung adenocarcinoma can comprise one or more loci from one or a plurality of genes, including one of, two of, three of, four of, five of, six of, seven of, or eight of TP53, BRAF, KRAS, EGFR, ERBB2, MET, STK11, and ALK.
  • the one or more regions in a panel for lung squamous cell carcinoma can comprise one or more loci from one or a plurality of genes, including one of, two of, three of, four of, or five of TP53, BRAF,
  • the one or more regions in a panel for breast cancer can comprise one or more loci from one or a plurality of genes, including one of, two of, three of, or four of TP53, GAT A3, PIK3CA, and ESRI.
  • one or more regions in a panel can comprise one or more loci from a combination of any of the above genes, for example, to detect a combination of cancer types.
  • one or more regions in a panel can comprise one or more loci from each of the preceding genes, for example, in a pan-cancer panel.
  • the one or more regions in a panel for lung cancer can comprise one or more loci from a plurality of genes, including one of, two of, three of, four of, five of, six of, seven of, eight of, nine of, 10 of, 11 of, 12 of, 13 of, 14 of, 15 of, 16 of, 17 of, 18 of, 19 of, or 20 of EGFR, KRAS, TP53, CDKN2A, STK11, BRAF, PIK3CA, RBI, ERBB2, PTEN, NFE2L2, MET, CTNNB1, NRAS, MUC16, NF1, BAB, SMARCA4, ATM, NTRK3, and ERBB4.
  • Such a panel also may include, or have substituted for any or all of the above, any or all of an EGFR Exon 19 deletion, EGFR L858R, EGFR C797S, EGFR T790M, EGFR S645C, ARAF S214C and S214F, ERBB2 S418T, MET exon 14 skipping, SNVs and indels.
  • Many of these genes may be clinically actionable, such that an observed anomaly in MAF (e.g., significantly higher or lower than in normal control subjects) may be indicative of a clinical state relevant to lung cancer, such as diagnosis, prognosis, risk stratification, treatment selection, tumor resistance to treatment, tumor burden, etc.
  • a lung cancer targeted panel may comprise a relatively small number of these lung cancer associated genes.
  • the one or more regions in a panel for breast cancer can comprise one or more loci from a plurality of genes, including any one of, or any combination of, ACVRL1, AFF2, AGMO, AGTR2, AHNAK, AHNAK2, AKAP9, AKT1, AKT2, ALK, APC, ARID 1 A, ARID1B, ARID2, ARID5B, ASXL1, ASXL2, ATR, BAP1, BCAS3, BIRC6, BRAF, BRCA1, BRCA2, BRIP1, CACNA2D3, CASP8, CBFB, CCND3, CDH1, CDKN1B, CDKN2A, CHD1, CHEK2, CLK3, CLRN2, COL12A1, COL22A1, COL6A3, CTCF, CTNNA1, CTNNA3, DCAF4L2, DNAH11, DNAH2, DNAH5, DTWD2, EGFR, EP300, ERBB2, ERBB3, ERBB4, FAM20C, FANCA, FANCA, FANCA,
  • genes may be clinically actionable, such that an observed anomaly in MAF (e.g., significantly higher or lower than in normal control subjects) may be indicative of a clinical state relevant to breast cancer, such as diagnosis, prognosis, risk stratification, treatment selection, tumor resistance to treatment, tumor burden, etc.
  • a breast cancer targeted panel may comprise a relatively small number of these breast cancer associated genes.
  • the one or more regions in a panel for colorectal cancer can comprise one or more loci from a plurality of genes, including one of, two of, three of, four of, five of, or six of TP53, BRAF, KRAS, APC, TGFBR, and PIK3CA.
  • Many of these genes may be clinically actionable, such that an observed anomaly in MAF (e.g., significantly higher or lower than in normal control subjects) may be indicative of a clinical state relevant to colorectal cancer, such as diagnosis, prognosis, risk stratification, treatment selection, tumor resistance to treatment, tumor burden, etc.
  • Such a colorectal cancer targeted panel may comprise a relatively small number of these colorectal cancer associated genes.
  • the one or more regions in the panel comprise one or more loci from one or a plurality of genes for detecting residual cancer after surgery. This detection can be earlier than is possible for existing methods of cancer detection.
  • the one or more regions in the panel comprise one or more loci from one or a plurality of genes for detecting cancer in a high-risk patient population. For example, smokers have much higher rates of lung cancer than the general population. Moreover, smokers can develop other lung conditions that make cancer detection more difficult, such as the development of irregular nodules in the lungs.
  • the methods described herein detect cancer in high risk patients earlier than is possible for existing methods of cancer detection.
  • a region may be selected for inclusion in a sequencing panel based on a number of subjects with a cancer that have a biomarker in that gene or region.
  • a region may be selected for inclusion in a sequencing panel based on prevalence of subjects with a cancer and a biomarker present in that gene. Presence of a biomarker in a region may be indicative of a subject having cancer.
  • the panel may be selected using information from one or more databases.
  • the information regarding a cancer may be derived from cancer tumor biopsies or cfDNA assays.
  • a database may comprise information describing a population of sequenced tumor samples.
  • a database may comprise information about mRNA expression in tumor samples.
  • a databased may comprise information about regulatory elements in tumor samples.
  • the information relating to the sequenced tumor samples may include the frequency various genetic variants and describe the genes or regions in which the genetic variants occur.
  • the genetic variants may be biomarkers.
  • a non-limiting example of such a database is COSMIC.
  • COSMIC is a catalogue of somatic mutations found in various cancers. For a particular cancer, COSMIC ranks genes based on frequency of mutation.
  • a gene may be selected for inclusion in a panel by having a high frequency of mutation within a given gene. For instance, COSMIC indicates that 33% of a population of sequenced breast cancer samples have a mutation in TP53 and 22% of a population of sampled breast cancers have a mutation in KRAS. Other ranked genes, including APC, have mutations found only in about 4% of a population of sequenced breast cancer samples.
  • TP53 and KRAS may be included in a sequencing panel based on having relatively high frequency among sampled breast cancers (compared to APC, for example, which occurs at a frequency of about 4%).
  • COSMIC is provided as a non-limiting example, however, any database or set of information may be used that associates a cancer with biomarker located in a gene or genetic region.
  • COSMIC of 1156 biliary tract cancer samples, 380 samples (33%) carried mutations in TP53.
  • TP53 may be selected for inclusion in the panel based on a relatively high frequency in a population of biliary tract cancer samples.
  • a gene or region may be selected for a panel where the frequency of a biomarker is significantly greater in sampled tumor tissue or circulating tumor DNA than found in a given background population.
  • a combination of regions may be selected for inclusion of a panel such that at least a majority of subjects having a cancer will have a biomarker present in at least one of the regions or genes in the panel.
  • the combination of regions may be selected based on data indicating that, for a particular cancer or set of cancers, a majority of subjects have one or more biomarkers in one or more of the selected regions.
  • a panel including regions A, B, C, and/or D may be selected based on data indicating that 90% of subjects with cancer 1 have a biomarker in regions A, B, C, and/or D of the panel.
  • biomarkers may be shown to occur independently in two or more regions in subjects having a cancer such that, combined, a biomarker in the two or more regions is present in a majority of a population of subjects having a cancer.
  • a panel including regions X, Y, and Z may be selected based on data indicating that 90% of subjects have a biomarker in one or more regions, and in 30% of such subjects a biomarker is detected only in region X, while biomarkers are detected only in regions Y and/or Z for the remainder of the subjects for whom a biomarker was detected.
  • Biomarkers present in one or more regions previously shown to be associated with one or more cancers may be indicative of or predictive of a subject having cancer if a biomarker is detected in one or more of those regions 50% or more of the time.
  • Computational approaches such as models employing conditional probabilities of detecting cancer given a known cancer frequency for a set of biomarkers within one or more regions may be used to predict which regions, alone or in combination, may be predictive of cancer.
  • Other approaches for panel selection involve the use of databases describing information from studies employing comprehensive genomic profiling of tumors with large panels and/or whole genome sequencing (WGS, RNA-seq, Chip-seq, bisulfate sequencing, ATAC-seq, and others). Information gleaned from literature may also describe pathways commonly affected and mutated in certain cancers. Panel selection may be further informed by the use of ontologies describing genetic information.
  • Genes included in the panel for sequencing can include the fully transcribed region, the promoter region, enhancer regions, regulatory elements, and/or downstream sequence. To further increase the likelihood of detecting tumor indicating mutations only exons may be included in the panel.
  • the panel can comprise all exons of a selected gene, or only one or more of the exons of a selected gene.
  • the panel may comprise of exons from each of a plurality of different genes.
  • the panel may comprise at least one exon from each of the plurality of different genes.
  • a panel of exons from each of a plurality of different genes is selected such that a determined proportion of subjects having a cancer exhibit a genetic variant in at least one exon in the panel of exons.
  • At least one full exon from each different gene in a panel of genes may be sequenced.
  • the sequenced panel may comprise exons from a plurality of genes.
  • the panel may comprise exons from 2 to 100 different genes, from 2 to 70 genes, from 2 to 50 genes, from 2 to 30 genes, from 2 to 15 genes, or from 2 to 10 genes.
  • a selected panel may comprise a varying number of exons.
  • the panel may comprise from 2 to 3000 exons.
  • the panel may comprise from 2 to 1000 exons.
  • the panel may comprise from 2 to 500 exons.
  • the panel may comprise from 2 to 100 exons.
  • the panel may comprise from 2 to 50 exons.
  • the panel may comprise no more than 300 exons.
  • the panel may comprise no more than 200 exons.
  • the panel may comprise no more than 100 exons.
  • the panel may comprise no more than 50 exons.
  • the panel may comprise no more than 40 exons.
  • the panel may comprise no more than 30 exons.
  • the panel may comprise no more than 25 exons.
  • the panel may comprise no more than 20 exons.
  • the panel may comprise no more than 15 exons.
  • the panel may comprise no more than 10 exons.
  • the panel may comprise no more than 9 exons.
  • the panel may comprise no more than 8 exons.
  • the panel may comprise no more than 7 exons.
  • the panel may comprise one or more exons from a plurality of different genes.
  • the panel may comprise one or more exons from each of a proportion of the plurality of different genes.
  • the panel may comprise at least two exons from each of at least 25%, 50%, 75% or 90% of the different genes.
  • the panel may comprise at least three exons from each of at least 25%, 50%, 75% or 90% of the different genes.
  • the panel may comprise at least four exons from each of at least 25%, 50%, 75% or 90% of the different genes.
  • the sizes of the sequencing panel may vary.
  • a sequencing panel may be made larger or smaller (in terms of nucleotide size) depending on several factors including, for example, the total amount of nucleotides sequenced or a number of unique molecules sequenced for a particular region in the panel.
  • the sequencing panel can be sized 5 kb to 50 kb.
  • the sequencing panel can be 10 kb to 30 kb in size.
  • the sequencing panel can be 12 kb to 20 kb in size.
  • the sequencing panel can be 12 kb to 60 kb in size.
  • the sequencing panel can be at least lOkb, 12 kb, 15 kb, 20 kb, 25 kb, 30 kb, 35 kb, 40 kb, 45 kb, 50 kb, 60 kb, 70 kb, 80 kb, 90 kb, 100 kb , 110 kb, 120 kb, 130 kb, 140 kb, or 150 kb in size.
  • the sequencing panel may be less than 100 kb, 90 kb, 80 kb, 70 kb, 60 kb, or 50 kb in size.
  • the panel selected for sequencing can comprise at least 1, 5, 10, 15, 20, 25, 30, 40, 50, 60, 80, or 100 regions.
  • the regions in the panel are selected that the size of the regions are relatively small.
  • the regions in the panel have a size of about 10 kb or less, about 8 kb or less, about 6 kb or less, about 5 kb or less, about 4 kb or less, about 3 kb or less, about 2.5 kb or less, about 2 kb or less, about 1.5 kb or less, or about 1 kb or less or less.
  • the regions in the panel have a size from about 0.5 kb to about 10 kb, from about 0.5 kb to about 6 kb, from about 1 kb to about 11 kb, from about 1 kb to about 15 kb, from about 1 kb to about 20 kb, from about 0.1 kb to about 10 kb, or from about 0.2 kb to about 1 kb.
  • the regions in the panel can have a size from about 0.1 kb to about 5 kb.
  • the panel selected herein can allow for deep sequencing that is sufficient to detect low- frequency genetic variants (e.g., in cell-free nucleic acid molecules obtained from a sample).
  • An amount of genetic variants in a sample may be referred to in terms of the minor allele frequency for a given genetic variant.
  • the minor allele frequency may refer to the frequency at which minor alleles (e.g., not the most common allele) occurs in a given population of nucleic acids, such as a sample. Genetic variants at a low minor allele frequency may have a relatively low frequency of presence in a sample.
  • the panel allows for detection of genetic variants at a minor allele frequency of at least 0.0001%, 0.001%, 0.005%, 0.01%, 0.05%, 0.1%, or 0.5%.
  • the panel can allow for detection of genetic variants at a minor allele frequency of 0.001% or greater.
  • the panel can allow for detection of genetic variants at a minor allele frequency of 0.01% or greater.
  • the panel can allow for detection of genetic variant present in a sample at a frequency of as low as 0.0001%, 0.001%, 0.005%, 0.01%, 0.025%, 0.05%, 0.075%, 0.1%, 0.25%, 0.5%, 0.75%, or 1.0%.
  • the panel can allow for detection of biomarkers present in a sample at a frequency of at least 0.0001%, 0.001%, 0.005%, 0.01%, 0.025%, 0.05%, 0.075%, 0.1%, 0.25%, 0.5%, 0.75%, or 1.0%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 1.0%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.75%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.5%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.25%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.1%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.075%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.05%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.025%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.01%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.005%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.001%.
  • the panel can allow for detection of biomarkers at a frequency in a sample as low as 0.0001%.
  • the panel can allow for detection of biomarkers in sequenced cfDNA at a frequency in a sample as low as 1.0% to 0.0001%.
  • the panel can allow for detection of biomarkers in sequenced cfDNA at a frequency in a sample as low as 0.01% to 0.0001%.
  • a genetic variant can be exhibited in a percentage of a population of subjects who have a disease (e.g., cancer). In some cases, at least 1%, 2%, 3%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 99% of a population having the cancer exhibit one or more genetic variants in at least one of the regions in the panel. For example, at least 80% of a population having the cancer may exhibit one or more genetic variants in at least one of the regions in the panel.
  • a disease e.g., cancer
  • the panel can comprise one or more regions from each of one or more genes. In some cases, the panel can comprise one or more regions from each of at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, or 80 genes. In some cases, the panel can comprise one or more regions from each of at most 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, or 80 genes. In some cases, the panel can comprise one or more regions from each of from about 1 to about 80, from 1 to about 50, from about 3 to about 40, from 5 to about 30, from 10 to about 20 different genes.
  • the regions in the panel can be selected so that one or more epigenetically modified regions are detected.
  • the one or more epigenetically modified regions can be acetylated, methylated, ubiquitylated, phosphorylated, sumoylated, ribosylated, and/or citrullinated.
  • the regions in the panel can be selected so that one or more methylated regions are detected.
  • the regions in the panel can be selected so that they comprise sequences differentially transcribed across one or more tissues.
  • the regions can comprise sequences transcribed in certain tissues at a higher level compared to other tissues.
  • the regions can comprise sequences transcribed in certain tissues but not in other tissues.
  • the regions in the panel can comprise coding and/or non-coding sequences.
  • the regions in the panel can comprise one or more sequences in exons, introns, promoters, 3’ untranslated regions, 5’ untranslated regions, regulatory elements, transcription start sites, and/or splice sites.
  • the regions in the panel can comprise other non-coding sequences, including pseudogenes, repeat sequences, transposons, viral elements, and telomeres.
  • the regions in the panel can comprise sequences in non-coding RNA, e.g., ribosomal RNA, transfer RNA, Piwi -interacting RNA, and microRNA.
  • the regions in the panel can be selected to detect (diagnose) a cancer with a desired level of sensitivity (e.g., through the detection of one or more genetic variants).
  • the regions in the panel can be selected to detect the cancer (e.g., through the detection of one or more genetic variants) with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • the regions in the panel can be selected to detect the cancer with a sensitivity of 100%.
  • the regions in the panel can be selected to detect (diagnose) a cancer with a desired level of specificity (e.g., through the detection of one or more genetic variants).
  • the regions in the panel can be selected to detect cancer (e.g., through the detection of one or more genetic variants) with a specificity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • the regions in the panel can be selected to detect the one or more genetic variant with a specificity of 100%.
  • the regions in the panel can be selected to detect (diagnose) a cancer with a desired positive predictive value.
  • Positive predictive value can be increased by increasing sensitivity (e.g., chance of an actual positive being detected) and/or specificity (e.g., chance of not mistaking an actual negative for a positive).
  • regions in the panel can be selected to detect the one or more genetic variant with a positive predictive value of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • the regions in the panel can be selected to detect the one or more genetic variant with a positive predictive value of 100%.
  • the regions in the panel can be selected to detect (diagnose) a cancer with a desired accuracy.
  • accuracy may refer to the ability of a test to discriminate between a disease condition (e.g., cancer) and health.
  • Accuracy may be can be quantified using measures such as sensitivity and specificity, predictive values, likelihood ratios, the area under the ROC curve, Youden’s index and/or diagnostic odds ratio.
  • Accuracy may presented as a percentage, which refers to a ratio between the number of tests giving a correct result and the total number of tests performed.
  • the regions in the panel can be selected to detect cancer with an accuracy of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • the regions in the panel can be selected to detect cancer with an accuracy of 100%.
  • a panel may be selected such that when one or more regions or genes in the panel are removed, specificity is appreciably decreased. Removal of one region from the panel may result in a decrease in specificity of at least 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, or more.
  • a panel may be selected such that the addition of one or more regions or genes to the panel does not appreciably increase the specificity of the panel, e.g., does not increase the specificity by more than 1%, 2%, 5%, 10%, 15%, or 20%.
  • a panel may be of a size such that when one or more regions or genes in the panel are removed, this appreciably decreases sensitivity, e.g., sensitivity is decreased by at least 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, or more.
  • a panel may be selected such that the addition of one or more regions or genes to the panel does not appreciably increase the sensitivity of the panel, e.g., does not increase the sensitivity by more than 1%, 2%, 5%, 10%, 15%, or 20%.
  • a panel may be of a size such that when one or more regions or genes in the panel are removed, accuracy is appreciably decreased, e.g., accuracy is decreased by at least 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, or more.
  • a panel may be selected such that the addition of one or more regions or genes to the panel does not appreciably increase the accuracy of the panel, e.g., does not increase the accuracy by more than 1%, 2%, 5%, 10%, 15%, or 20%.
  • a panel may be of a size such that when one or more regions or genes the panel are removed, positive predictive value is appreciably decreased, e.g., positive predictive value is decreased by at least 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, or more.
  • a panel may be selected such that the addition of one or more regions or genes to the panel does not appreciably increase the positive predictive value of the panel, e.g., does not increase the positive predictive value by more than 1%, 2%, 5%, 10%, 15%, or 20%
  • a panel may be selected to be highly sensitive and detect low frequency genetic variants. For instance, a panel may be selected such that a genetic variant or biomarker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may be detected at a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. Regions in a panel may be selected to detect a biomarker present at a frequency of 1% or less in a sample with a sensitivity of 70% or greater.
  • a panel may be selected to detect a biomarker at a frequency in a sample as low as 0.1% with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • a panel may be selected to detect a biomarker at a frequency in a sample as low as 0.01% with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • a panel may be selected to detect a biomarker at a frequency in a sample as low as 0.001% with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • a panel may be selected to be highly specific and detect low frequency genetic variants. For instance, a panel may be selected such that a genetic variant or biomarker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may be detected at a specificity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. Regions in a panel may be selected to detect a biomarker present at a frequency of 1% or less in a sample with a specificity of 70% or greater.
  • a panel may be selected to detect a biomarker at a frequency in a sample as low as 0.1% with a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • a panel may be selected to detect a biomarker at a frequency in a sample as low as 0.01% with a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • a panel may be selected to detect a biomarker at a frequency in a sample as low as 0.001% with a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • a panel may be selected to be highly accurate and detect low frequency genetic variants.
  • a panel may be selected such that a genetic variant or biomarker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may be detected at an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • Regions in a panel may be selected to detect a biomarker present at a frequency of 1% or less in a sample with an accuracy of 70% or greater.
  • a panel may be selected to detect a biomarker at a frequency in a sample as low as 0.1% with an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • a panel may be selected to detect a biomarker at a frequency in a sample as low as 0.01% with an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • a panel may be selected to detect a biomarker at a frequency in a sample as low as 0.001% with an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • a panel may be selected to be highly predictive and detect low frequency genetic variants.
  • a panel may be selected such that a genetic variant or biomarker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may have a positive predictive value of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
  • the concentration of probes or baits used in the panel may be increased (2 to 6 ng/pL) to capture more nucleic acid molecule within a sample.
  • the concentration of probes or baits used in the panel may be at least 2 ng/pL, 3 ng/ pL, 4 ng/ pL, 5 ng/pL, 6 ng/pL, or greater.
  • the concentration of probes may be about 2 ng/pL to about 3 ng/pL, about 2 ng/pL to about 4 ng/pL, about 2 ng/pL to about 5 ng/pL, about 2 ng/pL to about 6 ng/pL.
  • the concentration of probes or baits used in the panel may be 2 ng/pL or more to 6 ng/pL or less. In some instances this may allow for more molecules within a biological to be analyzed thereby enabling lower frequency alleles to be detected.
  • Genetic analysis includes detection of nucleotide sequence variants and copy number variations. Genetic variants can be determined by sequencing.
  • the sequencing method can be massively parallel sequencing, that is, simultaneously (or in rapid succession) sequencing any of at least 100,000, 1 million, 10 million, 100 million, or 1 billion polynucleotide molecules.
  • Sequencing methods may include, but are not limited to: high-throughput sequencing, pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-ligation, sequencing-by-hybridization, RNA-Seq (Illumina), Digital Gene Expression (Helicos), Next-generation sequencing, Single Molecule Sequencing by Synthesis (SMSS)(Helicos), massively-parallel sequencing, Clonal Single Molecule Array (Solexa), shotgun sequencing, Maxam-Gilbert or Sanger sequencing, primer walking, sequencing using PacBio, SOLiD, Ion Torrent, or Nanopore platforms and any other sequencing methods known in the art.
  • Sequencing can be made more efficient by performing sequence capture, that is, the enrichment of a sample for target sequences of interest, e.g., sequences including the KRAS and/or EGFR genes or portions of them containing sequence variant biomarkers. Sequence capture can be performed using immobilized probes that hybridize to the targets of interest.
  • Cell free DNA can include small amounts of tumor DNA mixed with germline DNA. Sequencing methods that increase sensitivity and specificity of detecting tumor DNA, and, in particular, genetic sequence variants and copy number variation, can be useful in the methods of this invention. Such methods are described in, for example, in WO 2014/039556.
  • One method includes high efficiency tagging of DNA molecules in the sample, e.g., tagging at least any of 50%, 75% or 90% of the polynucleotides in a sample. This increases the likelihood that a low-abundance target molecule in a sample will be tagged and subsequently sequenced, and significantly increases sensitivity of detection of target molecules.
  • Another method involves molecular tracking, which identifies sequence reads that have been redundantly generated from an original parent molecule, and assigns the most likely identity of a base at each locus or position in the parent molecule. This significantly increases specificity of detection by reducing noise generated by amplification and sequencing errors, which reduces frequency of false positives.
  • Methods of the present disclosure can be used to detect genetic variation in non-uniquely tagged initial starting genetic material (e.g., rare DNA) at a concentration that is less than 5%, 1%, 0.5%, 0.1%, 0.05%, or 0.01%, at a specificity of at least 99%, 99.9%, 99.99%, 99.999%, 99.9999%, or 99.99999%.
  • Sequence reads of tagged polynucleotides can be subsequently tracked to generate consensus sequences for polynucleotides with an error rate of no more than 2%, 1%, 0.1%, or 0.01%.
  • Copy number variation determination can involve determining a quantitative measure of polynucleotides in a sample mapping to a genetic locus, such as the EGFR gene or KRAS gene.
  • the quantitative measure can be a number. Once the total number of polynucleotides mapping to a locus is determined, this number can be used in standard methods of determining Copy Number Variation at the locus.
  • a quantitative measure can be normalized against a standard. In one method, a quantitative measure at a test locus can be standardized against a quantitative measure of polynucleotides mapping to a control locus in the genome, such as gene of known copy number. In another method, the quantitative measure can be compared against the amount of nucleic acid in the original sample.
  • the quantitative measure can be compared against an expected measure for diploidy.
  • the quantitative measure can be normalized against a measure from a control sample, and normalized measures at different loci can be compared.
  • quantifying involves quantifying parent or original molecules in a sample mapping to a locus, rather than number of sequence reads.
  • a copy number variation may be an amplification or a deletion or truncation of a gene.
  • An amplification may be 3, 4, 5, 6, 7, 8, 9, 10, or 10 or more copies of a gene.
  • a deletion or truncation may be 0 or 1 copies of a gene.
  • An example of a method for detecting copy number variation may include an array.
  • the array may comprise a plurality of capture probes.
  • the capture probes can be oligonucleotides that are bound to the surface of the array.
  • the capture probes may hybridize to at least one of the genes as set forth in Table 1.
  • the capture probes may bind to at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 genes as set forth in Table 1.
  • DNA derived from the subject may be labeled (e.g., with a fluorophore) prior to hybidization for detection.
  • a gene of interest may be amplified using primers that recognize the gene of interest.
  • the primers may hybridize to a gene upstream and/or downstream of a particular region of interest (e.g., upstream of a mutation site).
  • a detection probe may be hybridized to the amplification product. Detection probes may specifically hybridize to a wildtype sequence or to a mutated/variant sequence. Detection probes may be labeled with a detectable label (e.g., with a fluorophore). Detection of a wild-type or mutant sequence may be performed by detecting the detectable label (e.g., fluorescence imaging).
  • a gene of interest may be compared with a reference gene. Differences in copy number between the gene of interest and the reference gene may indicate amplification or deletion/truncation of a gene.
  • platforms suitable to perform the methods described herein include digital PCR platforms such as e.g., Fluidigm Digital Array.
  • a method for monitoring for relapse of a tumor in an individual treated for cancer can include providing a cell free DNA (cfDNA) sample, from liquid or tissue.
  • cfDNA cell free DNA
  • the reference samples can include information in a database, including real world evidence. Determining methylation difference between the test genomic DNA and the reference genomic DNA, including CpG sites can providing a normalized methylation difference, which may include weighing the normalized methylation difference based on coverage at each of the CpG sites, thereby determining an aggregate coverage-weighted normalized methylation difference score.
  • sample testing can include one or more biological molecules such as genomic DNA, RNA, peptides, proteins, etc. in addition to epigenomic DNA.
  • a sample of tumor tissue is analyzed for one or more of genomic, epigenomic, or gene expression to establish a profile of the tumor tissue including testing of cfDNA, including ctDNA on a methylome panel.
  • testing for minimal residual disease is performed in accordance with the methods described herein after resection. Based on the original biopsy, recurrence outcomes generated from testing and real -world evidence is utilized as predictive of likelihood of recurrence.
  • a sample of tumor tissue is analyzed, as described to establish a profile of the tumor tissue, including testing of cfDNA, including ctDNA on a methylome panel.
  • therapeutic intervention is modified based on testing results and likelihood of recurrence.
  • a liquid informed minimal residual disease includes testing of cfDNA, including ctDNA on a methylome panel. After identifying differentially methylated regions, they can be utilized as prior information for post-operation recurrence definition. This would improve MRD detection without the logistics associated with tissue testing.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Immunology (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Microbiology (AREA)
  • Urology & Nephrology (AREA)
  • Physics & Mathematics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Hematology (AREA)
  • Biomedical Technology (AREA)
  • Cell Biology (AREA)
  • Biotechnology (AREA)
  • Genetics & Genomics (AREA)
  • General Engineering & Computer Science (AREA)
  • Hospice & Palliative Care (AREA)
  • Oncology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Biophysics (AREA)
  • Food Science & Technology (AREA)
  • Medicinal Chemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Disclosed herein are methods, compositions, and devices for use in early detection of cancer. The methods include sequencing a panel of regions in cell-free nucleic acid molecules and detecting one or more biomarkers that are indicative of a cancer.

Description

METHODS FOR DETERMINING SURVEILLANCE AND THERAPY FOR DISEASES
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. provisional patent application no. 63/511,082 filed June 29, 2023, which is incorporated by reference herein in its entirety.
BACKGROUND
[0002] Cancer is a major cause of disease worldwide. Each year, tens of millions of people are diagnosed with cancer around the world, and more than half of the patients eventually die from it. In many countries, cancer ranks the second most common cause of death following cardiovascular diseases. Early detection is associated with improved outcomes for many cancers.
[0003] To detect cancer, several screening tests are available. A physical exam and history survey general signs of health, including checking for signs of disease, such as lumps or other unusual physical symptoms. A history of a patient’s health habits and past illnesses and treatments will also be taken. Laboratory tests are another type of screening test and may include medical procedures to procure samples of tissue, blood, urine, or other substances in the body before conducting laboratory testing. Imaging procedures screen for cancer by generating visual representations of areas inside the body. Genetic tests detect certain gene deleterious mutations linked to some types of cancer. Genetic testing is particularly useful for a number of diagnostic methods.
SUMMARY OF INVENTION
[0004] Described herein is a method, including: determining a state of biological molecules obtained from a sample derived from a human subject, testing for minimal residual disease (MRD), determining the likelihood of recurrence based on the MRD test, generating a schedule for one or more additional MRD tests based on the determination of the likelihood of recurrence. In other embodiments, the biological molecules are one or more of DNA, methylated DNA, RNA, methylated RNA, proteins, and peptides. In other embodiments, the method includes testing for MRD includes combining a plurality of nucleic acid molecules derived from a subject with a solution including an amount of methyl binding domain (MBD) proteins to produce a nucleic acid-MBD protein solution; and performing a plurality of washes of the nucleic acid- MBD protein solution with a salt solution to produce a number of nucleic acid fractions, individual nucleic acid fractions having a threshold number of methylated cytosines in regions of the plurality of nucleic acids having at least the threshold cytosine-guanine content. In other embodiments, the wash of the plurality of washes is performed with a solution having a concentration of sodium chloride (NaCl) and produces a nucleic acid fraction of the number of nucleic acid fractions having a range of binding strengths to MBD proteins. In other embodiments, the method includes determining that a first nucleic acid fraction is associated with a first partition of a plurality of partitions of nucleic acids, the first partition corresponding to a first range of binding strengths to MBD proteins, attaching a first molecular barcode to nucleic acids of the first nucleic acid fraction, the first molecular barcode being included in a first set of molecular barcodes associated with the first partition, determining that a second nucleic acid fraction is associated with a second partition of the plurality of partitions of nucleic acids, the second partition corresponding to a second range of binding energies to MBD proteins different from the first range of binding strengths to MBD proteins, and attaching a second molecular barcode to nucleic acids of the second nucleic acid fraction, the second molecular barcode being included in a second set of molecular barcodes associated with the second partition. In other embodiments, the method includes combining at least a portion of the number of nucleic acid fractions with an amount of restriction enzyme that cleaves molecules with one or more unmethylated cytosines to produce at least a portion of the plurality of samples used to produce the sequencing reads, wherein the threshold amount of methylated cytosines corresponds to a minimum frequency of methylated cytosines within a region having at least the threshold cytosine-guanine content.
[0005] In other embodiments, the method includes combining at least a portion of the number of nucleic acid fractions with an amount of a restriction enzyme that cleaves molecules with one or more methylated cytosines to produce at least a portion of the plurality of samples used to produce the sequencing reads, wherein the threshold amount of unmethylated cytosines corresponds to a maximum frequency of methylated cytosines that are not cleaved within a region having at least the threshold cytosine-guanine content. In other embodiments, the method includes testing for MRD includes sequencing nucleic acid molecules derived from a sample obtained from a subject, analyzing sequence reads derived from the sequencing to identify one or more driver mutations in the nucleic acid molecules, and using information about the presence, absence, or amount of the one or more driver mutations in the nucleic acid molecules to identify a tumor in the subject. In other embodiments, the nucleic acid molecules are cell-free DNA. In other embodiments, the sample is at least one of blood, serum, plasma or tissue. In other embodiments, the method includes determination of treatment for the subject. In other embodiments, the a limit of detection for the model to determine tumor fraction of samples is no greater than 0.05%. In other embodiments, the one or more driver mutations includes a somatic variant detected at a mutant allele frequency (MAF) of no more than 0.05%. In other embodiments, the one or more driver mutations includes a fusion detected at a mutant allele frequency (MAF) of no more than 0.1%. In other embodiments, the method includes detecting mutation distributions for each of one or more driver mutations, wherein the mutation distribution for each of the one or more driver mutations is detected with a correlation of at least 0.99 to a mutation distribution of the driver mutation detected in a cohort of the subject by tissue genotyping. In other embodiments, the method detects the tumor in the subject with a sensitivity of at least 85%, a specificity of at least 99%, and a diagnostic accuracy of at least 99%. In other embodiments, the method includes identify circulating tumor DNA (ctDNA) and one or more driver mutations in the ctDNA. In other embodiments, the method includes obtaining, by a computing system having one or more hardware processors and memory, testing sequence data from a subject, the testing sequence data including testing sequencing reads derived from a sample of the subject, analyzing, by the computing system, the testing sequencing reads to determine a first quantitative measure derived from the testing sequencing reads to genomic regions of a reference genome, analyzing, by the computing system, the testing sequencing reads to determine a second quantitative measure derived from the testing sequencing reads to genomic regions of a reference genome, determining, by the computing system, a metric based on the first quantitative and the second quantitative measure, and generating, by the computing system, an input vector that includes the metrics determining, by the computing system, an indication of cancer status in the subject by providing the input vector to a model that implements one or more machine learning techniques to generate indications of cancer status in subjects, the model including weights for individual classification regions of a plurality of classification regions and at least a portion of the weights of the individual classification regions being different from one another. In other embodiments, the individual testing sequencing reads include a nucleotide sequence corresponding to a fragment of a nucleic acid included in the sample and the individual testing sequencing reads correspond to molecules having a threshold amount of methylated cytosines included in regions of the nucleotide sequence having at least the threshold cytosine- guanine content, the first quantitative measure derived from the testing sequencing reads that correspond to individual classification regions of a plurality of classification regions at least a portion of the individual classification regions of the plurality of classification regions corresponding to genomic regions of a reference genome that have the threshold amount of methylated cytosines in subjects in which cancer is detected and that have at least the threshold cytosine-guanine content, the second quantitative measure derived from the testing sequencing reads that correspond to individual control regions a plurality of control regions, individual control regions of the plurality of control regions corresponding to additional genomic regions of the reference genome that have at least the threshold cytosine-guanine content and that have at least the threshold amount of methylated cytosines in subjects in which cancer is detected and in additional subjects in which cancer is not detected. In other embodiments, the method includes obtaining, by the computing system having one or more hardware processors and memory, training sequence data including training sequencing reads derived from a plurality of samples of a plurality of training subjects, individual training sequencing reads including a nucleotide sequence corresponding to a fragment of a nucleic acid included in a sample of the plurality of samples and individual training sequencing reads corresponding to molecules having a threshold amount of methylated cytosines included in regions of the nucleotide sequence having at least a threshold cytosine-guanine content, analyzing, by the computing system, the training sequencing reads to determine an additional first quantitative measure derived from the training sequencing reads that corresponds to individual classification regions of the plurality of classification regions, analyzing, by the computing system, the training sequencing reads to determine an additional second quantitative measure derived from the training sequencing reads that correspond to a plurality of control regions, determining, by the computing system, an additional metric for the individual classification regions of the plurality of classification regions based on the additional first quantitative measure for the individual classification regions and the additional second quantitative measure for the plurality of control regions, generating, by the computing device, training data that includes the additional metric for the individual classification regions of the plurality of classification regions for the training sequence reads from samples of the plurality of training subjects, implementing, by the computing system and using the training data, one or more machine learning algorithms to generate the model to determine the indications of cancer status in subjects based on amounts of methylated cytosines in at least a portion of the plurality of classification regions. In other embodiments, the one or more machine learning algorithms include one or more classification algorithms. In other embodiments, the one or more machine learning algorithms include one or more regression algorithms, and the indication corresponds to an estimate of tumor fraction of the sample. In other embodiments, the training sequencing reads comprise a first portion of the training sequence data and additional training sequencing reads comprise a second portion of the training sequence data, wherein the additional training sequencing reads are different from the training sequencing reads; and the method including analyzing, by the computing system, at least one of the first portion of the training sequence data or the second portion of the training sequence data to determine an individual frequency of a plurality of variants present in an individual sample of the plurality of samples, determining, by the computing system and for the individual sample, a variant of the plurality of variants having a maximum frequency that corresponds to the individual frequency having a greatest value among individual frequencies derived from an individual sample, and determining, by the computing system, individual measures of tumor fraction for an individual sample based on the greatest value of the individual frequencies derived from the individual sample. In other embodiments, the training data includes the individual measures of tumor fraction for the individual samples of the plurality of samples, and the model is generated based on the individual measures of tumor fraction for the individual samples of the plurality of samples. In other embodiments, the method includes generating, by a computing system including processing circuitry and memory, a data file including first tokens generated using a first hash function, individual first tokens corresponding to a respective individual of a group of individuals having data stored by a molecular data repository, sending, by the computing system, the data file to a health insurance claims data management system, obtaining, by the computing system and from the health insurance claims data management system, in response to the data file, health data corresponding to the group of individuals, generating, by the computing system, a number of identifiers using a second hash function that is different from the first hash function, each identifier corresponds to one or more tokens related to each individual of the group of individuals, obtaining, by the computing system and using the number of identifiers, second data from the molecular data repository for the group of individuals, determining, by the computing system, respective portions of the first data that correspond to respective portions of the second data for the group of individuals, and generating, by the computing system, an integrated data repository that stores the respective portions of the first data and the respective portions of the second data in relation to respective identifiers of the number of identifiers. In other embodiments, the method includes determining, by the computing system, a first set of data processing instructions that are executable in relation to first data stored by the integrated data repository, causing, by the computing system, the first set of data processing instructions to be executed to analyze first health insurance claims codes included in the first data to determine a first subset of the group of individuals in which a biological condition is present and generating, by the computing system, a first dataset indicating the subset of the group of individuals in which the biological condition is present. In other embodiments, the method includes determining, by the computing system, a second set of data processing instructions that are executable in relation to second data stored by the integrated data repository, causing, by the computing system, the second set of data processing instructions to be executed to analyze the second health insurance claims codes included in the second data to determine one or more treatments provided to a second subset of the group of individuals, and generating, by the computing system, a second dataset indicating the one or more treatments provided to the second subset of the group of individuals. In other embodiments, the method includes determining, by the computing system, a third subset of the group of individuals that includes a portion of the first subset of the group of individuals that overlaps with a portion of the second subset of the group of individuals, receiving, by the computing system, a request to perform an analysis of the first dataset and the second dataset in relation to the third subset of the group of individuals, and analyzing, by the computing system and in response to the request, the first dataset and the second dataset with respect to the third subset of the group of individuals to determine a measure of significance of a characteristic of the third subset of the group of individuals with respect to the biological condition.
[0006] In other embodiments, the method includes determining, by the computing system, one or more genomic mutations present in the third subset of the group of individuals, determining, by the computing system, a plurality of treatments provided to the third subset of the group of individuals, and determining, by the computing system, respective survival rates for the third subset of the group of individuals. In other embodiments, the measure of significance corresponds to survival rate with respect to a treatment of the plurality of treatments and a genomic mutation of the one or more genomic mutations. In other embodiments, the method includes determining, by the computing system and based on measure of significance, an effectiveness of the treatment for the third subset of the group of individuals. In other embodiments, the method includes determining, by the computing system, individuals in third subset of the group of individuals that have not received the treatment. In other embodiments, the method includes administering one or more therapeutically effective amounts of the treatment to the individuals in the third subset that have not received the treatment. In other embodiments, the integrated data repository is arranged according to a data repository schema that includes a plurality of data tables and a plurality of logical links between the plurality of data tables, individual logical links of the plurality of logical links indicating one or more rows of a data table of the plurality of data tables that correspond to one or more additional rows of an additional data table of the plurality of data tables.
[0007] In other embodiments, the plurality of data tables include a first data table that stores genomics data of the group of individuals, a second data that stores data related to one or more patient visits by individuals to one or more healthcare providers, a third data table that stores information corresponding to respective services provided to individuals with respect to one or more patient visits to one or more healthcare providers indicated by the second data table, a fourth data table that stores personal information of the group of individuals, a fifth data table that stores information related to a health insurance company or governmental entity that made payment for services provided to the group of individuals, a sixth data table storing information corresponding to health insurance coverage information for the group of individuals, and a seventh data table that stores information related to pharmaceutical treatments obtained by the group of individuals. In other embodiments, the number of identifiers generated using the second hash function comprise intermediate identifiers; and the method includes applying, by the computing system, a salt function to the intermediate identifiers to generate a final set of identifiers. In other embodiments, the method includes obtaining, by the computing system, information from an additional data repository that includes electronic medical records of an additional group of individuals, determining, by the computing system, a subset of the additional group of individuals that corresponds to the group of individuals having data stored by the genomics data repository, and modifying, by the computing system, the integrated data repository to store at least a portion of the information of the medical records of the subset of the additional group of individuals in relation to the number of identifiers. In other embodiments, the method includes performing, by the computing system, one or more optical character recognition operations with respect to the additional information, analyzing, by the computing system, the additional information obtained from the additional data repository to determine one or more portions of the additional information to remove to produce a corpus of information. In other embodiments, the method includes analyzing, by the computing system, the corpus of information to determine a portion of the subset of the additional group of individuals that correspond to one or more biomarkers, and generating, by the computing system, one or more data structures that store identifiers of the portion of the subset of the additional group of individuals and that store an indication that the portion of the subset of the additional group of individuals corresponds to the one or more biomarkers.
[0008] In other embodiments, the method includes storing, by the computing system, the one or more data structures in an intermediate data repository, performing, by the computing system, one or more de-identification operations with respect to the identifiers of the portion of the subset of the additional group of individuals before modifying the integrated data repository to store at least a portion of the additional information of the medical records of the portion of the subset of the additional group of individuals in relation to the number of identifiers. In other embodiments, the molecular data repository stores at least one or more of genomic information, genetic information, metabolomic information, transcriptomic information, fragmentomic information, immune receptor information, methylation information, epigenomic information, or proteomic information. In other embodiments, the method includes determining the likelihood of recurrence includes MRD test, real world evidence (RWE), or both.
[0009] Described herein is a method, including: determining a state of biological molecules obtained from a sample derived from a human subject, testing for minimal residual disease (MRD), determining the likelihood of recurrence based on the MRD test, recommending and/or administering treatment. In various embodiments, the methods described herein determine an assessment including comprehensive evaluation, diagnostic testing, molecular and genetic profiling and/or risk assessment. In various embodiments, the methods described herein determine a treatment plan including patient consultation, treatment strategy and/or tailored treatment. In various embodiments, the methods described herein determine a treatment including pre-treatment and/or administration of treatment execution. In various embodiments, the methods described herein determine monitoring and/or adjustment, including one or more follow-ups and/or response assessment. In various embodiments, the methods described herein determine long-term management and/or survivorship which can include post-treatment surveillance and/or recurrence management can support long term management and/or survivorship.
[0010] Described herein is a system configured to perform the method of any of the preceding claims.
[0011] Further described herein is computer readable medium including the method of any of the preceding claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The novel features of the disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings (also “fig.” and “FIG.” herein), of which:
[0013] Figure 1 illustrates an example architecture to generate an integrated data repository that includes multiple types of healthcare data, according to one or more implementations.
[0014] Figure 2 illustrates an example framework corresponding to an arrangement of data tables in an integrated data repository, according to one or more implementations.
[0015] Figure 3 illustrates an architecture to generate one or more datasets from information retrieved from a data repository that integrates health related data from a number of sources, according to one or more implementations. [0016] Figure 4 illustrates an architecture to generate an integrated data repository that includes de-identified health insurance claims data and de-identified genomics data, according to one or more implementations.
[0017] Figure 5 illustrates a framework to generate a dataset, by a data pipeline system, based on data stored by an integrated data repository, according to one or more implementations.
[0018] Figure 6 is a schematic diagram of an architecture to incorporate medical records data into an integrated data repository.
[0019] Figure 7 is a data flow diagram of an example process to generate an integrated data repository that stores health insurance claims data and genomics data, according to one or more implementations.
[0020] Figure 8 is a data flow diagram of an example process to generate a number of datasets used to analyze information stored by an integrated data repository that stores health insurance claims data and genomics data, according to one or more implementations.
[0021] Figure 9 illustrates a diagrammatic representation of a machine in the form of a computer system within which a set of instructions may be executed for causing the machine to perform any one or more of the methodologies discussed herein, according to one or more implementations.
[0022] Figure 10 illustrates a diagrammatic representation for tailoring the aggressiveness of patient surveillance based on a likelihood of tumor recurrence obtained from MRD testing outcomes using testing data and real world evidence
[0023] Figure 11 illustrates a diagrammatic representation for treatment planning.
[0024] Figure 12 illustrates a diagrammatic representation for treatment implementation.
[0025] Figure 13 illustrates a diagrammatic representation for monitoring and adjustment. [0026] Figure 14 illustrates a diagrammatic representation for long term management and planning.
DETAILED DESCRIPTION
[0027] While various embodiments of the disclosure have been shown and described herein, those skilled in the art will understand that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed.
[0028] The term “about” and its grammatical equivalents in relation to a reference numerical value can include a range of values up to plus or minus 10% from that value. For example, the amount “about 10 ” can include amounts from 9 to 11. The term “about ” in relation to a reference numerical value can include a range of values plus or minus 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, or 1% from that value.
[0029] The term “at least” and its grammatical equivalents in relation to a reference numerical value can include the reference numerical value and greater than that value. For example, the amount “at least 10” can include the value 10 and any numerical value above 10, such as 11, 100, and 1,000.
[0030] The term “at most” and its grammatical equivalents in relation to a reference numerical value can include the reference numerical value and less than that value. For example, the amount “at most 10” can include the value 10 and any numerical value under 10, such as 9, 8, 5, 1, 0.5, and 0.1.
[0031] As used herein the singular forms “a”, “an”, and “the” can include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a cell ” can include a plurality of such cells and reference to “the culture ” can include reference to one or more cultures and equivalents thereof known to those skilled in the art, and so forth. All technical and scientific terms used herein can have the same meaning as commonly understood to one of ordinary skill in the art to which this disclosure belongs unless clearly indicated otherwise.
[0032] Current approaches are to omit testing both genomic and epigenomic attributes of the patient sample or to perform multiple tests separately. Omitting genomic or epigenomic information can result in prescription of cancer therapies that could be known to be ineffective or withholding cancer therapies that could be known to be effective, had both genomic and epigenomic information been available. Cancer can be indicated by epigenetic variations, such as methylation. Examples of methylation changes in cancer include local gains of DNA methylation in the CpG islands at the transcription start site (TSS) of genes involved in normal growth control, DNA repair, cell cycle regulation, and/or cell differentiation. This hypermethylation can be associated with an aberrant loss of transcriptional capacity of involved genes and occurs at least as frequently as point mutations and deletions as a cause of altered gene expression. DNA methylation profiling can be used to detect regions with different extents of methylation (“differentially methylated regions” or “DMRs”) of the genome that are altered during development or that are perturbed by disease, for example, cancer or any cancer-associated disease. The genome of cancer cells harbor imbalance in the above DNA methylation patterns, and therefore in functional packaging of the DNA. The abnormalities of chromatin organization are therefore coupled with methylation changes and may contribute to enhanced cancer profiling when analyzed jointly. Combining MBD-partitioning with fragmentomic data, such as fragment mapped starts and stops positions (correlated with nucleosome positions) , fragment length and associated nucleosome occupancy, can be used for chromatin structure analysis in hypermethylation studies with the aim to improve biomarker detection rate.
[0033] Methylation profiling can involve determining methylation patterns across different regions of the genome. For example, after partitioning molecules based on extent of methylation (e.g., relative number of methylated sites per molecule) and sequencing, the sequences of molecules in the different partitions can be mapped to a reference genome. This can show regions of the genome that, compared with other regions, are more highly methylated or are less highly methylated. In this way, genomic regions, in contrast to individual molecules, may differ in their extent of methylation.
[0034] A characteristic of nucleic acid molecules may be a modification, which may include various chemical or protein modifications (i.e. epigenetic modifications). Non-limiting examples of chemical modification may include, but are not limited to, covalent DNA modifications, including DNA methylation. In some embodiments, DNA methylation includes addition of a methyl group to a cytosine at a CpG site (a cytosine followed by a guanine in a nucleic acid sequence). In some embodiments, DNA methylation includes addition of a methyl group to adenine, such as in N6-methyladenine. In some embodiments, DNA methylation is 5- methylation (modification of the 5th carbon of the 6 carbon ring of cytosine). In some embodiments, 5-methylation includes addition of a methyl group to the 5C position of the cytosine to create 5 -methylcytosine (m5c). In some embodiments, methylation includes a derivative of m5c. Derivatives of m5c include, but are not limited to, 5-hydroxymethylcytosine (5-hmC), 5-formylcytosine (5-fC), and 5-caryboxylcytosine (5-caC). In some embodiments, DNA methylation is 3C methylation (modification of the 3rd carbon of the 6 carbon ring of cytosine). In some embodiments, 3C methylation includes addition of a methyl group to the 3C position of the cytosine to generate 3 -methylcytosine (3mC). Other examples include N6- methyladenine or glycosylation. DNA methylation includes addition of methyl groups to DNA (e.g. CpG) and can change the expression of methylated DNA region.. Methylation can also occur at non CpG sites, for example, methylation can occur at a CpA, CpT, or CpC site. DNA methylation can change the activity of methylated DNA region. For example, when DNA in a promoter region is methylated, transcription of the gene may be repressed. DNA methylation is critical for normal development and abnormality in methylation may disrupt epigenetic regulation. The disruption, e.g., repression, in epigenetic regulation may cause diseases, such as cancer. Promoter methylation in DNA may be indicative of cancer. [0035] A CpG dyad is the dinucleotide CpG (cytosine-phosphate-guanine, i.e. a cytosine followed by a guanine in a 5’ - 3’ direction of the nucleic acid sequence) on the sense strand and its complementary CpG on the antisense strand of a double-stranded DNA molecule. CpG dyads can be either fully methylated or hemi-methylated (methylated on one strand only). [0036] The CpG dinucleotide is underrepresented in the normal human genome, with the majority of CpG dinucleotide sequences being transcriptionally inert (e.g. DNA heterochromatic regions in pericentromeric parts of the chromosome and in repeat elements) and methylated. However, many CpG islands are protected from such methylation especially around transcription start sites (TSS).
[0037] Protein modifications include binding to components of chromatin, particularly histones including modified forms thereof, and binding to other proteins, such as proteins involved in replication or transcription. The disclosure provides methods of processing and analyzing nucleic acids with different extents of modification, such that the nature of their original modification is correlated with a nucleic acid tag and can be decoded by sequencing the tag when nucleic acids are analyzed. Genetic variation of sample nucleic acid modifications can then be associated with the extent of modification (epigenetic variation) of that nucleic acid in the original sample, include single stranded (e.g., ssDNA or RNA) or double stranded molecules (e.g., dsDNA).
[0038] The loss of DNA can reduce the presence of one or more types of DNA such that the presence of the one or more types of DNA such as cfDNA, is difficult to detect. In one or more additional scenarios, existing methods to measure DNA methylation, such as enrichment or depletion methods, can have a relatively high level of resolution, such as about 100 base pairs (bp) to about 200 bp that can make accurately determining an amount of methylation of DNA difficult. The accuracy with which DNA methylation is determined can impact the accuracy of estimates of tumor fraction for samples. Since tumor fraction can be used to determine whether a sample is derived from a subject in which a tumor is present or not, the accuracy of determinations of tumor fraction estimates can impact diagnosis and/or treatment decisions for individuals.
[0039] In addition to these detection schemes, more data is needed to understand the behavior of tumors and performance of treatments and guidelines outside the highly selective confines of randomized controlled trials, often designed and conducted by entities with a commercial interest in their success. Real-world evidence (RWE), specifically the use of databases featuring integrated clinical and molecular data, plays an increasingly important role in precision oncology research. However, most of these databases feature genomic information from tumor limited to a single time point, generally at diagnosis, due in part to the practical challenges of genomic profiling of serial tumor specimens in real-world clinical practice. Genomic data for tumors is often limited to those naive to systemic treatment, despite evidence that treatments can significantly alter the tumor genomic landscape and lead to drug resistance. Combining data from a liquid biopsy assay with rich clinical information can overcome these challenges and help improve understanding of tumor evolution and the emergence of biomarkers that confer resistance to guide the development of novel therapeutics addressing areas of unmet need [0040] The analysis of healthcare data using existing systems and techniques is typically performed with respect to medical records generated by healthcare providers. As used herein, a healthcare provider may refer to an entity, individual, or group of individuals involved in provided care to individuals in relation to at least one of the treatment or prevention of one or more biological conditions. In addition, as used herein, a biological condition can refer to an abnormality of function and/or structure in an individual to such a degree as to produce or threaten to produce a detectable feature of the abnormality. A biological condition can be characterized by external and/or internal characteristics, signs, and/or symptoms that indicate a deviation from a biological norm in one or more populations. A biological condition can be characterized by external and/or internal characteristics, signs, and/or symptoms that indicate a deviation from a biological norm in one or more populations. In various examples, a biological condition can include one or more molecular phenotypes. For example, a biological condition may correspond to genetic or epigenetic lesions. In one or more additional examples, a biological condition can include at least one of one or more diseases, one or more disorders, one or more injuries, one or more syndromes, one or more disabilities, one or more infections, one or more isolated symptoms, or other atypical variations of biological structure and/or function of individuals. Additionally, a treatment, as used herein, can refer to a substance, procedure, routine, device, and/or other intervention that can administered or performed with the intent of treating one or more effects of a biological condition in an individual. In one or more examples, a treatment may include a substance that is metabolized by the individual. The substance may include a composition of matter, such as a pharmaceutical composition. The substance may be delivered to the individual via a number of methods, such as ingestion, injection, absorption, or inhalation. A treatment may also include physical interventions, such as one or more surgeries. In at least some examples, the treatment can include a therapeutically meaningful intervention. [0041] The healthcare data typically analyzed by existing systems includes unstructured data. Unstructured data can include data that is not organized according to a pre-defined or standardized format. For example, unstructured data may include notes made by a healthcare provider that is comprised of free text. That is, the manner in which the notes are captured does not include pre-defined inputs that are selectable by the healthcare provider, such as via a dropdown menu or via a list. Rather, the notes include text entered by a healthcare provider that may include sentences, sentence fragments, words, letters, symbols, abbreviations, one or more combinations thereof, and so forth. In some cases, unstructured data may be partially structured. For example, a provider could select an insurance billing code from a predefined list of insurance billing codes, and add unstructured notes to data associated with that billing code.
[0042] Existing systems typically devote a large amount of computing resources to analyzing unstructured data in order to extract information that may be relevant to analyses being performed by the existing systems. In some cases, existing systems may analyze unstructured data and transform the unstructured data to a structured format in order to facilitate the analysis of the previously unstructured data. The analysis of unstructured data by existing systems can be inefficient as well as inaccurate. In scenarios where the unstructured data is obtained from healthcare data, the importance of accurately analyzing the information is high because the analysis may be related to at least one of the treatment or diagnosis of a number of individuals with respect to one or more biological conditions. Thus, inaccurate analyses of healthcare data may result in suboptimal treatment of individuals.
[0043] The implementations of techniques, architectures, frameworks, systems, processes, and computer-readable instructions described herein are directed to analyzing health insurance claims data to derive information about at least one of the health or treatment of individuals. In contrast to existing systems, health insurance claims data is structured according to one or more formats and stored by a number of data tables. The data tables may include codes or other alphanumeric information indicating treatments received by individuals, dates of treatments, dosage information, diagnoses of individuals with respect to one or more biological conditions, information related to visits to healthcare providers, dates of visits to healthcare providers, billing information, and the like. The implementations described herein may be used to accurately analyze health insurance claims data for hundreds, up to thousands, up to tens of thousands of individuals or more in which one or more biological conditions are present. In various examples, tens of thousands, hundreds of thousands, up to millions of rows and/or columns of health insurance claims data may be analyzed to determine health-related information for individuals in which one or more biological conditions are present.
[0044] In various examples, the implementations described herein can integrate molecular data with health insurance claims data. The molecular data may include information derived from tissue samples extracted from a number of individuals. The molecular data may also include information derived from blood samples extracted from a number of individuals. In one or more illustrative examples, the molecular data may include genomics data. Further, in one or more examples, the health insurance claims data may be integrated with germline genetic information for a number of individuals.
[0045] An integrated data repository may be created that combines the health insurance claims data for individuals with the molecular data of the individuals. In one or more examples, an identifier may be generated for an individual that is associated with both the health insurance claims data of the individual and the molecular data of the individual. Both the molecular data and the health insurance claims data stored by the integrated data repository may be accessible using a single identifier of the individual. In one or more illustrative examples, the identifier for an individual may include an encrypted security key. In various examples, the integrated data repository may include a number of data tables corresponding to different aspects of the data stored within the data repository. For example, a first data table may be generated that includes summary data of individuals included in the integrated data repository, such as personal information, and a second data table may be generated that includes data corresponding to visits to healthcare providers. Additionally, a third data table may be generated indicating medical procedures provided to individuals and a fourth data table may be generated indicating information related to prescriptions obtained by individuals. Further, a fifth data table may be generated that includes multiomics profiling of individuals. Multiomics profiles may include at least one of genomic profiles, transcriptomic profiles, epigenetic profiles, or proteomic profiles. [0046] The data tables included in the integrated data repository may be linked via logical links. In this way, a query to retrieve information from one data table may cause information from one or more additional data tables to be retrieved. Information stored by the linked data tables may be accessed to generate a number of different datasets that may be used to analyze the information stored by the integrated data repository. For example, the information stored by the integrated data repository may be analyzed by one or more algorithms to generate datasets that are organized according to one or more schemas. The datasets may indicate treatment received by an individual over a period of time with respect to a biological condition. The datasets may also indicate cohorts of individuals included in the integrated data repository having a number of common characteristics. In various examples, the datasets may consolidate and arrange information from a number of different data sources, including the integrated data repository. The datasets may be analyzed with respect to a number of queries to indicate information that may be of interest to at least one of healthcare providers, patients, or providers of treatments of biological conditions. For example, one or more datasets may be analyzed to more accurately determine a survival rate of individuals in which a biological condition is present and having a specified genomic profile in response to receiving a specified treatment.
[0047] The implementations described herein may provide a platform to integrate health insurance claims data and molecular data for individuals that is not found in existing systems that typically rely on electronic medical records that include an amount of unstructured data. By generating and analyzing structured health insurance claims data that has been integrated with molecular data, the implementations described herein may provide more accurate characterizations of the integrated data in relation to existing systems that rely on relatively inaccurate, unstructured electronic medical records data. Additionally, implementations described herein generate analytics ready datasets that enable the analysis of health information about individuals in a confidential and anonymized manner.
Samples
[0048] A sample can be any biological sample isolated from a subject. A sample can be a bodily sample. Samples can include body tissues, such as known or suspected solid tumors, whole blood, platelets, serum, plasma, stool, red blood cells, white blood cells or leucocytes, endothelial cells, tissue biopsies, cerebrospinal fluid synovial fluid, lymphatic fluid, ascites fluid, interstitial or extracellular fluid, the fluid in spaces between cells, including gingival crevicular fluid, bone marrow, pleural effusions, cerebrospinal fluid, saliva, mucous, sputum, semen, sweat, urine. Samples are preferably body fluids, particularly blood and fractions thereof, and urine. A sample can be in the form originally isolated from a subject or can have been subjected to further processing to remove or add components, such as cells, or enrich for one component relative to another. Thus, a preferred body fluid for analysis is plasma or serum containing cell-free nucleic acids. A sample can be isolated or obtained from a subject and transported to a site of sample analysis. The sample may be preserved and shipped at a desirable temperature, e.g., room temperature, 4°C, -20°C, and/or -80°C. A sample can be isolated or obtained from a subject at the site of the sample analysis. The subject can be a human, a mammal, an animal, a companion animal, a service animal, or a pet. The subject may have a cancer. The subject may not have cancer or a detectable cancer symptom. The subject may have been treated with one or more cancer therapy, e.g., any one or more of chemotherapies, antibodies, vaccines or biologies. The subject may be in remission. The subject may or may not be diagnosed of being susceptible to cancer or any cancer-associated genetic mutations/disorders. [0049] The volume of plasma can depend on the desired read depth for sequenced regions. Exemplary volumes are 0.4-40 ml, 5-20 ml, 10-20 ml. For examples, the volume can be 0.5 mL, 1 mL, 5 mL 10 mL, 20 mL, 30 mL, or 40 mL. A volume of sampled plasma may be 5 to 20 mL. [0050] A sample can comprise various amount of nucleic acid that contains genome equivalents. For example, a sample of about 30 ng DNA can contain about 10,000 (104) haploid human genome equivalents and, in the case of cfDNA, about 200 billion (2x1011) individual polynucleotide molecules. Similarly, a sample of about 100 ng of DNA can contain about 30,000 haploid human genome equivalents and, in the case of cfDNA, about 600 billion individual molecules.
[0051] A sample can comprise nucleic acids from different sources, e.g., from cells and cell-free of the same subject, from cells and cell-free of different subjects. A sample can comprise nucleic acids carrying mutations. For example, a sample can comprise DNA carrying germline mutations and/or somatic mutations. Germline mutations refer to mutations existing in germline DNA of a subject. Somatic mutations refer to mutations originating in somatic cells of a subject, e.g., cancer cells. A sample can comprise DNA carrying cancer-associated mutations (e.g., cancer-associated somatic mutations). A sample can comprise an epigenetic variant (i.e. a chemical or protein modification), wherein the epigenetic variant associated with the presence of a genetic variant such as a cancer-associated mutation. In some embodiments, the sample includes an epigenetic variant associated with the presence of a genetic variant, wherein the sample does not comprise the genetic variant.
[0052] Exemplary amounts of cell-free nucleic acids in a sample before amplification range from about 1 fg to about 1 pg, e.g., 1 pg to 200 ng, 1 ng to 100 ng, 10 ng to 1000 ng. For example, the amount can be up to about 600 ng, up to about 500 ng, up to about 400 ng, up to about 300 ng, up to about 200 ng, up to about 100 ng, up to about 50 ng, or up to about 20 ng of cell-free nucleic acid molecules. The amount can be at least 1 fg, at least 10 fg, at least 100 fg, at least 1 pg, at least 10 pg, at least 100 pg, at least 1 ng, at least 10 ng, at least 100 ng, at least 150 ng, or at least 200 ng of cell-free nucleic acid molecules. The amount can be up to 1 femtogram (fg), 10 fg, 100 fg, 1 picogram (pg), 10 pg, 100 pg, 1 ng, 10 ng, 100 ng, 150 ng, or 200 ng of cell-free nucleic acid molecules. The method can comprise obtaining 1 femtogram (fg) to 200 ng-
[0053] Cell-free nucleic acids are nucleic acids not contained within or otherwise bound to a cell or in other words nucleic acids remaining in a sample after removing intact cells. Cell-free nucleic acids include DNA, RNA, and hybrids thereof, including genomic DNA, mitochondrial DNA, siRNA, miRNA, circulating RNA (cRNA), tRNA, rRNA, small nucleolar RNA (snoRNA), Piwi-interacting RNA (piRNA), long non-coding RNA (long ncRNA), or fragments of any of these. Cell-free nucleic acids can be double-stranded, single-stranded, or a hybrid thereof. A cell-free nucleic acid can be released into bodily fluid through secretion or cell death processes, e.g., cellular necrosis and apoptosis. Some cell-free nucleic acids are released into bodily fluid from cancer cells e.g., circulating tumor DNA, (ctDNA). Others are released from healthy cells. In some embodiments, cfDNA is cell-free fetal DNA (cffDNA) In some embodiments, cell free nucleic acids are produced by tumor cells. In some embodiments, cell free nucleic acids are produced by a mixture of tumor cells and non-tumor cells.
[0054] Cell-free nucleic acids have an exemplary size distribution of about 100-500 nucleotides, with molecules of 110 to about 230 nucleotides representing about 90% of molecules, with a mode of about 168 nucleotides and a second minor peak in a range between 240 to 440 nucleotides. Cell-free nucleic acids can be isolated from bodily fluids through a fractionation or partitioning step in which cell-free nucleic acids, as found in solution, are separated from intact cells and other non-soluble components of the bodily fluid. Partitioning may include techniques such as centrifugation or filtration. Alternatively, cells in bodily fluids can be lysed and cell-free and cellular nucleic acids processed together. Generally, after addition of buffers and wash steps, nucleic acids can be precipitated with an alcohol. Further clean up steps may be used such as silica based columns to remove contaminants or salts. Non-specific bulk carrier nucleic acids, such as Cot-1 DNA, DNA or protein for bisulfite sequencing, hybridization, and/or ligation, may be added throughout the reaction to optimize certain aspects of the procedure such as yield.
[0055] After such processing, samples can include various forms of nucleic acid including double stranded DNA, single stranded DNA and single stranded RNA. In some embodiments, single stranded DNA and RNA can be converted to double stranded forms so they are included in subsequent processing and analysis steps.
Analytes
[0056] Analytes can include nucleic acid analytes, and non-nucleic acid analytes. The disclosure provides for detecting genetic variations in biological samples from a subject. Biological samples may include polynucleotides from cancer cells. Polynucleotides may be DNA (e.g., genomic DNA, cDNA), RNA (e.g., mRNA, small RNAs), or any combination thereof. Biological samples may include tumor tissue, e.g., from a biopsy. In some cases, biological samples may include blood or saliva. In particular cases, biological samples may comprise cell free DNA (“cfDNA”) or circulating tumor DNA (“ctDNA”). Cell free DNA can be present in, e.g., blood. [0057] Examples of non-nucleic acid analytes include, but are not limited to, lipids, carbohydrates, peptides, proteins, glycoproteins (N-linked or O-linked), lipoproteins, phosphoproteins, specific phosphorylated or acetylated variants of proteins, amidation variants of proteins, hydroxylation variants of proteins, methylation variants of proteins, ubiquity lati on variants of proteins, sulfation variants of proteins, viral proteins (e.g., viral capsid, viral envelope, viral coat, viral accessory, viral glycoproteins, viral spike, etc.), extracellular and intracellular proteins, antibodies, and antigen binding fragments. This further includes receptor, an antigen, a surface protein, a transmembrane protein, a cluster of differentiation protein, a protein channel, a protein pump, a carrier protein, a phospholipid, a glycoprotein, a glycolipid, a cell-cell interaction protein complex, an antigen-presenting complex, a major histocompatibility complex, an engineered T-cell receptor, a T-cell receptor, a B-cell receptor, a chimeric antigen receptor, an extracellular matrix protein, a posttranslational modification (e.g., phosphorylation, glycosylation, ubiquitination, nitrosylation, methylation, acetylation or lipidation) state of a cell surface protein, a gap junction, and an adherens junction.
[0058] In general, the systems, apparatus, methods, and compositions can be used to analyze any number of analytes, further including both nucleic acid analytes and non-nucleic acid analytes. For example, the number of analytes that are analyzed can be at least about 2, at least about 3, at least about 4, at least about 5, at least about 6, at least about 7, at least about 8, at least about 9, at least about 10, at least about 11, at least about 12, at least about 13, at least about 14, at least about 15, at least about 20, at least about 25, at least about 30, at least about 40, at least about 50, at least about 100, at least about 1,000, at least about 10,000, at least about 100,000 or more different analytes present in a region of the sample or within an individual feature of the substrate. Methods for performing multiplexed assays to analyze two or more different analytes will be discussed in a subsequent section of this disclosure.
[0059] One or more nucleic acid analytes and/or non-nucleic acid analytes constitute a set of molecular interactions in a biological system under study (e.g., cells), which may be regarded as “interactome” - the molecular interactions that occur between molecules belonging to different biochemical families (proteins, nucleic acids, lipids, carbohydrates, etc.) and also within a given family. In various embodiments, an interactome is a protein-DNA interactome (network formed by transcription factors (and DNA or chromatin regulatory proteins) and their target genes. In other embodiments, interactome refers to protein-protein interaction network(PPI), or protein interaction network (PIN). The methods described herein allow for study and analysis of the interactome. Techniques such as proteogenomics (whole genome sequencing, whole exome sequencing and RNA-seq, and mass spectrometry as examples) can support study of the interactome.
Analysis
[0060] The present methods can be used to diagnose presence of conditions, particularly cancer, in a subject, to characterize conditions (e.g., staging cancer or determining heterogeneity of a cancer), monitor response to treatment of a condition, effect prognosis risk of developing a condition or subsequent course of a condition. The present disclosure can also be useful in determining the efficacy of a particular treatment option. Successful treatment options may increase the amount of copy number variation or rare mutations detected in subject's blood if the treatment is successful as more cancers may die and shed DNA. In other examples, this may not occur. In another example, perhaps certain treatment options may be correlated with genetic profiles of cancers over time. This correlation may be useful in selecting a therapy.
Additionally, if a cancer is observed to be in remission after treatment, the present methods can be used to monitor residual disease or recurrence of disease.
[0061] The types and number of cancers that may be detected may include blood cancers, brain cancers, lung cancers, skin cancers, nose cancers, throat cancers, liver cancers, bone cancers, lymphomas, pancreatic cancers, skin cancers, bowel cancers, rectal cancers, thyroid cancers, bladder cancers, kidney cancers, mouth cancers, stomach cancers, solid state tumors, heterogeneous tumors, homogenous tumors and the like. Type and/or stage of cancer can be detected from genetic variations including mutations, rare mutations, indels, copy number variations, transversions, translocations, inversion, deletions, aneuploidy, partial aneuploidy, polyploidy, chromosomal instability, chromosomal structure alterations, gene fusions, chromosome fusions, gene truncations, gene amplification, gene duplications, chromosomal lesions, DNA lesions, abnormal changes in nucleic acid chemical modifications, abnormal changes in epigenetic patterns, and abnormal changes in nucleic acid 5-methylcytosine.
[0062] Genetic and other analyte data can also be used for characterizing a specific form of cancer. Cancers are often heterogeneous in both composition and staging. Genetic profile data may allow characterization of specific sub-types of cancer that may be important in the diagnosis or treatment of that specific sub-type. This information may also provide a subject or practitioner clues regarding the prognosis of a specific type of cancer and allow either a subject or practitioner to adapt treatment options in accord with the progress of the disease. Some cancers can progress to become more aggressive and genetically unstable. Other cancers may remain benign, inactive or dormant. The system and methods of this disclosure may be useful in determining disease progression.
[0063] The present analyses are also useful in determining the efficacy of a particular treatment option. Successful treatment options may increase the amount of copy number variation or rare mutations detected in subject's blood if the treatment is successful as more cancers may die and shed DNA. In other examples, this may not occur. In another example, perhaps certain treatment options may be correlated with genetic profiles of cancers over time. This correlation may be useful in selecting a therapy. Additionally, if a cancer is observed to be in remission after treatment, the present methods can be used to monitor residual disease or recurrence of disease.
[0064] The present methods can also be used for detecting genetic variations in conditions other than cancer. Immune cells, such as B cells, may undergo rapid clonal expansion upon the presence certain diseases. Clonal expansions may be monitored using copy number variation detection and certain immune states may be monitored. In this example, copy number variation analysis may be performed over time to produce a profile of how a particular disease may be progressing. Copy number variation or even rare mutation detection may be used to determine how a population of pathogens is changing during the course of infection. This may be particularly important during chronic infections, such as HIV/AIDS or Hepatitis infections, whereby viruses may change life cycle state and/or mutate into more virulent forms during the course of infection. The present methods may be used to determine or profile rejection activities of the host body, as immune cells attempt to destroy transplanted tissue to monitor the status of transplanted tissue as well as altering the course of treatment or prevention of rejection.
[0065] Further, the methods of the disclosure may be used to characterize the heterogeneity of an abnormal condition in a subject. Such methods can include, e.g., generating a genetic profile of extracellular polynucleotides derived from the subject, wherein the genetic profile includes a plurality of data resulting from copy number variation and rare mutation analyses. In some embodiments, an abnormal condition is cancer. In some embodiments, the abnormal condition may be one resulting in a heterogeneous genomic population. In the example of cancer, some tumors are known to comprise tumor cells in different stages of the cancer. In other examples, heterogeneity may comprise multiple foci of disease. Again, in the example of cancer, there may be multiple tumor foci, perhaps where one or more foci are the result of metastases that have spread from a primary site. [0066] The present methods can be used to generate or profile, fingerprint or set of data that is a summation of genetic information derived from different cells in a heterogeneous disease. This set of data may comprise copy number variation and mutation analyses alone or in combination. [0067] The present methods can be used to diagnose, prognose, monitor or observe cancers, or other diseases. In some embodiments, the methods herein do not involve the diagnosing, prognosing or monitoring a fetus and as such are not directed to non-invasive prenatal testing. In other embodiments, these methodologies may be employed in a pregnant subject to diagnose, prognose, monitor or observe cancers or other diseases in an unborn subject whose DNA and other polynucleotides may co-circulate with maternal molecules.
Determination of 5-methylcytosine pattern of nucleic acids
[0068] Bisulfite-based sequencing and variants thereof provides a means of determining the methylation pattern of a nucleic acid. In some embodiments, determining the methylation pattern includes distinguishing 5-methylcytosine (5mC) from non-methylated cytosine. In some embodiments, determining methylation pattern includes distinguishing N6-methyladenine from non-methylated adenine. In some embodiments, determining the methylation pattern includes distinguishing 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), and 5- carboxylcytosine (5caC) from non-methylated cytosine. Examples of bisulfite sequencing include, but are not limited to oxidative bisulfite sequencing (OX-BS-seq), Tet-assisted bisulfite sequencing (TAB-seq), and reduced bisulfite sequencing (redBS-seq).
[0069] Oxidative bisulfite sequencing (OX-BS-seq) is used to distinguish between 5mC and 5hmC, by first converting the 5hmC to 5fC, and then proceeding with bisulfite sequencing as previously described. Tet-assisted bisulfite sequencing (TAB-seq) can also be used to distinguish 5mc and 5hmC. In TAB-seq, 5hmC is protected by glucosylation. A Tet enzyme is then used to convert 5mC to 5caC before proceeding with bisulfite sequencing, as previously described. Reduced bisulfite sequencing is used to distinguish 5fC from modified cytosines.
[0070] Generally, in bisulfite sequencing, a nucleic acid sample is divided into two aliquots and one aliquot is treated with bisulfite. The bisulfite converts native cytosine and certain modified cytosine nucleotides (e.g. 5-formylcytosine or 5-carboxylcytosine) to uracil whereas other modified cytosines (e.g., 5- methylcytosine, 5-hydroxylmethylcystosine) are not converted. Comparison of nucleic acid sequences of molecules from the two aliquots indicates which cytosines were and were not converted to uracils. Consequently, cytosines which were and were not modified can be determined. The initial splitting of the sample into two aliquots is disadvantageous for samples containing only small amounts of nucleic acids, and/or composed of heterogeneous cell/tissue origins such as bodily fluids containing cell-free DNA.
[0071] The present disclosure provides methods allowing bisulfite sequencing and variants thereof. These methods work by linking nucleic acids in a population to a capture moiety, i.e., a label that can be captured or immobilized. Capture moieties include, without limitation, biotin, avidin, streptavidin, a nucleic acid including a particular nucleotide sequence, a hapten recognized by an antibody, and magnetically attractable particles. The extraction moiety can be a member of a binding pair, such as biotin/streptavidin or hapten/antibody. In some embodiments, a capture moiety that is attached to an analyte is captured by its binding pair which is attached to an isolatable moiety, such as a magnetically attractable particle or a large particle that can be sedimented through centrifugation. The capture moiety can be any type of molecule that allows affinity separation of nucleic acids bearing the capture moiety from nucleic acids lacking the capture moiety. Exemplary capture moieties are biotin which allows affinity separation by binding to streptavidin linked or linkable to a solid phase or an oligonucleotide, which allows affinity separation through binding to a complementary oligonucleotide linked or linkable to a solid phase. Following linking of capture moieties to sample nucleic acids, the sample nucleic acids serve as templates for amplification. Following amplification, the original templates remain linked to the capture moieties but amplicons are not linked to capture moieties.
[0072] The capture moiety can be linked to sample nucleic acids as a component of an adapter, which may also provide amplification and/or sequencing primer binding sites. In some methods, sample nucleic acids are linked to adapters at both ends, with both adapters bearing a capture moiety. Preferably any cytosine residues in the adapters are modified, such as by 5methylcytosine, to protect against the action of bisulfite. In some instances, the capture moieties are linked to the original templates by a cleavable linkage (e.g., photocleavable desthiobiotin- TEG or uracil residues cleavable with USER™ enzyme, Chem. Commun. (Camb). 2015 Feb 21; 51(15): 3266-3269), in which case the capture moieties can, if desired, be removed.
[0073] The amplicons are denatured and contacted with an affinity reagent for the capture tag. Original templates bind to the affinity reagent whereas nucleic acid molecules resulting from amplification do not. Thus, the original templates can be separated from nucleic acid molecules resulting from amplification.
[0074] Following separation or partition, the respective populations of nucleic acids (i.e., original templates and amplification products) can be subjected to bisulfite treatment with the original template population receiving bisulfite treatment and the amplification products not. Alternatively, the amplification products can be subjected to bisulfite treatment and the original template population not. Following such treatment, the respective populations can be amplified (which in the case of the original template population converts uracils to thymines). The populations can also be subjected to biotin probe hybridization for enrichment. The respective populations are then analyzed and sequences compared to determine which cytosines were 5- methylated (or 5-hydroxylmethylated) in the original. Detection of a T nucleotide in the template population (corresponding to an unmethylated cytosine converted to uracil) and a C nucleotide at the corresponding position of the amplified population indicates an unmodified C. The presence of C's at corresponding positions of the original template and amplified populations indicates a modified C in the original sample.
[0075] In some embodiments, a method uses sequential DNA-seq and bisulfite-seq (BlS-seq) NGS library preparation of molecular tagged DNA libraries. This process is performed by labeling of adapters (e.g., biotin), DNA-seq amplification of whole library, parent molecule recovery (e.g. streptavidin bead pull down), bisulfite conversion and BlS-seq. In some embodiments, the method identifies 5-methylcytosine with single-base resolution, through sequential NGS-preparative amplification of parent library molecules with and without bisulfite treatment. This can be achieved by modifying the 5-methyl-ated NGS-adapters (directional adapters; Y-shaped/forked with 5-methylcytosine replacing) used in BlS-seq with a label (e.g., biotin) on one of the two adapter strands. Sample DNA molecules are adapter ligated, and amplified (e.g., by PCR). As only the parent molecules will have a labeled adapter end, they can be selectively recovered from their amplified progeny by label-specific capture methods (e.g., streptavidin-magnetic beads). As the parent molecules retain 5-methylation marks, bisulfite conversion on the captured library will yield single-base resolution 5-methylation status upon BlS-seq, retaining molecular information to corresponding DNA-seq. In some embodiments, the bisulfite treated library can be combined with a non-treated library prior to enrichment/NGS by addition of a sample tag DNA sequence in standard multiplexed NGS workflow. As with BIS- seq workflows, bioinformatics analysis can be carried out for genomic alignment and 5- methylated base identification. In sum, this method provides the ability to selectively recover the parent, ligated molecules, carrying 5-methylcytosine marks, after library amplification, thereby allowing for parallel processing for bisulfite converted DNA. This overcomes the destructive nature of bisulfite treatment on the quality/sensitivity of the DNA-seq information extracted from a workflow. With this method, the recovered ligated, parent DNA molecules (via labeled adapters) allow amplification of the complete DNA library and parallel application of treatments that elicit epigenetic DNA modifications. The present disclosure discusses the use of BlS-seq methods to identify cytosine5-methylation (5-methylcytosine), but this should is not limiting. Variants of BlS-seq have been developed to identify hydroxymethylated cytosines (5hmC; OX- BS-seq, TAB-seq), formylcytosine (5fC; redBS-seq) and carboxylcytosines. These methodologies can be implemented with the sequential/parallel library preparation described herein.
Alternative Methods of Modified Nucleic Acid Analysis
[0076] The disclosure provides alternative methods for analyzing modified nucleic acids (e.g., methylated, linked to histones and other modifications discussed above). In some such methods, a population of nucleic acids bearing the modification to different extents (e.g., 0, 1, 2, 3, 4, 5 or more methyl groups per nucleic acid molecule) is contacted with adapters before fractionation of the population depending on the extent of the modification. Adapters attach to either one end or both ends of nucleic acid molecules in the population. Preferably, the adapters include different tags of sufficient numbers that the number of combinations of tags results in a low probability e.g., 95, 99 or 99.9% of two nucleic acids with the same start and stop points receiving the same combination of tags. Following attachment of adapters, the nucleic acids are amplified from primers binding to the primer binding sites within the adapters. Adapters, whether bearing the same or different tags, can include the same or different primer binding sites, but preferably adapters include the same primer binding site. Following amplification, the nucleic acids are contacted with an agent that preferably binds to nucleic acids bearing the modification (such as the previously described such agents). The nucleic acids are separated into at least two partitions differing in the extent to which the nucleic acids bear the modification from binding to the agents. For example, if the agent has affinity for nucleic acids bearing the modification, nucleic acids overrepresented in the modification (compared with median representation in the population) preferentially bind to the agent, whereas nucleic acids underrepresented for the modification do not bind or are more easily eluted from the agent. Following separation, the different partitions can then be subject to further processing steps, which typically include further amplification, and sequence analysis, in parallel but separately. Sequence data from the different partitions can then be compared.
[0077] Nucleic acids can be linked at both ends to Y-shaped adapters including primer binding sites and tags. The molecules are amplified. The amplified molecules are then fractionated by contact with an antibody preferentially binding to 5-methylcytosine to produce two partitions. One partition includes original molecules lacking methylation and amplification copies having lost methylation. The other partition includes original DNA molecules with methylation. The two partitions are then processed and sequenced separately with further amplification of the methylated partition. The sequence data of the two partitions can then be compared. In this example, tags are not used to distinguish between methylated and unmethylated DNA but rather to distinguish between different molecules within these partitions so that one can determine whether reads with the same start and stop points are based on the same or different molecules. [0078] The disclosure provides further methods for analyzing a population of nucleic acid in which at least some of the nucleic acids include one or more modified cytosine residues, such as 5-methylcytosine and any of the other modifications described previously. In these methods, the population of nucleic acids is contacted with adapters including one or more cytosine residues modified at the 5C position, such as 5-methylcytosine. Preferably all cytosine residues in such adapters are also modified, or all such cytosines in a primer binding region of the adapters are modified. Adapters attach to both ends of nucleic acid molecules in the population. Preferably, the adapters include different tags of sufficient numbers that the number of combinations of tags results in a low probability e.g., 95, 99 or 99.9% of two nucleic acids with the same start and stop points receiving the same combination of tags. The primer binding sites in such adapters can be the same or different, but are preferably the same. After attachment of adapters, the nucleic acids are amplified from primers binding to the primer binding sites of the adapters. The amplified nucleic acids are split into first and second aliquots. The first aliquot is assayed for sequence data with or without further processing. The sequence data on molecules in the first aliquot is thus determined irrespective of the initial methylation state of the nucleic acid molecules. The nucleic acid molecules in the second aliquot are treated with bisulfite. This treatment converts unmodified cytosines to uracils. The bisulfite treated nucleic acids are then subjected to amplification primed by primers to the original primer binding sites of the adapters linked to nucleic acid. Only the nucleic acid molecules originally linked to adapters (as distinct from amplification products thereof) are now amplifiable because these nucleic acids retain cytosines in the primer binding sites of the adapters, whereas amplification products have lost the methylation of these cytosine residues, which have undergone conversion to uracils in the bisulfite treatment. Thus, only original molecules in the populations, at least some of which are methylated, undergo amplification. After amplification, these nucleic acids are subject to sequence analysis. Comparison of sequences determined from the first and second aliquots can indicate among other things, which cytosines in the nucleic acid population were subject to methylation.
Partitioning the Sample into a Plurality of Subsamples; Aspects of Samples; Analysis of
Epigenetic Characteristics [0079] In certain embodiments described herein, a population of different forms of nucleic acids (e.g., hypermethylated and hypom ethylated DNA in a sample, such as a captured set of cfDNA as described herein) can be physically partitioned based on one or more characteristics of the nucleic acids prior to further analysis, e.g., differentially modifying or isolating a nucleobase, tagging, and/or sequencing. This approach can be used to determine, for example, whether certain sequences are hypermethylated or hypomethylated. In some embodiments, hypermethylation variable epigenetic target regions are analyzed to determine whether they show hypermethylation characteristic of tumor cells and/or hypomethylation variable epigenetic target regions are analyzed to determine whether they show hypomethylation characteristic of tumor cells. Additionally, by partitioning a heterogeneous nucleic acid population, one may increase rare signals, e.g., by enriching rare nucleic acid molecules that are more prevalent in one fraction (or partition) of the population. For example, a genetic variation present in hyper-methylated DNA but less (or not) in hypomethylated DNA can be more easily detected by partitioning a sample into hyper-methylated and hypo-methylated nucleic acid molecules. By analyzing multiple fractions of a sample, a multi-dimensional analysis of a single locus of a genome or species of nucleic acid can be performed and hence, greater sensitivity can be achieved.
[0080] In some instances, a heterogeneous nucleic acid sample is partitioned into two or more partitions (e.g., at least 3, 4, 5, 6 or 7 partitions). In some embodiments, each partition is differentially tagged. Tagged partitions can then be pooled together for collective sample prep and/or sequencing. The partitioning-tagging-pooling steps can occur more than once, with each round of partitioning occurring based on a different characteristics (examples provided herein), and tagged using differential tags that are distinguished from other partitions and partitioning means.
[0081] Examples of characteristics that can be used for partitioning include sequence length, methylation level, nucleosome binding, sequence mismatch, immunoprecipitation, and/or proteins that bind to DNA. Resulting partitions can include one or more of the following nucleic acid forms: single-stranded DNA (ssDNA), double-stranded DNA (dsDNA), shorter DNA fragments and longer DNA fragments. In some embodiments, partitioning based on a cytosine modification (e.g., cytosine methylation) or methylation generally is performed and is optionally combined with at least one additional partitioning step, which may be based on any of the foregoing characteristics or forms of DNA. In some embodiments, a heterogeneous population of nucleic acids is partitioned into nucleic acids with one or more epigenetic modifications and without the one or more epigenetic modifications. Examples of epigenetic modifications include presence or absence of methylation; level of methylation; type of methylation (e.g., 5- methylcytosine versus other types of methylation, such as adenine methylation and/or cytosine hydroxymethylation); and association and level of association with one or more proteins, such as histones. Alternatively or additionally, a heterogeneous population of nucleic acids can be partitioned into nucleic acid molecules associated with nucleosomes and nucleic acid molecules devoid of nucleosomes. Alternatively or additionally, a heterogeneous population of nucleic acids may be partitioned into single-stranded DNA (ssDNA) and double-stranded DNA (dsDNA). Alternatively, or additionally, a heterogeneous population of nucleic acids may be partitioned based on nucleic acid length (e.g., molecules of up to 160 bp and molecules having a length of greater than 160 bp).
[0082] In some instances, each partition (representative of a different nucleic acid form) is differentially labelled, and the partitions are pooled together prior to sequencing. In other instances, the different forms are separately sequenced. In some embodiments, a population of different nucleic acids is partitioned into two or more different partitions. Each partition is representative of a different nucleic acid form, and a first partition (also referred to as a subsample) includes DNA with a cytosine modification in a greater proportion than a second subsample. Each partition is distinctly tagged. The first subsample is subjected to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample, wherein the first nucleobase is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the first nucleobase and the second nucleobase have the same base pairing specificity. The tagged nucleic acids are pooled together prior to sequencing. Sequence reads are obtained and analyzed, including to distinguish the first nucleobase from the second nucleobase in the DNA of the first subsample, in silico. Tags are used to sort reads from different partitions. Analysis to detect genetic variants can be performed on a partition-by-partition level, as well as whole nucleic acid population level. For example, analysis can include in silico analysis to determine genetic variants, such as CNV, SNV, indel, fusion in nucleic acids in each partition. In some instances, in silico analysis can include determining chromatin structure. For example, coverage of sequence reads can be used to determine nucleosome positioning in chromatin. Higher coverage can correlate with higher nucleosome occupancy in genomic region while lower coverage can correlate with lower nucleosome occupancy or nucleosome depleted region (NDR). [0083] Samples can include nucleic acids varying in modifications including post-replication modifications to nucleotides and binding, usually noncovalently, to one or more proteins.
[0084] In an embodiment, the population of nucleic acids is one obtained from a serum, plasma or blood sample from a subject suspected of having neoplasia, a tumor, or cancer or previously diagnosed with neoplasia, a tumor, or cancer. The population of nucleic acids includes nucleic acids having varying levels of methylation. Methylation can occur from any one or more postreplication or transcriptional modifications. Post-replication modifications include modifications of the nucleotide cytosine, particularly at the 5-position of the nucleobase, e.g., 5- methylcytosine, 5-hydroxymethylcytosine, 5-formylcytosine and 5-carboxylcytosine. The affinity agents can be antibodies with the desired specificity, natural binding partners or variants thereof (Bock et al., Nat Biotech 28: 1106-1114 (2010); Song et al., Nat Biotech 29: 68-72 (2011)), or artificial peptides selected e.g., by phage display to have specificity to a given target. [0085] Examples of capture moi eties contemplated herein include methyl binding domain (MBDs) and methyl binding proteins (MBPs) as described herein, including proteins such as MeCP2 and antibodies preferentially binding to 5-methylcytosine. Likewise, partitioning of different forms of nucleic acids can be performed using histone binding proteins which can separate nucleic acids bound to histones from free or unbound nucleic acids. Examples of histone binding proteins that can be used in the methods disclosed herein include RBBP4, RbAp48 and SANT domain peptides. Although for some affinity agents and modifications, binding to the agent may occur in an essentially all or none manner depending on whether a nucleic acid bears a modification, the separation may be one of degree. In such instances, nucleic acids overrepresented in a modification bind to the agent at a greater extent that nucleic acids underrepresented in the modification. Alternatively, nucleic acids having modifications may bind in an all or nothing manner. But then, various levels of modifications may be sequentially eluted from the binding agent.
[0086] For example, in some embodiments, partitioning can be binary or based on degree/level of modifications. For example, all methylated fragments can be partitioned from unmethylated fragments using methyl -binding domain proteins (e.g., MethylMiner Methylated DNA Enrichment Kit (ThermoFisher Scientific)). Subsequently, additional partitioning may involve eluting fragments having different levels of methylation by adjusting the salt concentration in a solution with the methyl -binding domain and bound fragments. As salt concentration increases, fragments having greater methylation levels are eluted. In some instances, the final partitions are representative of nucleic acids having different extents of modifications (overrepresentative or underrepresentative of modifications). Overrepresentation and underrepresentation can be defined by the number of modifications bom by a nucleic acid relative to the median number of modifications per strand in a population. For example, if the median number of 5-methylcytosine residues in nucleic acid in a sample is 2, a nucleic acid including more than two 5- methylcytosine residues is overrepresented in this modification and a nucleic acid with 1 or zero 5-methylcytosine residues is underrepresented. The effect of the affinity separation is to enrich for nucleic acids overrepresented in a modification in a bound phase and for nucleic acids underrepresented in a modification in an unbound phase (i.e. in solution). The nucleic acids in the bound phase can be eluted before subsequent processing.
[0087] When using MethylMiner Methylated DNA Enrichment Kit (ThermoFisher Scientific) various levels of methylation can be partitioned using sequential elutions. For example, a hypomethylated partition (e.g., no methylation) can be separated from a methylated partition by contacting the nucleic acid population with the MBD from the kit, which is attached to magnetic beads. The beads are used to separate out the methylated nucleic acids from the non- methylated nucleic acids. Subsequently, one or more elution steps are performed sequentially to elute nucleic acids having different levels of methylation. For example, a first set of methylated nucleic acids can be eluted at a salt concentration of 160 mM or higher, e.g., at least 150 mM, at least 200 mM, at least 300 mM, at least 400 mM, at least 500 mM, at least 600 mM, at least 700 mM, at least 800 mM, at least 900 mM, at least 1000 mM, or at least 2000 mM. After such methylated nucleic acids are eluted, magnetic separation is once again used to separate higher level of methylated nucleic acids from those with lower level of methylation. The elution and magnetic separation steps can repeat themselves to create various partitions such as a hypomethylated partition (representative of no methylation), a methylated partition (representative of low level of methylation), and a hyper methylated partition (representative of high level of methylation).
[0088] In some methods, nucleic acids bound to an agent used for affinity separation are subjected to a wash step. The wash step washes off nucleic acids weakly bound to the affinity agent. Such nucleic acids can be enriched in nucleic acids having the modification to an extent close to the mean or median (i.e., intermediate between nucleic acids remaining bound to the solid phase and nucleic acids not binding to the solid phase on initial contacting of the sample with the agent). The affinity separation results in at least two, and sometimes three or more partitions of nucleic acids with different extents of a modification. While the partitions are still separate, the nucleic acids of at least one partition, and usually two or three (or more) partitions are linked to nucleic acid tags, usually provided as components of adapters, with the nucleic acids in different partitions receiving different tags that distinguish members of one partition from another. The tags linked to nucleic acid molecules of the same partition can be the same or different from one another. But if different from one another, the tags may have part of their code in common so as to identify the molecules to which they are attached as being of a particular partition. For further details regarding portioning nucleic acid samples based on characteristics such as methylation, see WO2018/119452, which is incorporated herein by reference. In some embodiments, the nucleic acid molecules can be fractionated into different partitions based on the nucleic acid molecules that are bound to a specific protein or a fragment thereof and those that are not bound to that specific protein or fragment thereof.
[0089] Nucleic acid molecules can be fractionated based on DNA-protein binding. Protein-DNA complexes can be fractionated based on a specific property of a protein. Examples of such properties include various epitopes, modifications (e.g., histone methylation or acetylation) or enzymatic activity. Examples of proteins which may bind to DNA and serve as a basis for fractionation may include, but are not limited to, protein A and protein G. Any suitable method can be used to fractionate the nucleic acid molecules based on protein bound regions. Examples of methods used to fractionate nucleic acid molecules based on protein bound regions include, but are not limited to, SDS-PAGE, chromatin-immuno-precipitation (ChIP), heparin chromatography, and asymmetrical field flow fractionation (AF4).
[0090] In some embodiments, partitioning of the nucleic acids is performed by contacting the nucleic acids with a methylation binding domain (“MBD”) of a methylation binding protein (“MBP”). MBD binds to 5-methylcytosine (5mC). MBD is coupled to paramagnetic beads, such as Dynabeads® M-280 Streptavidin via a biotin linker. Partitioning into fractions with different extents of methylation can be performed by eluting fractions by increasing the NaCl concentration.
[0091] An exemplary method for molecular tag identification of MBD-bead partitioned libraries through NGS is as follows:
[0092] Physical partitioning of an extracted DNA sample (e.g., extracted blood plasma DNA from a human sample) using a methyl -binding domain protein-bead purification kit, saving all elutions from process for downstream processing.
[0093] Parallel application of differential molecular tags and NGS-enabling adapter sequences to each partition. For example, the hypermethylated, residual methylation ('wash'), and hypomethylated partitions are ligated with NGS-adapters with molecular tags.
[0094] Re-combining all molecular tagged partitions, and subsequent amplification using adapter-specific DNA primer sequences.
[0095] Enrichment/hybridization of re-combined and amplified total library, targeting genomic regions of interest (e.g., cancer-specific genetic variants and differentially methylated regions). [0096] Re-amplification of the enriched total DNA library, appending a sample tag. Different samples are pooled, and assayed in multiplex on an NGS instrument. [0097] Bioinformatics analysis of NGS data, with the molecular tags being used to identify unique molecules, as well deconvolution of the sample into molecules that were differentially MBD-partitioned. This analysis can yield information on relative 5-methylcytosine for genomic regions, concurrent with standard genetic sequencing/variant detection.
[0098] Examples of MBPs contemplated herein include, but are not limited to:
[0099] (a) MeCP2 is a protein preferentially binding to 5 -methyl -cytosine over unmodified cytosine.
[0100] (b) RPL26, PRP8 and the DNA mismatch repair protein MHS6 preferentially bind to 5- hydroxymethyl-cytosine over unmodified cytosine.
[0101] (c) FOXK1, FOXK2, FOXP1, FOXP4 and FOXI3 preferably bind to 5-formyl-cytosine over unmodified cytosine (lurlaro et al., Genome Biol. 14: R119 (2013)).
[0102] (d) Antibodies specific to one or more methylated nucleotide bases.
[0103] In general, elution is a function of number of methylated sites per molecule, with molecules having more methylation eluting under increased salt concentrations. To elute the DNA into distinct populations based on the extent of methylation, one can use a series of elution buffers of increasing NaCl concentration. Salt concentration can range from about 100 nM to about 2500 mM NaCl. In one embodiment, the process results in three (3) partitions. Molecules are contacted with a solution at a first salt concentration and including a molecule including a methyl binding domain, which molecule can be attached to a capture moiety, such as streptavidin. At the first salt concentration a population of molecules will bind to the MBD and a population will remain unbound. The unbound population can be separated as a “hypomethylated” population. For example, a first partition representative of the hypom ethylated form of DNA is that which remains unbound at a low salt concentration, e.g., 100 mM or 160 mM. A second partition representative of intermediate methylated DNA is eluted using an intermediate salt concentration, e.g., between 100 mM and 2000 mM concentration. This is also separated from the sample. A third partition representative of hypermethylated form of DNA is eluted using a high salt concentration, e.g., at least about 2000 mM.
[0104] The disclosure provides further methods for analyzing a population of nucleic acids in which at least some of the nucleic acids include one or more modified cytosine residues, such as 5-methylcytosine and any of the other modifications described previously. In these methods, after partitioning, the subsamples of nucleic acids are contacted with adapters including one or more cytosine residues modified at the 5C position, such as 5-methylcytosine. Preferably all cytosine residues in such adapters are also modified, or all such cytosines in a primer binding region of the adapters are modified. Adapters attach to both ends of nucleic acid molecules in the population. Preferably, the adapters include different tags of sufficient numbers that the number of combinations of tags results in a low probability e.g., 95, 99 or 99.9% of two nucleic acids with the same start and stop points receiving the same combination of tags. The primer binding sites in such adapters can be the same or different, but are preferably the same. After attachment of adapters, the nucleic acids are amplified from primers binding to the primer binding sites of the adapters. The amplified nucleic acids are split into first and second aliquots. The first aliquot is assayed for sequence data with or without further processing. The sequence data on molecules in the first aliquot is thus determined irrespective of the initial methylation state of the nucleic acid molecules. The nucleic acid molecules in the second aliquot are subjected to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA, wherein the first nucleobase includes a cytosine modified at the 5 position, and the second nucleobase includes unmodified cytosine. This procedure may be bisulfite treatment or another procedure that converts unmodified cytosines to uracils. The nucleic acids subjected to the procedure are then amplified with primers to the original primer binding sites of the adapters linked to nucleic acid. Only the nucleic acid molecules originally linked to adapters (as distinct from amplification products thereof) are now amplifiable because these nucleic acids retain cytosines in the primer binding sites of the adapters, whereas amplification products have lost the methylation of these cytosine residues, which have undergone conversion to uracils in the bisulfite treatment. Thus, only original molecules in the populations, at least some of which are methylated, undergo amplification. After amplification, these nucleic acids are subject to sequence analysis. Comparison of sequences determined from the first and second aliquots can indicate among other things, which cytosines in the nucleic acid population were subject to methylation.
[0105] Such an analysis can be performed using the following exemplary procedure. After partitioning, methylated DNA is linked to Y-shaped adapters at both ends including primer binding sites and tags. The cytosines in the adapters are modified at the 5 position (e.g., 5- methylated). The modification of the adapters serves to protect the primer binding sites in a subsequent conversion step (e.g., bisulfite treatment, TAP conversion, or any other conversion that does not affect the modified cytosine but affects unmodified cytosine). After attachment of adapters, the DNA molecules are amplified. The amplification product is split into two aliquots for sequencing with and without conversion. The aliquot not subjected to conversion can be subjected to sequence analysis with or without further processing. The other aliquot is subjected to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA, wherein the first nucleobase includes a cytosine modified at the 5 position, and the second nucleobase includes unmodified cytosine. This procedure may be bisulfite treatment or another procedure that converts unmodified cytosines to uracils. Only primer binding sites protected by modification of cytosines can support amplification when contacted with primers specific for original primer binding sites. Thus, only original molecules and not copies from the first amplification are subjected to further amplification. The further amplified molecules are then subjected to sequence analysis. Sequences can then be compared from the two aliquots. As in the separation scheme discussed above, nucleic acid tags in adapters are not used to distinguish between methylated and unmethylated DNA but to distinguish nucleic acid molecules within the same partition.
Subjecting the First Subsample to a Procedure that Affects a First Nucleobase in the DNA Differently from a Second Nucleobase in the DNA of the First Subsample
[0106] Methods disclosed herein comprise a step of subjecting the first subsample to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample, wherein the first nucleobase is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the first nucleobase and the second nucleobase have the same base pairing specificity. In some embodiments, if the first nucleobase is a modified or unmodified adenine, then the second nucleobase is a modified or unmodified adenine; if the first nucleobase is a modified or unmodified cytosine, then the second nucleobase is a modified or unmodified cytosine; if the first nucleobase is a modified or unmodified guanine, then the second nucleobase is a modified or unmodified guanine; and if the first nucleobase is a modified or unmodified thymine, then the second nucleobase is a modified or unmodified thymine (where modified and unmodified uracil are encompassed within modified thymine for the purpose of this step).
[0107] In some embodiments, the first nucleobase is a modified or unmodified cytosine, then the second nucleobase is a modified or unmodified cytosine. For example, first nucleobase may comprise unmodified cytosine (C) and the second nucleobase may comprise one or more of 5- methylcytosine (mC) and 5-hydroxymethylcytosine (hmC). Alternatively, the second nucleobase may comprise C and the first nucleobase may comprise one or more of mC and hmC. Other combinations are also possible, as indicated, e.g., in the Summary above and the following discussion, such as where one of the first and second nucleobases includes mC and the other includes hmC.
[0108] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes bisulfite conversion. Treatment with bisulfite converts unmodified cytosine and certain modified cytosine nucleotides (e.g. 5-formyl cytosine (fC) or 5-carboxylcytosine (caC)) to uracil whereas other modified cytosines (e.g., 5-methylcytosine, 5-hydroxylmethylcystosine) are not converted. Thus, where bisulfite conversion is used, the first nucleobase includes one or more of unmodified cytosine, 5-formyl cytosine, 5-carboxylcytosine, or other cytosine forms affected by bisulfite, and the second nucleobase may comprise one or more of mC and hmC, such as mC and optionally hmC. Sequencing of bisulfite-treated DNA identifies positions that are read as cytosine as being mC or hmC positions. Meanwhile, positions that are read as T are identified as being T or a bisulfite-susceptible form of C, such as unmodified cytosine, 5-formyl cytosine, or 5-carboxylcytosine. Performing bisulfite conversion on a first subsample as described herein thus facilitates identifying positions containing mC or hmC using the sequence reads obtained from the first subsample. For an exemplary description of bisulfite conversion, see, e.g., Moss et al., Nat Commun. 2018; 9: 5068..
[0109] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes oxidative bisulfite (Ox-BS) conversion. In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes Tet-assisted bisulfite (TAB) conversion. In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes Tet-assisted conversion with a substituted borane reducing agent, optionally wherein the substituted borane reducing agent is 2-picoline borane, borane pyridine, tert-butylamine borane, or ammonia borane. In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes chemi cal -assisted conversion with a substituted borane reducing agent, optionally wherein the substituted borane reducing agent is 2-picoline borane, borane pyridine, tert-butylamine borane, or ammonia borane. In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes APOBEC-coupled epigenetic (ACE) conversion.
[0110] In some embodiments, procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes enzymatic conversion of the first nucleobase, e.g., as in EM-Seq. See, e.g., Vaisvila R, et al. (2019) EM-seq: Detection of DNA methylation at single base resolution from picograms of DNA. bioRxiv; DOI: 10.1101/2019.12.20.884692, available at www.biorxiv.org/content/10.1101/2019.12.20.884692vl. For example, TET2 and T4-PGT can be used to convert 5mC and 5hmC into substrates that cannot be deaminated by a deaminase (e.g., AP0BEC3A), and then a deaminase (e.g., AP0BEC3A) can be used to deaminate unmodified cytosines converting them to uracils.
[oni] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample includes separating DNA originally including the first nucleobase from DNA not originally including the first nucleobase.
[0112] In some embodiments, the first nucleobase is a modified or unmodified adenine, and the second nucleobase is a modified or unmodified adenine. In some embodiments, the modified adenine is N6-methyladenine (mA). In some embodiments, the modified adenine is one or more of N6-m ethyladenine (mA), N6-hydroxymethyladenine (hmA), or N6-formyladenine (fA). [0113] Techniques including methylated DNA immunoprecipitation (MeDIP) can be used to separate DNA containing modified bases such as mA from other DNA. See, e.g., Kumar et al., Frontiers Genet. 2018; 9: 640; Greer et al., Cell 2015; 161 : 868-878. An antibody specific for mA is described in Sun et al., Bioessays 2015; 37: 1155-62. Antibodies for various modified nucleobases, such as forms of thymine/uracil including halogenated forms such as 5- bromouracil, are commercially available. Various modified bases can also be detected based on alterations in their base-pairing specificity. For example, hypoxanthine is a modified form of adenine that can result from deamination and is read in sequencing as a G. See, e.g., US Patent 8,486,630; Brown, Genomes, 2nd Ed., John Wiley & Sons, Inc., New York, N.Y., 2002, chapter 14, “Mutation, Repair, and Recombination.”
Enriching/Capturing Step, Amplification., Adaptors, Barcodes
[0114] In some embodiments, methods disclosed herein comprise a step of capturing one or more sets of target regions of DNA, such as cfDNA. Capture may be performed using any suitable approach known in the art. In some embodiments, capturing includes contacting the DNA to be captured with a set of target-specific probes. The set of target-specific probes may have any of the features described herein for sets of target-specific probes, including but not limited to in the embodiments set forth above and the sections relating to probes below. Capturing may be performed on one or more subsamples prepared during methods disclosed herein. In some embodiments, DNA is captured from at least the first subsample or the second subsample, e.g., at least the first subsample and the second subsample. Where the first subsample undergoes a separation step (e.g., separating DNA originally including the first nucleobase (e.g., hmC) from DNA not originally including the first nucleobase, such as hmC-seal), capturing may be performed on any, any two, or all of the DNA originally including the first nucleobase (e.g., hmC), the DNA not originally including the first nucleobase, and the second subsample. In some embodiments, the subsamples are differentially tagged (e.g., as described herein) and then pooled before undergoing capture.
[0115] The capturing step may be performed using conditions suitable for specific nucleic acid hybridization, which generally depend to some extent on features of the probes such as length, base composition, etc. Those skilled in the art will be familiar with appropriate conditions given general knowledge in the art regarding nucleic acid hybridization. In some embodiments, complexes of target-specific probes and DNA are formed.
[0116] In some embodiments, a method described herein includes capturing cfDNA obtained from a test subject for a plurality of sets of target regions. The target regions comprise epigenetic target regions, which may show differences in methylation levels and/or fragmentation patterns depending on whether they originated from a tumor or from healthy cells. The target regions also comprise sequence-variable target regions, which may show differences in sequence depending on whether they originated from a tumor or from healthy cells. The capturing step produces a captured set of cfDNA molecules, and the cfDNA molecules corresponding to the sequencevariable target region set are captured at a greater capture yield in the captured set of cfDNA molecules than cfDNA molecules corresponding to the epigenetic target region set. For additional discussion of capturing steps, capture yields, and related aspects, see W02020/160414, which is incorporated herein by reference for all purposes.
[0117] In some embodiments, a method described herein includes contacting cfDNA obtained from a test subject with a set of target-specific probes, wherein the set of target-specific probes is configured to capture cfDNA corresponding to the sequence-variable target region set at a greater capture yield than cfDNA corresponding to the epigenetic target region set.
[0118] It can be beneficial to capture cfDNA corresponding to the sequence-variable target region set at a greater capture yield than cfDNA corresponding to the epigenetic target region set because a greater depth of sequencing may be necessary to analyze the sequence-variable target regions with sufficient confidence or accuracy than may be necessary to analyze the epigenetic target regions. The volume of data needed to determine fragmentation patterns (e.g., to test fsor perturbation of transcription start sites or CTCF binding sites) or fragment abundance (e.g., in hypermethylated and hypomethylated partitions) is generally less than the volume of data needed to determine the presence or absence of cancer-related sequence mutations. Capturing the target region sets at different yields can facilitate sequencing the target regions to different depths of sequencing in the same sequencing run (e.g., using a pooled mixture and/or in the same sequencing cell).
[0119] In various embodiments, the methods further comprise sequencing the captured cfDNA, e.g., to different degrees of sequencing depth for the epigenetic and sequence-variable target region sets, consistent with the discussion herein. In some embodiments, complexes of targetspecific probes and DNA are separated from DNA not bound to target-specific probes. For example, where target-specific probes are bound covalently or noncovalently to a solid support, a washing or aspiration step can be used to separate unbound material. Alternatively, where the complexes have chromatographic properties distinct from unbound material (e.g., where the probes comprise a ligand that binds a chromatographic resin), chromatography can be used. [0120] As discussed in detail elsewhere herein, the set of target-specific probes may comprise a plurality of sets such as probes for a sequence-variable target region set and probes for an epigenetic target region set. In some such embodiments, the capturing step is performed with the probes for the sequence-variable target region set and the probes for the epigenetic target region set in the same vessel at the same time, e.g., the probes for the sequence-variable and epigenetic target region sets are in the same composition. This approach provides a relatively streamlined workflow. In some embodiments, the concentration of the probes for the sequence-variable target region set is greater that the concentration of the probes for the epigenetic target region set. [0121] Alternatively, the capturing step is performed with the sequence-variable target region probe set in a first vessel and with the epigenetic target region probe set in a second vessel, or the contacting step is performed with the sequence-variable target region probe set at a first time and a first vessel and the epigenetic target region probe set at a second time before or after the first time. This approach allows for preparation of separate first and second compositions including captured DNA corresponding to the sequence-variable target region set and captured DNA corresponding to the epigenetic target region set. The compositions can be processed separately as desired (e.g., to fractionate based on methylation as described elsewhere herein) and recombined in appropriate proportions to provide material for further processing and analysis such as sequencing.
[0122] In some embodiments, the DNA is amplified. In some embodiments, amplification is performed before the capturing step. In some embodiments, amplification is performed after the capturing step.
[0123] In some embodiments, adapters are included in the DNA. This may be done concurrently with an amplification procedure, e.g., by providing the adapters in a 5’ portion of a primer, e.g., as described above. Alternatively, adapters can be added by other approaches, such as ligation. [0124] In some embodiments, tags, which may be or include barcodes, are included in the DNA. Tags can facilitate identification of the origin of a nucleic acid. For example, barcodes can be used to allow the origin (e.g., subject) whence the DNA came to be identified following pooling of a plurality of samples for parallel sequencing. This may be done concurrently with an amplification procedure, e.g., by providing the barcodes in a 5’ portion of a primer, e.g., as described above. In some embodiments, adapters and tags/barcodes are provided by the same primer or primer set. For example, the barcode may be located 3’ of the adapter and 5’ of the target-hybridizing portion of the primer. Alternatively, barcodes can be added by other approaches, such as ligation, optionally together with adapters in the same ligation substrate. [0125] Additional details regarding amplification, tags, and barcodes are discussed in the “General Features of the Methods” section below, which can be combined to the extent practicable with any of the foregoing embodiments and the embodiments set forth in the introduction and summary section.
Captured Set
[0126] In some embodiments, a captured set of DNA (e.g., cfDNA) is provided. With respect to the disclosed methods, the captured set of DNA may be provided, e.g., by performing a capturing step after a partitioning step as described herein. The captured set may comprise DNA corresponding to a sequence-variable target region set, an epigenetic target region set, or a combination thereof. In some embodiments the quantity of captured sequence-variable target region DNA is greater than the quantity of the captured epigenetic target region DNA, when normalized for the difference in the size of the targeted regions (footprint size).
[0127] Alternatively, first and second captured sets may be provided, including, respectively, DNA corresponding to a sequence-variable target region set and DNA corresponding to an epigenetic target region set. The first and second captured sets may be combined to provide a combined captured set.
[0128] In some embodiments in which a captured set including DNA corresponding to the sequence-variable target region set and the epigenetic target region set includes a combined captured set as discussed above, the DNA corresponding to the sequence-variable target region set may be present at a greater concentration than the DNA corresponding to the epigenetic target region set, e.g., a 1.1 to 1.2-fold greater concentration, a 1.2- to 1.4-fold greater concentration, a 1.4- to 1.6-fold greater concentration, a 1.6- to 1.8-fold greater concentration, a 1.8- to 2.0-fold greater concentration, a 2.0- to 2.2-fold greater concentration, a 2.2- to 2.4-fold greater concentration a 2.4- to 2.6-fold greater concentration, a 2.6- to 2.8-fold greater concentration, a 2.8- to 3.0-fold greater concentration, a 3.0- to 3.5-fold greater concentration, a 3.5- to 4.0, a 4.0- to 4.5-fold greater concentration, a 4.5- to 5.0-fold greater concentration, a 5.0- to 5.5-fold greater concentration, a 5.5- to 6.0-fold greater concentration, a 6.0- to 6.5-fold greater concentration, a 6.5- to 7.0-fold greater, a 7.0- to 7.5-fold greater concentration, a 7.5- to 8.0-fold greater concentration, an 8.0- to 8.5-fold greater concentration, an 8.5- to 9.0-fold greater concentration, a 9.0- to 9.5-fold greater concentration, 9.5- to 10.0-fold greater concentration, a 10- to 11-fold greater concentration, an 11- to 12-fold greater concentration a 12- to 13-fold greater concentration, a 13- to 14-fold greater concentration, a 14- to 15-fold greater concentration, a 15- to 16-fold greater concentration, a 16- to 17-fold greater concentration, a 17- to 18-fold greater concentration, an 18- to 19-fold greater concentration, a 19- to 20-fold greater concentration, a 20- to 30-fold greater concentration, a 30- to 40-fold greater concentration, a 40- to 50-fold greater concentration, a 50- to 60-fold greater concentration, a 60- to 70-fold greater concentration, a 70- to 80-fold greater concentration, a 80- to 90-fold greater concentration, a 90- to 100-fold greater concentration, a 10- to 20-fold greater concentration, a 10- to 40-fold greater concentration, a 10- to 50-fold greater concentration, a 10- to 70-fold greater concentration, or a 10- to 100-fold greater concentration. The degree of difference in concentrations accounts for normalization for the footprint sizes of the target regions, as discussed in the definition section.
Epigenetic Target Region Set
[0129] The epigenetic target region set may comprise one or more types of target regions likely to differentiate DNA from neoplastic (e.g., tumor or cancer) cells and from healthy cells, e.g., non-neoplastic circulating cells. Exemplary types of such regions are discussed in detail herein. The epigenetic target region set may also comprise one or more control regions, e.g., as described herein. In some embodiments, the epigenetic target region set has a footprint of at least 100 kb, e.g., at least 200 kb, at least 300 kb, or at least 400 kb. In some embodiments, the epigenetic target region set has a footprint in the range of 100-1000 kb, e.g., 100-200 kb, 200- 300 kb, 300-400 kb, 400-500 kb, 500-600 kb, 600-700 kb, 700-800 kb, 800-900 kb, and 900- 1,000 kb.
HyDermethylation Variable Target Regions
[0130] In some embodiments, the epigenetic target region set includes one or more hypermethylation variable target regions. In general, hypermethylation variable target regions refer to regions where an increase in the level of observed methylation, e.g., in a cfDNA sample, indicates an increased likelihood that a sample (e.g., of cfDNA) contains DNA produced by neoplastic cells, such as tumor or cancer cells. For example, hypermethylation of promoters of tumor suppressor genes has been observed repeatedly. See, e.g., Kang et al., Genome Biol. 18:53 (2017) and references cited therein. In an example, hypermethylation variable target regions can include regions that do not necessarily differ in methylation in cancerous tissue relative to DNA from healthy tissue of the same type, but do differ in methylation (e.g., have more methylation) relative to cfDNA that is typical in healthy subjects. Where, for example, the presence of a cancer results in increased cell death such as apoptosis of cells of the tissue type corresponding to the cancer, such a cancer can be detected at least in part using such hypermethylation variable target regions. In some embodiments, hypermethylation variable target regions include one or more genomic regions, where the cfDNA molecules in those regions do not differ in methylation state in cancer subjects relative to cfDNA from healthy subjects, but the presence/increased quantity of hypermethylated cfDNA in those regions is indicative of a particular tissue type (e.g., cancer origin) and is presented as cfDNA with increased apoptosis (e.g. tumor shedding) into circulation.
[0131] Hypermethylation target regions may be obtained, e.g., from the Cancer Genome Atlas. Kang et al., Genome Biology 18:53 (2017), describe construction of a probabilistic method called CancerLocator using hypermethylation target regions from breast, colon, kidney, liver, and lung. In some embodiments, the hypermethylation target regions can be specific to one or more types of cancer. Accordingly, in some embodiments, the hypermethylation target regions include one, two, three, four, or five subsets of hypermethylation target regions that collectively show hypermethylation in one, two, three, four, or five of breast, colon, kidney, liver, and lung cancers.
[0132] In some embodiments, the probes for the epigenetic target region set comprise probes specific for one or more hypermethylation variable target regions. The hypermethylation variable target regions may be any of those set forth above. For example, in some embodiments, the probes specific for hypermethylation variable target regions comprise probes specific for a plurality of loci listed in Table 1, e.g., at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% of the loci listed in Table 1. In some embodiments, the probes specific for hypermethylation variable target regions comprise probes specific for a plurality of loci listed in Table 2, e.g., at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% of the loci listed in Table 2. In some embodiments, the probes specific for hypermethylation variable target regions comprise probes specific for a plurality of loci listed in Table 1 or Table 2, e.g., at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% of the loci listed in Table 1 or Table 2. In some embodiments, for each locus included as a target region, there may be one or more probes with a hybridization site that binds between the transcription start site and the stop codon (the last stop codon for genes that are alternatively spliced) of the gene. In some embodiments, the one or more probes bind within 300 bp of the listed position, e.g., within 200 or 100 bp. In some embodiments, a probe has a hybridization site overlapping the position listed above. In some embodiments, the probes specific for the hypermethylation target regions include probes specific for one, two, three, four, or five subsets of hypermethylation target regions that collectively show hypermethylation in one, two, three, four, or five of breast, colon, kidney, liver, and lung cancers.
HvDomethylation Variable Target Regions
[0133] Global hypomethylation is a commonly observed phenomenon in various cancers. See, e.g., Hon et al., Genome Res. 22:246-258 (2012) (breast cancer); Ehrlich, Epigenomics 1:239- 259 (2009) (review article noting observations of hypomethylation in colon, ovarian, prostate, leukemia, hepatocellular, and cervical cancers). For example, regions such as repeated elements, e.g., LINE1 elements, Alu elements, centromeric tandem repeats, pericentromeric tandem repeats, and satellite DNA, and intergenic regions that are ordinarily methylated in healthy cells may show reduced methylation in tumor cells. Accordingly, in some embodiments, the epigenetic target region set includes hypomethylation variable target regions, where a decrease in the level of observed methylation indicates an increased likelihood that a sample (e.g., of cfDNA) contains DNA produced by neoplastic cells, such as tumor or cancer cells. In an example, hypomethylation variable target regions can include regions that do not necessarily differ in methylation state in cancerous tissue relative to DNA from healthy tissue of the same type, but do differ in methylation (e.g., are less methylated) relative to cfDNA that is typical in healthy subjects. Where, for example, the presence of a cancer results in increased cell death such as apoptosis of cells of the tissue type corresponding to the cancer, such a cancer can be detected at least in part using such hypomethylation variable target regions. In some embodiments, hypomethylation variable target regions include one or more genomic regions, where the cfDNA molecules in those regions do not differ in methylation state in cancer subjects relative to cfDNA from healthy subjects, but the presence/increased quantity of hypom ethylated cfDNA in those regions is indicative of a particular tissue type (e.g., cancer origin) and is presented as cfDNA with increased apoptosis (e.g. tumor shedding) into circulation.
[0134] In some embodiments, hypomethylation variable target regions include repeated elements and/or intergenic regions. In some embodiments, repeated elements include one, two, three, four, or five of LINE1 elements, Alu elements, centromeric tandem repeats, pericentromeric tandem repeats, and/or satellite DNA.
[0135] Exemplary specific genomic regions that show cancer-associated hypomethylation include nucleotides 8403565-8953708 and 151104701-151106035 of human chromosome 1. In some embodiments, the hypomethylation variable target regions overlap or comprise one or both of these regions.
[0136] In some embodiments, the probes for the epigenetic target region set comprise probes specific for one or more hypomethylation variable target regions. The hypomethylation variable target regions may be any of those set forth above. For example, the probes specific for one or more hypomethylation variable target regions may include probes for regions such as repeated elements, e.g., LINE1 elements, Alu elements, centromeric tandem repeats, pericentromeric tandem repeats, and satellite DNA, and intergenic regions that are ordinarily methylated in healthy cells may show reduced methylation in tumor cells.
[0137] In some embodiments, probes specific for hypomethylation variable target regions include probes specific for repeated elements and/or intergenic regions. In some embodiments, probes specific for repeated elements include probes specific for one, two, three, four, or five of LINE1 elements, Alu elements, centromeric tandem repeats, pericentromeric tandem repeats, and/or satellite DNA.
[0138] Exemplary probes specific for genomic regions that show cancer-associated hypomethylation include probes specific for nucleotides 8403565-8953708 and/or 151104701- 151106035 of human chromosome 1. In some embodiments, the probes specific for hypomethylation variable target regions include probes specific for regions overlapping or including nucleotides 8403565-8953708 and/or 151104701-151106035 of human chromosome [0139] Probes for detecting the panel of regions can include those for detecting genomic regions of interest (hotspot regions) as well as nucleosome-aware probes (e.g., KRAS codons 12 and 13) and may be designed to optimize capture based on analysis of cfDNA coverage and fragment size variation impacted by nucleosome binding patterns and GC sequence composition. Regions used herein can also include non-hotspot regions optimized based on nucleosome positions and GC models. Subjects
[0140] In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject having a cancer. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject suspected of having a cancer. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject having a tumor. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject suspected of having a tumor. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject having neoplasia. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject suspected of having neoplasia. In some embodiments, the DNA (e.g., cfDNA) is obtained from a subject in remission from a tumor, cancer, or neoplasia (e.g., following chemotherapy, surgical resection, radiation, or a combination thereof). In any of the foregoing embodiments, the cancer, tumor, or neoplasia or suspected cancer, tumor, or neoplasia may be of the lung, colon, rectum, kidney, breast, prostate, or liver. In some embodiments, the cancer, tumor, or neoplasia or suspected cancer, tumor, or neoplasia is of the lung. In some embodiments, the cancer, tumor, or neoplasia or suspected cancer, tumor, or neoplasia is of the colon or rectum. In some embodiments, the cancer, tumor, or neoplasia or suspected cancer, tumor, or neoplasia is of the breast. In some embodiments, the cancer, tumor, or neoplasia or suspected cancer, tumor, or neoplasia is of the prostate. In any of the foregoing embodiments, the subject may be a human subject.
[0141] In some embodiments, the sequence-variable target region probe set has a footprint of at least 0.5 kb, e.g., at least 1 kb, at least 2 kb, at least 5 kb, at least 10 kb, at least 20 kb, at least 30 kb, or at least 40 kb. In some embodiments, the epigenetic target region probe set has a footprint in the range of 0.5-100 kb, e.g., 0.5-2 kb, 2-10 kb, 10-20 kb, 20-30 kb, 30-40 kb, 40-50 kb, 50-60 kb, 60-70 kb, 70-80 kb, 80-90 kb, and 90-100 kb.
[0142] In some embodiments, the probes specific for the sequence-variable target region set comprise probes specific for target regions from at least 10, 20, 30, or 35 cancer-related genes, such as AKT1, ALK, BRAF, CCND1, CDK2A, CTNNB1, EGFR, ERBB2, ESRI, FGFR1, FGFR2, FGFR3, FOXL2, GATA3, GNA11, GNAQ, GNAS, HRAS, IDH1, IDH2, KIT, KRAS, MED 12, MET, MYC, NFE2L2, NRAS, PDGFRA, PIK3CA, PPP2R1 A, PTEN, RET, STK11, TP53, and U2AFl.
Compositions Including Captured DNA
[0143] Provided herein is a combination including first and second populations of captured DNA. The first population may comprise or be derived from DNA with a cytosine modification in a greater proportion than the second population. The first population may comprise a form of a first nucleobase originally present in the DNA with altered base pairing specificity and a second nucleobase without altered base pairing specificity, wherein the form of the first nucleobase originally present in the DNA prior to alteration of base pairing specificity is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the form of the first nucleobase originally present in the DNA prior to alteration of base pairing specificity and the second nucleobase have the same base pairing specificity. The second population does not comprise the form of the first nucleobase originally present in the DNA with altered base pairing specificity. In some embodiments, the cytosine modification is cytosine methylation. In some embodiments, the first nucleobase is a modified or unmodified cytosine and the second nucleobase is a modified or unmodified cytosine. The first and second nucleobase may be any of those discussed herein in the Summary or with respect to subjecting the first subsample to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample.
[0144] In some embodiments, the first population includes a sequence tag selected from a first set of one or more sequence tags and the second population includes a sequence tag selected from a second set of one or more sequence tags, and the second set of sequence tags is different from the first set of sequence tags. The sequence tags may comprise barcodes.
[0145] In some embodiments, the first population includes protected hmC, such as glucosylated hmC. In some embodiments, the first population was subjected to any of the conversion procedures discussed herein, such as bisulfite conversion, Ox-BS conversion, TAB conversion, ACE conversion, TAP conversion, TAPSP conversion, or CAP conversion. In some embodiments, the first population was subjected to protection of hmC followed by deamination of mC and/or C. In some embodiments of the combination, the first population includes or was derived from DNA with a cytosine modification in a greater proportion than the second population and the first population includes first and second subpopulations, and the first nucleobase is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the first nucleobase and the second nucleobase have the same base pairing specificity. In some embodiments, the second population does not comprise the first nucleobase. In some embodiments, the first nucleobase is a modified or unmodified cytosine, and the second nucleobase is a modified or unmodified cytosine, optionally wherein the modified cytosine is mC or hmC. In some embodiments, the first nucleobase is a modified or unmodified adenine, and the second nucleobase is a modified or unmodified adenine, optionally wherein the modified adenine is mA.
[0146] In some embodiments, the first nucleobase (e.g., a modified cytosine) is biotinylated. In some embodiments, the first nucleobase (e.g., a modified cytosine) is a product of a Huisgen cycloaddition to P-6-azide-glucosyl-5-hydroxymethylcytosine that includes an affinity label (e.g., biotin).
[0147] In any of the combinations described herein, the captured DNA may comprise cfDNA. The captured DNA may have any of the features described herein concerning captured sets, including, e.g., a greater concentration of the DNA corresponding to the sequence-variable target region set (normalized for footprint size as discussed above) than of the DNA corresponding to the epigenetic target region set. In some embodiments, the DNA of the captured set includes sequence tags, which may be added to the DNA as described herein. In general, the inclusion of sequence tags results in the DNA molecules differing from their naturally occurring, untagged form.
[0148] The combination may further comprise a probe set described herein or sequencing primers, each of which may differ from naturally occurring nucleic acid molecules. For example, a probe set described herein may comprise a capture moiety, and sequencing primers may comprise a non-naturally occurring label.
Computer Systems, Processing of Real World Evidence (RWE)
[0149] Methods of the present disclosure can be implemented using, or with the aid of, computer systems. For example, such methods may comprise: partitioning the sample into a plurality of subsamples, including a first subsample and a second subsample, wherein the first subsample includes DNA with a cytosine modification in a greater proportion than the second subsample; subjecting the first subsample to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample, wherein the first nucleobase is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the first nucleobase and the second nucleobase have the same base pairing specificity; and sequencing DNA in the first subsample and DNA in the second subsample in a manner that distinguishes the first nucleobase from the second nucleobase in the DNA of the first subsample.
[0150] In an aspect, the present disclosure provides a non-transitory computer-readable medium including computer-executable instructions which, when executed by at least one electronic processor, perform at least a portion of a method including: collecting cfDNA from a test subject; capturing a plurality of sets of target regions from the cfDNA, wherein the plurality of target region sets includes a sequence-variable target region set and an epigenetic target region set, whereby a captured set of cfDNA molecules is produced; sequencing the captured cfDNA molecules, wherein the captured cfDNA molecules of the sequence-variable target region set are sequenced to a greater depth of sequencing than the captured cfDNA molecules of the epigenetic target region set; obtaining a plurality of sequence reads generated by a nucleic acid sequencer from sequencing the captured cfDNA molecules; mapping the plurality of sequence reads to one or more reference sequences to generate mapped sequence reads; and processing the mapped sequence reads corresponding to the sequence-variable target region set and to the epigenetic target region set to determine the likelihood that the subject has cancer.
[0151] The code can be pre-compiled and configured for use with a machine have a processer adapted to execute the code or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
[0152] Additional details relating to computer systems and networks, databases, and computer program products are also provided in, for example, Peterson, Computer Networks: A Systems Approach, Morgan Kaufmann, 5th Ed. (2011), Kurose, Computer Networking: A Top-Down Approach, Pearson, 7th Ed. (2016), Elmasri, Fundamentals of Database Systems, Addison Wesley, 6th Ed. (2010), Coronel, Database Systems: Design, Implementation, & Management, Cengage Learning, 11th Ed. (2014), Tucker, Programming Languages, McGraw-Hill Science/Engineering/Math, 2nd Ed. (2006), and Rhoton, Cloud Computing Architected: Solution Design Handbook, Recursive Press (2011), each of which is hereby incorporated by reference in its entirety.
[0153] Figure 1 illustrates an example architecture 100 to generate an integrated data repository that includes multiple types of healthcare data, according to one or more implementations. The architecture 100 may include a data integration and analysis system 102. The data integration and analysis system 102 may obtain data from a number of data sources and integrate the data from the data sources into an integrated data repository 104. For example, the data integration and analysis system 102 may obtain data from a health insurance claims data repository 106. In various examples, the data integration and analysis system 102 and the health insurance claims data repository 106 may be created and maintained by different entities. In one or more additional examples, the data integration and analysis system 102 and the health insurance claims data repository 106 may be created and maintained by a same entity.
[0154] The data integration and analysis system 102 may be implemented by one or more computing devices. The one or more computing devices may include one or more server computing devices, one or more desktop computing devices, one or more laptop computing devices, one or more tablet computing devices, one or more mobile computing devices, or combinations thereof. In certain implementations, at least a portion of the one or more computing devices may be implemented in a distributed computing environment. For example, at least a portion of the one or more computing devices may be implemented in a cloud computing architecture. In scenarios where the computing systems used to implement the data integration and analysis system 102 are configured in a distributed computing architecture, processing operations may be performed concurrently by multiple virtual machines. In various examples, the data integration and analysis system 102 may implement multithreading techniques. The implementation of a distributed computing architecture and multithreading techniques cause the data integration and analysis system 102 to utilize fewer computing resources in relation to computing architectures that do not implement these techniques.
[0155] The health insurance claims data repository 106 may store information obtained from one or more health insurance companies that corresponds to insurance claims made by subscribers of the one or more health insurance companies. The health insurance claims data repository 106 may be arranged (e.g., sorted) by patient identifier. The patient identifier may be based on the patient’s first name, last name, date of birth, social security number, address, employer, and the like. The data stored by the health insurance claims data repository 106 may include structured data that is arranged in one or more data tables. The one or more data tables storing the structured data may include a number of rows and a number of columns that indicate information about health insurance claims made by subscribers of one or more health insurance companies in relation to procedures and/or treatments received by the subscribers from healthcare providers. At least a portion of the rows and columns of the data tables stored by the health insurance claims data repository 106 may include health insurance codes that may indicate diagnoses of biological conditions, and treatments and/or procedures obtained by subscribers of the one or more health insurance companies. In various examples, the health insurance codes may also indicate diagnostic procedures obtained by individuals that are related to one or more biological conditions that may be present in the individuals. In one or more examples, a diagnostic procedure may provide information used in the detection of the presence of a biological condition. A diagnostic procedure may also provide information used to determine a progression of a biological condition. In one or more illustrative examples, a diagnostic procedure may include one or more imaging procedures, one or more assays, one or more laboratory procedures, one or more combinations thereof, and the like.
[0156] The data integration and analysis system 102 may also obtain information from a molecular data repository 108. The molecular data repository 108 may store data of a number of individuals related to genomic information, genetic information, metabolomic information, transcriptomic information, fragmentomic information, immune receptor information, methylation information, epigenomic information, and/or proteomic information. In one or more examples, the data integration and analysis system 102 and the molecular data repository 108 may be created and maintained by different entities. In one or more additional examples, the data integration and analysis system 102 and the molecular data repository 108 may be created and maintained by a same entity.
[0157] The genomic information may indicate one or more mutations corresponding to genes of the individuals. A mutation to a gene of individuals may correspond to differences between a sequence of nucleic acids of the individuals and one or more reference genomes. The reference genome may include a known reference genome, such as hgl9. In various examples, a mutation of a gene of an individual may correspond to a difference in a germline gene of an individual in relation to the reference genome. In one or more additional examples, the reference genome may include a germline genome of an individual. In one or more further examples, a mutation to a gene of an individual may include a somatic mutation. Mutations to genes of individuals may be related to insertions, deletions, single nucleotide variants, loss of heterozygosity, duplication, amplification, translocation, fusion genes, or one or more combinations thereof.
[0158] In one or more illustrative examples, genomic information stored by the molecular data repository 108 may include genomic profiles of tumor cells present within individuals. In these situations, the genomic information may be derived from an analysis of genetic material, such as deoxyribonucleic acid (DNA) and/or ribonucleic acid (RNA) from a sample, including, but not limited to, a tissue sample or tumor biopsy, circulating tumor cells (CTCs), exosomes or efferosomes, or from circulating nucleic acids (e.g., cell-free DNA) found in blood samples of individuals that is present due to the degradation of tumor cells present in the individuals. . In one or more examples, the genomic information of tumor cells of individuals may correspond to one or more target regions. One or more mutations present with respect to the one or more target regions may indicate the presence of tumor cells in individuals. The genomic information stored by the molecular data repository 108 may be generated in relation to an assay or other diagnostic test that may determine one or more mutations with respect to one or more target regions of the reference genome.
[0159] In one or more additional examples, the data integration and analysis system 102 may obtain information from one or more additional data repositories 110. The one or more additional data repositories 110 may store data related to electronic medical records of individuals for which data is present in at least one of the health insurance claims data repository 106 or the molecular data repository 108. Further, the one or more additional data repositories 110 may store data related to pathology reports of individuals for which data is present in at least one of the health insurance claims data repository 106 or the molecular data repository 108. In various examples, the one or more additional data repositories 110 may store data related to biological conditions and/or treatments for biological conditions. In one or more examples, the data integration and analysis system 102 and at least a portion of the one or more additional data repositories 110 may be created and maintained by different entities. In one or more further examples, the data integration and analysis system 102 and at least a portion of the one or more additional data repositories 110 may be created and maintained by a same entity.
[0160] In one or more further implementations, the data integration and analysis system 102 may obtain information from one or more reference information data repositories 112. The one or more reference information data repositories 112 may store information that includes definitions, standards, protocols, vocabularies, one or more combinations thereof, and the like. In various examples, the information stored by the one or more reference information data repositories may correspond to biological conditions and/or treatments for biological conditions. In one or more illustrative examples, the one or more reference information data repositories 112 may include RxNorm. (RxNorm provides normalized names for clinical drugs and links its names to many of the drug vocabularies used in pharmacy management and drug interaction software.) In one or more examples, the data integration and analysis system 102 and at least a portion of the one or more reference information data repositories 112 may be created and maintained by different entities. In one or more further examples, the data integration and analysis system 102 and at least a portion of the one or more reference information data repositories 112 may be created and maintained by a same entity.
[0161] The data integration and analysis system 102 may obtain data from at least one of the health insurance claims data repository 106, the molecular data repository 108, the one or more additional data repositories 110, or the reference information data repositories 112 via one or more communication networks accessible to the data integration and analysis system 102 and accessible to at least one of the health insurance claims data repository 106, the molecular data repository 108, the one or more additional data repositories 110, or the reference information data repositories 112. The data integration and analysis system 102 may also obtain data from at least one of the health insurance claims data repository 106, the molecular data repository 108, the one or more additional data repositories 110, or the reference information data repositories 112 via one or more secure communication channels. In addition, the data integration and analysis system 102 may obtain data from at least one of the health insurance claims data repository 106, the molecular data repository 108, the one or more additional data repositories 110, or the reference information data repositories 112 via one or more calls of an application programming interface (API).
[0162] The data integration and analysis system 102 may include a data integration system 114. The data integration system 114 may obtain data from the health insurance claims data repository 106 and the molecular data repository 108 to generate the integrated data repository 104. The data integration system 114 may also obtain data from the one or more additional data repositories 110 to generate the integrated data repository 104. In various examples, the data integration system 114 may implement one or more natural language processing techniques to integrate data from the one or more additional data repositories 110 into the integrated data repository 104.
[0163] In one or more examples, the data integration system 114 may generate one or more tokens to identify individuals that have data stored in the health insurance claims data repository 106 and that have data stored in the molecular data repository 108. In various examples, the data integration system 114 may generate one or more tokens by implementing one or more hash functions. The data integration system 114 may implement the one or more hash functions to generate the one or more tokens based on information stored by at least one of the health insurance claims data repository 106 or the molecular data repository 108. For example, the information used by the data integration system 114 to generate individual tokens by implementing a hash function may include at least one of an identifier of respective individuals, date of birth of the respective individuals, a postal code of the respective individuals, date of birth of the respective individuals, or a gender of the respective individuals. In one or more illustrative examples, the identifiers of the respective individuals may include a combination of at least a portion of a first name of the respective individuals and at least a portion of the last name of the respective individuals. Tokens generated using data from different data repositories may correspond to the same or similar information or the same or similar type stored by the different data repositories. To illustrate, tokens may be generated using a portion of names of individuals, date of birth, at least a portion of a postal code, and gender obtained from the health insurance claims data repository 106 and the molecular data repository 108.
[0164] The data integration system 114 may integrate data from a number of different data sources by analyzing tokens generated by implementing one or more hash functions using data obtained from the number of different data sources. For example, the data integration system 114 may obtain one or more first tokens generated from data stored by the health insurance claims data repository 106 and one or more second tokens generated from data stored by the molecular data repository 108. The data integration system 114 may analyze the one or more first tokens with respect to the one or more second tokens to determine individual first tokens that correspond to individual second tokens. In one or more illustrative examples, the data integration system 114 may identify individual first tokens that match individual second tokens. A first token may match a second token when the data of the first token has at least a threshold amount of similarity with respect to the data of the second token. In one or more examples, a first token may match a second token when the data of the first token is the same as the data of the second token. To illustrate, a first token may match a second token when an alphanumeric string of the first token is the same as an alphanumeric string of the second token.
[0165] By determining a first token generated using data stored by the health insurance claims data repository 106 that corresponds to a second token generated using data stored by the molecular data repository 108, the data integration system 114 may identify an individual having data that is stored in both the health insurance claims data repository 106 and in the molecular data repository 108. In this way, the data integration system 114 may obtain data from the health insurance claims data repository 106 from a number of individuals and data from the molecular data repository 108 from the same number of individuals and store the health insurance claims data and the molecular data for the number of individuals in the integrated data repository 104. [0166] The data integration system 114 may also integrate data stored by the one or more additional data repositories 110 with data from the health insurance claims data repository 106 and the molecular data repository 108 to generate the integrated data repository 104. To illustrate, the data integration system 114 may obtain one or more third tokens generated from data stored by an additional data repository 110, such as a data repository storing data corresponding to pathology reports. The data integration system 114 may analyze the one or more third tokens with respect to the first tokens generated using information stored by the health insurance claims data repository 106 and the second tokens generated using information stored by the molecular data repository 108 to determine respective third tokens that correspond to individuals first tokens and individual second tokens. In one or more illustrative examples, the data integration system 114 may identify third tokens generated using one or more hash functions and a common set of information obtained from the health insurance claims data repository 106, the molecular data repository 108, and the additional data repository 110.
[0167] By determining a third token generated using data stored by an additional data repository 110 that corresponds to a first token generated using data stored by the health insurance claims data repository 106 and a second token generated using data stored by the molecular data repository 108, the data integration system 114 may identify an individual having data that is stored in the health insurance claims data repository 106, the molecular data repository 108, and in an additional data repository 110. In this way, the data integration system 114 may obtain data from the health insurance claims data repository 106 from a number of individuals and data from the molecular data repository 108 and an additional data repository 110 from the same number of individuals and store the health insurance claims data, the molecular data, and the additional data for the number of individuals in the integrated data repository 104.
[0168] The data stored by the integrated data repository 104 for the number of individuals may be accessible using respective identifiers of individuals. The data integration system 114 may implement a number of techniques as part of a de-identification process with respect to storing and retrieving information of individuals in the integrated data repository 104. The identifiers of individuals may correspond to keys that are generated using at least one hash function. The identifiers of the individuals may also be generated by implementing one or more salting processes with respect to the keys generated using the at least one hash function, the tokens generated using one or more hash functions and a common set of information obtained from the health insurance claims data repository 106, the molecular data repository 108, and/or the additional data repository 110. In one or more illustrative examples, the identifiers generated by the data integration system 114 to access information for respective individuals that is stored by the integrated data repository 104 may be unique for each individual. In one or more examples, the identifiers of the individuals may be generated using at least a portion of the information used to generate the tokens related to the individuals. In one or more additional examples, the identifiers of the individuals may be generated using different information from the information used to generate the tokens related to the individuals.
[0169] The data integration system 114 may also generate the integrated data repository 104 from a number of different combinations of data repositories in a similar manner. For example, the data integration system 114 may obtain tokens generated from information stored by the health insurance claims data repository 106 and additional tokens generated from information stored by one or more additional data stores 110. The data integration system 114 may determine individual tokens generated from information stored by the health insurance claims data repository 106 that correspond to individual additional tokens generated from information stored by the one or more additional data repositories 110. By determining tokens generated using data stored by the health insurance claims data repository 106 that correspond to additional tokens generated using data stored by an additional data repository 110, the data integration system 114 may identify individuals having data that is stored in both the health insurance claims data repository 106 and in the additional data repository 110. In this way, the data integration system 114 may obtain data from the health insurance claims data repository 106 from a number of individuals and data from the additional data repository 110 from the same number of individuals and store the health insurance claims data and the additional data for the number of individuals in the integrated data repository 104. The health insurance claims data and the additional data stored by the integrated data repository 104 for the number of individuals may be accessible using respective identifiers of individuals.
[0170] In one or more further examples, the data integration system 114 may obtain tokens generated from information stored by the molecular data repository 108 and tokens generated from information stored by one or more additional data stores 110. The data integration system 114 may determine individual tokens generated from information stored by the molecular data repository 108 that correspond to individual additional tokens generated from information stored by the one or more additional data repositories 110. By determining tokens generated using data stored by the molecular data repository 108 that correspond to additional tokens generated using data stored by an additional data repository 110, the data integration system 114 may identify individuals having data that is stored in both the molecular data repository 108 and in the additional data repository 110. In this way, the data integration system 114 may obtain data from the molecular data repository 108 from a number of individuals and data from the additional data repository 110 from the same number of individuals and store the molecular data and the additional data for the number of individuals in the integrated data repository 104. The molecular data and the additional data stored by the integrated data repository 104 for the number of individuals may be accessible using respective identifiers of individuals.
[0171] The data stored by the integrated data repository 104 may be stored according to one or more regulatory frameworks that protect the privacy and ensure the security of medical records, health information, and insurance information of individuals. For example, data may be stored by the integrated data repository 104 in accordance with one or more governmental regulatory frameworks directed to protecting personal information, such as the Health Insurance Portability and Accountability Act (HIPAA) and/or the General Data Protection Regulation (GDPR). The integrated data repository 104 also stores data in an anonymized and de-identified manner to ensure protection of the privacy of individuals that have data stored by the integrated data repository 104. To further ensure the privacy of individuals that have data stored by the integrated data repository 104, the data integration system 114 may re-generate the integrated data repository 104 periodically. For example, the data integration system 114 may create the integrated data repository 104 once per quarter. In one or more additional examples, the data integration system 114 may generated the integrated data repository 104 on a monthly basis, on a weekly basis, or once every two weeks. By re-generating the integrated data repository 104 on a periodic basis and not simply refreshing the integrated data repository 104 when new data is available, the integrated data repository 104 enhances privacy protection with respect to data stored by the integrated data repository 104. That is, in situations where data repositories are refreshed simply with new data, it may be possible to more easily track individuals associated with data that has been newly added to a data repository because the number of new individuals added at a given time is typically smaller than an existing number of individuals that already have data stored by the data repository.
[0172] In various examples, data stored by the integrated data repository 104 may be accessed via a database management system. In addition, the integrated data repository 104 may store data according to one or more database models. In one or more examples, the integrated data repository 104 may store data according to one or more relational database technologies. For example, the integrated data repository 104 may store data according to a relational database model. In one or more additional examples, the integrated data repository 104 may store data according to an object-oriented database model. In one or more further examples, the integrated data repository 104 may store data according to an extensible markup language (XML) database model. In still additional examples, the integrated data repository 104 may store data according to a structured query language (SQL) database model. In still further examples, the integrated data repository may store data according to an image database model.
[0173] The data integration system 114 may generate the integrated data repository 104 by generating a number of data tables and creating links between the data tables. The links may indicate logical couplings between the data tables. The data integration system 114 may generate the data tables by extracting specified sets of data from the information obtained from the data repositories 106, 108, 110, 112 and storing the data in rows and columns of respective data tables. In various examples, the logical couplings between data tables may include at least one of a one-to-one link where a row of information in one data table corresponds to a row of information in another data table, a one-to-many link where a row of information in one data table corresponds to multiple rows of information in another data table, or a many-to-many link where multiple rows of information of one data table correspond to multiple rows of information in another data table.
[0174] The number of data tables may be arranged according to a data repository schema 116. In the illustrative example of Figure 1, the data repository schema 114 includes a first data table 118, a second data table 120, a third data table 122, a fourth data table 124, and a fifth data table 124. Although the illustrative example of Figure 1 includes five data tables, in additional implementations, the data repository schema 116 may include more data tables or fewer data tables. The data repository schema 116 may also include links between the data tables 118, 120, 122, 124, 128. The links between the data tables 118, 120, 122, 124, 126 may indicate that information retrieved from one of the data tables 118, 120, 122, 124, 126 results in additional information stored by one or more additional data tables 118, 120, 122, 124, 126 to be retrieved. Additionally, not all the data tables 118, 120, 122, 124, 126 may be linked to each of the other data tables 118, 120, 120, 122, 124, 126. In the illustrative example of Figure 1, the first data table 118 is logically coupled to the second data table 118 by a first link 128 and the first data table 118 is logically coupled to the fourth data table 124 by a second link 130. In addition, the second data table 120 is logically coupled to the third data table 122 via a third link 132 and the fourth data table 124 is logically coupled to the fifth data table 126 via a fourth link 134. Further, the third data table 122 is logically coupled to the fifth data table 126 via a fifth link 136.
[0175] In various examples, as data tables are added to and/or removed from the data repository schema 116, additional links between data tables may be added to or removed from the data repository schema 116. In one or more illustrative examples, the integrated data repository 104 may store data tables according to the data repository schema 116 for at least a portion of the individuals for which the data integration system 114 obtained information from a combination of at least two of the health insurance claims data repository 106, the molecular data repository 108, the one or more additional data repositories 110, and the one or more reference information data repositories 112. As a result, the integrated data repository 104 may store respective instances of the data tables 118, 120, 122, 124, 126 according to the data repository schema 116 for thousands, tens of thousands, up to hundreds of thousands or more individuals.
[0176] The data integration and analysis system 102 may also include a data pipeline system 138. The data pipeline system 138 may include a number of algorithms, software code, scripts, macros, or other bundles of computer-executable instructions that process information stored by the integrated data repository 104 to generate additional datasets. The additional datasets may include information obtained from one or more of the data tables 118, 120, 122, 124, 126. The additional datasets may also include information that is derived from data obtained from one or more of the data tables 118, 120, 122, 124, 126. The components of the data pipeline system 138 implemented to generate a first additional dataset may be different from the components of the data pipeline system 138 used to generate a second additional dataset.
[0177] In one or more examples, the data pipeline system 138 may generate a dataset that indicates pharmacy treatments received by a number of individuals. In one or more illustrative examples, the data pipeline system 138 may analyze information stored in at least one of the data tables 118, 120, 122, 124, 126 to determine health insurance codes corresponding to pharmaceutical treatments received by a number of individuals. The data pipeline system 138 may analyze the health insurance codes corresponding to pharmaceutical treatments with respect to a library of data that indicates specified pharmaceutical treatments that correspond to one or more health insurance codes to determine names of pharmaceutical treatments that have been received by the individuals. In one or more additional examples, the data pipeline system 138 may analyze information stored by the integrated data repository 104 to determine medical procedures received by a number of individuals. To illustrate, the data pipeline system 138 may analyze information stored by one of the data tables 118, 120, 122, 124, 126 to determine treatments received by individuals via at least one of injection or intravenously. In one or more further examples, the data pipeline system 138 may analyze information stored by the integrated data repository 104 to determine episodes of care for individuals, lines of therapy received by individuals, progression of a biological condition, or time to next treatment. In various examples, the datasets generated by the data pipeline system 138 may be different for different biological conditions. For example, the data pipeline system 138 may generate a first number of datasets with respect to a first type of cancer, such as lung cancer, and a second number of datasets with respect to a second type of cancer, such as colorectal cancer.
[0178] The data pipeline system 138 may also determine one or more confidence levels to assign to information associated with individuals having data stored by the integrated data repository 104. The respective confidence levels may correspond to different measures of accuracy for information associated with individuals having data stored by the integrated data repository 104. The information associated with the respective confidence levels may correspond to one or more characteristics of individuals derived from data stored by the integrated data repository 104. Values of confidence levels for the one or more characteristics may be generated by the data pipeline system 138 in conjunction with generating one or more datasets from the integrated data repository 104. In one or more examples, a first confidence level may correspond to a first range of measures of accuracy, a second confidence level may correspond to a second range of measures of accuracy, and a third confidence level may correspond to a third range of measures of accuracy. In one or more additional examples, the second range of measures of accuracy may include values that are less values of the first range of measures of accuracy and the third range of measures of accuracy may include values that are less than values of the second range of measures of accuracy. In one or more illustrative examples, information corresponding to the first confidence level may be referred to as Gold standard information, information corresponding to the second confidence level may be referred to as Silver standard information, and information corresponding to the third confidence level may be referred to as Bronze standard information.
[0179] The data pipeline system 138 may determine values for the confidence levels of characteristics of individuals based on a number of factors. For example, a respective set of information may be used to determine characteristics of individuals. The data pipeline system 138 may determine the confidence levels of characteristics of individuals based on an amount of completeness of the respective set of information used to determine a characteristic for an individual. In situations where one or more pieces of information are missing from the set of information associated with a first number of individuals, the confidence levels for a characteristic may be lower than for a second number of individuals where information is not missing from the set of information. In one or more examples, an amount of missing information may be used by the data pipeline system 138 to determine confidence levels of characteristics of individuals. To illustrate, a greater amount of missing information used to determine a characteristic of an individual may cause confidence levels for the characteristic to be lower than in situations where the amount of missing information used to determine the characteristic is lower. Further, different types of information may correspond to various confidence levels for a characteristic. In one or more examples, the presence of a first piece of information used to determine a characteristic of an individual may result in confidence levels for the characteristic being higher than the presence of a second piece of information used to determine the characteristic.
[0180] In one or more illustrative examples, the data pipeline system 138 may determine a number of individuals included in a cohort with a primary diagnosis of lung cancer (or other biological condition). The data pipeline system 138 may determine confidence levels for respective individuals with respect to being classified as having a primary diagnosis of lung cancer. The data pipeline system 138 may use information from a number of columns included in the data tables 118, 120, 122 124, 126 to determine a confidence level for the inclusion of individuals within a lung cancer cohort. The number of columns may include health insurance codes related to diagnosis of biological conditions and/or treatments of biological conditions. Additionally, the number of columns may correspond to dates of diagnosis and/or treatment for biological conditions. The data pipeline system 138 may determine that a confidence level of an individual being characterized as being part of the lung cancer cohort is higher in scenarios where information is available for each of the number of columns or at least a threshold number of columns than in instances where information is available for less than a threshold number of columns. Further, the data pipeline system 138 may determine confidence levels for individuals included in a lung cancer cohort based on the type of information and availability of information associated with one or more columns. To illustrate, in situations where one or more diagnosis codes are present in relation to one or more periods of time for a group of individuals and one or more treatment codes are absent, the data pipeline system 138 may determine that the confidence level of including the group of individuals in the lung cancer cohort is greater than in situations where at least one of the diagnosis codes is absent and the treatment codes used to determine whether individuals are included in the lung cancer cohort are present.
[0181] The data integration and analysis system 102 may include a data analysis system 140. The data analysis system 148 may receive integrated data repository requests 142 from one or more computing devices, such as an example computing device 144. The one or more integrated data repository requests 142 may cause data to be retrieved from the integrated data repository 104. In various examples, the one or more integrated data repository requests 142 may cause data to be retrieved from one or more datasets generated by the data pipeline system 138. The integrated data repository requests 142 may specify the data to be retrieved from the integrated data repository 104 and/or the one or more datasets generated by the data pipeline system 138. In one or more additional examples, the integrated data repository requests 142 may include one or more prebuilt queries that correspond to computer-executable instructions that retrieve a specified set of data from the integrated data repository 104 and/or one or more datasets generated by the data pipeline system 138.
[0182] In response to one or more integrated data repository requests 142, the data analysis system 140 may analyze data retrieved from at least one of the integrated data repository 104 or one or more datasets generated by the data pipeline system 138 to generate data analysis results 146. The data analysis results 146 may be sent to one or more computing devices, such as example computing device 148. Although the illustrative example of Figure 1 shows that the one or more integrated data repository requests 142 from one computing device 144 and the data analysis results 146 being sent to another computing device 148, in one or more additional implementations, the data analysis results 146 may be received by a same computing device that sent the one or more integrated data repository requests 142. The data analysis results 146 may be displayed by one or more user interfaces rendered by the computing device 144 or the computing device 148.
[0183] In one or more examples, the data analysis system 140 may implement at least one of one or more machine learning techniques or one or more statistical techniques to analyze data retrieved in response to one or more integrated data repository requests 142. In one or more examples, the data analysis system 140 may implement one or more artificial neural networks to analyze data retrieved in response to one or more integrated data repository requests 142. To illustrate, the data analysis system 140 may implement at least one of one or more convolutional neural networks or one or more residual neural networks to analyze data retrieved from the integrated data repository 104 in response to one or more integrated data repository requests 142. In at least some examples, the data analysis system 140 may implement one or more random forests techniques, one or more support vector machines, or one or more Hidden Markov models to analyze data retrieved in response to one or more integrated data repository requests 142. One or more statistical models may also be implemented to analyzed data retrieved in response to one or more integrated data repository requests 142 to identify at least one of correlations or measures of significance between characteristics of individuals. For example, log rank tests may be applied to data retrieved in response to one or more integrated data repository requests 142. In addition, Cox proportional hazards models may be implemented with respect to date retrieved in response to one or more integrated data repository requests 142. Further, Wilcoxon singed rank tests may be applied to data retrieved in response to one or more integrated data repository requests 142. In still other examples, a z-score analysis may be performed with respect to data retrieved in response to one or more integrated data repository requests 142. In still additional examples, a Kaplan Meier analysis may be performed with respect to data retrieved in response to one or more integrated data repository requests 142. In at least some examples, one or more machine learning techniques may be implemented in combination with one or more statistical techniques to analyze data retrieved in response to one or more integrated data repository requests 142.
[0184] In one or more illustrative examples, the data analysis system 140 may determine a rate of survival of individuals in which lung cancer is present in response to one or more treatments. In one or more additional illustrative examples, the data analysis system 140 may determine a rate of survival of individuals having one or more genomic region mutations in which lung cancer is present in response to one or more treatments. In various examples, the data analysis system 140 may generate the data analysis results 146 in situations where the data retrieved from at least one of the integrated data repository 104 or the one or more datasets generated by the data pipeline system 138 satisfies one or more criteria. For example, the data analysis system 140 may determine whether at least a portion of the data retrieved in response to one or more integrated data repository requests 142 satisfies a threshold confidence level. In situations where the confidence level for at least a portion of the date retrieved in response to one or more integrated data repository requests 142 is less than a threshold confidence level, the data analysis system 140 may refrain from generating at least a portion of data analysis results 146. In scenarios where the confidence level for at least a portion of the data retrieved in response to one or more integrated data repository requests 142 is at least a threshold confidence level, the data analysis system 140 may generate at least a portion of the data analysis results 146. In various examples, the threshold confidence level may be related to the type of data analysis results 146 being generated by the data analysis system 140.
[0185] In one or more illustrative examples, the data analysis system 140 may receive an integrated data repository request 142 to generate data analysis results 146 that indicate a rate of survival of one or more individuals. In these instances, the data analysis system 140 may determine whether the data stored by the integrated data repository 104 and/or by one or more datasets generated by the data pipeline system 138 satisfies a threshold confidence level, such as a Gold standard confidence level. In one or more additional examples, the data analysis system 140 may receive an integrated data repository request 142 to generate data analysis results 146 that indicate a treatment received by one or more individuals. In these implementations, the data analysis system 140 may determine whether the data stored by the integrated data repository 104 and/or by one or more datasets generated by the data pipeline system 138 satisfies a lower threshold confidence level, such as a Bronze standard confidence level.
[0186] In one or more additional illustrative examples, the data analysis system 140 may receive an integrated data repository request 142 to determine individuals having one or more genomic mutations and that have received one or more treatments for a biological condition. Continuing with this example, the data analysis system 140 can determine a survival rate of individuals with the one or more genomic mutations in relation to the one or more treatments received by the individuals. The data analysis system 140 can then identify based on the survival rate of individuals an effectiveness of treatments for the individuals in relation to genomic mutations that may be present in the individuals. In this way, health outcomes of individuals may be improved by identifying prospective treatments that may be more effective for populations of individuals having one or more genomic mutations than current treatments being provided to the individuals.
[0187] Figure 2 illustrates an example framework 200 corresponding to an arrangement of data tables in an integrated data repository, according to one or more implementations. In the illustrative example of Figure 2, the framework 200 includes a data repository schema 202 that includes a first data table 204, a second data table 206, a third data table 208, a fourth data table 210, a fifth data table 212, a sixth data table 214, and a seventh data table 216. Although the illustrative example of Figure 2 includes seven data tables, in additional implementations, the data repository schema 202 may include more data tables or fewer data tables. The data repository schema 202 may also include links between the data tables 204, 206, 208, 210, 212, 214, 216. The links between the data tables 204, 206, 208, 210, 212, 214, 216 may indicate that information retrieved from one of the data tables 204, 206, 208, 210, 212, 214, 216 results in additional information stored by one or more additional data tables 204, 206, 208, 210, 212, 214, 216 to be retrieved. Additionally, not all the data tables 204, 206, 208, 210, 212, 214, 216 may be linked to each of the other data tables 204, 206, 208, 210, 212, 214, 216. In the illustrative example of Figure 2, the first data table 204 is logically coupled to the second data table 206 by a first link 218 and the third data table 208 is logically coupled to the second data table 206 by a second link 220. The second data table 206 is also logically coupled to the fourth data table 210 by a third link 222, the second data table 206 is logically coupled to the fifth data table 212 by a fourth link 224, and the second data table 206 is logically coupled to the sixth data table 214 by a fifth link 226. In addition, fifth data table 212 is logically coupled to the sixth data table 214 by a sixth link 228 and the sixth data table 214 is logically coupled to the seventh data table 216 by a seventh link 230. Further, the seventh data table 216 is logically coupled to the fourth data table 210 by an eighth link 232. In various examples, as data tables are added to and/or removed from the data repository schema 202, additional links between data tables may be added to or removed from the data repository schema 202. In one or more illustrative examples, the integrated data repository 104 may store data tables according to the data repository schema 202 for at least a portion of the individuals for which the data integration system 114 obtained information from a combination of at least two of the health insurance claims data repository 106, the molecular data repository 108, and the one or more additional data repositories 110. As a result, the integrated data repository 104 may store respective instances of the data tables 204, 206, 208, 210, 212, 214, 216 according to the data repository schema 204 for thousands, tens of thousands, up to hundreds of thousands or more individuals.
[0188] In one or more examples, the first data table 204 may store data corresponding to genomics and genomics testing for individuals. For example, the first data table 204 may include columns that include information corresponding to a panel used to generate genomics data, mutations of genomic regions, types of mutations, copy numbers of genomic regions, coverage data indicating numbers of nucleic acid molecules identified in a sample having one or more mutations, testing dates, and patient information. The first data table 204 may also include one or more columns that include health insurance data codes that may correspond to one or more diagnosis codes. Additionally, the information in first data table 204 may include at least one identifier for an individual that is associated with an instance of the first data table 204.
[0189] The second data table 206 may store data related to one or more patient visits by individuals to one or more healthcare providers. The third data table 208 may store information corresponding to respective services provided to individuals with respect to one or more patient visits to one or more healthcare providers indicated by the second data table 206. To illustrate, an individual may visit a healthcare provider and multiple services may be performed with respect to the individual at the visit. A second data table 206 may include columns indicating information for each of the multiple services performed during the patient visit. Multiple third data tables 208 may be generated with respect to the patient visit that include columns indicating information on a more granular level for a respective service provided during the patient visit than the information stored by the second data table 206 related to the patient visit. For example, the second data table 206 may include multiple columns indicating a health insurance code for different services provided to an individual during a patient visit and a third data table 208 related to one of the services may include multiple columns for additional health insurance codes that correspond to additional information related to the respective services. The second data table 206 and the third data table(s) 208 for a patient visit may indicate one or more dates of service corresponding to the patient visit.
[0190] The fourth data table 210 may include columns that indicate information about individuals for which information is stored by the integrated data repository 104. For example, the fourth data table 210 may include columns that indicate information related to at least one of a location of an individual, a gender of an individual, a date of birth of an individual, a date of death of an individual (if applicable), or one or more keys associated with the individual. In one or more examples, the fourth data table 210 may include one or more columns related to whether erroneous data has been identified for an individual. In various examples, a single fourth data table 210 may be generated for respective individuals. Thus, the data repository schema 202 may include multiple instances of the fourth data table 210, such as thousands, tens of thousands, up to hundreds of thousands or more.
[0191] The fifth data table 212 may include columns that indicate information related to a health insurance company or governmental entity that made payment for one or more services provided to respective individuals. For example, the fifth data table 212 may include one or more payer identifiers. The sixth data table 214 may include columns that include information corresponding to health insurance coverage information for respective individuals. In one or more examples, the sixth data table 214 may include columns indicating the presence of medical coverage for an individual, the presence of pharmacy coverage for an individual, and a type of health insurance plan related to the individual, such as health maintenance organization (HMO), preferred provider organization (PPO), and the like.
[0192] The seventh data table 216 may include columns that indicate information related to pharmaceutical treatments obtained by a respective individual. In one or more examples, the seventh data table 216 may include one or more columns indicating health insurance codes corresponding to pharmaceutical treatments that are available via a pharmacy. The health insurance codes may correspond to individual pharmaceutical treatments. Additionally, the health insurance codes may indicate a diagnosis of a biological condition with respect to an individual. The seventh data table 216 may also include additional information, such as at least one of dosage amounts, number of days’ supply, quantity dispensed, number of refills authorized, dates of service, or information related to the individual receiving the pharmaceutical treatment.
[0193] In various examples, the data repository schema 202 may provide results of analysis of the information stored by the data tables 204, 206, 208, 210, 212, 214, 216 in a more efficient manner than typical data repository schemas. For example, the logical connections between the data tables 204, 206, 208, 210, 212, 214, 216 are arranged to efficiently retrieve data that is related across the different data tables 204, 206, 208, 210, 212, 214, 216. In situations where the data tables 204, 206, 208, 210, 212, 214, 216 are arranged in a serial manner and/or in situations where a greater number of the data tables 204, 206, 208, 210, 212, 214, 216 are logically connected, retrieving data from the integrated data repository 104 from one or more of the data tables 204, 206, 208, 210, 212, 214, 216 to responds to a request for information from the integrated data repository 104 will be less efficient than in situations where the data repository schema 202 is implemented.
[0194] Figure 3 illustrates an architecture 300 to generate one or more datasets from information retrieved from a data repository that integrates health related data from a number of sources, according to one or more implementations. The architecture 300 may include the data integration and analysis system 102 and the integrated data repository 104. Additionally, the data integration and analysis system 102 may include at least the data pipeline system 138 and the data analysis system 140. The data pipeline system 138 may include a number of sets of data processing instructions that are executable to generate respective datasets that may be analyzed by the data analysis system 140 in response to an integrated data repository request 142 to generate data analysis results 146.
[0195] The data pipeline system 138 may include first data processing instructions 302, second data processing instructions 304, up to Nth data processing instructions 306. The data processing instructions 302, 304, 306 may be executable by one or more processing units to perform a number of operations to generate respective datasets using information obtained from the integrated data repository 104. In one or more illustrative examples, the data processing instructions 302, 304, 306 may include at least one of software code, scripts, API calls, macros, and so forth. The first data processing instructions 302 may be executable to generate a first dataset 308. In addition, the second data processing instructions 304 may be executable to generate a second dataset 310. Further, the Nth data processing instructions 306 may be executable to generate an Nth dataset 312. In various examples, after the data integration and analysis system 102 generates the integrated data repository 104, the data pipeline system 138 may cause the data processing instructions 302, 304, 306 to be executed to generate the datasets 308, 310, 312. In one or more examples, the datasets 308, 310, 312 may be stored by the integrated data repository 104 or by an additional data repository that is accessible to the data integration and analysis system 102. At least a portion of the data processing instructions 302, 304, 306 may analyze health insurance codes to generate at least a portion of the datasets 308, 310, 312. Additionally, at least a portion of the data processing instructions 302, 304, 306 may analyze genomics data to generate at least a portion of the datasets 308, 310, 312.
[0196] In one or more examples, the first data processing instructions 302 may be executable to retrieve data from one or more first data tables stored by the integrated data repository 104. The first data processing instructions 302 may also be executable to retrieve data from one or more specified columns of the one or more first data tables. In various examples, the first data processing instructions 302 may be executable to identify individuals that have a health insurance code stored in one or more column and row combinations that correspond to one or more diagnosis codes. The first data processing instructions 302 may then be executable to analyze the one or more diagnosis codes to determine a biological condition for which the individuals have been diagnosed. In one or more illustrative examples, the first data processing instructions 302 may be executable to analyze the one or more diagnosis codes with respect to a library of diagnosis codes that indicates one or more biological conditions that correspond to respective diagnosis codes. The library of diagnosis codes may include hundreds up to thousands of diagnosis codes. The first data processing instructions 302 may also be executable to determine individuals diagnosed with a biological condition by analyzing timing information of the individuals, such as dates of treatment, dates of diagnosis, dates of death, one or more combinations thereof, and the like.
[0197] The second data processing instructions 304 may be executable to retrieve data from one or more second data tables stored by the integrated data repository 104. The second data processing instructions 304 may also be executable to retrieve data from one or more specified columns of the one or more second data tables. In various examples, the second data processing instructions 304 may be executable to identify individuals that have a health insurance code stored in one or more column and row combinations that correspond to one or more treatment codes. The one or more treatment codes may correspond to treatments obtained from a pharmacy. In one or more additional examples, the one or more treatment codes may correspond to treatments received by a medical procedure, such as an injection or intravenously. The second data processing instructions 304 may be executable to determine one or more treatments that correspond to the respective health insurance codes included in the one or more second data tables by analyzing the health insurance code in relation to a predetermined set of information. The predetermined set of information may include a data library that indicates one or more treatments that correspond to one out of hundreds up to thousands of health insurance codes. The second data processing instructions 304 may generate the second dataset 310 to indicate respective treatments received by a group of individuals. In one or more illustrative examples, the group of individuals may correspond to the individuals included in the first dataset 308. The second dataset 310 may be arranged in rows and columns with one or more rows corresponding to a single individual and one or more columns indicating the treatments received by the respective individual.
[0198] The Nth processing instructions 306 (where N may be any positive integer) may be executable to generate the Nth dataset 312 by combining information from a number of previously generated datasets, such as the first dataset 308 and the second dataset 310. In addition, the Nth processing instructions 306 may be executable to generate the Nth dataset 312 to retrieve additional information from one or more additional columns of the integrated data repository 104 and incorporate the additional information from the integrated data repository 104 with information obtained from the first dataset 308 and the second dataset 310. For example, the Nth processing instructions 306 may be executable to identify individuals included in the first dataset 308 that are diagnosed with a biological condition and analyze specified columns of one or more additional data tables of the integrated data repository 104 to determine dates of the treatments indicated in the second dataset 210 that correspond to the individuals included in the first dataset 308. In one or more further examples, the Nth processing instructions 306 may be executable to analyze columns of one or more additional data tables of the integrated data repository 104 to determine dosages of treatments indicated in the second dataset 310 received by the individuals included in the first dataset 308. In this way, the Nth processing instructions 306 may be executable to generate an episodes of care dataset based on information included in a cohort dataset and a treatments dataset.
[0199] In one or more illustrative examples, in response to receiving an integrated data repository request 142, the data analysis system 140 may determine one or more datasets that correspond to the features of the query related to the integrated data repository request 142. For example, the data analysis system 140 may determine that information included in the first dataset 308 and the second dataset 310 is applicable to responding to the integrated data repository request 142. In these scenarios, the data analysis system 140 may analyze at least a portion of the data included in the first dataset 308 and the second dataset 310 to generate the data analysis results 146. In one or more additional examples, the data analysis system 140 may determine different datasets to respond to different queries included in the integrated data repository request 142 in order to generate the data analysis results 146.
[0200] The use of specific sets of data processing instructions to generate respective data sets may reduce the number of inputs from users of the data integration and analysis system 102 as well as reduce the computational load, such as the amount of processing resources and memory, utilized to process integrated data repository requests 142. For example, without the specific architecture of the data pipeline system 138, each time an integrated data repository request 142 is received, the data utilized to respond to the integrated data repository request 142 is assembled from the data repository 104. In contrast, by implementing the data pipeline system 138 to execute the data processing instruction 302, 304, 306 to generate the datasets 308, 310, 312, the data needed to respond to various integrated data repository requests 142 has already been assembled and may be accessed by the data analysis system 140 to respond to the integrated data repository request 142. Thus, the computing resources used to respond to the integrated data repository request 142 by implementing the data pipeline system 138 to generate the datasets 308, 310, 312 are less than typical systems that perform an information parsing and collecting process for each integrated data repository request 142. Further, in situations where the data pipeline system 138 has not been implemented, users of the data integration and analysis system 102 may need to submit multiple integrated data repository request 142 in order to analyze the information that the users are intending to have analyzed either because the ad hoc collection of data to respond to an integrated data repository request 142 in typical systems is inaccurate or because the data analysis system 140 is called upon multiple times to perform an analysis of information in typical systems that may be performed using a single integrated data repository request 142 when the data pipeline system 138 is implemented.
[0201] Figure 4 illustrates an architecture 400 to generate an integrated data repository that includes de-identified health insurance claims data and de-identified genomics data it, according to one or more implementations. The architecture 400 may include the data integration and analysis system 102, the health insurance claims data repository 106, and the molecular data repository 108. The data integration and analysis system 102 may obtain patient information 402 from the molecular data repository 108. The patient information 402 may include genomics data 404 for individuals having data stored by the molecular data repository 108. The genomics data 404 may indicate results of one or more nucleic acid sequencing operations that analyze sequences of nucleic acid molecules included in a sample obtained from the individuals with respect to one or more target genomic regions. In one or more examples, the sample may be obtained from tissue of one or more individuals. In one or more additional examples, the sample may be obtained from fluid of one or more individuals, such as blood or plasma. The one or more target genomic regions may correspond to genomic regions that correspond to the presence of one or more biological conditions. For example, the target regions may correspond to genomic regions of a reference genome having mutations that are present in individuals in which a biological condition is present. In one or more illustrative examples, the target regions may correspond to genomic regions of a reference human genome in which one or more mutations are present in individuals in which one or more forms of cancer are present. The patient information 402 may also include information indicating personal information about individuals with data stored by the molecular data repository 108 and information corresponding to the testing and analysis performed on samples provided by individuals.
[0202] The data integration and analysis system 102 may perform a de-identification process 406 that anonymizes personal information obtained from the molecular data repository 108. The data integration and analysis system 102 may implement one or more computational techniques as part of the de-identification process to anonymize data related to individuals stored by the molecular data repository 108 such that the de-identified data protects the privacy of the individuals and is in compliance with one or more privacy regulation frameworks. The de- identification process 406 may include, at 408, accessing tokens. In various examples, the tokens may comprise an alphanumeric string of characters. In one or more examples, the tokens may be generated by the data integration and analysis system 102. In one or more additional examples the tokens may be generated by a third-party and obtained by the data integration and analysis system 102.
[0203] The tokens may be generated using one or more hash functions in relation to a subset 410 of the patient information 402. To illustrate, for individuals that have information stored by the molecular data repository 108, the tokens may be generated using a combination of at least a portion of a first name of the respective individuals, at least a portion of the last name of the respective individuals, at least a portion of a date of birth of the respective individuals, a gender of the individuals, and at least a portion of a location identifier of the respective individuals. The de-identification process 406 may also include, at 412, generating identifiers for individuals that have data stored by the molecular data repository 108. The identifiers may be generated by the data integration and analysis system 102 using one or more hash functions that are different from the one or more hash functions used to generate the tokens. In one or more illustrative examples, the data integration and analysis system 102 may generate an intermediate version of respective identifiers using one or more hash function and then apply one or more salting techniques to the intermediate versions of the identifiers to generate final versions of the identifiers. A salt function includes a function configured to add at least one random bit to each intermediate identifier to generate a respective final identifier. In various examples, the data integration and analysis system 102 may generate the identifiers at 412 using at least a portion of the information for respective individuals stored by the molecular data repository 108. In one or more illustrative examples, the identifiers may be generated based on a patient identifier included in the patient information 402. The identifiers generated by the data integration and analysis system 102 may be unique for respective individuals having data stored by the molecular data repository 108. [0204] At operation 414, the data integration and analysis system 102 may generate modified patient information 416 based on the identifiers. The modified patient information 416 may include genomics data 404 related to individuals associated with the molecular data repository 108 and the identifiers of the respective individuals. The modified patient information 416 may have a data structure 418. The data structure 418 may include a column that includes respective identifiers of individuals associated with the molecular data repository 108 and a number of columns that include genomics data 404 related to the individuals, such as identifiers of one or more genes, alterations to the one or more genes, type of alteration to the genes, and so forth. [0205] The data integration and analysis system 102 may generate a token file 420. The token file 420 may include first tokens 422 accessed at operation 408 for respective individuals having data stored by the molecular data repository 108. The token file 420 may have a data structure 424 that includes a number of columns that include information for respective individuals. The data structure 424 may include a column indicating respective identifiers generated by the data integration and analysis system 102 and columns indicating one or more first tokens 422 associated with the respective identifiers. The data integration and analysis system 102 may send the token file 420 to a health insurance claims data management system 426 that is coupled to the health insurance claims data repository 106. The health insurance claims data management system 426 may analyze the first tokens 422 with respect to corresponding second tokens 428. The second tokens 428 may be accessed by or generated by the health insurance claims data management system 426. The second tokens 428 may be generated using a same or similar subset of information for individuals having data stored in the health insurance claims data repository 106 as the subset 410 of the patient information 402. For example, the second tokens 428 may be generated using a combination of at least a portion of a first name of the respective individuals, at least a portion of the last name of the respective individuals, at least a portion of a date of birth of the respective individuals, a gender of the individuals, and at least a portion of a location identifier of the respective individuals.
[0206] In various examples, the health insurance claims data management system 426 may retrieve health insurance claims data from the health insurance claims data repository 106 for individuals associated with respective second tokens 428 that match corresponding first tokens 422. A first token 422 may match a second token 428 when the data of the first token 422 has at least a threshold amount of similarity with respect to the data of the second token 428. In one or more examples, a first token 422 may match a second token 428 when the data of the first token 422 is the same as the data of the second token 428.
[0207] In response to identifying health insurance claims data for individuals having respective second tokens 428 that correspond to a respective first token 422, the health insurance claims data management system 426 may generate modified health insurance claims data 430. The health insurance claims data management system 426 may send the modified health insurance claims data 430 to the data integration and analysis system 102. In one or more examples, the modified health insurance claims data 430 may be formatted according to a data structure 432. The data structure 432 may include a column that includes a subset of the second tokens 428 that correspond to the first tokens 422 and a number of columns that include the health insurance claims data.
[0208] At operation 434, the data integration and analysis system 102 may integrate genomics data and health insurance claims data of individuals that are common to both the molecular data repository 108 and the health insurance claims data repository 106. The data integration and analysis system 102 may determine individuals that are common to both the molecular data repository 108 and the health insurance claims data repository 106 by determining genomics data and health insurance claims data corresponding to common tokens. The data integration and analysis system 102 may determine that a first token 422 related to a portion of the genomics data 404 corresponds to a second token 428 related to a portion of the health insurance claims data by determining a measure of similarity between the first token 422 and the second token 428. In scenarios where the first token 422 has at least a threshold amount of similarity with respect to the second token 428, the data integration and analysis system 102 may store the corresponding portion of the genomics data 404 and the corresponding portion of the health insurance claims data in relation to the identifier of the individual in an integrated data repository, such as the integrated data repository 104 of Figure 1, Figure 2, and Figure 3. [0209] The implementation of the architecture 400 may implement a cryptographic protocol that enables de-identified information from disparate data repositories to be integrated into a single data repository. In this way, the security of the data stored by the integrated data repository 104 is increased. Additionally, the cryptographic protocol implemented by the architecture 400 may enable more efficient retrieval and accurate analysis of information stored by the integrated data repository 104 than in situations where the cryptographic protocol of the architecture 400 is not utilized. For example, by generating a token file 420 that includes first tokens 422 using a cryptographic technique based on a specified set of information stored by the molecular data repository 104 and utilizing second tokens 428 generated using a same or similar cryptographic technique with respect to the similar or same set of information stored by the health insurance claims data repository 106, the data integration and analysis system 102 may match information stored by disparate data repositories that correspond to a same individual. Without implementing the cryptographic protocol of the architecture 400, the probability of incorrectly attributing information from one data repository to one or more individuals increases, which decreases the accuracy of results provided by the data integration and analysis system 102 in response to integrated data repository requests 142 sent to the data integration and analysis system 102.
[0210] Figure 5 illustrates a framework 500 to generate a dataset, by a data pipeline system 138, based on data stored by an integrated data repository 104, according to one or more implementations. The integrated data repository 104 may store health insurance claims data and genomics data for a group of individuals 502. For example, the integrated data repository 104 may store information obtained from health insurance claims records 504 of the group of individuals 502. For each individual included in the group of individuals 502, the integrated data repository 104 may store information obtained from multiple health insurance claim records 504. In various examples, the information stored by the integrated data repository 104 may include and/or be derived from thousands, tens of thousands, hundreds of thousands, up to millions of health insurance claims records 504 for a number of individuals. Additionally, each health insurance claim record may include multiple columns. As a result, the integrated data repository 104 may be generated through the analysis of millions of columns of health insurance claims data.
[0211] Further, although the health insurance claims data may be organized according to a structured data format, health insurance claims data is typically arranged to be viewed by health insurance providers, patients, and healthcare providers in order to show financial information and insurance code information related to services provided to individuals by healthcare providers. Thus, health insurance claims data is not easily analyzed to gain insights that may be available in relation to characteristics of individuals in which a biological condition is present and that may aid in the treatment of the individuals with respect to the biological condition. The integrated data repository 104 may be generated and organized by analyzing and modifying raw health insurance claims data in a manner that enables the data stored by the integrated data repository 104 to be further analyzed to determine trends, characteristics, features, and/or insights with respect to individuals in which one or more biological conditions may be present. For example, health insurance codes may be stored in the integrated data repository 104 in such a way that at least one of medical procedures, biological conditions, treatments, dosages, manufacturers of medications, distributors of medications, or diagnoses may be determined for a given individual based on health insurance claims data for the individual. In various examples, the data integration and analysis system 102 may generate and implement one or more tables that indicate correlations between health insurance claims data and various treatments, symptoms, or biological conditions that correspond to the health insurance claims data. Further, the integrated data repository 104 may be generated using genomics data records 506 of the group of individuals 502. In various examples, the large amounts of health insurance claims data may be matched with genomics data for the group of individuals 502 to generate the integrated data repository 104.
[0212] By integrating the genomics data records 506 for the group of individuals 502 with the health insurance claims records 504, the data integration and analysis system 102 may determine correlations between the presence of one or more biomarkers that are present in the genomics data records 506 with other characteristics of individuals that are indicated by the health insurance claims data records 506 that existing systems are typically unable to determine. For example, the data integration and analysis system 102 may determine one or more genomic characteristics of individuals that correspond to treatments received by individuals, timing of treatments, dosages of treatments, diagnoses of individuals, smoking status, presence of one or more biological conditions, presence of one or more symptoms of a biological condition, one or more combinations thereof, and the like. Based on the correlations determined by the data integration and analysis system 102 using the integrated data repository 104, cohorts of individuals that may benefit from one or more treatments may be identified that would not have been identified in existing systems. In one or more examples, the processes and techniques implemented to integrate the health insurance claims records 504 and the genomics claims records 506 in order to generate the integrated data repository 104 may be complex and implement efficiency-enhancing techniques, systems, and processes in order to minimize the amount of computing resources used to generate the integrated data repository 104. [0213] In one or more illustrative examples, the data pipeline system 138 may access information stored by the integrated data repository 104 to generate datasets that include a number of additional data records 508 that include information related to at least a portion of the group of individuals 502. In the illustrative example of Figure 5, the additional data record 508 includes information indicating whether individuals are included in a cohort of individuals in which lung cancer is present. The data pipeline system 138 may execute a plurality of different sets of data processing instructions to determine a cohort of the group of individuals 502 in which lung cancer is present. In various examples, the additional data record 508 may indicate information used to determine a status of an individual 502 with respect to lung cancer, such as one or more transaction insurance identifier, one or more international classification of diseases (ICD) codes, and one or more health insurance transaction dates. In addition to including a column that indicates whether an individual 502 is included in the lung cancer cohort, the additional data record 508 may include a column indicating a confidence level of the status of the individual 502 with respect to the presence of lung cancer.
[0214] Figure 6 is a schematic diagram of a computing architecture 600 to incorporate medical records data into an integrated data repository 104. In various examples, at least a portion of the operations of the computing architecture 600 may be performed by the data integration and analysis system 102 of Figures 1, 3, and 4. In one or more examples, at least a portion of the operations of the computing architecture 600 may be performed by one or more additional computing systems that are at least one of controlled, maintained, or implemented by a service provider that also at least one of controls, maintains, or implements the data integration and analysis system 102. In one or more additional examples, at least a portion of the operations of the computing architecture 600 may be performed by a number of servers in a distributed computing environment.
[0215] The computing architecture 600 may include a medical records data repository 602. The medical records data repository 602 may store medical records data from a number of individuals. The medical records data may include imaging information, laboratory test results, diagnostic test information, clinical observations, dental health information, notes of healthcare practitioners, medical history forms, diagnostic request forms, medical procedure order forms, medical information charts, one or more combinations thereof, and so forth. In various examples, for a given individual, the medical records data repository 602 may store information obtained from one or more healthcare practitioners that is related to the individual.
[0216] The computing architecture 600 may perform operation 604 that includes obtaining data packages from the medical records data repository 602. In one or more examples, the data packages may be obtained in response to one or more requests sent to the medical records data repository 602 for medical records that correspond to one or more individuals. In one or more additional examples, the data packages may be obtained by the computing architecture 600 using one or more application programming interface (API) calls. In one or more illustrative examples, a first data package 606, a second data package 608, up to an Nth data package 610 may be obtained using the computing architecture 600. The individual data packages 606, 608, 610 may correspond medical records of a respective individual. For example, the first data package 606 may include medical records of a first individual, the second data package 608 may include medical records of a second individual, and the Nth data package 610 may include medical records of a third individual.
[0217] Individual data packages 606, 608, 610 may include a number of components. In one or more examples, individual data packages 606, 608, 610 may include individual components that correspond to medical records from different healthcare providers. In one or more additional examples, the individual data packages 606, 608, 610 may include individual components that correspond to different parts of medical records that correspond to one or more healthcare providers. In the illustrative example of Figure 6, the second data package 608 may include a first component 612, a second component 614, up to an Nth component 616. In one or more illustrative examples, the first component 612 may include a first portion of medical records of an individual, the second component 614 may include a second portion of medical records of an individual, and the Nth component 616 may include a third portion of medical records of an individual. In various examples, the first component 612 may correspond to medical records of a first healthcare provider for the individual, the second component 614 may correspond to medical records of a second healthcare provider for the individual, and the third component may correspond to medical records of a third healthcare provider for the individual. In one or more additional illustrative examples, the first component 612 may include a first section of medical records of the individual, such as one or more forms related to a diagnostic test or procedure, and the second component 614 may include a second section of medical records of the individual, such as a pathology report of the individual.
[0218] At operation 618, the computing architecture 600 may preprocess individual data packages to identify a corpus of information 620 to be analyzed. In one or more examples, the preprocessing of data packages obtained from the medical records data repository 602, may include transforming the data included in the data packages. For example, preprocessing the data packages may include transforming at least a portion of the data obtained from the medical records data repository 602 to machine encoded information. To illustrate, preprocessing the data packages may include performing one or more optical character recognition (OCR) operations with respect to at least a portion of the data packages obtained from the medical records data repository 602. By converting at least a portion of the data packages obtained from the medical records data repository 602 to machine encoded information, the data packages may be subjected to a number of operations, such as one or more parsing operations to identify one or more characters or strings of characters or one or more editing operations that are unable to be performed with respect to at least a portion of the data packages obtained from the medical records data repository 602.
[0219] In one or more examples, the preprocessing of individual data packages may include determining information included in individual data packages that is to be excluded from further analysis by the computing architecture 600. In various examples, one or more components of individual data packages may be excluded from a corpus of information 620 to be analyzed. For example, with respect to the second data package 608, the computing architecture 600 may determine that the first component 612 is to be excluded from further analysis by the computing architecture 600. In one or more examples, the computing architecture 600 may analyze the components 612, 614, and/or 616 with respect to one or more keywords to identify at least one of the components 612, 614, and/or 616 to exclude from further analysis by the computing architecture 600. In one or more illustrative examples, the computing architecture 600 may parse the components 612, 614, and/or 616 to identify one or more keywords and in response to identifying the one or more keywords in a component 612, 614, and/or 616, the computing architecture 600 may determine to exclude the respective component 612, 614, and/or 616 from further analysis by the computing architecture 600. For example, the computing architecture 600 may determine that the first component 612 of the second data package 608 is a test requisition form for one or more diagnostic procedures or tests. In these scenarios, the computing architecture 600 may determine that the first component 612 is to be excluded from further analysis by the computing architecture 600. Additionally, the computing architecture 600 may determine that at least one of the second component 614 and/or 616 correspond to one or more pathology reports for an individual based on one or more keywords included in at least one of the second component 614 or the Nth component 616. In these instances, the computing architecture 600 may determine that at least a portion of the second component 614 and/or at least a portion of the Nth component 616 is to be included in the corpus of information 620 to be further analyzed by the computing architecture 600.
[0220] In addition, a subset of the components of individual data packages obtained from the medical records data repository 602 may be included in the corpus of information 620. In various examples, one or more additional operations may be performed to narrow the corpus of information 620. For example, one or more queries may be applied to a subset of information obtained from the medical records data repository 602. The one or more queries may extract information from the one or more data packages that satisfy the one or more queries. In at least some examples, the one or more queries may be a group of queries that are applied to individual components of a data package. In one or more illustrative examples, the group of queries may determine information to be included in the corpus of information 620 and additional information that is to be excluded from the corpus of information 620. In one or more additional examples, one or more sections of at least one component of a data package may be excluded from the corpus of information 620.
[0221] In one or more additional illustrative examples, after determining that the first component 612 is to be excluded from further analysis by the computing architecture 600, the computing architecture 600 may then cause one or more queries to be implemented with respect to at least one the second component 614 or the Nth component 616. In these scenarios, the one or more queries may determine that a section of the second component 614, such as a section that indicates family history for one or more biological conditions, is to be excluded from the corpus of information 620. In various examples, the one or more queries may be directed to identifying a number of keywords and/or combinations of keywords included in at least one of the second component 614 or the Nth component 616. In these instances, the computing architecture 600 may exclude from the corpus of information 620 one or more portions of the individual components of the data packages that include one or more keywords or combinations of keywords. In one or more additional examples, the computing architecture 600 may exclude from the corpus of information 620 a number of words, a number of characters, and/or a number of symbols following one or more keywords that are included in one or more portions of the individual components of the data packages.
[0222] Further, at operation 622, the computing architecture 600 may analyze the corpus of information to determine characteristics of individuals. In one or more examples, the computing architecture 600 may analyze the corpus of information 620 to determine individuals that have one or more phenotypes. In various examples, the computing architecture 600 may analyze the corpus of information 620 to determine one or more biomarkers that are indicative of a biological condition. For example, the computing architecture 600 may analyze the corpus of information 620 to determine individuals having one or more genetic characteristics. The one or more genetic characteristics may include at least one of one or more variants of a genomic region that correspond to a biological condition. In one or more illustrative examples, the one or more genetic characteristics may correspond to one or more variants of a genomic region that correspond to a type of cancer. In one or more additional illustrative examples, the one or more biomarkers may correspond to levels of an analyte being outside of a specified range. To illustrate, the computing architecture 600 may analyze the corpus of information 620 to determine individuals having levels of one or more proteins and/or levels of one or more small molecules present that are indicative of a biological condition. In these scenarios, the computing architecture 600 may analyze results of laboratory tests to determine levels of analytes of individuals. In one or more additional examples, the computing architecture 600 may analyze the corpus of information 620 to determine individuals in which one or more symptoms are present that are indicative of a biological condition. In one or more further examples, the computing architecture 600 may analyze imaging information included in the corpus of information 620 to determine individuals in which one or more biomarkers are present.
[0223] In one or more examples, the computing architecture 600 may implement one or more machine learning techniques to analyze the corpus of information 620. For example, the computing architecture 600 may implement one or more artificial neural networks, such as at least one of one or more convolutional neural networks or one or more residual neural networks to analyze the corpus of information 620. The computing architecture 600 may also implement at least one of one or more random forests techniques, one or more hidden Markov models, or one or more support vector machines to analyze the corpus of information 620.
[0224] In at least some implementations, the computing architecture 600 may analyze the corpus of information 620 by performing one or more queries with respect to the corpus of information 620. The one or more queries may correspond to one or more keywords and/or combinations of keywords. The one or more keywords and/or combinations of keywords may correspond to at least one of characters or symbols that correspond to one or more biological conditions. To illustrate, a keyword may correspond to characters related to a mutation of a genomic region, such as HER2. In one or more additional illustrative examples, one or more criteria may be associated with combinations of keyworks. To illustrate, a criterion that corresponds to a combination of keywords may include a number of words being present within a specified distance of one another in a portion of the corpus of information 620 for an individual, such as the words fatigue, blood pressure, and swelling occurring within 100 characters of one another. In these instances, the computing architecture 600 may parse the corpus of information 620 for the one or more keywords and/or combinations of keywords. In various examples, in response to determining that the one or more keywords and/or combinations of keywords are present in accordance with one or more criteria, the computing architecture 600 may determine that a biological condition is present with respect to a given individual.
[0225] In one or more additional examples, the one or more queries may be image-based and the computing architecture 600 may analyze images included in the corpus of information 620 with respect to template images. The template images may be generated based on analyzing a number of images in which a biological condition is present and aggregating the number of images into a template image. In these scenarios, the computing architecture 600 may analyze images included in the corpus of information 620 with respect to one or more template images to determine a measure of similarity between the images included in the corpus of information 620 and the template images. In situations where the measure of similarity for an individual is at least a threshold value, the computing architecture 600 may determine that a characteristic of a biological condition is present in the individual.
[0226] After determining individuals having one or more characteristics, the computing architecture 600 may, at operation 624, generate data structures that store data for individuals having the one or more characteristics. In one or more examples, the computing architecture 600 may generate data tables that indicate individuals having an individual characteristics and/or individuals having a group of characteristics. For example, the computing architecture 600 may generate a first data table 626 and a second data table 628. The first data table 626 may indicate individuals having one or more first characteristics and the second data table 628 may indicate individuals having one or more second characteristics. In one or more illustrative examples, the first data table 626 may indicate individuals having one or more first biomarkers for a biological condition and the second data table 628 may indicate individual having one or more second biomarkers for the biological condition. The one or more first biomarkers may correspond to one or more first genomic variants that are associated with the biological condition and the one or more second biomarkers may correspond to one or more second genomic variants that are associated with the biological condition. In various examples, the data tables 626, 628 may indicate whether or not the one or more characteristics associated with the individual data tables 626, 628 are present with respect to individual individuals. To illustrate, the first data table 626 may include a first indication for individuals in which one or more first genomic variants are present and a second indication for individuals in which the one or more first genomic variants are not present. In one or more additional examples, the first data table 626 may indicate smoking status of individuals and the second data table 628 may indicate whether or not individual individuals have received one or more treatments for a biological condition. [0227] In one or more illustrative examples, the first data table 626 and the second data table 628 may have rows that correspond to individual individuals. In at least some examples, an individual identifier may be present in individual rows. The individual identifier may include at least one of alphanumeric characters or symbols that correspond to an individual. In various examples, the individual identifier may be present in a data package that corresponds to an individual. Columns of the first data table 626 and the second data table 628 may indicate a status of individual individuals with respect to one or more characteristics. For example, the columns of the data tables 626, 628 may include an identifier that includes at least one of alphanumeric characters or symbols that indicate the presence or absence of one or more characteristics for a given individual. Further, although the illustrative example of Figure 6 includes a first data table 626 and a second data table 628, the computing architecture 600 may generate more data tables or fewer data tables.
[0228] At operation 630, the computing architecture 600 may store the data structures in an additional data repository. For example, the computing architecture 600 may store at least the first data tale 626 and/or the second data table 628 in an intermediate data repository 632. In various examples, the first data table 626 and the second data table 628 may be temporarily stored in the intermediate data repository 632. In one or more illustrative examples, the first data table 626 and the second data table 628 may be stored in the intermediate data repository 632 before being added to the integrated data repository 104. In one or more examples, the integrated data repository 104 may be periodically generated and/or updated. In these scenarios, data structures generated by the computing architecture 600 based on analyzing the corpus of information 620 may be stored in the intermediate data repository 632 until a time when the integrated data repository 104 is to be at least one of generated or updated.
[0229] Prior to adding data structures stored by the intermediate data repository 632 to the integrated data repository 104, the computing architecture 600 may perform one or more de- identification processes at operation 634. The data structures stored by the intermediate data repository 632 may be de-identified in order to preserve the privacy of individuals. The one or more de-identification processes may include applying one or more electronically implemented cryptographic techniques to information of individuals included in the data structures stored by the intermediate data repository 632. In one or more examples, the computing architecture 600 may generate tokens that correspond to individual individuals that have information stored in data structures of the intermediate data repository 632. The tokens may be generated by applying one or more hash functions to information related to individual individuals. In one or more examples, the one or more de-identification processes may include applying a salt function to information corresponding to individual individuals to generate tokens for the individual individuals. In various examples, the one or more cryptographic techniques applied to deidentify the data structures stored by the intermediate data repository 632 may be the same or similar to those applied to information obtained from the health insurance claims data repository 106 of Figures 1 and 4.
[0230] At operation 636, the computing architecture 600 may store the de-identified data structures in conjunction with the integrated data repository 104. For example, the information stored in the intermediate data repository 632 for a given individual may be stored in conjunction with additional information about the given individual in the integrated data repository 104. To illustrate, the integrated data repository 104 may store information for a given individual obtained from at least two of the molecular data repository 108, obtained from the health insurance claims data repository 106, and obtained from the intermediate data repository 632. In this way, information about a given individual obtained from a number of disparate data repositories may be stored in the integrated data repository 104. As a result, information about individuals that is obtained from the different data repositories may be analyzed together rather than analyzed separately as with many existing systems.
[0231] In various examples, the information stored by the intermediate data repository 632 may be used to validate one or more determinations made by the data integration and analysis system 102. For example, the data integration and analysis system 102 may analyze information obtained from the health insurance claims data repository 106 and the molecular data repository 108 to determine characteristics of individuals. The data integration and analysis system 102 may then analyze information obtained from the intermediate data repository 632 to determine whether the predicted characteristics identified from the information obtained from the health insurance claims data repository 106 and from the molecular data repository 108 correspond to the characteristics for the same individuals with respect to information stored by the intermediate data repository 632.
[0232] The one or more cryptographic techniques applied to de-identify the data structures stored by the intermediate data repository 632 may utilize the same or similar information that was used to generate at least one of the first tokens 422 or the second tokens 428 of Figure 4. For example, the operation 634 may implement one or more cryptographic techniques using a combination of at least a portion of a first name of the respective individuals, at least a portion of the last name of the respective individuals, at least a portion of a date of birth of the respective individuals, a gender of the individuals, and at least a portion of a location identifier of the respective individuals to de-identify the data structures of the intermediate data repository. By utilizing the same or similar cryptographic techniques and the same or similar subset of information to de-identify the data structures stored by the intermediate data repository 632 as were used to generate at least one of the first tokens 422 or the second tokens 428, the information stored by the intermediate data repository 632 may be synchronized with information for the same individuals that have information stored in the integrated data repository 104. Both the integrated data repository 104 and the intermediate data repository 632 may store information for thousands, tens of thousands, up to millions of individuals. Thus, without the ability to synchronize the individuals having records stored by the integrated data repository 104 and the intermediate data repository 632 through the use of a specified cryptographic protocol as described herein, the data structures of the integrated data repository 104 and the data structures of the intermediate data repository 632 that are associated with a same individual may not be stored in a manner such that the information stored by the integrated data repository 104 and the information stored by the intermediate data repository 632 may be retrieved together for a given individual, which may lead to inaccurate information being provided by the data integration and analysis system 102. The absence of a specified cryptographic protocol as described herein may also lead to the use of more computing resources to determine the information stored in the integrated data repository 104 from other data sources and the information stored by the intermediate data repository 632 that correspond to a given individual. Figures 7 and 8 illustrate example processes to generate an integrated data repository and generate datasets used in the analysis of information stored by the integrated data repository. The example processes are illustrated as collections of blocks in logical flow graphs, which represent sequences of operations that may be implemented in hardware, software, or a combination thereof. The blocks are referenced by numbers. In the context of software, the blocks represent computer-executable instructions stored on one or more computer-readable media that, when executed by one or more processing units (such as hardware microprocessors), perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described blocks may be combined in any order and/or in parallel to implement the process.
[0233] Figure 7 is a data flow diagram of an example process 700 to generate an integrated data repository that stores health insurance claims data and genomics data, according to one or more implementations. At operation 702, the process 700 may include generating a data file that includes tokens generated using a first hash function. Individual tokens may correspond to a respective individual of a group of individuals having data stored by a molecular data repository. In one or more examples, an individual having data stored by the molecular data repository may be associated with one or more tokens. The tokens may be generated by applying one or more first hash functions to a subset of information corresponding to the group of individuals stored by the genomics data repository. In various examples, individual tokens may be generated by applying one or more first hash functions to one or more combinations of at least a portion of a first name of a respective individual of the group of individuals, at least a portion of a second name of a respective individual of the group of individuals, a location identifier of a respective individual of the group of individuals, a gender of a respective individual of the group of individuals, and a date of birth of a respective individual of the group of individuals. In one or more illustrative examples, the tokens may be generated by a data integration and analysis system that is coupled to the genomics data repository. In one or more additional illustrative examples, the tokens may be generated by a third-party system and accessed by a data integration and analysis system coupled to the molecular data repository. The process 700 may also include, at operation 704, sending the data file to a health insurance claims data management system. The health insurance claims data management system may match the tokens included in the data file with second tokens accessed by the health insurance data management system and generated based on information stored by a health insurance claims data repository.
[0234] In addition, at operation 706, the process 700 may include obtaining, from the health insurance claims data management system, in response to the data file, first data corresponding to the group of individuals, where the first data includes health insurance claims data. In some implementations, affirmative consent is obtained from the members of the group of individuals for their data to be transferred from the health insurance claims data management system. In one or more examples, the data is transferred in an anonymized format, such that the data may not be traced back to an individual member. The health insurance claims data management system may be coupled to a health insurance claims data repository that stores health insurance claims information for a number of individuals. In one or more examples, the health insurance claims data management system may analyze the tokens of the data file with respect to additional tokens generated by the health insurance claims data management system. The additional tokens may be generated based on a same set of information used to generate the tokens included in the data file. However, an individual’s identity may not be determined based on a token. In various examples, the health insurance claims data management system may match tokens included in the data file with additional tokens generated based on information stored by the health insurance claims data repository to determine individuals having information stored by the health insurance claims data repository that also have information stored by the genomics data repository. The technology disclosed herein complies with legal and best practice privacy standards, such as HIPAA and GDPR.
[0235] At operation 708, the process 700 may include generating a number of identifiers using a second hash function that is different from the first hash function. In one or more examples, individual identifiers may correspond to one or more tokens related to a respective individual of the group of individuals. The identifiers may be unique with respect to a given individual of the group of individuals and are de-identified. Additionally, the identifiers may be generated using information stored by the genomics data repository for the group of individuals that is different from the information stored by the genomics data repository used to generate the tokens. In various examples, intermediate identifiers may be generated by applying the second hash function to information of the respective groups of individuals and final versions of the identifiers may be generated by applying one or more salting techniques to the intermediate identifiers. Information stored by the genomics data repository for respective individuals may be stored in association with the identifiers such that at least a portion of the information for given individuals stored by the genomics data repository may be accessed using respective identifiers of the given individuals.
[0236] Further, the process 700 may include, at operation 710, obtaining, using the number of identifiers, second data from the molecular data repository for the group of individuals, and, at operation 712, the process 700 may include determining respective portions of the first data that correspond to respective portions of the second data for the group of individuals. For example, for a given individual, first data corresponding to health insurance claims data for the given individual may be identified in addition to second data corresponding to molecular data of the given individual, such as genomics data. In this way, for a given individual, both health insurance claims data and molecular data may be identified.
[0237] The process 700 may include, at operation 714, generating an integrated data repository that stores the respective portions of the first data and the respective portions of the second data in relation to respective identifiers of the number of identifiers. For example, the integrated data repository may store health insurance claims data and genomics claims data for a given individual in association with an identifier that may be used to access the health insurance claims data and the genomics claims data for the given individual. The information stored by the integrated data repository may be organized according to a data repository schema. For example, the integrated data repository may store health insurance claims data and genomics data for the group of individuals in a number of data tables. In one or more examples, information stored by the number of data tables may be linked. To illustrate, information related to a given individual stored by a first data table of the data repository schema may be linked to additional information related to the given individual stored by a second data table of the data repository schema. In this way, information accessed in one data table of the data repository schema may result in accessing additional information stored in another data table of the data repository schema. [0238] In one or more illustrative examples, the data repository schema may include a first data table that stores genomics data of the group of individuals. For example, the first data table may store information corresponding to a panel used to generate genomics data, mutations of genomic regions, types of mutations, copy numbers of genomic regions, coverage data indicating numbers of nucleic acid molecules identified in a sample having one or more mutations, testing dates, and patient information. The data repository schema may also include a second data that stores data related to one or more patient visits by individuals to one or more healthcare providers and a third data table that stores information corresponding to respective services provided to individuals with respect to one or more patient visits to one or more healthcare providers indicated by the second data table. Additionally, the data repository schema may include a fourth data table that stores personal information of the group of individuals and a fifth data table that stores information related to a health insurance company or governmental entity that made payment for services provided to the group of individuals. Further, the data repository schema may include a sixth data table storing information corresponding to health insurance coverage information for the group of individuals, such as a type of health insurance plan related to the group of individuals. The data repository schema may also include a seventh data table that stores information related to pharmaceutical treatments obtained by the group of individuals. [0239] In one or more examples, the integrated data repository may also store medical records that correspond to at least a portion of the group of individuals. In these examples, the medical records may be obtained from one or more data repositories storing the medical records. One or more optical character recognition (OCR) operations may be performed with respect to the medical records. Additionally, the medical records may be analyzed to determine one or more portions of the additional information to remove to produce a corpus of information. In various examples, the corpus of information may be analyzed to determine a portion of the subset of the additional group of individuals that correspond to one or more biomarkers.
[0240] One or more data structures may be generated from the corpus of information that store identifiers of the portion of the subset of the additional group of individuals and that store an indication that the portion of the subset of the additional group of individuals corresponds to the one or more biomarkers. The one or more data structures may be stored by an intermediate data repository. One or more de-identification operations may be performed with respect to the identifiers of the portion of the subset of the additional group of individuals before modifying the integrated data repository to store at least a portion of the additional information of the medical records of the portion of the subset of the additional group of individuals in relation to the number of identifiers. After de-identification of the information stored by the one or more data structures, the information stored by the integrated data repository may be added to the integrated data repository. In at least some examples, the de-identified medical records information may be added to the integrated data repository in addition to or in lieu of the health insurance claims data. In various examples, the one or more data structures storing the de- identified medical records information with respect to the biomarker data may have one or more logical connections with other data structures stored in the integrated data repository. To illustrate, the one or more data structures storing the de-identified medical records information with respect to the biomarker data may have one or more logical connections with at least one of the first data table may store information corresponding to a panel used to generate genomics data, mutations of genomic regions, types of mutations, copy numbers of genomic regions, coverage data indicating numbers of nucleic acid molecules identified in a sample having one or more mutations, testing dates, and patient information, the second data that stores data related to one or more patient visits by individuals to one or more healthcare providers, the a third data table that stores information corresponding to respective services provided to individuals with respect to one or more patient visits to one or more healthcare providers indicated by the second data table, the fourth data table that stores personal information of the group of individuals, the fifth data table that stores information related to a health insurance company or governmental entity that made payment for services provided to the group of individuals, the sixth data table storing information corresponding to health insurance coverage information for the group of individuals, such as a type of health insurance plan related to the group of individuals, or the seventh data table that stores information related to pharmaceutical treatments obtained by the group of individuals.
[0241] In various examples, the medical records data may be added to the integrated data repository by generating a data file including first tokens generated using a first hash function. Individual first tokens may correspond to a respective individual of a group of individuals having data stored by a molecular data repository. Additionally, the data file may be sent to a medical records data management system and medical records data corresponding to the group of individuals may be obtained from the medical records data management system in response to the data file. Further, a number of identifiers may be generated using a second hash function that is different from the first hash function. Each identifier may correspond to one or more tokens related to each individual of the group of individuals. Using the number of identifiers second data may be obtained from the molecular data repository for the group of individuals. In various examples, respective portions of the first data may be determined that correspond to respective portions of the second data for the group of individuals. In this way, the integrated data repository may be generated that stores the respective portions of the first data and the respective portions of the second data in relation to respective identifiers of the number of identifiers.
[0242] After the integrated data repository storing medical records data is generated, a request may be received to determine data with respect to a number of individuals having data stored in the integrated data repository. The request includes may one or more search criteria. In one or more examples, a subset of the number of individuals having one or more characteristics that correspond to the one or more search criteria may be determined and information of the subset of the number of individuals may be analyzed to determine a measure of significance of a characteristic of the one or more characteristics with respect to a biological condition.
[0243] In one or more illustrative examples, one or more genomic mutations may be determined to be present in the subset of the number of individuals and a plurality of treatments provided to the subset of the number of individuals may also be determined. In various examples, respective survival rates for the subset of the number of individuals may be determined, such as real world survival rates. In at least some examples, the measure of significance may correspond to survival rate with respect to a treatment of the plurality of treatments and a genomic mutation of the one or more genomic mutations. Based on measure of significance, an effectiveness of the treatment for the subset of the number of individuals may be determined. In one or more examples, individuals in subset of the number of individuals that have not received the treatment may be determined. One or more therapeutically effective amounts of the treatment may be administered to the individuals in the subset of the number of individuals that have not received the treatment.
[0244] Figure 8 is a data flow diagram of an example process 800 to generate a number of datasets used to analyze information stored by an integrated data repository that stores health insurance claims data and genomics data, according to one or more implementations. The process 800 may include, at operation 802, determining a first set of data processing instructions that are executable in relation to first data stored by an integrated data repository. The integrated data repository may store health insurance claims data and molecular data for a common group of individuals. In one or more examples, the first set of data processing instructions may be included in a plurality of sets of data processing instructions that are part of a data processing pipeline. Each of the sets of data processing instructions of the data processing pipeline may be executed to generate a respective analytics ready dataset. For example, individual sets of data processing instructions of the data processing pipeline may be executable to generate datasets that include specified portions of information and/or combinations of information stored by the integrated data repository. In one or more additional examples, individual sets of data processing instructions of the data processing pipeline may be executable to analyze and modify portions of information stored by the integrated data repository to generate respective datasets. Additionally, individual sets of data processing instructions may be executable with respect to individual subsets of information stored by the integrated data repository.
[0245] The process 800 may also include, at operation 804, causing the first set of data processing instructions to be executed to generate a first dataset. The first dataset may indicate a subset of the group of individuals in which a biological condition is present. The first set of data processing instructions may be executed to analyze data stored by the integrated data repository to identify a cohort of individuals in which the biological condition is present. In one or more illustrative examples, the biological condition may include a cancer. To illustrate, the first set of data processing instructions may be executed to analyze data stored by the integrated data repository to identify a cohort of individuals in which lung cancer is present. In various examples, the data processing pipeline may include multiple sets of data processing instructions to identify cohorts of individuals in which different biological conditions are present.
[0246] In one or more examples, the first set of data processing instructions may be executed to analyze at least one of health insurance claims data or molecular data to determine a cohort of individuals in which the biological condition is present. For example, the first set of data processing instructions may be executed to identify individuals having one or more health insurance codes present in health insurance claims data to determine a group of individuals in which the biological condition is present. Additionally, the first set of data processing instructions may be executed to identify individuals in which one or more mutations are present in a genomic region of nucleic acid molecules derived from samples obtained from the individuals to determine a group of individuals in which the biological condition is present.
[0247] In addition, the process 800 may include, at operation 806, determining a second set of data processing instructions that are executable in relation to second data stored by the integrated data repository. The second set of data stored by the integrated data repository may be different from the first set of data stored by the integrated data repository and analyzed in relation to the first set of data processing instructions. For example, the first data may correspond to first columns of one or more first data tables stored by the integrated data repository and the second data may correspond to second columns of one or more second data tables stored by the integrated data repository.
[0248] At operation 808, the process 800 may include causing the second set of data processing instructions to be executed to generate a second dataset indicating one or more treatments provided to a second subset of the group of individuals. The second dataset may indicate a subset of the group of individuals that have received one or more treatments. The one or more treatments may be provided to individuals in which one or more biological conditions are present. In one or more examples, the second set of data processing instructions may be executed to analyze data stored by the integrated data repository to identify a cohort of individuals that received the one or more treatments. To illustrate, the second set of data processing instructions may be executed to analyze at least one of health insurance claims data or genomics data to determine a cohort of individuals that received the one or more treatments. In one or more illustrative examples, the second set of data processing instructions may be executed to identify individuals having one or more health insurance codes present in health insurance claims data to determine a group of individuals that received the one or more treatments.
[0249] Further, the process 800 may include, at operation 810, determining a third subset of the group of individuals that includes a portion of the first subset of the group of individuals that overlaps with a portion of the second subset of the group of individuals. As a result, the third subset of the group of individuals corresponds to individuals in which both the biological condition is present and the one or more treatments are provided. At 812, the process 800, may include analyzing the first dataset and the second dataset with respect to the third subset of the group of individuals to determine a measure of significance of a characteristic of the third subset of the group of individuals. In one or more examples, one or more machine learning techniques or statistical techniques may be applied to information included in at least one of the first dataset and the second dataset with respect to the third subset of the group of individuals. The measure of significance may correspond to a statistical measure of significance with respect to the characteristic. In one or more additional examples, the measure of significance may correspond to a probability of the characteristic being present in individuals in which the biological condition is present.
[0250] In one or more illustrative examples, the characteristic may include one or more treatments provided to the individuals in which the biological condition is present. In one or more additional illustrative examples, the characteristic may include the presence of a mutation of a genomic region of nucleic acid molecules derived from samples obtained from individuals in which the biological condition is present. In various examples, information included in at least one of the first dataset or the second dataset may be analyzed to determine an impact of the characteristic with respect to one or more metrics. In one or more examples, information included in at least one of the first dataset or the second dataset may be analyzed to determine an amount of influence of a treatment on a survival rate of individuals in which the biological condition is present. In one or more further examples, information included in at least one of the first dataset or the second dataset may be analyzed to determine an amount of influence of a mutation of a genomic region on a survival rate of individuals in which the biological condition is present. Additionally, information included in the first dataset and the second dataset may be analyzed to determine an amount of impact of one or more treatments with respect to individuals in which the biological condition is present and in which one or more genomic mutations are also present.
[0251] Figure 9 illustrates a diagrammatic representation of a machine 9900 in the form of a computer system within which a set of instructions may be executed for causing the machine 900 to perform any one or more of the methodologies discussed herein, according to an example, according to an example implementation. Specifically, Figure 8 shows a diagrammatic representation of the machine 900 in the example form of a computer system, within which instructions 902 (e.g., software, a program, an application, an applet, an app, or other executable code) for causing the machine 900 to perform any one or more of the methodologies discussed herein may be executed. For example, the instructions 902 may cause the machine 900 to implement the architectures and frameworks 100, 200, 300, 400, 500, 600 described with respect to Figures 1, 2, 3, 4, 5, and 6, respectively, and to execute the methods 700, 800 described with respect to Figures 7 and 8, respectively.
[0252] The instructions 902 transform the general, non-programmed machine 900 into a particular machine 900 programmed to carry out the described and illustrated functions in the manner described. In alternative implementations, the machine 900 operates as a standalone device or may be coupled (e.g., networked) to other machines. In a networked deployment, the machine 900 may operate in the capacity of a server machine or a client machine in a serverclient network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine 900 may comprise, but not be limited to, a server computer, a client computer, a personal computer (PC), a tablet computer, a laptop computer, a netbook, a set-top box (STB), a personal digital assistant (PDA), an entertainment media system, a cellular telephone, a smart phone, a mobile device, a wearable device (e.g., a smart watch), a smart home device (e.g., a smart appliance), other smart devices, a web appliance, a network router, a network switch, a network bridge, or any machine capable of executing the instructions 902, sequentially or otherwise, that specify actions to be taken by the machine 900. Further, while only a single machine 900 is illustrated, the term “machine” shall also be taken to include a collection of machines 900 that individually or jointly execute the instructions 902 to perform any one or more of the methodologies discussed herein.
[0253] Examples of computing device 900 may include logic, one or more components, circuits (e.g., modules), or mechanisms. Circuits are tangible entities configured to perform certain operations. In an example, circuits may be arranged (e.g., internally or with respect to external entities such as other circuits) in a specified manner. In an example, one or more computer systems (e.g., a standalone, client or server computer system) or one or more hardware processors (processors) may be configured by software (e.g., instructions, an application portion, or an application) as a circuit that operates to perform certain operations as described herein. In an example, the software may reside (1) on a non-transitory machine readable medium or (2) in a transmission signal. In an example, the software, when executed by the underlying hardware of the circuit, causes the circuit to perform the certain operations.
[0254] In an example, a circuit may be implemented mechanically or electronically. For example, a circuit may comprise dedicated circuitry or logic that is specifically configured to perform one or more techniques such as discussed above, such as including a special-purpose processor, a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC). In an example, a circuit may comprise programmable logic (e.g., circuitry, as encompassed within a general -purpose processor or other programmable processor) that may be temporarily configured (e.g., by software) to perform the certain operations. It will be appreciated that the decision to implement a circuit mechanically (e.g., in dedicated and permanently configured circuitry), or in temporarily configured circuitry (e.g., configured by software) may be driven by cost and time considerations.
[0255] Accordingly, the term “circuit” is understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily (e.g., transitorily) configured (e.g., programmed) to operate in a specified manner or to perform specified operations. In an example, given a plurality of temporarily configured circuits, each of the circuits need not be configured or instantiated at any one instance in time. For example, where the circuits comprise a general -purpose processor configured via software, the general- purpose processor may be configured as respective different circuits at different times. Software may accordingly configure a processor, for example, to constitute a particular circuit at one instance of time and to constitute a different circuit at a different instance of time. [0256] In an example, circuits may provide information to, and receive information from, other circuits. In this example, the circuits may be regarded as being communicatively coupled to one or more other circuits. Where multiple of such circuits exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) that connect the circuits. In implementations in which multiple circuits are configured or instantiated at different times, communications between such circuits may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple circuits have access. For example, one circuit may perform an operation and store the output of that operation in a memory device to which it is communicatively coupled. A further circuit may then, at a later time, access the memory device to retrieve and process the stored output. In an example, circuits may be configured to initiate or receive communications with input or output devices and may operate on a resource (e.g., a collection of information).
[0257] The various operations of method examples described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented circuits that operate to perform one or more operations or functions. In an example, the circuits referred to herein may comprise processor-implemented circuits.
[0258] Similarly, the methods described herein may be at least partially processor implemented. For example, at least some or all of the operations of a method may be performed by one or processors or processor-implemented circuits. The performance of certain of the operations may be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In an example, the processor or processors may be located in a single location (e.g., within a home environment, an office environment or as a server farm), while in other examples the processors may be distributed across a number of locations.
[0259] The one or more processors may also operate to support performance of the relevant operations in a "cloud computing" environment or as a "software as a service”
[0260] (SaaS). For example, at least some of the operations may be performed by a group of computers (as examples of machines including processors), with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., Application Program Interfaces (APIs).)
[0261] Example implementations (e.g., apparatus, systems, or methods) may be implemented in digital electronic circuitry, in computer hardware, in firmware, in software, or in any combination thereof. Example implementations may be implemented using a computer program product (e.g., a computer program, tangibly embodied in an information carrier or in a machine readable medium, for execution by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers).
[0262] A computer program may be written in any form of programming language, including compiled or interpreted languages, and it may be deployed in any form, including as a standalone program or as a software module, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
[0263] In an example, operations may be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Examples of method operations may also be performed by, and example apparatus may be implemented as, special purpose logic circuitry (e.g., a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)).
[0264] The computing system may include clients and servers. A client and server are generally remote from each other and generally interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In implementations deploying a programmable computing system, it will be appreciated that both hardware and software architectures require consideration. Specifically, it will be appreciated that the choice of whether to implement certain functionality in permanently configured hardware (e.g., an ASIC), in temporarily configured hardware (e.g., a combination of software and a programmable processor), or a combination of permanently and temporarily configured hardware may be a design choice. Below are set out hardware (e.g., computing device 900) and software architectures that may be deployed in example implementations.
[0265] In an example, the computing device 900 may operate as a standalone device or the computing device 900 may be connected (e.g., networked) to other machines.
[0266] In a networked deployment, the computing device 900 may operate in the capacity of either a server or a client machine in server-client network environments. In an example, computing device 900 may act as a peer machine in peer-to-peer (or other distributed) network environments. The computing device 900 may be a personal computer (PC), a tablet PC, a set- top box (STB), a mobile telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) specifying actions to be taken (e.g., performed) by the computing device 900. Further, while only a single computing device 900 is illustrated, the term “computing device” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
[0267] Example computing device 900 may include a processor 904 (e.g., a central processing unit CPU), a graphics processing unit (GPU) or both), a main memory 906 and a static memory 908, some or all of which may communicate with each other via a bus 910. The computing device 900 may further include a display unit 912, an alphanumeric input device 914 (e.g., a keyboard), and a user interface (UI) navigation device 916 (e.g., a mouse). In an example, the display unit 912, input device 914 and UI navigation device 916 may be a touch screen display. The computing device 900 may additionally include a storage device (e.g., drive unit) 918, a signal generation device 920 (e.g., a speaker), a network interface device 922, and one or more sensors 924, such as a global positioning system (GPS) sensor, compass, accelerometer, or another sensor.
[0268] The storage device 918 may include a machine readable medium 926 on which is stored one or more sets of data structures or instructions 902 (e.g., software) embodying or utilized by any one or more of the methodologies or functions described herein. The instructions 902 may also reside, completely or at least partially, within the main memory 906, within static memory 908, or within the processor 904 during execution thereof by the computing device 900. In an example, one or any combination of the processor 904, the main memory 906, the static memory 908, or the storage device 918 may constitute machine readable media.
[0269] While the machine readable medium 926 is illustrated as a single medium, the term "machine readable medium" may include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that configured to store the one or more instructions 902. The term “machine readable medium” may also be taken to include any tangible medium that is capable of storing, encoding, or carrying instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure or that is capable of storing, encoding or carrying data structures utilized by or associated with such instructions. The term “machine readable medium” may accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media. Specific examples of machine-readable media may include non-volatile memory, including, by way of example, semiconductor memory devices (e.g., Electrically Programmable Read-Only Memory [0270] (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)) and flash memory devices; magnetic disks such as internal hard disks and removable disks; magnetooptical disks; and CD-ROM and DVD-ROM disks.
[0271] The instructions 902 may further be transmitted or received over a communications network 828 using a transmission medium via the network interface device 822 utilizing any one of a number of transfer protocols (e.g., frame relay, IP, TCP, UDP, HTTP, etc.). Example communication networks may include a local area network (LAN), a wide area network (WAN), a packet data network (e.g., the Internet), mobile telephone networks (e.g., cellular networks), Plain Old Telephone (POTS) networks, and wireless data networks (e.g., IEEE 802.11 standards family known as Wi-Fi®, IEEE 802.16 standards family known as WiMax®), peer-to-peer (P2P) networks, among others. The term “transmission medium” shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions for execution by the machine, and includes digital or analog communications signals or other intangible medium to facilitate communication of such software.
[0272] As used herein, a component, may refer to a device, physical entity, or logic having boundaries defined by function or subroutine calls, branch points, APIs, or other technologies that provide for the partitioning or modularization of particular processing or control functions. Components may be combined via their interfaces with other components to carry out a machine process. A component may be a packaged functional hardware unit designed for use with other components and a part of a program that usually performs a particular function of related functions. Components may constitute either software components (e.g., code embodied on a machine-readable medium) or hardware components. A "hardware component" is a tangible unit capable of performing certain operations and may be configured or arranged in a certain physical manner. In various example implementations, one or more computer systems (e.g., a standalone computer system, a client computer system, or a server computer system) or one or more hardware components of a computer system (e.g., a processor or a group of processors) may be configured by software (e.g., an application or application portion) as a hardware component that operates to perform certain operations as described herein.
Cancer and Other Diseases
[0273] The present methods can be used to diagnose presence of conditions, particularly cancer, in a subject, to characterize conditions (e.g., staging cancer or determining heterogeneity of a cancer), monitor response to treatment of a condition, effect prognosis risk of developing a condition or subsequent course of a condition. The present disclosure can also be useful in determining the efficacy of a particular treatment option. Successful treatment options may increase the amount of copy number variation or rare mutations detected in subject's blood if the treatment is successful as more cancers may die and shed DNA. In other examples, this may not occur. In another example, perhaps certain treatment options may be correlated with genetic profiles of cancers over time. This correlation may be useful in selecting a therapy.
[0274] Additionally, if a cancer is observed to be in remission after treatment, the present methods can be used to monitor residual disease or recurrence of disease.
[0275] In some embodiments, the methods and systems disclosed herein may be used to identify customized or targeted therapies to treat a given disease or condition in patients based on the classification of a nucleic acid variant as being of somatic or germline origin. Typically, the disease under consideration is a type of cancer. Non-limiting examples of such cancers include biliary tract cancer, bladder cancer, transitional cell carcinoma, urothelial carcinoma, brain cancer, gliomas, astrocytomas, breast carcinoma, metaplastic carcinoma, cervical cancer, cervical squamous cell carcinoma, rectal cancer, colorectal carcinoma, colon cancer, hereditary nonpolyposis colorectal cancer, colorectal adenocarcinomas, gastrointestinal stromal tumors (GISTs), endometrial carcinoma, endometrial stromal sarcomas, esophageal cancer, esophageal squamous cell carcinoma, esophageal adenocarcinoma, ocular melanoma, uveal melanoma, gallbladder carcinomas, gallbladder adenocarcinoma, renal cell carcinoma, clear cell renal cell carcinoma, transitional cell carcinoma, urothelial carcinomas, Wilms tumor, leukemia, acute lymphocytic leukemia (ALL), acute myeloid leukemia (AML), chronic lymphocytic leukemia (CLL), chronic myeloid leukemia (CML), chronic myelomonocytic leukemia (CMML), liver cancer, liver carcinoma, hepatoma, hepatocellular carcinoma, cholangiocarcinoma, hepatoblastoma, Lung cancer, non-small cell lung cancer (NSCLC), mesothelioma, B-cell lymphomas, non-Hodgkin lymphoma, diffuse large B-cell lymphoma, Mantle cell lymphoma, T cell lymphomas, non-Hodgkin lymphoma, precursor T-lymphoblastic lymphoma/leukemia, peripheral T cell lymphomas, multiple myeloma, nasopharyngeal carcinoma (NPC), neuroblastoma, oropharyngeal cancer, oral cavity squamous cell carcinomas, osteosarcoma, ovarian carcinoma, pancreatic cancer, pancreatic ductal adenocarcinoma, pseudopapillary neoplasms, acinar cell carcinomas. Prostate cancer, prostate adenocarcinoma, skin cancer, melanoma, malignant melanoma, cutaneous melanoma, small intestine carcinomas, stomach cancer, gastric carcinoma, gastrointestinal stromal tumor (GIST), uterine cancer, or uterine sarcoma. Type and/or stage of cancer can be detected from genetic variations including mutations, rare mutations, indels, copy number variations, transversions, translocations, inversion, deletions, aneuploidy, partial aneuploidy, polyploidy, chromosomal instability, chromosomal structure alterations, gene fusions, chromosome fusions, gene truncations, gene amplification, gene duplications, chromosomal lesions, DNA lesions, abnormal changes in nucleic acid chemical modifications, abnormal changes in epigenetic patterns, and abnormal changes in nucleic acid 5 -methylcytosine.
[0276] Genetic data can also be used for characterizing a specific form of cancer. Cancers are often heterogeneous in both composition and staging. Genetic profile data may allow characterization of specific sub-types of cancer that may be important in the diagnosis or treatment of that specific sub-type. This information may also provide a subject or practitioner clues regarding the prognosis of a specific type of cancer and allow either a subject or practitioner to adapt treatment options in accord with the progress of the disease. Some cancers can progress to become more aggressive and genetically unstable. Other cancers may remain benign, inactive or dormant. The system and methods of this disclosure may be useful in determining disease progression.
[0277] Further, the methods of the disclosure may be used to characterize the heterogeneity of an abnormal condition in a subject. Such methods can include, e.g., generating a genetic profile of extracellular polynucleotides derived from the subject, wherein the genetic profile includes a plurality of data resulting from copy number variation and rare mutation analyses. In some embodiments, an abnormal condition is cancer. In some embodiments, the abnormal condition may be one resulting in a heterogeneous genomic population. In the example of cancer, some tumors are known to comprise tumor cells in different stages of the cancer. In other examples, heterogeneity may comprise multiple foci of disease. Again, in the example of cancer, there may be multiple tumor foci, perhaps where one or more foci are the result of metastases that have spread from a primary site.
[0278] The present methods can be used to generate or profile, fingerprint or set of data that is a summation of genetic information derived from different cells in a heterogeneous disease. This set of data may comprise copy number variation, epigenetic variation, and mutation analyses alone or in combination.
[0279] The present methods can be used to diagnose, prognose, monitor or observe cancers, or other diseases. In some embodiments, the methods herein do not involve the diagnosing, prognosing or monitoring a fetus and as such are not directed to non-invasive prenatal testing. In other embodiments, these methodologies may be employed in a pregnant subject to diagnose, prognose, monitor or observe cancers or other diseases in an unborn subject whose DNA and other polynucleotides may co-circulate with maternal molecules. [0280] Non-limiting examples of other genetic-based diseases, disorders, or conditions that are optionally evaluated using the methods and systems disclosed herein include achondroplasia, alpha-1 antitrypsin deficiency, antiphospholipid syndrome, autism, autosomal dominant polycystic kidney disease, Charcot-Marie-Tooth (CMT), cri du chat, Crohn's disease, cystic fibrosis, Dercum disease, down syndrome, Duane syndrome, Duchenne muscular dystrophy, Factor V Leiden thrombophilia, familial hypercholesterolemia, familial Mediterranean fever, fragile X syndrome, Gaucher disease, hemochromatosis, hemophilia, holoprosencephaly, Huntington's disease, Klinefelter syndrome, Marfan syndrome, myotonic dystrophy, neurofibromatosis, Noonan syndrome, osteogenesis imperfecta, Parkinson's disease, phenylketonuria, Poland anomaly, porphyria, progeria, retinitis pigmentosa, severe combined immunodeficiency (SCID), sickle cell disease, spinal muscular atrophy, Tay-Sachs, thalassemia, trimethylaminuria, Turner syndrome, velocardiofacial syndrome, WAGR syndrome, Wilson disease, or the like.
[0281] In some embodiments, a method described herein includes detecting a presence or absence of DNA originating or derived from a tumor cell at a preselected timepoint following a previous cancer treatment of a subject previously diagnosed with cancer using a set of sequence information obtained as described herein. The method may further comprise determining a cancer recurrence score that is indicative of the presence or absence of the DNA originating or derived from the tumor cell for the test subject. Where a cancer recurrence score is determined, it may further be used to determine a cancer recurrence status. The cancer recurrence status may be at risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold. The cancer recurrence status may be at low or lower risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold. In particular embodiments, a cancer recurrence score equal to the predetermined threshold may result in a cancer recurrence status of either at risk for cancer recurrence or at low or lower risk for cancer recurrence.
[0282] In some embodiments, a cancer recurrence score is compared with a predetermined cancer recurrence threshold, and the test subject is classified as a candidate for a subsequent cancer treatment when the cancer recurrence score is above the cancer recurrence threshold or not a candidate for therapy when the cancer recurrence score is below the cancer recurrence threshold. In particular embodiments, a cancer recurrence score equal to the cancer recurrence threshold may result in classification as either a candidate for a subsequent cancer treatment or not a candidate for therapy.
[0283] The methods discussed above may further comprise any compatible feature or features set forth elsewhere herein, including in the section regarding methods of determining a risk of cancer recurrence in a test subject and/or classifying a test subject as being a candidate for a subsequent cancer treatment.
Methods of Determining a Risk of Cancer Recurrence in a Test Subject and/or Classifying a Test Subject as Being a Candidate for a Subsequent Cancer Treatment.
[0284] In some embodiments, a method provided herein is a method of determining a risk of cancer recurrence in a test subject. In some embodiments, a method provided herein is a method of classifying a test subject as being a candidate for a subsequent cancer treatment.
[0285] Any of such methods may comprise collecting DNA (e.g., originating or derived from a tumor cell) from the test subject diagnosed with the cancer at one or more preselected timepoints following one or more previous cancer treatments to the test subject. The subject may be any of the subjects described herein. The DNA may be cfDNA. The DNA may be obtained from a tissue sample.
[0286] Any of such methods may comprise capturing a plurality of sets of target regions from DNA from the subject, wherein the plurality of target region sets includes a sequence-variable target region set and an epigenetic target region set, whereby a captured set of DNA molecules is produced. The capturing step may be performed according to any of the embodiments described elsewhere herein. In any of such methods, the previous cancer treatment may comprise surgery, administration of a therapeutic composition, and/or chemotherapy.
[0287] Any of such methods may comprise sequencing the captured DNA molecules, whereby a set of sequence information is produced. The captured DNA molecules of the sequence-variable target region set may be sequenced to a greater depth of sequencing than the captured DNA molecules of the epigenetic target region set.
[0288] Any of such methods may comprise detecting a presence or absence of DNA originating or derived from a tumor cell at a preselected timepoint using the set of sequence information. The detection of the presence or absence of DNA originating or derived from a tumor cell may be performed according to any of the embodiments thereof described elsewhere herein.
[0289] Methods of determining a risk of cancer recurrence in a test subject may comprise determining a cancer recurrence score that is indicative of the presence or absence, or amount, of the DNA originating or derived from the tumor cell for the test subject. The cancer recurrence score may further be used to determine a cancer recurrence status. The cancer recurrence status may be at risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold. The cancer recurrence status may be at low or lower risk for cancer recurrence, e.g., when the cancer recurrence score is above a predetermined threshold. In particular embodiments, a cancer recurrence score equal to the predetermined threshold may result in a cancer recurrence status of either at risk for cancer recurrence or at low or lower risk for cancer recurrence.
[0290] Methods of classifying a test subject as being a candidate for a subsequent cancer treatment may comprise comparing the cancer recurrence score of the test subject with a predetermined cancer recurrence threshold, thereby classifying the test subject as a candidate for the subsequent cancer treatment when the cancer recurrence score is above the cancer recurrence threshold or not a candidate for therapy when the cancer recurrence score is below the cancer recurrence threshold. In particular embodiments, a cancer recurrence score equal to the cancer recurrence threshold may result in classification as either a candidate for a subsequent cancer treatment or not a candidate for therapy. In some embodiments, the subsequent cancer treatment includes chemotherapy or administration of a therapeutic composition.
[0291] Any of such methods may comprise determining a disease-free survival (DFS) period for the test subject based on the cancer recurrence score; for example, the DFS period may be 1 year, 2 years, 3, years, 4 years, 5 years, or 10 years.
[0292] In some embodiments, the set of sequence information includes sequence-variable target region sequences, and determining the cancer recurrence score may comprise determining at least a first subscore indicative of the amount of SNVs, insertions/deletions, CNVs and/or fusions present in sequence-variable target region sequences.
[0293] In some embodiments, a number of mutations in the sequence-variable target regions chosen from 1, 2, 3, 4, or 5 is sufficient for the first subscore to result in a cancer recurrence score classified as positive for cancer recurrence. In some embodiments, the number of mutations is chosen from 1, 2, or 3.
[0294] In some embodiments, the set of sequence information includes epigenetic target region sequences, and determining the cancer recurrence score includes determining a second subscore indicative of the amount of molecules (obtained from the epigenetic target region sequences) that represent an epigenetic state different from DNA found in a corresponding sample from a healthy subject (e.g., cfDNA found in a blood sample from a healthy subject, or DNA found in a tissue sample from a healthy subject where the tissue sample is of the same type of tissue as was obtained from the test subject). These abnormal molecules (i.e., molecules with an epigenetic state different from DNA found in a corresponding sample from a healthy subject) may be consistent with epigenetic changes associated with cancer, e.g., methylation of hypermethylation variable target regions and/or perturbed fragmentation of fragmentation variable target regions, where “perturbed” means different from DNA found in a corresponding sample from a healthy subject.
[0295] In some embodiments, a proportion of molecules corresponding to the hypermethylation variable target region set and/or fragmentation variable target region set that indicate hypermethylation in the hypermethylation variable target region set and/or abnormal fragmentation in the fragmentation variable target region set greater than or equal to a value in the range of 0.001%-10% is sufficient for the second subscore to be classified as positive for cancer recurrence. The range may be 0.001%-l%, 0.005%-l%, 0.01%-5%, 0.01%-2%, or 0.01%-l%.
[0296] In some embodiments, any of such methods may comprise determining a fraction of tumor DNA from the fraction of molecules in the set of sequence information that indicate one or more features indicative of origination from a tumor cell. This may be done for molecules corresponding to some or all of the epigenetic target regions, e.g., including one or both of hypermethylation variable target regions and fragmentation variable target regions (hypermethylation of a hypermethylation variable target region and/or abnormal fragmentation of a fragmentation variable target region may be considered indicative of origination from a tumor cell). This may be done for molecules corresponding to sequence variable target regions, e.g., molecules including alterations consistent with cancer, such as SNVs, indels, CNVs, and/or fusions. The fraction of tumor DNA may be determined based on a combination of molecules corresponding to epigenetic target regions and molecules corresponding to sequence variable target regions.
[0297] Determination of a cancer recurrence score may be based at least in part on the fraction of tumor DNA, wherein a fraction of tumor DNA greater than a threshold in the range of 10-11 to 1 or 10-10 to 1 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence. In some embodiments, a fraction of tumor DNA greater than or equal to a threshold in the range of 10-10 to 10-9, 10-9 to 10-8, 10-8 to 10-7, 10-7 to 10-6, 10-6 to 10- 5, 10-5 to 10-4, 10-4 to 10-3, 10-3 to 10-2, or 10-2 to 10-1 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence. In some embodiments, the fraction of tumor DNA greater than a threshold of at least 10-7 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence. A determination that a fraction of tumor DNA is greater than a threshold, such as a threshold corresponding to any of the foregoing embodiments, may be made based on a cumulative probability. For example, the sample was considered positive if the cumulative probability that the tumor fraction was greater than a threshold in any of the foregoing ranges exceeds a probability threshold of at least 0.5, 0.75, 0.9, 0.95, 0.98, 0.99, 0.995, or 0.999. In some embodiments, the probability threshold is at least 0.95, such as 0.99.
[0298] In some embodiments, the set of sequence information includes sequence-variable target region sequences and epigenetic target region sequences, and determining the cancer recurrence score includes determining a first subscore indicative of the amount of SNVs, insertions/deletions, CNVs and/or fusions present in sequence-variable target region sequences and a second subscore indicative of the amount of abnormal molecules in epigenetic target region sequences, and combining the first and second subscores to provide the cancer recurrence score. Where the first and second subscores are combined, they may be combined by applying a threshold to each subscore independently (e.g., greater than a predetermined number of mutations (e.g., > 1) in sequence-variable target regions, and greater than a predetermined fraction of abnormal molecules (i.e., molecules with an epigenetic state different from the DNA found in a corresponding sample from a healthy subject; e.g., tumor) in epigenetic target regions), or training a machine learning classifier to determine status based on a plurality of positive and negative training samples.
[0299] In some embodiments, a value for the combined score in the range of -4 to 2 or -3 to 1 is sufficient for the cancer recurrence score to be classified as positive for cancer recurrence.
[0300] In any embodiment where a cancer recurrence score is classified as positive for cancer recurrence, the cancer recurrence status of the subject may be at risk for cancer recurrence and/or the subject may be classified as a candidate for a subsequent cancer treatment.
[0301] In some embodiments, the cancer is any one of the types of cancer described elsewhere herein, e.g., colorectal cancer.
[0302] For example, as shown in Figure 10, comprehensive evaluation, diagnostic testing, molecular and genetic profiling and/or risk assessment, can be utilized in combination for an assessment. As shown in Figure 11, patient consultation, treatment strategy and/or tailored treatment can be utilized in combination for treatment planning. For Figure 12, treatment implementation can include pre-treatment and/or treatment execution. For Figure 13, regular follow-ups and/or response assessment can constitute mechanisms for determining monitoring and adjustment. For Figure 14, post-treatment surveillance and/or recurrence management can support long term management and/or survivorship.
Therapies and Related Administration
[0303] In certain embodiments, the methods disclosed herein relate to identifying and administering customized therapies to patients given the status of a nucleic acid variant as being of somatic or germline origin. In some embodiments, essentially any cancer therapy (e.g., surgical therapy, radiation therapy, chemotherapy, and/or the like) may be included as part of these methods. Typically, customized therapies include at least one immunotherapy (or an immunotherapeutic agent). Immunotherapy refers generally to methods of enhancing an immune response against a given cancer type. In certain embodiments, immunotherapy refers to methods of enhancing a T cell response against a tumor or cancer.
[0304] In certain embodiments, the status of a nucleic acid variant from a sample from a subject as being of somatic or germline origin may be compared with a database of comparator results from a reference population to identify customized or targeted therapies for that subject. Typically, the reference population includes patients with the same cancer or disease type as the test subject and/or patients who are receiving, or who have received, the same therapy as the test subject. A customized or targeted therapy (or therapies) may be identified when the nucleic variant and the comparator results satisfy certain classification criteria (e.g., are a substantial or an approximate match).
[0305] In certain embodiments, the customized therapies described herein are typically administered parenterally (e.g., intravenously or subcutaneously). Pharmaceutical compositions containing an immunotherapeutic agent are typically administered intravenously. Certain therapeutic agents are administered orally. However, customized therapies (e.g., immunotherapeutic agents, etc.) may also be administered by methods such as, for example, buccal, sublingual, rectal, vaginal, intraurethral, topical, intraocular, intranasal, and/or intraauricular, which administration may include tablets, capsules, granules, aqueous suspensions, gels, sprays, suppositories, salves, ointments, or the like.
[0306] While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the invention. It is therefore contemplated that the disclosure shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby. [0307] While the foregoing disclosure has been described in some detail by way of illustration and example for purposes of clarity and understanding, it will be clear to one of ordinary skill in the art from a reading of this disclosure that various changes in form and detail can be made without departing from the true scope of the disclosure and may be practiced within the scope of the appended claims. For example, all the methods, systems, computer readable media, and/or component features, steps, elements, or other aspects thereof can be used in various combinations.
Kits
[0308] Also provided are kits including the compositions as described herein. The kits can be useful in performing the methods as described herein. In some embodiments, a kit includes a first reagent for partitioning a sample into a plurality of subsamples as described herein, such as any of the partitioning reagents described elsewhere herein. In some embodiments, a kit includes a second reagent for subjecting the first subsample to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample, wherein the first nucleobase is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the first nucleobase and the second nucleobase have the same base pairing specificity (e.g., any of the reagents described elsewhere herein for converting a nucleobase such as cytosine or methylated cytosine to a different nucleobase). The kit may comprise the first and second reagents and additional elements as discussed below and/or elsewhere herein.
[0309] Kits may further comprise a plurality of oligonucleotide probes that selectively hybridize to least 5, 6, 7, 8, 9, 10, 20, 30, 40 or all genes selected from the group consisting of ALK, APC, BRAF, CDKN2A, EGFR, ERBB2, FBXW7, KRAS, MYC, NOTCH1, NRAS, PIK3CA, PTEN, RBI, TP53, MET, AR, ABL1, AKT1, ATM, CDH1, CSFIR, CTNNB1, ERBB4, EZH2, FGFR1, FGFR2, FGFR3, FLT3, GNA11, GNAQ, GNAS, HNF1A, HRAS, IDH1, IDH2, JAK2, JAK3, KDR, KIT, MLH1, MPL, NPM1, PDGFRA, PROC, PTPN11, RET,SMAD4, SMARCB1, SMO, SRC, STK11, VHL, TERT, CCND1, CDK4, CDKN2B, RAFI, BRCA1, CCND2, CDK6, NF1, TP53, ARID 1 A, BRCA2, CCNE1, ESRI, RIT1, GATA3, MAP2K1, RHEB, ROS1, ARAF, MAP2K2, NFE2L2, RHOA, and NTRK1 . The number genes to which the oligonucleotide probes can selectively hybridize can vary. For example, the number of genes can comprise 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, or 54. The kit can include a container that includes the plurality of oligonucleotide probes and instructions for performing any of the methods described herein.
[0310] The oligonucleotide probes can selectively hybridize to exon regions of the genes, e.g., of the at least 5 genes. In some cases, the oligonucleotide probes can selectively hybridize to at least 30 exons of the genes, e.g., of the at least 5 genes. In some cases, the multiple probes can selectively hybridize to each of the at least 30 exons. The probes that hybridize to each exon can have sequences that overlap with at least 1 other probe. In some embodiments, the oligoprobes can selectively hybridize to non-coding regions of genes disclosed herein, for example, intronic regions of the genes. The oligoprobes can also selectively hybridize to regions of genes including both exonic and intronic regions of the genes disclosed herein.
[0311] Any number of exons can be targeted by the oligonucleotide probes. For example, at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, , 295, 300, 400, 500, 600, 700, 800, 900, 1,000, or more, exons can be targeted.
[0312] The kit can comprise at least 4, 5, 6, 7, or 8 different library adaptors having distinct molecular barcodes and identical sample barcodes. The library adaptors may not be sequencing adaptors. For example, the library adaptors do not include flow cell sequences or sequences that permit the formation of hairpin loops for sequencing. The different variations and combinations of molecular barcodes and sample barcodes are described throughout, and are applicable to the kit. Further, in some cases, the adaptors are not sequencing adaptors. Additionally, the adaptors provided with the kit can also comprise sequencing adaptors. A sequencing adaptor can comprise a sequence hybridizing to one or more sequencing primers. A sequencing adaptor can further comprise a sequence hybridizing to a solid support, e.g., a flow cell sequence. For example, a sequencing adaptor can be a flow cell adaptor. The sequencing adaptors can be attached to one or both ends of a polynucleotide fragment. In some cases, the kit can comprise at least 8 different library adaptors having distinct molecular barcodes and identical sample barcodes. The library adaptors may not be sequencing adaptors. The kit can further include a sequencing adaptor having a first sequence that selectively hybridizes to the library adaptors and a second sequence that selectively hybridizes to a flow cell sequence. In another example, a sequencing adaptor can be hairpin shaped. For example, the hairpin shaped adaptor can comprise a complementary double stranded portion and a loop portion, where the double stranded portion can be attached {e.g. , ligated) to a double-stranded polynucleotide. Hairpin shaped sequencing adaptors can be attached to both ends of a polynucleotide fragment to generate a circular molecule, which can be sequenced multiple times. A sequencing adaptor can be up to 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96,
97, 98, 99, 100, or more bases from end to end. The sequencing adaptor can comprise 20-30, 20-
40, 30-50, 30-60, 40-60, 40-70, 50-60, 50-70, bases from end to end. In a particular example, the sequencing adaptor can comprise 20-30 bases from end to end. In another example, the sequencing adaptor can comprise 50-60 bases from end to end. A sequencing adaptor can comprise one or more barcodes. For example, a sequencing adaptor can comprise a sample barcode. The sample barcode can comprise a pre-determined sequence. The sample barcodes can be used to identify the source of the polynucleotides. The sample barcode can be at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, or more (or any length as described throughout) nucleic acid bases, e.g., at least 8 bases. The barcode can be contiguous or non-contiguous sequences, as described above.
[0313] The library adaptors can be blunt ended and Y-shaped and can be less than or equal to 40 nucleic acid bases in length. Other variations of the can be found throughout and are applicable to the kit.
Biomarkers
[0314] The disclosure provides methods of using biomarkers for the diagnosis, prognosis, and therapy selection of a subject suffering from e.g., cancer. A biomarker may be any gene or variant of a gene whose presence, mutation, deletion, substitution, copy number, or translation (i.e., to a protein) is an indicator of a disease state. Biomarkers of the present disclosure may include the presence, mutation, deletion, substitution, copy number, or translation in any one or more of EGFR, KRAS, MET, BRAF, MYC, NRAS, ERBB2, ALK, Notch, PIK3CA, APC, and SMO.
[0315] A biomarker is a genetic variant associated with one or more cancers. Biomarkers may be determined using any of several resources or methods. A biomarker may have been previously discovered or may be discovered de novo using experimental or epidemiological techniques. Detection of a biomarker may be indicative of cancer when the biomarker is highly correlated a cancer. Detection of a biomarker may be indicative of cancer when a biomarker in a region or gene occur with a frequency that is greater than a frequency for a given background population or dataset.
[0316] Publicly available resources such as scientific literature and databases may describe in detail genetic variants found to be associated with cancer. Scientific literature may describe experiments or genome-wide association studies (GWAS) associating one or more genetic variants with cancer. Databases may aggregate information gleaned from sources such as scientific literature to provide a more comprehensive resource for determining one or more biomarkers. Non-limiting examples of databases include FANTOM, GT ex, GEO, Body Atlas, INSiGHT, OMIM (Online Mendelian Inheritance in Man, omim.org), cBioPortal (cbioportal.org), CIViC (Clinical Interpretations of Variants in Cancer, civic.genome.wustl.edu), D0CM (Database of Curated Mutations, docm.genome.wustl.edu), and ICGC Data Portal (dcc.icgc.org). In a further example, the COSMIC (Catalogue of Somatic Mutations in Cancer) database allows for searching of biomarkers by cancer, gene, or mutation type. Biomarkers may also be determined de novo by conducting experiments such as case control or association (e.g, genome-wide association studies) studies.
[0317] One or more biomarkers may be detected in the sequencing panel. A biomarker may be one or more genetic variants associated with cancer. Biomarkers can be selected from single nucleotide variants (SNVs), copy number variants (CNVs), insertions or deletions (e.g., indels), gene fusions and inversions. Biomarkers may affect the level of a protein. Biomarkers may be in a promoter or enhancer, and may alter the transcription of a gene. The biomarkers may affect the transcription and/or translation efficacy of a gene. The biomarkers may affect the stability of a transcribed mRNA. The biomarker may result in a change to the amino acid sequence of a translated protein. The biomarker may affect splicing, may change the amino acid coded by a particular codon, may result in a frameshift, or may result in a premature stop codon. The biomarker may result in a conservative substitution of an amino acid. One or more biomarkers may result in a conservative substitution of an amino acid. One or more biomarkers may result in a nonconservative substitution of an amino acid.
[0318] One or more of the biomarkers may be a driver mutation. A driver mutation is a mutation that gives a selective advantage to a tumor cell in its microenvironment, through either increasing its survival or reproduction. None of the biomarkers may be a driver mutation. One or more of the biomarkers may be a passenger mutation. A passenger mutation is a mutation that has no effect on the fitness of a tumor cell but may be associated with a clonal expansion because it occurs in the same genome with a driver mutation. [0319] The frequency of a biomarker may be as low as 0.001%. The frequency of a biomarker may be as low as 0.005%. The frequency of a biomarker may be as low as 0.01%. The frequency of a biomarker may be as low as 0.02%. The frequency of a biomarker may be as low as 0.03%. The frequency of a biomarker may be as low as 0.05%. The frequency of a biomarker may be as low as 0.1%. The frequency of a biomarker may be as low as 1%.
[0320] No single biomarker may be present in more than 50%, of subjects having the cancer. No single biomarker may be present in more than 40%, of subjects having the cancer. No single biomarker may be present in more than 30%, of subjects having the cancer. No single biomarker may be present in more than 20%, of subjects having the cancer. No single biomarker may be present in more than 10%, of subjects having the cancer. No single biomarker may be present in more than 5%, of subjects having the cancer. A single biomarker may be present in 0.001% to 50% of subjects having cancer. A single biomarker may be present in 0.01% to 50% of subjects having cancer. A single biomarker may be present in 0.01% to 30% of subjects having cancer. A single biomarker may be present in 0.01% to 20% of subjects having cancer. A single biomarker may be present in 0.01% to 10% of subjects having cancer. A single biomarker may be present in 0.1% to 10% of subjects having cancer. A single biomarker may be present in 0.1% to 5% of subjects having cancer.
[0321] Detection of a biomarker may indicate the presence of one or more cancers. Detection may indicate presence of a cancer selected from the group including ovarian cancer, pancreatic cancer, breast cancer, colorectal cancer, non-small cell lung carcinoma (e.g., squamous cell carcinoma, or adenocarcinoma) or any other cancer. Detection may indicate the presence of any cancer selected from the group including ovarian cancer, pancreatic cancer, breast cancer, colorectal cancer, non-small cell lung carcinoma (squamous cell or adenocarcinoma) or any other cancer. Detection may indicate the presence of any of a plurality of cancers selected from the group including ovarian cancer, pancreatic cancer, breast cancer, colorectal cancer and non- small cell lung carcinoma (squamous cell or adenocarcinoma), or any other cancer. Detection may indicate presence of one or more of any of the cancers mentioned in this application.
[0322] One or more cancers may exhibit a biomarker in at least one exon in the panel. One or more cancers selected from the group including ovarian cancer, pancreatic cancer, breast cancer, colorectal cancer, non-small cell lung carcinoma (squamous cell or adenocarcinoma), or any other cancer, each exhibit a biomarker in at least one exon in the panel. Each of at least 3 of the cancers may exhibit a biomarker in at least one exon in the panel. Each of at least 4 of the cancers may exhibit a biomarker in at least one exon in the panel. Each of at least 5 of the cancers may exhibit a biomarker in at least one exon in the panel. Each of at least 8 of the cancers may exhibit a biomarker in at least one exon in the panel. Each of at least 10 of the cancers may exhibit a biomarker in at least one exon in the panel. All of the cancers may exhibit a biomarker in at least one exon in the panel.
[0323] If a subject has a cancer, the subject may exhibit a biomarker in at least one exon or gene in the panel. At least 85% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 90%, of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 92% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 95% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 96% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 97% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 98% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 99% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel. At least 99.5% of subjects having a cancer may exhibit a biomarker in at least one exon or gene in the panel.
[0324] If a subject has a cancer, the subject may exhibit a biomarker in at least one region in the panel. At least 85% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 90%, of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 92% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 95% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 96% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 97% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 98% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 99% of subjects having a cancer may exhibit a biomarker in at least one region in the panel. At least 99.5% of subjects having a cancer may exhibit a biomarker in at least one region in the panel.
[0325] Detection may be performed with a high sensitivity and/or a high specificity. Sensitivity can refer to a measure of the proportion of positives that are correctly identified as such. In some cases, sensitivity refers to the percentage of all existing biomarkers that are detected. In some cases, sensitivity refers to the percentage of sick people who are correctly identified as having certain disease. Specificity can refer to a measure of the proportion of negatives that are correctly identified as such. In some cases, specificity refers to the proportion of unaltered bases which are correctly identified. In some cases, specificity refers to the percentage of healthy people who are correctly identified as not having certain disease. The non-unique tagging method described previously significantly increases specificity of detection by reducing noise generated by amplification and sequencing errors, which reduces frequency of false positives. Detection may be performed with a sensitivity of at least 95%, 97%, 98%, 99%, 99.5%, or 99.9% and/or a specificity of at least 80%, 90%, 95%, 97%, 98% or 99%. Detection may be performed with a sensitivity of at least 90%, 95%, 97%, 98%, 99%, 99.5%, 99.6%, 99.98%, 99.9% or 99.95%. Detection may be performed with a specificity of at least 90%, 95%, 97%, 98%, 99%, 99.5%, 99.6%, 99.98%, 99.9% or 99.95%. Detection may be performed with a specificity of at least 70% and a sensitivity of at least 70%, a specificity of at least 75% and a sensitivity of at least 75%, a specificity of at least 80% and a sensitivity of at least 80%, a specificity of at least 85% and a sensitivity of at least 85%, a specificity of at least 90% and a sensitivity of at least 90%, a specificity of at least 95% and a sensitivity of at least 95%, a specificity of at least 96% and a sensitivity of at least 96%, a specificity of at least 97% and a sensitivity of at least 97%, a specificity of at least 98% and a sensitivity of at least 98%, a specificity of at least 99% and a sensitivity of at least 99%, or a specificity of 100% a sensitivity of 100%. In some cases, the methods can detect a biomarker at a sensitivity of sensitivity of about 80% or greater. In some cases, the methods can detect a biomarker at a sensitivity of sensitivity of about 95% or greater. In some cases, the methods can detect a biomarker at a sensitivity of sensitivity of about 80% or greater, and a sensitivity of sensitivity of about 95% or greater.
[0326] Detection may be highly accurate. Accuracy may apply to the identification of biomarkers in cell free DNA, and/or to the diagnosis of cancer. Statistical tools, such as covariate analysis described above, may be used to increase and/or measure accuracy. The methods can detect a biomarker at an accuracy of at least 80%, 90%, 95%, 97%, 98% or 99%, 99.5%, 99.6%, 99.98%, 99.9%, or 99.95%. In some cases, the methods can detect a biomarker at an accuracy of at least 95% or greater.
Cancer Treatments, Therapies
[0327] In some cases, the cancer treatment includes, without limitation, imatinib, gefatinib, afatinib, dacomitinib, sunitinib, sorafenib, vandetanib, brivanib, cabozantib, neratinib, tivantinib, bevacizumab, cixutumumab, dalotuzumab, figitumumab, rilotumumab, onartuzumab, ganitumab, ramucirumab, ridaforolimus, tensirolimus, everolimus, BMS-690514, BMS-754807, EMD 525797, GDC-0973, GDC-0941, MK-2206, AZD6244, GSK1120212, PX-866, XL821, IMC- A12, MM-121, PF-02341066, RG7160, and Sym004. Antibodies suitable for use as anti-EGFR therapy include cetuximab (Trade Name: Erbitux) and panitumumab (Trade Name: Vectibex). In some cases. In some cases, the cancer treatment includes EGFR tyrosine kinase inhibitors such as gefitinib (Trade Name: Iressa), erlotinib (Trade Name: Tarceva), lapatinib, canertinib, and cetuximab.
[0328] In some instances, therapties may be used in combination, such as an anti-EGFR therapy and an anti-EGFR therapy. Anti-EGFR therapy may be used in combination with any combination of chemotherapeutic agents or chemotherapeutic regimens, for example, FOLFOX (fluorouracil [5-FU]/leucovorin/oxaliplatin), FOLFIRI (5-FU/leucovorin/irinotecan), and the like.
[0329] In some aspects, a cancer treatment is administered to a subject. In some cases, the cancer treatment is administered in combination another therapy, such as a non-anti-EGFR therapy with anti-EGFR therapy.
Sequencing panel
[0330] To improve the likelihood of detecting tumor indicating mutations, the region of DNA sequenced may comprise a panel of genes or genomic regions. Selection of a limited region for sequencing (e.g., a limited panel) can reduce the total sequencing needed (e.g., a total amount of nucleotides sequenced. A sequencing panel can target a plurality of different genes or regions to detect a single cancer, a set of cancers, or all cancers.
[0331] In some aspects, a panel targets a plurality of different genes or genomic regions is selected such that a determined proportion of subjects having a cancer exhibits a genetic variant or biomarker in one or more different genes or genomic regions in the panel. The panel may be selected to limit a region for sequencing to a fixed number of base pairs. The panel may be selected to sequence a desired amount of DNA. The panel may be further selected to achieve a desired sequence read depth. The panel may be selected to achieve a desired sequence read depth or sequence read coverage for an amount of sequenced base pairs. The panel may be selected to achieve a theoretical sensitivity, a theoretical specificity and/or a theoretical accuracy for detecting one or more genetic variants in a sample.
[0332] Probes for detecting the panel of regions can include those for detecting hotspots regions as well as nucleosome-aware probes (e.g., KRAS codons 12 and 13) and may be designed to optimize capture based on analysis of cfDNA coverage and fragment size variation impacted by nucleosome binding patterns and GC sequence composition. Regions used herein can also include non-hotspot regions optimized based on nucleosome positions and GC models. The panel can comprise a plurality of subpanels, including subpanels for identifying tissue of origin (e.g., use of published literature to define 50-100 baits representing genes with most diverse transcription profile across tissues (not necessarily promoters)), whole genome scaffold (e.g., for identifying ultra-conservative genomic content and tiling sparsely across chromosomes with handful of probes for copy number base lining purposes), transcription start site (TSS)/CpG islands (e.g., for capturing differential methylated regions (e.g., Differentially Methylated Regions (DMRs)) in for example in promoters of tumor suppressor genes (e.g., SEPT9/VIM in colorectal cancer)). In some embodiments, markers for a tissue of origin are tissue-specific epigenetic markers.
[0333] The one or more regions in the panel can comprise one or more loci from one or a plurality of genes. The plurality of genes may be selected for sequencing and biomarker detection. Genes included in the region to be sequenced may be selected from genes known to be involved in cancer, or from genes not involved in cancer. For example the plurality of genes in the panel may be oncogenes, tumor suppressors, growth factors, DNA repair genes, signaling genes, transcription factors, receptors or metabolic genes. Examples of genes that may be in the panel include, but are not limited to: APC, AR, ARID1 A, BRAF, BRCA1, BRCA2, CCND1, CCND2, CCNE1, CDK4, CDK6, CDKN2A, CDKN2B, EGFR, ERBB2, FGFR1, FGFR2, HRAS, KIT, KRAS, MET, MYC, NF1, NRAS, PDGFRA, PIK3CA, PTEN, RAFI, TP53, AKT1, ALK, ARAF, ATM, CDH1, CTNNB1,ESR1, EZH2, FBXW7, FGFR3, GATA3, GNA11, GNAQ, GNAS, HNF1A, IDH1, IDH2, JAK2, JAK3, MAP2K1, MAP2K2, MLH1, MPL, NFE2L2, NOTCH1, NPM1, NTRK1, PTPN11, RET, RHEB, RHOA, RIT1, ROS1, SMAD4, SMO, SRC, STK11, TERT, VHL.
[0334] In some cases, the one or more regions in the panel can comprise one or more loci from one or a plurality of genes, including one or more of AKT1, ALK, APC, ATM, BRAF, CTNNB1, EGFR, ERBB2, ESRI, FGFR2, GATA3, GNAS, IDH1, IDH2, KIT, KRAS, MET, NRAS, PDGFRA, PIK3CA, PTEN, RBI, SMAD4, STK11, and TP53.
[0335] In some cases, the one or more regions in a panel for colorectal cancer can comprise one or more loci from one or a plurality of genes, including one of, two of, three of, four of, or five of TP53, APC, BRAF, KRAS, and NRAS. In some cases, the one or more regions in a panel for ovarian cancer can comprise one or more loci from one or a plurality of genes, including TP53. In some cases, the one or more regions in a panel for pancreatic cancer can comprise one or more loci from one or a plurality of genes, including one or both of TP53 and KRAS. In some cases, the one or more regions in a panel for lung adenocarcinoma can comprise one or more loci from one or a plurality of genes, including one of, two of, three of, four of, five of, six of, seven of, or eight of TP53, BRAF, KRAS, EGFR, ERBB2, MET, STK11, and ALK. In some cases, the one or more regions in a panel for lung squamous cell carcinoma can comprise one or more loci from one or a plurality of genes, including one of, two of, three of, four of, or five of TP53, BRAF,
I l l KRAS, MET, and ALK. In some cases, the one or more regions in a panel for breast cancer can comprise one or more loci from one or a plurality of genes, including one of, two of, three of, or four of TP53, GAT A3, PIK3CA, and ESRI. In some cases, one or more regions in a panel can comprise one or more loci from a combination of any of the above genes, for example, to detect a combination of cancer types. In some cases, one or more regions in a panel can comprise one or more loci from each of the preceding genes, for example, in a pan-cancer panel.
[0336] In some cases, the one or more regions in a panel for lung cancer can comprise one or more loci from a plurality of genes, including one of, two of, three of, four of, five of, six of, seven of, eight of, nine of, 10 of, 11 of, 12 of, 13 of, 14 of, 15 of, 16 of, 17 of, 18 of, 19 of, or 20 of EGFR, KRAS, TP53, CDKN2A, STK11, BRAF, PIK3CA, RBI, ERBB2, PTEN, NFE2L2, MET, CTNNB1, NRAS, MUC16, NF1, BAB, SMARCA4, ATM, NTRK3, and ERBB4. Such a panel also may include, or have substituted for any or all of the above, any or all of an EGFR Exon 19 deletion, EGFR L858R, EGFR C797S, EGFR T790M, EGFR S645C, ARAF S214C and S214F, ERBB2 S418T, MET exon 14 skipping, SNVs and indels. Many of these genes may be clinically actionable, such that an observed anomaly in MAF (e.g., significantly higher or lower than in normal control subjects) may be indicative of a clinical state relevant to lung cancer, such as diagnosis, prognosis, risk stratification, treatment selection, tumor resistance to treatment, tumor burden, etc. Such a lung cancer targeted panel may comprise a relatively small number of these lung cancer associated genes.
[0337] In some cases, the one or more regions in a panel for breast cancer can comprise one or more loci from a plurality of genes, including any one of, or any combination of, ACVRL1, AFF2, AGMO, AGTR2, AHNAK, AHNAK2, AKAP9, AKT1, AKT2, ALK, APC, ARID 1 A, ARID1B, ARID2, ARID5B, ASXL1, ASXL2, ATR, BAP1, BCAS3, BIRC6, BRAF, BRCA1, BRCA2, BRIP1, CACNA2D3, CASP8, CBFB, CCND3, CDH1, CDKN1B, CDKN2A, CHD1, CHEK2, CLK3, CLRN2, COL12A1, COL22A1, COL6A3, CTCF, CTNNA1, CTNNA3, DCAF4L2, DNAH11, DNAH2, DNAH5, DTWD2, EGFR, EP300, ERBB2, ERBB3, ERBB4, FAM20C, FANCA, FANCD2, FBXW7, FLT3, FOXO1, FOXO3, FOXP1, FRMD3, GAT A3, GH1, GLDC, GPR124, GPR32, GPS2, HDAC9, HERC2, HIST1H2BC, HRAS, JAK1, KDM3A, KDM6A, KLRG1, KMT2C, KRAS, LI CAM, LAMA2, LAMB3, LARGE, LDLRAP1, LIFR, LIPI, MAGEA8, MAP2K4, MAP3K1, MAP3K10, MAP3K13, MBL2, MEN1, LL2, MLLT4, MTAP, MUC16, MYH9, MY01A, MY03A, NCOA3, NCOR1, NCOR2, NDFIP1, NEK1, NF1, NF2, NOTCH1, NPNT, NR2F1, NR3C1, NRAS, NRG3, NT5E, OR6A2, PALLD, PBRM1, PDE4DIP, PIK3CA, PIK3R1, PPP2CB, PPP2R2A, PRKACG, PRKCE, PRKCQ, PRKCZ, PRKG1, PRPS2, PRR16, PTEN, PTPN22, PTPRD, PTPRM, RASGEF1B, RBI, ROS1, RPGR, RUNX1, RYR2, SBN01, SETD1A, SETD2, SETDB1, SF3B1, SGCD, SHANK2, SIAH1, SIK1, SIK2, SMAD2, SMAD4, SMARCB1, SMARCC1, SMARCC2, SMARCD1, SPACA1, STAB2, STK11, STMN2, SYNE1, TAF1, TAF4B, TBL1XR1, TBX3, TG, THADA, THSD7A, TP53, TTYH1, UBR5, USH2A, USP28, USP9X, UTRN, and ZFP36L1. Many of these genes may be clinically actionable, such that an observed anomaly in MAF (e.g., significantly higher or lower than in normal control subjects) may be indicative of a clinical state relevant to breast cancer, such as diagnosis, prognosis, risk stratification, treatment selection, tumor resistance to treatment, tumor burden, etc. Such a breast cancer targeted panel may comprise a relatively small number of these breast cancer associated genes.
[0338] In some cases, the one or more regions in a panel for colorectal cancer can comprise one or more loci from a plurality of genes, including one of, two of, three of, four of, five of, or six of TP53, BRAF, KRAS, APC, TGFBR, and PIK3CA. Many of these genes may be clinically actionable, such that an observed anomaly in MAF (e.g., significantly higher or lower than in normal control subjects) may be indicative of a clinical state relevant to colorectal cancer, such as diagnosis, prognosis, risk stratification, treatment selection, tumor resistance to treatment, tumor burden, etc. Such a colorectal cancer targeted panel may comprise a relatively small number of these colorectal cancer associated genes.
[0339] In some embodiments, the one or more regions in the panel comprise one or more loci from one or a plurality of genes for detecting residual cancer after surgery. This detection can be earlier than is possible for existing methods of cancer detection. In some embodiments, the one or more regions in the panel comprise one or more loci from one or a plurality of genes for detecting cancer in a high-risk patient population. For example, smokers have much higher rates of lung cancer than the general population. Moreover, smokers can develop other lung conditions that make cancer detection more difficult, such as the development of irregular nodules in the lungs. In some embodiments, the methods described herein detect cancer in high risk patients earlier than is possible for existing methods of cancer detection.
[0340] A region may be selected for inclusion in a sequencing panel based on a number of subjects with a cancer that have a biomarker in that gene or region. A region may be selected for inclusion in a sequencing panel based on prevalence of subjects with a cancer and a biomarker present in that gene. Presence of a biomarker in a region may be indicative of a subject having cancer.
[0341] In some instances, the panel may be selected using information from one or more databases. The information regarding a cancer may be derived from cancer tumor biopsies or cfDNA assays. A database may comprise information describing a population of sequenced tumor samples. A database may comprise information about mRNA expression in tumor samples. A databased may comprise information about regulatory elements in tumor samples. The information relating to the sequenced tumor samples may include the frequency various genetic variants and describe the genes or regions in which the genetic variants occur. The genetic variants may be biomarkers. A non-limiting example of such a database is COSMIC. COSMIC is a catalogue of somatic mutations found in various cancers. For a particular cancer, COSMIC ranks genes based on frequency of mutation. A gene may be selected for inclusion in a panel by having a high frequency of mutation within a given gene. For instance, COSMIC indicates that 33% of a population of sequenced breast cancer samples have a mutation in TP53 and 22% of a population of sampled breast cancers have a mutation in KRAS. Other ranked genes, including APC, have mutations found only in about 4% of a population of sequenced breast cancer samples. TP53 and KRAS may be included in a sequencing panel based on having relatively high frequency among sampled breast cancers (compared to APC, for example, which occurs at a frequency of about 4%). COSMIC is provided as a non-limiting example, however, any database or set of information may be used that associates a cancer with biomarker located in a gene or genetic region. In another example, as provided by COSMIC, of 1156 biliary tract cancer samples, 380 samples (33%) carried mutations in TP53. Several other genes, such as APC, have mutations in 4-8% of all samples. Thus, TP53 may be selected for inclusion in the panel based on a relatively high frequency in a population of biliary tract cancer samples.
[0342] A gene or region may be selected for a panel where the frequency of a biomarker is significantly greater in sampled tumor tissue or circulating tumor DNA than found in a given background population. A combination of regions may be selected for inclusion of a panel such that at least a majority of subjects having a cancer will have a biomarker present in at least one of the regions or genes in the panel. The combination of regions may be selected based on data indicating that, for a particular cancer or set of cancers, a majority of subjects have one or more biomarkers in one or more of the selected regions. For example, to detect cancer 1, a panel including regions A, B, C, and/or D may be selected based on data indicating that 90% of subjects with cancer 1 have a biomarker in regions A, B, C, and/or D of the panel. Alternately, biomarkers may be shown to occur independently in two or more regions in subjects having a cancer such that, combined, a biomarker in the two or more regions is present in a majority of a population of subjects having a cancer. For example, to detect cancer 2, a panel including regions X, Y, and Z may be selected based on data indicating that 90% of subjects have a biomarker in one or more regions, and in 30% of such subjects a biomarker is detected only in region X, while biomarkers are detected only in regions Y and/or Z for the remainder of the subjects for whom a biomarker was detected. Biomarkers present in one or more regions previously shown to be associated with one or more cancers may be indicative of or predictive of a subject having cancer if a biomarker is detected in one or more of those regions 50% or more of the time. Computational approaches such as models employing conditional probabilities of detecting cancer given a known cancer frequency for a set of biomarkers within one or more regions may be used to predict which regions, alone or in combination, may be predictive of cancer. Other approaches for panel selection involve the use of databases describing information from studies employing comprehensive genomic profiling of tumors with large panels and/or whole genome sequencing (WGS, RNA-seq, Chip-seq, bisulfate sequencing, ATAC-seq, and others). Information gleaned from literature may also describe pathways commonly affected and mutated in certain cancers. Panel selection may be further informed by the use of ontologies describing genetic information.
[0343] Genes included in the panel for sequencing can include the fully transcribed region, the promoter region, enhancer regions, regulatory elements, and/or downstream sequence. To further increase the likelihood of detecting tumor indicating mutations only exons may be included in the panel. The panel can comprise all exons of a selected gene, or only one or more of the exons of a selected gene. The panel may comprise of exons from each of a plurality of different genes. The panel may comprise at least one exon from each of the plurality of different genes.
[0344] In some aspects, a panel of exons from each of a plurality of different genes is selected such that a determined proportion of subjects having a cancer exhibit a genetic variant in at least one exon in the panel of exons.
[0345] At least one full exon from each different gene in a panel of genes may be sequenced. The sequenced panel may comprise exons from a plurality of genes. The panel may comprise exons from 2 to 100 different genes, from 2 to 70 genes, from 2 to 50 genes, from 2 to 30 genes, from 2 to 15 genes, or from 2 to 10 genes.
[0346] A selected panel may comprise a varying number of exons. The panel may comprise from 2 to 3000 exons. The panel may comprise from 2 to 1000 exons. The panel may comprise from 2 to 500 exons. The panel may comprise from 2 to 100 exons. The panel may comprise from 2 to 50 exons. The panel may comprise no more than 300 exons. The panel may comprise no more than 200 exons. The panel may comprise no more than 100 exons. The panel may comprise no more than 50 exons. The panel may comprise no more than 40 exons. The panel may comprise no more than 30 exons. The panel may comprise no more than 25 exons. The panel may comprise no more than 20 exons. The panel may comprise no more than 15 exons. The panel may comprise no more than 10 exons. The panel may comprise no more than 9 exons. The panel may comprise no more than 8 exons. The panel may comprise no more than 7 exons. [0347] The panel may comprise one or more exons from a plurality of different genes. The panel may comprise one or more exons from each of a proportion of the plurality of different genes. The panel may comprise at least two exons from each of at least 25%, 50%, 75% or 90% of the different genes. The panel may comprise at least three exons from each of at least 25%, 50%, 75% or 90% of the different genes. The panel may comprise at least four exons from each of at least 25%, 50%, 75% or 90% of the different genes.
[0348] The sizes of the sequencing panel may vary. A sequencing panel may be made larger or smaller (in terms of nucleotide size) depending on several factors including, for example, the total amount of nucleotides sequenced or a number of unique molecules sequenced for a particular region in the panel. The sequencing panel can be sized 5 kb to 50 kb. The sequencing panel can be 10 kb to 30 kb in size. The sequencing panel can be 12 kb to 20 kb in size. The sequencing panel can be 12 kb to 60 kb in size. The sequencing panel can be at least lOkb, 12 kb, 15 kb, 20 kb, 25 kb, 30 kb, 35 kb, 40 kb, 45 kb, 50 kb, 60 kb, 70 kb, 80 kb, 90 kb, 100 kb , 110 kb, 120 kb, 130 kb, 140 kb, or 150 kb in size. The sequencing panel may be less than 100 kb, 90 kb, 80 kb, 70 kb, 60 kb, or 50 kb in size.
[0349] The panel selected for sequencing can comprise at least 1, 5, 10, 15, 20, 25, 30, 40, 50, 60, 80, or 100 regions. In some cases, the regions in the panel are selected that the size of the regions are relatively small. In some cases, the regions in the panel have a size of about 10 kb or less, about 8 kb or less, about 6 kb or less, about 5 kb or less, about 4 kb or less, about 3 kb or less, about 2.5 kb or less, about 2 kb or less, about 1.5 kb or less, or about 1 kb or less or less. In some cases, the regions in the panel have a size from about 0.5 kb to about 10 kb, from about 0.5 kb to about 6 kb, from about 1 kb to about 11 kb, from about 1 kb to about 15 kb, from about 1 kb to about 20 kb, from about 0.1 kb to about 10 kb, or from about 0.2 kb to about 1 kb. For example, the regions in the panel can have a size from about 0.1 kb to about 5 kb.
[0350] The panel selected herein can allow for deep sequencing that is sufficient to detect low- frequency genetic variants (e.g., in cell-free nucleic acid molecules obtained from a sample). An amount of genetic variants in a sample may be referred to in terms of the minor allele frequency for a given genetic variant. The minor allele frequency may refer to the frequency at which minor alleles (e.g., not the most common allele) occurs in a given population of nucleic acids, such as a sample. Genetic variants at a low minor allele frequency may have a relatively low frequency of presence in a sample. In some cases, the panel allows for detection of genetic variants at a minor allele frequency of at least 0.0001%, 0.001%, 0.005%, 0.01%, 0.05%, 0.1%, or 0.5%. The panel can allow for detection of genetic variants at a minor allele frequency of 0.001% or greater. The panel can allow for detection of genetic variants at a minor allele frequency of 0.01% or greater. The panel can allow for detection of genetic variant present in a sample at a frequency of as low as 0.0001%, 0.001%, 0.005%, 0.01%, 0.025%, 0.05%, 0.075%, 0.1%, 0.25%, 0.5%, 0.75%, or 1.0%. The panel can allow for detection of biomarkers present in a sample at a frequency of at least 0.0001%, 0.001%, 0.005%, 0.01%, 0.025%, 0.05%, 0.075%, 0.1%, 0.25%, 0.5%, 0.75%, or 1.0%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 1.0%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.75%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.5%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.25%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.1%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.075%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.05%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.025%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.01%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.005%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.001%. The panel can allow for detection of biomarkers at a frequency in a sample as low as 0.0001%. The panel can allow for detection of biomarkers in sequenced cfDNA at a frequency in a sample as low as 1.0% to 0.0001%. The panel can allow for detection of biomarkers in sequenced cfDNA at a frequency in a sample as low as 0.01% to 0.0001%.
[0351] A genetic variant can be exhibited in a percentage of a population of subjects who have a disease (e.g., cancer). In some cases, at least 1%, 2%, 3%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 99% of a population having the cancer exhibit one or more genetic variants in at least one of the regions in the panel. For example, at least 80% of a population having the cancer may exhibit one or more genetic variants in at least one of the regions in the panel.
[0352] The panel can comprise one or more regions from each of one or more genes. In some cases, the panel can comprise one or more regions from each of at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, or 80 genes. In some cases, the panel can comprise one or more regions from each of at most 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, or 80 genes. In some cases, the panel can comprise one or more regions from each of from about 1 to about 80, from 1 to about 50, from about 3 to about 40, from 5 to about 30, from 10 to about 20 different genes.
[0353] The regions in the panel can be selected so that one or more epigenetically modified regions are detected. The one or more epigenetically modified regions can be acetylated, methylated, ubiquitylated, phosphorylated, sumoylated, ribosylated, and/or citrullinated. For example, the regions in the panel can be selected so that one or more methylated regions are detected.
[0354] The regions in the panel can be selected so that they comprise sequences differentially transcribed across one or more tissues. In some cases, the regions can comprise sequences transcribed in certain tissues at a higher level compared to other tissues. For example, the regions can comprise sequences transcribed in certain tissues but not in other tissues.
[0355] The regions in the panel can comprise coding and/or non-coding sequences. For example, the regions in the panel can comprise one or more sequences in exons, introns, promoters, 3’ untranslated regions, 5’ untranslated regions, regulatory elements, transcription start sites, and/or splice sites. In some cases, the regions in the panel can comprise other non-coding sequences, including pseudogenes, repeat sequences, transposons, viral elements, and telomeres. In some cases, the regions in the panel can comprise sequences in non-coding RNA, e.g., ribosomal RNA, transfer RNA, Piwi -interacting RNA, and microRNA.
[0356] The regions in the panel can be selected to detect (diagnose) a cancer with a desired level of sensitivity (e.g., through the detection of one or more genetic variants). For example, the regions in the panel can be selected to detect the cancer (e.g., through the detection of one or more genetic variants) with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. The regions in the panel can be selected to detect the cancer with a sensitivity of 100%.
[0357] The regions in the panel can be selected to detect (diagnose) a cancer with a desired level of specificity (e.g., through the detection of one or more genetic variants). For example, the regions in the panel can be selected to detect cancer (e.g., through the detection of one or more genetic variants) with a specificity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. The regions in the panel can be selected to detect the one or more genetic variant with a specificity of 100%.
[0358] The regions in the panel can be selected to detect (diagnose) a cancer with a desired positive predictive value. Positive predictive value can be increased by increasing sensitivity (e.g., chance of an actual positive being detected) and/or specificity (e.g., chance of not mistaking an actual negative for a positive). As a non-limiting example, regions in the panel can be selected to detect the one or more genetic variant with a positive predictive value of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. The regions in the panel can be selected to detect the one or more genetic variant with a positive predictive value of 100%.
[0359] The regions in the panel can be selected to detect (diagnose) a cancer with a desired accuracy. As used herein, the term “accuracy” may refer to the ability of a test to discriminate between a disease condition (e.g., cancer) and health. Accuracy may be can be quantified using measures such as sensitivity and specificity, predictive values, likelihood ratios, the area under the ROC curve, Youden’s index and/or diagnostic odds ratio.
[0360] Accuracy may presented as a percentage, which refers to a ratio between the number of tests giving a correct result and the total number of tests performed. The regions in the panel can be selected to detect cancer with an accuracy of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. The regions in the panel can be selected to detect cancer with an accuracy of 100%.
[0361] A panel may be selected such that when one or more regions or genes in the panel are removed, specificity is appreciably decreased. Removal of one region from the panel may result in a decrease in specificity of at least 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, or more.
[0362] A panel may be selected such that the addition of one or more regions or genes to the panel does not appreciably increase the specificity of the panel, e.g., does not increase the specificity by more than 1%, 2%, 5%, 10%, 15%, or 20%.
[0363] A panel may be of a size such that when one or more regions or genes in the panel are removed, this appreciably decreases sensitivity, e.g., sensitivity is decreased by at least 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, or more.
[0364] A panel may be selected such that the addition of one or more regions or genes to the panel does not appreciably increase the sensitivity of the panel, e.g., does not increase the sensitivity by more than 1%, 2%, 5%, 10%, 15%, or 20%.
[0365] A panel may be of a size such that when one or more regions or genes in the panel are removed, accuracy is appreciably decreased, e.g., accuracy is decreased by at least 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, or more.
[0366] A panel may be selected such that the addition of one or more regions or genes to the panel does not appreciably increase the accuracy of the panel, e.g., does not increase the accuracy by more than 1%, 2%, 5%, 10%, 15%, or 20%. [0367] A panel may be of a size such that when one or more regions or genes the panel are removed, positive predictive value is appreciably decreased, e.g., positive predictive value is decreased by at least 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, or more.
[0368] A panel may be selected such that the addition of one or more regions or genes to the panel does not appreciably increase the positive predictive value of the panel, e.g., does not increase the positive predictive value by more than 1%, 2%, 5%, 10%, 15%, or 20%
[0369] A panel may be selected to be highly sensitive and detect low frequency genetic variants. For instance, a panel may be selected such that a genetic variant or biomarker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may be detected at a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. Regions in a panel may be selected to detect a biomarker present at a frequency of 1% or less in a sample with a sensitivity of 70% or greater. A panel may be selected to detect a biomarker at a frequency in a sample as low as 0.1% with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a biomarker at a frequency in a sample as low as 0.01% with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a biomarker at a frequency in a sample as low as 0.001% with a sensitivity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
[0370] A panel may be selected to be highly specific and detect low frequency genetic variants. For instance, a panel may be selected such that a genetic variant or biomarker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may be detected at a specificity of at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. Regions in a panel may be selected to detect a biomarker present at a frequency of 1% or less in a sample with a specificity of 70% or greater. A panel may be selected to detect a biomarker at a frequency in a sample as low as 0.1% with a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a biomarker at a frequency in a sample as low as 0.01% with a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a biomarker at a frequency in a sample as low as 0.001% with a specificity of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
[0371] A panel may be selected to be highly accurate and detect low frequency genetic variants. A panel may be selected such that a genetic variant or biomarker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may be detected at an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. Regions in a panel may be selected to detect a biomarker present at a frequency of 1% or less in a sample with an accuracy of 70% or greater. A panel may be selected to detect a biomarker at a frequency in a sample as low as 0.1% with an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a biomarker at a frequency in a sample as low as 0.01% with an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%. A panel may be selected to detect a biomarker at a frequency in a sample as low as 0.001% with an accuracy of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
[0372] A panel may be selected to be highly predictive and detect low frequency genetic variants. A panel may be selected such that a genetic variant or biomarker present in a sample at a frequency as low as 0.01%, 0.05%, or 0.001% may have a positive predictive value of at least 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, or 99.9%.
[0373] The concentration of probes or baits used in the panel may be increased (2 to 6 ng/pL) to capture more nucleic acid molecule within a sample. The concentration of probes or baits used in the panel may be at least 2 ng/pL, 3 ng/ pL, 4 ng/ pL, 5 ng/pL, 6 ng/pL, or greater. The concentration of probes may be about 2 ng/pL to about 3 ng/pL, about 2 ng/pL to about 4 ng/pL, about 2 ng/pL to about 5 ng/pL, about 2 ng/pL to about 6 ng/pL. The concentration of probes or baits used in the panel may be 2 ng/pL or more to 6 ng/pL or less. In some instances this may allow for more molecules within a biological to be analyzed thereby enabling lower frequency alleles to be detected.
Genetic Analysis
[0374] Genetic analysis includes detection of nucleotide sequence variants and copy number variations. Genetic variants can be determined by sequencing. The sequencing method can be massively parallel sequencing, that is, simultaneously (or in rapid succession) sequencing any of at least 100,000, 1 million, 10 million, 100 million, or 1 billion polynucleotide molecules. Sequencing methods may include, but are not limited to: high-throughput sequencing, pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-ligation, sequencing-by-hybridization, RNA-Seq (Illumina), Digital Gene Expression (Helicos), Next-generation sequencing, Single Molecule Sequencing by Synthesis (SMSS)(Helicos), massively-parallel sequencing, Clonal Single Molecule Array (Solexa), shotgun sequencing, Maxam-Gilbert or Sanger sequencing, primer walking, sequencing using PacBio, SOLiD, Ion Torrent, or Nanopore platforms and any other sequencing methods known in the art.
[0375] Sequencing can be made more efficient by performing sequence capture, that is, the enrichment of a sample for target sequences of interest, e.g., sequences including the KRAS and/or EGFR genes or portions of them containing sequence variant biomarkers. Sequence capture can be performed using immobilized probes that hybridize to the targets of interest. [0376] Cell free DNA can include small amounts of tumor DNA mixed with germline DNA. Sequencing methods that increase sensitivity and specificity of detecting tumor DNA, and, in particular, genetic sequence variants and copy number variation, can be useful in the methods of this invention. Such methods are described in, for example, in WO 2014/039556. These methods not only can detect molecules with a sensitivity of up to or greater than 0.1%, but also can distinguish these signals from noise typical in current sequencing methods. Increases in sensitivity and specificity from blood-based samples of cfDNA can be achieved using various methods. One method includes high efficiency tagging of DNA molecules in the sample, e.g., tagging at least any of 50%, 75% or 90% of the polynucleotides in a sample. This increases the likelihood that a low-abundance target molecule in a sample will be tagged and subsequently sequenced, and significantly increases sensitivity of detection of target molecules.
[0377] Another method involves molecular tracking, which identifies sequence reads that have been redundantly generated from an original parent molecule, and assigns the most likely identity of a base at each locus or position in the parent molecule. This significantly increases specificity of detection by reducing noise generated by amplification and sequencing errors, which reduces frequency of false positives.
[0378] Methods of the present disclosure can be used to detect genetic variation in non-uniquely tagged initial starting genetic material (e.g., rare DNA) at a concentration that is less than 5%, 1%, 0.5%, 0.1%, 0.05%, or 0.01%, at a specificity of at least 99%, 99.9%, 99.99%, 99.999%, 99.9999%, or 99.99999%. Sequence reads of tagged polynucleotides can be subsequently tracked to generate consensus sequences for polynucleotides with an error rate of no more than 2%, 1%, 0.1%, or 0.01%.
[0379] Copy number variation determination can involve determining a quantitative measure of polynucleotides in a sample mapping to a genetic locus, such as the EGFR gene or KRAS gene. The quantitative measure can be a number. Once the total number of polynucleotides mapping to a locus is determined, this number can be used in standard methods of determining Copy Number Variation at the locus. A quantitative measure can be normalized against a standard. In one method, a quantitative measure at a test locus can be standardized against a quantitative measure of polynucleotides mapping to a control locus in the genome, such as gene of known copy number. In another method, the quantitative measure can be compared against the amount of nucleic acid in the original sample. For example, the quantitative measure can be compared against an expected measure for diploidy. In another method, the quantitative measure can be normalized against a measure from a control sample, and normalized measures at different loci can be compared. In another method, quantifying involves quantifying parent or original molecules in a sample mapping to a locus, rather than number of sequence reads. A copy number variation may be an amplification or a deletion or truncation of a gene. An amplification may be 3, 4, 5, 6, 7, 8, 9, 10, or 10 or more copies of a gene. A deletion or truncation may be 0 or 1 copies of a gene.
[0380] An example of a method for detecting copy number variation may include an array. The array may comprise a plurality of capture probes. The capture probes can be oligonucleotides that are bound to the surface of the array. The capture probes may hybridize to at least one of the genes as set forth in Table 1. The capture probes may bind to at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 genes as set forth in Table 1. DNA derived from the subject may be labeled (e.g., with a fluorophore) prior to hybidization for detection.
[0381] In other examples, a gene of interest may be amplified using primers that recognize the gene of interest. The primers may hybridize to a gene upstream and/or downstream of a particular region of interest (e.g., upstream of a mutation site). A detection probe may be hybridized to the amplification product. Detection probes may specifically hybridize to a wildtype sequence or to a mutated/variant sequence. Detection probes may be labeled with a detectable label (e.g., with a fluorophore). Detection of a wild-type or mutant sequence may be performed by detecting the detectable label (e.g., fluorescence imaging). In examples of copy number variation, a gene of interest may be compared with a reference gene. Differences in copy number between the gene of interest and the reference gene may indicate amplification or deletion/truncation of a gene. Examples of platforms suitable to perform the methods described herein include digital PCR platforms such as e.g., Fluidigm Digital Array.
EXAMPLES
Example 1:
[0382] In one example a method for monitoring for relapse of a tumor in an individual treated for cancer, can include providing a cell free DNA (cfDNA) sample, from liquid or tissue. When applicable, this includes a mixture of circulating tumor (ctDNA) and non-tumor DNA obtained from the individual at a first time point before initiating cancer treatment which provides a test sample.
[0383] After detecting methylation states for a plurality of regions, including CpG sites in the test samples, comparison can be made to methylation states for the plurality of CpG sites in reference samples, to identify differentially methylated regions (DMRs), the reference samples can include information in a database, including real world evidence. Determining methylation difference between the test genomic DNA and the reference genomic DNA, including CpG sites can providing a normalized methylation difference, which may include weighing the normalized methylation difference based on coverage at each of the CpG sites, thereby determining an aggregate coverage-weighted normalized methylation difference score.
[0384] After treatment, including resection, additional test samples from the treated sample are collected at second, third, etc. later time points, wherein the detection and comparison steps are performed. By determining changes that have occurred in the DMRs between the test genomic DNA and the second, third etc. test genomic DNA and in comparison to their respective reference samples in the database, including real world evidence, changes can be indicated of recurrence in the individual. Thereafter, treating the individual with a therapy to reduce the relapse, or adjusting the frequency of surveillance is provided. It is readily appreciated that sample testing can include one or more biological molecules such as genomic DNA, RNA, peptides, proteins, etc. in addition to epigenomic DNA.
Example 2:
[0385] A sample of tumor tissue is analyzed for one or more of genomic, epigenomic, or gene expression to establish a profile of the tumor tissue including testing of cfDNA, including ctDNA on a methylome panel.
[0386] Thereafter, testing for minimal residual disease is performed in accordance with the methods described herein after resection. Based on the original biopsy, recurrence outcomes generated from testing and real -world evidence is utilized as predictive of likelihood of recurrence.
Example 3:
[0387] A sample of tumor tissue is analyzed, as described to establish a profile of the tumor tissue, including testing of cfDNA, including ctDNA on a methylome panel. One then tests for minimal residual disease, including after resection. Based on the original tumor tissue profile, one then can generate recurrence outcomes utilizing both the aforementioned testing and real- world evidence. In some embodiments, therapeutic intervention is modified based on testing results and likelihood of recurrence.
Example 4:
[0388] Similarly, a liquid informed minimal residual disease includes testing of cfDNA, including ctDNA on a methylome panel. After identifying differentially methylated regions, they can be utilized as prior information for post-operation recurrence definition. This would improve MRD detection without the logistics associated with tissue testing.
[0389] While embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the disclosure be limited by the specific examples provided within the specification. While the disclosure has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the disclosure. Furthermore, it shall be understood that all aspects of the disclosure are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is therefore contemplated that the disclosure shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the disclosure and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims

THE CLAIMS
1. A method, comprising: determining a state of biological molecules obtained from a sample derived from a human subject testing for minimal residual disease (MRD) determining the likelihood of recurrence based on the MRD test generating a schedule for one or more additional MRD tests based on the determination of the likelihood of recurrence.
2. The method of claim 1, wherein the biological molecules are one or more of: DNA, methylated DNA, RNA, methylated RNA, proteins, and peptides.
3. The method of claim 1, wherein testing for MRD comprises combining a plurality of nucleic acid molecules derived from a subject with a solution including an amount of methyl binding domain (MBD) proteins to produce a nucleic acid-MBD protein solution; and performing a plurality of washes of the nucleic acid-MBD protein solution with a salt solution to produce a number of nucleic acid fractions, individual nucleic acid fractions having a threshold number of methylated cytosines in regions of the plurality of nucleic acids having at least the threshold cytosine-guanine content.
4. The method of claim 3, wherein a wash of the plurality of washes is performed with a solution having a concentration of sodium chloride (NaCl) and produces a nucleic acid fraction of the number of nucleic acid fractions having a range of binding strengths to MBD proteins.
5. The method of claim 3, comprising: determining that a first nucleic acid fraction is associated with a first partition of a plurality of partitions of nucleic acids, the first partition corresponding to a first range of binding strengths to MBD proteins; attaching a first molecular barcode to nucleic acids of the first nucleic acid fraction, the first molecular barcode being included in a first set of molecular barcodes associated with the first partition; determining that a second nucleic acid fraction is associated with a second partition of the plurality of partitions of nucleic acids, the second partition corresponding to a second range of binding energies to MBD proteins different from the first range of binding strengths to MBD proteins; and attaching a second molecular barcode to nucleic acids of the second nucleic acid fraction, the second molecular barcode being included in a second set of molecular barcodes associated with the second partition.
6. The method of claim 3, comprising: combining at least a portion of the number of nucleic acid fractions with an amount of restriction enzyme that cleaves molecules with one or more unmethylated cytosines to produce at least a portion of the plurality of samples used to produce the sequencing reads; wherein the threshold amount of methylated cytosines corresponds to a minimum frequency of methylated cytosines within a region having at least the threshold cytosine-guanine content.
7. The method of claim 3, comprising: combining at least a portion of the number of nucleic acid fractions with an amount of a restriction enzyme that cleaves molecules with one or more methylated cytosines to produce at least a portion of the plurality of samples used to produce the sequencing reads; wherein the threshold amount of unmethylated cytosines corresponds to a maximum frequency of methylated cytosines that are not cleaved within a region having at least the threshold cytosine-guanine content.
8. The method of claim 1, wherein testing for MRD comprises sequencing nucleic acid molecules derived from a sample obtained from a subject; analyzing sequence reads derived from the sequencing to identify one or more driver mutations in the nucleic acid molecules; and using information about the presence, absence, or amount of the one or more driver mutations in the nucleic acid molecules to identify a tumor in the subject.
9. The method of claim 3-8, wherein the nucleic acid molecules comprise cell-free DNA.
10. The method of any of the preceding claims, wherein the sample is at least one of blood, serum, plasma or tissue.
11. The method of any of the preceding claims, comprising determination of treatment for the subject.
12. The method of any of the preceding claims, wherein a limit of detection for the model to determine tumor fraction of samples is no greater than 0.05%.
13. The method of any of the preceding claims, wherein the one or more driver mutations comprises a somatic variant detected at a mutant allele frequency (MAF) of no more than 0.05%.
14. The method of any of the preceding claims, wherein the one or more driver mutations comprises a fusion detected at a mutant allele frequency (MAF) of no more than 0.1%.
15. The method of any of the preceding claims, further comprising detecting mutation distributions for each of one or more driver mutations, wherein the mutation distribution for each of the one or more driver mutations is detected with a correlation of at least 0.99 to a mutation distribution of the driver mutation detected in a cohort of the subject by tissue genotyping.
16. The method of any of the preceding claims, wherein the method detects the tumor in the subject with a sensitivity of at least 85%, a specificity of at least 99%, and a diagnostic accuracy of at least 99%.
17. The method of any of the preceding claims, comprising identify circulating tumor DNA (ctDNA) and one or more driver mutations in the ctDNA.
18. A method comprising: obtaining, by a computing system having one or more hardware processors and memory, testing sequence data from a subject, the testing sequence data including testing sequencing reads derived from a sample of the subject; analyzing, by the computing system, the testing sequencing reads to determine a first quantitative measure derived from the testing sequencing reads to genomic regions of a reference genome; analyzing, by the computing system, the testing sequencing reads to determine a second quantitative measure derived from the testing sequencing reads to genomic regions of a reference genome; determining, by the computing system, a metric based on the first quantitative and the second quantitative measure; and generating, by the computing system, an input vector that includes the metrics; determining, by the computing system, an indication of cancer status in the subject by providing the input vector to a model that implements one or more machine learning techniques to generate indications of cancer status in subjects, the model including weights for individual classification regions of a plurality of classification regions and at least a portion of the weights of the individual classification regions being different from one another.
19. The method of claim 18, wherein the individual testing sequencing reads include a nucleotide sequence corresponding to a fragment of a nucleic acid included in the sample and the individual testing sequencing reads correspond to molecules having a threshold amount of methylated cytosines included in regions of the nucleotide sequence having at least the threshold cytosine-guanine content; the first quantitative measure derived from the testing sequencing reads that correspond to individual classification regions of a plurality of classification regions at least a portion of the individual classification regions of the plurality of classification regions corresponding to genomic regions of a reference genome that have the threshold amount of methylated cytosines in subjects in which cancer is detected and that have at least the threshold cytosine-guanine content; the second quantitative measure derived from the testing sequencing reads that correspond to individual control regions a plurality of control regions, individual control regions of the plurality of control regions corresponding to additional genomic regions of the reference genome that have at least the threshold cytosine-guanine content and that have at least the threshold amount of methylated cytosines in subjects in which cancer is detected and in additional subjects in which cancer is not detected.
20. The method of claim 18 and 19, comprising: obtaining, by the computing system having one or more hardware processors and memory, training sequence data including training sequencing reads derived from a plurality of samples of a plurality of training subjects, individual training sequencing reads including a nucleotide sequence corresponding to a fragment of a nucleic acid included in a sample of the plurality of samples and individual training sequencing reads corresponding to molecules having a threshold amount of methylated cytosines included in regions of the nucleotide sequence having at least a threshold cytosine-guanine content; analyzing, by the computing system, the training sequencing reads to determine an additional first quantitative measure derived from the training sequencing reads that corresponds to individual classification regions of the plurality of classification regions; analyzing, by the computing system, the training sequencing reads to determine an additional second quantitative measure derived from the training sequencing reads that correspond to a plurality of control regions; determining, by the computing system, an additional metric for the individual classification regions of the plurality of classification regions based on the additional first quantitative measure for the individual classification regions and the additional second quantitative measure for the plurality of control regions; generating, by the computing device, training data that includes the additional metric for the individual classification regions of the plurality of classification regions for the training sequence reads from samples of the plurality of training subjects; implementing, by the computing system and using the training data, one or more machine learning algorithms to generate the model to determine the indications of cancer status in subjects based on amounts of methylated cytosines in at least a portion of the plurality of classification regions.
21. The method of claim 20, wherein one or more machine learning algorithms include one or more classification algorithms.
22. The method of claim 20, wherein the one or more machine learning algorithms include one or more regression algorithms; and the indication corresponds to an estimate of tumor fraction of the sample.
23. The method of claim 18, wherein the training sequencing reads comprise a first portion of the training sequence data and additional training sequencing reads comprise a second portion of the training sequence data, wherein the additional training sequencing reads are different from the training sequencing reads; and the method comprising: analyzing, by the computing system, at least one of the first portion of the training sequence data or the second portion of the training sequence data to determine an individual frequency of a plurality of variants present in an individual sample of the plurality of samples; determining, by the computing system and for the individual sample, a variant of the plurality of variants having a maximum frequency that corresponds to the individual frequency having a greatest value among individual frequencies derived from an individual sample; and determining, by the computing system, individual measures of tumor fraction for an individual sample based on the greatest value of the individual frequencies derived from the individual sample.
24. The method of claim 20, wherein the training data includes the individual measures of tumor fraction for the individual samples of the plurality of samples; and the model is generated based on the individual measures of tumor fraction for the individual samples of the plurality of samples.
25. The method of claim 1, comprising: generating, by a computing system including processing circuitry and memory, a data file including first tokens generated using a first hash function, individual first tokens corresponding to a respective individual of a group of individuals having data stored by a molecular data repository; sending, by the computing system, the data file to a health insurance claims data management system; obtaining, by the computing system and from the health insurance claims data management system, in response to the data file, health data corresponding to the group of individuals; generating, by the computing system, a number of identifiers using a second hash function that is different from the first hash function, each identifier corresponds to one or more tokens related to each individual of the group of individuals; obtaining, by the computing system and using the number of identifiers, second data from the molecular data repository for the group of individuals; determining, by the computing system, respective portions of the first data that correspond to respective portions of the second data for the group of individuals; and generating, by the computing system, an integrated data repository that stores the respective portions of the first data and the respective portions of the second data in relation to respective identifiers of the number of identifiers.
26. The method of claim 25, comprising: determining, by the computing system, a first set of data processing instructions that are executable in relation to first data stored by the integrated data repository; causing, by the computing system, the first set of data processing instructions to be executed to analyze first health insurance claims codes included in the first data to determine a first subset of the group of individuals in which a biological condition is present; and generating, by the computing system, a first dataset indicating the subset of the group of individuals in which the biological condition is present.
27. The method of claim 26, comprising: determining, by the computing system, a second set of data processing instructions that are executable in relation to second data stored by the integrated data repository; causing, by the computing system, the second set of data processing instructions to be executed to analyze the second health insurance claims codes included in the second data to determine one or more treatments provided to a second subset of the group of individuals; and generating, by the computing system, a second dataset indicating the one or more treatments provided to the second subset of the group of individuals.
28. The method of claim 27, comprising: determining, by the computing system, a third subset of the group of individuals that includes a portion of the first subset of the group of individuals that overlaps with a portion of the second subset of the group of individuals; receiving, by the computing system, a request to perform an analysis of the first dataset and the second dataset in relation to the third subset of the group of individuals; and analyzing, by the computing system and in response to the request, the first dataset and the second dataset with respect to the third subset of the group of individuals to determine a measure of significance of a characteristic of the third subset of the group of individuals with respect to the biological condition.
29. The method of claim 28, comprising: determining, by the computing system, one or more genomic mutations present in the third subset of the group of individuals; determining, by the computing system, a plurality of treatments provided to the third subset of the group of individuals; and determining, by the computing system, respective survival rates for the third subset of the group of individuals.
30. The method of claim 29, wherein the measure of significance corresponds to survival rate with respect to a treatment of the plurality of treatments and a genomic mutation of the one or more genomic mutations.
31. The method of claim 30, comprising determining, by the computing system and based on measure of significance, an effectiveness of the treatment for the third subset of the group of individuals.
32. The method of claim 31, comprising determining, by the computing system, individuals in third subset of the group of individuals that have not received the treatment.
33. The method of claim 32, comprising administering one or more therapeutically effective amounts of the treatment to the individuals in the third subset that have not received the treatment.
34. The method of any one of claims 25-33, wherein: the integrated data repository is arranged according to a data repository schema that includes a plurality of data tables and a plurality of logical links between the plurality of data tables; individual logical links of the plurality of logical links indicating one or more rows of a data table of the plurality of data tables that correspond to one or more additional rows of an additional data table of the plurality of data tables.
35. The method of claim 34, wherein the plurality of data tables include: a first data table that stores genomics data of the group of individuals; a second data that stores data related to one or more patient visits by individuals to one or more healthcare providers; a third data table that stores information corresponding to respective services provided to individuals with respect to one or more patient visits to one or more healthcare providers indicated by the second data table; a fourth data table that stores personal information of the group of individuals; a fifth data table that stores information related to a health insurance company or governmental entity that made payment for services provided to the group of individuals; a sixth data table storing information corresponding to health insurance coverage information for the group of individuals; and a seventh data table that stores information related to pharmaceutical treatments obtained by the group of individuals.
36. The method of any one of claims 25-35, wherein the number of identifiers generated using the second hash function comprise intermediate identifiers; and the method comprises: applying, by the computing system, a salt function to the intermediate identifiers to generate a final set of identifiers.
37. The method of any one of claims 25-36, comprising: obtaining, by the computing system, information from an additional data repository that includes electronic medical records of an additional group of individuals; determining, by the computing system, a subset of the additional group of individuals that corresponds to the group of individuals having data stored by the genomics data repository; and modifying, by the computing system, the integrated data repository to store at least a portion of the information of the medical records of the subset of the additional group of individuals in relation to the number of identifiers.
38. The method of claim 37, comprising: performing, by the computing system, one or more optical character recognition operations with respect to the additional information; analyzing, by the computing system, the additional information obtained from the additional data repository to determine one or more portions of the additional information to remove to produce a corpus of information.
39. The method of claim 38, comprising: analyzing, by the computing system, the corpus of information to determine a portion of the subset of the additional group of individuals that correspond to one or more biomarkers; and generating, by the computing system, one or more data structures that store identifiers of the portion of the subset of the additional group of individuals and that store an indication that the portion of the subset of the additional group of individuals corresponds to the one or more biomarkers.
40. The method of claim 39, comprising: storing, by the computing system, the one or more data structures in an intermediate data repository; performing, by the computing system, one or more de-identification operations with respect to the identifiers of the portion of the subset of the additional group of individuals before modifying the integrated data repository to store at least a portion of the additional information of the medical records of the portion of the subset of the additional group of individuals in relation to the number of identifiers.
41. The method of any one of claims 25-40, wherein the molecular data repository stores at least one or more of genomic information, genetic information, metabolomic information, transcriptomic information, fragmentomic information, immune receptor information, methylation information, epigenomic information, or proteomic information.
42. The method of any preceding claim, wherein determining the likelihood of recurrence comprises MRD test, real world evidence (RWE), or both.
43. A system configured to perform the method of any of the preceding claims.
44. A computer readable medium comprising the method of any of the preceding claims.
PCT/US2024/036222 2023-06-29 2024-06-28 Methods for determining surveillance and therapy for diseases Pending WO2025007034A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363511082P 2023-06-29 2023-06-29
US63/511,082 2023-06-29

Publications (1)

Publication Number Publication Date
WO2025007034A1 true WO2025007034A1 (en) 2025-01-02

Family

ID=91966232

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/036222 Pending WO2025007034A1 (en) 2023-06-29 2024-06-28 Methods for determining surveillance and therapy for diseases

Country Status (1)

Country Link
WO (1) WO2025007034A1 (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8486630B2 (en) 2008-11-07 2013-07-16 Industrial Technology Research Institute Methods for accurate sequence data and modified base position determination
WO2014039556A1 (en) 2012-09-04 2014-03-13 Guardant Health, Inc. Systems and methods to detect rare mutations and copy number variation
US20170213008A1 (en) * 2016-01-22 2017-07-27 Grail, Inc. Variant based disease diagnostics and tracking
WO2018119452A2 (en) 2016-12-22 2018-06-28 Guardant Health, Inc. Methods and systems for analyzing nucleic acid molecules
WO2020160414A1 (en) 2019-01-31 2020-08-06 Guardant Health, Inc. Compositions and methods for isolating cell-free dna
AU2019251504A1 (en) * 2018-04-14 2020-08-13 Natera, Inc. Methods for cancer detection and monitoring by means of personalized detection of circulating tumor DNA
US20210025011A1 (en) * 2018-04-02 2021-01-28 GRAIL, Inc Methylation markers and targeted methylation probe panel
US20210125683A1 (en) * 2017-09-15 2021-04-29 The Regents Of The University Of California Detecting somatic single nucleotide variants from cell-free nucleic acid with application to minimal residual disease monitoring
US20220017891A1 (en) * 2018-11-23 2022-01-20 Cancer Research Technology Limited Improvements in variant detection
US20220290252A1 (en) * 2019-08-27 2022-09-15 Belgian Volition Srl Method of isolating circulating nucleosomes
CN115786459A (en) * 2022-11-10 2023-03-14 江苏先声医疗器械有限公司 Method for detecting solid tumor minimal residual disease by high-throughput sequencing

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8486630B2 (en) 2008-11-07 2013-07-16 Industrial Technology Research Institute Methods for accurate sequence data and modified base position determination
WO2014039556A1 (en) 2012-09-04 2014-03-13 Guardant Health, Inc. Systems and methods to detect rare mutations and copy number variation
US20170213008A1 (en) * 2016-01-22 2017-07-27 Grail, Inc. Variant based disease diagnostics and tracking
WO2018119452A2 (en) 2016-12-22 2018-06-28 Guardant Health, Inc. Methods and systems for analyzing nucleic acid molecules
US20210125683A1 (en) * 2017-09-15 2021-04-29 The Regents Of The University Of California Detecting somatic single nucleotide variants from cell-free nucleic acid with application to minimal residual disease monitoring
US20210025011A1 (en) * 2018-04-02 2021-01-28 GRAIL, Inc Methylation markers and targeted methylation probe panel
AU2019251504A1 (en) * 2018-04-14 2020-08-13 Natera, Inc. Methods for cancer detection and monitoring by means of personalized detection of circulating tumor DNA
US20220017891A1 (en) * 2018-11-23 2022-01-20 Cancer Research Technology Limited Improvements in variant detection
WO2020160414A1 (en) 2019-01-31 2020-08-06 Guardant Health, Inc. Compositions and methods for isolating cell-free dna
US20220290252A1 (en) * 2019-08-27 2022-09-15 Belgian Volition Srl Method of isolating circulating nucleosomes
CN115786459A (en) * 2022-11-10 2023-03-14 江苏先声医疗器械有限公司 Method for detecting solid tumor minimal residual disease by high-throughput sequencing

Non-Patent Citations (18)

* Cited by examiner, † Cited by third party
Title
BOCK ET AL., NAT BIOTECH, vol. 28, 2010, pages 1106 - 1114
BROWN: "Genomes", 2002, JOHN WILEY & SONS, INC., article "Mutation, Repair, and Recombination"
CORONEL: "Database Systems: Design, Implementation, & Management, Cengage Learning", 2014
EHRLICH, EPIGENOMICS, vol. 1, 2009, pages 239 - 259
ELMASRI: "Fundamentals of Database Systems", 2010
GREER ET AL., CELL, vol. 161, 2015, pages 868 - 878
HON ET AL., GENOME RES., vol. 22, 2012, pages 246 - 258
IURLARO ET AL., GENOME BIOL, vol. 14, 2013, pages R119
KANG ET AL., GENOME BIOL, vol. 18, 2017, pages 53
KANG ET AL., GENOME BIOLOGY, vol. 18, 2017, pages 53
KUMAR ET AL., FRONTIERS GENET, vol. 9, 2018, pages 640
KUROSE: "Computer Networking: A Top-Down Approach", 2016
MOSS ET AL., NAT COMMUN, vol. 9, 2018, pages 5068
PETERSON: "Cloud Computing Architected: Solution Design Handbook", 2011, RECURSIVE PRESS
SONG ET AL., NAT BIOTECH, vol. 29, 2011, pages 68 - 72
SUN ET AL., BIOESSAYS, vol. 37, 2015, pages 1155 - 62
TUCKER: "Programming Languages", 2006, MCGRAW-HILL SCIENCE/ENGINEERING/MATH
VAISVILA R ET AL.: "EM-seq: Detection of DNA methylation at single base resolution from picograms of DNA", BIORXIV, 2019

Similar Documents

Publication Publication Date Title
US20250223653A1 (en) Systems and methods for analyzing nucleic acid
US12116640B2 (en) Methods for early detection of cancer
US20240321390A1 (en) Machine learning system and method for somatic mutation discovery
JP2022519045A (en) Compositions and Methods for Isolating Cell-Free DNA
WO2025007038A1 (en) Methods for early detection of cancer
US20250243550A1 (en) Minimum residual disease (mrd) detection in early stage cancer using urine
US20250250638A1 (en) Genomic and methylation biomarkers for prediction of copy number loss / gene deletion
WO2025007034A1 (en) Methods for determining surveillance and therapy for diseases
US20250246310A1 (en) Genomic and methylation biomarkers for determining patient risk of heart disease and novel genomic and epigenomic drug targets to decrease risk of heart disease and/or improve patient outcome after myocardial infarction or cardiac injury
JP2023524681A (en) Methods for sequencing using distributed nucleic acids
WO2025085784A1 (en) Genomic and methylation biomarkers for determining patient risk of heart disease and novel genomic and epigenomic drug targets to decrease risk of heart disease and/or improve patient outcome after myocardial infarction or cardiac injury
US20250336491A1 (en) Machine learning models to test computational algorithms
US20250101522A1 (en) Brca1 promoter methylation in sporadic breast cancer patients detected by liquid biopsy
WO2025235602A1 (en) Predictive, prognostic signatures for immuno-oncology using liquid biopsy
WO2025019254A1 (en) Classification of breast tumors using dna methylation from liquid biopsy
RU2811503C2 (en) Methods of detecting and monitoring cancer by personalized detection of circulating tumor dna
WO2025106796A1 (en) Non-small cell lung cancer (nsclc) histology classification using dna methylation data captured from liquid biopsies
WO2025019297A1 (en) Classification of colorectal tumors using dna methylation from liquid biopsy
WO2025106837A1 (en) Tumor fraction and outcome association in a real-world non-small cell lung cancer (nsclc) cohort using a methylation-based circulating tumor dna (ctdna) assay
WO2025106263A1 (en) Joint modeling of longitudinal and time-to-event data to predict patient survival
HK1250182B (en) Systems and methods for analyzing nucleic acid

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24746515

Country of ref document: EP

Kind code of ref document: A1