HK40058434A - Cell-free dna end characteristics - Google Patents
Cell-free dna end characteristics Download PDFInfo
- Publication number
- HK40058434A HK40058434A HK62022047071.5A HK62022047071A HK40058434A HK 40058434 A HK40058434 A HK 40058434A HK 62022047071 A HK62022047071 A HK 62022047071A HK 40058434 A HK40058434 A HK 40058434A
- Authority
- HK
- Hong Kong
- Prior art keywords
- dna
- sequence
- motifs
- dna fragments
- clinically relevant
- Prior art date
Links
Description
Cross Reference to Related Applications
This application is PCT and claims the benefit of U.S. provisional patent application No. 62/782,316 entitled "CELL-FREE DNA END CHARACTERISTICS" filed 2018, 12, 19, which is incorporated herein by reference in its entirety for all purposes.
Background
Plasma DNA is thought to consist of free DNA shed from various tissues in the body including, but not limited to, hematopoietic tissues, brain, liver, lung, colon, pancreas, etc. (Sun et al, Proc Natl Acad Sci usa.2015; 112: E5503-12; Lehmann-Werman et al, Proc Natl Acad Sci usa.2016; 113: E1826-34; Moss et al, Nat commun.2018; 9: 5068). Plasma DNA molecules (a free DNA molecule) have been shown to be generated by non-random processes, e.g., their size distribution shows a main peak of 166bp and a periodicity of 10bp present in the smaller peaks (Lo et al, Sci Transl Med.2010; 2:61ra 91; Jiang et al, Proc Natl Acad Sci USA 2015; 112: E1317-25).
Recently, it has been reported that a subset of the positions of the human genome (e.g., positions on a reference genome) are preferentially cleaved, thereby generating plasma DNA fragments whose terminal positions are related to the originating tissue (Chan et al, Proc Natl Acad Sci USA.2016; 113: E8159-8168; Jiang et al, Proc Natl Acad Sci USA.2018; doi: 10.1073/pnas.1814616115). Chandrananda et al (BMC Med genomics.2015; 8:29) used de novo exploration software DREME (Bailey, bioinformatics.2011; 27:1653-9) to mine free DNA data for motifs associated with nuclease cleavage, regardless of tissue type.
Disclosure of Invention
The present disclosure describes techniques for measuring the amount (e.g., relative frequency) of sequence end motifs of free DNA fragments in a biological sample of an organism to measure a characteristic of the sample (e.g., concentration fraction of clinically relevant DNA) and/or to determine a condition of the organism based on such measurements. Different tissue types exhibit different patterns of relative frequencies of sequence end motifs. The present disclosure provides various uses for measuring the relative frequency of sequence end motifs of free DNA, for example, in a mixture of free DNA from various tissues. DNA from one of such tissues may be referred to as clinically relevant DNA.
Various examples can quantify the amount of sequence motifs (terminal motifs) representing the terminal sequences of a DNA fragment. For example, embodiments can determine the relative frequency of a set of sequence motifs for the terminal sequences of a DNA fragment. In various embodiments, preferred terminal sequence sets and/or terminal motif patterns can be determined using genotypic (e.g., tissue-specific alleles) or phenotypic methods (e.g., using samples having the same conditions). The preferred set or relative frequencies with a particular pattern can be used to measure the classification of characteristics (e.g., concentration fraction of clinically relevant DNA) of the condition (e.g., gestational age or pathological level of the fetus) of a new sample or organism. Thus, embodiments may provide measurements to inform of physiological changes, including cancer, autoimmune disease, transplantation, and pregnancy.
As a further example, sequence end motifs can be used for physical and/or in silico enrichment of clinically relevant free DNA fragments in biological samples. Enrichment may use sequence end motifs that are preferred for clinically relevant tissues (e.g., embryos, tumors, or grafts). Physical enrichment can employ one or more probe molecules that detect a specific set of sequence end motifs, allowing the biological sample to be enriched for clinically relevant DNA fragments. For in silico enrichment, a set of sequence reads of free DNA fragments can be identified, the fragments having one end sequence of a set of preferred end sequences of clinically relevant DNA. Certain sequence reads can be stored based on the likelihood of corresponding to clinically relevant DNA, wherein the likelihood explains the sequence read that includes the preferred sequence end motif. The stored sequence reads can be analyzed to determine characteristics of clinically relevant DNA in the biological sample.
These and other embodiments of the present disclosure are described in detail below. For example, other embodiments are directed to systems, devices, and computer-readable media associated with the methods described herein.
The nature and advantages of embodiments of the present disclosure may be better understood with reference to the following detailed description and accompanying drawings.
Drawings
FIG. 1 shows an example of a terminal motif according to an embodiment of the present disclosure.
Fig. 2 shows a schematic of a genotype-based method for analyzing differential end motif patterns between fetal and maternal DNA molecules, according to an embodiment of the present disclosure.
Fig. 3 shows a bar graph of the frequencies of terminal motifs between fetal and maternal DNA molecules according to an embodiment of the present disclosure.
Fig. 4 shows the first 10 terminal motifs from fig. 3 of fetal and shared (i.e., fetal plus maternal) sequences according to an embodiment of the present disclosure.
Fig. 5A and 5B show box plots of entropy between fetal DNA molecules and maternal DNA molecules in a pregnant woman according to an embodiment of the present invention.
Fig. 6A and 6B illustrate hierarchical clustering analysis of fetal and maternal DNA molecules according to embodiments of the present disclosure.
Fig. 7A and 7B show entropy distributions across different three month times using all motifs of pregnant women, according to embodiments of the present disclosure. Fig. 7C and 7D show entropy distributions across different three month times using 10 motifs of pregnant women, according to embodiments of the present disclosure.
Fig. 8A shows the entropy of all segments across different gestational ages. The entropy of plasma DNA fragments in subjects at time 3 trimester proved to be lower (p-value 0.06) than the entropy of plasma DNA fragments in said subjects at time 1 trimester and time 2 trimester. Fig. 8B shows the entropy of fragments of the Y chromosome source across different gestational ages. The entropy of fragments of Y chromosome source in subjects at time 3 trimester proved to be lower (p-value ═ 0.01) than the entropy of fragments of Y chromosome source in said subjects at time 1 trimester and time 2 trimester.
Fig. 9 and 10 show the distribution of top 10-ranked terminal motifs between fetal and maternal DNA molecules across different three-month times according to embodiments of the present disclosure.
Fig. 11 shows the combined frequency of the top 10 ranked motifs between a fetal molecule and a shared molecule across different three month times according to an embodiment of the present disclosure.
Figure 12 shows a schematic of a genotype-based method for analyzing differential end motif patterns between mutant and shared molecules in plasma DNA of a cancer patient, according to an embodiment of the present disclosure.
Figure 13 shows a landscape of cancer-associated mutant molecules and plasma DNA terminal motifs sharing molecules in hepatocellular carcinoma according to embodiments of the present disclosure.
Figure 14 shows a radial landscape of plasma DNA terminal motifs of cancer-associated mutant and shared molecules in hepatocellular carcinoma, in accordance with embodiments of the present disclosure.
Figure 15A shows the top 10 terminal motifs in the ranking difference in terminal motif frequency between mutant and shared sequences in plasma DNA of HCC patients, according to an embodiment of the present disclosure.
Figure 15B shows the combined frequency of 8 terminal motifs for HCC patients and pregnant women, according to embodiments of the present disclosure.
Fig. 16A and 16B illustrate entropy values of shared and mutant segments for different end-order sets of HCC cases, according to embodiments of the present disclosure.
Figure 17 is a graph of motif diversity score (entropy) versus measured circulating tumor DNA score, according to embodiments of the present disclosure.
Fig. 18A illustrates an entropy analysis using donor-specific fragments, according to embodiments of the present disclosure. Figure 18B shows hierarchical clustering analysis using donor-specific fragments.
Fig. 19 is a flow chart illustrating a method of estimating a concentration fraction of clinically relevant DNA in a biological sample of a subject according to an embodiment of the present disclosure.
Fig. 20 is a flow chart illustrating a method of determining the gestational age of a fetus by analyzing a biological sample from a female subject pregnant with the fetus, according to an embodiment of the present disclosure.
Figure 21 shows a schematic of a phenotypic method for plasma DNA end motif analysis, according to embodiments of the present disclosure.
Fig. 22 shows an example of a frequency distribution of 4-mer terminal motifs between HCC subjects and HBV subjects with all plasma DNA molecules, according to an embodiment of the present disclosure.
Figure 23A shows a box plot of the combined frequency of the first 10 plasma DNA 4-mer terminal motifs for various subjects with different cancer levels, according to an embodiment of the present disclosure. These levels are controls: a healthy control subject; HBV: chronic hepatitis B carriers; cirr: a cirrhosis subject; eHCC: early HCC; iHCC: intermediate HCC; and aHCC: advanced HCC. Figure 23B shows Receiver Operating Characteristic (ROC) curves for combined frequencies of the first 10 plasma DNA 4-mer end motifs between HCC subjects and non-cancer subjects, according to embodiments of the present disclosure.
Fig. 24A shows a boxplot of frequencies across different groups of CCA motifs according to an embodiment of the present disclosure. Fig. 24B shows a ROC curve between a non-HCC group and an HCC group using the most frequent 3-mer motif (CCA) present in non-HCC subjects, according to an embodiment of the present disclosure.
Fig. 25A shows a boxplot of entropy values using 256 4-mer end motifs across different groups, according to an embodiment of the present disclosure. Fig. 25B shows a boxplot of entropy values using 10 4-mer end motifs across different groups, according to an embodiment of the present disclosure.
Fig. 26A shows a box plot of entropy values for 3-mer motifs used across different groups according to an embodiment of the present disclosure. The entropy of HCC subjects using 3-mer motifs (64 motifs in total) was found to be significantly higher (p-value <0.0001) than non-HCC subjects. Fig. 26B shows a ROC curve using entropy of 64 3-mer motifs between a non-HCC group and an HCC group, according to an embodiment of the present disclosure. AUC was found to be 0.872.
Fig. 27A and 27B show boxplots of motif diversity (entropy) scores using 4-mers across different groups according to embodiments of the present disclosure.
Fig. 28 shows recipient operational curves for various techniques to differentiate healthy controls from cancer, in accordance with embodiments of the present disclosure.
Figure 29 illustrates receiver operating curves for MDS analysis using various k-mers, in accordance with embodiments of the present disclosure.
Figure 30 illustrates performance of MDS-based cancer detection of various tumor DNA scores according to embodiments of the present disclosure.
Figure 31 illustrates receiver operating curves for MDS, SVM and logistic regression analysis according to embodiments of the present disclosure.
Figure 32 illustrates hierarchical clustering analysis for top 10-ranked terminal motifs across different groups with different cancer levels, according to an embodiment of the present disclosure. Different groups included controls: a healthy control subject; HBV: chronic hepatitis B carriers; cirr: a cirrhosis subject; eHCC: early HCC; iHCC: intermediate HCC; and aHCC: advanced HCC.
Fig. 33A-33C illustrate hierarchical clustering analysis using all plasma DNA molecules across different groups with different cancer levels, according to embodiments of the present disclosure.
Figure 34 illustrates hierarchical clustering analysis based on 3-mer motifs using all plasma DNA molecules across different groups with different cancer levels, according to an embodiment of the present disclosure.
Figure 35A shows an entropy analysis of all plasma DNA molecules between using healthy control subjects and SLE patients according to embodiments of the present disclosure. Figure 35B shows hierarchical cluster analysis of all plasma DNA molecules between subjects using healthy controls and SLE patients according to embodiments of the present disclosure.
Figure 36 shows an entropy analysis using plasma DNA molecules with 10 selected terminal motifs between healthy control subjects and SLE patients, according to embodiments of the present disclosure.
Figure 37 shows a ROC curve including a combination analysis of terminal motifs and copy number or methylation according to embodiments of the present disclosure.
Figure 38A shows entropy analysis based on 4-mers co-constructed from the ends of sequenced plasma DNA fragments and their adjacent genomic sequences in HCC subjects and non-HCC subjects, according to embodiments of the disclosure. Figure 38B shows a 4-mer based clustering analysis according to embodiments of the present disclosure, the 4-mer being constructed jointly from the ends of sequenced plasma DNA fragments and their adjacent genomic sequences in HCC subjects and non-HCC subjects.
Fig. 39 shows a ROC comparison of the techniques 140 and 160 of fig. 1 for defining terminal motifs of plasma DNA, according to embodiments of the present disclosure.
Figure 40 shows an accuracy comparison showing that tissue-specific open chromatin regions improve discriminatory power of plasma DNA terminal motifs, according to embodiments of the present disclosure.
Figure 41 illustrates a size band-based plasma DNA end motif analysis, according to embodiments of the present disclosure.
Fig. 42 is a flow chart illustrating a method of classifying a pathology level in a biological sample of a subject according to an embodiment of the present disclosure.
Fig. 43 is a flow diagram illustrating a method of enriching a biological sample for clinically relevant DNA according to an embodiment of the present disclosure.
Fig. 44 is a flow diagram illustrating a method 3700 of enriching clinically relevant DNA of a biological sample according to an embodiment of the present disclosure.
Fig. 45 shows an exemplary graph illustrating an increase in fetal DNA fraction using a CCCA tip motif according to an embodiment of the disclosure.
FIG. 46 shows a measurement system according to an embodiment of the invention.
FIG. 47 illustrates a block diagram of an exemplary computer system that may be used with the systems and methods according to embodiments of the invention.
Term(s) for
"tissue" corresponds to a group of cells that are grouped together as a functional unit. More than one type of cell may be present in a single tissue. Different types of tissue may consist of different types of cells (e.g., hepatocytes, alveolar cells, or blood cells), but may also correspond to tissue from different organisms (mother versus fetus) or to healthy cells versus tumor cells. The "reference tissue" may correspond to the tissue used to determine the tissue-specific methylation level. Multiple samples of the same tissue type from different individuals can be used to determine the tissue-specific methylation level of the tissue type.
A "biological sample" refers to any sample taken from a subject (e.g., a human (or other animal), such as a pregnant woman, a person with or suspected of having cancer, an organ transplant recipient, or a subject suspected of having a disease process involving an organ (e.g., a heart in myocardial infarction, a brain of stroke, or a hematopoietic system of anemia) and containing one or more nucleic acid molecules of interest. The biological sample can be a bodily fluid, such as blood, plasma, serum, urine, vaginal fluid, cyst (e.g., testicular) fluid, vaginal irrigation fluid, pleural fluid, ascites fluid, cerebrospinal fluid, saliva, sweat, tears, sputum, bronchoalveolar lavage, nipple discharge, aspirates from various parts of the body (e.g., thyroid, breast), intraocular fluid (e.g., aqueous humor), and the like. Fecal samples may also be used. In various embodiments, a majority of the DNA in a biological sample (e.g., a plasma sample obtained by a centrifugation protocol) that has been enriched for free DNA may be free, e.g., greater than 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the DNA may be free. The centrifugation protocol may comprise, for example, obtaining a fluid fraction at 3,000g × 10 minutes and centrifuging at, for example, 30,000g for an additional 10 minutes to remove residual cells. As part of the analysis of the biological sample, at least 1,000 free DNA molecules may be analyzed. As other embodiments, at least 10,000 or 50,000 or 100,000 or 500,000 or 1,000,000 or 5,000,000 free DNA molecules or more may be analyzed.
"clinically relevant DNA" may refer to DNA of particular tissue origin to be measured, for example, to determine concentration fractions of such DNA or to classify the phenotype of a sample (e.g., plasma). Examples of clinically relevant DNA are fetal DNA in maternal plasma, or tumor DNA in patient plasma or other samples with free DNA. Another example includes measurement of the amount of DNA associated with a graft in plasma, serum or urine of a transplant patient. Another example includes measuring the concentration fraction of hematopoietic DNA and non-hematopoietic DNA in the plasma of a subject, or the concentration fraction of liver DNA fragments (or other tissues) in a sample, or the concentration fraction of brain DNA fragments in cerebrospinal fluid.
"sequence read" refers to a string of nucleotides sequenced from any part or all of a nucleic acid molecule. For example, the sequence reads can be short nucleotide strings (e.g., 20-150 nucleotides) sequenced from the nucleic acid fragments, short nucleotide strings at one or both ends of the nucleic acid fragments, or sequencing of the entire nucleic acid fragments present in the biological sample. Sequence reads can be obtained in a variety of ways, for example using sequencing techniques or using probes, for example by hybridization arrays or capture probes or amplification techniques such as Polymerase Chain Reaction (PCR) or linear amplification or isothermal amplification using a single primer. As part of the biological sample analysis, at least 1,000 sequence reads can be analyzed. As other examples, at least 10,000, or 50,000, or 100,000, or 500,000, or 1,000,000, or 5,000,000 sequence reads may be analyzed.
The sequence reads can include "end sequences" associated with the ends of the fragments. The terminal sequence may correspond to the outermost N bases of the fragment, e.g., 2-30 bases of the fragment terminal. If the sequence reads correspond to the entire fragment, the sequence reads may comprise two terminal sequences. When paired-end sequencing provides two sequence reads corresponding to the ends of a fragment, each sequence read may comprise one end sequence.
A "sequence motif can refer to a short recurring pattern of bases in a DNA fragment (e.g., an episomal DNA fragment). Sequence motifs may be present at the ends of the fragments and thus are part of or comprise the terminal sequence. "terminal motif" may refer to a sequence motif of a terminal sequence that occurs preferentially at the ends of a DNA fragment, possibly for a particular type of tissue. The terminal motif may also occur just before or just after the end of the fragment and thus still correspond to the terminal sequence.
The term "allele" refers to an alternative DNA sequence at the same physical genomic locus that may or may not result in a different phenotypic trait. In any particular diploid organism, two copies of each chromosome are used (except for sex chromosomes in a male human subject), and the genotype of each gene includes a pair of alleles present at that locus, which are identical in homozygotes and different in heterozygotes. A population or species of an organism typically comprises multiple alleles at each locus of the individual. Genomic loci where more than one allele is found in a population are referred to as polymorphic sites. Allelic variation at a locus can be measured as the number of alleles present in a population (i.e., the degree of polymorphism) or the proportion of heterozygotes (i.e., the rate of heterozygotes). As used herein, the term "polymorphism" refers to any inter-individual variation in the human genome, regardless of the frequency of the variation. Examples of such variations include, but are not limited to, single nucleotide polymorphisms, simple tandem repeat polymorphisms, indel polymorphisms, mutations (which may cause disease), and copy number variations. The term "haplotype" as used herein refers to a combination of alleles at multiple loci that are transmitted together on the same chromosome or chromosomal region. A haplotype can refer to as few as a pair of loci or to a chromosomal region, or to the entire chromosome or chromosomal arm.
The term "fetal DNA concentration fraction" is used interchangeably with the terms "fetal DNA proportion" and "fetal DNA fraction" and refers to the proportion of fetal-derived fetal DNA molecules present in a biological sample (e.g., a maternal plasma or serum sample) (Lo et al, Am J Hum Genet.1998; 62: 768) 775; Lun et al, Clin chem.2008; 54: 1664) 1672. Similarly, the tumor fraction or tumor DNA fraction may refer to the concentration fraction of tumor DNA in a biological sample.
"relative frequency" may refer to a ratio (e.g., percentage, fraction, or concentration). In particular, the relative frequency of a particular terminal motif (e.g., CCGA) can provide a proportion of free DNA fragments associated with the terminal motif CCGA, e.g., by having a terminal sequence of CCGA.
"Total value" can refer to, for example, the collective nature of the relative frequencies of a set of terminal motifs. Examples include an average, a median, a sum of relative frequencies, a variation between relative frequencies (e.g., entropy, Standard Deviation (SD), Coefficient of Variation (CV), interquartile range (IQR), or some percentile cutoff between different relative frequencies (e.g., 95 th percentile or 99 th percentile)), or a difference in reference pattern relative to relative frequencies (e.g., distance), as may be achieved in clustering.
A "calibration sample" may correspond to a biological sample whose concentration fraction of clinically relevant DNA (e.g., tissue-specific DNA fraction) is known or determined by a calibration method, e.g., using alleles specific for a tissue, e.g., in a transplant, whereby alleles present in the donor genome but not in the recipient genome can be used as markers for the transplanted organ. As another example, the calibration sample may correspond to a sample from which the terminal motif may be determined. The calibration sample may be used for both purposes.
The "calibration data points" include "calibration values" and measured concentrations or known concentration fractions of clinically relevant DNA (e.g., DNA of a particular tissue type). Calibration values can be determined from the relative frequencies (e.g., the sum values) determined for the calibration samples for which the concentration fractions of clinically relevant DNA are known. The calibration data points may be defined in various ways, for example as discrete points or as a calibration function (also referred to as a calibration curve or calibration surface). The calibration function may be derived from an additional mathematical transformation of the calibration data points.
A "site" (also referred to as a "genomic site") corresponds to a single site, which can be a single base position or a group of related base positions, e.g., a CpG site or a larger group of related base positions. A "locus" can correspond to a region comprising multiple loci. A locus may comprise only one locus, which would equate the locus to one locus in the context.
The "methylation index" of each genomic site (e.g., CpG site) can refer to the proportion of DNA fragments that show methylation at that site (e.g., as determined from sequence reads or probes) to the total number of reads covering that site. A "read" can correspond to information obtained from a DNA fragment (e.g., methylation state at a site). Readings can be obtained using reagents (e.g., primers or probes) that preferentially hybridize to DNA fragments of a particular methylation state. Typically, such reagents are applied after treatment with methods that differentially modify or differentially recognize DNA molecules according to their methylation state (e.g., bisulfite conversion, or methylation-sensitive restriction enzymes, or methylation-binding proteins, or anti-methylcytosine antibodies, or single molecule sequencing techniques that recognize, for example, methylcytosine and hydroxymethylcytosine).
The "methylation density" of a region can refer to the number of reads at a site within the region showing methylation divided by the total number of reads covering that site in the region. The sites may have specific characteristics, for example, are CpG sites. Thus, a "CpG methylation density" of a region may indicate the number of reads of CpG methylation divided by the total number of reads covering CpG sites in the region (e.g., CpG sites within a particular CpG site, CpG island, or larger region). For example, the methylation density per 100kb set distance in the human genome can be determined as the proportion of all CpG sites covered by sequence reads mapped to a 100kb region based on the total number of unconverted cytosines after bisulfite treatment (corresponding to methylated cytosines) at the CpG sites. This analysis may also be performed for other bin sizes (e.g., 500bp, 5kb, 10kb, 50kb, or 1Mb, etc.). A region may be the entire genome or a chromosome or a portion of a chromosome (e.g., a chromosomal arm). When a region contains only CpG sites, the methylation index of the CpG sites is the same as the methylation density of the region. The "proportion of methylated cytosines" can refer to the number of cytosine sites "C" in a region that show methylation (e.g., unconverted after bisulfite conversion) within the total number of cytosine residues analyzed (i.e., cytosines outside of the context comprising CpG). Methylation index, methylation density, and proportion of methylated cytosines are examples of "methylation level". In addition to bisulfite conversion, other methods known to those skilled in the art may also be used to interrogate the methylation state of a DNA molecule, including, but not limited to, enzymes sensitive to methylation state (e.g., methylation-sensitive restriction enzymes), methylation-binding proteins, single molecule sequencing using a platform sensitive to methylation state (e.g., nanopore sequencing (Schreiber et al, Proc Natl Acad Sci USA.2013; 110: 18910-. The methylation metric of a DNA molecule can correspond to the percentage of sites that are methylated (e.g., CpG sites). The methylation metric can be specified as an absolute number or percentage, which can be referred to as the methylation density of the molecule.
The term "sequencing depth" refers to the number of times a locus is covered by sequence reads that are aligned to the locus. The locus may be as small as a nucleotide, or as large as a chromosomal arm, or as large as the entire genome. The sequencing depth can be expressed as 50x, 100x, etc., where "x" refers to the number of times a locus is covered by a sequence read. The sequencing depth may also be applied to multiple loci or to the entire genome, in which case x may refer to the average number of times a locus or haploid genome or entire genome, respectively, is sequenced. Ultra-deep sequencing may refer to sequencing at a depth of at least 100 x.
An "isolated value" corresponds to a difference or ratio that relates two values (e.g., two fractional contributions or two methylation levels). The separation value may be a simple difference or a ratio. As an example, the direct ratios of x/y and x/(x + y) are separate values. The separation value may contain other factors, for example, multiplication factors. As other examples, a difference or ratio of a function of values may be used, such as a difference or ratio of the natural logarithms (ln) of the two values. The separation value may comprise a difference and a ratio.
The "split value" and "sum value" (e.g., of relative frequency) are two examples of parameters (also referred to as metrics) that provide a sample metric that varies between different classifications (states) and, thus, can be used to determine different classifications. The aggregate value may be a separate value, for example, when taking the difference between the set of relative frequencies of the sample and the reference set of relative frequencies, as may be done in clustering.
The term "classification" as used herein refers to any one or more numbers or other one or more characters associated with a particular property of a sample. For example, the symbol "+" (or the word "positive") may indicate that the sample is classified as having a deletion or an amplification. The classification may be binary (e.g., positive or negative) or have more classification levels (e.g., a scale from 1 to 10 or 0 to 1).
The terms "cutoff value" and "threshold value" refer to a predetermined number used in operation. For example, a cutoff size may refer to a size that does not contain a fragment beyond a certain size. The threshold may be a value above or below the value applied for a particular classification. Any of these terms may be used in any of these contexts. The cutoff or threshold value may be a "reference value" or may be derived from a reference value that represents a particular category or distinguishes two or more categories. As will be appreciated by those skilled in the art, such reference values may be determined in various ways. For example, a metric may be determined for two different subjects with different known classifications, and a reference value may be selected as a representative (e.g., average) of one classification or a value between two clusters of metrics (e.g., selected to obtain a desired sensitivity and specificity). As another example, the reference value may be determined based on a statistical simulation of the sample.
The term "cancer level" can refer to the presence or absence of cancer (i.e., presence or absence), the stage of cancer, the size of the tumor, the presence or absence of metastasis, the total tumor burden of the body, the response of the cancer to treatment, and/or other measures of cancer severity (e.g., cancer recurrence). The cancer level may be a number or other indicia such as symbols, letters, and color. The level may be zero. The cancer level may also comprise premalignant or precancerous conditions. Cancer levels can be used in various ways. For example, screening can examine whether a cancer is present in a person who was not previously known to have cancer. Assessment may investigate a person diagnosed with cancer, to monitor the progression of the cancer over time, to study the effectiveness of a treatment, or to determine prognosis. In one embodiment, prognosis may be expressed as the likelihood that a patient dies from cancer, or the likelihood that cancer progresses after a particular duration or time, or the likelihood or extent of cancer metastasis. Detection may mean "screening" or may mean checking whether a person with an implied characteristic of cancer (e.g., symptoms or other positive test) has cancer.
"pathological level" may refer to the amount, extent, or severity of pathology associated with an organism, wherein the level may be as described above for cancer. Another example of a pathology is rejection of a transplanted organ. Other example pathologies may include autoimmune attacks (e.g., kidney lupus nephritis-impaired or multiple sclerosis), inflammatory diseases (e.g., hepatitis), fibrotic processes (e.g., liver cirrhosis), fatty infiltrates (e.g., fatty liver disease), degenerative processes (e.g., alzheimer's disease), and ischemic tissue damage (e.g., myocardial infarction or stroke). The health status of a subject may be considered a pathological free classification.
The term "about" or "approximately" can mean within an acceptable deviation of a particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, i.e., the limitations of the measurement system. For example, "about" can mean within 1 or greater than 1 standard deviation, according to practice in the art. Alternatively, "about" may mean a range of up to 20%, up to 10%, up to 5%, or up to 1% of a given value. Alternatively, particularly for biological systems or processes, the term "about" or "approximately" may mean within an order of magnitude, within 5 times the value, and more preferably, within 2 times the value. When particular values are described in the present application and claims, unless otherwise indicated, the term "about" shall be assumed to indicate that the particular values are within an acceptable error range. The term "about" can have the meaning commonly understood by one of ordinary skill in the art. The term "about" may mean ± 10%. The term "about" may mean ± 5%.
Detailed Description
The present disclosure describes techniques for measuring the amount (e.g., relative frequency) of terminal motifs of free DNA fragments in a biological sample of an organism to measure characteristics of the sample and/or to determine a condition of the organism based on such measurements. Different tissue types exhibit different patterns of relative frequencies of sequence motifs. The present disclosure provides various uses for measuring the relative frequency of terminal motifs of free DNA, for example in a mixture of free DNA from various tissues. DNA from one of such tissues may be referred to as clinically relevant, DNA.
Clinically relevant DNA of a particular tissue (e.g., an embryo, tumor, or transplanted organ) exhibits a particular pattern of relative frequency that can be measured as a sum. Other DNA in the sample may exhibit different patterns, allowing the amount of clinically relevant DNA in the sample to be measured. Thus, in one example, the concentration fraction (e.g., percentage) of clinically relevant DNA can be determined based on the relative frequency of the terminal motifs. The concentration score may be a number, a range of values, or other classification, such as high, medium, or low, or whether the concentration score exceeds a threshold. In various embodiments, the aggregate value can be the sum of the relative frequencies of the set of end-sequences, the variance of the relative frequencies of all end motifs or the set of end-sequences (e.g., entropy, also referred to as motif diversity score), or the difference (e.g., total distance) relative to a reference pattern (e.g., an array (vector) of relative frequencies of a calibration sample with a known concentration score). Such an array may be considered a reference set of relative frequencies. Such differences may be used in classifiers, hierarchical clustering, support vector machines, and logistic regression being examples of which. By way of example, the clinically relevant DNA may be fetal, tumor, transplanted organ or other tissue (e.g., hematopoietic or liver) DNA.
In another example, relative frequency of motifs can be used to determine the level of pathology. Organisms with different phenotypes may exhibit different patterns of relative frequencies of motifs for free DNA fragments. The sum of the relative frequencies of the terminal motifs can be compared to a reference value to classify the phenotype. In various embodiments, the aggregate value may be a sum of relative frequencies, a variance of relative frequencies, or a difference relative to a reference set of relative frequencies. Exemplary pathologies include cancer and autoimmune diseases, such as SLE.
In another example, motif relative frequency can be used to determine the gestational age of a fetus. The aggregate value of the relative frequencies of the terminal motifs in the maternal sample varies due to the longer gestational age of the fetus. Such a total value may be determined as described above and elsewhere.
Given that free DNA fragments from a particular tissue have a particular set of preferred end motifs, the preferred end motifs can be used to enrich the sample for DNA from the particular tissue (clinically relevant DNA). Such enrichment may be performed by physical manipulation to enrich the physical sample. Some embodiments may capture and/or amplify free DNA fragments having terminal sequences that match a preferred set of terminal motifs, e.g., using primers or adapters. Other examples are described herein.
In some embodiments, enrichment may be performed in silico. For example, the system can receive sequence reads and then filter the reads based on the terminal motifs to obtain a subset of sequence reads with a higher concentration of corresponding DNA fragments from clinically relevant DNA. If the terminal sequence of a DNA fragment comprises a preferred terminal motif, said DNA fragment can be identified as having a higher probability of being derived from a tissue of interest. The likelihood can be further determined based on methylation and size of the DNA fragments, as described herein.
Such use of terminal motifs may eliminate the need for a reference genome, as may be required when using terminal positions (Chan et al, Proc Natl Acad Sci USA.2016; 113: E8159-8168; Jiang et al, Proc Natl Acad Sci USA.2018; doi: 10.1073/pnas.1814616115)). Furthermore, since the number of terminal motifs may be less than the number of preferred terminal positions in the reference genome, more statistics per terminal motif can be collected, potentially increasing accuracy.
This ability to use terminal motifs in the manner described above is surprising, for example, as found by Chandranda et al in respect of site-specific nucleotide patterns involving single nucleotide frequencies in the 51bp (upstream/downstream 20bp) region surrounding the start site of the fragment, there is a high similarity between maternal and fetal fragments (Chandranda et al, BMC Med genomics.2015; 8:29), which means that the tissue of origin of the free DNA fragment cannot be informed using their method based on the single nucleotide frequencies surrounding the terminals.
I. Free DNA end motifs
The terminal motif relates to the terminal sequence of an isolated DNA fragment, e.g., a sequence of K bases at either end of the fragment. The terminal sequence can be a k-mer with various numbers of bases (e.g., 1, 2,3, 4, 5, 6, 7, etc.). The terminal motif (or "sequence motif") sequence itself, rather than referencing a particular location in the genome. Thus, the same terminal motif may occur at many positions throughout the reference genome. The reference genome can be used to determine the terminal motif, e.g., to identify the base just before the start position or just after the terminal position. Such bases will still correspond to the ends of the free DNA fragments, for example because they are identified based on the terminal sequence of the fragments.
FIG. 1 shows an example of a terminal motif according to an embodiment of the present disclosure. FIG. 1 depicts two ways of defining the 4-mer end motif to be analyzed. In technique 140, 4-mer end motifs are constructed directly from the first 4bp sequence on each end of the plasma DNA molecule. For example, the first 4 nucleotides or the last 4 nucleotides of a sequenced fragment may be used. In technique 160, a 4-mer end motif is co-constructed by using a 2-mer sequence from the sequenced end of a fragment and another 2-mer sequence from a genomic region adjacent to the end of the fragment. In other embodiments, other types of motifs may be used, such as 1-mer, 2-mer, 3-mer, 5-mer, 6-mer, and 7-mer end motifs.
As shown in fig. 1, free DNA fragments 110 are obtained, for example, using a purification process of a blood sample, for example, by centrifugation. In addition to plasma DNA fragments, other types of free DNA molecules may be used, such as from serum, urine, saliva, and other such free samples mentioned herein. In one embodiment, the DNA fragment may be blunt-ended.
At block 120, paired-end sequencing is performed on the DNA fragments. In some embodiments, paired-end sequencing can generate two sequence reads from both ends of a DNA fragment, e.g., 30-120 bases each. The two sequence reads may form a pair of reads of a DNA fragment (molecule), wherein each sequence read comprises the terminal sequence of the corresponding end of the DNA fragment. In other embodiments, the entire DNA fragment may be sequenced, thereby providing a single sequence read that comprises the terminal sequences of both ends of the DNA fragment.
At block 130, the sequence reads may be aligned to a reference genome. This alignment is used to illustrate different ways of defining sequence motifs, and may not be used in some embodiments. The alignment program can be performed using various software packages, such as BLAST, FASTA, Bowtie, BWA, BFAST, SHRiMP, SSAHA2, NovoAlign, and SOAP.
The sequence reads of the sequenced fragment 141 are shown by the technique 140 and aligned with the genome 145. With the 5' end as the start, a first terminal motif 142(CCCA) is located at the start of the sequenced fragment 141. A second terminal motif 144(TCGA) is at the tail of the sequenced fragment 141. In one embodiment, this terminal motif may occur when the enzyme recognizes the CCCA and then cleaves just before the first C. If this is the case, CCCA will preferentially be at the ends of plasma DNA fragments. For TCGA, the enzyme can recognize it and then cleave after a.
The sequence reads of the sequenced fragment 161 are shown by technique 160 and aligned with genome 165. With the 5' end as the start, the first end motif 162(CGCC) has a first portion (CG) that occurs just before the start of the sequenced fragment 161 and a second portion (CC) that is part of the end sequence of the start of the sequenced fragment 161. The second terminal motif 164(CCGA) has a first part (GA) that appears just after the tail of the sequenced fragment 161 and a second part (CC) that is part of the terminal sequence of the tail of the sequenced fragment 161. In one embodiment, such a terminal motif may occur when the enzyme recognizes CGCC and then cleaves between G and C. If this is the case, the CC will preferentially be at the end of the plasma DNA fragment, and the CG will occur just before the CC, providing the end motif of CGCC. With respect to the second terminal motif 164(CCGA), the enzyme can cleave between C and G. If this is the case, CC will preferentially be at the ends of the plasma DNA fragments. For the technique 160, the number of bases from adjacent genomic regions and sequenced plasma DNA fragments may vary and need not be limited to a fixed ratio, e.g., instead of 2:2, which may be 2:3, 3:2, 4:4, 2:4, etc.
The higher the number of nucleotides contained in the free DNA end-marker, the higher the specificity of the motif, since the probability of having 6 bases arranged in the exact configuration in the genome is lower than the probability of having 2 bases arranged in the exact configuration in the genome. Thus, the choice of the length of the terminal motif may be dictated by the desired sensitivity and/or specificity of the intended use application.
Since the sequence reads are aligned to the reference genome using the terminal sequences, any sequence motifs determined from the terminal sequences, or any sequence motifs immediately before/after the terminal sequences, can still be determined from the terminal sequences. Thus, the technique 160 associates the terminal sequence with other bases, where a reference is used as a mechanism to make the association. The difference between techniques 140 and 160 is the assignment of specific DNA fragments to the two terminal motifs, which affects the specific values of the relative frequencies. However, the overall outcome (e.g., concentration fraction of clinically relevant DNA, classification of pathological level, etc.) will not be affected by how DNA fragments are assigned to terminal motifs, as long as the data is trained using consistent techniques as used in production.
The counted number of DNA fragments having terminal sequences corresponding to a particular terminal motif can be counted (e.g., stored in an array in memory) to determine relative frequency. As described in more detail below, the relative frequency of the terminal motifs of the free DNA fragments can be analyzed. Differences in the relative frequency of terminal motifs have been detected for different types of tissues and different phenotypes (e.g., different levels of pathology). The difference can be quantified by the amount of DNA fragments with a particular terminal motif or overall pattern, e.g., variance (e.g., entropy, also referred to as motif diversity score) across a set of terminal motifs (e.g., all possible combinations of k-mers corresponding to the length used).
Methods based on genotype differences
We have determined that different tissue types have different terminal motifs. Herein, we describe how a terminal motif can be used to determine the concentration fraction of clinically relevant DNA (e.g., fetal DNA, tumor DNA, DNA from a transplanted organ, or DNA from a specific organ).
In order to identify terminal motifs that have precedence over a particular type of clinically relevant DNA, genotypic variation can be used to identify DNA fragments from clinically relevant tissues. Once a DNA fragment is detected from a clinically relevant tissue, the terminal motif of the DNA fragment can be determined. Our analysis of the relative frequencies of the terminal motifs shows that the relative frequencies of the terminal motifs vary from tissue to tissue. As described below, quantification of relative frequency differences may be used in conjunction with calibration samples whose concentration scores of clinically relevant DNA are known (e.g., measured by separate techniques such as tissue-specific alleles) to determine a classification of concentration fractions of clinically relevant DNA in a biological sample.
Although it may be desirable to measure the concentration fraction of clinically relevant DNA in a calibration sample, the resulting calibration value (e.g., as part of the calibration function) can be used to determine the concentration fraction of a new sample without having to identify alleles specific for clinically relevant DNA. In this way, the concentration fraction can be determined in a more robust manner.
A. Pregnancy
The genotypic differences between the maternal and fetal genomes can be used to distinguish fetal from maternal DNA molecules. For example, one can use informative Single Nucleotide Polymorphism (SNP) sites where the mother is homozygous (AA) and the fetus is heterozygous (AB).
Fig. 2 shows a schematic of a genotype-based method for analyzing differential end motif patterns between fetal and maternal DNA molecules, according to an embodiment of the present disclosure. As shown in fig. 2, a fetal-specific molecule 205 carrying a fetal-specific allele (B) can be determined. On the other hand, shared molecules 207 carrying a shared allele (a) can be determined, which will represent DNA molecules that are predominantly maternal in origin, since fetal DNA molecules are usually few in the maternal plasma DNA pool. Thus, any molecular property derived from the shared molecule will reflect the characteristics of the maternal background DNA molecule (i.e. the DNA molecule of hematopoietic origin). In addition to alleles, other fetal-specific markers (e.g., epigenetic markers) can be used.
We analyzed the 4-mer end motif using technique 140 in FIG. 1. 256 terminal motifs were analyzed. We calculated the proportion of each 4-mer motif and compared the frequencies of 256 motifs using a bar graph (depicted as bar graph 220). Such bar graphs provide the relative frequency (%) of the appearance of each 4-mer as a terminal motif. For ease of illustration, only a few 4-mers are shown. The relative frequency (sometimes also referred to as just "frequency") may be determined by: (number of DNA fragments with terminal motifs)/total number of DNA fragments analyzed (factor of 2, possibly in denominator), possibly with a factor of 2 in the denominator to take into account both terminals. Such a percentage may be considered a relative frequency, as it relates to the ratio of an amount (e.g., count) of the first terminal motif relative to the amount of one or more other motifs (which may include the first terminal motif). As one can see, terminal motif 222 has a significant difference in relative frequency between DNA fragments of different tissue types. Such differences may be used for various purposes, for example to enrich for fetal DNA in a sample or to determine the concentration of fetal DNA.
The values of the relative frequency shown in the bar graph 220 may be stored values in an array having 256 values. There may be a counter for each terminal motif of the set of terminal motifs, wherein the counter is incremented each time a new DNA fragment has a terminal motif corresponding to the counter for a particular terminal motif. The motif set can be selected in various ways, e.g., as all terminal motifs or a smaller set, e.g., the motif that appears most in the reference sample or the motif that exhibits the greatest separation in the reference sample.
Various quantification techniques can be used to provide a measure of the relative frequency of the sample, and such quantification techniques can be used to classify the amount of free DNA from clinically relevant DNA. One exemplary quantization technique includes the sum of the relative frequencies of a set of end-sequences, also referred to herein as a combined frequency. For example, such a set may be the terminal motifs that occur most frequently in a particular tissue type or are identified as having the greatest separation between two tissue types. A weighted sum may also be used. The weight may be predetermined or variable, for example, the weight for a given frequency may depend on the frequency itself. Entropy is an example of this.
In another embodiment, to capture the landscape differences of the terminal motifs between fetal and maternal DNA molecules, entropy-based analysis 230 may be used. Entropy is an example of variance/diversity. To analyze the frequency distribution of motifs (e.g., for a total of 256 motifs), one definition of entropy uses the following equation:
wherein P isiIs the frequency of a particular motif; higher entropy values indicate higher diversity (i.e., higher randomness).
In this example, the entropy will reach a maximum (i.e., 5.55) when the frequencies of the 256 motifs are equal. In contrast, when the frequencies of 256 motifs have a skewed distribution, the entropy will decrease. For example, if one particular motif makes up 99% and the other motifs make up the remaining 1%, the entropy in the formula will be reduced to 0.11, but other formulas, such as no logarithm or only a logarithm, may be used). Thus, increasing entropy of motif frequencies would imply an increase in skewness of the frequency distribution across the terminal motifs. Increasing entropy of motif frequencies would indicate that the frequencies across motifs would be toward the equal probability of these motifs, gull-year. Thus, entropy of motif frequency measures how homogeneous the abundance of terminal motifs present in plasma DNA is. The higher the uniformity of the motif frequency, the higher the entropy value that can be expected. In other words, a decrease in entropy of motif frequency would imply an increase in skewness across the terminal motif with respect to terminal motif frequency.
In various other examples, Standard Deviation (SD), Coefficient of Variation (CV), interquartile difference (IQR), or a certain percentile cut-off across different motif frequencies (e.g., 95 th or 99 th percentile) may be used to assess the landscape change of the terminal motif pattern between the fetal and maternal DNA molecules. Such various examples provide a measure of the variance/diversity of the relative frequencies of the set of end-sequences. Given the definition of entropy in FIG. 2, if only one terminal motif has a non-zero count, the entropy will have a minimum value. If other terminal motifs do occur in some DNA fragments, entropy will increase. If there is no selection (random distribution of all terminal motifs, e.g.in a hypothetical case where all have the same frequency), the entropy will reach a maximum. In this way, the global selectivity of the terminal sequences of the free DNA fragments for the terminal motifs was entropy quantified.
Curve 235 shows the entropy values of the shared (mainly maternal) and fetal sequences. Within the error tolerance of the genotyping measurement, the shared sequence contains less fetal DNA than the fetal sequence (if the original sample has 10% fetal DNA, the shared sequence contains perhaps about 5% fetal DNA), which will have nearly 100% fetal DNA. Given this separation, the greater the concentration of fetal DNA in the sample, the greater the difference in entropy values will be. This relationship between fetal DNA concentration and entropy can be used to determine fetal DNA concentration, e.g., as measured using one or more calibration values. For example, the concentration of clinically relevant DNA of a calibration sample may be measured by another technique (resulting in a calibration value) that may not be universally applicable, for example using Y chromosomal DNA of a male fetus or a previously identified tumor tissue mutation. Given an entropy measurement of the calibration sample, using the measured concentration in the calibration sample, a comparison of two entropy values (one being the entropy value of the test sample and the other being the entropy value of the calibration sample) can provide a concentration score for the test sample. Further details of such use of calibration values and calibration functions are described later.
In another embodiment, a cluster-based analysis 240 may be employed. The vertical axis corresponds to the 4-mer motif and the horizontal axis corresponds to different samples, e.g., with different classifications for fetal DNA concentration. The color corresponds to the relative frequency of a particular 4-mer motif of a particular sample, e.g., where the concentration of the red calibration sample 242 is higher than the concentration of the green calibration sample 244, the green calibration sample 244 has a lower value.
The cluster-based analysis may utilize the following assumptions: the frequency distribution of the 256 4-mer end motifs (i.e., the intragroup molecular properties) within a fetal DNA molecule or within a maternal DNA molecule will be relatively more similar than the similarity between fetal and maternal DNA molecules (i.e., the intergroup molecular properties). Thus, calibration samples from individuals characterized by end motifs derived from shared sequences (e.g., a higher concentration of shared sequences) are expected to differ from calibration samples from individuals characterized by end motifs derived from fetal-specific sequences (e.g., a lower concentration of shared sequences, and thus a higher concentration of fetal sequences). Each individual corresponds to a vector comprising 256 terminal motifs and their corresponding frequencies (i.e., a 256-dimensional vector). Exemplary clustering techniques include, but are not limited to, hierarchical clustering, centroid-based clustering, distribution-based clustering, density-based clustering. Different clusters may correspond to different amounts of fetal DNA in the sample, as these clusters will have different patterns of relative frequency due to the difference in the frequency of the terminal motifs between maternal and fetal DNA fragments.
To assess the difference in terminal motifs between fetal and maternal DNA molecules, we genotyped maternal buffy coats and fetal samples separately using a microarray platform (Human omni2.5, Illumina) and sequenced matching plasma DNA samples. Peripheral blood samples were obtained from 10 pregnant women from each of the first three months (12-14 weeks), the second three months (20-23 weeks), and the third three months (38-40 weeks), and plasma and maternal buffy coat samples from each case were harvested. We obtained the median of 195,331 informative SNPs (range: 146,428-202,800), where the mother was homozygous and the fetus was heterozygous. Plasma DNA molecules carrying a fetal-specific allele are identified as fetal-specific DNA molecules. Plasma DNA molecules carrying shared alleles were identified and considered to be primarily maternal DNA molecules. The median fetal DNA fraction in those samples was 17.1% (range: 7.0% -46.8%). For each case, a median of 1.03 hundred million (range: 5200 ten thousand to 1.86 hundred million) located paired-end readings were obtained. The terminal motifs of each plasma DNA molecule were determined by bioinformatics studies of the 4-mer sequence closest to the ends of the fragments. The results from the analysis of this sample set are provided below.
1. Rank order differences in relative frequency
We believe that the first few terminal motifs in the frequency-ranked difference in motifs between fetal and maternal DNA molecules will be useful for detecting or enriching fetal and maternal DNA molecules. Thus, we ranked the terminal motifs based on their frequency difference between fetal and maternal DNA molecules in one pregnant woman, with a sequencing depth of 270 x. Fetal sequences and shared sequences were identified based on informative SNPs using a similar manner as described above.
Fig. 3 shows a bar graph of the frequencies of terminal motifs between fetal and maternal DNA molecules according to an embodiment of the present disclosure. Data were obtained from one pregnant woman with a sequencing depth of 270 x. The vertical axis corresponds to the percentage of frequency for a given 4-mer motif, determined from the number of DNA fragments with the given 4-mer motif (as determined from the sequence reads) divided by the total number of terminal sequences of the DNA fragments analyzed (e.g., twice the number of DNA fragments). The horizontal axis corresponds to 256 different 4-mers. The 4-mers were sorted with decreasing frequency of shared sequences, with FIG. 3 being divided into two parts with different scales for the vertical axis. The frequency difference of the terminal motifs can be observed between fetal DNA molecules (fetal DNA molecules with a fetal-specific allele) and maternal DNA molecules (maternal DNA molecules with a shared allele).
Fig. 4 shows the first 10 terminal motifs from fig. 3 of fetal and shared (i.e., fetal plus maternal) sequences according to an embodiment of the present disclosure. The vertical axis is offset and starts at a frequency of 1%. The top 10 terminal motifs are CCCA, CCAG, CCTG, CCAA, CCCT, CCTT, CCAT, CAAA, CCTC and CCAC. As can be seen, some terminal motifs have a greater difference between the shared sequence and the fetal-specific sequence than other terminal motifs. Thus, in order to distinguish between maternal and fetal DNA, one may want to use the terminal motif with the greatest difference, rather than just the terminal motif with the highest frequency.
2. Use of entropy
Then, for each sample, the entropy of the DNA molecules with shared alleles and the entropy of the DNA molecules with fetal-specific alleles were analyzed. The former is identified as maternal and the latter as fetal. For each sample, two data points were obtained: entropy of fetal DNA molecules and entropy of shared DNA molecules (labeled "maternal").
FIG. 5A shows that the entropy of the terminal motifs in fetal DNA molecules is lower than the entropy of the terminal motifs in maternal DNA molecules (p value <0.0001), indicating that there is a higher skewness in the distribution of terminal motifs derived from maternal DNA molecules. The entropy in fig. 5A was determined using all 256 motifs, since in these examples, for a given sample and a given pool of fetal or maternal DNA molecules, 4-mers were used.
Similar to curve 235 of fig. 2, the difference in entropy of the two tissue types indicates that entropy can be used to determine the concentration fraction of fetal DNA in a mixture of free DNA fragments (e.g., plasma or serum). As described above, the pool identified as fetal DNA has a higher percentage (e.g., close to 100%) of fetal DNA than the maternal pool. The entropy values determined for the pool types are different. Thus, there is a relationship between entropy and fetal DNA concentration. The relationship may be determined as a calibration function based on a measured value of fetal DNA concentration of the calibration sample (calibration value) and a corresponding entropy value (example of relative frequency), wherein the calibration value and the relative frequency may form a calibration data point. Calibration samples with different fetal DNA concentrations will have different entropy values. A calibration function can be fitted to the calibration data points such that the relative frequency (e.g., entropy) of the new measurements can be input into the calibration function to provide an output of fetal DNA concentration.
FIG. 5B shows entropy when relative frequencies of 10 motifs from FIG. 4 are used. As shown, for this given set of 10 terminal motifs, the relationship changes with fetal sequences having a higher entropy. The concentration fraction of fetal DNA can still be determined, but a different calibration function will be used. Thus, the motif set used for calibration should be the same as the motif set used later, i.e., when the concentration score is measured based on entropy or other aggregate value of the relative frequencies of the set.
3. Clustering
We further performed hierarchical clustering analysis on pregnant women, each of which was characterized by a 256-dimensional vector containing all 4-mer end motif frequencies. Indeed, individuals characterized by terminal motifs derived from fetal-specific sequences and maternal DNA molecules can be clustered into two groups.
Fig. 6A and 6B illustrate hierarchical cluster analysis of fetal and maternal DNA molecules pregnant at the first trimester time according to embodiments of the present disclosure. FIG. 6A shows a hierarchical clustering analysis based on 256 4-mer end motif frequencies. The vertical axis corresponds to the 4-mer motif and the horizontal axis corresponds to different parts of the various samples (i.e., fetal-specific 620 (yellow) and shared 610 (blue) sequences). This color corresponds to the relative frequency of a particular 4-mer motif of a particular portion of the sample.
Different fractions (fetal-specific and shared) have different fetal DNA concentrations and will therefore have different classifications for the concentration of fetal DNA. When performing such clustering using calibration samples, fetal DNA concentration may be measured, for example, as described in the entropy section above. Each calibration sample will have a corresponding vector whose length is equal to the number of motifs used (e.g., 256 for all 4-mers, or perhaps just a subset of the 4-mers, since there may be the largest difference between fetal and shared sequences, but other k-mers may also be used).
FIG. 6B shows a magnified visualization for hierarchical clustering analysis based on 256 4-mer end motif frequencies. Each row represents one type of terminal motif (i.e., a different terminal motif). Each column represents a pregnant subject. Gradient color indicates the frequency of the terminal motif. Red represents the highest frequency and green represents the lowest frequency. As can be seen, the two parts (fetal and shared) representing samples with different fetal DNA concentrations were clustered cleanly into two separate clusters, showing good accuracy in being able to distinguish samples with different fetal DNA concentration levels.
4. Samples at different three month times
In addition to being able to distinguish samples having different concentration scores, some embodiments may distinguish different samples from pregnant subjects at different gestational ages (e.g., which trimester time, or simply whether or not at the third trimester time).
Fig. 7A and 7B show entropy distributions across different three month times using all motifs of pregnant women, according to embodiments of the present disclosure. Interestingly, the entropy of the number of terminal motifs determined using the fetal-specific fragments appeared to correlate with gestational age (p-value: 0.024, 1 st three month time data vs. data compiled from 2 nd and 3 rd three month time), but the entropy of the number of terminal motifs from shared fragments (mainly maternal DNA) appeared not to correlate with gestational age (p-value: 1, 1 st three month time data vs. data compiled from 2 nd and 3 rd three month time). Late gestation usually has a higher fetal DNA concentration. Thus, there may be a correlation between concentration and gestational age.
For fetal-specific fragments, the second and third trimester times have reduced entropy compared to the first trimester time. Thus, the fetal fragment may convey the age of the fetus. Also, since the shared fragments have a substantially constant entropy (e.g., since the predominantly maternal fragments and/or changes in the terminal motifs associated with maternal physiology counteract such fetal signals), the entropy change of all fragments will reflect the gestational age due to changes in the fetal fragments. This relationship of entropy between different three month times will show less variation due to the presence of maternal segments, but this relationship will still exist. However, when a fetal-specific allele can be identified (e.g., a male fetus or by identifying alleles that occur at a similar percentage as the expected fetal DNA concentration, or using parental genotype information), then a more significant relationship will exist (e.g., as shown in fig. 7B).
Fig. 7C and 7D show entropy distributions across different three month times using 10 motifs of pregnant women, according to embodiments of the present disclosure. The 10 motifs were selected by ranking determined from the shared fragments. These figures show that due to the specific selection of motifs, the entropy still changes for different three month times of the fetal-specific fragment, even though the relationship may be decreasing (as opposed to increasing in fig. 7B).
Fig. 8A shows the entropy of all segments across different gestational ages, according to an embodiment of the disclosure. The entropy was determined using all 256 4-mer end motifs. The entropy of plasma DNA fragments in subjects at time 3 trimester proved to be lower (p-value 0.06) than the entropy of plasma DNA fragments in said subjects at time 1 trimester and time 2 trimester. And, the average value of the 2 nd trimester time is lower than the average value of the 1 st trimester time. Thus, entropy does provide gestational age when all fetal fragments are included (as opposed to shared fragments in fig. 7A).
Fig. 8B shows the entropy of fragments of the Y chromosome source across different gestational ages. The entropy of fragments of Y chromosome source in subjects at time 3 trimester proved to be lower (p-value ═ 0.01) than the entropy of fragments of Y chromosome source in said subjects at time 1 trimester and time 2 trimester. These samples filtered against fetal molecules (using fetal-specific sequences from the Y chromosome) showed a large separation between time 3 trimester and time 2 trimester.
Fig. 9 and 10 show the distribution of top 10-ranked terminal motifs between fetal and maternal DNA molecules across different three-month times according to embodiments of the present disclosure. The top 10 terminal motifs of the motif frequency ranking difference between fetal and maternal DNA molecules were mined from a single deep-sequenced maternal case. These top 10-ranked terminal motifs were then used to analyze each of the samples.
The ratio of fetal and shared DNA molecules carrying these terminal motifs of interest was calculated in separate cohorts comprising 10 pregnant women from each of the first (12-14 weeks), second (20-23 weeks) and third (38-40 weeks) months, respectively. A higher number of terminal motifs were found in the fetal DNA molecules compared to the shared molecules, indicating that these terminal motifs have some relationship to the tissue of origin. For example, it was found that the median of CAAA% in fetal DNA molecules was consistently higher than the median of CAAA% in shared molecules (mainly maternal) across the first (1.26% versus 1.11%), second (1.24% versus 1.11%) and third (1.24% versus 1.15%) months. Thus, the terminal motif CAAA can be identified as a marker indicating an increased likelihood that a particular DNA fragment having the terminal sequence of CAAA is from a fetus.
Some terminal motifs show a more pronounced relationship to gestational age. For example, fetal DNA molecules with the terminal motif CCCA show a continuous (monotonic) increase with gestational age, as do CCAG, CCTG, CCAA, CCCT, and CCAC. However, CCTT does not show a continuous increase as the median dip for the 2 nd trimester time, then for the 3 rd trimester time.
In another embodiment, the top 10-ranked terminal motifs may be combined to see the difference between fetal and maternal DNA molecules across different three month times.
Fig. 11 shows the combined frequency of the top 10 ranked motifs between a fetal molecule and a shared molecule across different three month times according to an embodiment of the present disclosure. As shown in FIG. 11, we found that the difference in the combination frequency of the top 10-ranked terminal motifs between the fetal DNA molecule and the maternal DNA molecule was relatively large in both the 2 nd three-month period (p-value: 0.013) and the 3 rd three-month period (p-value: 0.0019) compared to the 1 st three-month period (p-value: 0.92). The frequency of fetal molecules increases from time 1 trimester to time 2 trimester to time 3 trimester, while shared molecules do not show this persistence relationship. This suggests that different physiological conditions (e.g., gestational age) will affect terminal motifs derived from different source tissues.
B. Oncology
Genotypic approaches designed in the pregnant setting can also be applied in the oncology setting.
Figure 12 shows a schematic of a genotype-based method for analyzing differential end motif patterns between mutant and shared molecules in plasma DNA of a cancer patient, according to an embodiment of the present disclosure. As shown in FIG. 12, a tumor specific molecule 1205 carrying a tumor specific allele (B) can be identified. On the other hand, shared molecules 1207 carrying shared alleles (a) can be determined, which will represent DNA molecules of major healthy origin, since tumor DNA molecules will typically be a minority of the plasma DNA pool.
By way of example, mutant sequences (i.e., plasma DNA carrying cancer-associated mutations) and shared sequences (primarily DNA of hematopoietic origin) can be identified. A cancer-associated mutation may be defined as a variant that is present in tumor tissue (hepatocellular carcinoma, HCC) but not in normal cells (e.g., buffy coat). For example, in HCC patients, assuming the genotype of the tumor tissue in a particular genomic locus is "AG" and the genotype of the buffy coat cells is "AA", the "G" that is specifically present in the tumor tissue will be considered a cancer-associated mutation, while the "a" will be considered a shared wild-type allele. In various embodiments, the mutant sequence can be obtained by sequencing a tissue biopsy from a tumor or by analyzing an episomal sample (e.g., plasma or serum), for example, as described in U.S. patent publication 2014/0100121.
In HCC patients whose plasma DNA was sequenced at a depth of 220x, the frequency distribution of terminal motifs between the mutant and shared sequences was determined. Bar graph 1220 provides the relative frequency (%) of the appearance of each 4-mer as a terminal motif of the mutant and shared sequences. Such relative frequencies may be determined as described above for bar graph 220 of fig. 2. As one can see, terminal motif 1222 has significant differences in relative frequency between DNA fragments of different tissue types. Such differences can be used for various purposes, for example to enrich for tumor DNA in a sample or to determine the concentration of tumor DNA.
In another embodiment, to capture the landscape differences in terminal motifs between tumor DNA molecules and maternal DNA molecules, entropy-based analysis 1230 can be used, similar to fig. 2. Curve 1235 shows the entropy values of the shared sequence and the tumor sequence. The difference in entropy or other measure of variance may provide a tumor concentration score, for example, using a calibration function.
In another embodiment, similar to the fetal analysis in fig. 2, a cluster-based analysis 1240 may be performed. A classification for the amount of tumor sequences in a sample may be determined based on new samples belonging to a reference cluster for which the classification of tumor scores is known.
1. Rank order differences in relative frequency
Figure 13 shows a landscape of cancer-associated mutant molecules and plasma DNA terminal motifs sharing molecules in hepatocellular carcinoma according to embodiments of the present disclosure. The number of terminal motifs was observed to vary between mutant and shared sequences, such as, but not limited to, CCCA, CCAG, CCAA, CCTG, CCTT, CCCT, CAAA, CCAT, TAAA, AAAA motifs. FIG. 13 shows similar information to FIG. 3, but for clinically relevant DNA, tumor DNA rather than fetal DNA.
Figure 14 shows a radial landscape of plasma DNA terminal motifs of cancer-associated mutant and shared molecules in hepatocellular carcinoma, in accordance with embodiments of the present disclosure. The different terminal motifs are listed on the periphery and the frequencies of the terminal motifs are shown at different radial lengths. The terminal motifs are sorted by the frequency of the wild-type (wt) allele of non-tumor (e.g., healthy) cells. Frequency values 1410 correspond to the wt allele, while frequency values 1420 correspond to the mutation (mut) allele. This radial view shows that there is a significant difference in the relative frequency of the terminal motifs of the mutated sequences compared to the wild-type (shared) sequence.
Figure 15A shows the top 10 terminal motifs in the ranking difference in terminal motif frequency between mutant and shared sequences in plasma DNA of HCC patients, according to an embodiment of the present disclosure. The first few terminal motifs of the shared sequence in the reference sample were determined. As shown, the first few terminal motifs are CCCA, CCAG, CCAA, CCTG, CCTT, CCCT, CAAA, CCAT, TAAA, and AAAA. The difference in relative frequency differs between the terminal motifs. For example, the motifs (CCCA) that showed the greatest difference between the mutant and shared sequences were found to be 1.9% and 1.6%, respectively, indicating a 15% reduction of such motifs relative to the shared sequence (mainly the wild-type sequence of blood cell origin) mutant sequence.
Figure 15B shows the combined frequency of 8 terminal motifs for HCC patients and pregnant women, according to embodiments of the present disclosure. The combined frequency is an exemplary aggregate value, e.g., as the sum of the relative frequencies of the set of terminal motifs. As can be seen, there is a separation in the combined frequency of the two types of sequences in each of the two cases: between Wild Type (WT) and mutant, and between maternal and fetal sequences. The separation in the combined frequency between Wild Type (WT) and mutant is greater than the separation between maternal and fetal sequences.
The combined frequency showed similar behavior to the entropy curve of the fetal analysis. Thus, fig. 15B shows another example of a sum of relative frequencies that can be used to determine concentration fractions of clinically relevant DNA. Also, the wt versus mutant relationship in fig. 15B shows that concentration fractions of other clinically relevant DNA (e.g., tumor DNA) can also be determined.
2. Use of entropy
Fig. 16A and 16B illustrate entropy values of shared and mutant segments for different end-order sets of HCC cases, according to embodiments of the present disclosure. As with fetal sequences, the relationship between the entropy of the two types of sequences may vary depending on the set of terminal motifs used. FIG. 16A uses all 256 terminal motifs of the 4-mer. The entropy of the mutant fragments is higher due to a more uniform frequency distribution (e.g., flatter) of the mutant fragments. Also, the entropy of the shared segment is lower due to the higher skewness frequency distribution.
Figure 16B uses the first 10 terminal motifs of the 4-mer of shared fragments that appear in HCC subjects. For the first 10 motifs, the relationship of entropy is reversed. Fig. 16A and 16B show that a calibration analysis for determining fetal DNA concentration can also be used to determine tumor DNA concentration.
As described above, higher entropy values indicate higher diversity in the terminal motifs. Motif Diversity Scores (MDS) can be used to estimate the concentration fraction of clinically relevant DNA (e.g., embryos, grafts, or tumors) in circulating free DNA biological samples.
Figure 17 is a graph of motif diversity scores versus measured circulating tumor DNA scores according to embodiments of the present disclosure. For each of the plurality of calibration samples, a calibration data point 1705 is measured. The calibration data points include the motif diversity score of the sample and the concentration score of clinically relevant DNA, which in this case is the tumor DNA score. Tumor DNA fraction was estimated based on the software package ichorCNA, which measures tumor DNA fraction in plasma DNA by exploiting cancer-associated copy number aberrations (adalsteisson et al, 2017).
A given sample may be a healthy control sample without tumor DNA, or a sample from a patient with a tumor, where the tumor DNA fraction is non-zero, i.e., tumor DNA and other (e.g., healthy) DNA are present. MDS values of plasma DNA of patients with HCC were found to correlate positively with tumor DNA scores (Spearman ρ: 0.597; p value: 0.0002). This is shown using a calibration function 1710 (a linear function in this example).
Calibration function 1710 can be used to determine the tumor DNA fraction in a new test sample whose motif diversity score has been measured. The calibration function 1710 may be determined by functionally fitting the calibration data points 1705, for example using regression.
In some examples, a calculated value X of MDS for a new sample may be used as an input to a function F (X), where F is a calibration function (curve). The output of F (X) is the concentration fraction. An error range may be provided which may be different for each value of X, providing a range of values as an output of f (X). In other examples, a concentration score corresponding to a measurement of MDS of 0.95 in the new sample may be determined as an average concentration calculated from calibration data points at MDS of 0.95. As another example, the calibration data points 1705 can be used to provide a range of DNA concentration fractions for a particular calibration value, where the range can be used to determine whether the concentration fraction is above a threshold amount.
C. Transplantation
Genotyping techniques may also be applied to monitor transplantation, such as liver transplantation. SNP sites where the recipient is homozygous and the donor is heterozygous will allow determination of donor-specific DNA molecules and major hematopoietic DNA in the plasma of the transplanted patient.
Fig. 18A illustrates an entropy analysis using donor-specific fragments, according to embodiments of the present disclosure. Figure 18B shows hierarchical clustering analysis using donor-specific fragments. As shown in fig. 18A and 18B, in the case of liver transplantation, it was observed that the liver-specific DNA molecules have different characteristics from the shared sequence (mainly, blood-derived DNA). The entropy of the plasma DNA terminal motifs was generally found to be lower in donor-specific DNA molecules (liver DNA) compared to shared sequences (fig. 18A). Individuals characterized by terminal motifs derived from liver-specific DNA molecules are clustered together, while individuals characterized by terminal motifs derived from shared DNA molecules are clustered into another group.
D. Classifying concentration scores
As described above, the relative frequency of the collection of one or more terminal motifs can be used to determine the classification of the concentration score of clinically relevant DNA.
Fig. 19 is a flow chart illustrating a method 1900 of estimating a concentration fraction of clinically relevant DNA in a biological sample of a subject according to an embodiment of the disclosure. Biological samples may include clinically relevant DNA and other free DNA. In other examples, the biological sample may not contain clinically relevant DNA, and the estimated concentration score may indicate zero or a low percentage of clinically relevant DNA. Method 1900 and aspects of any other methods described herein can be performed by a computer system.
At block 1910, a plurality of free DNA fragments from a biological sample are analyzed to obtain sequence reads. The sequence reads can include terminal sequences corresponding to the ends of the plurality of free DNA fragments. By way of example, sequence reads may be obtained using sequencing or probe-based techniques, both of which may include enrichment via amplification or capture probes, for example.
Sequencing can be performed in a variety of ways, such as using massively parallel sequencing or next generation sequencing, using single molecule sequencing, and/or using double-stranded or single-stranded DNA sequencing library preparation protocols. One of skill in the art will appreciate the various sequencing techniques that can be used. As part of sequencing, some of the sequence reads may correspond to cellular nucleic acids.
The sequencing may be targeted sequencing as described herein. For example, a biological sample can be enriched for DNA fragments from a particular region. Enrichment may include the use of capture probes that bind to a portion of a genome or the entire genome, for example, as defined by a reference genome.
Statistically significant numbers of free DNA molecules can be analyzed to provide accurate determination of concentration fractions. In some embodiments, at least 1,000 free DNA molecules are analyzed. In other embodiments, at least 10,000, or 50,000, or 100,000, or 500,000, or 1,000,000, or 5,000,000 free DNA molecules or more may be analyzed.
At block 1920, for each episomal DNA fragment of the plurality of episomal DNA fragments, a sequence motif is determined for each of the one or more terminal sequences of the episomal DNA fragment. A sequence motif can include N base positions (e.g., 1, 2,3, 4, 5, 6, etc.). By way of example, the sequence motif can be determined by: analyzing sequence reads at the ends corresponding to the ends of the DNA fragments, correlating signals to specific motifs (e.g., when probes are used), and/or aligning the sequence reads to a reference genome, e.g., as described in fig. 1.
For example, after sequencing by the sequencing apparatus, the sequence reads may be received by a computer system, which may be communicatively coupled with the sequencing device performing the sequencing, such as by wired or wireless communication or by a removable memory device. In some embodiments, one or more sequence reads comprising both ends of a nucleic acid fragment can be received. The location of the DNA molecule can be determined by mapping (aligning) one or more sequence reads of the DNA molecule to a corresponding portion of the human genome, e.g., a particular region. In other embodiments, a particular probe (e.g., after PCR or other amplification) may indicate a position or a particular terminal motif, e.g., by a particular fluorescent color. The identification may be that the free DNA molecule corresponds to one of a collection of sequence motifs.
At block 1930, the relative frequencies of a set of one or more sequence motifs corresponding to the terminal sequences of a plurality of free DNA fragments are determined. The relative frequency of the sequence motif can provide a proportion of the plurality of free DNA fragments having terminal sequences corresponding to the sequence motif. A reference set of one or more reference samples can be used to identify a set of one or more sequence motifs. For the reference sample, the concentration fraction of clinically relevant DNA need not be known, but genotype differences can be determined such that differences between the clinically relevant DNA and the terminal motifs of other DNA (e.g., healthy DNA, maternal DNA, or DNA of a subject receiving a transplanted organ) can be identified. The particular terminal motif may be selected based on the difference (e.g., to select the terminal motif with the highest absolute or percent difference). Examples of relative frequencies are described throughout this disclosure.
In some implementations, the sequence motif includes N base positions, wherein the set of one or more sequence motifs includes all combinations of N bases. In some examples, N may be an integer equal to or greater than two or three. The set of one or more sequence motifs may be the top M (e.g., top 10) most frequent sequence motifs that occur in one or more calibration samples or other reference samples that are not used to calibrate concentration scores.
At block 1940, a sum of relative frequencies of a set of one or more sequence motifs is determined. Exemplary aggregate values are described throughout the disclosure, for example, including entropy values (motif diversity scores), sums of relative frequencies, and multidimensional data points corresponding to a vector of counts of a set of motifs (e.g., a vector of 256 counts of 245 motifs in a possible 4-mer or 64 counts of 64 motifs in a possible 3-mer). When a collection of one or more sequence motifs comprises a plurality of sequence motifs, the aggregate value may comprise the sum of the relative frequencies of the collection.
By way of example, when a collection of one or more sequence motifs comprises a plurality of sequence motifs, the aggregate value may comprise the sum of the relative frequencies of the collection. As another example, the aggregate value may correspond to a variance of the relative frequency. For example, the aggregate value may include an entropy term. The entropy terms may include the sum of terms, each term comprising the relative frequency multiplied by the logarithm of the relative frequency. As another example, the aggregate value may include a final or intermediate output of a machine learning model (e.g., a clustering model).
At block 1950, a classification of concentration scores of clinically relevant DNA in the biological sample is determined by comparing the summed value to one or more calibrated values. One or more calibration values may be determined from one or more calibration samples for which the concentration fraction of clinically relevant DNA is known (e.g., measured). The comparison may be a comparison of a plurality of calibration values. The comparison may be performed by inputting the aggregate value into a calibration function fitted to the calibration data, the calibration function providing a change in the aggregate value relative to a change in the concentration fraction of the clinically relevant DNA in the sample. As another example, the one or more calibration values can correspond to one or more aggregate values of relative frequencies of the set of one or more sequence motifs measured using the free DNA fragments in the one or more calibration samples.
The calibration value may be calculated as a sum of values for each calibration sample. A calibration data point for each sample can be determined, where the calibration data point includes a calibration value for the sample and a measured concentration fraction. These calibration data points may be used in method 1900 or may be used to determine final calibration data points (e.g., as defined by a function fit). For example, a linear function may be fitted to a calibration value as a function of the concentration fraction. The linear function may define calibration data points to be used in the method 1900. As part of the comparison, the new aggregate value for the new sample may be used as an input to the function to provide an output concentration score. Thus, the one or more calibration values may be a plurality of calibration values of a calibration function determined using concentration fractions of clinically relevant DNA of a plurality of calibration samples.
As another example, the new aggregate value may be compared to an average aggregate value of samples having the same concentration score classification (e.g., within the same range), and if the new aggregate value is closer to the average value of another classification than the closeness of the calibration value to the average value, then the new sample may be determined to have the same concentration as the closest calibration value. Such techniques may be used when performing clustering. For example, the calibration value may be a representative value of a cluster corresponding to a particular classification of concentration scores.
The determination of the calibration data points may include, for example, measuring concentration fractions as follows. For each of the one or more calibration samples, a concentration fraction of clinically relevant DNA in the calibration sample may be measured. A sum of the relative frequencies of the set of one or more sequence motifs can be determined by analyzing free DNA fragments from the calibration sample as part of obtaining calibration data points, thereby determining one or more sum values. Each calibration data point may specify a fraction of the concentration of clinically relevant DNA measured in the calibration sample and an aggregate value determined for the calibration sample. The one or more calibration values may be one or more sum values or may be determined using one or more sum values (e.g., when using a calibration function). Measurement of concentration fractions can be performed in various ways as described herein, for example by using alleles specific for clinically relevant DNA.
In various embodiments, the concentration fraction of clinically relevant DNA can be measured using tissue-specific alleles or epigenetic markers, or using the size of the DNA fragments, for example as described in U.S. patent publication 2013/0237431, which is incorporated by reference in its entirety. The tissue-specific epigenetic marker may comprise a DNA sequence in the sample that exhibits a tissue-specific DNA methylation pattern.
In various embodiments, the clinically relevant DNA may be selected from the group consisting of: fetal DNA, tumor DNA, DNA from a transplanted organ, and a particular tissue type (e.g., from a particular organ). The clinically relevant DNA may be of a particular tissue type, for example, of the liver or hematopoietic system. When the subject is a pregnant female, the clinically relevant DNA may be placental tissue, which corresponds to fetal DNA. As another example, the clinically relevant DNA may be tumor DNA derived from an organ with cancer.
In general, it is preferred to use assays similar to those used to measure concentration fractions of biological (test) samples to generate one or more calibration values determined from one or more calibration samples. For example, a sequencing library can be generated in the same manner. Two exemplary processing techniques are GeneRead (www.qiagen.com/us/shop/sequencing/genetic-size-selection-kit/# ordering information) and SPRI (solid phase reversible immobilization, AMPure magnetic beads, www.beckman.hk/reagents _ depr/genetic _ depr/clean-and-size-selection/pcr). GeneRead can remove short DNA, mainly tumor fragments, which can affect the relative frequency of wild-type and mutant fragments as well as terminal motifs in fetal and transplant cases.
E. Determining gestational age
As described above in fig. 7A, 7B, and 8-10, fetal-specific fragment motifs can be used to infer gestational age.
Fig. 20 is a flow chart illustrating a method 2000 of determining the gestational age of a fetus by analyzing a biological sample from a female subject pregnant with the fetus, according to an embodiment of the present disclosure. The biological sample includes free DNA fragments from a female subject and a fetus.
At block 2010, a plurality of free DNA fragments from a biological sample are analyzed to obtain sequence reads. The sequence reads can include terminal sequences corresponding to the ends of the plurality of free DNA fragments. Block 2010 may be performed in a similar manner as block 1910 of fig. 19.
Prior to, after, or as part of the analysis, a plurality of free DNA fragments may be identified as being fetal-derived, for example as described above with respect to fig. 2 and 5A. This can filter DNA fragments that are fetal or most likely to be fetal. As an example, multiple episomal DNA fragments can be identified using a fetal-specific allele or a fetal-specific epigenetic marker. As another example, for each of the sequence reads, a likelihood that the sequence read corresponds to a fetus can be determined based on an end sequence of the sequence read that includes one of the set of one or more sequence motifs. Other criteria may also be used, for example as described in section ii.e. The likelihood can be compared to a threshold and when the likelihood exceeds the threshold, the sequence reads can be identified as originating from the fetus. More detailed information on clinically relevant DNA in enriched samples can be found in section IV.
At block 2020, for each of the plurality of episomal DNA fragments, a sequence motif is determined for each of the one or more terminal sequences of the episomal DNA fragment. Block 2020 may be performed in a similar manner as block 2020 of fig. 19.
At block 2030, the relative frequencies of a set of one or more sequence motifs corresponding to the terminal sequences of the plurality of free DNA fragments are determined. The relative frequency of the sequence motif can provide a proportion of the plurality of free DNA fragments having terminal sequences corresponding to the sequence motif. Block 1930 may be performed in a similar manner as block 2030 of fig. 19.
At block 2040, a total value of the relative frequencies of the set of one or more sequence motifs is determined. Block 1940 may be performed in a similar manner as block 2040 of fig. 19.
At block 2050, one or more calibration data points are obtained. Each calibration data point may specify a gestational age (e.g., three months as described above in the figures) corresponding to the aggregate value. As described above, one or more calibration data points may be determined from a plurality of calibration samples having a known gestational age and comprising free DNA molecules. In some embodiments, the one or more calibration data points can be a plurality of calibration data points that form a calibration function that approximates a measured aggregate value determined from the free DNA molecules in the plurality of calibration samples with known gestational age.
At block 2060, the summed value is compared to a calibrated value for the at least one calibration data point. For example, the new total value for the new sample may be compared to the average value for the third three months as determined in fig. 8A. As another example, the calibration value for the at least one calibration data point may correspond to a sum of values measured using free DNA molecules in at least one of the plurality of calibration samples. The comparison of the summed values may be a comparison with a plurality of calibration values, e.g., each calibration value corresponding to one of a plurality of calibration samples. The comparison may be made by inputting the aggregate value into a function (calibration function) fitted to calibration data, the calibration function providing a change in aggregate value with respect to gestational age. The comparison may be performed in a similar manner as described for method 1900 (e.g., with respect to block 1950).
At block 2070, the gestational age of the fetus is estimated based on the comparison. For example, if the new total value is closest to the average value for the third three months (or other calibration value used), then it may be determined that the new sample is in the third three months. As another example, the new sum value may be compared to a calibration function (e.g., a linear function) fitted to the data in fig. 8A or other similar graphs. The function may output the gestational age, e.g., Y value as a linear function. Other examples provided herein using calibration functions may also be used in the context of determining gestational age.
Phenotypic methods
Using genotype-based analysis of pregnant subjects, cancer subjects, and liver transplants, the presence of plasma DNA terminal motifs correlated with the presence of the tissue of origin. The reason we believe is that in cancer patients, tumor DNA is released into the blood circulation, altering the original normal appearance of the plasma DNA terminal motifs. However, we do not exclude the possibility that other aspects of cancer pathobiology, such as the tumor microenvironment (infiltrating T cells, B cells, neutrophils, etc.) will generate different terminal motifs, thereby affecting the terminal motif's landscape. Thus, analysis of plasma DNA terminal motifs between cancer subjects and non-cancer control subjects will reveal the ability to classify HCC from the control subjects.
Figure 21 shows a schematic of a phenotypic method for plasma DNA end motif analysis, according to embodiments of the present disclosure. Fig. 21 has similarities to fig. 2 and 12, for example, relative frequencies can be plotted, variance values (e.g., entropy) can be determined, and clustering can be performed.
In fig. 21, terminal motifs (e.g., 4-mers) deduced from plasma DNA molecules were used and compared between cancer subjects and control subjects, thereby eliminating the restriction of genotypic markers and making them widely applicable for many clinical situations, such as detection of autoimmune diseases (e.g., systemic lupus erythematosus, SLE) and transplantation. Using the phenotypic approach and using all sequenced plasma DNA fragments, entropy and cluster analysis can be performed following an analysis procedure very similar to the genotype difference based approach. In this case, the entropy analysis and cluster analysis will be compared between control subjects and diseased subjects.
The diseased molecule 2105 is from one or more subjects determined to have the disease. Control molecule 2107 is from one or more subjects who are not suffering from a disease. The relative frequencies of the sets of terminal sequences of the two pools of molecules are determined. Bar graph 1220 provides the relative frequency (%) of the appearance of each 4-mer as the terminal motif of the control and diseased sequences. Such relative frequencies may be determined as described above for bar graph 220 of fig. 2. As one can see, terminal motif 2122 has significant differences in relative frequency between DNA fragments of different tissue types. This difference can be used for various purposes, for example to classify a new sample as diseased or not, or some other level of disease.
To capture the landscape differences in the terminal motifs between tumor DNA molecules and shared DNA molecules, entropy-based analysis 2130 can be used, similar to fig. 2. Curve 2135 shows entropy values for control subjects and diseased subjects. Differences in entropy or other measures of variance can provide a classification of the level of pathology associated with a disease.
In another embodiment, similar to the fetal analysis in fig. 2 and the tumor analysis in fig. 12, a cluster-based analysis 2140 may be performed. The classification of the level of pathology may be determined based on new samples belonging to a reference cluster for which the classification is known.
Thus, in one example of a sum of relative frequencies, each individual may feature a vector of 256 frequencies (i.e., a 256-dimensional vector) for a 4-mer end motif. In other examples, Standard Deviation (SD), Coefficient of Variation (CV), interquartile difference (IQR), or a certain percentile cut-off across different motif frequencies (e.g., 95 th or 99 th percentile) may be used to assess the landscape change of terminal motif patterns between the disease group and the control group. Other examples of the sum value are provided in other sections and are applicable here.
A. Oncology
In some embodiments, the disease (pathology) may be cancer. Thus, some embodiments may classify the level of cancer.
1. Rank order differences in relative frequency
Fig. 22 shows an example of a frequency distribution of a 4-mer terminal motif between hepatocellular carcinoma (HCC) subjects and Hepatitis B Virus (HBV) subjects using all plasma DNA molecules, according to an embodiment of the present disclosure. FIG. 22 compares the frequency of 256 terminal motifs in HCC patients and one HBV patient. As with similar curves, the vertical axis is the motif frequency and the horizontal axis corresponds to the corresponding terminal motif. In fig. 22, we rank motifs in ascending order based on the mean of their frequency in non-HCC subjects. The bottom curve is continuous with the top curve, but is drawn at a different scale for ease of illustration.
There are many terminal motifs that exhibit aberrations in HCC patients. For example, the top 10 terminal motifs (TGGG, TAAA, AAAA, GAAA, GGAG, TAGA, GCAG, TGGT, GCTG and GAGA) that show an increased frequency in HCC patients compared to HBV subjects have an average of 1.22-fold change, with a range of 1.12-1.35-fold change; and the top 10 terminal motifs (CCCA, CCAG, CCAA, CCCT, CCTG, CCAC, CCAT, CCCC, CCTC, and CCTT) that showed a decrease in frequency in HCC patients had an average of 1.23-fold change, with a range of 1.16-1.29-fold change. Such a collection of the first few motifs, which show an increase in frequency (or decrease as a separate collection) in the HCC group relative to the non-cancer group, can be used to classify new subjects with respect to cancer. As another example, the ranking process may select all those motifs that show HCC elevation, and then rank those motifs in descending order according to ACC between HCC subjects and non-HCC subjects. The top 10 motifs were then selected based on AUC values.
To test the diagnostic potential by using the plasma DNA end motifs, we sequenced 20 healthy control subjects (control), 22 chronic hepatitis b carriers (HBV), 12 cirrhosis subjects (Cirr), 24 early hcc (eehcc), 11 intermediate hcc (ihcc), and 7 advanced hcc (ahcc), with a median of paired reads of 2.15 hundred million (range: 0.97-0.1681 million).
Figure 23A shows a box plot of the combined frequency of the first 10 plasma DNA 4-mer terminal motifs for various subjects with different cancer levels, according to an embodiment of the present disclosure. Based on the data in fig. 22, i.e., based on frequency in HBV subjects, the top 10 ranked plasma DNA 4-mer end motif was selected. The combined frequency is the sum of the frequencies of the 10 terminal motifs for a given subject. We found that the combined frequency of top 10-ranked terminal motifs was significantly reduced in HCC patients compared to non-cancer subjects (p-value < 0.0001). Importantly, using this terminal motif assay, 58.3% of patients with eHCC can be identified with a specificity of 95%. In addition, different stages of cancer can be detected. For example, the value of advanced HCC is significantly lower than that of eHCC and iHCC.
Figure 23B shows Receiver Operating Characteristic (ROC) curves for combined frequencies of the first 10 plasma DNA 4-mer end motifs between HCC subjects and non-cancer subjects, according to embodiments of the present disclosure. The area under the curve (AUC) of the ROC curve was found to be 0.91, indicating that the plasma DNA end motif does have clinical potential to distinguish HCC from non-cancer subjects. In another embodiment, the combined frequency of seven terminal motifs with the greatest separation between HCC and non-HCC subjects provides an AUC of 0.92.
Fig. 24A shows a boxplot of frequencies across different groups of CCA motifs according to an embodiment of the present disclosure. The most frequent 3-mer motif (CCA) in the non-HCC group was significantly lower in the HCC group (p-value < 0.0001). Fig. 24B shows a ROC curve between a non-HCC group and an HCC group using the most frequent 3-mer motif (CCA) present in non-HCC subjects, according to an embodiment of the present disclosure. AUC was found to be 0.915. The most frequent 4 mer (CCCA) also provided a similar AUC of 0.91.
2. Use of entropy (Motif diversity score)
Fig. 25A shows a boxplot of entropy values using 256 4-mer end motifs across different groups, according to an embodiment of the present disclosure. All 256 motifs of the 4-mer were used. As shown in FIG. 25A, entropy values were significantly increased in HCC patients (p value <0.0001) compared to non-HCC subjects (mean: 5.203; range: 5.124-5.253) (mean: 5.242; range: 5.164-5.29). Importantly, using this terminal motif assay, 41.7% of patients with eHCC can be identified with a specificity of 95%. The entropy of HCC, IHCC and advanced HCC groups was generally increased compared to the non-HCC group. In addition, different stages of cancer can be detected. For example, the value of advanced HCC is significantly higher than that of eHCC and iHCC.
Fig. 25B shows a boxplot of entropy values using 10 4-mer end motifs across different groups, according to an embodiment of the present disclosure. Here, the entropy of HCC subjects is reduced relative to non-HCC subjects. Thus, the set of terminal sequences used can change the relationship from increasing to decreasing. For example, using the top 10 ranked motif, entropy in HCC groups decreased. Either way, there is diagnostic ability between HCC and non-HCC groups, as well as early stage HCC versus HCC.
Fig. 26A shows a box plot of entropy values for 3-mer motifs used across different groups according to an embodiment of the present disclosure. The entropy of HCC subjects using 3-mer motifs (64 motifs in total) was found to be significantly higher (p-value <0.0001) than non-HCC subjects. Fig. 26B shows a ROC curve using entropy of 64 3-mer motifs between a non-HCC group and an HCC group, according to an embodiment of the present disclosure. AUC was found to be 0.872.
As described above, higher entropy values indicate higher diversity in the terminal motifs. As a further illustration of the ability of embodiments to use motif diversity scores to distinguish between various cancer type samples and control (e.g., healthy) samples, data from published studies was used.
Fig. 27A and 27B show box plot diagrams using motif diversity scores for 4-mers across different groups according to embodiments of the present disclosure. Motif diversity scores were determined using all 256 4-mers. When we performed MDS analysis using sequencing results of plasma DNA downloaded from published studies, an increase in terminal diversity of plasma DNA can often be observed in various cancer types (Song et al, 2017), which may reflect the following fact: different tumor cells from different anatomical sites will release their DNA into the blood circulation (Bettegowda et al, 2014). The cancers analyzed were: hepatocellular carcinoma (HCC), Lung Cancer (LC), Breast Cancer (BC), Gastric Cancer (GC), glioblastoma multiforme (GBM), Pancreatic Cancer (PC) and colorectal cancer (CRC).
To further test the prevalence of MDS changes across different cancer types, we further sequenced independent cohorts of plasma DNA samples with 40 other cancer types, including patients with colorectal (n-10), lung (n-10), nasopharyngeal (n-10), and head and neck squamous cell carcinomas (n-10), with a median paired end reading of 4200 ten thousand (range: 1900-. As shown in FIG. 27B, the MDS values (median: 0.943; range: 0.939-0.949) were significantly higher in the group of patients with cancer than in the control group without cancer (median: 0.941; range: 0.933-0.946; p value <0.0001, Wilcoxon rank-sum test).
Fig. 28 shows recipient operational curves for various techniques to differentiate healthy controls from cancer, in accordance with embodiments of the present disclosure. We have a total of 129 samples including healthy controls (n-38), HBV carriers (n-17), hepatocellular carcinoma patients (n-34), colorectal cancer patients (n-10), lung cancer patients (n-10), nasopharyngeal carcinoma patients (n-10), and head and neck squamous cell carcinoma patients (n-10). Interestingly, MDS-based method 2801(AUC 0.85) appeared to show the best performance compared to other fragmentation metrics including fragment size 2803(AUC 0.74, p-value 0.0040; DeLong test) (Yu et al, 2017b), fragment preferred end 2804(AUC 0.52, p-value <0.0001) (Jiang et al, 2018) and orientation-aware plasma free fragmentation signal OCF2802(AUC 0.68, p-value 0.0013) (Sun et al, 2019). If any of the techniques classify the subject as having cancer, the combined analysis 2805 identifies the subject as having cancer.
The accuracy of MDS analysis to distinguish between cancer and non-cancer remains relatively good for motifs of different lengths. Assays were performed on 1-mer to 5-mer using MDS.
Figure 29 illustrates receiver operating curves for MDS analysis using various k-mers, in accordance with embodiments of the present disclosure. MDS values derived from 1-mer to 5-mer motifs also have the ability to distinguish patients with cancer from those without cancer. The 1-mer analysis 2901 provided an AUC of 0.81. The 2-mer analysis 2902 provided an AUC of 0.85. The 3-mer analysis 2903 provided an AUC of 0.85. The 4-mer assay 2904 provided an AUC of 0.85. The 5-mer analysis 2905 provided an AUC of 0.81.
We also explored the effect of tumor DNA score on the performance of MDS-based cancer detection according to computer simulations.
Figure 30 shows performance of MDS-based cancer detection on various tumor DNA fractions, according to embodiments of the present disclosure. As shown in fig. 30, the performance of cancer detection gradually improved with the increase of the fraction of tumor DNA in plasma DNA. For example, the area under the ROC curve (AUC) for patients with a tumor DNA fraction of 0.1% was only 0.52, while the AUC for patients with a tumor DNA fraction of 3% increased to 0.9 and further increased at higher concentrations, but already approached a maximum at a tumor fraction of 5%.
3. Machine learning (SVM, regression and clustering)
To further explore whether a classifier for detecting cancer patients using plasma DNA end motifs could be built, we used 256 plasma DNA end motifs to construct a classifier to differentiate patients with cancer (n-55) and without cancer (n-74) using Support Vector Machines (SVMs) and logistic regression, respectively, that considers the magnitude and direction of each end motif. SVM analysis identified a hyperplane that best distinguished cancer patients from non-cancer patients in 256-dimensional locations, where the training data points were the frequency of each of the 256 motifs in a 4-mer. The logistic regression determines a coefficient to multiply each of the 256 frequencies and also determines a cutoff value for the resulting output of the logistic function, which may be a weighted sum of the multiplied frequencies or which may be received as an input. Such a logic function may be a sigmoid function or other activation function, as will be familiar to those skilled in the art.
To minimize the over-fitting problem, we employed a leave-one-out procedure to evaluate its performance by using Receiver Operating Characteristics (ROC) curve analysis. The leave-one-out procedure is performed according to the following steps. In the sample size of N, we left one sample as the test sample and then used the remaining samples (N-1) to train the classifier using 256 plasma DNA end motifs based on SVM and logistic regression. We then used the trained classifier to determine whether to classify the remaining samples as taken from subjects with or without cancer. We systematically set aside one sample as the test sample to test the classifier trained from the remaining samples. Therefore, the prediction result of each sample can be obtained, and the accuracy can be calculated according to the prediction result.
Figure 31 illustrates receiver operating curves for MDS, SVM and logistic regression analysis according to embodiments of the present disclosure. Compared to MDS based analysis (AUC ═ 0.85), we observed a small increase in AUC with the classifier with 256 terminal motifs (AUC ═ 0.89 for both SVM and logistic regression).
As another machine learning technique, we used clustering based on terminal motif frequency.
Figure 32 illustrates hierarchical clustering analysis for top 10-ranked terminal motifs across different groups with different cancer levels, according to an embodiment of the present disclosure. As shown, HCC subjects (eHCC: early HCC 3205; iHCC: intermediate HCC 3230; and aHCC: advanced HCC 3225) were generally clustered together, while non-HCC (healthy control subjects; HBV: chronic hepatitis B carriers) were generally clustered together. For example, the cluster on the right is the early HCC 3205 (yellow). The left most part is control 3210, HBV 3215 and cirrhosis 3220. The different clustering patterns between HCC and non-HCC groups indicate that the terminal motifs will reflect the preference among plasma DNA terminal motifs for association with disease and indicate the potential diagnostic ability of the plasma DNA terminal motifs. In addition to connectivity-based hierarchical clustering as a statistical method, other clustering techniques may be used, such as centroid-based clustering, distribution-based clustering, and density-based clustering.
Fig. 33A-33C illustrate hierarchical clustering analysis using all plasma DNA molecules across different groups with different cancer levels, according to embodiments of the present disclosure. FIG. 33A shows hierarchical clustering analysis based on 256 4-mer end motif frequencies. FIG. 33B shows a magnified visualization for hierarchical clustering analysis based on 256 4-mer end motif frequencies. Each row represents one type of terminal motif. Each column represents a separate plasma DNA sample. Gradient color indicates the frequency of the terminal motif. Red represents the highest frequency and green represents the lowest frequency. Figure 33C shows Principal Component Analysis (PCA) using end-group sequences for HCC subjects and non-HCC subjects. The principal component is a linear combination of 256 motifs that provides the greatest variance, e.g., in the resulting frequency weighted sum.
Since HCC subjects and non-HCC subjects appear to form two distinct clusters, the terminal motifs derived from all plasma DNA molecules would be an important metric for distinguishing HCC subjects from non-HCC subjects. Fig. 33A and 33B show that HCC subjects 3305 (red) tended to cluster into one group, and non-HCC subjects 3310 (blue) tended to cluster into another group. In fig. 33C, PCA analysis also showed that HCC subjects and non-HCC subjects tended to cluster into two different groups. PC1 and PC2 correspond to different linear combinations of relative frequencies (e.g., weighted averages) that may represent the pattern of a given histogram of relative frequencies. Fig. 33C shows that linear combination (or other transformations) can be performed before clustering is performed or cutoff values or cutoff planes are used. Thus, the transformed relative frequencies may be used to determine a sum value.
Figure 34 illustrates hierarchical clustering analysis based on 3-mer motifs using all plasma DNA molecules across different groups with different cancer levels, according to an embodiment of the present disclosure. For ease of illustration, only the top portion of the heatmap is shown. As shown, HCC subjects (eHCC: early HCC 3405; iHCC: intermediate HCC 3430; and aHCC: advanced HCC 3425) were typically clustered together, while non-HCC subjects (healthy control subject 3410; HBV 3415: chronic hepatitis B carrier; and liver cirrhosis 3420) were typically clustered together.
Based on these findings, machine learning (e.g., deep learning) models can be used to train cancer classifiers by using 256-dimensional vectors containing plasma DNA end motifs, including but not limited to Support Vector Machines (SVMs), decision trees, naive bayes classification, logistic regression, clustering algorithms, PCA, Singular Value Decomposition (SVD), t-distributed random neighbor embedding (tSNE), artificial neural networks, and integrated methods that construct a set of classifiers and then classify new data points by weighted voting on their predictions. Once the cancer classifier is trained based on a "256-dimensional vector-based matrix" comprising a series of cancer patients and non-cancer patients, the likelihood of a new patient suffering from cancer can be predicted.
In such use of machine learning algorithms, the aggregate value may correspond to a probability or distance (e.g., when using SVM) that may be compared to a reference value. In other embodiments, the aggregate value may correspond to an earlier output in the model (e.g., an earlier layer in the neural network) that is compared to a cutoff value between two classes or to a representative value for a given class.
B. Immune disease monitoring
Figure 35A shows an entropy analysis of all plasma DNA molecules between using healthy control subjects and SLE patients according to embodiments of the present disclosure. Figure 35B shows hierarchical cluster analysis of all plasma DNA molecules between subjects using healthy controls and SLE patients according to embodiments of the present disclosure.
Global landscape aberration analysis of the plasma DNA terminal motifs, including entropy (FIG. 35A, p-value: 0.00014) and clustering analysis (FIG. 35B), indicated that SLE patients could be distinguished from healthy control subjects. For example, for subjects with SLE, entropy increased (figure 35A). Also, two clusters were typically formed on the left (SLE 3510) and right (control/normal 3505). Thus, autoimmune disease alters the plasma DNA fragmentation pattern, demonstrating the ability to distinguish plasma DNA terminal motifs between SLE subjects and control subjects.
Figure 36 shows an entropy analysis using plasma DNA molecules with 10 selected terminal motifs between healthy control subjects and SLE patients, according to embodiments of the present disclosure. The motif with the top 10 highest relative frequencies against the control subjects was used. Like other phenotypes, motif sets can influence whether SLE entropy is higher or lower. Assuming that 10 motifs were selected as the highest value of the control, the entropy was higher because these values were similar to each other (i.e., due to ranking). Also, SLE entropy is lower because there are more changes, e.g., because they are not ranked against SLE subjects. The opposite relationship may exist if the top 10 motifs were selected using the SLE sample. Thus, the aggregate value of relative frequency can be used to determine the level of autoimmune disease (e.g., SLE).
C. Synergistic analysis of terminal motifs and conventional metrics
We tested whether combined analysis of plasma DNA end motifs and other metrics (copy number aberrations (CNA), hypomethylation and hypermethylation) would improve the performance of noninvasive cancer detection. For example, decision tree based classification may be used for combinatorial analysis.
Figure 37 shows a ROC curve for combined analysis including terminal motifs and copy number or methylation for HCC subjects and non-HCC subjects according to embodiments of the present disclosure. Terminal motif analysis motif diversity scores were determined using all 356 motifs of the 4-mer. If either analysis results in classification of the cancer, the combined analysis identifies the cancer. The combined analysis of the terminal motif and methylation analysis (AUC: 0.94) or of the terminal motif and CNA analysis (AUC: 0.93) outperformed the analysis using the terminal motif alone (AUC: 0.86). Methylation analysis used a higher number of hypomethylated (defined as methylation density z-score < -3)1Mb bins than normal controls, with the cutoff number of abnormal bins distinguishing between cancer and non-cancer. The CNA analysis uses the number of 1Mb bins representing z-scores greater than 3 or less than-3, and the cutoff number of abnormal bins to distinguish between cancer and non-cancer. More details of methylation analysis can be found in U.S. patent publication 2014/0080715, while more details of CNA analysis can be found in U.S. patent publication u.s.2013/0040824.
Exemplary decision tree based classification is described. For example, we can use a random forest algorithm to derive cutoff values for each metric, including CNA, hypomethylation, hypermethylation, size (e.g., as described in U.S. patent publication 2013/0237431), end motifs, and fragmentation patterns (e.g., as described in U.S. patent publications 2017/0024513 and 2019/0341127 and U.S. patent application 16/519,912). Each metric has a specific cutoff value. Taking a metric (hypomethylation) as an example, a situation may be classified as cancer or non-cancer depending on whether the metric is below or above a cutoff value. One metric represents one node in the decision tree. For example, after a sample traverses all nodes in the entire tree, most votes (e.g., the number of nodes representing cancer is greater than the number of nodes representing non-cancer) may provide a final classification.
D. Examples of alternative methods for defining terminal motifs of plasma DNA
To demonstrate the feasibility of using an alternative approach to defining terminal motifs of plasma DNA, technique 160 in fig. 1 was employed to analyze HCC and non-HCC subjects, including 20 healthy control subjects (control), 22 chronic hepatitis b carriers (HBV), 12 cirrhosis subjects (Cirr), 24 early HCC (ehcc), 11 mid-stage liver cancer (iHCC), and 7 advanced liver cancer (aHCC) that were sequenced.
Figure 38A shows entropy analysis based on 4-mers co-constructed from the ends of sequenced plasma DNA fragments and their adjacent genomic sequences in HCC subjects and non-HCC subjects, according to embodiments of the disclosure. Entropy was determined using all 256 terminal motifs. As with the analysis using the technique 140 of fig. 1 to define motifs, the entropy of HCC subjects differs from that of non-cancer subjects. Also, advanced HCC shows great differences with eHCC and iHCC. Figure 38B shows a 4-mer based clustering analysis according to embodiments of the present disclosure, the 4-mer was co-constructed from the ends of the sequenced plasma DNA fragments and their adjacent genomic sequences in HCC subject 3810 and non-HCC subject 3805.
Fig. 39 shows a ROC comparison of the techniques 140 and 160 of fig. 1 for defining terminal motifs of plasma DNA, according to embodiments of the present disclosure. The same subjects as in fig. 38A were used, and entropy analysis using 4-mers was performed for classification. Method (i) corresponds to technique 140 and method (ii) corresponds to technique 160. Slightly poorer performance was observed using the technique 160 in FIG. 1 (AUC: 0.815 versus 0.856) compared to the technique 140 in FIG. 1.
E. Filtering to improve discrimination
Certain criteria may be used to filter specific DNA fragments (in addition to the terminal motifs) to provide greater accuracy, e.g., sensitivity and specificity. As an example, terminal motif analysis may be limited to DNA fragments derived from open chromatin regions of a particular tissue, e.g., as determined by read alignment entirely within or partially with one of a plurality of open chromatin regions. For example, any reading having at least one nucleotide overlapping an open chromatin region may be defined as a reading within the open chromatin region. A typical open chromatin region is about 300bp, based on DNase I hypersensitivity sites. The size of the open chromatin region can vary, depending on the technique used to define the open chromatin region, e.g., ATAC-Seq (an assay for transposase accessible chromatin sequencing) and dnase I-Seq.
As another example, a DNA fragment of a particular size may be selected to perform the terminal motif analysis. As shown below, this can increase the separation of the sum of the relative frequencies of the terminal motifs, thereby improving accuracy.
Another example may use the methylation characteristics of DNA fragments. Fetal and tumor DNA is typically hypomethylated. Embodiments can determine a methylation metric (e.g., density) of a DNA fragment (e.g., as a proportion or absolute number of sites methylated on the DNA fragment). Also, DNA fragments can be selected for terminal motif analysis based on the measured methylation density. For example, a DNA fragment may be used only when the methylation density is above a threshold.
Whether a DNA fragment includes sequence variation (e.g., base substitution, insertion, or deletion) relative to a reference genome may also be used for filtering.
Various filtering criteria may be used in combination. For example, each criterion may need to be met, or at least a certain number of criteria may need to be met. In another implementation, a probability that a fragment corresponds to clinically relevant DNA (e.g., an embryo, a tumor, or a graft) can be determined, and a threshold can be determined for that probability that the DNA fragment satisfies prior to use in the end motif analysis. As another example, the contribution of a DNA fragment to the frequency counter of a particular terminal motif can be weighted based on probability (e.g., plus a probability having a value less than one, rather than plus one). Thus, DNA fragments with a particular terminal motif will be weighted higher and/or have a higher probability. This enrichment is described further below.
1. Terminal motifs across tissue-specific chromatin regions
Since different tissues will have a preferred fragmentation pattern during apoptosis (Chan et al, Proc Natl Acad Sci USA.2016; 113: E8159-8168; Jiang et al, Proc Natl Acad Sci USA.2018; doi:10.1073/pnas.1814616115), we further believe that the reason for selecting certain genomic regions for plasma DNA end motif analysis will further improve the ability to discriminate between diseased and control subjects. As an example to detect HCC patients, open chromatin regions of blood and liver are used.
Figure 40 shows an accuracy comparison showing tissue-specific open chromatin regions improving the ability to differentiate between HCC and non-cancer patients' plasma DNA terminal motifs, according to embodiments of the present disclosure. Analysis was performed on the entropy of all 256 motifs using the combined frequency of the 4-mer and the first 10 motifs. For the liver open chromatin results, sequence reads are retained (i.e., not filtered out) if the reads have at least one nucleotide overlapping one of the liver open chromatin regions.
The ability of terminal motifs derived from plasma DNA molecules that overlap with the liver open chromatin region yielded the best performance by using the combined frequency of top 10 ranked motifs, with an AUC of 0.918. In contrast, the ability of the terminal motifs derived from plasma DNA molecules to distinguish all 256 motifs without any selection was a minimum AUC of 0.855.
Thus, if a particular tissue is to be screened for cancer, analysis may be performed using DNA fragments from the open chromatin of that particular tissue (or at least where the terminal sequences are located in regions of open chromatin), without using DNA fragments that are not in those identified regions. Liver is used here because cancer is HCC. The location of the DNA fragments can be determined by aligning the sequence reads to a reference genome, where the open chromatin regions can be identified from literature or databases.
2. Size band based analysis of terminal motifs
The frequency of certain of the terminal motifs was shown to vary according to the size range (size band) analyzed, e.g., the percentage of CCCA showed this behavior. This means that size band-based end motif analysis can affect the ability to use plasma DNA end motifs to distinguish cancer patients from non-cancer subjects. To illustrate this possibility, we tested a range of sizes, including but not limited to 50-80bp, 81-110bp, 111-140bp, 141-170bp, 171-200bp, 201-230bp, to investigate how the analyzed size bands would affect the overall diagnostic performance.
Figure 41 illustrates a size band-based plasma DNA end motif analysis, according to embodiments of the present disclosure. The motif diversity score (entropy) used for the classification was determined using 256 motifs of a 4-mer. Various ranges are listed in fig. 41, but other ranges may be used. The 50-80 analysis 4101 provided an AUC of 0.826. The 81-110 analysis 4102 provided an AUC of 0.537. 111-140 analysis 4103 provided an AUC of 0.551. 141-170 analysis 4104 provided an AUC of 0.716. 171-. 201 + 230 analysis 4106 provided an AUC of 0.756.
Such size ranges are useful in techniques for enriching clinically relevant DNA. For example, a DNA molecule selected to be 50-80 bases will enrich for tumor DNA in the sample. Multiple disjoint size ranges may be used as opposed to a single size range. This enrichment can be for the following reasons: better AUC occurred for the size range of 50-80 bases and 81-110 bases.
Terminal motifs derived from plasma DNA molecules in the range of 50bp to 80bp appear to have the best discrimination ability to detect HCC from non-HCC subjects (AUC: 0.83). Thus, embodiments may filter DNA fragments to select DNA fragments within a particular size range, and then use the selected DNA fragments (reads) to determine relative frequency and subsequent operations. By way of example, size filtration can be performed by physical separation or by sizing using sequence reads (e.g., length if the entire fragment is sequenced or by aligning the paired ends with a reference). Examples of physical enrichment of short DNA include band cutting upon gel electrophoresis, collection of eluents at certain retention times upon capillary electrophoresis, after liquid chromatography, or by microfluidic techniques.
F. Classifying levels of pathology
Fig. 42 is a flow diagram illustrating a method 4200 of classifying a pathology level in a biological sample of a subject according to an embodiment of the present disclosure. The biological sample includes free DNA. Aspects of method 4200 may be performed in a manner similar to method 1900 of fig. 19 and method 2000 of fig. 20.
At block 4210, a plurality of free DNA fragments from the biological sample are analyzed to obtain sequence reads. The sequence reads include terminal sequences corresponding to the ends of the plurality of free DNA fragments. Block 4210 may be performed in a similar manner as block 1910 of fig. 19.
At block 4220, for each of the plurality of episomal DNA fragments, a sequence motif is determined for each of the one or more terminal sequences of the episomal DNA fragment. Block 4220 may be performed in a similar manner as block 1920 of fig. 19.
At block 4230, a relative frequency of a set of one or more sequence motifs corresponding to the terminal sequences of the plurality of free DNA fragments is determined. The relative frequency of the sequence motif can provide a proportion of the plurality of free DNA fragments having terminal sequences corresponding to the sequence motif. Block 4230 may be performed in a similar manner as block 1930 of fig. 19. For example, the collection of one or more sequence motifs can include N base positions. The collection of one or more sequence motifs may include all combinations of N bases. N may be an integer equal to or greater than 3, or any other integer.
As another example, the set of one or more sequence motifs can be the top M sequence motifs having the greatest difference between the two types of DNA identified in the one or more reference samples, e.g., the motifs all exhibiting the greatest positive difference (e.g., the top 10 or other number) or all of the motifs having the greatest negative difference. M may be an integer equal to or greater than 1. For methods 1900 and 2000, the two types of DNA can be clinically relevant DNA and another DNA. For method 4200, the two types of DNA may be from two reference samples with different classifications of pathology levels. As another example, the set of one or more sequence motifs can be the top M most frequent sequence motifs that occur in one or more reference samples, e.g., as shown in fig. 22, wherein the reference sample is a non-cancer sample, e.g., an HBV sample.
At block 4240, a sum of the relative frequencies of the set of one or more sequence motifs is determined. Block 4240 may be performed in a similar manner as block 1940 of fig. 19. Examples of the sum value are described throughout this disclosure and include entropy, combined frequency, difference (e.g., distance) from a reference pattern of relative frequencies (as may be implemented in clustering or using SVMs), or a value determined from the difference (e.g., probability), or an output in a machine learning model (e.g., an intermediate or final layer in a neural network) that is compared to a cutoff value between two classes or to a representative value for a given class.
When a collection of one or more sequence motifs comprises a plurality of sequence motifs, the aggregate value may comprise the sum of the relative frequencies of the collection. The sum may be a weighted sum. For example, the sum value may include an entropy term that includes a sum of terms that comprise a weighted sum. Each term may comprise the relative frequency multiplied by the logarithm of the relative frequency. The sum value may correspond to a variance of the relative frequency
In another example, the aggregate value includes a final or intermediate output of the machine learning model. In various implementations, the machine learning model uses clustering, support vector machines, or logistic regression.
At block 4250, a classification of the level of pathology of the subject may be determined based on a comparison of the aggregate value to a reference value. As an example, the pathology may be cancer or an autoimmune disorder. As an example, the level may be cancer free, early, intermediate or late. The classification may then select one of the levels. Thus, the classification may be determined from a plurality of cancer levels including a plurality of cancer stages. For example, the cancer can be hepatocellular carcinoma, lung cancer, breast cancer, gastric cancer, glioblastoma multiforme, pancreatic cancer, colorectal cancer, nasopharyngeal carcinoma, and head and neck squamous cell carcinoma. As an example, the autoimmune disorder can be systemic lupus erythematosus.
In other examples, the pathology level corresponds to a concentration score of clinically relevant DNA associated with the pathology. For example, the pathological level may be cancer and the clinically relevant DNA may be tumor DNA. The reference value may be a calibration value determined from the calibration sample, as described with respect to method 1900.
In some embodiments, the episomal DNA is filtered to identify a plurality of episomal DNA fragments. Examples of filtering are provided in the sections above. For example, filtration can be based on methylation (density or whether a particular site is methylated), size, or region from which the DNA fragment originates. DNA fragments from open chromatin regions of a particular tissue can be filtered from the free DNA.
Enrichment of
The preference that a DNA fragment from a particular tissue exhibits a particular set of terminal sequences can be used to enrich for DNA from that particular tissue in a sample. Thus, embodiments can enrich for clinically relevant DNA in a sample. For example, an assay may be used to sequence, amplify and/or capture only DNA fragments having a particular terminal sequence. As another example, filtering of sequence reads can be performed, for example, in a manner similar to that described in section iii.e.
A. Physical enrichment
Physical enrichment can be performed in a variety of ways, for example by targeted sequencing or PCR, as can be performed using specific primers or adapters. Adapters may be added to the ends of the fragments if specific end motifs of the end sequences are detected. Then, when sequencing is performed, only the DNA fragments with the adaptors are sequenced (or at least predominantly sequenced), thereby providing targeted sequencing.
As another example, primers that hybridize to a particular set of terminal sequences can be used. Sequencing or amplification can then be performed using these primers. Capture probes corresponding to specific terminal motifs can also be used to capture DNA molecules having those terminal motifs for further analysis. Some embodiments may ligate short oligonucleotides to the ends of plasma DNA molecules. The probe may then be designed such that it recognizes only the sequence of the oligonucleotide, partially a terminal motif and partially a ligation
Some embodiments can use CRISPR-based diagnostic techniques, e.g., using guide RNAs to locate sites corresponding to preferred terminal motifs of clinically relevant DNA, and then using nucleases to cleave the DNA fragment, as can be done using Cas-9 or Cas-12. For example, an adapter can be used to recognize a terminal motif, and then the CRISPR/Cas9 or Cas-12 is used to cleave the terminal motif/adapter hybrid and create a universal recognizable end to further enrich the molecule with the desired ends.
Fig. 43 is a flow diagram illustrating a method 4300 of enriching a biological sample for clinically relevant DNA, according to an embodiment of the present disclosure. Biological samples contain clinically relevant DNA molecules and other free DNA molecules. The method 4300 may perform enrichment using a particular assay.
At block 4310, a plurality of free DNA fragments from a biological sample is received. A clinically relevant DNA fragment (e.g., a fetus or tumor) has a terminal sequence that includes sequence motifs that occur at a greater relative frequency than other DNA (e.g., maternal DNA, healthy DNA, or blood cells). As an example, the data from fig. 3 and 13 may be used). Thus, the sequence motifs can be used to enrich for clinically relevant DNA.
At block 4320, the plurality of free DNA fragments are subjected to one or more probe molecules that detect sequence motifs in the terminal sequences of the plurality of free DNA fragments. Such use of probe molecules may result in the obtaining of detected DNA fragments. In one example, the one or more probe molecules may comprise one or more enzymes that interrogate the plurality of episomal DNA fragments and append new sequences for amplifying the detected DNA fragments. In another example, one or more probe molecules may be attached to a surface for detecting sequence motifs in the terminal sequences by hybridization.
At block 4330, the detected DNA fragments are used to enrich for clinically relevant DNA fragments in the biological sample. For example, enriching clinically relevant DNA fragments in a biological sample using the detected DNA fragments can include amplifying the detected DNA fragments. As another example, detected DNA fragments may be captured and undetected DNA fragments may be discarded.
B. Computer simulated enrichment
In silico enrichment certain DNA fragments can be selected or discarded using various criteria. Such criteria may include terminal motifs, open chromatin regions, size, sequence variation, methylation, and other epigenetic features. Epigenetic characteristics include all modifications of the genome that do not involve changes in the DNA sequence. The criterion may specify a cut-off value, for example, requiring certain characteristics, such as a particular size range, a methylation metric above or below a certain amount, a combination of methylation states of more than one CpG site (e.g., methylation haplotype (Guo et al, Nat Genet. 2017; 49:635-42)), etc., or having a combined probability above a threshold. Such enrichment may also involve weighting the DNA fragments based on such probabilities.
As an example, the enriched sample can be used to classify pathology (as described above), as well as to identify tumor or fetal mutations or for marker enumeration for amplification/deletion detection of chromosomes or chromosomal regions. For example, if a particular terminal motif or set of terminal motifs is associated with liver cancer (i.e., has a higher relative frequency than non-cancer or other cancer), embodiments for performing cancer screening may weight such DNA fragments higher than DNA fragments that do not have the preferred terminal motif or set of terminal motifs.
Fig. 44 is a flow diagram illustrating a method 4400 of enriching a biological sample for clinically relevant DNA, according to an embodiment of the present disclosure. Biological samples contain clinically relevant DNA molecules and other free DNA molecules. The method 4400 can perform enrichment using specific criteria for sequence reads.
At block 4410, a plurality of free DNA fragments from the biological sample are analyzed to obtain sequence reads. The sequence reads include terminal sequences corresponding to the ends of the plurality of free DNA fragments. Block 4410 may be performed in a similar manner as block 1910 of fig. 19.
At block 4420, for each of the plurality of episomal DNA fragments, a sequence motif is determined for each of the one or more terminal sequences of the episomal DNA fragment. Block 4420 may be performed in a similar manner as block 1920 of fig. 19.
At block 4430, identifying a set of one or more sequence motifs that occur with greater relative frequency in clinically relevant DNA than other DNA; the collection of sequence motifs can be identified by the genotypic or phenotypic techniques described herein. Calibration or reference samples can be used to rank and select sequence motifs that are selective for clinically relevant DNA.
At block 4440, a set of sequence reads having a set of one or more sequence motifs in the terminal sequence is identified. This can be seen as the first stage of filtration.
At block 4450, sequence reads having a likelihood of corresponding to clinically relevant DNA exceeding a threshold may be stored. This likelihood can be determined using a collection of terminal motifs. For example, for each sequence read in the set of sequence reads, a likelihood that the sequence read corresponds to clinically relevant DNA can be determined based on the terminal sequences of the sequence reads that comprise the sequence motif in the set of one or more sequence motifs. The likelihood may be compared to a threshold. As an example, the threshold may be determined empirically. For example, various thresholds can be tested for a sample for which the concentration of clinically relevant DNA for a set of sequence reads can be measured. An optimal threshold may maximize concentration while maintaining a certain percentage of the total number of sequence reads. The threshold may be determined by one or more given percentiles (5 th, 10 th, 90 th, or 95 th) of the concentration of one or more terminal motifs present in healthy controls or control groups exposed to similar etiological risk factors but without disease. The threshold may be a regression or probability score.
When the likelihood exceeds a threshold, the sequence reads can be stored in memory (e.g., in a file, table, or other data structure) to obtain stored sequence reads. Sequence reads with a likelihood below the threshold may be discarded or not stored in a storage location of retained reads, or a field of the database may include a flag indicating that the read has a lower threshold so that later analysis may exclude such reads. By way of example, the likelihood may be determined using various techniques, such as odds ratios, z-scores, or probability distributions.
At block 4460, the stored sequence reads may be analyzed to determine characteristics of clinically relevant DNA in the biological sample, e.g., as described herein, e.g., in other flow charts. Methods 1900, 2000, and 4200 are such examples. For example, the characteristic of clinically relevant DNA in the biological sample can be the concentration fraction of clinically relevant DNA. As another example, the characteristic may be a level of pathology in a subject from which the biological sample is obtained, wherein the level of pathology is associated with clinically relevant DNA. As another example, the characteristic may be the gestational age of a fetus of the pregnant female from which the biological sample was obtained.
Other criteria may be used to determine the likelihood. Sequence reads can be used to measure the size of multiple free DNA fragments. The likelihood that a particular sequence read corresponds to clinically relevant DNA can be further based on the size of the free DNA fragment corresponding to the particular sequence read.
Methylation may also be used. Thus, embodiments can measure one or more methylation states at one or more sites of an episomal DNA fragment that correspond to a particular sequence read. The likelihood that a particular sequence read corresponds to clinically relevant DNA can be further based on one or more methylation states. As a further example, whether the reading is within the set of identified open chromatin regions can be used as a filter.
Fig. 45 shows an exemplary graph illustrating an increase in fetal DNA fraction using a CCCA tip motif according to an embodiment of the disclosure. The vertical axis is the fetal DNA fraction of the sample tested. Two sets of data were for (1) all fragments that overlapped the informative SNP (i.e., fragments with a fetal-specific allele) and (2) fragments with a CCCA terminal motif and that overlapped the informative SNP. Thus, the data on the left provides the actual fetal DNA fraction in the entire sample, and the data on the right provides the data of the in silico enriched sample. In this example, when the terminal motif is CCCA, the likelihood can be determined to be above the threshold. More motifs can be used in a similar manner, e.g., as a group indicating a likelihood above a threshold.
The median relative increase in fetal DNA fraction was 3.2% (IQR: 1.3-6.4%). The relative increase in fetal DNA fraction is defined by (b-a)/a x 100, where a is the original fetal DNA fraction calculated over all fragments overlapping the informative SNP, where the mother is homozygous and the fetus is heterozygous, and b is the fetal DNA fraction calculated over fragments labeled with the CCCA motif enriched in fetal DNA molecules.
For any of the methods described herein, the sequence motif of each of the one or more terminal sequences of the episomal DNA fragment can be performed using a reference genome (e.g., via technique 160 of fig. 1). Such techniques may include: aligning one or more sequence reads corresponding to the free DNA fragments with a reference genome, identifying one or more bases in the reference genome adjacent to the terminal sequence, and determining a sequence motif using the terminal sequence and the one or more bases.
V. exemplary System
FIG. 46 shows a measurement system 4600 according to an embodiment of the invention. The system comprises a sample 4605, such as free DNA molecules within a sample holder 4610, wherein the sample 4605 can be contacted with an assay 4608 to provide a physical characteristic signal 4615. An example of a sample holder may be a flow cell containing probes and/or primers for an assay or a tube through which a droplet moves (where the droplet contains the assay). Detector 4620 detects a physical characteristic 4615 (e.g., fluorescence intensity, voltage, or current) from the sample. Detector 4620 may make measurements at intervals (e.g., periodic intervals) to obtain data points that constitute a data signal. In one embodiment, the analog-to-digital converter converts the analog signal from the detector to digital form multiple times. The sample holder 4610 and detector 4620 may form an assay device, for example, a sequencing device that performs sequencing according to embodiments described herein. Data signal 4625 is sent from detector 4620 to logic system 4630. Data signals 4625 may be stored in local memory 4635, external memory 4640, or storage device 4645.
Logic system 4630 may be or include a computer system, ASIC, microprocessor, or the like. It may also include or be coupled with a display (e.g., monitor, LED display, etc.) and a user input device (e.g., mouse, keyboard, buttons, etc.). The logic system 4630 and other components may be part of a stand-alone or network-connected computer system, or the logic system may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes the detector 4620 and/or the sample holder 4610. The logic system 4630 may also include software executing in the processor 4650. Logic system 4630 may include a computer readable medium storing instructions for controlling measurement system 4600 to perform any of the methods described herein. For example, the logic system 4630 may provide commands to a system including the sample holder 4610 so that sequencing or other physical operations are performed. Such physical operations may be performed in a particular order, for example, adding and removing reagents in a particular order. Such physical manipulations can be performed by a robotic system (e.g., a robotic system comprising a robotic arm), as can be used to obtain a sample and perform an assay.
Any of the computer systems mentioned herein may utilize any suitable number of subsystems. In fig. 47, an example of such a subsystem is shown in computer system 10. In some embodiments, the computer system comprises a single computer device, wherein the subsystem may be a component of the computer device. In other embodiments, a computer system may include multiple computer devices with internal components, each computer device being a subsystem. Computer systems may include desktop and laptop computers, tablets, mobile phones, and other mobile devices.
The subsystems shown in fig. 47 are interconnected by a system bus 75. Additional subsystems such as a printer 74, a keyboard 78, one or more storage devices 79, a monitor 76 (e.g., a display screen such as an LED) coupled to a display adapter 82, etc. are shown. Peripheral devices and input/output (I/O) devices coupled to I/O controller 71 may be connected via any number of input/output (I/O) ports 77 (e.g., USB, port, and/or the like,) Etc. known in the art, to a computer system. For example, the I/O port 77 or external interface 81 (e.g., Ethernet, Wi-Fi, etc.) can be used to connect the computer system 10 to a wide area network (e.g., the Internet), a mouse input device, or a scanner. The interconnection via system bus 75 allows central processor 73 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 72 or one or more storage devices 79 (e.g., a fixed disk such as a hard drive or optical disk), as well as the exchange of information between subsystems. System memory 72 and/orOne or more storage devices 79 may be embodied as computer-readable media. Another subsystem is a data collection device 85 such as a camera, microphone, accelerometer, etc. Any of the data mentioned herein may be output from one component to another component and may be output to a user.
The computer system may include multiple identical components or subsystems connected together, for example, through external interface 81, through an internal interface, or via a removable storage device that may be connected and removed from one component to another. In some embodiments, computer systems, subsystems, or devices may communicate over a network. In such cases, one computer may be considered a client and another computer a server, where each computer may be part of the same computer system. The client and server may each include multiple systems, subsystems, or components.
Aspects of the embodiments may be implemented in the form of control logic using hardware circuitry (e.g., an application specific integrated circuit or a field programmable gate array) and/or in modular or integrated form using computer software with a general-purpose programmable processor. As used herein, a processor may include a single-core processor, a multi-core processor on the same integrated chip, or multiple processing units on a single circuit board or networked, as well as dedicated hardware. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement embodiments of the present invention using hardware and a combination of hardware and software.
Any of the software components or functions described herein may be implemented as software code executed by a processor using any suitable computer language (e.g., Java, C + +, C #, Objective-C, Swift, or a scripting language such as Perl or Python), for example, using conventional or object-oriented techniques. The software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission. Suitable non-transitory computer readable media may include Random Access Memory (RAM), Read Only Memory (ROM), magnetic media such as a hard drive or floppy disk, or optical media such as a Compact Disc (CD) or DVD (digital versatile disc) or blu-ray disc, flash memory, and the like. A computer readable medium may be any combination of such storage or transmission devices.
Such programs may also be encoded and transmitted using carrier wave signals adapted for transmission over wired, optical, and/or wireless networks conforming to various protocols, including the internet. Thus, a computer readable medium may be produced using a data signal encoded with such a program. The computer readable medium encoded with the program code may be packaged with a compatible device or provided separately from other devices (e.g., via internet download). Any such computer-readable media may reside on or within a single computer product (e.g., a hard drive, a CD, or an entire computer system), and may exist on or within different computer products within a system or network. The computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
Any of the methods described herein may be performed in whole or in part with a computer system comprising one or more processors, which may be configured to perform the steps. Accordingly, embodiments may be directed to a computer system configured to perform the steps of any of the methods described herein, possibly with different components performing the respective steps or respective groups of steps. Although presented as numbered steps, the method steps herein may be performed simultaneously or at different times or in a different order. In addition, portions of these steps may be used with portions of other steps from other methods. Also, all or part of the steps may be optional. Additionally, any of the steps of any of the methods may be performed by a module, unit, circuit, or other device of a system for performing such steps.
The specific details of the particular embodiments may be combined in any suitable manner without departing from the spirit and scope of the embodiments of the invention. However, other embodiments of the invention may be directed to specific embodiments directed to each individual aspect, or to specific combinations of these individual aspects.
The foregoing description of the exemplary embodiments of the present disclosure has been presented for the purposes of illustration and description. The foregoing description is not intended to be exhaustive or to limit the disclosure to the precise form described, and many modifications and variations are possible in light of the above teaching.
Recitation of "a" or "an" or "the" is intended to mean "one or more" unless specifically indicated to the contrary. The use of "or" is intended to mean "an inclusive or" rather than an "exclusive or" unless specifically indicated to the contrary. Reference to a "first" component does not necessarily require that a second component be provided. Furthermore, unless explicitly stated otherwise, reference to "a first" or "a second" component does not limit the referenced component to a particular position. The term "based on" is intended to mean "based at least in part on".
All patents, patent applications, publications, and descriptions mentioned herein are incorporated by reference in their entirety for all purposes. None is admitted to be prior art.
Claims (53)
1. A method of classifying a level of pathology in a biological sample of a subject, the biological sample comprising free DNA, the method comprising:
analyzing a plurality of free DNA fragments from the biological sample to obtain sequence reads, wherein the sequence reads comprise terminal sequences corresponding to the ends of the plurality of free DNA fragments;
for each of the plurality of episomal DNA fragments, determining a sequence motif for each of one or more terminal sequences of the episomal DNA fragment;
determining the relative frequency of a set of one or more sequence motifs corresponding to the terminal sequences of the plurality of free DNA fragments, wherein the relative frequency of sequence motifs provides a proportion of the plurality of free DNA fragments having terminal sequences corresponding to the sequence motifs;
determining a sum of the relative frequencies of the set of one or more sequence motifs; and
determining a classification of the subject's level of pathology based on a comparison of the summed value to a reference value.
2. The method of claim 1, further comprising:
filtering the free DNA to identify the plurality of free DNA fragments.
3. The method of claim 2, wherein the filtering is based on the size of the DNA fragments or the region from which the DNA fragments originate.
4. The method of claim 3, wherein the free DNA is filtered to obtain DNA fragments from open chromatin regions of a particular tissue.
5. The method of claim 1, wherein the pathology is cancer.
6. The method of claim 5, wherein the cancer is hepatocellular carcinoma, lung cancer, breast cancer, gastric cancer, glioblastoma multiforme, pancreatic cancer, colorectal cancer, nasopharyngeal carcinoma, and head and neck squamous cell carcinoma.
7. The method of claim 5, wherein the classification is determined from a plurality of cancer levels comprising a plurality of cancer stages.
8. The method of claim 1, wherein the pathology is an autoimmune disorder.
9. The method of claim 8, wherein the autoimmune disorder is systemic lupus erythematosus.
10. The method of claim 1, wherein the level of pathology corresponds to a concentration score of clinically relevant DNA associated with the pathology.
11. A method of estimating the concentration fraction of clinically relevant DNA in a biological sample of a subject, the biological sample comprising clinically relevant DNA and other free DNA, the method comprising:
analyzing a plurality of free DNA fragments from the biological sample to obtain sequence reads, wherein the sequence reads comprise terminal sequences corresponding to the ends of the plurality of free DNA fragments;
for each of the plurality of episomal DNA fragments, determining a sequence motif for each of one or more terminal sequences of the episomal DNA fragment;
determining the relative frequency of a set of one or more sequence motifs corresponding to the terminal sequences of the plurality of free DNA fragments, wherein the relative frequency of sequence motifs provides a proportion of the plurality of free DNA fragments having terminal sequences corresponding to the sequence motifs;
determining a sum of the relative frequencies of the set of one or more sequence motifs; and
determining a classification of the clinically relevant DNA concentration score in the biological sample by comparing the total value to one or more calibration values determined from one or more calibration samples for which clinically relevant DNA concentration scores are known.
12. The method of claim 11, wherein the clinically relevant DNA is selected from the group consisting of: fetal DNA, tumor DNA, DNA from transplanted organs, and specific tissue types.
13. The method of claim 11, wherein the clinically relevant DNA is of a specific tissue type.
14. The method of claim 13, wherein the specific tissue type is of the liver or hematopoietic system.
15. The method of claim 11, wherein the subject is a pregnant female, and wherein the clinically relevant DNA is placental tissue.
16. The method of claim 11, wherein the clinically relevant DNA is tumor DNA derived from an organ with cancer.
17. The method of claim 11, wherein the one or more calibration values are a plurality of calibration values of a calibration function determined using concentration fractions of clinically relevant DNA of a plurality of calibration samples.
18. The method of claim 11, wherein the one or more calibration values correspond to one or more aggregate values of the relative frequencies of the set of one or more sequence motifs measured using free DNA fragments in the one or more calibration samples.
19. The method of claim 11, further comprising:
for each of the one or more calibration samples:
measuring the clinically relevant DNA concentration fraction in the calibration sample; and
determining the aggregate value of the relative frequencies of the set of one or more sequence motifs by analyzing free DNA fragments from the calibration sample as part of obtaining calibration data points, thereby determining one or more aggregate values, wherein each calibration data point specifies the measured clinically relevant DNA concentration fraction in the calibration sample and an aggregate value determined for the calibration sample, and wherein the one or more calibration values are the one or more aggregate values or are determined using the one or more aggregate values.
20. The method of claim 19, wherein measuring the clinically relevant DNA concentration fraction in the calibration sample is performed using alleles specific for the clinically relevant DNA.
21. A method of determining the gestational age of a fetus by analyzing a biological sample from a female subject pregnant with a fetus, the biological sample comprising free DNA molecules from the female subject and the fetus, the method comprising:
analyzing a plurality of free DNA fragments from the biological sample to obtain sequence reads, wherein the sequence reads comprise terminal sequences corresponding to the ends of the plurality of free DNA fragments;
for each of the plurality of episomal DNA fragments, determining a sequence motif for each of one or more terminal sequences of the episomal DNA fragment;
determining the relative frequency of a set of one or more sequence motifs corresponding to the terminal sequences of the plurality of free DNA fragments, wherein the relative frequency of sequence motifs provides a proportion of the plurality of free DNA fragments having terminal sequences corresponding to the sequence motifs;
determining a sum of the relative frequencies of the set of one or more sequence motifs;
obtaining one or more calibration data points, wherein each calibration data point specifies a gestational age corresponding to the aggregate value, and wherein the one or more calibration data points are determined from a plurality of calibration samples having a known gestational age and comprising free DNA molecules;
comparing the summed value to a calibrated value for at least one calibration data point; and
estimating the gestational age of the fetus based on the comparison.
22. The method of claim 21, wherein the one or more calibration data points is a plurality of calibration data points that form a calibration function that approximates a measured aggregate value determined from the free DNA molecules in the plurality of calibration samples with known gestational age.
23. The method of claim 21, wherein the summed value is compared to a plurality of calibration values, each calibration value corresponding to one of the plurality of calibration samples.
24. The method of claim 21, wherein the calibration value for the at least one calibration data point corresponds to the sum of values measured using the free DNA molecules in at least one calibration sample of the plurality of calibration samples.
25. The method of claim 21, further comprising:
identifying the plurality of episomal DNA fragments as being derived from the fetus.
26. The method of claim 25, wherein the plurality of episomal DNA fragments are identified using a fetal-specific allele or a fetal-specific epigenetic marker.
27. The method of claim 25, wherein the plurality of episomal DNA fragments is identified by:
for each of the sequence reads:
determining a likelihood that the sequence reads correspond to the fetus based on an end sequence of the sequence reads that include one of the set of one or more sequence motifs;
comparing the likelihood to a threshold; and
identifying the sequence reads as originating from the fetus when the likelihood exceeds the threshold.
28. The method of any one of claims 1-27, wherein the set of one or more sequence motifs comprises N base positions, wherein the set of one or more sequence motifs comprises all combinations of N bases and wherein N is an integer equal to or greater than 3.
29. The method of any one of claims 1-27, wherein the set of one or more sequence motifs is the first M sequence motifs with the largest difference between the two types of DNA as determined in one or more reference samples, M being an integer equal to or greater than 1.
30. The method of claim 29, wherein the two types of DNA are the clinically relevant DNA and another DNA.
31. The method of claim 29, wherein the two types of DNA are from two reference samples with different classifications of the level of pathology.
32. The method of any one of claims 1-27, wherein the set of one or more sequence motifs is the top M most frequent sequence motifs occurring in one or more reference samples, M being an integer equal to or greater than 1.
33. The method of any one of claims 28-32, wherein the set of one or more sequence motifs comprises a plurality of sequence motifs and wherein the aggregate value comprises a sum of the relative frequencies of the set.
34. The method of claim 33, wherein the sum is a weighted sum.
35. The method of claim 34, wherein the aggregate value comprises an entropy term, and wherein the entropy term comprises a sum of terms comprising a weighted sum, each term comprising a relative frequency multiplied by a logarithm of the relative frequency.
36. The method of any one of claims 1-35, wherein the aggregate value corresponds to a variance of the relative frequency.
37. The method of any of claims 1-35, wherein the aggregate value comprises a final or intermediate output of a machine learning model.
38. The method of claim 37, wherein the machine learning model uses clustering, support vector machines, or logistic regression.
39. A method of enriching a biological sample for clinically relevant DNA, the biological sample comprising the clinically relevant DNA and other free DNA, the method comprising:
analyzing a plurality of free DNA fragments from the biological sample to obtain sequence reads, wherein the sequence reads comprise terminal sequences corresponding to the ends of the plurality of free DNA fragments;
for each of the plurality of episomal DNA fragments, determining a sequence motif for each of one or more terminal sequences of the episomal DNA fragment;
identifying a collection of one or more sequence motifs that occur with greater relative frequency in the clinically relevant DNA than the other DNA;
identifying a set of sequence reads having the set of one or more sequence motifs in a terminal sequence;
for each sequence read in the set of sequence reads:
determining a likelihood that the sequence reads correspond to the clinically relevant DNA based on an end sequence of the sequence reads that include one of the set of one or more sequence motifs;
comparing the likelihood to a threshold; and
storing the sequence reads when the likelihood exceeds the threshold, thereby obtaining stored sequence reads; and
analyzing the stored sequence reads to determine a characteristic of the clinically relevant DNA in the biological sample.
40. The method of claim 39, wherein the characteristic of the clinically relevant DNA in the biological sample is (1) a concentration fraction of the clinically relevant DNA, (2) a level of pathology in a subject from which the biological sample is obtained, the level of pathology being associated with the clinically relevant DNA, or (3) a gestational age of a fetus of a pregnant female from which the biological sample is obtained.
41. The method of claim 39, further comprising:
measuring sizes of the plurality of episomal DNA fragments using the sequence reads, and wherein determining the likelihood that a particular sequence read corresponds to the clinically-relevant DNA is further based on the size of the episomal DNA fragment that corresponds to the particular sequence read.
42. The method of claim 39, further comprising:
measuring one or more methylation states at one or more sites of an episomal DNA fragment that corresponds to a particular sequence read, wherein determining the likelihood that the particular sequence read corresponds to the clinically relevant DNA is further based on the one or more methylation states.
43. The method of any one of claims 1-42, wherein determining the sequence motif for each of the one or more terminal sequences of the episomal DNA fragment comprises:
aligning one or more sequence reads corresponding to the episomal DNA fragment to a reference genome;
identifying one or more bases in the reference genome that are adjacent to the terminal sequence; and
determining the sequence motif using the terminal sequence and the one or more bases.
44. A method of enriching a biological sample for clinically relevant DNA, the biological sample comprising the clinically relevant DNA and other free DNA, the method comprising:
receiving a plurality of free DNA fragments from the biological sample, wherein the clinically relevant DNA fragments have terminal sequences that include sequence motifs that occur at a greater relative frequency than the other DNA;
subjecting the plurality of free DNA fragments to one or more probe molecules that detect the sequence motif in the terminal sequences of the plurality of free DNA fragments, thereby obtaining detected DNA fragments; and
enriching said clinically relevant DNA fragments in said biological sample using said detected DNA fragments.
45. The method of claim 44, wherein enriching the clinically relevant DNA fragments in the biological sample using the detected DNA fragments comprises:
amplifying the detected DNA fragments.
46. The method of claim 45, wherein the one or more probe molecules comprise one or more enzymes that interrogate the plurality of free DNA fragments and append new sequences for amplifying the detected DNA fragments.
47. The method of claim 44, wherein enriching the clinically relevant DNA fragments in the biological sample using the detected DNA fragments comprises:
capturing said detected DNA fragments; and
the undetected DNA fragments were discarded.
48. The method of claim 47, wherein one or more probe molecules are attached to a surface and the sequence motif in the terminal sequence is detected by hybridization.
49. A computer product comprising a computer-readable medium storing a plurality of instructions for controlling a computer system to perform the method of any one of claims 1-48.
50. A system, the system comprising:
the computer product of claim 49; and
one or more processors configured to execute instructions stored on the computer-readable medium.
51. A system comprising means for performing the method of any of claims 1-48.
52. A system comprising one or more processors configured to perform the method of any one of claims 1-48.
53. A system comprising means for performing the steps of the method according to any one of claims 1-48, respectively.
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US62/782,316 | 2018-12-19 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK40058434A true HK40058434A (en) | 2022-04-22 |
| HK40058434B HK40058434B (en) | 2024-06-07 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113366122B (en) | Characteristics of free DNA ends | |
| KR102658592B1 (en) | Determination of base modifications of nucleic acids | |
| JP6829211B2 (en) | Mutation detection for cancer screening and fetal analysis | |
| KR102889224B1 (en) | Augmenting cancer screening using cell-free viral nucleic acids | |
| WO2021139716A1 (en) | Biterminal dna fragment types in cell-free samples and uses thereof | |
| CN119301278A (en) | Fragmentation for measuring methylation and disease | |
| TW202217009A (en) | Nuclease-associated end signature analysis for cell-free nucleic acids | |
| CA3239063A1 (en) | Molecular analyses using long cell-free dna molecules for disease classification | |
| HK40058434A (en) | Cell-free dna end characteristics | |
| HK40104046A (en) | Cell-free dna end characteristics | |
| HK40104046B (en) | Cell-free dna end characteristics | |
| WO2025113619A1 (en) | Enrichment of clinically-relevant nucleic acids | |
| HK40054633B (en) | Cell-free dna end characteristics | |
| HK40054633A (en) | Cell-free dna end characteristics | |
| HK40058434B (en) | Cell-free dna end characteristics | |
| HK40080623A (en) | Biterminal dna fragment types in cell-free samples and uses thereof | |
| KR20250171389A (en) | Enhancement of cancer screening using cell-free viral nucleic acids |