[go: up one dir, main page]

HK1251263B - Accurate quantification of fetal dna fraction by shallow-depth sequencing of maternal plasma dna - Google Patents

Accurate quantification of fetal dna fraction by shallow-depth sequencing of maternal plasma dna Download PDF

Info

Publication number
HK1251263B
HK1251263B HK18110675.1A HK18110675A HK1251263B HK 1251263 B HK1251263 B HK 1251263B HK 18110675 A HK18110675 A HK 18110675A HK 1251263 B HK1251263 B HK 1251263B
Authority
HK
Hong Kong
Prior art keywords
reads
biological sample
loci
dna molecules
sites
Prior art date
Application number
HK18110675.1A
Other languages
Chinese (zh)
Other versions
HK1251263A1 (en
Inventor
卢煜明
江培勇
陈君赐
赵慧君
Original Assignee
香港中文大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 香港中文大学 filed Critical 香港中文大学
Priority claimed from PCT/CN2016/099682 external-priority patent/WO2017050244A1/en
Publication of HK1251263A1 publication Critical patent/HK1251263A1/en
Publication of HK1251263B publication Critical patent/HK1251263B/en

Links

Description

Accurate quantification of fetal DNA fraction by shallow depth sequencing of maternal plasma DNA
Cross Reference to Related Applications
This application claims priority to U.S. application 62/222,157 filed on 9/22/2015, the contents of which are incorporated herein by reference for all purposes.
Background
Discovery of circulating free fetal DNA in maternal plasma [ Lo YM (The discovery of circulating cell-free fetal DNA in maternal plasma) et al (1997), Lancet 350: 485. sup. 487] catalyzed a novel approach to a series of noninvasive prenatal diagnoses, including fetal RhD blood group genotyping [ Lo YM et al (1998), New England journal of medicine (N Engl J Med),339: 1734. sup. 1738, Finning K et al (2008), British journal of medicine (BMJ),336: 816. sup. 818], fetal sex determination for companion disorders [ Costa, Benachi Med A, Gautier E (2002), New England journal of medicine (N346J), 1502 ]: 1502], [, chromosome aneuploidy detection [ Lo YM et al (2007), Proc Natl Acad Sci U S A.) (104: 13116-; chiu RW et al (2008), Proc Natl Acad Sci U S A.) (105: 20458-; chiu RW, Cantor CR, Lo YM (2009), genetics progression (Trends Genet),25: 324-; fan HC et al (2008), Proc Natl Acad Sci U S A., 105: 16266-; chiu RW et al (2011), British journal of medicine (BMJ)342: c 7401; yu SC et al (2014), Proc Natl Acad Sci U.S. Pat. No. (S A),111: 8583-8588), and the detection of monogenic diseases [ Lo YMD et al (2010), scientific transformation medicine (Sci Transl Med),2:61ra 91; lam KW et al (2012), clinical chemistry (Clin Chem.); new MI et al (2014), Journal of Clinical Endocrinology & Metabolism (The Journal of Clinical Endocrinology & Metabolism),99: E1022-E1030; yoo S-K et al (2015), Clinical Chemistry (Clinical Chemistry); ma D et al (2014), Gene (Gene), 544: 252-258; tsui N et al (2011), Blood (Blood),117: 3684-.
In the aforementioned applications, the accurate subtraction of the fraction of fetal DNA (also referred to as fractional fetal DNA concentration or percent fetal DNA) is important for the accurate statistical interpretation of the results of non-invasive prenatal diagnosis by using plasma DNA, especially in the case where a statistical model depending on this parameter is used to detect chromosomal aneuploidies [ Sparks AB et al, (2012), Am J Obstet Gynecol, 206:319 e311-319] and in the case of the identification of monogenic disease inheritance [ Lo YM et al (2007), american academy of sciences (Proc Natl acsci U S a.)104: 13116-13121; lo YMD et al (2010), scientific transformation medicine (Sci Transl Med),2:61ra 91; lam KW et al (2012), clinical chemistry (Clin Chem.); new MI et al (2014), Journal of Clinical Endocrinology & Metabolism (The Journal of Clinical Endocrinology & Metabolism),99: E1022-E1030; yoo S-K et al (2015), Clinical Chemistry (Clinical Chemistry); tsui NB et al (2011), Blood (Blood),117: 3684-. For example, fetal DNA fraction is a central parameter for Relative Haplotype Dosimetry (RHDO) to determine exactly which maternal haplotype is delivered to the fetus [ Lo YMD et al (2010), scientific transformation medicine (Sci Transl Med),2:61ra 91; lam KW et al (2012), clinical chemistry (Clin Chem.); new MI et al (2014), Journal of Clinical Endocrinology and Metabolism (The Journal of Clinical Endocrinology & Metabolism),99: E1022-E1030. In this diagnostic method, the rationale is that the relative dose of maternal haplotypes delivered to the fetus will be slightly over-represented than the untransmitters and the fetal DNA fraction is used to determine the statistical significance of the over-representation.
To date, a number of methods have been developed to estimate the fractional fetal DNA concentration in the maternal plasma of a pregnant woman. For example, specific signals derived from the Y chromosome are used to infer fetal DNA fraction in pregnant women carrying male fetuses [ Chiu RW et al (2011), british journal of medicine (BMJ)342: c 7401; lo YM et al (1998), J.Man. Genet, USA (Am J Hum Genet),62: 768-775; lun FM et al (2008), clinical chemistry (Clin Chem)54: 1664-; hudecova I et al (2014), public science library journal (Plos One),9: e88484 ]. However, the method based on the Y-chromosome specific signal is not suitable for pregnant women carrying female fetuses. Another approach is to use Single Nucleotide Polymorphisms (SNPs) such that the ratio of fetal-specific alleles to consensus alleles is calculated to infer fetal DNA scores. In this approach, the genotype information must be known and should be one of the following: (a) the mother is homozygous and the fetus is heterozygous; (b) both the father and mother genotypes were homozygous, but with different alleles [ Lo YMD et al (2010), scientific transformation medicine (Sci Transl Med),2:61ra 91; liao GJ et al (2011), clinical chemistry (Clin Chem),57:92-101 ]. However, on the one hand, in the actual clinical situation of non-invasive prenatal diagnosis, the fetal genotype is not available beforehand. On the other hand, the incidence of paternal differences can be as high as 30%, which is shown by epidemiological studies of paternal differences around the world [ Bellis MA, Hughes K, Hughes S, Ashton JR (2005) journal of epidemiology and Community Health (J epidemic Community Health), 59: 749-. Even though algorithms independent of parental genotypes were developed by using high depth sequencing of maternal plasma DNA spanning different SNP sites (e.g. targeted sequencing of maternal plasma DNA) to avoid the prerequisite of additional genotype information [ Jiang P et al (2012), Bioinformatics (Bioinformatics),28: 2883-; liao GJ et al (2011), clinical chemistry (Clin Chem),57:92-101 ].
In addition to SNP-dependent methods, methods that do not depend on SNPs are being explored. For example, the fragment size of maternal plasma DNA can be used to estimate the fetal DNA fraction [ Yu SC et al (2014), Proc Natl Acad Sci U S A, 111: 8583-; kim SK et al (2015), Prenatal diagnosis (Prenatal diagnosis): n/a-n/a ], because fetal-derived DNA is usually shorter than maternal DNA [ Lo YMD et al (2010), scientific transformation medicine (Sci Transl Med),2:61ra91 ]. However, other conditions will affect the accuracy of the size-based estimation of fetal DNA fraction, such as systemic lupus erythematosus [ Chan RW et al (2014), Proc Natl Acad Sci U S A, 111: E5302-5311 ]. Alternatively, fetal-specific epigenetic changes such as methylated RASSF1A and unmethylated SERPINB5 sequences were shown to be fetal markers for fetal DNA fraction prediction regardless of genotype information [ Chan KC et al (2006), clinical chemistry (Clin Chem),52: 2211-; chim SS et al (2005), Proc Natl Acad Sci U S A, 102: 14753-. However, the analytical steps used to quantify these epigenetic markers involve bisulfite conversion or digestion with methylation sensitive restriction enzymes, and thus may affect the accuracy of these methods.
Therefore, new techniques are needed to provide fetal DNA fraction information from maternal plasma.
Disclosure of Invention
Embodiments of the present invention provide methods, systems, and apparatus for deriving a fraction of fetal DNA in maternal plasma. The fetal DNA fraction can be determined without the need to specifically determine the paternal or fetal genotype. Individual parameters may be determined and the calibration curve may be used to determine the actual fetal DNA fraction. For example, a ratio of the amount of reads having an allele nominally identified as a non-maternal allele to the amount of reads having an allele nominally identified as a maternal allele can be determined. As another example, a ratio of the amount of a locus showing a nominal non-maternal allele to the amount of a homozygous maternal locus determined from a separate data set can be determined. Differences in read size may also be used. The loci (sites) can be limited to known heterozygous loci in the population.
Maternal genotype information can be obtained from samples of maternal DNA only, or can be assumed from sequencing (e.g., at shallow depths) of biological samples with both maternal and fetal DNA molecules. The actual or hypothetical maternal genotype information can be combined with the sequencing of DNA molecules from a biological sample. Although it may not be explicitly known that a mother is homozygous at a particular locus or that a fetus is heterozygous, embodiments may use readings at such loci to determine individual parameters, which is a distinction from prior art techniques. Any errors are verified to be consistent and thus can be compensated by a calibration curve that can be generated once using a separate technique to determine the fetal DNA fraction.
Because sequencing can be at shallow depths, even if non-maternal alleles are present, the locus may have only few reads and may not show non-maternal alleles. However, the normalized parameters characterizing the sequenced non-maternal alleles can be used to provide an accurate estimate of the fetal DNA fraction, even if the amount of non-maternal alleles at one locus or all loci is not representative of the fetal DNA fraction. These normalization parameters may include the amount of sequence reads with non-maternal alleles or the amount of loci with non-maternal alleles. The methods described herein may not require high depth sequencing or enrichment of specific regions. Thus, these methods can be incorporated into widely used noninvasive prenatal testing and other diagnostics.
Some embodiments relate to systems and computer-readable media associated with the methods described herein.
A better understanding of the nature and advantages of embodiments of the present invention
Drawings
FIG. 1 is a schematic of fractional fetal DNA concentration measurement using maternal genotype, according to an embodiment of the invention.
Fig. 2A shows a linear regression model of actual fetal DNA fraction and non-maternal allele fraction constructed from a training data set from a first data set, according to an embodiment of the invention.
FIG. 2B shows a validation of the regression model of FIG. 2A using independent data sets, according to an embodiment of the invention.
Fig. 3A shows a linear regression model of actual fetal DNA fraction and non-maternal allele fraction constructed from a training data set from a second data set, according to an embodiment of the invention.
FIG. 3B shows a validation of the regression model of FIG. 3A using independent data sets, according to an embodiment of the invention.
FIG. 4A shows a deviation between an actual fetal DNA fraction and an estimated fetal DNA fraction of a first data set according to an embodiment of the invention.
FIG. 4B shows a deviation between the actual fetal DNA fraction and the estimated fetal DNA fraction of the second data set according to an embodiment of the invention.
FIG. 5 shows a graph of relative prediction error versus actual fetal DNA fraction, according to an embodiment of the invention.
Fig. 6A, 6B, 6C, and 6D show the accuracy of fetal DNA fraction predictions at various sequencing depths according to embodiments of the invention.
Fig. 7 shows a method of measuring a fraction of fetal DNA in a biological sample of a pregnant woman carrying a fetus using a read volume according to an embodiment of the invention.
Fig. 8 shows a graphical representation of measuring fetal DNA fraction without obtaining maternal genotype, paternal genotype, or a biological sample containing only maternal DNA molecules, according to an embodiment of the invention.
Fig. 9 shows a method of measuring a fetal DNA score in a biological sample of a pregnant woman carrying a fetus using various amounts of loci, according to an embodiment of the invention.
FIG. 10A shows a calibration curve for a linear regression model from fetal DNA fraction and Apparent Allele Difference (AAD) values, according to an embodiment of the invention.
FIG. 10B shows a linear regression plot based on fetal DNA fraction and short DNA molecule ratio, according to an embodiment of the invention.
Fig. 10C shows a graph of fetal DNA fraction determined from AAD values versus a proportion of fetal DNA fraction based on reads derived from the Y chromosome, according to an embodiment of the invention.
Fig. 11 shows a method of measuring a fetal DNA fraction in a biological sample of a pregnant woman carrying a fetus according to an embodiment of the invention.
12A, 12B, 12C, and 12D illustrate the relationship between DNA molecule size of maternal and non-maternal alleles according to an embodiment of the invention.
Fig. 13 shows a method of measuring a fraction of fetal DNA in a biological sample of a pregnant woman carrying a fetus using size values, according to an embodiment of the invention.
FIG. 14 is a table of fetal DNA fractions calculated for six different groups of twins, according to an embodiment of the invention.
FIG. 15 is a graph of fetal DNA fraction versus loci showing size differences according to an embodiment of the invention.
Fig. 16 shows a method of measuring a fetal DNA fraction in a biological sample of a pregnant woman carrying a fetus using loci of various amounts of DNA molecules having a specific size, according to an embodiment of the invention.
FIG. 17 shows a block diagram of an example computer system that may be used with the systems and methods according to embodiments of the invention.
FIG. 18 shows a sequencing system according to an embodiment of the invention.
FIG. 19 shows a computer system according to an embodiment of the invention.
Term(s)
As used herein, the term "locus" or plurals thereof is a location or address of a nucleotide (or base pair) of any length that has variation between genomes. "sequence read" refers to a string of nucleotides sequenced from any part or all of a nucleic acid molecule. For example, the sequence read can be a short stretch of nucleotides (e.g., 20-150) sequenced from the nucleic acid fragment, a short stretch of nucleotides at one or both ends of the nucleic acid fragment, or sequencing of the entire nucleic acid fragment present in the biological sample. Sequence reads may be obtained in a variety of ways, for example using sequencing techniques or using probes, for example in a hybridization array or capture probe, or amplification techniques such as Polymerase Chain Reaction (PCR) or linear amplification or isothermal amplification using a single primer.
"biological sample" refers to any sample obtained from a subject (e.g., a human, such as a pregnant woman, a cancer patient, or a person suspected of having cancer, an organ transplant recipient, or a subject suspected of having a disease process involving an organ (e.g., a heart in myocardial infarction, or a brain of stroke, or a hematopoietic system in anemia.) the biological sample can be a bodily fluid, such as blood, plasma, serum, urine, vaginal fluid, fluid from a hydrocele (e.g., testis), vaginal irrigation fluid, pleural fluid, ascites, cerebrospinal fluid, saliva, sweat, tears, sputum, bronchoalveolar lavage fluid, drainage fluid of the nipple, aspiration fluid of a different part of the body (e.g., thyroid, breast), etc. fecal samples can also be used. A plasma sample obtained by a centrifugation protocol) may be free, e.g., greater than 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the DNA may be free. The centrifugation protocol may include, for example, 3,000g × 10 minutes, obtaining a fluid fraction, and then centrifuging at, for example, 30,000g for an additional 10 minutes to remove residual cells. The free DNA in the sample may be derived from cells of various tissues, and thus the sample may comprise a mixture of free DNA.
"nucleic acid" may refer to deoxyribonucleotides or ribonucleotides and polymers thereof in either single-or double-stranded form. The term may encompass nucleic acids containing known nucleotide analogs or modified backbone residues or linkages, which are synthetic, naturally occurring, and non-naturally occurring, which have similar binding properties as the reference nucleic acid, and which are metabolized in a manner similar to the reference nucleotides. Examples of such analogs can include, but are not limited to, phosphorothioate, phosphoramidite, methyl phosphonate, chiral methyl phosphonate, 2-O-methyl ribonucleotide, peptide-nucleic acid (PNA).
Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) and complementary sequences as well as the sequence explicitly indicated. In particular, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed base and/or deoxyinosine residues (Batzer et al, Nucleic Acid Res. 19:5081 (1991); Ohtsuka et al, J.Biol.Chem.) -260: 2605-2608 (1985); Rossolini et al, molecular and cellular probes (mol.cell.Probes)8:91-98 (1994)). The term nucleic acid is used interchangeably with gene, cDNA, mRNA, oligonucleotide, and polynucleotide.
In addition to reference to naturally occurring ribonucleotide or deoxyribonucleotide monomers, the term "nucleotide" can be understood to refer to related structural variants thereof, including derivatives and analogs, that are functionally equivalent to the particular context in which the nucleotide is being used (e.g., hybridization to a complementary base), unless the context clearly dictates otherwise.
"sequence read" refers to a string of nucleotides sequenced from any part or all of a nucleic acid molecule. For example, the sequence reads may be the entire nucleic acid fragment present in the biological sample. Sequence reads can be obtained from single molecule sequencing.
"Classification" refers to any numerical value or other character associated with a particular characteristic of a sample. For example, a "+" symbol (or the word "positive") may indicate that the sample is classified as having deletions or amplifications. The classification may be binary (e.g., positive or negative) or have more levels of classification (e.g., a level of 1 to 10 or 0 to 1). The terms "cutoff" and "threshold" refer to predetermined values used in operation. For example, a cutoff size may refer to a size above which fragments are excluded. The threshold may be a value above or below which a particular classification application is applied. Any of these terms may be used in any of these situations.
The term "size profile" generally relates to the size of DNA fragments in a biological sample. The size profile may be a histogram providing the distribution of a quantity of DNA fragments of various sizes. Various statistical parameters (also referred to as dimension parameters or just parameters) may be used to distinguish one dimension profile from another. One parameter is the percentage of DNA fragments of a particular size or size range relative to all DNA fragments or relative to another size or range.
Detailed Description
Noninvasive prenatal testing (NIPT) using massively parallel sequencing of maternal plasma DNA is increasingly recognized as an important component in modern prenatal diagnosis and has been rapidly used in clinical applications worldwide. To ensure accurate interpretation of such non-invasive prenatal diagnosis, fetal DNA fraction becomes a key parameter to be measured. Although various methods have been developed to estimate this parameter, few are generally and widely applicable.
Some embodiments allow for accurate estimation of fetal DNA fraction using actual or hypothetical maternal genotypes and random massively parallel sequencing of maternal plasma. The fetal DNA fraction may be related to a parameter that characterizes the amount of non-maternal material in the biological sample. The amount of non-maternal material can be calculated as the fraction of sequence reads of the non-mother, or the proportion of loci with non-maternal alleles. In either of these calculations, the parameters do not accurately represent the actual fetal DNA fraction. Sequencing may be performed at shallow depths such that not all of the non-maternal alleles present may be sequenced. In addition, the calculation of the non-maternal score of sequence reads may include reads at sites without non-maternal alleles. The inclusion of these sequence reads in the calculation of non-maternal scores will then include potential homozygous sites that would not normally be used in conventional methods to determine fetal DNA scores. Similarly, calculating the proportion of loci with non-maternal alleles can include the use of many potential homozygous loci, which are not generally considered important in conventional methods of calculating fetal DNA fraction.
However, the fetal DNA fraction was found to correlate with the fraction of non-maternal alleles derived from maternal homozygous loci in the plasma of pregnant women, even though the fraction of non-maternal alleles did not accurately account for all non-maternal alleles. Furthermore, fetal DNA scores were found to correlate with the proportion of loci with non-maternal alleles, even when sequence reads did not reveal all loci with non-maternal alleles. These methods were validated with experimental data. Using shallow depth sequencing, the method may be more efficient and economical than traditional methods. Furthermore, these methods do not rely on the paternal genotype or on the particular genetic trait of the fetus, and thus these methods are widely applicable to any pregnant female. These methods may further enhance the clinical interpretation of non-invasive prenatal testing.
I. Quantitative analysis of DNA Using sequence reads
Maternal DNA alone can be sequenced and compared to DNA in a sample containing both maternal and fetal DNA to estimate the fetal DNA fraction. Only maternal DNA can be sequenced to identify homozygous sites. A sample (e.g., maternal plasma or serum) containing a mixture of maternal and fetal DNA can then be sequenced. In the mixture, some identified homozygous sites may have sequence reads of non-maternal alleles, while other identified homozygous sites may have sequence reads of alleles identical only to maternal alleles. These reads of the non-maternal allele and the same allele as the maternal allele can be used to calculate the non-maternal score. Such non-maternal fraction, which may contain sites in the denominator that have only the same allele as the maternal allele, may not equal the actual fetal DNA fraction. However, this non-maternal score may be related to fetal DNA score. A higher fraction of fetal DNA may result in a higher non-maternal fraction. Calibration curves for fetal DNA fraction and non-maternal fraction can be used to correlate estimates of fetal DNA fraction with non-maternal fraction calculated for a sample.
However, samples containing both maternal and fetal DNA may be sequenced at shallow depths, and loci may have only one or two reads. Even if reads in samples of both maternal and fetal DNA show the same alleles as maternal alleles, it cannot be determined with high statistical confidence that the locus is homozygous in the fetus, since the fetal non-maternal alleles may already be present but are not shown in only a few reads. Shallow depth sequencing may underestimate the actual number of non-maternal alleles in fetal DNA.
Even though the fraction of non-maternal alleles may not be the actual fraction of non-maternal alleles, the fraction may be used with a calibration curve to obtain an accurate fetal DNA fraction. Fetal DNA scores were found to correlate with non-maternal scores even if the scores were underestimated or otherwise did not include an accurate count of non-maternal alleles. A higher fraction of fetal DNA increases the likelihood that non-maternal alleles will be sequenced, thereby increasing the non-maternal fraction. Thus, even at shallow depths, the relationship between non-maternal and fetal DNA fractions can be represented in the calibration curve and used to estimate fetal DNA fraction.
A. Non-maternal score and calibration curve
The non-maternal fraction of the sample containing maternal and fetal DNA is the ratio of the first read to the second read having the non-maternal allele. The two reads may be at certain sites in the mother's genome, including sites known to have a high likelihood of being heterozygous (i.e., sites with SNPs). The second read can include a read from a sample of the DNA mixture having the maternal allele. In some embodiments, the second amount can be a total read amount at the site, wherein the total amount is a sum of the first amount and the read with the maternal allele.
This non-maternal score may not equal the actual score of the non-maternal allele present in the biological sample. In contrast, the non-maternal fraction reflects sequencing reads that are non-maternal alleles in maternal plasma. Thus, the non-maternal score may depend on sequencing errors, genotyping errors, the basal number of sites where the mother is homozygous and the fetus is heterozygous (informative SNP sites), and the fetal DNA score. The results show that the base number of sequencing errors, genotyping errors, and informative sites are relatively constant. Thus, the fetal DNA fraction can be determined from the non-maternal allele fraction.
Figure 1 shows the use of non-maternal fraction for determining fetal DNA fraction. Homozygous sites were identified in the maternal DNA. Maternal plasma was sequenced and reads of alleles were counted at the identified homozygous sites. Even though non-maternal alleles were not sequenced at some sites, non-maternal scores were calculated from the sum of the reads of the alleles at these sites. The resulting non-maternal fraction can then be compared to a calibration curve of previously measured fractional fetal DNA concentrations and previously calculated non-maternal fractions. An estimated fractional fetal DNA concentration may be obtained.
At section 110, the maternal genotype is obtained from maternal tissue, for example by analyzing buffy coat or buccal swab samples using microarray-based genotyping techniques. In other embodiments, maternal genotyping may also be performed using a sample comprising a mixture of fetal and maternal DNA.
Section 110 shows the homozygous sites in the maternal genotype. Each locus has two a alleles as shown in the block diagram. The A allele may or may not be the result of a SNP with an A nucleotide. Although section 110 does not show heterozygous sites in the mother's genome, heterozygous sites may be located between homozygous sites. Homozygous sites may be limited to sites known to have Single Nucleotide Polymorphisms (SNPs), which can be identified in databases such as dbSNP or HapMap. Maternal sites that are homozygous for the expression can be identified from the genotyping information and aligned to the reference genome. Genotyping may be performed using any suitable genotyping technique, for example using sequencing (which may include calibration to a reference genome), targeted sequencing, amplicon-based sequencing, mass spectrometry, microdroplet digital PCR, hybridization arrays, or microarrays.
The number of homozygous sites used may depend on the microarray platform used. For example, there are 700,000 homozygous sites for Affymetrix and 200 ten thousand for the BeadChip. Thus, the examples have enough sites to focus on SNP sites rather than any site in the entire genome, but the latter is possible.
At section 115, maternal plasma DNA was sequenced. Maternal plasma includes a number of DNA fragments, which may include alleles from the identified homozygous sites. Section 115 shows a fragment with the a allele (the maternal allele present at the homozygous site) and a fragment with the B allele (the non-maternal allele not present at the homozygous site). Maternal plasma DNA can be sequenced using massively parallel sequencing. Maternal plasma DNA can be sequenced at shallow depths. For example, the number of sequencing reads can be less than 0.1x, 0.2x, 0.3x, 0.4x, 0.5x, 0.8x, 1x, 1.5x, 2x, 3x, 4x, 5x, and 10x coverage of a haploid human genome. The number of reads can be less than or equal to 5000 ten thousand reads, including less than or equal to 3000 ten thousand reads, 2000 ten thousand reads, 1500 ten thousand reads, less than or equal to 1000 ten thousand reads, less than or equal to 800 ten thousand reads, less than or equal to 500 ten thousand reads, less than or equal to 400 ten thousand reads, less than or equal to 200 ten thousand reads, or less than or equal to 100 ten thousand reads. The sequence reads obtained in section 110 may also be determined at shallow depths. Thus, homozygous genotypes may be inaccurate (i.e., a woman may have B at one of the loci identified as homozygous), but the results show that such inaccuracies are consistent between samples, allowing the calibration curve to provide fetal DNA fractions with the required accuracy.
At section 120, sequence reads from maternal plasma (or other samples with a mixture of maternal and fetal DNA) are mapped to a reference genome. Mapping can be done only on identified homozygous sites in the mother's genome. Calibration to homozygous sites distinguishes B non-maternal alleles that are typically contributed from the father, but may be related to sequencing errors, de novo mutations, and other examples mentioned herein. As described above, in case the genotyping at section 110 is performed at shallow depth, the B allele may also be from mother.
Sequence reads of the maternal plasma were then summed for both the a maternal allele and the B non-maternal allele. The sequence reads with the B non-maternal allele at the identified homozygous sites were summed. Even though the B non-maternal allele was not sequenced at a particular site, the sequence reads with the a maternal allele at the identified site were summed.
At section 130, non-maternal allele fractions are determined. To calculate the non-maternal allele fraction, the total number of sequence reads with B non-maternal alleles at the homozygous site, Σ B, is obtained from the reads in section 120. The total number of sequence reads with either B non-maternal allele or a maternal allele, Σ (a + B), is obtained from the reads in section 120. Calculating the non-maternal allele fraction as a ratio of the total number of sequence reads having the B non-maternal allele to the total number of sequence reads having the a or B non-maternal allele, converting the ratio to a percentage:
Other related fractions or percentages may also be used. For example, the total number of sequence reads with B non-maternal alleles can be divided by the sum of the total number of sequence reads with only a maternal alleles. The inverse of any of the scores described herein may also be used.
In fact, the fraction of non-maternal alleles is controlled by fetal DNA fraction as well as sequencing and genotyping errors. Assuming that the error from the genotyping and sequencing platform is a systematic error that is relatively constant under certain circumstances, the fractional fetal DNA concentration is proportional to the fraction of non-maternal alleles measured in maternal plasma. The fraction of fetal DNA can be predicted by analyzing the fraction of non-maternal alleles.
At section 140, a calibration curve for obtaining fractional fetal DNA concentration from non-maternal allele fractions is shown. The calibration curve may have various functional forms, such as linear, quadratic, or any polynomial. Section 140 shows a linear calibration curve, X is the non-maternal score, calculated by equation (1), Y is the fractional fetal DNA concentration, α is the slope of the line, and β is the Y-intercept of the line.
To establish a calibration curve, embodiments may use a series of samples with known fetal DNA fractions (e.g., estimated from the Y chromosome, based on informative SNP sites, etc.). For each sample with a known fetal DNA fraction, the non-maternal allele fraction is measured. A functional fit of known fetal DNA fraction values to the measured non-maternal fraction can be determined and used as a calibration curve. These samples may be referred to as calibration samples.
In various embodiments, the calibration value may correspond to a calibration value for a calibration data point determined from a calibration sample or any calibration value determined therefrom, such as a calibration function that approximates the calibration data point. The one or more calibration samples may or may not include any additional samples for determining preferred end sites.
For each of the one or more calibration samples, the corresponding proportional contribution of the first tissue type can be measured, e.g., using tissue-specific alleles. The respective relative abundances can be determined using a respective number of free DNA molecules that terminate within a plurality of windows corresponding to the first set of genomic positions. The measured proportional contribution and relative abundance may provide a calibration data point. The one or more calibration data points may be a plurality of calibration data points that form a calibration function that approximates the plurality of calibration data points. More details of the use of calibration values can be found in U.S. patent publication 2013/0237431.
In determining the non-maternal allele fraction, each reading with non-maternal alleles at one site can be counted even if it is not known whether the fetus really has non-maternal alleles or whether the non-maternal alleles are errors. In some implementations, a minimum number of non-maternal alleles may not be required before the site is used, otherwise it will be used as a test to determine that the allele is not an error. In addition, sites that do not have sequence reads of non-maternal alleles can still be used to determine non-maternal allele fractions. For example, even if some sites in maternal plasma DNA have only sequence reads of the maternal allele, these reads of the maternal allele may still appear in the denominator used in equation (1) to calculate the non-maternal allele fraction. Because the calculation includes loci that may not have non-maternal alleles, the resulting non-maternal allele fraction may not reflect the actual non-maternal allele fraction.
To ensure greater accuracy, embodiments can filter out reads carrying alleles of unannotated sites in the dbSNP database, e.g., assuming all SNPs used are biallelic. For example, SNP sites are annotated as A/C in the dbSNP database. Reads that carry "G" seen in plasma are filtered out, but the site can still be used as a reference for other reads being analyzed. This may reduce sequencing error effects. In addition, all reads at sites not annotated as SNP sites are filtered out.
B. Training and validation of calibration curves
Maternal plasma samples were used to verify the use of non-maternal allele fractions in estimating fetal DNA fractions. Some samples were used as training data sets to generate calibration curves that affect fetal DNA fraction versus non-maternal allele fraction. For the remaining samples, the non-maternal allele fraction of each sample was determined, and then the fetal DNA fraction was estimated based on the calibration curve generated from the initial sample. The estimated fetal DNA fraction of the remaining sample is then compared to the actual fetal DNA fraction to verify the accuracy of using the non-maternal allele fraction.
1. Data set
Two data sets were used to test the hypothesis as to whether the fetal DNA concentration could be determined by the fraction of non-maternal alleles measured in maternal plasma. For the first data set, a total of 35 samples were genotyped by Affymetrix genotyping microarray (Affymetrix Genome-Wide Human SNP Array 6.0 system) and sequenced by Genome Analyzer IIx (Illumina) in 36 cycles paired-end mode as described in [ Lo YMD et al (2010), scientific transformation medicine (Sci Transl Med), 2:61ra91] and [ Yu SC et al (2014), Proc Natl Acad Sci U S.A. (S A),111: 8583-. On average 671,206 (range 635,378-682,501) homozygous sites were obtained among 906,600 SNPs interrogated by the Affymetrix genotyping platform. Meanwhile, after mapping paired-end sequencing reads to the reference human genome using SOPA2 [ Yu SC et al (2014), Proc Natl Acad Sci U S A),111: 8583-; li RQ et al (2009), Bioinformatics (Bioinformatics),25: 1966-. The median of nearly 1300 ten thousand reads corresponded to about 0.3x coverage.
The second data set has a higher number of reads and samples than the first data set. For the second dataset, a total of 70 samples were genotyped by a BeadChip array (Illumina) and sequenced by a HiSeq 2000 sequencer (Illumina) (50bp x 2) [ Stephanie C et al (2013), Clinical Chemistry) ]. 1,940,577 (range 1,925,282-1,949,532) homozygous loci were obtained on average among 2,351,072 SNPs interrogated by the BeadChip array (Illumina). After calibration, 69,959,574 (range 26,036,386-94,089,417) calibratable and non-replicate medians were obtained for the samples. Approximately 7000 million reads correspond to approximately 2.3x coverage. To evaluate the performance of fetal DNA fraction prediction, the estimated fetal fraction is compared to a fetal DNA fraction determined by using fetal genotype as a standard (referred to as actual fetal DNA fraction).
2. Non-maternal allele fraction calculation
The non-maternal allele fraction was calculated for each sample using equation (1). For homozygous sites identified from genotyping only the maternal samples, the number of reads from the corresponding maternal plasma samples was counted. The sum of the number of reads with non-maternal alleles at the identified homozygous sites divided by the total number of reads at the homozygous sites (i.e., reads with non-maternal or maternal alleles) is then converted to a percentage.
3. Fractional fetal DNA concentration estimation
To confirm that fractional fetal DNA concentration is proportional to the fraction of non-maternal alleles in maternal plasma, each data set was randomly divided, with some samples in the training set and the rest in the independent validation set. Linear regression was used to model the relationship between the actual fraction of fetal DNA in maternal plasma (dependent variable Y) and the fraction of non-maternal alleles (independent variable X, calculated from equation (1)) by analyzing 12 and 23 samples in the training set of the first and second data sets, respectively. The actual fetal DNA fraction (F) was derived from the following formula by analyzing reads that overlap with SNPs where the mother is homozygous and the fetus is heterozygous [ Lo YMD et al (2010), scientific transformation medicine (Sci trans Med),2:61ra91 ].
Where p is the number of sequencing reads for the fetal-specific allele and q is the read count for the consensus allele. Equation (2) differs from equation (1) in that equation (2) includes reads only from sites where the mother is homozygous and the fetus is heterozygous, while equation (1) may also include reads from sites where the mother and fetus appear homozygous. In other embodiments, F may be scaled by 2-fold to correspond to the total fetal fraction of all fetal DNA. Other ratios, such as p/q, may also be used.
Thus, F is assumed to be the actual fetal DNA fraction and is estimated from the site where the mother is homozygous and the fetus is heterozygous. Heterozygosity can be determined by genotyping the placental tissue at the corresponding site. Samples with the actual fetal DNA fraction determined were used to show that the F deduced using the embodiments of the invention is accurate.
4. Results
Fig. 2A shows a linear model constructed using the training dataset of the first dataset (Y ═ 11.9X-1.4). The actual fetal DNA fraction determined using the previously obtained fetal genotype is shown on the y-axis and the non-maternal allele fraction is shown on the x-axis. The adjusted R-square was 0.97 (p-value < 0.0001).
Fig. 2B shows that the estimated fetal DNA fraction is highly similar to the actual fetal DNA fraction in the first data set. The fraction of fetal DNA estimated using the linear model of fig. 2A is shown on the y-axis. The actual fetal DNA fraction determined using the previously obtained fetal genotype is shown on the y-axis. Linear regression was fitted to the data, resulting in a fit with an adjusted R-square of 0.99 (p-value < 0.0001).
Fig. 3A shows the construction of a linear model from 24 samples in the training set of the second dataset (Y ═ 18.9X-6.6, adjusted R square of 0.99, and p value < 0.0001). The actual fetal DNA fraction determined using the previously obtained fetal genotype is shown on the y-axis and the non-maternal allele fraction is shown on the x-axis.
Fig. 3B shows that the estimated fetal DNA fraction is highly similar to the actual fetal DNA fraction in the second data set. The fraction of fetal DNA estimated using the linear model of fig. 2B is shown on the y-axis. The actual fetal DNA fraction determined using the previously obtained fetal genotype is shown on the y-axis. Linear regression was fitted to the data, yielding a linear fit with an adjusted R-square of 0.99 (p-value < 0.0001).
The validation sets in fig. 2B and fig. 3B show that the estimated fetal DNA fraction is highly correlated with the actual fetal DNA fraction based on the calibration curve of the non-maternal allele fraction and the actual fetal DNA fraction. The linear fits for both the validation sets in fig. 2B and fig. 3B have an R-square of 0.99 (p-value < 0.0001). A high R-squared value indicates that the technique is accurate. The points in fig. 2B and 3B are also close to the y ═ x line, which would indicate a perfect estimate of the actual fetal DNA fraction.
Fig. 4A and 4B show the median deviation from the actual fetal DNA fraction. The x-axis in fig. 4A and 4B is the actual fetal DNA fraction. The y-axis is the percentage deviation between the estimated fetal DNA fraction and the actual DNA fraction for each sample in the validation dataset. A positive value on the y-axis corresponds to an estimated fetal DNA fraction that is greater than the actual fetal DNA fraction. Negative values on the y-axis correspond to estimated fetal DNA fractions that are less than the actual fetal DNA fraction. Figure 4A shows that for the validation set of the first data set, the median deviation was-0.14% and ranged from-0.7% to 1.7%. Fig. 4B shows that for the second data set, the median deviation was-0.22% and ranged from-1.5% to 0.98%. The difference in the results of the two calibration curves for the two data sets can be attributed to the different platforms used. Fig. 4A and 4B show that below a maximum deviation of 2% and a median deviation between-0.14% and-0.22% is possible based on the fraction of fetal DNA estimated using the fraction of non-maternal alleles.
The accuracy of the model constructed from the validation data set was further measured using the relative prediction error (E%), which is defined as:
whereinRepresents the estimated fractional fetal DNA concentration, and F represents the actual fetal DNA concentration. For example, an E% — 5% indicates that if the actual fetal DNA fraction is 10%, the reading will be between 9.95% and 10.05% (10% ± 0.05). For the first data set and the second data set, the average values of E% were found to be 1.7% (range: 0.7-2.9%) and 3.8% (range: 1.3-14.9%), respectively.
Experimental results demonstrate that non-maternal allele fraction can be used to accurately and precisely measure fetal DNA fraction. The accuracy and precision of the estimated fetal DNA fraction is within the range typically required for NIPT testing.
C. The accuracy of the fetal DNA fraction estimation depends on the actual fractional fetal DNA concentration
That is, the accuracy of the prediction depends on the actual fetal DNA fraction being analyzed. The higher the fetal DNA fraction, the more accurate the estimation. Since more data points were collected after delivery (fig. 3B) relating to fetal DNA fractions of less than 5%, the second data set was used to investigate the relationship between actual fetal DNA fraction and relative prediction error.
FIG. 5 shows a scatter plot of relative prediction error versus actual fetal DNA concentration. The relative prediction error, expressed as a percentage, is shown on the y-axis and the fraction of non-maternal alleles in maternal plasma is shown as a percentage on the x-axis. The scatter plot shows a very clear "L" shape, with cases with high fetal DNA levels showing low prediction errors, and cases with low fetal DNA levels showing relatively high prediction errors. Even for an actual fetal DNA fraction of 5%, E% will approach 5% (fig. 5).
D. Relationship between sequencing depth and accuracy of fetal DNA fraction estimation
To further demonstrate how sequencing depth affects fetal DNA fraction, the second data set is downsampled for analysis because the samples in the second data set have a higher sequencing depth than the first data set, allowing for multiple sampling analyses. For each of the 20 samples, a different number of sequence reads were randomly selected, and paired end reads were randomly selected from the 20 samples in the second dataset, each sample having 100, 200, 400, 600, and 800 tens of thousands. Repeat the above analysis of fetal DNA fraction prediction. The number of randomly chosen sequence reads is 100, 200, 400 and 800 million.
Figure 6A shows the estimated fetal DNA fraction versus the actual fetal DNA fraction at 100 million reads. A linear regression fit to the data had an R-square of 0.9946 and a p-value of less than 0.001.
Figure 6B shows the estimated fetal DNA fraction versus the actual fetal DNA fraction at 200 million reads. A linear regression fit to the data had an R-square of 0.9918 and a p-value of less than 0.001.
FIG. 6C shows the relationship at 400 ten thousand reads. A linear regression fit to the data had an R-square of 0.9927 and a p-value of less than 0.001.
FIG. 6D shows the relationship at 800 ten thousand reads. A linear regression fit to the data had an R-square of 0.9924 and a p-value of less than 0.001.
The R-squared value is greater than 0.99 regardless of the number of reads. Regardless of the number of reads, the p-value remains less than 0.001. The results show that even using 100 ten thousand reads can make us as good a prediction as the results obtained by using 200, 400, 600 or 800 ten thousand reads.
E. Applicability of the method
The fraction of non-maternal alleles present in the maternal plasma of a pregnant woman can be used to estimate the fetal DNA fraction. There is a linear relationship between fractional fetal DNA concentration with high R-squared and non-maternal allele fraction in maternal plasma, indicating that genotyping and sequencing errors are relatively constant, assuming a consistent platform is applied to the same dataset. The predictive power of this approach has been validated in a separate data set. An updated calibration curve can be used to improve accuracy for different sequencing or genotyping platforms. The improved R-square in the second dataset may be attributed to the improved accuracy of the genotyping and sequencing system [ Yu SC et al (2014), Proc Natl Acad Sci U S A),111: 8583-; lo YMD et al (2010), scientific transformation medicine (Sci Transl Med),2:61ra 91. However, the different relative errors (E%) observed between the two data sets may be due to more samples exhibiting a relatively lower fraction of fetal DNA in the second data set.
Notably, sequencing depth is not a critical factor affecting the accuracy of fetal DNA fraction estimation, as demonstrated by sampling analysis below. This method can be accurately generalized to samples with different sequencing depths. The underlying reason may be that the portion of the genomic locus showing the non-maternal allele in the maternal plasma increases or decreases proportionally with the depth of sequencing. Then, the fraction of non-maternal alleles in maternal plasma may be constant between different sequencing depths. Thus, this approach can eliminate the high demand for sequencing depth and can be easily applied to real clinical practice, since quantities with 1500 million sequencing reads [ Kim SK et al (2015), Prenatal diagnostics): n/a-n/a ] can be routinely achieved in non-invasive Prenatal diagnostics.
The accuracy of fetal DNA prediction should be higher than two previous non-polymorphism-based methods [ Yu SC et al (2014), Proc Natl Acad Sci U S A),111: 8583-; kim SK et al (2015), Prenatal diagnosis (Prenatal diagnosis): n/a-n/a ], because the prior values of the R-squared statistical data were 0.83 and 0.93 respectively [ Yu SC et al (2014), Proc Natl Acad Sci U S A, 111: 8583-; kim SK et al (2015), Prenatal diagnosis (Prenatal diagnosis): n/a-n/a ], which is lower than the corresponding value in this study (R square value in the second dataset is 0.99). In addition, the algorithm was able to accurately determine a low fetal DNA fraction of 5% (fig. 5). This ability to measure low fetal DNA fraction is particularly important because a significant fraction (about 5%) of maternal plasma samples have fractional fetal DNA concentrations of less than 5% [ Chiu RW et al (2011), british journal of medicine (BMJ)342: c 7401; palomaki GE et al (2011), Genet Med (Genet Med),13: 913-. Accurate estimation of fetal DNA fraction may allow accurate filtering of samples with low fetal DNA fraction in a quality control step [ Palomaki GE et al (2011), genetic medicine (Genet Med),13: 913-. In addition, the degree of variation in the amount of maternal plasma DNA from chromosomes involved in fetal aneuploidy shows a correlation with fetal DNA fraction. Samples whose data fall outside the correlation curve are considered more likely to be false positives. Embodiments for estimating fetal DNA fraction may help identify false results. On the other hand, certain pregnancy-associated disorders (e.g., preeclampsia and trisomy 18) are related to a disturbed fraction of fetal DNA in maternal plasma. Thus, a better estimate of fetal DNA fraction will allow for more sensitive detection of those conditions associated with the amount of interference in fractional fetal DNA concentration.
Because massively parallel sequencing-based clinical diagnosis is increasingly being recognized and applied to clinical practice, personalized genotypes are available for each individual. Thus, maternal genotype-assisted fetal DNA fraction estimation can be easily integrated into currently existing methods for noninvasive prenatal diagnosis. Embodiments using sequence reads of alleles provide a general method for accurately estimating fractional fetal DNA concentrations. Since there are few methods for accurately estimating fractional fetal DNA concentration in noninvasive prenatal diagnosis based on random sequencing, this method will provide a useful tool among the fastest available clinical utilities to enhance noninvasive prenatal detection of fetal chromosomal aneuploidies by performing more accurate statistical interpretation of the sequencing results of maternal plasma DNA [ Agarwal A et al (2013), prenatal diagnosis (Prenat Diagn),33: 521-.
F. Exemplary method for measuring fetal DNA fraction using reads
Fig. 7 shows an exemplary method 700 of measuring a fraction of fetal DNA in a biological sample of a pregnant woman carrying a fetus. The biological sample includes maternal DNA molecules and fetal DNA molecules. Method 700 may be performed using a computer system.
At block 702, method 700 includes identifying a plurality of loci having sequence information indicating that the woman is homozygous for a first allele at each of the plurality of loci. As an example, sequence information may be determined from the same sample (e.g., different read sets or different aliquots of the sample), which may be plasma or other mixtures of fetal and maternal DNA, or a different sample from a woman (e.g., a sample from a buffy coat, an oral swab, or a different plasma sample). Regardless of the source of the sample, the sequence information may include separate datasets of DNA molecule reads, e.g., other reads from the same sample or different samples. In some embodiments, an indication that a woman is homozygous may be made based only on the first allele detected at a particular locus. In other embodiments, the indication may allow for some reads with different alleles, but the number of reads with another allele at a locus is below a threshold (e.g., a threshold that calls for the locus to be homozygous to be within a certain statistical accuracy). This embodiment may be performed when only the maternal sample (e.g., buffy coat) is used to obtain sequence information. As described herein, sequence information may be obtained by any suitable technique, such as sequencing or validation.
Women may actually be homozygous at multiple sites. However, in some embodiments, women may receive sequencing at shallow depths such that only a few alleles (e.g., one or two) are read at these sites, and even if women are heterozygous at this site, women may exhibit homozygosity. Identifying the plurality of sites may include obtaining a plurality of reads from the DNA molecules of the biological sample. In other embodiments, identifying a plurality of loci at which a woman exhibits homozygosity may comprise identifying a plurality of loci from a plurality of reads of another biological sample (i.e., a second biological sample) that does not comprise a fetal DNA sample. For example, the second biological sample may be a maternal buffy coat or a buccal swab. Identifying a plurality of loci at which a female is homozygous may comprise genotyping cells in a second biological sample from the female. In some embodiments, analysis of the maternal genotype need not be highly accurate and can be obtained from shallow depth sequencing of the maternal buffy coat, such as, but not limited to, less than 0.1x, 0.2x, 0.3x, 0.4x, 0.5x, 0.8x, 1x, 1.5x, 2x, 3x, 4x, 5x, and 10x coverage of the haploid human genome. In some embodiments, the plurality of reads may be limited to reads that occur only at a second plurality of sites in a reference database for sites known to have SNPs.
At block 704, the method 700 includes obtaining a plurality of reads from DNA molecules of a biological sample. Multiple reads may be obtained from the sequencing device or from the data storage device. The method 700 may also include receiving a biological sample prior to obtaining the reading. These reads may be limited to sites identified in the database as corresponding to biallelic sites, i.e., sites with SNPs. A plurality of DNA molecules in a biological sample can be sequenced to obtain reads. In other embodiments, a probe microarray may be used to analyze a plurality of DNA molecules in a biological sample to obtain a read.
At block 705, the method 700 includes identifying a location of a plurality of reads in a reference genome. For example, an episomal DNA molecule can be sequenced to obtain sequence reads, and the sequence reads can be mapped (calibrated) to a reference genome. If the organism is a human, then the reference genome will be a reference human genome that is likely to be from a particular subpopulation. As another example, an isolated DNA molecule can be analyzed with different probes (e.g., after PCR or other amplification), where each probe corresponds to a genomic location.
At block 706, the method 700 includes determining a first read amount. Each of the first reads is located at one of a plurality of sites, and each read shows a second allele that is different from the first allele of the woman at that site. In some cases, the first read is a read of a non-maternal allele. The second allele may be limited to the alleles identified in the database as corresponding to biallelic loci. Not all of the plurality of loci may include a read showing the second allele. In fact, a portion of the plurality of loci may not include reads that reveal the first allele.
At block 708, a second read quantity at the plurality of sites may be determined. Each read in the second read quantity is located at one of the plurality of sites, and each read shows the first allele at that site. In some embodiments, the second amount may include reads of the same allele added to the first read amount that shows an allele different from the female allele. In other words, the second amount may be the sum of a + B, as shown in fig. 1, or the second amount may be the sum of a. Determining the second read amount may be implicitly determined by the total number of reads. The total number of readings may be the second reading amount.
At block 710, a non-maternal allele fraction may be determined from the first amount and the second amount. The non-maternal allele fraction may include the first amount divided by the second amount. The non-maternal allele fraction may include a numerical value converted to a percentage. In some embodiments, the non-maternal allele fraction may include the second amount divided by the first amount.
At block 712, calibration points determined using another sample with a known fetal DNA fraction and a measured non-maternal allele fraction may be obtained. The calibration point may be one of a plurality of calibration points, and the plurality of calibration points may constitute a calibration curve. The calibration curve may be calculated by determining the fraction of fetal DNA from a plurality of other samples from a plurality of pregnant women. The fetal DNA score of each other sample of the plurality of other samples may include identifying a second plurality of loci, wherein at each locus, the fetus is heterozygous and the pregnant woman is homozygous. In some embodiments, the fetal DNA fraction may be determined using the Y chromosome of a male fetus. Multiple reads of DNA molecules from another sample may be obtained. The plurality of reads may be equal or similar to the number of the plurality of reads of DNA molecules from the first biological sample. Reads having a third amount of fetal-specific alleles at the second plurality of loci can be determined. A fourth number of reads having a consensus allele at the second plurality of loci can be determined. The fetal DNA fraction may be determined using the third amount and the fourth amount. The non-maternal score for the plurality of samples may be calculated. The fetal DNA fraction and the non-maternal fraction may be fitted to a linear or other function. A linear or other function may describe the calibration curve.
At block 714, the fetal DNA fraction may be calculated using the calibration points and the non-maternal allele fraction. The non-maternal allele fraction can be compared to the calibration points of the calibration curve. The calculated fetal DNA fraction may be equal to the fraction of fetal DNA on the calibration curve that corresponds to the same or similar fraction of non-maternal alleles. If the calibration curve is represented by an equation, the fetal DNA fraction may be the result of a calculation substituting the non-maternal allele fraction into the equation.
G. Exemplary methods for measuring fetal DNA fraction Using characteristics of DNA molecules
Fig. 11 shows an exemplary method 1100 of measuring a fraction of fetal DNA in a biological sample of a pregnant woman carrying a fetus. The method 1100 may use values that define characteristics of a set of DNA molecules. The characteristic may include a size parameter of the set of molecules or an amount of sequential reads of the set of molecules.
At block 1102, method 1100 includes identifying a plurality of loci having sequence information indicating that a female is homozygous for a first allele at each locus of the plurality of loci. Identifying multiple sites can be performed by any of the procedures described herein, including the procedures described in method 700.
At block 1104, the method 1100 includes obtaining a plurality of reads from the DNA molecule. Obtaining the plurality of reads may be performed by any of the operations described herein, including the operations described in method 700.
At block 1105, the method 1100 includes identifying locations of multiple reads in a reference genome. The reference genome may be a human genome. Identifying the location of the reads may include calibrating the reads to a reference genome or using probes. Identifying a location may be performed by any of the operations described herein, including the operations described in method 700 for block 705.
At block 1106, the method 1100 includes determining a first value for a first set of DNA molecules. Each DNA molecule of the first set of DNA molecules may include a read at one of a plurality of sites. Each read may show a second allele that is different from the first allele at that site. The first value may define a property of the first set of DNA molecules. For example, as with method 700, the first value may be the number of reads located at multiple loci and having a second allele. Determining the first value may further comprise measuring a size of the first set of DNA molecules, wherein the first size value has a first size distribution of the first set of DNA molecules. In an embodiment, the first value may be a size parameter. The size parameter may be the number of molecules with a size in a range or the cumulative frequency of molecules at a size, e.g. the first cumulative frequency of DNA molecules with the largest size in the first set of DNA molecules.
At block 1108, the method 1100 includes determining a second value for a second set of DNA molecules. Each DNA molecule of the second set of DNA molecules may include a read at one of a plurality of sites. Each read can show the first allele at that site. The second set of DNA molecules may also be from the same biological sample as the first set of DNA molecules, or may be from another biological sample (e.g., a sample of maternal DNA only, such as a buffy coat or oral swab). Determining the second value may further comprise measuring the size of the second set of DNA molecules, wherein the second size value has a first size distribution of the second set of DNA molecules. The second value may define a property of the second set of DNA molecules. For example, if the first value is a size parameter, the second value may also be a size parameter. The second value may be a second number of reads at multiple sites and having the first allele.
At block 1110, a parameter value for a parameter may be determined from the first value and the second value. The parameter may comprise a ratio of the first value divided by the second value.
At block 1112, the method 1100 may include comparing the parameter value to a calibration point determined using at least one other sample (e.g., a calibration sample) having a known fetal DNA fraction and a calibration value corresponding to an individual measurement of the parameter in the at least one other sample. The calibration point may be one of a plurality of calibration points, and the plurality of calibration points may constitute a calibration curve. The calibration curve may be calculated by determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women, similar to the operations in block 712 of method 700. The parameter values for a number of other samples may be calculated. The fetal DNA fraction and the parameters may be fitted to a linear or other function. A linear or other function may describe the calibration curve.
At block 1114, a fetal DNA fraction may be calculated based on the comparison. The calculated fetal DNA fraction may be equal to the fraction of fetal DNA on the calibration curve corresponding to the same or similar parameter value. If the calibration curve is represented by an equation, the fetal DNA fraction may be a result of a calculation that substitutes parameter values into the equation.
H. Measurement of fetal DNA fraction using size parameters
The size of the DNA molecules with non-maternal alleles and/or the size of the DNA molecules with maternal alleles can be used to estimate the fetal DNA fraction. Fetal DNA has been found to be shorter than maternal DNA in maternal plasma (Lo YMD et al scientific transformation medicine (Sci Transl Med.) 2010; 2:61ra 91). Thus, the DNA molecule with the non-maternal allele should on average be shorter than the DNA molecule with the maternal allele in the maternal plasma.
As an example, maternal plasma with a 20% fetal DNA fraction was genotyped by microarray method (Illumina). Sites were identified where the maternal DNA was homozygous and the non-maternal allele was present in the plasma. The size of the DNA molecules with these sites of the maternal and non-maternal alleles is compared.
Figure 12A shows the size distribution of DNA molecules with maternal and non-maternal alleles. The x-axis is the size of the DNA molecule in a base pair. The y-axis is the frequency for a given dimension in percent. Line 1202 is the size distribution of DNA molecules with maternal alleles, while line 1204 is the size distribution of DNA molecules with non-maternal alleles. Line 1204 is generally to the left of line 1202, indicating that DNA molecules with non-maternal alleles are generally shorter than DNA molecules with maternal alleles.
Figure 12B shows the cumulative frequency of the sizes of the DNA molecules from figure 12A. The x-axis is the size of the DNA molecule in a base pair. The y-axis is the cumulative frequency expressed in percentage. Line 1206 is the cumulative frequency curve of the size of the DNA molecules with maternal alleles. Line 1208 is the cumulative frequency curve for the size of the DNA molecule with the non-maternal allele. Line 1208 is above line 1206, indicating that the DNA molecule with the non-maternal allele is shorter than the DNA molecule with the maternal allele.
Fig. 12C shows Δ S, the difference between the two cumulative frequency curves (line 1206 and line 1208). The x-axis is the size of the DNA molecule in a base pair. The y-axis is Δ S, the difference between the two cumulative frequency curves. The maximum value of Δ S is about 150 bp. As a result, the DNA molecules of the non-maternal alleles are relatively enriched for DNA molecules of a size less than or equal to 150 bp. A Δ S of 150bp (denoted as Δ S150) was quantified for 32 samples with 800 ten thousand paired-end sequence reads to test their suitability in estimating fetal DNA fraction.
Fig. 12D shows the relationship between Δ S150 and fetal DNA fraction for 32 samples. The x-axis is the fraction of fetal DNA expressed in percent. The y-axis is Δ S150, the difference between the cumulative frequency curve for DNA molecules with non-maternal alleles and the cumulative frequency curve for DNA molecules with maternal alleles 150bp in length. Δ S150 is positively correlated with fetal DNA fraction. In other words, a higher amount of short DNA molecules carrying non-maternal alleles indicates a higher fraction of fetal DNA. Linear regression was fitted to the data. The R-square of the linear fit was 0.81(p < 0.01). DNA molecules with maternal alleles include DNA molecules that are fetal DNA, but still carry maternal alleles. Thus, Δ S150 is not expected to reflect the actual size difference between maternal and fetal DNA.
In some embodiments, Δ S may be a size other than 150 bp. For example, Δ S may be 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 160, 170, 190, 200, or 210 bp. Other dimensional parameters may also be used. The size difference may be between any statistical value of the size distribution of the two groups. For example, the difference in median size of the first set of DNA molecules and the second set of DNA molecules may be used. Another example is the maximum value of the cumulative frequency in size between the first and second groups. Any of the size values described in U.S. patent publications 2011/0276277 and 2013/0237431 may be used.
A calibration curve between the size parameter and the fraction of fetal DNA may be used. The calibration curve may correlate the fetal DNA fraction of other samples with the size parameter. The fetal DNA fraction of the other samples can be determined by any of the methods described herein. Size parameters of other samples can then be measured and plotted against fetal DNA fraction. A linear or other regression is fitted to the data to determine a calibration curve. The size parameter of the biological sample with an unknown fetal DNA fraction may then be compared to a calibration curve to estimate the fetal DNA fraction.
In these embodiments, size parameters based on the size of the non-maternal and maternal allele DNA may be used to estimate the fetal DNA fraction, even if the parameters do not reflect the size of the fetal and maternal DNA.
I. Exemplary methods for measuring fetal DNA fraction Using size parameters
Fig. 13 shows an exemplary method 1300 of measuring a fraction of fetal DNA in a biological sample of a pregnant woman carrying a fetus. The method 1300 may use values that define the size of a set of DNA molecules. The value may be a value of a size parameter.
At block 1302, the method 1300 includes identifying a plurality of loci based on sequence information indicating that a female is homozygous for a first allele at each locus of the plurality of loci. Identifying multiple sites can be performed by any of the procedures described herein, including as in method 700.
At block 1304, the method 1300 includes obtaining a plurality of reads from DNA molecules of a biological sample. Obtaining multiple reads may be performed by any of the operations described herein, including as in method 700.
At block 1305, the method 1300 includes identifying the locations of the plurality of reads in the reference genome and determining the size of the DNA molecules of the biological sample. Identifying a location may be performed by any of the operations described herein, including the operations described in method 700 for block 705. The measurement of the dimensions may be performed by electrophoresis or by computer.
At block 1306, method 1300 includes determining a first size value for a first set of DNA molecules. Each DNA molecule in the first set of DNA molecules may include a read at one of a plurality of sites. Each read may show a second allele that is different from the first allele at that site. The first size value may correspond to a statistic of a first size distribution of the first set of DNA molecules. The first size value may be a size parameter. The size parameter may be the number of molecules with a size in a range or the cumulative frequency of molecules at a size. As a further example, the size value may be a median size, a mode of size distribution, or an average size of the first set of DNA molecules.
At block 1308, method 1300 includes determining a second value for a second set of DNA molecules. Each DNA molecule in the second set of DNA molecules may include a read at one of a plurality of sites. Each read can show the first allele at that site. The second set of DNA molecules may also be from a biological sample or may be from another biological sample (e.g., a maternal DNA-only sample such as a buffy coat or buccal swab). As other examples, the size value may be a median size, a mode of size distribution, or an average size of the second set of DNA molecules. The second size value may correspond to a statistic of a second size distribution of the second group of DNA molecules. For example, if the first value is a size parameter, the second value may also be a size parameter.
At block 1310, a parameter value may be determined from the first value and the second value. The parameter may comprise a ratio of the first value divided by the second value.
At block 1312, the method 1300 may include comparing the parameter value to a calibration point determined using at least one other sample (e.g., a calibration sample) having a known fetal DNA fraction and a calibration value corresponding to individual measured values of the parameter in the at least one other sample. The calibration point may be one of a plurality of calibration points, and the plurality of calibration points may constitute a calibration curve. The calibration curve may be calculated by determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women, similar to that described in method 700. The parameter values for a number of other samples may be calculated. The fetal DNA fraction and the parameters may be fitted to a linear or other function. A linear or other function may describe the calibration curve.
At block 1314, a fetal DNA fraction may be calculated based on the comparison. The parameter values may be compared to calibration points of a calibration curve. The calculated fetal DNA fraction may be equal to the fraction of fetal DNA on the calibration curve corresponding to the same or similar parameter value. If the calibration curve is represented by an equation, the fetal DNA fraction may be a result of a calculation that substitutes parameter values into the equation.
Analysis of DNA by locus quantification
To determine maternal or fetal genotype, some embodiments do not require analysis of reads of samples from maternal DNA only, fetal DNA only, or any DNA from only one subject. Indeed, some embodiments need not include highly accurate information about the maternal genotype. For example, determining what loci are homozygous on the maternal genotype does not need to be known with high statistical confidence or even any statistical confidence. In contrast, the method may assume that certain loci are homozygous for the presence of only one or a few alleles in a sample containing both maternal and fetal DNA. These methods typically have a shallow sequencing depth, which is not sufficient to confidently assess the alleles present at the locus. For example, determining that a locus is homozygous can be based on only one or two reads at the locus. Thus, a locus identified as homozygous may only appear homozygous because the locus is not sequenced at sufficient depth.
In addition, embodiments of analyzing DNA may include analyzing an apparently homozygous locus for a surrogate allele (e.g., a non-maternal allele) in a sample containing both maternal and fetal DNA. Analysis of samples for alternative alleles can also be performed at shallow sequencing depths. Shallow sequencing depths may result in few reads at one locus, sometimes only one or two reads. A low number of reads at one locus may result in not sequencing any alternative alleles actually present at the locus, or under counting the proportion of alternative alleles present at one locus. Because of these possible errors, techniques using shallow sequencing depths are not expected to accurately measure fetal DNA fractions or other characteristics of biological samples.
Furthermore, identifying alternate alleles at a locus as a means of determining fetal DNA fraction is not expected to be effective for any single locus. For any single locus, the alternative allele will be present or absent. Such binary results do not provide sufficient information to measure fetal DNA fraction or other characteristics of the biological sample.
However, the methods described herein surprisingly can accurately measure fetal DNA fraction or other characteristics of a biological sample when performing shallow sequencing. These methods can provide useful information about biological samples by using multiple loci, averaging the results to minimize sequencing and other errors, and using calibration data. These methods are an improvement over traditional methods, which may only be effective for male fetuses, may require genotype information for both parents, or may require high sequencing depth.
A. General procedure
Fig. 8 shows a schematic representation of a method 800 of measuring fetal DNA fraction without obtaining a maternal genotype, paternal genotype, or a biological sample containing only maternal DNA molecules.
Block 802 begins with a biological sample or biological samples. The biological sample may be plasma, serum, blood, saliva, sweat, urine, tears, or other fluid from a pregnant woman carrying a fetus. The biological sample may have a minimum fetal DNA molecule fraction of 1%, 2%, 3%, 4%, or 5%. The biological sample contains both maternal and fetal DNA molecules. The biological sample may be obtained from a needle administered by a medical professional. Biological samples may also be obtained non-invasively as part of a routine medical appointment.
Block 804 shows obtaining a sequencing read from a DNA molecule from a biological sample. Sequencing reads of any data set may be shallow depth or low depth. For example, the number of sequencing reads can be less than 0.1x, 0.2x, 0.3x, 0.4x, 0.5x, 0.8x, 1x, 1.5x, 2x, 3x, 4x, 5x, and 10x coverage of a haploid human genome. Sequencing of the DNA molecule may be performed by any suitable sequencing technique or system. Sequencing or reading can be limited to sites with known and common SNPs, including SNPs in reference databases (e.g., dbSNP or HapMap).
Blocks 806 and 808 show two data sets of sequencing reads obtained from one or more biological samples. The two data sets may be data from two copies of biological plasma DNA (i.e., two different blood draws from the same patient at approximately the same time); one plasma sample was divided into two; a plasma sample and a constituent genomic DNA sample (e.g., maternal buffy coat DNA, buccal swab DNA); or one plasma/serum sequencing dataset was randomly split into two sequencing datasets on a computer. Thus, two samples may be obtained in block 802, with sequence reads for each sample being obtained separately.
Block 810 depicts identifying exclusive alleles in each locus of the first set of loci. For illustrative purposes, reads 812 show a first set of loci 814 characterized by an exclusive single allele at each of loci a-h. Alleles are represented by white or black squares in figure 8. In read 812, the first set of loci 814 includes loci a-h. Loci a-h may not be contiguous loci. These loci are apparently homozygous because no locus shows the presence of two different alleles. Characterization of any locus as homozygous cannot be done with high statistical confidence, considering that there are only one or two reads at any position in the reads 812. In fact, for loci that have only a single read, the loci are generally not considered to be characterized as homozygous with any confidence. These loci may be limited to loci with known and common SNPs.
Block 816 shows identifying a second set of loci that display alternative alleles. A second set of loci are identified from within the first set of loci 814. Reading 818 shows the alleles sequenced from the second data set. For the same locus, loci a, c, f, and g show reads with different alleles than the allele in read 814. These loci show alternative alleles, as the alleles are substitutions for the alleles in the first data set. Loci b, d, e, and h show reads with the same allele as in read 814. Thus the second set of loci are identified as loci a, c, f and g.
Block 820 determines a first amount of loci. A first amount of loci can be determined from the first set of loci. In other embodiments, the first amount of loci can be determined from the second data set because the first set of loci is analyzed in the second data set. The first amount can be the number of loci or the number of reads with alleles. If the first amount is the number of loci, then the number of loci shown is 8 for read 812. The first amount may be limited to reads from DNA molecules having a particular size or particular properties. For example, the first amount may be the number of loci of a DNA molecule having a particular absolute size or a particular size relative to other DNA molecules. The number of reads with an allele can be a count of allele reads. In reads 812, the number of reads with alleles was 11. In certain embodiments, if each locus averages about one allele read, the average number of reads with an allele can be equal to the number of loci.
Block 822 determines a second amount of loci in the second set of loci. The second amount can be the number of loci or the number of reads with alleles. The second amount should be comparable to the first amount and have the same units, but in some embodiments the first amount and the second amount may have different units. If the second amount is the number of loci, then the second amount determined by the second set of reads 818 is 4. If the second amount is the number of reads with alleles, then the second amount in reads 818 is 6. Because the first set of loci in the second data set was analyzed for the second set of loci, in some cases, the first amount of loci can be considered to be also determined by the second data set in addition to the first data set.
Block 824 determines an Apparent Allelic Difference (AAD) from the first amount and the second amount. AAD is a parameter that quantifies the proportion of loci that show alternative alleles in the second data set and that are not present in the first data set. As indicated at block 824, the AAD may be calculated by dividing the second amount by the first amount. In other embodiments, AAD may be calculated from the second amount divided by the amount of maternal-only allele (i.e., the difference between the second amount and the first amount). Calculating the AAD may include multiplying factors and/or inverses of the described calculations. AAD may be considered as a normalized parameter for the second quantity.
Block 826 shows analyzing the biological sample using AAD. Analyzing the biological sample can include calculating a fetal DNA fraction from the AAD using the calibration curve. The calibration curve describes the relationship between fetal DNA fraction and AAD, as shown in figure 828. The calibration curve may be determined based on the actual fetal DNA fraction and AAD values from other biological samples. The number of sequencing reads for data points in the calibration curve may be similar to the number of sequencing reads in a biological sample with an unknown fetal DNA fraction. In other words, AAD data from samples with known fetal DNA fractions should be at a similar or the same sequencing depth as biological samples with unknown fetal DNA fractions. For example, the calibration curve may be at a sequencing depth within 1x, 5x, 10x, 15x, or 20x of the sequencing depth of a DNA molecule from the biological sample. In some embodiments, the calibration curve may be limited to samples having a similar genetic background as the mother, father, or fetus. For example, the calibration curve may be reduced to AAD data from human samples of the same or similar ethnic groups. The calibration curve may also be limited to a particular haplotype or haplotype block. Thus, several calibration curves can be used for the same sample, several genomic regions (including haplotypes). In some embodiments, AAD may be used to non-invasively test twins for zygosity.
As the sequencing depth increases, the proportion of loci identified as having non-maternal alleles will increase as more and more non-maternal alleles at loci are sequenced. Thus, at high sequencing depths, the proportion of loci with non-maternal alleles and AAD values is not expected to vary with fetal DNA fraction. The sequencing depth may be limited to a maximum of 5x, 10x, 15x, 20x, or 25x coverage to avoid regions where the AAD value does not depend on fetal DNA fraction. The biological sample can still be sequenced at a high sequencing depth above this maximum, but the resulting data can then be randomly down-sampled to generate a sequencing read dataset having a sequencing depth below the maximum.
B. Exemplary methods for measuring fetal DNA fraction
Fig. 9 shows a method 900 of measuring a fraction of fetal DNA in a biological sample of a pregnant woman carrying a fetus. The biological sample includes maternal DNA molecules and fetal DNA molecules. The biological sample can be any biological sample described herein.
In block 902, the method 900 includes receiving a data set of a plurality of reads of DNA molecules from a first sample of a pregnant woman carrying a fetus. The data set may be received by the computer system from a sequencing device or a data storage device. The first sample may or may not be a biological sample. The first sample may have only maternal DNA, no fetal DNA, such as buffy coat or oral swab.
In block 904, the method 900 includes identifying a location of a first plurality of reads in a reference genome. Identifying a location may be performed by any of the procedures described herein, including the procedures described for identifying a site in block 704 of method 700.
In block 906, the method 900 includes identifying a first set of loci based on the first data set and the identified locations. Loci in the first set of loci do not display more than one allele. In other words, each locus in the first set of loci is uniallelic and appears homozygous. The first set of loci can be selected from a set of loci in a reference database. In other words, the first set of loci can be a subset of the set of loci from the reference database, each locus of the first set of loci being in the set of loci from the reference database. It may be known that this set of loci includes Single Nucleotide Polymorphisms (SNPs) or examples of high heterozygosity. The reference database may comprise a short genetic variation database (dbSNP) or a HapMap database. The set of loci can be narrowed down to certain loci known to have a high probability of heterozygosity in certain ethnicities or other genetic groups similar to that of the mother or fetus.
The multiple reads may be at shallow depths. For example, the depth of reading may be less than or equal to 10x, less than or equal to 5x, less than or equal to 4x, less than or equal to 3x, less than or equal to 2x, less than or equal to 1x, or less than or equal to 0.5x in embodiments. For a haploid human genome, the 1 × coverage is approximately 5000 ten thousand reads for a size of 50 bp. The number of reads can be less than or equal to 5000 ten thousand reads, including less than or equal to 3000 ten thousand reads, 2000 ten thousand reads, 1500 ten thousand reads, less than or equal to 1000 ten thousand reads, less than or equal to 800 ten thousand reads, less than or equal to 500 ten thousand reads, less than or equal to 400 ten thousand reads, less than or equal to 200 ten thousand reads, or less than or equal to 100 ten thousand reads. A locus may have a total of one or two reads. The plurality of loci in the first set of loci includes more than 10%, more than 20%, more than 30%, more than 40%, more than 50%, more than 60%, more than 70%, more than 80%, or more than 90% of all loci in the first set of loci, and there can be a maximum of one or two reads. The maximum number of reads in any locus of the first set of loci can be 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10.
In block 908, the method 900 includes determining a first amount of loci. The first amount may be the number of loci in the first set of loci from the first dataset or may be the total number of allele reads in the first set of loci. In other embodiments, the first amount may be determined by the number of loci in the first set of loci from the second data set.
In block 910, a second data set of a plurality of reads of DNA molecules from a biological sample may be received. The second data set may be received by the computer system from the sequencing device or the data storage device.
In block 912, the method 900 includes identifying a location of a second plurality of reads in the reference genome. Identifying a location may be performed by any of the procedures described herein, including the procedures described for identifying a site in block 704 of method 700.
In block 914, a second set of loci are identified based on the second data set and the identified locations. Each locus of the second set of loci displays an allele that is different from the allele displayed in the first set of loci. In other words, each locus of the second set of loci can display a non-maternal allele, while each corresponding locus of the first set of loci can display only a maternal allele. Each read in the second data set may be different from each read in the first data set. In some embodiments, the first data set may be one half of the plurality of reads and the second data set may be the other half of the plurality of reads.
In block 916, the method 900 includes determining a second amount of loci in a set of loci in a second dataset. The second amount can be the number of loci in the second set of loci, or can be the total number of allele reads in the second set of loci. The second amount may be limited to reads from DNA molecules having a certain size. For example, the reading for the second amount can be limited to a reading from a first plurality of DNA molecules having a smallest average size difference from a second plurality of DNA molecules. The second plurality of DNA molecules may include DNA molecules having sequence reads for the first set of loci. The second plurality of DNA molecules may include all DNA molecules sequenced in the biological sample. In embodiments, the minimum size difference may be 5bp, 10bp, 20bp, 30bp, or 40 bp. The size of the first plurality or the second plurality of DNA molecules may be measured or the size of the first plurality or the second plurality of DNA molecules may be received.
In block 918, normalization parameters for the first quantity and the second quantity may be determined. In some embodiments, the normalization parameter may include the second amount divided by the first amount. The normalization parameter can be a ratio of the apparent non-maternal allele number to the maternal allele number. In other embodiments, the normalization parameter may include the second quantity divided by the sum of the first quantity and the second quantity. In these embodiments, the normalization parameter can be a ratio of the number of apparently non-maternal alleles to the total number of alleles. The normalization parameter may also be the inverse of any of these calculations. AAD is an example of a normalization parameter.
In block 920, the method 900 includes comparing the parameter value to a calibration point determined using at least one other sample (e.g., a calibration sample) having a known fetal DNA fraction and a calibration value corresponding to an individual measurement of the parameter in the at least one other sample. The calibration point may be one of a plurality of calibration points. The plurality of calibration points may constitute a calibration curve. The calibration curve may be a curve fitted to data points of known fetal DNA fraction and normalized parameters determined for different biological samples. Fig. 828 is an example of a calibration curve. The calibration curve may be a linear regression of the data points. The calibration curve may have a slope that is not equal to 1 and may be less than 1.
The calibration curve may be determined using the known fetal DNA fraction and a normalization parameter from another biological sample (i.e., a second normalization parameter) determined by a method similar to the normalization parameter from the biological sample currently being analyzed (i.e., a first normalization parameter). The second normalization parameter may also be determined by operations similar to blocks 902 through 918. The number of reads associated with a locus in a dataset from another biological sample may be approximately equal to the number of reads in the current biological sample. The number of reads may be within 1x, 5x, or 10x of each other.
In block 922, a fetal DNA fraction is calculated based on the comparison. The fetal DNA score may be a fetal DNA score in a calibration curve corresponding to the same value of the normalization parameter. In some embodiments, the fetal DNA fraction may be interpolated between two fetal DNA fractions of two values of the normalization parameter. In other embodiments, the calibration curve may be a linear equation of the form y-mx + b, where y is the fetal DNA fraction, x is a normalization parameter, and m and b are parameters fitted to the calibration curve.
C. Experimental results on fetal DNA fraction
Fetal DNA fractions were measured using AAD using 24 plasma samples from 24 pregnant women carrying male fetuses, each sample having an average of 810 ten thousand sequence reads (range: 710-. Of the 24 samples, 14 samples were used to establish a calibration curve that models the relationship between actual fetal DNA fraction and AAD value. The actual fetal DNA fraction was determined by the proportion of reads derived from the Y chromosome (Hudecova I et al, public science library journal (PLoS One). To calculate the AAD values, each of the 14 samples was randomly divided into two data sets. In the first data set, a first set of loci displaying one and only one type of allele is identified. In the second dataset, each locus in the first set of loci is analyzed to determine whether a substitute allele is present. Loci with alternative alleles constitute a second set of loci. AAD is calculated as the number of loci in the second set of loci divided by the number of loci in the first set of loci, multiplied by 100%.
Fig. 10A shows a calibration curve from a linear regression model of fetal DNA fraction and AAD values. The Y-axis represents the fraction of fetal DNA derived from the Y chromosome, while the x-axis represents AAD values. The linear regression slope was 11.61 and the y-intercept was-109.93. The R-squared value is 0.8795.
FIG. 10B shows linear regression based on fetal DNA fraction and short DNA molecule ratio. The Y-axis shows the fraction of fetal DNA derived from the Y chromosome, while the x-axis shows the percentage of DNA molecules in the sample that are less than 150bp in size. Fetal DNA fraction has been estimated based on the size of the DNA molecule (Yu SC et al, Proc Natl Acad Sci U S A2014; 111: 8583-8). The slope of the linear regression is 1.9247, and the y-intercept is-3.7911. The R-squared value is 0.3593.
For this data set, fetal DNA fraction was determined from AAD values, giving a higher correlation than the fraction of shorter DNA molecules represented by the R-squared value. With higher R-squared values, AAD-based fetal DNA fraction estimation will be more accurate than size-profile-based methods.
To test the prevalence of the AAD-based calibration curve of fig. 10A, the remaining 10 samples from 10 pregnant women were sequenced. Each of the 10 samples was randomly divided into two data sets. In the first data set, a first set of loci displaying one and only one type of allele is identified. In the second data set, each locus in the first set of loci is analyzed to determine whether a substitute allele is present. Loci with alternative alleles constitute a second set of loci. AAD is calculated as the number of loci in the second set of loci divided by the number of loci in the first set of loci multiplied by 100%.
The fraction of fetal DNA for the AAD values of the 10 samples was determined from the calibration curve in fig. 10A. In addition, the fetal DNA fraction of 10 samples was determined by the ratio of reads derived from the Y chromosome.
Fig. 10C shows fetal DNA scores determined by AAD values on the Y-axis relative to fetal DNA scores based on the ratio of reads derived from the Y chromosome. The fraction of fetal DNA estimated from the AAD value correlated well with the actual fraction of fetal DNA, with an R-square of 0.896. The median deviation from the actual fetal DNA fraction was only 0.8%, indicating that high accuracy of fetal DNA fraction prediction was achieved. Therefore, AAD-based calibration curves were observed to be widely spread into a new set of samples.
The accuracy of AAD-based fetal DNA fraction estimation may increase with higher fetal DNA fractions in the sample, decreased sequencing error rates, and the use of calibration curves based on samples from individuals with similar genetic profiles.
D. Association of twins with AAD
AAD can be used to classify whether a twin is a single egg or a double egg. The twin ovate foetus has foetuses of different genotypes. Loci with different genotypes mean that at least one fetus has a non-maternal allele. The proportion of loci with non-maternal alleles in plasma samples with a two-egg fetus will be higher than the proportion of loci in plasma samples with a single-egg fetus. For monozygotic twins, the proportion of loci in plasma samples with a single fetus is not expected to be higher than the proportion of loci in plasma samples with a single fetus because the genotypes of the fetuses are the same. The proportion of loci with non-maternal alleles is then expected to be higher for a two-egg fetus compared to a single-egg fetus. Thus, the AAD calculated from the proportion of loci with non-maternal alleles is expected to be higher, and for a two-egg fetus the calculated fetal DNA fraction is expected to be higher.
Figure 14 shows the fraction of fetal DNA calculated for six different groups of twins. Three groups of twins were uniovular and three groups of twins were bisovular. Fetal DNA fraction is estimated by two methods. In the first approach, the fetal DNA fraction is estimated based on the size of the DNA molecule (Yu SC et al, Proc Natl Acad Sci U S A.) -2014; 111: 8583-8). The size of the DNA molecule is not expected to vary depending on the zygosity of the fetus. In the second approach, the fetal DNA fraction is estimated from a certain amount of loci (e.g., as described for the examples using AAD values). In the second method, fetal DNA fraction is estimated from AAD values. The AAD value is expected to vary depending on the zygosity of the fetus. Figure 14 shows that the difference between AAD-based fetal DNA fraction and size-based fetal DNA fraction is greater for a two-egg twin compared to a one-egg twin. This difference in fetal DNA fraction estimates can be used to classify the fetus as single or double.
To classify zygosity of multiple fetuses, the AAD value can be used to estimate the fetal DNA fraction of a biological sample, as described herein. The first fetal DNA score may then be compared to a cutoff value. The cutoff value may be determined to be some value greater than the second fetal DNA fraction of the biological sample. The second fetal DNA fraction may be estimated by a method in which the estimated fetal DNA fraction does not vary based on the zygosity of the fetal DNA in the sample. For example, the estimated fetal DNA fraction may be based on a size profile of DNA molecules in the biological sample. The cutoff value may be some absolute percentage greater than the second fetal DNA fraction. For example, in fig. 14, the cutoff value may be 2 to 4 absolute percent greater than the size-based fraction of fetal DNA. The cutoff value can be an absolute percentage, a relative percentage, or a multiple of the standard deviation greater than the second fetal DNA fraction.
If the calculated fetal DNA fraction is greater than the cutoff value, the fetus may be classified as a double egg. If the calculated fetal DNA fraction is less than the cut-off value, the fetus may be classified as a single egg. In some embodiments, two cutoff values may be used, with the first cutoff value being greater than the second cutoff value. If the calculated fetal DNA fraction is greater than or equal to the first cutoff value, the fetus may be classified as a double egg. If the calculated fetal DNA fraction is less than or equal to the second cutoff value, the fetus may be classified as a single egg. If the calculated fetal DNA fraction is between the two cut-off values, the fetus can be classified as uncertain of zygosity. The fetus may then undergo further zygosity testing.
E. AAD with sized loci
The calculation of AAD can be based on the identification of non-maternal alleles by features other than sequence reads. For example, as described above, fetal DNA is shorter than maternal DNA. Thus, a long DNA molecule may comprise a maternal allele, while a short DNA molecule may comprise a non-maternal allele. Indicating that the characteristics of the non-maternal allele may be related to the size parameters of the DNA molecules in the locus. The size parameter may be a certain absolute size or a certain size relative to other DNA molecules.
Identifying loci with non-maternal alleles can be based on size differences from maternal alleles. A larger fraction of fetal DNA may be correlated with a larger proportion of sites of molecules in one data set that show at least some size difference from molecules in another data set.
As an example, shallow depth sequence data of one aliquot of maternal DNA from a pregnant woman was analyzed and a first set of loci with DNA molecules greater than 166bp in length were identified. The second aliquot, with maternal and fetal DNA from the same pregnant woman, was sequenced at a shallow depth. In the data from the second aliquot, a second set of loci of DNA molecules with size parameters (size values) shorter than 143bp were identified. In other words, the difference between the size parameters of two aliquots of DNA molecules at a given locus is at least 23 bp. The number of loci in the second set of loci divided by the number of loci in the first set of loci gives a ratio of loci having a size difference of at least 23 bp. The fetal DNA fraction of the pregnant woman is also determined. The procedure was repeated for another 23 pregnant women and the results plotted. The calculation may also be accomplished by first determining loci with size values below a size threshold, and then determining the proportion of those loci in different aliquots that have size values above a second threshold.
FIG. 15 shows the relationship between fetal DNA fraction and loci showing size differences. The x-axis is the ratio of the percentage of loci showing a size difference of at least 23bp between two aliquots. The y-axis is the fetal DNA fraction. There is a positive correlation between fetal DNA fraction and the proportion of loci showing size differences. R square is 0.62 (p ═ 0.0011).
The correlation between the locus proportion showing the size difference and the fetal DNA fraction shows that the fetal DNA fraction can be estimated using the locus proportion showing the size difference as a parameter (similar to AAD). The size difference is not necessarily 23 bp. In other embodiments, the size difference may be at least 10, 20, 30, 40, or 50 bp. The data for each set of loci may not be from two different aliquots. Data may be obtained from the same biological sample.
The minimum size difference can be used as an additional factor in identifying non-maternal alleles. With shallow sequencing, if an allele is found in a second data set that is different from the maternal allele in the first data set, the allele in the second data set may be a non-maternal allele. However, the allele in the second data set may also be a maternal allele, and because of the shallower depth, no sequencing was performed in the first data set. If the alleles in the second data set are similar in size to the maternal alleles, then the alleles may be second maternal alleles. Thus, considering the size difference of the alleles in the second data set can improve the identification of loci with non-maternal alleles.
F. Exemplary methods of measuring fetal DNA fraction of loci having DNA molecules exhibiting size differences
Fig. 16 shows a method 1600 of measuring a fraction of fetal DNA in a biological sample of a pregnant woman carrying a fetus. The biological sample includes maternal DNA molecules and fetal DNA molecules. The biological sample can be any biological sample described herein.
In block 1602, the method 1600 includes receiving a first plurality of read data sets from a first plurality of DNA molecules. The data set may be received by the computer system from a sequencing device or a data storage device. The first plurality of DNA molecules may or may not be in a biological sample. The first plurality of DNA molecules may be from a biological sample that does not have fetal DNA.
In block 1603, the method 1600 includes identifying a location of a first plurality of reads in a reference genome and determining a size of a DNA molecule corresponding to the first plurality of reads.
In block 1604, the method 1600 includes identifying a first set of loci in a first data set. The read first plurality of DNA molecules comprising each locus in the first set of loci has a first size distribution and has a first size value of the first distribution that exceeds a first size threshold. In some embodiments, all DNA molecules comprising reads in the first set of loci exceed the first size threshold. The first set of loci can be selected from a set of loci in a reference database, as described in method 900, or other factors can be considered. The multiple reads may be at shallow depths.
In block 1606, method 1600 includes determining a first amount of loci. The first amount can be a number of loci in the first set of loci from the first dataset.
In block 1608, a second data set of a second plurality of reads of a second plurality of DNA molecules from the biological sample may be received. The second data set may be received by the computer system from the sequencing device or the data storage device. The method 1600 may include measuring the size of the second plurality of DNA molecules, or receiving size information of the second plurality of DNA molecules.
In block 1609, the method 1600 includes identifying a location of a second plurality of reads in the reference genome and determining a size of the DNA molecule corresponding to the second plurality of reads.
In block 1610, a second set of loci in a second data set from the plurality of reads is identified. Each locus of the second set of loci is a locus of the first set of loci. The read DNA molecules comprising in each of the second set of loci have a second size distribution and have second size values of the second distribution that exceed a second size threshold in an opposite direction to the first size values that exceed the first size threshold.
The first size value may be greater than a first size threshold, and the second size value may be less than a second size threshold, and the second size threshold may be less than the first size threshold. In other embodiments, the first size value may be less than the first size threshold, the second size value may be greater than the second size threshold, and the second size threshold is greater than the first size threshold.
In block 1612, the method 1600 includes determining a second amount of loci in the set of loci in the second data set. The second amount can be the number of loci in the second set of loci.
In block 1614, a normalized parameter for the first quantity and the second quantity may be determined. In some embodiments, the normalization parameter may include the second quantity divided by the first quantity. The normalization parameter may be a ratio of the number of loci having DNA molecules smaller than a certain size to the number of loci having DNA molecules larger than a certain size. In other embodiments, the normalization parameter may include the second quantity divided by the sum of the first quantity and the second quantity. In these embodiments, the normalization parameter can be a ratio of the number of loci of the DNA molecule having the smaller size to the total number of loci. The normalization parameter may also be the inverse of any of these calculations. The normalization parameter may be of an AAD type.
In block 1616, the method 1600 may include comparing the parameter value to a calibration point determined using at least one other sample (e.g., a calibration sample) having a known fetal DNA fraction and a calibration value corresponding to an individual measurement of the parameter in the at least one other sample. The calibration point may be one of a plurality of calibration points. The plurality of calibration points may constitute a calibration curve. The calibration curve may be a curve fitted to data points of known fetal DNA fraction and to normalization parameters determined for different biological samples. The calibration curve may be a linear regression of the data points. The calibration curve may have a slope different from 1.
The calibration curve may be determined using the known fetal DNA fraction and a normalization parameter from another biological sample (i.e., a second normalization parameter) determined by a method similar to the normalization parameter from the biological sample currently being analyzed (i.e., a first normalization parameter). The second normalization parameter may also be determined by operations similar to blocks 1602 through 1614. The number of reads associated with a locus in a dataset from another biological sample may be approximately equal to the number of reads in the current biological sample. The number of reads may be within 1x, 5x, or 10x of each other.
In block 1618, a fetal DNA fraction is calculated based on the comparison. The fetal DNA score may be a fetal DNA score in a calibration curve corresponding to the same value of the normalization parameter. In some embodiments, the fetal DNA fraction may be interpolated between two fetal DNA fractions of two values of the normalization parameter. In other embodiments, the calibration curve may be a linear equation of the form y-mx + b, where y is the fetal DNA fraction, x is a normalization parameter, and m and b are parameters fitted to the calibration curve.
Further examples III
Example 1 includes a method of measuring a fraction of fetal DNA in a biological sample of a pregnant woman carrying a fetus, the biological sample including maternal DNA molecules and fetal DNA molecules, the method comprising: obtaining a plurality of reads of DNA molecules from a biological sample; identifying a plurality of loci homozygous for the female at that location; determining reads that show a first amount of non-maternal alleles at the plurality of loci; determining a total number of reads at the plurality of bits; determining a non-maternal allele fraction from the first amount and the total amount; obtaining a calibration curve determined using the known fetal DNA fraction and the measured non-maternal allele fraction; and calculating a fetal DNA fraction using the calibration curve and the non-maternal allele fraction.
Embodiment 2 includes the method of embodiment 1, further comprising: the calibration curve was calculated by: determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women; calculating a non-maternal score for the plurality of samples; and fitting the fetal DNA fraction and the non-maternal fraction to a linear function.
Example 3 includes the method of example 2, wherein determining the fetal DNA fraction of the other sample comprises: identifying a second plurality of loci at which the fetus is heterozygous and the pregnant woman is homozygous; obtaining a plurality of reads from the DNA molecules of another sample; determining reads having a second amount of fetal-specific alleles at the second plurality of loci; determining reads having a third amount of shared alleles at the second plurality of loci; and determining a fetal DNA fraction using the second amount and the third amount.
4 includes the method of example 1, wherein the non-maternal alleles are limited to alleles identified in the database as corresponding to biallelic loci.
Example 5 includes the method of example 1, wherein identifying the plurality of loci at which the woman is homozygous comprises genotyping a cell sample from the woman.
Embodiment 6 includes the method of embodiment 1, further comprising: receiving a biological sample; and sequencing a plurality of DNA molecules in the biological sample to obtain a read.
Embodiment 7 includes the method of embodiment 1, further comprising: receiving a biological sample; and analyzing the plurality of DNA molecules in the biological sample using the probe microarray to obtain a read.
Embodiment 8 includes a computer product comprising a computer-readable medium storing a plurality of instructions for controlling a computer system to perform the operations of any of the methods of embodiments 1-7.
Embodiment 9 includes a system comprising: the computer product of embodiment 8; and one or more processors configured to execute instructions stored on the computer-readable medium.
Embodiment 10 includes a system comprising means for performing any of the methods of embodiments 1-7.
Embodiment 11 includes a system configured to perform any of the methods of embodiments 1-7.
Embodiment 12 includes a system comprising means for performing the steps of any of the methods of embodiments 1-7, respectively.
IV. computer system
Any computer system mentioned herein may use any suitable number of subsystems. An example of this subsystem is shown in computer system 10 in fig. 17. In some embodiments, the computer system comprises a single computer device, wherein the subsystems may be components of the computer device. In other embodiments, a computer system may include multiple computer devices with internal components, each computer device being a subsystem. Computer systems may include desktop and laptop computers, tablets, mobile phones, and other mobile devices.
The subsystems shown in fig. 17 are interconnected by a system bus 75. Additional subsystems such as a printer 74, keyboard 78, storage 79, monitor 76 coupled to display adapter 82, etc., are shown. Peripheral devices and input/output (I/O) devices coupled to I/O controller 71 may be connected to the computer system by any number of means known in the art, such as input/output (I/O) ports 77 (e.g., USB, or the like,). For example, the computer system 10 may be connected to a wide area network such as the Internet, a mouse input device, or a scanner using the I/O ports 77 or an external interface 81 (e.g., Ethernet, Wi-Fi, etc.). The interconnection via system bus 75 allows central processor 73 to communicate with each subsystem and to control the execution of instructions from system memory 72 or storage device 79 (e.g., a fixed magnetic or optical disk such as a hard drive) and the exchange of information between subsystems. System memory 72 and/or storage 79 may be embodied as computer-readable media. Another one isThe subsystems are data collection devices 85, such as cameras, microphones, accelerometers, and the like. Any data mentioned herein may be exported from one component to another component and may be exported to a user.
The computer system may include a number of identical components or subsystems, connected together, for example, through an external interface 81 or through an internal interface. In some embodiments, computer systems, subsystems, or devices may communicate over a network. In this case, one computer may be considered a client and another computer a server, each of which may be part of the same computer system. A client and server may each comprise a plurality of systems, subsystems, or components.
It should be appreciated that any of the embodiments of the invention may be implemented in hardware (e.g., an application specific integrated circuit or a field programmable gate array) in the form of control logic and/or in a modular or integrated manner using computer software with a generally programmable processor. As used herein, a processor includes a single-core processor, a multi-core processor on the same integrated chip, or multiple processing units or networked processing units on a single circuit board. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement embodiments of the present invention using hardware and a combination of hardware and software.
Any of the software components or functions described herein may be implemented as software code to be executed by a processor using any suitable computer language, such as, for example, Java, C + +, C #, Objective-, Swift, or scripting languages, such as Perl or Python, using, for example, conventional or object-oriented techniques.
Such programs may also be encoded and transmitted using carrier wave signals suitable for transmission over wired, optical, and/or wireless networks conforming to various protocols including the internet. Thus, a computer-readable medium according to one embodiment of the invention may be created using a data signal encoded with such a program. The computer readable medium encoded with the program code may be packaged with a compatible device or provided separately from other devices (e.g., via internet download). Any such computer-readable media may reside on or within a single computer product (e.g., a hard drive, a CD, or an entire computer system), and may exist on or within different computer products within a system or network. The computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
Figure 18 shows an exemplary sequencing system. The system depicted in fig. 18 includes a sequencing device 1802 and an intelligence module 1804 that is part of a computer system 1806. The sequencing device 1802 can include any of the sequencing devices described herein. The computer system 1806 may include a portion or all of the computer system 10. The data sets (sequencing read data sets) are transmitted from the sequencing device 1802 to the intelligent module 1804 or vice versa via a network connection or a direct connection. The data set may, for example, be processed to identify certain loci. The steps of identifying and determining may be implemented by software stored on the hardware of computer system 1806. The data set may be processed by computer code running on the processor and stored on the memory device of the intelligent module and after processing transferred back to the memory device of the analysis module, where the modified data may be displayed on the display means. In some embodiments, the intelligence module may also be implemented in the sequencing device.
Fig. 19 shows that computer system 1900 can include a receiving device 1910 that can include, for example, receiving sequencing data obtained from a sequencing device. The computer system 1900 may further include an identification means 1920 for identifying a first set of loci in the first data set from the plurality of reads from the DNA molecule. Computer system 1900 may also include a determining means 1930 for determining a first amount of a first set of loci in a first data set. Computer system 1900 may also include an identification device 1940 for identifying a second set of loci in a second data set from the plurality of reads. Computer system 1900 may also include means 1950 for determining a second amount of loci in a second set of loci in a second data set. The computer system 1900 may further comprise determining means 1960 for determining the normalized parameters of the first quantity and the second quantity. The computer system 1900 may additionally include obtaining means 1970 for obtaining calibration points determined using known fetal DNA fractions. Computer system 1900 may also include a computing device 1980 for computing a fraction of fetal DNA using the calibration points and the normalization parameters.
Any of the methods described herein may be performed in whole or in part with a computer system comprising one or more processors, which may be configured to perform the steps. Thus, embodiments may relate to a computer system configured to perform the steps of any of the methods described herein, potentially with different components performing the respective steps or respective groups of steps. Although presented in numbered steps, the steps of the methods herein may be performed simultaneously or in a different order. Additionally, some of these steps may be used with some of the other steps from other methods. Also, all or portions of a step may be optional. Further, any steps of any method may be performed by modules, units, circuits or other means for performing the steps.
The specific details of particular embodiments may be combined in any suitable manner without departing from the spirit and scope of the embodiments of the invention. However, other embodiments of the invention may relate to specific embodiments relating to each individual aspect, or specific embodiments combining specific combinations of these individual aspects.
The foregoing description of example embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form described, and many modifications and variations are possible in light of the above teaching.
In the previous description, for purposes of explanation, numerous details have been set forth in order to provide an understanding of various embodiments of the present technology. It will be apparent, however, to one skilled in the art that certain embodiments may be practiced without some of these details or with additional details.
Having described several embodiments, it will be recognized by those of skill in the art that various modifications, alternative constructions, and equivalents may be used without departing from the spirit of the invention. In addition, many well known processes and elements have not been described in order to avoid unnecessarily obscuring the present invention. In addition, details of any particular embodiment may not always be present in variations of that embodiment, or may be added to other embodiments.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in that range and any other stated or intervening value in that range is encompassed. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those limits are also included.
Unless specifically stated to the contrary, reference to "a," an, "or" the "is intended to mean" one or more. The use of "or" is intended to mean "inclusive" rather than "exclusive" unless specifically indicated to the contrary.
All patents, patent applications, publications, and descriptions mentioned herein are incorporated by reference in their entirety for all purposes. None of which is admitted to be prior art.

Claims (360)

1. A method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus for non-diagnostic purposes, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising performing by a computer system the steps of:
identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining a first value for a first set of DNA molecules based on the plurality of reads, wherein:
each DNA molecule of the first set of DNA molecules comprising reads at a site of the plurality of sites and exhibiting a second allele at the site that is different from the first allele,
The first value defining a property of the first set of DNA molecules,
the characteristics include a size parameter or an amount of reading, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
determining a second value for a second set of DNA molecules, wherein:
each DNA molecule of the second set of DNA molecules comprises a read at a site of the plurality of sites and the first allele is displayed at the site, an
Said second value defining a property of said second set of DNA molecules;
determining a parameter value for the parameter from the first value and the second value;
comparing the parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measured value of the parameter in the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
2. The method of claim 1, wherein:
the first value is a first number of reads located at the plurality of loci and having the second allele,
the second value is a second number of reads at the plurality of sites and having the first allele, an
The parameter is the non-maternal allele fraction.
3. The method of claim 1, wherein determining the first value and the second value further comprises:
measuring the sizes of the first set of DNA molecules and the second set of DNA molecules, wherein the first value is a first size value of a first size distribution of the first set of DNA molecules, and wherein the second value is a second size value of a second size distribution of the second set of DNA molecules.
4. A method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus for non-diagnostic purposes, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising performing by a computer system the steps of:
identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining a first amount of reads, wherein:
each read of the first quantity of reads is located at a site of the plurality of sites,
each read exhibiting a second allele at the locus different from the first allele, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
determining a second quantity of reads at the plurality of sites, wherein:
each read of the second quantity of reads is located at a site of the plurality of sites, an
Each read shows the first allele at the site;
determining a non-maternal allele fraction from the first amount and the second amount;
obtaining a calibration point determined using another sample having a known fetal DNA fraction and a measured non-maternal allele fraction; and
calculating the fetal DNA fraction using the calibration points and the non-maternal allele fraction.
5. The method of claim 4, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve,
the method further comprises the following steps:
calculating the calibration curve by:
determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women;
calculating a non-maternal allele fraction for the plurality of other samples; and
fitting the fetal DNA fraction and the non-maternal allele fraction to a linear function, wherein the linear function describes the calibration curve.
6. The method of claim 4, wherein determining the second amount of reads at the plurality of bits comprises determining a total number of reads of the plurality of reads, wherein the total number of reads of the plurality of reads is the second amount of reads.
7. A method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus for non-diagnostic purposes, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising performing by a computer system the steps of:
identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining the size of the DNA molecules of the biological sample;
determining a first size value for a first set of DNA molecules, wherein:
each DNA molecule of the first set of DNA molecules comprises reads at a site of the plurality of sites and exhibits a second allele at the site that is different from the first allele, an
The first size value corresponds to a statistic of a first size distribution of the first set of DNA molecules;
Determining a second size value for a second set of DNA molecules, wherein:
each DNA molecule of the second set of DNA molecules comprising a read at a site of the plurality of sites and displaying the first allele at the site,
the second size value corresponds to a statistic of a second size distribution of the second set of DNA molecules, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
determining a parameter value from the first size value and the second size value;
comparing the parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measured value of the parameter for the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
8. The method of claim 7, wherein:
the first size value is a first cumulative frequency of DNA molecules having a largest size among the first set of DNA molecules, an
The second size value is a second cumulative frequency of DNA molecules in the second set of DNA molecules having the largest size.
9. The method of claim 7, wherein:
the first size value is a median size, an average size, or a pattern of the first set of DNA molecules, an
The second size value is a median size, an average size, or a pattern of the second set of DNA molecules.
10. The method of claim 7, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve,
the method further comprises the following steps:
calculating the calibration curve by:
determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women;
calculating parameter values for the plurality of other samples;
calculating a non-maternal allele fraction for the plurality of other samples; and
fitting the fetal DNA fraction and the non-maternal allele fraction to a linear function, wherein the linear function describes the calibration curve.
11. The method of claim 5, wherein determining the fetal DNA fraction of each of the plurality of other samples comprises:
identifying a second plurality of loci, wherein at each locus of the second plurality of loci a pregnant woman of the plurality of pregnant women is homozygous and the fetus of the pregnant woman is heterozygous;
Obtaining a plurality of reads from the DNA molecules of the other sample;
determining a third amount of reads that show a fetal-specific allele at the second plurality of loci;
determining a fourth amount of reads that show a consensus allele at the second plurality of loci; and
determining the fetal DNA fraction using the third amount and the fourth amount.
12. The method of claim 10, wherein determining the fetal DNA score of each of the plurality of other samples comprises:
identifying a second plurality of loci, wherein at each locus of the second plurality of loci a pregnant woman of the plurality of pregnant women is homozygous and the fetus of the pregnant woman is heterozygous;
obtaining a plurality of reads from the DNA molecules of the other sample;
determining a third amount of reads that show a fetal-specific allele at the second plurality of loci;
determining a fourth amount of reads that show a consensus allele at the second plurality of loci; and
determining the fetal DNA fraction using the third amount and the fourth amount.
13. The method of claim 1, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
Identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
14. The method of claim 4, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
15. The method of claim 7, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
16. The method of claim 1, wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
17. The method of claim 4, wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
18. The method of claim 7, wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
19. The method of claim 1, wherein the second allele is limited to alleles identified in the database as corresponding to biallelic loci.
20. The method of claim 4, wherein the second allele is limited to alleles identified in the database as corresponding to biallelic loci.
21. The method of claim 7, wherein the second allele is limited to alleles identified in the database as corresponding to biallelic loci.
22. The method of claim 1, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
23. The method of claim 4, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
24. The method of claim 7, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
25. The method of claim 1, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
26. The method of claim 4, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
27. The method of claim 7, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
28. The method of claim 1, wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
29. The method of claim 4, wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
30. The method of claim 7, wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
31. The method of claim 1, wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
32. The method of claim 4, wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
Providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
33. The method of claim 7, wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
34. The method of claim 1, further comprising:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
35. The method of claim 4, further comprising:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
36. The method of claim 7, further comprising:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
37. The method of claim 1, further comprising:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain a read.
38. The method of claim 4, further comprising:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
39. The method of claim 7, further comprising:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
40. The method of claim 1, wherein the plurality of reads comprises less than or equal to 5000 ten thousand reads.
41. The method of claim 4, wherein the plurality of reads includes less than or equal to 5000 ten thousand reads.
42. The method of claim 7, wherein the plurality of reads includes less than or equal to 5000 ten thousand reads.
43. The method of claim 1, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
44. The method of claim 4, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
45. The method of claim 7, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
46. A method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus for non-diagnostic purposes, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising:
receiving a first data set comprising a first plurality of reads of DNA molecules from a first sample from a female pregnant with the fetus;
identifying a location of the first plurality of reads in a reference genome;
identifying a first set of loci based on the first dataset and the identified locations, wherein no loci in the first set of loci display more than one allele;
determining a first amount of a locus in the first set of loci;
receiving a second data set comprising a second plurality of reads of DNA molecules from the biological sample;
identifying a location of the second plurality of reads in the reference genome;
identifying a second set of loci based on the second data set and the identified locations, wherein:
each of the second set of loci is one of the first set of loci,
Each locus in the second set of loci displays an allele that is different from the allele displayed in the first set of loci, an
A portion of the first set of loci does not comprise reads in the second plurality of reads that display alleles different from the alleles displayed in the first plurality of reads;
determining a second amount of loci in the second set of loci of the second data set;
determining a normalized parameter value for the first quantity and the second quantity;
comparing the normalized parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measurement of the parameter in the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
47. The method of claim 46, wherein the first sample is the biological sample.
48. The method of claim 46, wherein the first sample is not the biological sample, and wherein the first sample does not comprise fetal DNA molecules.
49. The method of claim 46, wherein the alleles that differ from the alleles displayed in the first set of loci are limited to alleles identified in a database as corresponding to a bi-allelic locus.
50. The method of claim 46, wherein the first amount is the number of loci and the second amount is the number of loci in the second data set.
51. The method of claim 46, wherein determining the first quantity comprises determining the first quantity in the second data set.
52. The method of claim 46, further comprising:
measuring the size of a first plurality of DNA molecules corresponding to the first plurality of reads, an
Measuring the size of a second plurality of DNA molecules corresponding to the second plurality of reads,
wherein:
the first plurality of DNA molecules having a first size value,
the first size value corresponds to a statistic of a first size distribution of the first plurality of DNA molecules,
the second plurality of DNA molecules having a second size value,
the second size value corresponds to a statistic of a second size distribution of the second plurality of DNA molecules, an
The first size value is greater than the second size value by a minimum difference.
53. The method of claim 46, wherein:
a first plurality of DNA molecules comprising reads at the first set of loci that are larger than a first size,
a second plurality of DNA molecules comprising reads at the second set of loci that are smaller than a second size, an
The difference between the first dimension and the second dimension is greater than a minimum difference.
54. The method of claim 52, wherein the minimum difference is 5 bp.
55. The method of claim 53, wherein the minimum difference is 5 bp.
56. The method of claim 46, wherein the female is carrying a plurality of fetuses, the method further comprising:
comparing the fetal DNA fraction to a cut-off value, and:
classifying the plurality of fetuses as single-egg if the calculated fetal DNA fraction is below the cut-off value, or
Classifying the plurality of fetuses as being biviteric if the calculated fetal DNA fraction is above the cutoff value.
57. The method of claim 56, wherein:
the fetal DNA fraction is the first fetal DNA fraction,
the cutoff value is determined as a value greater than a second fetal DNA fraction of the biological sample, an
Determining the second fetal DNA fraction without using the normalization parameter value.
58. The method of claim 57, wherein the second fetal DNA fraction is estimated using a size profile of DNA molecules in the biological sample.
59. A method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus for non-diagnostic purposes, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising:
Receiving a first data set from a first plurality of reads of a first plurality of DNA molecules;
identifying a location of the first plurality of reads in a reference genome;
determining the size of the DNA molecules corresponding to the first plurality of reads;
identifying a first set of loci in the first data set, wherein the read DNA molecules in each of the loci comprising the first set of loci have a first size distribution and have a first size value of the first size distribution that exceeds a first size threshold;
determining a first amount of a locus in the first set of loci;
receiving a second data set of a second plurality of reads of a second plurality of DNA molecules from the biological sample;
identifying a location of the second plurality of reads in the reference genome;
determining the size of the DNA molecules corresponding to the second plurality of reads;
identifying a second set of loci in the second data set, wherein:
each of the second set of loci is one of the first set of loci, an
Read DNA molecules comprising in each of the second set of loci have a second size distribution and have a second size value of the second size distribution that exceeds a second size threshold in an opposite direction to the first size value that exceeds the first size threshold;
Determining a second amount of loci in the second set of loci of the second data set;
determining a normalized parameter value for the first quantity and the second quantity;
comparing the normalized parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measurement of the parameter in the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
60. The method of claim 59, wherein the first size value is greater than the first size threshold and the second size value is less than the second size threshold, and wherein the second size threshold is less than the first size threshold.
61. The method of claim 59, wherein the first size value is less than the first size threshold and the second size value is greater than the second size threshold, and wherein the second size threshold is greater than the first size threshold.
62. The method of claim 59, wherein the difference between the first size value and the second size value is 10 bp.
63. The method of claim 46, wherein the first data set comprises reads of DNA molecules from the biological sample.
64. The method of claim 59, wherein the first data set comprises reads of DNA molecules from the biological sample.
65. The method of claim 46, wherein:
the biological sample is a first biological sample, an
The first data set includes reads of DNA molecules from a second biological sample that does not contain fetal DNA.
66. The method of claim 59, wherein:
the biological sample is a first biological sample, an
The first data set includes reads of DNA molecules from a second biological sample that does not contain fetal DNA.
67. The method of claim 46, further comprising:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain the second plurality of reads.
68. The method of claim 59, further comprising:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain the second plurality of reads.
69. The method of claim 46, further comprising:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain the second plurality of reads.
70. The method of claim 59, further comprising:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain the second plurality of reads.
71. The method of claim 46, wherein the second plurality of reads comprises less than or equal to 5000 ten thousand reads.
72. The method of claim 59, wherein the second plurality of reads comprises less than or equal to 5000 ten thousand reads.
73. The method of claim 46, wherein the second plurality of reads is at or less than 1x coverage of a haploid human genome.
74. The method of claim 59, wherein the second plurality of reads is at or less than 1x coverage of a haploid human genome.
75. The method of claim 46, wherein the normalized parameter value comprises the second quantity divided by the first quantity, or wherein the normalized parameter value comprises the second quantity divided by a sum of the first quantity and the second quantity.
76. The method of claim 59, wherein the normalized parameter value comprises the second quantity divided by the first quantity, or wherein the normalized parameter value comprises the second quantity divided by a sum of the first quantity and the second quantity.
77. The method of claim 46, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve.
78. The method of claim 59, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve.
79. The method of claim 77, wherein the calibration curve is linear.
80. The method of claim 78, wherein the calibration curve is linear.
81. The method of claim 77, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample,
determining a calibration point using the known fetal DNA fraction and a second normalization parameter value determined from the set of loci in the data set of the third plurality of reads of DNA molecules in the second biological sample according to the corresponding method for the first normalization parameter value, and
the second plurality of reads comprises a first coverage of haploid human genome within 10x coverage of a second coverage for reads of the second plurality of reads.
82. The method of claim 78, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample and,
determining a calibration point using the known fetal DNA fraction and a second normalization parameter value determined from the set of loci in the data set of the third plurality of reads of DNA molecules in the second biological sample according to the corresponding method for the first normalization parameter value, and
the second plurality of reads comprises a first coverage of haploid human genome within 10x coverage of a second coverage for reads of the second plurality of reads.
83. The method of claim 46, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample, an
Obtaining the calibration points includes using the known fetal DNA fraction and a second normalized parameter value determined from a second biological sample according to a corresponding method for the first normalized parameter value.
84. The method of claim 59, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample, an
Obtaining the calibration points includes using the known fetal DNA fraction and a second normalized parameter value determined from a second biological sample according to a corresponding method for the first normalized parameter value.
85. The method of claim 46, further comprising:
providing a fourth set of loci from a reference database, wherein:
the fourth set of loci comprising loci having known single nucleotide polymorphisms, an
Each of the first set of loci is one of the fourth set of loci.
86. The method of claim 59, further comprising:
providing a fourth set of loci from a reference database, wherein:
the fourth set of loci comprising loci having known single nucleotide polymorphisms, an
Each of the first set of loci is one of the fourth set of loci.
87. The method of any of claims 1-86, wherein a depth of coverage of a sequence read of a dataset is less than 5x coverage.
88. The method of any of claims 1-86, wherein a depth of coverage of a sequence read of a dataset is less than 1x coverage.
89. The method of any one of claims 1 to 86, wherein the number of sequence reads in a dataset is less than 5000 ten thousand.
90. A computer product comprising a computer readable medium storing a plurality of instructions for controlling a computer system to perform the operations of the method of any of claims 1 to 86.
91. A system, comprising:
the computer product of claim 90; and
one or more processors configured to execute instructions stored on the computer-readable medium.
92. A system comprising means for performing the steps of the method of any one of claims 1-86, respectively.
93. A computer readable medium storing a plurality of instructions, wherein the instructions, when executed by a processor, control a computer system to perform the steps comprising the method of any one of claims 1-86.
94. A computer product comprising a computer readable medium storing a plurality of instructions for controlling a computer system to perform a method for measuring a fraction of fetal DNA in a biological sample of a female pregnant with a fetus, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising performing the steps of:
identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
Obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining a first value for a first set of DNA molecules based on the plurality of reads, wherein:
each DNA molecule of the first set of DNA molecules comprising reads at a site of the plurality of sites and exhibiting a second allele at the site that is different from the first allele,
the first value defining a property of the first set of DNA molecules,
the characteristics include a size parameter or an amount of reading, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
determining a second value for a second set of DNA molecules, wherein:
each DNA molecule of the second set of DNA molecules comprises a read at a site of the plurality of sites and the first allele is displayed at the site, an
Said second value defining a property of said second set of DNA molecules;
determining a parameter value for the parameter from the first value and the second value;
comparing the parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measured value of the parameter in the at least one other sample; and
Calculating the fetal DNA fraction based on the comparison.
95. The computer product of claim 94, wherein:
the first value is a first number of reads located at the plurality of loci and having the second allele,
the second value is a second number of reads at the plurality of sites and having the first allele, an
The parameter is the non-maternal allele fraction.
96. The computer product of claim 94, wherein determining the first value and the second value further comprises:
measuring the sizes of the first set of DNA molecules and the second set of DNA molecules, wherein the first value is a first size value of a first size distribution of the first set of DNA molecules, and wherein the second value is a second size value of a second size distribution of the second set of DNA molecules.
97. A computer product comprising a computer readable medium storing a plurality of instructions for controlling a computer system to perform a method for measuring a fraction of fetal DNA in a biological sample of a female pregnant with a fetus, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising performing the steps of:
Identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining a first amount of reads, wherein:
each read of the first quantity of reads is located at a site of the plurality of sites,
each read exhibiting a second allele at the locus different from the first allele, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
determining a second quantity of reads at the plurality of sites, wherein:
each read of the second quantity of reads is located at a site of the plurality of sites, an
Each read shows the first allele at the site;
determining a non-maternal allele fraction from the first amount and the second amount;
obtaining a calibration point determined using another sample having a known fetal DNA fraction and a measured non-maternal allele fraction; and
Calculating the fetal DNA fraction using the calibration points and the non-maternal allele fraction.
98. The computer product of claim 97, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve,
the method further comprises the following steps:
calculating the calibration curve by:
determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women;
calculating a non-maternal allele fraction for the plurality of other samples; and
fitting the fetal DNA fraction and the non-maternal allele fraction to a linear function, wherein the linear function describes the calibration curve.
99. The computer product of claim 97, wherein determining the second amount of reads at the plurality of bits comprises determining a total number of reads of the plurality of reads, wherein the total number of reads of the plurality of reads is the second amount of reads.
100. A computer product comprising a computer readable medium storing a plurality of instructions for controlling a computer system to perform a method for measuring a fraction of fetal DNA in a biological sample of a female pregnant with a fetus, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising performing the steps of:
Identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining the size of the DNA molecules of the biological sample;
determining a first size value for a first set of DNA molecules, wherein:
each DNA molecule of the first set of DNA molecules comprises reads at a site of the plurality of sites and exhibits a second allele at the site that is different from the first allele, an
The first size value corresponds to a statistic of a first size distribution of the first set of DNA molecules;
determining a second size value for a second set of DNA molecules, wherein:
each DNA molecule of the second set of DNA molecules comprising a read at a site of the plurality of sites and displaying the first allele at the site,
the second size value corresponds to a statistic of a second size distribution of the second group of DNA molecules, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
Determining a parameter value from the first size value and the second size value;
comparing the parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measured value of the parameter for the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
101. The computer product of claim 100, wherein:
the first size value is a first cumulative frequency of DNA molecules having a largest size among the first set of DNA molecules, an
The second size value is a second cumulative frequency of DNA molecules in the second set of DNA molecules having the largest size.
102. The computer product of claim 100, wherein:
the first size value is a median size, an average size, or a pattern of the first set of DNA molecules, an
The second size value is a median size, an average size, or a pattern of the second set of DNA molecules.
103. The computer product of claim 100, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve,
The method further comprises the following steps:
calculating the calibration curve by:
determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women;
calculating parameter values for the plurality of other samples;
calculating a non-maternal allele fraction for the plurality of other samples; and
fitting the fetal DNA fraction and the non-maternal allele fraction to a linear function, wherein the linear function describes the calibration curve.
104. The computer product of claim 98, wherein determining the fetal DNA score of each other sample in the plurality of other samples comprises:
identifying a second plurality of loci, wherein at each locus of the second plurality of loci a pregnant woman of the plurality of pregnant women is homozygous and the fetus of the pregnant woman is heterozygous;
obtaining a plurality of reads from the DNA molecules of the other sample;
determining a third amount of reads that show a fetal-specific allele at the second plurality of loci;
determining a fourth amount of reads that show a consensus allele at the second plurality of loci; and
determining the fetal DNA fraction using the third amount and the fourth amount.
105. The computer product of claim 103, wherein determining the fetal DNA fraction of each other sample in the plurality of other samples comprises:
Identifying a second plurality of loci, wherein at each locus of the second plurality of loci a pregnant woman of the plurality of pregnant women is homozygous and the fetus of the pregnant woman is heterozygous;
obtaining a plurality of reads from the DNA molecules of the other sample;
determining a third amount of reads that show a fetal-specific allele at the second plurality of loci;
determining a fourth amount of reads that show a consensus allele at the second plurality of loci; and
determining the fetal DNA fraction using the third amount and the fourth amount.
106. The computer product of claim 94, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
107. The computer product of claim 97, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
108. The computer product of claim 100, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample and,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
109. The computer product of claim 94, wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
110. The computer product of claim 97, wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
111. The computer product of claim 100, wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
112. The computer product of claim 94, wherein said second allele is limited to alleles identified in a database as corresponding to biallelic loci.
113. The computer product of claim 97, wherein the second allele is limited to alleles identified in a database as corresponding to biallelic loci.
114. The computer product of claim 100, wherein said second allele is limited to alleles identified in a database as corresponding to biallelic loci.
115. The computer product of claim 94, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
116. The computer product of claim 97, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
117. The computer product of claim 100, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
118. The computer product of claim 94, wherein:
the plurality of reads is a first plurality of reads,
The biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
119. The computer product of claim 97, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
120. The computer product of claim 100, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
121. The computer product of claim 94, wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
122. The computer product of claim 97, wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
123. The computer product of claim 100, wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
124. The computer product of claim 94, wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
125. The computer product of claim 97, wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
126. The computer product of claim 100, wherein:
The plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
127. The computer product of claim 94, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
128. The computer product of claim 97, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
129. The computer product of claim 100, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
130. The computer product of claim 94, wherein the method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
131. The computer product of claim 97, wherein the method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
132. The computer product of claim 100, wherein the method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
133. The computer product of claim 94, wherein the plurality of reads comprises less than or equal to 5000 ten thousand reads.
134. The computer product of claim 97, wherein the plurality of reads comprises less than or equal to 5000 ten thousand reads.
135. The computer product of claim 100, wherein the plurality of reads comprises less than or equal to 5000 ten thousand reads.
136. The computer product of claim 94, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
137. The computer product of claim 97, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
138. The computer product of claim 100, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
139. A computer product comprising a computer-readable medium storing a plurality of instructions for controlling a computer system to perform operations of a method for measuring a fraction of fetal DNA in a biological sample of a female pregnant with a fetus, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising:
receiving a first data set comprising a first plurality of reads of DNA molecules from a first sample from a female pregnant with the fetus;
identifying a location of the first plurality of reads in a reference genome;
identifying a first set of loci based on the first dataset and the identified locations, wherein no loci in the first set of loci display more than one allele;
determining a first amount of a locus in the first set of loci;
receiving a second data set comprising a second plurality of reads of DNA molecules from the biological sample;
identifying a location of the second plurality of reads in the reference genome;
identifying a second set of loci based on the second data set and the identified locations, wherein:
each of the second set of loci is one of the first set of loci,
Each locus in the second set of loci displays an allele that is different from the allele displayed in the first set of loci, an
A portion of the first set of loci does not comprise reads in the second plurality of reads that display alleles different from the alleles displayed in the first plurality of reads;
determining a second amount of loci in the second set of loci of the second data set;
determining a normalized parameter value for the first quantity and the second quantity;
comparing the normalized parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measurement of the parameter in the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
140. The computer product of claim 139, wherein said first sample is said biological sample.
141. The computer product of claim 139, wherein the first sample is not the biological sample, and wherein the first sample does not comprise fetal DNA molecules.
142. The computer product of claim 139, wherein the alleles that differ from the alleles displayed in the first set of loci are limited to alleles identified in a database as corresponding to biallelic loci.
143. The computer product of claim 139, wherein said first amount is the number of loci and said second amount is the number of loci in said second data set.
144. The computer product of claim 139, wherein determining the first amount comprises determining the first amount in the second data set.
145. The computer product of claim 139, wherein said method further comprises:
measuring the size of a first plurality of DNA molecules corresponding to the first plurality of reads, an
Measuring the size of a second plurality of DNA molecules corresponding to the second plurality of reads,
wherein:
the first plurality of DNA molecules having a first size value,
the first size value corresponds to a statistic of a first size distribution of the first plurality of DNA molecules,
the second plurality of DNA molecules having a second size value,
the second size value corresponds to a statistic of a second size distribution of the second plurality of DNA molecules, an
The first size value is greater than the second size value by a minimum difference.
146. The computer product of claim 139, wherein:
a first plurality of DNA molecules comprising reads at the first set of loci that are larger than a first size,
A second plurality of DNA molecules comprising reads at the second set of loci that are smaller than a second size, an
The difference between the first dimension and the second dimension is greater than a minimum difference.
147. The computer product of claim 145, wherein the minimum difference is 5 bp.
148. The computer product of claim 146, wherein the minimum difference is 5 bp.
149. The computer product of claim 139, wherein said female is pregnant with a plurality of fetuses, the method further comprising:
comparing the fetal DNA fraction to a cut-off value, and:
classifying the plurality of fetuses as single-egg if the calculated fetal DNA fraction is below the cut-off value, or
Classifying the plurality of fetuses as being biviteric if the calculated fetal DNA fraction is above the cutoff value.
150. The computer product of claim 149, wherein:
the fetal DNA fraction is the first fetal DNA fraction,
the cutoff value is determined as a value greater than a second fetal DNA fraction of the biological sample, an
Determining the second fetal DNA fraction without using the normalization parameter value.
151. The computer product of claim 150, wherein the second fetal DNA fraction is estimated using a size profile of DNA molecules in the biological sample.
152. A computer product comprising a computer-readable medium storing a plurality of instructions for controlling a computer system to perform operations of a method for measuring a fraction of fetal DNA in a biological sample of a female pregnant with a fetus, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising:
receiving a first data set from a first plurality of reads of a first plurality of DNA molecules;
identifying a location of the first plurality of reads in a reference genome;
determining the size of the DNA molecules corresponding to the first plurality of reads;
identifying a first set of loci in the first data set, wherein the read DNA molecules in each of the loci comprising the first set of loci have a first size distribution and have a first size value of the first size distribution that exceeds a first size threshold;
determining a first amount of a locus in the first set of loci;
receiving a second data set of a second plurality of reads of a second plurality of DNA molecules from the biological sample;
identifying a location of the second plurality of reads in the reference genome;
determining the size of the DNA molecules corresponding to the second plurality of reads;
Identifying a second set of loci in the second data set, wherein:
each of the second set of loci is one of the first set of loci, an
Read DNA molecules comprising in each of the second set of loci have a second size distribution and have a second size value of the second size distribution that exceeds a second size threshold in an opposite direction to the first size value that exceeds the first size threshold;
determining a second amount of loci in the second set of loci of the second data set;
determining a normalized parameter value for the first quantity and the second quantity;
comparing the normalized parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measurement of the parameter in the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
153. The computer product of claim 152, wherein the first size value is greater than the first size threshold and the second size value is less than the second size threshold, and wherein the second size threshold is less than the first size threshold.
154. The computer product of claim 152, wherein the first size value is less than the first size threshold and the second size value is greater than the second size threshold, and wherein the second size threshold is greater than the first size threshold.
155. The computer product of claim 152, wherein the difference between the first size value and the second size value is 10 bp.
156. The computer product of claim 139, wherein said first data set comprises reads of DNA molecules from said biological sample.
157. The computer product of claim 152, wherein said first data set comprises reads of DNA molecules from said biological sample.
158. The computer product of claim 139, wherein:
the biological sample is a first biological sample, an
The first data set includes reads of DNA molecules from a second biological sample that does not contain fetal DNA.
159. The computer product of claim 152, wherein:
the biological sample is a first biological sample, an
The first data set includes reads of DNA molecules from a second biological sample that does not contain fetal DNA.
160. The computer product of claim 139, wherein said method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain the second plurality of reads.
161. The computer product of claim 152, wherein said method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain the second plurality of reads.
162. The computer product of claim 139, wherein said method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain the second plurality of reads.
163. The computer product of claim 152, wherein said method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain the second plurality of reads.
164. The computer product of claim 139, wherein the second plurality of reads comprises less than or equal to 5000 ten thousand reads.
165. The computer product of claim 152, wherein the second plurality of reads comprises less than or equal to 5000 ten thousand reads.
166. The computer product of claim 139, wherein said second plurality of reads is at or less than 1x coverage of a haploid human genome.
167. The computer product of claim 152, wherein said second plurality of reads is at or less than 1x coverage of a haploid human genome.
168. The computer product of claim 139, wherein the normalized parameter value comprises the second quantity divided by the first quantity, or wherein the normalized parameter value comprises the second quantity divided by a sum of the first quantity and the second quantity.
169. The computer product of claim 152, wherein the normalized parameter value comprises the second quantity divided by the first quantity, or wherein the normalized parameter value comprises the second quantity divided by a sum of the first quantity and the second quantity.
170. The computer product of claim 139, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve.
171. The computer product of claim 152, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve.
172. The computer product of claim 170, wherein the calibration curve is linear.
173. The computer product of claim 171, wherein the calibration curve is linear.
174. The computer product of claim 170, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample,
determining a calibration point using the known fetal DNA fraction and a second normalization parameter value determined from the set of loci in the data set of the third plurality of reads of DNA molecules in the second biological sample according to the corresponding method for the first normalization parameter value, and
the second plurality of reads comprises a first coverage of haploid human genome within 10x coverage of a second coverage for reads of the second plurality of reads.
175. The computer product of claim 171, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample and,
determining a calibration point using the known fetal DNA fraction and a second normalization parameter value determined from the set of loci in the data set of the third plurality of reads of DNA molecules in the second biological sample according to the corresponding method for the first normalization parameter value, and
The second plurality of reads comprises a first coverage of haploid human genome within 10x coverage of a second coverage for reads of the second plurality of reads.
176. The computer product of claim 139, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample, an
Obtaining the calibration points includes using the known fetal DNA fraction and a second normalized parameter value determined from a second biological sample according to a corresponding method for the first normalized parameter value.
177. The computer product of claim 152, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample, an
Obtaining the calibration points includes using the known fetal DNA fraction and a second normalized parameter value determined from a second biological sample according to a corresponding method for the first normalized parameter value.
178. The computer product of claim 139, wherein said method further comprises:
providing a fourth set of loci from a reference database, wherein:
the fourth set of loci comprising loci having known single nucleotide polymorphisms, an
Each of the first set of loci is one of the fourth set of loci.
179. The computer product of claim 152, wherein the method further comprises:
providing a fourth set of loci from a reference database, wherein:
the fourth set of loci comprising loci having known single nucleotide polymorphisms, an
Each of the first set of loci is one of the fourth set of loci.
180. The computer product of any of claims 94 to 179, wherein a depth of coverage of a sequence read of a dataset is less than 5x coverage.
181. The computer product of any of claims 94 to 179, wherein a depth of coverage of a sequence read of a dataset is less than 1x coverage.
182. The computer product of any one of claims 94 to 179, wherein the number of sequence reads in a dataset is less than 5000 ten thousand.
183. A system comprising modules for performing the steps in a method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus, respectively, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising performing the steps of:
Identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining a first value for a first set of DNA molecules based on the plurality of reads, wherein:
each DNA molecule of the first set of DNA molecules comprising reads at a site of the plurality of sites and exhibiting a second allele at the site that is different from the first allele,
the first value defining a property of the first set of DNA molecules,
the characteristics include a size parameter or an amount of reading, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
determining a second value for a second set of DNA molecules, wherein:
each DNA molecule of the second set of DNA molecules comprises a read at a site of the plurality of sites and the first allele is displayed at the site, an
Said second value defining a property of said second set of DNA molecules;
Determining a parameter value for the parameter from the first value and the second value;
comparing the parameter value to calibration points determined using at least one other sample having a known fetal DNA fraction and calibration values corresponding to individual measurements of the parameter in the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
184. The system claimed in claim 183 and wherein:
the first value is a first number of reads located at the plurality of loci and having the second allele,
the second value is a second number of reads at the plurality of sites and having the first allele, an
The parameter is the non-maternal allele fraction.
185. The system of claim 183, wherein determining the first value and the second value further comprises:
measuring the sizes of the first set of DNA molecules and the second set of DNA molecules, wherein the first value is a first size value of a first size distribution of the first set of DNA molecules, and wherein the second value is a second size value of a second size distribution of the second set of DNA molecules.
186. A system comprising modules for performing the steps in a method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus, respectively, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising performing the steps of:
Identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining a first amount of reads, wherein:
each read of the first quantity of reads is located at a site of the plurality of sites,
each read exhibiting a second allele at the locus different from the first allele, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
determining a second quantity of reads at the plurality of sites, wherein:
each read of the second quantity of reads is located at a site of the plurality of sites, an
Each read shows the first allele at the site;
determining a non-maternal allele fraction from the first amount and the second amount;
obtaining a calibration point determined using another sample having a known fetal DNA fraction and a measured non-maternal allele fraction; and
Calculating the fetal DNA fraction using the calibration points and the non-maternal allele fraction.
187. The system claimed in claim 186 and wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve,
the method further comprises the following steps:
calculating the calibration curve by:
determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women;
calculating a non-maternal allele fraction for the plurality of other samples; and
fitting the fetal DNA fraction and the non-maternal allele fraction to a linear function, wherein the linear function describes the calibration curve.
188. The system of claim 186, wherein determining the second amount of reads at the plurality of bits comprises determining a total number of reads of the plurality of reads, wherein the total number of reads of the plurality of reads is the second amount of reads.
189. A system comprising modules for performing the steps in a method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus, respectively, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising performing the steps of:
Identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining the size of the DNA molecules of the biological sample;
determining a first size value for a first set of DNA molecules, wherein:
each DNA molecule of the first set of DNA molecules comprises reads at a site of the plurality of sites and exhibits a second allele at the site that is different from the first allele, an
The first size value corresponds to a statistic of a first size distribution of the first set of DNA molecules;
determining a second size value for a second set of DNA molecules, wherein:
each DNA molecule of the second set of DNA molecules comprising a read at a site of the plurality of sites and displaying the first allele at the site,
the second size value corresponds to a statistic of a second size distribution of the second set of DNA molecules, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
Determining a parameter value from the first size value and the second size value;
comparing the parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measured value of the parameter for the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
190. The system claimed in claim 189 and wherein:
the first size value is a first cumulative frequency of DNA molecules having a largest size among the first set of DNA molecules, an
The second size value is a second cumulative frequency of DNA molecules in the second set of DNA molecules having the largest size.
191. The system claimed in claim 189 and wherein:
the first size value is a median size, an average size, or a pattern of the first set of DNA molecules, an
The second size value is a median size, an average size, or a pattern of the second set of DNA molecules.
192. The system claimed in claim 189 and wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve,
the method further comprises the following steps:
Calculating the calibration curve by:
determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women;
calculating parameter values for the plurality of other samples;
calculating a non-maternal allele fraction for the plurality of other samples; and
fitting the fetal DNA fraction and the non-maternal allele fraction to a linear function, wherein the linear function describes the calibration curve.
193. The system of claim 187, wherein determining the fetal DNA score for each other sample in the plurality of other samples comprises:
identifying a second plurality of loci, wherein at each locus of the second plurality of loci a pregnant woman of the plurality of pregnant women is homozygous and the fetus of the pregnant woman is heterozygous;
obtaining a plurality of reads from the DNA molecules of the other sample;
determining a third amount of reads that show a fetal-specific allele at the second plurality of loci;
determining a fourth amount of reads that show a consensus allele at the second plurality of loci; and
determining the fetal DNA fraction using the third amount and the fourth amount.
194. The system of claim 192, wherein determining the fetal DNA score of each other sample of the plurality of other samples comprises:
Identifying a second plurality of loci, wherein at each locus of the second plurality of loci a pregnant woman of the plurality of pregnant women is homozygous and the fetus of the pregnant woman is heterozygous;
obtaining a plurality of reads from the DNA molecules of the other sample;
determining a third amount of reads that show a fetal-specific allele at the second plurality of loci;
determining a fourth amount of reads that show a consensus allele at the second plurality of loci; and
determining the fetal DNA fraction using the third amount and the fourth amount.
195. The system claimed in claim 183 and wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
196. The system claimed in claim 186 and wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
197. The system claimed in claim 189 and wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
198. The system claimed in claim 183 and wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
199. The system claimed in claim 186 and wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
200. The system claimed in claim 189 and wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
201. The system of claim 183, wherein the second allele is limited to alleles identified in the database as corresponding to biallelic loci.
202. The system of claim 186, wherein the second allele is limited to alleles identified in the database as corresponding to biallelic loci.
203. The system of claim 189, wherein the second allele is limited to alleles identified in the database as corresponding to biallelic loci.
204. The system of claim 183, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
205. The system of claim 186, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
206. The system of claim 189, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
207. The system claimed in claim 183 and wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
208. The system claimed in claim 186 and wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
209. The system claimed in claim 189 and wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
210. The system claimed in claim 183 and wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
211. The system claimed in claim 186 and wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
212. The system claimed in claim 189 and wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
213. The system claimed in claim 183 and wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
214. The system claimed in claim 186 and wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
215. The system claimed in claim 189 and wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
216. The system of claim 183, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
217. The system of claim 186, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
218. The system of claim 189, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
219. The system of claim 183, wherein the method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
220. The system of claim 186, wherein the method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
221. The system of claim 189, wherein the method further comprises:
Receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
222. The system of claim 183, wherein the plurality of reads comprises less than or equal to 5000 ten thousand reads.
223. The system of claim 186, wherein the plurality of reads comprises less than or equal to 5000 ten thousand reads.
224. The system of claim 189, wherein the plurality of reads comprises less than or equal to 5000 ten thousand reads.
225. The system of claim 183, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
226. The system of claim 186, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
227. The system of claim 189, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
228. A system comprising modules to perform steps in a method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus, respectively, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising:
Receiving a first data set comprising a first plurality of reads of DNA molecules from a first sample from a female pregnant with the fetus;
identifying a location of the first plurality of reads in a reference genome;
identifying a first set of loci based on the first dataset and the identified locations, wherein no loci in the first set of loci display more than one allele;
determining a first amount of a locus in the first set of loci;
receiving a second data set comprising a second plurality of reads of DNA molecules from the biological sample;
identifying a location of the second plurality of reads in the reference genome;
identifying a second set of loci based on the second data set and the identified locations, wherein:
each of the second set of loci is one of the first set of loci,
each locus in the second set of loci displays an allele that is different from the allele displayed in the first set of loci, an
A portion of the first set of loci does not comprise reads in the second plurality of reads that display alleles different from the alleles displayed in the first plurality of reads;
Determining a second amount of loci in the second set of loci of the second data set;
determining a normalized parameter value for the first quantity and the second quantity;
comparing the normalized parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measurement of the parameter in the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
229. The system of claim 228, wherein the first sample is the biological sample.
230. The system of claim 228, wherein the first sample is not the biological sample, and wherein the first sample does not comprise fetal DNA molecules.
231. The system of claim 228, wherein the alleles that differ from the alleles displayed in the first set of loci are limited to alleles identified in a database as corresponding to biallelic loci.
232. The system of claim 228, wherein the first amount is a number of loci and the second amount is a number of loci in the second data set.
233. The system of claim 228, wherein determining the first amount comprises determining the first amount in the second data set.
234. The system of claim 228, wherein the method further comprises:
measuring the size of a first plurality of DNA molecules corresponding to the first plurality of reads, an
Measuring the size of a second plurality of DNA molecules corresponding to the second plurality of reads,
wherein:
the first plurality of DNA molecules having a first size value,
the first size value corresponds to a statistic of a first size distribution of the first plurality of DNA molecules,
the second plurality of DNA molecules having a second size value,
the second size value corresponds to a statistic of a second size distribution of the second plurality of DNA molecules, an
The first size value is greater than the second size value by a minimum difference.
235. The system claimed in claim 228 and wherein:
a first plurality of DNA molecules comprising reads at the first set of loci that are larger than a first size,
a second plurality of DNA molecules comprising reads at the second set of loci that are smaller than a second size, an
The difference between the first dimension and the second dimension is greater than a minimum difference.
236. The system of claim 234, wherein the minimum difference is 5 bp.
237. The system of claim 235, wherein the minimum difference is 5 bp.
238. The system of claim 228, wherein the female is pregnant with a plurality of fetuses, the method further comprising:
comparing the fetal DNA fraction to a cut-off value, and:
classifying the plurality of fetuses as single-egg if the calculated fetal DNA fraction is below the cut-off value, or
Classifying the plurality of fetuses as being biviteric if the calculated fetal DNA fraction is above the cutoff value.
239. The system claimed in claim 238 and wherein:
the fetal DNA fraction is the first fetal DNA fraction,
the cutoff value is determined as a value greater than a second fetal DNA fraction of the biological sample, an
Determining the second fetal DNA fraction without using the normalized parameter value.
240. The system of claim 239, wherein the second fetal DNA fraction is estimated using a size profile of DNA molecules in the biological sample.
241. A system comprising modules to perform steps in a method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus, respectively, the biological sample comprising maternal DNA molecules and fetal DNA molecules, the method comprising:
Receiving a first data set from a first plurality of reads of a first plurality of DNA molecules;
identifying a location of the first plurality of reads in a reference genome;
determining the size of the DNA molecules corresponding to the first plurality of reads;
identifying a first set of loci in the first data set, wherein the read DNA molecules in each of the loci comprising the first set of loci have a first size distribution and have a first size value of the first size distribution that exceeds a first size threshold;
determining a first amount of a locus in the first set of loci;
receiving a second data set of a second plurality of reads of a second plurality of DNA molecules from the biological sample;
identifying a location of the second plurality of reads in the reference genome;
determining the size of the DNA molecules corresponding to the second plurality of reads;
identifying a second set of loci in the second data set, wherein:
each of the second set of loci is one of the first set of loci, an
Read DNA molecules comprising in each of the second set of loci have a second size distribution and have a second size value of the second size distribution that exceeds a second size threshold in an opposite direction to the first size value that exceeds the first size threshold;
Determining a second amount of loci in the second set of loci of the second data set;
determining a normalized parameter value for the first quantity and the second quantity;
comparing the normalized parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measurement of the parameter in the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
242. The system of claim 241, wherein the first size value is greater than the first size threshold and the second size value is less than the second size threshold, and wherein the second size threshold is less than the first size threshold.
243. The system of claim 241, wherein the first size value is less than the first size threshold and the second size value is greater than the second size threshold, and wherein the second size threshold is greater than the first size threshold.
244. The system of claim 241, wherein a difference between the first size value and the second size value is 10 bp.
245. The system of claim 228, wherein the first data set comprises reads of DNA molecules from the biological sample.
246. The system of claim 241, wherein the first dataset includes reads of DNA molecules from the biological sample.
247. The system claimed in claim 228 and wherein:
the biological sample is a first biological sample, an
The first data set includes reads of DNA molecules from a second biological sample that does not contain fetal DNA.
248. The system claimed in claim 241 and wherein:
the biological sample is a first biological sample, an
The first data set includes reads of DNA molecules from a second biological sample that does not contain fetal DNA.
249. The system of claim 228, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain the second plurality of reads.
250. The system of claim 241, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain the second plurality of reads.
251. The system of claim 228, wherein the method further comprises:
receiving the biological sample; and
Analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain the second plurality of reads.
252. The system of claim 241, wherein the method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain the second plurality of reads.
253. The system of claim 228, wherein the second plurality of reads comprises less than or equal to 5000 ten thousand reads.
254. The system of claim 241, wherein the second plurality of reads comprises less than or equal to 5000 ten thousand reads.
255. The system of claim 228, wherein the second plurality of reads is at or less than 1x coverage of a haploid human genome.
256. The system of claim 241, wherein the second plurality of reads is at or less than 1x coverage of a haploid human genome.
257. The system of claim 228, wherein the normalized parameter value comprises the second quantity divided by the first quantity, or wherein the normalized parameter value comprises the second quantity divided by a sum of the first quantity and the second quantity.
258. The system of claim 241, wherein the normalized parameter value comprises the second quantity divided by the first quantity, or wherein the normalized parameter value comprises the second quantity divided by a sum of the first quantity and the second quantity.
259. The system claimed in claim 228 and wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve.
260. The system claimed in claim 241 and wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve.
261. The system of claim 259, wherein the calibration curve is linear.
262. The system of claim 260, wherein the calibration curve is linear.
263. The system claimed in claim 259 and wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample,
determining a calibration point using the known fetal DNA fraction and a second normalization parameter value determined from the set of loci in the data set of the third plurality of reads of DNA molecules in the second biological sample according to the corresponding method for the first normalization parameter value, and
The second plurality of reads comprises a first coverage of haploid human genome within 10x coverage of a second coverage for reads of the second plurality of reads.
264. The system claimed in claim 260 and wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample and,
determining a calibration point using the known fetal DNA fraction and a second normalization parameter value determined from the set of loci in the data set of the third plurality of reads of DNA molecules in the second biological sample according to the corresponding method for the first normalization parameter value, and
the second plurality of reads comprises a first coverage of haploid human genome within 10x coverage of a second coverage for reads of the second plurality of reads.
265. The system claimed in claim 228 and wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample, an
Obtaining the calibration points includes using the known fetal DNA fraction and a second normalized parameter value determined from a second biological sample according to a corresponding method for the first normalized parameter value.
266. The system claimed in claim 241 and wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample, an
Obtaining the calibration points includes using the known fetal DNA fraction and a second normalized parameter value determined from a second biological sample according to a corresponding method for the first normalized parameter value.
267. The system of claim 228, wherein the method further comprises:
providing a fourth set of loci from a reference database, wherein:
the fourth set of loci comprising loci having known single nucleotide polymorphisms, an
Each of the first set of loci is one of the fourth set of loci.
268. The system of claim 241, wherein the method further comprises:
providing a fourth set of loci from a reference database, wherein:
the fourth set of loci comprising loci having known single nucleotide polymorphisms, an
Each of the first set of loci is one of the fourth set of loci.
269. The system of any one of claims 183-268, wherein a depth of coverage of a sequence read of a dataset is less than 5x coverage.
270. The system of any one of claims 183-268, wherein a depth of coverage of a sequence read of a dataset is less than 1x coverage.
271. The system of any one of claims 183-268, wherein the number of sequence reads in a dataset is less than 5000 ten thousand.
272. A computer readable medium storing a plurality of instructions, wherein the instructions when executed by a processor control a computer system to implement the steps included in a method for measuring the fraction of fetal DNA in a biological sample of a female carrying a fetus, the biological sample including maternal DNA molecules and fetal DNA molecules, the method comprising performing the steps of:
identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining a first value for a first set of DNA molecules based on the plurality of reads, wherein:
each DNA molecule of the first set of DNA molecules comprising reads at a site of the plurality of sites and exhibiting a second allele at the site that is different from the first allele,
The first value defining a property of the first set of DNA molecules,
the characteristics include a size parameter or an amount of reading, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
determining a second value for a second set of DNA molecules, wherein:
each DNA molecule of the second set of DNA molecules comprises a read at a site of the plurality of sites and the first allele is displayed at the site, an
Said second value defining a property of said second set of DNA molecules;
determining a parameter value for the parameter from the first value and the second value;
comparing the parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measured value of the parameter in the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
273. The computer readable medium of claim 272, wherein:
the first value is a first number of reads located at the plurality of loci and having the second allele,
the second value is a second number of reads at the plurality of sites and having the first allele, an
The parameter is the non-maternal allele fraction.
274. The computer readable medium of claim 272, wherein determining the first value and the second value further comprises:
measuring the sizes of the first set of DNA molecules and the second set of DNA molecules, wherein the first value is a first size value of a first size distribution of the first set of DNA molecules, and wherein the second value is a second size value of a second size distribution of the second set of DNA molecules.
275. A computer readable medium storing a plurality of instructions that when executed by a processor control a computer system to implement the steps included in a method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus, the biological sample including maternal DNA molecules and fetal DNA molecules, the method comprising performing the steps of:
identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
identifying the location of the plurality of reads in a reference genome;
determining a first amount of reads, wherein:
Each read of the first quantity of reads is located at a site of the plurality of sites,
each read exhibiting a second allele at the locus different from the first allele, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
determining a second quantity of reads at the plurality of sites, wherein:
each read of the second quantity of reads is located at a site of the plurality of sites, an
Each read shows the first allele at the site;
determining a non-maternal allele fraction from the first amount and the second amount;
obtaining a calibration point determined using another sample having a known fetal DNA fraction and a measured non-maternal allele fraction; and
calculating the fetal DNA fraction using the calibration points and the non-maternal allele fraction.
276. The computer readable medium of claim 275, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve,
the method further comprises the following steps:
calculating the calibration curve by:
Determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women;
calculating a non-maternal allele fraction for the plurality of other samples; and
fitting the fetal DNA fraction and the non-maternal allele fraction to a linear function, wherein the linear function describes the calibration curve.
277. The computer-readable medium of claim 275, wherein determining the second amount of reads at the plurality of bits comprises determining a total number of reads for the plurality of reads, wherein the total number of reads for the plurality of reads is the second amount of reads.
278. A computer readable medium storing a plurality of instructions, wherein the instructions when executed by a processor control a computer system to implement the steps included in a method for measuring the fraction of fetal DNA in a biological sample of a female carrying a fetus, the biological sample including maternal DNA molecules and fetal DNA molecules, the method comprising performing the steps of:
identifying a plurality of loci based on sequence information indicating that the woman is homozygous for the first allele at each of the plurality of loci;
obtaining a plurality of reads from the DNA molecules of the biological sample;
Identifying the location of the plurality of reads in a reference genome;
determining the size of the DNA molecules of the biological sample;
determining a first size value for a first set of DNA molecules, wherein:
each DNA molecule of the first set of DNA molecules comprises a read at a site of the plurality of sites and displays a second allele, different from the first allele, at the site, an
The first size value corresponds to a statistic of a first size distribution of the first set of DNA molecules;
determining a second size value for a second set of DNA molecules, wherein:
each DNA molecule of the second set of DNA molecules comprising a read at a site of the plurality of sites and displaying the first allele at the site,
the second size value corresponds to a statistic of a second size distribution of the second set of DNA molecules, an
A portion of the plurality of loci not comprising reads that exhibit a second allele different from the first allele;
determining a parameter value from the first size value and the second size value;
comparing the parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measured value of the parameter for the at least one other sample; and
Calculating the fetal DNA fraction based on the comparison.
279. The computer readable medium of claim 278, wherein:
the first size value is a first cumulative frequency of DNA molecules having a largest size among the first set of DNA molecules, an
The second size value is a second cumulative frequency of DNA molecules in the second set of DNA molecules having the largest size.
280. The computer readable medium of claim 278, wherein:
the first size value is a median size, an average size, or a pattern of the first set of DNA molecules, an
The second size value is a median size, an average size, or a pattern of the second set of DNA molecules.
281. The computer readable medium of claim 278, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve,
the method further comprises the following steps:
calculating the calibration curve by:
determining fetal DNA fractions of a plurality of other samples from a plurality of pregnant women;
calculating parameter values for the plurality of other samples;
calculating a non-maternal allele fraction for the plurality of other samples; and
Fitting the fetal DNA fraction and the non-maternal allele fraction to a linear function, wherein the linear function describes the calibration curve.
282. The computer readable medium of claim 276, wherein determining the fetal DNA score of each other sample in the plurality of other samples comprises:
identifying a second plurality of loci, wherein at each locus of the second plurality of loci a pregnant woman of the plurality of pregnant women is homozygous and the fetus of the pregnant woman is heterozygous;
obtaining a plurality of reads from the DNA molecules of the other sample;
determining a third amount of reads that show a fetal-specific allele at the second plurality of loci;
determining a fourth amount of reads that show a consensus allele at the second plurality of loci; and
determining the fetal DNA fraction using the third amount and the fourth amount.
283. The computer readable medium of claim 281, wherein determining the fetal DNA score of each other sample in the plurality of other samples comprises:
identifying a second plurality of loci, wherein at each locus of the second plurality of loci a pregnant woman of the plurality of pregnant women is homozygous and the fetus of the pregnant woman is heterozygous;
Obtaining a plurality of reads from the DNA molecules of the other sample;
determining a third amount of reads that show a fetal-specific allele at the second plurality of loci;
determining a fourth amount of reads that show a consensus allele at the second plurality of loci; and
determining the fetal DNA fraction using the third amount and the fourth amount.
284. The computer readable medium of claim 272, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
285. The computer readable medium of claim 275, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
286. The computer readable medium of claim 278, wherein:
The plurality of reads is a first plurality of reads,
the biological sample is a first biological sample,
identifying the plurality of sites comprises identifying the plurality of sites from a second plurality of reads of DNA molecules from a second biological sample, an
The second biological sample does not comprise fetal DNA molecules.
287. The computer readable medium of claim 272, wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
288. The computer readable medium of claim 275, wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of loci, wherein the second plurality of loci are loci identified in the database as corresponding to biallelic loci.
289. The computer readable medium of claim 278, wherein:
the plurality of sites is a first plurality of sites,
the plurality of reads is limited to reads from a second plurality of sites, wherein the second plurality of sites are sites identified in the database as corresponding to biallelic sites.
290. The computer-readable medium of claim 272, wherein the second allele is limited to alleles identified in the database as corresponding to biallelic loci.
291. The computer-readable medium of claim 275, wherein the second allele is limited to alleles identified in the database as corresponding to biallelic loci.
292. The computer-readable medium of claim 278, wherein the second allele is limited to alleles identified in the database as corresponding to biallelic loci.
293. The computer readable medium of claim 272, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
294. The computer readable medium of claim 275, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
295. The computer readable medium of claim 278, wherein identifying the plurality of sites comprises genotyping a cell sample from the female.
296. The computer readable medium of claim 272, wherein:
The plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
297. The computer readable medium of claim 275, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
298. The computer readable medium of claim 278, wherein:
the plurality of reads is a first plurality of reads,
the biological sample is a first biological sample, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from DNA molecules of a second biological sample different from the first biological sample.
299. The computer readable medium of claim 272, wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
300. The computer readable medium of claim 275, wherein:
The plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
301. The computer readable medium of claim 278, wherein:
the plurality of reads is a first plurality of reads, an
Identifying the plurality of sites comprises obtaining a second plurality of reads from the DNA molecules of the biological sample.
302. The computer readable medium of claim 272, wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
303. The computer readable medium of claim 275, wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
304. The computer readable medium of claim 278, wherein:
the plurality of sites is a first plurality of sites,
the method further comprises the following steps:
providing a second plurality of sites from a reference database, wherein:
the second plurality of sites comprises sites having known single nucleotide polymorphisms, an
Each of the first plurality of sites is one of the second plurality of sites.
305. The computer readable medium of claim 272, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
306. The computer readable medium of claim 275, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
307. The computer readable medium of claim 278, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain reads.
308. The computer readable medium of claim 272, wherein the method further comprises:
Receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
309. The computer readable medium of claim 275, wherein the method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
310. The computer readable medium of claim 278, wherein the method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain reads.
311. The computer-readable medium of claim 272, wherein the plurality of reads comprises less than or equal to 5000 ten thousand reads.
312. The computer-readable medium of claim 275, wherein the plurality of reads comprises less than or equal to 5000 ten thousand reads.
313. The computer-readable medium of claim 278, wherein the plurality of reads comprises less than or equal to 5000 ten thousand reads.
314. The computer-readable medium of claim 272, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
315. The computer-readable medium of claim 275, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
316. The computer-readable medium of claim 278, wherein the plurality of reads is equal to or less than 1x coverage of a haploid human genome.
317. A computer readable medium storing a plurality of instructions, wherein the instructions when executed by a processor control a computer system to implement the steps included in a method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus, the biological sample including maternal DNA molecules and fetal DNA molecules, the method comprising:
receiving a first data set comprising a first plurality of reads of DNA molecules from a first sample from a female pregnant with the fetus;
identifying a location of the first plurality of reads in a reference genome;
identifying a first set of loci based on the first dataset and the identified locations, wherein no loci in the first set of loci display more than one allele;
determining a first amount of a locus in the first set of loci;
receiving a second data set comprising a second plurality of reads of DNA molecules from the biological sample;
Identifying a location of the second plurality of reads in the reference genome;
identifying a second set of loci based on the second data set and the identified locations, wherein:
each of the second set of loci is one of the first set of loci,
each locus in the second set of loci displays an allele that is different from the allele displayed in the first set of loci, an
A portion of the first set of loci does not comprise reads in the second plurality of reads that display alleles different from the alleles displayed in the first plurality of reads;
determining a second amount of loci in the second set of loci of the second data set;
determining a normalized parameter value for the first quantity and the second quantity;
comparing the normalized parameter value to a calibration point determined using at least one other sample having a known fetal DNA fraction and a calibration value corresponding to an individual measurement of the parameter in the at least one other sample; and
calculating the fetal DNA fraction based on the comparison.
318. The computer readable medium of claim 317, wherein the first sample is the biological sample.
319. The computer readable medium of claim 317, wherein the first sample is not the biological sample, and wherein the first sample does not comprise fetal DNA molecules.
320. The computer-readable medium of claim 317, wherein the alleles that differ from the alleles displayed in the first set of loci are limited to alleles identified in a database as corresponding to biallelic loci.
321. The computer-readable medium of claim 317, wherein the first amount is a number of loci and the second amount is a number of loci in the second dataset.
322. The computer-readable medium of claim 317, wherein determining the first amount comprises determining the first amount in the second dataset.
323. The computer readable medium of claim 317, wherein the method further comprises:
measuring the size of a first plurality of DNA molecules corresponding to the first plurality of reads, an
Measuring the size of a second plurality of DNA molecules corresponding to the second plurality of reads,
wherein:
the first plurality of DNA molecules having a first size value,
the first size value corresponds to a statistic of a first size distribution of the first plurality of DNA molecules,
The second plurality of DNA molecules having a second size value,
the second size value corresponds to a statistic of a second size distribution of the second plurality of DNA molecules, an
The first size value is greater than the second size value by a minimum difference.
324. The computer readable medium of claim 317, wherein:
a first plurality of DNA molecules comprising reads at the first set of loci that are larger than a first size,
a second plurality of DNA molecules comprising reads at the second set of loci that are smaller than a second size, an
The difference between the first dimension and the second dimension is greater than a minimum difference.
325. The computer readable medium of claim 323, wherein the minimum difference is 5 bp.
326. The computer readable medium of claim 324, wherein the minimum difference is 5 bp.
327. The computer-readable medium of claim 317, wherein the female is pregnant with a plurality of fetuses, the method further comprising:
comparing the fetal DNA fraction to a cut-off value, and:
classifying the plurality of fetuses as single-egg if the calculated fetal DNA fraction is below the cut-off value, or
Classifying the plurality of fetuses as being biviteric if the calculated fetal DNA fraction is above the cutoff value.
328. The computer readable medium of claim 327, wherein:
the fetal DNA fraction is the first fetal DNA fraction,
the cutoff value is determined as a value greater than a second fetal DNA fraction of the biological sample, an
Determining the second fetal DNA fraction without using the normalization parameter value.
329. The computer readable medium of claim 328, wherein the second fetal DNA fraction is estimated using a size profile of DNA molecules in the biological sample.
330. A computer readable medium storing a plurality of instructions, wherein the instructions when executed by a processor control a computer system to implement the steps included in a method for measuring the fraction of fetal DNA in a biological sample of a female pregnant with a fetus, the biological sample including maternal DNA molecules and fetal DNA molecules, the method comprising:
receiving a first data set from a first plurality of reads of a first plurality of DNA molecules;
identifying a location of the first plurality of reads in a reference genome;
determining the size of the DNA molecules corresponding to the first plurality of reads;
identifying a first set of loci in the first data set, wherein the read DNA molecules in each of the loci comprising the first set of loci have a first size distribution and have a first size value of the first size distribution that exceeds a first size threshold;
Determining a first amount of a locus in the first set of loci;
receiving a second data set of a second plurality of reads of a second plurality of DNA molecules from the biological sample;
identifying a location of the second plurality of reads in the reference genome;
determining the size of the DNA molecules corresponding to the second plurality of reads;
identifying a second set of loci in the second data set, wherein:
each of the second set of loci is one of the first set of loci, an
Read DNA molecules comprising in each of the second set of loci have a second size distribution and have a second size value of the second size distribution that exceeds a second size threshold in an opposite direction to the first size value that exceeds the first size threshold;
determining a second amount of loci in the second set of loci of the second data set;
determining a normalized parameter value for the first quantity and the second quantity;
comparing the normalized parameter value to calibration points determined using at least one other sample having a known fetal DNA fraction and calibration values corresponding to individual measurements of the parameter in the at least one other sample; and
Calculating the fetal DNA fraction based on the comparison.
331. The computer-readable medium of claim 330, wherein the first size value is greater than the first size threshold and the second size value is less than the second size threshold, and wherein the second size threshold is less than the first size threshold.
332. The computer-readable medium of claim 330, wherein the first size value is less than the first size threshold and the second size value is greater than the second size threshold, and wherein the second size threshold is greater than the first size threshold.
333. The computer readable medium of claim 330, wherein the difference between the first size value and the second size value is 10 bp.
334. The computer readable medium of claim 317, wherein the first dataset comprises reads of DNA molecules from the biological sample.
335. The computer readable medium of claim 330, wherein the first data set comprises reads of DNA molecules from the biological sample.
336. The computer readable medium of claim 317, wherein:
the biological sample is a first biological sample, an
The first data set includes reads of DNA molecules from a second biological sample that does not contain fetal DNA.
337. The computer readable medium of claim 330, wherein:
the biological sample is a first biological sample, an
The first data set includes reads of DNA molecules from a second biological sample that does not contain fetal DNA.
338. The computer readable medium of claim 317, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain the second plurality of reads.
339. The computer readable medium of claim 330, wherein the method further comprises:
receiving the biological sample; and
sequencing a plurality of DNA molecules in the biological sample to obtain the second plurality of reads.
340. The computer readable medium of claim 317, wherein the method further comprises:
receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain the second plurality of reads.
341. The computer readable medium of claim 330, wherein the method further comprises:
Receiving the biological sample; and
analyzing a plurality of DNA molecules in the biological sample using a probe microarray to obtain the second plurality of reads.
342. The computer-readable medium of claim 317, wherein the second plurality of reads comprises less than or equal to 5000 ten thousand reads.
343. The computer-readable medium of claim 330, wherein the second plurality of reads comprises less than or equal to 5000 ten thousand reads.
344. The computer-readable medium of claim 317, wherein the second plurality of reads is at or less than 1x coverage of a haploid human genome.
345. The computer-readable medium of claim 330, wherein the second plurality of reads is at or less than 1x coverage of a haploid human genome.
346. The computer-readable medium of claim 317, wherein the normalized parameter value comprises the second quantity divided by the first quantity, or wherein the normalized parameter value comprises the second quantity divided by a sum of the first quantity and the second quantity.
347. The computer-readable medium of claim 330, wherein the normalized parameter value comprises the second quantity divided by the first quantity, or wherein the normalized parameter value comprises the second quantity divided by a sum of the first quantity and the second quantity.
348. The computer readable medium of claim 317, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve.
349. The computer readable medium of claim 330, wherein:
the calibration point is one of a plurality of calibration points, an
The plurality of calibration points constitute a calibration curve.
350. The computer readable medium of claim 348, wherein the calibration curve is linear.
351. The computer readable medium of claim 349, wherein the calibration curve is linear.
352. The computer readable medium of claim 348, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample,
determining a calibration point using the known fetal DNA fraction and a second normalization parameter value determined from the set of loci in the data set of the third plurality of reads of DNA molecules in the second biological sample according to the corresponding method for the first normalization parameter value, and
the second plurality of reads comprises a first coverage of haploid human genome within 10x coverage of a second coverage for reads of the second plurality of reads.
353. The computer readable medium of claim 349, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample,
determining a calibration point using the known fetal DNA fraction and a second normalization parameter value determined from the set of loci in the data set of the third plurality of reads of DNA molecules in the second biological sample according to the corresponding method for the first normalization parameter value, and
the second plurality of reads comprises a first coverage of haploid human genome within 10x coverage of a second coverage for reads of the second plurality of reads.
354. The computer readable medium of claim 317, wherein:
the normalized parameter value is a first normalized parameter value,
the biological sample is a first biological sample, an
Obtaining the calibration points includes using the known fetal DNA fraction and a second normalized parameter value determined from a second biological sample according to a corresponding method for the first normalized parameter value.
355. The computer readable medium of claim 330, wherein:
the normalized parameter value is a first normalized parameter value,
The biological sample is a first biological sample, an
Obtaining the calibration points includes using the known fetal DNA fraction and a second normalized parameter value determined from a second biological sample according to a corresponding method for the first normalized parameter value.
356. The computer-readable medium of claim 317, further comprising:
providing a fourth set of loci from a reference database, wherein:
the fourth set of loci comprising loci having known single nucleotide polymorphisms, an
Each of the first set of loci is one of the fourth set of loci.
357. The computer readable medium of claim 330, wherein the method further comprises:
providing a fourth set of loci from a reference database, wherein:
the fourth set of loci comprising loci having known single nucleotide polymorphisms, an
Each of the first set of loci is one of the fourth set of loci.
358. The computer-readable medium of any one of claims 272-357, wherein a depth of coverage of sequential reads of the dataset is less than 5x coverage.
359. The computer-readable medium of any one of claims 272-357, wherein a depth of coverage of a sequence read of the dataset is less than 1x coverage.
360. The computer readable medium of any one of claims 272-357, wherein the number of sequence reads in the dataset is less than 5000 ten thousand.
HK18110675.1A 2015-09-22 2016-09-22 Accurate quantification of fetal dna fraction by shallow-depth sequencing of maternal plasma dna HK1251263B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201562222157P 2015-09-22 2015-09-22
US62/222,157 2015-09-22
PCT/CN2016/099682 WO2017050244A1 (en) 2015-09-22 2016-09-22 Accurate quantification of fetal dna fraction by shallow-depth sequencing of maternal plasma dna

Publications (2)

Publication Number Publication Date
HK1251263A1 HK1251263A1 (en) 2019-01-25
HK1251263B true HK1251263B (en) 2023-02-17

Family

ID=

Similar Documents

Publication Publication Date Title
US20250122566A1 (en) Diagnosing fetal chromosomal aneuploidy using massively parallel genomic sequencing
CN108026576B (en) Accurate quantification of fetal DNA fraction by shallow-depth sequencing of maternal plasma DNA
HK40080394A (en) Accurate quantification of fetal dna fraction by shallow-depth sequencing of maternal plasma dna
HK1251263B (en) Accurate quantification of fetal dna fraction by shallow-depth sequencing of maternal plasma dna
AU2008278843B2 (en) Diagnosing fetal chromosomal aneuploidy using genomic sequencing
AU2013203077A1 (en) Diagnosing fetal chromosomal aneuploidy using genomic sequencing
HK40030136A (en) Diagnosing cancer using genomic sequencing
HK40007856A (en) Diagnosing fetal chromosomal aneuploidy using genomic sequencing
HK40007856B (en) Diagnosing fetal chromosomal aneuploidy using genomic sequencing
AU2013200581A1 (en) Diagnosing cancer using genomic sequencing
HK1177766A (en) Diagnosing cancer using genomic sequencing
HK1177766B (en) Diagnosing cancer using genomic sequencing
HK1177768B (en) Diagnosing fetal chromosomal aneuploidy using genomic sequencing
HK1144024B (en) Diagnosing fetal chromosomal aneuploidy using genomic sequencing