WO2025250794A1 - Two-copy allele detection - Google Patents
Two-copy allele detectionInfo
- Publication number
- WO2025250794A1 WO2025250794A1 PCT/US2025/031425 US2025031425W WO2025250794A1 WO 2025250794 A1 WO2025250794 A1 WO 2025250794A1 US 2025031425 W US2025031425 W US 2025031425W WO 2025250794 A1 WO2025250794 A1 WO 2025250794A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- smn1
- copy
- allele
- reads
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
Copy number variant detection techniques are described that improve precision and sensitivity for detection of two-copy alleles. In an embodiment, a novel technique for classifying two-copy alleles of SMN1 is provided that uses a two-copy allele-associated haplogroup. In an embodiment, a novel technique for classifying two-copy alleles of SMN1 is provided that uses machine-learning identified variants. The techniques may be used in conjunction with SMN1 copy number detection techniques that distinguish between SMN1 and SMN2 or that otherwise estimate copy number for SMN1 considering paralog-aligned sequences.
Description
TWO-COPY ALLELE DETECTION
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application claims priority to and the benefit of U.S. Provisional Application No. 63/654,604, filed on May 31,2024, and incorporated by reference in its entirety herein for all purposes.
REFERENCE TO ELECTRONIC SEQUENCE LISTING
[0002] The application contains a Sequence Listing which has been submitted electronically in .XML format and is hereby incorporated by reference in its entirety. Said .XML copy, created on May 26, 2025, is named “ILUM0149PCT.xml” and is 60,373 bytes in size. The sequence listing contained in this .XML file is part of the specification and is hereby incorporated by reference herein in its entirety.
BACKGROUND
[0003] The disclosed technology relates generally to sequence data assessment techniques to identify two-copy alleles of a genomic region, such as a gene or genes of interest.
[0004] Motor neuron diseases (MNDs) are a group of progressive neurological disorders that destroy motor neurons, the cells that control essential voluntary muscle activity such as speaking, walking, breathing, and swallowing. Normally, messages from motor nerve cells in the brain (called upper motor neurons) are transmitted to motor nerve cells in the brain stem and spinal cord (called lower motor neurons), and messages from the lower motor neurons are transmitted to particular muscles. Upper motor neurons direct the lower motor neurons to produce movements such as walking or chewing. Lower motor neurons control movement in the arms, legs, chest, face, throat, and tongue. Spinal motor neurons are also called anterior horn cells.
[0005] Spinal muscular atrophy (SMA) is an autosomal recessive, neuromuscular disorder characterized by loss of motor neurons and progressive muscle wasting, often leading to early
death. The disorder is caused by a genetic defect in the SMN1 gene, which encodes survival of motor neuron (SMN) protein, a protein expressed in all eukaryotic cells and necessary for the survival of motor neurons. Lower levels of the protein result in loss of function of neuronal cells in the anterior horn of the spinal cord and subsequent system-wide muscle wasting (atrophy).
[0006] A small amount of SMN protein can be produced from a gene similar to SMN1 called SMN2. Several different versions of the SMN protein are produced from the SMN2 gene, but only one version (called isoform d) is full size and fully functional. The other versions are smaller and may be easily broken down. The full-size protein made from the SMN2 gene is identical to the protein made from SMN1; however, much less full-size SMN protein is produced from the SMN2 gene compared with the SMN1 gene. SMN1 and SMN2 genes are nearly identical and encode the same protein. A sequence difference between the two is a single nucleotide in exon 7, which is thought to be an exon splice enhancer. It is thought that gene conversion events may involve the two genes, leading to exchanges of sequence between SMN1 and SMN2.
[0007] A person is affected with SMA if the person only has defective copies of the SMN1 gene. A person is a carrier of SMA if the person has one chromosome containing at least one normal copy of the SMN1 gene and at least one chromosome containing no normal copies of the SMN1 gene (i.e., either no copies of SMN1 or only defective copies of SMN1). An individual may be considered a silent carrier if one chromosome has two copies of the SMN1 gene while the other chromosome has no copies. This genotype is difficult to distinguish from a normal genotype with each chromosome having one copy, because the total copy number is the same although the chromosomal distribution is different. In addition, it is challenging to resolve the copy number of SMN1 when some sequence reads that align to SMN1 may be derived from the SMN2 sequence.
BRIEF DESCRIPTION
[0008] In one embodiment, the present disclosure provides a method for identifying a two- copy allele of survival of motor neuron 1 (SMN1) gene. The method includes receiving sequence data of a subject comprising nucleotide identities for a plurality of sequence reads, the plurality of sequence reads comprising at least ten thousand sequence reads; performing an alignment to a reference sequence for the plurality of sequence reads, wherein the reference sequence comprises an SMN1 reference sequence and an SMN2 reference sequence; identifying a subset of sequence reads of the plurality of sequence reads that are aligned to one or both of the SMN 1 reference sequence or the SMN2 reference sequence; determining a normalized number of the sequence reads in the subset; identifying sequences differences in the subset, the sequence differences comprising first sequence differences relative to the SMN 1 reference sequence for the sequence reads and second sequence differences relative to the SMN2 reference sequence, wherein the first sequence differences and the second sequence differences are distinguishing between SMN1 and SMN2; determining an SMN1 copy number based on the identified sequence differences in the subset and the normalized number; identifying a two-copy allele- associated Sl-8 and Sl-9d haplogroup in the sequence differences; and classifying the subject as carrying a two-copy allele of SMN1 based on an estimated SMN 1 copy number of two copies and a presence of two-copy allele- associated Sl- 8 and Sl-9d haplogroup in the sequence differences
[0009] In one embodiment, the present disclosure provides a multivariate computer- implemented method for identifying a two-copy allele of a gene of interest. The method includes receiving onto a memory a set of weighting factors of a machine learning model related to a preselected set of positions in a reference sequence; receiving onto the memory whole genome sequence data of a subject comprising nucleotide identities for a plurality of sequence reads, the plurality of sequence reads comprising at least a million sequence reads; receiving onto the memory alignments based on aligning sequence data in the whole genome sequence data to the reference sequence to generate the alignments; receiving onto the memory sequence differences relative to the reference sequence in the alignments to a gene of interest; determining total copy number for the gene of interest based on the alignments; determining copy numbers for variant alleles at the preselected set of positions based on the sequence
differences; and generating an output of the machine learning model using the total copy number, the weights and the copy numbers, wherein the output characterizes the subject as carrying a two-copy allele for the gene of interest.
[0010] In one embodiment, the present disclosure provides a system for identifying a two- copy allele of survival of motor neuron 1 (SMN1) gene. The system includes a sequence device configured to generate whole genome sequence data of a subject comprising base call information for a plurality of sequence reads, the plurality of sequence reads comprising at least a million sequence reads. The system also includes processing circuitry configured to receive the whole genome sequence data. The processing circuitry is configured to execute instructions to align survival of motor neuron 1 (SMN1) sequence data and survival of motor neuron 2 (SMN2) sequence data in the whole genome sequence data to an SMN1 reference sequence (SEQ ID NO: 1) and an SMN2 reference sequence (SEQ ID NO:2) to generate alignments; determine that an SMN1 copy number of the whole genome sequence data is two copies using the alignments; use the alignments to generate variant site copy numbers for a plurality of variants; provide the variant site copy numbers to a machine learning model; and characterize the subject as carrying a two-copy allele of SMN1 based on an output of the machine learning model.
[0011] In one embodiment, the present disclosure provides a multivariate computer- implemented method for identifying a two-copy allele of a gene of interest. The method includes receiving first group sequence data from a plurality of individuals having three copies of a survival of motor neuron 1 (SMN1) gene; receiving second group sequence data from a plurality of individuals having only two copies of the SMN1 gene; aligning the first group sequence data and the second group sequence data to an SMN1 reference sequence; identifying sequence variants relative to the SMN1 reference sequence in the first group sequence data that are not present in the second group sequence data; generating features for a machine learning model based on the identified sequence variants; and using the machine learning model to classify sequence data from a subject as having a two-copy allele of SMN1.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] These and other features, aspects, and advantages of the disclosed embodiments will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
[0013] FIG. 1 is a schematic illustration of different SMN1 genotypes;
[0014] FIG. 2 is a schematic illustration of an SMN1 two-copy allele-associated haplogroup;
[0015] FIG. 3 is an example two-copy allele classifying method, according to an embodiment;
[0016] FIG. 4 is an example two-copy allele classifying workflow, according to an embodiment;
[0017] FIG. 5 shows concordance with another two-copy allele calling technique for SI -8;
[0018] FIG. 6 shows concordance with another two-copy allele calling technique for Sl-9d;
[0019] FIG. 7 shows haplogroup caller performance relative to a g27134TG variant classifier;
[0020] FIG. 8 is a schematic illustrations of a three-copy allele genotype vs. a two-copy allele genotype;
[0021] FIG. 9 shows posterior probabilities for different variants;
[0022] FIG. 10 is an example multivariate two-copy allele classifying method using a machine learning model, according to an embodiment;
[0023] FIG. 11 is an example multivariate two-copy allele classifying workflow using a machine learning model, according to an embodiment;
[0024] FIG. 12 shows performance of different two-copy allele classifiers;
[0025] FIG. 13 shows performance of the multivariate two-copy allele classifying method relative to an SMN 1 copy number caller;
[0026] FIG. 14 shows performance of the multivariate two-copy allele classifying method relative to the haplogroup caller; and
[0027] FIG. 15 is a block diagram of sequence analysis device, according to an embodiment.
DETAILED DESCRIPTION
[0028] The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
[0029] Copy number variation (CNV) refers to a circumstance in which the number of copies of a specific segment of DNA varies among different individuals’ genomes. The individual variants may be short or include thousands of bases. These structural differences may have come about through duplications, deletions or other changes and can affect long stretches of DNA. Such regions may or may not contain a gene(s). In an embodiment, the present techniques may be used to identify individuals having two-copy alleles for duplicated genes or genomic regions. That is, certain individuals may carry two or more copies of a particular gene on an individual chromosome. In certain embodiments, the systems and methods can be used to determine affected or carrier status of a subject for SMA using sequencing data, such as whole genome sequencing (WGS) data. A subject is affected with SMA if the subject has only defective copies of the SMN1 gene. A subject is a carrier of SMA if the subject has at least one chromosome containing at least one normal copy of the SMN1 gene and at least one chromosome containing no normal copies of SMN1 (i.e. either no copies of SMN1 or only defective copies of SMN1). In certain embodiments, the systems and methods can be used to determine affected or carrier status of a subject for alpha-thalassemia (HBA) using sequencing
data, such as whole genome sequencing (WGS) data. In certain embodiments, the systems and methods can be used to determine affected or carrier status of a subject for a silent carrier genotype using sequencing data, such as partial or whole genome sequencing (WGS) data. These genes are examples of genes for which identification of a two-copy allele may be clinically significant. However, identification of two-copy alleles may also be beneficial for accurate genome assembly and assessment of genomic structural variation.
[0030] Turning to the example of SMN1, FIG. 1 is a schematic diagram of carrier frequency for SMN1 in different populations. As illustrated, an individual may have a normal genotype (each chromosome carrying a single copy of SMN1), e.g., an SMN1 copy number of two distributed across the maternal and paternal chromosome. In some cases, about 1 in 6000- 10000 incidence depending on the population, an individual has the SMA genotype when neither chromosome has a functional copy of the SMN1 gene, e.g., an SMN1 copy number of zero. Also illustrated are two different SMA carrier genotypes. Individuals who are SMA carriers have increased risk of having children with SMA. One example of a carrier genotype has one chromosome with a functional SMN1 gene and the other chromosome have a null copy or no functional SMN1 such that the individual has a total copy number of 1.
[0031] Another carrier genotype is a silent carrier. In this genotype, the copy number is two copies of SMN1, and both of these copies are on a single chromosome. That is, one chromosome has an SMN1 copy number of two, and the other chromosome has no copies of SMN1. Carrier genotypes may occur about 1 in 40-80 individuals. For non-African populations, estimates are that the silent carrier genotype occurs in 3%-9% of carriers. In African populations, silent carriers are estimated to be around 30% of carriers. However, the silent carrier genotype and the normal genotype both carry two copies, albeit with different chromosomal distributions. Therefore, conventional copy number estimation techniques cannot differentiate between these genotypes. Further, the presence of the SMN1 paralog, SMN2, provides additional complexity in distinguishing the normal genotype and the silent carrier genotype.
[0032] Certain screening methods rely on identification of a presence of a particular variant, g.2713T>G to distinguish between the silent carrier genotype and the normal genotype. However, this variant is rare in non-African populations and would be a low sensitivity screen in these populations. Further, this variant is relatively common in African populations in both two-copy allele individuals as well as single-copy allele individuals (about 20% prevalence) and would be a low precision screen in these populations.
[0033] Provided herein are classifiers to classify individuals as either having a two-copy allele of a gene of interest or not having the two-copy allele of the gene of interest. In certain embodiments, the gene of interest is SMN1. In one embodiment, classification is based on the presence of a two-copy allele (e.g., a two-SMNl-copy allele)-associated haplogroup or haplogroups. Two haplogroups, SI -8 and Sl-9d, co-occur in two-copy alleles and are infrequent in one-copy or singleton alleles. Therefore, the present techniques include identification of SI -8 and S 1 -9d haplogroups as part of two-copy allele classification in a targeted two-copy allele caller. In embodiments, read support for individual SI -8 and Sl-9d marker variants may be provided to the targeted two-copy allele caller to predict SI -8 and Sl- 9d copy number as part of the workflow as generally discussed with respect to FIG. 5.
[0034] Sequencing data analyses using long sequence reads, e.g., 5kb or greater, has been used to reconstruct haplotype sequences for SMN1 and SMN2 to characterize sequence variants for two-copy alleles. FIG. 2 shows haplogroup information from long read sequencing data. Phasing of haplotypes into alleles was done by comparing the haplotypes/haplogroups in parents and probands. Haplotypes were directly assigned haplogroups by Paraphase (Pacific Biosciences of California, Inc.) in samples with >20X HiFi WGS coverage. For parents with either Illumina short read data or low coverage HiFi data i.e., where phasing is not possible or accurate, representative variants for each haplogroup were queried in the parent data to identify the haplogroups in the parent. In ambiguous cases, i.e., both parents have haplotypes of the same haplogroup, manual examination of data in IGV was conducted to find unique SNPs that distinguish these haplotypes and phase them into alleles.
[0035] For example, haplogroups Sl-8 and Sl-9d were associated with two-thirds of African two-copy SMN1 alleles and are rarely present as singleton alleles. An SMN1 copy number call of 2 with detection of both Sl-8 and Sl-9d would lead to a silent carrier probability of 88.5%. Testing positive for these two haplotypes in an individual with two copies of SMN1 gives a silent carrier risk of 88.5%, which is significantly higher than the g.2713T>G variant (1.7%-3.0%). Multiple unique variants associated with Sl-8 haplogroup can be detected with short-read data to call haplogroup copy numbers as provided herein.
[0036] The targeted two-copy allele caller may be provided as part of SMN1 copy number calling. FIG. 3 is a flow diagram of a two-copy allele classifier method 100. The method 100 may be implemented by the disclosed systems and/or devices discussed herein. In an embodiment, the method may be at least in part carried about by hardware elements discussed with respect to FIG. 15. It should be understood that the method 100 may include intervening steps, and that certain steps of the method 100 may be combined or exchanged. The method 100 may operate on sequence data generated from a sample of interest. Generation of the sequence data, storing of the generated sequence data to memory, communicating the stored sequence data, and/or accessing the stored sequence data may, in embodiments, be encompassed by embodiments of the method 100.
[0037] The method 100 includes identifying sequence reads aligned to one or both of an SMN1 or SMN2 reference sequence at block 102. The sequence data may be in the form of a plurality of sequence reads. For example, sequence reads may be about 100 base pairs to about 1000 base pairs in length each. The computing system aligns the plurality of sequence reads to a reference sequence to obtain a plurality of aligned sequence reads comprising sequence reads aligned to the SMN1 and/or the SMN2 reference sequence. In an embodiment, the method 100 aligns the sequence reads (e.g., WGS reads) to a reference genome sequence. The reference genome sequence of a human subject can be a human reference genome sequence such as the hgl6, hgl7, hgl8, hgl9, or hg38 reference human genome sequence. In an embodiment, the alignment is performed as a genome-wide alignment of the sequencing data to the reference genome, and those reads that are aligned to the SMN1 or SMN2 reference are
identified from the entire pool of aligned reads. However, in an embodiment, certain workflows may be a targeted alignment to only a portion of the genome. The alignment may be a separate SMN1/SMN2 alignment.
[0038] At block 104, a copy number of SMN 1 for the sample is determined. The copy number may be normalized to a genome-wide average as disclosed herein. In one example, normalization may occur as generally discussed in U.S. Publication No. 20210166781, U.S. Publication No. US20200087723, or U.S. Publication No. 20200381079, which are hereby incorporated by reference in their entireties herein for all purposes. Normalization may involve counting all reads aligned under SMN1 and SMN2 reference sequences. At this point, reads aligned under SMN 1 reference sequence can be either from SMN 1 or SMN2. Same for reads aligned under SMN2 reference sequence. Then the process can normalize the read count into total copy number. For WGS application, pre-selected normalization regions are used. For example, the read count may be normalized by the length of the region and against a set of 3000 genomic regions of 2000bp expected to be consistently diploid across populations. Once the total copy number is determined, the SMN 1 copy number is determined based on the differentiating sites between SMN1 and SMN2. In some cases, the SMN1 copy number may be based on a set of differentiating sites in exon 7.
[0039]
[0040] The differentiating sites may refer to a nucleotide identity at a particular position of the sequence read being different than a nucleotide identity at the corresponding position in the reference sequence after alignment. For a particular sequence data set including thousands to millions of sequence reads, a subset will align to SMN 1. Each of these aligned sequence reads of the subset has the potential to have 100% identity with the reference sequence or to have one or more nucleotide differences relative to the reference sequence at corresponding positions. The differences may be due to polymorphisms of the SMN genes or due to nucleotide differences between corresponding positions of the SMN 1 and SMN2 genes. An
SMN gene refers to the SMN1 gene or the SMN2 gene, and the differences may be due to polymorphisms of either the SMN1 gene or the SMN2 gene. For example, there is a singlebase difference between functional SMN1 and SMN2 that falls in exon 7 of the canonical transcript of SMN1. The sizeable majority, approximately 95%, of SMA cases, and of carrier haplotypes, are due to one of two types of change that can be detected as the loss (total absence or quantitative depletion for affected and carrier respectively) of the SMN1 version of exon 7. One change is a deletion of all or part of SMN1 that includes exon 7. The second change is a gene conversion replacing a region including exon 7 of SMN 1 with the homologous sequence from SMN2.
[0041] Affected status for most affected individuals can thus be detected as the absence, or near absence (to allow for one or more sequencing errors) of the allele (sequence difference) matching the SMN1 reference base at specific positions as provided herein. This can be determined by the counts of reads supporting the relevant sequence differences. In some embodiments, performing the test on the counts of reads supporting the relevant alleles can include: if fewer than X reads matching the reference SMN 1 sequence are detected, the sample is labeled as “affected.” If more than Y reads matching the reference SMN1 sequence are observed, the sample can be labeled as “unaffected.” The thresholds X and Y can be determined empirically. The thresholds X and Y can depend on the depth of coverage. In an embodiment, a sequencing system may be programmed to operate with an operating mode or setting of 3 Ox depth of coverage for a particular sample, where the depth of coverage refers to a number of unique sequence reads that align to each nucleotide of the reference genome. Alternatively or in addition, the thresholds X and Y can be adjusted based on the desired or acceptable accuracy. In some embodiments, performing the test on the counts of reads supporting the relevant alleles can be based on probabilistic models as discussed in U.S. Patent Publication No. US20200087723A1. The probabilistic models can be generated based on one or more sequencing errors or haplotype sampling. In some embodiments, population- or family-based priors could be incorporated into these processes.
[0042] Determining the SMN1 copy number may be as generally discussed in U.S. Patent Publication No. US20200087723A1 or US20210166781A1, which are incorporated by reference in their entireties herein. In one embodiment, the SMN1 copy number determination technique uses the group of SMN1 and SMN2-aligned sequence reads. In one method, the pool may be force-aligned to the SMN1 reference sequence, and not the SMN2 reference sequence, to generate a set of SMN1 realignments. These forced realignments are assessed for sequence differences relative to the SMN1 reference sequence, and different reads are assigned to different sets based on a presence of these individual sequence differences. The normalized count of reads in each different set is used to determine the SMN1 copy number.
[0043] Disclosed herein are techniques that permit identification of two-copy alleles, including two-copy alleles of gene paralogs, such as two-SMNl-copy alleles. Certain analysis steps of the disclosed techniques use sequence data prepared from a biological sample derived from a subject. Sample preparation, sample quality, and acquisition of the sequence data can be variable. In certain cases, this variability is reflected in the generated sequence data. The variability is difficult to predict and may be tied to operating parameters of the sequence device (see FIG. 15), variability in sequence library preparation, or sample handling and storage. In an embodiment, a sequencing library generated from an individual sample is a non-naturally occurring composition that is prepared by modifying the sample. Further, because of unpredictability in fragment generation on a sample-level basis as part of sample preparation steps during NGS, each sequencing library is a unique composition that in turn generates a unique set of sequence reads (e.g., at least thousands or at least millions of sequence reads) associated with that sample. Normalization may occur using benchmark or stored sequence data of a pool of normals acquired under the same conditions or normalization may be selfnormalization to address this variability. Thus, the present techniques include steps of normalization of the acquired whole genome sequence data. In an embodiment, sequence data acquisition from a sequencing library with reduced amplification yields may be reflected in variable sequencing depth (sequence read counts per region or bin) across the genome. Normalization to a genome-wide depth average addresses such sample-dependent variability. Accordingly, the sequence reads in the method 100 per allele can be normalized to a genome-
wide average or using other normalization techniques. In some embodiments, the method 100 may adjust for coverage by normalizing coverage depth (i.e. read count) by the genome-wide or chromosome-wide average for the sample being analyzed. Thus, the coverage is normalized against other regions of the genome for the same sample.
[0044] Other methods for improving copy number calls include GC correction. In an embodiment, a GC bias metric is computed as follows:
1. Calculate GC content using a 100 bp wide, per-base rolling window over all chromosomes in the reference genome, excluding any decoys and alternate contigs. Windows containing more than four masked (N) bases in the reference are discarded.
2. Calculate the average coverage for each window, excluding any non-PF, duplicate, secondary, and supplementary reads.
3. Calculate the average global coverage across the whole genome.
4. Group valid windows based on the percentage of GC content, both at individual percentages and five 20% ranges as summary.
5. Calculate the normalized coverage for each group by dividing the average coverage for the bin by the global average coverage across the genome. Values below 1.0 indicate a lower than expected coverage at the given GC percent or range. Coverages significantly deviating from 1.0 at greater GC values are an expected result.
6. Calculate dropout metrics as the sum of all positive values of (percentage of windows at GC X-percentage aligned reads at GC X) for each GC < 50% and > 50% for AT and GC dropout.
[0045] GC correction has been described in Benjamini, Y, et al., Summarizing and correcting the GC content bias in high-throughput sequencing, Nucl. Acids Res., 2012, 40 (10): e72, doi: 10.1093/nar/gks001, and Miller, C A, et al., ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads, PLoS One., 2011, 6:
el6327. doi: 10.1371/joumal. pone.0016327; the content of each is incorporated by reference in its entirety. In another embodiment, the GC correction process involves both normalization and GC correction, and the output of this process is the normalized and GC corrected depth. The caller first counts the number of reads that are uniquely aligned to each of the target regions (for example, SMN1 and SMN2 regions) and each of the 3000 distinct 2kb regions in the human genome with diploid copy number in the population with diverse genetic background. The read count is then normalized by the length of the region. Next, the normalized read count for each query region is pooled together with normalized read counts for 3000 distinct 2kb regions to correct for bias in sequencing coverage due to variable GC content among different regions. A LOWESS regression model is used to model the relationship between GC and normalized depth. Then, the correction is applied based on the loess regression prediction value for the target region and the observed target region value.
[0046] As discussed, copy number determination may include classifying aligned sequence reads according to sequence differences relative to a reference SMN1 sequence and SMN2 sequence. An example SMN1 reference sequence is SEQ ID NO: 1 (MIM:600354). An example SMN2 reference sequence is SEQ ID NO:2 (MIM:601627). The alignments to SMN1 or SMN2 are identified from a set of alignments performed on thousands to millions of sequence reads that include other sequences for other genes or genome regions. Within the group of aligned sequence reads, sequences differences associated with the two-copy allele may be identified at block 106. That is, the sequence differences relative to a reference sequence may be used to distinguish between reads that align to SMN1 or SMN2 for normalized copy number determination of 0, 1, 2, 3 etc., and, for SMNl/SMN2-aligned reads (and, in embodiments, ambiguously aligned reads that have a probability of being aligned to both of SMN1 or SMN2), may be used to identify whether-for a copy number value of 2-the determined two copies are likely to be on different chromosome or on a same chromosome. Thus, a computational alignment may be performed to generate aligned reads using a programmed computer with alignment parameters that are set according to a particular alignment technique. However, even reads that are aligned to a reference sequence may not be identical. In one example, the sequence differences represent a nucleotide difference at
corresponding or aligned positions in the alignments between a sequence read of a sample of interest and the reference sequence.
[0047] At block 108, a subject is classified as having a two-copy allele based on (1) a presence of haplogroup or groups associated with the two-copy allele for SMN1 in the sequence data and (2) a normalized SMN1 copy number of 2. The method 100 also encompasses classifying a subject as not having the two copy allele when one or more of the following conditions is true: (1) no identification of a haplogroup or groups associated with the two-copy allele for SMN 1 and (2) a normalized SMN 1 copy number of 2. It should be understood that individuals with a determined SMN1 copy number of 1 or zero are also classified as not having the two- copy allele. Individuals having a determined SMN1 copy number of 3 or more are likely to have the two-copy allele, regardless of haplogroup identification, as discussed herein.
[0048] FIG. 4 is an example workflow 150 showing identification of particular SMN markers as part of two-copy allele detection. As discussed, the workflow may incorporate two-copy allele detection into existing SMN1/2 copy number calling workflows by providing an enhanced set of information using the alignments 152 provided as input for SMN1/2 copy number calling. In the illustrated workflow, the alignments 152 or sequence reads aligned to SMN1 and/or SMN2 reference sequences are used as inputs to both the SMN1/2 copy number calling 154 to generate copy numbers 155 and the haplogroup variant detection. The SI -8 markers 156 and Sl-9d markers 158 are detected at blockl60 based on nucleotide sequence differences at corresponding positions between the aligned sequence reads with the reference sequences, and used to make separate Sl-8 calls 162 and Sl-9d calls 164. For example, the call may be a binary call indicative of a presence of both Sl-8 and Sl-9d variants in the alignments. The Sl-8 and Sl-9d variants may be present in the SMN 1 -aligned reads. Accordingly, in an embodiment, the Sl-8 and Sl-9d variants may be detected in only SMN1- aligned reads or in a pool of SMN1 and SMN2-aligned reads.
[0049] The Sl-8 and Sl-9d variants may be, in an embodiment, one or more variants in Table 1.
Table 1: SMN haplogroup variants
[0050] An identified haplogroup may include identification, in the group of aligned sequence reads from the sample of interest, of at least one SI -8 variant and at least one Sl-9d variant. The workflow 150 uses the SI -8 calls 162 and Sl-9d calls 164 and the SMN1/2 calls 155 to make final calls 166. In an embodiment, two-copy allele classifying requires a copy number of two and both an SI -8 variant and an Sl-9d variant to be present. If only one is present, the classifier may return a call of no two-copy allele when the SMN1 copy number is two and no S 1-8 or Sl-9d variant is present or only one of the S 1-8 variant or the Sl-9d variant is present.
It should be understood that more than one SI -8 and Sl-9d variant may be present in the alignments. In one example, the copy number of a particular S 1-8 variant and a particular Sl- 9d variant may be determined as part of the workflow. The workflow 150 counts the number of reads at each variant site that support the presence of an SI -8 or Sl-9d haplotype or, alternatively, the reference SMN1 allele. For each site, the posterior probability of the variant copy number maybe calculated using a Poisson model with the number of reads supporting the variant and the expected number of reads given a specific copy number and the depth at each site. A consensus-based approach is then used to determine a final copy number call for the SI -8 or Sl-9d haplogroup. The copy number, as discussed herein, may be a normalized copy number based on a normalized read count supporting each variant. In an embodiment, the variant copy number is used as an input to the classifier. If the SI -8 variant copy number is 1 and the S 1 -9d variant copy number is also 1 in conjunction with a total SMN 1 copy number of 2, the classifier returns a call of a two-copy allele classification. The background ploidy, which is the total copy number for SMN1 + SMN2, is used in determining the copy number of each SI -8 and Sl-9d specific variant site
[0051] The workflow may operate in parallel in some cases, whereby, once alignments are generated, the SMN1/2 copy number determination and the haplogroup variant calling operate independently of one another. However, in certain embodiments, the haplogroup variant calling may operate in series and subsequent to SMN1/2 copy number calling. In an embodiment, the two-copy allele detection may only be triggered when the SMN1 copy number is determined to be 2. Thus, for other SMN1 copy numbers (e.g., zero, 1, 3, etc.), the two-copy allele detection operations may be deactivated or bypassed. In this manner, the efficiency of the computing system and distribution of computing resources may be improved by eliminating certain operations for subjects who are unlikely to carry the two-copy allele.
[0052] The classifier may return out an output of the classification, such as a determined copy number, a carrier status, a disease status, as well as a determination of a presence of the two- copy allele. Because the two-copy allele may be associated with a silent carrier status, the
classifier output may provide additional instructions related to the determined silent carrier status.
[0053] In an embodiment, the classifier may not return an output or may return an error message if a copy number consensus fails for the variants that are present. In such an example, the classifier may generate an SMN1 copy number but may provide a null output for the two- copy allele detection.
[0054] FIG. 5 is a comparison of the SI -8 call from the prototype against the SI -8 call from the long read, which is indicative of the presence and copy number of the haplogroup in a given sample, and FIG. 6 is a comparison of the Sl-9d call from the prototype against the Sl- 9d call from the long read, which is indicative of the presence and copy number of the haplogroup in a given sample.
[0055] FIG. 7 shows improved precision and lower recall of the disclosed SI -8 variant and S 1 -9d variant haplogroup caller for identification of individuals with a two-copy allele relative to using a single location g.27134T>G variant. The Sl-8 variant and Sl-9d variant haplogroup caller had high Precision (0.99) and moderate recall (0.49). It was demonstrated that two-copy allele detection based on Sl-8 and Sl-9d haplogroups using short-read data has high concordance against an orthogonal long-read technology. Therefore, two-copy allele detection can be achieved using more cost-effective short read technology. This is in contrast to previous techniques, which required generate long read data from nucleotide fragments containing duplicate regions to identify a presence of an individual with a two-copy allele. For example, long-read technology may read 5000 to 30000 base pairs in one read. This length can encompass at least a portion of a two-copy allele region, allowing direct detection of this allele variant. However, long-read technology can be more expensive. Further, certain types of long-read technologies may be less accurate, which may introduce errors that mask variants. The disclosed techniques permit use of widespread short-read NGS technology to accurately achieve two-copy allele detection.
[0056] In certain embodiments, the disclosed two-copy allele identification techniques include a multivariate two-copy allele caller that generates a two-copy allele classification using a machine learning model to impute the presence of two-copy alleles using various markers (sequence differences relative to the references sequence) as features of a model trained on sequence data from individuals having different allele profiles as discussed herein. As used herein, the term “machine learning model” refers to a computer algorithm or a collection of computer algorithms that automatically improve for a particular task through experience based on use of data. For example, a machine learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness. Example machine learning models include various types of decision trees, support vector machines, Bayesian networks, or neural networks. In some cases, the call-recalibration-machine-learning model is a series of gradient boosted decision trees, while in other cases the call-recalibration-machine-learning model is a random forest model, a multilayer perceptron, or a linear regression, or a logistic regression.
[0057] The disclosed machine learning model improvement is at least in part based on model training using a particular set of sequence data likely to be enriched for two-copy alleles and without requiring long-read sequence data to confirm a presence of the two-copy alleles. The training data included sequence data from individuals identified as likely having an SMN1 copy number of 3 based on copy number variant calling techniques as disclosed herein or previously described. As shown in the schematic illustration of FIG. 8, individuals with an SMN1 copy number of 3 are most likely to have a two-copy allele on one chromosome and a singleton allele on the other chromosome, rather than the much less likely possibility of all three copies being on a same chromosome. This is because it’ s rare to have 3+0 configuration, because having 3 SMN1 copy in one chromosome and having 0 SMN1 copy in one chromosome is rare, which is based on population estimate. Thus, identification of a subject with an SMN1 copy number of 3 can be used as a proxy for identification of a two-copy allele. SMN1 alignments from these 3 copy number samples were used to train the machine learning model classifier as the two-copy allele group.
[0058] These alignments were alignments of the sequence reads to a reference genome as discussed herein and, for certain data sets, included differentiating sites with different nucleotides at corresponding positions in the aligned sequence read versus the reference genome. In one example, the input data of alignments was analyzed by receiving all potential informative positions with nucleotide sequence differences in the alignments as features for the model and training for the weights of the model to be applied for the sample of interest. Alignments from individuals identified as having an SMN1 copy number of 2 were used as the group not likely to have the two-copy allele. However, the SMN1 copy number 2 group would include some subjects with the two-copy allele. An estimate for the frequency of the two-copy allele presence in this group may be roughly 10 samples in IkG.
[0059] Variants present only in the SMN1 copy number 3 group but that were not present in the SMN 1 copy number 2 group, were used as candidate two-copy allele markers. Given that the two-copy allele is far more likely to be present in SMN1 copy number 3 group over SMN1 copy number 2 group, variants that are much more likely to occur in SMN1 copy number 3 group over SMN1 copy number 2 group were used as candidate two-copy SMN1 allele markers.
[0060] FIG. 9 shows different identified variants strongly associated with the SMN 1 copy number 3 group. Further, many of these variants had not been previously identified by other SMN callers.
[0061] FIG. 10 is a flow diagram of a two-copy allele classifier method 200 using a trained model. The method 200 may be implemented by the disclosed systems and/or devices discussed herein. In an embodiment, the method may be at least in part carried about by hardware elements discussed with respect to FIG. 15. It should be understood that the method 200 may include intervening steps, and that certain steps of the method 200 may be combined or exchanged. The method 200 may operate on sequence data generated from a sample of interest. Generation of the sequence data, storing of the generated sequence data to memory, communicating the stored sequence data, and/or accessing the stored sequence data may, in embodiments, be encompassed by embodiments of the method 200.
[0062] The method 200 operates on sequence reads that are from a first group and a second group. The first group represents a pool of sequence data from individuals identified as SMN 1 copy number 3. The second group represents a pool of sequence data from individuals identified as SMN1 copy number 2. In an embodiment, the sequence data is sequence reads from the first group and second group aligned to the SMN1 and SMN2 reference sequence. The aligned sequence reads from each group are aligned to SMN1 at block 202 and each group’s respective SMN1 alignments are used to identify variants present in the first group alignments and not the second group alignments at block 204. These variants are used as model features for a machine learning model that is trained on the data from the first group and the second group at block 206. Once trained, the trained model can be used to classify a new or uncharacterized sample as either having the two-copy allele or not having the two-copy allele at block 208. After training, a subset of variants that are informative for model prediction (by investigating the model weights, for example, if a variant has no weight, then it is not informative for prediction) can be selected. Those selected variants can be used as final machine learning features to predict whether a new sample has SMN1 two-copy allele.
[0063] In an example, SMN1 copy number 3 group alignments and SMN1 copy number 2 group alignments were provided as input to a variant caller, Freebayes, with background ploidy based on the total copy number of SMN1+SMN2 (which can be extracted from the SMN1/2 copy number determination as provided herein) to account for the presence of SMN1 and SMN2-derived reads in the alignments. For the output of the variant caller, higher quality variants may be selected, such as variants with at least 0.05 allele frequency to remove variants that have minor allele frequency in the population. The variant call format (vcf provided as a text file format), or variant caller output, for copy number 3 (704) and separately copy number 2 (2498) were merged. Potential informative variant positions that were used as features for a machine learning model, which was trained using the groups of FIG. 8 to determine feature weights for new or uncharacterized samples. The selection of a copy number 3 group and a copy number 2 group data to identify variants that were in turn used as model features to train a machine learning model exploits a natural phenomenon (individuals with a three-copy allele) and uses this natural phenomenon as a novel basis for feature selection of a machine learning
model. A logistic regression model was used and trained with 50000 samples to identify additional features.
[0064] A workflow may include using the whole genome Binary Alignment Map (BAM) file as an input and, for every variant in the merged vcf of the BAM file, select ones that pass the threshold. The machine learning model uses model fitting to the training data to get final weights for the features. Each feature (variant site) has a particular weight. In an embodiment, the machine learning model has a dynamic configuration that deletes zero weight variants. Each feature may be a SNP or INDEL (multiple nucleotides). In an embodiment, 90% of the data is used for training, and 10% is reserved to run. To train the final model, the posterior probability threshold for the variants is increased, and only the features that have positive weights are used. The two-copy allele variants are identified in the alignments to make a preliminary call. Initially, a lower or less stringent threshold is used, such as a posterior probability of greater than 0.3 is considered to have a high association with the presence of a two-copy allele. In the final trained model, the posterior probability may be increased, e.g., to 0.5 or greater. In an embodiment, for each variant, a posterior probability is calculated that copy number 3 (CN3) in given variant is present:
P(CN3 | var) = (P(var | CN3) * P(CN3)) / (P(var | CN3) * P(CN3) + P(var | CN2) * P(CN2))
Variants that pass a threshold (posterior _prob >= 0.3) indicate reasonable likelihood that the SMN1 copy number is 3 if a particular variant is present.
[0065] FIG. 11 shows an example workflow 200 for two-copy allele detection and copy number determination using a machine learning model. The workflow 200 may incorporate two-copy allele detection into existing SMN1/2 copy number calling workflows 202 as discussed herein to generate SMN1 copy number calls 204. In the illustrated workflow, the alignments 206 or sequence reads aligned to SMN1 and/or SMN2 reference sequences are used as inputs to both the SMN1/2 copy number calling 204 and the machine learning model pathway. The alignments 206 may be from an uncharacterized sample. The alignments 206
are used to determine a copy number 210 for each individual variant feature, e.g., the copy number is based on a number of reads supporting each nucleotide variant 208. For example, a single alternative for the variant is considered ( G to T only, for example, not G to A). The copy number 210 for each variant may be used to generate vectors that are provided as input to the machine learning model 212, which, based on the input, generates a classifier output and final calls 216 related to a presence of a two-copy allele, a disease and/or carrier status if applicable. As discussed, the variants may be a predetermined set of positions in the reference sequence that are associated with variants identified as discussed with respect to FIG. 10.
[0066] It should be understood that features of the SI -8 and Sl-9d detection may be incorporated into the workflow of FIG. 11. For example, one or both of SI -8 or Sl-9d variant detection may be performed as a separate analysis pathway. The two-copy allele detection technique may arbitrate between results provided from these pathways, or may base a decision on agreement between the SI -8 or S 1 -9d variant detection pathway and the machine learning model output.
[0067] FIGS. 12-14 shows results from evaluating the concordance of the machine-learning based and haplogroup-based two-SMNl -copy-allele detection results. 3+ represents cases that the sample has at least 3 SMN1 copies, and at least one two-SMNl -copy-allele. 2- represents cases that the sample has equal to or less than 2 SMN1 copies, and there is no two-SMNl - copy-allele. FIG. 12 shows that the multivariate caller using the trained model outperforms other methods for two-copy allele detection. The multivariate caller (logistic regression or xgboost) outperformed techniques using identification of the g.27134T>G variant single nucleotide change as well as the haplogroup caller. FIG. 13 shows that the multivariate caller using the trained model achieves high precision and improves recall. FIG. 14 shows that the multivariate caller using the trained model captures the information in the haplogroup caller. The multivariate two-copy allele caller shows substantial improvements in precision (0.95) and recall (0.74) over the currently used g.27134T>G variant. The results show that the machine learning model can cover 343/344 of the two-SMNl -copy-allele cases detected by the haplogroup based calling approach and that haplogroup based calling approach can only
cover 343/553 of the two-SMNl -copy-allele cases detected by the machine-learning based approach. This indicates that machine-learning based approach covers nearly all positive cases that can be detected by the haplogroup based approach, and can cover many positive cases that the haplogroup based approach cannot cover.
[0068] In some embodiments, the system comprises non-transitory computer-readable memory configured to store executable instructions. The non-transitory memory can be configured to store a reference sequence and sequence data of a sample. The system can comprise a hardware processor in communication with the non-transitory memory. The hardware processor can be programmed by the executable instructions to perform receiving a plurality of sequence reads generated from a sample obtained from a subject. The hardware processor can be programmed by the executable instructions to perform aligning the plurality of sequence reads to a reference sequence to generate alignments that include a plurality of aligned sequence reads aligned to the reference sequence. The hardware processor can be programmed by the executable instructions to perform determining a number of copies of each region of the plurality of regions based on a number of the sequence reads aligned to the region and using the techniques disclosed herein.
[0069] The various illustrative algorithms described in connection with the embodiments disclosed herein can be implemented or performed by a machine, such as a processing unit or processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor can be a microprocessor, but in the alternative, the processor can be a controller, microcontroller, or state machine, combinations of the same, or the like. A processor can include electrical circuitry configured to process computer-executable instructions. In another embodiment, a processor includes an FPGA or other programmable device that performs logic operations without processing computerexecutable instructions. A processor can also be implemented as a combination of computing devices, for example a combination of a DSP and a microprocessor, a plurality of
microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Although described herein primarily with respect to digital technology, a processor may also include primarily analog components. For example, some or all of the signal processing algorithms described herein may be implemented in analog circuitry or mixed analog and digital circuitry. A computing environment can include any type of computer system, including, but not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computational engine within an appliance, to name a few.
[0070] The hardware components may be implemented in a device or devices of the system. FIG. 15 is a schematic diagram of a sequencing device 500 that may be used in conjunction with the disclosed embodiments for acquiring or generating sequence data of identification sequences and/or index sequences as generally discussed herein. The sequence device 500 may be implemented according to any sequencing technique, such as those incorporating sequencing-by-synthesis methods described in U.S. Patent Publication Nos. 2007/0166705; 2006/0188901; 2006/0240439; 2006/0281109; 2005/0100900; U.S. Pat. No. 7,057,026; WO 05/065814; WO 06/064199; WO 07/010,251, the disclosures of which are incorporated herein by reference in their entireties. Alternatively, sequencing by ligation techniques may be used in the sequencing device 500. Such techniques use DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides and are described in U.S. Pat. No. 6,969,488; U.S. Pat. No. 6,172,218; and U.S. Pat. No. 6,306,597; the disclosures of which are incorporated herein by reference in their entireties. Some embodiments can utilize nanopore sequencing, whereby target nucleic acid strands, or nucleotides exonucleolytically removed from target nucleic acids, pass through a nanopore. As the target nucleic acids or nucleotides pass through the nanopore, each type of base can be identified by measuring fluctuations in the electrical conductance of the pore (U.S. Patent No. 7,001,792; Soni & Meller, Clin. Chem. 53, 1996-2001 (2007); Healy, Nanomed 2, 459-481 (2007); and Cockroft, et al. J. Am. Chem. Soc. 130, 818-820 (2008), the disclosures of which are incorporated herein by reference in their entireties). Yet other embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product. For example, sequencing based on
detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 Al; US 2009/0127589 Al; US 2010/0137143 Al; or US 2010/0282617 Al, each of which is incorporated herein by reference in its entirety. Particular embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate-labeled nucleotides, or with zeromode waveguides as described, for example, in Levene et al. Science 299, 682-686 (2003); Lundquist et al. Opt. Lett. 33, 1026-1028 (2008); Korlach et al. Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), the disclosures of which are incorporated herein by reference in their entireties. Other suitable alternative techniques include, for example, fluorescent in situ sequencing (FISSEQ), and Massively Parallel Signature Sequencing (MPSS). In particular embodiments, the sequencing device 500 may be a HiSeq, MiSeq, or HiScanSQ from Illumina (La Jolla, CA). In other embodiment, the sequencing device 500 may be configured to operate using a CMOS sensor with nanowells fabricated over photodiodes such that DNA deposition is aligned one-to-one with each photodiode.
[0071] The sequencing device 500 may be “one-channel” a detection device, in which only two of four nucleotides are labeled and detectable for any given image. For example, thymine may have a permanent fluorescent label, while adenine uses the same fluorescent label in a detachable form. Guanine may be permanently dark, and cytosine may be initially dark but capable of having a label added during the cycle. Accordingly, each cycle may involve an initial image and a second image in which dye is cleaved from any adenines and added to any cytosines such that only thymine and adenine are detectable in the initial image but only thymine and cytosine are detectable in the second image. Any base that is dark through both images in guanine and any base that is detectable through both images is thymine. A base that is detectable in the first image but not the second is adenine, and a base that is not detectable in the first image but detectable in the second image is cytosine. By combining the information
from the initial image and the second image, all four bases are able to be discriminated using one channel.
[0072] In the depicted embodiment, the sequencing device 500 includes a separate sample processing device 502 (e.g., for sequencing library preparation) and an associated computer 504. However, as noted, these may be implemented as a single device. Further, the associated computer 504 may be local to or networked or otherwise in communication with the sample processing device 502. In the depicted embodiment, the biological sample may be loaded into the sample processing device 502 on a sample substrate 510, e.g., a flow cell or slide, that is imaged to generate sequence data. For example, reagents that interact with the biological sample fluoresce at particular wavelengths in response to an excitation beam generated by an imager 512 and thereby return radiation for imaging. For instance, the fluorescent components may be generated by fluorescently tagged nucleic acids that hybridize to complementary molecules of the components or to fluorescently tagged nucleotides that are incorporated into an oligonucleotide using a polymerase. As will be appreciated by those skilled in the art, the wavelength at which the dyes of the sample are excited and the wavelength at which they fluoresce will depend upon the absorption and emission spectra of the specific dyes. Such returned radiation may propagate back through the directing optics. This retrobeam may generally be directed toward detection optics of the imager 512.
[0073] The imager detection optics may be based upon any suitable technology, and may be, for example, a charged coupled device (CCD) sensor that generates pixilated image data based upon photons impacting locations in the device. However, it will be understood that any of a variety of other detectors may also be used including, but not limited to, a detector array configured for time delay integration (TDI) operation, a complementary metal oxide semiconductor (CMOS) detector, an avalanche photodiode (APD) detector, a Geiger-mode photon counter, or any other suitable detector. TDI mode detection can be coupled with line scanning as described in U.S. Patent No. 7,329,860, which is incorporated herein by reference. Other useful detectors are described, for example, in the references provided previously herein in the context of various nucleic acid sequencing methodologies.
[0074] The imager 512 may be under processor control, e.g., via a processor 514, and the sample receiving device 502 may also include I/O controls 516, an internal bus 518, nonvolatile memory 520, RAM 522 and any other memory structure such that the memory is capable of storing executable instructions, and other suitable hardware components. Further, the associated computer 504 may also include a processor 524, I/O controls 526, communications circuity 527, and a memory architecture including RAM 528 and non-volatile memory 530, such that the memory architecture is capable of storing executable instructions 532. The hardware components may be linked by an internal bus, which may also link to the display 534. In embodiments in which the sequencing device 500 is implemented as an all- in-one device, certain redundant hardware elements may be eliminated.
[0075] The processor 514, 524 may be programmed to assign individual sequence reads to a sample based on the associated index sequence or sequences according to the techniques provided herein. In particular embodiments, based on the image data acquired by the imager 512, the sequencing device 500 may be configured to generate sequence data that includes base calls (nucleotide identities) for each base of a sequence read. Further, based on the image data, even for sequence reads that are performed in series, the individual reads may be linked to the same location via the image data and, therefore, to the same template strand. In this manner, index sequence reads may be associated with a sequence read of an insert sequence before being assigned to a sample of origin. The processor 514, 524 may also be programmed to perform downstream analysis on the sequences corresponding to the inserts for a particular sample subsequent to assignment of sequence reads to the sample.
[0076] In certain embodiments, the I/O controls 516, 526 may be configured to receive user inputs that automatically select sequencing parameters. For example, the user inputs may permit selection of bin sizes for alignments. In an embodiment, the bin size selection may apply in operating modes outside of the disclosed joint depth distribution techniques that rely on variable bin sizes orborders based on a presence of differentiating sites between the regions of interest. Therefore, a user input of bin size may be overridden or not applied in similar regions having differentiating sites.
[0077] In an embodiment, notifications related to CNV calling are provided on the display 534 or communicated via the communications circuitry 527 to a remote device or a cloud server.
[0078] The term copy number variation, copy number variant, or CNV may refer to variation in the number of copies of a nucleic acid sequence present in a test sample in comparison with the copy number of the nucleic acid sequence present in a reference sample. In certain embodiments, the nucleic acid sequence is 1 kb or larger. In some cases, the nucleic acid sequence is a whole chromosome or significant portion thereof. A copy number variant may refer to a sequence in which copy-number differences are found by comparison of a nucleic acid sequence of interest in test sample with an expected level of the nucleic acid sequence of interest. For example, the level of the nucleic acid sequence of interest in the test sample is compared to that present in a qualified sample. Copy number variants/variations include deletions, including microdeletions, insertions, including microinsertions, duplications, multiplications, and translocations. CNVs encompass chromosomal aneuploidies and partial aneuploidies.
[0079] A CNV call may be based on a variation from a threshold. A threshold or threshold value may refer to a number that is used as a cutoff to characterize a sample such as a test sample containing a nucleic acid from an organism suspected of having a medical condition. The threshold may be compared to a parameter value to determine whether a sample giving rise to such parameter value suggests that the organism has the medical condition. In certain embodiments, a qualified threshold value is calculated using a qualifying data set and serves as a limit of diagnosis of a SNV or CNV. If a threshold is exceeded by results obtained from methods disclosed herein, a subject can be diagnosed with a SNV or CNV.
[0080] A sequencing coverage refers to a percentage of bases in a reference sequence covered by the mapped or aligned sequence reads. A set of sequence reads are mapped to a reference genome at the various genomic regions. The total percentage of target bases within the reference genome to which sequenced reads are mapped is quantified as the coverage of the genome. The average depth of sequencing coverage is the ratio of the number of reads, e.g.,
scaled scaled by read length, to the total referenced genome length. The read depth may be normalized to a mean depth across a genome. Coverage may be determined using the Lander/Waterman equation:
C = LN / G
C stands for coverage
G is the haploid genome length
L is the read length
N is the number of reads
A Poisson distribution can be used to model any discrete occurrence given an average number of occurrences. The probability function is the following: P(Y=y) = (Cy x e-C)/y! y is the number of times a base is read
C stands for coverage
Poisson distribution is used to compute the probability of a base being sequenced a certain number of times. SNP callers may require at least four calls at a base position to call SNPs.
[0081] A sequence read may refer to a sequence obtained from a fragment of a nucleic acid sample. Typically, though not necessarily, a read represents a short sequence of contiguous base pairs in the sample. The read may be represented symbolically by the base pair sequence (in A, T, C, or G) of the sample portion. It may be stored in a memory device and processed as appropriate to determine whether it matches a reference sequence or meets other criteria. A read may be obtained directly from a sequencing apparatus or indirectly from stored sequence information concerning the sample. In some cases, a sequence read is a contiguous DNA sequence of sufficient length (e.g., at least about 25 bp) that can be used to identify a larger sequence or region, e g., that can be aligned and specifically assigned to a chromosome or genomic region or gene. In some embodiments, the plurality of sequence reads comprises
sequence reads that are about 100 base pairs to about 1000 base pairs in length each. The plurality of sequence reads can comprise paired-end sequence reads and/or single-end sequence reads. The plurality of sequence reads may be generated by whole genome sequencing (WGS), such as clinical WGS (cWGS). In some embodiments, the sequence reads may be generated by targeted sequencing techniques. Sequence data may include at least a thousand or at least a million individual sequence reads.
[0082] An alignment may refer to comparing a sequence read to a reference sequence and thereby determining whether the reference sequence contains the read sequence. If the reference sequence contains the read sequence, the read may be mapped to the reference sequence or, in certain embodiments, to a particular location in the reference sequence. Aligned reads or tags are one or more sequences that are identified as a match in terms of the order of their nucleic acid molecules to a known sequence from a reference genome. Alignment can be done manually, although it is typically implemented by a computer algorithm, as it would be impossible to align reads in a reasonable time period for implementing the methods disclosed herein. Non limiting examples of alignment methods include global alignments (such as Needleman-Wunsch algorithm), local alignments, dynamic programming (such as Smith-Waterman algorithm), heuristic algorithms or probabilistic methods, progressive methods, iterative methods, motif finding or profile analysis, genetic algorithms, simulated annealing, pairwise alignments, multiple sequence alignments. In some cases, the alignment may generate an alignment score. The alignment method may include Burrows-Wheeler Aligner (BWA), iSAAC, BarraCUDA, BFAST, BLASTN, BLAT, Bowtie, CASHX, Cloudburst, CUDA-EC, CUSHAW, CUSHAW2, CUSHAW2-GPU, drFAST, ELAND, ERNE, GNUMAP, GEM, GensearchNGS, GMAP and GSNAP, Geneious Assembler, LAST, MAQ, mrFAST and mrsFAST, MOM, MOSAIK, MPscan, Novoaligh & NovoalignCS, NextGENe, Omixon, PALMapper, Partek, PASS, PerM, PRIMEX, QPalma, RazerS, REAL, cREAL, RMAP, rNA, RT Investigator, Segemehl, SeqMap, Shrec, SHRiMP, SLIDER, SOAP, SOAP2, SOAP3 and SOAP3-dp, SOCS, SSAHA and SSAHA2, Stampy, SToRM, Subread and Subjunc, Taipan, UGENE, VelociMapper, XpressAlign, and/or ZOOM.
[0083] An alignment score is a score indicating a similarity of two sequences determined using an alignment method. In some implementations, an alignment score accounts for number of edits (e.g., deletions, insertions, and substitutions of characters in the string). In some implementations, an alignment score accounts for a number of matches. In some implementations, an alignment score accounts for both the number of matches and a number of edits. In some implementations, the number of matches and edits are equally weighted for the alignment score. As provided herein, highly similar regions may have alignment scores above a threshold, e.g., above 90% similarity. In one embodiment, a threshold percent identity or identity score (e.g., above 85% identity, above 90% identity based on alignment) is used to select two regions as being similar and/or as having corresponding bins. In one embodiment, two highly similar regions may be similar over a minimum length (e.g., at least 500 or at least 1000 bases).
[0084] The output from the alignment module is a Binary Alignment Map (BAM, e.g., binary version of a Sequence Alignment Map (SAM)) file along with a mapping quality score (MAP A), which quality score reflects the confidence that the predicted and aligned location of the read to the reference is actually where the read is derived. BAM files contain a header section and an alignment section. The header contains information about the entire file, such as sample name, sample length, and alignment method. Alignments in the alignments section are associated with specific information in the header section. The alignments contain read name, read sequence, read quality, alignment information, and custom tags. The read name includes the chromosome, start coordinate, alignment quality, and the match descriptor string. The alignments section includes one or more of Read group, which indicates the number of reads for a specific sample, Barcode tag, which indicates the demultiplexed sample ID associated with the read, Single-end alignment quality, Paired-end alignment quality, Edit distance tag, which records the Levenshtein distance between the read and the reference, and Amplicon name tag, which records the amplicon tile ID associated with the read.
[0085] Output from variant calling is a Variant Call Format (VCF). The VCF file header includes the VCF file format version and the variant caller version and lists the annotations
used in the remainder of the file. The VCF header also includes the reference genome file and BAM file. The last line in the header contains the column headings for the data lines. Each of the VCF file data lines contains information about 1 variant. VCF files may include the following information:
The chromosome of the reference genome.
The single-base position of the variant in the reference chromosome. For SNVs, this position is the reference base with the variant. For indels, this position is the reference base immediately preceding the variant.
The rs number for the SNP obtained from dbSNP.txt, if applicable. If multiple rs numbers exist at this location, the list is delimited by semicolons. If a dbSNP entry does not exist at this position, a missing value marker ('.') is used.
The reference genotype. For example, a deletion of a single T is represented as reference TT and alternate T. An A to T single nucleotide variant is represented as reference A and alternate T.
The alleles that differ from the reference read. For example, an insertion of a single T is represented as reference A and alternate AT. An A to T single nucleotide variant is represented as reference A and alternate T.
A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant and lower probability of errors. For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores based on their statistical models, which are high in relation to the error rate observed.
[0086] A reference sequence refers to any particular known genome sequence, whether partial or complete, of any organism or virus which may be used to reference identified sequences from a subject. In one example, the reference sequence is that of a full length human genome.
Such sequences may be referred to as genomic reference sequences. In another example, the reference sequence is limited to a specific human chromosome such as chromosome 13. In some embodiments, a reference Y chromosome is the Y chromosome sequence from human genome version hgl9. Such sequences may be referred to as chromosome reference sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, sub-chromosomal regions (such as strands), etc., of any species. In various embodiments, the reference sequence is a consensus sequence or other combination derived from multiple individuals. However, in certain applications, the reference sequence may be taken from a particular individual.
[0087] A sample, target sample, or biological sample may refer to a sample typically derived from a biological fluid, cell, tissue, organ, or organism, comprising a nucleic acid or a mixture of nucleic acids comprising at least one nucleic acid sequence that is to be screened for SNVs or CNVs. In certain embodiments the sample comprises at least one nucleic acid sequence whose copy number is suspected of having undergone variation. Such samples include, but are not limited to sputum/oral fluid, amniotic fluid, blood, a blood fraction, or fine needle biopsy samples (e.g., surgical biopsy, fine needle biopsy, etc.), urine, peritoneal fluid, pleural fluid, and the like. Although the sample is often taken from a human subject (e g., patient), the assays can be used to SNVs or CNVs in samples from any mammal, including, but not limited to dogs, cats, horses, goats, sheep, cattle, pigs, etc. The sample may be used directly as obtained from the biological source or following a pretreatment to modify the character of the sample. For example, such pretreatment may include preparing plasma from blood, diluting viscous fluids and so forth. Methods of pretreatment may also involve, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, the addition of reagents, lysing, etc. If such methods of pretreatment are employed with respect to the sample, such pretreatment methods are typically such that the nucleic acid(s) of interest remain in the test sample, sometimes at a concentration proportional to that in an untreated test sample (e.g., namely, a sample that is not subjected to any such pretreatment method(s)). Such “treated” or “processed” samples are still considered to be
biological “test” samples with respect to the methods described herein. The sample can comprise cells, cell-free DNA, cell-free fetal DNA, amniotic fluid, a blood sample, a biopsy sample, or a combination thereof. The sample can be obtained directly from a subject. The sample can be generated from another sample obtained from a subject. The other sample can be obtained directly from the subject or the other sample can be generated from another sample obtained from the subject. The computing system can store the plurality of sequence reads in its memory.
[0088] While certain embodiments of the disclosure are discussed in the context of SMN1 copy number, it should be understood that other silent carrier or two-copy allele conditions may be identified by the present techniques. For example, alpha globin coded for by two genes (a-globin genes, HBA1 and HBA2) on chromosome 16. Each person needs four functional HBA genes (two from each parent) to make enough a-globin for the body's hemoglobin to work normally. Different forms of a-thalassemia occur if one or more of these genes are defective. If one gene is defective, then a person is a “silent” carrier of the a-thalassemia trait and usually has no signs or symptoms. If two genes are defective, then a person has a- thalassemia trait (also called alpha thalassemia minor) and may have mild anemia. If three genes are defective, then a person has hemoglobin H disease. This can cause moderate to severe anemia. If all four genes are missing, then a person has a-thalassemia major (also called hemoglobin Bart's or hydrops fetalis). This is the most severe type of a-thalassemia. A fetus with this disorder will usually die in the womb or the baby will die soon after birth because the child is unable to make normal hemoglobin to carry oxygen throughout the body.
[0089] More than 90% of a-thalassemia results from the deletion of two or more copies of the a-globin genes (HBA1 and HBA2) on chromosome 16. The HBA1 and HBA2 genes are located within an ~30 kb a-globin gene cluster on chromosome 16, that includes the following alpha globin genes and (pseudogenes) from telomere to centromere in this order: HBZ, (HBZP1) HBM, (HBAP1), HBA2, HBA1, HBQ1 (see, e g., FIG. 1). The coding sequences of HBA1 and HBA2 are identical with divergent sequences located in the introns and 5'- and 3 '-untranslated regions. In addition, the deletion of the HS-40 major hypersensitive site, which
is located 40 kb upstream of the HBZ gene in the promoter region, affects RNA expression of both HBA1 and HBA2, thereby causing an u-thalassemia trait in heterozygotes. The disclosed techniques may be used to distinguish between a normal genotype (aa/aa) and silent carrier (- a/aa) or minor trait (— /aa or -a/-a) genotypes based on copy number.
[0090] In some embodiments, two-copy alleles may be identified for one or more of GBA, CYP21A2, ABCC6, ABCD1, ACTB, ACTG1,ACTN4, ADAMTSL2, ADIPOR1, AFG3L2, AGK, ALGJ, AIMS!, ANKRD J, ANOS1, AP4S1, ARMC4, ARSE, ASNS, ATAD3A, B3GAT3, BCAP3I, BDPI, BMPRIA, BRAF, BRCAI, C2, CACNAIC, CALM1, CD46, CEP290, CFH, CFH, CFH, CHEK2, CISD2, CLCNKA, CLCNKB, CORO 1 A, COXIO, CP, CRYBB2, CSF2RA, CUBN, CUBN, CYCS, CYP11B1, CYP21A2, DCLRE1C, DHFR, DICER1, DIS3L2, DNAH11, DNAH11, DNM1, DSE, DU0X2, EGLN1, ELK1, ELM02, ERCC6, ESPN, EYS, F8, FANCD2, FANCD2, FAR1, FHL1, FLG, FLNC, F0XD4, FXN, GBA, GH1, GJA1, GK, GLUD1, GLUDI, GOSR2, GUSB, HBA 1, HBA2, HNRNPA 1, HPS1, HSPD1, HYDJN, IDS, IFTI22, IGLLl, KANSU, KCTD1, KIF1C, KRAS, KRT14, KRT16, KRT17, KRT6A, KRT6B, KRT6C, LEFTY2, LRP5, LRP5, MAT2A, MIDI, M0CS1, MSN, MSX2, MY05B, NCF1, NEB, NECAPI, NEFH, NFI, NF1, NF1, NOTCH2, NXF5, OCLN, OTOA, PARN, PBXI, PIGA, PIGN, PIK3CA, PIK3CD, PKD1, PKP2, PMS2, PMS2, PMS2, PNPT1, POLH, PRODH, PRODH, PROS1, PRPS1, PRSS1, PTEN, RAD21, RBM8A, RBPJ, RDX, RMND1, RNF216, RNF2I6, RPL15, SALLI, SBDS, SDHA, SHOX, SLC25A15, SLC25AI5, SLC33A1, SLC6A8, SMN1, SMN2, SOX2, SPTLC1, SRD5A3, SRP72, STAT5B, STRC, SYT14, TARDBP, TBL1XR1, TBX20, TIMM8A, TPM3, TPMT, TRAPPC2, TRIP11, TTN, TUBA1A, TUBB2A, TUBB2B, TUBB3, TUBB4A, TUBG1, TYR, UBA5, UBE3A, UNC93B1, USP18, VPS35, VWF, WRN, XI AP, ZEB2, or ZNF341.
[0091] With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural references
unless the context clearly dictates otherwise. For example, “a processor” can include distributed processing between multiple processors.
[0092] As will be understood by one skilled in the art, for any and all purposes, such as in terms of providing a written description, all ranges disclosed herein also encompass any and all possible sub-ranges and combinations of sub-ranges thereof. Any listed range can be easily recognized as sufficiently describing and enabling the same range being broken down into at least equal halves, thirds, quarters, fifths, tenths, etc. As a non-limiting example, each range discussed herein can be readily broken down into a lower third, middle third and upper third, etc. As will also be understood by one skilled in the art all language such as “up to,” “at least,” “greater than,” “less than,” and the like include the number recited and refer to ranges which can be subsequently broken down into sub-ranges as discussed above. Finally, as will be understood by one skilled in the art, a range includes each individual member. Thus, for example, a group having 1-3 articles refers to groups having 1, 2, or 3 articles. Similarly, a group having 1-5 articles refers to groups having 1, 2, 3, 4, or 5 articles, and so forth.
[0093] This written description uses examples to enable any person skilled in the art to practice the disclosed embodiments, including making and using any devices or systems and performing any incorporated methods. The patentable scope is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements with insubstantial differences from the literal languages of the claims.
Claims
1. A method for identifying a two-copy allele of survival of motor neuron 1 (SMN1) gene comprising: receiving sequence data of a subject comprising nucleotide identities for a plurality of sequence reads, the plurality of sequence reads comprising at least ten thousand sequence reads; performing an alignment to a reference sequence for the plurality of sequence reads, wherein the reference sequence comprises an SMN1 reference sequence and an SMN2 reference sequence; identifying a subset of sequence reads of the plurality of sequence reads that are aligned to one or both of the SMN 1 reference sequence or the SMN2 reference sequence; determining a normalized number of the sequence reads in the subset; identifying sequences differences in the subset, the sequence differences comprising first sequence differences relative to the SMN1 reference sequence for the sequence reads and second sequence differences relative to the SMN2 reference sequence, wherein the first sequence differences and the second sequence differences are distinguishing between SMN1 and SMN2; determining an SMN1 copy number based on the identified sequence differences in the subset and the normalized number; identifying a two-copy allele- associated S 1-8 and Sl-9d haplogroup in the sequence differences; and classifying the subject as carrying a two-copy allele of SMN1 based on an estimated SMN1 copy number of two copies and a presence of two-copy allele- associated SI -8 and Sl-9d haplogroup in the sequence differences.
2. The method of claim 1, wherein the SMN1 reference sequence is SEQ ID NO: 1.
3. The method of claim 1, wherein the SMN2 reference sequence is SEQ ID NO: 2.
4. The method of claim 1, wherein the sequence differences are in exon 7 of SMN1 or SMN2.
5. The method of claim 1, wherein identification of the two-copy allele- associated Sl-8 and Sl-9d haplogroup comprises providing the sequence data to a trained model.
6. The method of claim 1, wherein the sequence data is short-read data such that individual sequence reads of the plurality of sequence reads are 1000 bases or less in length.
7. The method of claim 1, comprising classifying the subject as not carrying the two- copy allele based on an estimated SMN1 copy number of two copies and an absence of the Sl-8 and Sl-9d haplogroups in the sequence differences.
8. The method of claim 1, wherein identifying the two-copy allele- associated Sl-8 and Sl-9d haplogroup in the sequence differences comprises determining a copy number at each variant site of a plurality of variant sites.
9. The method of claim 1, wherein the Sl-8 and Sl-9d haplogroups comprise one or more of SEQ ID NO: 3, SEQ ID NO:4, a G to T change at position 70931231 relative to SEQ ID NO: 1, a C to T change at position 70934752 relative to SEQ ID NO: 1, a T to C change at position 70935496 relative to SEQ ID NO: 1, a T to G change at position 70940327 relative to SEQ ID NO: 1, a C to T change at position 70954415 relative to SEQ ID NO: 1, a G to A change at position 70954423 relative to SEQ ID NO: 1; a CG to CA change at position 70954725 relative to SEQ ID NO: 1; a C to T change at position 70955672 relative to SEQ ID NO: 1;;; a GTTA to GTA change at position 70957303 relative to SEQ ID NO: 1; a G to A change at position 70957520 relative to SEQ ID NO: 1; a C to T change at position 70957914 relative to SEQ ID NO: 1; a G to C change at position
70921404 relative to SEQ ID NO: 1; and a G to C change at position 70923922 relative to SEQ ID NO: 1.
10. A multivariate computer-implemented method for identifying a two-copy allele of a gene of interest comprising: receiving onto a memory a set of weighting factors of a machine learning model related to a preselected set of positions in a reference sequence; receiving onto the memory whole genome sequence data of a subject comprising nucleotide identities for a plurality of sequence reads, the plurality of sequence reads comprising at least a million sequence reads; receiving onto the memory alignments based on aligning sequence data in the whole genome sequence data to the reference sequence to generate the alignments; receiving onto the memory sequence differences relative to the reference sequence in the alignments to a gene of interest; determining total copy number for the gene of interest based on the alignments; determining copy numbers for variant alleles at the preselected set of positions based on the sequence differences; and generating an output of the machine learning model using the total copy number, the weights and the copy numbers, wherein the output characterizes the subject as carrying a two-copy allele for the gene of interest.
11. The method of claim 10, wherein the machine learning model is trained on a first group of sequence data from individuals with an estimated three copies of the gene of interest and a second group of sequence data from individuals with an estimated two copies of the gene of interest.
12. The method of claim 11, wherein the preselected set of positions is generated based on variants relative to the reference sequence for the gene of interest present in the first group and not present in the second group.
13. The method of claim 10, wherein the gene of interest is a survival of motor neuron 1 (SMNl) gene.
14. The method of claim 10, wherein the set of weights are assigned to different copy numbers.
15. A system for identifying a two-copy allele of survival of motor neuron 1 (SMN1) gene comprising: a sequence device configured to generate whole genome sequence data of a subject comprising base call information for a plurality of sequence reads, the plurality of sequence reads comprising at least a million sequence reads; processing circuitry configured to receive the whole genome sequence data and to execute instructions to: align survival of motor neuron 1 (SMN1) sequence data and survival of motor neuron 2 (SMN2) sequence data in the whole genome sequence data to an SMN1 reference sequence (SEQ ID NO: 1) and an SMN2 reference sequence (SEQ ID NO:2) to generate alignments; determine that an SMN1 copy number of the whole genome sequence data is two copies using the alignments; use the alignments to generate variant site copy numbers for a plurality of variants; provide the variant site copy numbers to a machine learning model; and characterize the subject as carrying a two-copy allele of SMN1 based on an output of the machine learning model.
16. The system of claim 15, wherein the plurality of variants are relative to the SMN1 reference sequence.
17. The system of claim 15, wherein the variant site copy numbers are based on sequence reads in the alignments that support each individual variant site.
18. The system of claim 15, wherein the processing circuitry is configured to execute instructions to generate a vector based on each variant site copy number of the variant site copy numbers, and wherein each vector is provided as an input to the machine learning model.
19. A multivariate computer-implemented method for identifying a two-copy allele of a gene of interest comprising: receiving first group sequence data from a plurality of individuals having three copies of a survival of motor neuron 1 (SMN1) gene; receiving second group sequence data from a plurality of individuals having only two copies of the SMN1 gene; aligning the first group sequence data and the second group sequence data to an SMN1 reference sequence; identifying sequence variants relative to the SMN1 reference sequence in the first group sequence data that are not present in the second group sequence data; generating features for a machine learning model based on the identified sequence variants; and using the machine learning model to classify sequence data from a subject as having a two-copy allele of SMN 1.
20. The method of claim 19, comprising training the machine learning model using the first group sequence data and the second group sequence data.
21. The method of claim 20, comprising determining weights for the features based on the training.
22. The method of claim 21, comprising eliminating features without positive weights.
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US63/654,604 | 2024-05-31 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025250794A1 true WO2025250794A1 (en) | 2025-12-04 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Barthel et al. | Longitudinal molecular trajectories of diffuse glioma in adults | |
| JP7051900B2 (en) | Methods and systems for the generation and error correction of unique molecular index sets with non-uniform molecular lengths | |
| CA2983833C (en) | Diagnostic methods | |
| US12367978B2 (en) | Methods and systems for determining somatic mutation clonality | |
| EP3495496A1 (en) | Methods and processes for non-invasive assessment of chromosome alterations | |
| US20210104297A1 (en) | Systems and methods for determining tumor fraction in cell-free nucleic acid | |
| KR20160022374A (en) | Methods and processes for non-invasive assessment of genetic variations | |
| CN115702457A (en) | System and method for determining cancer status using an automated encoder | |
| KR20160065208A (en) | Methods and processes for non-invasive assessment of genetic variations | |
| US20210285042A1 (en) | Systems and methods for calling variants using methylation sequencing data | |
| CA3046660A1 (en) | Methods and systems for determining paralogs | |
| KR20220013349A (en) | Limit-of-detection-based quality control metrics | |
| EP2602734A1 (en) | Robust variant identification and validation | |
| US20210295948A1 (en) | Systems and methods for estimating cell source fractions using methylation information | |
| WO2025250794A1 (en) | Two-copy allele detection | |
| CN117497047B (en) | Method, equipment and medium for screening tumor gene markers based on exon sequencing | |
| WO2023043914A1 (en) | Diagnosis and prognosis of richter's syndrome | |
| US20250273296A1 (en) | Method of detecting cancer dna in a sample | |
| WO2025217057A1 (en) | Variant detection using improved sequence data alignments | |
| Chundru et al. | Genotype-level quality control substantially reduces error rates in population-scale whole-genome sequencing | |
| Spargo | downloaded from the King’s Research Portal at https://kclpure. kcl. ac. uk/portal | |
| Dimartino | A machine learning based method to detect genomic imbalances exploiting X chromosome exome reads | |
| D'Costa | From Strings to Graphs: Personalized Repeat-Aware Algorithms for Improved Long Read Structural Variant Detection | |
| WO2025178926A1 (en) | Methods and systems for intra-tumor heterogeneity classification | |
| WO2024238750A2 (en) | Clonal hematopoiesis burden as a biomarker for immune checkpoint inhibitor response |