WO2025217057A1

WO2025217057A1 - Variant detection using improved sequence data alignments

Info

Publication number: WO2025217057A1
Application number: PCT/US2025/023480
Authority: WO
Inventors: Vitor Ferreira ONUCHIC; Christine Amalachukwu ANYANSI
Original assignee: Illumina Inc
Current assignee: Illumina Inc
Priority date: 2024-04-08
Filing date: 2025-04-07
Publication date: 2025-10-16
Anticipated expiration: 2026-10-08
Also published as: WO2025217057A9

Abstract

Copy number variant detection techniques are described that improve detection outcomes by using improved data alignment that preserves analysis and variant calling using sequence reads that may align to a first region or a second region of high similarity, such as a segmental duplication.

Description

VARIANT DETECTION USING IMPROVED SEQUENCE DATA ALIGNMENTS CROSS-REFERENCE TO RELATED APPLICATIONS [0001] The present application claims the benefit of and priority to U.S. Provisional Application No. 63/631,303, titled “VARIANT DETECTION USING IMPROVED SEQUENCE DATA ALIGNMENTS,” filed April 8, 2024. BACKGROUND [0002] The disclosed technology relates generally to gene variant detection using previously excluded or discarded data. In particular, the technology disclosed relates to techniques for improving identification of variants, such as copy number variants, using sequence data corresponding to regions having high similarity. [0003] Segmental duplications are hotspots for structural variants and gene recombinant variants (for example, gene conversion). Studies have noted a significant association between the location of segmental duplications and regions of chromosomal instability or evolutionary rearrangement. High sequence similarity in segmental duplications can lead to poor read alignments and exclusion of these regions in copy number variant determination. There is a need to informatically identify copy number variants of genes that lie within segmental duplications. BRIEF DESCRIPTION [0004] In one embodiment, the present disclosure provides a method for identifying copy number variants in genomic sequence data. The method includes receiving genomic sequence data from a sample of interest, the genomic sequence data comprising millions of base calls corresponding to nucleotides of the sample of interest and processing the genomic sequence data to identify sequence reads that align to one or both of a first region or a second region of a reference genome. The method also includes determining a joint sequencing depth of corresponding bins of the first region and the second region using the identified sequence reads by combining identified sequence reads that align to corresponding bins of the first region or the second region, wherein each bin within the first region corresponds to one other bin in the second region, wherein the correspondence is based on sequence similarity, and such that the joint sequencing depth for an individual set of corresponding bins comprises identified sequence reads that are aligned to a first region bin and/or a corresponding second region bin the individual set and normalizing the joint sequencing depth for each set of corresponding bins. The method also includes using a ratio of a first count of sequence reads assigned to the first region to a second count of sequence reads assigned to the second region based on differentiating sites having different nucleotide identities between the first region and the second region to split the normalized joint sequencing depths of each set of corresponding bins between the first region bin and the corresponding second region bin to generate first region normalized depth values and second region normalized depth values; combining the first region normalized depth values and the second region normalized depth values with normalized depth values from other regions of the reference genome to generate a combined set of normalized depth values of the genomic sequence data; and identifying one or more copy number variants using the combined set of normalized depth values of the genomic sequence data. [0005] In one embodiment, the present disclosure provides a system for identifying copy number variants in genomic sequence data. The system includes a sequencing device configured to generate whole genome sequence data comprising a plurality of sequence reads obtained from a sample of a subject and non-transitory memory configured to store executable instructions, the whole genome sequence data, and a reference sequence. The system also includes a hardware processor in communication with the non-transitory memory. The hardware processor programmed by the executable instructions to perform generating alignments of the whole genome sequence data to the reference sequence, wherein the alignments comprise sequence reads aligned to one or both of a first region or a second region, wherein the second region is a segmental duplication of the first region; generating normalization data for the whole genome sequence data, wherein the normalization data is based on variability in the sample and sample-dependent operations of the sequencing device; determining a joint sequencing depth of corresponding bins of the first region and the second region using the aligned sequence reads by combining identified sequence reads that align to the corresponding bins of the first region or the second region, wherein each bin within the first region corresponds to one other bin in the second region, wherein the correspondence is based on sequence similarity, and such that the joint sequencing depth for an individual set of corresponding bins comprises identified sequence reads that are aligned to one or both of a first region bin or a corresponding second region bin the individual set; normalizing the joint sequencing depth for each set of corresponding bins using the normalization data; determining a ratio, for sequence reads aligned to each set of corresponding bins, based on a number of the sequence reads having a nucleotide difference at an individual differentiating site associated with the first region relative to a number of the sequence reads having the nucleotide difference at the individual differentiating site associated with the second region; using the ratio to split the normalized joint sequencing depths of each set of corresponding bins between the first region and the second region to generate first region normalized depth values for each first region bin of the first region and second region normalized depth values for each second region bin of the second region; combining the first region normalized depth values and the second region normalized depth values with normalized depth values from other regions of the whole genome sequence data to generate a combined set of normalized depth values of the genomic sequence data; and identifying one or more copy number variants using the combined set of normalized depth values of the whole genome sequence data. [0006] In one embodiment, the present disclosure provides non-transitory computer-readable memory storing executable instructions to perform receiving alignments of whole genome sequence data to a reference sequence, the whole genome sequence data comprising a plurality of sequence reads obtained from a sample of a subject, and wherein the alignments comprise sequence reads aligned to one or both of a first region or a second region, wherein the second region is a segmental duplication of the first region; determining a joint sequencing depth of corresponding bins of the first region and the second region using the aligned sequence reads by combining identified sequence reads that align to the corresponding bins of the first region or the second region, wherein each bin within the first region corresponds to one other bin in the second region, wherein the correspondence is based on sequence similarity, and such that the joint sequencing depth for an individual set of corresponding bins comprises identified sequence reads that are aligned to one or both of a first region bin or a corresponding second region bin; normalizing the joint sequencing depth for each set of corresponding bins; determining a ratio, for sequence reads aligned to each set of corresponding bins, based on a number of the sequence reads having a nucleotide difference at an individual differentiating site associated with the first region relative to a number of the sequence reads having the nucleotide difference at the individual differentiating site associated with the second region; using the ratio to split the normalized joint sequencing depths of each set of corresponding bins between the first region and the second region to generate first region normalized depth values for each first region bin of the first region and second region normalized depth values for each second region bin of the second region; combining the first region normalized depth values and the second region normalized depth values with normalized depth values from other regions of the reference genome to generate a combined set of normalized depth values of the genomic sequence data; and identifying one or more copy number variants using the combined set of normalized depth values of the genomic sequence data. BRIEF DESCRIPTION OF THE DRAWINGS [0007] These and other features, aspects, and advantages of the disclosed embodiments will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein: [0008] FIG. 1 is a schematic illustration of example exclusion parameters based on unique kmers; [0009] FIG. 2 is a schematic illustration of potential for missed copy number variants in excluded regions; [0010] FIG.3 shows excluded sequence data for a region with a segmental duplication that is rescued using embodiments of the disclosed techniques; [0011] FIG.4 shows example read counts and excluded segmental duplication regions that are rescued using embodiments of the disclosed techniques; [0012] FIG. 5 shows normalized example read counts and excluded segmental duplication regions that are rescued using embodiments of the disclosed techniques; [0013] FIG.6 is a schematic illustration of example reads that are ambiguously aligned when a segmental duplication exists vs. uniquely mapping regions with no or limited ambiguous alignment; [0014] FIG.7 is a schematic illustration of an example workflow step of creating bins based on differentiating sites between duplicated regions, according to an embodiment; [0015] FIG.8 is a schematic illustration of an example workflow step of determining joint bin depth, according to an embodiment; [0016] FIG.9 shows normalized joint bin depth of FIG.8, according to an embodiment; [0017] FIG.10 shows normalized joint bin depth of FIG.8, according to an embodiment; [0018] FIG. 11 is a flow diagram of a method of normalizing sequence depth data for two regions, according to an embodiment; [0019] FIG.12 is a schematic illustration of rescued segments, according to an embodiment; [0020] FIG. 13 is a schematic illustration of rescued segments showing normalized counts, according to an embodiment; [0021] FIG.14 shows CYP2A6 and CYP2A7 non-unique regions; [0022] FIG.15 is a schematic illustration of CYP2A6 structural variants; [0023] FIG. 16 shows results of heterozygous deletion detection using the disclosed techniques; [0024] FIG.17 shows results of duplication detection using the disclosed techniques; [0025] FIG. 18 shows results of homozygous deletion detection using the disclosed techniques; [0026] FIG.19 shows s results of hybrid deletion detection using the disclosed techniques; [0027] FIG.20 shows results of STRC detection using the disclosed techniques; [0028] FIG.21 is an example improved workflow of a copy number variant caller, according to an embodiment; [0029] FIG.22 is an example improved workflow of a copy number variant caller, according to an embodiment; and [0030] FIG.23 is a block diagram of sequence analysis device, according to an embodiment. DETAILED DESCRIPTION [0031] The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein. [0032] Copy number variation (CNV) refers to a circumstance in which the number of copies of a specific segment of DNA varies among different individuals’ genomes. The individual variants may be short or include thousands of bases. These structural differences may have come about through duplications, deletions or other changes and can affect long stretches of DNA. Such regions may or may not contain a gene(s). Segmental duplications are hotspots for structural variants (for example, with deletion or duplication) with gene recombinant variants (for example, gene conversion). A gene recombinant variant can result from a sequence of a gene being copied into a paralog of the gene or vice versa. The paralog of the gene can be a gene or a pseudogene. Segmental duplications can occur for genes with highly homologous gene family members or pseudogenes. Many clinically relevant genes have highly homologous gene family members or pseudogenes and can be affected by segmental duplications. Such clinically relevant genes include genes important for rare diseases, cancer, immunology and pharmacogenetics. [0033] Provided herein are gene variant detection techniques that yield improved characterization of copy number variants (CNVs) and that rescue previously excluded regions of a genome from copy number variant analysis. The excluded portion may represent genome regions having segmental duplications. Segmental duplications are segments of DNA sequences that have similar copies in other regions of the genome. Segmental duplications may present challenges in the sequence alignment to a reference genome because of the ambiguity the duplicative segments cause in determining the precise location of a sequence result. In an embodiment, segmental duplications, segmental duplicate/s, or segmental repeat/s may refer to genomic DNA regions that, in embodiments, range from about 500 bases to 400 kilobases in length that occur at more than one site within the genome and typically share a high level (greater than 90%) of sequence identity. Because it is difficult to align sequence reads having high similarity to more than one genome region, such reads may be excluded from further analysis. This in turn results in poor coverage for the regions of the genome corresponding to the excluded sequence data. [0034] Copy number variant (CNV) callers may refer to genome analysis techniques that detect large copy number variation events such as deletions and duplications using next- generation sequencing (NGS) sequencing depth data. To ensure accurate and reliable CNV calling, CNVs in regions of the genome that share high levels of similarity to other regions of the genome are excluded, for example excluding approximately 407Mbp in an hg38 human reference genome of assembled chromosomes (1-22, X, Y). Many of these excluded and problematic regions overlap with segmental duplications (approximately 100Mbp). Targeted calling methods that incorporate excluded regions using characterized differentiating sites to estimate sequence depth have demonstrated success in the past. However, such methods cannot be directly applied to novel or uncharacterized regions, are dependent on prior knowledge about targeted CNV event boundaries (cannot discover novel events) and cannot easily handle border CNVs that span both duplicated and unique genomic regions. The present techniques augment or enhance CNV calling methods to include regions overlapping with segmental duplications. In an embodiment, the disclosed techniques bin segmental duplication regions and identify corresponding bins between the different segmental duplication copies containing at least one differentiating site. The techniques also may include computing and normalizing the aggregate depth for sequence reads across all copies of each bin. That is, a sequence read can be aligned to one or both corresponding bins for a duplicated region, thus permitting inclusion of ambiguously aligned reads. The aggregate or joint normalized depth signal is split according to differentiating sites within each bin and the normalized and split depth signal is segmented over segmental duplication bins along with unique region bins to identify CNV events. This method allows for novel CNV boundary detection within segmental duplications and also handles CNVs spanning both unique and duplicated regions seamlessly. Provided herein is also a novel technique for data normalization to generate normalized counts of sequence reads using aggregated portions (bins) of sequence reads for similar regions or genes. [0035] FIG.1 is a schematic illustration of example sequence or region exclusion parameters that may, in operation, result in exclusion of certain genome regions in a reference genome from analysis. In the illustrated example, sequences of the reference genome are divided into nonoverlapping bins, with each bin corresponding to a genome region or contiguous length of the genome. However, in certain cases, only bins in the reference genome with a sufficient number of unique kmers may be considered. In the illustrated example, certain bins are denoted as not having sufficiently unique kmers. Whole regions associated with the bins are excluded, which may also capture adjacent or intervening bins that otherwise have sufficient unique kmers to be considered. Thus, exclusion of regions in a reference genome results in exclusion of corresponding regions in sample data from analysis. The disclosed techniques permit consideration of previously excluded regions in CNV calling, specifically excluded regions that encompass segmental duplications that do not have unique kmers, e.g., regions with high similarity to other regions. [0036] FIG. 2 is a schematic illustration of potential for missed copy number variants in excluded regions of the genome. In the illustrated example, certain genome regions are masked. However, the masked region includes a copy number variant. By unmasking or otherwise permitting inclusion of at least a portion of previously masked regions, copy number variant calling is improved. If excluded regions for alignments, many such excluded regions lie within segmental duplications (e.g., 20% of excluded regions). Segmental duplications account for 5-8% of the genome, and 20% of potentially excluded regions lie within segmental duplications. Therefore, ~3% of genome may be excluded from CNV callers but lie within a segmental duplication. [0037] FIGS.3-5 show example excluded segmental duplications in a CNV calling workflow that does not rescue or does not at least partially rescue duplicated regions, such that CNVs within or across the border of the excluded regions are not identified. In FIG.3, target sample sequence counts are shown across segments of the genome that include a CNV event. However, as shown in FIG.4, there is no data provided across excluded regions. Therefore, the read counts and normalized read counts (FIG. 5) are not provided as inputs for CNV calling. The disclosed techniques, as discussed herein, rescue the excluded regions to permit identification of the CNV event. [0038] FIG. 6 is a schematic illustration showing clean or unambiguous alignment between Region 1 and Region 2, which are uniquely mapping. In contrast, Region 1 and its segmental duplication, indicated as Region 3 in FIG. 6, have reads that are ambiguously aligned. The present techniques, rather than treating ambiguous alignments as unresolved data that is excluded, instead utilize ambiguous alignments. CNV calling methods may rely on the normalized read-depth signal for detection of CNVs. For segmentally duplicated areas of the genome these signals are skewed due to ambigous read alignments. The disclosed techniques rescue the read-depth signal for segmental duplication regions via a multiple step process that first determines the total (i.e., aggregate) read-depth of the two regions and then uses differentiating sites within the region, as shown schematically in FIG.7, to split the total read- depth to its corresponding duplication. [0039] FIG.7 is a schematic illustration of differentiation sites between a first region 100 and a second region 102. The first region 100 and the second region 102 are segmental duplications with high similarity. The first region 100 and the second region have differentiating sites, which may be nucleotide sites that consistently differ between the different copies of a given segmental duplication. These sites may be empirically identified from population datasets. However, in embodiments, the differentiating sites may be identified de novo in analyzed samples. [0040] In an embodiment, the regions 100, 102 can be mapped to one another using the sequence similarity between them to identify corresponding positions between the regions 100, 102. For example, a particular differentiating site 110 of the first region 100 corresponds to or is mapped to another differentiating site 122 in the second region 102. Thus, if a nucleotide identity of the differentiating site 110 is a G, then the nucleotide identity of the differentiating site 122 is different (e.g., T, C, or A). In an embodiment, the differentiating site 110 is defined as being a particular nucleotide (e.g., G) in a first region 100, and the differentiating site 122 is a different particular nucleotide (e.g., T) in the second region 102. Sequence reads having, at the defined position corresponding to the differentiating sites a C or A may be marked as being ambiguously mapped. The number and positions of the differentiating sites depend on the characteristics of a particular segmental duplication. Further, differentiating sites may be single nucleotides or two or more consecutive nucleotides. It should be understood that at intervening sites between the differentiating sites, the regions 100, 102 are highly similar and, in embodiments, have stretches of identical sequences. The differentiating sites 110, 112, 114, 116, 118, 120 of the first region 100 and the differentiating sites 122, 124, 126, 128, 130, 132 of the second region 102 are by way of example. It should be understood that the regions 100, 102 may have more or fewer differentiating sites. In embodiments, the differentiating sites may represent a range of possible differentiating sites across a population or pooled samples, and an individual sample may include only a subset of the differentiating sites. [0041] Bins are created, e.g., automatically created, based on the differentiating sites. In contrast to other analysis workflows (e.g., general alignment), bin sizes in the regions 100, 102 may be non-uniform relative to one another. The selection of bins and bin sizes for corresponding regions having high similarity may be rules-based, such that each bin has a least one differentiating site. In an embodiment, the bin size is constrained to be of sufficient size to ensure stable depth signal while maintaining sufficient granularity (e.g., 900-1200 bases). Depth signal variance is reduced as the bin size increases, but granularity of CNV boundary detection is also reduced with increased bin size. The bin size may have a minimum size threshold (e.g., 300 bases) and a maximum threshold (e.g., 10000 bases). Each bin in the first region 100 has a corresponding bin in the second region 102 based on an alignment or correspondence in sequence between them. Thus, corresponding bins may have a same size or may be within 5-10 bases in length of one another. In the illustrated example, a bin 150 of the first region 100 corresponds to one bin 160 of the second region 102. Corresponding bins may include bins 152, 162, bins 154, 164, and bins 156, 166 by way of example. Further, the illustrated example is a two-region case. However, the disclosed techniques may be applied to three-duplicate or more cases having a sufficient distribution of differentiating sites. [0042] If the bins are created using a reference population, the differentiating sites may be stored in memory and accessed during subsequent analysis steps. The workflow may include an initial step of identifying or setting differentiating sites that is not repeated for each sample. [0043] As shown in FIG. 8, for a segmental duplication with identified differentiating sites and established bins, the workflow includes using alignments to one or both of the first region 100 or the second region 102 (see FIG. 6). Thus, the workflow takes in any read aligned to sequences in the first region 100 and/or the second region 102. The alignments may be alignments that were previously generated from an initial alignment or may be a separate alignment step. The joint depth per bin is calculated by counting all reads aligning to either copy of that bin (e.g., bins of the first region 100 and the second region 102), including ambiguously mapped reads. A joint bin may refer to a set that includes corresponding bins between the first region and the second region. Thus, a set 180 includes bins 150, 160. Another set is bins 152, 162, and so on. An individual bin of the first region may correspond to only one other bin in the second region. Further, each individual bin of the first region may be in only one set in an embodiment. [0044] In the illustrated example, the joint depth 200 corresponding to the individual set 180 of corresponding bins 150, 160 between the regions 100, 200 is, by way of example, 59, which is a count of the aligned sequence reads aligned to one or both of the bins 150, 160. Further, in the illustrated example, the coverage is sufficient, when aggregated, to pass quality thresholds for 30x sequencing depth. In this example, certain of the bins, when not aggregated, would not have sufficient depth to be considered. Aggregation as disclosed in the present techniques permits inclusion of the sequence data associated with the first region 100 and the second region 102. As discussed herein, de-aggregation to reflect the likely distribution of the reads to a particular bin assigned to the first region 100 or the second region 102 is accomplished by distributing the reads proportionally using the presence of differentiating sites in the reads. [0045] The joint depth per bin is then normalized according to its GC content (GC correction), the length of the bin (length normalization), and the depth over a set of 3000 genomic regions of 2000bp expected to be consistently diploid across populations (mean sample depth normalization), as shown in FIG. 9. Normalization depth values may be calculated as each position’s coverage normalized by median autosomal coverage and so a value of 1 corresponds to the median autosomal coverage depth. The normalized depths can be averaged across all samples for a cohort depth. Normalization may be a self-normalization using statistics within the case sample itself to determine the baseline from which to make a CNV call. [0046] FIG. 10 shows the use of differentiating sites to proportionally split the normalized joint depth between the two regions. In the illustrated example, there are four bins present, and each bin has an associated normalized joint depth based on its (normalized) aligned reads. The duplication-specific single nucleotide polymorphism (SNP) ratio is calculated for each differentiating site present in each of the bins according to the proportion of reads supporting the base (A, C, G, or T) corresponding to either copy of the segmental duplication at that site. All reads overlapping either copy of that site, regardless of mapping quality, are used when calculating the SNP ratios. In the illustrated example, the determination of the proportions is based on how many reads support each differentiating site. For the first bin, there are two differentiating sites, and the read distributions are shown in box 220. To the extent that the proportions are different from a first site and a second site (e.g., sites 110, 112 of the bin 150 and sites 122, 124 of bin 160) within a bin, the proportions can be averaged to yield a bin average proportion. [0047] FIG.11 is a flow diagram of a method 300. The method 300 may be implemented by the disclosed systems and/or devices discussed herein. In an embodiment, the method may be at least in part carried about by hardware elements discussed with respect to FIG.21. It should be understood that the method 300 may include intervening steps. The method 300 may be applied to analysis of a particular first region and second region, whereby the first region and the second region have similarity to one another. In an embodiment, the second region is a segmental duplication of the first region. In an embodiment, the second region is a paralog of the first region. For a genome having multiple regions of high similarity, the method may be applied in parallel to each pair or group of similar regions. In an embodiment, the method may be used for characterized regions that have identified differentiating sites. In an embodiment, the method may be used to identify novel differentiating sites. [0048] The method 300 begins by receiving sequence data from a sample of interest at block 302. The sequence data may be in the form of a plurality of sequence reads. For example, sequence reads may be about 100 base pairs to about 1000 base pairs in length each. The method 300 proceeds to block 304, where the computing system aligns the plurality of sequence reads to a reference sequence to obtain a plurality of aligned sequence reads comprising sequence reads aligned to a first region and/or a second region in the reference sequence. A normalized joint sequencing depth is determined based on the group of the aligned sequence reads at block 306. In an embodiment, the sequencing depth is on a per-bin basis, whereby the first region includes a plurality of bins and the second region includes a plurality of corresponding bins. Each individual bin of the first region corresponds to an individual bin of the second region. In an embodiment, each bin of the first region corresponds to only one bin of the second region. The correspondence is based on sequence similarity or identity. Using the alignments, an aggregate or joint depth of the combined reads aligned to each set of corresponding bins is determined, and the joint depth is normalized. The normalized joint depth is split or distributed between the first region bin and the second region bin based on a distribution of differentiating sites in the aggregated sequence reads for the set of bins at block 308. For example, if-out of 50 sequence reads-40 reads have a “G” at position 1, while 10 reads have an “A” at position 1 (where position 1 is a differentiating site with G being associated with the first region and A being associated with the second region), the distribution ratio between the first region and the second region is 4:1. This can be used as a multiplication factor for the normalized joint depth to determine the split depth. In an embodiment, any reads having a “T” or “C” at position 1 may be excluded from the ratio if those nucleotides are not associated with the differentiating site. [0049] FIG.12 is a schematic illustration of Region 1 and Region 2 rescue using the disclosed techniques, and FIG.13 shows the rescued segments with normalized counts. In the illustrated example the normalized depth is split using the SNP ratios (differentiating site ratios) calculated in each bin, which quantify the proportion of the depth signal over that bin originating from either copy of the segmental duplication. This is done by multiplying the average ratio for the differentiating sites within the bin by the normalized joint depth for that bin. In doing so, each bin now has a region-specific normalized depth signal. In the illustrated example, the scale is the same scale as those generated by the pre-segmentation stages of a DRAGEN CNV caller (Illumina). The region-specific normalized depth signal for segmental duplication bins is added back to the DRAGEN CNV normalized depth per bin file along with the existing normalized depth signal from unique region bins. The DRAGEN CNV segmentation algorithm then runs on the combined set of unique and rescued segmental duplication bins to identify CNV events potentially spanning both duplicated and unique region. Such segments are then scored and genotyped using the default CNV caller infrastructure. [0050] As a proof of concept, this method has been applied to the CYP2A6 gene, which has a significant portion of the gene excluded from CNV calling due to overlap with a segmental duplication. CYP2A6 is responsible for metabolization of nicotine and cotinine and is involved in metabolism of several drugs and carcinogens. Therefore, CNVs for this gene may have clinical significance. However, CYP2A6 shares high homology with its neighboring pseudogene CYP2A7. FIG.14 shows the excluded non-unique kmers that are conventionally excluded from CNV calling, and FIG.15 shows CYP2A6 structural variants. [0051] In the analysis, no false positive CNV events were detected as a result of the segmental duplication bin rescue process across the 10 negative samples investigated. The noise in the normalized depth signal over rescued bins seemed to be similar to that observed over the original bins from unique regions, which is consistent with the lack of false positive events. FIG. 16 shows successful heterozygous deletion detection. FIG. 17 shows successful duplication detection. FIG. 18 shows successful homozygous deletion detection. FIG. 19 shows successful hybrid deletion detection. All homozygous and heterozygous deletion events expected in the CYP2A6 region across the 7 positive samples analyzed were detected with the correct boundaries. [0052] STRC deletion was also assessed, as shown in FIG.20. In the analyzed genomic data, an STRC deletion incidence rate of 1.3% was observed, which is consistent with an incidence of 1.5% previously observed. [0053] Disclosed herein are techniques that permit inclusion of previously masked or discarded regions of a genome for subsequent analysis, which may include haplotyping, copy number variant detection, SNP detection, structural variant detection, etc. The included regions may be regions that have high similarity to other regions, such as paralogs or segmental duplications. Certain analysis steps of the disclosed techniques use sequence data prepared from a sample. Sample preparation, sample quality, and acquisition of the sequence data can be variable. In certain cases, this sample and/or variability is reflected in the generated sequence data. The variability is difficult to predict and may be tied to operating parameters of the sequence device (see FIG. 23), variability in sequence library preparation, or sample handling and storage. In an embodiment, a sequencing library generated from a modified individual sample is a non-naturally occurring composition. Further, because of unpredictability in fragment generation on a sample-level basis as part of sample preparation steps, each sample may yield a sequencing library that is a unique composition that in turn generates a unique set of sequence reads (e.g., at least thousands or at least millions of sequence reads) associated with that sample. Normalization to sequence data of a pool of normals acquired under the same conditions or self-normalization may be used to address this variability. Thus, the present techniques include steps of normalization of sequence data. In an embodiment, sequence data acquisition from a sequencing library with reduced amplification yields may be reflected in variable sequencing depth (sequence read counts per region or bin) across the genome. Normalization to a genome-wide depth average addresses such sample-dependent variability. In addition, as disclosed herein, the present techniques provide a novel normalization step in which sequence reads are normalized using joint bins. [0054] In some embodiments, the disclosed techniques may be incorporated into other CNV workflows for non-duplicated regions or unique regions of the genome to create an enhanced CNV calling result that includes more of the genome in the analysis. That is, the disclosed techniques rescue regions conventionally excluded using a novel data handling technique for generating depth values. Once generated, these values can be normalized and combined with normalized depth values from other regions. FIG.21 is a CNV calling workflow that shows incorporation of the rescued normalized counts from segmental duplications. In the workflow, the sequence data is used to generate target counts that are normalized. The normalization may be self-referential or using a panel of normal (PoN), which may be based on operator input. For regions of segmental duplication, the normalization may occur using normalization techniques as generally disclosed herein. These segmental duplication regions are rescued and analyzed separately, with the alignments for these duplicated regions being pulled out of the conventional analysis workflow to generate the normalized depths. However, the normalized depth values for these duplicated regions are provided into the general workflow prior to segmentation and CNV calling. Thus, relative to other workflows, the disclosed techniques provide additional depth values, or a combined set of depth values that include depth values from rescued regions that were previously not considered using conventional techniques. [0055] In embodiments, the activation of the workflow for including segmental duplications in CNV calling may be user-initiated. A user may provide an input via a graphical user interface to consider segmental duplications in CNV calling. When activated, the analysis steps as provided herein are initiated. In an embodiment, the workflow for including segmental duplications in CNV calling is a default setting. In an embodiment, the report generated for CNV calling may include a notation or indication that segmental duplications were included or excluded. The workflow may include only those segmental duplications, or similar regions, with sequence data of sufficiently high quality. For example, of the aggregate read counts fall below a threshold (e.g., 30 reads per bin) for an individual bin, the individual bin of the region/s or the entire region may be excluded. [0056] FIG.22 shows the workflow for generating the rescued depth counts of FIG.21 and according to the techniques disclosed herein. Using alignments for the regions of interest, the joint depths are determined for each bin and GC corrected/normalized. Based on the differentiating site ratio, a distribution of differences between the aggregated sequence reads in the joint bins, the depths are split into each bin. This is turn is used to update a stored file of normalized counts for other regions of the genome that is fed into a CNV caller. Thus, the normalized counts for the segmental duplications are generated and written to memory. [0057] In some embodiments, the system comprises non-transitory computer-readable memory configured to store executable instructions. The non-transitory memory can be configured to store a reference sequence and sequence data of a sample. The system can comprise a hardware processor in communication with the non-transitory memory. The hardware processor can be programmed by the executable instructions to perform receiving a plurality of sequence reads generated from a sample obtained from a subject. The hardware processor can be programmed by the executable instructions to perform aligning the plurality of sequence reads to a reference sequence to generate alignments that include a plurality of aligned sequence reads aligned to the reference sequence. The hardware processor can be programmed by the executable instructions to perform determining a number of copies of each region of the plurality of regions based on a number of the sequence reads aligned to the region and using the techniques disclosed herein. [0058] The various illustrative algorithms described in connection with the embodiments disclosed herein can be implemented or performed by a machine, such as a processing unit or processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor can be a microprocessor, but in the alternative, the processor can be a controller, microcontroller, or state machine, combinations of the same, or the like. A processor can include electrical circuitry configured to process computer-executable instructions. In another embodiment, a processor includes an FPGA or other programmable device that performs logic operations without processing computer- executable instructions. A processor can also be implemented as a combination of computing devices, for example a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Although described herein primarily with respect to digital technology, a processor may also include primarily analog components. For example, some or all of the signal processing algorithms described herein may be implemented in analog circuitry or mixed analog and digital circuitry. A computing environment can include any type of computer system, including, but not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computational engine within an appliance, to name a few. [0059] The hardware components may implemented in a device or devices of the system. FIG. 23 is a schematic diagram of a sequencing device 500 that may be used in conjunction with the disclosed embodiments for acquiring or generating sequence data of identification sequences and/or index sequences as generally discussed herein. The sequence device 500 may be implemented according to any sequencing technique, such as those incorporating sequencing-by-synthesis methods described in U.S. Patent Publication Nos. 2007/0166705; 2006/0188901; 2006/0240439; 2006/0281109; 2005/0100900; U.S. Pat. No.7,057,026; WO 05/065814; WO 06/064199; WO 07/010,251, the disclosures of which are incorporated herein by reference in their entireties. Alternatively, sequencing by ligation techniques may be used in the sequencing device 500. Such techniques use DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides and are described in U.S. Pat. No. 6,969,488; U.S. Pat. No.6,172,218; and U.S. Pat. No.6,306,597; the disclosures of which are incorporated herein by reference in their entireties. Some embodiments can utilize nanopore sequencing, whereby target nucleic acid strands, or nucleotides exonucleolytically removed from target nucleic acids, pass through a nanopore. As the target nucleic acids or nucleotides pass through the nanopore, each type of base can be identified by measuring fluctuations in the electrical conductance of the pore (U.S. Patent No.7,001,792; Soni & Meller, Clin. Chem. 53, 1996–2001 (2007); Healy, Nanomed.2, 459–481 (2007); and Cockroft, et al. J. Am. Chem. Soc. 130, 818–820 (2008), the disclosures of which are incorporated herein by reference in their entireties). Yet other embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product. For example, sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 A1; US 2009/0127589 A1; US 2010/0137143 A1; or US 2010/0282617 A1, each of which is incorporated herein by reference in its entirety. Particular embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and γ-phosphate-labeled nucleotides, or with zeromode waveguides as described, for example, in Levene et al. Science 299, 682–686 (2003); Lundquist et al. Opt. Lett. 33, 1026–1028 (2008); Korlach et al. Proc. Natl. Acad. Sci. USA 105, 1176–1181 (2008), the disclosures of which are incorporated herein by reference in their entireties. Other suitable alternative techniques include, for example, fluorescent in situ sequencing (FISSEQ), and Massively Parallel Signature Sequencing (MPSS). In particular embodiments, the sequencing device 500 may be a HiSeq, MiSeq, or HiScanSQ from Illumina (La Jolla, CA). In other embodiment, the sequencing device 500 may be configured to operate using a CMOS sensor with nanowells fabricated over photodiodes such that DNA deposition is aligned one-to-one with each photodiode. [0060] The sequencing device 500 may be “one-channel” a detection device, in which only two of four nucleotides are labeled and detectable for any given image. For example, thymine may have a permanent fluorescent label, while adenine uses the same fluorescent label in a detachable form. Guanine may be permanently dark, and cytosine may be initially dark but capable of having a label added during the cycle. Accordingly, each cycle may involve an initial image and a second image in which dye is cleaved from any adenines and added to any cytosines such that only thymine and adenine are detectable in the initial image but only thymine and cytosine are detectable in the second image. Any base that is dark through both images in guanine and any base that is detectable through both images is thymine. A base that is detectable in the first image but not the second is adenine, and a base that is not detectable in the first image but detectable in the second image is cytosine. By combining the information from the initial image and the second image, all four bases are able to be discriminated using one channel. [0061] In the depicted embodiment, the sequencing device 500 includes a separate sample processing device 502 (e.g., for sequencing library preparation) and an associated computer 504. However, as noted, these may be implemented as a single device. Further, the associated computer 504 may be local to or networked or otherwise in communication with the sample processing device 502. In the depicted embodiment, the biological sample may be loaded into the sample processing device 502 on a sample substrate 510, e.g., a flow cell or slide, that is imaged to generate sequence data. For example, reagents that interact with the biological sample fluoresce at particular wavelengths in response to an excitation beam generated by an imager 512 and thereby return radiation for imaging. For instance, the fluorescent components may be generated by fluorescently tagged nucleic acids that hybridize to complementary molecules of the components or to fluorescently tagged nucleotides that are incorporated into an oligonucleotide using a polymerase. As will be appreciated by those skilled in the art, the wavelength at which the dyes of the sample are excited and the wavelength at which they fluoresce will depend upon the absorption and emission spectra of the specific dyes. Such returned radiation may propagate back through the directing optics. This retrobeam may generally be directed toward detection optics of the imager 512. [0062] The imager detection optics may be based upon any suitable technology, and may be, for example, a charged coupled device (CCD) sensor that generates pixilated image data based upon photons impacting locations in the device. However, it will be understood that any of a variety of other detectors may also be used including, but not limited to, a detector array configured for time delay integration (TDI) operation, a complementary metal oxide semiconductor (CMOS) detector, an avalanche photodiode (APD) detector, a Geiger-mode photon counter, or any other suitable detector. TDI mode detection can be coupled with line scanning as described in U.S. Patent No.7,329,860, which is incorporated herein by reference. Other useful detectors are described, for example, in the references provided previously herein in the context of various nucleic acid sequencing methodologies. [0063] The imager 512 may be under processor control, e.g., via a processor 514, and the sample receiving device 502 may also include I/O controls 516, an internal bus 518, non- volatile memory 520, RAM 522 and any other memory structure such that the memory is capable of storing executable instructions, and other suitable hardware components that may be similar to those described with regard to FIG. 31. Further, the associated computer 504 may also include a processor 524, I/O controls 526, communications circuity 527, and a memory architecture including RAM 528 and non-volatile memory 530, such that the memory architecture is capable of storing executable instructions 532. The hardware components may be linked by an internal bus, which may also link to the display 534. In embodiments in which the sequencing device 500 is implemented as an all-in-one device, certain redundant hardware elements may be eliminated. [0064] The processor 514, 524 may be programmed to assign individual sequence reads to a sample based on the associated index sequence or sequences according to the techniques provided herein. In particular embodiments, based on the image data acquired by the imager 512, the sequencing device 500 may be configured to generate sequence data that includes base calls (nucleotide identities) for each base of a sequence read. Further, based on the image data, even for sequence reads that are performed in series, the individual reads may be linked to the same location via the image data and, therefore, to the same template strand. In this manner, index sequence reads may be associated with a sequence read of an insert sequence before being assigned to a sample of origin. The processor 514, 524 may also be programmed to perform downstream analysis on the sequences corresponding to the inserts for a particular sample subsequent to assignment of sequence reads to the sample. [0065] In certain embodiments, the I/O controls 516, 526 may be configured to receive user inputs that automatically select sequencing parameters. For example, the user inputs may permit selection of bin sizes for alignments. In an embodiment, the bin size selection may apply in operating modes outside of the disclosed joint depth distribution techniques that rely on variable bin sizes or borders based on a presence of differentiating sites between the regions of interest. Therefore, a user input of bin size may be overridden or not applied in similar regions having differentiating sites. [0066] In an embodiment, notifications related to CNV calling are provided on the display 534 or communicated via the communications circuitry 527 to a remote device or a cloud server. [0067] The term copy number variation, copy number variant, or CNV may refer to variation in the number of copies of a nucleic acid sequence present in a test sample in comparison with the copy number of the nucleic acid sequence present in a reference sample. In certain embodiments, the nucleic acid sequence is 1 kb or larger. In some cases, the nucleic acid sequence is a whole chromosome or significant portion thereof. A copy number variant may refer to a sequence in which copy-number differences are found by comparison of a nucleic acid sequence of interest in test sample with an expected level of the nucleic acid sequence of interest. For example, the level of the nucleic acid sequence of interest in the test sample is compared to that present in a qualified sample. Copy number variants/variations include deletions, including microdeletions, insertions, including microinsertions, duplications, multiplications, and translocations. CNVs encompass chromosomal aneuploidies and partial aneuploidies. [0068] A CNV call may be based on a variation from a threshold. A threshold or threshold value may refer to a number that is used as a cutoff to characterize a sample such as a test sample containing a nucleic acid from an organism suspected of having a medical condition. The threshold may be compared to a parameter value to determine whether a sample giving rise to such parameter value suggests that the organism has the medical condition. In certain embodiments, a qualified threshold value is calculated using a qualifying data set and serves as a limit of diagnosis of a SNV or CNV. If a threshold is exceeded by results obtained from methods disclosed herein, a subject can be diagnosed with a SNV or CNV. [0069] A sequencing coverage refers to a percentage of bases in a reference sequence covered by the mapped or aligned sequence reads A set of sequence reads are mapped to a reference genome at the various genomic regions. The total percentage of target bases within the reference genome to which sequenced reads are mapped is quantified as the coverage of the genome. The average depth of sequencing coverage is the ratio of the number of reads, e.g., scaled scaled by read length, to the total referenced genome length. The read depth may be normalized to a mean depth across a genome. [0070] A sequence read may refer to a sequence obtained from a fragment of a nucleic acid sample. Typically, though not necessarily, a read represents a short sequence of contiguous base pairs in the sample. The read may be represented symbolically by the base pair sequence (in A, T, C, or G) of the sample portion. It may be stored in a memory device and processed as appropriate to determine whether it matches a reference sequence or meets other criteria. A read may be obtained directly from a sequencing apparatus or indirectly from stored sequence information concerning the sample. In some cases, a read is a DNA sequence of sufficient length (e.g., at least about 25 bp) that can be used to identify a larger sequence or region, e.g., that can be aligned and specifically assigned to a chromosome or genomic region or gene. In some embodiments, the plurality of sequence reads comprises sequence reads that are about 100 base pairs to about 1000 base pairs in length each. The plurality of sequence reads can comprise paired-end sequence reads and/or single-end sequence reads. The plurality of sequence reads may be generated by whole genome sequencing (WGS), such as clinical WGS (cWGS). In some embodiments, the sequence reads may be generated by targeted sequencing techniques. [0071] An alignment may refer to comparing a sequence read to a reference sequence and thereby determining whether the reference sequence contains the read sequence. If the reference sequence contains the read sequence, the read may be mapped to the reference sequence or, in certain embodiments, to a particular location in the reference sequence. Aligned reads or tags are one or more sequences that are identified as a match in terms of the order of their nucleic acid molecules to a known sequence from a reference genome. Alignment can be done manually, although it is typically implemented by a computer algorithm, as it would be impossible to align reads in a reasonable time period for implementing the methods disclosed herein. Non limiting examples of alignment methods include global alignments (such as Needleman-Wunsch algorithm), local alignments, dynamic programming (such as Smith-Waterman algorithm), heuristic algorithms or probabilistic methods, progressive methods, iterative methods, motif finding or profile analysis, genetic algorithms, simulated annealing, pairwise alignments, multiple sequence alignments. In some cases, the alignment may generate an alignment score. An alignment score is a score indicating a similarity of two sequences determined using an alignment method. In some implementations, an alignment score accounts for number of edits (e.g., deletions, insertions, and substitutions of characters in the string). In some implementations, an alignment score accounts for a number of matches. In some implementations, an alignment score accounts for both the number of matches and a number of edits. In some implementations, the number of matches and edits are equally weighted for the alignment score. As provided herein, highly similar regions may have alignment scores above a threshold, e.g., above 90% similarity. In one embodiment, a threshold percent identity or identity score (e.g., above 85% identity, above 90% identity based on alignment) is used to select two regions as being similar and/or as having corresponding bins. In one embodiment, two highly similar regions may be similar over a minimum length (e.g., at least 500 or at least 1000 bases). In an embodiment, the regions may be segmental duplications. The segmental duplications might be located on the same chromosome (intrachromosomal) or different chromosome (interchromosomal). [0072] A reference sequence refers to any particular known genome sequence, whether partial or complete, of any organism or virus which may be used to reference identified sequences from a subject. In one example, the reference sequence is that of a full length human genome. Such sequences may be referred to as genomic reference sequences. In another example, the reference sequence is limited to a specific human chromosome such as chromosome 13. In some embodiments, a reference Y chromosome is the Y chromosome sequence from human genome version hg19. Such sequences may be referred to as chromosome reference sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, sub-chromosomal regions (such as strands), etc., of any species. In various embodiments, the reference sequence is a consensus sequence or other combination derived from multiple individuals. However, in certain applications, the reference sequence may be taken from a particular individual. [0073] A sample, target sample, or biological sample may refer to a sample typically derived from a biological fluid, cell, tissue, organ, or organism, comprising a nucleic acid or a mixture of nucleic acids comprising at least one nucleic acid sequence that is to be screened for SNVs or CNVs. In certain embodiments the sample comprises at least one nucleic acid sequence whose copy number is suspected of having undergone variation. Such samples include, but are not limited to sputum/oral fluid, amniotic fluid, blood, a blood fraction, or fine needle biopsy samples (e.g., surgical biopsy, fine needle biopsy, etc.), urine, peritoneal fluid, pleural fluid, and the like. Although the sample is often taken from a human subject (e.g., patient), the assays can be used to SNVs or CNVs in samples from any mammal, including, but not limited to dogs, cats, horses, goats, sheep, cattle, pigs, etc. The sample may be used directly as obtained from the biological source or following a pretreatment to modify the character of the sample. For example, such pretreatment may include preparing plasma from blood, diluting viscous fluids and so forth. Methods of pretreatment may also involve, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, the addition of reagents, lysing, etc. If such methods of pretreatment are employed with respect to the sample, such pretreatment methods are typically such that the nucleic acid(s) of interest remain in the test sample, sometimes at a concentration proportional to that in an untreated test sample (e.g., namely, a sample that is not subjected to any such pretreatment method(s)). Such “treated” or “processed” samples are still considered to be biological “test” samples with respect to the methods described herein.The sample can comprise cells, cell-free DNA, cell-free fetal DNA, amniotic fluid, a blood sample, a biopsy sample, or a combination thereof. The sample can be obtained directly from a subject. The sample can be generated from another sample obtained from a subject. The other sample can be obtained directly from the subject or the other sample can be generated from another sample obtained from the subject. The computing system can store the plurality of sequence reads in its memory. [0074] With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. For example, “a processor” can include distributed processing between multiple processors. [0075] As will be understood by one skilled in the art, for any and all purposes, such as in terms of providing a written description, all ranges disclosed herein also encompass any and all possible sub-ranges and combinations of sub-ranges thereof. Any listed range can be easily recognized as sufficiently describing and enabling the same range being broken down into at least equal halves, thirds, quarters, fifths, tenths, etc. As a non-limiting example, each range discussed herein can be readily broken down into a lower third, middle third and upper third, etc. As will also be understood by one skilled in the art all language such as “up to,” “at least,” “greater than,” “less than,” and the like include the number recited and refer to ranges which can be subsequently broken down into sub-ranges as discussed above. Finally, as will be understood by one skilled in the art, a range includes each individual member. Thus, for example, a group having 1-3 articles refers to groups having 1, 2, or 3 articles. Similarly, a group having 1-5 articles refers to groups having 1, 2, 3, 4, or 5 articles, and so forth. [0076] This written description uses examples to enable any person skilled in the art to practice the disclosed embodiments, including making and using any devices or systems and performing any incorporated methods. The patentable scope is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements with insubstantial differences from the literal languages of the claims.

Claims

CLAIMS What is claimed is: 1. A method for identifying copy number variants in genomic sequence data, the method comprising: receiving genomic sequence data from a sample of interest, the genomic sequence data comprising millions of base calls corresponding to nucleotides of the sample of interest; processing the genomic sequence data to identify sequence reads that align to one or both of a first region or a second region of a reference genome; determining a joint sequencing depth of corresponding bins of the first region and the second region using the identified sequence reads by combining identified sequence reads that align to corresponding bins of the first region or the second region, wherein each bin within the first region corresponds to one other bin in the second region, wherein the correspondence is based on sequence similarity, and such that the joint sequencing depth for an individual set of corresponding bins comprises identified sequence reads that are aligned to a first region bin and/or a corresponding second region bin the individual set; normalizing the joint sequencing depth for each set of corresponding bins; using a ratio of a first count of sequence reads assigned to the first region to a second count of sequence reads assigned to the second region based on differentiating sites having different nucleotide identities between the first region and the second region to split the normalized joint sequencing depths of each set of corresponding bins between the first region bin and the corresponding second region bin to generate first region normalized depth values and second region normalized depth values; combining the first region normalized depth values and the second region normalized depth values with normalized depth values from other regions of the reference genome to generate a combined set of normalized depth values of the genomic sequence data; and identifying one or more copy number variants using the combined set of normalized depth values of the genomic sequence data. 2. The method of claim 1, wherein the one or more copy number variants are identified based on a normalized depth value from the first region or the second region deviating from an expected normalized depth value. 3. The method of claim 2, wherein the expected normalized depth value is based on a diploid organism. 4. The method of claim 1, wherein at least some bins within the first region have different sizes relative to one another. 5. The method of claim 4, wherein corresponding bins between the first region and the second region have a same size. 6. The method of claim 4, wherein the normalized depth values from other regions of the genome are generated using bins of a predefined size. 7. The method of claim 1, wherein boundaries of the bins within the first region and the second region are determined based on the genomic sequence data. 8. The method of claim 7, wherein the boundaries are determined based on identifying the differentiating sites in the genomic sequence data. 9. The method of claim 1, wherein boundaries of the bins within the first region and the second region are determined based on predetermined differentiating sites from population data. 10. The method of claim 1, wherein an individual differentiating site has a fixed nucleotide identity associated with the first region or the second region. 11. The method of claim 1, wherein the first region and the second region are located on different chromosomes. 12. The method of claim 1, wherein the sequence reads that align to one or both of the first region or the second region comprise ambiguously aligned sequence reads that align to both the first region and the second region. 13. The method of claim 1, wherein the first region and the second region are segmental duplications having at least 90% sequence identity with one another. 14. A system for identifying copy number variants in genomic sequence data, the system comprising: a sequencing device configured to generate whole genome sequence data comprising a plurality of sequence reads obtained from a sample of a subject; non-transitory memory configured to store executable instructions and the whole genome sequence data and a reference sequence; a hardware processor in communication with the non-transitory memory, the hardware processor programmed by the executable instructions to perform: generating alignments of the whole genome sequence data to the reference sequence, wherein the alignments comprise sequence reads aligned to one or both of a first region or a second region, wherein the second region is a segmental duplication of the first region; generating normalization data for the whole genome sequence data, wherein the normalization data is based on variability in the sample and sample-dependent operations of the sequencing device; determining a joint sequencing depth of corresponding bins of the first region and the second region using the aligned sequence reads by combining identified sequence reads that align to the corresponding bins of the first region or the second region, wherein each bin within the first region corresponds to one other bin in the second region, wherein the correspondence is based on sequence similarity, and such that the joint sequencing depth for an individual set of corresponding bins comprises identified sequence reads that are aligned to one or both of a first region bin or a corresponding second region bin; normalizing the joint sequencing depth for each set of corresponding bins using the normalization data; determining a ratio, for sequence reads aligned to each set of corresponding bins, based on a number of the sequence reads having a nucleotide difference at an individual differentiating site associated with the first region relative to a number of the sequence reads having the nucleotide difference at the individual differentiating site associated with the second region; using the ratio to split the normalized joint sequencing depths of each set of corresponding bins between the first region and the second region to generate first region normalized depth values for each first region bin of the first region and second region normalized depth values for each second region bin of the second region; combining the first region normalized depth values and the second region normalized depth values with normalized depth values from other regions of the whole genome sequence data to generate a combined set of normalized depth values of the genomic sequence data; and identifying one or more copy number variants using the combined set of normalized depth values of the whole genome sequence data. 15. The system of claim 14, wherein the individual differentiating site is at a corresponding or aligned position between the first region and the second region based on the sequence similarity. 16. The system of claim 14, wherein the normalization data is self-referential to the sample. 17. The system of claim 14, wherein the normalization data is based at least in part on mean sequencing depth of the whole genome sequence data. 18. The system of claim 14, comprising a graphical user interface configured to receive a user input to enable the executable instructions based on a quality metric of the whole genome sequence data being above a threshold. 19. The system of claim 18, wherein the quality metric is an average sequencing depth of 30 across the genome for the sample. 20. The system of claim 18, wherein the graphical user interface masks or disables the user input responsive to a determination that the quality metric is below the threshold. 21. The system of claim 14, wherein the combined set of normalized depth values are written to the non-transitory memory. 22. A non-transitory computer-readable memory storing executable instructions to perform: receiving alignments of whole genome sequence data to a reference sequence, the whole genome sequence data comprising a plurality of sequence reads obtained from a sample of a subject, and wherein the alignments comprise sequence reads aligned to one or both of a first region or a second region, wherein the second region is a segmental duplication of the first region; determining a joint sequencing depth of corresponding bins of the first region and the second region using the aligned sequence reads by combining identified sequence reads that align to the corresponding bins of the first region or the second region, wherein each bin within the first region corresponds to one other bin in the second region, wherein the correspondence is based on sequence similarity, and such that the joint sequencing depth for an individual set of corresponding bins comprises identified sequence reads that are aligned to one or both of a first region bin or a corresponding second region bin the individual set; normalizing the joint sequencing depth for each set of corresponding bins; determining a ratio, for sequence reads aligned to each set of corresponding bins, based on a number of the sequence reads having a nucleotide difference at an individual differentiating site associated with the first region relative to a number of the sequence reads having the nucleotide difference at the individual differentiating site associated with the second region; using the ratio to split the normalized joint sequencing depths of each set of corresponding bins between the first region and the second region to generate first region normalized depth values for each first region bin of the first region and second region normalized depth values for each second region bin of the second region; combining the first region normalized depth values and the second region normalized depth values with normalized depth values from other regions of the reference genome to generate a combined set of normalized depth values of the genomic sequence data; and identifying one or more copy number variants using the combined set of normalized depth values of the genomic sequence data.