[go: up one dir, main page]

WO2025160089A1 - Custom multigenome reference construction for improved sequencing analysis of genomic samples - Google Patents

Custom multigenome reference construction for improved sequencing analysis of genomic samples

Info

Publication number
WO2025160089A1
WO2025160089A1 PCT/US2025/012463 US2025012463W WO2025160089A1 WO 2025160089 A1 WO2025160089 A1 WO 2025160089A1 US 2025012463 W US2025012463 W US 2025012463W WO 2025160089 A1 WO2025160089 A1 WO 2025160089A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleotide
variant
assembly
haplotype
multigenome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/012463
Other languages
French (fr)
Inventor
Massimiliano Rossi
Severine Catreux
Rami Mehio
Michael Ruehle
John Cooper Roddey
Jennifer DEL GIUDICE
Guillaume Rizk
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Illumina Inc
Original Assignee
Illumina Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Illumina Inc filed Critical Illumina Inc
Publication of WO2025160089A1 publication Critical patent/WO2025160089A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • nucleobase sequencing platforms determine individual nucleobases within sequences from genomic samples’ cells by using conventional Sanger sequencing or by using sequencing-by-synthesis (SBS) methods.
  • SBS sequencing-by-synthesis
  • existing platforms can monitor millions to billions of nucleic acid polymers being synthesized in parallel to predict nucleobase calls from a larger base call dataset.
  • a camera in many SBS platforms captures images of irradiated fluorescent tags incorporated into oligonucleotides for determining the nucleobase calls.
  • existing sequencing platforms send base call data (or image-based data) to a computing device to apply sequencing data analysis software that determines a nucleobase sequence for a genomic sample or other nucleic acid polymer. For instance, such software (i) maps and aligns nucleotide reads determined by the sequencing platform for a genomic sample with (ii) a reference genome comprising at least a primary contiguous sequence. Based on differences between the aligned nucleotide reads and the reference genome, existing data analysis software can further utilize a variant caller to identify genotype and/or variants within a genomic sample, such as single nucleotide polymorphisms (SNPs), insertions or deletions (indels), or structural variants.
  • SNPs single nucleotide polymorphisms
  • indels insertions or deletions
  • existing sequencing systems often utilize reference genomes that misrepresent certain populations and foment inaccurate read mapping and alignment and mistaken variant calling.
  • some existing sequencing systems use a linear reference genome that purportedly represents a consensus or example of genes and other nucleotide sequences of an organism.
  • GRCh38 from the Genome Reference Consortium
  • GRCh38 from the Genome Reference Consortium
  • pangenome reference representing a more diverse group of individuals.
  • various organizations such as the Human Pangenome Reference Consortium (HPRC), the Chinese Pangenome Consortium (CPC), and the Arab Pangenome Reference (APR) — have produced assemblies of diverse populations of genomic samples for use in establishing inclusive, pangenome references. Many such assemblies cover nucleotide sequences that span entire chromosomes of the respective genomic sample, from one telomere to another telomere, many with separate complete sequences for both maternal and paternal alleles.
  • pangenome reference from a set of assemblies presents additional challenges.
  • many existing sequencing systems utilize structural variant detection methods that align assemblies with a linear reference genome on a whole contiguous sequence basis (e.g., aligning entire chromosome sequences from telomere to telomere) and fdter the aligned sequences to identify structural variants. Due in part to their focus on whole sequence alignment for structural variant detection, such existing systems often exhibit relatively poor performance in improving variant calls for smaller variants, such as SNPs and indels, relative to conventional methods of aligning reads to a linear reference genome without augmentation by a pangenome reference created by existing reference-assembly systems.
  • This disclosure describes embodiments of methods, non-transitory computer-readable media, and systems that (i) generate a multigenome reference comprising haplotype variants identified within a set of nucleotide assemblies and/or (ii) utilize the generated multigenome reference to determine alignments and/or genotype calls.
  • the disclosed systems can generate a multi-sample phased variant dataset representing a multigenome reference by comparing simulated reads sampled from a set of nucleotide assemblies with a linear reference genome to identify genotype calls for each nucleotide assembly of the set of nucleotide assemblies.
  • the disclosed systems can generate a custom multigenome reference from a selected set of nucleotide assemblies representing a target population corresponding to target genomic sample.
  • the disclosed systems can utilize the custom multigenome reference to determine read alignments and genotype calls relative to a linear reference genome for the target genomic sample.
  • the disclosed systems can utilize various pipelines to (a) process each assembly haplotype independently to identify variants relative to a reference genome, (b) combine resulting variant calls for respective assembly haplotypes (e.g., maternal and paternal haplotypes) of each nucleotide assembly to generate phased variant data for each nucleotide assembly, and (c) aggregate the phased variant data of multiple nucleotide assemblies to compile a multi-sample phased variant dataset for the given set of haplotype-resolved nucleotide assemblies.
  • assembly haplotypes e.g., maternal and paternal haplotypes
  • the disclosed methods can process a given assembly haplotype by (a) generating a set of simulated reads from chromosome sequences within the given assembly haplotype, (b) aligning the simulated reads to a linear reference genome, and (c) generating variant calls relative to the linear reference genome according to the read alignments to generate the variant data for the assembly haplotype.
  • the disclosed systems can generate a customized multigenome reference with improved accuracy and efficiency over existing systems.
  • FIG. 1 illustrates an environment in which a custom multigenome construction system can operate in accordance with one or more embodiments of the present disclosure.
  • FIG. 2 illustrates an overview of the custom multigenome construction system generating a multigenome reference from a plurality of nucleotide assemblies in accordance with one or more embodiments of the present disclosure.
  • FIG. 3 A illustrates the custom multigenome construction system generating haplotype variant data for an assembly haplotype in accordance with one or more embodiments of the present disclosure.
  • FIG. 3B illustrates the custom multigenome construction system generating phased variant data for a diploid haplotype-resolved nucleotide assembly in accordance with one or more embodiments of the present disclosure.
  • FIG. 3C illustrates the custom multigenome construction system generating a multisample phased variant dataset from a plurality of haplotype-resolved nucleotide assemblies in accordance with one or more embodiments of the present disclosure.
  • FIG. 3D illustrates the custom multigenome construction system generating phased variant data for a haplotype-resolved assembly of any ploidy level in accordance with one or more embodiments of the present disclosure.
  • FIG. 3E illustrates the custom multigenome construction system generating phased variant data for a diploid haplotype-resolved assembly utilizing a simulated-read variant calling pipeline and a whole-contiguous sequence structural variant calling pipeline in accordance with one or more embodiments of the present disclosure.
  • FIGS. 4A-4B further illustrate the custom multigenome construction system utilizing two respective multigenome construction pipelines to generate phased multi-sample variant datasets in accordance with one or more embodiments of the present disclosure.
  • FIG. 5 illustrates a graphical user interface comprising selectable options for generating a custom multigenome reference in accordance with one or more embodiments of the present disclosure.
  • FIG. 6 illustrates two end-to-end analysis pipelines for generating and utilizing a multigenome reference to determine genotype calls for a target genomic sample utilizing (i) an existing sequencing system and a graph pangenome reference from an existing reference-assembly system and (ii) the custom multigenome construction system in accordance with one or more embodiments of the present disclosure.
  • FIG. 7 illustrates a table of comparative experimental results of determining variant calls from nucleotide reads that are mapped and aligned to a reference genome utilizing (i) a pangenome reference generated by an existing reference-assembly system and (ii) a multigenome reference generated by the disclosed custom multigenome construction system in accordance with one or more embodiments of the present disclosure.
  • FIG. 8 illustrates a graph of comparative experimental results of determining variant calls from nucleotide reads that are mapped and aligned to a reference genome utilizing (i) pangenome references generated by various existing reference-assembly systems, and (ii) a multigenome reference generated by the disclosed custom multigenome construction system in accordance with one or more embodiments of the present disclosure.
  • FIG. 9A illustrates a flowchart of a series of acts for generating a multi-sample phased variant dataset representing a multigenome reference in accordance with one or more embodiments of the present disclosure.
  • FIG. 9B illustrates a flowchart of a series of acts for generating genotype calls for a target genomic sample utilizing a multigenome reference generated according to one or more embodiments of the present disclosure.
  • FIG. 10 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure.
  • This disclosure describes embodiments of a custom multigenome construction system that can (i) generate a multigenome reference identifying haplotype variants from a set of nucleotide assemblies for a target population and/or (ii) utilize the multigenome reference to determine read alignments and genotype calls for a target genomic sample with improved accuracy over existing sequencing systems.
  • the custom multigenome construction system utilizes long read simulation and variant calling to generate a multi-sample variant dataset from a collection of nucleotide assemblies to establish a multigenome reference for a target population.
  • the custom multigenome construction system can determine read alignments and genotype calls relative to a linear reference genome for a target genomic sample.
  • the custom multigenome construction system processes haplotype-resolved nucleotide assemblies to generate phased variant data for each nucleotide assembly and compiles the phased variant data to generate a multi-sample phased variant dataset representing a multigenome reference.
  • the custom multigenome construction system processes each assembly haplotype within a haplotype-resolved nucleotide assembly by (a) generating a set of simulated reads from chromosome sequences within the given assembly haplotype, (b) aligning the simulated reads to a linear reference genome, and (c) generating variant calls relative to the linear reference genome according to the read alignments to generate the phased variant data for the given assembly haplotype.
  • the custom multigenome construction system utilizes one or more structural variant detection models to further identify structural variants within the given assembly haplotype that includes the identified structural variants in the phased variant data for the given assembly haplotype.
  • the custom multigenome construction system samples (or segments on an overlapping-basis) assembly sequences corresponding to individual chromosomes within a given nucleotide assembly to generate simulated nucleotide reads for use in determining variant calls to include within a multigenome reference.
  • the custom multigenome construction system samples, from a given nucleotide assembly, a plurality of overlapping nucleotide sequences in a progressive series of segments to generate the simulated nucleotide reads with a predetermined read length per simulated nucleotide read and/or at a target read depth across the simulated nucleotide reads.
  • the custom multigenome construction system utilizes a modified read length and/or a modified read depth, greater than or lesser than the respective predetermined read length and/or target read depth, at one or more regions of the given nucleotide assembly. Accordingly, the custom multigenome construction system can generate simulated nucleotide reads from nucleotide assemblies for use in generating a multigenome reference, as described in further detail below in relation to the various figures.
  • the custom multigenome construction system provides several technical advantages, benefits, and/or improvements over existing sequencing systems and methods.
  • the custom multigenome construction system improves the accuracy of read alignments and subsequent genomic analysis by utilizing a multigenome reference generated according to the disclosed embodiments.
  • the custom multigenome construction system generates a custom multigenome reference including phased variant data identified using simulated nucleotide reads sampled from a set of nucleotide assemblies.
  • the set of nucleotide assemblies is selected for a target population corresponding to a target genomic sample.
  • the custom multigenome construction system can more accurately align nucleotide reads with a corresponding linear reference genome — especially in more complex or “difficult-to-call” genomic regions (e.g., regions comprising lower confidence base calls in general) — than existing sequencing systems that utilize reference genomes with poor representation of population diversity.
  • the custom multigenome construction system can also determine more accurate genotype calls and/or variant calls with a higher confidence that such calls match or differ from the reference base of a reference genome compared to existing sequencing systems that use pangenome references generated by existing reference-assembly systems.
  • the custom multigenome construction system can generate simulated reads by sampling overlapping sequences from a given nucleotide assembly to provide reliably accurate inputs for identifying variants within the given nucleotide assembly using the pipelines described herein (e.g., in relation to FIGS. 3A-3E below).
  • simulated reads comprising discrete and accurate sequences from nucleotide assemblies and utilizing the simulated reads to identify haplotype variants
  • embodiments of the custom multigenome construction system can further implement improved alignment and variant calling for target genomic samples, relative to existing systems utilizing pangenome references generated by other methods.
  • the custom multigenome construction system can utilize an additional analysis pipeline to identify structural variant calls based on a whole-contig alignment of contiguous sequences from a haplotype assembly.
  • an additional analysis pipeline to identify structural variant calls based on a whole-contig alignment of contiguous sequences from a haplotype assembly.
  • the custom multigenome construction system 106 can further improve the fidelity of resulting multi genome references for further improved alignment and variant calling of target genomic samples — particularly for structural variants.
  • This disclosure describes and depicts examples of such improved genotype and/or variant calls below in relation to FIGS. 7-8.
  • the custom multigenome construction system improves computational efficiency in an alignment process when generating a multigenome reference relative to existing reference-assembly systems.
  • Some existing reference-assembly systems consume excessive processing and time to align lengthy alternate contiguous sequences (e.g., contiguous sequences representing whole or millions of base pairs from whole chromosomes) with a reference genome.
  • the custom multigenome system can generate simulated reads sampled from nucleotide assemblies (e.g., encoding individual chromosomes) to identify haplotype variants relative to a linear reference genome and implement parallel processing to align such simulated reads with respective regions of a reference genome.
  • telomere-to-telomere alignment utilized by aligning lengthy alternate contiguous sequences (e.g., whole chromosomes) performed by many existing reference-assembly systems.
  • a custom multigenome reference generated by the custom multigenome construction system consumes less computer memory compared to the augmented graph multigenome references used by many existing sequencing system.
  • the custom multigenome construction system can generate a multigenome reference with more flexibility than conventional genome references.
  • the custom multigenome construction system can generate a custom multigenome reference indicating haplotype variants identified within a set of nucleotide assemblies specifically selected to represent a target population corresponding to a particular target genomic sample.
  • the custom multigenome construction system 106 can flexibly generate a custom multigenome reference incorporating haploid, diploid, and/or polyploid genome information in organisms and/or genomic regions of any ploidy level (e.g., as described below in relation to FIG. 3D).
  • the custom multigenome construction system can determine read alignments and subsequent variant calls for the particular target genomic sample with further increased accuracy and improved efficiency compared to methods implementing a linear reference genome or a non-specific or less specific multigenome reference.
  • the custom multigenome construction system utilizes a custom multigenome reference comprising haplotype variants derived from a discrete subset of nucleotide assemblies particularly selected for a given genomic sample, resulting in improved alignment accuracy, reduced memory consumption, and increased processing speeds.
  • target genomic sample refers to a target genome or portion of a genome undergoing an assay or sequencing.
  • a target genomic sample includes one or more sequences of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence).
  • a genomic sample includes a full genome that is isolated or extracted (in whole or in part) from a sample organism and composed of nitrogenous heterocyclic bases.
  • a target genomic sample can include a segment of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids or chimeric or hybrid forms of nucleic acids noted below.
  • the target genomic sample is found in a sample prepared or isolated by a kit and received by a sequencing device.
  • a known genomic sample refers to a well-characterized genomic sample that serves as a reference material for benchmarking and improving sequencing analysis models.
  • a known genomic sample can include a human genome sample (e.g., HG001, HG004) commonly used or made available in public repositories, such as Genome-in-a-Bottle (GIAB).
  • GIAB Genome-in-a-Bottle
  • HPRC Human Pangenome Reference Consortium
  • CPC Chinese Pangenome Consortium
  • APR Arab Pangenome Reference
  • nucleotide read refers to an inferred or predicted sequence of one or more nucleotide bases (or nucleobase pairs) from all or part of a sample genomic sequence (e.g., a sample genomic sequence, complementary DNA).
  • a nucleotide read includes a determined or predicted sequence of nucleobase calls for a nucleotide fragment (or group of monoclonal nucleotide fragments) from a sequencing library corresponding to a genomic sample.
  • the custom multigenome construction system determines a nucleotide read by generating nucleobase calls for nucleobases passed through a nanopore of a nucleotide-sample slide, determined via fluorescent tagging, or determined from a well in a flow cell.
  • a nucleotide read can refer to a particular type of read, such as a nucleotide read synthesized from sample library fragments that are shorter than a threshold number of nucleobases (e.g., SBS reads).
  • nucleotide read can refer to (i) assembled nucleotide reads that have been assembled from shorter nucleotide reads to form a contiguous sequence (e.g., assembled nucleotide reads) satisfying a threshold number of nucleobases, (ii) circular consensus sequencing (CCS) reads satisfying the threshold number of nucleobases, or (iii) nanopore long reads satisfying the threshold number of nucleobases.
  • CCS circular consensus sequencing
  • simulated nucleotide read refers to a sequence of nucleotide bases sampled from a nucleotide assembly.
  • simulated nucleotide reads comprise overlapping nucleotide sequences sampled from a haplotype-resolved nucleotide assembly in a progressive series of segments.
  • a simulated nucleotide read spans tens of thousands of nucleobases in length (e.g., 20,000 or 30,000 nucleobases) to hundreds of thousands of nucleobases in length (e.g., 200,000 or 300,000 nucleobases).
  • a simulated nucleotide read can include a sequence of successive nucleotides indicated within a respective nucleotide assembly without alteration. Because simulated nucleotide reads can overlap with one another when aligned with a linear reference genome, simulated nucleotide reads simulate nucleotide reads for a genomic sample with respect to alignment but nevertheless differ from the nucleotide reads described in the paragraph above in part because simulated nucleotide reads do not represent sequences determined from oligonucleotides extracted from a genomic sample and immobilized on a nucleotide-sample slide (e.g., using SBS and a flow cell), from a nanopore-based process, or from other similar processes.
  • a nucleotide read can include a sequence of successive nucleotides indicated within a respective nucleotide assembly without alteration. Because simulated nucleotide reads can overlap with one another when aligned with a linear reference genome, simulated nucleo
  • the custom multigenome construction system generates simulated nucleotide reads comprising discrete segments of individual chromosomes within a given nucleotide assembly for use in identifying haplotype variants to include within a multigenome reference.
  • nucleotide assembly refers to an assembled set of nucleotide sequences of a sample genome.
  • a nucleotide assembly can include a comprehensive assembly of an entire genome of an organism, including the specific arrangement of nucleotides for each respective chromosome of the sample genome.
  • a nucleotide assembly can include assembled nucleotides within a limited span of a respective sample genome, such as an individual chromosome or a particular region thereof (e.g., the Major Histocompatibility Complex (MHC) located on the sixth chromosome of a human genome).
  • MHC Major Histocompatibility Complex
  • each nucleotide assembly comprises a complete set of nucleotide sequences assembled for a known genomic sample (e.g., a well-characterized genomic sample, as described above).
  • haplotype-resolved nucleotide assembly refers to a genome reconstruction — in whole or in part (e.g., one or more chromosomes or other genomic regions) — in which the allelic nucleotides on homologous chromosomes are explicitly resolved. Accordingly, in a haplotype-resolved nucleotide assembly, a genome reconstruction includes allelic nucleotides that have been phased (e.g., organized or labeled) according to a parental haplotype.
  • a diploid haplotype-resolved nucleotide assembly includes allelic nucleotides, from a pair of homologous chromosomes, that are resolved for two parental haplotypes.
  • a polyploid haplotype-resolved nucleotide assembly includes allelic nucleotides for haplotypes corresponding to three or more homologous chromosomes.
  • a haplotype-resolved nucleotide assembly, whether diploid or polyploid includes assembled nucleotide sequences in regions of a different ploidy level (e.g., a haploid region within a diploid genome or regions of local ploidy variation within a polyploid genome).
  • phasing refers to a process of separating nucleotide reads (e.g., simulated nucleotide reads) or continuous sequences into respective parental haplotypes. For instance, phasing can occur by identifying unique alleles or variants (e.g., SNPs, indels) on nucleotide reads in a genomic region, organizing such nucleotide reads in the genomic region according to the unique alleles or variants, and identifying subsets of nucleotide reads according to a maternal haplotype or paternal haplotype based on the organization or grouping.
  • unique alleles or variants e.g., SNPs, indels
  • haplotype phasing model that uses a hidden Markov model (HMM) or another algorithm that can be used to perform haplotype phasing, such as Segmented HAPlotype Estimation and Imputation Tool (SHAPEIT), BEAGLE, Eagle2, or WhatsHap.
  • HMM hidden Markov model
  • SHAPEIT Segmented HAPlotype Estimation and Imputation Tool
  • BEAGLE BEAGLE
  • Eagle2 Eagle2
  • WhatsHap WhatsHap.
  • assembly haplotype refers to an allele-specific sequence of nucleotides from a nucleotide assembly.
  • an assembly haplotype can include assembled nucleotides from an allele of a single chromosome or, in some cases, can include all of the allelic nucleotides corresponding to a particular individual haplotype or a particular parent (e.g., as indicated within a haplotype-resolved genome sample).
  • haplotype refers to nucleotide sequences that are present in an organism (or present in organisms from a population) and inherited from one or more ancestors.
  • haplotype can include alleles or other nucleotide sequences present in organisms of a population and inherited together by such organisms respectively from a single parent.
  • haplotypes include a set of SNPs and/or other variants (e.g., insertions or deletions (indels), structural variants, and/or microsatellites) on the same chromosome that tend to be inherited together.
  • data representing a haplotype or a set of different haplotypes are stored or otherwise accessible on a haplotype database.
  • genomic coordinate refers to a particular location or position of a nucleobase within a genome (e.g., an organism’s genome or a reference genome).
  • a genomic coordinate includes an identifier for a particular chromosome of a genome and an identifier for a position of a nucleobase within the particular chromosome.
  • a genomic coordinate or coordinates may include a number, name, or other identifier for a somatic or sex chromosome (e.g., chrl or chrX) and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570 or chrl: 1234570-1234870).
  • a genomic coordinate refers to a genomic coordinate on a sex chromosome (e.g., chrX or chrY).
  • a genomic coordinate refers to a source of a reference genome (e.g., mt for a mitochondrial DNA reference genome or SARS-CoV-2 for a reference genome for the SARS-CoV-2 virus) and a position of a nucleobase within the source for the reference genome (e.g., mt: 16568 or SARS-CoV- 2:29001).
  • a genomic coordinate refers to a position of a nucleobase within a reference genome without reference to a chromosome or source (e.g., 29727).
  • genomic region refers to a range of genomic coordinates. Like genomic coordinates, in certain implementations, a genomic region may be identified by an identifier for a chromosome and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570-1234870). In various implementations, a genomic coordinate includes a position within a reference genome. In some cases, a genomic coordinate is specific to a particular reference genome.
  • a genomic coordinate includes a position within a reference genome. Such a position may be within a particular reference genome.
  • the term “reference genome” refers to a digital nucleic acid sequence assembled as a representative example (or representative examples) of genes and other genetic sequences of an organism. Regardless of the sequence length, in some cases, a reference genome represents an example set of genes or a set of nucleic acid sequences in a digital nucleic acid sequenced determined by scientists as representative of an organism of a particular species.
  • a linear human reference genome may be GRCh38 or other versions of reference genomes from the Genome Reference Consortium.
  • a reference genome includes multi-base codes.
  • a reference genome may include a graph reference genome that includes both a linear reference genome and paths representing nucleic acid sequences from ancestral haplotypes, such as Illumina DRAGEN Graph Reference Genome hgl9.
  • multigenome reference refers to a reference dataset or assembly that includes genetic information from multiple sample genomes to represent genetic diversity within a population.
  • a multigenome reference includes such a dataset or assembly that includes diverse genetic information from multiple sample genomes of a target population and, therefore, constitutes a customized multigenome reference.
  • a customized multigenome reference can be generated.
  • the term “pangenome reference” refers to a multigenome reference representing collective information of a species, or a subset thereof. In some cases, for example, a pangenome reference includes the shared genetic elements of the corresponding population, as well as a representation of the genetic diversity of the corresponding population.
  • reference haplotype database refers to a database encoding variant data for population haplotypes of a multigenome or pangenome.
  • reference haplotype database includes complete or partially complete nucleotide sequences (e.g., alternate contiguous sequences) for population haplotypes of a multigenome or pangenome.
  • a reference haplotype database encodes variant data for population haplotypes having allele-variant differences from locally distinct population haplotypes within respective genomic regions of a corresponding reference genome.
  • the reference haplotype database comprises a haplotype data structure comprising a hierarchical partitioning of different genomic regions of a reference genome into a collection of bins covering respective spans of a linear reference genome (e.g., as represented by a primary contiguous sequence), the bins comprising variant data for the respective spans.
  • nucleobase call refers to a determination or prediction of a particular nucleobase (or nucleobase pair) for an oligonucleotide (e.g., nucleotide read) during a sequencing cycle or for a genomic coordinate of a genomic sample.
  • a nucleobase call can indicate a determination or prediction of the type of nucleobase that has been incorporated within an oligonucleotide on a nucleotide-sample slide (e.g., read-based nucleobase calls).
  • a nucleobase call includes a determination or a prediction of a nucleobase based on intensity values resulting from fluorescent-tagged nucleotides added to an oligonucleotide of a nucleotide-sample slide (e.g., in a cluster of a flow cell).
  • a single nucleobase call can be an adenine (A) call, a cytosine (C) call, a guanine (G) call, a thymine (T) call, or an uracil (U) call.
  • the term “variant” refers to a nucleobase or multiple nucleobases that do not align with, differs from, or varies from a corresponding nucleobase (or nucleobases) in a reference sequence or a reference genome.
  • a variant includes a SNP, an indel, or a structural variant that indicates nucleobases in a sample nucleotide sequence that differ from nucleobases in corresponding genomic coordinates of a reference sequence or a reference genome.
  • haplotype variant refers to a variant associated with a particular haplotype, such as indicated within a multigenome reference as described herein.
  • a structural variant refers to a variation (e.g., deletion, insertion, translocation, inversion) in a structure of an organism’s chromosome or a variation to nucleotide sequences of the organism’s chromosome (e.g., a sample genomic sequence).
  • a structural variant includes a variation to a threshold number of base pairs (e.g., > 35 or > 50 base pairs) within an organism’s chromosome.
  • a structural variant includes an insertion or deletion exceeding a threshold number of base pairs, a duplication exceeding a threshold number of base pairs, an inversion, a translocation, or a copy number variation (CNV).
  • the threshold number of base pairs for a structural variant may be different, such as, but not limited to, 16, 25, 32, 45, 100, or 1,000 base pairs.
  • a “variant call” refers to a nucleobase call comprising a mutation or a variant at a particular genomic coordinate or genomic region with respect to a reference.
  • a variant call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that differs from a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.
  • a “reference call” refers to a nucleobase call comprising a non-variant or a reference nucleobase at a genomic coordinate or a genomic region with respect to a reference.
  • a non-variant or reference call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that matches a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.
  • genotype call refers to a determination or prediction of a particular genotype of a genomic sample or a sample nucleotide sequence at a genomic locus.
  • a genotype call can include a prediction of a particular genotype of a genomic sample with respect to a reference genome or a reference sequence at a genomic coordinate or a genomic region.
  • a genotype call includes a determination or prediction that a genomic sample comprises both a nucleobase and a complementary nucleobase at a genomic coordinate that is either homozygous or heterozygous for a reference base or a variant (e.g., homozygous reference bases represented as 0
  • a genotype call can include a prediction of a variant or reference base for one or more alleles of a genomic sample and indicate zygosity with respect to a variant or reference base.
  • a genotype call is often determined for a genomic coordinate or genomic region at which an SNP, insertion, deletion, or other variant has been identified for a population of organisms.
  • the custom multigenome construction system identifies and/or stores nucleotide sequences, corresponding sequencing metrics, and/or other sequencing data within one or more sequencing data files.
  • sequencing data file refers to a digital file that includes genetic sequencing information concerning genotype calls or nucleotide reads generated by one or more genomic sequencing procedures. Such sequencing information may include, for example, nucleotide reads, alignment and mapping information, nucleotide reads at one or more genomic coordinates, and so forth.
  • sequence file refers to a digital file that indicates one or more nucleotide sequences.
  • a sequence file can include a digital file comprising a standard text-based format, such as FASTA files (indicating one or more nucleotide sequences) or FASTQ files (indicating quality scores and/or other metrics in association with one or more nucleotide sequences) storing nucleotide sequences expressed in single-letter code (e.g., A, T, C, G).
  • FASTA files indicating one or more nucleotide sequences
  • FASTQ files indicating quality scores and/or other metrics in association with one or more nucleotide sequences
  • the custom multigenome construction system 106 receives a FASTA file comprising nucleotide sequences spanning an entire sample genome or a portion thereof (e.g., an individual chromosome or genomic region).
  • an alignment data file refers to a digital file that indicates mapping and alignment information for nucleotide reads of a sample nucleotide sequence.
  • an alignment data file can include a binary alignment map (BAM) file, a compressed reference-oriented alignment map (CRAM) file, or another file indicating nucleotide reads of a sample nucleotide sequence.
  • BAM binary alignment map
  • CRAM compressed reference-oriented alignment map
  • an alignment data file can include further information regarding nucleotide reads, mapping and alignment results, population haplotype data, and so forth.
  • genotype-call data file refers to a digital file that indicates one or more genotype calls (e.g., including reference and/or variant calls) compared to a reference genome along with other information pertaining to the genotype calls (e.g., variant calls).
  • a genotype-call data file can include a variant call file, such as but not limited to a Variant Call Format (VCF) file (including a multi-sample Variant Call Format (msVCF) file as described below in relation to FIG. 3C).
  • VCF Variant Call Format
  • msVCF multi-sample Variant Call Format
  • a genotype-call data file can include a General Feature Format (GFF) file, a Genome Variant Format (GVF), or other suitable data file comprising genotype calls for a sample nucleotide sequence (such as a nucleotide assembly or a sequence thereof).
  • GFF General Feature Format
  • VVF Genome Variant Format
  • variant call file refers to a particular genotype-call data file that comprises a text file format that contains information about variants at specific genomic coordinates.
  • a variant call file can include metainformation lines, a header line, and data lines where each data line contains information about a single genotype call (e.g., a single variant).
  • a “probabilistic model” refers to a type of statistics-based prediction model.
  • a probabilistic variant call model refers to a Bayesian probability model that generates variant calls based on nucleotide reads (or simulated nucleotide reads).
  • Such a model can process or analyze sequencing metrics corresponding to read pileups (e.g., multiple nucleotide reads corresponding to a single genomic coordinate), including mapping quality, base quality, and various hypotheses including foreign reads, missing reads, joint detection, and more.
  • Probabilistic models may likewise include multiple components, including, but not limited to, different software application or components for mapping and aligning, sorting, duplicate marking, computing read pileup depths, and variant calling.
  • a probabilistic variant call model refers to an ILLUMINA DRAGEN model for variant calling functions and mapping and alignment functions (e.g., a DRAGEN variant caller or “DRAGEN VC”).
  • machine-learning model refers to a computer algorithm or a collection of computer algorithms that automatically improve for a particular task through experience based on use of ground truth data.
  • a machine-learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness.
  • Example machinelearning models include various types of decision trees, support vector machines, Bayesian networks, or neural networks.
  • neural network refers to a machine-learning model that can be trained and/or tuned based on inputs to determine classification or approximate unknown functions.
  • a neural network can include a model of interconnected artificial neurons (e.g., organized in layers) that communicate and leam to approximate complex functions and generate outputs (e.g., classifications, predictions, or digital content) based on a plurality of inputs provided to the neural network.
  • a neural network refers to an algorithm (or set of algorithms) that implements deep learning techniques to model high-level abstractions of data.
  • a neural network can include a convolutional neural network, a recurrent neural network (e.g., an LSTM), a graph neural network, a self-attention transformer neural network, or a generative adversarial neural network.
  • haplotype-aware alignment model refers to a model configured to determine alignments of nucleotide reads with a reference genome while considering the underlying haplotype structure of the corresponding genomic sample.
  • a haplotype-aware alignment model can consider allelic coordinates of nucleotide reads when determining alignments between the nucleotide reads and respective genomic regions of a linear reference genome.
  • the term “read-based phasing model” refers to a model configured to leverage sequencing information to infer the haplotypes of a genomic sample.
  • a read-based phasing model can consider metrics and information from sequencing procedures, from mapping and alignment of nucleotide reads, and/or from metrics related to variant calling procedures to determine the specific divisions of nucleotides between alleles along chromosomes of the corresponding genomic sample.
  • the term “whole-contig alignment” refers to a process for aligning contiguous sequences (often referred to as “contigs”) from a genome assembly to a reference genome sequence or to individual primary or alternative contiguous sequences thereof.
  • each contiguous sequence of a genome assembly can represent an entire chromosome (e.g., from telomere to telomere) or an entire genome (e.g., spanning all chromosomes).
  • the term “wholegenome alignment” is a type of whole-contig alignment that refers to a process of aligning contiguous sequences of an entire genome assembly to a reference genome. For whole-genome alignment, accordingly, each contiguous sequence from each chromosome of an entire genome assembly is aligned with a reference genome.
  • FIG. 1 illustrates a schematic diagram of a computing system 100 in which a custom multigenome construction system 106 operates in accordance with one or more embodiments.
  • the computing system 100 includes a sequencing device 102 connected to a local device 108 (e.g., a local server device), one or more server device(s) 110, and a client device 114.
  • a local device 108 e.g., a local server device
  • server device(s) 110 e.g., a local server device
  • client device 114 e.g., a client device 114
  • the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114 can communicate with each other via a network 118.
  • the network 118 comprises any suitable network over which computing devices can communicate.
  • the sequencing device 102 comprises a computing device and a sequencing device system 104 for sequencing a genomic sample or other nucleic-acid polymer.
  • the sequencing device 102 analyzes nucleotide fragments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems either directly or indirectly on the sequencing device 102.
  • the sequencing device 102 receives nucleotide-sample slides (e.g., flow cells) comprising nucleotide fragments extracted from samples and further copies and determines the nucleobase sequence of such extracted nucleotide fragments.
  • nucleotide-sample slides e.g., flow cells
  • the sequencing device 102 utilizes sequencing-by- synthesis (SBS) techniques to sequence nucleotide fragments into nucleotide reads and determine nucleobase calls for the nucleotide reads.
  • SBS sequencing-by- synthesis
  • the sequencing device 102 bypasses the network 118 and communicates directly with the local device 108 or the client device 114.
  • the sequencing device 102 can further store the nucleobase calls as part of base-call data that is formatted as a binary base call (BCL) file and send the BCL file to the local device 108 and/or the server device(s) 110.
  • BCL binary base call
  • the local device 108 is located at or near a same physical location of the sequencing device 102. Indeed, in some embodiments, the local device 108 and the sequencing device 102 are integrated into a same computing device.
  • the local device 108 may run the custom multigenome construction system 106 to generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such base-call data in conjunction with a multigenome reference (e.g., a pangenome reference) provided by the custom multigenome construction system 106 or accessed within a database 120. As shown in FIG.
  • the sequencing device 102 may send (and the local device 108 may receive) base-call data generated during a sequencing run of the sequencing device 102.
  • the local device 108 may align nucleotide reads with a reference genome utilizing a multigenome reference generated according to one or more embodiments and/or stored within the database 120 and determine genetic variants based on the aligned nucleotide reads.
  • the local device 108 may also communicate with the client device 114.
  • the local device 108 can send data to the client device 114, including a binary alignment map (BAM) file, a variant call format (VCF) file, a multi-sample phased VCF representing a multigenome reference, or other information indicating nucleobase calls, sequencing metrics, error data, or other metrics.
  • BAM binary alignment map
  • VCF variant call format
  • the server device(s) 110 are located remotely from the local device 108 and the sequencing device 102. Similar to the local device 108, in some embodiments, the server device(s) 110 include a version of (or are otherwise able to access or implement) the custom multigenome construction system 106, accessible via a sequencing system 112 of the server device(s) 110.
  • the server device(s) 110 may generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such base-call data, or by generating a multigenome reference utilizing a collection of nucleotide assemblies stored within the database 120 or otherwise accessed by the custom multigenome construction system 106.
  • the sequencing device 102 may send (and the server device(s) 110 may receive) base-call data from the sequencing device 102.
  • the server device(s) 110 may also communicate with the client device 114.
  • the server device(s) 110 can send data to the client device 114, including BAM files, VCF files, or other sequencing related information.
  • the server device(s) 110 comprise a distributed collection of servers where the server device(s) 110 include a number of server devices distributed across the network 118 and located in the same or different physical locations. Further, the server device(s) 110 can comprise a content server, an application server, a communication server, a web-hosting server, or another type of server. Moreover, as shown in FIG. 1, the server device(s) 110 are in communication, either directly or via the network 118, with a database 120 storing nucleotide assemblies to be utilized by the custom multi genome construction system 106 when generating a multigenome reference (e.g., a pangenome reference) for a target population or a target genomic sample. Further, the database 120 can additionally store multigenome references generated by the custom multigenome construction system 106 for subsequent use in sequencing analysis of target genomic sequences.
  • a multigenome reference e.g., a pangenome reference
  • the custom multigenome construction system 106 can generate, encode, and/or implement the aforementioned multigenome reference(s) to determine alignments of nucleotide reads from a target genomic sample with a reference genome.
  • the custom multigenome construction system 106 can generate a multigenome reference (e.g.
  • the client device 114 can generate, store, receive, and send digital data.
  • the client device 114 can receive sequencing data from the local device 108 or receive call files (e.g., BCL) and sequencing metrics from the sequencing device 102.
  • the client device 114 may communicate with the local device 108 or the server device(s) 110 to receive a VCF comprising genotype or variant calls and/or other metrics, such as a base-call-quality metrics or pass-filter metrics.
  • the client device 114 can accordingly present or display information pertaining to variant calls or other genotype calls within a graphical user interface of the sequencing application 116 to a user associated with the client device 114.
  • the client device 114 can present genotype calls, variant calls, and/or sequencing metrics for a sequenced genomic sample within a graphical user interface of the sequencing application 116.
  • the client device 114 via the sequencing application 116, can present or display a user interface comprising selectable options for generating a custom multigenome reference (e.g., by selecting a custom set of nucleotide assemblies representing a target population and/or a target genomic sample), as further described below in relation to FIG. 5.
  • a custom multigenome reference e.g., by selecting a custom set of nucleotide assemblies representing a target population and/or a target genomic sample
  • FIG. 1 depicts the client device 114 as a desktop or laptop computer
  • the client device 114 may comprise various types of client devices.
  • the client device 114 includes non -mobile devices, such as desktop computers or servers, or other types of client devices.
  • the client device 114 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device 114 are discussed below with respect to FIG. 10.
  • the client device 114 includes the sequencing application 116.
  • the sequencing application 116 may be a web application or a native application stored and executed on the client device 114 (e.g., a mobile application, desktop application).
  • the sequencing application 116 can include instructions that (when executed) cause the client device 114 to receive data from the custom multigenome construction system 106 and present, for display at the client device 114, base-call data or data from an alignment data fde or VCF.
  • the sequencing application 116 can instruct the client device 114 to display summaries for multiple sequencing runs.
  • a version of the custom multigenome construction system 106 may be located and/or implemented (e.g., entirely or in part) on the client device 114 or the sequencing device 102.
  • the custom multigenome construction system 106 is implemented by one or more other components of the computing system 100, such as the local device 108.
  • the custom multigenome construction system 106 can be implemented in a variety of different ways across the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114.
  • the custom multigenome construction system 106 can be downloaded from the server device(s) 110 to the client device 114 and/or the local device 108 where all or part of the functionality of the custom multigenome construction system 106 is performed at each respective device within the computing system 100.
  • the custom multigenome construction system 106 generates a multigenome reference (e.g., a pangenome reference) comprising variant data identified using simulated reads sampled from a set of nucleotide assemblies.
  • FIG. 2 depicts the custom multigenome construction system 106 generating a multigenome reference 208 comprising variant data 206 identified using simulated nucleotide reads 204 sampled from a set of nucleotide assemblies 202.
  • the custom multigenome construction system 106 identifies (or receives) the set of nucleotide assemblies 202 as input for constructing the multigenome reference 208.
  • the set of nucleotide assemblies 202 include complete nucleotide sequences for each chromosome of a respective set of known genomic samples.
  • each nucleotide assembly of the set of nucleotide assemblies 202 comprises a haplotype-resolved nucleotide assembly denoting allelic nucleotide sequences for each respective chromosome of a corresponding sample genome.
  • the set of nucleotide assemblies 202 can be provided in a suitable text format, such in the form of a FASTA digital file denoting nucleotide sequences divided by individual chromosomes and, in some cases, by specific alleles (e.g., maternal or paternal alleles) of each individual chromosome.
  • a suitable text format such in the form of a FASTA digital file denoting nucleotide sequences divided by individual chromosomes and, in some cases, by specific alleles (e.g., maternal or paternal alleles) of each individual chromosome.
  • the custom multigenome construction system 106 generates the simulated nucleotide reads 204 to include nucleotide sequences from each nucleotide assembly of the set of nucleotide assemblies 202.
  • the custom multigenome construction system 106 generates the simulated nucleotide reads 204 by sampling segments of nucleotide sequences of individual chromosomes within each nucleotide assembly of the set of nucleotide assemblies 202 (e.g., as further described below in relation to FIG. 3 A).
  • the custom multigenome construction system 106 utilizes the simulated nucleotide reads 204 to generate the variant data 206 comprising variant calls for each respective nucleotide assembly of the set of nucleotide assemblies 202.
  • the custom multigenome construction system 106 can utilize various analysis pipelines comprising a variety of sequence analysis models to (i) determine read alignments between the simulated nucleotide reads 204 and respective genomic regions of a reference genome (e.g., a linear reference genome) and (ii) generate variant calls (e.g., the variant data 206) relative to the reference genome for the simulated nucleotide reads 204 according to the read alignments.
  • the custom multigenome construction system 106 utilizes the variant data 206 for the set of nucleotide assemblies 202 to construct the multigenome reference 208 (e.g., as further described below in relation to FIGS. 3C and 6).
  • the custom multigenome construction system 106 utilizes various pipelines to construct a multigenome reference (e.g., a pangenome reference) from a set of nucleotide assemblies for genomic samples representing a target population.
  • a multigenome reference e.g., a pangenome reference
  • 3 A, 3B, and 3C respectively illustrate the custom multigenome construction system 106 (a) generating haplotype variant data 312 for an individual assembly haplotype 304 of a nucleotide assembly 302, (b) generating phased variant data 328 for a given diploid haplotype-resolved nucleotide assembly 322, and (c) generating a multi-sample phased variant dataset 338 for a set of haplotype-resolved nucleotide assemblies 332. Furthermore, FIG.
  • 3D illustrates the custom multigenome construction system 106 generating phased variant data 348 for a given haplotype-resolved nucleotide assembly 342 of any ploidy level (e.g., a haploid assembly, a diploid assembly, or a polyploid assembly).
  • a ploidy level e.g., a haploid assembly, a diploid assembly, or a polyploid assembly.
  • the custom multigenome construction system 106 performs a haplotype reference construction 300 to generate the haplotype variant data 312 for the individual assembly haplotype 304 of the nucleotide assembly 302.
  • the nucleotide assembly 302 includes haplotype-resolved sequences for respective alleles of each chromosome represented by the nucleotide assembly 302.
  • the custom multigenome construction system 106 processes a nucleotide assembly that lacks an indication of allele-specific sequences for individual chromosomes.
  • the custom multigenome construction system 106 performs an additional phasing analysis to determine the haplotype variant data 312 for individual assembly haplotypes the nucleotide assembly 302 (e.g., as discussed below in relation to FIG. 4A).
  • the custom multigenome construction system 106 identifies (or receives) the assembly haplotype 304 for processing per the haplotype reference construction 300 pipeline.
  • the assembly haplotype 304 includes allelic sequences for every chromosome of the nucleotide assembly 302.
  • the custom multigenome construction system 106 considers an allelic sequence from a particular chromosome or genomic region of the nucleotide assembly 302 (e.g., to generate a multi genome limited to the particular chromosome or genomic region).
  • the custom multigenome construction system 106 generates a plurality of simulated nucleotide reads 306 from the assembly haplotype 304 (e.g., as described above in relation to FIG. 2).
  • the custom multigenome construction system 106 generates the simulated nucleotide reads 306 by sampling a plurality of nucleotide sequences from the nucleotide assembly 302 (e.g., by sampling segments of the assembly haplotype 304).
  • the custom multigenome construction system 106 samples the nucleotide sequences with an initial read length of up to 300,000 nucleobases.
  • the initial read length comprises 30,000 nucleobases.
  • the custom multigenome construction system 106 samples overlapping segments to simulate a target read depth among the simulated nucleotide reads 306 for the assembly haplotype 304. Also, in one or more embodiments, the custom multigenome construction system 106 samples the segments with one or more of (a) a modified read depth lesser than or greater than the target read depth or (b) a modified read length greater than or lesser than the initial read length in one or more regions of the respective nucleotide assembly (e.g., to cover relatively ambiguous regions with increased resolution).
  • the custom multigenome construction system 106 performs a mapping and alignment 308 to determine read alignments between the simulated nucleotide reads 306 and respective genomic regions of a reference genome (e.g., a linear reference genome).
  • a reference genome e.g., a linear reference genome
  • the custom multigenome construction system 106 utilizes an alignment model, such as a haplotype-aware alignment model, to identify read alignments for the simulated nucleotide reads 306.
  • the custom multi genome construction system 106 identifies, within the read alignments identified for the simulated nucleotide reads 306, at least one mis-mapped read based on comparing the respective genomic regions of the read alignments with relative coordinates of the simulated nucleotide reads 306 within the nucleotide assembly 302. Accordingly, in some embodiments, the custom multigenome construction system 106 utilizes relative positions of nucleotide sequences corresponding to the simulated nucleotide reads 306 to ensure accurate mapping and alignment 308 of the simulated nucleotide reads 306.
  • the custom multigenome construction system 106 Based on the read alignments identified by the mapping and alignment 308, the custom multigenome construction system 106 performs a variant calling 310 to generate variant calls within the assembly haplotype 304, as represented by the simulated nucleotide reads 306.
  • the custom multigenome construction system 106 utilizes a probabilistic variant caller model to perform the variant calling 310.
  • the custom multigenome construction system 106 can utilize a variant caller machine-learning model to perform the variant calling 310.
  • the custom multigenome construction system 106 generates the haplotype variant data 312 for the assembly haplotype 304 by generating variant calls relative to the reference genome for the simulated nucleotide reads 306, according to the read alignments determined by the mapping and alignment 308. Accordingly, the custom multigenome construction system 106 can utilize the haplotype reference construction 300 to generate haplotype variant data for a given assembly haplotype of a given nucleotide assembly. In some embodiments, the custom multigenome construction system 106 outputs haplotype variant data in the form of an assembly haplotype-specific variant call file (e.g., a VCF file).
  • an assembly haplotype-specific variant call file e.g., a VCF file
  • the custom multigenome construction system 106 utilizes an additional pipeline to identify structural variants within a nucleotide assembly or an assembly haplotype thereof. As shown in FIG. 3 A, for example, the custom multigenome construction system 106 performs a whole-genome or whole-contig alignment 307 of the assembly haplotype 304 with a reference genome. In some embodiments, for example, the custom multigenome construction system 106 aligns one or more entire contiguous sequences (e.g., representing an entire genomic sample from end to end or one or more entire chromosomes from telomere to telomere) of the assembly haplotype 304 with the reference genome.
  • the custom multigenome construction system 106 aligns one or more entire contiguous sequences (e.g., representing an entire genomic sample from end to end or one or more entire chromosomes from telomere to telomere) of the assembly haplotype 304 with the reference genome.
  • the custom multigenome construction system 106 performs the whole- genome/contig alignment 307 for multiple assembly haplotypes of a diploid or polyploid haplotype-resolved nucleotide assembly to a reference genome using a haplotype-aware alignment model.
  • the custom multigenome construction system 106 performs the whole-genome/contig alignment 307 for contiguous sequence(s) representing a haploid genome of an organism (or, for short reference, a haploid sample genome), where, for example, such an alignment for a haploid genome facilitates identifying structural variants within a haploid nucleotide assembly.
  • the custom multigenome construction system 106 performs the whole-genome/contig alignment 307 for contiguous sequence(s) representing a haploid genomic region of a diploid genome or polyploid genome of an organism, where, for example, such alignments facilitate identifying structural variants within a haploid genomic region of a diploid haplotype-resolved nucleotide assembly or polyploid haplotype-resolved nucleotide assembly.
  • the custom multigenome construction system 106 performs a structural variant detection 309 to identify structural variations relative to the reference genome and accordingly generates haplotype structural variant (SV) data 311 for the assembly haplotype 304.
  • the custom multigenome construction system 106 generates a variant call format (VCF) file comprising the haplotype SV data 311.
  • VCF variant call format
  • the custom multigenome construction system 106 merges the haplotype SV data 311 with the haplotype variant data 312 (generated utilizing the simulated nucleotide reads 306 as described above).
  • the custom multigenome construction system 106 adds each structural variant identified by the structural variant detection 309 to the haplotype variant data 312. Alternatively, in some embodiments, the custom multigenome construction system 106 identifies a subset of structural variants from the haplotype SV data to merge with the haplotype variant data 312. In one or more embodiments, for example, the custom multigenome construction system 106 utilizes a structural variant comparison model (e.g., Truvari) to identify duplicate, nearly duplicate, or inconsistent structural variants between the haplotype SV data 311 and the haplotype variant data 312 (e.g., prior to the illustrated merging of variant data).
  • a structural variant comparison model e.g., Truvari
  • the custom multigenome construction system 106 retains a given duplicate, near duplicate, or inconsistent structural variant identified within the simulated nucleotide reads 306 while filtering (e.g., deleting or not including) a corresponding structural variant identified within the haplotype SV data 311. Moreover, in some embodiments, the custom multigenome construction system 106 filters (e.g., deletes or does not include) one or more structural variants from the haplotype SV data 311 (e.g., based on determining that one or more quality metrics associated with the one or more structural variants fall below a predetermined quality threshold).
  • the custom multigenome construction system 106 further performs a diplotype reference construction 320 to generate the phased variant data 328 for the diploid haplotype-resolved nucleotide assembly 322.
  • the diploid haplotype-resolved nucleotide assembly 322 comprises allele-specific sequences for individual chromosomes (e.g., maternal and paternal chromosomes).
  • separate alleles within a haplotype-resolved nucleotide assembly are designated without express indications of parental lineage (e.g., allelic sequences for individual chromosomes provided separately without ‘maternal’ or ‘paternal’ labeling).
  • the custom multigenome construction system 106 can (i) identify a first assembly haplotype and a second assembly haplotype within a given haplotype-resolved nucleotide assembly — without identifying whether the first or second assembly haplotype is maternal or paternal — and (ii) generate a first set of simulated nucleotide reads for the first assembly haplotype and a second set of simulated nucleotide reads for the second assembly haplotype.
  • the custom multi genome construction system 106 extracts (or otherwise identifies or receives) respective assembly haplotypes 324a and 324b from the diploid haplotype- resolved nucleotide assembly 322 for the diplotype reference construction 320 of the phased variant data 328.
  • the assembly haplotype 324a comprises maternal allelic sequences from individual chromosomes and the assembly haplotype 324b comprises paternal allelic sequences from individual chromosomes.
  • each assembly haplotype extracted from a haplotype resolved nucleotide assembly for analysis via the diplotype reference construction 320 includes an allelic sequence for an individual chromosome.
  • the custom multigenome construction system 106 can generate the phased variant data 328 for the whole genome sequence of the diploid haplotype-resolved nucleotide assembly 322, for individual chromosomes included therein, or for any discrete genomic region within the diploid haplotype-resolved nucleotide assembly 322.
  • the custom multi genome construction system 106 utilizes haplotype reference constructions 300a and 300b to generate haplotype variant data 326a and 326b for the assembly haplotypes 324a and 324b, respectively. Accordingly, each assembly haplotype is respectively processed per the haplotype reference construction 300 pipeline, as described above in relation to FIG. 3 A, to generate corresponding haplotype reference data comprising variant calls for the respective assembly haplotypes.
  • the custom multi genome construction system 106 compiles (e.g., combines) the haplotype variant data 326a for the assembly haplotype 324a with the haplotype variant data 326b for the assembly haplotype 324b into the phased variant data 328 to include diploid genotype calls relative to the reference genome.
  • the custom multigenome construction system 106 combines two variant call fdes (e.g., VCF files) comprising the haplotype variant data 326a and 326b, respectively, to generate a phased variant call file (e.g., a sample-specific phased VCF file) comprising the phased variant data 328.
  • a phased variant call file e.g., a sample-specific phased VCF file
  • the custom multigenome construction system 106 compiles phased variant data 336a, 336b through 336n (e.g., sample-specific VCF files) corresponding to multiple respective haplotype-resolved nucleotide assemblies 334a, 334b through 334n of the set of haplotype-resolved nucleotide assemblies 332 to generate the multi-sample phased variant dataset 338.
  • phased variant data 336a, 336b through 336n e.g., sample-specific VCF files
  • generating the multi-sample phased variant dataset 338 further comprises aligning the phased variant data of each respective haplotype-resolved nucleotide assembly according to genomic coordinates within the reference genome and inserting placeholders (e.g., within rows of a multi-sample VCF for respective nucleotide assemblies) representing deletions within respective nucleotide assemblies.
  • the custom multigenome construction system 106 generates the multi-sample phased variant dataset 338, which in turn can be utilized to generate a multigenome reference for mapping and alignment of target genomic samples.
  • the custom multigenome construction system 106 generates a graph multigenome reference or a reference haplotype database based on the genotype data included within the multi-sample phased variant dataset 338.
  • the custom multigenome construction system 106 identifies (or receives) the set of haplotype-resolved nucleotide assemblies 332 comprising the haplotype- resolved nucleotide assemblies 334a-334n for constructing a multigenome reference (e.g., a pangenome reference) for a target population.
  • a multigenome reference e.g., a pangenome reference
  • the custom multigenome construction system 106 processes each of the haplotype-resolved nucleotide assemblies 334a, 334b through 334n individually via diplotype reference constructions 320a, 320b through 320n, respectively, to determine the respective phased variant data 336a-336n.
  • each assembly of the set of haplotype-resolved nucleotide assemblies is respectively processed per the diplotype reference construction 320 pipeline, as described above in relation to FIGS. 3A-3B, to generate corresponding phased variant data comprising genotype calls (e.g., phased variant calls) for the respective nucleotide assemblies.
  • the custom multigenome construction system 106 compiles the phased variant data 336a-336n for the respective haplotype-resolved nucleotide assemblies 334a-334n to generate the multi-sample phased variant dataset 338 for the set of haplotype-resolved nucleotide assemblies 332.
  • the custom multigenome construction system 106 combines respective variant call format (VCF) files comprising the phased variant data 336a- 336n to generate a multi-sample Variant Call Format (msVCF) file comprising the multi-sample phased variant dataset 338.
  • VCF variant call format
  • msVCF multi-sample Variant Call Format
  • the custom multigenome construction system 106 performs a ploidy reference construction 340 to generate the phased variant data 348 for the haplotype-resolved nucleotide assembly 342 (e.g., a haplotype assembly for an organism or genomic region of any ploidy level).
  • the haplotype-resolved nucleotide assembly 342 comprises allele-specific sequences for individual chromosomes (e.g., haplotype sequences corresponding to one or multiple chromosomes, depending on the ploidy level of the respective organism or the respective genomic region).
  • the custom multigenome construction system 106 can generate phased variant data 348 for any number of assembly haplotypes.
  • one assembly haplotype is shown as an assembly haplotype 344a for a haploid organism or genomic region; two assembly haplotypes are shown as assembly haplotypes 344a-344b for a diploid organism or genomic region; or three or more assembly haplotypes are shown as assembly haplotypes 344a-344b 344a-344n for a polyploid organism or genomic region.
  • the custom multigenome construction system 106 extracts (or otherwise identifies or receives) the assembly haplotype 344a for haploid reference construction, the assembly haplotypes 344a-344b for diploid reference construction, or the assembly haplotypes 344a-344n for polyploid reference construction from the haplotype- resolved nucleotide assembly 342 for the ploidy reference construction 340 of the phased variant data 348.
  • the haplotype-resolved nucleotide assembly 342 comprises a haploid nucleotide assembly limited to the assembly haplotype 344a, such as an individual assembly haplotype from a diploid or polyploid assembly or an individual assembly from a haploid genome or haploid region of a genome.
  • the custom multigenome construction system 106 directed to a haploid nucleotide assembly, the custom multigenome construction system 106 generates haplotype variant data 346a for the assembly haplotype 344a according to the haplotype reference construction 300a (e.g., as described above in relation to FIG. 3A).
  • the haplotype-resolved nucleotide assembly 342 comprises a diploid nucleotide assembly or polyploid nucleotide assembly with two assembly haplotypes shown as assembly haplotypes 344a-344b or more than two assembly haplotypes shown as assembly haplotypes 344a-344n, respectively.
  • the assembly haplotype 344a comprises allelic sequences from individual chromosomes of a first haplotype (“hapl”)
  • the assembly haplotype 344b comprises allelic sequences from individual chromosomes of a second haplotype (“hap2”)
  • the assembly haplotype 344n comprises allelic sequences from individual chromosomes of a final haplotype (“hapN”) within the haplotype-resolved nucleotide assembly 342.
  • each assembly haplotype extracted from a haplotype resolved nucleotide assembly for analysis via the ploidy reference construction 340 includes an allelic sequence for an individual chromosome.
  • the custom multigenome construction system 106 can generate the phased variant data 348 for the whole genome sequence of the haplotype-resolved nucleotide assembly 342, for individual chromosomes included therein, or for any discrete genomic region within the haplotype-resolved nucleotide assembly 342, including genomic regions of local ploidy variation within a diploid or polyploid genome.
  • the custom multigenome construction system 106 utilizes haplotype reference construction 300a for haploid reference construction, haplotype reference construction 300a and 300b for diploid reference construction, or haplotype reference construction 300a through 300n for polyploid reference construction to generate haplotype variant data 346a, 346a-346b, or 346a-346n for the assembly haplotypes 344a, 344a-344b, or 344a-344n, respectively.
  • each assembly haplotype is respectively processed per the haplotype reference construction 300 pipeline, as described above in relation to FIG. 3A, to generate corresponding haplotype reference data comprising variant calls for the respective assembly haplotypes.
  • the custom multigenome construction system 106 compiles (e.g., combines) the haplotype variant data 346a for the assembly haplotype 344a with (i) the haplotype variant data 346b for the assembly haplotype 344b or (ii) each of the variant data 346a-346n for the respective assembly haplotypes 344a-344n into the phased variant data 348 to include phased genotype calls (e.g., specified by haplotype) relative to the reference genome.
  • phased genotype calls e.g., specified by haplotype
  • the custom multigenome construction system 106 combines two or more variant call files (e.g., VCF files) comprising the haplotype variant data 346a-346b or 346a-326n, respectively, to generate a phased variant call file (e.g., a sample-specific phased VCF file) comprising the phased variant data 348.
  • VCF files e.g., VCF files
  • a phased variant call file e.g., a sample-specific phased VCF file
  • the custom multigenome construction system 106 can further generate a multi-sample phased variant dataset by combining phased variant data (e.g., the phased variant data 348) for multiple haplotype-resolved nucleotide assemblies of any ploidy level (e.g., the haplotype-resolved nucleotide assembly 342).
  • phased variant data e.g., the phased variant data 348
  • haplotype-resolved nucleotide assemblies of any ploidy level e.g., the haplotype-resolved nucleotide assembly 342.
  • the custom multigenome construction system 106 combines a plurality of phased variant call format (VCF) files (e.g., sample-specific VCF files) comprising phased variant data for the respective haplotype-resolved nucleotide assemblies to generate a multi-sample Variant Call Format (msVCF) file comprising the multi-sample phased variant dataset.
  • VCF phased variant call format
  • msVCF multi-sample Variant Call Format
  • FIG. 3E shows an exemplary schematic of the custom multigenome construction system 106 generating phased variant data 362 for a diploid haplotype- resolved nucleotide assembly 352 utilizing a simulated-read variant calling pipeline and a whole- contiguous sequence structural variant calling pipeline in accordance with one or more embodiments.
  • the custom multigenome construction system 106 uses a haplotype-assembly -to-VCF model (e.g., such as DRAGEN hap-asm2vcf) to determine variant calls based on simulated nucleotide reads derived from a haplotype-resolved nucleotide assembly.
  • a haplotype-assembly -to-VCF model e.g., such as DRAGEN hap-asm2vcf
  • the custom multigenome construction system 106 uses an assembly-based structural variant (SV) caller (e.g., Dipcall, Smartie-SV, Structural Variant Identification using Mapped long read Assembly (SVIM-ASM), or Phased Assembly Variant (PAV)) to determine structural variant calls based on contiguous sequences derived from the haplotype-resolved nucleotide assembly.
  • SV structural variant
  • SVIM-ASM Mapped long read Assembly
  • PAV Phased Assembly Variant
  • the custom multigenome construction system 106 further merges (i) variant calls derived from simulated nucleotide reads and (ii) structural variant calls derived from contiguous sequences of a haplotype-resolved nucleotide assembly into respective datasets for each parental haplotype and further merges the consolidated variant-and-structural- variant-call data into a phased genotype-call data fde (e.g., merged VCF).
  • a phased genotype-call data fde e.g., merged VCF
  • the custom multigenome construction system 106 identifies (or receives) the diploid haplotype-resolved nucleotide assembly 352 comprising allelespecific sequences for individual chromosomes (e.g., maternal and paternal chromosomes).
  • the diploid haplotype-resolved nucleotide assembly 352 is designated without express indications of parental lineage (e.g., allelic sequences for individual chromosomes provided separately without “maternal” or “paternal” labeling).
  • the custom multigenome construction system 106 extracts, receives, or otherwise identifies a first assembly haplotype (shown as an assembly haplotype 354a) and a second assembly haplotype (shown as an assembly haplotype 354b) within the diploid haplotype-resolved nucleotide assembly 352, with or without identifying whether the assembly haplotype 354a or the assembly haplotype 354b is maternal or paternal.
  • a first assembly haplotype shown as an assembly haplotype 354a
  • a second assembly haplotype shown as an assembly haplotype 354b
  • the custom multigenome construction system 106 extracts, receives, or otherwise identifies the respective assembly haplotypes 354a and 354b from the diploid haplotype-resolved nucleotide assembly 352 for the diplotype reference construction of the phased variant data 362.
  • the assembly haplotype 354a comprises maternal allelic sequences from individual chromosomes
  • the assembly haplotype 354b comprises paternal allelic sequences from individual chromosomes.
  • the custom multigenome construction system 106 utilizes simulated-read variant calling 355a and 355b to generate variant calls for the assembly haplotype 354a and the assembly haplotype 354b, respectively, based on simulated nucleotide reads respectively generated from the assembly haplotype 354a and the assembly haplotype 354b (e.g., as described above in relation to FIG. 3A).
  • the custom multigenome construction system 106 can use a variant call model (e.g., DRAGEN) and a corresponding haplotype-assembly-to-VCF model (e.g., hap-asm2vcf) to process simulated nucleotide reads into variant calls within a genotype-call data file (e.g., VCF). Accordingly, as shown in FIG.
  • a variant call model e.g., DRAGEN
  • a haplotype-assembly-to-VCF model e.g., hap-asm2vcf
  • the custom multigenome construction system 106 generates maternal haplotype variant data 357a (e.g., a first maternal VCF or other genotype-call datafile) and paternal haplotype variant data 357b (e.g., a first paternal VCF or other genotype-call data file) with the variant calls generated by the simulated-read variant calling 355a for the assembly haplotype 354a and the simulated-read variant calling 355b for the assembly haplotype 354b, respectively.
  • maternal haplotype variant data 357a e.g., a first maternal VCF or other genotype-call datafile
  • paternal haplotype variant data 357b e.g., a first paternal VCF or other genotype-call data file
  • the custom multi genome construction system 106 utilizes a pipeline for whole-contig structural variant calling 356a and 356b to generate structural variant calls for the assembly haplotype 354a and the assembly haplotype 354b, respectively, based on respective whole-contig alignments between a reference genome and the assembly haplotype 354a or the assembly haplotype 354b (e.g., as described above in relation to FIG. 3A).
  • the custom multigenome construction system 106 can use an assembly-based SV caller (e.g., Dipcall, Smartie- SV, SVIM-ASM, or PAV) to process contiguous sequences into structural variant calls within a genotype-call data fde (e.g., VCF). Accordingly, as shown in FIG.
  • assembly-based SV caller e.g., Dipcall, Smartie- SV, SVIM-ASM, or PAV
  • a genotype-call data fde e.g., VCF
  • the custom multigenome construction system 106 generates maternal haplotype SV data 358a (e.g., a second maternal VCF or other genotype-call data file) and paternal haplotype SV data 358b (e.g., a second paternal VCF or other genotype-call data file) with the structural variant calls generated by the whole-contig structural variant calling 356a for the assembly haplotype 354a and the whole-contig structural variant calling 356b for the assembly haplotype 354b, respectively.
  • maternal haplotype SV data 358a e.g., a second maternal VCF or other genotype-call data file
  • paternal haplotype SV data 358b e.g., a second paternal VCF or other genotype-call data file
  • the custom multigenome construction system 106 merges the maternal haplotype variant data 357a and the maternal haplotype SV data 358a to generate merged maternal data 360a (e.g., a merged maternal VCF or other genotype-call data file) for the assembly haplotype 354a. Also, the custom multigenome construction system 106 merges the paternal haplotype variant data 357b and the paternal haplotype SV data 358b to generate merged paternal data 360b (e.g., a merged paternal VCF or other genotype-call data file) for the assembly haplotype 354b.
  • merged maternal data 360a e.g., a merged maternal VCF or other genotype-call data file
  • the custom multigenome construction system 106 utilizes a structural variant comparison model to identify duplicate structural variants, near-duplicate structural variants, or inconsistent structural variants between variant calls generated utilizing the simulated-read calling 355a or 355b and the structural variant calls generated utilizing the wholecontig structural variant calling 356a or 356b, respectively (e.g., consistent with the description above in relation to FIG. 3A).
  • the custom multigenome construction system 106 compiles (e.g., combines) the merged maternal data 360a for the assembly haplotype 354a with the merged paternal data 360b for the assembly haplotype 354b into the phased variant data 362 to include diploid genotype calls relative to the reference genome.
  • the custom multigenome construction system 106 combines two variant call or other genotype-call datafiles (e.g., VCF files) comprising the merged maternal data 360a and the merged paternal data 360b, respectively, to generate a phased genotype-call data file (e.g., a sample-specific phased VCF or other genotypecall data file) comprising the phased variant data 362. Furthermore, in some embodiments, the custom multigenome construction system 106 generates a multi-sample phased variant dataset by compiling (e.g., combining) multiple diploid haplotype-resolved assemblies to generate the multisample phased variant dataset (e.g., as described above in relation to FIG. 3C).
  • VCF files variant call or other genotype-call datafiles
  • a phased genotype-call data file e.g., a sample-specific phased VCF or other genotypecall data file
  • the custom multigenome construction system 106 generates a multi-sample phased variant
  • the custom multigenome construction system 106 utilizes different pipelines for generating a multigenome reference from respectively different sequence data for known genomic samples representing a target population.
  • FIGS. 4A and 4B respectively illustrate the custom multigenome construction system 106 (a) utilizing a read-based pipeline 400a to generate a partially phased multi-sample variant dataset 412 from a set of known genomic samples 402 and (b) utilizing an assembly -based pipeline to generate a phased multi-sample variant dataset 432 from a set of haplotype-resolved nucleotide assemblies 422.
  • the read-based pipeline 400a takes as input a set of nucleotide reads 404 corresponding to the set of known genomic samples 402.
  • the custom multigenome construction system 106 receives a respective set of FASTQ sequence fdes comprising the nucleotide reads 404 from the respective set of known genomic samples 402.
  • the readbased pipeline 400a processes nucleotide sequences that are not organized according to separate assembly haplotypes.
  • the nucleotide reads 404 comprise simulated nucleotide reads that are not preemptively phased for generating haplotype-specific variant data.
  • the custom multi genome construction system 106 performs mapping and alignment 406 of the set of nucleotide reads 404 to determine alignments with respect to a reference genome (e.g., a linear reference genome).
  • a reference genome e.g., a linear reference genome
  • the custom multigenome construction system 106 generates an alignment data file (e.g., a BAM file) comprising read alignments between the nucleotide reads 404 and the reference genome.
  • the custom multigenome construction system 106 outputs or stores the results of the mapping and alignment 406 in a different format and/or medium (e.g., storing the alignment data in cache for use in subsequent analysis).
  • the custom multigenome construction system 106 Based on the read alignments determined by the mapping and alignment 406 for the nucleotide reads 404, the custom multigenome construction system 106 performs a variant calling 408 to generate variant calls for the nucleotide reads 404 relative to the reference genome.
  • the custom multigenome construction system 106 utilizes a variant caller machine-learning model to perform the variant calling 408 and, in some cases, to subsequently perform the mapping and alignment 406.
  • the custom multigenome construction system 106 utilizes alternative models for generating variant calls for the nucleotide reads 404 relative to the reference genome, such as but not limited to a probabilistic variant caller model.
  • the custom multigenome construction system 106 performs a variant phasing 410 of the variant calls generated by the variant calling 408 for the nucleotide reads 404.
  • the custom multigenome construction system 106 determines phasing of the variant calls utilizing a read-based phasing model, such as a probabilistic model, to predict the phasing of variant calls for the nucleotide reads 404.
  • a read-based phasing model such as a probabilistic model
  • the assembly-based pipeline 400b takes as input the set of haplotype-resolved nucleotide assemblies 422 to generate the phased multi-sample variant dataset 432 comprising a set of haplotype variant data 430a and a set of haplotype variant data 430b for respective assembly haplotypes indicated by the set of haplotype-resolved nucleotide assemblies 422.
  • the custom multigenome construction system 106 generates multiple sets of simulated nucleotide reads 422 for the respective assembly haplotypes of the set of haplotype- resolved nucleotide assemblies 422.
  • the custom multigenome construction system 106 receives a set of FASTA sequence files comprising the respective set of haplotype-resolved nucleotide assemblies 422 and generates a corresponding set of FASTQ files comprising the sets of simulated nucleotide reads 424a and 424b.
  • the custom multigenome construction system 106 receives, generates, and/or stores the haplotype-resolved nucleotide assemblies 422 and the sets of simulated nucleotide reads 424a and 424b in a different format and/or medium (e.g., cached within memory for subsequence procedures).
  • the custom multigenome construction system 106 performs mapping and alignment 426a and 426b of the respective sets of simulated nucleotide reads 424a and 424b to determine alignments with respect to a reference genome (e.g., a linear reference genome).
  • a reference genome e.g., a linear reference genome
  • the custom multigenome construction system 106 generates an alignment data file (e.g., a BAM file) comprising read alignments between the simulated nucleotide reads 424a-424b and the reference genome.
  • the custom multigenome construction system 106 outputs or stores the results of the mapping and alignment 406 in a different format and/or medium (e.g., storing the alignment data in cache for use in subsequent analysis).
  • the custom multigenome construction system 106 Based on the read alignments determined by the respective mapping and alignment 426a and 426b for the sets of nucleotide reads 424a and 424b, the custom multigenome construction system 106 performs respective variant calling 428a and 428b to generate variant calls for the respective sets of nucleotide reads 424a and 424b relative to the reference genome.
  • the custom multigenome construction system 106 utilizes a probabilistic variant caller model and/or a variant caller machine-learning model to perform the variant calling 408.
  • the custom multigenome construction system 106 utilizes alternative models for generating read alignments and/or variant calls for the simulated nucleotide reads 424a-424b relative to the reference genome, such as but not limited to a machine-learning model configured for either or both variant calling and read alignment. [0110] Accordingly, the custom multigenome construction system 106 generates the sets of haplotype variant data 430a and 430b comprising variant calls identified within the respective sets of simulated nucleotide reads 424a and 424b.
  • the custom multigenome construction system 106 outputs respective sets of VCF files comprising the respective sets of haplotype variant data 430a and 430b and compiles the variant data to generate a multi-sample VCF file comprising the multi-sample variant dataset 432.
  • the custom multigenome construction system 106 includes the haplotype variant data 430a and 430b within the phased multi-sample variant dataset 432 (e.g., as a multi-sample VCF file) without outputting the aforementioned haplotype specific VCF files.
  • the custom multigenome construction system 106 generates a custom multigenome reference (e.g., a custom pangenome reference) from a customized selection of nucleotide assemblies for a target population and/or a target genomic sample.
  • a custom multigenome reference e.g., a custom pangenome reference
  • FIG. 5 depicts a graphical user interface 500 for selecting a customized set of nucleotide assemblies for implementation within a custom multigenome reference according to one or more embodiments.
  • the graphical user interface 500 includes a list of selectable known genomic samples (e.g., “HG002,” HIFI03268,” “Samplel,” and so forth) for which nucleotide assemblies are available for construction of a custom multigenome reference.
  • selectable known genomic samples e.g., “HG002,” HIFI03268,” “Samplel,” and so forth
  • the graphical user interface 500 includes summary information for the corresponding nucleotide assembly, such as an organizational source (e.g., “HPRC”), a respective population/continental origin (e.g., “European” or “ South Asia”), a respective sub- population/ethnicity (e.g., “Ashkenazi”), a respective karyotype (e.g., “XY” or “XX”), and various statistical data.
  • an organizational source e.g., “HPRC”
  • a respective population/continental origin e.g., “European” or “ South Asia”
  • a respective sub- population/ethnicity e.g., “Ashkenazi”
  • a respective karyotype e.g., “XY” or “XX”
  • various statistical data e.g., “XY” or “XX”.
  • a limited set of nucleotide assemblies is selected for inclusion within a custom multi genome reference.
  • the graphical user interface 500 includes additional selectable options for multigenome construction and/or evaluation, such as selection of a reference genome (e.g., “hg38”), optional evaluation of the resultant multigenome reference with respect to various truth sets (e.g., truth sets for human genome samples HG001-HG007), a search function for identifying nucleotide assemblies for potential inclusion, and a selectable option to upload a nucleotide assembly not yet included within the available nucleotide references (e.g., “Bring Your Own Genomes”).
  • the custom multigenome construction system 106 generates and utilizes a custom multigenome reference to determine genotype calls for a target genomic sample with improved accuracy relative to existing sequencing systems.
  • FIG. 6 depicts respective pipelines for generating genotype calls 614 and 634 for a target genomic sample 610 utilizing (i) a graph pangenome reference 608 generated from a set of haplotype-resolved nucleotide assemblies 602 by an existing sequencing system and (ii) a custom multigenome reference 628 generated from the set of haplotype-resolved nucleotide assemblies 602 by the custom multigenome construction system 106 according to one or more embodiments. [0115] As shown in FIG.
  • the portrayed existing sequencing system performs an assembly alignment and variant calling 604 of the set of haplotype-resolved nucleotide assemblies 602 to align whole contiguous sequences of assembly chromosomes with a reference genome and filter or otherwise identify variants within the respective whole contiguous sequences relative to the reference genome. Further, the existing sequencing system performs a graph fragment assembly 606 to generate a Graphical Fragment Assembly (GF A) file representing an assembly graph of the set of haplotype-resolved nucleotide assemblies 602, including nodes, edges, and associated attributes typical of graph-based genomes.
  • GF A Graphical Fragment Assembly
  • the existing sequencing system further utilizes the results (e.g., the aforementioned GFA file) of the graph fragment assembly 606, in combination with the reference genome used for the assembly alignment and variant calling 604, to generate the graph pangenome reference 608.
  • results e.g., the aforementioned GFA file
  • existing sequencing systems often produce a graph pangenome reference (such as the graph pangenome reference 608) comprising a graph-based structure capturing genetic variations, alternative alleles, haplotypes, and structural variations in relation to a linear reference genome.
  • the portrayed existing sequencing system can utilize the graph pangenome reference 608 to perform a read alignment and variant calling 612 of nucleotide reads from a target genomic sample 610 to generate the genotype calls 614 for the target genomic sample 610 relative to the linear reference genome.
  • existing sequencing systems utilize a machine-learning model to align and/or call variant calls for sample nucleotide reads.
  • the portrayed sequencing system utilizes significantly different models to perform the read alignment and variant calling 612 for the sample nucleotide reads of the target genomic sample 610 than that implemented for the assembly alignment and variant calling 604 of the whole chromosome sequences of the set of haplotype-resolved nucleotide assemblies 602.
  • the custom multigenome construction system 106 performs a simulated read alignment and variant calling 624 (e.g., as discussed above in relation to FIGS. 3A-3D) to determine variant calls from respective simulated nucleotide reads sample from the set of haplotype-resolved nucleotide assemblies 602 and outputs a phased multi-sample variant dataset 626 (e.g., a multi-sample VCF file). Further, the custom multigenome construction system 106 utilizes the phased multi-sample variant dataset 626 to generate the custom multigenome reference 628 corresponding to the set of haplotype-resolved nucleotide assemblies 602.
  • a simulated read alignment and variant calling 624 e.g., as discussed above in relation to FIGS. 3A-3D
  • phased multi-sample variant dataset 626 e.g., a multi-sample VCF file
  • the custom multigenome reference 628 comprises an augmented linear reference comprising a linear reference genome augmented by annotations indicating variants, alternative alleles, and so forth, as identified by the simulated read alignment and variant calling 624.
  • the custom multigenome reference 628 comprises a reference haplotype database indicating allele-variant differences between the linear reference genome and haplotype variants provided within the phased multi-sample variant dataset 626.
  • the custom multigenome construction system 106 utilizes the custom multigenome reference 628 to perform a read alignment and variant calling 632 of nucleotide reads from the target genomic sample 610 to generate the genotype calls 634 for the target genomic sample 610.
  • the custom multigenome construction system 106 utilizes a probabilistic variant caller model and/or a machine-leaming- based variant-call model to identify the genotype calls 634 for the target genomic sample relative to the linear reference genome.
  • the custom multigenome construction system 106 generates and utilizes a multigenome reference to implement mapping and alignment of nucleotide reads from a target genomic sample with genomic regions of a reference genome with increased accuracy.
  • FIGS. 7-8 show experimental results of the custom multigenome construction system 106 generating and utilizing a custom multi genome reference, in accordance with some of the disclosed embodiments, to determine alignments of nucleotide reads.
  • FIGS. 7-8 illustrate comparative results of identifying single nucleotide polymorphisms (SNPs) and insertions or deletions (indels) based on read alignments generated according to one or more embodiments.
  • SNPs single nucleotide polymorphisms
  • Indels insertions or deletions
  • FIG. 7 includes a table of experimental results of identifying SNPs and indels in nucleotide reads from the whole genome sequencing of a known genomic sample (the Genome-in-a-Bottle Human Genome sample HG002) aligned with a reference genome utilizing (i) a pangenome reference generated by an existing sequencing system (see row indicated as “prior”) and (ii) a custom multigenome reference generated by the custom multigenome construction system 106 (see row indicated as “OURS”).
  • a pangenome reference generated by an existing sequencing system see row indicated as “prior”
  • a custom multigenome reference generated by the custom multigenome construction system 106 see row indicated as “OURS”.
  • the numerical values depicted in association with each column correspond to resulting false negative (FN) variant calls, false positive (FP) variant calls, and the combined false positive and false negative (FP+FN) variant calls for SNPs and indels, respectively, as identified per a U.S. National Institute of Standards and Technology (NIST) truth set for the HG002 sample.
  • the table in FIG. 7 includes variant calling results generated by the existing sequencing system and the custom multigenome construction system 106 when considering (i) a subset of the HG002 assembly generally understood as exhibiting relatively high-confidence results when using the existing sequencing system (“high-confidence bed only”) and (ii) the entire HG002 assembly (“whole assembly”).
  • read alignments determined utilizing a custom multigenome reference generated by the custom multigenome construction system 106 exhibit significantly improved accuracy in identifying variant calls over the results of the existing sequencing system.
  • FIG. 8 includes a bar graph of experimental results of identifying SNPs and indels in nucleotide reads from the whole genome sequencing of several known genomic samples (the Genome-in-a-Bottle Human Genome samples HG001-HG007, respectively) aligned utilizing (i) a linear reference genome (“Linear Reference”), (ii) two different pangenome references generated by existing reference-assembly systems (“Pangenome A” and “Pangenome B” — see, e.g., the graph pangenome reference 608 depicted in FIG.
  • Linear Reference linear reference genome
  • Pannome A two different pangenome references generated by existing reference-assembly systems
  • a custom multigenome reference generated by the custom multigenome construction system 106 (“Custom Multigenome” — see, e.g., the custom multigenome reference 628 depicted in FIG. 6).
  • the numerical values depicted in association with each bar column correspond to resulting false negative (FN) variant calls, false positive (FP) variant calls, and the combined false positive and false negative (FP+FN) variant calls, respectively.
  • FN false negative
  • FP false positive
  • FP+FN combined false positive and false negative
  • FIGS. 9A-9B illustrate respective example flowcharts of two series of acts for (a) generating a multi-sample phased variant dataset representing a multigenome reference and (b) utilizing a multigenome reference to generate genotype calls for a target genomic sample in accordance with one or more embodiments. While FIGS. 9A-9B illustrate acts according to particular embodiments, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in FIGS. 9A-9B. The acts of FIGS. 9A-9B can be performed as part of one or more methods.
  • a non-transitory computer readable storage medium can comprise instructions that, when executed by one or more processors, cause a computing device to perform some or all of the acts depicted in FIGS. 9A-9B.
  • a system comprising at least one processor and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform some or all of the acts of FIGS. 9A-9B.
  • the series of acts 900 includes an act 902 of generating simulated reads from each nucleotide assembly of a set of nucleotide assemblies, an act 904 of determining read alignments between the simulated nucleotide reads and respective genomic regions of a reference genome, an act 906 of generating variant calls relative to the reference genome for the simulated nucleotide reads, an act 908 of generating phased variant data for each respective nucleotide assembly of the set of nucleotide assemblies, and an act 910 of compiling the phased variant data to generate a multi-sample phased variant dataset representing a multigenome reference (e.g. a pangenome reference).
  • a multigenome reference e.g. a pangenome reference
  • the series of acts 920 includes an act 922 of identifying nucleotide reads corresponding to a target genomic sample, an act 924 of determining read alignments between the nucleotide reads and respective genomic regions of a linear reference genome, an act 926 of comparing the nucleotide reads with a multigenome reference (e.g., a pangenome reference) comprising haplotype variants identified using simulated nucleotide reads sampled from a set of nucleotide assemblies, and an act 928 of generating genotype calls for the target genomic sample relative to the linear reference genome based on the read alignments.
  • a multigenome reference e.g., a pangenome reference
  • the series of acts 900 and/or the series of acts 920 can include acts to perform any of the operations described in the following clauses:
  • CLAUSE 1 A method comprising: generating simulated nucleotide reads comprising nucleotide sequences from each nucleotide assembly of a set of nucleotide assemblies; determining read alignments between the simulated nucleotide reads and respective genomic regions of a reference genome; generating variant calls relative to the reference genome for the simulated nucleotide reads according to the read alignments; generating phased variant data comprising the variant calls for each respective nucleotide assembly of the set of nucleotide assemblies; and compiling the phased variant data for the set of nucleotide assemblies to generate a multisample phased variant dataset representing a multigenome reference.
  • CLAUSE 2 The method according to clause 1, further comprising: identifying, for a given haplotype-resolved nucleotide assembly of the set of nucleotide assemblies, a first assembly haplotype and a second assembly haplotype; generating, from the given haplotype-resolved nucleotide assembly, a first set of the simulated nucleotide reads for the first assembly haplotype and a second set of the simulated nucleotide reads for the second assembly haplotype; generating the variant calls relative to the reference genome, including variant calls for the first set of the simulated nucleotide reads and variant calls for the second set of the simulated nucleotide reads; and generating, for the given haplotype-resolved nucleotide assembly, the phased variant data comprising diploid genotype calls by combining the variant calls for the first set of the simulated nucleotide reads with the variant calls for the second set of the simulated nucleotide reads.
  • CLAUSE 3 The method according to clause 2, further comprising: identifying the first assembly haplotype by identifying, for the given haplotype-resolved nucleotide assembly, a first nucleotide sequence comprising a first set of alleles in a chromosome from a first parent; and identifying the second assembly haplotype by identifying, for the given haplotype-resolved nucleotide assembly, a second nucleotide sequence comprising a second set of alleles in the chromosome from a second parent.
  • CLAUSE 4 The method according to clause 1, further comprising: identifying, for a given polyploid haplotype-resolved nucleotide assembly of the set of nucleotide assemblies, a plurality of assembly haplotypes; generating, from the given polyploid haplotype-resolved nucleotide assembly, respective sets of the simulated nucleotide reads for the plurality of assembly haplotypes; generating the variant calls relative to the reference genome, including respective (e.g., haplotype-specific) variant calls for each of the respective sets of the simulated nucleotide reads; and generating, for the given polyploid haplotype-resolved nucleotide assembly, the phased variant data comprising polyploid genotype calls by compiling the variant calls for each of the respective sets of the simulated nucleotide reads.
  • CLAUSE 5 The method according to clause 1, further comprising: generating, for a given haplotype-resolved nucleotide assembly of the set of nucleotide assemblies, one or more variant calls corresponding to a haploid genomic region of the reference genome; and generating, for the given haplotype-resolved nucleotide assembly, the phased variant data comprising haploid genotype calls including the one or more variant calls corresponding to the haploid genomic region.
  • CLAUSE 6 The method according to clause 1 , further comprising generating the phased variant data utilizing a read-based phasing model to determine phasing of the variant calls for each nucleotide assembly of the set of nucleotide assemblies.
  • CLAUSE 7. The method of any of clauses 1-6, further comprising generating the simulated nucleotide reads comprising nucleotide sequences by sampling assembly sequences corresponding to individual chromosomes within each nucleotide assembly of the set of nucleotide assemblies.
  • CLAUSE 8 The method according to any of clauses 1-7, further comprising: generating the phased variant data by generating sample-specific variant call files comprising the phased variant data for each respective nucleotide assembly of the set of nucleotide assemblies; and compiling the phased variant data by compiling the phased variant data of the samplespecific variant call files to generate a multi-sample variant call file comprising the multi-sample phased variant dataset.
  • CLAUSE 9 The method according to clause 8, further comprising generating the multisample variant call file comprising the multi-sample phased variant dataset by: aligning the sample-specific variant call files according to genomic coordinates within the reference genome; and inserting, within rows for respective nucleotide assemblies within the multi-sample variant call file, data placeholders representing deletions within the respective nucleotide assemblies.
  • CLAUSE 10 The method according to any of clauses 8-9, further comprising generating the multigenome reference by generating a reference haplotype database or a graph multigenome reference utilizing the multi-sample variant call file.
  • CLAUSE 11 The method according to any of clauses 1-10, further comprising generating the simulated nucleotide reads by sampling a plurality of overlapping nucleotide sequences in a progressive series of segments of each nucleotide assembly of the set of nucleotide assemblies.
  • CLAUSE 12 The method according to clause 11, further comprising generating the simulated nucleotide reads by sampling the plurality of overlapping nucleotide sequences to simulate a target read depth of nucleotides within the simulated nucleotide reads.
  • CLAUSE 13 The method according to clause 12, further comprising generating the simulated nucleotide reads with a modified read depth greater than or lesser than the target read depth at one or more regions of a given nucleotide assembly of the set of nucleotide assemblies.
  • CLAUSE 14 The method according to any of clauses 11-13, further comprising generating the simulated nucleotide reads by sampling the plurality of overlapping nucleotide sequences with an initial read length of up to 300,000 nucleobases.
  • CLAUSE 15 The method according to clause 14, further comprising sampling the plurality of overlapping nucleotide sequences with a modified read length greater than or lesser than the initial read length at one or more regions of a given nucleotide assembly of the set of nucleotide assemblies.
  • CLAUSE 16 The method according to any of clauses 1-15, further comprising generating the read alignments between the simulated nucleotide reads and the respective genomic regions of the reference genome utilizing a haplotype-aware alignment model.
  • CLAUSE 17 The method according to any of clauses 1-16, further comprising generating the variant calls relative to the reference genome for the simulated nucleotide reads utilizing a probabilistic variant call model or a machine-leaming-based variant-call model.
  • CLAUSE 18 The method according to any of clauses 1-17, further comprising: identifying, within the read alignments between the simulated nucleotide reads and the respective genomic regions of the reference genome, at least one mis-mapped read based on comparing the respective genomic regions with relative coordinates of the simulated nucleotide reads within a respective nucleotide assembly of the set of nucleotide assemblies; and excluding one or more variant calls corresponding to the at least one mis-mapped read from the phased variant data for the respective nucleotide assembly.
  • CLAUSE 19 The method of any of clauses 1-18, further comprising: generating, for a given nucleotide assembly of the set of nucleotide assemblies, respective variant calls relative to the reference genome based on respective simulated nucleotide reads generated from the given nucleotide assembly; determining a whole-contig alignment between contiguous sequences of the given nucleotide assembly and corresponding genomic regions of the reference genome; generating one or more structural variant calls for the given nucleotide assembly based on the whole-genome alignment or the whole-contig alignment; and generating, for the given nucleotide assembly, the phased variant data comprising the respective variant calls based on the respective simulated nucleotide reads and the one or more structural variant calls based on the whole-contig alignment.
  • generating the respective variant calls comprises generating, utilizing a variant call model, a set of variant calls in a first maternal genotype-call data file and a set of variant calls in a first paternal genotype-call data file;
  • CLAUSE 21 The method according to any of clauses 1-20, further comprising: determining one or more read alignments between one or more nucleotide reads corresponding to a target genomic sample and one or more respective genomic regions of the reference genome based on comparing the one or more nucleotide reads with the multigenome reference; and generating genotype calls for a target genomic sample relative to the reference genome based on the one or more read alignments.
  • CLAUSE 22 The method according to any of clauses 1-21, further comprising: receiving at least one indication of a user interaction selecting, from a database of nucleotide assemblies, a target population or the set of nucleotide assemblies for the target population; and based on the selection of the target population or the set of nucleotide assemblies for the target population, generating the multi-sample phased variant dataset representing the multigenome reference for the target population.
  • CLAUSE 23 A method comprising: identifying one or more nucleotide reads corresponding to a target genomic sample; determining one or more read alignments between the one or more nucleotide reads and one or more respective genomic regions of a linear reference genome based on comparing the one or more nucleotide reads with a multigenome reference comprising haplotype variants identified using simulated nucleotide reads sampled from a set of nucleotide assemblies; and generating genotype calls for the target genomic sample relative to the linear reference genome based on the one or more read alignments.
  • CLAUSE 24 The method according to clause 23, wherein the multigenome reference comprises phased variant data for each respective nucleotide assembly of the set of nucleotide assemblies.
  • CLAUSE 25 The method according to clause 23, wherein the set of nucleotide assemblies comprises a plurality of haploid nucleotide assemblies of (i) a plurality of haploid genomes of one or more organisms or (ii) haploid regions of a plurality of diploid genomes or polyploid genomes of one or more organisms.
  • CLAUSE 26 The method according to any of clauses 23-24, further comprising generating the genotype calls by generating a genotype call identifying a genotype for the target genomic sample at a genomic coordinate relative to a known genomic sample represented by a segment of phased variant data within the multigenome reference.
  • CLAUSE 27 The method according to any of clauses 23-24 or 26, further comprising: receiving, for a target population corresponding to the target genomic sample, the set of nucleotide assemblies corresponding to a set of known genomic samples; and generating a multi-sample phased variant dataset representing the multigenome reference for the target population.
  • CLAUSE 28 The method according to clause 27, further comprising: receiving at least one indication of a user interaction selecting, from a database of nucleotide assemblies, the target population or the set of nucleotide assemblies for the target population; and based on the selection of the target population or the set of nucleotide assemblies for the target population, generating the multi-sample phased variant dataset representing the multigenome reference for the target population.
  • nucleic acid sequencing techniques can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase type from another are particularly applicable.
  • the process to determine the nucleotide sequence of a target nucleic acid i.e., a nucleic acid polymer
  • Preferred embodiments include sequencing-by-synthesis (SBS) techniques.
  • SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand.
  • a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery.
  • more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
  • SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties.
  • Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below.
  • the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery.
  • the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.).
  • SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like.
  • a characteristic of the label such as fluorescence of the label
  • a characteristic of the nucleotide monomer such as molecular weight or charge
  • a byproduct of incorporation of the nucleotide such as release of pyrophosphate; or the like.
  • the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used.
  • the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by
  • Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release.” Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing.” Genome Res. 11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P.
  • PPi inorganic pyrophosphate
  • the nucleic acids to be sequenced can be attached to features in an array and the array can be imaged to capture the chemiluminescent signals that are produced due to incorporation of a nucleotides at the features of the array.
  • An image can be obtained after the array is treated with a particular nucleotide type (e.g., A, T, C or G). Images obtained after addition of each nucleotide type will differ with regard to which features in the array are detected. These differences in the image reflect the different sequence content of the features on the array. However, the relative locations of each feature will remain unchanged in the images.
  • the images can be stored, processed and analyzed using the methods set forth herein. For example, images obtained after treatment of the array with each different nucleotide type can be handled in the same way as exemplified herein for images obtained from different detection channels for reversible terminatorbased sequencing methods.
  • cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference.
  • This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91/06678 and WO 07/123,744, each of which is incorporated herein by reference.
  • the availability of fluorescently labeled terminators in which both the termination can be reversed, and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing.
  • Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.
  • the labels do not substantially inhibit extension under SBS reaction conditions.
  • the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following incorporation of labels into arrayed nucleic acid features.
  • each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially, and an image of the array can be obtained between each addition step.
  • each image will show nucleic acid features that have incorporated nucleotides of a particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remain unchanged in the images. Images obtained from such reversible terminator- SBS methods can be stored, processed and analyzed as set forth herein. Following the image capture step, labels can be removed, and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.
  • nucleotide monomers can include reversible terminators.
  • reversible terminators/cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporated herein by reference).
  • Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety).
  • Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension but could easily be deblocked by a short treatment with a palladium catalyst.
  • the fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 second exposure to long wavelength UV light.
  • disulfide reduction or photocleavage can be used as a cleavable linker.
  • Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP.
  • the presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and/or electrostatic hindrance.
  • Some embodiments can utilize detection of four different nucleotides using fewer than four different labels.
  • SBS can be performed utilizing methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232.
  • a pair of nucleotide types can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair.
  • nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e.g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into a nucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal.
  • one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels.
  • An exemplary embodiment that combines all three examples is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitation wavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g.
  • dTTP having at least one label that is detected in both channels when excited by the first and/or second excitation wavelength
  • a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel (e.g. dGTP having no label).
  • sequencing data can be obtained using a single channel.
  • the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated.
  • the third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.
  • Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides.
  • the oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize.
  • images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the features will remain unchanged in the images.
  • Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. "Nanopores and nucleic acids: prospects for ultrarapid sequencing.” Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, “Characterization of nucleic acids by nanopore analysis”. Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope” Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties).
  • the target nucleic acid passes through a nanopore.
  • the nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin.
  • each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore.
  • Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity.
  • Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No.
  • FRET fluorescence resonance energy transfer
  • the illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations.” Science 299, 682-686 (2003); Lundquist, P. M. et al.
  • Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product.
  • sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 Al; US 2009/0127589 Al; US 2010/0137143 Al; or US 2010/0282617 Al, each of which is incorporated herein by reference.
  • Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.
  • the above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously.
  • different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner.
  • the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner.
  • the target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface.
  • the array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.
  • the methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features/cm2, 100 features/cm2, 500 features/cm2, 1,000 features/cm2, 5,000 features/cm2, 10,000 features/cm2, 50,000 features/cm2, 100,000 features/cm2, 1,000,000 features/cm2, 5,000,000 features/cm2, or higher.
  • an advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly, the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniques known in the art such as those exemplified above.
  • an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and/or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like.
  • a flow cell can be configured and/or used in an integrated system for detection of target nucleic acids. Exemplary flow cells are described, for example, in US 2010/0111768 Al and US Ser. No.
  • one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method.
  • one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above.
  • an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods.
  • Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego, CA) and devices described in US Ser. No. 13/273,666, which is incorporated herein by reference.
  • the sequencing system described above sequences nucleic acid polymers present in samples received by a sequencing device.
  • sample and its derivatives, is used in its broadest sense and includes any specimen, culture and the like that is suspected of including a target.
  • the sample comprises DNA, RNA, PNA, LNA, chimeric or hybrid forms of nucleic acids.
  • the sample can include any biological, clinical, surgical, agricultural, atmospheric or aquatic-based specimen containing one or more nucleic acids.
  • the term also includes any isolated nucleic acid sample such a genomic DNA, fresh-frozen or formalin-fixed paraffin-embedded nucleic acid specimen. It is also envisioned that the sample can be from a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material such as maternal and fetal DNA obtained from a maternal subject, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA.
  • the source of nucleic acid material can include nucleic acids obtained from a newborn, for example as typically used for newborn screening.
  • the nucleic acid sample can include high molecular weight material such as genomic DNA (gDNA).
  • the sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples.
  • low molecular weight material includes enzymatically or mechanically fragmented DNA.
  • the sample can include cell-free circulating DNA.
  • the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples.
  • the sample can be an epidemiological, agricultural, forensic or pathogenic sample.
  • the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source.
  • the sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus or fungus.
  • the source of the nucleic acid molecules may be an archived or extinct sample or species.
  • forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel.
  • the nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric or other substrate that may be impregnated with saliva, blood, or other bodily fluids.
  • the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA.
  • target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine and serum.
  • target sequences can be obtained from hair, skin, tissue samples, autopsy or remains of a victim.
  • nucleic acids including one or more target sequences can be obtained from a deceased animal or human.
  • target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA.
  • target sequences or amplified target sequences are directed to purposes of human identification.
  • the disclosure relates generally to methods for identifying characteristics of a forensic sample.
  • the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein.
  • a forensic or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.
  • the components of the custom multigenome construction system 106 can include software, hardware, or both.
  • the components of the custom multigenome construction system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device 114, the local device 108, or the server device(s) 110).
  • the computer-executable instructions of the custom multigenome construction system 106 can cause the computing devices to perform the bubble detection methods described herein.
  • the components of the custom multigenome construction system 106 can comprise hardware, such as special purpose processing devices to perform a certain function or group of functions.
  • the components of the custom multigenome construction system 106 can include a combination of computer-executable instructions and hardware.
  • the components of the custom multigenome construction system 106 performing the functions described herein with respect to the custom multigenome construction system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and/or as a cloud-computing model.
  • components of the custom multigenome construction system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device.
  • the components of the custom multigenome construction system 106 may be implemented in any application that provides sequencing services including, but not limited to Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and/or other countries.
  • Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below.
  • Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures.
  • one or more of the processes described herein may be implemented at least in part as instructions embodied in a non- transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein).
  • a processor receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc.), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
  • a non-transitory computer-readable medium e.g., a memory, etc.
  • Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system.
  • Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices).
  • Computer- readable media that carry computer-executable instructions are transmission media.
  • embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
  • Non-transitory computer-readable storage media includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
  • SSDs solid state drives
  • PCM phasechange memory
  • a “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices.
  • a network or another communications connection can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer- readable media.
  • program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa).
  • computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a NIC), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system.
  • a network interface module e.g., a NIC
  • non-transitory computer- readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
  • Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions.
  • computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure.
  • the computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code.
  • the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.
  • the disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks.
  • program modules may be located in both local and remote memory storage devices.
  • Embodiments of the present disclosure can also be implemented in cloud computing environments.
  • “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources.
  • cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources.
  • the shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
  • a cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth.
  • a cloud-computing model can also expose various service models, such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (laaS).
  • SaaS Software as a Service
  • PaaS Platform as a Service
  • laaS Infrastructure as a Service
  • a cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth.
  • a “cloud-computing environment” is an environment in which cloud computing is employed.
  • FIG. 10 illustrates a block diagram of a computing device 1000 that may be configured to perform one or more of the processes described above.
  • the computing device 1000 may implement the custom multigenome construction system 106 and the sequencing device system 104.
  • the computing device 1000 can comprise a processor 1002, a memory 1004, a storage device 1006, an I/O interface 1008, and a communication interface 1010, which may be communicatively coupled by way of a communication infrastructure 1012.
  • the computing device 1000 can include fewer or more components than those shown in FIG. 10. The following paragraphs describe components of the computing device 1000 shown in FIG. 10 in additional detail.
  • the processor 1002 includes hardware for executing instructions, such as those making up a computer program.
  • the processor 1002 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 1004, or the storage device 1006 and decode and execute them.
  • the memory 1004 may be a volatile or nonvolatile memory used for storing data, metadata, and programs for execution by the processor(s).
  • the storage device 1006 includes storage, such as a hard disk, flash disk drive, or other digital storage device, for storing data or instructions for performing the methods described herein.
  • the I/O interface 1008 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1000.
  • the I/O interface 1008 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces.
  • the I/O interface 1008 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers.
  • the I/O interface 1008 is configured to provide graphical data to a display for presentation to a user.
  • the graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
  • the communication interface 1010 can include hardware, software, or both. In any event, the communication interface 1010 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1000 and one or more other computing devices or networks. As an example, and not by way of limitation, the communication interface 1010 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
  • NIC network interface controller
  • WNIC wireless NIC
  • the communication interface 1010 may facilitate communications with various types of wired or wireless networks.
  • the communication interface 1010 may also facilitate communications using various communication protocols.
  • the communication infrastructure 1012 may also include hardware, software, or both that couples components of the computing device 1000 to each other.
  • the communication interface 1010 may use one or more networks and/or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein.
  • the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.
  • the methods described herein may be performed with less or more steps/acts or the steps/acts may be performed in differing orders. Additionally, the steps/acts described herein may be repeated or performed in parallel with one another or in parallel with different instances of the same or similar steps/acts.
  • the scope of the present application is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

This disclosure describes methods, non-transitory computer readable media, and systems that implement improved mapping and alignment of nucleotide reads with genomic regions of a reference genome. For instance, the disclosed systems can generate and utilize a custom multigenome reference comprising haplotype variants identified using simulated nucleotide reads sampled from a set of nucleotide assemblies for a respective set of known genomic samples. To generate a custom multigenome reference, for example, the disclosed systems can generate simulated nucleotide reads comprising nucleotide sequences from a set of nucleotide assemblies, determine read alignments and variant calls for the simulated reads with respect to a reference genome, and compile resultant variant data for the set of nucleotide assemblies to generate a multi-sample phased variant dataset representing a multigenome reference.

Description

CUSTOM MULTIGENOME REFERENCE CONSTRUCTION FOR IMPROVED SEQUENCING ANALYSIS OF GENOMIC SAMPLES
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/625,580, entitled, “CUSTOM MULTIGENOME REFERENCE CONSTRUCTION FOR IMPROVED SEQUENCING ANALYSIS OF GENOMIC SAMPLES,” filed on January 26, 2024 (IP-2756-PRV), the entirety of which is hereby incorporated by reference.
BACKGROUND
[0002] In recent years, biotechnology firms and research institutions have improved hardware and software for sequencing nucleotides and determining variant calls for genomic samples. For instance, some existing nucleobase sequencing platforms determine individual nucleobases within sequences from genomic samples’ cells by using conventional Sanger sequencing or by using sequencing-by-synthesis (SBS) methods. When using SBS, existing platforms can monitor millions to billions of nucleic acid polymers being synthesized in parallel to predict nucleobase calls from a larger base call dataset. For instance, a camera in many SBS platforms captures images of irradiated fluorescent tags incorporated into oligonucleotides for determining the nucleobase calls. After capturing such images, existing sequencing platforms send base call data (or image-based data) to a computing device to apply sequencing data analysis software that determines a nucleobase sequence for a genomic sample or other nucleic acid polymer. For instance, such software (i) maps and aligns nucleotide reads determined by the sequencing platform for a genomic sample with (ii) a reference genome comprising at least a primary contiguous sequence. Based on differences between the aligned nucleotide reads and the reference genome, existing data analysis software can further utilize a variant caller to identify genotype and/or variants within a genomic sample, such as single nucleotide polymorphisms (SNPs), insertions or deletions (indels), or structural variants.
[0003] Despite these recent advances, existing nucleobase sequencing platforms and sequencing data analysis software (together and hereinafter, “existing sequencing systems”) often utilize reference genomes that misrepresent certain populations and foment inaccurate read mapping and alignment and mistaken variant calling. For example, some existing sequencing systems use a linear reference genome that purportedly represents a consensus or example of genes and other nucleotide sequences of an organism. But about 93% of the primary assembly for the most common linear human reference genome, GRCh38 from the Genome Reference Consortium, is based on libraries from only 11 individuals, with 70% of the linear human reference genome coming from 1 individual. Accordingly, many existing systems use a linear reference genome that does not represent certain populations, common variants, or common population haplotypes. Consequently, existing sequencing systems frequently align nucleotide reads and determine variant calls that are negatively influenced by a reference bias in favor of alignment of such reads with alleles represented by the linear reference genome and to the detriment of alternative alleles — despite allele-variant differences between the target genomic sample and the linear reference genome. Such a reference bias often leads to false positives (FPs) and false negatives (FNs) in variant calls, thus reducing the accuracy of existing sequencing systems in generating variant calls for a diploid genomic sample.
[0004] To address this lack of genetic representation in linear reference genomes, some existing reference-assembly systems generate and use a pangenome reference representing a more diverse group of individuals. To that end, various organizations — such as the Human Pangenome Reference Consortium (HPRC), the Chinese Pangenome Consortium (CPC), and the Arab Pangenome Reference (APR) — have produced assemblies of diverse populations of genomic samples for use in establishing inclusive, pangenome references. Many such assemblies cover nucleotide sequences that span entire chromosomes of the respective genomic sample, from one telomere to another telomere, many with separate complete sequences for both maternal and paternal alleles.
[0005] Construction of pangenome reference from a set of assemblies, however, presents additional challenges. For instance, many existing sequencing systems utilize structural variant detection methods that align assemblies with a linear reference genome on a whole contiguous sequence basis (e.g., aligning entire chromosome sequences from telomere to telomere) and fdter the aligned sequences to identify structural variants. Due in part to their focus on whole sequence alignment for structural variant detection, such existing systems often exhibit relatively poor performance in improving variant calls for smaller variants, such as SNPs and indels, relative to conventional methods of aligning reads to a linear reference genome without augmentation by a pangenome reference created by existing reference-assembly systems.
[0006] These, along with additional problems and issues exist in existing sequencing systems and pangenome references created by existing reference-assembly systems.
SUMMARY
[0007] This disclosure describes embodiments of methods, non-transitory computer-readable media, and systems that (i) generate a multigenome reference comprising haplotype variants identified within a set of nucleotide assemblies and/or (ii) utilize the generated multigenome reference to determine alignments and/or genotype calls. In particular, the disclosed systems can generate a multi-sample phased variant dataset representing a multigenome reference by comparing simulated reads sampled from a set of nucleotide assemblies with a linear reference genome to identify genotype calls for each nucleotide assembly of the set of nucleotide assemblies. In certain implementations, for example, the disclosed systems can generate a custom multigenome reference from a selected set of nucleotide assemblies representing a target population corresponding to target genomic sample. The disclosed systems can utilize the custom multigenome reference to determine read alignments and genotype calls relative to a linear reference genome for the target genomic sample.
[0008] To generate a multi-sample variant dataset for a given set of haplotype-resolved nucleotide assemblies, for example, the disclosed systems can utilize various pipelines to (a) process each assembly haplotype independently to identify variants relative to a reference genome, (b) combine resulting variant calls for respective assembly haplotypes (e.g., maternal and paternal haplotypes) of each nucleotide assembly to generate phased variant data for each nucleotide assembly, and (c) aggregate the phased variant data of multiple nucleotide assemblies to compile a multi-sample phased variant dataset for the given set of haplotype-resolved nucleotide assemblies. For instance, the disclosed methods can process a given assembly haplotype by (a) generating a set of simulated reads from chromosome sequences within the given assembly haplotype, (b) aligning the simulated reads to a linear reference genome, and (c) generating variant calls relative to the linear reference genome according to the read alignments to generate the variant data for the assembly haplotype. By utilizing the foregoing pipelines to (i) generate variant data for assembly haplotypes based on simulated reads, (ii) generate phased variant data for each nucleotide assembly by combining the variant data for the respective assembly haplotypes, and (ii) aggregate the phased variant data for multiple nucleotide assemblies into a multi-sample phased variant dataset, the disclosed systems can generate a customized multigenome reference with improved accuracy and efficiency over existing systems.
[0009] Additional features and advantages of one or more embodiments of the present disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The detailed description refers to the drawings briefly described below.
[0011] FIG. 1 illustrates an environment in which a custom multigenome construction system can operate in accordance with one or more embodiments of the present disclosure.
[0012] FIG. 2 illustrates an overview of the custom multigenome construction system generating a multigenome reference from a plurality of nucleotide assemblies in accordance with one or more embodiments of the present disclosure.
[0013] FIG. 3 A illustrates the custom multigenome construction system generating haplotype variant data for an assembly haplotype in accordance with one or more embodiments of the present disclosure. [0014] FIG. 3B illustrates the custom multigenome construction system generating phased variant data for a diploid haplotype-resolved nucleotide assembly in accordance with one or more embodiments of the present disclosure.
[0015] FIG. 3C illustrates the custom multigenome construction system generating a multisample phased variant dataset from a plurality of haplotype-resolved nucleotide assemblies in accordance with one or more embodiments of the present disclosure.
[0016] FIG. 3D illustrates the custom multigenome construction system generating phased variant data for a haplotype-resolved assembly of any ploidy level in accordance with one or more embodiments of the present disclosure.
[0017] FIG. 3E illustrates the custom multigenome construction system generating phased variant data for a diploid haplotype-resolved assembly utilizing a simulated-read variant calling pipeline and a whole-contiguous sequence structural variant calling pipeline in accordance with one or more embodiments of the present disclosure.
[0018] FIGS. 4A-4B further illustrate the custom multigenome construction system utilizing two respective multigenome construction pipelines to generate phased multi-sample variant datasets in accordance with one or more embodiments of the present disclosure.
[0019] FIG. 5 illustrates a graphical user interface comprising selectable options for generating a custom multigenome reference in accordance with one or more embodiments of the present disclosure.
[0020] FIG. 6 illustrates two end-to-end analysis pipelines for generating and utilizing a multigenome reference to determine genotype calls for a target genomic sample utilizing (i) an existing sequencing system and a graph pangenome reference from an existing reference-assembly system and (ii) the custom multigenome construction system in accordance with one or more embodiments of the present disclosure.
[0021] FIG. 7 illustrates a table of comparative experimental results of determining variant calls from nucleotide reads that are mapped and aligned to a reference genome utilizing (i) a pangenome reference generated by an existing reference-assembly system and (ii) a multigenome reference generated by the disclosed custom multigenome construction system in accordance with one or more embodiments of the present disclosure.
[0022] FIG. 8 illustrates a graph of comparative experimental results of determining variant calls from nucleotide reads that are mapped and aligned to a reference genome utilizing (i) pangenome references generated by various existing reference-assembly systems, and (ii) a multigenome reference generated by the disclosed custom multigenome construction system in accordance with one or more embodiments of the present disclosure. [0023] FIG. 9A illustrates a flowchart of a series of acts for generating a multi-sample phased variant dataset representing a multigenome reference in accordance with one or more embodiments of the present disclosure.
[0024] FIG. 9B illustrates a flowchart of a series of acts for generating genotype calls for a target genomic sample utilizing a multigenome reference generated according to one or more embodiments of the present disclosure.
[0025] FIG. 10 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure.
DETAILED DESCRIPTION
[0026] This disclosure describes embodiments of a custom multigenome construction system that can (i) generate a multigenome reference identifying haplotype variants from a set of nucleotide assemblies for a target population and/or (ii) utilize the multigenome reference to determine read alignments and genotype calls for a target genomic sample with improved accuracy over existing sequencing systems. In one or more embodiments, for example, the custom multigenome construction system utilizes long read simulation and variant calling to generate a multi-sample variant dataset from a collection of nucleotide assemblies to establish a multigenome reference for a target population. Using a multigenome reference generated according to one or more embodiments, the custom multigenome construction system can determine read alignments and genotype calls relative to a linear reference genome for a target genomic sample.
[0027] In some embodiments, for instance, the custom multigenome construction system processes haplotype-resolved nucleotide assemblies to generate phased variant data for each nucleotide assembly and compiles the phased variant data to generate a multi-sample phased variant dataset representing a multigenome reference. For instance, in one or more embodiments, the custom multigenome construction system processes each assembly haplotype within a haplotype-resolved nucleotide assembly by (a) generating a set of simulated reads from chromosome sequences within the given assembly haplotype, (b) aligning the simulated reads to a linear reference genome, and (c) generating variant calls relative to the linear reference genome according to the read alignments to generate the phased variant data for the given assembly haplotype. Additionally, in one or more embodiments, the custom multigenome construction system utilizes one or more structural variant detection models to further identify structural variants within the given assembly haplotype that includes the identified structural variants in the phased variant data for the given assembly haplotype.
[0028] To further illustrate, in some embodiments, the custom multigenome construction system samples (or segments on an overlapping-basis) assembly sequences corresponding to individual chromosomes within a given nucleotide assembly to generate simulated nucleotide reads for use in determining variant calls to include within a multigenome reference. In some embodiments, for example, the custom multigenome construction system samples, from a given nucleotide assembly, a plurality of overlapping nucleotide sequences in a progressive series of segments to generate the simulated nucleotide reads with a predetermined read length per simulated nucleotide read and/or at a target read depth across the simulated nucleotide reads. Further, in some embodiments, the custom multigenome construction system utilizes a modified read length and/or a modified read depth, greater than or lesser than the respective predetermined read length and/or target read depth, at one or more regions of the given nucleotide assembly. Accordingly, the custom multigenome construction system can generate simulated nucleotide reads from nucleotide assemblies for use in generating a multigenome reference, as described in further detail below in relation to the various figures.
[0029] As suggested above, the custom multigenome construction system provides several technical advantages, benefits, and/or improvements over existing sequencing systems and methods. For example, the custom multigenome construction system improves the accuracy of read alignments and subsequent genomic analysis by utilizing a multigenome reference generated according to the disclosed embodiments. More specifically, in some embodiments, the custom multigenome construction system generates a custom multigenome reference including phased variant data identified using simulated nucleotide reads sampled from a set of nucleotide assemblies. In some such cases, the set of nucleotide assemblies is selected for a target population corresponding to a target genomic sample. By utilizing the custom multigenome reference in determining alignments of nucleotide reads from a target genomic sample, the custom multigenome construction system can more accurately align nucleotide reads with a corresponding linear reference genome — especially in more complex or “difficult-to-call” genomic regions (e.g., regions comprising lower confidence base calls in general) — than existing sequencing systems that utilize reference genomes with poor representation of population diversity. Due to the improved alignment with the reference genome, whilst avoiding reference bias by providing a diverse variant dataset to augment the reference genome, the custom multigenome construction system can also determine more accurate genotype calls and/or variant calls with a higher confidence that such calls match or differ from the reference base of a reference genome compared to existing sequencing systems that use pangenome references generated by existing reference-assembly systems.
[0030] In addition, as mentioned above, the custom multigenome construction system can generate simulated reads by sampling overlapping sequences from a given nucleotide assembly to provide reliably accurate inputs for identifying variants within the given nucleotide assembly using the pipelines described herein (e.g., in relation to FIGS. 3A-3E below). By generating simulated reads comprising discrete and accurate sequences from nucleotide assemblies and utilizing the simulated reads to identify haplotype variants, embodiments of the custom multigenome construction system can further implement improved alignment and variant calling for target genomic samples, relative to existing systems utilizing pangenome references generated by other methods. Furthermore, in certain implementations, the custom multigenome construction system can utilize an additional analysis pipeline to identify structural variant calls based on a whole-contig alignment of contiguous sequences from a haplotype assembly. By merging into one or more sequencing data files both (i) structural variants based on whole-contig alignments with (ii) variants (e.g., single nucleotide variants, insertions, deletions, structural variants, and so forth) identified utilizing the aforementioned simulated nucleotide reads, the custom multigenome construction system 106 can further improve the fidelity of resulting multi genome references for further improved alignment and variant calling of target genomic samples — particularly for structural variants. This disclosure describes and depicts examples of such improved genotype and/or variant calls below in relation to FIGS. 7-8.
[0031] In addition to improving the accuracy of alignment and related sequencing analysis, the custom multigenome construction system improves computational efficiency in an alignment process when generating a multigenome reference relative to existing reference-assembly systems. Some existing reference-assembly systems consume excessive processing and time to align lengthy alternate contiguous sequences (e.g., contiguous sequences representing whole or millions of base pairs from whole chromosomes) with a reference genome. By contrast, the custom multigenome system can generate simulated reads sampled from nucleotide assemblies (e.g., encoding individual chromosomes) to identify haplotype variants relative to a linear reference genome and implement parallel processing to align such simulated reads with respective regions of a reference genome. Such simulated-read alignments in parallel increase computing speeds for the alignment process relative to analysis by a single processor as required by the telomere-to-telomere alignment utilized by aligning lengthy alternate contiguous sequences (e.g., whole chromosomes) performed by many existing reference-assembly systems. Furthermore, in at least some implementations, a custom multigenome reference generated by the custom multigenome construction system consumes less computer memory compared to the augmented graph multigenome references used by many existing sequencing system.
[0032] In addition to improved accuracy and computational efficiency, the custom multigenome construction system can generate a multigenome reference with more flexibility than conventional genome references. To illustrate, in some embodiments, the custom multigenome construction system can generate a custom multigenome reference indicating haplotype variants identified within a set of nucleotide assemblies specifically selected to represent a target population corresponding to a particular target genomic sample. Additionally, in certain implementations, the custom multigenome construction system 106 can flexibly generate a custom multigenome reference incorporating haploid, diploid, and/or polyploid genome information in organisms and/or genomic regions of any ploidy level (e.g., as described below in relation to FIG. 3D). By utilizing such a custom multigenome reference targeted to a particular target genomic sample, the custom multigenome construction system can determine read alignments and subsequent variant calls for the particular target genomic sample with further increased accuracy and improved efficiency compared to methods implementing a linear reference genome or a non-specific or less specific multigenome reference. Indeed, in some embodiments, the custom multigenome construction system utilizes a custom multigenome reference comprising haplotype variants derived from a discrete subset of nucleotide assemblies particularly selected for a given genomic sample, resulting in improved alignment accuracy, reduced memory consumption, and increased processing speeds. [0033] As suggested by the foregoing discussion, this disclosure utilizes a variety of terms to describe features and benefits of the custom multigenome construction system. Additional detail is hereafter provided regarding the meaning of these terms as used in this disclosure. As used in this disclosure, for instance, the term “target genomic sample” refers to a target genome or portion of a genome undergoing an assay or sequencing. For example, a target genomic sample includes one or more sequences of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence). In particular, a genomic sample includes a full genome that is isolated or extracted (in whole or in part) from a sample organism and composed of nitrogenous heterocyclic bases. A target genomic sample can include a segment of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids or chimeric or hybrid forms of nucleic acids noted below. In some cases, the target genomic sample is found in a sample prepared or isolated by a kit and received by a sequencing device.
[0034] In contrast to a target genomic sample, the term “known genomic sample,” as used herein, refers to a well-characterized genomic sample that serves as a reference material for benchmarking and improving sequencing analysis models. For example, a known genomic sample can include a human genome sample (e.g., HG001, HG004) commonly used or made available in public repositories, such as Genome-in-a-Bottle (GIAB). As also mentioned above, various organizations — such as the Human Pangenome Reference Consortium (HPRC), the Chinese Pangenome Consortium (CPC), and the Arab Pangenome Reference (APR) — are producing haplotype-resolved nucleotide assemblies to provide diverse populations of known genomic samples. To further illustrate, several exemplary known genomic samples are listed in relation to their respective nucleotide assemblies in FIG. 5.
[0035] Also, as used herein, the term “nucleotide read” (or simply “read”) refers to an inferred or predicted sequence of one or more nucleotide bases (or nucleobase pairs) from all or part of a sample genomic sequence (e.g., a sample genomic sequence, complementary DNA). In particular, a nucleotide read includes a determined or predicted sequence of nucleobase calls for a nucleotide fragment (or group of monoclonal nucleotide fragments) from a sequencing library corresponding to a genomic sample. For example, in some embodiments, the custom multigenome construction system determines a nucleotide read by generating nucleobase calls for nucleobases passed through a nanopore of a nucleotide-sample slide, determined via fluorescent tagging, or determined from a well in a flow cell. In some cases, a nucleotide read can refer to a particular type of read, such as a nucleotide read synthesized from sample library fragments that are shorter than a threshold number of nucleobases (e.g., SBS reads). In these or other cases, another type of nucleotide read can refer to (i) assembled nucleotide reads that have been assembled from shorter nucleotide reads to form a contiguous sequence (e.g., assembled nucleotide reads) satisfying a threshold number of nucleobases, (ii) circular consensus sequencing (CCS) reads satisfying the threshold number of nucleobases, or (iii) nanopore long reads satisfying the threshold number of nucleobases.
[0036] Relatedly, as used herein, the term “simulated nucleotide read” (or simply “simulated read”) refers to a sequence of nucleotide bases sampled from a nucleotide assembly. In particular, in some cases, simulated nucleotide reads comprise overlapping nucleotide sequences sampled from a haplotype-resolved nucleotide assembly in a progressive series of segments. In certain embodiments, a simulated nucleotide read spans tens of thousands of nucleobases in length (e.g., 20,000 or 30,000 nucleobases) to hundreds of thousands of nucleobases in length (e.g., 200,000 or 300,000 nucleobases). For example, a simulated nucleotide read can include a sequence of successive nucleotides indicated within a respective nucleotide assembly without alteration. Because simulated nucleotide reads can overlap with one another when aligned with a linear reference genome, simulated nucleotide reads simulate nucleotide reads for a genomic sample with respect to alignment but nevertheless differ from the nucleotide reads described in the paragraph above in part because simulated nucleotide reads do not represent sequences determined from oligonucleotides extracted from a genomic sample and immobilized on a nucleotide-sample slide (e.g., using SBS and a flow cell), from a nanopore-based process, or from other similar processes. As described in further detail below, the custom multigenome construction system generates simulated nucleotide reads comprising discrete segments of individual chromosomes within a given nucleotide assembly for use in identifying haplotype variants to include within a multigenome reference.
[0037] Also, as used herein, the term “nucleotide assembly” refers to an assembled set of nucleotide sequences of a sample genome. For example, a nucleotide assembly can include a comprehensive assembly of an entire genome of an organism, including the specific arrangement of nucleotides for each respective chromosome of the sample genome. In some cases, a nucleotide assembly can include assembled nucleotides within a limited span of a respective sample genome, such as an individual chromosome or a particular region thereof (e.g., the Major Histocompatibility Complex (MHC) located on the sixth chromosome of a human genome). In particular, as used herein, each nucleotide assembly comprises a complete set of nucleotide sequences assembled for a known genomic sample (e.g., a well-characterized genomic sample, as described above).
[0038] Relatedly, as used herein, the term “haplotype-resolved nucleotide assembly” refers to a genome reconstruction — in whole or in part (e.g., one or more chromosomes or other genomic regions) — in which the allelic nucleotides on homologous chromosomes are explicitly resolved. Accordingly, in a haplotype-resolved nucleotide assembly, a genome reconstruction includes allelic nucleotides that have been phased (e.g., organized or labeled) according to a parental haplotype. For instance, a diploid haplotype-resolved nucleotide assembly includes allelic nucleotides, from a pair of homologous chromosomes, that are resolved for two parental haplotypes. By contrast, a polyploid haplotype-resolved nucleotide assembly includes allelic nucleotides for haplotypes corresponding to three or more homologous chromosomes. Also, in some cases, a haplotype-resolved nucleotide assembly, whether diploid or polyploid, includes assembled nucleotide sequences in regions of a different ploidy level (e.g., a haploid region within a diploid genome or regions of local ploidy variation within a polyploid genome).
[0039] As further used herein, the term “phasing” (or “haplotype phasing”) refers to a process of separating nucleotide reads (e.g., simulated nucleotide reads) or continuous sequences into respective parental haplotypes. For instance, phasing can occur by identifying unique alleles or variants (e.g., SNPs, indels) on nucleotide reads in a genomic region, organizing such nucleotide reads in the genomic region according to the unique alleles or variants, and identifying subsets of nucleotide reads according to a maternal haplotype or paternal haplotype based on the organization or grouping. In some cases, a haplotype phasing model that uses a hidden Markov model (HMM) or another algorithm that can be used to perform haplotype phasing, such as Segmented HAPlotype Estimation and Imputation Tool (SHAPEIT), BEAGLE, Eagle2, or WhatsHap.
[0040] Also, as used herein, the term “assembly haplotype” refers to an allele-specific sequence of nucleotides from a nucleotide assembly. For example, an assembly haplotype can include assembled nucleotides from an allele of a single chromosome or, in some cases, can include all of the allelic nucleotides corresponding to a particular individual haplotype or a particular parent (e.g., as indicated within a haplotype-resolved genome sample). Relatedly, as used herein, the term “haplotype” refers to nucleotide sequences that are present in an organism (or present in organisms from a population) and inherited from one or more ancestors. In particular, a haplotype can include alleles or other nucleotide sequences present in organisms of a population and inherited together by such organisms respectively from a single parent. In one or more embodiments, haplotypes include a set of SNPs and/or other variants (e.g., insertions or deletions (indels), structural variants, and/or microsatellites) on the same chromosome that tend to be inherited together. In some cases, data representing a haplotype or a set of different haplotypes are stored or otherwise accessible on a haplotype database.
[0041] As further used herein, the term “genomic coordinate” (or sometimes simply “coordinate”) refers to a particular location or position of a nucleobase within a genome (e.g., an organism’s genome or a reference genome). In some cases, a genomic coordinate includes an identifier for a particular chromosome of a genome and an identifier for a position of a nucleobase within the particular chromosome. For instance, a genomic coordinate or coordinates may include a number, name, or other identifier for a somatic or sex chromosome (e.g., chrl or chrX) and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570 or chrl: 1234570-1234870). In some cases, a genomic coordinate refers to a genomic coordinate on a sex chromosome (e.g., chrX or chrY). Consequently, the custom multigenome construction system can determine genotype probabilities for a genotype call (e.g., a variant call) for a genomic coordinate on a sex chromosome. Further, in certain implementations, a genomic coordinate refers to a source of a reference genome (e.g., mt for a mitochondrial DNA reference genome or SARS-CoV-2 for a reference genome for the SARS-CoV-2 virus) and a position of a nucleobase within the source for the reference genome (e.g., mt: 16568 or SARS-CoV- 2:29001). By contrast, in certain cases, a genomic coordinate refers to a position of a nucleobase within a reference genome without reference to a chromosome or source (e.g., 29727).
[0042] As used herein, a “genomic region” refers to a range of genomic coordinates. Like genomic coordinates, in certain implementations, a genomic region may be identified by an identifier for a chromosome and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570-1234870). In various implementations, a genomic coordinate includes a position within a reference genome. In some cases, a genomic coordinate is specific to a particular reference genome.
[0043] As noted above, a genomic coordinate includes a position within a reference genome. Such a position may be within a particular reference genome. As used herein, the term “reference genome” refers to a digital nucleic acid sequence assembled as a representative example (or representative examples) of genes and other genetic sequences of an organism. Regardless of the sequence length, in some cases, a reference genome represents an example set of genes or a set of nucleic acid sequences in a digital nucleic acid sequenced determined by scientists as representative of an organism of a particular species. For example, a linear human reference genome may be GRCh38 or other versions of reference genomes from the Genome Reference Consortium. As noted above, in some cases, a reference genome includes multi-base codes. As a further example, a reference genome may include a graph reference genome that includes both a linear reference genome and paths representing nucleic acid sequences from ancestral haplotypes, such as Illumina DRAGEN Graph Reference Genome hgl9.
[0044] Further, as used herein, the term “multigenome reference” refers to a reference dataset or assembly that includes genetic information from multiple sample genomes to represent genetic diversity within a population. In some cases, a multigenome reference includes such a dataset or assembly that includes diverse genetic information from multiple sample genomes of a target population and, therefore, constitutes a customized multigenome reference. By generating a multigenome reference including phased variant data identified using simulated nucleotide reads sampled from a set of nucleotide assemblies for a target population, such a customized multigenome reference can be generated. Relatedly, as used herein, the term “pangenome reference” refers to a multigenome reference representing collective information of a species, or a subset thereof. In some cases, for example, a pangenome reference includes the shared genetic elements of the corresponding population, as well as a representation of the genetic diversity of the corresponding population.
[0045] Relatedly, as used herein, the term “reference haplotype database” refers to a database encoding variant data for population haplotypes of a multigenome or pangenome. In one or more embodiments, reference haplotype database includes complete or partially complete nucleotide sequences (e.g., alternate contiguous sequences) for population haplotypes of a multigenome or pangenome. Alternatively, in some embodiments, a reference haplotype database encodes variant data for population haplotypes having allele-variant differences from locally distinct population haplotypes within respective genomic regions of a corresponding reference genome. For example, in some embodiments, the reference haplotype database comprises a haplotype data structure comprising a hierarchical partitioning of different genomic regions of a reference genome into a collection of bins covering respective spans of a linear reference genome (e.g., as represented by a primary contiguous sequence), the bins comprising variant data for the respective spans.
[0046] As further used herein, the term “nucleobase call” (or simply “base call”) refers to a determination or prediction of a particular nucleobase (or nucleobase pair) for an oligonucleotide (e.g., nucleotide read) during a sequencing cycle or for a genomic coordinate of a genomic sample. In particular, a nucleobase call can indicate a determination or prediction of the type of nucleobase that has been incorporated within an oligonucleotide on a nucleotide-sample slide (e.g., read-based nucleobase calls). In some cases, for a nucleotide read, a nucleobase call includes a determination or a prediction of a nucleobase based on intensity values resulting from fluorescent-tagged nucleotides added to an oligonucleotide of a nucleotide-sample slide (e.g., in a cluster of a flow cell). As suggested above, a single nucleobase call can be an adenine (A) call, a cytosine (C) call, a guanine (G) call, a thymine (T) call, or an uracil (U) call.
[0047] As used herein, the term “variant” refers to a nucleobase or multiple nucleobases that do not align with, differs from, or varies from a corresponding nucleobase (or nucleobases) in a reference sequence or a reference genome. For example, a variant includes a SNP, an indel, or a structural variant that indicates nucleobases in a sample nucleotide sequence that differ from nucleobases in corresponding genomic coordinates of a reference sequence or a reference genome. Relatedly, as used herein, the term “haplotype variant” refers to a variant associated with a particular haplotype, such as indicated within a multigenome reference as described herein.
[0048] Further, as used herein, the term “structural variant” refers to a variation (e.g., deletion, insertion, translocation, inversion) in a structure of an organism’s chromosome or a variation to nucleotide sequences of the organism’s chromosome (e.g., a sample genomic sequence). In some cases, a structural variant includes a variation to a threshold number of base pairs (e.g., > 35 or > 50 base pairs) within an organism’s chromosome. Accordingly, in certain implementations, a structural variant includes an insertion or deletion exceeding a threshold number of base pairs, a duplication exceeding a threshold number of base pairs, an inversion, a translocation, or a copy number variation (CNV). While some examples of structural variants use 35 base pairs or 50 base pairs as a threshold number of base pairs, in some embodiments, the threshold number of base pairs for a structural variant may be different, such as, but not limited to, 16, 25, 32, 45, 100, or 1,000 base pairs.
[0049] Along these lines, a “variant call” (or “variant nucleobase call”) refers to a nucleobase call comprising a mutation or a variant at a particular genomic coordinate or genomic region with respect to a reference. In particular, a variant call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that differs from a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome. Conversely, a “reference call” (or “non-variant nucleobase call” or “non-variant call”) refers to a nucleobase call comprising a non-variant or a reference nucleobase at a genomic coordinate or a genomic region with respect to a reference. In particular, a non-variant or reference call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that matches a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.
[0050] As further used herein, the term “genotype call” refers to a determination or prediction of a particular genotype of a genomic sample or a sample nucleotide sequence at a genomic locus. In particular, a genotype call can include a prediction of a particular genotype of a genomic sample with respect to a reference genome or a reference sequence at a genomic coordinate or a genomic region. For instance, in some cases, a genotype call includes a determination or prediction that a genomic sample comprises both a nucleobase and a complementary nucleobase at a genomic coordinate that is either homozygous or heterozygous for a reference base or a variant (e.g., homozygous reference bases represented as 0| 0 or heterozygous for a variant on a particular strand represented as 0| 1). Accordingly, a genotype call can include a prediction of a variant or reference base for one or more alleles of a genomic sample and indicate zygosity with respect to a variant or reference base. A genotype call is often determined for a genomic coordinate or genomic region at which an SNP, insertion, deletion, or other variant has been identified for a population of organisms.
[0051] In one or more embodiments, the custom multigenome construction system identifies and/or stores nucleotide sequences, corresponding sequencing metrics, and/or other sequencing data within one or more sequencing data files. As used herein, the term “sequencing data file” refers to a digital file that includes genetic sequencing information concerning genotype calls or nucleotide reads generated by one or more genomic sequencing procedures. Such sequencing information may include, for example, nucleotide reads, alignment and mapping information, nucleotide reads at one or more genomic coordinates, and so forth.
[0052] For instance, as used herein, the term “sequence file” refers to a digital file that indicates one or more nucleotide sequences. For example, a sequence file can include a digital file comprising a standard text-based format, such as FASTA files (indicating one or more nucleotide sequences) or FASTQ files (indicating quality scores and/or other metrics in association with one or more nucleotide sequences) storing nucleotide sequences expressed in single-letter code (e.g., A, T, C, G). Accordingly, in some implementations, the custom multigenome construction system 106 receives a FASTA file comprising nucleotide sequences spanning an entire sample genome or a portion thereof (e.g., an individual chromosome or genomic region).
[0053] As used herein, the term “alignment data file” refers to a digital file that indicates mapping and alignment information for nucleotide reads of a sample nucleotide sequence. For example, an alignment data file can include a binary alignment map (BAM) file, a compressed reference-oriented alignment map (CRAM) file, or another file indicating nucleotide reads of a sample nucleotide sequence. Also, an alignment data file can include further information regarding nucleotide reads, mapping and alignment results, population haplotype data, and so forth.
[0054] Also, as used herein, the term “genotype-call data file” refers to a digital file that indicates one or more genotype calls (e.g., including reference and/or variant calls) compared to a reference genome along with other information pertaining to the genotype calls (e.g., variant calls). For example, a genotype-call data file can include a variant call file, such as but not limited to a Variant Call Format (VCF) file (including a multi-sample Variant Call Format (msVCF) file as described below in relation to FIG. 3C). Alternatively, as a further example, a genotype-call data file can include a General Feature Format (GFF) file, a Genome Variant Format (GVF), or other suitable data file comprising genotype calls for a sample nucleotide sequence (such as a nucleotide assembly or a sequence thereof). Relatedly, as used herein, the term “variant call file” refers to a particular genotype-call data file that comprises a text file format that contains information about variants at specific genomic coordinates. For instance, a variant call file can include metainformation lines, a header line, and data lines where each data line contains information about a single genotype call (e.g., a single variant).
[0055] As used herein, a “probabilistic model” refers to a type of statistics-based prediction model. For example, in some cases, a probabilistic variant call model refers to a Bayesian probability model that generates variant calls based on nucleotide reads (or simulated nucleotide reads). Such a model can process or analyze sequencing metrics corresponding to read pileups (e.g., multiple nucleotide reads corresponding to a single genomic coordinate), including mapping quality, base quality, and various hypotheses including foreign reads, missing reads, joint detection, and more. Probabilistic models may likewise include multiple components, including, but not limited to, different software application or components for mapping and aligning, sorting, duplicate marking, computing read pileup depths, and variant calling. In some cases, a probabilistic variant call model refers to an ILLUMINA DRAGEN model for variant calling functions and mapping and alignment functions (e.g., a DRAGEN variant caller or “DRAGEN VC”).
[0056] As used herein, the term “machine-learning model” refers to a computer algorithm or a collection of computer algorithms that automatically improve for a particular task through experience based on use of ground truth data. For example, a machine-learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness. Example machinelearning models include various types of decision trees, support vector machines, Bayesian networks, or neural networks. Further, as used herein, the term “neural network” refers to a machine-learning model that can be trained and/or tuned based on inputs to determine classification or approximate unknown functions. For example, a neural network can include a model of interconnected artificial neurons (e.g., organized in layers) that communicate and leam to approximate complex functions and generate outputs (e.g., classifications, predictions, or digital content) based on a plurality of inputs provided to the neural network. In some cases, a neural network refers to an algorithm (or set of algorithms) that implements deep learning techniques to model high-level abstractions of data. For example, a neural network can include a convolutional neural network, a recurrent neural network (e.g., an LSTM), a graph neural network, a self-attention transformer neural network, or a generative adversarial neural network. [0057] As used herein, the term “haplotype-aware alignment model” refers to a model configured to determine alignments of nucleotide reads with a reference genome while considering the underlying haplotype structure of the corresponding genomic sample. For example, a haplotype-aware alignment model can consider allelic coordinates of nucleotide reads when determining alignments between the nucleotide reads and respective genomic regions of a linear reference genome.
[0058] Moreover, as used herein, the term “read-based phasing model” refers to a model configured to leverage sequencing information to infer the haplotypes of a genomic sample. For example, a read-based phasing model can consider metrics and information from sequencing procedures, from mapping and alignment of nucleotide reads, and/or from metrics related to variant calling procedures to determine the specific divisions of nucleotides between alleles along chromosomes of the corresponding genomic sample.
[0059] Further, as used herein, the term “whole-contig alignment” refers to a process for aligning contiguous sequences (often referred to as “contigs”) from a genome assembly to a reference genome sequence or to individual primary or alternative contiguous sequences thereof. In relation to whole-contig alignment, for example, each contiguous sequence of a genome assembly can represent an entire chromosome (e.g., from telomere to telomere) or an entire genome (e.g., spanning all chromosomes). Relatedly and accordingly, as used herein, the term “wholegenome alignment” is a type of whole-contig alignment that refers to a process of aligning contiguous sequences of an entire genome assembly to a reference genome. For whole-genome alignment, accordingly, each contiguous sequence from each chromosome of an entire genome assembly is aligned with a reference genome.
[0060] The following paragraphs describe the custom multigenome construction system with respect to illustrative figures that portray example embodiments and implementations. For example, FIG. 1 illustrates a schematic diagram of a computing system 100 in which a custom multigenome construction system 106 operates in accordance with one or more embodiments. As illustrated, the computing system 100 includes a sequencing device 102 connected to a local device 108 (e.g., a local server device), one or more server device(s) 110, and a client device 114. As shown in FIG. 1, the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114 can communicate with each other via a network 118. The network 118 comprises any suitable network over which computing devices can communicate. Example networks are discussed in additional detail below with respect to FIG. 10. While FIG. 1 shows an embodiment of the custom multigenome construction system 106, this disclosure describes alternative embodiments and configurations below. [0061] As indicated by FIG. 1, the sequencing device 102 comprises a computing device and a sequencing device system 104 for sequencing a genomic sample or other nucleic-acid polymer. In some embodiments, by executing the sequencing device system 104 using a processor, the sequencing device 102 analyzes nucleotide fragments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems either directly or indirectly on the sequencing device 102. More particularly, the sequencing device 102 receives nucleotide-sample slides (e.g., flow cells) comprising nucleotide fragments extracted from samples and further copies and determines the nucleobase sequence of such extracted nucleotide fragments.
[0062] In one or more embodiments, the sequencing device 102 utilizes sequencing-by- synthesis (SBS) techniques to sequence nucleotide fragments into nucleotide reads and determine nucleobase calls for the nucleotide reads. In addition or in the alternative to communicating across the network 118, in some embodiments, the sequencing device 102 bypasses the network 118 and communicates directly with the local device 108 or the client device 114. By executing the sequencing device system 104, the sequencing device 102 can further store the nucleobase calls as part of base-call data that is formatted as a binary base call (BCL) file and send the BCL file to the local device 108 and/or the server device(s) 110.
[0063] As further indicated by FIG. 1, the local device 108 is located at or near a same physical location of the sequencing device 102. Indeed, in some embodiments, the local device 108 and the sequencing device 102 are integrated into a same computing device. The local device 108 may run the custom multigenome construction system 106 to generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such base-call data in conjunction with a multigenome reference (e.g., a pangenome reference) provided by the custom multigenome construction system 106 or accessed within a database 120. As shown in FIG. 1, the sequencing device 102 may send (and the local device 108 may receive) base-call data generated during a sequencing run of the sequencing device 102. By executing software in the form of the custom multigenome construction system 106, the local device 108 may align nucleotide reads with a reference genome utilizing a multigenome reference generated according to one or more embodiments and/or stored within the database 120 and determine genetic variants based on the aligned nucleotide reads. The local device 108 may also communicate with the client device 114. In particular, the local device 108 can send data to the client device 114, including a binary alignment map (BAM) file, a variant call format (VCF) file, a multi-sample phased VCF representing a multigenome reference, or other information indicating nucleobase calls, sequencing metrics, error data, or other metrics. [0064] As further indicated by FIG. 1, the server device(s) 110 are located remotely from the local device 108 and the sequencing device 102. Similar to the local device 108, in some embodiments, the server device(s) 110 include a version of (or are otherwise able to access or implement) the custom multigenome construction system 106, accessible via a sequencing system 112 of the server device(s) 110. Accordingly, the server device(s) 110 may generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such base-call data, or by generating a multigenome reference utilizing a collection of nucleotide assemblies stored within the database 120 or otherwise accessed by the custom multigenome construction system 106. As indicated above, the sequencing device 102 may send (and the server device(s) 110 may receive) base-call data from the sequencing device 102. The server device(s) 110 may also communicate with the client device 114. In particular, the server device(s) 110 can send data to the client device 114, including BAM files, VCF files, or other sequencing related information.
[0065] In some embodiments, the server device(s) 110 comprise a distributed collection of servers where the server device(s) 110 include a number of server devices distributed across the network 118 and located in the same or different physical locations. Further, the server device(s) 110 can comprise a content server, an application server, a communication server, a web-hosting server, or another type of server. Moreover, as shown in FIG. 1, the server device(s) 110 are in communication, either directly or via the network 118, with a database 120 storing nucleotide assemblies to be utilized by the custom multi genome construction system 106 when generating a multigenome reference (e.g., a pangenome reference) for a target population or a target genomic sample. Further, the database 120 can additionally store multigenome references generated by the custom multigenome construction system 106 for subsequent use in sequencing analysis of target genomic sequences.
[0066] As indicated above, as part of the server device(s) 110 or the local device 108, the custom multigenome construction system 106 can generate, encode, and/or implement the aforementioned multigenome reference(s) to determine alignments of nucleotide reads from a target genomic sample with a reference genome. In some embodiments, for example, the custom multigenome construction system 106 can generate a multigenome reference (e.g. a pangenome reference) for a target population or a target genomic sample (e.g., by utilizing simulated nucleotide reads sampled from a set of nucleotide assemblies stored within the database 120 or otherwise provided to the custom multigenome construction system 106) and utilize the generated multigenome reference to determine one or more alignments of nucleotide reads from a target genomic sample, as described in greater detail below in relation to the subsequent figures. [0067] As further illustrated and indicated in FIG. 1, by executing a sequencing application 116, the client device 114 can generate, store, receive, and send digital data. In particular, the client device 114 can receive sequencing data from the local device 108 or receive call files (e.g., BCL) and sequencing metrics from the sequencing device 102. Furthermore, the client device 114 may communicate with the local device 108 or the server device(s) 110 to receive a VCF comprising genotype or variant calls and/or other metrics, such as a base-call-quality metrics or pass-filter metrics. The client device 114 can accordingly present or display information pertaining to variant calls or other genotype calls within a graphical user interface of the sequencing application 116 to a user associated with the client device 114. For example, the client device 114 can present genotype calls, variant calls, and/or sequencing metrics for a sequenced genomic sample within a graphical user interface of the sequencing application 116. Moreover, the client device 114, via the sequencing application 116, can present or display a user interface comprising selectable options for generating a custom multigenome reference (e.g., by selecting a custom set of nucleotide assemblies representing a target population and/or a target genomic sample), as further described below in relation to FIG. 5.
[0068] Although FIG. 1 depicts the client device 114 as a desktop or laptop computer, the client device 114 may comprise various types of client devices. For example, in some embodiments, the client device 114 includes non -mobile devices, such as desktop computers or servers, or other types of client devices. In yet other embodiments, the client device 114 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device 114 are discussed below with respect to FIG. 10.
[0069] As mentioned, the client device 114 includes the sequencing application 116. The sequencing application 116 may be a web application or a native application stored and executed on the client device 114 (e.g., a mobile application, desktop application). The sequencing application 116 can include instructions that (when executed) cause the client device 114 to receive data from the custom multigenome construction system 106 and present, for display at the client device 114, base-call data or data from an alignment data fde or VCF. Furthermore, the sequencing application 116 can instruct the client device 114 to display summaries for multiple sequencing runs.
[0070] As further illustrated in FIG. 1, a version of the custom multigenome construction system 106 may be located and/or implemented (e.g., entirely or in part) on the client device 114 or the sequencing device 102. In yet other embodiments, the custom multigenome construction system 106 is implemented by one or more other components of the computing system 100, such as the local device 108. In particular, the custom multigenome construction system 106 can be implemented in a variety of different ways across the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114. For example, the custom multigenome construction system 106 can be downloaded from the server device(s) 110 to the client device 114 and/or the local device 108 where all or part of the functionality of the custom multigenome construction system 106 is performed at each respective device within the computing system 100. [0071] As previously mentioned, in some embodiments, the custom multigenome construction system 106 generates a multigenome reference (e.g., a pangenome reference) comprising variant data identified using simulated reads sampled from a set of nucleotide assemblies. To illustrate, FIG. 2 depicts the custom multigenome construction system 106 generating a multigenome reference 208 comprising variant data 206 identified using simulated nucleotide reads 204 sampled from a set of nucleotide assemblies 202.
[0072] As shown in FIG. 2, the custom multigenome construction system 106 identifies (or receives) the set of nucleotide assemblies 202 as input for constructing the multigenome reference 208. In one or more embodiments, for example, the set of nucleotide assemblies 202 include complete nucleotide sequences for each chromosome of a respective set of known genomic samples. Furthermore, in some embodiments, each nucleotide assembly of the set of nucleotide assemblies 202 comprises a haplotype-resolved nucleotide assembly denoting allelic nucleotide sequences for each respective chromosome of a corresponding sample genome. As shown, the set of nucleotide assemblies 202 can be provided in a suitable text format, such in the form of a FASTA digital file denoting nucleotide sequences divided by individual chromosomes and, in some cases, by specific alleles (e.g., maternal or paternal alleles) of each individual chromosome.
[0073] As also shown in FIG. 2, the custom multigenome construction system 106 generates the simulated nucleotide reads 204 to include nucleotide sequences from each nucleotide assembly of the set of nucleotide assemblies 202. In some embodiments, for example, the custom multigenome construction system 106 generates the simulated nucleotide reads 204 by sampling segments of nucleotide sequences of individual chromosomes within each nucleotide assembly of the set of nucleotide assemblies 202 (e.g., as further described below in relation to FIG. 3 A).
[0074] As illustrated, the custom multigenome construction system 106 utilizes the simulated nucleotide reads 204 to generate the variant data 206 comprising variant calls for each respective nucleotide assembly of the set of nucleotide assemblies 202. As described in further detail below, for example, the custom multigenome construction system 106 can utilize various analysis pipelines comprising a variety of sequence analysis models to (i) determine read alignments between the simulated nucleotide reads 204 and respective genomic regions of a reference genome (e.g., a linear reference genome) and (ii) generate variant calls (e.g., the variant data 206) relative to the reference genome for the simulated nucleotide reads 204 according to the read alignments. Accordingly, as shown in FIG. 2, the custom multigenome construction system 106 utilizes the variant data 206 for the set of nucleotide assemblies 202 to construct the multigenome reference 208 (e.g., as further described below in relation to FIGS. 3C and 6).
[0075] As previously mentioned, in some embodiments, the custom multigenome construction system 106 utilizes various pipelines to construct a multigenome reference (e.g., a pangenome reference) from a set of nucleotide assemblies for genomic samples representing a target population. For example, FIGS. 3 A, 3B, and 3C respectively illustrate the custom multigenome construction system 106 (a) generating haplotype variant data 312 for an individual assembly haplotype 304 of a nucleotide assembly 302, (b) generating phased variant data 328 for a given diploid haplotype-resolved nucleotide assembly 322, and (c) generating a multi-sample phased variant dataset 338 for a set of haplotype-resolved nucleotide assemblies 332. Furthermore, FIG. 3D illustrates the custom multigenome construction system 106 generating phased variant data 348 for a given haplotype-resolved nucleotide assembly 342 of any ploidy level (e.g., a haploid assembly, a diploid assembly, or a polyploid assembly).
[0076] As shown in FIG. 3 A, for instance, the custom multigenome construction system 106 performs a haplotype reference construction 300 to generate the haplotype variant data 312 for the individual assembly haplotype 304 of the nucleotide assembly 302. As mentioned above, in some embodiments, the nucleotide assembly 302 includes haplotype-resolved sequences for respective alleles of each chromosome represented by the nucleotide assembly 302. Alternatively, in one or more embodiments, the custom multigenome construction system 106 processes a nucleotide assembly that lacks an indication of allele-specific sequences for individual chromosomes. In some such embodiments, the custom multigenome construction system 106 performs an additional phasing analysis to determine the haplotype variant data 312 for individual assembly haplotypes the nucleotide assembly 302 (e.g., as discussed below in relation to FIG. 4A).
[0077] As further shown in FIG. 3 A, in implementations wherein the nucleotide assembly 302 includes haplotype-resolved sequences for individual chromosomes (e.g., a haplotype-resolved nucleotide assembly), the custom multigenome construction system 106 identifies (or receives) the assembly haplotype 304 for processing per the haplotype reference construction 300 pipeline. In some embodiments, the assembly haplotype 304 includes allelic sequences for every chromosome of the nucleotide assembly 302. Alternatively, in some embodiments, the custom multigenome construction system 106 considers an allelic sequence from a particular chromosome or genomic region of the nucleotide assembly 302 (e.g., to generate a multi genome limited to the particular chromosome or genomic region).
[0078] As illustrated, the custom multigenome construction system 106 generates a plurality of simulated nucleotide reads 306 from the assembly haplotype 304 (e.g., as described above in relation to FIG. 2). In some embodiments, for example, the custom multigenome construction system 106 generates the simulated nucleotide reads 306 by sampling a plurality of nucleotide sequences from the nucleotide assembly 302 (e.g., by sampling segments of the assembly haplotype 304). Furthermore, in one or more embodiments, the custom multigenome construction system 106 samples the nucleotide sequences with an initial read length of up to 300,000 nucleobases. In some embodiments, for example, the initial read length comprises 30,000 nucleobases. Moreover, in one or more embodiments, the custom multigenome construction system 106 samples overlapping segments to simulate a target read depth among the simulated nucleotide reads 306 for the assembly haplotype 304. Also, in one or more embodiments, the custom multigenome construction system 106 samples the segments with one or more of (a) a modified read depth lesser than or greater than the target read depth or (b) a modified read length greater than or lesser than the initial read length in one or more regions of the respective nucleotide assembly (e.g., to cover relatively ambiguous regions with increased resolution).
[0079] Furthermore, as shown in FIG. 3 A, the custom multigenome construction system 106 performs a mapping and alignment 308 to determine read alignments between the simulated nucleotide reads 306 and respective genomic regions of a reference genome (e.g., a linear reference genome). In some embodiments, for example, the custom multigenome construction system 106 utilizes an alignment model, such as a haplotype-aware alignment model, to identify read alignments for the simulated nucleotide reads 306. Additionally, in one or more embodiments, the custom multi genome construction system 106 identifies, within the read alignments identified for the simulated nucleotide reads 306, at least one mis-mapped read based on comparing the respective genomic regions of the read alignments with relative coordinates of the simulated nucleotide reads 306 within the nucleotide assembly 302. Accordingly, in some embodiments, the custom multigenome construction system 106 utilizes relative positions of nucleotide sequences corresponding to the simulated nucleotide reads 306 to ensure accurate mapping and alignment 308 of the simulated nucleotide reads 306.
[0080] Based on the read alignments identified by the mapping and alignment 308, the custom multigenome construction system 106 performs a variant calling 310 to generate variant calls within the assembly haplotype 304, as represented by the simulated nucleotide reads 306. In some embodiments, for example, the custom multigenome construction system 106 utilizes a probabilistic variant caller model to perform the variant calling 310. Additionally or alternatively, the custom multigenome construction system 106 can utilize a variant caller machine-learning model to perform the variant calling 310. As shown, the custom multigenome construction system 106 generates the haplotype variant data 312 for the assembly haplotype 304 by generating variant calls relative to the reference genome for the simulated nucleotide reads 306, according to the read alignments determined by the mapping and alignment 308. Accordingly, the custom multigenome construction system 106 can utilize the haplotype reference construction 300 to generate haplotype variant data for a given assembly haplotype of a given nucleotide assembly. In some embodiments, the custom multigenome construction system 106 outputs haplotype variant data in the form of an assembly haplotype-specific variant call file (e.g., a VCF file).
[0081] Additionally, in one or more implementations, the custom multigenome construction system 106 utilizes an additional pipeline to identify structural variants within a nucleotide assembly or an assembly haplotype thereof. As shown in FIG. 3 A, for example, the custom multigenome construction system 106 performs a whole-genome or whole-contig alignment 307 of the assembly haplotype 304 with a reference genome. In some embodiments, for example, the custom multigenome construction system 106 aligns one or more entire contiguous sequences (e.g., representing an entire genomic sample from end to end or one or more entire chromosomes from telomere to telomere) of the assembly haplotype 304 with the reference genome. Further, in some embodiments, the custom multigenome construction system 106 performs the whole- genome/contig alignment 307 for multiple assembly haplotypes of a diploid or polyploid haplotype-resolved nucleotide assembly to a reference genome using a haplotype-aware alignment model. In one or more embodiments, the custom multigenome construction system 106 performs the whole-genome/contig alignment 307 for contiguous sequence(s) representing a haploid genome of an organism (or, for short reference, a haploid sample genome), where, for example, such an alignment for a haploid genome facilitates identifying structural variants within a haploid nucleotide assembly. Further, in some cases, the custom multigenome construction system 106 performs the whole-genome/contig alignment 307 for contiguous sequence(s) representing a haploid genomic region of a diploid genome or polyploid genome of an organism, where, for example, such alignments facilitate identifying structural variants within a haploid genomic region of a diploid haplotype-resolved nucleotide assembly or polyploid haplotype-resolved nucleotide assembly.
[0082] As also shown in FIG. 3A, having determined a whole-genome or whole-contig alignment between the assembly haplotype 304 and a reference genome, the custom multigenome construction system 106 performs a structural variant detection 309 to identify structural variations relative to the reference genome and accordingly generates haplotype structural variant (SV) data 311 for the assembly haplotype 304. In some embodiments, for example, the custom multigenome construction system 106 generates a variant call format (VCF) file comprising the haplotype SV data 311. Also, in some embodiments, the custom multigenome construction system 106 merges the haplotype SV data 311 with the haplotype variant data 312 (generated utilizing the simulated nucleotide reads 306 as described above). [0083] In some embodiments, for example, the custom multigenome construction system 106 adds each structural variant identified by the structural variant detection 309 to the haplotype variant data 312. Alternatively, in some embodiments, the custom multigenome construction system 106 identifies a subset of structural variants from the haplotype SV data to merge with the haplotype variant data 312. In one or more embodiments, for example, the custom multigenome construction system 106 utilizes a structural variant comparison model (e.g., Truvari) to identify duplicate, nearly duplicate, or inconsistent structural variants between the haplotype SV data 311 and the haplotype variant data 312 (e.g., prior to the illustrated merging of variant data). In some cases, for instance, the custom multigenome construction system 106 retains a given duplicate, near duplicate, or inconsistent structural variant identified within the simulated nucleotide reads 306 while filtering (e.g., deleting or not including) a corresponding structural variant identified within the haplotype SV data 311. Moreover, in some embodiments, the custom multigenome construction system 106 filters (e.g., deletes or does not include) one or more structural variants from the haplotype SV data 311 (e.g., based on determining that one or more quality metrics associated with the one or more structural variants fall below a predetermined quality threshold).
[0084] As shown in FIG. 3B, in some implementations, the custom multigenome construction system 106 further performs a diplotype reference construction 320 to generate the phased variant data 328 for the diploid haplotype-resolved nucleotide assembly 322. As illustrated, the diploid haplotype-resolved nucleotide assembly 322 comprises allele-specific sequences for individual chromosomes (e.g., maternal and paternal chromosomes). In some embodiments, separate alleles within a haplotype-resolved nucleotide assembly are designated without express indications of parental lineage (e.g., allelic sequences for individual chromosomes provided separately without ‘maternal’ or ‘paternal’ labeling). Accordingly, the custom multigenome construction system 106 can (i) identify a first assembly haplotype and a second assembly haplotype within a given haplotype-resolved nucleotide assembly — without identifying whether the first or second assembly haplotype is maternal or paternal — and (ii) generate a first set of simulated nucleotide reads for the first assembly haplotype and a second set of simulated nucleotide reads for the second assembly haplotype.
[0085] As illustrated, the custom multi genome construction system 106 extracts (or otherwise identifies or receives) respective assembly haplotypes 324a and 324b from the diploid haplotype- resolved nucleotide assembly 322 for the diplotype reference construction 320 of the phased variant data 328. As shown, for example, the assembly haplotype 324a comprises maternal allelic sequences from individual chromosomes and the assembly haplotype 324b comprises paternal allelic sequences from individual chromosomes. Alternatively, in some embodiments, each assembly haplotype extracted from a haplotype resolved nucleotide assembly for analysis via the diplotype reference construction 320 includes an allelic sequence for an individual chromosome. Accordingly, the custom multigenome construction system 106 can generate the phased variant data 328 for the whole genome sequence of the diploid haplotype-resolved nucleotide assembly 322, for individual chromosomes included therein, or for any discrete genomic region within the diploid haplotype-resolved nucleotide assembly 322.
[0086] As further shown in FIG. 3B, the custom multi genome construction system 106 utilizes haplotype reference constructions 300a and 300b to generate haplotype variant data 326a and 326b for the assembly haplotypes 324a and 324b, respectively. Accordingly, each assembly haplotype is respectively processed per the haplotype reference construction 300 pipeline, as described above in relation to FIG. 3 A, to generate corresponding haplotype reference data comprising variant calls for the respective assembly haplotypes. To generate the phased variant data 328 for the diploid haplotype-resolved nucleotide assembly 322, the custom multi genome construction system 106 compiles (e.g., combines) the haplotype variant data 326a for the assembly haplotype 324a with the haplotype variant data 326b for the assembly haplotype 324b into the phased variant data 328 to include diploid genotype calls relative to the reference genome. In some embodiments, for example, the custom multigenome construction system 106 combines two variant call fdes (e.g., VCF files) comprising the haplotype variant data 326a and 326b, respectively, to generate a phased variant call file (e.g., a sample-specific phased VCF file) comprising the phased variant data 328.
[0087] As shown in FIG. 3C, the custom multigenome construction system 106 compiles phased variant data 336a, 336b through 336n (e.g., sample-specific VCF files) corresponding to multiple respective haplotype-resolved nucleotide assemblies 334a, 334b through 334n of the set of haplotype-resolved nucleotide assemblies 332 to generate the multi-sample phased variant dataset 338. In some embodiments, for example, generating the multi-sample phased variant dataset 338 further comprises aligning the phased variant data of each respective haplotype-resolved nucleotide assembly according to genomic coordinates within the reference genome and inserting placeholders (e.g., within rows of a multi-sample VCF for respective nucleotide assemblies) representing deletions within respective nucleotide assemblies.
[0088] Accordingly, in one or more embodiments, the custom multigenome construction system 106 generates the multi-sample phased variant dataset 338, which in turn can be utilized to generate a multigenome reference for mapping and alignment of target genomic samples. In some embodiments, for example, the custom multigenome construction system 106 generates a graph multigenome reference or a reference haplotype database based on the genotype data included within the multi-sample phased variant dataset 338.
[0089] As also shown in FIG. 3C, the custom multigenome construction system 106 identifies (or receives) the set of haplotype-resolved nucleotide assemblies 332 comprising the haplotype- resolved nucleotide assemblies 334a-334n for constructing a multigenome reference (e.g., a pangenome reference) for a target population. As illustrated, the custom multigenome construction system 106 processes each of the haplotype-resolved nucleotide assemblies 334a, 334b through 334n individually via diplotype reference constructions 320a, 320b through 320n, respectively, to determine the respective phased variant data 336a-336n. Accordingly, each assembly of the set of haplotype-resolved nucleotide assemblies is respectively processed per the diplotype reference construction 320 pipeline, as described above in relation to FIGS. 3A-3B, to generate corresponding phased variant data comprising genotype calls (e.g., phased variant calls) for the respective nucleotide assemblies. Having generated variant calls and output phased variant data for the respective assemblies, the custom multigenome construction system 106 compiles the phased variant data 336a-336n for the respective haplotype-resolved nucleotide assemblies 334a-334n to generate the multi-sample phased variant dataset 338 for the set of haplotype-resolved nucleotide assemblies 332. In some embodiments, for example, the custom multigenome construction system 106 combines respective variant call format (VCF) files comprising the phased variant data 336a- 336n to generate a multi-sample Variant Call Format (msVCF) file comprising the multi-sample phased variant dataset 338.
[0090] As shown in FIG. 3D, in some implementations, the custom multigenome construction system 106 performs a ploidy reference construction 340 to generate the phased variant data 348 for the haplotype-resolved nucleotide assembly 342 (e.g., a haplotype assembly for an organism or genomic region of any ploidy level). As illustrated, the haplotype-resolved nucleotide assembly 342 comprises allele-specific sequences for individual chromosomes (e.g., haplotype sequences corresponding to one or multiple chromosomes, depending on the ploidy level of the respective organism or the respective genomic region). Accordingly, the custom multigenome construction system 106 can generate phased variant data 348 for any number of assembly haplotypes. As depicted in FIG. 3D, for instance, one assembly haplotype is shown as an assembly haplotype 344a for a haploid organism or genomic region; two assembly haplotypes are shown as assembly haplotypes 344a-344b for a diploid organism or genomic region; or three or more assembly haplotypes are shown as assembly haplotypes 344a-344b 344a-344n for a polyploid organism or genomic region.
[0091] As also illustrated in FIG. 3D, in some cases, the custom multigenome construction system 106 extracts (or otherwise identifies or receives) the assembly haplotype 344a for haploid reference construction, the assembly haplotypes 344a-344b for diploid reference construction, or the assembly haplotypes 344a-344n for polyploid reference construction from the haplotype- resolved nucleotide assembly 342 for the ploidy reference construction 340 of the phased variant data 348. In some implementations, for example, the haplotype-resolved nucleotide assembly 342 comprises a haploid nucleotide assembly limited to the assembly haplotype 344a, such as an individual assembly haplotype from a diploid or polyploid assembly or an individual assembly from a haploid genome or haploid region of a genome. In such implementations directed to a haploid nucleotide assembly, the custom multigenome construction system 106 generates haplotype variant data 346a for the assembly haplotype 344a according to the haplotype reference construction 300a (e.g., as described above in relation to FIG. 3A).
[0092] As also shown in FIG. 3D, in some implementations, the haplotype-resolved nucleotide assembly 342 comprises a diploid nucleotide assembly or polyploid nucleotide assembly with two assembly haplotypes shown as assembly haplotypes 344a-344b or more than two assembly haplotypes shown as assembly haplotypes 344a-344n, respectively. As illustrated, the assembly haplotype 344a comprises allelic sequences from individual chromosomes of a first haplotype (“hapl”), the assembly haplotype 344b comprises allelic sequences from individual chromosomes of a second haplotype (“hap2”), and the assembly haplotype 344n comprises allelic sequences from individual chromosomes of a final haplotype (“hapN”) within the haplotype-resolved nucleotide assembly 342. Alternatively, in some embodiments, each assembly haplotype extracted from a haplotype resolved nucleotide assembly for analysis via the ploidy reference construction 340 includes an allelic sequence for an individual chromosome. Accordingly, the custom multigenome construction system 106 can generate the phased variant data 348 for the whole genome sequence of the haplotype-resolved nucleotide assembly 342, for individual chromosomes included therein, or for any discrete genomic region within the haplotype-resolved nucleotide assembly 342, including genomic regions of local ploidy variation within a diploid or polyploid genome.
[0093] As further shown in FIG. 3D, the custom multigenome construction system 106 utilizes haplotype reference construction 300a for haploid reference construction, haplotype reference construction 300a and 300b for diploid reference construction, or haplotype reference construction 300a through 300n for polyploid reference construction to generate haplotype variant data 346a, 346a-346b, or 346a-346n for the assembly haplotypes 344a, 344a-344b, or 344a-344n, respectively. Accordingly, each assembly haplotype is respectively processed per the haplotype reference construction 300 pipeline, as described above in relation to FIG. 3A, to generate corresponding haplotype reference data comprising variant calls for the respective assembly haplotypes. To generate the phased variant data 348 for the haplotype-resolved nucleotide assembly 342 for (i) diploid reference construction or (ii) polyploid reference construction, the custom multigenome construction system 106 compiles (e.g., combines) the haplotype variant data 346a for the assembly haplotype 344a with (i) the haplotype variant data 346b for the assembly haplotype 344b or (ii) each of the variant data 346a-346n for the respective assembly haplotypes 344a-344n into the phased variant data 348 to include phased genotype calls (e.g., specified by haplotype) relative to the reference genome. In some embodiments, for example, the custom multigenome construction system 106 combines two or more variant call files (e.g., VCF files) comprising the haplotype variant data 346a-346b or 346a-326n, respectively, to generate a phased variant call file (e.g., a sample-specific phased VCF file) comprising the phased variant data 348.
[0094] Moreover, as similarly described in relation to FIG. 3C, the custom multigenome construction system 106 can further generate a multi-sample phased variant dataset by combining phased variant data (e.g., the phased variant data 348) for multiple haplotype-resolved nucleotide assemblies of any ploidy level (e.g., the haplotype-resolved nucleotide assembly 342). In some embodiments, for example, the custom multigenome construction system 106 combines a plurality of phased variant call format (VCF) files (e.g., sample-specific VCF files) comprising phased variant data for the respective haplotype-resolved nucleotide assemblies to generate a multi-sample Variant Call Format (msVCF) file comprising the multi-sample phased variant dataset.
[0095] To further illustrate, FIG. 3E shows an exemplary schematic of the custom multigenome construction system 106 generating phased variant data 362 for a diploid haplotype- resolved nucleotide assembly 352 utilizing a simulated-read variant calling pipeline and a whole- contiguous sequence structural variant calling pipeline in accordance with one or more embodiments. As explained below, in some cases, the custom multigenome construction system 106 uses a haplotype-assembly -to-VCF model (e.g., such as DRAGEN hap-asm2vcf) to determine variant calls based on simulated nucleotide reads derived from a haplotype-resolved nucleotide assembly. By contrast, in certain embodiments, the custom multigenome construction system 106 uses an assembly-based structural variant (SV) caller (e.g., Dipcall, Smartie-SV, Structural Variant Identification using Mapped long read Assembly (SVIM-ASM), or Phased Assembly Variant (PAV)) to determine structural variant calls based on contiguous sequences derived from the haplotype-resolved nucleotide assembly. The custom multigenome construction system 106 further merges (i) variant calls derived from simulated nucleotide reads and (ii) structural variant calls derived from contiguous sequences of a haplotype-resolved nucleotide assembly into respective datasets for each parental haplotype and further merges the consolidated variant-and-structural- variant-call data into a phased genotype-call data fde (e.g., merged VCF).
[0096] As shown in FIG. 3E, for instance, the custom multigenome construction system 106 identifies (or receives) the diploid haplotype-resolved nucleotide assembly 352 comprising allelespecific sequences for individual chromosomes (e.g., maternal and paternal chromosomes). As mentioned above, in some embodiments, separate alleles within a haplotype-resolved nucleotide assembly, such as the diploid haplotype-resolved nucleotide assembly 352, are designated without express indications of parental lineage (e.g., allelic sequences for individual chromosomes provided separately without “maternal” or “paternal” labeling). Accordingly, the custom multigenome construction system 106 extracts, receives, or otherwise identifies a first assembly haplotype (shown as an assembly haplotype 354a) and a second assembly haplotype (shown as an assembly haplotype 354b) within the diploid haplotype-resolved nucleotide assembly 352, with or without identifying whether the assembly haplotype 354a or the assembly haplotype 354b is maternal or paternal.
[0097] As further illustrated, the custom multigenome construction system 106 extracts, receives, or otherwise identifies the respective assembly haplotypes 354a and 354b from the diploid haplotype-resolved nucleotide assembly 352 for the diplotype reference construction of the phased variant data 362. As shown, for example, the assembly haplotype 354a comprises maternal allelic sequences from individual chromosomes and the assembly haplotype 354b comprises paternal allelic sequences from individual chromosomes.
[0098] As also shown in FIG. 3E, the custom multigenome construction system 106 utilizes simulated-read variant calling 355a and 355b to generate variant calls for the assembly haplotype 354a and the assembly haplotype 354b, respectively, based on simulated nucleotide reads respectively generated from the assembly haplotype 354a and the assembly haplotype 354b (e.g., as described above in relation to FIG. 3A). To generate such variant calls (e.g., SNPs, indels, SVs), the custom multigenome construction system 106 can use a variant call model (e.g., DRAGEN) and a corresponding haplotype-assembly-to-VCF model (e.g., hap-asm2vcf) to process simulated nucleotide reads into variant calls within a genotype-call data file (e.g., VCF). Accordingly, as shown in FIG. 3E, the custom multigenome construction system 106 generates maternal haplotype variant data 357a (e.g., a first maternal VCF or other genotype-call datafile) and paternal haplotype variant data 357b (e.g., a first paternal VCF or other genotype-call data file) with the variant calls generated by the simulated-read variant calling 355a for the assembly haplotype 354a and the simulated-read variant calling 355b for the assembly haplotype 354b, respectively.
[0099] As further shown in FIG. 3E, in addition to utilizing a pipeline for the simulated-read variant calling 355a and 355b to generate variant calls based on simulated nucleotide reads, the custom multi genome construction system 106 utilizes a pipeline for whole-contig structural variant calling 356a and 356b to generate structural variant calls for the assembly haplotype 354a and the assembly haplotype 354b, respectively, based on respective whole-contig alignments between a reference genome and the assembly haplotype 354a or the assembly haplotype 354b (e.g., as described above in relation to FIG. 3A). To generate such structural variant calls, the custom multigenome construction system 106 can use an assembly-based SV caller (e.g., Dipcall, Smartie- SV, SVIM-ASM, or PAV) to process contiguous sequences into structural variant calls within a genotype-call data fde (e.g., VCF). Accordingly, as shown in FIG. 3E, the custom multigenome construction system 106 generates maternal haplotype SV data 358a (e.g., a second maternal VCF or other genotype-call data file) and paternal haplotype SV data 358b (e.g., a second paternal VCF or other genotype-call data file) with the structural variant calls generated by the whole-contig structural variant calling 356a for the assembly haplotype 354a and the whole-contig structural variant calling 356b for the assembly haplotype 354b, respectively.
[0100] After generating the foregoing maternal and/or paternal haplotype variant data, as shown in FIG. 3E, the custom multigenome construction system 106 merges the maternal haplotype variant data 357a and the maternal haplotype SV data 358a to generate merged maternal data 360a (e.g., a merged maternal VCF or other genotype-call data file) for the assembly haplotype 354a. Also, the custom multigenome construction system 106 merges the paternal haplotype variant data 357b and the paternal haplotype SV data 358b to generate merged paternal data 360b (e.g., a merged paternal VCF or other genotype-call data file) for the assembly haplotype 354b. In one or more embodiments, for example, the custom multigenome construction system 106 utilizes a structural variant comparison model to identify duplicate structural variants, near-duplicate structural variants, or inconsistent structural variants between variant calls generated utilizing the simulated-read calling 355a or 355b and the structural variant calls generated utilizing the wholecontig structural variant calling 356a or 356b, respectively (e.g., consistent with the description above in relation to FIG. 3A).
[0101] To generate the phased variant data 362 for the diploid haplotype-resolved nucleotide assembly 352, the custom multigenome construction system 106 compiles (e.g., combines) the merged maternal data 360a for the assembly haplotype 354a with the merged paternal data 360b for the assembly haplotype 354b into the phased variant data 362 to include diploid genotype calls relative to the reference genome. In some embodiments, for example, the custom multigenome construction system 106 combines two variant call or other genotype-call datafiles (e.g., VCF files) comprising the merged maternal data 360a and the merged paternal data 360b, respectively, to generate a phased genotype-call data file (e.g., a sample-specific phased VCF or other genotypecall data file) comprising the phased variant data 362. Furthermore, in some embodiments, the custom multigenome construction system 106 generates a multi-sample phased variant dataset by compiling (e.g., combining) multiple diploid haplotype-resolved assemblies to generate the multisample phased variant dataset (e.g., as described above in relation to FIG. 3C).
[0102] As mentioned previously, in some embodiments, the custom multigenome construction system 106 utilizes different pipelines for generating a multigenome reference from respectively different sequence data for known genomic samples representing a target population. For example, FIGS. 4A and 4B respectively illustrate the custom multigenome construction system 106 (a) utilizing a read-based pipeline 400a to generate a partially phased multi-sample variant dataset 412 from a set of known genomic samples 402 and (b) utilizing an assembly -based pipeline to generate a phased multi-sample variant dataset 432 from a set of haplotype-resolved nucleotide assemblies 422.
[0103] As shown in FIG. 4A, for instance, the read-based pipeline 400a takes as input a set of nucleotide reads 404 corresponding to the set of known genomic samples 402. In some embodiments, for example, the custom multigenome construction system 106 receives a respective set of FASTQ sequence fdes comprising the nucleotide reads 404 from the respective set of known genomic samples 402. In contrast to the set of haplotype-resolved nucleotide assemblies 422 processed by the assembly-based pipeline 400b to construct a multigenome reference, the readbased pipeline 400a processes nucleotide sequences that are not organized according to separate assembly haplotypes. In some embodiments, for example, the nucleotide reads 404 comprise simulated nucleotide reads that are not preemptively phased for generating haplotype-specific variant data.
[0104] As further shown in FIG. 4A, the custom multi genome construction system 106 performs mapping and alignment 406 of the set of nucleotide reads 404 to determine alignments with respect to a reference genome (e.g., a linear reference genome). In some embodiments, for example, the custom multigenome construction system 106 generates an alignment data file (e.g., a BAM file) comprising read alignments between the nucleotide reads 404 and the reference genome. Alternatively, in one or more embodiments, the custom multigenome construction system 106 outputs or stores the results of the mapping and alignment 406 in a different format and/or medium (e.g., storing the alignment data in cache for use in subsequent analysis).
[0105] Based on the read alignments determined by the mapping and alignment 406 for the nucleotide reads 404, the custom multigenome construction system 106 performs a variant calling 408 to generate variant calls for the nucleotide reads 404 relative to the reference genome. In some embodiments, for example, the custom multigenome construction system 106 utilizes a variant caller machine-learning model to perform the variant calling 408 and, in some cases, to subsequently perform the mapping and alignment 406. Alternatively, the custom multigenome construction system 106 utilizes alternative models for generating variant calls for the nucleotide reads 404 relative to the reference genome, such as but not limited to a probabilistic variant caller model.
[0106] Furthermore, the custom multigenome construction system 106 performs a variant phasing 410 of the variant calls generated by the variant calling 408 for the nucleotide reads 404. In some embodiments, for example, the custom multigenome construction system 106 determines phasing of the variant calls utilizing a read-based phasing model, such as a probabilistic model, to predict the phasing of variant calls for the nucleotide reads 404. Accordingly, when utilizing the read-based pipeline 400a, the custom multigenome construction system 106 generates the partially phased multi-sample variant dataset 412 comprising partially phased variant data (e.g., genotype calls) for the nucleotide reads 404 of the known genomic samples 402.
[0107] As shown in FIG. 4B, the assembly-based pipeline 400b takes as input the set of haplotype-resolved nucleotide assemblies 422 to generate the phased multi-sample variant dataset 432 comprising a set of haplotype variant data 430a and a set of haplotype variant data 430b for respective assembly haplotypes indicated by the set of haplotype-resolved nucleotide assemblies 422. For instance, the custom multigenome construction system 106 generates multiple sets of simulated nucleotide reads 422 for the respective assembly haplotypes of the set of haplotype- resolved nucleotide assemblies 422. In some embodiments, for example, the custom multigenome construction system 106 receives a set of FASTA sequence files comprising the respective set of haplotype-resolved nucleotide assemblies 422 and generates a corresponding set of FASTQ files comprising the sets of simulated nucleotide reads 424a and 424b. Alternatively, in some embodiments, the custom multigenome construction system 106 receives, generates, and/or stores the haplotype-resolved nucleotide assemblies 422 and the sets of simulated nucleotide reads 424a and 424b in a different format and/or medium (e.g., cached within memory for subsequence procedures).
[0108] As further shown in FIG. 4B, the custom multigenome construction system 106 performs mapping and alignment 426a and 426b of the respective sets of simulated nucleotide reads 424a and 424b to determine alignments with respect to a reference genome (e.g., a linear reference genome). In some embodiments, for example, the custom multigenome construction system 106 generates an alignment data file (e.g., a BAM file) comprising read alignments between the simulated nucleotide reads 424a-424b and the reference genome. Alternatively, in one or more embodiments, the custom multigenome construction system 106 outputs or stores the results of the mapping and alignment 406 in a different format and/or medium (e.g., storing the alignment data in cache for use in subsequent analysis).
[0109] Based on the read alignments determined by the respective mapping and alignment 426a and 426b for the sets of nucleotide reads 424a and 424b, the custom multigenome construction system 106 performs respective variant calling 428a and 428b to generate variant calls for the respective sets of nucleotide reads 424a and 424b relative to the reference genome. In some embodiments, for example, the custom multigenome construction system 106 utilizes a probabilistic variant caller model and/or a variant caller machine-learning model to perform the variant calling 408. Alternatively, the custom multigenome construction system 106 utilizes alternative models for generating read alignments and/or variant calls for the simulated nucleotide reads 424a-424b relative to the reference genome, such as but not limited to a machine-learning model configured for either or both variant calling and read alignment. [0110] Accordingly, the custom multigenome construction system 106 generates the sets of haplotype variant data 430a and 430b comprising variant calls identified within the respective sets of simulated nucleotide reads 424a and 424b. In some embodiments, for example, the custom multigenome construction system 106 outputs respective sets of VCF files comprising the respective sets of haplotype variant data 430a and 430b and compiles the variant data to generate a multi-sample VCF file comprising the multi-sample variant dataset 432. Alternatively, in some embodiments, the custom multigenome construction system 106 includes the haplotype variant data 430a and 430b within the phased multi-sample variant dataset 432 (e.g., as a multi-sample VCF file) without outputting the aforementioned haplotype specific VCF files.
[0111] As previously mentioned, in some embodiments, the custom multigenome construction system 106 generates a custom multigenome reference (e.g., a custom pangenome reference) from a customized selection of nucleotide assemblies for a target population and/or a target genomic sample. For example, FIG. 5 depicts a graphical user interface 500 for selecting a customized set of nucleotide assemblies for implementation within a custom multigenome reference according to one or more embodiments.
[0112] As shown in FIG. 5, the graphical user interface 500 includes a list of selectable known genomic samples (e.g., “HG002,” HIFI03268,” “Samplel,” and so forth) for which nucleotide assemblies are available for construction of a custom multigenome reference. For each selectable genomic sample, the graphical user interface 500 includes summary information for the corresponding nucleotide assembly, such as an organizational source (e.g., “HPRC”), a respective population/continental origin (e.g., “European” or “ South Asia”), a respective sub- population/ethnicity (e.g., “Ashkenazi”), a respective karyotype (e.g., “XY” or “XX”), and various statistical data.
[0113] As illustrated, a limited set of nucleotide assemblies is selected for inclusion within a custom multi genome reference. Moreover, the graphical user interface 500 includes additional selectable options for multigenome construction and/or evaluation, such as selection of a reference genome (e.g., “hg38”), optional evaluation of the resultant multigenome reference with respect to various truth sets (e.g., truth sets for human genome samples HG001-HG007), a search function for identifying nucleotide assemblies for potential inclusion, and a selectable option to upload a nucleotide assembly not yet included within the available nucleotide references (e.g., “Bring Your Own Genomes”).
[0114] As previously mentioned, in some embodiments, the custom multigenome construction system 106 generates and utilizes a custom multigenome reference to determine genotype calls for a target genomic sample with improved accuracy relative to existing sequencing systems. For example, FIG. 6 depicts respective pipelines for generating genotype calls 614 and 634 for a target genomic sample 610 utilizing (i) a graph pangenome reference 608 generated from a set of haplotype-resolved nucleotide assemblies 602 by an existing sequencing system and (ii) a custom multigenome reference 628 generated from the set of haplotype-resolved nucleotide assemblies 602 by the custom multigenome construction system 106 according to one or more embodiments. [0115] As shown in FIG. 6, the portrayed existing sequencing system performs an assembly alignment and variant calling 604 of the set of haplotype-resolved nucleotide assemblies 602 to align whole contiguous sequences of assembly chromosomes with a reference genome and filter or otherwise identify variants within the respective whole contiguous sequences relative to the reference genome. Further, the existing sequencing system performs a graph fragment assembly 606 to generate a Graphical Fragment Assembly (GF A) file representing an assembly graph of the set of haplotype-resolved nucleotide assemblies 602, including nodes, edges, and associated attributes typical of graph-based genomes.
[0116] Accordingly, the existing sequencing system further utilizes the results (e.g., the aforementioned GFA file) of the graph fragment assembly 606, in combination with the reference genome used for the assembly alignment and variant calling 604, to generate the graph pangenome reference 608. For example, existing sequencing systems often produce a graph pangenome reference (such as the graph pangenome reference 608) comprising a graph-based structure capturing genetic variations, alternative alleles, haplotypes, and structural variations in relation to a linear reference genome.
[0117] As further illustrated by FIG. 6, the portrayed existing sequencing system can utilize the graph pangenome reference 608 to perform a read alignment and variant calling 612 of nucleotide reads from a target genomic sample 610 to generate the genotype calls 614 for the target genomic sample 610 relative to the linear reference genome. In some cases, for example, existing sequencing systems utilize a machine-learning model to align and/or call variant calls for sample nucleotide reads. In particular, the portrayed sequencing system utilizes significantly different models to perform the read alignment and variant calling 612 for the sample nucleotide reads of the target genomic sample 610 than that implemented for the assembly alignment and variant calling 604 of the whole chromosome sequences of the set of haplotype-resolved nucleotide assemblies 602.
[0118] Moreover, as shown in FIG. 6, the custom multigenome construction system 106 performs a simulated read alignment and variant calling 624 (e.g., as discussed above in relation to FIGS. 3A-3D) to determine variant calls from respective simulated nucleotide reads sample from the set of haplotype-resolved nucleotide assemblies 602 and outputs a phased multi-sample variant dataset 626 (e.g., a multi-sample VCF file). Further, the custom multigenome construction system 106 utilizes the phased multi-sample variant dataset 626 to generate the custom multigenome reference 628 corresponding to the set of haplotype-resolved nucleotide assemblies 602. In some embodiments, for example, the custom multigenome reference 628 comprises an augmented linear reference comprising a linear reference genome augmented by annotations indicating variants, alternative alleles, and so forth, as identified by the simulated read alignment and variant calling 624. Alternatively, in some embodiments, the custom multigenome reference 628 comprises a reference haplotype database indicating allele-variant differences between the linear reference genome and haplotype variants provided within the phased multi-sample variant dataset 626.
[0119] Furthermore, as illustrated, the custom multigenome construction system 106 utilizes the custom multigenome reference 628 to perform a read alignment and variant calling 632 of nucleotide reads from the target genomic sample 610 to generate the genotype calls 634 for the target genomic sample 610. In some embodiments, for example, the custom multigenome construction system 106 utilizes a probabilistic variant caller model and/or a machine-leaming- based variant-call model to identify the genotype calls 634 for the target genomic sample relative to the linear reference genome.
[0120] As mentioned above, in certain described embodiments, the custom multigenome construction system 106 generates and utilizes a multigenome reference to implement mapping and alignment of nucleotide reads from a target genomic sample with genomic regions of a reference genome with increased accuracy. To illustrate, FIGS. 7-8 show experimental results of the custom multigenome construction system 106 generating and utilizing a custom multi genome reference, in accordance with some of the disclosed embodiments, to determine alignments of nucleotide reads. In particular, FIGS. 7-8 illustrate comparative results of identifying single nucleotide polymorphisms (SNPs) and insertions or deletions (indels) based on read alignments generated according to one or more embodiments.
[0121] Specifically, FIG. 7 includes a table of experimental results of identifying SNPs and indels in nucleotide reads from the whole genome sequencing of a known genomic sample (the Genome-in-a-Bottle Human Genome sample HG002) aligned with a reference genome utilizing (i) a pangenome reference generated by an existing sequencing system (see row indicated as “prior”) and (ii) a custom multigenome reference generated by the custom multigenome construction system 106 (see row indicated as “OURS”). As shown, the, the numerical values depicted in association with each column correspond to resulting false negative (FN) variant calls, false positive (FP) variant calls, and the combined false positive and false negative (FP+FN) variant calls for SNPs and indels, respectively, as identified per a U.S. National Institute of Standards and Technology (NIST) truth set for the HG002 sample. Further, the table in FIG. 7 includes variant calling results generated by the existing sequencing system and the custom multigenome construction system 106 when considering (i) a subset of the HG002 assembly generally understood as exhibiting relatively high-confidence results when using the existing sequencing system (“high-confidence bed only”) and (ii) the entire HG002 assembly (“whole assembly”). Indeed, as indicated by the relative quantities of false negatives and false positives depicted by FIG. 7, read alignments determined utilizing a custom multigenome reference generated by the custom multigenome construction system 106, according to one or more embodiments, exhibit significantly improved accuracy in identifying variant calls over the results of the existing sequencing system.
[0122] Furthermore, FIG. 8 includes a bar graph of experimental results of identifying SNPs and indels in nucleotide reads from the whole genome sequencing of several known genomic samples (the Genome-in-a-Bottle Human Genome samples HG001-HG007, respectively) aligned utilizing (i) a linear reference genome (“Linear Reference”), (ii) two different pangenome references generated by existing reference-assembly systems (“Pangenome A” and “Pangenome B” — see, e.g., the graph pangenome reference 608 depicted in FIG. 6), and (iii) a custom multigenome reference generated by the custom multigenome construction system 106 according to one or more embodiments (“Custom Multigenome” — see, e.g., the custom multigenome reference 628 depicted in FIG. 6). Further, the numerical values depicted in association with each bar column correspond to resulting false negative (FN) variant calls, false positive (FP) variant calls, and the combined false positive and false negative (FP+FN) variant calls, respectively. Indeed, as further indicated by the relative quantities of false negatives and false positives depicted by FIG. 8, read alignments determined utilizing a custom multigenome reference generated by the custom multi genome construction system 106, according to one or more embodiments, exhibit significantly improved accuracy in identifying variant calls over the results of various existing sequencing systems.
[0123] Turning now to FIGS. 9A-9B, these figures illustrate respective example flowcharts of two series of acts for (a) generating a multi-sample phased variant dataset representing a multigenome reference and (b) utilizing a multigenome reference to generate genotype calls for a target genomic sample in accordance with one or more embodiments. While FIGS. 9A-9B illustrate acts according to particular embodiments, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in FIGS. 9A-9B. The acts of FIGS. 9A-9B can be performed as part of one or more methods. Alternatively, a non-transitory computer readable storage medium can comprise instructions that, when executed by one or more processors, cause a computing device to perform some or all of the acts depicted in FIGS. 9A-9B. In still further embodiments, a system comprising at least one processor and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform some or all of the acts of FIGS. 9A-9B. [0124] As shown in FIG. 9A, the series of acts 900 includes an act 902 of generating simulated reads from each nucleotide assembly of a set of nucleotide assemblies, an act 904 of determining read alignments between the simulated nucleotide reads and respective genomic regions of a reference genome, an act 906 of generating variant calls relative to the reference genome for the simulated nucleotide reads, an act 908 of generating phased variant data for each respective nucleotide assembly of the set of nucleotide assemblies, and an act 910 of compiling the phased variant data to generate a multi-sample phased variant dataset representing a multigenome reference (e.g. a pangenome reference).
[0125] As shown in FIG. 9B, the series of acts 920 includes an act 922 of identifying nucleotide reads corresponding to a target genomic sample, an act 924 of determining read alignments between the nucleotide reads and respective genomic regions of a linear reference genome, an act 926 of comparing the nucleotide reads with a multigenome reference (e.g., a pangenome reference) comprising haplotype variants identified using simulated nucleotide reads sampled from a set of nucleotide assemblies, and an act 928 of generating genotype calls for the target genomic sample relative to the linear reference genome based on the read alignments.
[0126] For example, the series of acts 900 and/or the series of acts 920 can include acts to perform any of the operations described in the following clauses:
CLAUSE 1. A method comprising: generating simulated nucleotide reads comprising nucleotide sequences from each nucleotide assembly of a set of nucleotide assemblies; determining read alignments between the simulated nucleotide reads and respective genomic regions of a reference genome; generating variant calls relative to the reference genome for the simulated nucleotide reads according to the read alignments; generating phased variant data comprising the variant calls for each respective nucleotide assembly of the set of nucleotide assemblies; and compiling the phased variant data for the set of nucleotide assemblies to generate a multisample phased variant dataset representing a multigenome reference.
CLAUSE 2. The method according to clause 1, further comprising: identifying, for a given haplotype-resolved nucleotide assembly of the set of nucleotide assemblies, a first assembly haplotype and a second assembly haplotype; generating, from the given haplotype-resolved nucleotide assembly, a first set of the simulated nucleotide reads for the first assembly haplotype and a second set of the simulated nucleotide reads for the second assembly haplotype; generating the variant calls relative to the reference genome, including variant calls for the first set of the simulated nucleotide reads and variant calls for the second set of the simulated nucleotide reads; and generating, for the given haplotype-resolved nucleotide assembly, the phased variant data comprising diploid genotype calls by combining the variant calls for the first set of the simulated nucleotide reads with the variant calls for the second set of the simulated nucleotide reads.
CLAUSE 3. The method according to clause 2, further comprising: identifying the first assembly haplotype by identifying, for the given haplotype-resolved nucleotide assembly, a first nucleotide sequence comprising a first set of alleles in a chromosome from a first parent; and identifying the second assembly haplotype by identifying, for the given haplotype-resolved nucleotide assembly, a second nucleotide sequence comprising a second set of alleles in the chromosome from a second parent.
CLAUSE 4. The method according to clause 1, further comprising: identifying, for a given polyploid haplotype-resolved nucleotide assembly of the set of nucleotide assemblies, a plurality of assembly haplotypes; generating, from the given polyploid haplotype-resolved nucleotide assembly, respective sets of the simulated nucleotide reads for the plurality of assembly haplotypes; generating the variant calls relative to the reference genome, including respective (e.g., haplotype-specific) variant calls for each of the respective sets of the simulated nucleotide reads; and generating, for the given polyploid haplotype-resolved nucleotide assembly, the phased variant data comprising polyploid genotype calls by compiling the variant calls for each of the respective sets of the simulated nucleotide reads.
CLAUSE 5. The method according to clause 1, further comprising: generating, for a given haplotype-resolved nucleotide assembly of the set of nucleotide assemblies, one or more variant calls corresponding to a haploid genomic region of the reference genome; and generating, for the given haplotype-resolved nucleotide assembly, the phased variant data comprising haploid genotype calls including the one or more variant calls corresponding to the haploid genomic region.
CLAUSE 6. The method according to clause 1 , further comprising generating the phased variant data utilizing a read-based phasing model to determine phasing of the variant calls for each nucleotide assembly of the set of nucleotide assemblies. CLAUSE 7. The method of any of clauses 1-6, further comprising generating the simulated nucleotide reads comprising nucleotide sequences by sampling assembly sequences corresponding to individual chromosomes within each nucleotide assembly of the set of nucleotide assemblies.
CLAUSE 8. The method according to any of clauses 1-7, further comprising: generating the phased variant data by generating sample-specific variant call files comprising the phased variant data for each respective nucleotide assembly of the set of nucleotide assemblies; and compiling the phased variant data by compiling the phased variant data of the samplespecific variant call files to generate a multi-sample variant call file comprising the multi-sample phased variant dataset.
CLAUSE 9. The method according to clause 8, further comprising generating the multisample variant call file comprising the multi-sample phased variant dataset by: aligning the sample-specific variant call files according to genomic coordinates within the reference genome; and inserting, within rows for respective nucleotide assemblies within the multi-sample variant call file, data placeholders representing deletions within the respective nucleotide assemblies.
CLAUSE 10. The method according to any of clauses 8-9, further comprising generating the multigenome reference by generating a reference haplotype database or a graph multigenome reference utilizing the multi-sample variant call file.
CLAUSE 11. The method according to any of clauses 1-10, further comprising generating the simulated nucleotide reads by sampling a plurality of overlapping nucleotide sequences in a progressive series of segments of each nucleotide assembly of the set of nucleotide assemblies.
CLAUSE 12. The method according to clause 11, further comprising generating the simulated nucleotide reads by sampling the plurality of overlapping nucleotide sequences to simulate a target read depth of nucleotides within the simulated nucleotide reads.
CLAUSE 13. The method according to clause 12, further comprising generating the simulated nucleotide reads with a modified read depth greater than or lesser than the target read depth at one or more regions of a given nucleotide assembly of the set of nucleotide assemblies.
CLAUSE 14. The method according to any of clauses 11-13, further comprising generating the simulated nucleotide reads by sampling the plurality of overlapping nucleotide sequences with an initial read length of up to 300,000 nucleobases.
CLAUSE 15. The method according to clause 14, further comprising sampling the plurality of overlapping nucleotide sequences with a modified read length greater than or lesser than the initial read length at one or more regions of a given nucleotide assembly of the set of nucleotide assemblies.
CLAUSE 16. The method according to any of clauses 1-15, further comprising generating the read alignments between the simulated nucleotide reads and the respective genomic regions of the reference genome utilizing a haplotype-aware alignment model.
CLAUSE 17. The method according to any of clauses 1-16, further comprising generating the variant calls relative to the reference genome for the simulated nucleotide reads utilizing a probabilistic variant call model or a machine-leaming-based variant-call model.
CLAUSE 18. The method according to any of clauses 1-17, further comprising: identifying, within the read alignments between the simulated nucleotide reads and the respective genomic regions of the reference genome, at least one mis-mapped read based on comparing the respective genomic regions with relative coordinates of the simulated nucleotide reads within a respective nucleotide assembly of the set of nucleotide assemblies; and excluding one or more variant calls corresponding to the at least one mis-mapped read from the phased variant data for the respective nucleotide assembly.
CLAUSE 19. The method of any of clauses 1-18, further comprising: generating, for a given nucleotide assembly of the set of nucleotide assemblies, respective variant calls relative to the reference genome based on respective simulated nucleotide reads generated from the given nucleotide assembly; determining a whole-contig alignment between contiguous sequences of the given nucleotide assembly and corresponding genomic regions of the reference genome; generating one or more structural variant calls for the given nucleotide assembly based on the whole-genome alignment or the whole-contig alignment; and generating, for the given nucleotide assembly, the phased variant data comprising the respective variant calls based on the respective simulated nucleotide reads and the one or more structural variant calls based on the whole-contig alignment.
CLAUSE 20. The method of any of clauses 1-19, wherein: generating the respective variant calls comprises generating, utilizing a variant call model, a set of variant calls in a first maternal genotype-call data file and a set of variant calls in a first paternal genotype-call data file; generating the one or more structural variant calls comprises generating, utilizing an assembly-based structural variant (SV) caller, a set of structural variant calls in a second maternal genotype-call data file and a set of structural variant calls in a second paternal genotype-call data file; merging (i) the set of variant calls and the set of structural variant calls from the first maternal genotype-call data file and the second maternal genotype-call data file into a merged maternal genotype-call data file and (ii) the set of variant calls and the set of structural variant calls from the first paternal genotype-call data file and the second paternal genotype-call data file into a merged paternal genotype-call data file; and generating the phased variant data comprises generating the phased variant data in a phased genotype-call data file from the merged maternal genotype-call data file and the merged paternal genotype-call data file.
CLAUSE 21. The method according to any of clauses 1-20, further comprising: determining one or more read alignments between one or more nucleotide reads corresponding to a target genomic sample and one or more respective genomic regions of the reference genome based on comparing the one or more nucleotide reads with the multigenome reference; and generating genotype calls for a target genomic sample relative to the reference genome based on the one or more read alignments.
CLAUSE 22. The method according to any of clauses 1-21, further comprising: receiving at least one indication of a user interaction selecting, from a database of nucleotide assemblies, a target population or the set of nucleotide assemblies for the target population; and based on the selection of the target population or the set of nucleotide assemblies for the target population, generating the multi-sample phased variant dataset representing the multigenome reference for the target population.
CLAUSE 23. A method comprising: identifying one or more nucleotide reads corresponding to a target genomic sample; determining one or more read alignments between the one or more nucleotide reads and one or more respective genomic regions of a linear reference genome based on comparing the one or more nucleotide reads with a multigenome reference comprising haplotype variants identified using simulated nucleotide reads sampled from a set of nucleotide assemblies; and generating genotype calls for the target genomic sample relative to the linear reference genome based on the one or more read alignments.
CLAUSE 24. The method according to clause 23, wherein the multigenome reference comprises phased variant data for each respective nucleotide assembly of the set of nucleotide assemblies.
CLAUSE 25. The method according to clause 23, wherein the set of nucleotide assemblies comprises a plurality of haploid nucleotide assemblies of (i) a plurality of haploid genomes of one or more organisms or (ii) haploid regions of a plurality of diploid genomes or polyploid genomes of one or more organisms.
CLAUSE 26. The method according to any of clauses 23-24, further comprising generating the genotype calls by generating a genotype call identifying a genotype for the target genomic sample at a genomic coordinate relative to a known genomic sample represented by a segment of phased variant data within the multigenome reference.
CLAUSE 27. The method according to any of clauses 23-24 or 26, further comprising: receiving, for a target population corresponding to the target genomic sample, the set of nucleotide assemblies corresponding to a set of known genomic samples; and generating a multi-sample phased variant dataset representing the multigenome reference for the target population.
CLAUSE 28. The method according to clause 27, further comprising: receiving at least one indication of a user interaction selecting, from a database of nucleotide assemblies, the target population or the set of nucleotide assemblies for the target population; and based on the selection of the target population or the set of nucleotide assemblies for the target population, generating the multi-sample phased variant dataset representing the multigenome reference for the target population.
[0127] The methods described herein can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase type from another are particularly applicable. In some embodiments, the process to determine the nucleotide sequence of a target nucleic acid (i.e., a nucleic acid polymer) can be an automated process. Preferred embodiments include sequencing-by-synthesis (SBS) techniques.
[0128] SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand. In traditional methods of SBS, a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery. However, in the methods described herein, more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
[0129] SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties. Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below. In methods using nucleotide monomers lacking terminators, the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery. For SBS techniques that utilize nucleotide monomers having a terminator moiety, the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.).
[0130] SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like. In embodiments, where two or more different nucleotides are present in a sequencing reagent, the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used. For example, the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by the sequencing methods developed by Solexa (now Illumina, Inc.).
[0131] Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release." Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing." Genome Res. 11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P. (1998) “A sequencing method based on realtime pyrophosphate.” Science 281(5375), 363; U.S. Pat. No. 6,210,891; U.S. Pat. No. 6,258,568 and U.S. Pat. No. 6,274,320, the disclosures of which are incorporated herein by reference in their entireties). In pyrosequencing, released PPi can be detected by being immediately converted to adenosine triphosphate (ATP) by ATP sulfurylase, and the level of ATP generated is detected via luciferase-produced photons. The nucleic acids to be sequenced can be attached to features in an array and the array can be imaged to capture the chemiluminescent signals that are produced due to incorporation of a nucleotides at the features of the array. An image can be obtained after the array is treated with a particular nucleotide type (e.g., A, T, C or G). Images obtained after addition of each nucleotide type will differ with regard to which features in the array are detected. These differences in the image reflect the different sequence content of the features on the array. However, the relative locations of each feature will remain unchanged in the images. The images can be stored, processed and analyzed using the methods set forth herein. For example, images obtained after treatment of the array with each different nucleotide type can be handled in the same way as exemplified herein for images obtained from different detection channels for reversible terminatorbased sequencing methods.
[0132] In another exemplary type of SBS, cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference. This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91/06678 and WO 07/123,744, each of which is incorporated herein by reference. The availability of fluorescently labeled terminators in which both the termination can be reversed, and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing. Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.
[0133] Preferably in reversible terminator-based sequencing embodiments, the labels do not substantially inhibit extension under SBS reaction conditions. However, the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following incorporation of labels into arrayed nucleic acid features. In particular embodiments, each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially, and an image of the array can be obtained between each addition step. In such embodiments, each image will show nucleic acid features that have incorporated nucleotides of a particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remain unchanged in the images. Images obtained from such reversible terminator- SBS methods can be stored, processed and analyzed as set forth herein. Following the image capture step, labels can be removed, and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.
[0134] In particular embodiments some or all of the nucleotide monomers can include reversible terminators. In such embodiments, reversible terminators/cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporated herein by reference). Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety). Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension but could easily be deblocked by a short treatment with a palladium catalyst. The fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 second exposure to long wavelength UV light. Thus, either disulfide reduction or photocleavage can be used as a cleavable linker. Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP. The presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and/or electrostatic hindrance. The presence of one incorporation event prevents further incorporations unless the dye is removed. Cleavage of the dye removes the fluor and effectively reverses the termination. Examples of modified nucleotides are also described in U.S. Pat. No. 7,427,673, and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference in their entireties.
[0135] Additional exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Patent Application Publication No. 2007/0166705, U.S. Patent Application Publication No. 2006/0188901, U.S. Pat. No. 7,057,026, U.S. Patent Application Publication No. 2006/0240439, U.S. Patent Application Publication No. 2006/0281109, PCT Publication No. WO 05/065814, U.S. Patent Application Publication No. 2005/0100900, PCT Publication No. WO 06/064199, PCT Publication No. WO 07/010,251, U.S. Patent Application Publication No. 2012/0270305 and U.S. Patent Application Publication No. 2013/0260372, the disclosures of which are incorporated herein by reference in their entireties.
[0136] Some embodiments can utilize detection of four different nucleotides using fewer than four different labels. For example, SBS can be performed utilizing methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232. As a first example, a pair of nucleotide types can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair. As a second example, three of four different nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e.g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into a nucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal. As a third example, one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels. The aforementioned three exemplary configurations are not considered mutually exclusive and can be used in various combinations. An exemplary embodiment that combines all three examples, is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitation wavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g. dTTP having at least one label that is detected in both channels when excited by the first and/or second excitation wavelength) and a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel (e.g. dGTP having no label).
[0137] Further, as described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232, sequencing data can be obtained using a single channel. In such so- called one-dye sequencing approaches, the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated. The third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.
[0138] Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides. The oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize. As with other SBS methods, images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the features will remain unchanged in the images. Images obtained from ligation-based sequencing methods can be stored, processed and analyzed as set forth herein. Exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Pat. No. 6,969,488, U.S. Pat. No. 6,172,218, and U.S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.
[0139] Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. "Nanopores and nucleic acids: prospects for ultrarapid sequencing." Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, "Characterization of nucleic acids by nanopore analysis". Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope" Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties). In such embodiments, the target nucleic acid passes through a nanopore. The nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin. As the target nucleic acid passes through the nanopore, each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore. (U.S. Pat. No. 7,001,792; Soni, G. V. & Meller, "A. Progress toward ultrafast DNA sequencing using solid-state nanopores." Clin. Chem. 53, 1996-2001 (2007); Healy, K. "Nanopore-based single-molecule DNA analysis." Nanomed. 2, 459-481 (2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. "A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution." J. Am. Chem. Soc. 130, 818-820 (2008), the disclosures of which are incorporated herein by reference in their entireties). Data obtained from nanopore sequencing can be stored, processed and analyzed as set forth herein. In particular, the data can be treated as an image in accordance with the exemplary treatment of optical images and other images that is set forth herein.
[0140] Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No. 7,315,019 (which is incorporated herein by reference) and using fluorescent nucleotide analogs and engineered polymerases as described, for example, in U.S. Pat. No. 7,405,281 and U.S. Patent Application Publication No. 2008/0108082 (each of which is incorporated herein by reference). The illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations." Science 299, 682-686 (2003); Lundquist, P. M. et al. "Parallel confocal detection of single molecules in real time." Opt. Lett. 33, 1026-1028 (2008); Korlach, J. et al. "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures." Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), the disclosures of which are incorporated herein by reference in their entireties). Images obtained from such methods can be stored, processed and analyzed as set forth herein.
[0141] Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product. For example, sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 Al; US 2009/0127589 Al; US 2010/0137143 Al; or US 2010/0282617 Al, each of which is incorporated herein by reference. Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.
[0142] The above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously. In particular embodiments, different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner. In embodiments using surface-bound target nucleic acids, the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner. The target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface. The array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.
[0143] The methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features/cm2, 100 features/cm2, 500 features/cm2, 1,000 features/cm2, 5,000 features/cm2, 10,000 features/cm2, 50,000 features/cm2, 100,000 features/cm2, 1,000,000 features/cm2, 5,000,000 features/cm2, or higher.
[0144] An advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly, the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniques known in the art such as those exemplified above. Thus, an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and/or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like. A flow cell can be configured and/or used in an integrated system for detection of target nucleic acids. Exemplary flow cells are described, for example, in US 2010/0111768 Al and US Ser. No. 13/273,666, each of which is incorporated herein by reference. As exemplified for flow cells, one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method. Taking a nucleic acid sequencing embodiment as an example, one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above. Alternatively, an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods. Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego, CA) and devices described in US Ser. No. 13/273,666, which is incorporated herein by reference. [0145] The sequencing system described above sequences nucleic acid polymers present in samples received by a sequencing device. As defined herein, "sample" and its derivatives, is used in its broadest sense and includes any specimen, culture and the like that is suspected of including a target. In some embodiments, the sample comprises DNA, RNA, PNA, LNA, chimeric or hybrid forms of nucleic acids. The sample can include any biological, clinical, surgical, agricultural, atmospheric or aquatic-based specimen containing one or more nucleic acids. The term also includes any isolated nucleic acid sample such a genomic DNA, fresh-frozen or formalin-fixed paraffin-embedded nucleic acid specimen. It is also envisioned that the sample can be from a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material such as maternal and fetal DNA obtained from a maternal subject, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA. In some embodiments, the source of nucleic acid material can include nucleic acids obtained from a newborn, for example as typically used for newborn screening.
[0146] The nucleic acid sample can include high molecular weight material such as genomic DNA (gDNA). The sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples. In another embodiment, low molecular weight material includes enzymatically or mechanically fragmented DNA. The sample can include cell-free circulating DNA. In some embodiments, the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples. In some embodiments, the sample can be an epidemiological, agricultural, forensic or pathogenic sample. In some embodiments, the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source. In another embodiment, the sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus or fungus. In some embodiments, the source of the nucleic acid molecules may be an archived or extinct sample or species.
[0147] Further, the methods and compositions disclosed herein may be useful to amplify a nucleic acid sample having low-quality nucleic acid molecules, such as degraded and/or fragmented genomic DNA from a forensic sample. In one embodiment, forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel. The nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric or other substrate that may be impregnated with saliva, blood, or other bodily fluids. As such, in some embodiments, the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA. In some embodiments, target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine and serum. In some embodiments, target sequences can be obtained from hair, skin, tissue samples, autopsy or remains of a victim. In some embodiments, nucleic acids including one or more target sequences can be obtained from a deceased animal or human. In some embodiments, target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA. In some embodiments, target sequences or amplified target sequences are directed to purposes of human identification. In some embodiments, the disclosure relates generally to methods for identifying characteristics of a forensic sample. In some embodiments, the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein. In one embodiment, a forensic or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.
[0148] The components of the custom multigenome construction system 106 can include software, hardware, or both. For example, the components of the custom multigenome construction system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device 114, the local device 108, or the server device(s) 110). When executed by the one or more processors, the computer-executable instructions of the custom multigenome construction system 106 can cause the computing devices to perform the bubble detection methods described herein. Alternatively, the components of the custom multigenome construction system 106 can comprise hardware, such as special purpose processing devices to perform a certain function or group of functions. Additionally, or alternatively, the components of the custom multigenome construction system 106 can include a combination of computer-executable instructions and hardware.
[0149] Furthermore, the components of the custom multigenome construction system 106 performing the functions described herein with respect to the custom multigenome construction system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and/or as a cloud-computing model. Thus, components of the custom multigenome construction system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device. Additionally, or alternatively, the components of the custom multigenome construction system 106 may be implemented in any application that provides sequencing services including, but not limited to Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and/or other countries.
[0150] Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non- transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc.), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
[0151] Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices). Computer- readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
[0152] Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
[0153] A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer- readable media.
[0154] Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a NIC), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer- readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
[0155] Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
[0156] Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
[0157] Embodiments of the present disclosure can also be implemented in cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
[0158] A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (laaS). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.
[0159] FIG. 10 illustrates a block diagram of a computing device 1000 that may be configured to perform one or more of the processes described above. One will appreciate that one or more computing devices such as the computing device 1000 may implement the custom multigenome construction system 106 and the sequencing device system 104. As shown by FIG. 10, the computing device 1000 can comprise a processor 1002, a memory 1004, a storage device 1006, an I/O interface 1008, and a communication interface 1010, which may be communicatively coupled by way of a communication infrastructure 1012. In certain embodiments, the computing device 1000 can include fewer or more components than those shown in FIG. 10. The following paragraphs describe components of the computing device 1000 shown in FIG. 10 in additional detail.
[0160] In one or more embodiments, the processor 1002 includes hardware for executing instructions, such as those making up a computer program. As an example, and not by way of limitation, to execute instructions for dynamically modifying workflows, the processor 1002 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 1004, or the storage device 1006 and decode and execute them. The memory 1004 may be a volatile or nonvolatile memory used for storing data, metadata, and programs for execution by the processor(s). The storage device 1006 includes storage, such as a hard disk, flash disk drive, or other digital storage device, for storing data or instructions for performing the methods described herein. [0161] The I/O interface 1008 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1000. The I/O interface 1008 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces. The I/O interface 1008 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I/O interface 1008 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
[0162] The communication interface 1010 can include hardware, software, or both. In any event, the communication interface 1010 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1000 and one or more other computing devices or networks. As an example, and not by way of limitation, the communication interface 1010 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
[0163] Additionally, the communication interface 1010 may facilitate communications with various types of wired or wireless networks. The communication interface 1010 may also facilitate communications using various communication protocols. The communication infrastructure 1012 may also include hardware, software, or both that couples components of the computing device 1000 to each other. For example, the communication interface 1010 may use one or more networks and/or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein. To illustrate, the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.
[0164] In the foregoing specification, the present disclosure has been described with reference to specific exemplary embodiments thereof. Various embodiments and aspects of the present disclosure(s) are described with reference to details discussed herein, and the accompanying drawings illustrate the various embodiments. The description above and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure. [0165] The present disclosure may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with less or more steps/acts or the steps/acts may be performed in differing orders. Additionally, the steps/acts described herein may be repeated or performed in parallel with one another or in parallel with different instances of the same or similar steps/acts. The scope of the present application is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

CLAIMS We Claim:
1. A system comprising: at least one processor; and a non-transitory computer-readable medium storing instructions that, when executed by the at least one processor, cause the system to: generate simulated nucleotide reads comprising nucleotide sequences from each nucleotide assembly of a set of nucleotide assemblies; determine read alignments between the simulated nucleotide reads and respective genomic regions of a reference genome; generate variant calls relative to the reference genome for the simulated nucleotide reads according to the read alignments; generate phased variant data comprising the variant calls for each respective nucleotide assembly of the set of nucleotide assemblies; and compile the phased variant data for the set of nucleotide assemblies to generate a multi-sample phased variant dataset representing a multigenome reference.
2. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: identify, for a given haplotype-resolved nucleotide assembly of the set of nucleotide assemblies, a first assembly haplotype and a second assembly haplotype; generate, from the given haplotype-resolved nucleotide assembly, a first set of the simulated nucleotide reads for the first assembly haplotype and a second set of the simulated nucleotide reads for the second assembly haplotype; generate the variant calls relative to the reference genome, including variant calls for the first set of the simulated nucleotide reads and variant calls for the second set of the simulated nucleotide reads; and generate, for the given haplotype-resolved nucleotide assembly, the phased variant data comprising diploid genotype calls by combining the variant calls for the first set of the simulated nucleotide reads with the variant calls for the second set of the simulated nucleotide reads.
3. The system of claim 2, further comprising instructions that, when executed by the at least one processor, cause the system to: identify the first assembly haplotype by identifying, for the given haplotype-resolved nucleotide assembly, a first nucleotide sequence comprising a first set of alleles in a chromosome from a first parent; and identify the second assembly haplotype by identifying, for the given haplotype-resolved nucleotide assembly, a second nucleotide sequence comprising a second set of alleles in the chromosome from a second parent.
4. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: identify, for a given polyploid haplotype-resolved nucleotide assembly of the set of nucleotide assemblies, a plurality of assembly haplotypes; generate, from the given polyploid haplotype-resolved nucleotide assembly, respective sets of the simulated nucleotide reads for the plurality of assembly haplotypes; generate the variant calls relative to the reference genome, including respective variant calls for each of the respective sets of the simulated nucleotide reads; and generate, for the given polyploid haplotype-resolved nucleotide assembly, the phased variant data comprising polyploid genotype calls by compiling the variant calls for each of the respective sets of the simulated nucleotide reads.
5. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: generate, for a given haplotype-resolved nucleotide assembly of the set of nucleotide assemblies, one or more variant calls corresponding to a haploid genomic region of the reference genome; and generate, for the given haplotype-resolved nucleotide assembly, the phased variant data comprising haploid genotype calls including the one or more variant calls corresponding to the haploid genomic region.
6. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate the phased variant data utilizing a read-based phasing model to determine phasing of the variant calls for each nucleotide assembly of the set of nucleotide assemblies.
7. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate the simulated nucleotide reads comprising nucleotide sequences by sampling assembly sequences corresponding to individual chromosomes within each nucleotide assembly of the set of nucleotide assemblies.
8. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: generate the phased variant data by generating sample-specific variant call files comprising the phased variant data for each respective nucleotide assembly of the set of nucleotide assemblies; and compile the phased variant data by compiling the phased variant data of the sample-specific variant call files to generate a multi-sample variant call file comprising the multi-sample phased variant dataset.
9. The system of claim 8, further comprising instructions that, when executed by the at least one processor, cause the system to generate the multi-sample variant call file comprising the multi-sample phased variant dataset by: aligning the sample-specific variant call files according to genomic coordinates within the reference genome; and inserting, within rows for respective nucleotide assemblies within the multi-sample variant call file, data placeholders representing deletions within the respective nucleotide assemblies.
10. The system of claim 8, further comprising instructions that, when executed by the at least one processor, cause the system to generate the multigenome reference by generating a reference haplotype database or a graph multigenome reference utilizing the multi-sample variant call file.
11. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate the simulated nucleotide reads by sampling a plurality of overlapping nucleotide sequences in a progressive series of segments of each nucleotide assembly of the set of nucleotide assemblies.
12. The system of claim 11, further comprising instructions that, when executed by the at least one processor, cause the system to generate the simulated nucleotide reads by sampling the plurality of overlapping nucleotide sequences to simulate a target read depth of nucleotides within the simulated nucleotide reads.
13. The system of claim 12, further comprising instructions that, when executed by the at least one processor, cause the system to generate the simulated nucleotide reads with a modified read depth greater than or lesser than the target read depth at one or more regions of a given nucleotide assembly of the set of nucleotide assemblies.
14. The system of claim 11, further comprising instructions that, when executed by the at least one processor, cause the system to generate the simulated nucleotide reads by sampling the plurality of overlapping nucleotide sequences with an initial read length of up to 300,000 nucleobases.
15. The system of claim 14, further comprising instructions that, when executed by the at least one processor, cause the system to sample the plurality of overlapping nucleotide sequences with a modified read length greater than or lesser than the initial read length at one or more regions of a given nucleotide assembly of the set of nucleotide assemblies.
16. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate the read alignments between the simulated nucleotide reads and the respective genomic regions of the reference genome utilizing a haplotype- aware alignment model.
17. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate the variant calls relative to the reference genome for the simulated nucleotide reads utilizing a probabilistic variant call model or a machine-leaming- based variant-call model.
18. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: identify, within the read alignments between the simulated nucleotide reads and the respective genomic regions of the reference genome, at least one mis-mapped read based on comparing the respective genomic regions with relative coordinates of the simulated nucleotide reads within a respective nucleotide assembly of the set of nucleotide assemblies; and exclude one or more variant calls corresponding to the at least one mis-mapped read from the phased variant data for the respective nucleotide assembly.
19. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: generate, for a given nucleotide assembly of the set of nucleotide assemblies, respective variant calls relative to the reference genome based on respective simulated nucleotide reads generated from the given nucleotide assembly; determine a whole-contig alignment between contiguous sequences of the given nucleotide assembly and corresponding genomic regions of the reference genome; generate one or more structural variant calls for the given nucleotide assembly based on the whole-contig alignment; and generate, for the given nucleotide assembly, the phased variant data comprising the respective variant calls based on the respective simulated nucleotide reads and the one or more structural variant calls based on the whole-contig alignment.
20. The system of claim 19, further comprising instructions that, when executed by the at least one processor, cause the system to: generate the respective variant calls by generating, utilizing a variant call model, a set of variant calls in a first maternal genotype-call data file and a set of variant calls in a first paternal genotype-call data file; generate the one or more structural variant calls by generating, utilizing an assembly -based structural variant (SV) caller, a set of structural variant calls in a second maternal genotype-call data file and a set of structural variant calls in a second paternal genotype-call data file; merge (i) the set of variant calls and the set of structural variant calls from the first maternal genotype-call data file and the second maternal genotype-call data file into a merged maternal genotype-call data file and (ii) the set of variant calls and the set of structural variant calls from the first paternal genotype-call data file and the second paternal genotype-call data file into a merged paternal genotype-call data file; and generate the phased variant data by generating the phased variant data in a phased genotypecall data file from the merged maternal genotype-call data file and the merged paternal genotypecall data file.
21. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: determine one or more read alignments between one or more nucleotide reads corresponding to a target genomic sample and one or more respective genomic regions of the reference genome based on comparing the one or more nucleotide reads with the multigenome reference; and generate genotype calls for a target genomic sample relative to the reference genome based on the one or more read alignments.
22. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: receive at least one indication of a user interaction selecting, from a database of nucleotide assemblies, a target population or the set of nucleotide assemblies for the target population; and based on the selection of the target population or the set of nucleotide assemblies for the target population, generate the multi-sample phased variant dataset representing the multigenome reference for the target population.
23. A system comprising: at least one processor; and a non-transitory computer-readable medium storing instructions that, when executed by the at least one processor, cause the system to: identify one or more nucleotide reads corresponding to a target genomic sample; determine one or more read alignments between the one or more nucleotide reads and one or more respective genomic regions of a linear reference genome based on comparing the one or more nucleotide reads with a multigenome reference comprising haplotype variants identified using simulated nucleotide reads sampled from a set of nucleotide assemblies; and generate genotype calls for the target genomic sample relative to the linear reference genome based on the one or more read alignments.
24. The system of claim 23, wherein the multigenome reference comprises phased variant data for each respective nucleotide assembly of the set of nucleotide assemblies.
25. The system of claim 23, wherein the set of nucleotide assemblies comprises a plurality of haploid nucleotide assemblies of (i) a plurality of haploid genomes of one or more organisms or (ii) haploid regions of a plurality of diploid genomes or polyploid genomes of one or more organisms.
26. The system of claim 23, further comprising instructions that, when executed by the at least one processor, cause the system to generate genotype calls by generating a genotype call identifying a genotype for the target genomic sample at a genomic coordinate relative to a known genomic sample represented by a segment of phased variant data within the multigenome reference.
27. The system of claim 23, further comprising instructions that, when executed by the at least one processor, cause the system to: receive, for a target population corresponding to the target genomic sample, the set of nucleotide assemblies corresponding to a set of known genomic samples; and generate a multi-sample phased variant dataset representing the multigenome reference for the target population.
28. The system of claim 27, further comprising instructions that, when executed by the at least one processor, cause the system to: receive at least one indication of a user interaction selecting, from a database of nucleotide assemblies, the target population or the set of nucleotide assemblies for the target population; and based on the selection of the target population or the set of nucleotide assemblies for the target population, generate the multi-sample phased variant dataset representing the multigenome reference for the target population.
PCT/US2025/012463 2024-01-26 2025-01-21 Custom multigenome reference construction for improved sequencing analysis of genomic samples Pending WO2025160089A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202463625580P 2024-01-26 2024-01-26
US63/625,580 2024-01-26

Publications (1)

Publication Number Publication Date
WO2025160089A1 true WO2025160089A1 (en) 2025-07-31

Family

ID=94688015

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/012463 Pending WO2025160089A1 (en) 2024-01-26 2025-01-21 Custom multigenome reference construction for improved sequencing analysis of genomic samples

Country Status (1)

Country Link
WO (1) WO2025160089A1 (en)

Citations (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1991006678A1 (en) 1989-10-26 1991-05-16 Sri International Dna sequencing
US6172218B1 (en) 1994-10-13 2001-01-09 Lynx Therapeutics, Inc. Oligonucleotide tags for sorting and identification
US6210891B1 (en) 1996-09-27 2001-04-03 Pyrosequencing Ab Method of sequencing DNA
US6258568B1 (en) 1996-12-23 2001-07-10 Pyrosequencing Ab Method of sequencing DNA based on the detection of the release of pyrophosphate and enzymatic nucleotide degradation
US6274320B1 (en) 1999-09-16 2001-08-14 Curagen Corporation Method of sequencing a nucleic acid
US6306597B1 (en) 1995-04-17 2001-10-23 Lynx Therapeutics, Inc. DNA sequencing by parallel oligonucleotide extensions
WO2004018497A2 (en) 2002-08-23 2004-03-04 Solexa Limited Modified nucleotides for polynucleotide sequencing
US20050100900A1 (en) 1997-04-01 2005-05-12 Manteia Sa Method of nucleic acid amplification
WO2005065814A1 (en) 2004-01-07 2005-07-21 Solexa Limited Modified molecular arrays
US6969488B2 (en) 1998-05-22 2005-11-29 Solexa, Inc. System and apparatus for sequential processing of analytes
US7001792B2 (en) 2000-04-24 2006-02-21 Eagle Research & Development, Llc Ultra-fast nucleic acid sequencing device and a method for making and using the same
US7057026B2 (en) 2001-12-04 2006-06-06 Solexa Limited Labelled nucleotides
WO2006064199A1 (en) 2004-12-13 2006-06-22 Solexa Limited Improved method of nucleotide detection
US20060240439A1 (en) 2003-09-11 2006-10-26 Smith Geoffrey P Modified polymerases for improved incorporation of nucleotide analogues
US20060281109A1 (en) 2005-05-10 2006-12-14 Barr Ost Tobias W Polymerases
WO2007010251A2 (en) 2005-07-20 2007-01-25 Solexa Limited Preparation of templates for nucleic acid sequencing
US7211414B2 (en) 2000-12-01 2007-05-01 Visigen Biotechnologies, Inc. Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity
WO2007123744A2 (en) 2006-03-31 2007-11-01 Solexa, Inc. Systems and devices for sequence by synthesis analysis
US7315019B2 (en) 2004-09-17 2008-01-01 Pacific Biosciences Of California, Inc. Arrays of optical confinements and uses thereof
US7329492B2 (en) 2000-07-07 2008-02-12 Visigen Biotechnologies, Inc. Methods for real-time single molecule sequence determination
US20080108082A1 (en) 2006-10-23 2008-05-08 Pacific Biosciences Of California, Inc. Polymerase enzymes and reagents for enhanced nucleic acid sequencing
US7405281B2 (en) 2005-09-29 2008-07-29 Pacific Biosciences Of California, Inc. Fluorescent nucleotide analogs and uses therefor
US20090026082A1 (en) 2006-12-14 2009-01-29 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes using large scale FET arrays
US20090127589A1 (en) 2006-12-14 2009-05-21 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes using large scale FET arrays
US20100137143A1 (en) 2008-10-22 2010-06-03 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes
US20100282617A1 (en) 2006-12-14 2010-11-11 Ion Torrent Systems Incorporated Methods and apparatus for detecting molecular interactions using fet arrays
US20120270305A1 (en) 2011-01-10 2012-10-25 Illumina Inc. Systems, methods, and apparatuses to image a sample for biological or chemical analysis
US20130079232A1 (en) 2011-09-23 2013-03-28 Illumina, Inc. Methods and compositions for nucleic acid sequencing
US20130260372A1 (en) 2012-04-03 2013-10-03 Illumina, Inc. Integrated optoelectronic read head and fluidic cartridge useful for nucleic acid sequencing
US20220301655A1 (en) * 2021-03-17 2022-09-22 Seven Bridges Genomics Inc. Systems and methods for generating graph references

Patent Citations (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1991006678A1 (en) 1989-10-26 1991-05-16 Sri International Dna sequencing
US6172218B1 (en) 1994-10-13 2001-01-09 Lynx Therapeutics, Inc. Oligonucleotide tags for sorting and identification
US6306597B1 (en) 1995-04-17 2001-10-23 Lynx Therapeutics, Inc. DNA sequencing by parallel oligonucleotide extensions
US6210891B1 (en) 1996-09-27 2001-04-03 Pyrosequencing Ab Method of sequencing DNA
US6258568B1 (en) 1996-12-23 2001-07-10 Pyrosequencing Ab Method of sequencing DNA based on the detection of the release of pyrophosphate and enzymatic nucleotide degradation
US20050100900A1 (en) 1997-04-01 2005-05-12 Manteia Sa Method of nucleic acid amplification
US6969488B2 (en) 1998-05-22 2005-11-29 Solexa, Inc. System and apparatus for sequential processing of analytes
US6274320B1 (en) 1999-09-16 2001-08-14 Curagen Corporation Method of sequencing a nucleic acid
US7001792B2 (en) 2000-04-24 2006-02-21 Eagle Research & Development, Llc Ultra-fast nucleic acid sequencing device and a method for making and using the same
US7329492B2 (en) 2000-07-07 2008-02-12 Visigen Biotechnologies, Inc. Methods for real-time single molecule sequence determination
US7211414B2 (en) 2000-12-01 2007-05-01 Visigen Biotechnologies, Inc. Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity
US7057026B2 (en) 2001-12-04 2006-06-06 Solexa Limited Labelled nucleotides
US7427673B2 (en) 2001-12-04 2008-09-23 Illumina Cambridge Limited Labelled nucleotides
US20060188901A1 (en) 2001-12-04 2006-08-24 Solexa Limited Labelled nucleotides
WO2004018497A2 (en) 2002-08-23 2004-03-04 Solexa Limited Modified nucleotides for polynucleotide sequencing
US20070166705A1 (en) 2002-08-23 2007-07-19 John Milton Modified nucleotides
US20060240439A1 (en) 2003-09-11 2006-10-26 Smith Geoffrey P Modified polymerases for improved incorporation of nucleotide analogues
WO2005065814A1 (en) 2004-01-07 2005-07-21 Solexa Limited Modified molecular arrays
US7315019B2 (en) 2004-09-17 2008-01-01 Pacific Biosciences Of California, Inc. Arrays of optical confinements and uses thereof
WO2006064199A1 (en) 2004-12-13 2006-06-22 Solexa Limited Improved method of nucleotide detection
US20060281109A1 (en) 2005-05-10 2006-12-14 Barr Ost Tobias W Polymerases
WO2007010251A2 (en) 2005-07-20 2007-01-25 Solexa Limited Preparation of templates for nucleic acid sequencing
US7405281B2 (en) 2005-09-29 2008-07-29 Pacific Biosciences Of California, Inc. Fluorescent nucleotide analogs and uses therefor
WO2007123744A2 (en) 2006-03-31 2007-11-01 Solexa, Inc. Systems and devices for sequence by synthesis analysis
US20100111768A1 (en) 2006-03-31 2010-05-06 Solexa, Inc. Systems and devices for sequence by synthesis analysis
US20080108082A1 (en) 2006-10-23 2008-05-08 Pacific Biosciences Of California, Inc. Polymerase enzymes and reagents for enhanced nucleic acid sequencing
US20090127589A1 (en) 2006-12-14 2009-05-21 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes using large scale FET arrays
US20090026082A1 (en) 2006-12-14 2009-01-29 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes using large scale FET arrays
US20100282617A1 (en) 2006-12-14 2010-11-11 Ion Torrent Systems Incorporated Methods and apparatus for detecting molecular interactions using fet arrays
US20100137143A1 (en) 2008-10-22 2010-06-03 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes
US20120270305A1 (en) 2011-01-10 2012-10-25 Illumina Inc. Systems, methods, and apparatuses to image a sample for biological or chemical analysis
US20130079232A1 (en) 2011-09-23 2013-03-28 Illumina, Inc. Methods and compositions for nucleic acid sequencing
US20130260372A1 (en) 2012-04-03 2013-10-03 Illumina, Inc. Integrated optoelectronic read head and fluidic cartridge useful for nucleic acid sequencing
US20220301655A1 (en) * 2021-03-17 2022-09-22 Seven Bridges Genomics Inc. Systems and methods for generating graph references

Non-Patent Citations (16)

* Cited by examiner, † Cited by third party
Title
COCKROFT, S. L.CHU, J.AMORIN, MGHADIRI, M. R: "A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution.", J. AM. CHEM. SOC., vol. 130, 2008, pages 818 - 820, XP055097434, DOI: 10.1021/ja077082c
DEAMER, D. WAKESON, M: "Nanopores and nucleic acids: prospects for ultrarapid sequencing.", TRENDS BIOTECHNOL, vol. 18, 2000, pages 147 - 151, XP004194002, DOI: 10.1016/S0167-7799(00)01426-8
DEAMER, DD. BRANTON: "Characterization of nucleic acids by nanopore analysis", ACC. CHEM. RES, vol. 35, 2002, pages 817 - 825, XP002226144, DOI: 10.1021/ar000138m
HEALY, K: "Nanopore-based single-molecule DNA analysis.", NANOMED, vol. 2, 2007, pages 459 - 481, XP009111262, DOI: 10.2217/17435889.2.4.459
KORLACH, J ET AL.: "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures", PROC. NATL. ACAD. SCI. USA, vol. 105, 2008, pages 1176 - 1181
LEVENE, M. J ET AL.: "Zero-mode waveguides for single-molecule analysis at high concentrations.", SCIENCE, vol. 299, 2003, pages 682 - 686, XP002341055, DOI: 10.1126/science.1079700
LI, J.M. GERSHOWD. STEINE. BRANDINJ. A. GOLOVCHENKO: "DNA molecules and configurations in a solid-state nanopore microscope", NAT. MATER, vol. 2, 2003, pages 611 - 615, XP009039572, DOI: 10.1038/nmat965
LIAO WEN-WEI ET AL: "A draft human pangenome reference", NATURE, vol. 617, no. 7960, 10 May 2023 (2023-05-10), London, pages 312 - 324, XP093271640, ISSN: 0028-0836, DOI: 10.1038/s41586-023-05896-x *
LUNDQUIST, P. M ET AL.: "Parallel confocal detection of single molecules in real time.", OPT. LETT, vol. 33, 2008, pages 1026 - 1028, XP001522593, DOI: 10.1364/OL.33.001026
MARSCHALL TOBIAS: "Computational pan-genomics: status, promises and challenges", BRIEFINGS IN BIOINFORMATICS, 21 October 2016 (2016-10-21), GB, pages bbw089, XP093271638, ISSN: 1467-5463, DOI: 10.1093/bib/bbw089 *
METZKER, GENOME RES, vol. 15, 2005, pages 1767 - 1776
RONAGHI, M.KARAMOHAMED, S.PETTERSSON, B.UHLEN, MNYREN, P: "Real-time DNA sequencing using detection of pyrophosphate release.", ANALYTICAL BIOCHEMISTRY, vol. 242, 1996, pages 84 - 9, XP002388725, DOI: 10.1006/abio.1996.0432
RONAGHI, M.UHLEN, MNYREN, P: "A sequencing method based on real-time pyrophosphate.", SCIENCE, vol. 281, no. 5375, 1998, pages 363, XP002135869, DOI: 10.1126/science.281.5375.363
RONAGHI, M: "Pyrosequencing sheds light on DNA sequencing.", GENOME RES, vol. 11, no. 1, 2001, pages 3 - 11, XP000980886, DOI: 10.1101/gr.11.1.3
RUPAREL ET AL., PROC NATL ACAD SCI USA, vol. 102, 2005, pages 5932 - 7
SONI, G. VMELLER: "A. Progress toward ultrafast DNA sequencing using solid-state nanopores.", CLIN. CHEM, vol. 53, 2007, pages 1996 - 2001, XP055076185, DOI: 10.1373/clinchem.2007.091231

Similar Documents

Publication Publication Date Title
AU2022316203A1 (en) Machine-learning model for recalibrating nucleotide-base calls
US20240038327A1 (en) Rapid single-cell multiomics processing using an executable file
US20220415443A1 (en) Machine-learning model for generating confidence classifications for genomic coordinates
US20240404624A1 (en) Structural variant alignment and variant calling by utilizing a structural-variant reference genome
US20230420082A1 (en) Generating and implementing a structural variation graph genome
US20240112753A1 (en) Target-variant-reference panel for imputing target variants
WO2025006874A1 (en) Machine-learning model for recalibrating genotype calls corresponding to germline variants and somatic mosaic variants
US20230095961A1 (en) Graph reference genome and base-calling approach using imputed haplotypes
EP4457822B1 (en) Machine learning model for recalibrating nucleotide base calls corresponding to target variants
WO2024006705A1 (en) Improved human leukocyte antigen (hla) genotyping
WO2025160089A1 (en) Custom multigenome reference construction for improved sequencing analysis of genomic samples
US20250210141A1 (en) Enhanced mapping and alignment of nucleotide reads utilizing an improved haplotype data structure with allele-variant differences
US20250384952A1 (en) Tandem repeat genotyping
US20240371469A1 (en) Machine learning model for recalibrating genotype calls from existing sequencing data files
US20230420080A1 (en) Split-read alignment by intelligently identifying and scoring candidate split groups
US20240177802A1 (en) Accurately predicting variants from methylation sequencing data
WO2025184234A1 (en) A personalized haplotype database for improved mapping and alignment of nucleotide reads and improved genotype calling
WO2025090883A1 (en) Detecting variants in nucleotide sequences based on haplotype diversity
WO2024206848A1 (en) Tandem repeat genotyping
WO2025250996A2 (en) Call generation and recalibration models for implementing personalized diploid reference haplotypes in genotype calling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25706895

Country of ref document: EP

Kind code of ref document: A1