[go: up one dir, main page]

WO2025250996A2 - Call generation and recalibration models for implementing personalized diploid reference haplotypes in genotype calling - Google Patents

Call generation and recalibration models for implementing personalized diploid reference haplotypes in genotype calling

Info

Publication number
WO2025250996A2
WO2025250996A2 PCT/US2025/031740 US2025031740W WO2025250996A2 WO 2025250996 A2 WO2025250996 A2 WO 2025250996A2 US 2025031740 W US2025031740 W US 2025031740W WO 2025250996 A2 WO2025250996 A2 WO 2025250996A2
Authority
WO
WIPO (PCT)
Prior art keywords
genotype
personalized
call
sample
diploid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/031740
Other languages
French (fr)
Inventor
Konrad SCHEFFLER
Gavin Parnaby
Faezeh RAHBAR
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Illumina Inc
Original Assignee
Illumina Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Illumina Inc filed Critical Illumina Inc
Publication of WO2025250996A2 publication Critical patent/WO2025250996A2/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

This disclosure describes methods, non-transitory computer readable media, and systems that can utilize personalized reference haplotypes to generate genotype calls in relation to a reference genome. For instance, the disclosed systems can generate, utilizing a call-generation model, genotype probabilities of candidate genotypes in relation to a personalized subset of diploid reference haplotypes for a sample at a genomic coordinate and, based on the genotype probabilities, generate a genotype call for the sample at the genomic coordinate. Furthermore, the disclosed systems can determine, for a set of nucleotide reads at a genomic coordinate of a sample, personalized sequencing metrics in relation to a personalized subset of diploid reference haplotypes. Based on the personalized sequencing metrics, the disclosed systems can generate a genotype call for the sample at the genomic coordinate—in some cases by confirming, recalibrating, or modifying the candidate variant call generated by a call-generation model.

Description

CALL GENERATION AND RECALIBRATION MODELS FOR IMPLEMENTING PERSONALIZED DIPLOID REFERENCE HAPLOTYPES IN GENOTYPE CALLING
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/654,806, entitled, “CALL GENERATION AND RECALIBRATION MODELS FOR IMPLEMENTING PERSONALIZED DIPLOID REFERENCE HAPLOTYPES IN GENOTYPE CALLING,” filed on May 31, 2024 (IP-2766-PRV), which is incorporated herein by reference in its entirety.
BACKGROUND
[0002] In recent years, biotechnology firms and research institutions have improved hardware and software for sequencing nucleotides and determining variant calls for samples. For instance, some existing nucleobase sequencing platforms determine individual nucleobases within sequences from samples by using conventional Sanger sequencing or by using sequencing-by-synthesis (SBS) methods. When using SBS, existing platforms can monitor millions to billions of nucleic acid polymers being synthesized in parallel to predict nucleobase calls from a larger base call dataset. For instance, a camera in many SBS platforms captures images of irradiated fluorescent tags incorporated into oligonucleotides for determining the nucleobase calls. After capturing such images, existing SBS platforms send base call data (or image-based data) to a computing device to apply sequencing data analysis software that determines a nucleobase sequence for a sample or other nucleic acid polymer. Based on differences between the aligned nucleotide reads and a reference genome, existing data analysis software can further utilize a variant caller to identify genotypes and/or variants within a sample, such as single nucleotide polymorphisms (SNPs), insertions or deletions (indels), or structural variants in relation to the reference genome.
[0003] Despite these recent advances in sequencing and variant calling, existing nucleotide sequencing platforms and sequencing data analysis software (together and hereinafter, “existing sequencing systems”) often implement models for determining variant calls that require considerable computing resources or execute genotype callers that inaccurately determine nucleobase calls and/or corresponding variant calls. For instance, some existing sequencing systems inefficiently expend considerable computing resources executing overly complex models — often requiring considerable computer processing runtime — to accurately determine base calls or variant calls. To illustrate, some existing sequencing systems utilize genotype callers with a deep learning architecture or some other neural network architecture that require extensive computation resources (e.g., computing time, processing power, and memory) to train and implement. For example, some existing sequencing systems utilize deep learning architectures that, even after training, take many hours across multiple computing devices to generate genotype calls for a single sample sequence.
[0004] In contrast to some deep learning architectures, some existing genotype callers have been developed that determine variant calls with increased accuracy by processing various features for both nucleotide reads and a reference genome. But many such genotype callers rely exclusively on features related to either (i) a linear reference genome based on a limited set of haplotypes (e.g., haplotypes from approximately 11 people) that fails to adequately represent certain populations or (ii) a graph reference genome that better accounts for some populations’ genetics but obfuscates sequencing analysis with an exorbitant number of alternate paths representing an extensive database of population haplotypes. Because such alternate paths can be similar to — but not match — many sample genomes’ alleles, generic graph genomes frequently cause existing sequencing systems to misalign or miss call variants for a large number of samples. Consequently, by relying on such generic reference genomes for model inputs and features, many existing sequencing systems generate variant calls that include excessive numbers of false positive calls and/or false negative calls that could otherwise be reduced with a more sample-specific approach. In addition to such shortcomings in accuracy, existing graph reference genomes are often bulky and consume considerable memory and computing resources in addition to the aforementioned model resources required by many existing sequencing systems.
[0005] Moreover, in addition to the foregoing shortcomings, existing sequencing systems often suffer from inaccuracies resulting from an incidental bias towards the reference genome utilized for alignment and genotype calling. Many existing sequencing systems, for example, utilize generic prior probabilities estimated in relation to a particular reference genome when determining genotype calls via a probabilistic approach. Relying on such generic or flat prior probabilities, however, often favors alleles or haplotypes found more generally in a population (e.g., as reflected in the corresponding reference genome) but ignore alleles or haplotypes that may be more relevant to a particular sample. Consequently, existing sequencing systems that utilize generic or flat prior probabilities can increase a likelihood of generating false negative or false positive variant calls because such generic or flat prior probabilities reflect a bias towards haplotypes of the reference genome.
[0006] In an attempt to compensate for inaccuracies in alignment and/or genotype calling for difficult-to-call regions, some existing sequencing systems perform additional sequencing cycles during a sequencing run to produce more nucleotide-read data and avoid under-sequencing of some samples. Such additional sequencing cycles require additional computing time, memory, and reagents on a sequencing device and require processing multiple base-call data files (e.g., base call files or FASTQ files) for a single sample to support genotype calling. As a result of performing additional sequencing cycles within a sequencing run, existing sequencing systems can oversequence samples. While the addition of sequencing cycles results in fewer under-sequenced samples, existing sequencing systems cannot wholly eliminate under-sequenced samples. Thus, in addition to devoting excessive computing time and memory to over-sequencing samples, some existing sequencing systems must also expend additional computing time to perform one or more additional sequencing runs for the under-sequenced samples of a previous sequencing run.
[0007] These, along with additional problems and issues exist in existing sequencing systems.
SUMMARY
[0008] This disclosure describes embodiments of methods, non-transitory computer-readable media, and systems that can utilize personalized subsets of reference haplotypes to generate genotype calls in relation to a linear reference genome. For example, the disclosed systems can receive or otherwise identify a personalized subset of diploid reference haplotypes (e.g., from a population haplotype database) at given genomic coordinate for a sample and leverage such haplotypes in genotype calling utilizing a call-generation model (e.g., a variant-call model) and/or a call-recalibration machine-learning model to analyze nucleotide reads corresponding to the given genomic coordinate. By implementing personalized diploid reference haplotypes in one or more genotype calling models, the disclosed systems generate genotype calls with respect to a linear reference genome with increased efficiency and accuracy relative to existing sequencing systems. [0009] To illustrate, the disclosed systems can identify nucleotide reads for a given sample that align at a given genomic coordinate and determine, based on such nucleotide reads, a personalized subset of diploid reference haplotypes at the given genomic coordinate for the given sample. Based on the personalized subset of diploid reference haplotypes, the disclosed systems can utilize a callgeneration model to generate genotype probabilities of candidate genotypes at the genomic coordinate of the given sample and determine a genotype call for the given sample at the genomic coordinate.
[0010] In combination with or independent of the foregoing call-generation model, the disclosed systems can identify nucleotide reads for a given sample that align at a given genomic coordinate and determine personalized sequencing metrics for the nucleotide reads in relation to a personalized subset of diploid reference haplotypes at the given genomic coordinate for the given sample. Based on the personalized sequencing metrics, the disclosed systems can utilize a call- recalibration machine-learning model to generate one or more personalized genotype call classifications indicating an accuracy of identifying a genotype at the genomic coordinate and determine a genotype call for the given sample at the genomic coordinate in part from the one or more personalized genotype call classifications. Accordingly, the disclosed systems can determine a genotype call for the given sample at the genomic coordinate based on either personalized genotype probabilities and/or the personalized genotype call classifications derived from a personalized subset of diploid reference haplotypes for a given sample.
[0011] Additional features and advantages of one or more embodiments of the present disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such embodiments.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The detailed description refers to the drawings briefly described below.
[0013] FIG. 1 illustrates an environment in which a personalized sequencing system can operate in accordance with one or more embodiments of the present disclosure.
[0014] FIG. 2 illustrates an overview of the personalized sequencing system utilizing a personalized call -generation model and a personalized call-recalibration machine-learning model to generate genotype calls in accordance with one or more embodiments of the present disclosure. [0015] FIG. 3 illustrates the personalized sequencing system identifying nucleotide reads corresponding to genomic coordinates of a sample in accordance with one or more embodiments of the present disclosure.
[0016] FIG. 4 illustrates the personalized sequencing system generating genotype probabilities utilizing a personalized call-generation model in accordance with one or more embodiments of the present disclosure.
[0017] FIG. 5 illustrates phylogenetic diagrams representing various implementations for determining genotype probabilities in relation to a personalized subset of diploid reference haplotypes in accordance with one or more embodiments of the present disclosure.
[0018] FIG. 6 illustrates the personalized sequencing system generating personalized genotype call classifications utilizing a personalized call-recalibration machine-learning model in accordance with one or more embodiments of the present disclosure.
[0019] FIG. 7 illustrates an example process for the personalized sequencing system training a personalized call-recalibration machine-learning model in accordance with one or more embodiments of the present disclosure.
[0020] FIG. 8A illustrates the personalized sequencing system generating genotype calls utilizing personalized sequencing metrics determined in relation to personalized diploid reference haplotypes in accordance with one or more embodiments of the present disclosure.
[0021] FIG. 8B illustrates the personalized sequencing system filtering candidate variant calls in accordance with one or more embodiments of the present disclosure.
[0022] FIG. 9 illustrates a graph of experimental results of utilizing the personalized sequencing system to determine genotype calls for a sample in accordance with one or more embodiments of the present disclosure. [0023] FIG. 10 illustrates a flowchart of a series of acts for determining a genotype call utilizing a call -recalibration machine-learning model in accordance with one or more embodiments of the present disclosure.
[0024] FIG. 11 illustrates a flowchart of a series of acts for determining a genotype call utilizing a call-generation model in accordance with one or more embodiments of the present disclosure.
[0025] FIG. 12 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure of the present disclosure.
DETAILED DESCRIPTION
[0026] This disclosure describes embodiments of a personalized sequencing system that uses various models to implement personalized subsets of reference haplotypes to generate, recalibrate, or confirm genotype calls (e.g., variant calls) for a sample. In particular, the personalized sequencing system can receive or identify a personalized subset of diploid reference haplotypes for a given genomic coordinate of a sample and leverage such haplotypes to increase sensitivity in genotype calling with adjustments to inputs and features of a call-generation model configured to generate genotype probabilities for candidate variants and/or a call-recalibration machine-learning model configured to recalibrate or confirm genotype calls based on one or more personalized sequencing metrics. By utilizing a personalized subset of diploid reference genotypes to determine one or both of the aforementioned genotype probabilities using a call-generation model or the personalized sequencing metrics processed by a call-recalibration machine-learning model according to the embodiments described herein, the personalized sequencing system can generate genotype calls with respect to a linear reference genome with improved accuracy and efficiency relative to existing sequencing systems.
[0027] In some embodiments, for example, the personalized sequencing system identifies a set of nucleotide reads of a sample corresponding to a genomic coordinate according to a respective set of read alignments and, based on the set of nucleotide reads, determines a personalized subset of diploid reference haplotypes for the sample at the genomic coordinate. In one or more implementations, the personalized subset of diploid reference haplotypes comprises a first population haplotype and a second population haplotype selected for the genomic coordinate of the sample from a database of population haplotypes. Having determined, received, or otherwise identified the personalized subset of diploid reference haplotypes, the personalized sequencing system utilizes a call-generation model to generate, based on the personalized subset, genotype probabilities of candidate genotypes for the sample at the genomic coordinate. In certain embodiments, for example, the personalized sequencing system determines prior genotype probabilities for the candidate genotypes in relation to the personalized subset of diploid reference haplotypes and utilizes the call-generation model to generate posterior genotype probabilities of the candidate genotypes based on the prior genotype probabilities. Based on the genotype probabilities generated by the call-generation model, the personalized sequencing system can determine a genotype call for the sample at the genotype coordinate.
[0028] In combination with or independent of the foregoing call-generation model, in some embodiments, the personalized sequencing system utilizes a call-recalibration machine-learning model to determine, modify, or confirm one or more genotype calls for a sample. For instance, having identified a set of nucleotide reads corresponding to a genomic coordinate of a sample, the personalized sequencing system can determine personalized sequencing metrics for the set of nucleotide reads in relation to a personalized subset of diploid reference haplotypes for the sample at the genomic coordinate. Such personalized sequencing metrics may include, but are not limited to, values for a genotype metric (e.g., GT field), genotype-probability metric (e.g., GP field), genomic coordinate (e.g., POS field), inferred allele (e.g., REF or ALT field), variant-call-quality metric (e.g., QUAL field), inferred genotype-quality metric (e.g., GQ field) related to the haplotypes of the personalized diploid reference. Using the call-recalibration machine-learning model to process the personalized sequencing metrics, the personalized sequencing system can generate one or more personalized genotype call classifications indicating an accuracy of identifying a genotype at the genomic coordinate and determine a genotype call for the sample at the genomic coordinate based on the personalized genotype call classifications. As described further below, in some cases, the personalized sequencing system trains the call-recalibration machine-learning model to process such personalized sequencing metrics and/or non-personalized sequencing metrics.
[0029] In addition to training on non-personalized sequencing metrics, the personalized sequencing system can train a personalized call-recalibration machine-learning model to recalibrate genotype calls based on personalized sequencing metrics derived from personalized diploid reference haplotypes. In one or more embodiments, to prevent overfitting and increase data diversity during training, the personalized sequencing system excludes population haplotypes corresponding to a training sample from a personalized haplotype database when generating personalized diploid reference haplotypes for use in training the personalized call-recalibration machine-learning model.
[0030] As indicated above, the personalized sequencing system can use (a) genotype probabilities from a call-generation model or a personalized call-generation model or (b) the personalized sequencing metrics processed by a personalized call-recalibration machine-learning model in combination or separately for determining genotype calls at diploid or haploid genomic coordinates. For instance, in some cases, if a call-generation model (e.g., variant calling model) generates genotype probabilities or other outputs indicating a genotype call different than the subset of personalized diploid reference haplotypes for a sample at a given genomic coordinate, the call- recalibration machine-learning model neither processes the personalized sequencing metrics nor generates personalized genotype call classifications for the given genomic coordinate. When a personalized version of the call recalibration machine-learning model neither processes personalized inputs nor generates personalized outputs, however, the personalized sequencing system can still report reference and/or variant calls in a variant call file (VCF) based on the posterior genotype probabilities from non-personalized versions of a call -generation model and/or a call recalibration machine-learning model.
[0031] In one or both embodiments of implementing a personalized version of a call-generation model or a personalized version of a cell-recalibration machine-learning model, the personalized sequencing system determines, utilizing a call-generation model and/or a call-recalibration machine-learning model, a genotype call for a sample at a haploid genomic coordinate based on at least one personalized reference haplotype selected for the sample at the haploid genomic coordinate. For instance, the personalized sequencing system can identify a set of nucleotide reads corresponding to a haploid genomic coordinate and determine either or both (i) genotype probabilities of haploid genotypes or (ii) personalized sequencing metrics for the set of nucleotide reads in relation to a personalized reference haplotype for the sample at the haploid genomic coordinate. Based on the genotype probabilities and/or the personalized sequencing metrics, respectively, the personalized sequencing system can utilize the call-generation model and/or the call-recalibration machine-learning model to determine a genotype call for the sample at the haploid genomic coordinate. Indeed, embodiments of the personalized sequencing system can implement personalized subsets of reference haplotypes at haploid and diploid genomic coordinates of a sample to generate, modify, or confirm genotype calls as further described below.
[0032] As described below and depicted in the corresponding figures, in certain embodiments, the personalized sequencing system can also leverage or selectively use personalized subsets of reference haplotypes to: (i) reprocess genotype calls indicated as reference genotypes (i.e., 0/0) by the personalized call-recalibration machine-learning model using a non-personalized call- recalibration machine-learning model to prevent false negative variant calls, (ii) implement personalized sequencing metrics derived from (or modified by) personalized diploid reference haplotypes to increase variant-calling accuracy at indel positions of a sample, and/or (iii) utilize information from a population haplotype database to provide joint information on variants on the same haplotypes and jointly generate genotype calls.
[0033] As suggested above, the personalized sequencing system provides several technical advantages, benefits, and/or improvements over existing sequencing systems, including variant callers and other sequencing data analysis software. In some embodiments, for instance, the personalized sequencing system increases the flexibility with which a sequencing system can determine, modify, or confirm genotype calls at genomic coordinates for which personalized subsets of reference haplotypes (e.g., diploid reference haplotypes) are provided or otherwise identified. As indicated above, many existing variant caller models, for instance, rely on inputs (e.g., initial estimates or model features) related specifically to either (i) a linear reference genome comprising an individual contiguous sequence derived from one or more representative samples or (ii) a generic graph reference genome augmented by an unfiltered or unnecessarily large set of population haplotypes (e.g., 15-20 alternate haplotypes for each genomic region of a linear reference genome). In combination with either a linear or graph reference genome, some such existing sequencing systems utilize flat or generic prior genotype probabilities derived from generic alleles or haplotypes that may differ in genotype at target genomic regions from a particular sample of interest.
[0034] In contrast to such existing systems that utilize generic or impersonalized reference genomes and flat prior genotype probabilities — and without data accounting for the sample’s likely inherited haplotypes in a given genomic region — the personalized sequencing system can flexibly utilize personalized subsets of reference haplotypes to accurately determine, modify, or confirm genotype calls for the particular sample. Such personalized subsets of reference haplotypes include, but are not limited to, pairs of diploid reference haplotypes selected for a particular sample from a population haplotype database. For instance, the personalized sequencing system can utilize a callgeneration model to generate genotype probabilities for candidate genotypes at a genomic coordinate of a sample based on a personalized subset of diploid reference haplotypes selected for the sample. In some implementations, for example, the personalized sequencing system generates the genotype probabilities by determining prior genotype probabilities of the candidate genotypes in relation to the personalized subset of diploid reference haplotypes and utilizes the call-generation model to generate posterior genotype probabilities based on the prior genotype probabilities. Independently or in combination with such a call-generation model, the personalized sequencing system can determine personalized sequencing metrics in relation to a personalized subset of diploid reference haplotypes and utilize a call-recalibration machine-learning model to generate personalized genotype call classifications based on the personalized sequencing metrics. Rather processing a sample’s nucleotide-read data based on generic reference genomes or generalized prior genotype probabilities, therefore, the personalized sequencing system can flexibly and intelligently adapt its model for genotype calling based on personalized diploid reference haplotypes for a given sample. [0035] Indeed, as disclosed herein, the personalized sequencing system is uniquely configured to expressly consider personalized subsets of reference haplotypes (e.g., population haplotypes selected for a particular sample) at genomic coordinates where such personalized information is applicable and attainable. For example, in certain embodiments, the personalized sequencing system flexibly determines and utilizes personalized sequencing metrics in determining genotype calls for a sample where a candidate variant call with respect to a linear reference genome (e.g., a variant call initially generated by a call-generation model) matches the alleles indicated by a respective personalized subset of diploid reference haplotypes, as described in additional detail below (e.g., in relation to FIG. 8B).
[0036] In addition to improved flexibility in determining genotype calls based on personalized haplotype information, the personalized sequencing system provides improved genotype-calling accuracy over existing sequencing systems. By utilizing either or both a call-generation model or a call-recalibration machine-learning model to generate personalized genotype probabilities or personalized genotype call classifications, respectively, the personalized sequencing system can determine more accurate genotype calls (e.g., variant calls) with a higher confidence that such calls match or differ from the reference allele of a reference genome compared to existing sequencing systems. Such personalized and more accurate genotype calls go beyond mere probabilities but leverage personalized reference haplotypes for given genomic coordinates or regions to more accurately detect the physical nucleobases present at given genomic coordinates or regions of a genome for a particular sample isolated for genetic sequencing. Furthermore, in certain embodiments, the personalized sequencing system can significantly reduce the number of false negative genotype calls by comparing results of both a personalized version of a call-recalibration machine-learning model and a call-recalibration machine-learning model trained without personalization at certain genomic coordinates, as described in detail below (e.g., in relation to FIG. 8B). To further illustrate, this disclosure describes and depicts results of such a personalized sequencing system generating improved genotype and/or variant calls below in relation to FIG. 9. [0037] Beyond improved flexibility and accuracy, the personalized sequencing system improves the computing efficiency with which a sequencing system identifies genotypes (e.g., variants) within genomic sequences. As noted above, some existing sequencing systems utilize computationally expensive, slow neural network architectures (e.g., deep learning architectures such as convolutional neural networks) that require many hours (e.g., hundreds of hours) across multiple high-end processors to implement for processing read data to generate variant calls for a sample. Such deep learning architectures can further require several days (or weeks) to train. Conversely, in some embodiments, the personalized sequencing system utilizes a comparatively lightweight, fast architecture for the call-generation model and the call-recalibration machine- learning model leveraging personalized diploid reference haplotypes as described herein. In contrast to the many hours across multiple processors required by existing sequencing systems, the personalized sequencing system requires under an hour (for both the call-generation model and the call-recalibration machine-learning model together) of runtime (e.g., on a single field programmable gate array or a single processor) to generate genotype calls (and/or variant calls) for a sample. Thus, the personalized sequencing system is significantly faster and less computationally expensive than many deep learning approaches to genotype/variant calling. Indeed, not only are the models of the personalized sequencing system faster and less computationally expensive to implement, but the call-recalibration machine-learning model is also much faster and less computationally expensive to train than many existing deep learning systems.
[0038] Moreover, by determining personalized prior probabilities (e.g., as described below in relation to FIG. 4) from which the call -generation model can generate posterior genotype probabilities for candidate genotypes (e.g., candidate variant calls), the personalized sequencing system can accurately and efficiently determine genotype calls for a sample without extensive reconfiguration of the call-generation model for implementation of personalized reference haplotypes. Similarly, by implementing personalized sequencing metrics in relation to personalized reference haplotypes, the personalized sequencing system can efficiently train and/or implement a personalized call-recalibration machine-learning model to generate personalized genotype call classifications at genomic coordinates where such personalized information is available for a particular sample. Indeed, according to the embodiments disclosed herein, either or both the callgeneration model and the call-recalibration machine-learning model can be configured and implemented for personalization in genotype calling.
[0039] Furthermore, in some implementations, the personalized sequencing system conserves sequencing cycles, imaging, consumables, and other physical resources — and reduces overuse of fluidics devices and other hardware within a sequencing device — relative to existing sequencing systems. To compensate for the read-data-coverage uncertainty, existing sequencing systems often duplicate sequencing cycles and sometimes perform additional sequencing runs. Such excessive sequencing cycles or runs can require additional run time and consume sequencing reagents, processing materials, and sample materials. In contrast to such existing sequencing systems, the personalized sequencing system can reduce one or both (i) the number of sequencing cycles and (ii) the number of flow cell regions imaged in a given sequencing run to satisfy a target readcoverage level by providing for genotype calling results with increased accuracy without requiring additional read data and sequencing runs or cycles on a sequencing device.
[0040] As suggested by the foregoing discussion, this disclosure utilizes a variety of terms to describe features and benefits of the personalized sequencing system. Additional detail is hereafter provided regarding the meaning of these terms as used in this disclosure. As used in this disclosure, for instance, the term “sample” refers to a specimen, culture, or the like that is suspected of including a target nucleic acid. In some embodiments, the sample comprises DNA, ribonucleic acid (RNA), peptide nucleic acid (PNA), locked nucleic acid (LNA), chimeric or hybrid forms of nucleic acids as targets. The sample can likewise include any biological, clinical, surgical, agricultural-atmospheric, or aquatic-based specimen containing one or more nucleic acids. A sample also includes any isolated or extracted nucleic acid sample from an organism, such a genomic DNA, fresh-frozen, or formalin-fixed paraffin-embedded nucleic acid specimen. In some cases, accordingly, a sample can include a full genome or partial genome that is isolated or extracted (e.g., in whole or in part by a kit) from an organism and that is prepared to undergo sequencing or an assay in a sequencing device. A sample can be from a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material, such as maternal and fetal DNA obtained from a maternal subject, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA. In some embodiments, the source of nucleic acid material can include nucleic acids obtained from a newborn, for example as typically used for newborn screening.
[0041] The sample can include high molecular weight material, such as genomic DNA (gDNA). The sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples. In another implementation, low molecular weight material includes enzymatically or mechanically fragmented DNA. The sample can include cell- free circulating DNA. In some implementations, the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples. In some implementations, the sample can be an epidemiological, agricultural, forensic, or pathogenic sample. In some implementations, the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source. In another implementation, the sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus, or fungus. In some implementations, the source of the nucleic acid molecules may be an archived or extinct sample or species.
[0042] As used herein, the term “nucleotide read” refers to an inferred sequence of one or more nucleobases (or nucleobase pairs) from all or part of a sample nucleotide sequence (e.g., a sample genomic sequence, complementary DNA). In particular, a nucleotide read includes a determined or predicted sequence of nucleobase calls for a nucleotide fragment (or group of monoclonal nucleotide fragments) from a sequencing library corresponding to a sample. For example, in some embodiments, the personalized sequencing system determines a nucleotide read by generating nucleobase calls for nucleobases passed through a nanopore of a nucleotide-sample slide, determined via fluorescent tagging, or determined from a well in a flow cell. In some cases, a nucleotide read can refer to a particular type of read, such as a nucleotide read synthesized from sample library fragments that are shorter than a threshold number of nucleobases (e.g., SBS reads). In these or other cases, another type of nucleotide read can refer to (i) assembled nucleotide reads that have been assembled from shorter nucleotide reads to form a contiguous sequence (e.g., assembled nucleotide reads) satisfying a threshold number of nucleobases, (ii) circular consensus sequencing (CCS) reads satisfying the threshold number of nucleobases, or (iii) nanopore long reads satisfying the threshold number of nucleobases.
[0043] As further used herein, the term “genomic coordinate” (or sometimes simply “coordinate”) refers to a particular location or position of a nucleobase within a genome (e.g., an organism’s genome or a reference genome). In some cases, a genomic coordinate includes an identifier for a particular chromosome of a genome and an identifier for a position of a nucleobase within the particular chromosome. For instance, a genomic coordinate or coordinates may include a number, name, or other identifier for a chromosome (e.g., chrl or chrX) and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl : 1234570 or chrl : 1234570-1234870). In some cases, a genomic coordinate refers to a genomic coordinate on a sex chromosome (e.g., chrX or chrY). Consequently, the personalized sequencing system can determine genotype probabilities for a genotype call (e.g., a variant call) for a genomic coordinate on a sex chromosome. Further, in certain implementations, a genomic coordinate refers to a source of a reference genome (e.g., mt for a mitochondrial DNA reference genome or SARS- CoV-2 for a reference genome for the SARS-CoV-2 virus) and a position of a nucleobase within the source for the reference genome (e.g., mt: 16568 or SARS-CoV-2:29001). By contrast, in certain cases, a genomic coordinate refers to a position of a nucleobase within a reference genome without reference to a chromosome or source (e.g., 29727).
[0044] As used herein, a “genomic region” refers to a range of genomic coordinates. Like genomic coordinates, in certain implementations, a genomic region may be identified by an identifier for a chromosome and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570-1234870). In various implementations, a genomic coordinate includes a position within a reference genome. In some cases, a genomic coordinate is specific to a particular reference genome. Relatedly, as used herein, the term “reference span” refers to a span of nucleobase positions within a linear reference genome. Accordingly, a reference span includes a span of nucleobases between two respective genomic coordinates of the linear reference genome. In some cases, a reference span can be described as a bin of nucleobases spanning positions from one genomic coordinate to another genomic coordinate. [0045] As indicated above, genomic coordinates within a nucleotide sequence can exhibit different genotypes. For example, a “homozygous reference genotype” refers to a genotype where both nucleobases at a given coordinate of a sample nucleotide sequence match a reference nucleobase of a reference sequence or a reference genome (represented as 0/0). As another example, a “homozygous alternate genotype” refers to a genotype at a given coordinate where both nucleobases differ from a reference nucleobase of a reference sequence or a reference genome (represented as 1/1). As a further example, a “heterozygous genotype” refers to a genotype where the nucleobases at a given coordinate are not the same. In some cases, a heterozygous genotype includes a genotype in which one nucleobase matches a reference nucleobase and the other nucleobase differs from the reference nucleobase (represented as 0/1 or 1/0). For multiallelic genomic coordinates, genotypes can exhibit nucleobases from more than one alternate nucleobase differing from a reference nucleobase of a reference genome. For instance, a multiallelic heterozygous genotype can be represented as 1/2, where one nucleobase call matches a first alternate nucleobase differing from a reference nucleobase and the other nucleobase call matches a second alternate nucleobase differing from the reference nucleobase.
[0046] As noted above, a genomic coordinate includes a position within a reference genome. Such a position may be within a particular reference genome. As used herein, the term “reference genome” refers to a digital nucleic acid sequence assembled as a representative example (or representative examples) of genes and other genetic sequences of an organism. Regardless of the sequence length, in some cases, a reference genome represents an example set of genes or a set of nucleic acid sequences in a digital nucleic acid sequenced determined by scientists as representative of an organism of a particular species. For example, a linear human reference genome may be GRCh38 or other versions of reference genomes from the Genome Reference Consortium. As a further example, a reference genome may include a reference graph genome that includes both a linear reference genome and paths representing nucleic acid sequences from ancestral haplotypes, such as Illumina DRAGEN Graph Reference Genome hgl9.
[0047] As used herein, the term “population haplotypes” refers to nucleotide sequences that are present in organisms from a population as inherited from one or more ancestors. In particular, a population haplotype can include alleles or other nucleotide sequences that are present in organisms of a population and inherited together by such organisms respectively from a single parent. In one or more embodiments, population haplotypes include a set of SNPs or other variants on the same chromosome that tend to be inherited together. In some cases, data representing a population haplotype, or a set of different population haplotypes, are stored or otherwise accessible on a population haplotype database. As mentioned, in some embodiments, the personalized sequencing system also generates a personalized haplotype database comprising a customized selection of the population haplotypes imported from a particular population haplotype database.
[0048] Relatedly, as used herein, the term “population haplotype database” refers to a database encoding variant data for population haplotypes of a sample organism. In particular, a population haplotype database refers to an unfiltered (or only partially filtered) compilation of population haplotypes or, in other words, a complete compilation of population haplotypes prior to selection of personalized subsets of haplotypes for a particular sample as described herein. In one or more embodiments, a population haplotype database includes complete or partially complete nucleotide sequences (e.g., alternate contiguous sequences) for population haplotypes of a sample organism. Alternatively, in some embodiments, a population haplotype database encodes variant data for population haplotypes having allele-variant differences from locally distinct population haplotypes within respective genomic regions of a corresponding reference genome.
[0049] Relatedly, as used herein, the term “reference haplotype” refers to a population haplotype associated with a particular sample or otherwise selected as a reference for the particular sample. For instance, as used herein, the term “personalized subset of diploid reference haplotypes,” sometimes used interchangeably with “personalized diploid reference haplotypes” for brevity, refers to one or more pairs of populations haplotypes selected for a genomic coordinate or genomic region of a particular sample. In some embodiments, for example, a personalized subset of diploid reference haplotypes for a genomic coordinate of a given sample comprises a first population haplotype and a second population haplotype selected from a population haplotype database (or other compilation of population haplotypes) for the given sample at the genomic coordinate. Additionally or alternatively, in some embodiments, a personalized subset of diploid reference haplotypes for a genomic coordinate of a given sample comprises a first, second, third, and fourth population haplotypes selected from a population haplotype database (or other compilation of population haplotypes) for the given sample at a genomic coordinate or region. Accordingly, while this disclosure sometimes describes a personalized subset of diploid reference haplotypes in terms of a single pair of populations haplotypes, multiple pairs of population haplotypes (e.g., two pairs, three pairs) for the given sample at a genomic coordinate or genomic region may be selected for a given sample and constitute a personalized subset of diploid reference haplotypes. Further, (i) a single pair of population haplotypes may be selected and constitute a personalized subset of diploid reference haplotypes for one genomic coordinate or region of a given sample and (ii) multiple pairs of population haplotypes may be selected and constitute another personalized subset of diploid reference haplotypes for another genomic coordinate or region of the given sample. [0050] As further used herein, the term “genotype call” refers to a determination or prediction of a particular genotype of a sample or a sample nucleotide sequence at a genomic locus. In particular, a genotype call can include a prediction of a particular genotype of a sample with respect to a reference genome or a reference sequence at a genomic coordinate or a genomic region. For instance, in some cases, a genotype call includes a determination or prediction that a sample comprises both a nucleobase and a complementary nucleobase at a genomic coordinate that is either homozygous or heterozygous for a reference base or a variant (e.g., homozygous reference bases represented as 0|0 or heterozygous for a variant on a particular strand represented as 0|l). Accordingly, a genotype call can include a prediction of a variant or reference base for one or more alleles of a sample and indicate zygosity with respect to a variant or reference base. A genotype call is often determined for a genomic coordinate or genomic region at which an SNP, insertion, deletion, or other variant has been identified for a population of organisms.
[0051] As used herein, the term “variant” refers to a nucleobase or multiple nucleobases that do not align with, differs from, or varies from a corresponding nucleobase (or nucleobases) in a reference sequence or a reference genome. For example, a variant includes a SNP, an indel, or a structural variant that indicates nucleobases in a sample nucleotide sequence that differ from nucleobases in corresponding genomic coordinates of a reference sequence or a reference genome. [0052] Along these lines, a “variant call” (or “variant nucleobase call”) refers to a nucleobase call comprising a mutation or a variant at a particular genomic coordinate or genomic region with respect to a reference. In particular, a variant call includes a determination or prediction that a sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that differs from a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome. Conversely, a “non-variant call” (or “non-variant nucleobase call” or “reference call”) refers to a nucleobase call comprising a nonvariant or a reference nucleobase at a genomic coordinate or a genomic region with respect to a reference. In particular, a non-variant or reference call includes a determination or prediction that a sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that matches a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.
[0053] As noted, in some embodiments, the personalized sequencing system utilizes a call generation model to determine genotype calls (e.g., variant calls with respect to a reference genome) or, in some cases, candidate variant calls for further evaluation. As used herein, the term “call generation model” refers to a probabilistic model that generates sequencing data from nucleotide reads of a sample, including variant calls, and/or genotype calls along with associated metrics. Accordingly, in some cases, a call generation model may be a variant call generation model. For example, in some cases, a call-generation model refers to a Bayesian probability model that generates variant calls based on nucleotide reads of a sample nucleotide sequence. Such a model can process or analyze sequencing metrics corresponding to read pileups (e.g., multiple nucleotide reads corresponding to a single genomic coordinate), including mapping quality, base quality, and various hypotheses including foreign reads, missing reads, joint detection, and more. A call generation model may likewise include multiple components, including, but not limited to, different software applications or components for mapping and aligning, sorting, duplicate marking, computing read pileup depths, and variant calling. In some cases, a call generation model refers to an ILLUMINA DRAGEN model for variant calling functions and mapping and alignment functions (e.g., a DRAGEN variant caller or “DRAGEN VC”).
[0054] Relatedly, as used herein, the term “personalized call-generation model” refers to a callgeneration model configured to generate genotype probabilities or other sequencing data in relation to a personalized subset of reference haplotypes. Such personalized subsets of reference haplotypes include personalized diploid reference haplotypes selected for a diploid genomic region or a personalized reference haplotype selected for a haploid genomic region. In some cases, a personalized call-generation model includes a call-generation model configured to generate genotype probabilities or other sequencing data for selected genomic coordinates or regions, such genomic coordinates or regions for which a non-personalized or standard call-generation model determines a variant with respect to a linear reference genome.
[0055] Accordingly, as further used herein, the term “genotype probability” refers to a likelihood, probability, or score of a particular genotype (e.g., a candidate genotype) at a genomic coordinate or genomic region. For instance, a genotype probability includes a likelihood of a homozygous reference genotype, a likelihood of a heterozygous variant genotype, or a likelihood of a homozygous variant genotype at one or more genomic coordinates. In some cases, a genotype probability can refer to a prior genotype probability, or a posterior genotype probability generated by a call-generation model starting with a prior genotype probability. Accordingly, in some cases, a prior genotype probability determined for a call-generation model can be presented in a prior genotype probability (PRI) field of a VCF or other sequencing data file. Similarly, a posterior genotype probability generated by a call-generation model can be presented in a posterior genotype probability (GP) field of a VCF or other sequencing data file. A genotype probability can include a specialized prediction depending on the application of a call-generation model, such as for predicting SNPs or indels.
[0056] Relatedly, as used herein, the term “prior genotype probability” refers to a genotype probability that is estimated before new data or additional information is collected and/or analyzed (e.g., a newly determined metric or newly observed event). For example, a prior genotype probability can refer to an estimated genotype probability prior to utilizing an imputation model and/or prior to utilizing a probabilistic model to determine posterior genotype probabilities from aligning and mapping nucleotide reads of a sample to a reference genome and/or personalized diploid reference haplotypes. Accordingly, in some cases, prior genotype probabilities account for (e.g., are determined based on) inferred population haplotype information (e.g., personalized diploid reference haplotypes) prior to directly accounting for the nucleobase content and/or alignment of nucleotide reads from a target sample. As set forth below, FIG. 4 depicts and the corresponding paragraphs below describe an embodiment of such prior genotype probabilities. However, as indicated by either or both of FIGS. 2 and 4, the personalized sequencing system can determine personalized diploid reference haplotypes based on a sample’s nucleotide reads (e.g., an initial read processing and mapping) and further determine prior genotype probabilities based on the personalized diploid reference haplotypes, thereby indirectly accounting for the sample’s nucleotide reads in such prior genotype probabilities. In terms of a given allele and coordinate, a prior genotype probability comprises a quantification of an expected rate per locus of a respective genotype on a given allele at a genomic coordinate of a sample.
[0057] By contrast, the term “posterior genotype probability” refers to a genotype probability that is determined to account for or reflect new data or information (e.g., a newly determined metric or newly observed event, such as nucleotide reads or read alignments). For instance, a posterior genotype probability can include an estimated genotype probability produced from utilizing an imputation model and/or from utilizing a probabilistic model to determine posterior genotype probabilities from aligning and mapping nucleotide reads of a sample to a reference genome and/or personalized diploid reference haplotypes. Accordingly, in some cases, posterior genotype probabilities account for (e.g., are determined based on) both inferred population haplotype information (e.g., personalized diploid reference haplotypes) and the nucleobase content and/or alignment of nucleotide reads from a target sample. As set forth below, FIG. 4 depicts and the corresponding paragraphs below describe an embodiment of such posterior genotype probabilities. [0058] As noted above, in some embodiments, the personalized sequencing system determines sequencing metrics for nucleobase calls of nucleotide reads. As used herein, the term “sequencing metric” refers to a quantitative measurement or score indicating a degree to which an individual nucleobase call (or a sequence of nucleobase calls) aligns, compares, or quantifies with respect to a genomic coordinate or genomic region of a reference genome, with respect to nucleobase calls from nucleotide reads, or with respect to external genomic sequencing or genomic structure. For instance, a sequencing metric includes a quantitative measurement or score indicating a degree to which (i) individual nucleobase calls align, map, or cover a genomic coordinate or reference base of a reference genome; (ii) nucleobase calls compare to reference or alternative nucleotide reads in terms of mapping, mismatch, base call quality, or other raw sequencing metrics; or (iii) genomic coordinates or regions corresponding to nucleobase calls demonstrate mappability, repetitive base call content, DNA structure, or other generalized metrics.
[0059] Relatedly, as used herein, the term “personalized sequencing metrics” refers to sequencing metrics determined in relation to (i) a personalized subset of diploid reference haplotypes for a particular sample or (ii) for haploid genomic coordinates, at least one personalized reference haplotype) for a particular sample. Accordingly, in some embodiments, personalized sequencing metrics can include sequencing metrics that have been determined based on a personalized subset of diploid reference haplotypes for a particular sample but exclude sequencing metrics that have been determined based on generic population haplotypes. In some embodiments, for example, personalized sequencing metrics can include one or more of a personalized genotype metric indicating a diploid genotype at a genomic coordinate of a sample, a personalized genotypeprobability metric indicating a probability of the diploid genotype occurring at the genomic coordinate, a personalized genotype-quality metric indicating a probability that the diploid genotype at the genomic coordinate is correct or incorrect, or a personalized variant-call-quality metric indicating a quality score for a variant call generated by a call-generation model utilizing a personalized subset of diploid reference haplotypes for the sample.
[0060] As suggested above, the personalized sequencing system can utilize a machine-learning model to modify sequencing metrics and update a genotype call. As used herein, the term “machine-learning model” refers to a computer algorithm or a collection of computer algorithms that automatically improve for a particular task through experience based on use of data. For example, a machine-learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness. Example machine-learning models include various types of decision trees, support vector machines, or neural networks. In some cases, the call-recalibration machinelearning model is a series of gradient boosted decision trees (e.g., XGBoost algorithm), while in other cases the call-recalibration machine-learning model is a random forest model, a multilayer perceptron, a linear regression, a support vector machine, a deep tabular learning architecture, a deep learning transformer (e.g., self-attention-based-tabular transformer), or a logistic regression. [0061] In some cases, the personalized sequencing system utilizes a call-recalibration machinelearning model to generate outputs for confirming, modifying, or updating a genotype call based on sequencing metrics. As used herein, the term “call-recalibration machine-learning model” refers to a machine-learning model that generates genotype call classifications (e.g., genotype probabilities). For example, in some cases, the variant-call-recalibration machine-learning model is trained to generate genotype call classifications indicating various probabilities or predictions for genotype calls (e.g., variant calls) based on the aforementioned sequencing metrics. Accordingly, in some cases, a call-recalibration machine-learning model is a call-recalibration machine-learning model. In some cases, the call-recalibration machine-learning model is a series of gradient boosted decision trees (e.g., XGBoost algorithm or treelite algorithm for an ensemble of decision trees), while in other cases the call-recalibration machine-learning model is a random forest model, a multilayer perceptron, a linear regression, a support vector machine, a deep tabular learning architecture, a deep learning transformer (e.g., self-attention-based-tabular transformer), or a logistic regression. In certain embodiments, a call-recalibration machine-learning model includes multiple sub-models or operates in tandem with another call-recalibration machinelearning model. For instance, a first call-recalibration machine-learning model (e.g., an ensemble of gradient boosted trees) generates a first set of genotype call classifications and a second call- recalibration machine-learning model (e.g., a random forest) generates a second set of genotype call classifications.
[0062] As mentioned, in some embodiments, the call-recalibration machine-learning model can be a neural network. The term the term “neural network” refers to a machine-learning model that can be trained and/or tuned based on inputs to determine classifications or approximate unknown functions. For example, a neural network includes a model of interconnected artificial neurons (e.g., organized in layers) that communicate and leam to approximate complex functions and generate outputs (e.g., generated digital images) based on a plurality of inputs provided to the neural network. In some cases, a neural network refers to an algorithm (or set of algorithms) that implements deep learning techniques to model high-level abstractions in data. For example, a neural network can include a convolutional neural network, a recurrent neural network (e.g., an LSTM), a graph neural network, a self-attention transformer neural network, or a generative adversarial neural network.
[0063] As also used herein, the term “genotype call classification” refers to a predicted classification from a call-recalibration machine-learning model that indicates a probability, score, or other quantitative measurement indicating an accuracy of identifying a genotype at a genomic coordinate of a sample. A genotype call classification can include a specialized prediction depending on the application of a call-recalibration machine-learning model. In some cases, genotype call classifications for a biallelic genomic coordinate includes (i) a false-positive probability that a genotype call is a false positive, (ii) a genotype-error probability that a genotype for the genotype call is incorrect, and (iii) a true-positive probability that the genotype call is a true positive. As a further example, in embodiments for generating genotype calls for multiallelic genomic coordinates, genotype call classifications can include: (i) a reference probability that a genotype call comprises a homozygous reference genotype at a multiallelic genomic coordinate, (ii) a zygosity-error probability that the genotype call comprises a genotype-zygosity error at a multiallelic genomic coordinate, and (iii) a true-positive variant probability that the genotype call constitutes a true positive variant at a multiallelic genomic coordinate.
[0064] In embodiments for generating genotype calls for a haploid genomic coordinate, genotype call classifications can include: (i) a first genotype probability of a first genotype at the genomic coordinate and (ii) a second genotype probability of a second genotype at the genomic coordinate. As suggested above, the first genotype probability can be a probability that a genotype at a genomic coordinate is a haploid reference genotype, and the second genotype probability can be a probability that a genotype at the genomic coordinate is a haploid alternate genotype. In these or other embodiments, such as embodiments for generating genotype calls for genomic coordinates indicated to exhibit homozygous reference genotypes, genotype call classifications can include: (i) a false-positive probability or a homozygous reference classification indicating a probability that a genotype call is a false positive or a homozygous reference genotype, respectively; (ii) a zygosityerror probability or a heterozygous genotype classification indicating a probability that a genotype (e.g., an indication of a heterozygous or homozygous genotype for a variant call at a particular location) is incorrect or a heterozygous genotype, respectively; and/or (iii) a true-positive classification or a homozygous alternate classification indicating a probability that a genotype call is a true positive or a homozygous alternate genotype, respectively. In some cases, the genotype call classifications accordingly represent intermediate scoring metrics and/or a predicted probability that a genotype for a genotype call is accurate. Accordingly, in some cases, the genotype call classifications represent intermediate scoring metrics and/or a predicted probability that a genotype for a genotype call is accurate.
[0065] Relatedly, as used herein, the term “personalized genotype call classification” refers to a genotype call classification generated based on personalized sequencing metrics by a personalized call-recalibration machine-learning model. Accordingly, in some embodiments, personalized genotype call classifications include genotype probabilities or other genotype call classifications that have been determined based on processing personalized sequencing metrics derived from a personalized subset of diploid reference haplotypes for a particular sample but exclude genotype call classifications that have been determined based on generic population haplotypes.
[0066] Accordingly, as used herein, the term “personalized call-recalibration machine-learning model” refers to a call-recalibration machine-learning model configured and trained to process personalized sequencing metrics and generate personalized genotype call classifications based on the personalized sequencing metrics. In some cases, a personalized call-recalibration machinelearning model includes a call-recalibration machine-learning model configured to generate personalized genotype call classifications for selected genomic coordinates or regions, such genomic coordinates or regions for which a non-personalized or standard call-generation model determines a variant with respect to a linear reference genome.
[0067] In one or more embodiments, the personalized sequencing system identifies and/or stores sequencing metrics within one or more sequencing data files. As used herein, the term “sequencing data file” refers to a digital file that includes genetic sequencing information concerning genotype calls or nucleotide reads generated by one or more genomic sequencing procedures. Such sequencing information may include, for example, nucleotide reads, alignment and mapping information, nucleotide reads at one or more genomic coordinates, and so forth.
[0068] As also mentioned, in one or more embodiments, one or more sequencing data files in which the personalized sequencing system identifies or stores sequencing metrics include an alignment data file containing information from a read processing and mapping procedure. As used herein, the term “alignment data file” refers to a digital file that indicates mapping and alignment information for nucleotide reads of a sample nucleotide sequence. For example, an alignment data file can include a binary alignment map (BAM) file, a compressed reference-oriented alignment map (CRAM) file, or another file indicating nucleotide reads of a sample nucleotide sequence.
[0069] In some embodiments, the personalized sequencing system modifies data fields corresponding to a genotype-call data file, such as a variant call file. As used herein, the term “genotype-call data file” refers to a digital file that indicates or represents one or more genotype calls (e.g., including reference and/or variant calls) compared to a reference genome along with other information pertaining to the genotype calls (e.g., variant calls). For example, a genotype-call data file can include a variant call file, such as but not limited to a variant call format (VCF) file (as well as a genomic variant call format (gVCF) file). Alternatively, as a further example, genotype-call data file can include a General Feature Format (GFF), a Genome Variant Format (GVF), or other suitable data file comprising genotype calls for a sample nucleotide sequence.
[0070] As further used herein, a “variant call file” refers to a particular genotype-call data file that comprises a text file format that contains information about variants at specific genomic coordinates. For instance, a variant call file can include meta-information lines, a header line, and data lines where each data line contains information about a single genotype call (e.g., a single variant). As described further below, the personalized sequencing system can generate different versions of genotype-call data files, including a pre-filter variant call file comprising variant genotype calls that either pass or fail a quality filter for base-call-quality metrics or a post-filter variant call file comprising variant genotype calls that pass the quality filter but excludes variant genotype calls that fail the quality filter.
[0071] In some embodiments, the personalized sequencing system modifies data fields corresponding to metrics of a genotype call associated with a variant call file, such as fields for call quality, genotype, and genotype quality. As used herein, the term “call quality” when used with respect to a data field in a variant call file refers to a measure or an indication of a likelihood or a probability that a variant exists at a given location. Accordingly, a call quality field (or QUAL field) corresponding to a VCF file may include a base-call-quality metric, such as a PHRED-scaled quality or Q score, representing a probability that a genomic coordinate of a sample genome includes a variant. Similarly, a “genotype quality” when used with respect to a field refers to a likelihood or a probability that a particular predicted genotype for a nucleobase call is correct.
[0072] The following paragraphs describe the personalized sequencing system with respect to illustrative figures that portray example embodiments and implementations. For example, FIG. 1 illustrates a schematic diagram of a computing system 100 in which a personalized sequencing system 106 operates in accordance with one or more embodiments. As illustrated, the computing system 100 includes a sequencing device 102 connected to a local device 108 (e.g., a local server device), one or more server device(s) 110, and a client device 114. As shown in FIG. 1, the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114 can communicate with each other via a network 118. The network 118 comprises any suitable network over which computing devices can communicate. Example networks are discussed in additional detail below with respect to FIG. 12. While FIG. 1 shows an embodiment of a personalized sequencing system 106, this disclosure describes alternative embodiments and configurations below.
[0073] As indicated by FIG. 1, the sequencing device 102 comprises a computing device and a sequencing device system 104 for sequencing a sample or other nucleic-acid polymer. In some embodiments, by executing the sequencing device system 104 using a processor, the sequencing device 102 analyzes nucleotide fragments or oligonucleotides extracted from samples to generate nucleotide reads or other data utilizing computer implemented methods and systems either directly or indirectly on the sequencing device 102. More particularly, the sequencing device 102 receives nucleotide-sample slides (e.g., flow cells) comprising nucleotide fragments extracted from samples and further copies and determines the nucleobase sequence of such extracted nucleotide fragments. [0074] In one or more embodiments, the sequencing device 102 utilizes sequencing-by- synthesis (SBS) techniques to sequence nucleotide fragments into nucleotide reads and determine nucleobase calls for the nucleotide reads. In addition or in the alternative to communicating across the network 118, in some embodiments, the sequencing device 102 bypasses the network 118 and communicates directly with the local device 108 or the client device 114. By executing the sequencing device system 104, the sequencing device 102 can further store the nucleobase calls as part of base-call data that is formatted as a binary base call (BCL) file and send the BCL file to the local device 108 and/or the server device(s) 110. [0075] As further indicated by FIG. 1, the local device 108 is located at or near a same physical location of the sequencing device 102. Indeed, in some embodiments, the local device 108 and the sequencing device 102 are integrated into a same computing device. The local device 108 may run the personalized sequencing system 106 to generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such basecall data. As shown in FIG. 1, the sequencing device 102 may send (and the local device 108 may receive) base-call data generated during a sequencing run of the sequencing device 102. By executing software in the form of the personalized sequencing system 106, the local device 108 may utilize a personalized subset of diploid reference haplotypes (e.g., personalized diploid reference haplotypes 112) selected from a population haplotype database 120 to generate genotype calls at one or more genomic coordinates of a sample. The local device 108 may also communicate with the client device 114. In particular, the local device 108 can send data to the client device 114, including a binary alignment map (BAM) fde, a genotype-call data file (e.g., a variant call format (VCF) file), or other information indicating nucleobase calls, sequencing metrics, error data, or other metrics.
[0076] As further indicated by FIG. 1, the server device(s) 110 are located remotely from the local device 108 and the sequencing device 102. Similar to the local device 108, in some embodiments, the server device(s) 110 include a version of (or are otherwise able to access or implement) the personalized sequencing system 106. Accordingly, the server device(s) 110 may generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining genotype calls based on analyzing such base-call data in relation to the personalized diploid reference haplotypes 112. As indicated above, the sequencing device 102 may send (and the server device(s) 110 may receive) base-call data from the sequencing device 102. The server device(s) 110 may also communicate with the client device 114. In particular, the server device(s) 110 can send data to the client device 114, including BAM fdes, VCF files, or other sequencing related information.
[0077] In some embodiments, the server device(s) 110 comprise a distributed collection of servers where the server device(s) 110 include a number of server devices distributed across the network 118 and located in the same or different physical locations. Further, the server device(s) 110 can comprise a content server, an application server, a communication server, a web-hosting server, or another type of server. Moreover, as shown in FIG. 1, the server device(s) 110 include or are otherwise in communication with the population haplotype database 120 storing population haplotypes from which the personalized diploid reference haplotypes 112 are selected for a genomic coordinate of a sample. [0078] As indicated above, as part of the server device(s) 110 or the local device 108, the personalized sequencing system 106 can generate, encode, and/or implement the population haplotype database 120 to determine the personalized diploid reference haplotypes 112. In some embodiments, for example, the personalized sequencing system 106 can determine the personalized diploid reference haplotypes 112 for a sample (e.g., based on comparing a set of nucleotide reads from a sample with haplotypes of the population haplotype database 120 at a genomic coordinate) and utilizes the personalized diploid reference haplotypes 112 to determine one or more genotype calls for nucleotide reads from a sample at the genomic coordinate, as described in greater detail below in relation to the subsequent figures. In some embodiments, the personalized sequencing system 106 can receive the personalized diploid reference haplotypes 112 from another system.
[0079] As further illustrated and indicated in FIG. 1, by executing a sequencing application 116, the client device 114 can generate, store, receive, and send digital data. In particular, the client device 114 can receive sequencing data from the local device 108 or receive call files (e.g., BCL) and sequencing metrics from the sequencing device 102. Furthermore, the client device 114 may communicate with the local device 108 or the server device(s) 110 to receive a VCF comprising genotype or variant calls and/or other metrics, such as a base-call-quality metrics or personalized genotype probabilities. The client device 114 can accordingly present or display information pertaining to variant calls or other genotype calls within a graphical user interface of the sequencing application 116 to a user associated with the client device 114. For example, the client device 114 can present genotype calls, variant calls, and/or sequencing metrics for a sequenced sample within a graphical user interface of the sequencing application 116.
[0080] Although FIG. 1 depicts the client device 114 as a desktop or laptop computer, the client device 114 may comprise various types of client devices. For example, in some embodiments, the client device 114 includes non -mobile devices, such as desktop computers or servers, or other types of client devices. In yet other embodiments, the client device 114 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device 114 are discussed below with respect to FIG. 12.
[0081] As further illustrated in FIG. 1, the client device 114 includes the sequencing application 116. The sequencing application 116 may be a web application or a native application stored and executed on the client device 114 (e.g., a mobile application, desktop application). The sequencing application 116 can include instructions that (when executed) cause the client device 114 to receive data from the personalized sequencing system 106 and present, for display at the client device 114, base-call data or data from an alignment data fde or VCF. Furthermore, the sequencing application 116 can instruct the client device 114 to display summaries for multiple sequencing runs. [0082] As further illustrated in FIG. 1, a version of the personalized sequencing system 106 may be located and/or implemented (e.g., entirely or in part) on the client device 114 or the sequencing device 102. In yet other embodiments, the personalized sequencing system 106 is implemented by one or more other components of the computing system 100, such as the local device 108. In particular, the personalized sequencing system 106 can be implemented in a variety of different ways across the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114. For example, the personalized sequencing system 106 can be downloaded from the server device(s) 110 to the personalized sequencing system 106 and/or the local device 108 where all or part of the functionality of the personalized sequencing system 106 is performed at each respective device within the computing system 100.
[0083] As mentioned previously, in some embodiments, the personalized sequencing system 106 implements a personalized subset of diploid reference haplotypes utilizing a personalized callgeneration model and/or a personalized call-recalibration machine-learning model to generate genotype calls. To illustrate, FIG. 2 depicts the personalized sequencing system 106 generating one or more genotype call(s) 218 utilizing a personalized call-generation model 210 and a personalized call-recalibration machine-learning model 214 with input from a personalized diploid reference haplotypes 204.
[0084] As shown in FIG. 2, the personalized sequencing system 106 identifies nucleotide reads 202 corresponding to one or more genomic coordinates of a sample and determines or receives a personalized subset of diploid reference haplotypes for the sample at the genomic coordinate(s) (e.g., personalized diploid reference haplotypes 204). In one or more embodiments, for example, the personalized diploid reference haplotypes comprise one or more pairs of population haplotypes selected from a database of population haplotypes for the sample at the one or more genomic coordinates. Similarly, for nucleotide reads corresponding to a haploid genomic coordinate, the personalized sequencing system 106 can determine or receive at least one personalized reference haplotype selected from the database of population haplotypes.
[0085] In some embodiments, for example, the personalized sequencing system 106 performs an initial read processing and mapping 222 to determine initial read alignments of the nucleotide reads with respect to a linear reference genome. Based on the initial read alignments determined by the initial read processing and mapping 222, the personalized sequencing system 106 can determine the personalized diploid reference haplotypes 204 for the sample by comparing the nucleotide reads 202 with population haplotypes at genomic regions of the sample, as indicated by the initial read alignments.
[0086] In one or more embodiments, as shown in FIG. 2, the personalized sequencing system 106 performs a personalized read processing and mapping 206 of the nucleotide reads 202 to determine read alignments of the nucleotide reads 202 with respect to a linear reference genome by comparing the nucleotide reads 202 with the linear reference genome and the personalized diploid reference haplotypes 204. Indeed, in one or more implementations, the personalized sequencing system 106 determines read alignments with increased accuracy relative to existing sequencing systems by utilizing personalized subsets of diploid reference haplotypes in read mapping and alignment processing, as opposed to read processing and mapping performed exclusively in relation to a linear reference genome and/or a graph reference genome augmented by an unfiltered collection of population haplotypes.
[0087] As further illustrated in FIG. 2, the personalized sequencing system 106 determines, for input to a personalized call-generation model 210, personalized prior genotype probabilities 208 of candidate genotypes for the sample at the one or more genomic coordinates. In particular, the personalized sequencing system 106 determines the personalized prior genotype probabilities 208 of the candidate genotypes in relation to the personalized diploid reference haplotypes 204 (e.g., as described below in relation to FIG. 5) and processes the personalized prior genotype probabilities 208 with the personalized call-generation model 210 to determine genotype probabilities (e.g., personalized posterior genotype probabilities) for the candidate genotypes at one or more genomic coordinates (e.g., as described below in relation to FIG. 4). Indeed, in one or more embodiments, the call-generation model is configured to generate genotype probabilities for a genomic coordinate of a given sample in relation to a personalized subset of diploid population haplotypes selected for the given sample at the genomic coordinate. Based on the genotype probabilities output by the personalized call-generation model 210, the personalized sequencing system 106 can determine one or more genotype call(s) 218 and, in some cases, outputs the one or more genotype call(s) 218 to a genotype-call data file 220 (e.g., a VCF file).
[0088] Furthermore, as shown in FIG. 2, the personalized sequencing system 106 determines personalized sequencing metrics 212 for the nucleotide reads 202 in relation to the personalized diploid reference haplotypes for the sample at the one or more genomic coordinates (e.g., as described below in relation to FIG. 6). As illustrated, the personalized sequencing system 106 utilizes a personalized call-recalibration machine-learning model 214 to generate, based on the personalized sequencing metrics 212, one or more personalized genotype call classifications 216 indicating an accuracy of identifying a genotype at the genomic coordinate (e.g., as described below in relation to FIGS. 6 and 8A-8B). Based on the one or more personalized genotype call classifications 216, the personalized sequencing system 106 can determine, modify, and/or recalibrate the one or more genotype call(s) 218 and emit or modify the one or more genotype call(s) 218 to the genotype-call data file 220. [0089] Moreover, as shown in FIG. 2, the personalized sequencing system 106 can also generate genotype calls for the sample at genomic coordinates for which no corresponding personalized diploid reference haplotypes are provided. As illustrated, for example, the personalized sequencing system 106 can utilize non-personalized version of a call-generation model 224 to generate one or more genotype call(s) 230 without the personalized diploid reference haplotypes 204 (e.g., at a genomic coordinate not covered by the personalized diploid reference haplotypes 204) based on prior genotype probabilities determined in relation to a linear reference genome (e.g., flat prior probabilities for each candidate genotype at the genomic coordinate).
[0090] In addition to utilizing a non-personalized version of the call-generation model 224, in some embodiments, the personalized sequencing system 106 can utilize a non-personalized version of a call-recalibration machine-learning model 226 to generate genotype call classifications 228 based on sequencing metrics determined without reference to the personalized diploid reference haplotypes 204 (e.g., at a genomic coordinate not covered by the personalized diploid reference haplotypes 204). Accordingly, based on the genotype call classifications 228, the personalized sequencing system 106 can confirm, modify, or recalibrate one or more of the genotype call(s) 230. Additional features and implementations of the personalized call-generation model 210, the call- recalibration machine-learning model 226, and the personalized call-recalibration machinelearning model 214, according to one or more embodiments, are provided below in relation to FIG. 8B.
[0091] As previously mentioned, in some embodiments, the personalized sequencing system 106 identifies a set of nucleotide reads of a sample corresponding to a genomic coordinate and determines, receives, or otherwise identifies a personalized subset of diploid reference haplotypes for a sample at a genomic coordinate. To illustrate, FIG. 3 depicts an example of a set of nucleotide reads 302 of a sample spanning at least two genomic regions of a linear reference genome 304 having personalized diploid reference haplotypes 308 specifically associated with the sample.
[0092] In one or more embodiments, for example, the personalized sequencing system 106 compares nucleotide reads, such as the set of nucleotide reads 302, with the linear reference genome 304 and multiple candidate haplotypes from a population haplotype database to select pairs of the candidate haplotypes to include in the personalized diploid reference haplotypes 308 for corresponding genomic regions (e.g., a first genomic region 306a and a second genomic region 306b) of the linear reference genome 304 (and of the sample). Note that FIG. 3 does not depict nucleotide reads of the set of nucleotide reads 302 to scale relative to reference spans or the first genomic region 306a or the second genomic region 306b, but rather depicts such nucleotide reads and other elements in FIG. 3 at an un-scaled size for explanatory purposes only. [0093] To illustrate candidate-haplotype selection, in some embodiments, the personalized sequencing system 106 can generate a personalized haplotype database that is customized or personalized for a specific sample based on a comparison of nucleotide reads from a sample with candidate population haplotypes from a population haplotype database. In certain embodiments, for example, the personalized sequencing system 106 can generate a personalized diploid reference database with a customized set of haplotype pairs (e.g., personalized sets of diploid reference haplotypes) for diploid genomic regions of a reference genome. Also, in some embodiments, the personalized sequencing system 106 can utilize such a personalized haplotype database comprising personalized sets of diploid reference haplotypes to determine personalized alignments of nucleotide reads from the corresponding sample with respective genomic regions of the reference genome.
[0094] In some embodiments, for example, the personalized sequencing system 106 can identify, within a set of reference spans of a reference genome, a set of nucleotide reads from a sample and candidate population haplotypes from a population haplotype database. As indicated above, a reference span (e.g., a bin) includes a span of nucleobase positions within a linear reference genome, such as nucleobases between two respective genomic coordinates within the linear reference genome. Such a reference span may include some or all of a genomic region (e.g., the first genomic region 306a and the second genomic region 306b of the linear reference genome 304). For each reference span of the set of reference spans, the personalized sequencing system 106 can generate haplotype set scores for respective sets of the candidate population haplotypes based on comparing the nucleotide reads and the candidate population haplotypes. Based on the haplotype set scores, the personalized sequencing system 106 can generate a personalized haplotype database comprising a subset of population haplotypes (e.g., personalized sets of diploid reference haplotypes) from the population haplotype database.
[0095] In one or more embodiments, for example, the personalized sequencing system 106 determines (e.g., selects) personalized subsets of diploid and/or haploid reference haplotypes for a given sample, or uses a population haplotype database with reference spans, according to one or more of the methods described in U.S. Provisional Patent Application No. 63/558,754, fded February 28, 2024, and entitled “A Personalized Haplotype Database For Improved Mapping And Alignment Of Nucleotide Reads And Improved Genotype Calling,” (IP-2677-PRV), or International Patent Application No. PCT/US2025/017424, filed February 26, 2025, and entitled “A Personalized Haplotype Database For Improved Mapping And Alignment Of Nucleotide Reads And Improved Genotype Calling,” (IP-2677-PCT), both of which are incorporated herein by reference in its entirety. [0096] As shown in FIG. 3, for example, the personalized diploid reference haplotypes 308 include a first population haplotype 310a and a second population haplotype 310b selected for the first genomic region 306a of the sample (e.g., shown in FIG. 3 in relation to the linear reference genome 304). As also shown in FIG. 3, the personalized diploid reference haplotypes 308 further include a third population haplotype 310c and a fourth population haplotype 3 lOd selected for the second genomic region 306b (e.g., shown in FIG. 3 in relation to the linear reference genome 304). Indeed, as shown in FIG. 3, the personalized diploid reference haplotypes 308 for a given sample can include population haplotypes selected for multiple genomic regions of the given sample. Moreover, while not shown in FIG. 3, the personalized sequencing system 106 can process a personalized subset of reference haplotypes (e.g., haploid reference haplotypes or polyploid reference haplotypes) comprising more or less than two population haplotypes associated with a particular genomic region or genomic coordinate of a sample.
[0097] As previously mentioned, in some embodiments, the personalized sequencing system 106 utilizes a call-generation model to generate genotype probabilities of candidate genotypes for a sample at a genomic coordinate based on a personalized subset of diploid reference haplotypes selected for the sample at the genomic coordinate. For example, FIG. 4 illustrates an overview of the personalized sequencing system 106 identifying personalized diploid reference haplotypes 408 for a sample based on a set of nucleotide reads 402 of the sample and generating genotype probabilities 412 utilizing a personalized call-generation model 410.
[0098] As shown in FIG. 4, the personalized sequencing system 106 receives (or identifies) the set of nucleotide reads 402 of a sample corresponding to a genomic coordinate (e.g., according to a respective set of read alignments in relation to a linear reference genome). Based on the set of nucleotide reads 402, the personalized sequencing system 106 performs a haplotype personalization process 404 to determine the personalized diploid reference haplotypes 408 for the sample at the genomic coordinate (e.g., as described above in relation to FIG. 3). As shown, the personalized sequencing system 106 selects the personalized diploid reference haplotypes 408 from a database of population haplotypes 406 based on comparing the nucleotide reads 402 with the population haplotypes 406 (or a subset thereof) at the genomic coordinate.
[0099] In some embodiments, for example, the personalized sequencing system 106 utilizes an imputation model to determine which of the population haplotypes 406 most closely corresponds to the nucleotide reads 402 at the genomic coordinate (e.g., as described above in relation to FIG. 3 for a population haplotype database). Alternatively (or additionally), the personalized sequencing system 106 can utilize external information (e.g., parental genomic data) for a sample to identify a personalized subset of diploid reference haplotypes from the database of population haplotypes 406. [0100] As further illustrated in FIG. 4, having selected, received, or otherwise identified the personalized diploid reference haplotypes 408 for the genomic coordinate corresponding to the nucleotide reads 402 of the sample, the personalized sequencing system 106 utilizes the personalized call-generation model 410 to generate the genotype probabilities 412 of candidate genotypes for the sample at the genomic coordinate. In some embodiments, for example, the personalized call -generation model 410 comprises a probabilistic model configured to process one or more prior genotype probabilities 414 to generate (e.g., impute, use Bayesian probability) one or more posterior genotype probabilities 416 of candidate genotypes at the genomic coordinate based on the nucleotide reads 402.
[0101] Accordingly, as shown in FIG. 4, the personalized sequencing system 106 determines the one or more prior genotype probabilities 414 of the candidate genotypes in relation to the personalized diploid reference haplotypes 408 (e.g., as further described below in relation to FIG. 5). In some such embodiments, accordingly, the one or more prior genotype probabilities 414 account for the personalized diploid reference haplotypes 408 but do not directly account for the nucleobase content and/or alignment of the nucleotide reads 402. As noted above, however, the one or more prior genotype probabilities 414 can be derived from (or otherwise based on) the personalized diploid reference haplotypes 408 that the personalized sequencing system 106 selects based on a comparison of the nucleotide reads 402 and the population haplotypes 406. Based on the one or more prior genotype probabilities 414 and the nucleotide reads 402, the personalized sequencing system 106 generates the one or more posterior genotype probabilities 416 of the candidate genotypes at the genomic coordinate of the sample. Because the one or more prior genotype probabilities 414 can be derived from (or otherwise based on) the personalized diploid reference haplotypes 408, the one or more posterior genotype probabilities 416 account for the personalized diploid reference haplotypes 408. Accordingly, the one or more posterior genotype probabilities 416 account for both the personalized diploid reference haplotypes 408 and the nucleobase content and/or alignment of the nucleotide reads 402. Having generated the genotype probabilities 412 (e.g., the one or more posterior genotype probabilities 416), the personalized sequencing system 106 determines a genotype call 418 for the genomic coordinate of the sample.
[0102] As mentioned, in some embodiments, the personalized sequencing system 106 determines prior genotype probabilities of candidate genotypes at a genomic coordinate of a sample in relation to a personalized subset of diploid reference haplotypes. To further illustrate, FIG. 5 illustrates multiple phylogenetic diagrams utilized by the personalized sequencing system 106 in certain embodiments to determine prior genotype probabilities of candidate genotypes in relation to a personalized subset of diploid reference haplotypes. Accordingly, in some embodiments, the personalized sequencing system 106 utilizes a call -generation model to generate posterior genotype probabilities of candidate genotypes based on the prior genotype probabilities determined in relation to a personalized subset of diploid reference haplotypes for the respective genomic coordinate.
[0103] As suggested above, in one or more embodiments, a prior genotype probability includes an estimated genotype probability prior to imputation and/or prior to utilizing a probabilistic model to determine posterior genotype probabilities from aligning and mapping nucleotide reads of a sample to a reference genome and/or personalized diploid reference haplotypes. In one or more embodiments, for example, a prior genotype probability comprises a quantification of an expected rate per locus of a respective genotype on a given allele at a genomic coordinate of a sample. In particular, in one or more embodiments, a prior genotype probability includes an initial genotype probability for a candidate genotype as conditioned upon information available prior to sequencing of a given sample. In some embodiments, for example, the personalized sequencing system 106 determines prior genotype probabilities (e.g., one or more prior genotype probabilities 414) for candidate genotypes at a genomic coordinate of a given sample based on phylogenetic branch lengths between population haplotypes corresponding to the given sample (e.g., those selected as personalized diploid reference haplotypes for the given sample) and the most recent common ancestor (MRCA) of those population haplotypes. To illustrate, FIG. 5 shows three exemplary phylogenetic diagrams — including a first phylogenetic diagram 500a, a second phylogenetic diagram 500b, and a third phylogenetic diagram 500c for determining prior genotype probabilities from a personalized subset of diploid reference haplotypes at a genomic coordinate of a sample, according to one or more embodiments.
[0104] As shown in FIG. 5, for instance, the first phylogenetic diagram 500a represents an implementation wherein a first population haplotype and a second population haplotype corresponding to a genomic coordinate of a sample (e.g., selected as personalized diploid reference haplotypes for the sample at the genomic coordinate) are identical — both population haplotypes accordingly represented by R. In particular, the first phylogenetic diagram 500a comprises a 3-node phylogeny with a uniform branch length 6/2 between the population haplotypes R and the sample haplotypes Hi and H2 at the genomic coordinate. Indeed, in some embodiments, the personalized sequencing system 106 determines prior genotype probabilities for candidate genotypes based on a uniform branch length between the first population haplotype and the second population haplotype of a personalized subset of diploid reference haplotypes for a sample at a genomic coordinate.
[0105] As also shown in FIG. 5, the second phylogenetic diagram 500b represents an implementation wherein a first population haplotype Ri and a second population haplotype R2 are distinct from one another (e.g., comprised of a different nucleobase) at the genomic coordinate of the sample. Thus, the second phylogenetic diagram 500b comprises a 4-node phylogeny with a uniform branch length 0/2 between the population haplotypes Ri and R2 and the respective sample haplotypes Hi and Hi at the genomic coordinate.
[0106] Relatedly, as shown in FIG. 5, the third phylogenetic diagram 500c represents an independent analysis of the first and second population haplotypes, Ri and R2, and the respective first and second sample haplotypes, Hi and H2, wherein the first and second population haplotypes are also distinct from one another. Thus the third phylogenetic diagram 500c comprises two independent 2-node phylogenies with the uniform branch length 0/2 between the population haplotypes Ri and R2 and the respective sample haplotypes Hi and Hi at the genomic coordinate.
[0107] As an alternative to implementation of a uniform branch length based on the examples above, in some embodiments, the personalized sequencing system 106 can utilize an inferred variable branch length based on an inferred phylogeny relating to two known population haplotypes Ri and R2 (e.g., a personalized subset of diploid reference haplotypes). In one or more embodiments, for example, an inferred variable branch length is determined for a particular genomic coordinate or region by prior analysis of a particular sample to determine the phylogeny of the two known population haplotypes Ri and R2 to infer the branch lengths separating the sample haplotypes Hi and H2 associated with the particular genomic coordinate or region from the known population haplotypes Ri and / . In some embodiments, for instance, the personalized sequencing system 106 can infer variable branch length for a particular genomic coordinate or region of a particular sample and separately for each reference span of a reference genome based on an assumption that each segment is recombination-free and can be represented by a single phylogenetic tree. Accordingly, for a given sample, the personalized sequencing system 106 can generate prior genotype probabilities for candidate variants based on an inferred variable branch length expressly associated with the given sample. In some embodiments in which no recombination event between segments, the personalized sequencing system 106 can co-infer branch length for reference spans of a reference genome.
[0108] Regardless of whether uniform branch length or inferred variable branch length are used as a model for a haplotypes from phylogenetic branches, the personalized sequencing system 106 can determine a pair or other subset of personalized reference haplotypes that include two distinct reference haplotypes. To illustrate, in implementations wherein a personalized subset of diploid reference haplotypes for a genomic coordinate of a sample comprises two distinct population haplotypes Ri and R2, for example, the personalized sequencing system 106 determines, in one or more embodiments, prior genotype probabilities for candidate variants in relation to a personalized subset of diploid reference haplotypes according to the following:
P H , H2\R1,R2') = PCHMPCHM where P H1,H2\R1,R2) represents the prior genotype probability of a diploid genotype H1, H2 in relation to a first reference haplotype Ri and a second reference haplotype R2 of a personalized subset of diploid reference haplotypes.
[0109] Based on the above equation, in one or more embodiments, the personalized sequencing system 106 can determine prior probabilities for candidate single nucleotide variants (SNVs or SNPs) according to the following function: het hetalt homref where 0 represents the branch length as described above (e.g., a uniform branch length or an inferred variable branch length). Relatedly, in one or more embodiments, the personalized sequencing system 106 determines prior genotype probabilities for candidate insertions or deletions (indels) at a genomic coordinate (e.g., a single candidate indel allele at the genomic coordinate) according to the following function: het hetalt homref where 0; represents an estimated indel rate (e.g., estimated on a per-sample basis as a function of repeat period and length) for the sample at the genomic coordinate.
[0110] As indicated above, in implementations wherein the first reference haplotype (e.g., the first population haplotype selected for the sample at a diploid genomic coordinate) differs from the second reference haplotype (e.g., the second population haplotype selected for the sample at a diploid genomic coordinate) at the genomic coordinate, the personalized sequencing system 106 can determine prior genotype probabilities for three distinct diploid genotypes in relation to the first and second reference haplotypes of the personalized subset of diploid reference haplotypes at the genomic coordinate.
[0111] Specifically, the personalized sequencing system 106 can determine a first probability of a heterozygous diploid genotype, indicated as “het” in the equations above, corresponding to (e.g., matching) one of the first reference haplotype or the second reference haplotype of the personalized subset of diploid reference haplotypes at the genomic coordinate (e.g., = Rlr H2 =
[0112] In addition to determining a first probability of a heterozygous diploid genotype for a genomic coordinate of a sample, the personalized sequencing system 106 can determine a second probability of a heterozygous alternate diploid genotype, indicated as “hetalt” in the equations above, corresponding to neither the first population haplotype nor the second population haplotype of the personalized subset of diploid reference haplotypes at the genomic coordinate (e.g., H1 =
[0113] In addition to determining such a first probability and a second probability for a genomic coordinate of a sample, the personalized sequencing system 106 can determine a third probability of a homozygous reference diploid genotype, indicated as “homref’ in the equations above, corresponding to both the first population haplotype and the second population haplotype of the personalized subset of diploid reference haplotypes at the genomic coordinate (e.g., H1 = 7?1,H2 = 7?2).
[0114] As indicated above, in implementations wherein two reference haplotypes of a personalized subset of diploid reference haplotypes are identical for a diploid genomic coordinate, the personalized sequencing system 106 can determine prior genotype probabilities, in relation to a singular reference R, for candidate single nucleotide variants (SNVs or SNPs) as follows: ef where P(G | /?) represents the prior genotype probability of a genotype Hl, H2 in relation to a singular reference haplotype R (e.g., R = R = R2), and 0 represents the branch length as described above (e.g., a uniform branch length or an inferred variable branch length).
[0115] Moreover, in similar implementations wherein two reference haplotypes of the personalized subset of diploid reference haplotypes are identical at the genomic coordinate, the personalized sequencing system 106 can determine prior genotype probabilities for candidate insertions or deletions (indels) at the genomic coordinate (e.g., a single candidate indel allele at the genomic coordinate) as follows: ef where P(G\R) represents the prior genotype probability of a genotype Hl, H2 in relation to a singular reference haplotype R (e.g., R = R = R2), and 9i represents an estimated indel rate (e.g., estimated on a per-sample basis as a function of repeat period and length) for the sample at the genomic coordinate. The estimated indel rate can accordingly be considered a biological indel rate. [0116] As indicated above, in implementations wherein the first reference haplotype (e.g., the first population haplotype selected for the sample at a diploid genomic coordinate) is identical to the second reference haplotype (e.g., the second population haplotype selected for the sample at a diploid genomic coordinate) at the genomic coordinate, the personalized sequencing system 106 can determine prior genotype probabilities for four distinct diploid genotypes in relation to the first and second reference haplotypes of the personalized subset of diploid reference haplotypes at the genomic coordinate.
[0117] Specifically, the personalized sequencing system 106 can determine a first probability of a heterozygous diploid genotype, indicated as “het” in the equations above, with one allele of the sample corresponding to the first and second reference haplotypes of the personalized subset of diploid reference haplotypes at the genomic coordinate (e.g., =# R, H2 = R).
[0118] In addition to determining a first probability of a heterozygous diploid genotype for a genomic coordinate of a sample, the personalized sequencing system 106 can determine a second probability of a homozygous diploid genotype, indicated as “horn” in the equations above, with identical sample alleles corresponding to neither the first or second reference haplotypes of the personalized subset of diploid reference haplotypes at the genomic coordinate (e.g., = H2 =# R). [0119] In addition to determining such a first probability and a second probability for a genomic coordinate of a sample, the personalized sequencing system 106 can determine a third genotype probability of a heterozygous alternate diploid genotype, indicated as “hetalt” in the equations above, with differing sample alleles, each corresponding to neither the first or second reference haplotypes of the personalized subset of diploid reference haplotypes at the genomic coordinate (e.g., H R, H2 R, Hi ¥= H2).
[0120] In addition to determining such a first, second, and third probability for a genomic coordinate of a sample, the personalized sequencing system 106 can determine a fourth probability of a homozygous reference diploid genotype, indicated as “homref ’ in the equations above, with identical sample alleles corresponding to the first and second reference haplotypes of the personalized diploid reference haplotypes at the genomic coordinate = H2 = R).
[0121] As indicated above, in some cases, the personalized sequencing system 106 receives (or identifies) multiple personalized subsets of diploid reference haplotypes for a genome coordinate. When more than two population haplotypes are identified as candidate reference haplotypes at a genomic coordinate or more than one pair of personalized diploid reference haplotypes are identified at a genomic coordinate, the personalized sequencing system 106 can treat or utilize prior genotype probabilities for a combination of different pairs of personalized diploid reference haplotypes as weights in a weight sum. Such a weighted sum can account for the different pairs of personalized diploid reference haplotypes. In such implementations, the personalized sequencing system 106 can determine prior genotype probabilities for each respective personalized subset of diploid reference haplotypes (and combine resulting probabilities to determine an overall prior genotype probability P(G') of a candidate genotype G according to the following function representing a weighted sum: where P(P) represents the individual prior genotype probability of each personalized subset of diploid reference haplotypes R = {R1, R2}. Because each personalized subset of diploid reference haplotypes R has been assigned a prior genotype probability P(R) in the foregoing function, the personalized sequencing system 106 can utilize the foregoing function as a weighted sum to generate the overall prior genotype probability P(CP) of a candidate genotype G.
[0122] In some embodiments, for example, the personalized sequencing system 106 determines individual prior genotype probabilities P(R) according to a Hardy-Weinberg equation. Such a Hardy -Weinberg equation takes a product of population allele frequencies of the underlying haplotypes and corrects for duplicates because order of the allele frequencies does not matter. For example, the personalized sequencing system 106 can determine the individual prior genotype probability P(R') of each personalized subset diploid reference haplotype R = {R1,R2} according to the following function:
PfRl = [2 * p(Ri)p(R2 when Ri * R2
(P^)2 when R = R2 where, in one or more embodiments, the respective probability P(R of each population haplotype within a corresponding personalized subset of diploid reference haplotypes is based on a population allele frequency within the respective population. The foregoing equation accordingly represents an application of Hardy -Weinberg in which the personalized sequencing system 106 can either (i) use the individual prior genotype probability P(R') directly for genotyping (e.g., genotype calling at a genomic coordinate of a sample) if personalized diploid reference haplotypes are unavailable or (ii) use the individual prior genotype probability P(R') as an input for personalized haplotyping and personalized genotype calling (e.g., determining a genotype call for a sample at a genomic coordinate based on genotype probabilities derived from a personalized subset of diploid reference haplotypes).
[0123] As discussed above (e.g., in relation to FIG. 4), having determined prior genotype probabilities for candidate genotypes in relation to a personalized subset of diploid reference haplotypes, the personalized sequencing system 106 can generate posterior genotype probabilities based on the prior genotype probabilities using a call-generation model. In some embodiments, for example, the personalized sequencing system 106 determines a genotype likelihood represented by for each observed nucleobase at a given genomic coordinate, where Rt represents an observed nucleobase and Hj represents a haplotype base. Such a genotype likelihood may be considered a posterior genotype likelihood because this genotype likelihood for an observed base at a given genomic coordinate accounts for the nucleobase content and/or alignment of nucleotide reads at the given genomic coordinate. Having determined such a likelihood, the personalized sequencing system 106 can determine several values for the whole read pileup. The personalized sequencing system 106 further aggregates the values into a likelihood of an observed base given a genotype at the genomic coordinate. For example, the genotype likelihood of an observed nucleobase given a genotype can be expressed as P(Ri\Gk), where Rt represents an observed nucleobase and Qk represents the genotype at the genomic coordinate. Similarly, therefore, such a genotype likelihood for an observed nucleobase given a genotype may be considered a posterior genotype likelihood because the genotype likelihood for the observed nucleobase at a given genomic coordinate accounts for the nucleobase content and/or alignment of nucleotide reads at the given genomic coordinate.
[0124] Having determined likelihoods for individual genotypes or having determined personalized prior genotype probabilities as described above, the personalized sequencing system 106 can input one or both of the genotype likelihoods and the personalized prior genotype probabilities into a call-generation model (e.g., variant caller). For example, the personalized sequencing system 106 can input, into a call -generation model, the genotype likelihoods represented by P(/?i |W;) or P(Ri\Gk)- In some implementations, the personalized sequencing system 106 collapses the likelihoods into an aggregate likelihood of the observed read pileup, represented as P(D|(/fc). The personalized sequencing system 106 can further utilize the callgeneration model to invert the aggregate likelihood of the observed read pileup to generate a posterior genotype probability P Qk\D where D represents the data or the observed read pileup. In addition or in the alternative to inputting such genotype likelihoods, the personalized sequencing system 106 can input, into a call -generation model, any of the personalized prior genotype probabilities described above, including a prior genotype probability P(CP) of a candidate genotype G based on an individual prior genotype probability of each personalized subset of diploid reference haplotypes R = [R1, R2 . Given an observed read pileup D and personalized prior genotype probabilities P(G) for each or a subset of candidate genotypes G, a call-generation model (e.g., DRAGEN VC) can determine a posterior genotype probability P(G \D) using, for example, a Bayesian probability model. As also mentioned, in some embodiments, the personalized sequencing system 106 determines a genotype call for the genomic coordinate based on the posterior genotype probabilities generated by the call-generation model.
[0125] While in some embodiments, such as described in relation to FIG. 5, the personalized sequencing system 106 determines prior genotype probabilities in relation to a personalized set of diploid reference haplotypes, the personalized sequencing system 106, in one or more embodiments, determines genotype calls with respect to a linear reference genome (e.g., for recording in a genotype-call data file, such as a VCF). [0126] As mentioned previously, in some embodiments, the personalized sequencing system 106 utilizes a personalized call-recalibration machine-learning model to determine, recalibrate, modify, and/or confirm genotype calls for one or more genomic coordinate of a sample based on personalized sequencing metrics determined in relation to a personalized subset of diploid reference haplotypes. For example, FIG. 6 illustrates the personalized sequencing system 106 utilizing a personalized call-recalibration machine-learning model 608 to generate one or more genotype calls 612 based on personalized sequencing metrics 606 determined in relation to personalized diploid reference haplotypes 604.
[0127] As shown in FIG. 6, for instance, the personalized sequencing system 106 identifies (or receives) a set of nucleotide reads 602 corresponding to one or more genomic coordinates of a sample. In some embodiments, for example, the personalized sequencing system 106 identifies and/or selects a genomic coordinate based on identifying a variant call generated with respect to a linear reference genome at the genomic coordinate by a call-generation model (e.g., the callgeneration model 224, the personalized call -generation model 210, or the personalized callgeneration model 410) based on the set of nucleotide reads 602. Also, in one or more embodiments, the personalized sequencing system 106 identifies and/or selects the genomic coordinate based on determining that the variant call generated by the call-generation model matches a variant determined by an imputation model within the personalized subset of diploid reference haplotypes at the genomic coordinate. To identify such candidate genomic coordinates with (i) variant calls generated by the call-generation model that match (ii) a variant determined by an imputation model within the personalized subset of diploid reference haplotypes at the genomic coordinate, the personalized sequencing system 106 can process sequencing metrics including one or more of a position metric (POS) indicating a genomic coordinate within a reference genome, a reference allele metric (REF) indicating a nucleobase of the reference genome at the genomic coordinate, or an alternate allele metrics (ALT) indicating an alternate nucleobase to the reference nucleobase at the genomic position. Alternatively or additionally, in some embodiments, the personalized sequencing system 106 identifies and/or selects the genomic coordinate from a predetermined set of genomic coordinates corresponding to variants with respect to a linear reference genome.
[0128] As further illustrated in FIG. 6, the personalized sequencing system 106 determines, for the sample at the one or more genomic coordinates of the set of nucleotide reads 602, the personalized sequencing metrics 606 in relation to the personalized diploid reference haplotypes 604. In some embodiments, for example, the personalized sequencing metrics 606 for each genomic coordinate of the one or more genomic coordinates include one or more of a personalized genotype metric (GT) indicating a diploid genotype at the genomic coordinate, a personalized genotypeprobability metric (GP) indicating a probability of the diploid genotype occurring at the genomic coordinate, a personalized genotype-quality metric (GQ) indicating a probability that the diploid genotype at the genomic coordinate is correct or incorrect, or a personalized variant-call-quality metric (QU AL) indicating a quality score for a variant call generated by the call-generation model utilizing the personalized diploid reference haplotypes 604 for the genomic coordinate.
[0129] As indicated above, the personalized sequencing system 106 inputs a single genotypeprobability metric or multiple genotype-probability metrics (GP) into the personalized call- recalibration machine-learning model 608. In some embodiments, for instance, the personalized sequencing metrics 606 include multiple genotype-probability metrics (GP) indicating respective probabilities of various candidate diploid genotypes occurring at a given genomic coordinate. Such multiple genotype-probability metrics may include, for example, (i) a probability that a genotype call is ahomozygous reference genotype (e.g., 0/0); (ii) aprobability that a genotype a heterozygous genotype (e.g., 0/1); and/or (iii) a probability that a genotype call is homozygous alternate genotype (e.g., 1/1). As described above, however, such multiple genotype-probability metrics may take the form of different probabilities to, for example, account for multi-allelic genomic coordinates, including a probability for each of the genotypes encoded as 0/0, 0/1, 0/2, 1/1, 1/2, and 2/2, where 0 represents a reference nucleobase, 1 represents a first altemate/variant nucleobase, and 2 represents a second altemate/variant nucleobase.
[0130] As also shown in FIG. 6, the personalized sequencing system 106 utilizes the personalized call-recalibration machine-learning model 608 to generate, based on the personalized sequencing metrics 606, personalized genotype call classifications 610 indicating an accuracy of identifying a respective genotype at the one or more genomic coordinates corresponding to the set of nucleotide reads 602. Based on the personalized genotype call classifications 610, the personalized sequencing system 106 determines one or more genotype calls 612 for the respective one or more genomic coordinates. In some embodiments, for example, the personalized sequencing system 106 determines the genotype calls 612 by confirming, recalibrating, or modifying a genotype call generated by a call-generation model (e.g., as further described below in relation to FIG. 8A).
[0131] To illustrate, as shown in FIG. 6, the personalized sequencing system 106 utilized the personalized call-recalibration machine-learning model 608 to generate the personalized genotype call classifications 610 for a candidate genomic coordinate (e.g., “Chr5:4”), including: (i) a first genotype call classification indicating an accuracy of identifying ahomozygous reference genotype (e.g., “L(0/0) @ Chr5:4”) with respect to a reference genome at the candidate genomic coordinate, (ii) a second genotype call classification indicating an accuracy of identifying a heterozygous variant genotype (e.g., “L(0/l) @ Chr5:4”) with respect to the reference genome at the candidate genomic coordinate, and (iii) a third genotype call classification indicating an accuracy of identifying a homozygous variant genotype (e.g., “L(l/1 ) @ Chr5:4”) with respect to the reference genome at the candidate genomic coordinate. Specifically, in the example illustrated, the first genotype call classification indicates a likelihood of 0.10 that the genotype call is a homozygous reference genotype, the second genotype call classification indicates a likelihood of 0.76 that the genotype call is a heterozygous variant genotype, and the third genotype call classification indicates a likelihood of 0.14 that the genotype call is a homozygous variant genotype.
[0132] As indicated above, in some embodiments, the personalized sequencing system 106 trains or tunes a personalized call-recalibration machine-learning model (e.g., the personalized call- recalibration machine-learning model 608 or one or more sub-models of the call-recalibration machine-learning model 812). In particular, the personalized sequencing system 106 utilizes an iterative training process to fit a call-recalibration machine-learning model by adjusting or adding decision trees or learning parameters that result in accurate genotype call classifications (e.g., the personalized genotype call classifications 610). To illustrate, FIG. 7 shows the personalized sequencing system 106 training a personalized call-recalibration machine-learning model 714 in accordance with one or more embodiments.
[0133] As illustrated in FIG. 7, the personalized sequencing system 106 accesses sample sequence data 702 and corresponding personalized diploid reference haplotypes 708. In some embodiments, the personalized sequencing system 106 selects, as part of a haplotype personalization process 706, the personalized diploid reference haplotypes 708 from a database of population haplotypes 712 based on the sample sequence data 702. Furthermore, in one or more embodiments, the personalized sequencing system 106 reduces overfitting of the sample sequence data 702 to known haplotypes within the database of population haplotypes 712 by omitting the respective known haplotypes from the haplotype personalization process 706. Accordingly, in certain embodiments, the personalized diploid reference haplotypes 708 selected for the sample sequence data 702 comprises population haplotypes other than the population haplotypes known to correspond to the sample sequence data 702.
[0134] As also shown in FIG. 7, the personalized sequencing system 106, at an act 710, determines or extracts sequencing metrics 711a and/or personalized sequencing metrics 711b for the sample sequence data 702. For instance, the personalized sequencing system 106 can extract the sequencing metrics 711a in the form of sample read-based metrics, sample externally sourced sequencing metrics, and/or sample call-model-generated sequencing metrics for the sample sequence data 702. In some embodiments, for example, the sample sequence data 702 includes various ground truth metrics that result from the sequencing metrics 711a. Furthermore, the personalized sequencing system 106 can extract the personalized sequencing metrics 711b for the sample sequence data 702 based on the personalized diploid reference haplotypes 708 (e.g., as selected for the sample sequence data 702 during the haplotype personalization process 706).
[0135] In addition to the sample sequence data 702, the personalized sequencing system 106 also accesses or extracts ground truth genotype data 704 corresponding to the sample sequence data 702. For example, in some embodiments, the personalized sequencing system 106 utilizes ground truth data from a training dataset from the Food and Drug Administration, called the PrecisionFDA dataset, for ground truth data comprising well-characterized samples with ground truth genotype calls (e.g., variant calls). Accordingly, the personalized sequencing system 106 can utilize the ground truth genotype data 704 to train or tune the personalized call-recalibration machine-learning model 714 to generate personalized genotype call classifications based on the extracted personalized sequencing metrics 71 lb for the sample sequence data 702.
[0136] As further illustrated in FIG. 7, the personalized sequencing system 106 generates personalized genotype call classifications 716 based on the sequencing metrics 711a and the personalized sequencing metrics 711b. Specifically, the personalized sequencing system 106 utilizes the personalized call-recalibration machine-learning model 714 to generate the personalized genotype call classifications 716. Indeed, in some embodiments, the personalized call- recalibration machine-learning model 714 generates a set of three personalized genotype call classifications, as described above (e.g., indicating an accuracy of identifying a homozygous reference call, a homozygous variant call, or a heterozygous call at a given genomic coordinate). This disclosure depicts merely one example of such personalized genotype call classifications in FIG. 6, including (i) a first genotype call classification indicating an accuracy of identifying a homozygous reference genotype (e.g., “L(0/0) @ Chr5:4”) with respect to a reference genome at the candidate genomic coordinate, (ii) a second genotype call classification indicating an accuracy of identifying a heterozygous variant genotype (e.g., “L(0/l) @ Chr5:4”) with respect to the reference genome at the candidate genomic coordinate, and (iii) a third genotype call classification indicating an accuracy of identifying a homozygous variant genotype (e.g., “L(l/1) @ Chr5:4”) with respect to the reference genome at the candidate genomic coordinate. As described above, however, the personalized genotype call classifications 716 can take the form of any of the genotype call classifications described herein.
[0137] Having determined the personalized genotype call classifications 716, the personalized sequencing system 106 performs a comparison 720 between (i) the personalized genotype call classifications 716 and/or corresponding data fields output by the personalized call-recalibration machine-learning model 714 (e.g., as further described below in relation to FIG. 8A) and (ii) genotype calls and/or corresponding data fields indicated within the ground truth genotype data 704 (e.g., to determine an error or a measure of loss between them). For example, for a given genomic coordinate, the personalized sequencing system 106 can utilize a loss function 722 (e.g., log loss, mean squared error) to determine a difference between values (e.g., probabilities) from the personalized genotype call classifications 716 and/or the corresponding data fields, on the one hand, and a probability distribution (genotype-probability distribution) in the ground truth genotype data 704, on the other hand.
[0138] For instance, in cases where the personalized call-recalibration machine-learning model 714 comprises an ensemble of gradient boosted trees, the personalized sequencing system 106 utilizes a mean squared error loss function (e.g., for regression) and/or a logarithmic loss function (e.g., for classification) as the loss function 722 based on the comparison 720. By contrast, embodiments where the personalized call-recalibration machine-learning model 714 is a neural network, the personalized sequencing system 106 can utilize a cross-entropy loss function, an LI loss function, or a mean squared error loss function as the loss function 722.
[0139] As further illustrated in FIG. 7, the personalized sequencing system 106 performs a model fitting 724. In particular, the personalized sequencing system 106 fits the personalized call- recalibration machine-learning model 714 based on the comparison 720. For instance, the personalized sequencing system 106 performs modifications or adjustments to the personalized call-recalibration machine-learning model 714 to reduce a measure of loss indicated by the loss function 722 for a subsequent training iteration.
[0140] For gradient boosted trees or treelite, for example, the personalized sequencing system 106 trains the personalized call-recalibration machine-learning model 714 on the gradients of the errors determined by the loss function 722. For instance, the personalized sequencing system 106 solves a convex optimization problem (e.g., of infinite dimensions) while regularizing the objective to avoid overfitting. In certain implementations, the personalized sequencing system 106 scales the gradients to emphasize corrections to under-represented classed (e.g., where there are significantly more true positives than false positive variant calls). In some embodiments, the personalized sequencing system 106 adds a new weak learner (e.g., a boosted tree) to the personalized call- recalibration machine-learning model 714 for each successive training iteration as part of solving the aforementioned optimization problem. For example, the personalized sequencing system 106 finds a feature (e.g., a sequencing metric) that minimizes a loss from the loss function 722 and either adds the feature to the current iteration’s tree or initiates a new tree with the added feature. In addition or in the alternative to gradient boosted decision trees, in some embodiments, the personalized sequencing system 106 trains a logistic regression to leam parameters for generating one or more genotype call classifications. To avoid overfitting, the personalized sequencing system 106 further regularizes based on hyperparameters such as the learning rate, stochastic gradient boosting, the number of trees, the tree-depth(s), complexity penalization, and L1/L2 regularization. [0141] In some embodiments wherein the personalized call-recalibration machine-learning model 714 comprises a neural network, the personalized sequencing system 106 performs the model fitting 724 by modifying internal parameters (e.g., weights) of the personalized call- recalibration machine-learning model 714 to reduce a measure of loss indicated by the loss function 722. In such embodiments, the personalized sequencing system 106 modifies how the personalized call-recalibration machine-learning model 714 analyzes and passes data between layers and neurons by modifying the internal network parameters. Accordingly, over multiple iterations of training, the personalized sequencing system 106 improves the accuracy of the genotype call classifications and/or genotype calls generated by the personalized call-recalibration machinelearning model 714.
[0142] Indeed, in some cases, the personalized sequencing system 106 repeats the training process illustrated in FIG. 7 for multiple iterations. For example, the personalized sequencing system 106 can repeat the iterative training by selecting a new set of sequencing metrics for each genotype call along with a corresponding ground truth genotype call in corresponding ground truth data. The personalized sequencing system 106 further generates a new set of predicted genotype call classifications for each iteration. As described above, the personalized sequencing system 106 also compares genotype calls and/or corresponding data fields at each iteration with the corresponding genotype calls and/or data fields from the corresponding ground truth data and performs model fitting 724. The personalized sequencing system 106 repeats this process until the personalized call-recalibration machine-learning model 714 generates predicted genotype call classifications that result in genotype calls that satisfy a threshold measure of loss.
[0143] FIG. 8A illustrates the personalized sequencing system generating genotype calls utilizing personalized sequencing metrics determined in relation to personalized diploid reference haplotypes in accordance with one or more embodiments of the present disclosure.
[0144] In accordance with one or more embodiments, FIGS. 8A-8B illustrate the personalized sequencing system 106 implementing personalized subsets of diploid reference haplotypes to generate genotype call classifications and determine output genotype calls. In certain described embodiments, the personalized sequencing system 106 utilizes a personalized call-recalibration machine-learning model together with a personalized call-generation model to generate genotype calls in genomic regions having respectively selected personalized subsets of diploid reference haplotypes. To illustrate, FIGS. 8A-8B illustrate the personalized sequencing system 106 utilizing a call-recalibration machine-learning model 812 to modify data fields corresponding to a genotypecall data file 818 representing one or more genotype calls.
[0145] As illustrated in FIG. 8 A, the personalized sequencing system 106 accesses a sequencing information database 802, a linear reference genome 803 (in some cases stored within the sequencing information database 802), sequence data 804 extrapolated from one or more nucleotide reads of a target sample, and at least one personalized subset of diploid reference haplotypes (e.g., personalized diploid reference haplotypes 805) for the target sample. Indeed, the personalized sequencing system 106. Indeed, the personalized sequencing system 106 performs sequencing-metric extraction 810 to extract or determine sequencing metrics 81 la and personalized sequencing metrics 811b as described above (e.g., in relation to FIGS. 6 and 7). For example, the personalized sequencing system 106 determines the sequencing metrics 711a — such as read-based sequencing metrics, externally sourced sequencing metrics, and/or call model generated sequencing metrics — from the sequencing information database 802 and/or by utilizing either or both of mapping-and-alignment components 806 and genotype-caller components 808 of the personalized call-generation model 820. Such read-based sequencing metrics (e.g., alignment metrics, mapping quality metrics, read-depth metrics, read quality scores) include sequencing metrics derived from nucleotide reads of a sample; externally sourced sequencing metrics (e.g., a mappability metric indicating an ease or difficult of mapping a particular nucleotide sequence, a guanine-cytosine-content metric indicating a count (or a dropout or a mean) of guanine-cytosine content in a reference nucleotide sequence (e.g., reference genome)) include sequencing metrics identified or obtained from one or more external databases; and/or call model generated sequencing metrics (e.g., base-call-quality metrics, average read depths, call model generated-base-quality- dropoff metric, hidden Markov model (HMM) statistics, a number of unique reads) include sequencing metrics that are internal, model-specific sequencing metrics generated or extracted by a call generation model. Furthermore, the personalized sequencing system 106 can determine the personalized sequencing metrics 811b in relation to the personalized diploid reference haplotypes from the mapping-and-alignment components 806 and/or the genotype-caller components 808 of the personalized call-generation model 820.
[0146] As further illustrated in FIG. 8 A, the personalized sequencing system 106 generates personalized genotype call classifications 814. More specifically, the personalized sequencing system 106 utilizes the call-recalibration machine-learning model 812 to generate the personalized genotype call classifications 814 based on the sequencing metrics 811a and/or the personalized sequencing metrics 811b extracted via the sequencing-metric extraction 810. For example, the call- recalibration machine-learning model 812 generates personalized genotype call classifications indicating an accuracy of identifying one or more genotypes and one or more respective genomic coordinates.
[0147] While not shown in FIG. 8A, in some embodiments, the call-recalibration machinelearning model 812 comprises (i) a personalized call-recalibration machine-learning model 812a (see FIG. 8B) configured and trained to process personalized sequencing metrics (e.g., the personalized sequencing metrics 811b) for generating personalized genotype call classifications (e.g., as described above in relation to FIG. 7) and (ii) a call-recalibration machine-learning model 812b (see FIG. 8B) configured and trained to generate genotype call classifications based on sequencing metrics (e.g., the sequencing metrics 811a) without consideration of personalized sequencing metrics. In some embodiments, the personalized call-recalibration machine-learning model 812a is configured and trained to process both the sequencing metrics 811a and the personalized sequencing metrics 811b to generate personalized genotype call classifications.
[0148] As mentioned, in some embodiments, the personalized sequencing system 106 generates, based on personalized sequencing metrics, genotype call classifications indicating an accuracy of identifying a genotype at a genomic coordinate of a sample. For example, as shown in FIG. 8 A, the personalized sequencing system 106 utilized the call-recalibration machine-learning model 812 to generate the personalized genotype call classifications 814 for a candidate genomic coordinate (e.g., “Chr5:4”), including: (i) a first genotype call classification indicating an accuracy of identifying a homozygous reference genotype (e.g., “L(0/0) @ Chr5:4”) with respect to the linear reference genome 803 at the candidate genomic coordinate, (ii) a second genotype call classification indicating an accuracy of identifying a heterozygous variant genotype (e.g., “L(0/l) @ Chr5:4”) with respect to the linear reference genome 803 at the candidate genomic coordinate, and (iii) a third genotype call classification indicating an accuracy of identifying a homozygous variant genotype (e.g., “L(l/1) @ Chr5:4”) with respect to the linear reference genome 803 at the candidate genomic coordinate. Specifically, in the example illustrated, the first genotype call classification indicates a likelihood of 0.10 that the genotype call is a homozygous reference genotype, the second genotype call classification indicates a likelihood of 0.76 that the genotype call is a heterozygous variant genotype, and the third genotype call classification indicates a likelihood of 0.14 that the genotype call is a homozygous variant genotype.
[0149] As indicated above, in some embodiments, the personalized sequencing system 106 generates a genotype call for a haploid genomic coordinate. In some such embodiments, the personalized sequencing system 106 determines personalized sequencing metrics in relation to at least one personalized reference haplotype selected for the haploid genomic coordinate and generates one or more personalized haploid genotype call classifications. For example, the personalized sequencing system 106 utilized the call-recalibration machine-learning model 812 to generate the personalized genotype call classifications 814 for a haploid genomic coordinate as follows: (i) a first genotype call classification indicating an accuracy of a first genotype at the haploid genomic coordinate and (ii) a second genotype call classification indicating an accuracy of a second genotype at the haploid genomic coordinate. Accordingly, in some embodiments, the first genotype probability can be a probability that a genotype at a genomic coordinate is a haploid reference genotype, and the second genotype probability can be a probability that a genotype at the genomic coordinate is a haploid alternate genotype.
[0150] As mentioned, in some cases, the call-recalibration machine-learning model 812 comprises an ensemble of gradient boosted trees that processes the sequencing metrics 81 la and/or the personalized sequencing metrics 81 lb to generate the personalized genotype call classifications 814. For instance, the call-recalibration machine-learning model 812 can include a series of weak learners such as non-linear decision trees that are trained in a logistic regression to generate the personalized genotype call classifications 814. In some cases, the call-recalibration machinelearning model 812 includes metrics within various trees that define how the call-recalibration machine-learning model 812 processes the sequencing metrics 811a and/or the personalized sequencing metrics 81 lb to generate the personalized genotype call classifications 814. Additional detail regarding the training of the call-recalibration machine-learning model 812 (in particular, the personalized sub-model thereof) is provided above with reference to FIG. 7.
[0151] In certain embodiments, the call-recalibration machine-learning model 812 is a different type of machine-learning model, such as a neural network, a support vector machine, or a random forest. For example, in cases where the call-recalibration machine-learning model 812 is a neural network, the call-recalibration machine-learning model 812 includes one or more layers each with neurons that make up the layer for processing the sequencing metrics 811a and/or the personalized sequencing metrics 811b. In some cases, the call-recalibration machine-learning model 812 generates the personalized genotype call classifications 814 by extracting latent vectors from the sequencing metrics 811a and/or the personalized sequencing metrics 811b, passing the latent vectors from layer to layer (or neuron to neuron) to manipulate the vectors until utilizing an output layer (e.g., one or more fully connected layers) to generate the personalized genotype call classifications 814.
[0152] As an example of generating the personalized genotype call classifications 814, in some embodiments, the personalized sequencing system 106 utilizes statistics to summarize a mapping quality distribution of reference supporting reads and alternative supporting reads (e.g., for a comparative-mapping-quality -distribution metric). The personalized sequencing system 106 can determine and utilize the mean of the MAPQ for reads supporting an alternative allele from SBS reads and from assembled nucleotide reads. In these or other embodiments, the call-recalibration machine-learning model 812 leams from the data that, when the MAPQ of an alternative allele (indicated by SBS reads or assembled nucleotide reads) is low and a depth metric is high relative to other MAPQ and depth metrics in distributions, a resultant genotype call is more likely to be a false positive. Indeed, as the probability of a false positives increases, the MAPQ metrics would likely decrease. [0153] As a further example, in some cases, the personalized sequencing system 106 compares a mapping quality (e.g., MAPQ) associated with an SBS read and/or an assembled nucleotide read with a mapping-quality threshold. For instance, the personalized sequencing system 106 utilizes a mapping-quality threshold such as a threshold difference between best and second-best alignment scores. Upon determining that one or more of mapping qualities for the different read types does not satisfy the threshold, the personalized sequencing system 106 adjusts one or more of the personalized genotype call classifications 814 accordingly (e.g., to select a read with a higher MAPQ).
[0154] In addition (or in the alternative), the personalized sequencing system 106 can determine the personalized genotype call classifications 814 by utilizing an accumulation of statistical analyses over complex functions (depending on the architecture of the call-recalibration machine-learning model 812) to determine how to best fit the data. For example, as described above, the personalized sequencing system 106 trains the call-recalibration machine-learning model 812 to minimize a loss generated from a number of (different types of) sequencing metrics to determine weights and biases that best fit the data (e.g., that result in a reduced or minimized loss).
[0155] As further illustrated in FIG. 8A, in addition to generating the personalized genotype call classifications 814, the personalized sequencing system 106 performs data field generation 816. More specifically, the personalized sequencing system 106 generates or modifies data fields for genotype-call data file 818. To generate (or modify) the genotype-call data file 818, the personalized sequencing system 106 utilizes the genotype-caller components 808 of the personalized call-generation model 820 and modifies or maintains values for such data fields based on the genotype call classifications generated by the call-recalibration machine-learning model 812. For instance, the personalized sequencing system 106 modifies various metrics such as quality metrics, mapping metrics, or other metrics associated with the genotype call. As mentioned, in some cases, the personalized sequencing system 106 selects metrics associated with nucleotide reads, the personalized diploid reference haplotypes 805, and/or associated with the personalized genotype call classifications 814. In other cases, the personalized sequencing system 106 generates new metrics from the data generated by the personalized call-generation model 820 and/or the call- recalibration machine-learning model 812.
[0156] As indicated in FIG. 8 A, for example, the personalized sequencing system 106 can utilize (i) candidate genotype calls (e.g., candidate variant calls) generated by the personalized callgeneration model 820 and (ii) the personalized genotype call classifications 814 generated by the call-recalibration machine-learning model 812 to modify data fields corresponding to a genotypecall data file 818. Such modified or recalibrated values are output by the call-recalibration machine- learning model 812 via data field generation 816. For example, the personalized sequencing system 106 determines recalibrated values for particular metrics corresponding to the genotype calls emitted to the genotype-call data file 818, such as a base-call-quality metric (QU AL), a genotype metric (GT), a genotype-quality metric (GQ), and so forth. Furthermore, in some embodiments, the personalized sequencing system 106 outputs one or more of the personalized sequencing metrics 81 lb to data fields of a genotype-call data file for a given sample (e.g., for the sequence data 804), such as a personalized prior genotype probability metric (PRI), a personalized genotype likelihood metric (GL), and a personalized genotype probability metric (GP), and so forth.
[0157] In one or more embodiments, the personalized sequencing system 106 recalibrates or modifies a genotype call (or generates a new genotype call) using the personalized genotype call classifications 814 from the call-recalibration machine-learning model 812. As described, the personalized sequencing system 106 modifies the genotype call by modifying or recalibrating data fields for one or more of the metrics associated with the genotype call (e.g., as included within the genotype-call data file 818).
[0158] To update or recalibrate the call-quality metric (QUAL) associated with a genotype call, for instance, the personalized sequencing system 106 determines how each of the personalized genotype call classifications 814 impact or affect the base-call-quality metric. For example, the personalized sequencing system 106 determines that a high probability for a genotype error results in a lower overall genotype quality and possibly a different overall call quality. As another example, the personalized sequencing system 106 determines that a high probability for a false-positive variant results in a lower overall call quality. As yet another example, the personalized sequencing system 106 determines that a high probability for a true-positive variant results in a higher overall (variant) call quality. The personalized sequencing system 106 accordingly updates the genotype along with the genotype quality and the call quality associated with the genotype call.
[0159] In one or more implementations, the personalized sequencing system 106 generates a combination (e.g., a weighted combination or an average) of the personalized genotype call classifications 814 to recalibrate the call-quality metric. In particular, the personalized sequencing system 106 weights the various predictions of the personalized genotype call classifications 814 according to their respective impact on (variant) call quality. In some cases, the personalized sequencing system 106 weights each genotype probability, while in other cases the personalized sequencing system 106 determines different weights for each. In any event, the personalized sequencing system 106 determines a weighted combination or a weighted average of the personalized genotype call classifications 814 to recalibrate (increase or decrease) a call-quality metric for a genotype call (e.g., an initial variant call). [0160] To update or recalibrate the genotype metric (e.g., within the GT field of the genotypecall data file 818) associated with a genotype call, the personalized sequencing system 106 utilizes one or more of the personalized genotype call classifications 814. For example, the personalized sequencing system 106 compares the various constituent predictions of each to determine which of the personalized genotype call classifications 814 has a highest probability. In some cases, the personalized sequencing system 106 utilizes the genotype call classification indicating the highest accuracy to recalibrate the genotype metric (e.g., from 0 as corresponding to the reference base to 1 as corresponding to a first alternative supporting read).
[0161] To update or recalibrate the genotype-quality metric (e.g., within the GQ field of the genotype-call data file 818) associated with a genotype call, the personalized sequencing system 106 utilizes one or more of the personalized genotype call classifications 814. More specifically, the personalized sequencing system 106 determines how each of the personalized genotype call classifications 814 affect the genotype-quality metric. The personalized sequencing system 106 recalibrates the genotype-quality metric accordingly (e.g., by increasing or decreasing the quality score between 0 to 10 or 0 to 100 or on some other scale). For example, the personalized sequencing system 106 determines that a higher genotype error probability (generally) indicates a lower genotype-quality metric, and the personalized sequencing system 106 reduces the metric accordingly.
[0162] In some cases, the personalized sequencing system 106 determines a combination (e.g., a weighted combination or a weighted average) of the personalized genotype call classifications 814 to modify the genotype-quality metric. For example, the personalized sequencing system 106 determines a combined effect that the personalized genotype call classifications 814 have on the genotype-quality metric. As another example, the personalized sequencing system 106 determines individual impacts that each constituent prediction of the personalized genotype call classifications 814 has on the genotype-quality metric and weights each accordingly. The personalized sequencing system 106 further recalibrates the genotype-quality metric by increasing or decreasing its value based on the indicated classifications.
[0163] As described, the personalized sequencing system 106 generates an output genotype call from the same set of the sequencing metrics 811a and/or the personalized sequencing metrics 811b (or a subset of the sequencing metrics that are shared between the call-recalibration machinelearning model 812 and the personalized call-generation model 820). Indeed, the personalized sequencing system 106 can operate the call-recalibration machine-learning model 812 in parallel with the personalized call-generation model 820 to generate metrics for an output genotype call and the personalized genotype call classifications 814 for recalibrating the generated metrics. [0164] In one or more implementations, the personalized sequencing system 106 updates or otherwise modifies the data fields for the genotype-call data file 818 according to particular algorithms. After modifying such data fields, the personalized sequencing system 106 can generate the genotype-call data file 818 (e.g., a post-filter variant call file) to include metrics reflecting the updated data fields. For instance, in some cases, the personalized sequencing system 106 updates the QU AL field for every variant based on the probability of a false positive variant. As indicated above, in some cases, QUAL indicates the probability that there is some kind of variant (or other nucleobase call) at a given location, measured in PHRED scale.
[0165] As suggested above, in some embodiments, the personalized sequencing system 106 increases or decreases a base-call-quality metric (e.g., Q score) for a genotype call. Based on the personalized genotype call classifications 814, for example, the personalized sequencing system 106 increases base-call-quality metrics for genotype calls that would not have previously passed a quality filter and determines that the increased base-call-quality metrics now passes the quality filter. In some such cases, the personalized sequencing system 106 includes genotype calls with such increased base-call-quality metrics (passing the quality filter) in a post-filter variant call file. By contrast, in other cases, the personalized sequencing system 106 decreases base-call-quality metrics for genotype calls that previously would have passed a quality filter and determines that the decreased base-call-quality metrics now fail the quality filter. In some such cases, the personalized sequencing system 106 excludes genotype calls with decreased base-call-quality metrics (failing the quality filter) from a post-filter variant call file but includes the genotype calls with such decreased base-call-quality metrics in a pre-filter variant call file.
[0166] For example, the personalized sequencing system 106 can remove false positive variant calls and recover false negative variant calls by changing corresponding base-call-quality metrics. To remove a false positive, in some cases, the personalized sequencing system 106 decreases the base-call-quality metric of a genotype call that initially passed a quality filter — based on the personalized genotype call classifications 814 from the call-recalibration machine-learning model 812. Based on determining the decreased base-call-quality metric falls below a threshold metric (e.g., a Q score of 3.0 or 10.0), the personalized sequencing system 106 determines that the genotype call no longer passes the quality filter. The personalized sequencing system 106 thus filters out, or removes, the false positive-genotype call that initially passed the filter by changing its base-call-quality metric.
[0167] In addition to removing false positive variant calls based on changes to base-call- quality metrics, the personalized sequencing system 106 can remove false positive variant calls based on changes to genotype. To remove a false positive, in some cases, the personalized sequencing system 106 changes a genotype of an initial genotype call indicating a different nucleobase than a reference base (e.g., GT = 1 or 2) to a genotype of an updated genotype call indicating a same nucleobase as the reference base (e.g., GT = 0). Based on the genotype being the same as the reference base, the personalized sequencing system 106 does not identify the genotype call as a variant and, in some cases, excludes data for the genotype call from the genotype-call data file 818. For instance, the personalized sequencing system 106 can use a null-data indicator for a genotype call (or a particular field) of the genotype-call data file 818. In some cases, the personalized sequencing system 106 uses a null-data indicator in cases where a certain sequencing metric does not apply to a particular variant call or VCF field (e.g., where SBS-based calls use different metrics than assembled-nucleotide-read-based calls).
[0168] To recover a false negative, the personalized sequencing system 106 increases the base- call-quality metric of a genotype call that initially failed a quality filter. Based on determining the increased base-call-quality metric exceeds a threshold metric, the personalized sequencing system 106 determines that the genotype call passes the quality filter. The personalized sequencing system 106 thus recovers a false-negative-genotype call that was initially filtered out by changing its base- call-quality metric.
[0169] In addition to recovering false negative variant calls based on changes to base-call- quality metrics, the personalized sequencing system 106 can recover false negative variant calls based on changes to genotype. To recover a false negative, in some cases, the personalized sequencing system 106 changes a genotype of an initial genotype call indicating the same nucleobase as a reference base (e.g., GT = 0) to a different genotype of an updated genotype call indicating a different nucleobase than the reference base (e.g., GT = 1 or 2). Based on the differing genotype of the updated genotype call and a passing base-call-quality metric, the personalized sequencing system 106 identifies the genotype call as a variant and includes the genotype call within the genotype-call data file 818.
[0170] Indeed, in some implementations, the personalized sequencing system 106 operates in a specific sequential order utilizing the personalized call-generation model 820 and the call- recalibration machine-learning model 812. For example, the personalized sequencing system 106 generates a FASTQ file by converting a BCL file to FASTQ. In addition, the personalized sequencing system 106 (subsequently) utilizes the mapping-and-alignment components 806 of the personalized call-generation model 820 to map and align nucleobases from a sample nucleotide sequence. In some cases, the personalized sequencing system 106 maps and aligns the nucleobases of the sample sequence in relation to the linear reference genome 803 and/or various alternative supporting reads.
[0171] After mapping and aligning, as described herein, the personalized sequencing system 106 utilizes the genotype-caller components 808 of the personalized call-generation model 820 to generate an initial genotype call for the sample sequence corresponding to a particular genomic coordinate — based on various sequencing metrics. After or at the same time, the personalized sequencing system 106 also applies the call-recalibration machine-learning model 812 to generate the personalized genotype call classifications 814 from the sequencing metrics 811a and/or the personalized sequencing metrics 811b. Based on the personalized genotype call classifications 814, the personalized sequencing system 106 recalibrates the genotype call (e.g., by modifying various data fields corresponding to specific metrics of the nucleobase call, such as QU AL, GT, and/or GQ), as described above.
[0172] In some cases, the personalized sequencing system 106 further applies a quality filter to the genotype call to determine whether the genotype call passes the quality filter (e.g., a hard pass filter of Q20 or other Q score). The personalized sequencing system 106 subsequently identifies a subset of genotype calls that represent variants from reference bases and pass the quality filter. The personalized sequencing system 106 further generates a modified or updated genotypecall data file (e.g., the genotype-call data file 818) that includes the subset of genotype calls and recalibrated metrics for the subset of genotype calls, such as updated QU AL metrics, updated GT metrics, and/or updated GQ metrics.
[0173] As suggested above, in some embodiments, the personalized sequencing system 106 utilizes a call-recalibration machine-learning model comprising at least two sub-models: (i) the personalized call-recalibration machine-learning model 812a (or sub-model) configured and trained to process personalized sequencing metrics (e.g., the personalized sequencing metrics 811b) for generating personalized genotype call classifications (e.g., as described above in relation to FIG. 7) and (ii) the call-recalibration machine-learning model 812b configured and trained to generate genotype call classifications based on sequencing metrics (e.g., the sequencing metrics 811a) without consideration of personalized sequencing metrics. In particular, FIG. 8B illustrates the personalized sequencing system utilizing the different sub-models of the call-recalibration machine-learning model 812 to filter candidate variant calls to reduce false negative variant genotype calls.
[0174] As shown in FIG. 8B, the personalized sequencing system 106 identifies (or receives) a candidate variant call 821 (e.g., with respect to the linear reference genome 803) generated by the personalized call-generation model 820 (e.g., as described in FIGS. 4-5. Having identified the candidate variant call 821 at a respective genomic coordinate, the personalized sequencing system 106 determines, at an act 822, whether the candidate variant call 821 corresponds (e.g., matches) the alleles indicated by the personalized diploid reference haplotypes 805 at the respective genomic coordinate. [0175] In cases where, at the act 822, the personalized sequencing system 106 determines that the candidate variant call 821 does not match the personalized diploid reference haplotypes 805 at the respective genomic coordinate, the personalized sequencing system 106 generates genotype call classifications utilizing the call-recalibration machine-learning model 812b (e.g., the submodel of the call-recalibration machine-learning model 812 trained to generate genotype call classifications based on the sequencing metrics 811a without the personalized sequencing metrics 811b). Based on the genotype call classifications generated by the call-recalibration machinelearning model 812b, the personalized sequencing system 106 determines, at an act 830, whether to confirm or modify the candidate variant call 821. For confirmed variant calls, the personalized sequencing system 106 determines, at an act 834, whether the candidate variant call 821 meets a predetermined threshold, such as a threshold quality score, a threshold confidence score, or a threshold probability. Based on the candidate satisfying the predetermined threshold at the act 834, the personalized sequencing system 106 emits, at an act 828, the candidate variant call 821 to an output file for the respective genomic coordinate, such as a VCF or other genotype-call data file (e.g., the genotype-call data file 818). Otherwise, the personalized sequencing system 106, at an act 832, discards the candidate variant call 821 and, in some cases, records a reference genotype call or other genotype call for the respective genomic coordinate.
[0176] In cases where, at the act 822, the personalized sequencing system 106 determines that the candidate variant call 821 matches the personalized diploid reference haplotypes 805 at the respective genomic coordinate, the personalized sequencing system 106 generates genotype call classifications utilizing the personalized call-recalibration machine-learning model 812a (e.g., the sub-model of the call-recalibration machine-learning model 812 trained, such as described above in relation to FIG. 7, to generate genotype call classifications based on the personalized sequencing metrics 811b and/or the sequencing metrics 811a). Based on the genotype call classifications generated by the personalized call-recalibration machine-learning model 812a, the personalized sequencing system 106 determines, at an act 824, whether to confirm or modify the candidate variant call 821. For confirmed variant calls, the personalized sequencing system 106 determines at the act 826, whether the candidate variant call 821 meets the predetermined threshold for acceptance and, in cases where the candidate variant call 821 satisfies the predetermined threshold, emits, at the act 828, the candidate variant call 821 as output for the respective genomic coordinate. [0177] As further illustrated in FIG. 8B, for variant calls that are either (i) not confirmed at the act 824 or (ii) not found to meet the predetermined threshold at the act 826, the personalized sequencing system 106 re-processes the candidate variant call 821 to generate additional genotype call classifications for the respective genomic coordinate utilizing the call-recalibration machinelearning model 812b (e.g., the sub-model of the call-recalibration machine-learning model 812 trained to generate genotype call classifications based on the sequencing metrics 811a without the personalized sequencing metrics 811b). Based on the additional genotype call classifications generated by the call-recalibration machine-learning model 812b, the personalized sequencing system 106 determines, at the act 830, whether to confirm or modify the candidate variant call 821. For confirmed variant calls, the personalized sequencing system 106 determines, at the act 834, whether the candidate variant call 821 meets the predetermined threshold and, based on the candidate satisfying the predetermined threshold at the act 834, emits the candidate variant call 821 to an output file for the respective genomic coordinate at the act 828. Otherwise, the personalized sequencing system 106, at the act 832, discards the candidate variant call 821 and, in some cases, records a reference genotype call or other genotype call for the respective genomic coordinate.
[0178] As mentioned above, in certain described embodiments, the personalized sequencing system 106 provides improvements in genotype-calling accuracy over existing sequencing systems. To illustrate, FIG. 9 provides a bar graph depicting accuracy improvements associated with the personalized sequencing system 106 in accordance with one or more embodiments. More specifically, FIG. 9 illustrates comparative experimental results of various sequencing systems to determine genotype calls for a known sample.
[0179] For instance, FIG. 9 illustrates a bar graph comparing performance in the identification of single nucleotide polymorphisms (SNPs) within a well-characterized genomic dataset — specifically, the HG002 Human Genome dataset — comprising known variants. The illustrated bar graph depicts results of an existing sequencing system without implementing personalized reference haplotypes, labeled “Baseline”; a sequencing system executing mapping and alignment using personalized reference haplotypes, labeled “Personalized M/A”; an embodiment of the personalized sequencing system 106 implementing personalized mapping and alignment and a personalized call-generation model, labeled “Personalized M/A + VC”; and two embodiments of the personalized sequencing system 106 implementing embodiments of a personalized callgeneration model and a personalized call-recalibration machine-learning model, labeled “Personalized VC + ML VI” and “Personalized VC + ML V2,” respectively.
[0180] Indeed, as shown in FIG. 9, the personalized sequencing system 106 outperforms the existing sequencing system, resulting in fewer false positive (FP) variant calls and fewer false negative (FN) variant calls when identifying SNPs within the genomic dataset. By utilizing either or both a call-generation model or a call-recalibration machine-learning model to generate personalized genotype probabilities or personalized genotype call classifications, respectively, FIG. 9 shows that the personalized sequencing system 106 determines more accurate genotype calls (e.g., variant calls) with fewer false positive or false negative variant calls compared to existing sequencing systems. [0181] Turning now to FIGS. 10 and 11, these figures illustrate example flowcharts of two respective series of acts 1000 and 1100 for implementing a personalized subset of diploid reference haplotypes to generate genotype calls in accordance with one or more embodiments. While FIGS. 10 and 11 illustrate acts according to particular embodiments, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in FIGS. 10 and/or 11. The acts of FIGS. 10 and/or 11 can be performed as part of a method. Alternatively, a non-transitory computer readable storage medium can comprise instructions that, when executed by one or more processors, cause a computing device to perform the acts depicted in FIGS. 10 and/or 11. In still further embodiments, a system comprising at least one processor and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform the acts of FIGS. 10 and/or 11.
[0182] As shown in FIG. 10, the series of acts 1000 includes an act 1002 of identifying a set of nucleotide reads corresponding to a genomic coordinate of a sample, an act 1004 of determining personalized sequencing metrics for the set of nucleotide reads in relation to a personalized subset of diploid reference haplotypes for the sample at the genomic coordinate, an act 1006 of generating, utilizing a call-recalibration machine-learning model and based on the personalized sequencing metrics, one or more personalized genotype call classifications indicating an accuracy of identifying a genotype at the genomic coordinate, and an act 1008 of determining a genotype call for the sample at the genomic coordinate based on the one or more personalized genotype call classifications.
[0183] As shown in FIG. 11, the series of acts 1100 includes and act 1102 of identifying a set of nucleotide reads of a sample corresponding to a genomic coordinate according to a respective set of read alignments, an act 1104 of determining, based on the set of nucleotide reads, a personalized subset of diploid reference haplotypes for the sample at the genomic coordinate, an act 1106 of generating, based on the personalized subset of diploid reference haplotypes and utilizing a call-generation model, genotype probabilities of candidate genotypes for the sample at the genomic coordinate, and an act 1108 of determining a genotype call for the sample at the genomic coordinate based on the genotype probabilities.
[0184] For example, the series of acts 1000 and/or the series of acts 1100 can include acts to perform any of the operations described in the following clauses:
CLAUSE 1. A computer-implemented method comprising: identifying a set of nucleotide reads corresponding to a genomic coordinate of a sample; determining personalized sequencing metrics for the set of nucleotide reads in relation to a personalized subset of diploid reference haplotypes for the sample at the genomic coordinate; generating, utilizing a call-recalibration machine-learning model and based on the personalized sequencing metrics, one or more personalized genotype call classifications indicating an accuracy of identifying a genotype at the genomic coordinate; and determining a genotype call for the sample at the genomic coordinate based on the one or more personalized genotype call classifications.
CLAUSE 2. The computer-implemented method of clause 1, wherein the personalized subset of diploid reference haplotypes comprise a first population haplotype selected for the sample at the genomic coordinate and a second population haplotype selected for the sample at the genomic coordinate.
CLAUSE S. The computer-implemented method of any of clauses 1-2, further comprising determining the personalized sequencing metrics by determining one or more of: a personalized genotype metric indicating a diploid genotype at the genomic coordinate; a personalized genotype-probability metric indicating a probability of the diploid genotype occurring at the genomic coordinate; a personalized genotype-quality metric indicating a probability that the diploid genotype at the genomic coordinate is correct or incorrect; or a personalized variant-call-quality metric indicating a quality score for a variant call generated by a call-generation model utilizing the personalized subset of diploid reference haplotypes.
CLAUSE 4. The computer-implemented method of any of clauses 1-3, further comprising selecting the genomic coordinate based on identifying a variant call generated with respect to a linear reference genome at the genomic coordinate by a call-generation model based on the set of nucleotide reads.
CLAUSE 5. The computer-implemented method of clause 4, further comprising selecting the genomic coordinate further based on determining that the variant call generated by the call-generation model matches a variant determined by an imputation model within the personalized subset of diploid reference haplotypes at the genomic coordinate.
CLAUSE 6. The computer-implemented method of any of clauses 1-5, further comprising selecting the genomic coordinate from a predetermined set of genomic coordinates corresponding to variants with respect to a linear reference genome.
CLAUSE 7. The computer-implemented method of any of clauses 1-6, further comprising generating a genotype-call data file comprising the genotype call and one or more of: a personalized genotype metric indicating a diploid genotype at the genomic coordinate; a personalized prior-probability metric indicating a prior probability of the diploid genotype from the personalized subset of diploid reference haplotypes at the genomic coordinate; a personalized genotype-likelihood metric indicating a likelihood of the diploid genotype at the genomic coordinate based on the personalized prior-probability metric; or a personalized genotype-probability metric indicating a posterior probability of the diploid genotype at the genomic coordinate.
CLAUSE 8. The computer-implemented method of any of clauses 1-7, further comprising generating a genotype-call data fde comprising a variant call as part of the genotype call for the sample at the genomic coordinate by: determining that the variant call from the genotype call corresponds to a variant-call-quality score satisfying a quality threshold; and based on the variant-call-quality score satisfying the quality threshold, reporting the variant call as part of the genotype-call data fde.
CLAUSE 9. The computer-implemented method of any of clauses 1-8, further comprising generating a genotype-call data fde comprising a variant call as part of the genotype call for the sample at the genomic coordinate by: determining that the genotype call comprises a reference call based on the one or more personalized genotype call classifications; determining an alternate genotype call comprising the variant call based on different genotype call classifications generated by a different call-recalibration machine-learning model; determining that the variant call from the alternate genotype call corresponds to a variant- call-quality score satisfying a quality threshold; and based on the variant-call-quality score satisfying the quality threshold, reporting the variant call as part of the genotype-call data file.
CLAUSE 10. The computer-implemented method of any of clauses 1-9, further comprising generating the one or more personalized genotype call classifications utilizing a first machine-learning model of the call-recalibration machine-learning model, the first machinelearning model trained to determine personalized genotype call classifications in relation to personalized subsets of diploid reference haplotypes for respective samples without reference to other candidate population haplotypes for the respective samples.
CLAUSE 11. The computer-implemented method of clause 10, further comprising: determining that the one or more personalized genotype call classifications generated by the first machine-learning model correspond to a reference genotype call at the genomic coordinate; based on the one or more personalized genotype call classifications corresponding to the reference genotype call, generating one or more additional genotype call classifications utilizing a second machine-learning model of the call-recalibration machine-learning model, the second machine-learning model trained to determine genotype call classifications for samples in relation to a linear reference genome; and determining the genotype call for the genomic coordinate based on the one or more personalized genotype call classifications and the one or more additional genotype call classifications.
CLAUSE 12. The computer-implemented method of any of clauses 1-11, further comprising generating the one or more personalized genotype call classifications by determining a posterior probability of a diploid genotype with respect to a linear reference genome at the genomic coordinate based on a prior probability that the sample comprises the personalized subset of diploid reference haplotypes for the genomic coordinate.
CLAUSE 13. The computer-implemented method of any of clauses 1-12, further comprising: identifying an additional set of nucleotide reads corresponding to a haploid genomic coordinate of the sample; determining additional personalized sequencing metrics for the additional set of nucleotide reads in relation to at least one personalized reference haplotype for the sample at the haploid genomic coordinate; generating, utilizing the call-recalibration machine-learning model and based on the additional personalized sequencing metrics, one or more personalized haploid genotype call classifications indicating an accuracy of identifying a genotype at the haploid genomic coordinate; and determining an additional genotype call for the sample at the haploid genomic coordinate based on the one or more personalized haploid genotype call classifications.
CLAUSE 14. A computer-implemented method comprising: identifying a set of nucleotide reads of a sample corresponding to a genomic coordinate according to a respective set of read alignments; determining, based on the set of nucleotide reads, a personalized subset of diploid reference haplotypes for the sample at the genomic coordinate; generating, based on the personalized subset of diploid reference haplotypes and utilizing a call-generation model, genotype probabilities of candidate genotypes for the sample at the genomic coordinate; and determining a genotype call for the sample at the genomic coordinate based on the genotype probabilities. CLAUSE 15. The computer-implemented method of clause 14, further comprising generating the genotype probabilities by determining prior genotype probabilities of the candidate genotypes in relation to the personalized subset of diploid reference haplotypes.
CLAUSE 16. The computer-implemented method of clause 15, further comprising: generating, utilizing the call-generation model, posterior genotype probabilities of the candidate genotypes for the sample at the genomic coordinate based on the prior genotype probabilities; and determining the genotype call for the sample at the genomic coordinate based on the posterior genotype probabilities.
CLAUSE 17. The computer-implemented method of any of clauses 14-16, further comprising determining the respective set of read alignments with respect to a linear reference genome by comparing the set of nucleotide reads with the personalized subset of diploid reference haplotypes at the genomic coordinate.
CLAUSE 18. The computer-implemented method of any of clauses 14-17, further comprising determining the genotype probabilities for the candidate genotypes comprising diploid genotypes indicating:
(i) a first nucleobase call for a first allele of the sample in relation to a first population haplotype of the personalized subset of diploid reference haplotypes at the genomic coordinate; and
(ii) a second nucleobase call for a second allele of the sample in relation to a second population haplotype of the personalized subset of diploid reference haplotypes at the genomic coordinate.
CLAUSE 19. The computer-implemented method of any of clauses 14-18, wherein the personalized subset of diploid reference haplotypes comprise a first population haplotype selected for the sample at the genomic coordinate and a second population haplotype selected for the sample at the genomic coordinate.
CLAUSE 20. The computer-implemented method of clause 19, further comprising generating, based on determining that the first population haplotype differs from the second population haplotype, the genotype probabilities in relation to the personalized subset of diploid reference haplotypes by generating one or more of: a first probability of a heterozygous diploid genotype corresponding to one of the first population haplotype or the second population haplotype at the genomic coordinate; a second probability of a heterozygous alternate diploid genotype corresponding to neither the first population haplotype nor the second population haplotype; or a third probability of a homozygous reference diploid genotype corresponding to both the first population haplotype and the second population haplotype.
CLAUSE 21. The computer-implemented method of clause 19, further comprising generating, based on determining that the first population haplotype and the second population haplotype are identical at the genomic coordinate, the genotype probabilities in relation to the personalized subset of diploid reference haplotypes by generating one or more of: a first probability of a heterozygous diploid genotype with an allele of the sample corresponding to the first population haplotype and the second population haplotype at the genomic coordinate; a second probability of a homozygous diploid genotype with identical alleles corresponding to neither the first population haplotype nor the second population haplotype; a third probability of a heterozygous alternate diploid genotype with differing alleles each corresponding to neither the first population haplotype or the second population haplotype; or a fourth probability of a homozygous reference diploid genotype with identical alleles corresponding to the first population haplotype and the second population haplotype.
CLAUSE 22. The computer-implemented method of any of clauses 19-21, further comprising generating the genotype probabilities for the candidate genotypes comprising single nucleotide variants based on a uniform branch length or an inferred variable branch length between the sample and each of the first population haplotype and the second population haplotype.
CLAUSE 23. The computer-implemented method of any of clause 19-21, further comprising generating the genotype probabilities for the candidate genotypes comprising insertions or deletions (indels) based on an estimated indel rate for the sample in consideration of a single candidate indel allele at the genomic coordinate.
CLAUSE 24. The computer-implemented method of any of clauses 14-23, further comprising: determine personalized sequencing metrics for the set of nucleotide reads based on the genotype probabilities generated by the call -generation model; generate, utilizing a call-recalibration machine-learning model and based on the personalized sequencing metrics, one or more personalized genotype call classifications indicating an accuracy of identifying a genotype at the genomic coordinate; and determine the genotype call for the sample at the genomic coordinate based on the one or more personalized genotype call classifications.
CLAUSE 25. The computer-implemented method of any of clauses 14-24, further comprising determining the genotype call with respect to a linear reference genome. CLAUSE 26. The computer-implemented method of any of clauses 14-25, further comprising: identifying an additional set of nucleotide reads of the sample corresponding to a haploid genomic coordinate; determining, based on the additional set of nucleotide reads, at least one personalized reference haplotype for the sample at the haploid genomic coordinate; generating, based on the at least one personalized reference haplotype and utilizing the callgeneration model, additional genotype probabilities of candidate haploid genotypes for the sample at the haploid genomic coordinate; and determining an additional genotype call for the sample at the haploid genomic coordinate based on the additional genotype probabilities.
[0185] The methods described herein can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase type from another are particularly applicable. In some embodiments, the process to determine the nucleotide sequence of a target nucleic acid (i.e., a nucleic acid polymer) can be an automated process. Preferred embodiments include sequencing-by-synthesis (SBS) techniques.
[0186] SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand. In traditional methods of SBS, a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery. However, in the methods described herein, more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
[0187] SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties. Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below. In methods using nucleotide monomers lacking terminators, the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery. For SBS techniques that utilize nucleotide monomers having a terminator moiety, the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.). [0188] SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like. In embodiments, where two or more different nucleotides are present in a sequencing reagent, the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used. For example, the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by the sequencing methods developed by Solexa (now Illumina, Inc.).
[0189] Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release." Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing." Genome Res. 11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P. (1998) “A sequencing method based on realtime pyrophosphate.” Science 281(5375), 363; U.S. Pat. No. 6,210,891; U.S. Pat. No. 6,258,568 and U.S. Pat. No. 6,274,320, the disclosures of which are incorporated herein by reference in their entireties). In pyrosequencing, released PPi can be detected by being immediately converted to adenosine triphosphate (ATP) by ATP sulfurylase, and the level of ATP generated is detected via luciferase-produced photons. The nucleic acids to be sequenced can be attached to features in an array and the array can be imaged to capture the chemiluminescent signals that are produced due to incorporation of a nucleotides at the features of the array. An image can be obtained after the array is treated with a particular nucleotide type (e.g., A, T, C or G). Images obtained after addition of each nucleotide type will differ with regard to which features in the array are detected. These differences in the image reflect the different sequence content of the features on the array. However, the relative locations of each feature will remain unchanged in the images. The images can be stored, processed and analyzed using the methods set forth herein. For example, images obtained after treatment of the array with each different nucleotide type can be handled in the same way as exemplified herein for images obtained from different detection channels for reversible terminatorbased sequencing methods.
[0190] In another exemplary type of SBS, cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference. This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91/06678 and WO 07/123,744, each of which is incorporated herein by reference. The availability of fluorescently labeled terminators in which both the termination can be reversed, and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing. Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.
[0191] Preferably in reversible terminator-based sequencing embodiments, the labels do not substantially inhibit extension under SBS reaction conditions. However, the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following incorporation of labels into arrayed nucleic acid features. In particular embodiments, each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially, and an image of the array can be obtained between each addition step. In such embodiments, each image will show nucleic acid features that have incorporated nucleotides of a particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remain unchanged in the images. Images obtained from such reversible terminator- SBS methods can be stored, processed and analyzed as set forth herein. Following the image capture step, labels can be removed, and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.
[0192] In particular embodiments some or all of the nucleotide monomers can include reversible terminators. In such embodiments, reversible terminators/cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporated herein by reference). Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety). Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension but could easily be deblocked by a short treatment with a palladium catalyst. The fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 second exposure to long wavelength UV light. Thus, either disulfide reduction or photocleavage can be used as a cleavable linker. Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP. The presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and/or electrostatic hindrance. The presence of one incorporation event prevents further incorporations unless the dye is removed. Cleavage of the dye removes the fluor and effectively reverses the termination. Examples of modified nucleotides are also described in U.S. Pat. No. 7,427,673, and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference in their entireties.
[0193] Additional exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Patent Application Publication No. 2007/0166705, U.S. Patent Application Publication No. 2006/0188901, U.S. Pat. No. 7,057,026, U.S. Patent Application Publication No. 2006/0240439, U.S. Patent Application Publication No. 2006/0281109, PCT Publication No. WO 05/065814, U.S. Patent Application Publication No. 2005/0100900, PCT Publication No. WO 06/064199, PCT Publication No. WO 07/010,251, U.S. Patent Application Publication No. 2012/0270305 and U.S. Patent Application Publication No. 2013/0260372, the disclosures of which are incorporated herein by reference in their entireties.
[0194] Some embodiments can utilize detection of four different nucleotides using fewer than four different labels. For example, SBS can be performed utilizing methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232. As a first example, a pair of nucleotide types can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair. As a second example, three of four different nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e.g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into a nucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal. As a third example, one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels. The aforementioned three exemplary configurations are not considered mutually exclusive and can be used in various combinations. An exemplary embodiment that combines all three examples, is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitation wavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g. dTTP having at least one label that is detected in both channels when excited by the first and/or second excitation wavelength) and a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel (e.g. dGTP having no label).
[0195] Further, as described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232, sequencing data can be obtained using a single channel. In such so- called one-dye sequencing approaches, the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated. The third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.
[0196] Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides. The oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize. As with other SBS methods, images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the features will remain unchanged in the images. Images obtained from ligation-based sequencing methods can be stored, processed and analyzed as set forth herein. Exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Pat. No. 6,969,488, U.S. Pat. No. 6,172,218, and U.S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.
[0197] Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. "Nanopores and nucleic acids: prospects for ultrarapid sequencing." Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, "Characterization of nucleic acids by nanopore analysis". Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope" Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties). In such embodiments, the target nucleic acid passes through a nanopore. The nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin. As the target nucleic acid passes through the nanopore, each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore. (U.S. Pat. No. 7,001,792; Soni, G. V. & Meller, "A. Progress toward ultrafast DNA sequencing using solid-state nanopores." Clin. Chem. 53, 1996-2001 (2007); Healy, K. "Nanopore-based single-molecule DNA analysis." Nanomed. 2, 459-481 (2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. "A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution." J. Am. Chem. Soc. 130, 818-820 (2008), the disclosures of which are incorporated herein by reference in their entireties). Data obtained from nanopore sequencing can be stored, processed and analyzed as set forth herein. In particular, the data can be treated as an image in accordance with the exemplary treatment of optical images and other images that is set forth herein.
[0198] Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No. 7,315,019 (which is incorporated herein by reference) and using fluorescent nucleotide analogs and engineered polymerases as described, for example, in U.S. Pat. No. 7,405,281 and U.S. Patent Application Publication No. 2008/0108082 (each of which is incorporated herein by reference). The illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations." Science 299, 682-686 (2003); Lundquist, P. M. et al. "Parallel confocal detection of single molecules in real time." Opt. Lett. 33, 1026-1028 (2008); Korlach, J. et al. "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures." Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), the disclosures of which are incorporated herein by reference in their entireties). Images obtained from such methods can be stored, processed and analyzed as set forth herein.
[0199] Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product. For example, sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 Al; US 2009/0127589 Al; US 2010/0137143 Al; or US 2010/0282617 Al, each of which is incorporated herein by reference. Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.
[0200] The above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously. In particular embodiments, different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner. In embodiments using surface-bound target nucleic acids, the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner. The target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface. The array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.
[0201] The methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features/cm2, 100 features/cm2, 500 features/cm2, 1,000 features/cm2, 5,000 features/cm2, 10,000 features/cm2, 50,000 features/cm2, 100,000 features/cm2, 1,000,000 features/cm2, 5,000,000 features/cm2, or higher.
[0202] An advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly, the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniques known in the art such as those exemplified above. Thus, an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and/or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like. A flow cell can be configured and/or used in an integrated system for detection of target nucleic acids. Exemplary flow cells are described, for example, in US 2010/0111768 Al and US Ser. No. 13/273,666, each of which is incorporated herein by reference. As exemplified for flow cells, one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method. Taking a nucleic acid sequencing embodiment as an example, one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above. Alternatively, an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods. Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego, CA) and devices described in US Ser. No. 13/273,666, which is incorporated herein by reference. The sequencing system described above sequences nucleic acid polymers present in samples received by a sequencing device, as described further above. [0203] Further, the methods and compositions disclosed herein may be useful to amplify a nucleic acid sample having low-quality nucleic acid molecules, such as degraded and/or fragmented genomic DNA from a forensic sample. In one embodiment, forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel. The nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric or other substrate that may be impregnated with saliva, blood, or other bodily fluids. As such, in some embodiments, the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA. In some embodiments, target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine and serum. In some embodiments, target sequences can be obtained from hair, skin, tissue samples, autopsy or remains of a victim. In some embodiments, nucleic acids including one or more target sequences can be obtained from a deceased animal or human. In some embodiments, target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA. In some embodiments, target sequences or amplified target sequences are directed to purposes of human identification. In some embodiments, the disclosure relates generally to methods for identifying characteristics of a forensic sample. In some embodiments, the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein. In one embodiment, a forensic or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.
[0204] The components of the personalized sequencing system 106 can include software, hardware, or both. For example, the components of the personalized sequencing system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device 114). When executed by the one or more processors, the computer-executable instructions of the personalized sequencing system 106 can cause the computing devices to perform the bubble detection methods described herein. Alternatively, the components of the personalized sequencing system 106 can comprise hardware, such as special purpose processing devices to perform a certain function or group of functions. Additionally, or alternatively, the components of the personalized sequencing system 106 can include a combination of computer-executable instructions and hardware. [0205] Furthermore, the components of the personalized sequencing system 106 performing the functions described herein with respect to the personalized sequencing system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and/or as a cloud-computing model. Thus, components of the personalized sequencing system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device. Additionally, or alternatively, the components of the personalized sequencing system 106 may be implemented in any application that provides sequencing services including, but not limited to Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and/or other countries.
[0206] Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non- transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc.), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
[0207] Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices). Computer- readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
[0208] Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. [0209] A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer- readable media.
[0210] Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a NIC), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer- readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
[0211] Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
[0212] Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
[0213] Embodiments of the present disclosure can also be implemented in cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
[0214] A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (laaS). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.
[0215] FIG. 12 illustrates a block diagram of a computing device 1200 that may be configured to perform one or more of the processes described above. One will appreciate that one or more computing devices such as the computing device 1200 may implement the personalized sequencing system 106 and the sequencing device system 104. As shown by FIG. 12, the computing device 1200 can comprise a processor 1202, a memory 1204, a storage device 1206, an I/O interface 1208, and a communication interface 1210, which may be communicatively coupled by way of a communication infrastructure 1212. In certain embodiments, the computing device 1200 can include fewer or more components than those shown in FIG. 12. The following paragraphs describe components of the computing device 1200 shown in FIG. 12 in additional detail.
[0216] In one or more embodiments, the processor 1202 includes hardware for executing instructions, such as those making up a computer program. As an example, and not by way of limitation, to execute instructions for dynamically modifying workflows, the processor 1202 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 1204, or the storage device 1206 and decode and execute them. The memory 1204 may be a volatile or nonvolatile memory used for storing data, metadata, and programs for execution by the processor(s). The storage device 1206 includes storage, such as a hard disk, flash disk drive, or other digital storage device, for storing data or instructions for performing the methods described herein.
[0217] The I/O interface 1208 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1200. The I/O interface 1208 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces. The I/O interface 1208 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I/O interface 1208 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
[0218] The communication interface 1210 can include hardware, software, or both. In any event, the communication interface 1210 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1200 and one or more other computing devices or networks. As an example, and not by way of limitation, the communication interface 1210 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
[0219] Additionally, the communication interface 1210 may facilitate communications with various types of wired or wireless networks. The communication interface 1210 may also facilitate communications using various communication protocols. The communication infrastructure 1212 may also include hardware, software, or both that couples components of the computing device 1200 to each other. For example, the communication interface 1210 may use one or more networks and/or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein. To illustrate, the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.
[0220] In the foregoing specification, the present disclosure has been described with reference to specific exemplary embodiments thereof. Various embodiments and aspects of the present disclosure(s) are described with reference to details discussed herein, and the accompanying drawings illustrate the various embodiments. The description above and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure.
[0221] The present disclosure may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with less or more steps/acts or the steps/acts may be performed in differing orders. Additionally, the steps/acts described herein may be repeated or performed in parallel with one another or in parallel with different instances of the same or similar steps/acts. The scope of the present application is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

CLAIMS We Claim:
1. A system comprising: at least one processor; and a non-transitory computer-readable medium storing instructions that, when executed by the at least one processor, cause the system to: identify a set of nucleotide reads corresponding to a genomic coordinate of a sample; determine personalized sequencing metrics for the set of nucleotide reads in relation to a personalized subset of diploid reference haplotypes for the sample at the genomic coordinate; generate, utilizing a call-recalibration machine-learning model and based on the personalized sequencing metrics, one or more personalized genotype call classifications indicating an accuracy of identifying a genotype at the genomic coordinate; and determine a genotype call for the sample at the genomic coordinate based on the one or more personalized genotype call classifications.
2. The system of claim 1, wherein the personalized subset of diploid reference haplotypes comprise a first population haplotype selected for the sample at the genomic coordinate and a second population haplotype selected for the sample at the genomic coordinate.
3. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to determine the personalized sequencing metrics by determining one or more of: a personalized genotype metric indicating a diploid genotype at the genomic coordinate; a personalized genotype-probability metric indicating a probability of the diploid genotype occurring at the genomic coordinate; a personalized genotype-quality metric indicating a probability that the diploid genotype at the genomic coordinate is correct or incorrect; or a personalized variant-call-quality metric indicating a quality score for a variant call generated by a call-generation model utilizing the personalized subset of diploid reference haplotypes.
4. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to select the genomic coordinate based on identifying a variant call generated with respect to a linear reference genome at the genomic coordinate by a callgeneration model based on the set of nucleotide reads.
5. The system of claim 4, further comprising instructions that, when executed by the at least one processor, cause the system to select the genomic coordinate further based on determining that the variant call generated by the call-generation model matches a variant determined by an imputation model within the personalized subset of diploid reference haplotypes at the genomic coordinate.
6. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to select the genomic coordinate from a predetermined set of genomic coordinates corresponding to variants with respect to a linear reference genome.
7. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate a genotype-call data file comprising the genotype call and one or more of: a personalized genotype metric indicating a diploid genotype at the genomic coordinate; a personalized prior-probability metric indicating a prior probability of the diploid genotype from the personalized subset of diploid reference haplotypes at the genomic coordinate; a personalized genotype-likelihood metric indicating a likelihood of the diploid genotype at the genomic coordinate based on the personalized prior-probability metric; or a personalized genotype-probability metric indicating a posterior probability of the diploid genotype at the genomic coordinate.
8. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate a genotype-call data file comprising a variant call as part of the genotype call for the sample at the genomic coordinate by: determining that the variant call from the genotype call corresponds to a variant-call-quality score satisfying a quality threshold; and based on the variant-call-quality score satisfying the quality threshold, reporting the variant call as part of the genotype-call data file.
9. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate a genotype-call data file comprising a variant call as part of the genotype call for the sample at the genomic coordinate by: determining that the genotype call comprises a reference call based on the one or more personalized genotype call classifications; determining an alternate genotype call comprising the variant call based on different genotype call classifications generated by a different call-recalibration machine-learning model; determining that the variant call from the alternate genotype call corresponds to a variant- call-quality score satisfying a quality threshold; and based on the variant-call-quality score satisfying the quality threshold, reporting the variant call as part of the genotype-call data fde.
10. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate the one or more personalized genotype call classifications utilizing a first machine-learning model of the call-recalibration machine-learning model, the first machine-learning model trained to determine personalized genotype call classifications in relation to personalized subsets of diploid reference haplotypes for respective samples without reference to other candidate population haplotypes for the respective samples.
11. The system of claim 10, further comprising instructions that, when executed by the at least one processor, cause the system to: determine that the one or more personalized genotype call classifications generated by the first machine-learning model correspond to a reference genotype call at the genomic coordinate; based on the one or more personalized genotype call classifications corresponding to the reference genotype call, generate one or more additional genotype call classifications utilizing a second machine-learning model of the call-recalibration machine-learning model, the second machine-learning model trained to determine genotype call classifications for samples in relation to a linear reference genome; and determine the genotype call for the genomic coordinate based on the one or more personalized genotype call classifications and the one or more additional genotype call classifications.
12. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate the one or more personalized genotype call classifications by determining a posterior probability of a diploid genotype with respect to a linear reference genome at the genomic coordinate based on a prior probability that the sample comprises the personalized subset of diploid reference haplotypes for the genomic coordinate.
13. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: identify an additional set of nucleotide reads corresponding to a haploid genomic coordinate of the sample; determine additional personalized sequencing metrics for the additional set of nucleotide reads in relation to at least one personalized reference haplotype for the sample at the haploid genomic coordinate; generate, utilizing the call-recalibration machine-learning model and based on the additional personalized sequencing metrics, one or more personalized haploid genotype call classifications indicating an accuracy of identifying a genotype at the haploid genomic coordinate; and determine an additional genotype call for the sample at the haploid genomic coordinate based on the one or more personalized haploid genotype call classifications.
14. A system comprising: at least one processor; and a non-transitory computer-readable medium storing instructions that, when executed by the at least one processor, cause the system to: identify a set of nucleotide reads of a sample corresponding to a genomic coordinate according to a respective set of read alignments; determine, based on the set of nucleotide reads, a personalized subset of diploid reference haplotypes for the sample at the genomic coordinate; generate, based on the personalized subset of diploid reference haplotypes and utilizing a call-generation model, genotype probabilities of candidate genotypes for the sample at the genomic coordinate; and determine a genotype call for the sample at the genomic coordinate based on the genotype probabilities.
15. The system of claim 14, further comprising instructions that, when executed by the at least one processor, cause the system to generate the genotype probabilities by determining prior genotype probabilities of the candidate genotypes in relation to the personalized subset of diploid reference haplotypes.
16. The system of claim 15, further comprising instructions that, when executed by the at least one processor, cause the system to: generate, utilizing the call-generation model, posterior genotype probabilities of the candidate genotypes for the sample at the genomic coordinate based on the prior genotype probabilities; and determine the genotype call for the sample at the genomic coordinate based on the posterior genotype probabilities.
17. The system of claim 14, further comprising instructions that, when executed by the at least one processor, cause the system to determine the respective set of read alignments with respect to a linear reference genome by comparing the set of nucleotide reads with the personalized subset of diploid reference haplotypes at the genomic coordinate.
18. The system of claim 14, further comprising instructions that, when executed by the at least one processor, cause the system to determine the genotype probabilities for the candidate genotypes comprising diploid genotypes indicating:
(i) a first nucleobase call for a first allele of the sample in relation to a first population haplotype of the personalized subset of diploid reference haplotypes at the genomic coordinate; and
(ii) a second nucleobase call for a second allele of the sample in relation to a second population haplotype of the personalized subset of diploid reference haplotypes at the genomic coordinate.
19. The system of claim 14, wherein the personalized subset of diploid reference haplotypes comprise a first population haplotype selected for the sample at the genomic coordinate and a second population haplotype selected for the sample at the genomic coordinate.
20. The system of claim 19, further comprising instructions that, when executed by the at least one processor, cause the system to generate, based on determining that the first population haplotype differs from the second population haplotype, the genotype probabilities in relation to the personalized subset of diploid reference haplotypes by generating one or more of: a first probability of a heterozygous diploid genotype corresponding to one of the first population haplotype or the second population haplotype at the genomic coordinate; a second probability of a heterozygous alternate diploid genotype corresponding to neither the first population haplotype nor the second population haplotype; or a third probability of a homozygous reference diploid genotype corresponding to both the first population haplotype and the second population haplotype.
21. The system of claim 19, further comprising instruction that, when executed by the at least one processor, cause the system to generate, based on determining that the first population haplotype and the second population haplotype are identical at the genomic coordinate, the genotype probabilities in relation to the personalized subset of diploid reference haplotypes by generating one or more of: a first probability of a heterozygous diploid genotype with an allele of the sample corresponding to the first population haplotype and the second population haplotype at the genomic coordinate; a second probability of a homozygous diploid genotype with identical alleles corresponding to neither the first population haplotype nor the second population haplotype; a third probability of a heterozygous alternate diploid genotype with differing alleles each corresponding to neither the first population haplotype or the second population haplotype; or a fourth probability of a homozygous reference diploid genotype with identical alleles corresponding to the first population haplotype and the second population haplotype.
22. The system of claim 19, further comprising instructions that, when executed by the at least one processor, cause the system to generate the genotype probabilities for the candidate genotypes comprising single nucleotide variants based on a uniform branch length or an inferred variable branch length between the sample and each of the first population haplotype and the second population haplotype.
23. The system of claim 19, further comprising instructions that, when executed by the at least one processor, cause the system to generate the genotype probabilities for the candidate genotypes comprising insertions or deletions (indels) based on an estimated indel rate for the sample in consideration of a single candidate indel allele at the genomic coordinate.
24. The system of claim 14, further comprising instructions that, when executed by the at least one processor, cause the system to: determine personalized sequencing metrics for the set of nucleotide reads based on the genotype probabilities generated by the call -generation model; generate, utilizing a call-recalibration machine-learning model and based on the personalized sequencing metrics, one or more personalized genotype call classifications indicating an accuracy of identifying a genotype at the genomic coordinate; and determine the genotype call for the sample at the genomic coordinate based on the one or more personalized genotype call classifications.
25. The system of claim 14, further comprising instructions that, when executed by the at least one processor, cause the system to determine the genotype call with respect to a linear reference genome.
26. The system of claim 14, further comprising instructions that, when executed by the at least one processor, cause the system to: identify an additional set of nucleotide reads of the sample corresponding to a haploid genomic coordinate; determine, based on the additional set of nucleotide reads, at least one personalized reference haplotype for the sample at the haploid genomic coordinate; generate, based on the at least one personalized reference haplotype and utilizing the callgeneration model, additional genotype probabilities of candidate haploid genotypes for the sample at the haploid genomic coordinate; and determine an additional genotype call for the sample at the haploid genomic coordinate based on the additional genotype probabilities.
PCT/US2025/031740 2024-05-31 2025-05-30 Call generation and recalibration models for implementing personalized diploid reference haplotypes in genotype calling Pending WO2025250996A2 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US63/654,806 2024-05-31

Publications (1)

Publication Number Publication Date
WO2025250996A2 true WO2025250996A2 (en) 2025-12-04

Family

ID=

Similar Documents

Publication Publication Date Title
US20240120027A1 (en) Machine-learning model for refining structural variant calls
CA3223739A1 (en) Machine-learning model for recalibrating nucleotide-base calls
WO2025006874A1 (en) Machine-learning model for recalibrating genotype calls corresponding to germline variants and somatic mosaic variants
US20240127905A1 (en) Integrating variant calls from multiple sequencing pipelines utilizing a machine learning architecture
EP4457822A1 (en) Machine learning model for recalibrating nucleotide base calls corresponding to target variants
US20230095961A1 (en) Graph reference genome and base-calling approach using imputed haplotypes
WO2024006705A1 (en) Improved human leukocyte antigen (hla) genotyping
WO2025250996A2 (en) Call generation and recalibration models for implementing personalized diploid reference haplotypes in genotype calling
US20240371469A1 (en) Machine learning model for recalibrating genotype calls from existing sequencing data files
US20230313271A1 (en) Machine-learning models for detecting and adjusting values for nucleotide methylation levels
US20240177802A1 (en) Accurately predicting variants from methylation sequencing data
US20250210141A1 (en) Enhanced mapping and alignment of nucleotide reads utilizing an improved haplotype data structure with allele-variant differences
US20250111899A1 (en) Predicting insert lengths using primary analysis metrics
WO2025090883A1 (en) Detecting variants in nucleotide sequences based on haplotype diversity
WO2024206848A1 (en) Tandem repeat genotyping
WO2024249973A2 (en) Linking human genes to clinical phenotypes using graph neural networks
WO2025184234A1 (en) A personalized haplotype database for improved mapping and alignment of nucleotide reads and improved genotype calling