[go: up one dir, main page]

WO2025221998A1 - Systems and methods for variant calling - Google Patents

Systems and methods for variant calling

Info

Publication number
WO2025221998A1
WO2025221998A1 PCT/US2025/025161 US2025025161W WO2025221998A1 WO 2025221998 A1 WO2025221998 A1 WO 2025221998A1 US 2025025161 W US2025025161 W US 2025025161W WO 2025221998 A1 WO2025221998 A1 WO 2025221998A1
Authority
WO
WIPO (PCT)
Prior art keywords
variant
count
indel
reads
consensus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/025161
Other languages
French (fr)
Inventor
Ke Tang
James Han
Md Abid HASAN
Zhongyun HUANG
Tieming JI
Badri Kothandaraman PADHUKASAHASRAM
Seyed Hamid MIREBRAHIM
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Roche Sequencing Solutions Inc
Original Assignee
Roche Sequencing Solutions Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Roche Sequencing Solutions Inc filed Critical Roche Sequencing Solutions Inc
Publication of WO2025221998A1 publication Critical patent/WO2025221998A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • BACKGROUND [0002] The development of affordable and rapid DNA sequencing technologies has enabled the development of targeted therapeutics that rely on the use of DNA biomarkers to identify patients that are suitable for receiving the targeted therapy. For example, mutations in certain genes, such as genes involved in cell proliferation, are known to lead to certain types of cancers that can be treated very effectively with specific types of drugs. Other mutations are known to confer resistance to certain therapies. Therefore, there is a need for improved systems and methods to identify variants from sequencing data.
  • SUMMARY [0003] The embodiments described herein relate to systems and methods for performing variant calling on sequencing data. More particularly, the embodiments described herein related to calling single nucleotide variants, insertions, and deletions in the sequencing data.
  • a method for calling insertions and deletions (InDels) from sequencing data generated from a multiplexed sample includes receiving a computer file comprising a plurality of paired sequence reads; aligning the plurality of paired sequence reads to a reference sequence; sorting the aligned, paired sequence reads; grouping the sorted, aligned, and paired sequence reads into a plurality of groups according to a unique molecular identifier (UMI), a sequence read start position, and a sequence read end position, wherein each member of a group has the same UMI, sequence read start position, and sequence read end position; determining a consensus sequence from the members of each group; parsing the paired consensus sequence reads; polishing the parsed consensus sequence reads; determining a set of candidate InDel variants based on a comparison of the polished consensus sequence reads with the reference sequence; for each candidate InDel variant, calculating a weighte
  • UMI unique molecular identifier
  • the method further includes generating a computer file comprising the paired consensus sequence reads.
  • the step of polishing the parsed consensus sequence reads includes: comparing the parsed consensus sequence reads to the reference sequence to determine a set of preliminary InDel variants, including a count of each preliminary InDel variant; and adjusting the count of each preliminary InDel variant based on a comparison of the count of each preliminary InDel variant to a corresponding background variant count that is determined from a control sample.
  • the comparison comprises determining whether the count for each preliminary InDel variant is background noise.
  • the step of determining whether the count for each preliminary InDel variant is background noise comprises calculating a probability that the preliminary InDel variant is background noise.
  • the linear model comprises a duplex variant count, a singleton variant count, and a consensus variant count.
  • the threshold for the linear model is sample specific.
  • the threshold for the linear model is determined for all the candidate InDel variants in the sample by fitting a regression to a plot of a cumulative candidate InDel variant count having at least a certain weighted linear score as a function of the weighted linear score, and setting the threshold at a weighted linear score that results in a target cumulative InDel variant count.
  • PATENT Client Reference No.: P39266-WO-1 the threshold for the linear model is determined from a set of training data by determining a cumulative count of false positive variant calls as a function of weighted linear score, and selecting the weighted linear score that corresponds to a target cumulative false positive count as the threshold.
  • the non-linear tree-based ensemble machine learning classifier comprises a singleton molecule count, a consensus molecule count, a duplex molecule count, a ratio of the singleton molecule count to the duplex molecule count, a median distance of the potential variant from a 5’ end of the sequence reads, a relative spread of variant distance in variant molecules versus background, a relative spread of insert starts in variant molecules versus background, a raw strand of alignment bias for consensus molecules, a raw strand of origin bias for singleton molecules, a raw strand of alignment bias for non-duplex molecules, a fraction of duplex variant molecules with two reads, a fraction of duplex variant molecules with less than or equal to three reads, a fraction of consensus variant molecules with two reads, a fraction of consensus variant molecules with less than or equal to three reads, a relative frequency of singleton in variant molecules versus background, a relative frequency of duplex in variant molecules versus background, and a 2 x 2
  • the machine learning classifier threshold is calibrated using a set of healthy samples to target a predetermined false positive rate.
  • a method for calling single nucleotide variants (SNV) from sequencing data generated from a multiplexed sample is provided.
  • the method includes: receiving a computer file comprising a plurality of paired sequence reads; aligning the sequence reads to a reference sequence; sorting by pair the aligned sequence reads; grouping the sorted, aligned, and paired sequence reads into a plurality of groups by unique molecular identifier (UMI), sequence read start position, and sequence read end position, wherein each member of a group has the same UMI, sequence read start position, and sequence read end position; determining a consensus sequence from the members of each group; parsing the paired consensus sequence reads; polishing the parsed consensus sequence reads; determining a set of candidate variants based on a comparison of the polished consensus reads; for each candidate SNV variant, calculating a PATENT Client Reference No.: P39266-WO-1 weighted linear score using a linear model, wherein the candidate SNV variant is called a potential SNV variant when the weighted linear score is greater than a linear model threshold; and for each potential SNV variant, calculating a weight
  • the method further includes generating a computer file comprising paired consensus sequence reads.
  • the step of polishing the parsed consensus sequence reads includes: comparing the parsed consensus reads to the reference sequence to determine a set of preliminary SNV variants, including a count of each preliminary SNV variant; and adjusting the count of each preliminary SNV variant based on a comparison of the count of each preliminary SNV variant to a corresponding background variant count that is determined from a control sample.
  • the comparison comprises determining whether the count for each preliminary SNV variant is background noise.
  • the step of determining whether the count for each preliminary SNV variant is background noise comprises calculating a probability that the preliminary SNV variant is background noise.
  • the linear model comprises a duplex variant count, a singleton variant count, and a consensus variant count.
  • the threshold for the linear model is sample specific.
  • the threshold for the linear model is determined for all the candidate SNV variants in the sample by fitting a regression to a plot of a cumulative candidate SNV variant count having at least a certain weighted linear score as a function of the weighted linear score, and setting the threshold at a weighted linear score that results in a target cumulative SNV variant count.
  • PATENT Client Reference No.: P39266-WO-1 the threshold for the linear model is determined from a set of training data by determining a cumulative count of false positive variant calls as a function of weighted linear score, and selecting the weighted linear score that corresponds to a target cumulative false positive count as the threshold.
  • the logistic regression model comprises a duplex variant count, a singleton variant count, and a consensus variant count. [0025] In some embodiments of the second aspect, the logistic regression model further comprises a ratio of the singleton variant count to the duplex molecule count, an adjustment for depth of coverage, an adjustment for errors in a gene, a median distance of the variant from 5’ end of the sequence read, a relative spread of variant distance in variant molecules versus background, a relative spread of insert starts in variant molecules versus background, a relative spread of insert ends in variant molecules versus background, a raw strand of alignment bias, a raw strand of origin bias, a substitution type of variant, a two base pair context around a variant position, a fraction of duplex variant molecules with two reads, a fraction of duplex variant molecules with less than or equal to three reads, a fraction of consensus variant molecules with two reads, a fraction of consensus variant molecules with less than or equal to three reads, a relative frequency of consensus in
  • the logistic regression model comprises a plurality of features, wherein each feature is multiplied by a weight, wherein the weights are determined from a training set of data with known true positive variants and known false positive variants.
  • a system for generating sequencing data.
  • the system may include an assay device and/or a logic system.
  • the logic system may include a processor coupled to a memory storing instructions executable by the processor.
  • the processor upon execution of the instructions, is configured to perform the method of one or more embodiments of the first aspect.
  • the system further includes a treatment device for determining and/or administering a treatment to the patient based on at least one true InDel variant.
  • the system further includes a reporting device for displaying information relating to at least one true InDel variant.
  • a non-transitory computer-readable medium stores a set of instructions that, upon execution by at least one processor, cause the processor to perform the method of one or more embodiments of the first aspect.
  • a system for generating sequencing data.
  • the system may include an assay device and/or a logic system.
  • the logic system may include a processor coupled to a memory storing instructions executable by the processor.
  • the processor upon execution of the instructions, is configured to perform the method of one or more embodiments of the second aspect.
  • the system further includes a treatment device for determining and/or administering a treatment to the patient based on at least one true variant.
  • the system further includes a reporting device for displaying information relating to at least one true variant.
  • a non-transitory computer-readable medium stores a set of instructions that, upon execution by at least one processor, cause the processor to perform the method of one or more embodiments of the second aspect.
  • FIG.1 is a flow chart of a method for an InDel caller algorithm workflow, in accordance with some embodiments.
  • FIG.2 shows an overview of a barcode deduplication algorithm, in accordance with some embodiments.
  • FIG.3 is a flow chart of a method for the barcode deduplication algorithm, in accordance with some embodiments.
  • FIG.4 is another flow chart of the method for the barcode deduplication algorithm, which illustrates the creation of barcode families, in accordance with some embodiments.
  • FIG.5 illustrates how a threshold is set for a core linear model, in accordance with some embodiments.
  • FIG.6 is a flow chart of a method for a single nucleotide variant (SNV) caller algorithm, in accordance with some embodiments.
  • FIG.7 illustrates the improved performance of a two-stage approach for the SNV caller algorithm, in accordance with some embodiments.
  • FIG.8 illustrates a sequencing system, in accordance with some embodiments.
  • FIG.9 illustrates an exemplary computer apparatus, in accordance with some embodiments. DETAILED DESCRIPTION [0045]
  • the present disclosure provides a number of techniques for performing variant calling, as part of a secondary analysis workflow of sequencing data produced by today’s next generation sequencing devices.
  • an InDel caller algorithm is provided.
  • the InDel caller algorithm can include a hotspot module and an adaptive module.
  • the InDel caller algorithm uses a number of heuristics to identify InDel variants in hotspot regions.
  • the adaptive module uses a two stage approach to identify variants with low AF that might not be strongly supported in a plasma sample.
  • a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold.
  • a machine learning classifier such as a gradient boosting machine (GBM) model is used to score each variant that passes the first stage and then the classifier score is compared to a threshold to determine the final list of called variants.
  • the called variants can be further filtered by a blocklist filter.
  • GBM gradient boosting machine
  • an SNV caller algorithm is provided that is similar in many aspects to the InDel caller algorithm described above.
  • the SNV caller algorithm can include a hotspot module and an adaptive module.
  • the hotspot module is similar to the hotspot module of the InDel caller algorithm, with the exception that one or more additional heuristics may be included in addition to or in lieu of the heuristics used in the InDel caller algorithm.
  • the adaptive module of the SNV caller algorithm also uses a two stage approach to identify variants. In the first stage, a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold. In the second stage, an extended linear model that incorporates a number of additional weighted features is used to filter out variants that passed the first stage based on a comparison of each candidate variant’s score with a threshold.
  • sequences of interest can be sequenced using a sequencing assay as part of the procedure for determining sequencing PATENT Client Reference No.: P39266-WO-1 reads for a plurality of microsatellite loci. Any of a number of sequencing technologies or sequencing assays can be utilized.
  • next Generation Sequencing refers to sequencing methods that allow for massively parallel sequencing of clonally amplified molecules and of single nucleic acid molecules (or of nucleic acid analogues).
  • NGS Next Generation Sequencing
  • Non-limiting examples of sequencing assays that are suitable for use with the methods disclosed herein include nanopore sequencing (see, e.g., U.S.
  • Patent Application Publication Nos.2013/0244340, 2013/0264207, 2014/0134616, 2015/0119259, and 2015/0337366) Sanger sequencing, capillary array sequencing, thermal cycle sequencing (see, e.g., Sears et al., Biotechniques, 13:626-633 (1992)), solid-phase sequencing (see, e.g., Zimmerman et al., Methods Mol.
  • sequencing with mass spectrometry such as matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF/MS) (see, e.g., Fu et al., Nature Biotech., 16:381-384 (1998)), sequencing by hybridization (see, e.g., Drmanac et al., Nature Biotech., 16:54-58 (1998)), and NGS methods, including but not limited to sequencing by synthesis (see, e.g., HiSeqTM, MiSeqTM, or Genome Analyzer, each available from Illumina, Inc.
  • sequencing by ligation see, e.g., SOLiDTM, available from Thermo Fisher Scientific, Inc. (Waltham, MA)
  • ion semiconductor sequencing see, e.g., Ion TorrentTM, available from Thermo Fisher Scientific, Inc. (Waltham, MA)
  • SMRT® sequencing available from Pacific Biosciences of California, Inc. (Menlo Park, CA).
  • Commercially available sequencing technologies include: sequencing-by- hybridization platforms from Affymetrix, Inc. (Sunnyvale, CA), sequencing-by-synthesis platforms from Illumina, Inc. (San Diego, Calif.), and sequencing-by-ligation platform from Thermo Fisher Scientific, Inc.
  • Bioinformatics Workflow Overview [0052] The output of an NGS sequencer is generally processed by a bioinformatics pipeline that processes the raw signal from the NGS sequencer and translates the raw signal into base calls, often referred to as raw reads, which are typically stored in a FASTQ file that PATENT Client Reference No.: P39266-WO-1 combines the raw reads with associated quality data.
  • This portion of the bioinformatics pipeline is often referred to as primary analysis.
  • the next section of the bioinformatics pipeline is called secondary analysis, and it takes the raw reads generated by the primary analysis, and performs several tasks, including alignment and variant calling.
  • Tertiary analysis is the final portion of the bioinformatics pipeline and uses the variant calling information to generate medical insights that health care practitioners can use to improve treatments for their patients.
  • Secondary Analysis [0055] New sequencing technologies, such as nanopore-based sequencers, generate sequencing data with different characteristics than sequencing data generated by the current market leading sequencers, such as Illumina sequencers. For example, these differences can include differences in raw read accuracy and differences in the error profiles.
  • FIG.1 illustrates a secondary analysis workflow for detecting insertions and deletions (InDels).
  • InDel mutations refer to one of the most common classes of short variants that involve insertion(s) and/or deletion(s) of nucleotides in the genomic DNA.
  • Somatic InDels refers to insertion and deletion events in non-germline cells and may contribute to cancer. Such variants constitute a substantial part of the genetic variation in the cancer PATENT Client Reference No.: P39266-WO-1 genome. Indels in a coding region may disrupt protein coding and often lead to changes in protein function. Given that InDels have not been studied as much as single nucleotide variants (SNVs), it is important to refine and optimize tools for InDel-calling.
  • SNVs single nucleotide variants
  • the InDel algorithm workflow can be broken down into five main steps or modules: a barcode deduplication step or module 100, and pre-processing step or module 102, a multi-stage InDel calling step or module 104, a filtering step or module 106, and an output step or module 108.
  • the barcode deduplication step is further illustrated in FIGS.2-4.
  • FIG.2 provides an overview of the barcode deduplication process, which is designed to collapse properly paired reads, including duplex reads, with the same fragment start, fragment end, and unique molecular identifier (UMI) into the same barcode family.
  • UMI unique molecular identifier
  • paired reads from the positive strand 200 and the negative strand 202 are collapsed by the barcode deduplication step into a single barcode family 204 because they share the same UMI, fragment start, and fragment end.
  • Consensus calling can be performed within the barcode family to identify candidate variants (SNVs and InDels).
  • the output of the barcode deduplication step is a file (i.e., a BAM file) with the consensus read pairs of each barcode family.
  • FIG.3 illustrates a workflow for the barcode deduplication process, starting with the output from the sequencer, which is often provided in a FASTQ file 300.
  • a paired sort FASTQ file is used.
  • the FASTQ file contains raw sequencing data that can be aligned 302, using SWA or another alignment technique, for example.
  • the output of the alignment step 302 is typically a BAM file 304, which can be sorted by pair to yield a paired BAM file 306. For each read pair, low quality reads and off-target reads can be filtered out here or at an earlier stage before the barcode families are grouped together 308.
  • FIG.4 illustrates one embodiment for the barcode family grouping step.
  • a barcode family key 402 is built or extracted from the records 400, 401.
  • the barcode family key can be the UMIs associated with the paired reads.
  • the barcode family key 402 is then checked to see whether the barcode family key 402 already exists in the current barcode family linked list 404.
  • the barcode family information can be added to the records 406, 407. If the barcode family key 402 already exists, then the barcode family information can be added to the records 406, 407. If the barcode family key does not exist in the current barcode family linked list 404, then the fragment start position is compared with the previous fragment start position 408. If the fragment start position is the same as the previous start position, then a new barcode family is created 410, and the new barcode family information is added to the records 412, 413 to start a new barcode family. If the fragment start position is not the same as the previous start position, then the current linked list of the barcode family is processed 414 to output a barcode family mutation 416 and to output the barcode family to the dedup output file.
  • the barcode deduplication step 100 can also optionally exclude any reads that are soft clipped because soft clipped reads may be associated with higher error rates.
  • a multiplexed sample which contains more than one sample all mixed together, can be de-multiplexed into separate samples so that the analysis described herein can be performed on each sample.
  • pre-processing can be performed 102. Pre-processing of the dedup output can include background polishing, in which the consensus sequences from the dedup output can be compared with a reference sequence from a healthy sample.
  • the comparison with the healthy sample allows the filtering out of variants called in both the healthy sample and the actual samples as likely artifacts or errors introduced at some point in the sequencing process, such as recurrent library preparation errors, for example.
  • background polishing in only performed on plasma samples and not on tissue samples. In other embodiments, background polishing can be performed on both plasma samples and tissue samples.
  • the background polishing step can be performed by capturing errors (low allele frequency (AF) variants) as systematic errors from a set of healthy normal control cfDNA samples, which are not expected to have low AF variants.
  • errors low allele frequency (AF) variants
  • a probabilistic model is built from the observed error at every position across the control reference set to yield a background error distribution.
  • the number of supporting reads for all non-reference positions is evaluated against the derived background error distribution for each of those positions. If a variant appears to belong to the background noise distribution, it is “polished” away (i.e., filtered out) from the base-counts file. That is the number of supporting reads for that variant is set to 0.
  • preprocessing can include the parsing of the dedup outputs, which can include the computation of features used in subsequent steps of the InDel caller algorithm. These features include, but are not limited to, the features listed in Table 1, listed below.
  • the InDel calling algorithm 104 can include multiple components, such as a hotspot module and an adaptive module.
  • the hotspot module can be used on predetermined loci of interest, which can include cancer “hotspots” from targeted panels, including known actionable mutations (e.g., BRAF, V600E, etc.), known biomarkers (e.g., toxicity biomarker DPYD or resistance biomarker), and recurrent cancer mutations (e.g., top 1% most mutated in COSMIC).
  • the variant caller can be designed to be more sensitive for mutations on the hotspot list.
  • the hotspot module can use heuristic rules for identifying variants.
  • a variant is called when either duplex rule or depth rule is met.
  • the duplex rule is met when the AF is greater than or equal to 1.5% and the duplex support is greater than or equal to 5 and total support is greater than or equal to duplex support, where support is the number of reads with that variant.
  • the depth rule is met when the AF is greater than 1.5% and the variant depth is greater than 5.
  • the duplex rule is met when (1) duplex support is greater than or equal to the minimum duplex support, and (2) the total support is greater than or equal to the duplex support.
  • the depth rule is met when the variant depth is greater than 5.
  • an adaptive module can be used.
  • the adaptive module can include a two-stage approach.
  • the first stage uses a core linear model to PATENT Client Reference No.: P39266-WO-1 compute weighted variant molecule counts and learns sample specific thresholds for calling.
  • the second stage applies an additional filter based on a lightgbm (gradient boosted decision trees) classifier that uses both variant molecule counts as well as additional technical and biological features, as listed in Table 1. This enables the caller to detect lower allele fraction variants with limited supports in cfDNA samples as compared to other standard methods.
  • the core linear model in the first stage is described in more detail as follows.
  • the duplex ( ), consensus ( ), and singleton ( ) variant reads.
  • the goal of the model is to determine the best coefficients alpha ( ), beta ( ), and gamma ( ) that minimize the total errors (false positives and false negatives).
  • the weighted linear model is as follows: , (Eq.2) where is the duplex variant count, is the consensus variant count, and is the singleton variant count. Theoretically, when the weighted linear score is larger than a certain cutoff, this variant will be called. Therefore, each variant will receive a weighted linear score according to this linear model.
  • All the candidate variants identified by the core linear model in the first stage will be scored on the basis of a machine learning classifier, such as a non-linear tree-based ensemble learning classifier (e.g., lightgbm).
  • a machine learning classifier such as a non-linear tree-based ensemble learning classifier (e.g., lightgbm).
  • Table 1 above describes the full list of features used by the lightgbm classifier.
  • Variants from the first stage with classifier score above a certain PATENT Client Reference No.: P39266-WO-1 threshold will be retained as the final calls.
  • the threshold for the classifier is calibrated using a set of healthy samples to target the desired or predetermined false positive levels. Other types of classifiers can be used, with preference for classifiers having interpretable features.
  • the output from the InDel caller can then be filtered 106 using a blocklist filter and other filter tags.
  • the blocklist filter can be generated to block repetitive false positive variants detected in multiple cohorts.
  • the variants included in the blocklist filter can be limited to somatic AF (AF less than 30%) level false positive variants only, where the false positive variant has to be found in greater than 2 samples in the same cohort, and found in greater than or equal to 2 independent cohorts.
  • Filter tags that can be applied include a low AF tag, which is applied when the variant AF is less than a minimum AF threshold, a low support tag, which is applied when the variant support is less than a minimum support threshold, and a low depth tag, which is applied when the variant depth is less than a minimum depth threshold.
  • Application of these tags or the blocklist filter to a variant will remove the variant from the list of called variants.
  • the output is a list of called variants that can be written as a VCF file, for example.
  • SNV Caller Single nucleotide variants (SNVs) refer to mutations with single change of A/G/C/T base.
  • Somatic SNVs occur only in somatic (non-germline) cells. Cancer mutations are of this type. In comparison, another type of SNVs are germline or inherited as they occur in sperm/egg. This type of mutation is referred to as SNPs (single nucleotide polymorphisms).
  • SNPs single nucleotide polymorphisms.
  • the SNV caller algorithm described herein was developed for calling SNVs in cfDNA and tissue samples and the main goal is to call any variants observed in data and distinguish it from technical artifacts. Further classification of variants by type (i.e., somatic or germline) can be performed in downstream modules.
  • the SNV caller module described herein and shown in FIG.6 includes three major components, described below.
  • pre-processing module 602 performs calculations and operations on the barcode-deduplication output files 600 before the main body of SNV caller 604, including parsing the variant file, and the depth file, where the depth refers to the number of times a PATENT Client Reference No.: P39266-WO-1 locus of the genome was sequenced.
  • the SNV caller also incorporates background polishing.
  • pre-processing includes calculating the features that are used in the SNV calling algorithms.
  • SNV caller 604 main body includes several components: a core linear model, an adaptive caller, a hotspot caller, and an extended linear model based final filtration.
  • the adaptive caller and the hotspot caller are two parallel calling modules.
  • the adaptive caller detects variants in a selector-wide fashion.
  • the error distribution of each of the 12 substitution types or 192 tri-nucleotide substitution types is modeled in the sample where mutations are being called, and sample-specific substitution- specific supporting score thresholds are set.
  • sample-specific substitution- specific supporting score thresholds are set.
  • simpler heuristic rules are applied to call the variants listed in the loci of interest (which usually are known genomic mutations of clinical impact in cancers).
  • a supporting score for each variant is calculated using a weighted linear combination of singleton, consensus, and duplex reads.
  • calling occurs in two stages: (1) a core linear model and (2) an extended linear model.
  • the core linear model based adaptive variant calls are further filtered based on probability scores according to the extended linear model.
  • the extended linear model is a logistic regression classification model that uses both molecule counts as well as multiple technical and biological variant features such as sequence context, distance to read end, strand bias, and relative spread of variant families with respect to background.
  • the barcode duplication module 600 is the same as described above with respect to FIG.1 for the InDel caller.
  • the pre-processing module 602 is also largely the same as described above with respect to FIG.1 for the InDel caller except that different features are calculated for the SNV caller module.
  • the background polishing step is the same.
  • PATENT Client Reference No.: P39266-WO-1 PATENT Client Reference No.: P39266-WO-1
  • the SNV calling module 604 includes a hotspot module that is essentially the same as described above for the InDel caller, with some additional rules that apply to SNV type mutations.
  • one additional optional special heuristic rule can include special cases for C>T or G>A substitutions, where a single duplex supporting read is required, or optionally, requiring two duplex supporting reads for sites determined to be high noise based on the background error distribution performed in the background polishing step.
  • the core linear model of the SNV calling module 604 is also essentially the same as described above for the InDel caller.
  • a tri-nucleotide module models the error distribution of each of the 12 substitution types or 192 tri-nucleotide substitution types in the sample where mutations are being called, and sample-specific substitution-specific supporting score thresholds are set.
  • Variant score is based on counts of singleton(s), consensus (c) and duplex (d) barcode families of variants as well as a variety of additional technical and biological variant features. These features are listed in Table 2, below.
  • the final variant score for k additional features (f) and weights (w) is: where denotes the natural log
  • the filtering module 606 can include a duplex ratio filter, a blocklist filter, and other tags that can be used to exclude certain variant calls.
  • Duplex ratio filter For every variant called, the filtering module 1006 computes the number of barcode deduplicated reads (i.e. the number of barcode families) that support the variant, and support the reference allele. With this barcode scheme, it is possible to also identify barcode families that have duplex support (meaning that the other strand in the original duplex can also be assigned). This gives counts of duplex deduplicated reads (i.e. number of barcode families with duplex evidence) that support the variant and the reference allele. The percentage of duplex families should be around 10-20%, so if there is a significant depletion of duplex reads against this expectation it can indicate a contamination of single stranded molecules, and an artifactual variant.
  • FIG.8 illustrates a sequencing system 800 according to an embodiment of the present disclosure.
  • the system as shown includes a sample 805, such as Xpandomers within an assay device 810, where an assay 808 can be performed on sample 805.
  • sample 805 can be contacted with reagents of assay 808 to provide a signal (e.g., an intensity signal) of a physical characteristic 815 (e.g., sequence information of a cell-free nucleic acid molecule).
  • Assay 808 may include sequencing by expansion with an assay device 810, such as a nanopore sequencing device as discussed above.
  • Physical characteristic 815 e.g., a PATENT Client Reference No.: P39266-WO-1 fluorescence intensity, a voltage, or a current
  • detector 820 can take a measurement at intervals (e.g., periodic intervals) to obtain data points that make up a data signal.
  • an analog-to-digital converter converts an analog signal from the detector into digital form at a plurality of times.
  • Assay device 810 and detector 820 can form an assay system, e.g., a sequencing system 800 that performs sequencing according to embodiments described herein.
  • a data signal 825 is sent from detector 820 to logic system 830.
  • data signal 825 can be used to determine sequences and/or locations in a reference genome of nucleic acid molecules (e.g., DNA and/or RNA).
  • Data signal 825 can include various measurements made at a same time, e.g., different colors of fluorescent dyes or different electrical signals for different molecule of sample 805, and thus data signal 825 can correspond to multiple signals.
  • Data signal 825 may be stored in a local memory 835, an external memory 840, or a storage device 845.
  • the sequencing system 800 can be comprised of multiple assay devices 810 and detectors 820.
  • Logic system 830 may be, or may include, a computer system, ASIC, processor, graphics processing unit (GPU), etc. It may also include or be coupled with a display (e.g., monitor, LED display, etc.) and a user input device (e.g., mouse, keyboard, buttons, etc.).
  • Logic system 830 and the other components may be part of a stand-alone or network connected computer system, or they may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes detector 820 and/or assay device 810.
  • Logic system 830 may also include software that executes in a processor 850.
  • Logic system 830 may include a computer readable medium storing instructions for controlling sequencing system 800 to perform any of the methods described herein.
  • logic system 830 can provide commands to a system that includes assay device 810 such that sequencing or other physical operations are performed. Such physical operations can be performed in a particular order, e.g., with reagents being added and removed in a particular order.
  • Sequencing system 800 may also include a treatment device 860, which can provide a treatment to the subject. Treatment device 860 can determine a treatment and/or be used to perform a treatment.
  • Sequencing system 800 may also include a reporting device 855, which can present results of any of the methods describe herein, e.g., as determined using the sequencing system 800. Reporting device 855 can be in communication with a reporting module within logic system 830 that can aggregate, format, and send a report to reporting device 855.
  • the reporting module can present information determined using any of the methods described herein.
  • the information can be presented by reporting device 855 in any format that can be recognized and interpreted by a user of the sequencing system 800.
  • the information can be presented by reporting device 855 in a displayed, printed, or transmitted format, or any combination thereof.
  • Any of the computer systems mentioned herein may utilize any suitable number of subsystems. Examples of such subsystems are shown in FIG.9 in computer system 900.
  • a computer system 900 includes a single computer apparatus, where the subsystems can be the components of the computer apparatus.
  • a computer system 900 can include multiple computer apparatuses, each being a subsystem, with internal components.
  • a computer system can include desktop and laptop computers, tablets, mobile phones, and other mobile devices.
  • the subsystems shown in FIG.9 are interconnected via a system bus 975. Additional subsystems such as a printer 974, keyboard 978, storage device(s) 979, 982, monitor 976 (e.g., a display screen, such as an LED), which is coupled to display adapter 982, and others are shown.
  • Peripherals and input/output (I/O) devices which couple to I/O controller 971, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 977 (e.g., USB, FireWire ® ).
  • I/O port 977 or external interface 981 can be used to connect computer system PATENT Client Reference No.: P39266-WO-1 900 to a wide area network such as the Internet, a mouse input device, or a scanner.
  • the interconnection via system bus 975 allows the central processor 973 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 972 or the storage device(s) 979 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems.
  • the system memory 972 and/or the storage device(s) 979 may embody a computer readable medium.
  • a computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 981, by an internal interface, or via removable storage devices that can be connected and removed from one component to another component.
  • computer systems, subsystem, or apparatuses can communicate over a network.
  • one computer can be considered a client and another computer a server, where each can be part of a same computer system.
  • a client and a server can each include multiple systems, subsystems, or components.
  • methods may involve various numbers of clients and/or servers, including at least 10, 20, 50, 100, 200, 500, 1,000, or 10,000 devices.
  • Methods can include various numbers of communication messages between devices, including at least 100, 200, 500, 1,000, 10,000, 50,000, 100,000, 500,000, or one million communication messages.
  • Such communications can involve at least 1 MB, 10 MB, 100 MB, 1 GB, 10 GB, or 100 GB of data.
  • aspects of embodiments can be implemented in the form of control logic using hardware circuitry (e.g., an application specific integrated circuit or field programmable gate array) and/or using computer software stored in a memory with a generally programmable processor in a modular or integrated manner, and thus a processor can include memory storing software instructions that configure hardware circuitry, as well as an FPGA with configuration instructions or an ASIC.
  • a processor can include a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked, as well as dedicated hardware.
  • P39266-WO-1 other ways and/or methods to implement embodiments of the present disclosure using hardware and a combination of hardware and software.
  • Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques.
  • the software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission.
  • a suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk) or Blu-ray disk, flash memory, and the like.
  • the computer readable medium may be any combination of such devices.
  • the order of operations may be re-arranged.
  • a process can be terminated when its operations are completed but could have additional steps not included in a figure.
  • a process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.
  • Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet.
  • a computer readable medium may be created using a data signal encoded with such programs.
  • Computer readable media encoded with the program code may be packaged with a compatible device (e.g., as firmware) or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g., a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network.
  • a computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
  • Any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Any operations performed with a processor may be performed in real-time.
  • the term “real-time” may refer to computing operations or processes that are completed within a PATENT Client Reference No.: P39266-WO-1 certain time constraint. As examples, a time constraint may be 30 seconds, 1 minute, 10 minutes, 30 minutes, 1 hour, 4 hours, 1 day, or 7 days.
  • embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective step or a respective group of steps.
  • steps of methods herein can be performed at a same time or at different times or in a different order. Additionally, portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means of a system for performing these steps.
  • the term “and/or” includes any and all combinations of one or more of the associated listed items and may be abbreviated as [0115]
  • Spatially relative terms such as “under”, “below”, “lower”, “over”, “upper” and the like, may be used herein for ease of description to describe one element or feature’s relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is inverted, elements described as “under” or “beneath” other elements or features would then be oriented “over” the other elements or features.
  • the exemplary term “under” can encompass both an orientation of over and under.
  • the device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.
  • the terms “upwardly”, “downwardly”, “vertical”, “horizontal” and the like are used herein for the purpose of explanation only unless specifically indicated otherwise.
  • first and second may be used herein to describe various features/elements (including steps), these features/elements should not be limited by these terms, unless the context indicates otherwise. These terms may be used to distinguish one feature/element from another feature/element.
  • a numeric value may have a value that is +/- 0.1% of the stated value (or range of values), +/- 1% of the stated value (or range of values), +/- 2% of the stated value (or range of values), +/- 5% of the stated value (or range of values), +/- 10% of the stated value (or range of values), etc.
  • Any numerical values given herein should also be understood to include about or approximately that value, unless the context indicates otherwise. For example, if the value “10” is disclosed, then “about 10” is also disclosed. Any numerical range recited herein is intended to include all sub-ranges subsumed therein.
  • inventive subject matter may be referred to herein individually or collectively by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept, if more than one is, in fact, disclosed.
  • inventive concept any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown.
  • This disclosure is intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The variant caller algorithms described herein can be used to identify various mutations in sequencing data. In accordance with various embodiments, an InDel caller and/or an SNV caller can use a multi-stage filtering approach involving a core linear model as well as an additional model to provide additional filtering of the identified variants to improve sensitivity and/or accuracy. A barcode deduplication module can also be used to de-multiplex a multiplexed sample so that the variant caller algorithms can be applied separately to each sample. Additional filtering steps can be used to further reduce false positives and/or false negatives.

Description

PATENT Client Reference No.: P39266-WO-1 INTERNATIONAL PATENT APPLICATION Title: SYSTEMS AND METHODS FOR VARIANT CALLING Inventors: James Han, a U.S. citizen, resident of San Carlos, CA Md Abid Hasan, a citizen of Bangladesh, resident of Pleasanton, CA Zhongyun Huang, a citizen of China, resident of Fremont, CA Tieming Ji, a U.S. citizen, resident of Foster City, CA Seyed Hamid Mirebrahim, a citizen of Iran, resident of Mission Viejo, CA Badri Padhukasahasram, a U.S. citizen, resident of Bangaluru, India Ke Tang, a citizen of China, resident of Dublin, CA Assignee: Roche Sequencing Solutions, Inc. 4300 Hacienda Drive Pleasanton, CA 94588 United States of America Entity: Large PATENT Client Reference No.: P39266-WO-1 SYSTEMS AND METHODS FOR VARIANT CALLING CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This Application claims the benefit of United States Provisional Patent Application No.63/635,576, filed on April 17, 2024, which is hereby incorporated by reference in its entirety. BACKGROUND [0002] The development of affordable and rapid DNA sequencing technologies has enabled the development of targeted therapeutics that rely on the use of DNA biomarkers to identify patients that are suitable for receiving the targeted therapy. For example, mutations in certain genes, such as genes involved in cell proliferation, are known to lead to certain types of cancers that can be treated very effectively with specific types of drugs. Other mutations are known to confer resistance to certain therapies. Therefore, there is a need for improved systems and methods to identify variants from sequencing data. SUMMARY [0003] The embodiments described herein relate to systems and methods for performing variant calling on sequencing data. More particularly, the embodiments described herein related to calling single nucleotide variants, insertions, and deletions in the sequencing data. [0004] In accordance with a first aspect of the present disclosure, a method for calling insertions and deletions (InDels) from sequencing data generated from a multiplexed sample is provided. The method includes receiving a computer file comprising a plurality of paired sequence reads; aligning the plurality of paired sequence reads to a reference sequence; sorting the aligned, paired sequence reads; grouping the sorted, aligned, and paired sequence reads into a plurality of groups according to a unique molecular identifier (UMI), a sequence read start position, and a sequence read end position, wherein each member of a group has the same UMI, sequence read start position, and sequence read end position; determining a consensus sequence from the members of each group; parsing the paired consensus sequence reads; polishing the parsed consensus sequence reads; determining a set of candidate InDel variants based on a comparison of the polished consensus sequence reads with the reference sequence; for each candidate InDel variant, calculating a weighted linear score using a linear PATENT Client Reference No.: P39266-WO-1 model, wherein the candidate InDel variant is called a potential InDel variant when the weighted linear score is greater than a linear model threshold; and for each potential InDel variant, calculating a score using a non-linear tree-based ensemble machine learning classifier, wherein the potential InDel variant is called a true InDel variant when the score is greater than a machine learning classifier threshold. [0005] In some embodiments of the first aspect, the method further includes generating a computer file comprising the paired consensus sequence reads. [0006] In some embodiments of the first aspect, the step of polishing the parsed consensus sequence reads includes: comparing the parsed consensus sequence reads to the reference sequence to determine a set of preliminary InDel variants, including a count of each preliminary InDel variant; and adjusting the count of each preliminary InDel variant based on a comparison of the count of each preliminary InDel variant to a corresponding background variant count that is determined from a control sample. [0007] In some embodiments of the first aspect, the comparison comprises determining whether the count for each preliminary InDel variant is background noise. [0008] In some embodiments of the first aspect, the step of determining whether the count for each preliminary InDel variant is background noise comprises calculating a probability that the preliminary InDel variant is background noise. [0009] In some embodiments of the first aspect, the linear model comprises a duplex variant count, a singleton variant count, and a consensus variant count. [0010] In some embodiments of the first aspect, the threshold for the linear model is sample specific. [0011] In some embodiments of the first aspect, the threshold for the linear model is determined for all the candidate InDel variants in the sample by fitting a regression to a plot of a cumulative candidate InDel variant count having at least a certain weighted linear score as a function of the weighted linear score, and setting the threshold at a weighted linear score that results in a target cumulative InDel variant count. PATENT Client Reference No.: P39266-WO-1 [0012] In some embodiments of the first aspect, the threshold for the linear model is determined from a set of training data by determining a cumulative count of false positive variant calls as a function of weighted linear score, and selecting the weighted linear score that corresponds to a target cumulative false positive count as the threshold. [0013] In some embodiments of the first aspect, the non-linear tree-based ensemble machine learning classifier comprises a singleton molecule count, a consensus molecule count, a duplex molecule count, a ratio of the singleton molecule count to the duplex molecule count, a median distance of the potential variant from a 5’ end of the sequence reads, a relative spread of variant distance in variant molecules versus background, a relative spread of insert starts in variant molecules versus background, a raw strand of alignment bias for consensus molecules, a raw strand of origin bias for singleton molecules, a raw strand of alignment bias for non-duplex molecules, a fraction of duplex variant molecules with two reads, a fraction of duplex variant molecules with less than or equal to three reads, a fraction of consensus variant molecules with two reads, a fraction of consensus variant molecules with less than or equal to three reads, a relative frequency of singleton in variant molecules versus background, a relative frequency of duplex in variant molecules versus background, and a 2 x 2 fisher exact test p-value for the ratio of singletons to duplex in variant molecules versus background. [0014] In some embodiments of the first aspect, the machine learning classifier threshold is calibrated using a set of healthy samples to target a predetermined false positive rate. [0015] In accordance with a second aspect of the present disclosure, a method for calling single nucleotide variants (SNV) from sequencing data generated from a multiplexed sample is provided. The method includes: receiving a computer file comprising a plurality of paired sequence reads; aligning the sequence reads to a reference sequence; sorting by pair the aligned sequence reads; grouping the sorted, aligned, and paired sequence reads into a plurality of groups by unique molecular identifier (UMI), sequence read start position, and sequence read end position, wherein each member of a group has the same UMI, sequence read start position, and sequence read end position; determining a consensus sequence from the members of each group; parsing the paired consensus sequence reads; polishing the parsed consensus sequence reads; determining a set of candidate variants based on a comparison of the polished consensus reads; for each candidate SNV variant, calculating a PATENT Client Reference No.: P39266-WO-1 weighted linear score using a linear model, wherein the candidate SNV variant is called a potential SNV variant when the weighted linear score is greater than a linear model threshold; and for each potential SNV variant, calculating a weighted logistic score using a logistic regression model, wherein the potential SNV variant is called a true variant when the weighted logistic score is greater than a logistic regression model threshold. [0016] In some embodiments of the second aspect, the method further includes generating a computer file comprising paired consensus sequence reads. [0017] In some embodiments of the second aspect, the step of polishing the parsed consensus sequence reads includes: comparing the parsed consensus reads to the reference sequence to determine a set of preliminary SNV variants, including a count of each preliminary SNV variant; and adjusting the count of each preliminary SNV variant based on a comparison of the count of each preliminary SNV variant to a corresponding background variant count that is determined from a control sample. [0018] In some embodiments of the second aspect, the comparison comprises determining whether the count for each preliminary SNV variant is background noise. [0019] In some embodiments of the second aspect, the step of determining whether the count for each preliminary SNV variant is background noise comprises calculating a probability that the preliminary SNV variant is background noise. [0020] In some embodiments of the second aspect, the linear model comprises a duplex variant count, a singleton variant count, and a consensus variant count. [0021] In some embodiments of the second aspect, the threshold for the linear model is sample specific. [0022] In some embodiments of the second aspect, the threshold for the linear model is determined for all the candidate SNV variants in the sample by fitting a regression to a plot of a cumulative candidate SNV variant count having at least a certain weighted linear score as a function of the weighted linear score, and setting the threshold at a weighted linear score that results in a target cumulative SNV variant count. PATENT Client Reference No.: P39266-WO-1 [0023] In some embodiments of the second aspect, the threshold for the linear model is determined from a set of training data by determining a cumulative count of false positive variant calls as a function of weighted linear score, and selecting the weighted linear score that corresponds to a target cumulative false positive count as the threshold. [0024] In some embodiments of the second aspect, the logistic regression model comprises a duplex variant count, a singleton variant count, and a consensus variant count. [0025] In some embodiments of the second aspect, the logistic regression model further comprises a ratio of the singleton variant count to the duplex molecule count, an adjustment for depth of coverage, an adjustment for errors in a gene, a median distance of the variant from 5’ end of the sequence read, a relative spread of variant distance in variant molecules versus background, a relative spread of insert starts in variant molecules versus background, a relative spread of insert ends in variant molecules versus background, a raw strand of alignment bias, a raw strand of origin bias, a substitution type of variant, a two base pair context around a variant position, a fraction of duplex variant molecules with two reads, a fraction of duplex variant molecules with less than or equal to three reads, a fraction of consensus variant molecules with two reads, a fraction of consensus variant molecules with less than or equal to three reads, a relative frequency of consensus in variant molecules versus background, and a relative frequency of duplex in variant molecules versus background. [0026] In some embodiments of the second aspect, the logistic regression model comprises a plurality of features, wherein each feature is multiplied by a weight, wherein the weights are determined from a training set of data with known true positive variants and known false positive variants. [0027] In accordance with a third aspect of the present disclosure, a system is provided for generating sequencing data. The system may include an assay device and/or a logic system. The logic system may include a processor coupled to a memory storing instructions executable by the processor. The processor, upon execution of the instructions, is configured to perform the method of one or more embodiments of the first aspect. PATENT Client Reference No.: P39266-WO-1 [0028] In some embodiments of the third aspect, the system further includes a treatment device for determining and/or administering a treatment to the patient based on at least one true InDel variant. [0029] In some embodiments of the third aspect, the system further includes a reporting device for displaying information relating to at least one true InDel variant. [0030] In accordance with a fourth aspect of the present disclosure, a non-transitory computer-readable medium is provided. The computer-readable medium stores a set of instructions that, upon execution by at least one processor, cause the processor to perform the method of one or more embodiments of the first aspect. [0031] In accordance with a fifth aspect of the present disclosure, a system is provided for generating sequencing data. The system may include an assay device and/or a logic system. The logic system may include a processor coupled to a memory storing instructions executable by the processor. The processor, upon execution of the instructions, is configured to perform the method of one or more embodiments of the second aspect. [0032] In some embodiments of the fifth aspect, the system further includes a treatment device for determining and/or administering a treatment to the patient based on at least one true variant. [0033] In some embodiments of the fifth aspect, the system further includes a reporting device for displaying information relating to at least one true variant. [0034] In accordance with a sixth aspect of the present disclosure, a non-transitory computer-readable medium is provided. The computer-readable medium stores a set of instructions that, upon execution by at least one processor, cause the processor to perform the method of one or more embodiments of the second aspect. BRIEF DESCRIPTION OF THE DRAWINGS [0035] The features of the invention are set forth with particularity in the claims that follow. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative PATENT Client Reference No.: P39266-WO-1 embodiments, in which the principles of the invention are utilized, and the accompanying drawings thereof. [0036] FIG.1 is a flow chart of a method for an InDel caller algorithm workflow, in accordance with some embodiments. [0037] FIG.2 shows an overview of a barcode deduplication algorithm, in accordance with some embodiments. [0038] FIG.3 is a flow chart of a method for the barcode deduplication algorithm, in accordance with some embodiments. [0039] FIG.4 is another flow chart of the method for the barcode deduplication algorithm, which illustrates the creation of barcode families, in accordance with some embodiments. [0040] FIG.5 illustrates how a threshold is set for a core linear model, in accordance with some embodiments. [0041] FIG.6 is a flow chart of a method for a single nucleotide variant (SNV) caller algorithm, in accordance with some embodiments. [0042] FIG.7 illustrates the improved performance of a two-stage approach for the SNV caller algorithm, in accordance with some embodiments. [0043] FIG.8 illustrates a sequencing system, in accordance with some embodiments. [0044] FIG.9 illustrates an exemplary computer apparatus, in accordance with some embodiments. DETAILED DESCRIPTION [0045] The present disclosure provides a number of techniques for performing variant calling, as part of a secondary analysis workflow of sequencing data produced by today’s next generation sequencing devices. More particularly, a variant calling algorithm is provided for identifying single nucleotide variants (SNV) and/or insertion/deletions (InDels) from sequencing data, such as nanopore-based sequencing data. PATENT Client Reference No.: P39266-WO-1 [0046] In at least one aspect of the secondary analysis workflow, an InDel caller algorithm is provided. The InDel caller algorithm can include a hotspot module and an adaptive module. In the hotspot module, the InDel caller algorithm uses a number of heuristics to identify InDel variants in hotspot regions. In parallel, the adaptive module uses a two stage approach to identify variants with low AF that might not be strongly supported in a plasma sample. In the first stage, a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold. In the second stage, a machine learning classifier, such as a gradient boosting machine (GBM) model is used to score each variant that passes the first stage and then the classifier score is compared to a threshold to determine the final list of called variants. Optionally, the called variants can be further filtered by a blocklist filter. [0047] In another aspect of the secondary analysis workflow, an SNV caller algorithm is provided that is similar in many aspects to the InDel caller algorithm described above. The SNV caller algorithm can include a hotspot module and an adaptive module. The hotspot module is similar to the hotspot module of the InDel caller algorithm, with the exception that one or more additional heuristics may be included in addition to or in lieu of the heuristics used in the InDel caller algorithm. The adaptive module of the SNV caller algorithm also uses a two stage approach to identify variants. In the first stage, a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold. In the second stage, an extended linear model that incorporates a number of additional weighted features is used to filter out variants that passed the first stage based on a comparison of each candidate variant’s score with a threshold. [0048] These secondary analysis workflows can be used to analyze sequencing data from a number of next generation sequencing devices, including nanopore-based sequencing systems. Correctly identifying variants from the raw sequencing data is an important step in the bioinformatics pipeline used to diagnose patients and/or create therapies tailored to particular disease. Sequencing [0049] Prepared nucleic acid molecules of interest (e.g., a sequencing library) can be sequenced using a sequencing assay as part of the procedure for determining sequencing PATENT Client Reference No.: P39266-WO-1 reads for a plurality of microsatellite loci. Any of a number of sequencing technologies or sequencing assays can be utilized. The term "Next Generation Sequencing (NGS)" as used herein refers to sequencing methods that allow for massively parallel sequencing of clonally amplified molecules and of single nucleic acid molecules (or of nucleic acid analogues). [0050] Non-limiting examples of sequencing assays that are suitable for use with the methods disclosed herein include nanopore sequencing (see, e.g., U.S. Patent Application Publication Nos.2013/0244340, 2013/0264207, 2014/0134616, 2015/0119259, and 2015/0337366), Sanger sequencing, capillary array sequencing, thermal cycle sequencing (see, e.g., Sears et al., Biotechniques, 13:626-633 (1992)), solid-phase sequencing (see, e.g., Zimmerman et al., Methods Mol. Cell Biol., 3:39-42 (1992)), sequencing with mass spectrometry such as matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF/MS) (see, e.g., Fu et al., Nature Biotech., 16:381-384 (1998)), sequencing by hybridization (see, e.g., Drmanac et al., Nature Biotech., 16:54-58 (1998)), and NGS methods, including but not limited to sequencing by synthesis (see, e.g., HiSeq™, MiSeq™, or Genome Analyzer, each available from Illumina, Inc. (San Diego, CA)), sequencing by ligation (see, e.g., SOLiD™, available from Thermo Fisher Scientific, Inc. (Waltham, MA)), ion semiconductor sequencing (see, e.g., Ion Torrent™, available from Thermo Fisher Scientific, Inc. (Waltham, MA)), and SMRT® sequencing, available from Pacific Biosciences of California, Inc. (Menlo Park, CA). [0051] Commercially available sequencing technologies include: sequencing-by- hybridization platforms from Affymetrix, Inc. (Sunnyvale, CA), sequencing-by-synthesis platforms from Illumina, Inc. (San Diego, Calif.), and sequencing-by-ligation platform from Thermo Fisher Scientific, Inc. (Waltham, MA). Other sequencing technologies include, but are not limited to, the Ion Torrent technology by ThermoFisher Scientific, Inc. (Waltham, MA), and nanopore sequencing by Roche Sequencing Solutions, Inc. (Santa Clara, CA) and/or Oxford Nanopore Technologies, plc (Oxford, United Kingdom). Bioinformatics Workflow Overview [0052] The output of an NGS sequencer is generally processed by a bioinformatics pipeline that processes the raw signal from the NGS sequencer and translates the raw signal into base calls, often referred to as raw reads, which are typically stored in a FASTQ file that PATENT Client Reference No.: P39266-WO-1 combines the raw reads with associated quality data. This portion of the bioinformatics pipeline is often referred to as primary analysis. [0053] The next section of the bioinformatics pipeline is called secondary analysis, and it takes the raw reads generated by the primary analysis, and performs several tasks, including alignment and variant calling. [0054] Tertiary analysis is the final portion of the bioinformatics pipeline and uses the variant calling information to generate medical insights that health care practitioners can use to improve treatments for their patients. Secondary Analysis [0055] New sequencing technologies, such as nanopore-based sequencers, generate sequencing data with different characteristics than sequencing data generated by the current market leading sequencers, such as Illumina sequencers. For example, these differences can include differences in raw read accuracy and differences in the error profiles. Because Illumina sequencers currently dominate the market, the vast majority of the secondary analysis software tools that have been developed are custom tailored to process the type of data that is generated by the Illumina sequencers. These software tools, which typically work very well with data from Illumina sequencers, may not work well with data generated by new next generation sequencing technologies, such as nanopore sequencers. Consequently, there is a need to develop new secondary analysis tools that work well with the new sequencing technologies that are currently being developed. In addition, although the variant calling methods described herein may be particularly effective with nanopore sequencing data, the methods can also be used with other types of sequencing data, such as data from an Illumina sequencer. InDel Caller [0056] FIG.1 illustrates a secondary analysis workflow for detecting insertions and deletions (InDels). InDel mutations refer to one of the most common classes of short variants that involve insertion(s) and/or deletion(s) of nucleotides in the genomic DNA. Somatic InDels refers to insertion and deletion events in non-germline cells and may contribute to cancer. Such variants constitute a substantial part of the genetic variation in the cancer PATENT Client Reference No.: P39266-WO-1 genome. Indels in a coding region may disrupt protein coding and often lead to changes in protein function. Given that InDels have not been studied as much as single nucleotide variants (SNVs), it is important to refine and optimize tools for InDel-calling. Described herein are improved systems and methods for calling Indels with improved accuracy and/or specificity that result in fewer false positives and/or fewer false negatives. [0057] The InDel algorithm workflow can be broken down into five main steps or modules: a barcode deduplication step or module 100, and pre-processing step or module 102, a multi-stage InDel calling step or module 104, a filtering step or module 106, and an output step or module 108. [0058] The barcode deduplication step is further illustrated in FIGS.2-4. FIG.2 provides an overview of the barcode deduplication process, which is designed to collapse properly paired reads, including duplex reads, with the same fragment start, fragment end, and unique molecular identifier (UMI) into the same barcode family. As shown in FIG.2, paired reads from the positive strand 200 and the negative strand 202 are collapsed by the barcode deduplication step into a single barcode family 204 because they share the same UMI, fragment start, and fragment end. Consensus calling can be performed within the barcode family to identify candidate variants (SNVs and InDels). The output of the barcode deduplication step is a file (i.e., a BAM file) with the consensus read pairs of each barcode family. [0059] FIG.3 illustrates a workflow for the barcode deduplication process, starting with the output from the sequencer, which is often provided in a FASTQ file 300. In this process, a paired sort FASTQ file is used. The FASTQ file contains raw sequencing data that can be aligned 302, using SWA or another alignment technique, for example. The output of the alignment step 302 is typically a BAM file 304, which can be sorted by pair to yield a paired BAM file 306. For each read pair, low quality reads and off-target reads can be filtered out here or at an earlier stage before the barcode families are grouped together 308. After the barcode families are grouped together 308, consensus calling 310 can be performed on each barcode family group, which results in a deduped BAM file 314. The deduped BAM file 314 can also be accompanied with a variant file, a basecount file, and/or a metrics file 312. PATENT Client Reference No.: P39266-WO-1 [0060] FIG.4 illustrates one embodiment for the barcode family grouping step. For each record 400, 401 a barcode family key 402 is built or extracted from the records 400, 401. The barcode family key can be the UMIs associated with the paired reads. The barcode family key 402 is then checked to see whether the barcode family key 402 already exists in the current barcode family linked list 404. If the barcode family key 402 already exists, then the barcode family information can be added to the records 406, 407. If the barcode family key does not exist in the current barcode family linked list 404, then the fragment start position is compared with the previous fragment start position 408. If the fragment start position is the same as the previous start position, then a new barcode family is created 410, and the new barcode family information is added to the records 412, 413 to start a new barcode family. If the fragment start position is not the same as the previous start position, then the current linked list of the barcode family is processed 414 to output a barcode family mutation 416 and to output the barcode family to the dedup output file. [0061] In some embodiments, the barcode deduplication step 100 can also optionally exclude any reads that are soft clipped because soft clipped reads may be associated with higher error rates. [0062] As a result of the barcode deduplication step, a multiplexed sample which contains more than one sample all mixed together, can be de-multiplexed into separate samples so that the analysis described herein can be performed on each sample. [0063] Referring back to FIG.1, after the barcode deduplication step 100, pre-processing can be performed 102. Pre-processing of the dedup output can include background polishing, in which the consensus sequences from the dedup output can be compared with a reference sequence from a healthy sample. This can be an independent sample not connected with the patient samples, or it can be a matched sample where healthy tissue is taken from the patient along with the tumor or unhealthy sample from the patient. The comparison with the healthy sample allows the filtering out of variants called in both the healthy sample and the actual samples as likely artifacts or errors introduced at some point in the sequencing process, such as recurrent library preparation errors, for example. In some embodiments, background polishing in only performed on plasma samples and not on tissue samples. In other embodiments, background polishing can be performed on both plasma samples and tissue samples. PATENT Client Reference No.: P39266-WO-1 [0064] In some embodiments, the background polishing step can be performed by capturing errors (low allele frequency (AF) variants) as systematic errors from a set of healthy normal control cfDNA samples, which are not expected to have low AF variants. A probabilistic model is built from the observed error at every position across the control reference set to yield a background error distribution. Then, for a test sample, the number of supporting reads for all non-reference positions is evaluated against the derived background error distribution for each of those positions. If a variant appears to belong to the background noise distribution, it is “polished” away (i.e., filtered out) from the base-counts file. That is the number of supporting reads for that variant is set to 0. If the variant in the test sample appears to be significantly different from the background error distribution, it is not polished away. [0065] In addition, preprocessing can include the parsing of the dedup outputs, which can include the computation of features used in subsequent steps of the InDel caller algorithm. These features include, but are not limited to, the features listed in Table 1, listed below. Table 1 Feature Description s Singleton Molecule Counts c Consensus Molecule Counts d Duplex Molecule Counts sdrat Ratio of singletons to duplex molecule counts for total molecules sdratint Interaction between sdrat and s binsdrat Is 0 if sdrat < 1 and 1 if sdrat >= 1 binsdratint Interaction term between binsdrat and S disttrans Median distance of variant from 5’ end of reads. relvdistmad Relative spread of variant distance in variant molecules vs background relvdiststartmad Relative spread of insert starts in variant molecules vs background relvdistendmad Relative spread of insert ends in variant molecules vs background cintn Raw strand of alignment bias for consensus molecules cmintn Raw strand of origin bias for singleton molecules scintn Raw strand of alignment bias for non-duplex molecules PATENT Client Reference No.: P39266-WO-1 scmintn Raw strand of alignment bias for non-duplex molecules d2frac Fraction of duplex variant molecules with 2 reads d3frac Fraction of duplex variant molecules with <=3 reads c2frac Fraction of consensus variant molecules with 2 reads c3frac Fraction of consensus variant molecules with <=3 reads SF Relative frequency of singleton in variant molecules vs background DF Relative frequency of duplex in variant molecules vs background P value 2 x 2 fisher exact test p value for the ratio of singletons to duplex in variant molecules vs background [0066] These features can be passed on to the InDel calling algorithm 104. As shown in FIG.1, the InDel calling algorithm 104 can include multiple components, such as a hotspot module and an adaptive module. [0067] The hotspot module can be used on predetermined loci of interest, which can include cancer “hotspots” from targeted panels, including known actionable mutations (e.g., BRAF, V600E, etc.), known biomarkers (e.g., toxicity biomarker DPYD or resistance biomarker), and recurrent cancer mutations (e.g., top 1% most mutated in COSMIC). The variant caller can be designed to be more sensitive for mutations on the hotspot list. [0068] The hotspot module can use heuristic rules for identifying variants. For example, for tissue samples, a variant is called when either duplex rule or depth rule is met. The duplex rule is met when the AF is greater than or equal to 1.5% and the duplex support is greater than or equal to 5 and total support is greater than or equal to duplex support, where support is the number of reads with that variant. The depth rule is met when the AF is greater than 1.5% and the variant depth is greater than 5. [0069] For cfDNA (i.e., plasma samples), the duplex rule is met when (1) duplex support is greater than or equal to the minimum duplex support, and (2) the total support is greater than or equal to the duplex support. The depth rule is met when the variant depth is greater than 5. [0070] In parallel with the heuristic hotspot module, an adaptive module can be used. The adaptive module can include a two-stage approach. The first stage uses a core linear model to PATENT Client Reference No.: P39266-WO-1 compute weighted variant molecule counts and learns sample specific thresholds for calling. The second stage applies an additional filter based on a lightgbm (gradient boosted decision trees) classifier that uses both variant molecule counts as well as additional technical and biological features, as listed in Table 1. This enables the caller to detect lower allele fraction variants with limited supports in cfDNA samples as compared to other standard methods. [0071] The core linear model in the first stage is described in more detail as follows. For a particular candidate variant identified by the dedup step, let , , , respectively, denote the duplex ( ), consensus ( ), and singleton ( ) variant reads. The goal of the model is to determine the best coefficients alpha ( ), beta ( ), and gamma ( ) that minimize the total errors (false positives and false negatives). The weighted linear model is as follows: , (Eq.2) where is the duplex variant count, is the consensus variant count, and is the singleton variant count. Theoretically, when the weighted linear score is larger than a certain cutoff, this variant will be called. Therefore, each variant will receive a weighted linear score according to this linear model. [0072] The cutoff for each sample is learned in a sample-specific manner by fitting a regression for all the candidate variants in that sample. This consists of the following steps: 1. Cumulative_Counts(X) = Total variants with weighted linear score (F(d, c, s)) >= X 2. Fit a regression log(Cumulative_Counts(X)) ~ log(X) 3. Set threshold where Fitted_Cumulative_Counts/panelsize drops to a low value e.g.1 per 10000 bp. [0073] FIG.5 illustrates the threshold as a vertical line on the graph. [0074] The machine learning classifier in the second stage is described in more detail as follows. All the candidate variants identified by the core linear model in the first stage will be scored on the basis of a machine learning classifier, such as a non-linear tree-based ensemble learning classifier (e.g., lightgbm). Table 1 above describes the full list of features used by the lightgbm classifier. Variants from the first stage with classifier score above a certain PATENT Client Reference No.: P39266-WO-1 threshold will be retained as the final calls. The threshold for the classifier is calibrated using a set of healthy samples to target the desired or predetermined false positive levels. Other types of classifiers can be used, with preference for classifiers having interpretable features. [0075] Returning to FIG.1, the output from the InDel caller can then be filtered 106 using a blocklist filter and other filter tags. The blocklist filter can be generated to block repetitive false positive variants detected in multiple cohorts. The variants included in the blocklist filter can be limited to somatic AF (AF less than 30%) level false positive variants only, where the false positive variant has to be found in greater than 2 samples in the same cohort, and found in greater than or equal to 2 independent cohorts. Other filter tags that can be applied include a low AF tag, which is applied when the variant AF is less than a minimum AF threshold, a low support tag, which is applied when the variant support is less than a minimum support threshold, and a low depth tag, which is applied when the variant depth is less than a minimum depth threshold. Application of these tags or the blocklist filter to a variant will remove the variant from the list of called variants. [0076] After filtering 106, the output is a list of called variants that can be written as a VCF file, for example. SNV Caller [0077] Single nucleotide variants (SNVs) refer to mutations with single change of A/G/C/T base. Somatic SNVs occur only in somatic (non-germline) cells. Cancer mutations are of this type. In comparison, another type of SNVs are germline or inherited as they occur in sperm/egg. This type of mutation is referred to as SNPs (single nucleotide polymorphisms). The SNV caller algorithm described herein was developed for calling SNVs in cfDNA and tissue samples and the main goal is to call any variants observed in data and distinguish it from technical artifacts. Further classification of variants by type (i.e., somatic or germline) can be performed in downstream modules. The SNV caller module described herein and shown in FIG.6 includes three major components, described below. [0078] (1) pre-processing module 602, performs calculations and operations on the barcode-deduplication output files 600 before the main body of SNV caller 604, including parsing the variant file, and the depth file, where the depth refers to the number of times a PATENT Client Reference No.: P39266-WO-1 locus of the genome was sequenced. The SNV caller also incorporates background polishing. As above for the InDel caller, pre-processing includes calculating the features that are used in the SNV calling algorithms. [0079] (2) SNV caller 604 main body includes several components: a core linear model, an adaptive caller, a hotspot caller, and an extended linear model based final filtration. [0080] The adaptive caller and the hotspot caller are two parallel calling modules. The adaptive caller detects variants in a selector-wide fashion. In the adaptive caller, the error distribution of each of the 12 substitution types or 192 tri-nucleotide substitution types is modeled in the sample where mutations are being called, and sample-specific substitution- specific supporting score thresholds are set. In the hotspot caller, simpler heuristic rules are applied to call the variants listed in the loci of interest (which usually are known genomic mutations of clinical impact in cancers). [0081] A supporting score for each variant is calculated using a weighted linear combination of singleton, consensus, and duplex reads. For non-hotspot variants, calling occurs in two stages: (1) a core linear model and (2) an extended linear model. The core linear model based adaptive variant calls are further filtered based on probability scores according to the extended linear model. The extended linear model is a logistic regression classification model that uses both molecule counts as well as multiple technical and biological variant features such as sequence context, distance to read end, strand bias, and relative spread of variant families with respect to background. [0082] (3) filters and tagging module 606, filters potential candidate SNV variants based on several situations not covered by the core linear model, such as the ratio of duplex and optimization on the blocklist filtering, which are systematic experimental artifacts seen repeatedly in healthy samples. [0083] Referring to FIG.6, the barcode duplication module 600 is the same as described above with respect to FIG.1 for the InDel caller. In addition, the pre-processing module 602 is also largely the same as described above with respect to FIG.1 for the InDel caller except that different features are calculated for the SNV caller module. The background polishing step is the same. PATENT Client Reference No.: P39266-WO-1 [0084] The SNV calling module 604 includes a hotspot module that is essentially the same as described above for the InDel caller, with some additional rules that apply to SNV type mutations. For example, in some embodiments, one additional optional special heuristic rule can include special cases for C>T or G>A substitutions, where a single duplex supporting read is required, or optionally, requiring two duplex supporting reads for sites determined to be high noise based on the background error distribution performed in the background polishing step. [0085] The core linear model of the SNV calling module 604 is also essentially the same as described above for the InDel caller. [0086] A tri-nucleotide module models the error distribution of each of the 12 substitution types or 192 tri-nucleotide substitution types in the sample where mutations are being called, and sample-specific substitution-specific supporting score thresholds are set. For each ref>var substitution type (for example: A>T), gather the weighted linear score for variant supporting depths, for positions where AF < 0.5% . The assumption is that the vast majority of these positions are noise. [0087] The result is a cumulative distribution for each substitution type, where the value at position indicates the number of times that substitution was observed at support depth greater than or equal to . [0088] Minimum support threshold (weighted linear score) for each substitution type is determined separately. This minimum support threshold serves as the raw threshold value and is subjected to be adjusted by other factors, such as gene weight and depth weight. [0089] Extended Linear Model: The extended linear model can be a logistic regression model for binary classification that will be used to score variants. Variant score (VS) is based on counts of singleton(s), consensus (c) and duplex (d) barcode families of variants as well as a variety of additional technical and biological variant features. These features are listed in Table 2, below. Table 2 Feature Description PATENT Client Reference No.: P39266-WO-1 s Singleton Molecule Counts c Consensus Molecule Counts d Duplex Molecule Counts sdrat Ratio of singletons to duplex molecule counts for total molecules sdratint Interaction between sdrat and s binsdrat Is 0 if sdrat < 1 and 1 if sdrat >= 1 binsdratint Interaction term between binsdrat and S depthweight Adjustment for depth of coverage geneweight Adjustment for errors in gene disttrans Median distance of variant from 5’ end of reads. relvdistmad Relative spread of variant distance in variant molecules vs background relvdiststartmad Relative spread of insert starts in variant molecules vs background relvdistendmad Relative spread of insert ends in variant molecules vs background indelvariants5 Variable is 1 if SNV occurs within 5 bp of indel and 0 otherwise cintn Raw strand of alignment bias cmintn Raw strand of origin bias substype Substitution type of variant varcontext 2 bp context around the variant position d2frac Fraction of duplex variant molecules with 2 reads d3frac Fraction of duplex variant molecules with <=3 reads c2frac Fraction of consensus variant molecules with 2 reads c3frac Fraction of consensus variant molecules with <=3 reads CF Relative frequency of consensus in variant molecules vs background DF Relative frequency of duplex in variant molecules vs background [0090] The final variant score for k additional features (f) and weights (w) is: where denotes the natural logarithm, and is a probability taking values between 0 and 1. The coefficients or weights for each feature will be learned from a training set of PATENT Client Reference No.: P39266-WO-1 known true positives and false positive variants. Implementation entails calculation of additional metrics corresponding to these features using the files output by the barcode deduplication tool. Table 2 above describes the full list of features in the extended linear model. [0091] Adaptive Calling and Two-Stage Scoring of Variants: First the linear model-based sample-specific adaptive model is used to generate initial calls. Here a linear weighted count of variant molecule counts is used. Then, an extended linear model is applied to these calls to filter out false positives and improve specificity. The rationale behind using a 2-stage model is illustrated in FIG.7. [0092] In the 2-stage approach, filtering based on extended linear score greatly improves specificity with a relatively small drop in sensitivity over only using the core linear model. In our tests, 1-stage scoring based on extended linear model score did not work well in practice. Features such as median variant distance, relative MAD, strand bias depend on the number of molecule counts/families supporting a variant. When variant molecule counts are low, additional variant features can take extreme values by chance and cannot be reliably estimated. Core linear model in the adaptive stage filters out low molecule count variants and thus improves accuracy of estimating additional features for scoring variants in the second stage. [0093] The scoring function for the second stage is further modified to filter out variants which take extreme values for certain variant features. In these cases, the features values are transformed such that the score becomes close to 0. The following transformations are used: 1. If disttrans < 20, disttrans = -100000000 2. If relvdistmad < 0.05, relvdistmad = -100000000 3. If relvstartmad < 0.05, relvstartmad = -100000000 4. If relvendmad < 0.05, relvendmad = -100000000 5. If relvstartmad < 0.15 & relvendmad < 0.15, relvstartmad = -100000000, relvendmad = -100000000 6. If relvstartmad < 0.10 & relvdistmad < 0.15, relvstartmad = -100000000, relvdistmad = -100000000 7. If relvdistmad < 0.15 & relvendmad < 0.10, relvdistmad = -100000000, relvendmad = -100000000 PATENT Client Reference No.: P39266-WO-1 [0094] These transformations are motivated by observations in empirical data that includes both true positive and true negative sites. [0095] Referring to FIG.6, after the SNV calling module 604 generates a list of potential variants, additional filtering can be done with a filtering module 606. The filtering module 606 can include a duplex ratio filter, a blocklist filter, and other tags that can be used to exclude certain variant calls. [0096] Duplex ratio filter: For every variant called, the filtering module 1006 computes the number of barcode deduplicated reads (i.e. the number of barcode families) that support the variant, and support the reference allele. With this barcode scheme, it is possible to also identify barcode families that have duplex support (meaning that the other strand in the original duplex can also be assigned). This gives counts of duplex deduplicated reads (i.e. number of barcode families with duplex evidence) that support the variant and the reference allele. The percentage of duplex families should be around 10-20%, so if there is a significant depletion of duplex reads against this expectation it can indicate a contamination of single stranded molecules, and an artifactual variant. Therefore, a statistic is defined from a Fisher exact test, and multiple testing correction (across all variants in the sample) is performed. The significant variants are removed. [0097] Blocklist filter. The blocklist filter can be generated in a similar manner as described above for the InDel caller, and can exclude variants with AF greater than 10%. [0098] Tags: The tags that can be applied are essentially the same as those described above for the InDel caller. Example Systems [0099] FIG.8 illustrates a sequencing system 800 according to an embodiment of the present disclosure. The system as shown includes a sample 805, such as Xpandomers within an assay device 810, where an assay 808 can be performed on sample 805. For example, sample 805 can be contacted with reagents of assay 808 to provide a signal (e.g., an intensity signal) of a physical characteristic 815 (e.g., sequence information of a cell-free nucleic acid molecule). Assay 808 may include sequencing by expansion with an assay device 810, such as a nanopore sequencing device as discussed above. Physical characteristic 815 (e.g., a PATENT Client Reference No.: P39266-WO-1 fluorescence intensity, a voltage, or a current), from the sample is detected by detector 820. Detector 820 can take a measurement at intervals (e.g., periodic intervals) to obtain data points that make up a data signal. In one embodiment, an analog-to-digital converter converts an analog signal from the detector into digital form at a plurality of times. [0100] Assay device 810 and detector 820 can form an assay system, e.g., a sequencing system 800 that performs sequencing according to embodiments described herein. A data signal 825 is sent from detector 820 to logic system 830. As an example, data signal 825 can be used to determine sequences and/or locations in a reference genome of nucleic acid molecules (e.g., DNA and/or RNA). Data signal 825 can include various measurements made at a same time, e.g., different colors of fluorescent dyes or different electrical signals for different molecule of sample 805, and thus data signal 825 can correspond to multiple signals. Data signal 825 may be stored in a local memory 835, an external memory 840, or a storage device 845. The sequencing system 800 can be comprised of multiple assay devices 810 and detectors 820. [0101] Logic system 830 may be, or may include, a computer system, ASIC, processor, graphics processing unit (GPU), etc. It may also include or be coupled with a display (e.g., monitor, LED display, etc.) and a user input device (e.g., mouse, keyboard, buttons, etc.). Logic system 830 and the other components may be part of a stand-alone or network connected computer system, or they may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes detector 820 and/or assay device 810. Logic system 830 may also include software that executes in a processor 850. Logic system 830 may include a computer readable medium storing instructions for controlling sequencing system 800 to perform any of the methods described herein. For example, logic system 830 can provide commands to a system that includes assay device 810 such that sequencing or other physical operations are performed. Such physical operations can be performed in a particular order, e.g., with reagents being added and removed in a particular order. Such physical operations may be performed by a robotics system, e.g., including a robotic arm, as may be used to obtain a sample and perform an assay 808. Logic system 830 can also perform any steps of methods described herein that perform computer processing, such as, but not limited to base calling, alignment, variant calling, the InDel caller algorithm shown in FIG.1, the barcode deduplication algorithm shown in FIG.2, and/or the SNV caller shown in FIG.6. PATENT Client Reference No.: P39266-WO-1 [0102] Sequencing system 800 may also include a treatment device 860, which can provide a treatment to the subject. Treatment device 860 can determine a treatment and/or be used to perform a treatment. Examples of such treatment can include surgery, radiation therapy, chemotherapy, immunotherapy, targeted therapy, hormone therapy, and stem cell transplant. Logic system 830 may be connected to treatment device 860, e.g., to provide results of a method described herein. The treatment device may receive inputs from other devices, such as an imaging device and user inputs (e.g., to control the treatment, such as controls over a robotic system). [0103] Sequencing system 800 may also include a reporting device 855, which can present results of any of the methods describe herein, e.g., as determined using the sequencing system 800. Reporting device 855 can be in communication with a reporting module within logic system 830 that can aggregate, format, and send a report to reporting device 855. The reporting module can present information determined using any of the methods described herein. The information can be presented by reporting device 855 in any format that can be recognized and interpreted by a user of the sequencing system 800. For example, the information can be presented by reporting device 855 in a displayed, printed, or transmitted format, or any combination thereof. [0104] Any of the computer systems mentioned herein may utilize any suitable number of subsystems. Examples of such subsystems are shown in FIG.9 in computer system 900. In some embodiments, a computer system 900 includes a single computer apparatus, where the subsystems can be the components of the computer apparatus. In other embodiments, a computer system 900 can include multiple computer apparatuses, each being a subsystem, with internal components. A computer system can include desktop and laptop computers, tablets, mobile phones, and other mobile devices. [0105] The subsystems shown in FIG.9 are interconnected via a system bus 975. Additional subsystems such as a printer 974, keyboard 978, storage device(s) 979, 982, monitor 976 (e.g., a display screen, such as an LED), which is coupled to display adapter 982, and others are shown. Peripherals and input/output (I/O) devices, which couple to I/O controller 971, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 977 (e.g., USB, FireWire®). For example, I/O port 977 or external interface 981 (e.g., Ethernet, Wi-Fi, etc.) can be used to connect computer system PATENT Client Reference No.: P39266-WO-1 900 to a wide area network such as the Internet, a mouse input device, or a scanner. The interconnection via system bus 975 allows the central processor 973 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 972 or the storage device(s) 979 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems. The system memory 972 and/or the storage device(s) 979 may embody a computer readable medium. Another subsystem is a data collection device 985, such as a camera, microphone, accelerometer, and the like. Any of the data mentioned herein can be output from one component to another component and can be output to the user. [0106] A computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 981, by an internal interface, or via removable storage devices that can be connected and removed from one component to another component. In some embodiments, computer systems, subsystem, or apparatuses can communicate over a network. In such instances, one computer can be considered a client and another computer a server, where each can be part of a same computer system. A client and a server can each include multiple systems, subsystems, or components. In various embodiments, methods may involve various numbers of clients and/or servers, including at least 10, 20, 50, 100, 200, 500, 1,000, or 10,000 devices. Methods can include various numbers of communication messages between devices, including at least 100, 200, 500, 1,000, 10,000, 50,000, 100,000, 500,000, or one million communication messages. Such communications can involve at least 1 MB, 10 MB, 100 MB, 1 GB, 10 GB, or 100 GB of data. [0107] Aspects of embodiments can be implemented in the form of control logic using hardware circuitry (e.g., an application specific integrated circuit or field programmable gate array) and/or using computer software stored in a memory with a generally programmable processor in a modular or integrated manner, and thus a processor can include memory storing software instructions that configure hardware circuitry, as well as an FPGA with configuration instructions or an ASIC. As used herein, a processor can include a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked, as well as dedicated hardware. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate PATENT Client Reference No.: P39266-WO-1 other ways and/or methods to implement embodiments of the present disclosure using hardware and a combination of hardware and software. [0108] Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission. A suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk) or Blu-ray disk, flash memory, and the like. The computer readable medium may be any combination of such devices. In addition, the order of operations may be re-arranged. A process can be terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function. [0109] Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet. As such, a computer readable medium may be created using a data signal encoded with such programs. Computer readable media encoded with the program code may be packaged with a compatible device (e.g., as firmware) or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g., a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network. A computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user. [0110] Any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Any operations performed with a processor may be performed in real-time. The term “real-time” may refer to computing operations or processes that are completed within a PATENT Client Reference No.: P39266-WO-1 certain time constraint. As examples, a time constraint may be 30 seconds, 1 minute, 10 minutes, 30 minutes, 1 hour, 4 hours, 1 day, or 7 days. Thus, embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective step or a respective group of steps. Although presented as numbered steps, steps of methods herein can be performed at a same time or at different times or in a different order. Additionally, portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means of a system for performing these steps. [0111] The specific details of particular embodiments may be combined in any suitable manner without departing from the spirit and scope of embodiments of the disclosure. However, other embodiments of the disclosure may be directed to specific embodiments relating to each individual aspect, or specific combinations of these individual aspects. [0112] The above description of example embodiments of the present disclosure has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise form described, and many modifications and variations are possible in light of the teaching above. [0113] When a feature or element is herein referred to as being “on” another feature or element, it can be directly on the other feature or element or intervening features and/or elements may also be present. In contrast, when a feature or element is referred to as being “directly on” another feature or element, there are no intervening features or elements present. It will also be understood that, when a feature or element is referred to as being “connected”, “attached” or “coupled” to another feature or element, it can be directly connected, attached or coupled to the other feature or element or intervening features or elements may be present. In contrast, when a feature or element is referred to as being “directly connected”, “directly attached” or “directly coupled” to another feature or element, there are no intervening features or elements present. Although described or shown with respect to one embodiment, the features and elements so described or shown can apply to other embodiments. It will also be appreciated by those of skill in the art that references to a structure or feature that is disposed “adjacent” another feature may have portions that overlap or underlie the adjacent feature. PATENT Client Reference No.: P39266-WO-1 [0114] Terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. For example, as used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items and may be abbreviated as [0115] Spatially relative terms, such as “under”, “below”, “lower”, “over”, “upper” and the like, may be used herein for ease of description to describe one element or feature’s relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is inverted, elements described as “under” or “beneath” other elements or features would then be oriented “over” the other elements or features. Thus, the exemplary term “under” can encompass both an orientation of over and under. The device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly. Similarly, the terms “upwardly”, “downwardly”, “vertical”, “horizontal” and the like are used herein for the purpose of explanation only unless specifically indicated otherwise. [0116] Although the terms “first” and “second” may be used herein to describe various features/elements (including steps), these features/elements should not be limited by these terms, unless the context indicates otherwise. These terms may be used to distinguish one feature/element from another feature/element. Thus, a first feature/element discussed below could be termed a second feature/element, and similarly, a second feature/element discussed below could be termed a first feature/element without departing from the teachings of the present invention. [0117] Throughout this specification and the claims which follow, unless the context requires otherwise, the word “comprise”, and variations such as “comprises” and “comprising” means various components can be co-jointly employed in the methods and PATENT Client Reference No.: P39266-WO-1 articles (e.g., compositions and apparatuses including device and methods). For example, the term “comprising” will be understood to imply the inclusion of any stated elements or steps but not the exclusion of any other elements or steps. [0118] As used herein in the specification and claims, including as used in the examples and unless otherwise expressly specified, all numbers may be read as if prefaced by the word “about” or “approximately,” even if the term does not expressly appear. The phrase “about” or “approximately” may be used when describing magnitude and/or position to indicate that the value and/or position described is within a reasonable expected range of values and/or positions. For example, a numeric value may have a value that is +/- 0.1% of the stated value (or range of values), +/- 1% of the stated value (or range of values), +/- 2% of the stated value (or range of values), +/- 5% of the stated value (or range of values), +/- 10% of the stated value (or range of values), etc. Any numerical values given herein should also be understood to include about or approximately that value, unless the context indicates otherwise. For example, if the value “10” is disclosed, then “about 10” is also disclosed. Any numerical range recited herein is intended to include all sub-ranges subsumed therein. It is also understood that when a value is disclosed that “less than or equal to” the value, “greater than or equal to the value” and possible ranges between values are also disclosed, as appropriately understood by the skilled artisan. For example, if the value “X” is disclosed the “less than or equal to X” as well as “greater than or equal to X” (e.g., where X is a numerical value) is also disclosed. It is also understood that the throughout the application, data is provided in a number of different formats, and that this data, represents endpoints and starting points, and ranges for any combination of the data points. For example, if a particular data point “10” and a particular data point “15” are disclosed, it is understood that greater than, greater than or equal to, less than, less than or equal to, and equal to 10 and 15 are considered disclosed as well as between 10 and 15. It is also understood that each unit between two particular units are also disclosed. For example, if 10 and 15 are disclosed, then 11, 12, 13, and 14 are also disclosed. [0119] Although various illustrative embodiments are described above, any of a number of changes may be made to various embodiments without departing from the scope of the invention as described by the claims. For example, the order in which various described method steps are performed may often be changed in alternative embodiments, and in other PATENT Client Reference No.: P39266-WO-1 alternative embodiments one or more method steps may be skipped altogether. Optional features of various device and system embodiments may be included in some embodiments and not in others. Therefore, the foregoing description is provided primarily for exemplary purposes and should not be interpreted to limit the scope of the invention as it is set forth in the claims. [0120] The examples and illustrations included herein show, by way of illustration and not of limitation, specific embodiments in which the subject matter may be practiced. As mentioned, other embodiments may be utilized and derived there from, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. Such embodiments of the inventive subject matter may be referred to herein individually or collectively by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept, if more than one is, in fact, disclosed. Thus, although specific embodiments have been illustrated and described herein, any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description.

Claims

PATENT Client Reference No.: P39266-WO-1 CLAIMS 1. A method for calling insertions and deletions (InDel) from sequencing data generated from a multiplexed sample, the method comprising: receiving a computer file comprising a plurality of paired sequence reads; aligning the plurality of paired sequence reads to a reference sequence; sorting the aligned, paired sequence reads; grouping the sorted, aligned, and paired sequence reads into a plurality of groups according to a unique molecular identifier (UMI), a sequence read start position, and a sequence read end position, wherein each member of a group has the same UMI, sequence read start position, and sequence read end position; determining a consensus sequence from the members of each group; parsing the paired consensus sequence reads; polishing the parsed consensus sequence reads; determining a set of candidate InDel variants based on a comparison of the polished consensus sequence reads with the reference sequence; for each candidate InDel variant, calculating a weighted linear score using a linear model, wherein the candidate InDel variant is called a potential InDel variant when the weighted linear score is greater than a linear model threshold; and for each potential InDel variant, calculating a score using a non-linear tree-based ensemble machine learning classifier, wherein the potential InDel variant is called a true InDel variant when the score is greater than a machine learning classifier threshold. 2. The method of claim 1, further comprising generating a computer file comprising the paired consensus sequence reads. 3. The method of claim 1, wherein the step of polishing the parsed consensus sequence reads comprises: comparing the parsed consensus sequence reads to the reference sequence to determine a set of preliminary InDel variants, including a count of each preliminary InDel variant; and PATENT Client Reference No.: P39266-WO-1 adjusting the count of each preliminary InDel variant based on a comparison of the count of each preliminary InDel variant to a corresponding background variant count that is determined from a control sample. 4. The method of claim 3, wherein the comparison comprises determining whether the count for each preliminary InDel variant is background noise. 5. The method of claim 4, wherein the step of determining whether the count for each preliminary InDel variant is background noise comprises calculating a probability that the preliminary InDel variant is background noise. 6. The method of claim 1, wherein the linear model comprises a duplex variant count, a singleton variant count, and a consensus variant count. 7. The method of claim 1, wherein the threshold for the linear model is sample specific. 8. The method of claim 7, wherein the threshold for the linear model is determined for all the candidate InDel variants in the sample by fitting a regression to a plot of a cumulative candidate InDel variant count having at least a certain weighted linear score as a function of the weighted linear score, and setting the threshold at a weighted linear score that results in a target cumulative InDel variant count. 9. The method of claim 1, wherein the threshold for the linear model is determined from a set of training data by determining a cumulative count of false positive variant calls as a function of weighted linear score, and selecting the weighted linear score that corresponds to a target cumulative false positive count as the threshold. 10. The method of claim 1, wherein the non-linear tree-based ensemble machine learning classifier comprises a singleton molecule count, a consensus molecule count, a duplex molecule count, a ratio of the singleton molecule count to the duplex molecule count, a median distance of the potential variant from a 5’ end of the sequence reads, a relative spread of variant distance in variant molecules versus background, a relative spread of insert starts in PATENT Client Reference No.: P39266-WO-1 variant molecules versus background, a raw strand of alignment bias for consensus molecules, a raw strand of origin bias for singleton molecules, a raw strand of alignment bias for non-duplex molecules, a fraction of duplex variant molecules with two reads, a fraction of duplex variant molecules with less than or equal to three reads, a fraction of consensus variant molecules with two reads, a fraction of consensus variant molecules with less than or equal to three reads, a relative frequency of singleton in variant molecules versus background, a relative frequency of duplex in variant molecules versus background, and a 2 x 2 fisher exact test p-value for the ratio of singletons to duplex in variant molecules versus background. 11. The method of claim 1, wherein the machine learning classifier threshold is calibrated using a set of healthy samples to target a predetermined false positive rate. 12. A method for calling single nucleotide variants (SNV) from sequencing data generated from a multiplexed sample, the method comprising: receiving a computer file comprising a plurality of paired sequence reads; aligning the sequence reads to a reference sequence; sorting by pair the aligned sequence reads; grouping the sorted, aligned, and paired sequence reads into a plurality of groups by unique molecular identifier (UMI), sequence read start position, and sequence read end position, wherein each member of a group has the same UMI, sequence read start position, and sequence read end position; determining a consensus sequence from the members of each group; parsing the paired consensus sequence reads; polishing the parsed consensus sequence reads; determining a set of candidate variants based on a comparison of the polished consensus reads; for each candidate SNV variant, calculating a weighted linear score using a linear model, wherein the candidate SNV variant is called a potential SNV variant when the weighted linear score is greater than a linear model threshold; and PATENT Client Reference No.: P39266-WO-1 for each potential SNV variant, calculating a weighted logistic score using a logistic regression model, wherein the potential SNV variant is called a true variant when the weighted logistic score is greater than a logistic regression model threshold. 13. The method of claim 12, the method further comprising generating a computer file comprising paired consensus sequence reads. 14. The method of claim 12, wherein the step of polishing the parsed consensus sequence reads comprises: comparing the parsed consensus reads to the reference sequence to determine a set of preliminary SNV variants, including a count of each preliminary SNV variant; and adjusting the count of each preliminary SNV variant based on a comparison of the count of each preliminary SNV variant to a corresponding background variant count that is determined from a control sample. 15. The method of claim 14, wherein the comparison comprises determining whether the count for each preliminary SNV variant is background noise. 16. The method of claim 15, wherein the step of determining whether the count for each preliminary SNV variant is background noise comprises calculating a probability that the preliminary SNV variant is background noise. 17. The method of claim 12, wherein the linear model comprises a duplex variant count, a singleton variant count, and a consensus variant count. 18. The method of claim 12, wherein the threshold for the linear model is sample specific. 19. The method of claim 18, wherein the threshold for the linear model is determined for all the candidate SNV variants in the sample by fitting a regression to a plot of a cumulative candidate SNV variant count having at least a certain weighted linear score as a function of the weighted linear score, and setting the threshold at a weighted linear score that results in a target cumulative SNV variant count. PATENT Client Reference No.: P39266-WO-1 20. The method of claim 12, wherein the threshold for the linear model is determined from a set of training data by determining a cumulative count of false positive variant calls as a function of weighted linear score, and selecting the weighted linear score that corresponds to a target cumulative false positive count as the threshold. 21. The method of claim 12, wherein the logistic regression model comprises a duplex variant count, a singleton variant count, and a consensus variant count. 22. The method of claim 21, wherein the logistic regression model further comprises a ratio of the singleton variant count to the duplex molecule count, an adjustment for depth of coverage, an adjustment for errors in a gene, a median distance of the variant from 5’ end of the sequence read, a relative spread of variant distance in variant molecules versus background, a relative spread of insert starts in variant molecules versus background, a relative spread of insert ends in variant molecules versus background, a raw strand of alignment bias, a raw strand of origin bias, a substitution type of variant, a two base pair context around a variant position, a fraction of duplex variant molecules with two reads, a fraction of duplex variant molecules with less than or equal to three reads, a fraction of consensus variant molecules with two reads, a fraction of consensus variant molecules with less than or equal to three reads, a relative frequency of consensus in variant molecules versus background, and a relative frequency of duplex in variant molecules versus background. 23. The method of claim 12, wherein the logistic regression model comprises a plurality of features, wherein each feature is multiplied by a weight, wherein the weights are determined from a training set of data with known true positive variants and known false positive variants. 24. A system, comprising: an assay device; and a logic system including at least a processor and a memory storing a set of instructions, wherein the processor, upon executing the instructions, is configured to perform the method of any of claims 1 to 11. PATENT Client Reference No.: P39266-WO-1 25. The system of claim 24, further comprising a treatment device for deteremining or administering a treatment to the patient based on at least one true InDel variant. 26. The system of claim 24, further comprising a reporting device for displaying information about at least one true InDel variant. 27. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any of claims 1 to 11. 28. A system, comprising: an assay device; and a logic system including at least a processor and a memory storing a set of instructions, wherein the processor, upon executing the instructions, is configured to perform the method of any of claims 12 to 23. 29. The system of claim 28, further comprising a treatment device for deteremining or administering a treatment to the patient based on at least one true variant. 30. The system of claim 28, further comprising a reporting device for displaying information about at least one true variant. 31. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any of claims 12 to 23.
PCT/US2025/025161 2024-04-17 2025-04-17 Systems and methods for variant calling Pending WO2025221998A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202463635576P 2024-04-17 2024-04-17
US63/635,576 2024-04-17

Publications (1)

Publication Number Publication Date
WO2025221998A1 true WO2025221998A1 (en) 2025-10-23

Family

ID=95743631

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/025161 Pending WO2025221998A1 (en) 2024-04-17 2025-04-17 Systems and methods for variant calling

Country Status (1)

Country Link
WO (1) WO2025221998A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130244340A1 (en) 2012-01-20 2013-09-19 Genia Technologies, Inc. Nanopore Based Molecular Detection and Sequencing
US20130264207A1 (en) 2010-12-17 2013-10-10 Jingyue Ju Dna sequencing by synthesis using modified nucleotides and nanopore detection
US20140134616A1 (en) 2012-11-09 2014-05-15 Genia Technologies, Inc. Nucleic acid sequencing using tags
US20150119259A1 (en) 2012-06-20 2015-04-30 Jingyue Ju Nucleic acid sequencing by nanopore detection of tag molecules
US20150337366A1 (en) 2012-02-16 2015-11-26 Genia Technologies, Inc. Methods for creating bilayers for use with nanopore sensors

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130264207A1 (en) 2010-12-17 2013-10-10 Jingyue Ju Dna sequencing by synthesis using modified nucleotides and nanopore detection
US20130244340A1 (en) 2012-01-20 2013-09-19 Genia Technologies, Inc. Nanopore Based Molecular Detection and Sequencing
US20150337366A1 (en) 2012-02-16 2015-11-26 Genia Technologies, Inc. Methods for creating bilayers for use with nanopore sensors
US20150119259A1 (en) 2012-06-20 2015-04-30 Jingyue Ju Nucleic acid sequencing by nanopore detection of tag molecules
US20140134616A1 (en) 2012-11-09 2014-05-15 Genia Technologies, Inc. Nucleic acid sequencing using tags

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
CHANG XU ET AL: "smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers", BIORXIV, 14 March 2018 (2018-03-14), XP055585298, Retrieved from the Internet <URL:https://www.biorxiv.org/content/biorxiv/early/2018/03/14/281659.full.pdf> [retrieved on 20250725], DOI: 10.1101/281659 *
DRMANAC ET AL., NATURE BIOTECH., vol. 16, 1998, pages 381 - 384
LI TAI FANG ET AL: "An ensemble approach to accurately detect somatic mutations using SomaticSeq", GENOME BIOLOGY, vol. 16, no. 1, 17 September 2015 (2015-09-17), XP055531382, DOI: 10.1186/s13059-015-0758-2 *
MIKHAIL SHUGAY ET AL: "MAGERI: Computational pipeline for molecular-barcoded targeted resequencing", PLOS COMPUTATIONAL BIOLOGY, vol. 13, no. 5, 5 May 2017 (2017-05-05), pages e1005480, XP055496652, DOI: 10.1371/journal.pcbi.1005480 *
SATER VINCENT ET AL: "UMI-VarCal: a new UMI-based variant caller that efficiently improves low-frequency variant detection in paired-end sequencing NGS libraries", BIOINFORMATICS, vol. 36, no. 9, 27 January 2020 (2020-01-27), GB, pages 2718 - 2724, XP093299216, ISSN: 1367-4803, Retrieved from the Internet <URL:https://academic.oup.com/bioinformatics/article-pdf/36/9/2718/48985410/bioinformatics_36_9_2718.pdf> [retrieved on 20250725], DOI: 10.1093/bioinformatics/btaa053 *
SEARS ET AL., BIOTECHNIQUES, vol. 13, 1992, pages 626 - 633
XU CHANG: "A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data", COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, vol. 16, 1 January 2018 (2018-01-01), Sweden, pages 15 - 24, XP055781134, ISSN: 2001-0370, DOI: 10.1016/j.csbj.2018.01.003 *
ZIMMERMAN ET AL., METHODS MOL. CELL BIOL., vol. 3, 1992, pages 39 - 42

Similar Documents

Publication Publication Date Title
JP7684708B2 (en) Non-invasive prenatal molecular karyotyping of maternal plasma
EP3143537B1 (en) Rare variant calls in ultra-deep sequencing
JP2025085645A (en) Systems and methods for automating rna expression calling in cancer prediction pipelines
WO2017127741A1 (en) Methods and systems for high fidelity sequencing
US20160154930A1 (en) Methods for identification of individuals
US20200013484A1 (en) Machine learning variant source assignment
JPWO2019132010A1 (en) Methods, devices and programs for estimating base species in a base sequence
WO2025221998A1 (en) Systems and methods for variant calling
Niehus et al. PopDel identifies medium-size deletions jointly in tens of thousands of genomes
WO2025221988A1 (en) Systems and methods for somatic small variant calling
Veeramachaneni Data analysis in rare disease diagnostics
이선호 New Methods for SNV/InDel Calling and Haplotyping from Next Generation Sequencing Data
HK40080479A (en) Noninvasive prenatal molecular karyotyping from maternal plasma
HK40074981A (en) Noninvasive prenatal molecular karyotyping from maternal plasma
KR20250092241A (en) Nucleic acid error suppression
HK40100599A (en) Noninvasive prenatal molecular karyotyping from maternal plasma
Null Advancement of Understudied Genetic Variants Within Statistical Genetics: A Copy Number Variants Analysis and Development of a Rare Variant Simulation Algorithm
Song IMPROVING GENOME ANNOTATION WITH RNA-SEQ DATA
Wang High-Throughput Sequencing And Natural Selection: Studies Of Recent Sweep Inferences And A New Computational Approach For Transcription Identification
Schaibley Understanding the Patterns and Consequences of Single-Nucleotide Mutations in the Human Genome Using High-Throughput Sequencing.
Lorenzo Salazar Bioinformatics Pipeline for Next Generation Sequencing Analysis in Association Studies of Idiopathic Pulmonary Fibrosis
Corbett Assessment of Alignment Algorithms, Variant Discovery and Genotype Calling Strategies in Exome Sequencing Data
HK1210811B (en) Noninvasive prenatal molecular karyotyping from maternal plasma

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25725617

Country of ref document: EP

Kind code of ref document: A1