[go: up one dir, main page]

WO2025221998A1 - Systèmes et procédés d'appel de variants - Google Patents

Systèmes et procédés d'appel de variants

Info

Publication number
WO2025221998A1
WO2025221998A1 PCT/US2025/025161 US2025025161W WO2025221998A1 WO 2025221998 A1 WO2025221998 A1 WO 2025221998A1 US 2025025161 W US2025025161 W US 2025025161W WO 2025221998 A1 WO2025221998 A1 WO 2025221998A1
Authority
WO
WIPO (PCT)
Prior art keywords
variant
count
indel
reads
consensus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/025161
Other languages
English (en)
Inventor
Ke Tang
James Han
Md Abid HASAN
Zhongyun HUANG
Tieming JI
Badri Kothandaraman PADHUKASAHASRAM
Seyed Hamid MIREBRAHIM
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Roche Sequencing Solutions Inc
Original Assignee
Roche Sequencing Solutions Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Roche Sequencing Solutions Inc filed Critical Roche Sequencing Solutions Inc
Publication of WO2025221998A1 publication Critical patent/WO2025221998A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • BACKGROUND [0002] The development of affordable and rapid DNA sequencing technologies has enabled the development of targeted therapeutics that rely on the use of DNA biomarkers to identify patients that are suitable for receiving the targeted therapy. For example, mutations in certain genes, such as genes involved in cell proliferation, are known to lead to certain types of cancers that can be treated very effectively with specific types of drugs. Other mutations are known to confer resistance to certain therapies. Therefore, there is a need for improved systems and methods to identify variants from sequencing data.
  • SUMMARY [0003] The embodiments described herein relate to systems and methods for performing variant calling on sequencing data. More particularly, the embodiments described herein related to calling single nucleotide variants, insertions, and deletions in the sequencing data.
  • a method for calling insertions and deletions (InDels) from sequencing data generated from a multiplexed sample includes receiving a computer file comprising a plurality of paired sequence reads; aligning the plurality of paired sequence reads to a reference sequence; sorting the aligned, paired sequence reads; grouping the sorted, aligned, and paired sequence reads into a plurality of groups according to a unique molecular identifier (UMI), a sequence read start position, and a sequence read end position, wherein each member of a group has the same UMI, sequence read start position, and sequence read end position; determining a consensus sequence from the members of each group; parsing the paired consensus sequence reads; polishing the parsed consensus sequence reads; determining a set of candidate InDel variants based on a comparison of the polished consensus sequence reads with the reference sequence; for each candidate InDel variant, calculating a weighte
  • UMI unique molecular identifier
  • the method further includes generating a computer file comprising the paired consensus sequence reads.
  • the step of polishing the parsed consensus sequence reads includes: comparing the parsed consensus sequence reads to the reference sequence to determine a set of preliminary InDel variants, including a count of each preliminary InDel variant; and adjusting the count of each preliminary InDel variant based on a comparison of the count of each preliminary InDel variant to a corresponding background variant count that is determined from a control sample.
  • the comparison comprises determining whether the count for each preliminary InDel variant is background noise.
  • the step of determining whether the count for each preliminary InDel variant is background noise comprises calculating a probability that the preliminary InDel variant is background noise.
  • the linear model comprises a duplex variant count, a singleton variant count, and a consensus variant count.
  • the threshold for the linear model is sample specific.
  • the threshold for the linear model is determined for all the candidate InDel variants in the sample by fitting a regression to a plot of a cumulative candidate InDel variant count having at least a certain weighted linear score as a function of the weighted linear score, and setting the threshold at a weighted linear score that results in a target cumulative InDel variant count.
  • PATENT Client Reference No.: P39266-WO-1 the threshold for the linear model is determined from a set of training data by determining a cumulative count of false positive variant calls as a function of weighted linear score, and selecting the weighted linear score that corresponds to a target cumulative false positive count as the threshold.
  • the non-linear tree-based ensemble machine learning classifier comprises a singleton molecule count, a consensus molecule count, a duplex molecule count, a ratio of the singleton molecule count to the duplex molecule count, a median distance of the potential variant from a 5’ end of the sequence reads, a relative spread of variant distance in variant molecules versus background, a relative spread of insert starts in variant molecules versus background, a raw strand of alignment bias for consensus molecules, a raw strand of origin bias for singleton molecules, a raw strand of alignment bias for non-duplex molecules, a fraction of duplex variant molecules with two reads, a fraction of duplex variant molecules with less than or equal to three reads, a fraction of consensus variant molecules with two reads, a fraction of consensus variant molecules with less than or equal to three reads, a relative frequency of singleton in variant molecules versus background, a relative frequency of duplex in variant molecules versus background, and a 2 x 2
  • the machine learning classifier threshold is calibrated using a set of healthy samples to target a predetermined false positive rate.
  • a method for calling single nucleotide variants (SNV) from sequencing data generated from a multiplexed sample is provided.
  • the method includes: receiving a computer file comprising a plurality of paired sequence reads; aligning the sequence reads to a reference sequence; sorting by pair the aligned sequence reads; grouping the sorted, aligned, and paired sequence reads into a plurality of groups by unique molecular identifier (UMI), sequence read start position, and sequence read end position, wherein each member of a group has the same UMI, sequence read start position, and sequence read end position; determining a consensus sequence from the members of each group; parsing the paired consensus sequence reads; polishing the parsed consensus sequence reads; determining a set of candidate variants based on a comparison of the polished consensus reads; for each candidate SNV variant, calculating a PATENT Client Reference No.: P39266-WO-1 weighted linear score using a linear model, wherein the candidate SNV variant is called a potential SNV variant when the weighted linear score is greater than a linear model threshold; and for each potential SNV variant, calculating a weight
  • the method further includes generating a computer file comprising paired consensus sequence reads.
  • the step of polishing the parsed consensus sequence reads includes: comparing the parsed consensus reads to the reference sequence to determine a set of preliminary SNV variants, including a count of each preliminary SNV variant; and adjusting the count of each preliminary SNV variant based on a comparison of the count of each preliminary SNV variant to a corresponding background variant count that is determined from a control sample.
  • the comparison comprises determining whether the count for each preliminary SNV variant is background noise.
  • the step of determining whether the count for each preliminary SNV variant is background noise comprises calculating a probability that the preliminary SNV variant is background noise.
  • the linear model comprises a duplex variant count, a singleton variant count, and a consensus variant count.
  • the threshold for the linear model is sample specific.
  • the threshold for the linear model is determined for all the candidate SNV variants in the sample by fitting a regression to a plot of a cumulative candidate SNV variant count having at least a certain weighted linear score as a function of the weighted linear score, and setting the threshold at a weighted linear score that results in a target cumulative SNV variant count.
  • PATENT Client Reference No.: P39266-WO-1 the threshold for the linear model is determined from a set of training data by determining a cumulative count of false positive variant calls as a function of weighted linear score, and selecting the weighted linear score that corresponds to a target cumulative false positive count as the threshold.
  • the logistic regression model comprises a duplex variant count, a singleton variant count, and a consensus variant count. [0025] In some embodiments of the second aspect, the logistic regression model further comprises a ratio of the singleton variant count to the duplex molecule count, an adjustment for depth of coverage, an adjustment for errors in a gene, a median distance of the variant from 5’ end of the sequence read, a relative spread of variant distance in variant molecules versus background, a relative spread of insert starts in variant molecules versus background, a relative spread of insert ends in variant molecules versus background, a raw strand of alignment bias, a raw strand of origin bias, a substitution type of variant, a two base pair context around a variant position, a fraction of duplex variant molecules with two reads, a fraction of duplex variant molecules with less than or equal to three reads, a fraction of consensus variant molecules with two reads, a fraction of consensus variant molecules with less than or equal to three reads, a relative frequency of consensus in
  • the logistic regression model comprises a plurality of features, wherein each feature is multiplied by a weight, wherein the weights are determined from a training set of data with known true positive variants and known false positive variants.
  • a system for generating sequencing data.
  • the system may include an assay device and/or a logic system.
  • the logic system may include a processor coupled to a memory storing instructions executable by the processor.
  • the processor upon execution of the instructions, is configured to perform the method of one or more embodiments of the first aspect.
  • the system further includes a treatment device for determining and/or administering a treatment to the patient based on at least one true InDel variant.
  • the system further includes a reporting device for displaying information relating to at least one true InDel variant.
  • a non-transitory computer-readable medium stores a set of instructions that, upon execution by at least one processor, cause the processor to perform the method of one or more embodiments of the first aspect.
  • a system for generating sequencing data.
  • the system may include an assay device and/or a logic system.
  • the logic system may include a processor coupled to a memory storing instructions executable by the processor.
  • the processor upon execution of the instructions, is configured to perform the method of one or more embodiments of the second aspect.
  • the system further includes a treatment device for determining and/or administering a treatment to the patient based on at least one true variant.
  • the system further includes a reporting device for displaying information relating to at least one true variant.
  • a non-transitory computer-readable medium stores a set of instructions that, upon execution by at least one processor, cause the processor to perform the method of one or more embodiments of the second aspect.
  • FIG.1 is a flow chart of a method for an InDel caller algorithm workflow, in accordance with some embodiments.
  • FIG.2 shows an overview of a barcode deduplication algorithm, in accordance with some embodiments.
  • FIG.3 is a flow chart of a method for the barcode deduplication algorithm, in accordance with some embodiments.
  • FIG.4 is another flow chart of the method for the barcode deduplication algorithm, which illustrates the creation of barcode families, in accordance with some embodiments.
  • FIG.5 illustrates how a threshold is set for a core linear model, in accordance with some embodiments.
  • FIG.6 is a flow chart of a method for a single nucleotide variant (SNV) caller algorithm, in accordance with some embodiments.
  • FIG.7 illustrates the improved performance of a two-stage approach for the SNV caller algorithm, in accordance with some embodiments.
  • FIG.8 illustrates a sequencing system, in accordance with some embodiments.
  • FIG.9 illustrates an exemplary computer apparatus, in accordance with some embodiments. DETAILED DESCRIPTION [0045]
  • the present disclosure provides a number of techniques for performing variant calling, as part of a secondary analysis workflow of sequencing data produced by today’s next generation sequencing devices.
  • an InDel caller algorithm is provided.
  • the InDel caller algorithm can include a hotspot module and an adaptive module.
  • the InDel caller algorithm uses a number of heuristics to identify InDel variants in hotspot regions.
  • the adaptive module uses a two stage approach to identify variants with low AF that might not be strongly supported in a plasma sample.
  • a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold.
  • a machine learning classifier such as a gradient boosting machine (GBM) model is used to score each variant that passes the first stage and then the classifier score is compared to a threshold to determine the final list of called variants.
  • the called variants can be further filtered by a blocklist filter.
  • GBM gradient boosting machine
  • an SNV caller algorithm is provided that is similar in many aspects to the InDel caller algorithm described above.
  • the SNV caller algorithm can include a hotspot module and an adaptive module.
  • the hotspot module is similar to the hotspot module of the InDel caller algorithm, with the exception that one or more additional heuristics may be included in addition to or in lieu of the heuristics used in the InDel caller algorithm.
  • the adaptive module of the SNV caller algorithm also uses a two stage approach to identify variants. In the first stage, a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold. In the second stage, an extended linear model that incorporates a number of additional weighted features is used to filter out variants that passed the first stage based on a comparison of each candidate variant’s score with a threshold.
  • sequences of interest can be sequenced using a sequencing assay as part of the procedure for determining sequencing PATENT Client Reference No.: P39266-WO-1 reads for a plurality of microsatellite loci. Any of a number of sequencing technologies or sequencing assays can be utilized.
  • next Generation Sequencing refers to sequencing methods that allow for massively parallel sequencing of clonally amplified molecules and of single nucleic acid molecules (or of nucleic acid analogues).
  • NGS Next Generation Sequencing
  • Non-limiting examples of sequencing assays that are suitable for use with the methods disclosed herein include nanopore sequencing (see, e.g., U.S.
  • Patent Application Publication Nos.2013/0244340, 2013/0264207, 2014/0134616, 2015/0119259, and 2015/0337366) Sanger sequencing, capillary array sequencing, thermal cycle sequencing (see, e.g., Sears et al., Biotechniques, 13:626-633 (1992)), solid-phase sequencing (see, e.g., Zimmerman et al., Methods Mol.
  • sequencing with mass spectrometry such as matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF/MS) (see, e.g., Fu et al., Nature Biotech., 16:381-384 (1998)), sequencing by hybridization (see, e.g., Drmanac et al., Nature Biotech., 16:54-58 (1998)), and NGS methods, including but not limited to sequencing by synthesis (see, e.g., HiSeqTM, MiSeqTM, or Genome Analyzer, each available from Illumina, Inc.
  • sequencing by ligation see, e.g., SOLiDTM, available from Thermo Fisher Scientific, Inc. (Waltham, MA)
  • ion semiconductor sequencing see, e.g., Ion TorrentTM, available from Thermo Fisher Scientific, Inc. (Waltham, MA)
  • SMRT® sequencing available from Pacific Biosciences of California, Inc. (Menlo Park, CA).
  • Commercially available sequencing technologies include: sequencing-by- hybridization platforms from Affymetrix, Inc. (Sunnyvale, CA), sequencing-by-synthesis platforms from Illumina, Inc. (San Diego, Calif.), and sequencing-by-ligation platform from Thermo Fisher Scientific, Inc.
  • Bioinformatics Workflow Overview [0052] The output of an NGS sequencer is generally processed by a bioinformatics pipeline that processes the raw signal from the NGS sequencer and translates the raw signal into base calls, often referred to as raw reads, which are typically stored in a FASTQ file that PATENT Client Reference No.: P39266-WO-1 combines the raw reads with associated quality data.
  • This portion of the bioinformatics pipeline is often referred to as primary analysis.
  • the next section of the bioinformatics pipeline is called secondary analysis, and it takes the raw reads generated by the primary analysis, and performs several tasks, including alignment and variant calling.
  • Tertiary analysis is the final portion of the bioinformatics pipeline and uses the variant calling information to generate medical insights that health care practitioners can use to improve treatments for their patients.
  • Secondary Analysis [0055] New sequencing technologies, such as nanopore-based sequencers, generate sequencing data with different characteristics than sequencing data generated by the current market leading sequencers, such as Illumina sequencers. For example, these differences can include differences in raw read accuracy and differences in the error profiles.
  • FIG.1 illustrates a secondary analysis workflow for detecting insertions and deletions (InDels).
  • InDel mutations refer to one of the most common classes of short variants that involve insertion(s) and/or deletion(s) of nucleotides in the genomic DNA.
  • Somatic InDels refers to insertion and deletion events in non-germline cells and may contribute to cancer. Such variants constitute a substantial part of the genetic variation in the cancer PATENT Client Reference No.: P39266-WO-1 genome. Indels in a coding region may disrupt protein coding and often lead to changes in protein function. Given that InDels have not been studied as much as single nucleotide variants (SNVs), it is important to refine and optimize tools for InDel-calling.
  • SNVs single nucleotide variants
  • the InDel algorithm workflow can be broken down into five main steps or modules: a barcode deduplication step or module 100, and pre-processing step or module 102, a multi-stage InDel calling step or module 104, a filtering step or module 106, and an output step or module 108.
  • the barcode deduplication step is further illustrated in FIGS.2-4.
  • FIG.2 provides an overview of the barcode deduplication process, which is designed to collapse properly paired reads, including duplex reads, with the same fragment start, fragment end, and unique molecular identifier (UMI) into the same barcode family.
  • UMI unique molecular identifier
  • paired reads from the positive strand 200 and the negative strand 202 are collapsed by the barcode deduplication step into a single barcode family 204 because they share the same UMI, fragment start, and fragment end.
  • Consensus calling can be performed within the barcode family to identify candidate variants (SNVs and InDels).
  • the output of the barcode deduplication step is a file (i.e., a BAM file) with the consensus read pairs of each barcode family.
  • FIG.3 illustrates a workflow for the barcode deduplication process, starting with the output from the sequencer, which is often provided in a FASTQ file 300.
  • a paired sort FASTQ file is used.
  • the FASTQ file contains raw sequencing data that can be aligned 302, using SWA or another alignment technique, for example.
  • the output of the alignment step 302 is typically a BAM file 304, which can be sorted by pair to yield a paired BAM file 306. For each read pair, low quality reads and off-target reads can be filtered out here or at an earlier stage before the barcode families are grouped together 308.
  • FIG.4 illustrates one embodiment for the barcode family grouping step.
  • a barcode family key 402 is built or extracted from the records 400, 401.
  • the barcode family key can be the UMIs associated with the paired reads.
  • the barcode family key 402 is then checked to see whether the barcode family key 402 already exists in the current barcode family linked list 404.
  • the barcode family information can be added to the records 406, 407. If the barcode family key 402 already exists, then the barcode family information can be added to the records 406, 407. If the barcode family key does not exist in the current barcode family linked list 404, then the fragment start position is compared with the previous fragment start position 408. If the fragment start position is the same as the previous start position, then a new barcode family is created 410, and the new barcode family information is added to the records 412, 413 to start a new barcode family. If the fragment start position is not the same as the previous start position, then the current linked list of the barcode family is processed 414 to output a barcode family mutation 416 and to output the barcode family to the dedup output file.
  • the barcode deduplication step 100 can also optionally exclude any reads that are soft clipped because soft clipped reads may be associated with higher error rates.
  • a multiplexed sample which contains more than one sample all mixed together, can be de-multiplexed into separate samples so that the analysis described herein can be performed on each sample.
  • pre-processing can be performed 102. Pre-processing of the dedup output can include background polishing, in which the consensus sequences from the dedup output can be compared with a reference sequence from a healthy sample.
  • the comparison with the healthy sample allows the filtering out of variants called in both the healthy sample and the actual samples as likely artifacts or errors introduced at some point in the sequencing process, such as recurrent library preparation errors, for example.
  • background polishing in only performed on plasma samples and not on tissue samples. In other embodiments, background polishing can be performed on both plasma samples and tissue samples.
  • the background polishing step can be performed by capturing errors (low allele frequency (AF) variants) as systematic errors from a set of healthy normal control cfDNA samples, which are not expected to have low AF variants.
  • errors low allele frequency (AF) variants
  • a probabilistic model is built from the observed error at every position across the control reference set to yield a background error distribution.
  • the number of supporting reads for all non-reference positions is evaluated against the derived background error distribution for each of those positions. If a variant appears to belong to the background noise distribution, it is “polished” away (i.e., filtered out) from the base-counts file. That is the number of supporting reads for that variant is set to 0.
  • preprocessing can include the parsing of the dedup outputs, which can include the computation of features used in subsequent steps of the InDel caller algorithm. These features include, but are not limited to, the features listed in Table 1, listed below.
  • the InDel calling algorithm 104 can include multiple components, such as a hotspot module and an adaptive module.
  • the hotspot module can be used on predetermined loci of interest, which can include cancer “hotspots” from targeted panels, including known actionable mutations (e.g., BRAF, V600E, etc.), known biomarkers (e.g., toxicity biomarker DPYD or resistance biomarker), and recurrent cancer mutations (e.g., top 1% most mutated in COSMIC).
  • the variant caller can be designed to be more sensitive for mutations on the hotspot list.
  • the hotspot module can use heuristic rules for identifying variants.
  • a variant is called when either duplex rule or depth rule is met.
  • the duplex rule is met when the AF is greater than or equal to 1.5% and the duplex support is greater than or equal to 5 and total support is greater than or equal to duplex support, where support is the number of reads with that variant.
  • the depth rule is met when the AF is greater than 1.5% and the variant depth is greater than 5.
  • the duplex rule is met when (1) duplex support is greater than or equal to the minimum duplex support, and (2) the total support is greater than or equal to the duplex support.
  • the depth rule is met when the variant depth is greater than 5.
  • an adaptive module can be used.
  • the adaptive module can include a two-stage approach.
  • the first stage uses a core linear model to PATENT Client Reference No.: P39266-WO-1 compute weighted variant molecule counts and learns sample specific thresholds for calling.
  • the second stage applies an additional filter based on a lightgbm (gradient boosted decision trees) classifier that uses both variant molecule counts as well as additional technical and biological features, as listed in Table 1. This enables the caller to detect lower allele fraction variants with limited supports in cfDNA samples as compared to other standard methods.
  • the core linear model in the first stage is described in more detail as follows.
  • the duplex ( ), consensus ( ), and singleton ( ) variant reads.
  • the goal of the model is to determine the best coefficients alpha ( ), beta ( ), and gamma ( ) that minimize the total errors (false positives and false negatives).
  • the weighted linear model is as follows: , (Eq.2) where is the duplex variant count, is the consensus variant count, and is the singleton variant count. Theoretically, when the weighted linear score is larger than a certain cutoff, this variant will be called. Therefore, each variant will receive a weighted linear score according to this linear model.
  • All the candidate variants identified by the core linear model in the first stage will be scored on the basis of a machine learning classifier, such as a non-linear tree-based ensemble learning classifier (e.g., lightgbm).
  • a machine learning classifier such as a non-linear tree-based ensemble learning classifier (e.g., lightgbm).
  • Table 1 above describes the full list of features used by the lightgbm classifier.
  • Variants from the first stage with classifier score above a certain PATENT Client Reference No.: P39266-WO-1 threshold will be retained as the final calls.
  • the threshold for the classifier is calibrated using a set of healthy samples to target the desired or predetermined false positive levels. Other types of classifiers can be used, with preference for classifiers having interpretable features.
  • the output from the InDel caller can then be filtered 106 using a blocklist filter and other filter tags.
  • the blocklist filter can be generated to block repetitive false positive variants detected in multiple cohorts.
  • the variants included in the blocklist filter can be limited to somatic AF (AF less than 30%) level false positive variants only, where the false positive variant has to be found in greater than 2 samples in the same cohort, and found in greater than or equal to 2 independent cohorts.
  • Filter tags that can be applied include a low AF tag, which is applied when the variant AF is less than a minimum AF threshold, a low support tag, which is applied when the variant support is less than a minimum support threshold, and a low depth tag, which is applied when the variant depth is less than a minimum depth threshold.
  • Application of these tags or the blocklist filter to a variant will remove the variant from the list of called variants.
  • the output is a list of called variants that can be written as a VCF file, for example.
  • SNV Caller Single nucleotide variants (SNVs) refer to mutations with single change of A/G/C/T base.
  • Somatic SNVs occur only in somatic (non-germline) cells. Cancer mutations are of this type. In comparison, another type of SNVs are germline or inherited as they occur in sperm/egg. This type of mutation is referred to as SNPs (single nucleotide polymorphisms).
  • SNPs single nucleotide polymorphisms.
  • the SNV caller algorithm described herein was developed for calling SNVs in cfDNA and tissue samples and the main goal is to call any variants observed in data and distinguish it from technical artifacts. Further classification of variants by type (i.e., somatic or germline) can be performed in downstream modules.
  • the SNV caller module described herein and shown in FIG.6 includes three major components, described below.
  • pre-processing module 602 performs calculations and operations on the barcode-deduplication output files 600 before the main body of SNV caller 604, including parsing the variant file, and the depth file, where the depth refers to the number of times a PATENT Client Reference No.: P39266-WO-1 locus of the genome was sequenced.
  • the SNV caller also incorporates background polishing.
  • pre-processing includes calculating the features that are used in the SNV calling algorithms.
  • SNV caller 604 main body includes several components: a core linear model, an adaptive caller, a hotspot caller, and an extended linear model based final filtration.
  • the adaptive caller and the hotspot caller are two parallel calling modules.
  • the adaptive caller detects variants in a selector-wide fashion.
  • the error distribution of each of the 12 substitution types or 192 tri-nucleotide substitution types is modeled in the sample where mutations are being called, and sample-specific substitution- specific supporting score thresholds are set.
  • sample-specific substitution- specific supporting score thresholds are set.
  • simpler heuristic rules are applied to call the variants listed in the loci of interest (which usually are known genomic mutations of clinical impact in cancers).
  • a supporting score for each variant is calculated using a weighted linear combination of singleton, consensus, and duplex reads.
  • calling occurs in two stages: (1) a core linear model and (2) an extended linear model.
  • the core linear model based adaptive variant calls are further filtered based on probability scores according to the extended linear model.
  • the extended linear model is a logistic regression classification model that uses both molecule counts as well as multiple technical and biological variant features such as sequence context, distance to read end, strand bias, and relative spread of variant families with respect to background.
  • the barcode duplication module 600 is the same as described above with respect to FIG.1 for the InDel caller.
  • the pre-processing module 602 is also largely the same as described above with respect to FIG.1 for the InDel caller except that different features are calculated for the SNV caller module.
  • the background polishing step is the same.
  • PATENT Client Reference No.: P39266-WO-1 PATENT Client Reference No.: P39266-WO-1
  • the SNV calling module 604 includes a hotspot module that is essentially the same as described above for the InDel caller, with some additional rules that apply to SNV type mutations.
  • one additional optional special heuristic rule can include special cases for C>T or G>A substitutions, where a single duplex supporting read is required, or optionally, requiring two duplex supporting reads for sites determined to be high noise based on the background error distribution performed in the background polishing step.
  • the core linear model of the SNV calling module 604 is also essentially the same as described above for the InDel caller.
  • a tri-nucleotide module models the error distribution of each of the 12 substitution types or 192 tri-nucleotide substitution types in the sample where mutations are being called, and sample-specific substitution-specific supporting score thresholds are set.
  • Variant score is based on counts of singleton(s), consensus (c) and duplex (d) barcode families of variants as well as a variety of additional technical and biological variant features. These features are listed in Table 2, below.
  • the final variant score for k additional features (f) and weights (w) is: where denotes the natural log
  • the filtering module 606 can include a duplex ratio filter, a blocklist filter, and other tags that can be used to exclude certain variant calls.
  • Duplex ratio filter For every variant called, the filtering module 1006 computes the number of barcode deduplicated reads (i.e. the number of barcode families) that support the variant, and support the reference allele. With this barcode scheme, it is possible to also identify barcode families that have duplex support (meaning that the other strand in the original duplex can also be assigned). This gives counts of duplex deduplicated reads (i.e. number of barcode families with duplex evidence) that support the variant and the reference allele. The percentage of duplex families should be around 10-20%, so if there is a significant depletion of duplex reads against this expectation it can indicate a contamination of single stranded molecules, and an artifactual variant.
  • FIG.8 illustrates a sequencing system 800 according to an embodiment of the present disclosure.
  • the system as shown includes a sample 805, such as Xpandomers within an assay device 810, where an assay 808 can be performed on sample 805.
  • sample 805 can be contacted with reagents of assay 808 to provide a signal (e.g., an intensity signal) of a physical characteristic 815 (e.g., sequence information of a cell-free nucleic acid molecule).
  • Assay 808 may include sequencing by expansion with an assay device 810, such as a nanopore sequencing device as discussed above.
  • Physical characteristic 815 e.g., a PATENT Client Reference No.: P39266-WO-1 fluorescence intensity, a voltage, or a current
  • detector 820 can take a measurement at intervals (e.g., periodic intervals) to obtain data points that make up a data signal.
  • an analog-to-digital converter converts an analog signal from the detector into digital form at a plurality of times.
  • Assay device 810 and detector 820 can form an assay system, e.g., a sequencing system 800 that performs sequencing according to embodiments described herein.
  • a data signal 825 is sent from detector 820 to logic system 830.
  • data signal 825 can be used to determine sequences and/or locations in a reference genome of nucleic acid molecules (e.g., DNA and/or RNA).
  • Data signal 825 can include various measurements made at a same time, e.g., different colors of fluorescent dyes or different electrical signals for different molecule of sample 805, and thus data signal 825 can correspond to multiple signals.
  • Data signal 825 may be stored in a local memory 835, an external memory 840, or a storage device 845.
  • the sequencing system 800 can be comprised of multiple assay devices 810 and detectors 820.
  • Logic system 830 may be, or may include, a computer system, ASIC, processor, graphics processing unit (GPU), etc. It may also include or be coupled with a display (e.g., monitor, LED display, etc.) and a user input device (e.g., mouse, keyboard, buttons, etc.).
  • Logic system 830 and the other components may be part of a stand-alone or network connected computer system, or they may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes detector 820 and/or assay device 810.
  • Logic system 830 may also include software that executes in a processor 850.
  • Logic system 830 may include a computer readable medium storing instructions for controlling sequencing system 800 to perform any of the methods described herein.
  • logic system 830 can provide commands to a system that includes assay device 810 such that sequencing or other physical operations are performed. Such physical operations can be performed in a particular order, e.g., with reagents being added and removed in a particular order.
  • Sequencing system 800 may also include a treatment device 860, which can provide a treatment to the subject. Treatment device 860 can determine a treatment and/or be used to perform a treatment.
  • Sequencing system 800 may also include a reporting device 855, which can present results of any of the methods describe herein, e.g., as determined using the sequencing system 800. Reporting device 855 can be in communication with a reporting module within logic system 830 that can aggregate, format, and send a report to reporting device 855.
  • the reporting module can present information determined using any of the methods described herein.
  • the information can be presented by reporting device 855 in any format that can be recognized and interpreted by a user of the sequencing system 800.
  • the information can be presented by reporting device 855 in a displayed, printed, or transmitted format, or any combination thereof.
  • Any of the computer systems mentioned herein may utilize any suitable number of subsystems. Examples of such subsystems are shown in FIG.9 in computer system 900.
  • a computer system 900 includes a single computer apparatus, where the subsystems can be the components of the computer apparatus.
  • a computer system 900 can include multiple computer apparatuses, each being a subsystem, with internal components.
  • a computer system can include desktop and laptop computers, tablets, mobile phones, and other mobile devices.
  • the subsystems shown in FIG.9 are interconnected via a system bus 975. Additional subsystems such as a printer 974, keyboard 978, storage device(s) 979, 982, monitor 976 (e.g., a display screen, such as an LED), which is coupled to display adapter 982, and others are shown.
  • Peripherals and input/output (I/O) devices which couple to I/O controller 971, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 977 (e.g., USB, FireWire ® ).
  • I/O port 977 or external interface 981 can be used to connect computer system PATENT Client Reference No.: P39266-WO-1 900 to a wide area network such as the Internet, a mouse input device, or a scanner.
  • the interconnection via system bus 975 allows the central processor 973 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 972 or the storage device(s) 979 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems.
  • the system memory 972 and/or the storage device(s) 979 may embody a computer readable medium.
  • a computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 981, by an internal interface, or via removable storage devices that can be connected and removed from one component to another component.
  • computer systems, subsystem, or apparatuses can communicate over a network.
  • one computer can be considered a client and another computer a server, where each can be part of a same computer system.
  • a client and a server can each include multiple systems, subsystems, or components.
  • methods may involve various numbers of clients and/or servers, including at least 10, 20, 50, 100, 200, 500, 1,000, or 10,000 devices.
  • Methods can include various numbers of communication messages between devices, including at least 100, 200, 500, 1,000, 10,000, 50,000, 100,000, 500,000, or one million communication messages.
  • Such communications can involve at least 1 MB, 10 MB, 100 MB, 1 GB, 10 GB, or 100 GB of data.
  • aspects of embodiments can be implemented in the form of control logic using hardware circuitry (e.g., an application specific integrated circuit or field programmable gate array) and/or using computer software stored in a memory with a generally programmable processor in a modular or integrated manner, and thus a processor can include memory storing software instructions that configure hardware circuitry, as well as an FPGA with configuration instructions or an ASIC.
  • a processor can include a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked, as well as dedicated hardware.
  • P39266-WO-1 other ways and/or methods to implement embodiments of the present disclosure using hardware and a combination of hardware and software.
  • Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques.
  • the software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission.
  • a suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk) or Blu-ray disk, flash memory, and the like.
  • the computer readable medium may be any combination of such devices.
  • the order of operations may be re-arranged.
  • a process can be terminated when its operations are completed but could have additional steps not included in a figure.
  • a process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.
  • Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet.
  • a computer readable medium may be created using a data signal encoded with such programs.
  • Computer readable media encoded with the program code may be packaged with a compatible device (e.g., as firmware) or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g., a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network.
  • a computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
  • Any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Any operations performed with a processor may be performed in real-time.
  • the term “real-time” may refer to computing operations or processes that are completed within a PATENT Client Reference No.: P39266-WO-1 certain time constraint. As examples, a time constraint may be 30 seconds, 1 minute, 10 minutes, 30 minutes, 1 hour, 4 hours, 1 day, or 7 days.
  • embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective step or a respective group of steps.
  • steps of methods herein can be performed at a same time or at different times or in a different order. Additionally, portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means of a system for performing these steps.
  • the term “and/or” includes any and all combinations of one or more of the associated listed items and may be abbreviated as [0115]
  • Spatially relative terms such as “under”, “below”, “lower”, “over”, “upper” and the like, may be used herein for ease of description to describe one element or feature’s relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is inverted, elements described as “under” or “beneath” other elements or features would then be oriented “over” the other elements or features.
  • the exemplary term “under” can encompass both an orientation of over and under.
  • the device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.
  • the terms “upwardly”, “downwardly”, “vertical”, “horizontal” and the like are used herein for the purpose of explanation only unless specifically indicated otherwise.
  • first and second may be used herein to describe various features/elements (including steps), these features/elements should not be limited by these terms, unless the context indicates otherwise. These terms may be used to distinguish one feature/element from another feature/element.
  • a numeric value may have a value that is +/- 0.1% of the stated value (or range of values), +/- 1% of the stated value (or range of values), +/- 2% of the stated value (or range of values), +/- 5% of the stated value (or range of values), +/- 10% of the stated value (or range of values), etc.
  • Any numerical values given herein should also be understood to include about or approximately that value, unless the context indicates otherwise. For example, if the value “10” is disclosed, then “about 10” is also disclosed. Any numerical range recited herein is intended to include all sub-ranges subsumed therein.
  • inventive subject matter may be referred to herein individually or collectively by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept, if more than one is, in fact, disclosed.
  • inventive concept any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown.
  • This disclosure is intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne des algorithmes d'appel de variants qui peuvent être utilisés pour identifier diverses mutations dans des données de séquençage. Selon divers modes de réalisation, un appelant InDel et/ou un appelant SNV peuvent utiliser une approche de filtrage à étages multiples impliquant un modèle linéaire central ainsi qu'un modèle supplémentaire pour fournir un filtrage supplémentaire des variants identifiés pour améliorer la sensibilité et/ou la précision. Un module de déduplication de code à barres peut également être utilisé pour démultiplexer un échantillon multiplexé de telle sorte que les algorithmes d'appel de variants peuvent être appliqués séparément à chaque échantillon. Des étapes de filtrage supplémentaires peuvent être utilisées pour réduire davantage les faux positifs et/ou les faux négatifs.
PCT/US2025/025161 2024-04-17 2025-04-17 Systèmes et procédés d'appel de variants Pending WO2025221998A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202463635576P 2024-04-17 2024-04-17
US63/635,576 2024-04-17

Publications (1)

Publication Number Publication Date
WO2025221998A1 true WO2025221998A1 (fr) 2025-10-23

Family

ID=95743631

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/025161 Pending WO2025221998A1 (fr) 2024-04-17 2025-04-17 Systèmes et procédés d'appel de variants

Country Status (1)

Country Link
WO (1) WO2025221998A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130244340A1 (en) 2012-01-20 2013-09-19 Genia Technologies, Inc. Nanopore Based Molecular Detection and Sequencing
US20130264207A1 (en) 2010-12-17 2013-10-10 Jingyue Ju Dna sequencing by synthesis using modified nucleotides and nanopore detection
US20140134616A1 (en) 2012-11-09 2014-05-15 Genia Technologies, Inc. Nucleic acid sequencing using tags
US20150119259A1 (en) 2012-06-20 2015-04-30 Jingyue Ju Nucleic acid sequencing by nanopore detection of tag molecules
US20150337366A1 (en) 2012-02-16 2015-11-26 Genia Technologies, Inc. Methods for creating bilayers for use with nanopore sensors

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130264207A1 (en) 2010-12-17 2013-10-10 Jingyue Ju Dna sequencing by synthesis using modified nucleotides and nanopore detection
US20130244340A1 (en) 2012-01-20 2013-09-19 Genia Technologies, Inc. Nanopore Based Molecular Detection and Sequencing
US20150337366A1 (en) 2012-02-16 2015-11-26 Genia Technologies, Inc. Methods for creating bilayers for use with nanopore sensors
US20150119259A1 (en) 2012-06-20 2015-04-30 Jingyue Ju Nucleic acid sequencing by nanopore detection of tag molecules
US20140134616A1 (en) 2012-11-09 2014-05-15 Genia Technologies, Inc. Nucleic acid sequencing using tags

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
CHANG XU ET AL: "smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers", BIORXIV, 14 March 2018 (2018-03-14), XP055585298, Retrieved from the Internet <URL:https://www.biorxiv.org/content/biorxiv/early/2018/03/14/281659.full.pdf> [retrieved on 20250725], DOI: 10.1101/281659 *
DRMANAC ET AL., NATURE BIOTECH., vol. 16, 1998, pages 381 - 384
LI TAI FANG ET AL: "An ensemble approach to accurately detect somatic mutations using SomaticSeq", GENOME BIOLOGY, vol. 16, no. 1, 17 September 2015 (2015-09-17), XP055531382, DOI: 10.1186/s13059-015-0758-2 *
MIKHAIL SHUGAY ET AL: "MAGERI: Computational pipeline for molecular-barcoded targeted resequencing", PLOS COMPUTATIONAL BIOLOGY, vol. 13, no. 5, 5 May 2017 (2017-05-05), pages e1005480, XP055496652, DOI: 10.1371/journal.pcbi.1005480 *
SATER VINCENT ET AL: "UMI-VarCal: a new UMI-based variant caller that efficiently improves low-frequency variant detection in paired-end sequencing NGS libraries", BIOINFORMATICS, vol. 36, no. 9, 27 January 2020 (2020-01-27), GB, pages 2718 - 2724, XP093299216, ISSN: 1367-4803, Retrieved from the Internet <URL:https://academic.oup.com/bioinformatics/article-pdf/36/9/2718/48985410/bioinformatics_36_9_2718.pdf> [retrieved on 20250725], DOI: 10.1093/bioinformatics/btaa053 *
SEARS ET AL., BIOTECHNIQUES, vol. 13, 1992, pages 626 - 633
XU CHANG: "A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data", COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, vol. 16, 1 January 2018 (2018-01-01), Sweden, pages 15 - 24, XP055781134, ISSN: 2001-0370, DOI: 10.1016/j.csbj.2018.01.003 *
ZIMMERMAN ET AL., METHODS MOL. CELL BIOL., vol. 3, 1992, pages 39 - 42

Similar Documents

Publication Publication Date Title
JP7684708B2 (ja) 母体血漿の無侵襲的出生前分子核型分析
EP3143537B1 (fr) Identifications de variant rares dans un séquençage ultra-profond
JP2025085645A (ja) がん予測パイプラインにおけるrna発現コールを自動化するためのシステムおよび方法
US20190338349A1 (en) Methods and systems for high fidelity sequencing
US20160154930A1 (en) Methods for identification of individuals
US20200013484A1 (en) Machine learning variant source assignment
JPWO2019132010A1 (ja) 塩基配列における塩基種を推定する方法、装置及びプログラム
WO2025221998A1 (fr) Systèmes et procédés d&#39;appel de variants
WO2025221988A1 (fr) Systèmes et procédés d&#39;appel de petits variants somatiques
Veeramachaneni Data analysis in rare disease diagnostics
이선호 New Methods for SNV/InDel Calling and Haplotyping from Next Generation Sequencing Data
HK40080479A (en) Noninvasive prenatal molecular karyotyping from maternal plasma
HK40074981A (en) Noninvasive prenatal molecular karyotyping from maternal plasma
KR20250092241A (ko) 핵산 오류 억제
HK40100599A (en) Noninvasive prenatal molecular karyotyping from maternal plasma
Null Advancement of Understudied Genetic Variants Within Statistical Genetics: A Copy Number Variants Analysis and Development of a Rare Variant Simulation Algorithm
Wang High-Throughput Sequencing And Natural Selection: Studies Of Recent Sweep Inferences And A New Computational Approach For Transcription Identification
Schaibley Understanding the Patterns and Consequences of Single-Nucleotide Mutations in the Human Genome Using High-Throughput Sequencing.
Lorenzo Salazar Bioinformatics Pipeline for Next Generation Sequencing Analysis in Association Studies of Idiopathic Pulmonary Fibrosis
Corbett Assessment of Alignment Algorithms, Variant Discovery and Genotype Calling Strategies in Exome Sequencing Data
HK1210811B (en) Noninvasive prenatal molecular karyotyping from maternal plasma

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25725617

Country of ref document: EP

Kind code of ref document: A1