WO2025221988A1 - Systèmes et procédés d'appel de petits variants somatiques - Google Patents
Systèmes et procédés d'appel de petits variants somatiquesInfo
- Publication number
- WO2025221988A1 WO2025221988A1 PCT/US2025/025147 US2025025147W WO2025221988A1 WO 2025221988 A1 WO2025221988 A1 WO 2025221988A1 US 2025025147 W US2025025147 W US 2025025147W WO 2025221988 A1 WO2025221988 A1 WO 2025221988A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- variant
- sequencing
- variants
- threshold
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
Definitions
- the embodiments described herein relate to systems and methods for performing variant calling on sequencing data. More particularly, the embodiments described herein related to calling single nucleotide variants, insertions, and deletions in the sequencing data. [0004] In accordance with a first aspect of the present disclosure, a method for somatic variant calling is provided.
- the method includes: receiving a computer file comprising a plurality of consensus reads generated from a patient sample; aligning the plurality of consensus reads; determining a callable region from the aligned consensus reads; determining an active region from the callable region based on a comparison of the callable region to a reference sequence and identifying one or more differences in the active region compared to the reference sequence; generating an assembly graph from the reference sequence and the active region, wherein the assembly graph comprises a plurality of paths, wherein each path comprises a plurality of path segments, and wherein each path segment comprises a count of the consensus reads that support the path segment; determining a plurality of haplotype sequences from a plurality of paths of the assembly graph; determining a likelihood score for each haplotype sequence based on the count of consensus reads that support the path PATENT Client Reference No.: P39265-WO-1 segments included in each haplotype sequence; determining a plurality of candidate variants by comparing a weighted count for each variant to
- the callable region is based on a minimum coverage requirement.
- the assembly graph is pruned to remove at least one path in the plurality of paths. In other embodiments, the assembly graph is not pruned.
- the first threshold is dynamically determined by fitting a sample specific regression model to a background distribution of weighted counts for the candidate variant.
- the machine learning classification model includes one or more features selected from the group consisting of: nonduplex counts, duplex counts, weighted counts, distance of the candidate variant to a 5’ end of the consensus read, substitution type for single nucleotide variants, sequence context, strand bias, mapping quality, base quality, cluster size, and duplex fraction.
- the machine learning classification model comprises a gradient boosted decision tree algorithm.
- the gradient boosted decision tree algorithm is selected from the group consisting of: LightGBM and XGBoost.
- a system is provided for generating sequencing data.
- the system may include an assay device and/or a logic system.
- the logic system may include a processor coupled to a memory storing instructions executable by the processor.
- the processor upon execution of the instructions, is configured to perform the method of one or more embodiments of the first aspect.
- PATENT Client Reference No.: P39265-WO-1 PATENT Client Reference No.: P39265-WO-1
- the system further includes a treatment device for determining and/or administering a treatment to the patient based on at least one filtered variant in the plurality of filtered variants.
- the system further includes a reporting device for displaying information relating to at least one filtered variant in the plurality of filtered variants.
- a non-transitory computer-readable medium is provided.
- FIG. 1 is a flow chart of a method for a variant caller workflow, in accordance with some embodiments.
- FIG. 1 is a flow chart of a method for a variant caller workflow, in accordance with some embodiments.
- FIG. 2 is a graph illustrating how a threshold is set by fitting a sample specific regression model to the background distribution of weighted variant counts for all the positions with evidence of alternate alleles (i.e., candidate variant sites) within the sample, in accordance with some embodiments.
- FIG. 3 illustrates the relationship between false positives and sensitivity of the variant caller, in accordance with some embodiments.
- FIG. 4 illustrates a flow chart of a three-stage filtering process that can be used to improve variant caller sensitivity and/or accuracy, in accordance with some embodiments.
- FIG. 5 illustrates a sequencing system, in accordance with some embodiments. PATENT Client Reference No.: P39265-WO-1 [0021] FIG.
- the present disclosure provides a number of techniques for performing variant calling, as part of a secondary analysis workflow of sequencing data produced by today’s next generation sequencing devices. More particularly, a variant calling algorithm is provided for identifying single nucleotide variants (SNV) and/or insertion/deletions (InDels) from sequencing data, such as nanopore-based sequencing data.
- a variant calling algorithm is provided that generates an assembly graph using a reference sequence for one or more active regions of sequencing data (e.g., a plurality of reads generated by sequencing a sample).
- Weights for each edge of the graph are then calculated based on the sequencing data.
- Candidate variants are then identified by traversing the graph.
- the candidate variants are then filtered according to a two stage process. In the first stage, candidate variants are removed based on a comparison of the candidate variant weight with a threshold.
- a machine learning model is used to score the candidate variants that passed the first stage according to a number of features. The score is then compared against a threshold to determine a final list of called variants for the active region.
- an InDel caller algorithm is provided.
- the InDel caller algorithm can include a hotspot module and an adaptive module.
- the InDel caller algorithm uses a number of heuristics to identify InDel variants in hotspot regions.
- the adaptive module uses a two stage approach to identify variants with low AF that might not be strongly supported in a plasma sample.
- a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold.
- a machine learning classifier such as a gradient boosting machine (GBM) model is used to score each variant that passes the first stage and then the classifier score is compared to a threshold to determine the final list of called variants.
- the called variants can be further filtered by a blocklist filter.
- an SNV caller algorithm is provided that is similar in many aspects to the InDel caller algorithm described above.
- the SNV caller algorithm can include a hotspot module and an adaptive module.
- the hotspot module is similar to the hotspot module of the InDel caller algorithm, with the exception that one or more additional heuristics may be included in addition to or in lieu of the heuristics used in the InDel caller algorithm.
- the adaptive module of the SNV caller algorithm also uses a two stage approach to identify variants.
- a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold.
- an extended linear model that incorporates a number of additional weighted features is used to filter out variants that passed the first stage based on a comparison of each candidate variant’s score with a threshold.
- NGS Next Generation Sequencing
- Patent Application Publication Nos.2013/0244340, 2013/0264207, 2014/0134616, 2015/0119259, and 2015/0337366) Sanger sequencing, capillary array sequencing, thermal cycle sequencing (see, e.g., Sears et al., Biotechniques, 13:626-633 (1992)), solid-phase sequencing (see, e.g., Zimmerman et al., Methods Mol.
- sequencing with mass spectrometry such as matrix-assisted laser desorption/ionization time-of-flight mass PATENT Client Reference No.: P39265-WO-1 spectrometry (MALDI-TOF/MS) (see, e.g., Fu et al., Nature Biotech., 16:381-384 (1998)), sequencing by hybridization (see, e.g., Drmanac et al., Nature Biotech., 16:54-58 (1998)), and NGS methods, including but not limited to sequencing by synthesis (see, e.g., HiSeqTM, MiSeqTM, or Genome Analyzer, each available from Illumina, Inc.
- sequencing by ligation see, e.g., SOLiDTM, available from Thermo Fisher Scientific, Inc. (Waltham, MA)
- ion semiconductor sequencing see, e.g., Ion TorrentTM, available from Thermo Fisher Scientific, Inc. (Waltham, MA)
- SMRT® sequencing available from Pacific Biosciences of California, Inc. (Menlo Park, CA).
- Commercially available sequencing technologies include: sequencing-by- hybridization platforms from Affymetrix, Inc. (Sunnyvale, CA), sequencing-by-synthesis platforms from Illumina, Inc. (San Diego, Calif.), and sequencing-by-ligation platform from Thermo Fisher Scientific, Inc.
- Bioinformatics Workflow Overview [0030] The output of an NGS sequencer is generally processed by a bioinformatics pipeline that processes the raw signal from the NGS sequencer and translates the raw signal into base calls, often referred to as raw reads, which are typically stored in a FASTQ file that combines the raw reads with associated quality data. This portion of the bioinformatics pipeline is often referred to as primary analysis.
- the next section of the bioinformatics pipeline is called secondary analysis, and it takes the raw reads generated by the primary analysis, and performs several tasks, including alignment and variant calling.
- Tertiary analysis is the final portion of the bioinformatics pipeline and uses the variant calling information to generate medical insights that health care practitioners can use to improve treatments for their patients.
- Secondary Analysis PATENT Client Reference No.: P39265-WO-1
- New sequencing technologies such as nanopore-based sequencers, generate sequencing data with different characteristics than sequencing data generated by the current market leading sequencers, such as Illumina sequencers. For example, these differences can include differences in raw read accuracy and differences in the error profiles.
- FIG. 1 illustrates a secondary analysis workflow for a variant calling algorithm, in accordance with some embodiments.
- the variant calling algorithm leverages a portion of the approach used by Mutect2, a haplotype variant caller with somatic-specific genotyping and filtering that is available as part of the Genome Analysis Toolkit (GATK) maintained by the Broad Institute, with several changes or additions to the said algorithm.
- the variant calling algorithm starts by receiving consensus reads 100 from, for example, a nanopore sequencer or another type of sequencer. A typical format for receiving these reads is in a file using the FASTQ format. The consensus reads stored in the file can be accessed and processed by the pipeline to perform an alignment of the consensus reads against a reference sequence. A callable region 104 can be then identified as a region with a sufficient depth of read coverage to allow for variant calling.
- the callable region 104 can have a minimum depth of read coverage of at least 5, 10, 15, 20, 25, 30, 35, 40, 45, or 50.
- the required depth of coverage can vary depending on a variety of factors, such as the types of bases in the region (i.e., gc rich region or other type of motif which may have a higher error rate during sequencing), and the quality scores of the bases in the region.
- PATENT Client Reference No.: P39265-WO-1 [0036]
- mismatches or gaps between the base calls of the reads and the reference sequence are used to identify active regions 108 that contain potential SNVs (single nucleotide variants) and/or InDels (insertions and deletions) 106.
- an assembly graph 110 such as a Debruijn graph for example, is generated by starting with the reference sequence, which is decomposed into a series of kmers (short sequences bases long), with each successive kmer overlapping the previous kmer by bases.
- the kmers can be represented as nodes that can be joined by lines called edges.
- the edges can be weighted to keep track of the number of kmers found in the sample, with the weights initially set to zero. This forms a reference graph.
- each sequence read can similarly be decomposed into a series of kmers, and can be matched to the reference graph. Each time two successive kmers are matched to the graph, the weight of the edge joining the two kmer nodes is incremented by one.
- the assembly graph can then be optionally pruned by removing sections of the graph that are supported by an edge weight that is fewer than a threshold value, such as 2.
- a weight of 2, for example, means that 2 reads in the sample support that segment of the graph.
- the threshold can be increased, for more aggressive pruning, which will result in faster processing and higher specificity, but with lower sensitivity.
- haplotype sequences can be generated from the graph by traversing all paths in the graph, with a likelihood score calculated for each haplotype sequence as the product of the transition probabilities of the path edges.
- the probability of an edge can be calculated as the weight of the edge divided by the sum of the weights of all the edges that share the same source node.
- the haplotypes with the highest likelihood scores can be used for candidate variant detection 112.
- PATENT Client Reference No.: P39265-WO-1 These haplotypes are then aligned to the reference sequence in order to identify candidate variants.
- the alignment can be done using a Smith-Waterman alignment (SWA), for example, and can be stored in a SAM or BAM file.
- SWA Smith-Waterman alignment
- the candidate variants can be determined by comparing the aligned sequence to the reference sequence and can be stored in a .vcf file.
- the steps described above can be performed, for example, using optimized settings for GATK, an open source variant caller.
- the optimized settings include: initial- tumor-lod -5, tumor-lod-to-emit -5, pruning-lod-threshold -9, max-reads-per-alignment-start 0, active-probability-threshold 0.00005, and min-pruning 0. These parameters are changed from the defaults to more sensitively detect candidate variants at 0.1% AF from realignment.
- the optimal parameters can be identified by maximizing sensitivity based on a grid search.
- the candidate variants can be filtered using a two-stage process, which is designed to reduce false positives and improve precision. As shown in FIG.1, in the first stage 114, the weighted counts of variant molecules are compared against a threshold that is dynamically learned from the sample.
- This threshold is computed by fitting a sample specific regression model to the background distribution of weighted variant counts for all the positions with evidence of alternate alleles (i.e., candidate variant sites) within the sample.
- smooth splines can optionally be used instead of a regression model to fit more general curves. All the variants with weighted-counts above the sample threshold are retained for the second stage. The sample specific regression is illustrated in FIG.2.
- the vertical line in FIG.2, which shows the threshold, can be set to achieve a particular desired or predetermined value of the cumulative counts for the y-axis, such as 1/100, 1/200, 1/300, 1/400, 1/500, 1/600, 1/700, 1/800, 1/900, or 1/1000 cumulative counts per base pair of the panel.
- This can be learned from a training dataset, where the cumulative counts can be matched to or set based on a known number of variants, depending on a desired sensitivity and precision profile.
- a machine learning classification model 116 e.g., LightGBM or XGBoost
- XGBoost e.g., LightGBM or XGBoost
- This model uses the counts of molecules supporting variants as well as additional features such as distance of the variant with respect to the 5’ end of the fragment, substitution type (for SNVs), sequence context, cluster size, strand bias, simplex/duplex support, ref/alt, PATENT Client Reference No.: P39265-WO-1 base quality score (e.g., base quality score used in GATK), MAPQ score, and strand information to produce a probability score for each variant.
- Sequence context is the base upstream and downstream of the variant position. Therefore, if the variant occurs at position 1000 in chromosome 1, the sequence context is the combination of bases at position 999 and position 1001.
- Strand bias is a measure of how unbalanced the distribution of plus and minus strands is in the data.
- strand bias is 0. If either plus or minus strand is much higher than the other, then it takes higher value. The maximum number it can take is 0.25.
- Table 1 below lists all the features extracted from consensus reads that can be used by the machine learning classifier model 116.
- the classifier can use any combination of the features listed herein. For example, the classifier can use all the features in some embodiments or use a subset of the features in other embodiments.
- Table 1 Feature Type Description Nonduplex Numeric Number of variant molecules with only support counts from + or - strands Duplex Counts Numeric Number of variant molecules with support from both strands (+/-) Weighted Numeric Linear weighted combination of duplex and Counts nonduplex counts Distance Numeric Median distance of variant from nearest fragment end Duplex fraction Numeric Fraction of variant molecules that are duplex Strand Bias Numeric Metric that measures the imbalance of support from + or - strands.
- the machine learning classification threshold can be determined, for example, based on a training set with known somatic variant positive samples and negative/healthy samples which are not expected to contain somatic variants.
- the threshold is set or selected to maximize the F1 score for variants in the data.
- the F1 score is a metric as illustrated in FIG.4, an optional third filtering step can to remove systemic noise that can be introduced during the sequencing process, such as errors introduced during sample preparation, for example.
- the first filtering stage 402 Starting with the candidate variants 400 and assembled BAM 401 that is the output of a variant caller (corresponding to step 114 in FIG.1), the first filtering stage 402 passes candidate variants with weighted counts greater than a sample specific threshold, as described above with respect to step 114 in FIG.1.
- the candidate variant is below the threshold, it is filtered out as background noise 404.
- the candidate variant has a weighted count that is greater than the first stage 402 threshold, it is passed to the second stage 406 (as described above with respect with step 116 in FIG.1) for the machine learning classifier filtering step 406.
- the variant has a ML score that is less than the ML threshold, then it is filtered out 408 for one or more reasons, such as having low quality, low MAPQ, low base quality, strand bias, low depth, etc.
- the variant PATENT Client Reference No.: P39265-WO-1 has a ML score that is greater than the ML threshold, then it is passed to an optional third stage 410 blocklist filter.
- the blocklist filter can be created from normal samples staring with the VCF files from a variant caller such as GATK/mutect2. Any variant that appears in multiple normal samples, which are expected to not have any variants, with a count greater than or equal to 3 can be added to the blocklist. [0051] If the variant passed to the third stage 410 is on the blocklist, it is filtered out as systemic noise 412. If the variant passed to the third stage 410 is not on the blocklist, it is passed as a called variant 414. [0052] In some embodiments, the third stage 410 can include an additional rescue step for variants on the blocklist.
- FIG. 5 illustrates a sequencing system 500 according to an embodiment of the present disclosure.
- the system as shown includes a sample 505, such as Xpandomers within an assay device 510, where an assay 508 can be performed on sample 505.
- sample 505 can be contacted with reagents of assay 508 to provide a signal (e.g., an intensity signal) of a physical characteristic 515 (e.g., sequence information of a cell-free nucleic acid molecule).
- Assay 508 may include sequencing by expansion with an assay device 510, such as a nanopore sequencing device as discussed above.
- Physical characteristic 515 e.g., a PATENT Client Reference No.: P39265-WO-1 fluorescence intensity, a voltage, or a current
- Detector 520 can take a measurement at intervals (e.g., periodic intervals) to obtain data points that make up a data signal.
- an analog-to-digital converter converts an analog signal from the detector into digital form at a plurality of times.
- Assay device 510 and detector 520 can form an assay system, e.g., a a sequencing system 500 that performs sequencing according to embodiments described herein.
- a data signal 525 is sent from detector 520 to logic system 530.
- data signal 525 can be used to determine sequences and/or locations in a reference genome of nucleic acid molecules (e.g., DNA and/or RNA).
- Data signal 525 can include various measurements made at a same time, e.g., different colors of fluorescent dyes or different electrical signals for different molecule of sample 505, and thus data signal 525 can correspond to multiple signals.
- Data signal 525 may be stored in a local memory 535, an external memory 540, or a storage device 545.
- the sequencing system 500 can be comprised of multiple assay devices 510 and detectors 520.
- Logic system 530 may be, or may include, a computer system, ASIC, processor, graphics processing unit (GPU), etc. It may also include or be coupled with a display (e.g., monitor, LED display, etc.) and a user input device (e.g., mouse, keyboard, buttons, etc.).
- Logic system 530 and the other components may be part of a stand-alone or network connected computer system, or they may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes detector 520 and/or assay device 510.
- Logic system 530 may also include software that executes in a processor 550.
- Logic system 530 may include a computer readable medium storing instructions for controlling sequencing system 500 to perform any of the methods described herein.
- logic system 530 can provide commands to a system that includes assay device 510 such that sequencing or other physical operations are performed. Such physical operations can be performed in a particular order, e.g., with reagents being added and removed in a particular order.
- Sequencing system 500 may also include a treatment device 560, which can provide a treatment to the subject.
- Treatment device 560 can determine a treatment and/or be used to perform a treatment. Examples of such treatment can include surgery, radiation therapy, chemotherapy, immunotherapy, targeted therapy, hormone therapy, and stem cell transplant.
- Sequencing system 500 may also include a reporting device 555, which can present results of any of the methods describe herein, e.g., as determined using the sequencing system 500.
- Reporting device 555 can be in communication with a reporting module within logic system 530 that can aggregate, format, and send a report to reporting device 555.
- the reporting module can present information determined using any of the methods described herein.
- the information can be presented by reporting device 555 in any format that can be recognized and interpreted by a user of the sequencing system 500.
- the information can be presented by reporting device 555 in a displayed, printed, or transmitted format, or any combination thereof.
- Any of the computer systems mentioned herein may utilize any suitable number of subsystems. Examples of such subsystems are shown in FIG.6 in computer system 600.
- a computer system 600 includes a single computer apparatus, where the subsystems can be the components of the computer apparatus.
- a computer system 600 can include multiple computer apparatuses, each being a subsystem, with internal components.
- a computer system can include desktop and laptop computers, tablets, mobile phones, and other mobile devices.
- FIG.6 The subsystems shown in FIG.6 are interconnected via a system bus 675. Additional subsystems such as a printer 674, keyboard 678, storage device(s) 679, 682, monitor 676 (e.g., a display screen, such as an LED), which is coupled to display adapter 682, and others are shown. Peripherals and input/output (I/O) devices, which couple to I/O controller 671, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 677 (e.g., USB, FireWire ® ).
- I/O input/output
- I/O port 677 or external interface 681 can be used to connect computer system PATENT Client Reference No.: P39265-WO-1 600 to a wide area network such as the Internet, a mouse input device, or a scanner.
- the interconnection via system bus 675 allows the central processor 673 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 672 or the storage device(s) 679 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems.
- the system memory 672 and/or the storage device(s) 679 may embody a computer readable medium.
- a computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 681, by an internal interface, or via removable storage devices that can be connected and removed from one component to another component.
- computer systems, subsystem, or apparatuses can communicate over a network.
- one computer can be considered a client and another computer a server, where each can be part of a same computer system.
- a client and a server can each include multiple systems, subsystems, or components.
- methods may involve various numbers of clients and/or servers, including at least 10, 20, 50, 100, 200, 500, 1,000, or 10,000 devices.
- Methods can include various numbers of communication messages between devices, including at least 100, 200, 500, 1,000, 10,000, 50,000, 100,000, 500,000, or one million communication messages.
- Such communications can involve at least 1 MB, 10 MB, 100 MB, 1 GB, 10 GB, or 100 GB of data.
- aspects of embodiments can be implemented in the form of control logic using hardware circuitry (e.g., an application specific integrated circuit or field programmable gate array) and/or using computer software stored in a memory with a generally programmable processor in a modular or integrated manner, and thus a processor can include memory storing software instructions that configure hardware circuitry, as well as an FPGA with configuration instructions or an ASIC.
- a processor can include a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked, as well as dedicated hardware.
- P39265-WO-1 other ways and/or methods to implement embodiments of the present disclosure using hardware and a combination of hardware and software.
- Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques.
- the software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission.
- a suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk) or Blu-ray disk, flash memory, and the like.
- the computer readable medium may be any combination of such devices.
- the order of operations may be re-arranged.
- a process can be terminated when its operations are completed but could have additional steps not included in a figure.
- a process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.
- Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet.
- a computer readable medium may be created using a data signal encoded with such programs.
- Computer readable media encoded with the program code may be packaged with a compatible device (e.g., as firmware) or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g., a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network.
- a computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
- Any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Any operations performed with a processor may be performed in real-time.
- the term “real-time” may refer to computing operations or processes that are completed within a PATENT Client Reference No.: P39265-WO-1 certain time constraint. As examples, a time constraint may be 30 seconds, 1 minute, 10 minutes, 30 minutes, 1 hour, 4 hours, 1 day, or 7 days.
- embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective step or a respective group of steps.
- steps of methods herein can be performed at a same time or at different times or in a different order. Additionally, portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means of a system for performing these steps.
- the term “and/or” includes any and all combinations of one or more of the associated listed items and may be abbreviated as “/”.
- Spatially relative terms such as “under”, “below”, “lower”, “over”, “upper” and the like, may be used herein for ease of description to describe one element or feature’s relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is inverted, elements described as “under” or “beneath” other elements or features would then be oriented “over” the other elements or features.
- the exemplary term “under” can encompass both an orientation of over and under.
- the device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.
- the terms “upwardly”, “downwardly”, “vertical”, “horizontal” and the like are used herein for the purpose of explanation only unless specifically indicated otherwise.
- first and second may be used herein to describe various features/elements (including steps), these features/elements should not be limited by these terms, unless the context indicates otherwise. These terms may be used to distinguish one feature/element from another feature/element.
- a numeric value may have a value that is +/- 0.1% of the stated value (or range of values), +/- 1% of the stated value (or range of values), +/- 2% of the stated value (or range of values), +/- 5% of the stated value (or range of values), +/- 10% of the stated value (or range of values), etc.
- Any numerical values given herein should also be understood to include about or approximately that value, unless the context indicates otherwise. For example, if the value “10” is disclosed, then “about 10” is also disclosed. Any numerical range recited herein is intended to include all sub-ranges subsumed therein.
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Les algorithmes d'appelant de variants décrits ici peuvent être utilisés pour identifier diverses mutations dans des données de séquençage. Les résultats d'un appelant de variants, tel qu'un appelant de variants à code source ouvert tel que GATK/mutect2, peuvent en outre être filtrés pour améliorer la sensibilité et/ou la précision. Une première étape de filtrage peut être utilisée pour supprimer le bruit de fond, et une deuxième étape de filtrage peut utiliser un modèle de classificateur d'apprentissage automatique pour filtrer davantage les variants. Une troisième étape de filtrage facultative peut utiliser une liste de blocage pour filtrer le bruit systémique.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463635568P | 2024-04-17 | 2024-04-17 | |
| US63/635,568 | 2024-04-17 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025221988A1 true WO2025221988A1 (fr) | 2025-10-23 |
Family
ID=95714702
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/025147 Pending WO2025221988A1 (fr) | 2024-04-17 | 2025-04-17 | Systèmes et procédés d'appel de petits variants somatiques |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025221988A1 (fr) |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20130244340A1 (en) | 2012-01-20 | 2013-09-19 | Genia Technologies, Inc. | Nanopore Based Molecular Detection and Sequencing |
| US20130264207A1 (en) | 2010-12-17 | 2013-10-10 | Jingyue Ju | Dna sequencing by synthesis using modified nucleotides and nanopore detection |
| US20140134616A1 (en) | 2012-11-09 | 2014-05-15 | Genia Technologies, Inc. | Nucleic acid sequencing using tags |
| US20150119259A1 (en) | 2012-06-20 | 2015-04-30 | Jingyue Ju | Nucleic acid sequencing by nanopore detection of tag molecules |
| US20150337366A1 (en) | 2012-02-16 | 2015-11-26 | Genia Technologies, Inc. | Methods for creating bilayers for use with nanopore sensors |
| WO2017004589A1 (fr) * | 2015-07-02 | 2017-01-05 | Edico Genome, Corp. | Systèmes, appareils et procédés bioinformatiques exécutés sur une plate-forme de traitement à circuit intégré |
| US20170270245A1 (en) * | 2016-01-11 | 2017-09-21 | Edico Genome, Corp. | Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing |
-
2025
- 2025-04-17 WO PCT/US2025/025147 patent/WO2025221988A1/fr active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20130264207A1 (en) | 2010-12-17 | 2013-10-10 | Jingyue Ju | Dna sequencing by synthesis using modified nucleotides and nanopore detection |
| US20130244340A1 (en) | 2012-01-20 | 2013-09-19 | Genia Technologies, Inc. | Nanopore Based Molecular Detection and Sequencing |
| US20150337366A1 (en) | 2012-02-16 | 2015-11-26 | Genia Technologies, Inc. | Methods for creating bilayers for use with nanopore sensors |
| US20150119259A1 (en) | 2012-06-20 | 2015-04-30 | Jingyue Ju | Nucleic acid sequencing by nanopore detection of tag molecules |
| US20140134616A1 (en) | 2012-11-09 | 2014-05-15 | Genia Technologies, Inc. | Nucleic acid sequencing using tags |
| WO2017004589A1 (fr) * | 2015-07-02 | 2017-01-05 | Edico Genome, Corp. | Systèmes, appareils et procédés bioinformatiques exécutés sur une plate-forme de traitement à circuit intégré |
| US20170270245A1 (en) * | 2016-01-11 | 2017-09-21 | Edico Genome, Corp. | Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing |
Non-Patent Citations (3)
| Title |
|---|
| FU ET AL., NATURE BIOTECH., vol. 16, 1998, pages 381 - 384 |
| SEARS ET AL., BIOTECHNIQUES, vol. 13, 1992, pages 626 - 633 |
| ZIMMERMAN ET AL., , METHODS MOL. CELL BIOL., vol. 3, 1992, pages 39 - 42 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7684708B2 (ja) | 母体血漿の無侵襲的出生前分子核型分析 | |
| Formenti et al. | Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation | |
| US10176294B2 (en) | Accurate typing of HLA through exome sequencing | |
| US20180039730A1 (en) | Computer Method and System of Identifying Genomic Mutations Using Graph-Based Local Assembly | |
| Chan et al. | ArCH: improving the performance of clonal hematopoiesis variant calling and interpretation | |
| JPWO2019132010A1 (ja) | 塩基配列における塩基種を推定する方法、装置及びプログラム | |
| Heo | Improving quality of high-throughput sequencing reads | |
| WO2025221988A1 (fr) | Systèmes et procédés d'appel de petits variants somatiques | |
| WO2025221998A1 (fr) | Systèmes et procédés d'appel de variants | |
| Niehus et al. | PopDel identifies medium-size deletions jointly in tens of thousands of genomes | |
| Veeramachaneni | Data analysis in rare disease diagnostics | |
| US20240404627A1 (en) | Systems and Methods for Correcting for Noise and Systemic Variations in Sequencing Data | |
| Wang et al. | A comparative study of methods for detecting small somatic variants in disease-normal paired next generation sequencing data | |
| Valdez et al. | scAllele: a versatile tool for the detection and analysis of variants in scRNA-seq | |
| Jayasekera et al. | A Bioinformatics pipeline for variant discovery from Targeted Next Generation Sequencing of the human mitochondrial genome | |
| HK40074981A (en) | Noninvasive prenatal molecular karyotyping from maternal plasma | |
| HK40100599A (en) | Noninvasive prenatal molecular karyotyping from maternal plasma | |
| HK40080479A (en) | Noninvasive prenatal molecular karyotyping from maternal plasma | |
| KR20250092241A (ko) | 핵산 오류 억제 | |
| Null | Advancement of Understudied Genetic Variants Within Statistical Genetics: A Copy Number Variants Analysis and Development of a Rare Variant Simulation Algorithm | |
| Corbett | Assessment of Alignment Algorithms, Variant Discovery and Genotype Calling Strategies in Exome Sequencing Data | |
| Goldfeder | Evaluating and Improving Clinical Genome Sequencing | |
| Lorenzo Salazar | Bioinformatics Pipeline for Next Generation Sequencing Analysis in Association Studies of Idiopathic Pulmonary Fibrosis | |
| Inouye et al. | Exploratory analysis and error modeling of a sequencing technology |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25725374 Country of ref document: EP Kind code of ref document: A1 |