[go: up one dir, main page]

WO2025049828A1 - Optimisation de panels de séquençage ciblés - Google Patents

Optimisation de panels de séquençage ciblés Download PDF

Info

Publication number
WO2025049828A1
WO2025049828A1 PCT/US2024/044554 US2024044554W WO2025049828A1 WO 2025049828 A1 WO2025049828 A1 WO 2025049828A1 US 2024044554 W US2024044554 W US 2024044554W WO 2025049828 A1 WO2025049828 A1 WO 2025049828A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
panel
variants
targeted sequencing
feature values
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/044554
Other languages
English (en)
Inventor
Robert Abe Paine CALEF
Oliver Claude VENN
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Grail Inc
Original Assignee
Grail Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Grail Inc filed Critical Grail Inc
Publication of WO2025049828A1 publication Critical patent/WO2025049828A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • This disclosure relates to improving sequencing panel assignments for samples from two or more individuals.
  • the methods described herein provide a method for designing an optimized targeted sequencing panel.
  • the methods provide a means for designing a targeted sequencing panel that produces a limit of detection below a threshold for the amplified cfDNA, thereby facilitating tumor fraction estimation or determining the presence or absence of disease.
  • the analysis begins with a group of samples and the aim of method is to generate optimized targeted sequencing panels based on identified feature value.
  • the method includes selecting at least a subset of identified feature values and generating an optimized targeted sequencing panel based on the selected subset of identified feature values.
  • the cfDNA from samples is amplified using the targeted sequencing panel, thereby generating amplified cfDNA.
  • the amplified cfDNA is assessed for whether the optimized targeted sequencing panel produces a limit of detection below a threshold for the amplified cfDNA. Assessing whether the optimized targeted sequencing panel produces a limit of detection below a threshold can include aggregating information collected from the amplified cfDNA. In some instances, optimizing the targeted sequencing panel is done by assessing and re-assessing (i.e., seed and swap) a targeted sequencing panel based on the identified subset of feature values where one or more feature values are seeded and swapped in each iteration. Based on the assessing and re-assessing (e.g., seed and swap), an optimized targeted sequencing panel or panel(s) including an optimized targeted sequencing panel can be generated.
  • assessing and re-assessing i.e., seed and swap
  • a feature value can be any characteristics of a genomic region that can be used to generate a targeted sequencing panel.
  • a feature value is a variant and the variant is identified for inclusion in the identified subset of feature values because the variant is a variant (SNP) that has been previously shown to indicate a disease presence and/or a disease type.
  • SNP variant
  • this disclosure provides a framework for using various feature values when selecting a targeted sequencing panel.
  • the framework described in this disclosure provides a general approach to produce various iterations of targeted sequencing panel assignments with different tradeoffs between the feature values and selecting the set of assignments most suitable to the needs and/or limitations of a given targeted sequencing panel (e.g., limit of detection, disease monitoring, cost, among other factors).
  • FIG. 1 is flowchart of a method for preparing a nucleic acid sample for sequencing according to one embodiment.
  • FIG. 2 is block diagram of a processing system for processing sequence reads according to one embodiment.
  • FIG. 3 is flowchart of a method for determining variants of sequence reads according to one embodiment.
  • FIG. 4 is a flow chart of a workflow for improving sequencing panel assignment according to one embodiment.
  • FIG. 5 is a flow chart of a workflow for selecting samples for which the sequencing data is included in the at least first set of sequencing data (see 410 in FIG. 4).
  • FIG. 6 is a flow chart of a workflow for somatic variant calling.
  • FIG. 7 illustrates a plot showing number of somatic variant calls across a range of tissues.
  • FIG. 8 shows a plot of sequencing panels designed to include varying numbers somatic variants and the corresponding cTAF for each panel. The line indicates the threshold for the limit of detection for the described assay.
  • FIG. 9 is a flow chart of a workflow for designing an optimized targeted sequencing panel for tumor fraction (TF) estimation.
  • FIG. 10A shows a bar graph for cfDNA yield (nanograms (ng)) for the various batches.
  • FIG. 10B shows a bar graph for a bar graph for collapsed mean target coverage
  • FIGs. 10C shows a bar graph for fraction reads on target for the various batches.
  • FIG. 10D shows a bar graph for estimated error rate for the various batches.
  • FIG. 11 shows a chart illustrating the tumor fraction estimates.
  • FIG. 12 is a table showing results of TF estimation.
  • FIG. 13 is a plot showing estimated tumor fraction for titration samples.
  • FIG. 14 is a box plot showing estimated tumor fraction for each batch.
  • FIG. 15 is a plot showing detection status over tumor fraction.
  • FIG. 16 is a plot showing counts for estimated tumor fractions for each of the
  • FIG. 17A is a plot showing estimated tumor fraction for samples grouped by cancer stage.
  • FIG. 17B is a plot showing estimated tumor fraction for samples grouped by cancer stage for lung, colorectal, and breast cancer only.
  • FIG. 17C is a plot showing estimated tumor fraction based for stage I, breast cancer samples grouped by marker.
  • FIG. 18 is a plot showing detected variants versus tumor fraction merged for cfDNA data.
  • FIG. 19 is a plot showing allele fraction calculated using methylation analysis (methyl af) versus allele fraction calculated using somatic variants (sv af).
  • sequence reads refers to nucleobase sequences read from a sample obtained from an individual. Sequence reads can be obtained through various methods known in the art.
  • read segment refers to any nucleobase sequences including sequence reads obtained from an individual and/or nucleobase sequences derived from the initial sequence read from a sample obtained from an individual.
  • a read segment can refer to an aligned sequence read, a collapsed sequence read, or a stitched read.
  • a read segment can refer to an individual nucleobase base, such as a single nucleobase variant.
  • single nucleobase variant refers to a substitution of one nucleobase to a different nucleobase at a position (e.g., site) of a nucleobase sequence, e.g., a sequence read from an individual.
  • a substitution from a first nucleobase X to a second nucleobase Y can be denoted as “X>Y.”
  • a cytosine to thymine SNV can be denoted as “OT.”
  • the term “indel” refers to any insertion or deletion of one or more base pairs having a length and a position (which can also be referred to as an anchor position) in a sequence read.
  • An insertion corresponds to a positive length
  • a deletion corresponds to a negative length.
  • mutation refers to one or more SNVs or indels.
  • true positive refers to a mutation that indicates real biology, for example, presence of a potential cancer, disease, or germline mutation in an individual. True positives are not caused by mutations naturally occurring in healthy individuals (e.g., recurrent mutations) or other sources of artifacts such as process errors during assay preparation of nucleic acid samples.
  • false positive refers to a mutation incorrectly determined to be a true positive. Generally, false positives can be more likely to occur when processing sequence reads associated with greater mean noise rates or greater uncertainty in noise rates.
  • cell-free nucleic acid refers to nucleic acid fragments that circulate in an individual’s body (e.g., bloodstream) and originate from one or more healthy cells and/or from one or more cancer cells.
  • cfDNA can be obtained from a blood sample.
  • circulating tumor DNA or “ctDNA” refers to nucleic acid fragments that originate from tumor cells or other types of cancer cells, which can be released into an individual’s bloodstream as result of biological processes such as apoptosis or necrosis of dying cells or actively released by viable tumor cells.
  • ctDNA is DNA found in cfDNA.
  • genomic nucleic acid refers to nucleic acid including chromosomal DNA that originates from one or more healthy cells. In some cases, white blood cells are assumed to be healthy cells.
  • wbcDNA refers to nucleic acid including chromosomal DNA that originates from white blood cells. Generally, wbcDNA is gDNA and is assumed to be healthy DNA.
  • tissue nucleic acid refers to nucleic acid including chromosomal DNA from tumor cells or other types of cancer cells that are obtained from cancerous tissue or a tumor. In some cases, tDNA is obtained from a biopsy of a tumor.
  • ALT refers to an allele having one or more mutations relative to a reference allele, e.g., corresponding to a known gene.
  • sampling depth refers to a total number of read segments from a sample obtained from an individual.
  • AD alternate depth
  • AF alternate frequency
  • the AF can be determined by dividing the corresponding AD of a sample by the depth of the sample for the given ALT.
  • feature value refers to a characteristic or set of characteristics used to define a sample.
  • a “feature value” can include a single characteristic at a single genomic locus or a combination of characteristics across a plurality of genomic loci.
  • sequencing panel refers to a combination of sequencing data from two or more sample (e.g., individuals).
  • MTC mean target coverage
  • FIG. 1 is flowchart of a method for preparing a nucleic acid sample for sequencing according to one embodiment.
  • the workflow 100 includes, but is not limited to, the following steps.
  • any step of the workflow 100 can comprise a quantitation sub-step for quality control or other laboratory assay procedures known to one skilled in the art.
  • a nucleic acid sample (DNA or RNA) is extracted from a subject.
  • DNA and RNA can be used interchangeably unless otherwise indicated. That is, the following embodiments for using error source information in variant calling and quality control can be applicable to both DNA and RNA types of nucleic acid sequences.
  • the sample can be any subset of the human genome, including the whole genome.
  • the sample can be extracted from a subject known to have or suspected of having cancer.
  • the sample can include blood, plasma, serum, urine, fecal, saliva, other types of bodily fluids, or any combination thereof. In some cases, the sample can include tissue or bodily fluids extracted from tissue.
  • methods for drawing a blood sample can be less invasive than procedures for obtaining a tissue biopsy, which can require surgery.
  • the extracted sample can include cfDNA and/or ctDNA.
  • the human body can naturally clear out cfDNA and other cellular debris. If a subject has a cancer or disease, ctDNA in an extracted sample can be present at a detectable level for diagnosis.
  • the extracted sample can include wbcDNA. Extracting the nucleic acid sample can further include separating the cfDNA and/or ctDNA from the wbcDNA. Extracting the wbcDNA from the cfDNA and/or ctDNA can occur when the DNA is separated from the sample.
  • the wbcDNA is obtained from a buff coat fraction of the blood sample.
  • the wbcDNA can be sheared to obtain wbcDNA fragments less than 300 base pairs in length. Separating the wbcDNA from the cfDNA and/or ctDNA allows the wbcDNA to be sequenced independently from the cfDNA and/or ctDNA.
  • the sequencing process for wbcDNA is similar to the sequencing process for cfDNA and/or ctDNA.
  • a sequencing library is prepared.
  • unique molecular identifiers UMI
  • the UMIs are short nucleic acid sequences (e.g., 4-10 base pairs) that are added to ends of DNA fragments during adapter ligation.
  • UMIs are degenerate base pairs that serve as a unique tag that can be used to identify sequence reads originating from a specific DNA fragment.
  • the UMIs are replicated along with the attached DNA fragment, which provides a way to identify sequence reads that came from the same original fragment in downstream analysis.
  • hybridization probes also referred to herein as “probes” are used to target, and pull down, nucleic acid fragments informative for the presence or absence of cancer (or disease), cancer status, or a cancer classification (e.g., cancer type or tissue of origin).
  • the probes can be designed to anneal (or hybridize) to a target (complementary) strand of DNA or RNA.
  • the target strand can be the “positive” strand (e.g., the strand transcribed into mRNA, and subsequently translated into a protein) or the complementary “negative” strand.
  • the probes can range in length from 10s, 100s, or 1000s of base pairs.
  • the probes are designed based on a gene panel to analyze particular mutations or target regions of the genome (e.g., of the human or another organism) that are suspected to correspond to certain cancers or other types of diseases.
  • the probes can cover overlapping portions of a target region.
  • a targeted gene panel rather than sequencing all expressed genes of a genome, also known as “whole exome sequencing”
  • the workflow 100 can be used to increase sequencing depth of the target regions, where depth refers to the count of the number of times a given target sequence within the sample has been sequenced. Increasing sequencing depth reduces required input amounts of the nucleic acid sample.
  • the hybridized nucleic acid fragments are captured and can also be amplified using PCR.
  • step 140 sequence reads are generated from the enriched DNA sequences.
  • Sequencing data can be acquired from the enriched DNA sequences by known means in the art.
  • the workflow 100 can include next generation sequencing (NGS) techniques including synthesis technology (Illumina), pyrosequencing (454 Life Sciences), ion semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing ( Pacific Biosciences), sequencing by ligation (SOLiD sequencing), nanopore sequencing (Oxford Nanopore Technologies), or paired-end sequencing.
  • NGS next generation sequencing
  • massively parallel sequencing is performed using sequencing-by-synthesis with reversible dye terminators.
  • the sequence reads can be aligned to a reference genome using known methods in the art to determine alignment position information.
  • the alignment position information can indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleobase base and end nucleobase base of a given sequence read.
  • Alignment position information can also include sequence read length, which can be determined from the beginning position and end position.
  • a region in the reference genome can be associated with a gene or a segment of a gene. As cfDNA and/or ctDNA and wbcDNA are sequenced independently, sequence reads for both cfDNA and or ctDNA and wbcDNA are independently generated.
  • a sequence read is comprised of a read pair denoted as / nowadays and R 2 .
  • the first read can be sequenced from a first end of a nucleic acid fragment whereas the second read R 2 can be sequenced from the second end of the nucleic acid fragment. Therefore, nucleobase base pairs of the first read R 1 and second read R 2 can be aligned consistently (e.g., in opposite orientations) with nucleobase bases of the reference genome.
  • Alignment position information derived from the read pair R and R 2 can include a beginning position in the reference genome that corresponds to an end of a first read (e.g., R- ) and an end position in the reference genome that corresponds to an end of a second read (e.g., R 2 ).
  • the beginning position and end position in the reference genome represent the likely location within the reference genome to which the nucleic acid fragment corresponds.
  • An output file having SAM (sequence alignment map) format or BAM (binary) format can be generated and output for further analysis such as variant calling, as described below with respect to FIG. 2.
  • FIG. 2 is block diagram of a processing system 200 for processing sequence reads and generating sequence panel assignments according to one embodiment.
  • the processing system 200 includes a sequence processor 205, sequence database 210, model database 215, machine learning engine 220, models 225 (for example, including a classifier model and one or more Bayesian hierarchical models or joint models), parameter database 230, score engine 235, variant caller 240, and a sequencing panel generator 250 (e.g., a targeted sequencing panel).
  • a sequence processor 205 sequence database 210
  • model database 215 for example, including a classifier model and one or more Bayesian hierarchical models or joint models
  • model 225 for example, including a classifier model and one or more Bayesian hierarchical models or joint models
  • parameter database 230 for example, including a classifier model and one or more Bayesian hierarchical models or joint models
  • score engine 235 for example, including a classifier model and one or more Bayesian hierarchical models or joint models
  • variant caller 240
  • the sequencing panel generator 250 selects a feature value(s) from the sequencing data using various features, scores, sequences, or a combination thereof.
  • the feature value(s) are pre-determined (e.g., a pre-determined set of variants).
  • the feature value(s) are determined by feature value identifier 260.
  • a feature value indicates a presence or absence of a variant, a mean allele frequency, a total number of small variants, and an allele frequency of true variants.
  • the sequencing panel generator 250 can include a feature value identifier 260 that stores features values.
  • the feature can be any one or more of, without limitation, a presence or absence of a variant, a mean allele frequency, a total number of small variants, and an allele frequency of true variants.
  • a feature value comprises a variant.
  • the variant comprises one or more of the following of a single nucleotide variant, an insertion, and a deletion.
  • a variant is associated with a genomic region described herein and/or as described U.S. Pat. Pub. No. 2019/0073445 Al.
  • FIG. 3 is an example where the feature value is a variant and the sum of the information gathered for each variant can be considered a feature value and used according to the methods described herein.
  • FIG. 3 is a flowchart of a workflow for determining variants of sequence reads according to one embodiment.
  • the processing system 200 uses the workflow 300 to perform variant calling (e.g., for SNVs and/or indels) based on input sequencing data. Further, the processing system 200 can obtain the input sequencing data from an output file associated with nucleic acid sample prepared using the workflow 100 described above.
  • the workflow 300 includes, but is not limited to, the following steps, which are described with respect to the components of the processing system 200. In other embodiments, one or more steps of the workflow 300 can be replaced by a step of a different process for generating variant calls, e.g., using Variant Call Format (VCF), such as HaplotypeCaller, VarScan, Strelka, or SomaticSniper.
  • VCF Variant Call Format
  • the sequence processor 205 collapses aligned sequence reads of the input sequencing data.
  • collapsing sequence reads includes using UMIs, and optionally alignment position information from sequencing data of an output file (e.g., from the workflow 100 shown in FIG. 1) to collapse multiple sequence reads into a consensus sequence for determining the most likely sequence of a nucleic acid fragment or a portion thereof. Since the UMIs are replicated with the ligated nucleic acid fragments through enrichment and PCR, the sequence processor 205 can determine that certain sequence reads originated from the same molecule in a nucleic acid sample.
  • sequence reads that have the same or similar alignment position information (e.g., beginning and end positions within a threshold offset) and include a common UMI are collapsed, and the sequence processor 205 generates a collapsed read (also referred to herein as a consensus read) to represent the nucleic acid fragment.
  • the sequence processor 205 designates a consensus read as “duplex” if the corresponding pair of collapsed reads have a common UMI, which indicates that both positive and negative strands of the originating nucleic acid molecule is captured; otherwise, the collapsed read is designated “non-duplex.”
  • the sequence processor 205 can perform other types of error correction on sequence reads as an alternate to, or in addition to, collapsing sequence reads.
  • the sequence processor 205 stitches the collapsed reads based on the corresponding alignment position information.
  • the sequence processor 205 compares alignment position information between a first read and a second read to determine whether nucleobase base pairs of the first and second reads overlap in the reference genome.
  • the sequence processor 205 responsive to determining that an overlap (e.g., of a given number of nucleobase bases) between the first and second reads is greater than a threshold length (e.g., threshold number of nucleobase bases), the sequence processor 205 designates the first and second reads as “stitched”; otherwise, the collapsed reads are designated “unstitched.” In some embodiments, a first and second read are stitched if the overlap is greater than the threshold length and if the overlap is not a sliding overlap.
  • a threshold length e.g., threshold number of nucleobase bases
  • a sliding overlap can include a homopolymer run (e.g., a single repeating nucleobase base), a dinucleobase run (e.g., two-nucleobase base sequence), or a trinucleobase run (e.g., three-nucleobase base sequence), where the homopolymer run, dinucleobase run, or trinucleobase run has at least a threshold length of base pairs.
  • a homopolymer run e.g., a single repeating nucleobase base
  • a dinucleobase run e.g., two-nucleobase base sequence
  • a trinucleobase run e.g., three-nucleobase base sequence
  • the sequence processor 205 assembles reads into paths.
  • the sequence processor 205 assembles reads to generate a directed graph, for example, a de Bruijn graph, for a target region (e.g., a gene).
  • Unidirectional edges of the directed graph represent sequences of k nucleobase bases (also referred to herein as “k- mers”) in the target region, and the edges are connected by vertices (or nodes).
  • the sequence processor 205 aligns collapsed reads to a directed graph such that any of the collapsed reads can be represented in order by a subset of the edges and corresponding vertices.
  • the sequence processor 205 determines sets of parameters describing directed graphs and processes directed graphs. Additionally, the set of parameters can include a count of successfully aligned k-mers from collapsed reads to a k-mer represented by a node or edge in the directed graph.
  • the sequence processor 205 stores, e.g., in the sequence database 210, directed graphs and corresponding sets of parameters, which can be retrieved to update graphs or generate new graphs. For instance, the sequence processor 205 can generate a compressed version of a directed graph (e.g., or modify an existing graph) based on the set of parameters.
  • the sequence processor 205 removes (e.g., “trims” or “prunes”) nodes or edges having a count less than a threshold value, and maintains nodes or edges having counts greater than or equal to the threshold value.
  • the processing system 200 can store sequencing data in a database 210 (e.g., variants and normals), which can be used to detect presence, absence, or level of a feature values (e.g., a presence or absence of a variant, a mean allele frequency, a total number of small variants, and an allele frequency of true variants) in a sample from a subject, and/or otherwise predict cost associated with the variant (e.g., per-site cost values and relative costs).
  • the sequence database 210 can also store sequencing data processed by the system 200, but can also store sequencing data not processed by the system 200, such as sequencing data uploaded from an external source and/or otherwise retrieved from external or publicly available databases.
  • the variant caller 240 generates candidate variants from the paths assembled by the sequence processor 205.
  • the variant caller 240 generates the candidate variants by comparing a directed graph (which can have been compressed by pruning edges or nodes in step 310) to a reference sequence of a target region of a genome.
  • the variant caller 240 can align edges of the directed graph to the reference sequence, and records the genomic positions of mismatched edges and mismatched nucleobase bases adjacent to the edges as the locations of candidate variants.
  • the variant caller 240 can generate candidate variants based on the sequencing depth of a target region.
  • the variant caller 240 can be more confident in identifying variants in target regions that have greater sequencing depth, for example, because a greater number of sequence reads help to resolve (e.g., using redundancies) mismatches or other base pair variations between sequences.
  • the variant caller 240 generate candidate variants using a variant model 225 to determine expected noise rates for sequence reads from a subject.
  • the variant model 225 can be a Bayesian hierarchical model, though in some embodiments, the processing system 200 uses one or more different types of models.
  • a Bayesian hierarchical model can be one of many possible model architectures that can be used to generate candidate variants and which are related to each other in that they all model position-specific noise information in order to improve the sensitivity/specificity of variant calling. More specifically, the machine learning engine 220 trains the variant model 225 using samples from healthy individuals to model the expected noise rates per position of sequence reads.
  • multiple different models can be stored in the model database 215 or retrieved for application post-training. For example, a first model is trained to model SNV noise rates and a second model is trained to model indel noise rates.
  • the score engine 235 scores the candidate variants based on the variant model 225 or corresponding likelihoods of true positives or quality scores.
  • the processing system 200 outputs the candidate variants.
  • the processing system 200 outputs some or all of the determined candidate variants along with the corresponding scores.
  • Downstream systems e.g., external to the processing system 200 or other components of the processing system 200, can use the candidate variants and scores for various applications including, but not limited to, predicting presence of cancer, disease, or germline mutations.
  • candidate variants are outputted for both cfDNA and/or ctDNA and wbcDNA.
  • candidate variants for wbcDNA are “normals” while candidate variants for cfDNA and/or ctDNA are “variants.”
  • Various detection methods and models can compare variants to normals to determine if the variants include signatures of cancer or any other disease.
  • normals and variants can be generated using any other process, any number of samples (e.g., a tumor biopsy or blood sample), or accessed from a database storing candidate variants.
  • the outputted candidate variants are used in the methods described herein to generate an optimized sequencing panel assignment.
  • III. C GENERATING AN OPTIMIZED TARGETED SEQUENCING PANEL ASSIGNMENT
  • the sequencing panel generator 250 generates a targeted sequencing panel assignment using various features, scores, sequences, etc. determined by the processing system 200.
  • the targeted sequencing panel generator 250 employs a machine learning model (e.g., a classifier model) to determine the targeted sequencing panel assignment based on the feature values from the sequencing data.
  • a machine learning model e.g., a classifier model
  • the targeted sequencing panel includes about 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 or more samples.
  • the sequencing panel includes 16 samples. In another embodiment, the sequencing panel has no more than 16 samples.
  • the machine learning model is selected from: a classifier model, a pre-specified algorithm, and a regression model.
  • the machine learning model is a classifier model.
  • the classifier model applies a greedy algorithm to add a next-highest ranked sample (or feature value) of the remaining ranked samples (or feature values) to a panel, wherein the panel to which the sample is sorted comprises the lowest value of feature values. This greedy algorithm is applied until a a targeted sequencing panel produces a limit of detection below a threshold.
  • applying the classifier model includes: seeding the targeted sequencing panel with the subset of identified feature values; swapping an identified feature value in the selected subset of identified feature values for an identified feature value not in the selected subset of identified feature values, measuring the limit of detection after swapping; and assessing whether the optimized targeted sequencing panel produces a limit of detection below the threshold; optionally, iterating until a targeted sequencing panel is selected that produces a limit of detection below the threshold.
  • the sequencing panel generator 250 generates the optimized targeted sequence panel assignment by applying a seed and swap approach to compiling a targeted sequencing panel. In some embodiments, the sequencing panel generator 250 iterates through samples to assign the feature values (and corresponding primers for amplifying the feature value) to a panel, determines mean of feature values for each targeted sequencing panel, swapping two feature values between two different targeted sequencing panels, and measuring deviation of mean feature value for each of the two different targeted sequencing panels following the swap. In one embodiment, if the change in feature value of the two different targeted sequencing panels is decreased by the swap, then it is accepted.
  • the targeted sequencing panel generator includes repeating these steps for a pre-specified number of swaps, thereby generating a targeted sequencing panel assignment based on the feature values from the sequencing data. In such cases, the repeating step is performed until the targeted sequencing panel selected is capable of producing a limit of detection below a threshold.
  • the method includes measuring the limit of detection (e.g., estimated limit of detection); and comparing the targeted sequencing panel assignments and the feature values.
  • measuring the limit of detection includes a term to disallow undesired configurations (e.g., where two or more samples have similar one or more feature values and cannot be easily differentiated).
  • the method includes repeating for a pre-specified number of steps or until the limit of detection is below a threshold.
  • the sequencing panel generator 250 can determine and rank feature values after a sample is selected for a panel. For example, after selecting the sample with the highest ranked feature value after a first iteration, the sequencing panel generator 250 can apply the classification model to the remaining feature values to derive features and rank feature values in a second iteration. The sequencing panel generator 250 can then select feature vales based on model coefficients determined in the second iteration. The iterative selection process can continue as needed.
  • the subset of identified feature values comprises at least 50, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, or at least 1000 feature values
  • the sequencing panel generator 250 generates a targeted sequencing panel assignment with the aim of reducing (or improving) the limit of detection of an assay.
  • FIG. 4 shows an example workflow for generating a sequencing panel assignment according to one embodiment.
  • the workflow 400 can be executed by the system 200 or another similar system 200.
  • the sequencing panel generator 250 obtains (retrieves) 410 sequencing data (e.g., test sequences) for a set of samples (e.g., here samples that meet a set of criteria described herein).
  • the first sequencing data can be the CCGA indicator set but could be another set of genomic regions to be analyzed.
  • the sequencing data is associated with a number of test sequences, and is associated with feature values (e.g., a presence or absence of a variant, a mean allele frequency, a total number of small variants, an allele frequency of true variants a GC content, an error rate, and a sequencing depth count,).
  • the sequencing panel generator 250 identifies 412 one or more feature values to be analyzed from the sequencing data (e.g., the first set of sequencing data) .
  • the feature value can be a a variant (e.g., a presence or absence of a variant).
  • the feature value can be GC content, sequencing depth, a mean allele frequency, a total number of small variants, and an allele frequency of true variants in the sequencing data.
  • Other feature values are also possible.
  • the sequencing panel generator 250 selects 414 at least a subset of identified feature values to be used for generating the optimized targeted sequencing panel.
  • the targeted sequencing panel generator 250 can access one or more additional sets of feature values and apply a machine learning model to the samples based on the additional set of set of feature values. In doing so, the targeted sequencing panel generator 250 can identify one or more additional subsets of feature values for consideration when generating the optimized targeted sequencing panel.
  • the targeted sequencing panel generator 250 generates 416 the optimized targeted sequencing panel assignment, for example, by applying a machine learning model to determine the features included in the targeted sequencing panel. In one embodiment, the targeted sequencing panel generator 250 generates 416 the optimized targeted sequencing panel assignment using a seed and swap approach to compiling a panel. In some embodiments, the sequencing panel generator 250 iterates through feature values to assign the feature values to a targeted sequencing panel, determines mean of feature values for each targeted sequencing panel, swapping two feature values between two different panels, and measuring deviation of mean feature value for each of the two different panels following the swap. The targeted sequencing panel generator includes repeating these steps for a prespecified number of swaps, thereby generating a targeted sequencing panel assignment based on the feature values from the sequencing data. In such cases, the repeating step is performed until the reduction in the mean feature value is below a threshold.
  • the targeted sequencing panel generator can only derive feature values for genomic regions having variants in a threshold number of sequences in the sequencing data.
  • the targeted sequencing panel generator can duplicate, or remove duplications, of a genomic region from a panel to increase detection capability.
  • a system administrator can remove genomic regions from the analysis.
  • a system administrator can remove samples from the targeted sequencing panel.
  • the sequencing panel generator can remove feature values from the panel based on a feature value blacklist.
  • the feature value blacklist can include patented feature values, feature values known to cause false positives, or any other feature value that could decrease the detection capability of a panel.
  • the biological samples are amplified 418 using the targeted sequencing panel, thereby generating, for example, amplified cfDNA.
  • the amplified cfDNA for example, can then be assessed for the presence (or absence) of the amplified cfDNA associated with the features values in the targeted sequencing panel.
  • methods useful for detecting the presence of absence of the amplified cfDNA associated with the features values in the targeted sequencing include: next generation sequencing, quantitative PCR, and hybrid capture.
  • the amplified cfDNA is assessed 420 whether the optimized targeted sequencing panel produces a limit of detection below a threshold for the amplified cfDNA.
  • a threshold for a limit of detection includes a limit of detection of 1/10,000; 1/20,000; 1/30,00; 1/40,000, 1/50,000, 1/60,000, 1/70,000, 1/80,000, 1/90,000, and 1/100,000.
  • the methods described herein include an optional step of determining 422, based on whether the limit of detection is below a threshold for the amplified cfDNA, a significance score associated with the presence or absence of disease in the sample.
  • the significance score indicates the confidence with which one can rely on the determination of the presence of absence in a disease sample.
  • target samples selected for testing included one or more of (i) to be undetected by past classifiers (i.e., detectability), (ii) have available plasma tubes (i.e., available plasma), and (iii) contain current tumor fraction estimates less than 1%, if available (i.e., low estimated TF).
  • the pool of potential samples included samples from the Circulating Cell Genome Atlas 1 Study (CCGA1) and the Circulating Cell Genome Atlas 2 Study (CCGA2).
  • Additional donor criteria (“detectability filter”) include WGBS of tumor biopsy, WGS of WBC, available plasma tubes, current tumor fraction estimate ⁇ 1%, and at least 9 somatic variants called (see FIG. 5).
  • the methods include benchmarking (control) samples.
  • Benchmarking samples are contrived titrations with known tumor fractions (e.g., any of the tumor fraction values described herein).
  • the estimated tumor fraction (TF) of a benchmarking sample is at least 10%.
  • benchmarking samples can be used for benchmarking of variant calling.
  • benchmarking samples can also be used for TF estimation. Benchmarking samples were selected from CCGA1 participants that included WBC WGS data.
  • the number of participants (e.g., samples) per panel can be determined empirically. In some embodiments, determining the number of participants per panel includes the sequence cost and panel ordering costs. When sequencing panel assignments include a fixed number of participants, where each participant is assigned to a panel, the number of participants per panel is a tradeoff between the cost of ordering more panels and the increased cost of “wasted” sequencing on participants (i.e., samples) where the requisite information has already been derived.
  • the number of participant (i.e., samples) per panel minimizes total cost as a function of the number of panels ordered (with fixed number of participants, number of panels is equivalent to participants per panel). Therefore, minimizing total costs can be represented as:
  • N number of panels
  • costrp cost per read pair
  • depth target raw sequencing depth
  • sites number of variants per participant
  • nsampies total number of participants.
  • the samples are assumed to have the same number of genomic regions per participant.
  • per panel cost is assumed to be constant.
  • a feature value is a variant (e.g., a somatic variant).
  • Somatic variants can be determined, for example, as described in Section III. B.
  • Another nonlimiting example of a method for determining somatic variants is as described in FIG. 6.
  • the workflow in FIG. 6 is a suggested method due in part to the nature of “niche” set of datatypes, for example, WGBS of tumor and WGS of cfDNA as matched normal, for which there is no well-characterized tools for somatic variant calling.
  • FIG. 7 shows the number of somatic variant calls across a range of tissues when using the method described in FIG. 6.
  • the aim of the methods described herein is to designing an optimized targeted sequencing panel for tumor fraction (TF) estimation by lowering the Limit of Detection (LoD).
  • target samples selected for testing included one or more of (i) to be undetected by past classifiers (i.e., detectability), (ii) have available plasma tubes (i.e., available plasma), and (iii) contain current tumor fraction estimates less than 1%, if available (i.e., low estimated TF).
  • the pool of potential samples included samples from the Circulating Cell Genome Atlas 1 Study (CCGA1) and the Circulating Cell Genome Atlas 2 Study (CCGA2).
  • Additional donor criteria (“detectability filter”) include WGBS of tumor biopsy, WGS of WBC, available plasma tubes, current tumor fraction estimate ⁇ 1%, and at least 9 somatic variants called (see FIG. 5).
  • the number of feature values (e.g., somatic variants) to include in the targeted sequencing panel can be determined empirically.
  • the number of feature values (e.g. somatic variants) to include in the targeted sequencing panel is determined by calculating the estimated cTAF for targeted sequencing panels designed with varying numbers of somatic variants and selecting the number of somatic variants based on which panels produced a cTAF below a threshold.
  • FIG. 8 shows a plot of targeted sequencing panels designed to include varying numbers of somatic variants and the corresponding cTAF for each panel. The targeted sequencing panels to the left of the line indicate those that have a cTAF below the desired LoD threshold.
  • determining the number of variants per participant includes (i) confidence of the variant call, (ii) error rate of the site, and (iii) ease of sequencing/availability of cfDNA. In one embodiment, the number of variants per participant is less than 500 variants. In one embodiment, the number of variants per participant is 500 variants or greater.
  • determining (i) confidence of the variant call includes a log-likelihood of a true call versus noise. This can be represented as:
  • determining (ii) “error of the site” depends, at least in part, on the conversion type of the variant (e.g., SNP). In one embodiment, greater error rates exist for conversion type including for A>G, T>C; OA, G>T; and OT, G>A. The variants having the lowest error rates met the criteria for “error rate of the site.”
  • determining (iii) ease of sequencing/availability of cfDNA depends, at least in part, on the GC content.
  • a skilled artisan would appreciate that other factors contribute to ease of sequencing.
  • relative coverage of genomic regions is reproducible across samples.
  • the consistent (or inconsistent) coverage is used in determining ease of sequencing.
  • FIG. 9 is a flow chart of a workflow for designing an optimized targeted sequencing panel for tumor fraction (TF) estimation.
  • TF tumor fraction
  • WGBS biopsy whole genome bisulfite sequencing
  • WGS biopsy whole genome sequencing
  • Tumor fraction estimation was determined using the targeted sequencing panels described herein.
  • Tumor fraction estimates were generated using a hierarchical Bayesian model.
  • the model alternates the allele generating process as a mixture of true somatic variants, missed germline variants, and false positive variant calls.
  • the model takes in allele counts from targeted sequencing and tumor allele counts from WGBS to inform allele fractions.
  • Posterior distributions were estimated using Markov Chain Monte Carlo (slice sampling). Validation of the models were done using in silico simulations.
  • Nonlimiting, exemplary data is presented in FIG. 11.
  • sequenced depth was 1800X (collapsed depth) on up to 500 putative somatic variants per participant, meaning about ⁇ 1M X aggregate depth.
  • Assay noise rates were about IE-6 (depending on SNP type and read quality). Sequencing data is provided in FIG. 12.
  • the samples analyzed here included some samples with clear evidence of tumor material and other samples with less, or no, tumor material. Based on these samples, the TF estimation model described herein will produce at least some estimate of TF, meaning the TF estimation may have a varied LoD depending on the makeup of the targeted sequencing panel.
  • the TF estimates were calibrated when measured on titration samples (see FIG. 13). Notably, the TF estimates were not subject to batch effects (see FIG. 14). In addition, LoD of panel-based tumor fraction estimates estimated via logistical regression to detection status showed that LOD50 and LOD95 taken as the tumor fractions had 50% and 95% probability of detection respectively (FIG. 15).
  • FIG. 17B shows estimated tumor fraction based on samples grouped by cancer stage for lung, colorectal, and breast cancer only.
  • FIG. 17C shows estimated TF based for stage I breast cancer samples grouped by marker.
  • the methods described herein can be used for estimating classifier LoD.
  • FIG. 18 shows detected variants versus tumor fraction merged for plasma cfDNA data, which is one of the types of data that can be used as data to optimize a targeted sequence panel.
  • the data is tumor fraction data generated from a classification approach for detecting cancer signal in plasma cfDNA.
  • the data is used to identify feature values (e.g., somatic variants) that, when used to amplify cfDNA, enables a limit of detection below a threshold.
  • the methods described herein can be used for Mosaic variant allele frequencies (MVAF) comparison and/or calibration.
  • MVAF Mosaic variant allele frequencies
  • comparison of tumor fraction data generated using the approach described herein can be compared to an orthogonal method for estimating tumor fraction (e.g., methylation-based) for comparison or calibration purposes.
  • a non-limiting comparison is as shown in FIG. 19.
  • One major consideration when developing methods for improving sequencing panel assignments is to find but avoid going below the lowest tumor fraction (e.g., limit of detection) that can be detected using a feature value (e.g., SNPs/variants) for a set of genomic regions.
  • a feature value e.g., SNPs/variants
  • the lowest tumor fraction e.g., limit of detection
  • simulations can be run including the expected number of total fragments and alt- allele containing fragments under a range of possible tumor fractions as well as pure noise (i.e., 0% TF).
  • the limit of detection refers to the lowest TF with good separation of the alt fraction distribution from noise.
  • the main parameters were: number of collapsed fragments per target and error rates of collapsed fragments.
  • Embodiments 1 A method for designing an optimized targeted sequencing panel for tumor fraction (TF) estimation in a sample comprising cell free DNA (cfDNA), the method comprising: retrieving at least a first set of sequencing data; identifying feature values in the first set of sequencing data; selecting at least a subset of identified feature values; generating an optimized targeted sequencing panel based on the selected subset of identified feature values; amplifying cfDNA from the sample using the targeted sequencing panel, thereby generating amplified cfDNA; and assessing whether the optimized targeted sequencing panel produces a limit of detection below a threshold for the amplified cfDNA.
  • Embodiment 3 The method of embodiment 1 or 2, wherein assessing whether the optimized targeted sequencing panel produces a limit of detection below a threshold comprises aggregating information collected from the amplified cfDNA.
  • Embodiment 4 The method of any one of embodiments 1-3, wherein generating the optimized targeted sequencing panel comprises applying a machine learning model to determine the features included in the targeted sequencing panel.
  • Embodiment 5 The method of embodiment 4, wherein the machine learning model is selected from: a classifier model, a pre-specified algorithm, and a regression model.
  • Embodiment 6. The method of embodiment 5, wherein the machine learning model is a classifier model.
  • Embodiment 7. The method of embodiment 6, wherein applying the classifier model comprises: iterating through the steps of embodiment 1 until a limit of detection is below the threshold.
  • Embodiment 8 The method of embodiment 7, wherein applying the classifier model comprises: seeding the targeted sequencing panel with the subset of identified feature values; swapping an identified feature value in the selected subset of identified feature values for an identified feature value not in the selected subset of identified feature values; measuring the limit of detection after swapping; and assessing whether the optimized targeted sequencing panel produces a limit of detection below the threshold; optionally, iterating until a targeted sequencing panel is selected that produces a limit of detection below the threshold.
  • Embodiment 9 The method of any one of embodiments 1-8, wherein the subset of identified feature values comprises at least 50, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, or at least 1000 feature values.
  • Embodiment 10 The method of any one of embodiments 1-9, wherein the threshold is a limit of detection of 1/10,000; 1/20,000; 1/30,00; 1/40,000, and 1/50,000.
  • Embodiment 11 The method of any one of embodiments 1-10, wherein the first set of sequencing data is obtained from sequencing cell-free DNA existing in one or more biological samples obtained from a plurality of individuals.
  • Embodiment 12 The method of any one of embodiments 1-11, wherein the first set of sequencing data comes from two or more biological samples.
  • Embodiment 13 The method of embodiment 12, wherein the first set of sequencing data comes from 16 biological samples.
  • Embodiment 14 The method of embodiment 13, wherein the 16 biological samples comprises a first control sample.
  • Embodiment 15 The method of embodiment 14, wherein the 16 biological samples comprises a second control sample.
  • Embodiment 16 The method of embodiment 15, wherein the first control sample, the second control sample, or both, comprises a cfDNA from a healthy patient.
  • Embodiment 17 The method any one of embodiments 1-16, wherein the first sequencing data comes from two or more biological samples where at least one of the biological samples is a benchmarking sample.
  • Embodiment 18 The method of any one of embodiments 1-17, wherein the feature values correspond to a genomic region comprising one or more of the following: tumor-specific markers, mutation hotspots, and viral regions.
  • Embodiment 19 The method of any one of embodiments 1-18, wherein the sequencing data comprises genomic regions associated with a high signal cancer or a liquid cancer.
  • Embodiment 20 The method of any one of embodiments 1-19, wherein the feature values represent features corresponding to one or more of the following: a GC content, an error rate, a sequencing depth count, a presence or absence of a variant, a mean allele frequency, a total number of small variants, and an allele frequency of true variants.
  • Embodiment 21 The method of embodiment 20, wherein the feature value is a variant.
  • Embodiment 22 The method of any one of embodiments 1-21, wherein the subset of identified variants comprises at least 50, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, or at least 1000 variants.
  • Embodiment 23 The method of any one of embodiments 1-22, wherein the variant comprises one or more of the following: a single nucleotide variant, an insertion, and a deletion.
  • Embodiment 24 The method of any one of embodiments 1-23, wherein the variants are, or are associated with, previously identified variants.
  • Embodiment 25 The method of any one of embodiments 1-24, wherein the variants comprise tumor specific nucleotide variants.
  • Embodiment 26 The method of any one of embodiments 1-25, wherein the tumor specific nucleotide variants are associated with previously tumor specific variants.
  • Embodiment 27 The method of any one of embodiments 1-26, wherein the previously tumor specific variants are selected from variants described in: Circulating Cell Genome Atlas 1 Study (CCGA1) and Circulating Cell Genome Atlas 2 Study (CCGA2).
  • Embodiment 28 The method of any one of embodiments 1-27, wherein the tumor specific variants are identified de novo.
  • Embodiment 29 The method of any one of embodiments 1-28, wherein the one or more biological samples are selected from plasma, urine, saliva, and tears.
  • Embodiment 30 A method for lowering the limit of detection of tumor fraction (TF) estimation in a sample comprising cell-free DNA (cfDNA), the method comprising: generating a optimized targeted sequencing panel according to any one of embodiments 1-21, wherein the targeted sequencing panel is based on a first set of sequencing data; amplifying cfDNA from the sample using the targeted sequencing panel; sequencing the amplified cfDNA, thereby generating a second set of sequencing data; aggregating the second set of sequencing data; and determining the limit of detection of TF estimation based on aggregating the second set of sequencing data, wherein the limitation of detection of TF estimation is lowered compared to a method that does not use an optimized targeted sequencing panel, aggregates the second sequencing data, or uses both steps.
  • TF tumor fraction
  • Embodiment 31 The method of embodiment 30, wherein generating a targeted sequencing panel comprises applying a machine learning model to determine the somatic variants included in the targeted sequencing panel.
  • Embodiment 32 The method of embodiment 31, wherein the machine learning model is selected from: a classifier model, a pre-specified algorithm, and a regression model.
  • Embodiment 33 The method of embodiment 32, wherein the machine learning model is a classifier model.
  • Embodiment 34 The method of any one of embodiments 30-33, wherein the optimized targeted sequencing panels is capable of amplifying genomic regions associated with at least 50, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, or at least 1000 somatic variants.
  • Embodiment 35 The method of any one of embodiments 30-34, wherein the threshold is a limit of detection of 1/10,000; 1/20,000; 1/30,00; 1/40,000, and 1/50,000.
  • Embodiment 36 A non-transitory computer-readable medium storing one or more programs, the one or more programs including instructions which, when executed by an electronic device including a processor, cause the device to perform any of the methods of the preceding embodiments.
  • Embodiment 37 An electronic device, comprising: one or more processors; memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods of the preceding embodiments.
  • a software module is implemented with a computer program product including a computer-readable non-transitory medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
  • Embodiments of the invention can also relate to a product that is produced by a computing process described herein.
  • a product can include information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and can include any embodiment of a computer program product or other data combination described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente divulgation concerne un procédé de conception de panels de séquençage ciblés optimisés. Selon un aspect, le procédé consiste à évaluer si le panel de séquençage ciblé optimisé produit une limite de détection au-dessous d'un seuil.
PCT/US2024/044554 2023-08-30 2024-08-29 Optimisation de panels de séquençage ciblés Pending WO2025049828A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363535423P 2023-08-30 2023-08-30
US63/535,423 2023-08-30

Publications (1)

Publication Number Publication Date
WO2025049828A1 true WO2025049828A1 (fr) 2025-03-06

Family

ID=94820355

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/044554 Pending WO2025049828A1 (fr) 2023-08-30 2024-08-29 Optimisation de panels de séquençage ciblés

Country Status (1)

Country Link
WO (1) WO2025049828A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210324477A1 (en) * 2020-04-21 2021-10-21 Grail, Inc. Generating cancer detection panels according to a performance metric
WO2022271159A1 (fr) * 2021-06-22 2022-12-29 Foundation Medicine, Inc. Systèmes et procédés d'évaluation d'une fraction tumorale
WO2023158711A1 (fr) * 2022-02-17 2023-08-24 Grail, Llc Estimation de fraction tumorale à l'aide de variants de méthylation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210324477A1 (en) * 2020-04-21 2021-10-21 Grail, Inc. Generating cancer detection panels according to a performance metric
WO2022271159A1 (fr) * 2021-06-22 2022-12-29 Foundation Medicine, Inc. Systèmes et procédés d'évaluation d'une fraction tumorale
WO2023158711A1 (fr) * 2022-02-17 2023-08-24 Grail, Llc Estimation de fraction tumorale à l'aide de variants de méthylation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
D’HAENE NICKY, MELÉNDEZ BÁRBARA, BLANCHARD ORIANE, DE NÈVE NANCY, LEBRUN LAETITIA, VAN CAMPENHOUT CLAUDE, SALMON ISABELLE: "Design and Validation of a Gene-Targeted, Next-Generation Sequencing Panel for Routine Diagnosis in Gliomas", CANCERS, MDPI AG, CH, vol. 11, no. 6, 4 June 2019 (2019-06-04), CH , pages 773, XP093209245, ISSN: 2072-6694, DOI: 10.3390/cancers11060773 *
MIN, BYUNG-JOO , WOO SEUNG LEE, MYUNG-EUI SEO , KYE-HWA LEE , SEUNG-YONG JEONG , JA-LOK KU , YEUL HONG KIM , SANG-WON SHIN, JU HA: "Development and validation of targeted gene sequencing panel based companion diagnostic for Korean patients with solid tumors", CANCERS, MDPI AG, CH, vol. 13, no. 5112, 12 October 2021 (2021-10-12), CH , pages 1 - 15, XP093209244, ISSN: 2072-6694 *

Similar Documents

Publication Publication Date Title
US20240247306A1 (en) Detecting Cross-Contamination in Sequencing Data Using Regression Techniques
US20250292864A1 (en) Detecting Cross Contamination in Sequencing Data
US11961589B2 (en) Models for targeted sequencing
CN112218957B (zh) 用于确定在无细胞核酸中的肿瘤分数的系统及方法
EP3378001B1 (fr) Procédés pour détecter des variations du nombre de copies dans un séquençage de nouvelle génération
US20190108311A1 (en) Site-specific noise model for targeted sequencing
US20210285042A1 (en) Systems and methods for calling variants using methylation sequencing data
EP3729441B1 (fr) Détection d'instabilité de microsatellites
WO2018150378A1 (fr) Détection de contamination croisée dans des données de séquençage à l'aide de techniques de régression
EP4193362B1 (fr) Détection de la contamination croisée dans les données de séquençage
US20240312561A1 (en) Optimization of sequencing panel assignments
WO2025049828A1 (fr) Optimisation de panels de séquençage ciblés
CN118632935A (zh) 检测无细胞rna中的交叉污染
US20200105374A1 (en) Mixture model for targeted sequencing
WO2025049830A1 (fr) Identification et exclusion de variants comprenant des fractions d'allèle aberrantes dans des estimations de fraction tumorale

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24861090

Country of ref document: EP

Kind code of ref document: A1