[go: up one dir, main page]

WO2025019779A1 - Systèmes et procédés de conception d'amorce par dégénération d'une expansion de liste d'amorces multiplex incomplètes dégénérées (dimple) - Google Patents

Systèmes et procédés de conception d'amorce par dégénération d'une expansion de liste d'amorces multiplex incomplètes dégénérées (dimple) Download PDF

Info

Publication number
WO2025019779A1
WO2025019779A1 PCT/US2024/038762 US2024038762W WO2025019779A1 WO 2025019779 A1 WO2025019779 A1 WO 2025019779A1 US 2024038762 W US2024038762 W US 2024038762W WO 2025019779 A1 WO2025019779 A1 WO 2025019779A1
Authority
WO
WIPO (PCT)
Prior art keywords
primer
implemented method
computer implemented
sequence
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/038762
Other languages
English (en)
Inventor
David Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pupil Bio Inc
Original Assignee
Pupil Bio Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pupil Bio Inc filed Critical Pupil Bio Inc
Publication of WO2025019779A1 publication Critical patent/WO2025019779A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • This application relates generally to methods, algorithms, computer programs, computer- readable medium, and computer systems for designing massively multiplex PCR primer sets.
  • this disclosure provides one or more embodiments of systems, methods, and non-transitory computer readable storage media that provide benefits and/or solve one or more of the above problems in the art.
  • the disclosed algorithm, model and systems can design massively multiplex PCR primer sets.
  • this disclosure provides a computer implemented method for designing a multiplex PCR primer set to amplify a plurality of nucleic acid (NA) template sequences, comprising the steps of: (a) generating a set of Semilocus sequences from the NA template sequences, (b) generating a set of primer candidates for each Semilocus, (c) evaluating a Fitness score for each primer candidate, and (d) performing the following steps (i) to (iv) iteratively for at least 5, 10, 20, 30, 50, 100, 200, 500, 1000, 2000, 3000, 4000 or 5000 cycles: (i) selecting a set of Semiloci, (ii) selecting a set of primer candidates for each of the selected Semiloci, (iii) evaluating a Loss score for each selected primer candidate against a pre-existing primer set to determine an FL Value for each selected primer candidate, and (iv)adding the selected primer candidate with the highest FL Value to the pre-existing primer set.
  • this disclosure provides a computer-readable medium comprising codes that, upon execution by one or more processors, cause the one or more processors to implements the primer design method as set out herein.
  • this disclosure provides a computer system for designing a multiplex PCR primer set, the system comprising: a non-transitory memory configured to store executable instructions; and one or more processors alone or in combination programmed to perform a primer design method as set out herein
  • FIG. 1 Flowchart of one embodiment of the DIMPLE algorithm. Terms defined in this patent application are capitalized and underlined. The bracketed numbers are for illustration purposes.
  • Figure 2 Illustration of Semilocus sequence generation from context sequences and key sequences.
  • the DNA template sequence at the top is retrieved from a reference genome of interest, and the context sequence is generated through removal of nucleotides too distant from the key sequence of interest.
  • the subsequence of the context sequence to the 5’ of the key sequence becomes Semilocus 1, and the reverse complement of the subsequence of the context sequence to subsequences of their respective Semilocus sequences.
  • Figure 3 Illustration of primer candidate generation from a Semilocus sequence.
  • the first sequence has a AG° that is too weak to be considered a valid primer candidate, and the second sequence has a AG° that is too strong to be considered a valid primer candidate.
  • the primer candidates la through Ih all have AG° values between -11.5 kcal/mol and -13.0 kcal/mol.
  • Figure 4 Illustration of primer set and Semilocus coverage.
  • the dashed boxes illustrate the aligned positions of the primers in a primer set to each of 6 Semiloci.
  • the number of primers in the primer set that are derived from each Semilocus constitute that Semilocus’s coverage.
  • FIG. 5 Illustration of a primer candidate’s Fitness and Loss values. Fitness is a value that can be calculated based only the primer candidate’s sequence itself, whereas Loss is a value that is dependent on the pairwise interactions between the primer candidate and each primer in a primer set.
  • Figure 6 Flowchart of the Main DIMPLE design loop.
  • the primer set Database is implemented as a hash table of k-mer words.
  • Figure 7 Illustration of Loss calculation via 6-mer words with a primer set comprising 1 primer and a single primer candidate. All 6-mer words of the primer set are enumerated on the left side, and the reverse complement of all 6-mer words of the primer candidate are enumerated on the right side. The reverse complement of all 6-mer primer candidate words can be rapidly matched for hits to the 6-mer words in the primer set words through a Database (e.g. hash table, suffix true) constructed on the latter. The same approach can be used for 7-mer words and longer words.
  • a Database e.g. hash table, suffix true
  • Figure 8 Illustration of impact of Redundancy Penalty values on the design of primer sets.
  • the left panel shows a hypothetical set of 2 primer candidates for each of 4 Semiloci and the Fitness values.
  • the arrows indicate the pairwise Loss values between primer candidates; where no arrows exist between a pair of primer candidates, the pairwise Loss is 0.
  • the right panel shows the Redundancy Penalties and resulting primer sets: When the Redundancy Penalty is 100, there is strong discouragement for adding a second primer to a Semilocus into the primer set, and the resulting primer set has 4 primers, 1 to each Semilocus.
  • DIMPLE When the Redundancy Penalty is 50, 50 (50 for the second primer to a Semilocus, cumulative 100 for the third primer to a Semilocus), DIMPLE will prioritize achieving at least 1 primer coverage for each Semilocus, and the resulting primer set has 6 primers, including at least 1 to each Semilocus. When the Redundancy Penalty is 0, 100, DIMPLE values a second primer to a Semilocus equivalently to having a first primer to another Semilocus, and the resulting primer set has 6 primers, but there are 0 primers to Semilocus 4.
  • FIG. 9 Example properties of DIMPLE designed primer sets using the embodiment described in Figure 1.
  • Figure 10 illustrates the performance of DIMPLE-designed pan-cancer panel.
  • A Summary of the Ct Observed with 100 Subsets of 10 Primer Sets (20 Total Primers): In 70% of the cases, the Ct for no-template control (NTC) is above 39 cycles, demonstrating the low propensity of DIMPLE-designed primers to form dimers.
  • B The 1000-Plex Pan-Cancer Panel: The designed panel yields a ACT of approximately 10 cycles between the positive control and the negative control.
  • C Post- Amplification Workflow: Upon amplification, the product undergoes a standard library preparation workflow, which includes the ligation of adapters and amplification with indexing primers.
  • FIG. 11 illustrates detection of fusion genes Using the DIMPLE qPCR assay.
  • A Strategy for Designing the 274-Primer Assay Using DIMPLE: The DIMPLE strategy for the qPCR assay involves designing forward (FW) primers for each exon of the genes of interest (KMT2A and NUP98) and reverse (RV) primers for each exon of their most frequent partner genes. Specifically, forward primers are designed to target all relevant exons of KMT2A and NUP98. Reverse primers are designed to target the exons of the most common fusion partners. This comprehensive approach ensures broad coverage: by covering each exon of the target genes and their partners, this design allows for the detection of a wide variety of fusion breakpoints with high sensitivity.
  • Figure 12 shows a computer system implementing the methods, algorithms and computer programs described here.
  • an item is selected from a group consisting of A, B, C, and D
  • the inventors specifically envision each alternative individually (e.g., A alone, B alone, etc.), as well as combinations such as A, B, and D; A and C; B and C; etc.
  • the term “and/or” when used in a list of two or more items means any one of the listed items by itself or in combination with any one or more of the other listed items.
  • the expression “A and/or B” is intended to mean either or both of A and B - i.e., A alone, B alone, or A and B in combination.
  • the expression “A, B and/or C” is intended to mean A alone, B alone, C alone, A and B in combination, A and C in combination, B and C in combination, or A, B, and C in combination.
  • the term “substantially”, when used to modify a quality, generally allows a certain degree of variation without that quality being lost.
  • degree of variation can be less than 0.1%, about 0.1%, about 0.2%, about 0.3%, about 0.4%, about 0.5%, about 0.6%, about 0.7%, about 0.8%, about 0.9%, about 1%, between 1-2%, between 2-3%, between 3-4%, between 4-5%, or greater than 5% or 10%.
  • “about” can mean a variation of ⁇ 0.1%, ⁇ 0.5%, ⁇ 1%, ⁇ 2%, ⁇ 3%, ⁇ 4%, ⁇ 5%, ⁇ 6%, ⁇ 7%, ⁇ 8%, ⁇ 9% or ⁇ 10%.
  • the term “context sequence” refers to a subsequence of the DNA template to be amplified that DIMPLE will consider in the design of PCR primers. The length of the context sequence thus determines the maximum length of the amplicons generated by primers designed by the DIMPLE algorithm. The set of context sequences is the primary input of the DIMPLE algorithm.
  • a primer set is capable of amplifying context sequences having a length between 40 and 1000, between 40 and 900, between 40 and 800, between 40 and 700, between 40 and 600, between 40 and 500, between 40 and 400, between 40 and 300, between 40 and 200, between 40 and 100, between 40 and 90, between 40 and 80, between 40 and 70, between 40 and 60, between 50 and 1000, between 100 and 1000, between 200 and 1000, between 300 and 1000, between 400 and 1000, between 500 and 1000, between 600 and 1000, between 700 and 1000, between 800 and 1000, between 100 and 900, between 100 and 800, between 100 and 700, between 100 and 600, between 100 and 500, between 100 and 400, between 100 and 300, between 100 and 200, between 200 and 900, between 300 and 800, between 400 and 700, between 500 and 600, between 100 and 200, between 200 and 300, between 300 and 400, or between 400
  • the term “key sequence” refers to a sequence in each context sequence that must be included in the insert of the amplicon generated by the corresponding primers.
  • the key sequence is represented in the context sequence as uppercase letters, and all other DNA nucleotides in the context sequence are represented in lowercase letters.
  • the key sequence may be probed by a molecular probe (e.g. Taqman probe) or sequenced to determine its identity
  • the corresponding primer for the context sequence preferably does not include the key sequence (or its complement).
  • DIMPLE defaults to assuming that the middle 1-2 letters of the context sequence is the key sequence.
  • the term “Semilocus” (and its plural form “Semiloci”) refers to a portion of a context sequence where each context sequence is used to generate two Semilocus sequences: All nucleotides the 5’ of the key sequence is considered the first Semilocus sequence, and the reverse complement of all nucleotides to the 3’ of the key sequence is considered the second Semilocus sequence.
  • the Semilocus is the set of continuous nucleotides from which continuous subsequences are selected to generate primer candidates.
  • the term “primer candidate” refers to a subsequence of the Semilocus that fulfills the basic criteria for being a valid primer to amplify its corresponding context sequence.
  • a continuous subsequence of a Semilocus qualifies as a primer candidate if the standard free energy of hybridization (AG°) of the subsequence to its reverse complement sequence at the temperature and salinity of the intended PCR reaction’s annealing step is roughly -12 kcal/mol.
  • DIMPLE defaults to assuming that the temperature of the PCR anneal step is about 60 °C, and the effective salinity of the PCR reaction is about 0.18 M Na+.
  • a continuous subsequence of a Semilocus qualifies as a primer candidate if the melting temperature of a two-stranded DNA molecule consisting of the subsequence and its reverse complement is between 1°C and 10°C above the PCR anneal temperature for the intended primer concentration.
  • a primer candidate comprises a template-complementary region ranging between 10 and 50, between 15 and 45, between 20 and 40, between 25 and 35, between 10 and 40, between 10 and 30, between 10 and 20, between 20 and 50, between 20 and 40, or between 20 and 30 nucleotides.
  • primer candidate comprises a GC content between 0.10 and 0.90, between 0.10 and 0.80, between 0.10 and 0.70, between 0.10 and 0.60, between 0.10 and 0.50, between 0.10 and 0.40, between 0.10 and 0.30, between 0.10 and 0.20, between 0.20 and 0.80, between 0.20 and 0.70, between 0.20 and 0.60, between 0.20 and 0.50, between 0.20 and 0.40, between 0.20 and 0.30, between 0.30 and 0.80, between 0.30 and 0.70, between 0.30 and 0.60, between 0.30 and 0.50, between 0.30 and 0.40, between 0.10 and 0.20, between 0.20 and 0.30, between 0.30 and 0.40, between 0.40 and 0.50, between 0.50 and 0.60, between 0.60 and 0.70, between 0.70 and 0.80, or between 0.80 and 0.90.
  • primer set refers to a set of primers that are intended to be used together in the same PCR reaction.
  • DIMPLE takes on approach of gradually expanding the number of primers in the primer set by incrementally adding one primer candidate into the primer set at a time, in a way that maximizes the coverage of the Semiloci while minimizing likelihood of forming primer dimers
  • the term “Semilocus coverage” of a Semilocus refers to the number of primers in the primer set that amplifies such Semilocus. For example, given a set of 10 context sequences as input, 20 Semiloci sequences are generated. A minimal primer set that amplifies all 10 context sequences would have 20 primers in the primer set, corresponding to 1 primer per Semilocus. In some embodiments, Semilocus coverage is expressed as an array of integers, corresponding to the number of primers in the primer set that are selected from each Semilocus. When a primer set’s Semilocus coverage has at least one 0 value, then the primer set is considered incomplete, as it cannot amplify all context sequences.
  • primer set When the primer set’s Semilocus coverage has at least one value greater or equal to 2, then the primer set is considered Redundant, as there are multiple choices for primer to the corresponding Semilocus.
  • a primer set can be simultaneously incomplete and redundant for different Semiloci. In some instances, primer set redundancy is desired because it allows backup options during experimental optimization of multiplex PCR panels, and because the extra primers can be easily used as nested primers for a nested PCR process.
  • the term “Fitness” of a primer candidate is a numerical score indicating the intrinsic desirability of the primer candidate sequence, without consideration for potential interactions with other primer candidates or the existing primer set.
  • the Fitness is initially defined to be a fixed value (e.g. 100), and reduced subsequently based on the presence of undesirable features in a primer candidate.
  • undesirable features that penalize Fitness include but are not limited to (1) primer candidates sequence lengths that are longer than a threshold (e.g. 35nt), (2) primer candidate sequence lengths that are shorter than a different threshold (e.g. 15nt), (3) presence of homopolymers of at least a threshold length (e.g.
  • Fitness penalties include the predicted quantity and/or strength of nonspecific binding interactions to other regions of a background genome sequence.
  • a random variable e.g. a normally distributed value with mean 0 and standard deviation 5 is added to each primer candidate’s Fitness to tie-break Fitness and/or to provide stochasticity in the DIMPLE primer design process.
  • the term “Loss” for a primer candidate with a primer set is a numerical score indicating the likelihood of the primer candidate to form undesired interactions with the primer set.
  • Loss is initially defined to be a fixed value (e.g. 0), and increased subsequently based on the presence of sequence features in the primer candidate and primer set that could result in undesirable primer dimer formation.
  • the Loss is computed through a weighted sum of k-mer words that are reverse complementary between the primer candidate and any primer in the primer set. In some embodiments, only 6-mer, 7-mer, 8- mer, 9-mer, and 10-mer words are considered.
  • the weighting of the presence of the 6-mer, 7-mer, 8-mer, 9-mer, and 10-mer words are different depending of the position of these words in the primer candidate and in the corresponding primer in the primer set.
  • the Loss is computed through a hash table comprising the k-mer words in the primers of a primer set.
  • the term “Loss Weighting” refers to the relative weight of Loss vs. Fitness in the evaluation of primer candidates by the DIMPLE algorithm.
  • the Loss Weighting initially starts at value (e.g. 1) and then is later reduced during the course of the DIMPLE algorithm.
  • the Loss Weighting is reduced when a large fraction of primer candidates evaluated have Loss Weighting multiplied by Loss exceeding the primer candidate’s Fitness. In some embodiments, the Loss Weighting is reduced based on the number of primers in the primer set and the number of context sequences, with larger reduction of Loss Weighting when the number of primers in the Primers Set divided by the number of context sequences is smaller.
  • the term “FL Value”, for a primer candidate, is calculated as the primer candidate’s Fitness minus the Loss Weighting multiplied by the primer candidate’s Loss.
  • DIMPLE evaluates the FL values of a number of primer candidates and selects the primer candidate with highest FL Value to be added to the primer set, assuming that the highest FL Value is positive. In some embodiments, if the FL Values are negative for all primer candidates evaluated, the Loss Weighting is reduced. In some embodiments, if the FL Values are negative for all primer candidates evaluated, the evaluated primer candidates are removed from consideration, and DIMPLE will select a new set of primer candidates to evaluate.
  • the term “Redundancy Penalty” refers to a penalty to the FL Value to encourage the DIMPLE algorithm to prioritize inclusion of primers into the primer set from Semilocus with low Semilocus coverage.
  • the Redundancy Penalty is implemented as dynamic updates to the Fitness of all remaining primer candidates to a specific Semilocus, after one primer candidate for a Semilocus is selected to be included into the primer set.
  • the Redundancy Penalty is implemented as a separate term in the FL value.
  • a “computer system” refers to a system of hardware, software, and data storage medium used to analyze information.
  • the minimum hardware of a patient computer-based system comprises a central processing unit (CPU), and hardware for data input, data output (e.g., display), and data storage.
  • CPU central processing unit
  • the data storage medium may comprise any manufacture comprising a recording of the present information as described above, or a memory access device that can access such a manufacture.
  • to “record” data programming or other information on a computer readable medium refers to a process for storing information, using any such methods as known in the art. Any convenient data storage structure may be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g. word processing text file, database format, etc.
  • a “processor” or “computing means” references any hardware and/or software combination that will perform the functions required of it.
  • a suitable processor may be a programmable digital microprocessor such as available in the form of an electronic controller, mainframe, server or personal computer (desktop or portable).
  • suitable programming can be communicated from a remote location to the processor, or previously saved in a computer program product (such as a portable or fixed computer readable storage medium, whether magnetic, optical or solid state device based).
  • a magnetic medium or optical disk may carry the programming, and can be read by a suitable reader communicating with each processor at its corresponding station.
  • DNA extracted from biospecimen samples used for medical diagnostics or life sciences research often are low in volume and/or concentration (e.g. cell-free DNA, tissue biopsies), and must be amplified for downstream analysis.
  • the most method for amplifying DNA from specific genes or loci of interest is the polymerase chain reaction (PCR), in which a pair of DNA primers are used to exponentially amplify a region of interest.
  • PCR primers are relatively easy to design for a single DNA template sequence, but the difficulty of designing highly multiplex PCR primers is extremely challenging due to the combinatorial possible unintended amplification reactions.
  • sensitivity refers to whether a genetic locus of interest is amplified by the corresponding PCR primers.
  • Specificity refers to whether amplification products from a primer set contain in a large fraction of unintended amplicons, such as primer dimers and nonspecific genomic amplification.
  • a poorly designed multiplex PCR primer set may have low sensitivity and specificity because one or both primers for amplifying Locus 1 may be consumed through unproductive primer-dimer formation with other primers, or through unproductive nonspecific genomic amplification of other DNA sequences.
  • PCR primers are first used to preamplify the DNA loci of interest.
  • the outer PCR primer set is often designed with higher emphasis on sensitivity than specificity.
  • the products of the multiplexed PCR reaction using the outer PCR primers are then re-amplified using a set of inner PCR primers. Because nonspecific DNA amplification products and primer dimers are unlikely to have corresponding binding sites for the inner primers, this process greatly increases the sensitivity and specificity of the multiplex PCR amplification reaction.
  • nested PCR is more labor-intensive, as it typically involves multiple purification steps after the outer PCR amplification step to remove unreacted primers, pyrophosphate side products, and other chemicals. This labor-intensive protocol prevents nested PCR from being routinely used in qPCR setting, but is used for NGS library preparation approaches.
  • DNA methylation patterns are increasingly being studied as disease biomarkers and predictive biomarkers for therapeutics.
  • breast cancer patients with BRCA1 promoter hypermethylation tend to benefit from PARP inhibitor drugs such as olaparib.
  • companies such as Grail use methylation patterns in cell-free DNA to perform early cancer detection screening tests, such as the Grail Galleri test.
  • Bisulfite conversion converts unmethylated C’s into uracils (U). These uracil bases are copied as thymine (T) bases during the PCR amplification process. In contrast, methylated C’s remain as methylated C’s, which then are copied as cytosines (C) bases during PCR.
  • enzymatic conversion methods such as those based on the APOBEC enzyme, that perform result in the same C to U conversion as bisulfite conversion.
  • sequences of the DNA molecules that result from bisulfite conversion treatment of the original DNA template have some properties that render them more difficult to design PCR primers for than typical genomic DNA templates.
  • Second, methylation states for CpG’s at the same locus are often heterogeneous, with a sample containing a multiple cellular sources of DNA with different methylation patterns. Consequently, after conversion each CpG can become either CG or TG.
  • primer binding sequence When a primer binding sequence has multiple CpG sites, an exponentially many number of primer sequences may be needed.
  • the present disclosure provides a Degenerate Incomplete Multiplex Primer List Expansion (DIMPLE) method, its computer-implemented version, and the corresponding computer-readable medium and computer system, for designing massively multiplex PCR primer sets.
  • DIIMPLE Degenerate Incomplete Multiplex Primer List Expansion
  • DIMPLE presents many advantages. For example, DIMPLE’S runtime is linear in the number of DNA templates to be designed, allowing it to scale to massive multiplexing. DIMPLE is applicable to designing PCR primers for both standard DNA templates and to post-conversion DNA templates with degenerate nucleotides. DIMPLE allows simultaneous design of multiple primers to each DNA template of interest, to facilitate experimental optimization and to enable nested PCR experimental methods. Finally, the inherent lack of template coverage completeness by DIMPLE allows DIMPLE to avoid non-suitable templates that can result in primers with high experimental failure rates.
  • DIMPLE generates a multiplex primer set with a plexity between 100 and 200, between 200 and 300, between 300 and 500, between 500 and 750, between 750 and 1000, between 1000 and 2000, between 2000 and 3000, between 2000 and 4000, between 2000 and 5000, between 5000 and 7500, between 7500 and 10000, between 10000 and 15000, between 15000 and 20000, between 20000 and 25000, between 25000 and 30000.
  • DIMPLE Incomplete Multiplex Primer List Expansion
  • the caller script defines several key input variables and parameters, and then passes these into the main DIMPLE function.
  • the caller script defines the context sequences based on a reference genome sequence, a set of genomic coordinates of interest, and a set of amplicon length constraints.
  • the caller script defines the context sequences based on plus (+) strand of the reference genome sequence, and replaces all cytosine (C) nucleotides not directly adjacent to the 5’ of a guanine (G) nucleotide into uracil (U) or thymine (T) nucleotides, and replaces all cytosine (C) nucleotides directly adjacent to the 5’ of a G nucleotide into a pyrimidine (Y) degenerate nucleotide.
  • C cytosine
  • the caller script defines the context sequences based on minus (-) strand of the reference genome sequence, and replaces all cytosine (C) nucleotides not directly adjacent to the 5’ of a guanine (G) nucleotide into uracil (U) or thymine (T) nucleotides, and replaces all cytosine (C) nucleotides directly adjacent to the 5’ of a G nucleotide into a pyrimidine (Y) degenerate nucleotide.
  • a primer set designed herein provides substantially complete coverage of intended context sequences.
  • the caller script defines a Fitness Noise parameter.
  • the Fitness Noise parameter is the standard deviation of a Gaussian distributed random variable with mean 0 that is added to the calculated Fitness of each primer candidate.
  • the Fitness Noise parameter is the maximum absolute value of a uniformly distributed variable with mean 0 that is added to the calculated Fitness of each primer candidate.
  • the caller script defines a dynamic Loss Weighting ruleset.
  • the Loss Weighting is never changed through the course of a DIMPLE design run.
  • the Loss Weighting is divided by a fixed value each time the Loss Weighting adjustment is triggered.
  • the Loss Weighting is divided by a value inversely proportional to the number of primers in the primer set when the Loss Weighting adjustment is triggered.
  • the Loss Weighting is divided by a value proportional to the number of context sequences when the Loss Weighting adjustment is triggered.
  • the caller script defines a set of pre-existing primer sequences to be included initially in the primer set.
  • the set of pre-existing primer sequences are for primers intended to be present in the final multiplex PCR panel.
  • the set of pre-existing primer sequences are a set of virtual sequences used to guide the sequence design of primer set, wherein the pre-existing primer sequences are not intended to physically exist as synthetic oligonucleotides in the final multiplex PCR panel.
  • the set of pre-existing primer sequences is an empty set.
  • the caller script calls the main DIMPLE function multiple times on the same set of context sequences using different random number generator seed values, in order to generate multiple alternative primer sets.
  • the multiple alternative primer sets designed by DIMPLE are evaluated by a different algorithm to select a final primer set for wetlab optimization or usage.
  • FIG. 1 An illustration of one embodiment of the DIMPLE main function implementation is show in Fig. 1. Steps 1 through 5 of the Main DIMPLE function are called once, and the remains steps are repeated many times until the primer set design is complete.
  • the numbers in bracket indicate example values of parameters that can affect the performance of DIMPLE. In some embodiments, adjustment of these parameters values may affect the DIMPLE algorithm by producing better primer sets (with higher aggregate coverage) but take significantly longer to run. For example, increasing the number of Semiloci considered and increasing the number of primer candidates considered for each Semilocus potentially increases the performance of DIMPLE at the cost of slower runtime.
  • the primer candidate generation function takes as input one or more Semilocus sequences or context sequences and a set of primer candidate design parameters, and returns as output a set of primer candidates for each Semilocus.
  • the primer candidate design parameters may comprise the intended PCR anneal cycle temperature, the intended PCR reaction effective monovalent cation salinity, the maximum allowed length of the primer candidate, and the maximum (least negative) allowable standard free energy (AG°) of hybridization between a primer candidate and its reverse complement.
  • the design parameters may have default values of approximately 60°C for PCR anneal cycle temperature, approximately 0.18M for PCR effective monovalent cation concentration, approximately 40 nucleotides for maximum primer candidate length, and approximately -12.0 kcal/mol for the maximum AG° of hybridization to reverse complement.
  • the primer candidate design parameters further comprise a minimum (most negative) allowable standard free energy (AG°) of hybridization between a primer candidate and its reverse complement; in some embodiments this minimum AG° has a value of approximately -17 kcal/mol.
  • the primer candidate generation function begins by considering every allowable position in the Semilocus as a 5 ’-most nucleotide, and iteratively adds consecutive nucleotides from the Semilocus sequence until the growing sequence has a AG° of hybridization to its reverse complement sequence or the growing sequence exceeds the maximum allowable primer candidate length.
  • the primer candidate generation function evaluates and produces multiple primer candidates with the same 5 ’-most nucleotide position on each Semilocus; in other embodiments, the primer candidate generation function allows a maximum of 1 primer candidate per 5 ’-most nucleotide position on each Semilocus.
  • the primer candidate generation function handles degenerate nucleotides such as Y for pyrimidines (representing C or T) and R for purines (representing G or A).
  • the calculation of AG° of primer candidate hybridization to its perfect complement for primer candidates with degenerate nucleotides assumes the weakest possible binding, with Y nucleotides assumed to be T and R nucleotides assumed to be A.
  • the primer candidate generation function inputs further comprise a reference genome sequence or description (e.g. "human GRCh38").
  • primer candidates are aligned to the reference genome, and the primer candidates with many perfect or near-perfect hits on the reference genome are removed from consideration. In some embodiments, primer candidates with more than approximately 100 perfect or near-perfect (>90% homology) hits on the reference genome are removed from consideration.
  • the Fitness evaluation function takes as input one or more primer candidate sequence(s) and a set of Fitness evaluation parameters, and returns as output the Fitness score for each primer candidate.
  • the Fitness evaluation parameters include the default Fitness of a perfect primer candidate with no penalties, the number of nucleotides from the 3’ end below which degenerate nucleotides are penalized, the minimum length of homopolymer repeats above which homopolymers are penalized, the minimum length of primers below which short primer lengths are penalized, the maximum length of primers above which primers lengths are penalized, and the minimum primer candidate self-complementarity above which self-complementarity is penalized.
  • the default values of the Fitness evaluation parameters are: approximately 100 Fitness as the default value before penalties, approximately 6 nucleotides as the minimum from the 3’ end below which degenerate nucleotides are penalized, approximately 6 nucleotides as the minimum number of homopolymer nucleotides above which homopolymers are penalized, approximately 15 nucleotides as the minimum length primer below which primer length is penalized, approximately 35 nucleotides as the maximum length primer above which primer length is penalized.
  • the Fitness evaluation function begins by setting the Fitness of each primer candidate as the default Fitness (e.g. 100), and runs a number of tests on each primer candidate sequence to check for the presence of penalizing features.
  • each penalizing feature is independent and separately reduces the Fitness of the primer candidate by a certain amount.
  • some of the penalizing features are exponential in the Fitness penalty relative to property of the undesirable feature.
  • Fitness penalties for short primers below a threshold length e.g. 15nt
  • a A (15-Length)
  • Length the length of the primer candidate.
  • This length exponential Fitness penalty could be justified in that shorter primer candidates are exponentially more likely to have nonspecific alignment and binding to unintended DNA subsequences of a background genome.
  • Fitness penalties for degenerate e.g.
  • Y or R nucleotides in the primer candidate may be exponentially penalized as A A (NumDegen), where A>1 is a constant and NumDegen is the number of degenerate nucleotides.
  • a A NumDegen
  • NumDegen the number of degenerate nucleotides.
  • a primer candidate with sequence "...YGYGA-3” will be synthesized as a mixture of 4 different sequences "...TGTGA-3", “...CGTGA-3”, “...TGGGA-3”, and “...CGCGA-3” each with approximately 25% of the total concentration of the primer.
  • Fitness penalties for self-complementarity of subsequences in the primer candidate may be exponentially penalized as A A (numGC*B + numAT), where A>1 and B>1 are constants, numGC is the number of G or C nucleotides in the continuous self- complementary subsequence, and numAT is the number of A or T nucleotides in the continuous self-complementary subsequence.
  • a A numberGC*B + numAT
  • A>1 and B>1 are constants
  • numGC is the number of G or C nucleotides in the continuous self- complementary subsequence
  • numAT is the number of A or T nucleotides in the continuous self-complementary subsequence.
  • some of the penalizing features are linear in the Fitness penalty relative to the property of the undesirable feature.
  • Fitness penalties for long primers above a threshold length e.g. 35nt
  • A*(Length-35) may be linearly penalized as A*(Length-35), where A is a constant and Length is the length of the primer candidate.
  • the linear Fitness penalty for long primers could be justified based on increased costs for primer synthesis and purification based on current oligonucleotide synthesis and purification chemistry and economics.
  • the primer set Database update function takes as input the sequences of one or more newly selected primer candidates for inclusion into the primer set.
  • the primer set Database update function also takes as input a data structure or a pointer to a data structure representing the presence, quantity, and 3 ’-weighting of k-mer words present in the primer set before the inclusion of the newly selected primer candidate(s) (the Database), and produces as output a data structure or a pointer to a data structure representing the presence, quantity, and 3’- weighting of k-mer words present in the primer set after the inclusion of the newly selected primer candidates (the updated Database).
  • the primer set Database update function updates a global data structure representing the presence, quantity, and 3 ’-weighting of k-mer words present in the primer set.
  • the data structure described is a hash table.
  • the data structure described is a suffix tree.
  • k-mer words (subsequences) present in the primer set comprise 6-mer, 7-mer, 8-mer, 9-mer, and 10-mer words.
  • the Database contained a contains a numerical score for the number of position- weighted occurrences of k-mer words, and each score is initialized as 0 for an empty primer set.
  • all continuous subsequences of the primer candidate (words) of desired lengths e.g. 6nt, 7nt, 8nt, 9nt, and lOnt
  • each word updates its corresponding entry in the Database by an update value.
  • the update value is a constant number that is identical for all words.
  • the update value is dependent on the number of G or C nucleotides in the word.
  • the update value is dependent on the position of the word in the primer candidate, with higher update values assigned to words closer to the 3’ end of the primer candidate. In some embodiments, the update value dependence on the position of the word position is specified by the user, depending on whether the intended DNA polymerase to be used has 3 ’>5’ proofreading exonuclease activity or not. In some embodiments, the intended DNA polymerase to be used lacks 3 ’>5’ proofreading exonuclease activity, and update values of words not within A nucleotides of the 3’ end are 0, where A is a constant with a value of 0, 1, or 2. In some embodiments, the intended DNA polymerase to be used possesses 3 ’>5’ proofreading exonuclease activity, and update values of all words are nonzero.
  • the primer candidate input comprises one or more degenerate nucleotides, such as Y (representing T or C) or R (representing A or G).
  • the k- mer words of the primer candidate that contain degenerate nucleotides are enumerated as all possible non-degenerate k-mer words consistent with the k-mer with degenerate nucleotides, and all non-degenerate k-mer words are used to update the Database.
  • each k-mer word of the primer candidate that contain degenerate nucleotides are converted into two k-mer words comprising non-degenerate nucleotides, one corresponding to fully-methylated (strong) nucleotides and the other corresponding to fully-unmethylated (weak) nucleotides, and these two k-mers are used to update the Database.
  • the 6-mer word "TYGYGA” would be converted into two 6-mer words "TCGCGA” and "TTGTGA”.
  • k-mer words of the primer candidate that contain degenerate nucleotides are ignored and not used to update the Database.
  • the user may specify a non-zero update value multiplier for reverse complement of k-mer words, and the reverse complement of the k-mer words present as subsequences in the primer candidate are also used to update the Database.
  • the update value multiplier for reverse complements is approximately 0.1.
  • the primer candidate Loss evaluation function takes as input one or more primer candidate sequences, and returns the calculated Loss value for each primer candidate against the current primer set.
  • the primer candidate Loss evaluation function also takes as input a data structure or a pointer to a data structure representing the presence, quantity, and 3’- weighting of k-mer words present in the primer set before the inclusion of the newly selected primer candidate(s) (the Database).
  • the primer set Database accesses a global data structure representing the presence, quantity, and 3 ’-weighting of k-mer words present in the primer set.
  • the primer candidate Loss evaluation function calculates the Loss score for a primer candidate against the existing primer set by generating all the reverse complement of all k-mer word subsequences of the primer candidate, looks up these k-mer words in the Database, and calculates the weighted sums the scores of the Database multiplied by weight constants based on the value of k.
  • the Loss due to a k-mer in the Database may be calculated as 2 A (k-6) * 3 A numGC * EntryValue, where k is the length of the k-mer word, numGC is the number of G or C nucleotides in the word, and EntryValue is the score of word in the Database.
  • Database score values for longer words yield exponentially larger contributions to Loss.
  • the subject matter described herein can include a digital processing device or use of the same.
  • the digital processing device can include one or more hardware central processing units (CPU), graphics processing units (GPU), or tensor processing units (TPU) that perform the device’s functions.
  • the digital processing device can include an operating system configured to perform executable instructions.
  • the digital processing device can optionally be connected a computer network. In some examples, the digital processing device may be optionally connected to the Internet. In some examples, the digital processing device may be optionally connected to a cloud computing infrastructure. In some examples, the digital processing device may be optionally connected to an intranet. In some examples, the digital processing device may be optionally connected to a data storage device.
  • Non-limiting examples of suitable digital processing devices include server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, handheld computers, Internet appliances, mobile smartphones, and tablet computers.
  • Suitable tablet computers can include, for example, those with booklet, slate, and convertible configurations.
  • the digital processing device can include an operating system configured to perform executable instructions.
  • the operating system can include software, including programs and data, which manages the device’s hardware and provides services for execution of applications.
  • Non-limiting examples of operating systems include Ubuntu, FreeB SD, OpenBSD, NetB SD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®.
  • Non-limiting examples of suitable personal computer operating systems include Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®.
  • the operating system may be provided by cloud computing, and cloud computing resources may be provided by one or more service providers.
  • the device can include a storage and/or memory device.
  • the storage and/or memory device may be one or more physical apparatuses used to store data or programs on a temporary or permanent basis.
  • the device may be volatile memory and require power to maintain stored information.
  • the device may be non-volatile memory and retain stored information when the digital processing device is not powered.
  • 1 the non-volatile memory can include flash memory.
  • the non-volatile memory can include dynamic random-access memory (DRAM).
  • the non-volatile memory can include ferroelectric random access memory (FRAM).
  • the nonvolatile memory can include phase-change random access memory (PRAM).
  • the device may be a storage device including, for example, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing-based storage.
  • the storage and/or memory device may be a combination of devices such as those disclosed herein.
  • the digital processing device can include a display to send visual information to a user.
  • the display may be a cathode ray tube (CRT).
  • the display may be a liquid crystal display (LCD).
  • the display may be a thin film transistor liquid crystal display (TFT- LCD).
  • the display may be an organic light emitting diode (OLED) display.
  • on OLED display may be a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display.
  • the display may be a plasma display.
  • the display may be a video projector.
  • the display may be a combination of devices such as those disclosed herein.
  • the digital processing device can include an input device to receive information from a user.
  • the input device may be a keyboard.
  • the input device may be a pointing device including, for example, a mouse, trackball, track pad, joystick, game controller, or stylus.
  • the input device may be a touch screen or a multi-touch screen.
  • the input device may be a microphone to capture voice or other sound input.
  • the input device may be a video camera to capture motion or visual input.
  • the input device may be a combination of devices such as those disclosed herein.
  • the subject matter disclosed herein can include one or more non- transitory computer-readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device.
  • a computer-readable storage medium may be a tangible component of a digital processing device.
  • a computer-readable storage medium may be optionally removable from a digital processing device.
  • a computer-readable storage medium can include, for example, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like.
  • the program and instructions may be permanently, substantially permanently, semi-permanently, or non-transitorily encoded on the media.
  • Figure 10 shows a computer system 101 that is programmed or otherwise configured to store, process, identify, or interpret patient data, biological data, biological sequences, and reference sequences.
  • the computer system 101 can process various aspects of patient data, biological data, biological sequences, or reference sequences of the present disclosure.
  • the computer system 101 may be an electronic device of a user or a computer system that is remotely located with respect to the electronic device.
  • the electronic device may be a mobile electronic device.
  • the computer system 101 comprises a central processing unit (CPU, also “processor” and “computer processor” herein) 105, which may be a single core or multi core processor, or a plurality of processors for parallel processing.
  • the computer system 101 also comprises memory or memory location 110 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 115 (e.g., hard disk), communication interface 120 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 125, such as cache, other memory, data storage and/or electronic display adapters.
  • the memory 110, storage unit 115, interface 120 and peripheral devices 125 are in communication with the CPU 105 through a communication bus (solid lines), such as a motherboard.
  • the storage unit 115 may be a data storage unit (or data repository) for storing data.
  • the computer system 101 may be operatively coupled to a computer network (“network”) 130 with the aid of the communication interface 120.
  • the network 130 may be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
  • the network 130 in some examples is a telecommunication and/or data network.
  • the network 130 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
  • the network 130 in some examples with the aid of the computer system 101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 101 to behave as a client or a server.
  • the CPU 105 can execute a sequence of machine-readable instructions, which may be embodied in a program or software.
  • the instructions may be stored in a memory location, such as the memory 110.
  • the instructions may be directed to the CPU 105, which can subsequently program or otherwise configure the CPU 105 to implement methods of the present disclosure. Examples of operations performed by the CPU 105 can include fetch, decode, execute, and writeback.
  • the CPU 105 may be part of a circuit, such as an integrated circuit.
  • a circuit such as an integrated circuit.
  • One or more other components of the system 101 may be included in the circuit.
  • the circuit is an application specific integrated circuit (ASIC).
  • ASIC application specific integrated circuit
  • the storage unit 115 can store files, such as drivers, libraries and saved programs.
  • the storage unit 115 can store user data, e.g., user preferences and user programs.
  • the computer system 101 in some examples can include one or more additional data storage units that are external to the computer system 101, such as located on a remote server that is in communication with the computer system 101 through an intranet or the Internet.
  • the computer system 101 can communicate with one or more remote computer systems through the network 130.
  • the computer system 101 can communicate with a remote computer system of a user.
  • remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
  • the user can access the computer system 101 via the network 130.
  • Methods as described herein may be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 101, such as, for example, on the memory 110 or electronic storage unit 115.
  • the machine-executable or machine-readable code may be provided in the form of software.
  • the code may be executed by the processor 105.
  • the code may be retrieved from the storage unit 115 and stored on the memory 110 for ready access by the processor 105.
  • the electronic storage unit 115 may be precluded, and machine-executable instructions are stored on memory 110.
  • the code may be pre-compiled and configured for use with a machine having a processer adapted to execute the code or may be interpreted or compiled during runtime.
  • the code may be supplied in a programming language that may be selected to enable the code to execute in a precompiled, interpreted, or as-compiled fashion.
  • aspects of the systems and methods provided herein may be embodied in programming.
  • Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
  • Machine-executable code may be stored on an electronic storage unit, such as memory (e.g., readonly memory, random-access memory, flash memory) or a hard disk.
  • “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
  • another type of media that may bear the software elements comprises optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
  • a machine readable medium such as computer-executable code
  • a tangible storage medium such as computer-executable code
  • Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
  • Volatile storage media include dynamic memory, such as main memory of such a computer platform.
  • Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
  • Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
  • RF radio frequency
  • IR infrared
  • Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
  • Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
  • the computer system 101 can include or be in communication with an electronic display 135 that comprises a user interface (UI) 140 for providing, for example, a nucleic acid sequence, an enriched nucleic acid sample, a methylation profile, an expression profile, and an analysis of a methylation or expression profile.
  • UI user interface
  • Examples of UI’s include, without limitation, a graphical user interface (GUI) and web-based user interface.
  • Methods and systems of the present disclosure may be implemented by way of one or more algorithms.
  • An algorithm may be implemented by way of software upon execution by the central processing unit 105.
  • the algorithm can, for example, store, process, identify, or interpret patient data, biological data, biological sequences, and reference sequences.
  • the subject matter disclosed herein can include at least one computer program or use of the same.
  • a computer program can a sequence of instructions, executable in the digital processing device’s CPU, GPU, or TPU, written to perform a specified task.
  • Computer-readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types.
  • APIs Application Programming Interfaces
  • a computer program may be written in various versions of various languages.
  • a computer program can include one sequence of instructions.
  • a computer program can include a plurality of sequences of instructions.
  • a computer program may be provided from one location.
  • a computer program may be provided from a plurality of locations.
  • a computer program can include one or more software modules.
  • a computer program can include, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plugins, extensions, add-ins, or add-ons, or combinations thereof.
  • the computer processing may be a method of statistics, mathematics, biology, or any combination thereof.
  • the computer processing method comprises a dimension reduction method including, for example, logistic regression, dimension reduction, principal component analysis, autoencoders, singular value decomposition, Fourier bases, singular value decomposition, wavelets, discriminant analysis, support vector machine, tree-based methods, random forest, gradient boost tree, logistic regression, matrix factorization, network clustering, and neural network.
  • the computer processing method is a supervised machine learning method including, for example, a regression, support vector machine, tree-based method, and network.
  • the computer processing method is an unsupervised machine learning method including, for example, clustering, network, principal component analysis, and matrix factorization.
  • the subject matter disclosed herein can include one or more databases, or use of the same to store patient data, biological data, biological sequences, or reference sequences. Reference sequences may be derived from a database.
  • suitable databases can include, for example, relational databases, non-relational databases, object-oriented databases, object databases, entity- relationship model databases, associative databases, and XML databases.
  • a database may be internet-based.
  • a database may be web-based.
  • a database may be cloud computing-based.
  • a database may be based on one or more local computer storage devices.
  • the present disclosure provides a non-transitory computer-readable medium comprising instructions that direct a processor to perform a method disclosed herein.
  • the present disclosure provides a computing device comprising the computer-readable medium.
  • a list of 1,000 targets comprising 23,336 mutations linked to pancreatic, nonsmall cell lung, ovarian, and breast cancer onset was identified from the AACR PROJECT GENIE database (The AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine Through An International Consortium, Cancer Discov. 2017 Aug;7(8):818- 831).
  • the targets were identified to be at most 70 bp, and by not overlapping each other, within a 150 bp buffer in the 3’ and 5’ regions.
  • NGS testing the complete panel, consisting of 2,000 primers in total, was used at an individual concentration of 7.5 nM in a reaction containing iTaq (BioRad) and 10 ng human genomic DNA or NTC.
  • the cycling conditions were as follows: 95°C for 5 minutes, followed by 50 cycles of 95°C for 30 seconds and 60°C for 2 minutes -FIGURE 10 B.
  • SPRI purification with 1.8x Agencourt beads (Beckman Coulter)
  • the PCR product was used with the NEBNext® UltraTM II DNA Library Prep Kit for Illumina (NEB) according to the manufacturer’s instructions. Briefly, following an end-repair step, adapters were ligated.
  • the libraries were diluted 1: 100 and amplified with TruSeq sequencing primers - FIGURE 10 C. Finally, after a final purification, the library was QC checked on the Bioanalyzer (Agilent) and sequenced.
  • Fusion genes are typically identified by chromosomal coordinates with a resolution at the gene level. However, the exact fusion junction can vary across different subjects. Detection mechanisms such as FISH or even NGS, while capable of identifying alternative constructs, lack the sensitivity to detect low-frequency variants (below 1%) with high confidence. Additionally, actionable fusions like NTRK and FGR, despite being actionable (with FDA-approved treatments like Larotrectinib (Loxo) and Entrectinib (Genentech) for all cancers with NTRK fusions), have only a 1% incidence. This makes the cost of NGS unjustifiable from a cost-per -positive standpoint.
  • a highly multiplexed qPCR assay designed with DIMPLE enables the screening and monitoring of hundreds of fusion breakpoints simultaneously, with a limit of detection expected to be well below 1%.
  • DIMPLE was used to design a 274-primer assay covering major fusions of KMT2A and NUP98, focusing on the top 12 fusion partner genes for KMT2A (over 95% of actionable KMT2A fusions in leukemia) and 3 partner genes for NUP98, covering 90% of the actionable NUP98 fusions.
  • the assay is designed so that the forward primer targets the gene of interest across all relevant exons, and the reverse primer targets all exons in the most frequent gene partners in RNA samples, covering up to -3,400 constructs ( Figure 11 A).
  • the assay was used to test RNA extracted from the cell line RS4;11 (CRL-1873) bearing the KMT2A::AFF1 fusion, serially diluted in RNA from leukocytes from healthy patients.
  • the assay was able to detect up to 0.01% of the KMT2A:: AFF1 diluted sample.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne des procédés, des systèmes et des supports lisibles par ordinateur non transitoires pour concevoir efficacement des ensembles d'amorces PCR massivement multiplex par l'intermédiaire d'une expansion de liste d'amorces multiplex incomplètes dégénérées (DIMPLE). DIMPLE a un temps d'exécution plus court, ce qui lui permet de mettre à l'échelle un multiplexage massif. L'invention est applicable à la conception d'amorces de PCR pour à la fois des modèles d'ADN standard et à des modèles d'ADN post-conversion avec des nucléotides dégénérés. L'invention permet la conception simultanée de multiples amorces à chaque modèle d'ADN d'intérêt, pour faciliter une optimisation expérimentale et pour permettre une PCR imbriquée. Enfin, DIMPLE peut éviter des modèles non appropriés qui peuvent conduire à des amorces avec des taux de défaillance expérimentaux élevés.
PCT/US2024/038762 2023-07-20 2024-07-19 Systèmes et procédés de conception d'amorce par dégénération d'une expansion de liste d'amorces multiplex incomplètes dégénérées (dimple) Pending WO2025019779A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363514659P 2023-07-20 2023-07-20
US63/514,659 2023-07-20

Publications (1)

Publication Number Publication Date
WO2025019779A1 true WO2025019779A1 (fr) 2025-01-23

Family

ID=94282704

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/038762 Pending WO2025019779A1 (fr) 2023-07-20 2024-07-19 Systèmes et procédés de conception d'amorce par dégénération d'une expansion de liste d'amorces multiplex incomplètes dégénérées (dimple)

Country Status (1)

Country Link
WO (1) WO2025019779A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140094373A1 (en) * 2010-05-18 2014-04-03 Natera, Inc. Highly multiplex pcr methods and compositions
US20200080136A1 (en) * 2016-09-22 2020-03-12 William Marsh Rice University Molecular hybridization probes for complex sequence capture and analysis
US20230220456A1 (en) * 2020-05-01 2023-07-13 William Marsh Rice University Quantitative blocker displacement amplification (qbda) sequencing for calibration-free and multiplexed variant allele frequency quantitation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140094373A1 (en) * 2010-05-18 2014-04-03 Natera, Inc. Highly multiplex pcr methods and compositions
US20200080136A1 (en) * 2016-09-22 2020-03-12 William Marsh Rice University Molecular hybridization probes for complex sequence capture and analysis
US20230220456A1 (en) * 2020-05-01 2023-07-13 William Marsh Rice University Quantitative blocker displacement amplification (qbda) sequencing for calibration-free and multiplexed variant allele frequency quantitation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PANDEY RAM VINAY, PULVERER WALTER, KALLMEYER RAINER, BEIKIRCHER GABRIEL, PABINGER STEPHAN, KRIEGNER ALBERT, WEINHÄUSEL ANDREAS: "MSRE-HTPrimer: a high-throughput and genome-wide primer design pipeline optimized for epigenetic research", CLINICAL EPIGENETICS, BIOMED CENTRAL LTD, LONDON, UK, vol. 8, no. 26, 1 December 2016 (2016-12-01), London, UK, XP055952868, ISSN: 1868-7075, DOI: 10.1186/s13148-016-0190-9 *
XIE NINA G., WANG MICHAEL X., SONG PING, MAO SHIQI, WANG YIFAN, YANG YUXIA, LUO JUNFENG, REN SHENGXIANG, ZHANG DAVID YU: "Designing highly multiplex PCR primer sets with Simulated Annealing Design using Dimer Likelihood Estimation (SADDLE)", NATURE COMMUNICATIONS, vol. 13, no. 1, 11 April 2022 (2022-04-11), XP055960639, DOI: 10.1038/s41467-022-29500-4 *

Similar Documents

Publication Publication Date Title
Li et al. From GWAS to gene: transcriptome-wide association studies and other methods to functionally understand GWAS discoveries
Han et al. Advanced applications of RNA sequencing and challenges
Alamancos et al. Methods to study splicing from high-throughput RNA sequencing data
Gao et al. Before and after: comparison of legacy and harmonized TCGA genomic data commons’ data
Evangelou et al. Meta-analysis methods for genome-wide association studies and beyond
Richardson et al. Statistical methods in integrative genomics
KR102662206B1 (ko) 심층 학습 기반 비정상 스플라이싱 검출
JP2023524627A (ja) 核酸のメチル化分析による結腸直腸癌を検出するための方法およびシステム
Browning et al. Improving the accuracy and efficiency of identity-by-descent detection in population data
Zhu et al. Using ERDS to infer copy-number variants in high-coverage genomes
Qi et al. From genetic associations to genes: methods, applications, and challenges
Moldovan et al. Multi-modal cell-free DNA genomic and fragmentomic patterns enhance cancer survival and recurrence analysis
KR20230162662A (ko) 핵산 메틸화 분석을 통해 암을 검출하는 방법 및 시스템
JP2021503922A (ja) ターゲットシーケンシングのためのモデル
KR20160107237A (ko) 판독물 맵핑에서 알려진 대립 유전자의 사용을 위한 시스템 및 방법
Kim et al. Decomposing oncogenic transcriptional signatures to generate maps of divergent cellular states
JP6858783B2 (ja) 一塩基多型及びインデルの複対立遺伝子遺伝子型決定
Kuan et al. Integrating prior knowledge in multiple testing under dependence with applications to detecting differential DNA methylation
US20250122563A1 (en) Methods and compositions of nucleic acid molecule enrichment for sequencing
Qin et al. Accurate fusion transcript identification from long-and short-read isoform sequencing at bulk or single-cell resolution
Barrot et al. Big data in pharmacogenomics: current applications, perspectives and pitfalls
Fusi et al. Detecting regulatory gene–environment interactions with unmeasured environmental factors
Wang et al. IMIX: a multivariate mixture model approach to association analysis through multi-omics data integration
Liu et al. Gene signatures for cancer research: A 25-year retrospective and future avenues
Szkop et al. Untranslated Parts of Genes Interpreted: Making Heads or Tails of High‐Throughput Transcriptomic Data via Computational Methods: Computational methods to discover and quantify isoforms with alternative untranslated regions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24844021

Country of ref document: EP

Kind code of ref document: A1