WO2025019779A1

WO2025019779A1 - Systems and methods for primer design via degenerate incomplete multiplex primer list expansion (dimple)

Info

Publication number: WO2025019779A1
Application number: PCT/US2024/038762
Authority: WO
Inventors: David Zhang
Original assignee: Pupil Bio Inc
Current assignee: Pupil Bio Inc
Priority date: 2023-07-20
Filing date: 2024-07-19
Publication date: 2025-01-23
Anticipated expiration: 2026-01-20

Abstract

Methods, systems, and non-transitory computer readable media are disclosed for efficiently designing massively multiplex PCR primer sets via Degenerate Incomplete Multiplex Primer List Expansion (DIMPLE). DIMPLE has shorter runtime, allowing it to scale to massive multiplexing. DIMPLE is applicable to designing PCR primers for both standard DNA templates and to post-conversion DNA templates with degenerate nucleotides. DIMPLE allows simultaneous design of multiple primers to each DNA template of interest, to facilitate experimental optimization and to enable nested PCR. Finally, DIMPLE can avoid non-suitable templates that can result in primers with high experimental failure rates.

Description

SYSTEMS AND METHODS FOR PRIMER DESIGN VIA DEGENERATE INCOMPLETE MULTIPLEX PRIMER LIST EXPANSION (DIMPLE)

CROSS-REFERENCE TO RELATED APPLICATIONS

[001] This application claims the benefit of U.S. Provisional Application No. 63/514,659, filed July 20, 2023, which is hereby incorporated by reference in its entirety.

FIELD

[002] This application relates generally to methods, algorithms, computer programs, computer- readable medium, and computer systems for designing massively multiplex PCR primer sets.

BACKGROUND

[003] The design of highly multiplex PCR primer sets is challenging because there are quadratically many unintended pairwise interactions in a set of N primers. Multiplex PCR primer design is even more challenging when the DNA template sequence to be amplified is the postconversion product of a bisulfite conversion reaction for two reasons: (1) there are C nucleotides in a CpG dinucleotide sequence in the pre-converted template that may be either a C or a U in the post-converted template, and (2) there are significantly fewer C nucleotides in the post-converted template resulting in weaker DNA binding and more repetitive DNA sequences. Here, a Degenerate Incomplete Multiplex Primer List Expansion (DIMPLE) method, its computer- implemented algorithm, the corresponding computer-readable medium and computer system are described for designing massively multiplex PCR primer sets.

SUMMARY

[004] In an aspect, this disclosure provides one or more embodiments of systems, methods, and non-transitory computer readable storage media that provide benefits and/or solve one or more of the above problems in the art. For example, the disclosed algorithm, model and systems can design massively multiplex PCR primer sets.

[005] In another aspect, this disclosure provides a computer implemented method for designing a multiplex PCR primer set to amplify a plurality of nucleic acid (NA) template sequences, comprising the steps of: (a) generating a set of Semilocus sequences from the NA template sequences, (b) generating a set of primer candidates for each Semilocus, (c) evaluating a Fitness score for each primer candidate, and (d) performing the following steps (i) to (iv) iteratively for at least 5, 10, 20, 30, 50, 100, 200, 500, 1000, 2000, 3000, 4000 or 5000 cycles: (i) selecting a set of Semiloci, (ii) selecting a set of primer candidates for each of the selected Semiloci, (iii) evaluating a Loss score for each selected primer candidate against a pre-existing primer set to determine an FL Value for each selected primer candidate, and (iv)adding the selected primer candidate with the highest FL Value to the pre-existing primer set.

[006] In an aspect, this disclosure provides a computer-readable medium comprising codes that, upon execution by one or more processors, cause the one or more processors to implements the primer design method as set out herein.

[007] In another aspect, this disclosure provides a computer system for designing a multiplex PCR primer set, the system comprising: a non-transitory memory configured to store executable instructions; and one or more processors alone or in combination programmed to perform a primer design method as set out herein

[008] Additional features and advantages of one or more embodiments of the present disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

[009] The following drawings and provided descriptions illustrate the apparent features, advantages, and uses of the invention(s). The incorporated drawings and descriptions included herein serve to identify specifications that will further explain the concept of the invention(s) and allow the production of art that allows a trained professional to make and use the invention(s). The drawings are not illustrated to scale.

[0010] Figure 1. Flowchart of one embodiment of the DIMPLE algorithm. Terms defined in this patent application are capitalized and underlined. The bracketed numbers are for illustration purposes.

[0011] Figure 2. Illustration of Semilocus sequence generation from context sequences and key sequences. The DNA template sequence at the top is retrieved from a reference genome of interest, and the context sequence is generated through removal of nucleotides too distant from the key sequence of interest. The subsequence of the context sequence to the 5’ of the key sequence becomes Semilocus 1, and the reverse complement of the subsequence of the context sequence to

subsequences of their respective Semilocus sequences.

[0012] Figure 3. Illustration of primer candidate generation from a Semilocus sequence. The first sequence has a AG° that is too weak to be considered a valid primer candidate, and the second sequence has a AG° that is too strong to be considered a valid primer candidate. The primer candidates la through Ih all have AG° values between -11.5 kcal/mol and -13.0 kcal/mol.

[0013] Figure 4. Illustration of primer set and Semilocus coverage. The dashed boxes illustrate the aligned positions of the primers in a primer set to each of 6 Semiloci. The number of primers in the primer set that are derived from each Semilocus constitute that Semilocus’s coverage.

[0014] Figure 5. Illustration of a primer candidate’s Fitness and Loss values. Fitness is a value that can be calculated based only the primer candidate’s sequence itself, whereas Loss is a value that is dependent on the pairwise interactions between the primer candidate and each primer in a primer set.

[0015] Figure 6. Flowchart of the Main DIMPLE design loop. Here, the primer set Database is implemented as a hash table of k-mer words.

[0016] Figure 7. Illustration of Loss calculation via 6-mer words with a primer set comprising 1 primer and a single primer candidate. All 6-mer words of the primer set are enumerated on the left side, and the reverse complement of all 6-mer words of the primer candidate are enumerated on the right side. The reverse complement of all 6-mer primer candidate words can be rapidly matched for hits to the 6-mer words in the primer set words through a Database (e.g. hash table, suffix true) constructed on the latter. The same approach can be used for 7-mer words and longer words.

[0017] Figure 8. Illustration of impact of Redundancy Penalty values on the design of primer sets. The left panel shows a hypothetical set of 2 primer candidates for each of 4 Semiloci and the Fitness values. The arrows indicate the pairwise Loss values between primer candidates; where no arrows exist between a pair of primer candidates, the pairwise Loss is 0. The right panel shows the Redundancy Penalties and resulting primer sets: When the Redundancy Penalty is 100, there is strong discouragement for adding a second primer to a Semilocus into the primer set, and the resulting primer set has 4 primers, 1 to each Semilocus. When the Redundancy Penalty is 50, 50 (50 for the second primer to a Semilocus, cumulative 100 for the third primer to a Semilocus), DIMPLE will prioritize achieving at least 1 primer coverage for each Semilocus, and the resulting primer set has 6 primers, including at least 1 to each Semilocus. When the Redundancy Penalty is 0, 100, DIMPLE values a second primer to a Semilocus equivalently to having a first primer to another Semilocus, and the resulting primer set has 6 primers, but there are 0 primers to Semilocus 4.

[0018] Figure 9. Example properties of DIMPLE designed primer sets using the embodiment described in Figure 1. The top subfigure panels show the Fitness values, Loss values, and FL values of a primer set of P=788 primers for N=100 context sequences, each CL=201 nucleotides long with a Int key sequence in the middle of each context sequence. The bottom subfigure panels show the Fitness Values, Loss Values, and FL values of a primer set of P=1125 primers for the 200 post-bisulfite conversion products of the top and bottom strands for the same N=100 context sequences as in the top panels.

[0019] Figure 10 illustrates the performance of DIMPLE-designed pan-cancer panel. (A) Summary of the Ct Observed with 100 Subsets of 10 Primer Sets (20 Total Primers): In 70% of the cases, the Ct for no-template control (NTC) is above 39 cycles, demonstrating the low propensity of DIMPLE-designed primers to form dimers. (B) The 1000-Plex Pan-Cancer Panel: The designed panel yields a ACT of approximately 10 cycles between the positive control and the negative control. (C) Post- Amplification Workflow: Upon amplification, the product undergoes a standard library preparation workflow, which includes the ligation of adapters and amplification with indexing primers. (D) Initial Library Performance: The initial library, without any optimization, shows that 60% of the reads align to the desired target and 30% are unaligned. (E) Target Analysis: Further analysis reveals that 30% of the targets are responsible for 90% of the by-products. (F) Optimization Results: After removing the problematic targets, the NTC in qPCR is delayed by over 8 cycles. More importantly, 80% of the amplicons are present with a high degree of uniformity, with the number of reads exceeding 20% of the median.

[0020] Figure 11 illustrates detection of fusion genes Using the DIMPLE qPCR assay. (A) Strategy for Designing the 274-Primer Assay Using DIMPLE: The DIMPLE strategy for the qPCR assay involves designing forward (FW) primers for each exon of the genes of interest (KMT2A and NUP98) and reverse (RV) primers for each exon of their most frequent partner genes. Specifically, forward primers are designed to target all relevant exons of KMT2A and NUP98. Reverse primers are designed to target the exons of the most common fusion partners. This comprehensive approach ensures broad coverage: by covering each exon of the target genes and their partners, this design allows for the detection of a wide variety of fusion breakpoints with high sensitivity. (B) Example of Assay Performance: The KMT2A fusion assay, comprising 274 primers, was able to detect up to 5 copies of KMT2A::MLLT1 down to 0.05% (5 KMT2A fusion copies in 10,000 total copies). (C) Summary of Assay Performance over 19 Constructs: All the synthetic spike- in mimicking fusion constructs were detected up to 1% variant allele frequency (VAF). Additionally, 17 out of 19 constructs (85%) were consistently detected up to 0.05% VAF. (D) Validation of Assay Performance in Blending Experiments: Performance of the DIMPLE qPCR assay on RNA extracted from the cell line RS4;11 (CRL-1873) bearing the KMT2A::AFF1 fusion. The assay demonstrated high sensitivity, detecting the fusion at a dilution level of 0.01% in RNA from leukocytes of healthy individuals.

[0021] Figure 12 shows a computer system implementing the methods, algorithms and computer programs described here.

DETAILED DESCRIPTION

[0022] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

[0023] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference in their entirety.

[0024] As used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. By way of example, “an element” means at least one element and can include more than one element.

[0025] Where a range of values is provided, it is understood that each intervening value, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed within the disclosure. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges, and are also encompassed within the disclosure, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the disclosure. [0026] When a grouping of alternatives is presented, any and all combinations of the members that make up that grouping of alternatives is specifically envisioned. For example, if an item is selected from a group consisting of A, B, C, and D, the inventors specifically envision each alternative individually (e.g., A alone, B alone, etc.), as well as combinations such as A, B, and D; A and C; B and C; etc.

[0027] The term “and/or” when used in a list of two or more items means any one of the listed items by itself or in combination with any one or more of the other listed items. For example, the expression “A and/or B” is intended to mean either or both of A and B - i.e., A alone, B alone, or A and B in combination. The expression “A, B and/or C” is intended to mean A alone, B alone, C alone, A and B in combination, A and C in combination, B and C in combination, or A, B, and C in combination.

[0028] As used herein, the term “substantially”, when used to modify a quality, generally allows a certain degree of variation without that quality being lost. For example, in certain aspects such degree of variation can be less than 0.1%, about 0.1%, about 0.2%, about 0.3%, about 0.4%, about 0.5%, about 0.6%, about 0.7%, about 0.8%, about 0.9%, about 1%, between 1-2%, between 2-3%, between 3-4%, between 4-5%, or greater than 5% or 10%.

[0029] The term “about”, “around” or “approximately”, when modifying the quantity (e.g., mg) of a substance or composition, or the value of a parameter characterizing a step in a method, or the like, refers to variation in the numerical quantity that can occur, for example, through typical measuring, handling, and sampling procedures involved in the preparation, characterization and/or use of the substance or composition; through an inadvertent error in these procedures; through differences in the manufacture, source, or purity of the ingredients employed to make or use the compositions or carry out the procedures; and the like. In certain aspects, “about” can mean a variation of ± 0.1%, ± 0.5%, ± 1%, ± 2%, ± 3%, ± 4%, ± 5%, ± 6%, ± 7%, ± 8%, ± 9% or ± 10%. [0030] As used herein, the term “context sequence” refers to a subsequence of the DNA template to be amplified that DIMPLE will consider in the design of PCR primers. The length of the context sequence thus determines the maximum length of the amplicons generated by primers designed by the DIMPLE algorithm. The set of context sequences is the primary input of the DIMPLE algorithm. Entry of longer context sequences result in more possible primer candidates, which generally will result in slower runtime but better Semilocus coverage and improved Fitness and Loss of designed primer sets. In an aspect, a primer set is capable of amplifying context sequences having a length between 40 and 1000, between 40 and 900, between 40 and 800, between 40 and 700, between 40 and 600, between 40 and 500, between 40 and 400, between 40 and 300, between 40 and 200, between 40 and 100, between 40 and 90, between 40 and 80, between 40 and 70, between 40 and 60, between 50 and 1000, between 100 and 1000, between 200 and 1000, between 300 and 1000, between 400 and 1000, between 500 and 1000, between 600 and 1000, between 700 and 1000, between 800 and 1000, between 100 and 900, between 100 and 800, between 100 and 700, between 100 and 600, between 100 and 500, between 100 and 400, between 100 and 300, between 100 and 200, between 200 and 900, between 300 and 800, between 400 and 700, between 500 and 600, between 100 and 200, between 200 and 300, between 300 and 400, or between 400 and 500 nucleotides.

[0031] As used herein, the term “key sequence” refers to a sequence in each context sequence that must be included in the insert of the amplicon generated by the corresponding primers. In some embodiments, the key sequence is represented in the context sequence as uppercase letters, and all other DNA nucleotides in the context sequence are represented in lowercase letters. Because the key sequence may be probed by a molecular probe (e.g. Taqman probe) or sequenced to determine its identity, the corresponding primer for the context sequence preferably does not include the key sequence (or its complement). In some embodiments, when the key sequence is not specified by the user, DIMPLE defaults to assuming that the middle 1-2 letters of the context sequence is the key sequence.

[0032] As used herein, the term “Semilocus” (and its plural form “Semiloci”) refers to a portion of a context sequence where each context sequence is used to generate two Semilocus sequences: All nucleotides the 5’ of the key sequence is considered the first Semilocus sequence, and the reverse complement of all nucleotides to the 3’ of the key sequence is considered the second Semilocus sequence. The Semilocus is the set of continuous nucleotides from which continuous subsequences are selected to generate primer candidates.

[0033] As used herein, the term “primer candidate” refers to a subsequence of the Semilocus that fulfills the basic criteria for being a valid primer to amplify its corresponding context sequence. In some embodiments, a continuous subsequence of a Semilocus qualifies as a primer candidate if the standard free energy of hybridization (AG°) of the subsequence to its reverse complement sequence at the temperature and salinity of the intended PCR reaction’s annealing step is roughly -12 kcal/mol. In some embodiments, DIMPLE defaults to assuming that the temperature of the PCR anneal step is about 60 °C, and the effective salinity of the PCR reaction is about 0.18 M Na+. In other embodiments, a continuous subsequence of a Semilocus qualifies as a primer candidate if the melting temperature of a two-stranded DNA molecule consisting of the subsequence and its reverse complement is between 1°C and 10°C above the PCR anneal temperature for the intended primer concentration. In an aspect, a primer candidate comprises a template-complementary region ranging between 10 and 50, between 15 and 45, between 20 and 40, between 25 and 35, between 10 and 40, between 10 and 30, between 10 and 20, between 20 and 50, between 20 and 40, or between 20 and 30 nucleotides. In another aspect, primer candidate comprises a GC content between 0.10 and 0.90, between 0.10 and 0.80, between 0.10 and 0.70, between 0.10 and 0.60, between 0.10 and 0.50, between 0.10 and 0.40, between 0.10 and 0.30, between 0.10 and 0.20, between 0.20 and 0.80, between 0.20 and 0.70, between 0.20 and 0.60, between 0.20 and 0.50, between 0.20 and 0.40, between 0.20 and 0.30, between 0.30 and 0.80, between 0.30 and 0.70, between 0.30 and 0.60, between 0.30 and 0.50, between 0.30 and 0.40, between 0.10 and 0.20, between 0.20 and 0.30, between 0.30 and 0.40, between 0.40 and 0.50, between 0.50 and 0.60, between 0.60 and 0.70, between 0.70 and 0.80, or between 0.80 and 0.90.

[0034] As used herein, the term “primer set” refers to a set of primers that are intended to be used together in the same PCR reaction. DIMPLE takes on approach of gradually expanding the number of primers in the primer set by incrementally adding one primer candidate into the primer set at a time, in a way that maximizes the coverage of the Semiloci while minimizing likelihood of forming primer dimers

[0035] As used herein, the term “Semilocus coverage” of a Semilocus refers to the number of primers in the primer set that amplifies such Semilocus. For example, given a set of 10 context sequences as input, 20 Semiloci sequences are generated. A minimal primer set that amplifies all 10 context sequences would have 20 primers in the primer set, corresponding to 1 primer per Semilocus. In some embodiments, Semilocus coverage is expressed as an array of integers, corresponding to the number of primers in the primer set that are selected from each Semilocus. When a primer set’s Semilocus coverage has at least one 0 value, then the primer set is considered incomplete, as it cannot amplify all context sequences. When the primer set’s Semilocus coverage has at least one value greater or equal to 2, then the primer set is considered Redundant, as there are multiple choices for primer to the corresponding Semilocus. A primer set can be simultaneously incomplete and redundant for different Semiloci. In some instances, primer set redundancy is desired because it allows backup options during experimental optimization of multiplex PCR panels, and because the extra primers can be easily used as nested primers for a nested PCR process.

[0036] As used herein, the term “Fitness” of a primer candidate is a numerical score indicating the intrinsic desirability of the primer candidate sequence, without consideration for potential interactions with other primer candidates or the existing primer set. In some embodiments, the Fitness is initially defined to be a fixed value (e.g. 100), and reduced subsequently based on the presence of undesirable features in a primer candidate. In some embodiments, undesirable features that penalize Fitness include but are not limited to (1) primer candidates sequence lengths that are longer than a threshold (e.g. 35nt), (2) primer candidate sequence lengths that are shorter than a different threshold (e.g. 15nt), (3) presence of homopolymers of at least a threshold length (e.g. 6nt) in the primer candidate, (4) presence and/or quantity of degenerate nucleotides in the primer candidate, and (5) presence of self-complementary subsequences within the primer candidate. In some embodiments, Fitness penalties include the predicted quantity and/or strength of nonspecific binding interactions to other regions of a background genome sequence. In some embodiments, a random variable (e.g. a normally distributed value with mean 0 and standard deviation 5) is added to each primer candidate’s Fitness to tie-break Fitness and/or to provide stochasticity in the DIMPLE primer design process.

[0037] As used herein, the term “Loss” for a primer candidate with a primer set is a numerical score indicating the likelihood of the primer candidate to form undesired interactions with the primer set. In some embodiments, Loss is initially defined to be a fixed value (e.g. 0), and increased subsequently based on the presence of sequence features in the primer candidate and primer set that could result in undesirable primer dimer formation. In some embodiments, the Loss is computed through a weighted sum of k-mer words that are reverse complementary between the primer candidate and any primer in the primer set. In some embodiments, only 6-mer, 7-mer, 8- mer, 9-mer, and 10-mer words are considered. In some embodiments, the weighting of the presence of the 6-mer, 7-mer, 8-mer, 9-mer, and 10-mer words are different depending of the position of these words in the primer candidate and in the corresponding primer in the primer set. In some embodiments, the Loss is computed through a hash table comprising the k-mer words in the primers of a primer set. [0038] As used herein, the term “Loss Weighting” refers to the relative weight of Loss vs. Fitness in the evaluation of primer candidates by the DIMPLE algorithm. In some embodiments, the Loss Weighting initially starts at value (e.g. 1) and then is later reduced during the course of the DIMPLE algorithm. In some embodiments, the Loss Weighting is reduced when a large fraction of primer candidates evaluated have Loss Weighting multiplied by Loss exceeding the primer candidate’s Fitness. In some embodiments, the Loss Weighting is reduced based on the number of primers in the primer set and the number of context sequences, with larger reduction of Loss Weighting when the number of primers in the Primers Set divided by the number of context sequences is smaller.

[0039] As used herein, the term “FL Value”, for a primer candidate, is calculated as the primer candidate’s Fitness minus the Loss Weighting multiplied by the primer candidate’s Loss. In each cycle of primer selection, DIMPLE evaluates the FL values of a number of primer candidates and selects the primer candidate with highest FL Value to be added to the primer set, assuming that the highest FL Value is positive. In some embodiments, if the FL Values are negative for all primer candidates evaluated, the Loss Weighting is reduced. In some embodiments, if the FL Values are negative for all primer candidates evaluated, the evaluated primer candidates are removed from consideration, and DIMPLE will select a new set of primer candidates to evaluate.

[0040] As used herein, the term “Redundancy Penalty” refers to a penalty to the FL Value to encourage the DIMPLE algorithm to prioritize inclusion of primers into the primer set from Semilocus with low Semilocus coverage. In some embodiments, the Redundancy Penalty is implemented as dynamic updates to the Fitness of all remaining primer candidates to a specific Semilocus, after one primer candidate for a Semilocus is selected to be included into the primer set. In other embodiments, the Redundancy Penalty is implemented as a separate term in the FL value.

[0041] A “computer system” refers to a system of hardware, software, and data storage medium used to analyze information. The minimum hardware of a patient computer-based system comprises a central processing unit (CPU), and hardware for data input, data output (e.g., display), and data storage. An ordinarily skilled artisan can readily appreciate that any currently available computer-based systems and/or components thereof are suitable for use in connection with the methods of the present disclosure. The data storage medium may comprise any manufacture comprising a recording of the present information as described above, or a memory access device that can access such a manufacture.

[0042] To “record” data, programming or other information on a computer readable medium refers to a process for storing information, using any such methods as known in the art. Any convenient data storage structure may be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g. word processing text file, database format, etc.

[0043] A “processor” or “computing means” references any hardware and/or software combination that will perform the functions required of it. For example, a suitable processor may be a programmable digital microprocessor such as available in the form of an electronic controller, mainframe, server or personal computer (desktop or portable). Where the processor is programmable, suitable programming can be communicated from a remote location to the processor, or previously saved in a computer program product (such as a portable or fixed computer readable storage medium, whether magnetic, optical or solid state device based). For example, a magnetic medium or optical disk may carry the programming, and can be read by a suitable reader communicating with each processor at its corresponding station.

Multiplexed PCR Primer Design.

[0044] DNA extracted from biospecimen samples used for medical diagnostics or life sciences research often are low in volume and/or concentration (e.g. cell-free DNA, tissue biopsies), and must be amplified for downstream analysis. The most method for amplifying DNA from specific genes or loci of interest is the polymerase chain reaction (PCR), in which a pair of DNA primers are used to exponentially amplify a region of interest. PCR primers are relatively easy to design for a single DNA template sequence, but the difficulty of designing highly multiplex PCR primers is extremely challenging due to the combinatorial possible unintended amplification reactions. For example, to PCR amplify 1000 different regions of the genome simultaneously, 2000 PCR primers are needed, which can generate Combination (2000,2) = 1,999,000 primer dimer products, and as many nonspecific genomic amplification products. For a typical DNA locus of interest, there are often more than 30 reasonable possible candidates for each of the forward primer and reverse primer, so a 1000 amplicon PCR reaction would have 30^A2000 possible choices for primer set design, which is not feasible to comprehensively evaluate and order. [0045] With modern advances in DNA sequencing, there is increasing interest and feasibility to study more different regions of the genome and transcriptome to understand their impact on phenotype, and use this information to guide a variety of biology -related applications, including medical diagnostics, therapeutics development, biomanufacturing, and agricultural yield optimization. However, despite decreasing costs of DNA sequencing, humanity is still far from being able to affordably perform unbiased whole genomic sequencing to sufficient depths (e.g. 30,000x) for screening, diagnosing, and monitoring a variety of diseases, particularly those based on cell-free DNA.

[0046] Consequently, there is a need for fast algorithms that design of multiplex PCR primer sets with high specificity and sensitivity to the genetic loci of interest. Here, sensitivity refers to whether a genetic locus of interest is amplified by the corresponding PCR primers. Specificity refers to whether amplification products from a primer set contain in a large fraction of unintended amplicons, such as primer dimers and nonspecific genomic amplification. A poorly designed multiplex PCR primer set may have low sensitivity and specificity because one or both primers for amplifying Locus 1 may be consumed through unproductive primer-dimer formation with other primers, or through unproductive nonspecific genomic amplification of other DNA sequences.

Nested PCR.

[0047] To improve the specificity of PCR, particularly highly multiplex PCR, one common strategy is to perform nested PCR, wherein an outer set of PCR primers are first used to preamplify the DNA loci of interest. The outer PCR primer set is often designed with higher emphasis on sensitivity than specificity. The products of the multiplexed PCR reaction using the outer PCR primers are then re-amplified using a set of inner PCR primers. Because nonspecific DNA amplification products and primer dimers are unlikely to have corresponding binding sites for the inner primers, this process greatly increases the sensitivity and specificity of the multiplex PCR amplification reaction. On the other hand, nested PCR is more labor-intensive, as it typically involves multiple purification steps after the outer PCR amplification step to remove unreacted primers, pyrophosphate side products, and other chemicals. This labor-intensive protocol prevents nested PCR from being routinely used in qPCR setting, but is used for NGS library preparation approaches.

DNA Methylation and Methylation Markers. [0048] Recently, there is high interest in profiling the methylation state of DNA for a variety of human disease-related applications. When a cytosine nucleotide is immediately followed by a guanine nucleotide in a DNA sequence, this dinucleotide pair is called a CpG site (the lowercase "p" indicating a phosphate). Some fraction of cytosines in CpG’s can possess a 5 ’-methyl modification (5mC), whereas cytosines not followed by guanines generally will not have a 5’- methyl modification. The methylation state of CpG’s can impact the expression levels of genes through regulating the transcription rates. Additionally, the methylation state of DNA different cells from different organs from each other. As a result, DNA methylation patterns are increasingly being studied as disease biomarkers and predictive biomarkers for therapeutics. For example, breast cancer patients with BRCA1 promoter hypermethylation tend to benefit from PARP inhibitor drugs such as olaparib. As another example, companies such as Grail use methylation patterns in cell-free DNA to perform early cancer detection screening tests, such as the Grail Galleri test.

[0049] One of the most common methods to profile the DNA methylation state of CpG sites of interest is bisulfite conversion. Bisulfite conversion converts unmethylated C’s into uracils (U). These uracil bases are copied as thymine (T) bases during the PCR amplification process. In contrast, methylated C’s remain as methylated C’s, which then are copied as cytosines (C) bases during PCR. There are more recent enzymatic conversion methods, such as those based on the APOBEC enzyme, that perform result in the same C to U conversion as bisulfite conversion.

[0050] The sequences of the DNA molecules that result from bisulfite conversion treatment of the original DNA template have some properties that render them more difficult to design PCR primers for than typical genomic DNA templates. First, because most occurrences of cytosines in the genome are not in CpG’s, most of the C’s are converted into U’s/T’s. Consequently, the postconversion DNA template sequences are A/T rich, and contain a high number of A/T runs. Second, methylation states for CpG’s at the same locus are often heterogeneous, with a sample containing a multiple cellular sources of DNA with different methylation patterns. Consequently, after conversion each CpG can become either CG or TG. When a primer binding sequence has multiple CpG sites, an exponentially many number of primer sequences may be needed. For example, the sequence 5’-ACGCGTTCGTTCCA-3’ has 3 CpG sites, and any of the 2^A3 = 8 following primer sequences could be perfectly matched to the post-conversion sequence: (1) 5’- TAAAACGAACGCGT-3’, (2) 5’-TAAAATGAACGCGT-3’, (3) 5’-TAAAACGAATGCGT-3’, (4) 5’-TAAAATGAATGCGT-3’, (5) 5’-TAAAACGAACGTGT-3’, (6) 5’-

TAAAATGAACGTGT-3’, (7) 5’-TAAAACGAATGTGT-3’, (8) 5’-TAAAATGAATGTGT-3’.

[0051] In an aspect, the present disclosure provides a Degenerate Incomplete Multiplex Primer List Expansion (DIMPLE) method, its computer-implemented version, and the corresponding computer-readable medium and computer system, for designing massively multiplex PCR primer sets.

[0052] DIMPLE presents many advantages. For example, DIMPLE’S runtime is linear in the number of DNA templates to be designed, allowing it to scale to massive multiplexing. DIMPLE is applicable to designing PCR primers for both standard DNA templates and to post-conversion DNA templates with degenerate nucleotides. DIMPLE allows simultaneous design of multiple primers to each DNA template of interest, to facilitate experimental optimization and to enable nested PCR experimental methods. Finally, the inherent lack of template coverage completeness by DIMPLE allows DIMPLE to avoid non-suitable templates that can result in primers with high experimental failure rates.

[0053] In an aspect, DIMPLE generates a multiplex primer set with a plexity between 100 and 200, between 200 and 300, between 300 and 500, between 500 and 750, between 750 and 1000, between 1000 and 2000, between 2000 and 3000, between 2000 and 4000, between 2000 and 5000, between 5000 and 7500, between 7500 and 10000, between 10000 and 15000, between 15000 and 20000, between 20000 and 25000, between 25000 and 30000.

[0054] A more detailed description of an exemplary DIMPLE algorithm is provided below.

[0055] The Degenerate Incomplete Multiplex Primer List Expansion (DIMPLE) algorithm designs massively multiplex PCR primer sets with very fast runtime, linear in the number of context sequences and primers designed. In the illustrative implementation of DIMPLE, DIMPLE is abstracted as 6 modules, listed below in the order in which they are called:

1. Caller script.

2. Main DIMPLE primer design function.

3. primer candidate generation function.

4. Fitness evaluation function.

5. primer set Database update function.

6. primer candidate Loss evaluation function.

1. Caller script.

[0056] The caller script defines several key input variables and parameters, and then passes these into the main DIMPLE function. In some embodiments, the caller script defines the context sequences based on a reference genome sequence, a set of genomic coordinates of interest, and a set of amplicon length constraints.

[0057] In some embodiments, the caller script defines the context sequences based on plus (+) strand of the reference genome sequence, and replaces all cytosine (C) nucleotides not directly adjacent to the 5’ of a guanine (G) nucleotide into uracil (U) or thymine (T) nucleotides, and replaces all cytosine (C) nucleotides directly adjacent to the 5’ of a G nucleotide into a pyrimidine (Y) degenerate nucleotide. In some embodiments, the caller script defines the context sequences based on minus (-) strand of the reference genome sequence, and replaces all cytosine (C) nucleotides not directly adjacent to the 5’ of a guanine (G) nucleotide into uracil (U) or thymine (T) nucleotides, and replaces all cytosine (C) nucleotides directly adjacent to the 5’ of a G nucleotide into a pyrimidine (Y) degenerate nucleotide.

[0058] In some embodiments, the caller script defines the Redundancy Penalty values in a way that facilitates maximal primer coverage of the context sequences (e.g. first Penalty = 100). In an aspect, a primer set designed herein provides substantially complete coverage of intended context sequences. In other embodiments, the caller script defines the Redundancy Penalty values in a way that facilitates at least 2 primers to be designed for each Semilocus (e.g. first Penalty = 0, second Penalty = 100).

[0059] In some embodiments, the caller script defines a Fitness Noise parameter. In some embodiments, the Fitness Noise parameter is the standard deviation of a Gaussian distributed random variable with mean 0 that is added to the calculated Fitness of each primer candidate. In other embodiments, the Fitness Noise parameter is the maximum absolute value of a uniformly distributed variable with mean 0 that is added to the calculated Fitness of each primer candidate.

[0060] In some embodiments, the caller script defines a dynamic Loss Weighting ruleset. In some embodiments, the Loss Weighting is never changed through the course of a DIMPLE design run. In other embodiments, the Loss Weighting is divided by a fixed value each time the Loss Weighting adjustment is triggered. In some embodiments, the Loss Weighting is divided by a value inversely proportional to the number of primers in the primer set when the Loss Weighting adjustment is triggered. In some embodiments, the Loss Weighting is divided by a value proportional to the number of context sequences when the Loss Weighting adjustment is triggered. [0061] In some embodiments, the caller script defines a set of pre-existing primer sequences to be included initially in the primer set. In some embodiments, the set of pre-existing primer sequences are for primers intended to be present in the final multiplex PCR panel. In other embodiments, the set of pre-existing primer sequences are a set of virtual sequences used to guide the sequence design of primer set, wherein the pre-existing primer sequences are not intended to physically exist as synthetic oligonucleotides in the final multiplex PCR panel. In other embodiments, the set of pre-existing primer sequences is an empty set.

[0062] In some embodiments, the caller script calls the main DIMPLE function multiple times on the same set of context sequences using different random number generator seed values, in order to generate multiple alternative primer sets. In some embodiments, the multiple alternative primer sets designed by DIMPLE are evaluated by a different algorithm to select a final primer set for wetlab optimization or usage.

2. Main DIMPLE primer design function.

[0063] An illustration of one embodiment of the DIMPLE main function implementation is show in Fig. 1. Steps 1 through 5 of the Main DIMPLE function are called once, and the remains steps are repeated many times until the primer set design is complete. The numbers in bracket indicate example values of parameters that can affect the performance of DIMPLE. In some embodiments, adjustment of these parameters values may affect the DIMPLE algorithm by producing better primer sets (with higher aggregate coverage) but take significantly longer to run. For example, increasing the number of Semiloci considered and increasing the number of primer candidates considered for each Semilocus potentially increases the performance of DIMPLE at the cost of slower runtime.

[0064] The steps shown in Figure 1 are meant to be an example that is neither bare minimum necessary nor fully featured, but an illustrative example of an implementation of DIMPLE. For example, other data structures and alignment algorithms could be used instead of hash tables in steps 5 and 10. Two examplary features of DIMPLE are (1) to sort primer candidates by Fitness calculated independently without consideration for other primers, and (2) to remove primer candidates with FL Values below 0 (or some other threshold value). Both of these significantly and asymptotically reduce the number of computationally expensive pairwise Loss computation between primer candidates and the primer sets. DIMPLE’S linear asymptotic time complexity allows DIMPLE to scale to massively multiplexed design of thousands of primers that function in the same reaction. 3. primer candidate generation function.

[0065] The primer candidate generation function takes as input one or more Semilocus sequences or context sequences and a set of primer candidate design parameters, and returns as output a set of primer candidates for each Semilocus. In some embodiments, the primer candidate design parameters may comprise the intended PCR anneal cycle temperature, the intended PCR reaction effective monovalent cation salinity, the maximum allowed length of the primer candidate, and the maximum (least negative) allowable standard free energy (AG°) of hybridization between a primer candidate and its reverse complement. In some embodiments, the design parameters may have default values of approximately 60°C for PCR anneal cycle temperature, approximately 0.18M for PCR effective monovalent cation concentration, approximately 40 nucleotides for maximum primer candidate length, and approximately -12.0 kcal/mol for the maximum AG° of hybridization to reverse complement. In some embodiments, the primer candidate design parameters further comprise a minimum (most negative) allowable standard free energy (AG°) of hybridization between a primer candidate and its reverse complement; in some embodiments this minimum AG° has a value of approximately -17 kcal/mol.

[0066] In some embodiments, the primer candidate generation function begins by considering every allowable position in the Semilocus as a 5 ’-most nucleotide, and iteratively adds consecutive nucleotides from the Semilocus sequence until the growing sequence has a AG° of hybridization to its reverse complement sequence or the growing sequence exceeds the maximum allowable primer candidate length. In some embodiments, the primer candidate generation function evaluates and produces multiple primer candidates with the same 5 ’-most nucleotide position on each Semilocus; in other embodiments, the primer candidate generation function allows a maximum of 1 primer candidate per 5 ’-most nucleotide position on each Semilocus.

[0067] In some embodiments, the primer candidate generation function handles degenerate nucleotides such as Y for pyrimidines (representing C or T) and R for purines (representing G or A). In some embodiments, the calculation of AG° of primer candidate hybridization to its perfect complement for primer candidates with degenerate nucleotides assumes the weakest possible binding, with Y nucleotides assumed to be T and R nucleotides assumed to be A.

[0068] In some embodiments, the primer candidate generation function inputs further comprise a reference genome sequence or description (e.g. "human GRCh38"). In some embodiments, primer candidates are aligned to the reference genome, and the primer candidates with many perfect or near-perfect hits on the reference genome are removed from consideration. In some embodiments, primer candidates with more than approximately 100 perfect or near-perfect (>90% homology) hits on the reference genome are removed from consideration.

4. Fitness evaluation function.

[0069] The Fitness evaluation function takes as input one or more primer candidate sequence(s) and a set of Fitness evaluation parameters, and returns as output the Fitness score for each primer candidate. In some embodiments, the Fitness evaluation parameters include the default Fitness of a perfect primer candidate with no penalties, the number of nucleotides from the 3’ end below which degenerate nucleotides are penalized, the minimum length of homopolymer repeats above which homopolymers are penalized, the minimum length of primers below which short primer lengths are penalized, the maximum length of primers above which primers lengths are penalized, and the minimum primer candidate self-complementarity above which self-complementarity is penalized. In some embodiments, the default values of the Fitness evaluation parameters are: approximately 100 Fitness as the default value before penalties, approximately 6 nucleotides as the minimum from the 3’ end below which degenerate nucleotides are penalized, approximately 6 nucleotides as the minimum number of homopolymer nucleotides above which homopolymers are penalized, approximately 15 nucleotides as the minimum length primer below which primer length is penalized, approximately 35 nucleotides as the maximum length primer above which primer length is penalized.

[0070] In some embodiments, the Fitness evaluation function begins by setting the Fitness of each primer candidate as the default Fitness (e.g. 100), and runs a number of tests on each primer candidate sequence to check for the presence of penalizing features. In some embodiments, each penalizing feature is independent and separately reduces the Fitness of the primer candidate by a certain amount.

[0071] In some embodiments, some of the penalizing features are exponential in the Fitness penalty relative to property of the undesirable feature. For example, Fitness penalties for short primers below a threshold length (e.g. 15nt) may be exponentially penalized as A^A(15-Length), where A>1 is a constant and Length is the length of the primer candidate. This length exponential Fitness penalty could be justified in that shorter primer candidates are exponentially more likely to have nonspecific alignment and binding to unintended DNA subsequences of a background genome. [0072] As another example, Fitness penalties for degenerate (e.g. Y or R) nucleotides in the primer candidate may be exponentially penalized as A^A(NumDegen), where A>1 is a constant and NumDegen is the number of degenerate nucleotides. This degenerate nucleotide exponential Fitness penalty could be justified in that more degenerate nucleotides exponentially reduce the concentration of each specific nondegenerate component sequence when the total concentration of the primer candidate with degenerate nucleotides is conserved. As an illustration, a primer candidate with sequence "...YGYGA-3" will be synthesized as a mixture of 4 different sequences "...TGTGA-3", "...CGTGA-3", "...TGGGA-3", and "...CGCGA-3" each with approximately 25% of the total concentration of the primer.

[0073] As another example, Fitness penalties for self-complementarity of subsequences in the primer candidate may be exponentially penalized as A^A(numGC*B + numAT), where A>1 and B>1 are constants, numGC is the number of G or C nucleotides in the continuous self- complementary subsequence, and numAT is the number of A or T nucleotides in the continuous self-complementary subsequence. The longer the self-complementary sequence, the exponentially more likely the primer candidate is to adopt a secondary structure in solution that is inaccessible for binding to the corresponding template DNA sequence, resulting in a false negative (nonamplification) event.

[0074] In some embodiments, some of the penalizing features are linear in the Fitness penalty relative to the property of the undesirable feature. For example, Fitness penalties for long primers above a threshold length (e.g. 35nt) may be linearly penalized as A*(Length-35), where A is a constant and Length is the length of the primer candidate. The linear Fitness penalty for long primers could be justified based on increased costs for primer synthesis and purification based on current oligonucleotide synthesis and purification chemistry and economics.

5. primer set Database update function.

[0075] The primer set Database update function takes as input the sequences of one or more newly selected primer candidates for inclusion into the primer set. In some embodiments, the primer set Database update function also takes as input a data structure or a pointer to a data structure representing the presence, quantity, and 3 ’-weighting of k-mer words present in the primer set before the inclusion of the newly selected primer candidate(s) (the Database), and produces as output a data structure or a pointer to a data structure representing the presence, quantity, and 3’- weighting of k-mer words present in the primer set after the inclusion of the newly selected primer candidates (the updated Database). In other embodiments, the primer set Database update function updates a global data structure representing the presence, quantity, and 3 ’-weighting of k-mer words present in the primer set. In some embodiments, the data structure described is a hash table. In other embodiments, the data structure described is a suffix tree. In some embodiments, k-mer words (subsequences) present in the primer set comprise 6-mer, 7-mer, 8-mer, 9-mer, and 10-mer words.

[0076] In some embodiments of the primer set Database update function, the Database contained a contains a numerical score for the number of position- weighted occurrences of k-mer words, and each score is initialized as 0 for an empty primer set. In some embodiments of the primer set Dataset update function, all continuous subsequences of the primer candidate (words) of desired lengths (e.g. 6nt, 7nt, 8nt, 9nt, and lOnt) are enumerated, and each word updates its corresponding entry in the Database by an update value. In some embodiments, the update value is a constant number that is identical for all words. In some embodiments, the update value is dependent on the number of G or C nucleotides in the word. In some embodiments, the update value is dependent on the position of the word in the primer candidate, with higher update values assigned to words closer to the 3’ end of the primer candidate. In some embodiments, the update value dependence on the position of the word position is specified by the user, depending on whether the intended DNA polymerase to be used has 3 ’>5’ proofreading exonuclease activity or not. In some embodiments, the intended DNA polymerase to be used lacks 3 ’>5’ proofreading exonuclease activity, and update values of words not within A nucleotides of the 3’ end are 0, where A is a constant with a value of 0, 1, or 2. In some embodiments, the intended DNA polymerase to be used possesses 3 ’>5’ proofreading exonuclease activity, and update values of all words are nonzero.

[0077] In some embodiments of the primer set Database update function, the primer candidate input comprises one or more degenerate nucleotides, such as Y (representing T or C) or R (representing A or G). In some embodiments of the primer set Database update function, the k- mer words of the primer candidate that contain degenerate nucleotides are enumerated as all possible non-degenerate k-mer words consistent with the k-mer with degenerate nucleotides, and all non-degenerate k-mer words are used to update the Database. In other embodiments of the primer set Database update function, each k-mer word of the primer candidate that contain degenerate nucleotides are converted into two k-mer words comprising non-degenerate nucleotides, one corresponding to fully-methylated (strong) nucleotides and the other corresponding to fully-unmethylated (weak) nucleotides, and these two k-mers are used to update the Database. As an illustration, the 6-mer word "TYGYGA" would be converted into two 6-mer words "TCGCGA" and "TTGTGA". In other embodiments of the primer set Database update function, k-mer words of the primer candidate that contain degenerate nucleotides are ignored and not used to update the Database.

[0078] In some embodiments of the primer set Database update function, the user may specify a non-zero update value multiplier for reverse complement of k-mer words, and the reverse complement of the k-mer words present as subsequences in the primer candidate are also used to update the Database. In some embodiments, the update value multiplier for reverse complements is approximately 0.1.

6. primer candidate Loss evaluation function.

[0079] The primer candidate Loss evaluation function takes as input one or more primer candidate sequences, and returns the calculated Loss value for each primer candidate against the current primer set. In some embodiments, the primer candidate Loss evaluation function also takes as input a data structure or a pointer to a data structure representing the presence, quantity, and 3’- weighting of k-mer words present in the primer set before the inclusion of the newly selected primer candidate(s) (the Database). In other embodiments, the primer set Database accesses a global data structure representing the presence, quantity, and 3 ’-weighting of k-mer words present in the primer set.

[0080] In some embodiments, the primer candidate Loss evaluation function calculates the Loss score for a primer candidate against the existing primer set by generating all the reverse complement of all k-mer word subsequences of the primer candidate, looks up these k-mer words in the Database, and calculates the weighted sums the scores of the Database multiplied by weight constants based on the value of k. In some embodiments, the Loss due to a k-mer in the Database may be calculated as 2^A(k-6) * 3^AnumGC * EntryValue, where k is the length of the k-mer word, numGC is the number of G or C nucleotides in the word, and EntryValue is the score of word in the Database. Thus, in some embodiments, Database score values for longer words yield exponentially larger contributions to Loss.

Digital Processing Device. [0081] In some examples, the subject matter described herein can include a digital processing device or use of the same. In some examples, the digital processing device can include one or more hardware central processing units (CPU), graphics processing units (GPU), or tensor processing units (TPU) that perform the device’s functions. In some examples, the digital processing device can include an operating system configured to perform executable instructions.

[0082] In some examples, the digital processing device can optionally be connected a computer network. In some examples, the digital processing device may be optionally connected to the Internet. In some examples, the digital processing device may be optionally connected to a cloud computing infrastructure. In some examples, the digital processing device may be optionally connected to an intranet. In some examples, the digital processing device may be optionally connected to a data storage device.

[0083] Non-limiting examples of suitable digital processing devices include server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, handheld computers, Internet appliances, mobile smartphones, and tablet computers. Suitable tablet computers can include, for example, those with booklet, slate, and convertible configurations.

[0084] In some examples, the digital processing device can include an operating system configured to perform executable instructions. For example, the operating system can include software, including programs and data, which manages the device’s hardware and provides services for execution of applications. Non-limiting examples of operating systems include Ubuntu, FreeB SD, OpenBSD, NetB SD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®. Non-limiting examples of suitable personal computer operating systems include Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®. In some examples, the operating system may be provided by cloud computing, and cloud computing resources may be provided by one or more service providers.

[0085] In some examples, the device can include a storage and/or memory device. The storage and/or memory device may be one or more physical apparatuses used to store data or programs on a temporary or permanent basis. In some examples, the device may be volatile memory and require power to maintain stored information. In some examples, the device may be non-volatile memory and retain stored information when the digital processing device is not powered. In some examples, 1 the non-volatile memory can include flash memory. In some examples, the non-volatile memory can include dynamic random-access memory (DRAM). In some examples, the non-volatile memory can include ferroelectric random access memory (FRAM). In some examples, the nonvolatile memory can include phase-change random access memory (PRAM).

[0086] In some examples, the device may be a storage device including, for example, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing-based storage. In some examples, the storage and/or memory device may be a combination of devices such as those disclosed herein. In some examples, the digital processing device can include a display to send visual information to a user. In some examples, the display may be a cathode ray tube (CRT). In some examples, the display may be a liquid crystal display (LCD). In some examples, the display may be a thin film transistor liquid crystal display (TFT- LCD). In some examples, the display may be an organic light emitting diode (OLED) display. In some examples, on OLED display may be a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display. In some examples, the display may be a plasma display. In some examples, the display may be a video projector. In some examples, the display may be a combination of devices such as those disclosed herein.

[0087] In some examples, the digital processing device can include an input device to receive information from a user. In some examples, the input device may be a keyboard. In some examples, the input device may be a pointing device including, for example, a mouse, trackball, track pad, joystick, game controller, or stylus. In some examples, the input device may be a touch screen or a multi-touch screen. In some examples, the input device may be a microphone to capture voice or other sound input. In some examples, the input device may be a video camera to capture motion or visual input. In some examples, the input device may be a combination of devices such as those disclosed herein.

Non-Transitory Computer-Readable Storage Medium.

[0088] In some examples, the subject matter disclosed herein can include one or more non- transitory computer-readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device. In some examples, a computer-readable storage medium may be a tangible component of a digital processing device. In some examples, a computer-readable storage medium may be optionally removable from a digital processing device. In some examples, a computer-readable storage medium can include, for example, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like. In some examples, the program and instructions may be permanently, substantially permanently, semi-permanently, or non-transitorily encoded on the media.

Computer Systems.

[0089] The present disclosure provides computer systems that are programmed to implement methods described herein. Figure 10 shows a computer system 101 that is programmed or otherwise configured to store, process, identify, or interpret patient data, biological data, biological sequences, and reference sequences. The computer system 101 can process various aspects of patient data, biological data, biological sequences, or reference sequences of the present disclosure. The computer system 101 may be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device may be a mobile electronic device.

[0090] The computer system 101 comprises a central processing unit (CPU, also “processor” and “computer processor” herein) 105, which may be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 101 also comprises memory or memory location 110 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 115 (e.g., hard disk), communication interface 120 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 125, such as cache, other memory, data storage and/or electronic display adapters. The memory 110, storage unit 115, interface 120 and peripheral devices 125 are in communication with the CPU 105 through a communication bus (solid lines), such as a motherboard. The storage unit 115 may be a data storage unit (or data repository) for storing data. The computer system 101 may be operatively coupled to a computer network (“network”) 130 with the aid of the communication interface 120. The network 130 may be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 130 in some examples is a telecommunication and/or data network. The network 130 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 130, in some examples with the aid of the computer system 101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 101 to behave as a client or a server. [0091] The CPU 105 can execute a sequence of machine-readable instructions, which may be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 110. The instructions may be directed to the CPU 105, which can subsequently program or otherwise configure the CPU 105 to implement methods of the present disclosure. Examples of operations performed by the CPU 105 can include fetch, decode, execute, and writeback.

[0092] The CPU 105 may be part of a circuit, such as an integrated circuit. One or more other components of the system 101 may be included in the circuit. In some examples, the circuit is an application specific integrated circuit (ASIC).

[0093] The storage unit 115 can store files, such as drivers, libraries and saved programs. The storage unit 115 can store user data, e.g., user preferences and user programs. The computer system 101 in some examples can include one or more additional data storage units that are external to the computer system 101, such as located on a remote server that is in communication with the computer system 101 through an intranet or the Internet.

[0094] The computer system 101 can communicate with one or more remote computer systems through the network 130. For instance, the computer system 101 can communicate with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 101 via the network 130.

[0095] Methods as described herein may be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 101, such as, for example, on the memory 110 or electronic storage unit 115. The machine-executable or machine-readable code may be provided in the form of software. During use, the code may be executed by the processor 105. In some examples, the code may be retrieved from the storage unit 115 and stored on the memory 110 for ready access by the processor 105. In some examples, the electronic storage unit 115 may be precluded, and machine-executable instructions are stored on memory 110.

[0096] The code may be pre-compiled and configured for use with a machine having a processer adapted to execute the code or may be interpreted or compiled during runtime. The code may be supplied in a programming language that may be selected to enable the code to execute in a precompiled, interpreted, or as-compiled fashion.

[0097] Aspects of the systems and methods provided herein, such as the computer system 101, may be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code may be stored on an electronic storage unit, such as memory (e.g., readonly memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements comprises optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.

[0098] Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.

[0099] The computer system 101 can include or be in communication with an electronic display 135 that comprises a user interface (UI) 140 for providing, for example, a nucleic acid sequence, an enriched nucleic acid sample, a methylation profile, an expression profile, and an analysis of a methylation or expression profile. Examples of UI’s include, without limitation, a graphical user interface (GUI) and web-based user interface.

[00100] Methods and systems of the present disclosure may be implemented by way of one or more algorithms. An algorithm may be implemented by way of software upon execution by the central processing unit 105. The algorithm can, for example, store, process, identify, or interpret patient data, biological data, biological sequences, and reference sequences.

[00101] While certain examples of methods and systems have been shown and described herein, one of skill in the art will realize that these are provided by way of example only and not intended to be limiting within the specification. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the scope described herein. Furthermore, it shall be understood that all aspects of the described methods and systems are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables and the description is intended to include such alternatives, modifications, variations or equivalents.

[00102] In some examples, the subject matter disclosed herein can include at least one computer program or use of the same. A computer program can a sequence of instructions, executable in the digital processing device’s CPU, GPU, or TPU, written to perform a specified task. Computer-readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, a computer program may be written in various versions of various languages.

[00103] The functionality of the computer-readable instructions may be combined or distributed as desired in various environments. In some examples, a computer program can include one sequence of instructions. In some examples, a computer program can include a plurality of sequences of instructions. In some examples, a computer program may be provided from one location. In some examples, a computer program may be provided from a plurality of locations. In some examples, a computer program can include one or more software modules. In some examples, a computer program can include, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plugins, extensions, add-ins, or add-ons, or combinations thereof.

[00104] In some examples, the computer processing may be a method of statistics, mathematics, biology, or any combination thereof. In some examples, the computer processing method comprises a dimension reduction method including, for example, logistic regression, dimension reduction, principal component analysis, autoencoders, singular value decomposition, Fourier bases, singular value decomposition, wavelets, discriminant analysis, support vector machine, tree-based methods, random forest, gradient boost tree, logistic regression, matrix factorization, network clustering, and neural network.

[00105] In some examples, the computer processing method is a supervised machine learning method including, for example, a regression, support vector machine, tree-based method, and network.

[00106] In some examples, the computer processing method is an unsupervised machine learning method including, for example, clustering, network, principal component analysis, and matrix factorization.

Databases.

[00107] In some examples, the subject matter disclosed herein can include one or more databases, or use of the same to store patient data, biological data, biological sequences, or reference sequences. Reference sequences may be derived from a database. In view of the disclosure provided herein, many databases may be suitable for storage and retrieval of the sequence information. In some examples, suitable databases can include, for example, relational databases, non-relational databases, object-oriented databases, object databases, entity- relationship model databases, associative databases, and XML databases. In some examples, a database may be internet-based. In some examples, a database may be web-based. In some examples, a database may be cloud computing-based. In some examples, a database may be based on one or more local computer storage devices.

[00108] In an aspect, the present disclosure provides a non-transitory computer-readable medium comprising instructions that direct a processor to perform a method disclosed herein.

[00109] In an aspect, the present disclosure provides a computing device comprising the computer-readable medium.

[00110] It will be readily apparent to those skilled in the art that other suitable modifications and adaptations of the devices, systems and methods described herein may be made using suitable equivalents without departing from the scope of the aspects disclosed herein. Having now described certain aspects in detail, the same will be more clearly understood by reference to the following example, which is included for purposes of illustration only and is not intended to be limiting. All patents, patent applications, and references described herein are incorporated by reference in their entirety for all purposes.

EXAMPLES

Example 1: Development of a Pan cancer NGS Panel

[00111] A list of 1,000 targets comprising 23,336 mutations linked to pancreatic, nonsmall cell lung, ovarian, and breast cancer onset was identified from the AACR PROJECT GENIE database (The AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine Through An International Consortium, Cancer Discov. 2017 Aug;7(8):818- 831). The targets were identified to be at most 70 bp, and by not overlapping each other, within a 150 bp buffer in the 3’ and 5’ regions.

Table 1 : Listing of Gene Target Sequences

[00112] Consequently, DIMPLE was launched to design primers for each target with a default redundancy of 4. As a result, a 1000-plex panel was created with a median amplicon size of 140 bp, and a median AG°_hybridization of -11.6 kcal/mol at 60°C with 0.180M monovalent salt. A set comprising one and only one set of primers per target was synthesized and tested upon arrival. Corroborating the subtractive attribute of DIMPLE, which designs amplicons to work in the same reaction pot, the initial 1000 targets were tested as 100-10plex primer sets (sets comprising 10 forward and 10 reverse primers) at a concentration of 15 nM. Reactions were amplified using a CFX96 Touch Real-Time PCR Detection System (Bio-Rad Laboratories) with PowerUP (ThermoFisher) at a final concentration of lx and primer concentrations of 15 nM each, with 10 ng of 3,000 genomic copy template as a positive control and NTC as a negative control. The cycling conditions were as follows: 95 °C for 5 minutes, followed by 50 cycles of 95°C for 30 seconds and 60°C for 2 minutes. The results showed a median NTC > 40 cycles, and the positive control had a median of 22 cycles -FIGURE 10 A

[00113] In NGS testing, the complete panel, consisting of 2,000 primers in total, was used at an individual concentration of 7.5 nM in a reaction containing iTaq (BioRad) and 10 ng human genomic DNA or NTC. The cycling conditions were as follows: 95°C for 5 minutes, followed by 50 cycles of 95°C for 30 seconds and 60°C for 2 minutes -FIGURE 10 B. Following SPRI purification with 1.8x Agencourt beads (Beckman Coulter), the PCR product was used with the NEBNext® Ultra™ II DNA Library Prep Kit for Illumina (NEB) according to the manufacturer’s instructions. Briefly, following an end-repair step, adapters were ligated. Next, following SPRI purification, the libraries were diluted 1: 100 and amplified with TruSeq sequencing primers - FIGURE 10 C. Finally, after a final purification, the library was QC checked on the Bioanalyzer (Agilent) and sequenced.

[00114] The zero-shot sequencing results revealed that 90% of the amplicons were found, and 70% of the amplicons were above 20% of the median (good uniformity). In terms of on- target reads, approximately 60% of the reads were aligned, and 30% were unmapped or misaligned - FIGURE 10 D. Of the 40% of the not aligned reads, 90% were found to be dimers, 99% of which were produced by at least one of the primers involving 278 targets, resulting in an overall 70% design success rate - FIGURE 10 E.

[00115] Removing the primers linked to these targets, thus with a round of optimization, yielded 84% of aligned reads, and 8% of unmapped reads - FIGURE 10F. The resulting qPCR showed an NTC as delayed as 37 cycles - FIGURE 10G.

Example 2: Development of a Fusion qPCR Assay

[00116] Fusion genes are typically identified by chromosomal coordinates with a resolution at the gene level. However, the exact fusion junction can vary across different subjects. Detection mechanisms such as FISH or even NGS, while capable of identifying alternative constructs, lack the sensitivity to detect low-frequency variants (below 1%) with high confidence. Additionally, actionable fusions like NTRK and FGR, despite being actionable (with FDA-approved treatments like Larotrectinib (Loxo) and Entrectinib (Genentech) for all cancers with NTRK fusions), have only a 1% incidence. This makes the cost of NGS unjustifiable from a cost-per -positive standpoint.

[00117] A highly multiplexed qPCR assay designed with DIMPLE enables the screening and monitoring of hundreds of fusion breakpoints simultaneously, with a limit of detection expected to be well below 1%. As a proof of concept, DIMPLE was used to design a 274-primer assay covering major fusions of KMT2A and NUP98, focusing on the top 12 fusion partner genes for KMT2A (over 95% of actionable KMT2A fusions in leukemia) and 3 partner genes for NUP98, covering 90% of the actionable NUP98 fusions.

[00118] The assay is designed so that the forward primer targets the gene of interest across all relevant exons, and the reverse primer targets all exons in the most frequent gene partners in RNA samples, covering up to -3,400 constructs (Figure 11 A).

Table 2: Listing of Relevant Exons in Genes of Interest

[00119] Utilizing TaqMan probes as the detection moiety in qPCR, which allows further differentiation across different actionable fusions, the assay was tested with synthetic gene fragments serially diluted in cDNA from healthy individuals, up to 5 copies per reaction in 10,000 total copies of the KMT2A gene (Figures 1 IB and 11C).

Table 3: Listing of Detection Probes

[00120] Additionally, the assay was used to test RNA extracted from the cell line RS4;11 (CRL-1873) bearing the KMT2A::AFF1 fusion, serially diluted in RNA from leukocytes from healthy patients. The assay was able to detect up to 0.01% of the KMT2A:: AFF1 diluted sample.

Claims

1. A computer implemented method for designing a multiplex PCR primer set to amplify a plurality of nucleic acid (NA) template sequences, comprising the steps of: a. Generating a set of Semilocus sequences from the NA template sequences; b. Generating a set of primer candidates for each Semilocus; c. Evaluating a Fitness score for each primer candidate; and d. Performing the following steps (i) to (iv) iteratively for at least 5, 10, 20, 30, 50, 100, 200, 500, 1000, 2000, 3000, 4000 or 5000 cycles: i. selecting a set of Semiloci, ii. selecting a set of primer candidates for each of the selected Semiloci, iii. evaluating a Loss score for each selected primer candidate against a preexisting primer set to determine an FL Value for each selected primer candidate, and iv. adding the selected primer candidate with the highest FL Value to the preexisting primer set.

2. The computer implemented method of claim 1, wherein generating a set of Semilocus sequences comprises the steps of: a. Defining a plurality of context sequences, b. Specifying a key sequence within each context sequence, and c. Producing a pair of Semilocus sequences from each context sequence, wherein nucleotides to the 5’ of the key sequence is considered the first Semilocus sequence, and the reverse complement of all nucleotides to the 3 ’ of the key sequence is considered the second Semilocus sequence.

3. The computer implemented method of claim 2, wherein each context sequence is at least about 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, 300, 400, 500, 600, 700, 800, 900 or 1000 nucleotide long.

4. The computer implemented method of claim 2, wherein each context sequence is at most about 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, 300, 400, 500, 600, 700, 800, 900 or 1000 nucleotide long.

5. The computer implemented method of claim 2, wherein each context sequence has a length between 40 and 1000.

6. The computer implemented method of any of claims 2 to 5, wherein the key sequence corresponds to (i) the 1 or 2 nucleotides located at the midpoint of the context sequence, (ii) a polymorphic sequence site, and/or (iii) a sequence comprising 3 or more CpG sites.

7. The computer implemented method of claim 6, wherein the polymorphic sequence site corresponds to a CpG site or a DNA methylation site.

8. The computer implemented method of any of claims 2 to 5, wherein the key sequence is between 6 and 40 nucleotide long.

9. The computer implemented method of any of claims 1 to 8, wherein each primer candidate comprising a melting temperature ranging from 55°C to 72°C at a concentration between lOpM and 5pM.

10. The computer implemented method of any of claims 1 to 8, wherein each primer candidate comprises a standard free energy (AG°) from -10 kcal/mol to -15 kcal/mol at a temperature between 42 °C and 72 °C and a monovalent cationic salinity between 0.04M Na+ and 0.45M Na+.

11. The computer implemented method of claim 10, wherein the standard free energy (AG°) is between -10 and -11, between -10 and -12, between -10 and -13, between -10 and -14, between -11 and -12, between -11 and -13, between -11 and -14, between -11 and -15, between -12 and -13, between -12 and -14, between -12 and -15, between -13 and -14, between -13 and -15, or between -14 and -15 kcal/mol.

12. The computer implemented method of any of claims 1 to 8, wherein each primer candidate comprises a standard free energy (AG°) of about -12 kcal/mol at a temperature of about 60°C and a monovalent cationic salinity of about 0.18M Na+.

13. The computer implemented method of any of claims 1 to 12, wherein each primer candidate comprising a template-complementary region ranging between lOnt and 50 nt.

14. The computer implemented method of any of claims 1 to 13, wherein each primer candidate comprises a GC content between 0.10 and 0.90, wherein the GC content is defined as the number of Guanine or Cytosine nucleotides divided by the number of total nucleotides in the primer candidate.

15. The computer implemented method of any of claims 1 to 14, wherein each primer candidate comprises a 5’ overhang sequence not complementary to the corresponding context sequence.

16. The computer implemented method of any of claims 1 to 15, wherein the Fitness score is initially defined to be a fixed value and reduced subsequently based on the presence of one or more undesirable features in a primer candidate.

17. The computer implemented method of claim 13, wherein the undesirable features are selected from the group consisting of (1) sequence lengths longer than a first threshold (e.g. 35nt), (2) sequence lengths shorter than a second threshold (e.g. 15nt), (3) presence of homopolymers of at least a threshold length (e.g. 6nt), (4) presence and/or quantity of degenerate nucleotides in the primer candidate, (5) presence of self-complementary subsequences, and (6) predicted quantity and/or strength of nonspecific binding interactions to other regions of a background genome sequence.

18. The computer implemented method of any of claims 1 to 17, wherein the pre-existing primer set in the first cycle of step (d) is an empty primer set.

19. The computer implemented method of any of claims 1 to 17, wherein the Loss score is initially defined to be a fixed value and increased subsequently based on the presence of sequence features in (i) the primer candidate and (ii) the pre-existing primer set that could result in undesirable primer dimer formation.

20. The computer implemented method of any of claims 1 to 19, wherein the Loss score is computed through a weighted sum of k-mer words that are reverse complementary between the primer candidate and any primer in the pre-existing primer set, and optionally, only 6-mer, 7-mer, 8-mer, 9-mer, and 10-mer words are considered.

21. The computer implemented method of claim 20, wherein the weighting of the presence of the 6-mer, 7-mer, 8-mer, 9-mer, and 10-mer words are different depending on the position of these words in the primer candidate and in the corresponding primer in the pre-existing primer set.

22. The computer implemented method of claim 20, wherein the Loss score is computed through a hash table comprising the k-mer words in the primers of the pre-existing primer set.

23. The computer implemented method of any of claims 1 to 22, wherein a dynamic Loss Weighting adjustment is applied when determining the FL Value for each selected primer candidate.

24. The computer implemented method of claim 23, wherein the Loss Weighting initially starts at an initial value (e.g. 1) and then is later reduced in later cycles.

25. The computer implemented method of claim 24, wherein the Loss Weighting is reduced when a large fraction of primer candidates evaluated have Loss Weighting multiplied by Loss score exceeding the corresponding primer candidate’s Fitness score.

26. The computer implemented method of claim 24, wherein the Loss Weighting is reduced based on the number of primers in the pre-existing primer set and the number of context sequences, with Loss Weighting is reduced more as the number of primers in the preexisting primers decreases more relative to the number of context sequences.

27. The computer implemented method of claim 24, wherein the Loss Weighting is reduced when the FL Values are negative for all primer candidates evaluated for a Semilocus.

28. The computer implemented method of any of claims 1 to 27, wherein the method generates a primer set that covers each of the plurality of nucleic acid (NA) template sequences

29. The computer implemented method of any of claims 1 to 27, wherein the method generates a primer set that does not cover each of the plurality of nucleic acid (NA) template sequences.

30. The computer implemented method of any of claims 1 to 29, wherein a user-definable Redundancy Penalty is applied when determining the FL Value for each selected primer candidate, which optionally results in the designing of one or more backup primers or nested primers.

31. The computer implemented method of any of claims 1 to 29, wherein a Redundancy Penalty is applied as a dynamic update to the Fitness score of all remaining primer candidates to a specific Semilocus, after one primer candidate for that specific Semilocus is selected to be included into the pre-existing primer set.

32. The computer implemented method of any of claims 1 to 29, wherein a Redundancy Penalty is applied as a separate term in the FL value.

33. The computer implemented method of any of claims 1 to 32, wherein the method generates a primer set that comprises nested primers for at least some of the nucleic acid (NA) template sequences.

34. The computer implemented method of any of claims 1 to 33, wherein the method generates a multiplex primer set of at least 100, 200, 300, 500, 750, 1000, 2000, 5000, 7500, 10000, 15000, 20000, 25000 or 30000 plexity.

35. The computer implemented method of any of claims 1 to 34, wherein a random noise is included when determining the Fitness score or the FL Value.

36. The computer implemented method of any of claims 1 to 34, wherein the selected primer candidates with negative FL values are removed from the primer candidate pool of a Semilocus.

37. The computer implemented method of any of claims 1 to 36, wherein the plurality of nucleic acid (NA) template sequences comprises methyl-converted DNA sequences.

38. The computer implemented method of any of claims 1 to 36, wherein the plurality of nucleic acid (NA) template sequences comprises sequences from cell-free DNA fragments.

39. The computer implemented method of any of claims 1 to 36, wherein at least some primer candidates comprise one or more degenerate nucleotides.

40. A computer-readable medium comprising codes that, upon execution by one or more processors, cause said one or more processors to implements the method as set out in any one of claims 1 to 39.

41. A computer system for designing a multiplex PCR primer set, the system comprising: a. a non-transitory memory configured to store executable instructions; and b. one or more processors alone or in combination programmed to perform a method according to any one of claims 1 to 39.