[go: up one dir, main page]

WO2015164517A1 - Procédés et systèmes d'analyse entrainée de profil de fusion pour le génotypage fiable de variants de séquence - Google Patents

Procédés et systèmes d'analyse entrainée de profil de fusion pour le génotypage fiable de variants de séquence Download PDF

Info

Publication number
WO2015164517A1
WO2015164517A1 PCT/US2015/027120 US2015027120W WO2015164517A1 WO 2015164517 A1 WO2015164517 A1 WO 2015164517A1 US 2015027120 W US2015027120 W US 2015027120W WO 2015164517 A1 WO2015164517 A1 WO 2015164517A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequences
sequence
primer
pcr
identifying
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2015/027120
Other languages
English (en)
Inventor
Vatsal AGARWAL
Pornpat ATHAMANOLAP
Stephanie FRALEY
Michael A. Jacobs
Vishwa PAREKH
Jeff Tza-Huei Wang
Samuel Yang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Johns Hopkins University
Original Assignee
Johns Hopkins University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Johns Hopkins University filed Critical Johns Hopkins University
Publication of WO2015164517A1 publication Critical patent/WO2015164517A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/689Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for bacteria
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/686Polymerase chain reaction [PCR]

Definitions

  • the present invention relates to the field of molecular biology. More specifically, the present invention provides genotyping tools having broad clinical, diagnostic and research applications including infectious diseases, oncology, inherited diseases, and epigenetics.
  • HRM High Resolution Melting
  • the present invention provides methods, kits, primers, probes, and systems for identifying genotype presence in a biological sample based on signature melting curves derived from nucleic acid dissociation.
  • the present invention further provides methods and systems for designing minimal conserved amplification primer sets and unlabeled probes for use in genotype discrimination for any given large sequence data set.
  • the present invention provides methods and systems for automated
  • the present invention comprises methods and systems for identifying an unknown genotype in a biological sample by training machine to reliably classify the melting curve signature of the unknown genotype against a database of known genotypes.
  • the training is designed with learned tolerance for run-to-run or inter-operator reaction variability, such that the more data used for training, the higher the reliability.
  • High resolution melt is gaining considerable popularity as a simple and robust method for geno typing sequence variants.
  • accurate geno typing of an unknown sample for which a large number of possible variants may exist requires an automated HRM curve identification method capable of comparing unknowns against a large cohort of known sequence variants.
  • the present inventors provide a new method for automated HRM curve classification based on machine learning methods and learned tolerance for reaction condition deviations. The method was tested in silico through multiple cross-validations using curves generated from 9 different simulated experimental conditions to classify 92 known serotypes of Streptococcus pneumoniae and demonstrated over 99% accuracy with 8 training curves per serotype. In vitro verification of the algorithm was tested using sequence variants of a cancer-related gene and demonstrated 100% accuracy with 3 training curves per sequence variant.
  • the machine learning algorithm enabled reliable, scalable, and automated HRM genotyping analysis with broad potential clinical and epidemiological applications.
  • the present inventors have developed a genotyping approach using a linear kernel based one-vs-one ensemble of multiclass support vector machine (SVM) classification algorithm to recognize and identify shapes of sequence- specific DNA melt curves with high accuracy.
  • the method relies on PCR amplification of highly variable sequences flanked by conserved primer sites within the target gene.
  • Subsequent HRM generates a melt curve shape that is sequence specific but also susceptible to slight temperature variation from experiment to experiment.
  • Applying the primer- finding algorithm and SVM classifier of the present invention in silico for pneumococcal serotyping resulted in the use of only 7 primer pairs for the identification of all 92 pneumococcal serotypes with 99.9% accuracy and high reproducibility.
  • the SVM algorithm accuracy was tested experimentally using HRM curves generated by six RASSFIA DNA sequence variants, where an accuracy of 100% was achieved.
  • the method relied on PCR amplification of highly variable sequences flanked by conserved sequences within the target gene followed by HRMA to generate sequence- specific melt curves. Curve shape and position characteristic of sequence composition can be used to discriminate sequence variants. Its simplicity, speed, low cost, ease of use, flexibility, and high sensitivity/specificity make HRM an attractive genotyping tool with broad potential clinical diagnostic and research applications.
  • methods of curve matching relied on either arbitrary visual inspection, subtraction
  • the present invention provides a method for primer selection/design algorithm to design the minimal conserved PCR primer sets flanking hypervariable sequences for use in sequence/genotype discrimination based on melting curve profiles in any given sequence data set.
  • a method for unlabeled probe selection/design for use with conserved primer sets can be used to further genotype and discriminate down to single nucleotide variants based on combined melting curve profiles.
  • the present invention also provides methods for automated normalization of melting curves to improve comparison of similar or distinct curves.
  • the methods of the present invention further comprise utilizing calibration probes to improve melting curve normalization and analysis between multiple reactions/runs.
  • a method for automated genotype identification of an unknown nucleic acid in a biological sample comprises using a training machine to recognize and classify melt curve pattern from the unknown genotype against a known database of melt curves from known genotypes.
  • the method can further improve reliability of melt curve classification through repeat training with experimental datasets of verified known genotypes derived from same/different day or operator runs. Such training enables learned tolerance for run-to-run or inter-operator variability and improves reliability in melt curve classification.
  • the present invention provides methods for identifying an unknown genotype of a bacteria from a biological sample.
  • the methods comprises the steps of (a) performing a polymerase chain reaction (PCR) assay of bacterial nucleic acid isolated from a biological sample comprising the bacteria, wherein the PCR assay amplifies at least one highly variable sequence flanked by conserved sequences within a target gene of the bacteria; (b) performing high resolution melting analysis (HRMA) to generate a sequence specific melting curve signature of the PCR amplicon; and (c) identifying the genotype of the bacteria using a one-versus-one ensemble support vector machine algorithm with a linear kernel that classifies the melting curve signature of the unknown genotype against a database of known genotypes.
  • PCR polymerase chain reaction
  • HRMA high resolution melting analysis
  • the database of genotypes of the bacteria comprises melting curve signatures of the same amplicons of the target gene of the bacteria.
  • the PCR assay uses primer sets that anneal to the conserved sequences within the target gene of the bacteria.
  • the bacteria is Streptococcus pneumoniae.
  • the target gene is the capsule polysaccharide synthesis gene.
  • a method for identifying an unknown genotype of a subject from a biological sample comprises the steps of (a) performing a polymerase chain reaction (PCR) assay of nucleic acid isolated from a biological sample obtained from the subject, wherein the PCR assay amplifies at least one highly variable sequence flanked by conserved sequences within a target gene; (b) performing high resolution melting analysis (HRMA) to generate a sequence specific melting curve signature of the PCR amplicon; and (c) identifying the genotype of the subject using a one-versus-one ensemble support vector machine algorithm with a linear kernel that classifies the melting curve signature of the unknown genotype against a database of known genotypes.
  • PCR polymerase chain reaction
  • HRMA high resolution melting analysis
  • the database of genotypes comprises melting curve signatures of the same amplicons of the target gene.
  • the PCR assay uses primer sets that anneal to the conserved sequences within the target gene.
  • the subject is a human.
  • the genotype is a methylated genotype.
  • the present invention provides methods for designing a minimum set of conserved primers capable of amplifying known serotypes of a bacteria, with each primer set flanking regions of high sequence variability for serotype discrimination.
  • the method comprises the steps of (a) providing a sequence from a target gene of each of the known bacterial serotypes; (b) aligning the sequences using a multiple sequence alignment program; (c) identifying exact-matched at least about 18-mer primer pairs within about 500 base pairs; (d) identifying regions between the primer pairs of step (c) to determine how many sequences could be discriminated by each primer set with single nucleotide different sensitivity; (e) selecting the primer pair the provides the maximum number of distinguishable sequences; (f) creating a new sequence set from the remaining indistinguishable sequences; and (g) repeating steps (b)-(e) to create the minimum number of primer sets that can amplify all known bacterial serotypes.
  • a method for designing a minimum set of conserved primers capable of amplifying known genotypes of a subject, with each primer set flanking regions of high sequence variability for genotype discrimination comprises the steps of (a) providing a sequence from a target gene of each of the known genotypes of the subject; (b) align the sequences using a multiple sequence alignment program; (c) identifying exact-matched at least about 18-mer primer pairs within about 500 base pairs; (d) identifying regions between the primer pairs of step (c) to determine how many sequences could be discriminated by each primer set with single nucleotide different sensitivity; (e) selecting the primer pair the provides the maximum number of distinguishable sequences; (f) creating a new sequence set from the remaining indistinguishable sequences; and (g) repeating steps (b)-(e) to create the minimum number of primer sets that can amplify all known genotypes of the subject.
  • Polynucleotides, polypeptides, or other agents are purified and/or isolated.
  • an "isolated” or “purified” nucleic acid molecule, polynucleotide, polypeptide, or protein is substantially free of other cellular material, or culture medium when produced by recombinant techniques, or chemical precursors or other chemicals when chemically synthesized.
  • Purified compounds are at least 60% by weight (dry weight) the compound of interest.
  • the preparation is at least 75%, more preferably at least 90%, and most preferably at least 99%, by weight the compound of interest.
  • a purified compound is one that is at least 90%, 91%, 92%, 93%, 94%, 95%, 98%, 99%, or 100% (w/w) of the desired compound by weight. Purity is measured by any appropriate standard method, for example, by column chromatography, thin layer chromatography, or high-performance liquid chromatography (HPLC) analysis.
  • a purified or isolated polynucleotide ribonucleic acid (RNA) or deoxyribonucleic acid (DNA)
  • RNA ribonucleic acid
  • DNA deoxyribonucleic acid
  • a purified or isolated polypeptide is free of the amino acids or sequences that flank it in its naturally-occurring state. Purified also defines a degree of sterility that is safe for administration to a human subject, e.g., lacking infectious or toxic agents.
  • substantially pure is meant a nucleotide or polypeptide that has been separated from the components that naturally accompany it.
  • the nucleotides and polypeptides are substantially pure when they are at least 60%, 70%, 80%, 90%, 95%, or even 99%, by weight, free from the proteins and naturally-occurring organic molecules with they are naturally associated.
  • Constantly modified variations of a particular polynucleotide sequence refers to those polynucleotides that encode identical or essentially identical amino acid sequences, or where the polynucleotide does not encode an amino acid sequence, to essentially identical sequences. Because of the degeneracy of the genetic code, a large number of functionally identical nucleic acids encode any given polypeptide. For instance, the codons CGU, CGC, CGA, CGG, AGA, and AGG all encode the amino acid arginine. Thus, at every position where an arginine is specified by a codon, the codon can be altered to any of the
  • amino acid substitutions in one or a few amino acids in an amino acid sequence are substituted with different amino acids with highly similar properties are also readily identified as being highly similar to a particular amino acid sequence, or to a particular nucleic acid sequence which encodes an amino acid. Such conservatively substituted variations of any particular sequence are a feature of the present invention.
  • isolated nucleic acid is meant a nucleic acid that is free of the genes which flank it in the naturally-occurring genome of the organism from which the nucleic acid is derived.
  • the term covers, for example: (a) a DNA which is part of a naturally occurring genomic DNA molecule, but is not flanked by both of the nucleic acid sequences that flank that part of the molecule in the genome of the organism in which it naturally occurs; (b) a nucleic acid incorporated into a vector or into the genomic DNA of a prokaryote or eukaryote in a manner, such that the resulting molecule is not identical to any naturally occurring vector or genomic DNA; (c) a separate molecule such as a synthetic complementary DNA (cDNA), a genomic fragment, a fragment produced by polymerase chain reaction (PCR), or a restriction fragment; and (d) a recombinant nucleotide sequence that is part of a hybrid gene, i.e., a gene encoding a
  • Isolated nucleic acid molecules according to the present invention further include molecules produced synthetically, as well as any nucleic acids that have been altered chemically and/or that have modified backbones.
  • the isolated nucleic acid is a purified cDNA or RNA polynucleotide.
  • Isolated nucleic acid molecules also include messenger ribonucleic acid (mRNA) molecules.
  • Non-transitory computer program products i.e., physically embodied computer program products
  • store instructions which when executed by one or more data processors of one or more computing systems, causes at least one data processor to perform operations herein.
  • computer systems are also described that may include one or more data processors and memory coupled to the one or more data processors. The memory may temporarily or permanently store instructions that cause at least one processor to perform one or more of the operations described herein.
  • methods can be implemented by one or more data processors either within a single computing system or distributed among two or more computing systems.
  • Such computing systems can be connected and can exchange data and/or commands or other instructions or the like via one or more connections, including but not limited to a connection over a network (e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.
  • a network e.g. the Internet, a wireless wide area network, a local
  • FIG. 1 is a sequence alignment showing Gblocks output.
  • the blue-highlight underneath represents the region that passes the criteria according to the parameters and this region will be considered as a candidate to be a primer.
  • FIG. 2 is an illustration of the ensemble binary classifiers. Each classifier is used to differentiate two classes and the score is counted for each serotype. The result is based on the serotype that returns the highest score.
  • FIG. 3 is a line graph showing predicted melt curves of serotype 1 with the first primer set across 9 different conditions.
  • the predicted melt curves were generated using uMelt with 9 different conditions, which are all combinations between [Na+ K+] : 47mM, 50mM, and 53mM and [Mg2+]: 1.4mM, 1.5mM, and 1.6mM.
  • FIG. 4 is a line graph showing the accuracy of different classifiers under different conditions.
  • the horizontal axis shows the different Na+, K+ and Mg2+ concentrations, respectively, that were used to generate the predict curves.
  • Vertical axis shows accuracy in percentage. Different curves labeled with different legends represent the performance of different classifiers.
  • FIG. 5 is a three-dimensional bar chart showing the average accuracy of the classifier under different conditions.
  • the horizontal axis shows the different Na+, K+ and Mg2+ concentrations, respectively, that were used to generate the predict curves.
  • Grey bars represent the accuracy of using one data set to train the classification model and using the remaining 8 data sets to test the model.
  • the left-most bar is the result from using the first condition (47mM of [Na+, K+] and 1.4mM of [Mg2+]) as a testing data and colors show results from varying the number of training data as legend.
  • FIG. 6 is a line graph showing the experimental melt curves from six different number of 'CG' sites DNA sequences. Melt curves of six synthetic DNA sequences from two duplicate experiments from different days. Different colors represent different sequences as shown in the legend. The fully methylated sequences represented in dark blue color with 10 'CG' sites and then two 'CG' sites were changed to 'TG' to be the next target of 8 'CG' sites and so on until all 'CG' sites were changed to 'TG' as 0 'CG' sites (non-methylated) represented in light blue.
  • the present inventors developed a method for broad-based classification of melt curves based on a one- versus -one ensemble SVM algorithm with a linear kernel. This enabled 97-100% identification accuracy of melt curves in the data set.
  • the SVM outperformed three different classification methods, Naive Bayes, PCA followed by LDA and k Nearest Neighbors. Only the newly developed PCA-LDA method and SVM yielded high accuracy.
  • the PCA-LDA model could be challenging since it requires a two- step procedure and the method is dependent on the eigenvectors selected from PCA to run with LDA.
  • the PCA-LDA algorithm needs to be trained for each new dataset being tested.
  • This additional training increases both the time and space complexity of the algorithm as compared to SVM.
  • the SVM requires training once, which reduces the computational complexity while also achieving increased accuracy.
  • the SVM classification model incorporates a machine learning algorithm that learns the unique characteristics of each melt curve by training multiple times with curves generated under slightly varying conditions. This training method not only enhanced the robustness and tolerance of the model against experimental variability but also increased the accuracy of identification as we concluded from k-fold cross validation test results, i.e., the more data used to train our model, the higher the accuracy achieved.
  • HRM should be capable of resolving a significant number of sequence variant alleles within a genetic locus.
  • the present inventors have developed a primer design algorithm which can generate the minimal set of PCR primers flanking hypervariable segments needed to discriminate all the input sequences of a target gene. It takes into account multiple amplicons which give multiple melting sites and optimal amplicon lengths to enhance the discriminatory power. Subsequent SVM-based analysis of unknown curves derived from these primer sets against a large data set of known controls would then allow for sequence, or microbial, identification.
  • the present inventors also demonstrated experimentally that the invention could be used for epigenetic research applications.
  • epigenetic analysis of DNA methylation patterns uses sodium bisulfite treatment to convert unmethylated cytosines to uracils while methylated cytosines remain unchanged. This leads to differences in sequence GC content and thus different melting profiles after PCR amplification.
  • the sequence classification method was experimentally validated by using synthetic RASSF1A promoter sequences simulating six different methylation levels, which the SVM could automatically identify with 100% accuracy in the presence of both inter-assay and intra-assay variations.
  • digital PCR [33-36] integrated with HRM [37] can be used to allow for the discretization of heterogeneous samples into separate reactions which generate individually identifiable melt curves for each genotype present.
  • temperature calibrator probes with additional curve normalization are utilized to improve the accuracy of the training model [37,38].
  • the algorithm can incorporate classification scores/confidences from each binary classifier to enhance the model decision efficacy.
  • the primer finding algorithm enables the user to purposefully engineer groups of identical or distinguishable melt curves according to their specific detection needs. The present inventors have developed an approach for HRM curve identification using SVM to enable highly accurate and automated identification of melt curves based on comparison to an extensive reference library.
  • the present invention provides a powerful tool with broad applicability in microbiology, epigenetics, as well as other types of HRM studies.
  • the methods of the present invention can be used to identify any bacterial species including, but not limited to, the species listed below in Table 1.
  • the primer-finding algorithm implemented with Python, was developed to enable the selection of primer pairs among conserved regions which flank variable regions that differentiate all desired sequences. Sequences were first aligned using the multiple sequence alignment tool Kalign [39] with default parameters. Then, the aligned sequences were analyzed using Gblocks [40] to find conserved regions as shown in FIG. 5. The parameters used were specific to align DNA sequences with 18 nucleotides minimum block length, no gap/no mismatch allowed, and use default values for remaining parameters.
  • BLASTClust would cluster the input DNA sequences base on the nucleotide similarity. The sequences would be grouped together if they were identical. The melting temperature, GC content of each primer site, and the number of GC differences between primers were constrained while selecting a primer pair [27]. The primer pair that could give the maximum number of distinguishable sequences was selected. A new sequence set was then created from the remaining indistinguishable sequences, and the algorithm was applied again.
  • capsule polysaccharide synthesis (cps) gene locus which are believed to influence the antigenic diversity in the human immune system [42] of 92 published serotypes of S. pneumoniae were used, including 90 serotypes from the Wellcome Trust and two recently disclosed serotypes: 6C (GenBank accession code EF538714) and 6D (accession code HM171374) [42-44].
  • Predicted melt curves were generated from the optimized primer sets using uMelt web application [28].
  • the computer algorithm was used to find the list of all possible amplicons that could be flanked by each primer set and then all the amplicons were input into uMelt batch mode.
  • the parameters for uMelt were set as follow: Temperature range 65°C - 95°C with 0.5°C resolution and default thermodynamic set as Unified-SantaLucia 1998. Data was simulated with the combination of monovalent cation [Mono+]: 47, 50, and 53 mM and [Mg++]: 1.4, 1.5, and 1.6 mM with 0% of dimethylsulfoxide. The output temperature and fluorescence intensity data from each sequence was used for subsequent SVM analysis.
  • Pre-processing involves deriving a feature vector from a melt curve. Every normalized melt curve is a plot of helicity values corresponding to various temperature values, starting from helicity at 100% to helicity at 0%. The melt curves are further normalized to provide exactly 300 helicity value points between temperature values of 65 degrees and 95 degree Celsius. If the number of helicity value points generated from melt curve analysis is not 300, piecewise linear interpolation is used to ensure exactly 300 points. Because what the present inventors intend to capture is the variation in the helicity with temperature and not the exact values of helicity, a method is needed that would be oblivious to changes in the melting points.
  • Naive Bayes based classifier This classifier is based on Bayes theorem, which requires a large amount of estimated samples needed for accurate classification. However, Native Bayes reduces the number of estimated parameters needed by using a conditional independence assumption. Conditional independence is defined as: if given variables A, B, and C. A and B are conditionally independent given C, if
  • the Native Bayes formulates the prior probabilities of each of the classifiers and based on the likelihood estimate of the test melt curve, computes the posterior probability of the test curve belonging to each of the classes trained. The class with the maximum posterior probability is assigned to the test curve.
  • KNN k Nearest Neighbor based classifier
  • PCA Principal component analysis
  • LDA Linear Discriminant Analysis
  • the input data of melting curves is a sparse dataset with a large number of training points
  • a dimensionality reduction technique is required to reduce the number of points in training sample.
  • the dimensionality of the training and test data was reduced to three eigenvectors and then applied the LDA on this reduced dimensional data for classification.
  • Support Vector Machine based classifier Herein, a one vs. one ensemble of linear kernel SVM with Least Squares optimization was used. The SVM was trained with two groups of feature vectors. At each pixel location, i, the melt curve data was represented by a vector x. With this terminology a label, y, was assigned representing the melt curve type, to every possible feature vector x. By a statistical sampling of such feature vectors along with their labels, the SVM method derived a detection rule by taking a pairwise similarity index between these samples k(x,x') and computing the solution to the following set of equations:
  • the vector a corresponds to the hyperplane such that 1/ 1
  • b is the scalar bias term, which ensures that the hyperplane is not forced to go through the zero [23].
  • N the number of classes (e.g., 92 serotypes) the input data is grouped into.
  • the value of Cy is one when the curve is classified as i and zero when the curve is classified as j.
  • the number of ones in every row i of FIG. 6 indicates how many times the curve was recognized as i.
  • the 20-fold diluted amplicons will be subsequently used for melt curve analysis by performing quantitative SYBR green-based PCR.
  • the 25 ⁇ - final volume PCR reaction contains 2 ⁇ of the diluted amplicon, 1 X Advanced SYBR Green Supermix (2 X stock, Bio-Rad), and 400 nmol/L of each forward and reverse primers, which are 35nt at the beginning and the end of each sequence respectively (Table 3).
  • the PCR program consisted of 95°C for 2 minutes, followed by 40 cycles of 95°C for 15 seconds, 60°C for 15 seconds, and 72°C for 45 seconds with another cycle for the melting step: 95°C for 15 seconds and ramping from 60°C to 95°C with ramping rate 0.2°C/sec.
  • the melting profiles were obtained from a Bio-Rad iCycler real-time PCR machine after endpoint PCR product detection.
  • Generating training data simulated melt curves.
  • amplicon sequences derived from the seven primer pairs in Table 4 were used to calculate theoretical melt curves with the web-based tool uMelt [28].
  • the amplicons are provided in the sequence listing filed herewith: Primer 1 (SEQ ID NOS:27-83); primer 2 (SEQ ID NOS:84-102); primer 3 (SEQ ID NOS: 103-134); primer 4 (SEQ ID NOS:135-164); primer 5 (SEQ ID NOS: 165-189); primer 6 (SEQ ID NOS: 190-192) and primer 7 (SEQ ID NOS: 193- 194).
  • the classifier was also tested on newer conditions, i.e., other than original 9 conditions mentioned above. To perform this, two conditions were randomly chosen within the considered extremes of the data, for example, 49mM monovalent ions (Na+ and K+), 1.6mM Magnesium and 52mM monovalent ions, 1.5mM Magnesium. The SVM classifier was able to correctly predict the serotype of all 92 samples in both of these conditions.
  • One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) computer hardware, firmware, software, and/or combinations thereof.
  • ASICs application specific integrated circuits
  • FPGAs field programmable gate arrays
  • These various aspects or features can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
  • the programmable system or computing system may include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • machine -readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • the machine- readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid-state memory or a magnetic hard drive or any equivalent storage medium.
  • the machine -readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example as would a processor cache or other random access memory associated with one or more physical processor cores.
  • one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer.
  • a display device such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user
  • LCD liquid crystal display
  • LED light emitting diode
  • a keyboard and a pointing device such as for example a mouse or a trackball
  • feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including, but not limited to, acoustic, speech, or tactile input.
  • Other possible input devices include, but are not limited to, touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive trackpads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne des procédés pour l'identification d'un génotype inconnu dans un échantillon biologique, qui utilise un algorithme d'apprentissage automatique basé sur une machine à vecteurs de support (MVS) linéaire pour classer des courbes de fusion avec une tolérance entraînée pour des variations des conditions de réaction. Dans un autre mode de réalisation, la présente invention concerne des procédés pour l'identification de l'ensemble minimal d'amorces conservées flanquant des régions hypervariables susceptibles de faire la différence entre tous les variants de séquence dans un ensemble donné de données. Les présents inventeurs ont démontré in silico la capacité de notre approche à identifier la totalité des 92 sérotypes connus de Streptococcus pneumoniae sur la base de leurs courbes de fusion prédite. Les procédés de la présente invention ont en outre été vérifiés expérimentalement à l'aide d'un panel d'ADN synthétique pour divers allèles du gène RASSFIA humain. L'invention concerne également un appareil, des systèmes, des techniques et des articles s'y rapportant.
PCT/US2015/027120 2014-04-22 2015-04-22 Procédés et systèmes d'analyse entrainée de profil de fusion pour le génotypage fiable de variants de séquence Ceased WO2015164517A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201461982370P 2014-04-22 2014-04-22
US61/982,370 2014-04-22

Publications (1)

Publication Number Publication Date
WO2015164517A1 true WO2015164517A1 (fr) 2015-10-29

Family

ID=54333145

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/027120 Ceased WO2015164517A1 (fr) 2014-04-22 2015-04-22 Procédés et systèmes d'analyse entrainée de profil de fusion pour le génotypage fiable de variants de séquence

Country Status (1)

Country Link
WO (1) WO2015164517A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017204771A3 (fr) * 2016-05-27 2018-04-05 Erciyes Universitesi Système et procédé d'identification de micro-organismes
WO2018165080A1 (fr) * 2017-03-06 2018-09-13 The Johns Hopkins University Plate-forme profilée pour identification bactérienne et test de sensibilité aux antibiotiques

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100076690A1 (en) * 2008-09-19 2010-03-25 Corbett Research Pty Ltd METHOD AND SYSTEM FOR ANALYSIS OF MELT CURVES, PARTICULARLY dsDNA AND PROTEIN MELT CURVES
US20110045479A1 (en) * 2008-11-03 2011-02-24 Life Technologies Corporation Method For High Resolution Melt Genotyping
WO2011050050A2 (fr) * 2009-10-20 2011-04-28 The Johns Hopkins University Utilisation d'un test de fusion haute résolution pour mesurer la diversité génétique
US20140017682A1 (en) * 2011-01-11 2014-01-16 Roche Diagnostics Operations, Inc. High resolution melting analysis as a prescreening tool

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100076690A1 (en) * 2008-09-19 2010-03-25 Corbett Research Pty Ltd METHOD AND SYSTEM FOR ANALYSIS OF MELT CURVES, PARTICULARLY dsDNA AND PROTEIN MELT CURVES
US20110045479A1 (en) * 2008-11-03 2011-02-24 Life Technologies Corporation Method For High Resolution Melt Genotyping
WO2011050050A2 (fr) * 2009-10-20 2011-04-28 The Johns Hopkins University Utilisation d'un test de fusion haute résolution pour mesurer la diversité génétique
US20140017682A1 (en) * 2011-01-11 2014-01-16 Roche Diagnostics Operations, Inc. High resolution melting analysis as a prescreening tool

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ATHAMANOLAP ET AL.: "Trainable high resolution melt curve machine learning classifier for large-scale reliable genotyping of sequence variants", PLOS ONE, vol. 9, no. 10, October 2014 (2014-10-01), pages 1 - 10, XP055233770 *
PALAIS ET AL.: "Mathematical algorithms for high-resolution DNA melting analysis", METHODS IN ENZYMOLOGY, vol. 454, 2009, pages 323 - 343, XP009185077 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017204771A3 (fr) * 2016-05-27 2018-04-05 Erciyes Universitesi Système et procédé d'identification de micro-organismes
WO2018165080A1 (fr) * 2017-03-06 2018-09-13 The Johns Hopkins University Plate-forme profilée pour identification bactérienne et test de sensibilité aux antibiotiques

Similar Documents

Publication Publication Date Title
Athamanolap et al. Trainable high resolution melt curve machine learning classifier for large-scale reliable genotyping of sequence variants
Vijayakumar et al. Accurate identification of clinically important Acinetobacter spp.: an update
Konigsberg et al. Host methylation predicts SARS-CoV-2 infection and clinical outcome
Meisel et al. Skin microbiome surveys are strongly influenced by experimental design
US11004537B2 (en) Methods and processes for non invasive assessment of a genetic variation
Christner et al. Rapid MALDI-TOF mass spectrometry strain typing during a large outbreak of Shiga-toxigenic Escherichia coli
JP6318151B2 (ja) 遺伝的変異の非侵襲的評価のための方法およびプロセス
CN115667554A (zh) 通过核酸甲基化分析检测结直肠癌的方法和系统
Gherardi et al. An overview of various typing methods for clinical epidemiology of the emerging pathogen Stenotrophomonas maltophilia
Fraley et al. Universal digital high-resolution melt: a novel approach to broad-based profiling of heterogeneous biological samples
Lu et al. Dynamic time warping assessment of high-resolution melt curves provides a robust metric for fungal identification
Pánek et al. A new method for identification of protein (sub) families in a set of proteins based on hydropathy distribution in proteins
Watts et al. Metagenomic next-generation sequencing in clinical microbiology
Chen et al. A gene signature based method for identifying subtypes and subtype-specific drivers in cancer with an application to medulloblastoma
WO2024125660A1 (fr) Techniques d'apprentissage automatique pour déterminer des méthylations de base
Touchon et al. From GC skews to wavelets: a gentle guide to the analysis of compositional asymmetries in genomic data
WO2024007971A1 (fr) Analyse de fragments microbiens dans le plasma
Yao et al. DeepSF-4mC: A deep learning model for predicting DNA cytosine 4mC methylation sites leveraging sequence features
WO2015164517A1 (fr) Procédés et systèmes d'analyse entrainée de profil de fusion pour le génotypage fiable de variants de séquence
AU2013243300A1 (en) Gene expression panel for breast cancer prognosis
Carter et al. A process for analysis of microarray comparative genomics hybridisation studies for bacterial genomes
Bohlin Genomic signatures in microbes—properties and applications
Laakso et al. Evaluation of high-throughput PCR and microarray-based assay in conjunction with automated DNA extraction instruments for diagnosis of sepsis
US20080172183A1 (en) Systems and methods for methylation prediction
Kotanidou et al. Methylation haplotypes of the insulin gene promoter in children and adolescents with type 1 diabetes: Can a dimensionality reduction approach predict the disease?

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15783247

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15783247

Country of ref document: EP

Kind code of ref document: A1