HK1259978A1 - Methods for obtaining and correcting biological sequence information - Google Patents
Methods for obtaining and correcting biological sequence information Download PDFInfo
- Publication number
- HK1259978A1 HK1259978A1 HK19119544.5A HK19119544A HK1259978A1 HK 1259978 A1 HK1259978 A1 HK 1259978A1 HK 19119544 A HK19119544 A HK 19119544A HK 1259978 A1 HK1259978 A1 HK 1259978A1
- Authority
- HK
- Hong Kong
- Prior art keywords
- sequencing
- nucleotide
- sequence
- reaction
- signal
- Prior art date
Links
Description
Technical Field
The invention relates to a high-throughput sequencing method in some aspects, and belongs to the field of gene sequencing.
Background
High throughput sequencers are a technology that has developed at a rapid pace in recent years. Compared with traditional sanger sequencing (sanger sequencing), the biggest advantage of high-throughput sequencing is that a great deal of sequence information can be read out simultaneously. Although the accuracy is not as good as that of the traditional sequencing method, information beyond the sequence itself, such as gene expression amount and copy number change, can be obtained due to mass data analysis.
Currently, the mainstream sequencers use SBS (sequencing by synthesis) methods such as Solexa/Illumina, 454, iontorent, etc. These sequencers are structurally similar and include fluidic systems, optical systems, and chip systems. The sequencing reaction takes place on-chip. The sequencing process is also very similar, all including: and introducing the reaction liquid into the chip to perform SBS reaction, collecting signals and washing. Next, a new round of sequencing was performed. This is a cyclic process. With the increase of the cycles, continuous single-base non-degenerate sequence information (e.g., ACTGACTG) was measured. However, high throughput sequencers do not completely eliminate sequencing errors. Sequencing errors may result from: reaction incidental errors or cumulative errors, signal acquisition errors, errors due to signal correction, and the like. In the existing sequencer, these chemical or optical, software errors can become noise, cannot be recognized at a single read site, and can only be eliminated by deep sequencing, using multiple reads of the same sequence at different sites. More accurate read-out is an important direction in the development of high-throughput sequencing. However, the prior art optimization of accuracy has focused on optimizing the chemical reaction itself and subsequent image signal processing, and has not been revolutionized in sequencing logic. There is therefore a need for improved sequencing methods.
Disclosure of Invention
The present application claims priority from the following chinese patent applications: chinese patent application No. CN201510822361.9, entitled "method for sequencing nucleotide molecules of a phosphate-modified fluorophore" filed 11/18/2015, chinese patent application No. cn201510815685.x filed 11/18/2015, entitled "method for sequencing using nucleotide substrate molecules of a fluorophore having fluorescence switching properties", chinese patent application No. CN201510944878.5 filed 12/11/2015, chinese patent application entitled "method for detecting and correcting sequence data errors in sequencing results", and chinese patent application No. cn201610899880.x filed 2016 10/14/h, entitled "method for reading sequence information from an original signal for high throughput DNA sequencing", the entire contents of which are incorporated herein by reference in their entirety.
This summary is not intended to be used to limit the scope of the claimed subject matter. Other features, details, utilities, and advantages of the claimed subject matter will be apparent from the detailed description including those aspects disclosed in the accompanying drawings and appended claims.
In one aspect, provided herein is a method for obtaining sequence information of a polynucleotide of interest, the method comprising: a) providing a first sequencing reagent to the target polynucleotide in the presence of a first polynucleotide replication catalyst, wherein the first sequencing reagent comprises at least two different nucleotide monomers each conjugated to a first label, and the nucleotide monomer/first label conjugate is substantially non-fluorescent until after the nucleotide monomers are incorporated into the target polynucleotide according to complementarity with the target polynucleotide, wherein the first labels of the at least two different nucleotide monomers are the same or different; and b) providing a second sequencing reagent to the target polynucleotide in the presence of a second polynucleotide replication catalyst, wherein the second sequencing reagent comprises one or more nucleotide monomers each conjugated to a second label, and the nucleotide monomer/second label conjugate is substantially non-fluorescent until after incorporation of the nucleotide monomer into the target polynucleotide according to complementarity with the target polynucleotide, at least one of the one or more nucleotide monomers being different from the nucleotide monomer present in the first sequencing reagent, and wherein the second sequencing reagent is provided subsequent to providing the first sequencing reagent, and c) obtaining sequence information of at least a portion of the target polynucleotide by detecting fluorescence emissions resulting from the first label and the second label after incorporation of the nucleotide monomer into the polynucleotide in steps a) and b).
In one embodiment, the method is used to obtain sequence information for at least a portion of a single polynucleotide of interest. In another embodiment, the method is used to obtain sequence information for at least a portion of a plurality of polynucleotides of interest simultaneously.
In any of the preceding embodiments, the first polynucleotide replication catalyst and the second polynucleotide replication catalyst may be the same polynucleotide replication catalyst or different polynucleotide replication catalysts.
In any of the preceding embodiments, the sequence information may be obtained by one or more sequencing reactions, optionally in one or more reaction volumes (e.g., reaction chambers), e.g., about 1x 106To about 5X 108Reaction volume, about 1X 106To about 1X 108A reaction volume or about 1X 106To about 5X 107In individual reaction volumes, wherein optionally the reaction volumes are physically separated from each other and/or there is no or substantially no material exchange between the reaction volumes, wherein optionally the reaction volumes are located in an array, such as a chip, wherein optionally the reaction volumes are closed and/or separated from each other by a liquid, such as an oil, immiscible with the liquid in the reaction volumes. When there is essentially no material exchange between reaction volumes, some material exchange is allowed but this does not affect the sequencing results in any reaction volume to cause cross-contamination.
In any of the preceding embodiments, the reaction volume may be provided in a reaction chamber and the target polynucleotide in each reaction chamber is immobilized on a solid support in the reaction chamber, wherein optionally the sequence information is determined by high throughputObtained, e.g., wherein at least about 103、104、105、106、107、108Or 109The sequence of bars is read in parallel. In any of the preceding embodiments, the first polynucleotide replication catalyst and/or the second polynucleotide replication catalyst is a polymerase, such as a DNA polymerase, an RNA polymerase or an RNA-dependent RNA polymerase, a ligase, a reverse transcriptase, or a terminal deoxyribonucleoside transferase.
In any of the preceding embodiments, the nucleotide monomers in the first and/or second sequencing reagents may be selected from the group consisting of: deoxyribonucleotides, modified deoxyribonucleotides, ribonucleotides, modified ribonucleotides, peptide nucleotides, modified sugar-phosphate backbone nucleotides, and mixtures thereof. In one embodiment, the nucleotide monomers in the first sequencing reagent and the second sequencing reagent are both deoxyribonucleotides. In some embodiments, the nucleotide monomer is selected from the group consisting of: A. T/U, C and G deoxyribonucleotides, and analogs thereof. In another embodiment, the nucleotide monomers in both the first sequencing reagent and the second sequencing reagent are ribonucleotides. In particular embodiments, the nucleotide monomer is selected from the group consisting of: A. U/T, C and G ribonucleotides, and analogs thereof.
In any of the preceding embodiments, the first and/or second label is releasably conjugated to the nucleotide monomer. In one embodiment, the first and/or second label is conjugated to the terminal phosphate group of the nucleotide monomer. In particular embodiments, the nucleotide monomer/first label conjugate in the first sequencing reagent and/or the one or more nucleotide monomer/second label conjugates in the second sequencing reagent have the structure of formula I below:
wherein n is 0-6, R is a nucleobase, and X is H, OH or OMe, or a salt thereof. In some embodiments, the first and/or second labels are substantially non-fluorescent until after release from the terminal phosphate group of the nucleotide monomer. In one other embodiment, the method further comprises releasing the first and/or second labels from the terminal phosphate groups of the nucleotide monomers using an activating enzyme. In one embodiment, the activating enzyme is an exonuclease, a phosphotransferase, or a phosphatase.
In any of the preceding embodiments, the nucleotide monomer/first label conjugate in the first sequencing reagent and/or the one or more nucleotide monomer/second label conjugates in the second sequencing reagent may have the structure of formula II below:
in any of the preceding embodiments, the first labels of at least two different nucleotide monomers may be the same or different from each other. In any of the preceding embodiments, the method may further comprise a washing step between steps a) and b).
In any of the preceding embodiments, the target polynucleotide is immobilized on a surface, such as a solid surface, a soft surface, a hydrogel surface, a microparticle surface, or a combination thereof. In one embodiment, the solid surface is part of a microreactor and steps a) and b) are performed in the microreactor. In any of the preceding embodiments, the process is carried out at a temperature in the range of from about 20 ℃ to about 70 ℃.
In any of the preceding embodiments, multiple rounds of steps a) and b) can be performed using different combinations of the first sequencing reagent and the second sequencing reagent.
In any of the preceding embodiments, the sequence information obtained in step c) may be a degenerate sequence. In one embodiment, at least one additional round of steps a) and b) is performed using a combination of first and second sequencing reagents that is different from the combination of first and second sequencing reagents in the previous round or rounds of steps a) and b) to obtain at least one additional sequence, and the additional sequences are compared to the degenerate sequences to obtain non-degenerate sequences.
In any of the preceding embodiments, the initial sequence information obtained in step c) may comprise no errors, or one or more errors. In one embodiment, at least one additional round of steps a) and b) is performed using a combination of first and second sequencing reagents that is different from the combination of first and second sequencing reagents in the previous round or rounds of steps a) and b) to obtain at least one additional sequence, and the additional sequence is compared to the initial sequence to reduce or eliminate sequence errors.
In any of the preceding embodiments, the alignment is performed using a mathematical analysis, algorithm, or method. In one embodiment, the mathematical analysis, algorithm or method comprises a Markov model (Markov model) or a maximum likelihood method based on Bayesian profile (Bayesian Scheme).
In any of the preceding embodiments, the first sequencing reagent can include two different nucleotide monomer/first label conjugates, each nucleotide monomer/first label conjugate comprising a different nucleotide monomer. In any of the preceding embodiments, the second sequencing reagent can include two different nucleotide monomer/second label conjugates, each nucleotide monomer/second label conjugate comprising a different nucleotide monomer. In any of the preceding embodiments, the two nucleotide monomers in the first sequencing reagent may be different from the two nucleotide monomers in the second sequencing reagent.
In any of the preceding embodiments, the two nucleotide monomers in the first sequencing reagent and the two nucleotide monomers in the second sequencing reagent may be selected from the group consisting of: A. T/U, C and G deoxyribonucleotides, and analogs thereof. In one embodiment, the two nucleotide monomers in the first sequencing reagent and the two nucleotide monomers in the second sequencing reagent are selected from the group consisting of: 1) a and T/U deoxyribonucleotides in one sequencing reagent and C and G deoxyribonucleotides in another sequencing reagent; 2) a and G deoxyribonucleotides in one sequencing reagent and C and T/U deoxyribonucleotides in another sequencing reagent; and 3) A and C deoxyribonucleotides in one sequencing reagent and G and T/U deoxyribonucleotides in another sequencing reagent. In another embodiment, one round of steps a) and b) or at least two rounds of steps a) and b) is performed, one of the combinations 1) -3) is used in one round of steps a) and b) and another one of the combinations 1) -3) which is different from the combination used in the previous round of steps a) and b) is used in another round of steps a) and b). In one aspect, three rounds of steps a) and b) are performed, each round using a different combination selected from combinations 1) to 3). In any of the preceding embodiments, the sequences obtained from multiple rounds of steps a) and b) can be compared to obtain a non-degenerate sequence and/or to reduce or eliminate sequence errors in the non-degenerate sequence.
In any of the preceding embodiments, the two nucleotide monomers in the first sequencing reagent and the two nucleotide monomers in the second sequencing reagent may be selected from the group consisting of: A. T/U, C and G ribonucleotides, and analogs thereof. In one embodiment, the two nucleotide monomers in the first sequencing reagent and the two nucleotide monomers in the second sequencing reagent are selected from the group consisting of: 1) a and T/U ribonucleotides in one sequencing reagent and C and G ribonucleotides in another sequencing reagent; 2) a and G ribonucleotides in one sequencing reagent and C and T/U ribonucleotides in another sequencing reagent; and 3) A and C ribonucleotides in one sequencing reagent and G and T/U ribonucleotides in another sequencing reagent. In one aspect, one round of steps a) and b) or at least two rounds of steps a) and b) is performed, one of the combinations 1) -3) is used in one round of steps a) and b), and another one of the combinations 1) -3) that is different from the combination used in the previous round of steps a) and b) is used in another round of steps a) and b). In another aspect, at least three rounds of steps a) and b) are performed, each round using a different combination of combinations 1) to 3). In any of the preceding embodiments, the sequences obtained from multiple rounds of steps a) and b) can be compared to obtain a non-degenerate sequence and/or to reduce or eliminate sequence errors in the non-degenerate sequence.
In any of the preceding embodiments, the first label of two different nucleotide monomers may be the same, and the second label may be the same as the first label.
In any of the preceding embodiments, the first label of the two different nucleotide monomers may be different, and the second label may be the same as the first label.
In any of the preceding embodiments, one of the first and second sequencing reagents may comprise three different nucleotide monomer/first label conjugates, each nucleotide monomer/first label conjugate comprising a different nucleotide monomer, and the other sequencing reagent may comprise one nucleotide monomer/second label conjugate, and the three nucleotide monomers in one sequencing reagent may be different from the nucleotide monomers in the other sequencing reagent.
In any of the preceding embodiments, the nucleotide monomers in the first sequencing and second sequencing reagents may be selected from the group consisting of: A. T/U, C and G deoxyribonucleotides, and analogs thereof. In specific embodiments, the nucleotide monomers in the first and second sequencing reagents are selected from the group consisting of: 1) c, G and T/U deoxyribonucleotides in one sequencing reagent and A deoxyribonucleotides in another sequencing reagent; 2) a, G and T/U deoxyribonucleotides in one sequencing reagent and C deoxyribonucleotides in another sequencing reagent; 3) a, C and T/U deoxyribonucleotides in one sequencing reagent and G deoxyribonucleotides in another sequencing reagent; and 4) A, C and G deoxyribonucleotides in one sequencing reagent and T/U deoxyribonucleotides in another sequencing reagent. In one embodiment, one round of steps a) and b) or at least two rounds of steps a) and b) is performed, one of the combinations 1) -4) being used in one round of steps a) and b) and another one of the combinations 1) -4) being different from the combination used in the previous round of steps a) and b) being used in another round of steps a) and b). In another embodiment, three rounds of steps a) and b) are performed, each round using a different combination selected from combinations 1) to 4). In yet another embodiment, four rounds of steps a) and b) are performed, each round using a different combination selected from combinations 1) to 4). In any of the preceding embodiments, the sequences obtained from multiple rounds of steps a) and b) can be compared to obtain a non-degenerate sequence and/or to reduce or eliminate sequence errors in the non-degenerate sequence.
In any of the preceding embodiments, the nucleotide monomers in the first sequencing and second sequencing reagents may be selected from the group consisting of: A. T/U, C and G ribonucleotides, and analogs thereof. In one embodiment, the nucleotide monomers in the first and second sequencing reagents are selected from the group consisting of: 1) c, G and T/U ribonucleotides in one sequencing reagent and A ribonucleotides in another sequencing reagent; 2) a, G and T/U ribonucleotides in one sequencing reagent and C ribonucleotides in another sequencing reagent; 3) a, C and T/U ribonucleotides in one sequencing reagent and G ribonucleotides in another sequencing reagent; and 4) A, C and G ribonucleotides in one sequencing reagent and T/U ribonucleotides in another sequencing reagent. In one embodiment, one round of steps a) and b) or at least two rounds of steps a) and b) is performed, one of the combinations 1) -4) being used in one round of steps a) and b) and another one of the combinations 1) -4) being different from the combination used in the previous round of steps a) and b) being used in another round of steps a) and b). In a particular embodiment, at least three rounds of steps a) and b) are carried out, each round using a different combination of combinations 1) to 4). In another embodiment, at least four rounds of steps a) and b) are performed, each round using a different combination of combinations 1) to 4). In any of the preceding embodiments, the sequences obtained from multiple rounds of steps a) and b) can be compared to obtain a non-degenerate sequence and/or to reduce or eliminate sequence errors in the non-degenerate sequence.
In any of the foregoing embodiments, read lengths of about 250bp, about 350bp, about 400bp, about 450bp, about 500bp, about 550bp, about 600bp, about 650bp, about 700bp, about 750bp, about 800bp, about 850bp, about 900bp, about 950bp, about 1000bp, about 1050bp, about 1100bp, about 1150bp, about 1200bp, about 1250bp, about 1300bp, about 1350bp, about 1400bp, about 1450bp, about 1500bp, about 1550bp, about 1600bp, about 1650bp, about 1700bp, about 1750bp, about 1800bp, about 1850bp, about 1900bp, about 1950bp, about 2000bp, about 2050bp, about 2100bp, about 2150bp, about 2200bp, about 2250bp, about 2300bp, about 2350bp, or about 2400 base pairs may be obtained.
In any of the foregoing embodiments, a code accuracy of at least about 95% may be obtained. In any of the preceding embodiments, the target polynucleotide may be a single stranded polynucleotide.
In another aspect, disclosed herein is a method for obtaining sequence information of a polynucleotide of interest, the method comprising: a) providing a first sequencing reagent to the target polynucleotide in the presence of a first polynucleotide replication catalyst, wherein the first sequencing reagent comprises two different nucleotide monomers each conjugated to a first label, and the nucleotide monomer/first label conjugate is substantially non-fluorescent until after the nucleotide monomer is incorporated into the target polynucleotide according to complementarity with the target polynucleotide; and b) providing a second sequencing reagent to the target polynucleotide in the presence of a second polynucleotide replication catalyst, wherein the second sequencing reagent comprises two different nucleotide monomers each conjugated to a second label, and the nucleotide monomer/second label conjugate is substantially non-fluorescent until after incorporation of the nucleotide monomer into the target polynucleotide according to complementarity with the target polynucleotide, and wherein the second sequencing reagent is provided subsequent to providing the first sequencing reagent, and c) obtaining sequence information for at least a portion of the target polynucleotide by detecting fluorescent emissions caused by the first label and the second label after incorporation of the nucleotide monomer into the polynucleotide in steps a) and b), wherein the nucleotide monomers in the first sequencing reagent and the second sequencing reagent are selected from the group consisting of: 1) adenine (a) and thymine (T)/uracil (U) nucleotide monomers in one sequencing reagent and cytosine (C) and guanine (G) nucleotide monomers in another sequencing reagent; 2) adenine (a) and guanine (G) nucleotide monomers in one portion of the sequencing reagent and cytosine (C) and thymine (T)/uracil (U) nucleotide monomers in another portion of the sequencing reagent; and 3) adenine (A) and cytosine (C) nucleotide monomers in one portion of the sequencing reagent and guanine (G) and thymine (T)/uracil (U) nucleotide monomers in another portion of the sequencing reagent. In one embodiment, the first labels of the two different nucleotide monomers in step a) and the second labels of the two different nucleotide monomers in step b) are the same label. In another embodiment, the first indicia comprises two different indicia, and wherein one of the first indicia is the same as one of the second indicia and the other of the first indicia is the same as the other of the second indicia. In any of the preceding embodiments, multiple rounds of steps a) and b) are performed, each round using a combination selected from combinations 1) to 3). In another embodiment, at least two or three sets of sequence information are obtained in step c), the method comprising: performing multiple rounds of steps a) and b) in a first sequencing reaction volume using combination 1) to obtain a first set of sequence information, performing multiple rounds of steps a) and b) in a second sequencing reaction volume using combination 2) to obtain a second set of sequence information, and/or performing multiple rounds of steps a) and b) in a third sequencing reaction volume using combination 3) to obtain a third set of sequence information. In one embodiment, the first, second and third sets of sequence information are obtained in parallel from separate sequencing reaction volumes. In another embodiment, the first, second and third sets of sequence information are obtained sequentially from the same sequencing reaction volume and the product of the previous sequencing reaction is excised before the next sequencing reaction is initiated. In any of the preceding embodiments, the method further comprises comparing at least two or three sets of sequence information to reduce or eliminate sequence errors. In one embodiment, the comparison indicates that when at least two or three sets of sequence information are identical to each other, there are no errors in the obtained target polynucleotide sequence. In another embodiment, the comparison indicates that there is an error in the obtained target polynucleotide sequence when the at least two or three sets of sequence information comprise a difference in at least one nucleotide residue of the target polynucleotide sequence. In one embodiment, the method further comprises correcting at least one nucleotide residue in the obtained target polynucleotide sequence such that, after correction, at least two or three sets of sequence information are identical to each other.
In yet another aspect, disclosed herein is a method for obtaining sequence information of a polynucleotide of interest, the method comprising: a) providing a first sequencing reagent to the target polynucleotide in the presence of a first polynucleotide replication catalyst, wherein the first sequencing reagent comprises three different nucleotide monomers each conjugated to a first label, and the nucleotide monomer/first label conjugate is substantially non-fluorescent until after the nucleotide monomer is incorporated into the target polynucleotide according to complementarity with the target polynucleotide; and b) providing a second sequencing reagent to the target polynucleotide in the presence of a second polynucleotide replication catalyst, wherein the second sequencing reagent comprises one nucleotide monomer conjugated to a second label, and the nucleotide monomer/second label conjugate is substantially non-fluorescent until after the nucleotide monomer is incorporated into the target polynucleotide according to complementarity with the target polynucleotide, and wherein the second sequencing reagent is provided before or after the first sequencing reagent is provided, and c) obtaining sequence information for at least a portion of the target polynucleotide by detecting fluorescent emissions resulting from the first label and the second label after the nucleotide monomer is incorporated into the polynucleotide in steps a) and b), wherein the nucleotide monomers in the first sequencing reagent and the second sequencing reagent are selected from the group consisting of: 1) cytosine (C), guanine (G), and thymine (T)/uracil (U) nucleotide monomers, and adenine (a) nucleotide monomers in a sequencing reagent; 2) adenine (a), guanine (G), and thymine (T)/uracil (U) nucleotide monomers, and cytosine (C) nucleotide monomers in an aliquot of sequencing reagent; and 3) an adenine (A) nucleotide monomer, a cytosine (C) nucleotide monomer, and a thymine (T)/uracil (U) nucleotide monomer, and a guanine (G) nucleotide monomer in one portion of the sequencing reagent; and 4) adenine (A) nucleotide monomer, cytosine (C) nucleotide monomer and guanine (G) nucleotide monomer in one portion of sequencing reagent, and thymine (T)/uracil (U) nucleotide monomer in another portion of sequencing reagent. In one embodiment, the first labels of three different nucleotide monomers in step a) and the second label of one nucleotide monomer in step b) are the same label. In any of the preceding embodiments, multiple rounds of steps a) and b) are performed, each round using a combination selected from combinations 1) through 4). In one embodiment, at least two, three or four sets of sequence information are obtained in step c), the method comprising: performing multiple rounds of steps a) and b) in a first sequencing reaction volume using combination 1) to obtain a first set of sequence information, performing multiple rounds of steps a) and b) in a second sequencing reaction volume using combination 2) to obtain a second set of sequence information, performing multiple rounds of steps a) and b) in a third sequencing reaction volume using combination 3) to obtain a third set of sequence information, and/or performing multiple rounds of steps a) and b) in a fourth sequencing reaction volume using combination 4) to obtain a fourth set of sequence information. In one embodiment, the first, second, third and fourth sets of sequence information are obtained in parallel from separate sequencing reaction volumes. In another embodiment, the first, second, third and fourth sets of sequence information are obtained sequentially from the same sequencing reaction volume and the product of the previous sequencing reaction is excised before the next sequencing reaction is initiated. In any of the preceding embodiments, the method further comprises comparing at least two, three, or four sets of sequence information to reduce or eliminate sequence errors. In one embodiment, the comparison indicates that when at least two, three or four sets of sequence information are identical to each other, there are no errors in the obtained target polynucleotide sequence. In one aspect, when using a monochromatic sequencing method, at least three sets of sequence information are required to monitor sequencing errors. On the other hand, when using a two-color sequencing method, only two sets of sequence information are needed to detect sequencing errors, since the information from the two fluorescent labels provides one extra piece of information for comparing sequences.
In another embodiment, when at least two, three or four sets of sequence information comprise a difference in at least one nucleotide residue of the target polynucleotide sequence, the comparison indicates that an error exists in the obtained target polynucleotide sequence. In one embodiment, the method further comprises correcting at least one nucleotide residue in the obtained polynucleotide sequence of interest such that, after correction, at least two, three or four sets of sequence information are identical to each other. In one aspect, at least one nucleotide residue is corrected for by deletion or insertion at the position where the error occurred to arrive at the correct sequence. In one aspect, each insertion of a location where an error occurs extends the sequence by at least one nucleotide and sequence information from one or more other rounds of sequencing is compared to the extended sequence to arrive at a corrected sequence. On the other hand, each deletion at the wrong position shortens the sequence by at least one nucleotide, and sequence information from one or more other rounds of sequencing is compared to the shortened sequence to arrive at a corrected sequence.
In yet another aspect, disclosed herein is a kit or system for obtaining sequence information of a polynucleotide of interest, the kit or system comprising: a) a first sequencing reagent comprising at least two different nucleotide monomer/first label conjugates that are substantially non-fluorescent until after incorporation of a nucleotide monomer into a target polynucleotide according to complementarity with the target polynucleotide; and b) a second sequencing reagent comprising one or more nucleotide monomer/second label conjugates that are substantially non-fluorescent until after incorporation of a nucleotide monomer into the polynucleotide according to complementarity with the target polynucleotide, at least one of the one or more nucleotide monomers being different from the nucleotide monomer present in the first sequencing reagent, and c) a detector for detecting fluorescence emissions resulting from the first label and the second label after incorporation of the nucleotide monomer into the polynucleotide. In one embodiment, the kit or system further comprises a first polynucleotide replication catalyst and/or a second polynucleotide replication catalyst. In any of the preceding embodiments, the first and/or second labels are conjugated to the terminal phosphate group of the nucleotide monomer. In one embodiment, the kit or system further comprises an activating enzyme for releasing the first and/or second label from the terminal phosphate group of the nucleotide monomer. In any of the preceding embodiments, the kit or system may further comprise a solid surface on which the target polynucleotide is configured to be immobilized. In one embodiment, the solid surface is part of a microreactor.
In any of the preceding embodiments, the kit or system further comprises means for obtaining sequence information of at least one target polynucleotide based on fluorescence emission resulting from the first label and the second label following incorporation of the nucleotide monomer into the polynucleotide. In one embodiment, the means comprises a computer readable medium containing executable instructions that when executed obtain sequence information for at least a portion of a polynucleotide of interest based on fluorescence emission resulting from a first label and a second label following incorporation of a nucleotide monomer into the polynucleotide.
In any of the preceding embodiments, the kit or system can further comprise means for comparing a plurality of sequences to obtain a non-degenerate sequence and/or to reduce or eliminate sequence errors in a non-degenerate sequence. In one embodiment, a tool includes a computer-readable medium containing executable instructions that when executed can compare sequences to obtain a non-degenerate sequence and/or reduce or eliminate sequence errors in the non-degenerate sequence.
In one aspect, provided herein is a method of correcting sequencing information errors, comprising: (a) performing parameter estimation based on sequencing signals from one or more reference polynucleotides during the sequencing reaction and known nucleic acid sequences of the reference polynucleotides, using the parameter estimation to obtain information on an early and/or late phase loss phenomenon of the sequencing reaction; (b) obtaining a sequencing signal from the target polynucleotide during a sequencing reaction; (c) calculating a secondary lead amount of the target polynucleotide based on the information obtained from step (a) and the sequencing signal obtained from step (b); (d) calculating a dephasing amount of the target polynucleotide based on the sequencing signal obtained from step (b) and the secondary lead amount of step (c); (e) correcting the sequencing signal obtained from step (b) using a phase loss amount to generate a predicted sequencing signal for the polynucleotide of interest; (f) repeating steps (c) through (e) for one or more rounds, wherein the predicted sequencing signal from round i is used to calculate a secondary lead for the target polynucleotide in round i +1 until the predicted sequencing signal for the target polynucleotide from round j is mathematically convergent, wherein i and j are integers and 1 ≦ i < i +1 ≦ j. In one embodiment, the secondary lead phenomenon refers to an unintended nucleotide extension occurring at a residue of the target polynucleotide during sequencing, and the unintended extension being further extended by a nucleotide other than the next residue. In one other embodiment, the amount of phase loss comprises changes in sequencing results due to a leading and/or lagging phase loss phenomenon during sequencing.
In any of the preceding embodiments, the parameter estimation in step (a) may comprise obtaining an attenuation coefficient. In any of the preceding embodiments, the parameter estimation in step (a) may further comprise obtaining an offset. In any of the preceding embodiments, the parameter estimation in step (a) may comprise obtaining unit signal information. In any of the preceding embodiments, the parameter estimation in step (a) may comprise obtaining lead coefficients and/or lag coefficients for each nucleotide or combination of nucleotides.
In any of the preceding embodiments, the method comprises obtaining information of lead and/or lag phase loss for each sequencing reaction when performing multiple sequencing reactions.
In another aspect, provided herein is a method of correcting sequencing information errors, comprising: (a) performing parameter estimation based on sequencing signals from one or more reference polynucleotides during a sequencing reaction and known nucleic acid sequences of the reference polynucleotides; (b) obtaining a sequencing signal from the target polynucleotide during a sequencing reaction; (c) calculating a secondary lead amount of the target polynucleotide based on the information obtained from the leading or lagging phase loss obtained by parameter estimation in step (a) and the sequencing signal obtained from step (b); (d) calculating a dephasing amount of the target polynucleotide based on the sequencing signal obtained from step (b) and the secondary lead amount of step (c); (e) correcting the sequencing signal obtained from step (b) using a phase loss amount to generate a predicted sequencing signal for the polynucleotide of interest; (f) repeating steps (c) through (e) for one or more rounds, wherein the predicted sequencing signal from round i is used to calculate a secondary lead for the target polynucleotide in round i +1 until the predicted sequencing signal for the target polynucleotide from round j is mathematically convergent, wherein i and j are integers and 1 ≦ i < i +1 ≦ j. In one aspect, the parameter estimation comprises obtaining a lead amount, a lag amount, an attenuation coefficient, and/or an offset amount based on the sequencing signal from the reference polynucleotide and the known nucleic acid sequence of the reference polynucleotide. In another aspect, the secondary lead phenomenon refers to an unintended nucleotide extension occurring at a residue of the target polynucleotide during sequencing, and the unintended extension being further extended by a nucleotide other than the next residue. In yet another aspect, the amount of phase loss comprises a change in sequencing results due to a pre-and/or post-phasing loss during sequencing.
In yet another aspect, disclosed herein is a method of correcting lead during sequencing, comprising: obtaining a sequencing signal from the target polynucleotide during a sequencing reaction, the sequencing signal corresponding to a sequence of the target polynucleotide; and optionally correcting the sequencing signal from the target polynucleotide with a secondary lead due to a secondary lead phenomenon using parameter estimation. In one embodiment, the secondary lead phenomenon refers to an unintended nucleotide extension occurring at a residue of the target polynucleotide during sequencing, and the unintended extension being further extended by a nucleotide other than the next residue.
In one aspect, the sequencing signal from the target polynucleotide comprises a primary lead due to a primary lead phenomenon, wherein the primary lead phenomenon refers to an unintended nucleotide extension occurring at a residue of the target polynucleotide during sequencing.
In any of the preceding embodiments, if the sequencing signal from a particular nucleotide residue of the target polynucleotide is close to a unity signal, the sequencing signal can be corrected using a secondary lead. In any preceding embodiment, wherein the sequencing signal intensity is within about 60%, within about 50%, within about 40%, within about 30%, within about 20%, within about 10%, or within about 5% of the signal intensity per unit.
In any of the preceding embodiments, when obtaining the nth sequencing signal, the method can comprise: methods of comparing the sequencing signal of a reference polynucleotide to the known sequence of the reference polynucleotide to identify errors during sequencing, and to correct errors; using the sequencing signal of the target polynucleotide before n and the method of correcting errors to obtain a corrected sequencing signal, e.g., by feeding back the sequencing signal of the target polynucleotide before n into the method of correcting errors; and determining whether a secondary lead is present at residue n by comparing the sequencing signal of the target polynucleotide at residue n to the corrected sequencing signal.
In any of the preceding embodiments, sequencing may comprise adding one or more sequencing reagents to the reaction solution, wherein the one or more sequencing reagents optionally comprise nucleotides and/or enzymes. In any of the preceding embodiments, one, two or three types of nucleotides may be added in each sequencing reaction in the sequencing. In any of the preceding embodiments, the sequencing reaction involves an open or unblocked 3' end of the polynucleotide. In any of the preceding embodiments, in sequencing, the added nucleotides may comprise one or more of A, G, C and T, or one or more of A, G, C and U. In any of the preceding embodiments, the detected sequencing signal can comprise an electrical signal, a bioluminescent signal, a chemiluminescent signal, or any combination thereof.
In any of the preceding embodiments, the parameter estimation may comprise: inferring the ideal signal h from the sequence of the reference polynucleotide, calculating the dephasing signal (or mismatch) s and the predicted raw sequencing signal p according to pre-set parameters, and calculating the correlation coefficient c between p and the actual raw sequencing signal f. In one aspect, the method further comprises finding a set of parameters using an optimization method such that the correlation coefficient c reaches an optimal value. In another aspect, the set of parameters includes a lead coefficient or quantity, a lag coefficient or quantity, a decay coefficient, an offset, a unit signal, or any combination thereof.
In any of the preceding embodiments, during sequencing, two sets of reaction solutions may be provided, each set comprising one or more nucleotides that are different from the other set, and one reaction solution provided in each sequencing reaction. In one aspect, two sets of reaction solutions are used in an alternating fashion to perform a sequencing reaction. In any of the preceding embodiments, sequencing of the target polynucleotide and the reference polynucleotide is performed simultaneously.
In any of the preceding embodiments, the reference polynucleotide may be used for parameter estimation to obtain one or more of the following parameters of the sequencing reaction: a lead coefficient or amount, a lag coefficient or amount, a decay coefficient, an offset, and a unit signal. In any of the preceding embodiments, the signal of the target polynucleotide may be corrected using one or more parameters of the sequencing reaction obtained by parameter estimation. In any of the preceding embodiments, the target polynucleotide may comprise a tag comprising a known sequence and/or a known amount of nucleotides, and the known sequence and/or the known amount of nucleotides is used to generate a unit signal for the sequencing reaction. In any of the preceding embodiments, the unit signal at each sampling point, e.g., at each nucleotide residue of the target polynucleotide, can be different.
In yet another aspect, disclosed herein is a computer-readable medium comprising instructions for correcting sequencing information errors. In one aspect, the instructions include: a) receiving sequencing information for the target polynucleotide and the reference polynucleotide; and b) correcting the sequencing information of the target polynucleotide using any of the methods of correcting sequencing information disclosed herein.
In another aspect, a computer system for sequencing is provided, the system comprising a computer-readable medium disclosed herein.
Drawings
FIG. 1 shows a method for correcting sequence data errors.
Fig. 2 shows data distribution of group 1 to group 5 data shown in a violin chart and a box chart. Black represents encoding accuracy and grey represents decoding accuracy. Groups 1 to 5 are presented in the sequence from left to right.
Fig. 3 shows a histogram of the frequency distribution showing the number of signals in decoding that were modified for each of 5000 pieces of sequence data.
Fig. 4 shows a correlation between the number of signals in which an error occurs in encoding and the number of signals modified by an error in decoding, the abscissa represents the number of signals in which an error occurs in encoding, the ordinate represents the correlation between the number of signals modified by an error in decoding, and the gray scale of color represents the ratio of the number of times the point is counted to all sequences.
Fig. 5A-C show the improvement of fluorescent gene (fluorogenic) performance of TPLFN by changing fluorophore structure.
FIG. 6 shows MALDI-TOF mass spectrum of purified TPLFN.
Fig. 7 shows excitation and emission spectra of TG (Tokyo Green).
FIG. 8 shows emission spectra of TG (Tokyo green), Me-FAM and Me-HCF under the same conditions (2. mu.M, pH 8.3, TE buffer, calculated by area normalization).
FIG. 9 shows absorption spectra of TPLFN (TG-dA4P) before and after enzymatic digestion.
FIG. 10 shows emission spectra of TPLFN (TG-dA4P) before and after enzymatic digestion.
Fig. 11 shows the kinetic pattern.
FIG. 12 shows the reaction rate differences between the four substrates.
FIG. 13 shows substrate competition.
FIG. 14 shows homopolymer length versus signal linearity.
Fig. 15A shows a homopolymer consisting of only T. Fig. 15B shows a homopolymer consisting of four repeating TCs.
Figure 16 shows the temperature dependent activity of Bst.
FIG. 17 shows the synthesis of N- (5- (2-bromoacetamido) pentyl) acrylamide.
FIG. 18 shows primer grafting.
Fig. 19 shows the difference in contact angle between glass and a BPAM coated surface.
FIG. 20 shows ECCS library design. FIG. 20a shows an ECCS library prior to solid phase PCR. FIG. 20b shows ECCS library prior to solid phase PCR. Figure 20c shows the ECCS library after annealing of the sequencing primer.
FIG. 21 shows a template preparation process.
FIG. 22 shows the results of gel electrophoresis of the PCR products. Lane 1 is marker (Transgene, 100bp Plus IIDNA Ladder); lanes 2 and 3 are two 200bp templates (L718-208 (330bp), L10115-201(323bp), respectively); lanes 4-6 are three 300bp templates (L718-308 (430bp), L4418-305(427bp), L10115-301(423bp), respectively); lanes 7-9 are three 500bp templates (L501-500 (622bp), L30501-500(622bp), L46499-500(622bp), respectively).
FIG. 23 shows a solid phase PCR process.
FIG. 24 shows a heat map of PCR product density for different lanes and locations (upper panel). The x-axis markers of each figure represent four different lanes of the chip; the y-axis markers of each plot represent five different imaging positions of the lane. The color from black to green indicates that the PCR product density goes from low to high. The lower panel shows the PCR product density for different templates. The x-axis markers are different experimental groups of solid phase PCR; the y-axis marker is the average density per lane of the chip.
FIG. 25 shows a sequencing instrument in the top panel, a typical fluorescence reaction kinetics curve in the bottom left panel, and a kinetics curve for each reaction cycle throughout the sequencing period, according to one embodiment.
Fig. 26 shows the phase loss process.
FIG. 27 shows the mock sequencing signal (left) and the DNA concentration profile at different positions (right). Color bar (grayscale bar): the DNA ratio. (FIGS. 27a and 27b) impurities: 0; reaction time: (fig. 27c and 27d) impurities: 0.003; reaction time: (fig. 27e and 27f) impurities: 0; reaction time: 100.
the upper graph of FIG. 28 shows the One-Pass, multiple-Stop (One Pass, More Stop) principle. The lower graph shows the distribution and flux matrices and the relationship of the two. The lead epsilon and lag lambda coefficients are set to 2% and 1%, respectively. The values of these two coefficients are relatively large to show the obvious effect of phase loss, rather than an estimate of experimental data.
Fig. 29 shows a simplified flow chart of the correction algorithm.
Fig. 30 shows the application of the correction algorithm.
Fig. 31 shows the phase loss correction algorithm.
Fig. 32 shows the effect of the dephasing factor on the condition number of (flux matrix) T.
Fig. 33 shows the effect of out-of-phase coefficient deviation on signal correction.
FIG. 34 shows that global white noise can reduce the accuracy of the correction signal and make the latter cycle error-prone.
FIG. 35 shows the number of error-free cycles after phase loss correction given a phase loss factor and global white noise.
Fig. 36 shows the effect of signal anomalies in certain cycles.
Fig. 37A shows a variation locus of each coefficient in the dephasing coefficient estimation algorithm. Figure 37B summarizes the phase loss coefficients in multiple rounds of sequencing. FIG. 37C shows the relationship between phase loss factor and sequencing reaction time.
FIG. 38 shows phase loss phenomenon in high throughput DNA sequencing. The squares represent the nucleotides of the template DNA and the circles represent the nucleotides that make up the nascent DNA strand. The pattern with diagonal lines represents the sequencing primer region, and the pattern filled in white or gray represents the different types of nucleotides.
Fig. 39 shows a primary lead phenomenon and a secondary lead phenomenon.
Fig. 40 shows that no more three-step advance occurs.
Fig. 41 shows a basic procedure of parameter estimation.
Fig. 42 shows a basic procedure of signal correction.
FIG. 43 shows a monochromatic 2+2 raw sequencing signal.
FIG. 44 shows the variation trend of each parameter in the parameter estimation process of monochromatic 2+2 sequencing raw signals.
FIG. 45 shows the raw and dephasing signals for monochromatic 2+2 sequencing.
FIG. 46 shows the iterative steps in signal correction of a monochromatic 2+2 sequencing signal.
FIG. 47 shows the raw signal of one two-color 2+2 sequencing.
Figure 48 shows the trend of all parameters during parameter estimation for two-color 2+2 sequencing.
Figure 49 shows the original signal and the dephasing signal of primary two-color 2+2 sequencing.
Figure 50 shows the iterative steps in signal correction for two-color 2+2 sequencing.
Figure 51 shows the statistics of signal corrections for multiple single color 2+2 sequencing.
FIG. 52 shows the principle of degenerate base fluorescent gene sequencing, according to one aspect of the present invention.
FIG. 53 shows degenerate base recognition (base-trapping) results, according to an aspect of the present invention.
FIG. 54 shows an information communication model for ECC sequencing, according to an aspect of the present invention.
FIG. 55 shows sequence decoding results using dynamic programming, in accordance with an aspect of the present invention.
FIG. 56 shows that decoding improves ECC sequencing accuracy, according to an aspect of the present invention.
FIG. 57 shows a range distribution of cycles of three base combinations, according to an aspect of the invention.
FIG. 58 shows an example of the hierarchy and node traversal order of the scoring matrix structure.
Figure 59 illustrates a state transition network for a hidden markov model for ECC decoding, in accordance with an aspect of the present invention.
FIG. 60 shows simulated distributions of accuracy before and after decoding, in accordance with an aspect of the present invention.
Fig. 61 shows an exemplary decoding result.
Detailed Description
The following provides a detailed description of one or more embodiments of the claimed subject matter and the accompanying drawings that illustrate principles of the claimed subject matter. The claimed subject matter is described in connection with such embodiments, but is not limited to any particular embodiment. It is to be understood that the claimed subject matter may be embodied in various forms and encompasses numerous alternatives, modifications, and equivalents. Therefore, specific details disclosed herein are not to be interpreted as limiting, but rather as a basis for the claims and as a representative basis for teaching one skilled in the art to employ the claimed subject matter in virtually any appropriately detailed system, structure or manner. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. These details are provided for the purpose of example only and claimed subject matter may be practiced according to the claims without some or all of these specific details. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter. It is to be understood that the various features and functions described in one or more of the individual embodiments are not limited in their applicability to the particular embodiment in which they are described. Rather, they may be applied, alone or in some combination, to one or more other embodiments of the disclosure, whether or not those embodiments are described, and whether or not those features are presented as part of a described embodiment. For the purpose of clarity, technical material that is known in the technical fields related to the claimed subject matter has not been described in detail so as not to unnecessarily obscure the claimed subject matter.
All technical terms, symbols, and other technical and scientific terms used herein are intended to have the same meaning as commonly understood by one of ordinary skill in the art to which the claimed subject matter belongs unless defined otherwise. In some instances, terms having commonly understood meanings are defined herein for clarity and/or ease of reference, and such definitions are incorporated herein, but should not necessarily be construed to represent substantial differences over what is commonly understood in the art. Many of the techniques and procedures described or referenced herein are known to those skilled in the art and are commonly employed when using conventional methods.
All publications, including patent documents, scientific articles, and databases, referred to in this application are incorporated by reference in their entirety for all purposes as if each individual publication were individually incorporated by reference. If a definition set forth herein is contrary to or otherwise inconsistent with a definition set forth in a patent, patent application, published application or other publication that is incorporated by reference, the definition set forth herein takes precedence over the definition set forth herein that is incorporated by reference. Citation of a publication or document is not intended as an admission that any of it is pertinent prior art, nor does it constitute any admission as to the contents or date of such publication or document.
Unless otherwise indicated, all headings are for the convenience of the reader and should not be used to limit the meaning of the words following the heading.
Unless otherwise indicated, practice of the provided embodiments will employ conventional techniques and descriptions of organic chemistry, polymer technology, molecular biology (including recombinant techniques), cell biology, biochemistry, and sequencing techniques, which are within the skill of the art to practice. Such conventional techniques include polypeptide and protein synthesis and modification, polynucleotide synthesis and modification, polymer array synthesis, hybridization and ligation of polynucleotides, and detection of hybridization using labels. Specific illustrations of suitable techniques can be obtained by reference to the examples herein. However, other equivalent conventional procedures may of course be used. Such conventional techniques and descriptions can be found in standard laboratory manuals, such as those compiled by Green et al, Genome Analysis: A laboratory Series (Vol.I-IV) (1999); weiner, Gabriel, Stephens eds, Genetic Variation: Alaborator Manual (2007); dieffenbach, eds. Dveksler, PCR Primer A Laboratory Manual (2003); bowtell and Sambrook, DNA microarray: A Molecular Cloning Manual (2003); mount, Bioinformatics, Sequence and Genome Analysis (2004); a Laboratory Manual (2006); and Sambrook and Russell, Molecular Cloning A Laboratory Manual (2002) (both from Cold spring Harbor Laboratory Press); compiled by Ausubel et al, Current Protocols in molecular biology (1987); brown eds, Essential Molecular Biology (1991), IRL Press; goeddel, eds., Gene Expression Technology (1991), Academic Press; bothwell et al, eds for Cloning and Analysis of Eurotic Genes (1990), Bartlett Publ; kriegler, Gene Transfer and Expression (1990), Stockton Press; wu et al, eds, recombined DNAhodology (1989), Academic Press; McPherson et al, PCR A Practical Approach (1991), IRL Press at Oxford University Press; stryer, Biochemistry (4 th edition) (1995), w.h.freeman, New York n.y.; gait, Oligonucleotide Synthesis, APracial Approach (2002), IRL Press, London; nelson and Cox, Lehninger, Principles of Biochemistry (2000)3rd eds, w.h.freeman pub., New York, n.y.; berg, et al, Biochemistry (2002)5th eds, w.h.freeman pub., New York, n.y.; edited by Weir & C. Blackwell, Handbook of Experimental Immunology (1996), Wiley-Blackwell; cellular and molecular immunology (A.Abbas et al, W.B.Saunders Co.1991, 1994); current Protocols in immunology (J.Coligan et al, 1991), all of which are incorporated by reference in their entirety for all purposes.
Throughout this disclosure, various aspects of the claimed subject matter are presented in a range format. It is to be understood that the description in range format is merely for convenience and brevity and should not be construed as a limitation on the scope of the claimed subject matter. Accordingly, the description of a range should be considered to have specifically disclosed all the possible sub-ranges, as well as individual numerical values within that range. For example, where a range of values is provided, it is understood that each intervening value, to the extent that there is no such stated or intervening value, to the upper or lower limit of that range, and any other stated or intervening value in that stated range, is encompassed within the claimed subject matter. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges, and are also encompassed within the claimed subject matter, subject to any specifically excluded limit in the stated range. Where a stated range includes one or both of the stated limits, ranges outside of any one or both of those included limits are also included in the claimed subject matter. This applies regardless of the breadth of the range. For example, a description of a range such as from 1 to 6 should be considered to specifically disclose sub-ranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6, etc., as well as individual values within that range, e.g., 1, 2, 3, 4, 5, and 6.
I. Definition of
As used herein, the singular forms "a" and "the" include plural referents unless the context clearly dictates otherwise. For example, "a" means "at least one" or "one or more". It is understood that aspects and variations described herein include "consisting of and/or" consisting essentially of aspects and variations.
The term "about" as used herein refers to a common error range for corresponding values as readily known to those skilled in the art. Reference herein to "about" a value or parameter includes (and describes) embodiments that are directed to the value or parameter itself. For example, a description of "about X" includes a description of "X" itself.
The terms "polynucleotide", "oligonucleotide", "nucleic acid" and "nucleic acid molecule" are used interchangeably herein to refer to polymeric forms of nucleotides of any length and include ribonucleotides, deoxyribonucleotides and analogs OR mixtures thereof the terms include triple-stranded, double-stranded and single-stranded deoxyribonucleic acids ("DNA"), as well as triple-stranded, double-stranded and single-stranded ribonucleic acids ("RNA") which also include polynucleotides modified, e.g., by alkylation and/OR by capping, and unmodified forms thereof, more specifically the terms "polynucleotide", "oligonucleotide", "nucleic acid" and "nucleic acid molecule" include polydeoxyribonucleotides (containing 2-deoxy-D-ribose), polyribonucleotides (containing D-ribose) including tRNA, rRNA, hRNA and mRNA (whether spliced OR unspliced), any other type of polynucleotide which is an N-OR C-glycoside of a purine OR pyrimidine base, and other polymers which include non-nucleotide backbones that are modified by a VIRAL-OR C-glycoside such as a purine OR pyrimidine base, such as polyamide (e.g., peptide ("PNA") and polymorpholino (e.g., as a phosphopeptide, a phosphodiester) and other polymers which include polymers which have at least a positive charge, a substitution in the environment such as a phosphodiester, including a phosphodiester, OR phosphodiester, such as a phosphodiester, OR phosphodiester, such as a phosphodiester, OR phosphodiester, which may be substituted in a polymer with a sequence, such as a more generally, such as a polymer with a modified, a change in a sequence, including a change in a sequence, which is found in a sequence, which is modified, which is increased in a sequence, which is modified, a sequence, such as a sequence, which is modified, which is a sequence, which is modified, a sequence, which is modified, a sequence, which is modified, a sequence, such as a sequence, which is modified, a sequence, which is modified, OR a sequence, which is modified, which is a sequence, which is modified, such as a sequence, OR a sequence, which is modified, such as a sequence, which is modified, which is a sequence, OR a sequence, which is.
It is to be understood that the terms "nucleoside" and "nucleotide" as used herein include not only the known purine and pyrimidine bases, but also other heterocyclic bases which have been modified. Such modifications include methylated purines or pyrimidines, acylated purines or pyrimidines, or other heterocycles. Modified nucleosides or nucleotides can also include modifications on the sugar moiety, for example, wherein one or more hydroxyl groups are substituted with a halogen, an aliphatic group, or are functionalized as ethers, amines, or the like. The term "nucleotide unit" is intended to encompass nucleosides and nucleotides.
The terms "complementary" and "substantially complementary" include hybridization or base pairing, or the formation of a duplex between nucleotides or nucleic acids (e.g., between two strands of a double-stranded DNA molecule or between an oligonucleotide primer and a primer binding site on a single-stranded nucleic acid). Complementary nucleotides are typically A and T (or A and U) or C and G. Two single-stranded RNA or DNA molecules may be said to be substantially complementary when the nucleotides of one strand (optimally aligned and with appropriate nucleotide insertions or deletions) are paired with at least about 80% of the other strands, typically at least about 90% to about 95% of the other strands, and even about 98% to about 100% of the other strands. In one aspect, the two complementary sequences of nucleotides are capable of hybridizing to opposite nucleotides, preferably with less than 25% mismatches, more preferably with less than 15% mismatches, even more preferably with less than 5% mismatches, and most preferably with no mismatches. Preferably, the two molecules will hybridize under conditions of high stringency.
As used herein, "hybridization" may refer to the process of non-covalent binding of two single-stranded polynucleotides to form a stable double-stranded polynucleotide. In one aspect, the resulting double-stranded polynucleotide may be a "hybrid" or "duplex". Typical "hybridization conditions" include salt concentrations of about less than 1M, usually less than about 500mM, and may be less than about 200 mM. "hybridization buffer" comprises a buffered saline solution, such as 5% SSPE or other such buffers known in the art. Hybridization temperatures can be as low as 5 ℃, but are typically above 22 ℃, more typically above about 30 ℃, and often above 37 ℃. Hybridization is often performed under stringent conditions, i.e., conditions under which a sequence will hybridize to its target sequence but not to other non-complementary sequences. Stringent conditions are sequence-dependent and will be different in different circumstances. For example, for specific hybridization, longer fragments may require higher hybridization temperatures than shorter fragments. The combination of parameters is more important than the absolute measure of either parameter alone, as other factors, including the base composition and length of the complementary strand, the presence of organic solvents, and the degree of base mismatch can affect the stringency of hybridization. Typically, stringent conditions are selected to be about 5 ℃ lower than the Tm for the specific sequence under defined ionic strength and pH. The melting temperature Tm may be the temperature at which a population of double-stranded nucleic acid molecules begins to dissociate semi-dissociates into single strands. A number of equations for calculating the Tm of nucleic acids are known in the art. As shown in standard reference, a simple estimate of Tm can be calculated by the equation Tm of 81.5+0.41 (% G + C) when the nucleic Acid is in aqueous solution at 1M NaCl (see, e.g., Anderson and Young, Quantitative Filter Hybridization, InNucleic Acid Hybridization (1985)). Other references (e.g., Allawi and Santa Lucia, Jr., Biochemistry,36:10581-94(1997)) include alternative methods of calculation where structural and environmental as well as sequence characteristics are considered for the calculation of Tm.
Generally, the stability of hybrids is a function of ion concentration and temperature. Typically, hybridization reactions are performed under conditions of lower stringency, followed by washes of different, but higher, stringency. Exemplary stringent conditions include a salt concentration of at least 0.01M to no more than 1M sodium ion concentration (or other salt) at a pH of about 7.0 to about 8.3 and a temperature of at least 25 ℃. For example, 5 XSSPE conditions (750 mM NaCl, 50mM sodium phosphate, 5mM EDTA at pH 7.4) and a temperature of about 30 ℃ are suitable for allele-specific hybridization, although suitable temperatures are related to the length and/or GC content of the hybridization region. In one aspect, "stringency of hybridization" in determining percent mismatch can be as follows: 1) high stringency: 0.1 XSSPE, 0.1% SDS, 65 ℃; 2) moderate stringency: 0.2 × SSPE, 0.1% SDS, 50 ℃ (also known as medium stringency); and 3) low stringency: 1.0 XSSPE, 0.1% SDS, 50 ℃. It is understood that equivalent stringencies may be achieved using alternative buffers, salts and temperatures. For example, moderately stringent hybridization can refer to conditions that allow a nucleic acid molecule, such as a probe, to bind to a complementary nucleic acid molecule. The hybridizing nucleic acid molecules are typically at least 60% identical, including, for example, any of at least 70%, 75%, 80%, 85%, 90%, or 95% identical. Moderate stringency conditions can be conditions equivalent to: hybridization was carried out in 50% formamide, 5 XDenhardt's solution, 5 XSSPE, 0.2% SDS at 42 ℃ followed by washing in 0.2 XSSPE, 0.2% SDS at 42 ℃. For example, high stringency conditions can be provided as follows: hybridization was carried out in 50% formamide, 5 Xdunghard's solution, 5 XSSPE, 0.2% SDS at 42 ℃ followed by washing in 0.1 XSSPE and 0.1% SDS at 65 ℃. Low stringency hybridization can refer to conditions equivalent to: hybridization was performed in 10% formamide, 5 Xdunghard's solution, 6 XSSPE, 0.2% SDS at 22 ℃ followed by washing in 1 XSSPE, 0.2% SDS at 37 ℃. The danhart solution contained 1% Ficoll, 1% polyvinylpyrrolidone and 1% Bovine Serum Albumin (BSA). 20 XSSPE (sodium chloride, sodium phosphate, EDTA) contains 3M sodium chloride, 0.2M sodium phosphate and 0.025 MEDTA. Other suitable moderate and high stringency hybridization buffers and conditions are known to those of skill in the art and are described, for example, in Sambrook et al, Molecular Cloning: A Laboratory Manual, 2 nd edition, Cold spring harbor Press, Plainview, N.Y. (1989); and Ausubel et al, Short Protocols in molecular biology, 4 th edition, John Wiley & Sons (1999).
Alternatively, substantial complementarity exists when an RNA or DNA strand will hybridize to its complement under selective hybridization conditions. Typically, selective hybridization will occur when there is at least about 65% complementarity, preferably at least about 75%, more preferably at least about 90% complementarity over a stretch of at least 14 to 25 nucleotides. See M.Kanehisa, Nucleic Acids Res.12:203 (1984).
As used herein, a "primer" may be a natural or synthetic oligonucleotide capable of acting as a point of initiation of nucleic acid synthesis upon formation of a duplex with a polynucleotide template, and capable of extending along the template from its 3' end, thereby forming an extended duplex. The nucleotide sequence added during extension is determined by the sequence of the template polynucleotide. The primers are typically amplified with a polymerase, such as a DNA polymerase.
A "substantially non-fluorescent" moiety refers to a moiety that is approximately or substantially non-fluorescent to emit detectable fluorescence. For example, the ratio of detectable absolute fluorescent emission from the fluorescent moiety to detectable absolute fluorescent emission from the substantially non-fluorescent moiety is generally greater than or equal to about 500:1, more generally greater than or equal to about 1000:1, and even more generally greater than or equal to about 1500:1 (e.g., about 2000:1, about 2500:1, about 3000:1, about 3500:1, about 4000:1, about 4500:1, about 5000:1, about 10: 1) at about the same concentration of the fluorescent moiety and the substantially non-fluorescent moiety41, about 1051, about 1061, about 1071 or about 108:1)。
"sequencing" and the like, such as nucleotide sequencing methods, including the determination of information relating to the nucleotide base sequence of a nucleic acid. This information may include confirmation or determination of partial as well as complete sequence information for the nucleic acid. Different degrees of statistical reliability or confidence may be used to determine sequence information. In one aspect, the term includes determining the identity and ordering of a plurality of contiguous nucleotides in a nucleic acid. "high-throughput sequencing" or "next generation sequencing" includes sequence determination using methods that determine many (typically thousands to billions) of nucleic acid sequences in an inherently parallel manner, i.e., where DNA templates are prepared for sequencing not one at a time, but rather in a batch process, and preferably many sequences are read in parallel, or using ultra-high-throughput serial processes that can parallelize themselves. Such methods include, but are not limited to, pyrosequencing (e.g., as commercialized by 454Life Sciences, inc., Branford, CT); sequencing by ligation (e.g., as in SOLID)TMTechnology, Life Technologies, Inc., Carlsbad, Calif.); sequencing by use of modified nucleotide synthesis (such as in TruSeq)TMAnd HiSeqTMCommercialized in the art by Illumina, inc., San Diego, CA; in HeliScopeTMCommercially available from Helicos Biosciences Corporation, Cambridge, Mass; and commercialized by Pacific Biosciences of California, inc., menlopack, CA in PacBio RS) by Ion detection techniques (such as Ion TorrentTMTechniques, Life Technologies, Carlsbad, CA) sequencing; sequencing of DNA nanospheres (Complete Genomics, inc., Mountain View, CA); nanopore-based sequencing techniques (e.g., as developed by Oxford Nanopore Technologies, LTD, Oxford, UK), and sequencing methods such as high parallelization.
In any of the embodiments disclosed herein, the method of obtaining sequence information for a polynucleotide of interest can be performed in a multiplex assay. "multiplexing" or "multiplex assay" herein may refer to an assay or other analytical method in which the presence and/or amount of multiple targets (e.g., multiple nucleic acid sequences) may be determined simultaneously, wherein each target has at least one different detection characteristic, e.g., a fluorescence characteristic (e.g., excitation wavelength, emission intensity, FWHM (full width at half maximum) or fluorescence lifetime) or a unique nucleic acid or protein sequence characteristic.
In any of the embodiments disclosed herein, the sequencing reaction of the target polynucleotide can be performed on an array, such as a microchip. The array may comprise a plurality of reaction volumes created, for example, by a plurality of reaction chambers disposed on the array. The target nucleotide sequence or fragment thereof may be immobilized or immobilized in the reaction volumes, such as by adsorption or specific binding to a capture molecule on a solid support in each reaction volume. After providing the reaction liquid in the reaction mixture and delivering to each reaction volume, each reaction volume may be enclosed and/or separated from other reaction volumes on the array. Signals, such as fluorescence information, can then be detected and/or recorded from each reaction volume.
In any of the embodiments disclosed herein, the array can be addressable. In one aspect, addressability includes the ability of the microchip to direct substances such as nucleic acids and enzymes and other amplification components from one location to another on the microchip (the capture sites of the chip). In another aspect, addressability includes the ability to spatially encode the sequencing reaction and/or its sequencing products on each array spot (arrayspot) such that after sequence read-out, the sequencing reaction and/or its sequencing products can be mapped back to a particular spot on the array and associated with other identifying information from that particular spot. For example, a spatially encoded tag may be conjugated to a target polynucleotide such that when the conjugated target polynucleotide is sequenced, the tag sequence reveals where the array target is located.
Sequencing method
In one aspect, disclosed herein are methods of sequencing a nucleotide molecule by modifying a fluorophore with a phosphate. In another aspect, disclosed herein are methods of sequencing nucleotide molecules modified using fluorescence-switched fluorophores.
In one aspect, methods of sequencing mixed nucleotides are disclosed herein. In particular embodiments, disclosed herein are sequencing methods that use phosphate modification of mixed nucleotide molecules with fluorophores. Furthermore, the present disclosure also relates to sequencing methods based on fluorophores with fluorescence switching properties.
In one aspect, disclosed herein are sequencing methods using mixed nucleotide molecules. In particular embodiments, methods of sequencing by using modified mixed nucleotide molecules with fluorophores are disclosed herein. Furthermore, the invention relates to sequencing methods based on fluorophores with fluorescence-switching properties. The invention combines fluorescence switch sequencing and mixed nucleotide molecule sequencing, and realizes unexpected technical results. The special signal acquisition method and efficiency make it have a wide prospect in gene sequencing.
In one aspect, disclosed herein are sequencing methods using nucleotide substrate molecules, wherein sequencing is performed by modifying the 5' end or the middle phosphate of the nucleotide substrate molecule with a fluorophore; one set of reaction solution was used for each round of sequencing, each set of reaction solution comprising two reaction solutions, each reaction solution containing two nucleotides having different bases. In one embodiment, the nucleotides in one reaction are complementary to two bases on the test nucleotide sequence and the nucleotides in the other reaction are complementary to two other bases on the test nucleotide sequence. In one embodiment, the method comprises first providing a fragment of the nucleotide sequence to be tested (e.g., by immobilizing the nucleotide sequence on a solid support) and then providing a first reaction solution of a set of reaction solutions, thereby initiating a first round of sequencing. In one embodiment, the method comprises detecting and recording fluorescent signals from the first round of sequencing. In one embodiment, the method then includes providing a second reaction solution of the same set of reaction solutions to continue the first round of sequencing. The fluorescent signal is again detected and recorded. In one aspect, the above steps are repeated and the first and second reaction solutions can be provided sequentially in any suitable order to obtain information encoding the nucleotide sequence to be determined by analysis of the fluorescent signal.
In one embodiment, each reaction solution contains two nucleotides with different bases, which may be labeled with two different or the same fluorophores.
In any of the preceding embodiments, sequencing may be performed by modifying the 5' end of the nucleotide substrate molecule or the intermediate phosphate of a fluorophore having fluorescence switching properties. In one aspect, the fluorescence switching property means that the fluorescence signal after sequencing is significantly altered compared to the signal before the sequencing reaction.
In any of the foregoing embodiments, the fluorescence switching property can refer to a significant increase (or elevation) in fluorescence signal after sequencing compared to before the sequencing reaction.
In one aspect, methods of sequencing nucleotide substrate molecules using fluorophores with fluorescence switching properties are also disclosed herein. In one aspect, sequencing is performed by modifying the 5' end of the nucleotide substrate molecule or the middle phosphate of a fluorophore with fluorescence switching properties. On the one hand, the fluorescence switching property means that the fluorescence signal intensity after sequencing is obviously enhanced compared with the situation before sequencing reaction. One set of reaction solution is used for each round of sequencing, each set of reaction solution comprises two reaction solutions, and each reaction solution comprises nucleotide substrate molecules with two different bases. In one aspect, the nucleotide substrate molecules in one of the reaction solutions can be complementary to two bases on the test nucleotide sequence, and the nucleotide substrate molecules in the other reaction solution can be complementary to two other bases on the test nucleic acid sequence. In one aspect, the method includes immobilizing a nucleotide sequence fragment to be tested in a reaction chamber and then introducing into a first reaction solution of a set of reaction solutions. In one aspect, the method comprises using an enzyme to release a fluorophore on a nucleotide substrate having a fluorophore with fluorescence switching properties, thereby causing fluorescence switching. In one aspect, the method includes introducing a second reaction solution of the same set of reaction solutions. In one aspect, the method comprises using an enzyme to release a fluorophore on a nucleotide substrate having a fluorophore with fluorescence switching properties, thereby causing fluorescence switching. In one aspect, the method comprises adding two reaction solutions in an alternating manner and obtaining coding information for the nucleotide substrate to be tested from the fluorescence information.
In another aspect, disclosed herein are methods of sequencing nucleotide substrate molecules using fluorophores with fluorescence switching properties. In one aspect, sequencing is performed by modifying the 5' end of the nucleotide substrate molecule or the middle phosphate of a fluorophore with fluorescence switching properties. On the one hand, the fluorescence switching property means that the fluorescence signal intensity after sequencing is obviously enhanced compared with the fluorescence signal intensity before sequencing reaction. In one aspect, each sequencing run uses a set of reaction solutions, each set of reaction solutions comprising at least two reaction solutions, each reaction solution comprising at least one of A, G, C or T nucleotide substrate molecules or at least one of A, G, C or U nucleotide substrate molecules. In one aspect, the nucleotide sequence fragment to be tested is first immobilized in a reaction chamber, and a reaction solution from a set of reaction solutions is added to the reaction chamber. The sequencing reaction can be initiated under appropriate conditions and the fluorescent signal recorded. Then, one additional reaction solution is provided at a time, so that other reaction solutions in the same set of reaction solutions are provided successively in the sequencing reaction. Simultaneously, one or more fluorescence signals from each reaction solution are recorded. In one aspect, there is at least one reaction solution in a set of reaction solutions comprising two or three nucleotide molecules.
In another aspect, disclosed herein is a method of sequencing a nucleotide substrate molecule using a fluorophore having fluorescence switching properties by modifying the 5' end or the middle phosphate of the nucleotide substrate molecule of the fluorophore having fluorescence switching properties. On the one hand, the fluorescence switching property means that the fluorescence signal intensity after sequencing is obviously enhanced compared with the situation before sequencing reaction. In one aspect, each sequencing run uses a set of reaction solutions, each set of reaction solutions comprising at least two reaction solutions, each reaction solution comprising either A, G, C, T nucleotide substrate molecules or A, G, C, U nucleotide substrate molecules. In one aspect, the method comprises first immobilizing a nucleotide sequence fragment to be detected in a reaction chamber and then introducing one of a plurality of reaction solutions. In one aspect, the method includes testing and recording fluorescence information. In one aspect, the method comprises adding one reaction solution at a time, followed by the sequential addition of the other reaction solutions in the same set of reaction solutions. Fluorescence information from each sequencing reaction was recorded.
In another aspect, disclosed herein are methods of sequencing nucleotide substrate molecules using fluorophores having fluorescence-switching properties by modifying the 5' end or the middle phosphate of the nucleotide substrate molecule of the fluorophore having fluorescence-switching properties, i.e., the intensity of the fluorescent signal is significantly enhanced after sequencing compared to before the sequencing reaction. In one aspect, a set of reaction solutions is used for each round of sequencing, the reaction solutions comprising A, G, C, T four nucleotide substrate molecules, or the reaction solutions comprising A, G, C, U four nucleotide substrate molecules. In one aspect, the method comprises immobilizing a nucleotide sequence fragment to be detected in a reaction chamber, introducing a reaction solution, and recording fluorescence information.
In any of the preceding embodiments, the method may further comprise removing residual reaction solution and fluorescent molecules with a wash solution, and then performing a next round of sequencing reaction. In any of the foregoing embodiments, the reaction solution may be added at a low temperature and then heated to an enzyme reaction temperature, where the fluorescent signal is detected. In any of the preceding embodiments, after the reaction mixture is added to the reaction mixture, the reaction chamber can be closed and the fluorescence information can be detected and/or recorded.
In any of the foregoing embodiments, after the reaction liquid is added, the space outside the reaction chamber may be filled with oil, thereby isolating and closing the reaction chamber. In any of the preceding embodiments, a nucleotide substrate molecule of a polyphosphate can refer to a nucleotide having 4 to 8 phosphate molecules. In any of the preceding embodiments, the modified nucleotide substrate molecule having a fluorophore can be labeled with a fluorophore for monochromatic sequencing; or labeled with different fluorophores for multicolor sequencing.
In any of the preceding embodiments, the method can include using an enzyme to release a fluorophore having fluorescent switching properties on a nucleotide substrate, wherein the enzyme can optionally include a DNA polymerase and/or an alkaline phosphatase.
In any of the preceding embodiments, wherein the two bases on the test nucleotide sequence may comprise any two of the A, G, C and T bases or A, G, C and U bases; wherein the base C is methylated C or unmethylated C.
In any of the foregoing embodiments, the reaction solution may comprise an enzyme, which releases a fluorophore on a nucleotide substrate of the fluorophore having a fluorescence-switching property when the reaction solution is introduced into the reaction region where the gene fragment to be detected is located.
In any of the preceding embodiments, the reaction solution and the enzyme may not be added at the same time; firstly, introducing a first reaction solution in a group of reaction solutions, and then introducing an enzyme solution; next, the second reaction solution in the same reaction solution set is introduced, followed by the enzyme solution.
In any of the preceding embodiments, one set of reaction solutions may be used to perform one round of sequencing, or two sets of reaction solutions may be used to perform two rounds of sequencing, or three sets of reaction solutions may be used for three rounds of sequencing.
In any of the preceding embodiments, the method can comprise performing a round of sequencing using a set of reaction solutions and obtaining degenerate code results.
In any of the preceding embodiments, the method can comprise performing two rounds of sequencing using two sets of reaction solutions to obtain base sequence information.
In any of the preceding embodiments, the method may comprise performing three rounds of sequencing with three sets of reaction solutions, error checking and correcting based on (comparison) results of (any) two rounds of sequencing in the mutual information (mutual information) between the three rounds of sequencing.
In any of the preceding embodiments, the fluorophore having fluorescence switching properties may comprise a fluorophore having a methyl fluorescein, a halomethyl fluorescein, a DDAO, a resorufin (resorufin) type structure.
In any of the preceding embodiments, the method can comprise using an enzyme to release a fluorophore on a nucleotide substrate of the fluorophore having fluorescence switching properties, wherein optimizing optionally comprises first releasing a polyphosphate-substituted fluorophore using a DNA polymerase, and then cleaving the substituted polyphosphate using a phosphatase, thereby releasing the fluorophore.
In any of the preceding embodiments, the reaction solution may comprise two or more nucleotides having different bases, which may be simply broken down into two or more reaction solutions such that each reaction solution comprises one or more nucleotides; also, two or three nucleotides having different bases may be contained in at least one portion of the reaction solution.
Also disclosed herein is a high throughput sequencing method according to any of the preceding embodiments, wherein the sequencing reaction is performed on a chip having a plurality of reaction chambers. The method can optionally comprise immobilizing the fragment of the nucleotide sequence to be tested in a reaction chamber.
In another aspect, disclosed herein are methods of sequencing nucleotide substrate molecules using fluorophores with fluorescence switching properties, and sequencing by modifying nucleotide substrate molecules of fluorophores with fluorescence switching properties using 5' terminal polyphosphoric acid. In one aspect, the methods provided herein comprise first immobilizing a test nucleotide sequence fragment and adding a reaction solution comprising a nucleotide substrate molecule. An enzyme may then be used to release the fluorophore on the nucleotide substrate, resulting in a fluorescence switch.
In one embodiment, the sequencing method further comprises removing residual reaction solution and fluorescent molecules by using a washing solution, and then performing the next round of sequencing reaction. In any of the preceding embodiments, the sequencing method may comprise the reaction solution at a low temperature and then heating the reaction solution to an enzyme reaction temperature. The fluorescent signal can then be detected and/or recorded.
In any of the preceding embodiments, the nucleotide substrate molecule may comprise a nucleotide molecule comprising A, G, C and a T base or a nucleotide molecule comprising A, G, C and a U base; wherein C is methylated C or unmethylated C. In any of the preceding embodiments, the nucleotide substrate molecule may comprise a fluorophore with fluorescence switching properties modified with a 5' terminal polyphosphate. In any of the preceding embodiments, the nucleotide substrate molecule may comprise a fluorophore with fluorescence switching properties modified with a 5' terminal phosphate.
Also disclosed herein are methods according to any of the foregoing embodiments, wherein different nucleotide substrate molecules may be linked to one fluorophore for monochromatic sequencing, or to multiple fluorophores for polychromatic sequencing, depending on the base.
Disclosed herein is a method according to any of the preceding embodiments, wherein the fluorescence switching property means that after the sequencing reaction of each step, the fluorescence signal is significantly enhanced or significantly reduced compared to the signal before the sequencing reaction, or the frequency range of the emitted light is significantly altered.
Disclosed herein is a method according to any of the preceding embodiments, wherein the fluorescence switching property means that after the sequencing reaction of each step, the fluorescence signal is significantly enhanced compared to the situation before the sequencing reaction.
Disclosed herein is a method according to any preceding embodiment, wherein the reaction solution containing the nucleotide substrate molecules is used for sequencing. Nucleotide substrate molecules refer to mixtures of any two or three of the A, G, C, T nucleotide substrate molecules; or a mixture of any two or three of the A, G, C, U nucleotide substrate molecules.
Disclosed herein is a method according to any preceding embodiment, wherein the reaction solution containing the nucleotide substrate molecules is used for sequencing. Nucleotide substrate molecules refer to any of the A, G, C, T nucleotide substrate molecules; or A, G, C, U nucleotide substrate molecules.
Disclosed herein is a method of sequencing nucleotide substrate molecules using a fluorophore having fluorescence-switching properties according to any of the preceding embodiments, wherein each round of sequencing uses a set of reaction solutions, each set of reaction solutions comprising at least two reaction solutions, each reaction solution comprising A, G, C, T nucleotide substrate molecules or each reaction solution comprising A, G, C, U nucleotide substrate molecules. In one aspect, the method includes immobilizing a nucleotide sequence fragment to be detected, introducing a portion of a set of reaction solutions, and recording fluorescence information. In one aspect, the method comprises feeding one reaction solution at a time, and sequentially feeding another reaction solution from the same set of reaction solutions. In one aspect, the set of reaction solutions comprises at least one reaction solution comprising two or three nucleotide molecules.
Disclosed herein is a method of sequencing a nucleotide substrate molecule using a fluorophore having fluorescence switching properties according to any of the preceding embodiments, wherein each round of sequencing uses one set of reaction solutions, each set of reaction solutions comprising two reaction solutions, each reaction solution comprising two nucleotides having different bases. In one aspect, the nucleotides in one of the reaction solutions are complementary to two bases on the test nucleotide sequence, and the nucleotides in the other reaction solution are complementary to two other bases on the test nucleic acid sequence. In one aspect, the method includes immobilizing a nucleotide sequence fragment to be tested, and introducing into a first reaction solution of a set of reaction solutions. Then, the second part of the reaction solution in the same set of reaction solutions was added. The two reaction solutions can be added in an alternating manner one after the other to obtain the coding information of the nucleotide substrate to be detected through the fluorescence information.
In any of the preceding embodiments, after the reaction solution is added to the sequencing reaction, the reaction chamber can be closed and the fluorescent signal recorded.
In any of the preceding embodiments, after the reaction liquid is added to the sequencing reaction, the space outside the reaction chamber is filled with an oil or oil substance capable of isolating and sealing the reaction chamber.
In any of the preceding embodiments, the polyphosphate nucleotide substrate may be a nucleotide having from about 4 to about 8 phosphate molecules.
In any of the preceding embodiments, one set of reaction solutions may be used to perform one round of sequencing, or two sets of reaction solutions may be used to perform two rounds of sequencing, or three sets of reaction solutions may be used for three rounds of sequencing.
In any of the preceding embodiments, the method may comprise using an enzyme to release the fluorophore on the nucleotide substrate of the fluorophore having fluorescence switching properties. The enzyme may comprise a DNA polymerase and/or an alkaline phosphatase.
In any of the preceding embodiments, the method can comprise performing a round of sequencing using a set of reaction solutions and obtaining degenerate code results.
In any of the preceding embodiments, the method can comprise performing two rounds of sequencing using the two sets of reaction solutions and obtaining base sequence information.
In any of the preceding embodiments, the method can include performing three rounds of sequencing using three reaction solutions, and error checking and correction based on mutual information (mutual information) of the results of any two rounds of sequencing in the three rounds of sequencing.
In any of the preceding embodiments, the reaction solution may comprise an enzyme. When the reaction solution is introduced into the reaction region where the gene fragment to be detected is located, the contained enzyme can release the fluorophore on the nucleotide substrate of the fluorophore having the fluorescence switching property.
In any of the foregoing embodiments, the reaction solution and the enzyme may be added at different times. In one aspect, the first reaction solution of a set of reaction solutions is added to the reaction first, followed by the enzyme solution. Next, a second reaction solution of the same set of reaction solutions is added, followed by the addition of the enzyme solution.
In any of the preceding embodiments, the fluorophore having fluorescence switching properties may comprise a fluorophore comprising groups such as methyl fluorescein, halogenated methyl fluorescein, DDAO (7-hydroxy-9H- (1, 3-dichloro-9, 9-dimethylacridin-2-one)), and/or resorufin.
In any of the preceding embodiments, the release of the fluorophore on the nucleotide substrate of the fluorophore with fluorescence switching properties may be optimized, for example, using an enzyme. In one aspect, the optimization involves first releasing the fluorophore substituted with polyphosphate using a DNA polymerase, and then cleaving the substituted polyphosphate using a phosphatase to release the fluorophore.
In any of the preceding embodiments, the reaction solution may comprise two or more nucleotides having different bases. In one aspect, two or more reaction solutions may be used such that each reaction solution includes one or more nucleotides. The order of addition of the reaction solution in the reaction may be appropriately adjusted, and in one aspect, at least one portion of the reaction solution contains two or three nucleotides having different bases.
Also provided herein is a high throughput sequencing method according to any preceding embodiment, wherein the sequencing reaction is performed on a chip having a plurality of reaction chambers. In one aspect, the method comprises immobilizing a fragment of the nucleotide sequence to be tested in each reaction chamber.
In one aspect, the invention relates to sequencing methods, e.g., using mixed nucleotide molecules. More specifically, the sequencing method uses a modified (e.g., phosphate-modified) mixed nucleotide molecule having a fluorophore. Furthermore, the invention relates to sequencing methods based on fluorophores with fluorescence-switching properties. Fluorophores with fluorescent switching properties are sequenced using nucleotide substrates labeled with terminal phosphates. The substrate of the fluorophore having fluorescence switching property is a fluorophore having fluorescence switching property modified by 5 '-terminal polyphosphoric acid or intermediate phosphoric acid, characterized in that a fluorophore having fluorescence switching property on the terminal phosphoric acid or intermediate phosphoric acid of 4, 5, 6 or more deoxyribonucleotide phosphates (including A, C, G, T, U and other nucleotides) is modified, and there is no label on the base and 3' -hydroxyl group. The absorption spectrum and/or emission spectrum of the phosphate-modified fluorophore is different from the absorption spectrum and/or emission spectrum of a fluorophore without phosphate. Sequencing reactions typically involve successive and similar cycles. Each cycle may include such steps as sample injection/coating, reaction, signal acquisition, and washing of unreacted reactant molecules. In the previously reported methods, when a base-bearing substrate molecule enters, no reaction will occur if it is not correctly paired; and the polymerase will attach the substrate molecule to the 3' end, releasing the polyphosphate-modified fluorescent molecule and the fluorescence spectrum will change. If paired consecutively with a homopolymer, the spectrum will change many times. In practice, as a modification label of substrate molecules such as methyl fluorescein, halogenated methyl fluorescein, DDAO, resorufin, fluorescent molecules, etc. referred to in CN104844674, fluorophores with fluorescence switching properties that are not absorbed in the terminal phosphate and that release in a high quantum yield are often used. The four substrate molecules can be labeled with different fluorescent molecules. The sequencing process is performed by sample injection in acgtacgt.
In one aspect, the invention relates to a method of sequencing a plurality of nucleotides. More specifically, the sequencing method uses phosphate to modify mixed nucleotide molecules with fluorophores. Sequencing by modifying the 5' end or the middle phosphate of the nucleotide substrate molecule with a fluorophore; using a group of reaction liquid for each round of sequencing, wherein each group of reaction liquid comprises two parts of reaction liquid, and each part of reaction liquid comprises two nucleotides containing different bases; wherein, the nucleotide in one part of reaction solution can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide in the other part of reaction solution can be complementary with the other two bases on the nucleic acid sequence to be detected; firstly, fixing a nucleotide sequence fragment to be detected, and introducing the nucleotide sequence fragment into a first reaction solution in a group of reaction solutions; testing and recording fluorescence information; then introducing a second part of reaction liquid in the same group of reaction liquid; testing and recording fluorescence information; and adding the two parts of reaction solution circularly, and obtaining the coding information of the nucleotide substrate to be detected through fluorescence information.
In some embodiments, the reaction solution in the present invention refers to a sequencing reaction solution in a general sense. Auxiliary solutions such as other washing liquids or washing liquids are made to enter the gaps between the reaction liquids. In one aspect, each reaction solution contains nucleotides of two different bases, which may be labeled with different or the same fluorophores. In one aspect, sequencing is performed by modifying the 5' end or the middle phosphate of a nucleotide substrate molecule having a fluorophore with fluorescence switching properties; the fluorescence switching property means that the fluorescence signal after sequencing is obviously changed compared with the situation before sequencing reaction.
On the one hand, the fluorescence switching property means that the fluorescence signal after sequencing is obviously enhanced (improved) compared with the fluorescence signal before sequencing reaction. The frequency of the emitted light will likely change, but the overall intensity of the emitted light, or the intensity of the emitted light in certain frequency bands, will increase significantly.
In one aspect, the invention relates to a method for sequencing using a nucleotide molecule having a fluorophore with fluorescence switching properties, wherein sequencing is performed by modifying the 5' end or the middle phosphate of a nucleotide substrate molecule having a fluorophore; the fluorescence switching property means that the fluorescence signal intensity after sequencing is obviously enhanced compared with the situation before sequencing reaction; using a group of reaction solution for each round of sequencing, wherein each group of reaction solution comprises two parts of reaction solution, and each part of reaction solution comprises nucleotide substrate molecules with two different bases; the nucleotide substrate molecules in one reaction solution can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotides in the other reaction solution are complementary with the other two bases on the nucleotide sequence to be detected. Firstly, fixing a nucleotide sequence fragment to be detected in a reaction chamber, and then introducing one reaction solution in a group of reaction solutions; then using an enzyme to release the fluorophore on the nucleotide substrate having the fluorophore with fluorescence-switching properties, thereby causing fluorescence switching; then introducing a second part of reaction liquid in the same group of reaction liquid; releasing the fluorophore on the nucleotide substrate of the fluorophore having the fluorescence switching property using an enzyme, thereby causing fluorescence switching; and adding the two parts of reaction solution circularly, and obtaining the coding information of the nucleotide substrate to be detected through fluorescence information.
In one aspect, the invention relates to a method for sequencing using a nucleotide molecule having a fluorophore with fluorescence switching properties, wherein sequencing is performed by modifying the 5' end or the middle phosphate of a nucleotide substrate molecule having a fluorophore; the fluorescence switching property means that the fluorescence signal intensity after sequencing is obviously enhanced compared with the situation before sequencing reaction; one set of reaction solutions was used for each round of sequencing, each set of reaction solutions comprising at least two reaction solutions, each reaction solution comprising at least one of A, G, C or T nucleotide substrate molecules, or one of A, G, C or U nucleotide substrate molecules. On one hand, the nucleotide sequence fragment to be detected can be firstly fixed in a reaction chamber, and one reaction solution in a group of reaction solutions is introduced; testing and recording fluorescence information; one reaction solution is introduced each time, and the other reaction solution in the same group of reaction solutions is introduced successively. Meanwhile, the fluorescence information may be checked and recorded after each reaction solution is introduced, wherein there is at least one reaction solution in a reaction solution group containing two or three nucleotide molecules in the reaction solution group.
In one aspect, the invention relates to a method for sequencing using a nucleotide molecule having a fluorophore with fluorescence switching properties, wherein sequencing is performed by modifying the 5' end or the middle phosphate of a nucleotide substrate molecule having a fluorophore; the fluorescence switching property means that the fluorescence signal intensity after sequencing is obviously enhanced compared with the situation before sequencing reaction; one set of reaction solutions was used for each round of sequencing, each set of reaction solutions comprising at least two reaction solutions, each reaction solution comprising either A, G, C or a T nucleotide substrate molecule, or either A, G, C or a U nucleotide substrate molecule. On one hand, the nucleotide sequence fragment to be detected can be firstly fixed in a reaction chamber, and one reaction solution in a group of reaction solutions is introduced; testing and recording fluorescence information; one reaction solution is introduced each time, and the other reaction solution in the same group of reaction solutions is introduced successively. Meanwhile, the fluorescence information may be tested and recorded after each reaction solution is introduced.
In one aspect, the invention relates to a method for sequencing using a nucleotide molecule having a fluorophore with fluorescence switching properties, wherein sequencing is performed by modifying the 5' end or the middle phosphate of a nucleotide substrate molecule having a fluorophore; the fluorescence switching property means that the fluorescence signal intensity after sequencing is obviously enhanced compared with the situation before sequencing reaction; each round of sequencing used a set of reaction solutions containing A, G, C and T nucleotide substrate molecules, or A, G, C and U nucleotide substrate molecules. In one aspect, the nucleotide sequence fragment to be tested can be immobilized in a reaction chamber, a reaction solution can be introduced, and then the fluorescence information can be tested and recorded.
In one aspect, the method further comprises removing residual reaction solution and fluorescent molecules with a cleaning solution, and then performing a next round of sequencing reaction. In one aspect, the method comprises delivering the reaction solution at a low temperature, then heating it to an enzymatic reaction temperature, and testing for a fluorescent signal. In one aspect, after the reaction solution is introduced, the method comprises sealing the reaction chamber and then testing and recording the fluorescence information.
In one aspect, after the reaction liquid is introduced, the method includes filling a space outside the reaction chamber with oil, thereby isolating and closing the reaction chamber. In one aspect, a nucleotide substrate molecule for a polyphosphate refers to a nucleotide having 4-8 phosphate molecules. On one hand, the nucleotide substrate molecules modified with fluorophores can be marked by a fluorophore according to different bases to perform monochromatic sequencing; multicolor sequencing can also be performed with different fluorophore labels.
In one aspect, the method comprises the steps of: an enzyme (e.g., DNA polymerase and/or alkaline phosphatase) is used to release the fluorophore on the nucleotide substrate of the fluorophore having the fluorescent switching property. In one aspect, the two bases on the test nucleotide sequence are any two of the A, G, C and T bases or any two of the A, G, C and U bases, wherein base C is either a methylated C or an unmethylated C. On the one hand, when the reaction solution is introduced into the reaction region where the gene fragment to be detected is located, the enzyme in the reaction solution can release the fluorophore on the nucleotide substrate of the fluorophore having the fluorescence switching property. In one aspect, the method comprises performing a round of sequencing using a set of reaction solutions and obtaining degenerate code results. In one aspect, the method comprises performing two rounds of sequencing using two sets of reaction solutions and obtaining base sequence information. In one aspect, the method comprises performing three rounds of sequencing using three reaction solutions and performing error checking and correction based on (any) two rounds of sequencing results in the mutual information between the three rounds of sequencing.
In one aspect, the invention relates to a method of sequencing mixed nucleotide molecules. More specifically, the sequencing method uses phosphate to modify mixed nucleotide molecules with fluorophores. Compared with a sequencing method of mixed nucleotide without phosphate modification, the method is easy to hydrolyze, and other groups are not introduced after the reaction is completed, so that the method is favorable for extension sequencing reaction and is simple in sequencing reaction.
In one aspect, the invention relates to a method for sequencing a mixed nucleotide molecule of nucleotide substrate molecules by modifying a fluorophore with fluorescence switching properties with a 5' terminal polyphosphoric acid. In one aspect, the method comprises first immobilizing a nucleotide sequence fragment to be tested and introducing a reaction solution comprising a nucleotide substrate molecule. In one aspect, the method comprises using an enzyme to release a fluorophore on a nucleotide substrate having a fluorophore with fluorescence switching properties, thereby causing fluorescence switching. In one aspect, the method further comprises removing residual reaction solution and fluorescent molecules with a cleaning solution, and then performing a next round of sequencing reaction.
In another embodiment, the present invention combines fluorescence-switched sequencing with mixed nucleotide molecular sequencing to achieve unexpected results. For example, the use of fluorescence switching to provide data redundancy and check features for mixed nucleotide molecule sequencing improves the accuracy of the sequencing data. In addition, the 3' end closed sequencing also ensures that information does not need to be acquired in real time in the sequence reaction, thereby improving the accuracy of signals. Independently of the sequencing chemistry principle itself, it can be matched to different sequencing chemistries. Furthermore, the 2+2 mode of fluorescence switching properties (two bases at a time) is clearly superior to other mixed nucleotide molecule sequencing. For example, data parsing is relatively easy and also provides data redundancy and verification features. The special signal acquisition method and efficiency make it have a wide prospect in gene sequencing. Fluorescence-switching-based multi-base sequencing reduces the error rate and makes the reaction simpler compared to non-fluorescence-switching-based nucleotide molecule sequencing. The mixed nucleotide molecule sequencing method adopting the fluorescence switching method disclosed by the invention has the sequencing accuracy rate of 99.99%, the reading length of the mixed nucleotide molecule sequencing method exceeds that of Illumina sequencing, the mixed nucleotide molecule sequencing method can reach 300nt or more than 300nt, and the cost of raw materials is very low. The method adopts a method of scanning after reaction, and has no limitation of flux. The time required by single-round reaction is short, and the rapid test can be realized. The strategy of fluorescence switching and mixed sequencing of multiple nucleotide molecules is adopted, so that the sequence read length and the information content of each reaction cycle can be prolonged. For example, Illumina sequencing reads 1nt (1 base) per reaction cycle and has an information content of 2 bits. 2+2 (two different base nucleotide molecules were introduced at a time, using a total of two reaction solutions) monochrome sequencing read 2nt per reaction cycle, with 2 bits of information. In one aspect, 2+2 two-color sequencing has a read length of 2nt and an information content of 3.4 bits per reaction cycle.
In some aspects, provided herein are fluorogenic and fluorogenic fluorophores. Some fluorophores have the property that when a substituent is changed, the fluorescence spectrum (absorption and reflection spectrum) changes, known as fluorescence switching. On the one hand, the increase in the intensity of the signal collected when under specific excitation and collection (emission) conditions is referred to as fluorescence.
In some aspects, provided herein are nucleotides and nucleotide labels. In one aspect, the nucleotide molecule is composed of a ribose skeleton, a base molecule at a position of a glucoside, and a polyphosphate chain linked to a hydroxyl group at the 5-position of the ribose skeleton. The 2C of the ribose ring may be linked to a hydroxyl group (to form a ribonucleotide) or to only H (to be referred to as a deoxyribonucleotide). The nucleotide molecule may be 4 major bases ACGT, uracil and modified bases such as methylated bases, hydroxymethylated bases, etc. The number of phosphate backbones can be 1-8. Which can modify a molecular group at multiple positions. There may be one or more modification sites on the 3C hydroxyl group of the ribose backbone at the base. For example, the phosphate is modified with a fluorophore and the 3C is modified with an ethynyl group.
In one aspect, an unmodified polyphosphate nucleotide substrate (more than 3 phosphates) on 3C has 3 active hydroxyl groups when the polymerase chain reaction is occurring. In one aspect, the polymerase reaction continues until the paired base is missing or a 3C non-hydroxyl nucleotide molecule is bound, as long as the next base can still pair. In some aspects, provided herein are fluorogenic nucleotides. In one aspect, a nucleotide molecule is at the phosphate terminus and labels a fluorogenic fluorophore that can be switched by the phosphohydrolysis process, referred to as a fluorogenic (or fluorogenic) nucleotide. The length of the phosphate chain may be 4-8.
In one aspect, the phosphate can be terminal or pendant. The number of marks may be one or more. The plurality of labels may be the same or different. More specifically, in one aspect, it is referred to as a polymerase fluorogenic nucleotide. Alternatively, fluorogenic nucleotides that are not labeled at the phosphate position and do not require fluorescence from the polymerase can be used. The nucleotide molecule may be a ribonucleotide, a deoxyribonucleotide or a (deoxy) ribonucleotide modified at the 3' C.
In some aspects, provided herein are fluorescently generated nucleotide polymerase reactions. In one aspect, the fluorogenic nucleotide polymerase reaction uses a fluorogenic nucleotide, a nucleic acid polymerase (DNA polymerase), a phosphatase, along with a nucleic acid substrate. In some embodiments, the DNA polymerase first polymerizes the fluorogenic nucleotide into the nucleic acid substrate to release the phosphorylated fluorogenic fluorophore, which is then further hydrolyzed by a phosphatase to remove the phosphate and release the fluorogenic fluorophore with a change in fluorescence state.
In some aspects, provided herein are fluorogenic sequencing methods. In one aspect, the method is directed to the use of a fluorogenic nucleotide polymerase reaction, and the fluorescence change (light intensity and spectrum) of the fluorogenic fluorophore is measured to obtain information about the polymerase reaction. In some aspects, provided herein are fluorogenic sequencing reactions that can comprise fluorogenic nucleotides, a nucleic acid polymerase (DNA polymerase), and a phosphatase.
A "fluorogenic nucleotide" as described herein may comprise one or more fluorogenic nucleotides. A "nucleotide" as described herein may comprise one or more nucleotides. In some embodiments, multiple nucleotides may be labeled with the same or different fluorogenic substrates. In some aspects, provided herein is a kit of fluorogenic sequencing reactions, which may comprise two or more fluorogenic sequencing reactions, e.g., A, C, G and T reactions at specific concentrations, or AC and GT reactions at specific concentrations.
In some aspects, provided herein are fluorescent sequencing reaction cycles that can include performing a fluorogenic polymerase reaction and testing for a fluorescent signal using one sequencing reaction solution. In some aspects, provided herein is a round of fluorogenic sequencing reactions that can include cycling sequencing reactions in a defined order using members of a set of fluorogenic sequencing reaction solutions. In some aspects, provided herein is a set of fluorogenic sequencing reactions, which may include one or more rounds of fluorogenic sequencing.
In some aspects, provided herein are single base resolved sequencing reactions. In one aspect, one way is (2+2 monochrome sets), the first reaction mix is made up of two bases (e.g., AC) and the second reaction mix is made up of two additional bases (e.g., GT), and the two reactions are used alternately for sequencing. At this time, the number of extended bases per cycle increases. After N rounds of sequencing, the number of extension bases was 2N nt. The information carried is 2N bits. There are 3 combinations to accomplish the above sequencing, namely AC/GT, AG/CT and AT/CG; or according to standard degenerate base (degenerate nucleotide) identifications, written as M/K, R/Y and W/S. The three combinations can be sequenced separately, or re-sequenced after completing another set of sequencing. The ith base determined on the DNA sequence must undergo a pairing reaction and release a signal in a unique cycle in both sets of sequencing. In each set of sequencing, the defined base sampling injection cycle included two types, so there were a total of 2 × 2 ═ 4 possible cases, corresponding exactly to four bases. Sequencing combination sequencing does not affect the inference of bases.
TABLE 1
TABLE 2
TABLE 3
In further embodiments, the method further comprises performing sequencing using a third different set of reaction solution combinations after completing the two different sets of sequencing. The ith base determined on the DNA sequence must undergo a pairing reaction and release a signal in a unique cycle in the three sets of sequencing. In each set of sequencing, the defined cycle of base sampling injection consists of two types, so there are 8 possible cases, only four of them are reasonable, and the other four are unreasonable. In fluorescence-switched sequencing, insertion or deletion errors are likely to occur. For a certain base, if one of the three sets of sequencing has sequencing error, the sequence cannot be correctly deduced, and it can be concluded that one or more of the three sets of sequencing has sequencing error.
TABLE 4
Such errors can be corrected because when a sequencing error in a single set of data is corrected, a subsequent large number of errors will be corrected together.
Another embodiment is a 2+2 two-color two-wheel mode. The first reaction solution is made of a mixture of two bases and carries a different fluorescent label (e.g., AX/CY), and the second reaction solution is made of a mixture of two other bases (e.g., GX/TY). In this case, the number of bases extended per cycle becomes large, and the average is 2 nt. The information carried is 2N bits.
Method for detecting and/or correcting sequencing errors
In one aspect, this document relates to methods of detecting and/or correcting one or more sequence data errors in sequencing results, and is in the field of nucleic acid sequencing.
In one aspect, provided herein are methods of detecting and/or correcting sequence data errors in sequencing results. In one aspect, the sequencing reaction comprises at least two types of nucleotide substrate molecules having different bases. In one aspect, degenerate gene coding information can be obtained. By comparing two or more degenerate coding information, it is possible to determine whether conflicting sequence information occurs in one or more nucleotide residues. Any minor improvement that can reduce the rate of sequencing errors in the original sequencing data, using the present method to correct sequence information, can result in a more significant reduction in the error rate of the corrected sequence information.
In one aspect, disclosed herein are methods of detecting and/or correcting sequence data errors in sequencing results. In one aspect, the method comprises sequencing a nucleic acid sequence to obtain sequence data of a sequence that is degenerate as three or more orthogonal nucleotides. In another aspect, the method further comprises detecting an error in the sequence by comparing three or more sequences that are degenerate in orthogonal nucleotides. In one aspect, the corrected sequence is obtained by modifying at least one sequence at the location where the alignment error occurred.
Also disclosed herein are methods of detecting and/or correcting sequence data errors in sequencing results, wherein the methods comprise performing a sequencing reaction on a nucleotide sequence to obtain three or more degenerate sequences represented by letters M, K, R, Y, W, S, B, D, H and V. In one aspect, the degenerate bases of the present invention are represented by letters in Table 5 according to the IUPAC nucleic acid notation. For example, M represents A and/or C bases.
Table 5: letters representing degenerate bases
In any of the preceding embodiments, sequence errors can be detected by comparing three or more degenerate sequences. In any of the preceding embodiments, where the wrong nucleotide position is identified during the alignment, a corrected sequence may be obtained by modifying at least one of the sequences; in any of the preceding embodiments, the location at which the error was identified during the comparison may be the location at which the sequencing error actually occurred.
In another aspect, disclosed herein is a method of detecting and/or correcting sequence data errors in sequencing results, wherein the method comprises sequencing the same nucleic acid sequence, obtaining two or more degenerate sequences represented by the letters M, K, R, Y, W, S, B, D, H and V, to obtain the sequence information represented in nucleic acid residues A, G, T and C or the sequence information represented in nucleic acid residues A, G, U and C. In another aspect, the method further comprises detecting sequence errors by using optical or electrical signals generated by one or more functional groups coupled to different bases in a sequencing reaction. For example, optical or electrical signals from different fluorophores coupled to different bases in a sequencing reaction can be used as "redundant" information that distinguishes one base from another at a particular position in a sequence. In any of the preceding embodiments, where the wrong nucleotide position is found during the comparison, a corrected sequence may be obtained by modifying at least one of the sequences; in any of the preceding embodiments, the location at which the error was identified during the comparison may be the location at which the sequencing error actually occurred.
In another aspect, disclosed herein are methods for detecting and/or correcting sequence errors in sequencing results using the memory of nucleic acid sequences. In one aspect, the method includes sequencing the same nucleic acid sequence to obtain data for three or more orthogonal degenerate sequences of the nucleic acid. In another aspect, the method further comprises comprehensively comparing degenerate sequences and using the memory of the nucleic acid sequences to detect sequence errors. In one aspect, a corrected sequence can be obtained by modifying at least one sequence at the location where the alignment error occurred. In some embodiments, each degenerate sequence represents only a portion of the sequence information of an actual polynucleotide template, and nucleotide identity at a position in one degenerate sequence does not indicate, or does not necessarily indicate, nucleotide identity at the same position in another degenerate sequence.
In one aspect, disclosed herein is a method of detecting and/or correcting sequence data errors in sequencing results, wherein the method comprises immobilizing a nucleic acid fragment to be sequenced on a vector, and providing a reaction solution to initiate a sequencing reaction from which a degenerate nucleic acid sequence is obtained. The sequencing reaction can be repeated for multiple rounds such that a degenerate nucleic acid sequence is obtained from each round of sequencing. After N rounds of sequencing, N degenerate nucleic acid sequences can be obtained. In one aspect, by comparing N degenerate sequences together, the position of a sequence error can be detected. In one aspect, the method may further comprise obtaining a corrected sequence by modifying at least one sequence at the location of the alignment error. In any of the preceding embodiments, the reaction solution may comprise two or more types of nucleotide substrate molecules having different bases. In any of the preceding embodiments, N may be a positive integer equal to or greater than 2.
In any of the preceding embodiments, the method can comprise comparing N-1 of the N degenerate nucleic acid sequences to obtain nucleic acid sequence information encoded with A, G, T and C or nucleic acid sequence information encoded with A, G, U and C. In one aspect, the method further comprises comparing the N degenerate nucleic acid sequences. In any of the preceding embodiments, N may be a positive integer equal to or greater than 3.
In any of the preceding embodiments, the method can comprise comparing the N degenerate nucleic acid sequences to obtain nucleic acid sequence information encoded by A, G, T and C or nucleic acid sequence information encoded by A, G, U and C. In one aspect, the method further comprises detecting the location of the error by using optical and/or electromagnetic information provided by two or more functional groups coupled to the nucleotide residue. In any of the preceding embodiments, N may be a positive integer equal to or greater than 2.
In another aspect, disclosed herein is a method of detecting and/or correcting sequence data errors in a sequencing result, wherein the method comprises immobilizing a nucleic acid fragment to be tested on a vector. In one aspect, the method further comprises providing a reaction solution to initiate a sequencing reaction, wherein the reaction solution comprises nucleotide substrate molecules for sequencing and is divided into three groups according to different bases, each group comprising two different reaction solutions, each reaction solution comprising nucleotide substrate molecules having different bases. On the other hand, there is no intersection (interaction) between the bases of nucleotides in two reaction solutions in the same set of reaction solutions. In one aspect, one set of reaction solutions is used for each round of sequencing, and two reaction solutions from each set are provided to react sequentially with the nucleic acid template in any suitable order. In one aspect, three rounds of sequencing were performed using three sets of reactions to obtain three degenerate sequences. On the other hand, by comprehensively comparing three degenerate sequences, the position of a sequence error can be detected. In one embodiment, a corrected sequence may be obtained by modifying at least one sequence at the location where the alignment error occurred.
In any of the preceding embodiments, the sequencing reaction may be performed by using a nucleotide substrate molecule (such as a dNTP or ddNTP) modified with a fluorophore having fluorescence switching properties, wherein the modification is on the 5' -terminal polyphosphate group of the nucleotide substrate molecule. In one aspect, the fluorescence switching property can refer to a significant change in the fluorescence signal after sequencing compared to the signal before the sequencing reaction. In another aspect, fluorescence switching occurs upon catalytic incorporation of a nucleotide substrate into an extension primer via a polymerase. In one aspect, the nucleotide sequence fragment to be tested is immobilized on a support, and then a reaction solution containing a nucleotide substrate molecule is provided to react with the template nucleotide sequence fragment. In one aspect, an enzyme is then used to release a fluorophore from the nucleotide substrate incorporated into the extension primer (and duplex polymerase extension product) to cause a fluorescence switch.
In one aspect, after each sequencing reaction, the fluorescence signal may be significantly enhanced or reduced, or the frequency of emitted light may be significantly altered, as compared to that before the sequencing reaction.
In any of the preceding embodiments, the sequence error may comprise an insertion and/or deletion. In any of the preceding embodiments, a sequence data error can be considered to occur at a position when at least two degenerate nucleic acid sequences do not have a common base at that position.
In any of the preceding embodiments, correcting the sequence error can comprise correcting a nucleotide residue of at least one sequence such that the corrected sequence has a correct nucleotide residue at least one position after the corrected nucleotide residue. In one aspect, a nucleotide residue is correct if the nucleic acid sequence information for any two rounds of sequence determination at the same nucleotide residue position does not match the nucleic acid sequence information for another round of sequencing.
In any of the preceding embodiments, correcting the sequence error can comprise correcting the error of at least one sequence such that a common nucleotide residue at least one position of the sequence can be obtained by comparing sequence information from multiple rounds of sequencing.
In any of the preceding embodiments, correcting a sequence error can include extending (e.g., by inserting nucleic acid residues at positions where errors are deemed to have occurred) and/or shortening (e.g., by deleting nucleic acid residues at positions where errors are deemed to have occurred) the sequence representing nucleic acid sequence information from multiple rounds of sequencing. In one aspect, by extending and/or shortening at least one sequence from multiple rounds of sequencing, the corrected sequence will be identical to sequences from other rounds of at least one nucleotide residue position.
In any of the preceding embodiments, the memorability of a nucleic acid sequence may refer to the information of the nucleic acid sequence at a particular position in the sequencing result not only relating to the nucleotide residues in its corresponding nucleic acid in the template, but also relating to the sequence information preceding the sequence information.
In any of the preceding embodiments, using the sequencing signals from the other two rounds of sequencing, the sequence in the sequencing signals can be extended (e.g., by inserting nucleic acid residues at positions deemed to have errors) by certain lengths to obtain corrected nucleic acid sequences. In any of the preceding embodiments, using sequencing signals from two additional rounds of sequencing, the sequence in the sequencing signal can be shortened (e.g., by deleting nucleic acid residues at positions deemed to have errors) by certain lengths to obtain a corrected nucleic acid sequence.
In any of the preceding embodiments, the reaction solution may be divided into three groups according to different bases, wherein the bases include A, G, C and T bases or A, G, C and U bases. In any of the preceding embodiments, the base may be methylated, hydroxymethylated or modified with an aldehyde or carboxyl group, or unmethylated, non-hydroxymethylated or unmodified with an aldehyde or carboxyl group.
In any of the preceding embodiments, the nucleotide substrate reaction solution may comprise different bases and may be divided into two reaction solutions according to the different bases, e.g., a + G in one reaction solution and C + T in the other reaction solution; one part of the reaction solution is A + C and the other part of the reaction solution is G + T; or A + T in one part of the reaction solution and C + G in the other part of the reaction solution.
In any of the preceding embodiments, the reaction solution may comprise a plurality of portions of the reaction solution, and one portion of the reaction solution may be used for the sequencing reaction. In one aspect, one or more reaction solutions are used for each round of sequencing. In another aspect, at least one reaction solution comprises two or more types of nucleotide substrate molecules having different bases. In any of the preceding embodiments, the reaction solution used in the different rounds of sequencing comprises different combinations of nucleotide substrate molecules.
In any of the preceding embodiments, the nucleotide substrate molecule may be labeled by fluorescence. In one aspect, a fluorophore (or a functional group that will have fluorescence switching properties by chemical reaction) is coupled to the base of a nucleotide residue. In one aspect, the nucleotide substrate molecule may be modified with one of a fluorophore or a functional group, or the nucleotide substrate molecule may be modified with different bases using multiple fluorophores or functional groups.
With the increasingly deep understanding of genes in recent years, gene sequencing has brought about great changes in pharmacology and biology. Conventional sequencing methods include sanger DNA, restriction fragment length polymorphisms, single-stranded conformation polymorphisms, and gene chip-based allele-specific oligonucleotide hybridization sequencing methods. Errors in sequencing results are inevitable due to various factors affecting the sequencing process, such as inaccurate CCD luminescence, fluid movement, ambient light, DNA contamination, errors in signal correction systems, or impure sequencing reaction solutions. The feature that DNA stores genetic information of an organism as genetic material also enables DNA to be used as a storage medium for basic information. When DNA is used to store information, it is necessary to encode the information into a DNA sequence and then read the information using a gene sequencing method. To avoid coding and/or reading errors, redundant information is often introduced into the coding process, which is used to perform signal corrections in the readings. For example, George Church et al, "Next-Generation Digital Information Storage in DNA," Science, 2012, encode Information into DNA sequences using Reed Solomon codes and read Information in DNA sequences using the Illumina sequencing platform. DNA encoding-reading technology is also used in combinatorial chemistry and other fields. In previous DNA coding techniques, the type of each base was usually independent of other bases (memoryless coding), or only related to its neighboring bases. Memory-based, distributed, orthogonal DNA coding methods are provided herein, and the type of each base is related to all bases in its preceding position. In addition, the method can be based on the comprehensive comparison of multiple groups of orthogonal codes, and effectively improves the encoding reading process until the decoding accuracy.
In one aspect, the invention provides a method of detecting and/or correcting coding errors in sequencing results, wherein the method comprises sequencing the same nucleic acid sequence to obtain three or more orthogonal degenerate sequence data of nucleotides, wherein errors in the sequence can be detected by comparing the three or more orthogonal degenerate sequences of nucleic acids, and wherein a corrected sequence can be obtained by modifying at least one sequence at the positions where errors were found during the comparison.
In one aspect, the invention provides a method of detecting and/or correcting code errors in sequencing results, wherein the method comprises sequencing the same nucleic acid sequence to obtain three or more degenerate sequence data represented by the letters M, K, R, Y, W, S, B, D, H and V, wherein errors in the sequences can be detected by comparing the three or more degenerate sequences, and wherein a corrected sequence can be obtained by modifying at least one of the sequences by the location of the error found during the comparison. In one aspect, the methods are suitable for routine sequencing. In another aspect, three or more encoded results can be obtained by multiple rounds of sequencing, where redundancy of information can be used to detect and/or correct error codes, as long as the sequencing substrate is properly designed.
In one aspect, the present invention provides a method for detecting and/or correcting code errors using the memory of a genetic code, wherein the method comprises sequencing the same nucleic acid sequence to obtain two or more degenerate sequences represented by the letter M, K, R, Y, W, S, B, D, H, V, or to obtain nucleic acid sequence information encoded by A, G, T, C, or nucleic acid sequence information encoded by A, G, U, C, wherein optical or electrical signals resulting from different functional groups attached to different bases in the sequencing reaction are used as redundant information to detect sequence errors, wherein a corrected sequence can be obtained by modifying at least one sequence at the positions where errors are found during alignment.
In one aspect, the invention provides a method of detecting and/or correcting a code error using the memory of a genetic code, wherein the method comprises sequencing the same nucleic acid sequence to obtain three or more orthogonal degenerate sequence data of nucleotides, and comprehensively comparing the degenerate sequences, and using the memory of the nucleic acid sequences to detect sequence errors, wherein corrected sequences are obtained by modifying at least one sequence at the positions where errors are found during the comparison, wherein in the degenerate sequences each sequence signal represents part of the sequence information of the gene, wherein the signal at the same position on the other degenerate sequence cannot be deduced from the signal on the intermediate one such degenerate sequence.
In any of the preceding embodiments, the method can include immobilizing a nucleic acid fragment to be tested on a support, providing a reaction solution to initiate a sequencing reaction such that each round of sequencing results in a degenerate nucleic acid sequence; obtaining N degenerate nucleic acid sequences through at least N rounds of sequencing, wherein a position of an error in a sequence can be detected by comprehensively comparing the N degenerate sequences, wherein a corrected sequence can be obtained by modifying at least one sequence at the position where the error is found during the comparison, wherein the reaction solution can contain two or more types of nucleotide substrate molecules having different bases, and wherein N is a positive integer equal to or greater than 2.
In one aspect, the nucleic acid sequence information encoded by A, G, T, C can be obtained by comparing N-1 degenerate nucleic acid sequences, or the nucleic acid sequence information encoded by A, G, U, C, and the location of a sequence error can be detected by comparing N degenerate sequences. N may be a positive integer equal to or greater than 3.
In one aspect, the nucleic acid sequence information encoded by A, G, T, C can be obtained by comparing N degenerate nucleic acid sequences, or the nucleic acid sequence information encoded by A, G, U, C, and the location of a sequence error can be detected by comparing N degenerate sequences. In one aspect, the location of the error can be detected using luminescence information provided by two or more functional groups attached to the base, and N is a positive integer equal to or greater than 2. In another aspect, the method includes correcting for changes in information of the bases themselves as redundant information in a sequencing reaction of information of molecules such as phosphate and hydrogen ions released during the reaction.
In one aspect, the present invention provides a method for detecting and/or correcting code errors in sequencing results, wherein the method comprises immobilizing a nucleic acid fragment to be detected, providing a reaction solution to initiate a sequencing reaction, wherein the reaction solution for nucleotide substrate molecules for sequencing is divided into three groups according to different bases, each group comprising two different reaction solutions, each reaction solution comprising nucleotide substrate molecules having different bases. On the other hand, there was no intersection between the bases of the nucleotides in the two reaction solutions. In another aspect, one set of reaction solution is used for each round of sequencing, and two reaction solutions for each set are provided alternately. In one aspect, the method comprises three rounds of sequencing using three sets of reaction solutions to obtain three degenerate sequences, the location of an error can be detected by comparing the three degenerate sequences together, and a corrected sequence can be obtained by modifying at least one of the sequences at the location where the error was found during the comparison.
On the one hand, a reaction solution containing two different bases may be divided into two reaction solutions; the other steps of the method may be adapted accordingly.
In one aspect, the reaction solution may comprise a plurality of reaction solutions, one for each sequencing run, wherein one or more reaction solutions are used for each sequencing run, wherein at least one reaction solution comprises two or more types of nucleotide substrate molecules having different bases, and wherein the reaction solutions for different sequencing runs comprise different combinations of nucleotide substrate molecules.
In one aspect, the sequencing of the invention comprises sequencing by modifying a nucleotide substrate molecule with a fluorophore having fluorescence switching properties using 5' -terminal polyphosphoric acid, wherein the fluorescence switching properties refer to a significant change in the fluorescence signal after sequencing compared to that before the sequencing reaction, wherein a fragment of the nucleotide sequence to be tested is first immobilized on a support, then a reaction solution containing the nucleotide substrate molecule is provided, and then the fluorophore on the nucleotide substrate is released using an enzyme, thereby causing fluorescence switching.
On the one hand, the phrase that the fluorescence signal is significantly changed after the sequencing reaction compared with the fluorescence signal before the sequencing reaction "means that the fluorescence signal is significantly enhanced or significantly reduced after the sequencing reaction of each step compared with the fluorescence signal before the sequencing reaction, or the frequency range of the emitted light is significantly changed.
In one aspect, sequence errors refer to insertion errors or deletion errors. On the other hand, a sequence data error is an error that is considered to occur when at least two pieces of nucleic acid sequence information do not represent the same base at the same position. In a further aspect, the method comprises correcting errors in at least one sequence such that a subsequent sequence at least one position is correct, wherein sequence is correct in that the nucleic acid sequence information determined at the same position in any two rounds of sequences is not in conflict with the nucleic acid sequence information in another round of sequences, or in that the nucleic acid sequence information represented at the same position in any two rounds of sequences is not in conflict with luminescence information provided by a functional group attached to a base or information from another sequencing process.
In one aspect, the method includes correcting the sequence by: correcting errors in at least one of the sequences such that a common base is obtained by comprehensively aligning the sequences at least one position.
In one aspect, by modifying at least one sequence, a corrected sequence can be obtained by extending or shortening the sequence representing the information on the nucleic acid sequence at the position where the error occurred, wherein extending or shortening refers to an increase or decrease in the length of the same test sequence, wherein when the code causes the position to be shortened or extended, the information on the sequence represented by the code is not changed, and the result is the same code. For example, when the signal strength of the degenerate code M is 2, i.e. MM, it can be extended to 3, i.e. MMM.
In one aspect, the memorability of a nucleic acid sequence means that the information on the nucleic acid sequence at a certain position in the sequencing result is related not only to the sequence on the nucleic acid to be tested corresponding to it, but also to the information on the sequence preceding it.
In one aspect, a corrected nucleic acid sequence is obtained using two additional rounds of sequencing signals by extending or shortening some of the sequencing signals at a location, extending or shortening the gene sequence represented by the location, wherein extending the sequencing signals comprises adding or inserting the gene sequence represented by the location to a particular length, wherein shortening some of the sequencing signals comprises shortening or deleting the gene sequence represented by the location by a particular length, and obtaining the corrected nucleic acid sequence using two additional rounds of sequencing signals.
In one aspect, the reaction solution is divided into three groups according to the difference of bases, wherein the base refers to A, G, C, T bases or A, G, C, U bases, and wherein the base may be methylated, hydroxymethylated, base with aldehyde or carboxyl, or non-methylated, non-hydroxymethylated, base without aldehyde or carboxyl.
On the other hand, the reaction solution for a nucleotide substrate containing two different bases may be divided into two reaction solutions according to the difference in the bases.
In one aspect, the nucleotide substrate molecule can be labeled by fluorescence. In one aspect, the method comprises modifying a fluorophore or functional group with a fluorescent switch by a chemical reaction on a base of a nucleotide substrate molecule. In another aspect, the nucleotide substrate molecule may be modified with one of a fluorophore or a functional group, or the nucleotide substrate molecule may be modified with different bases using multiple fluorophores or functional groups.
In one aspect, a degenerate set of gene sequence information can be obtained by each round of sequencing. In one aspect, degenerate gene sequence information is meant to encompass information about possible gene sequences. For example, when the reaction solution contains nucleotide substrate molecules having A and G bases, the degenerate gene sequence information obtained by sequencing contains gene sequence information of C and/or T bases in the nucleotide sequence to be tested. When the reaction solution contains nucleotide substrate molecules having A and T bases, the degenerate gene sequence information obtained by sequencing contains gene sequence information of C and/or G bases in the nucleotide sequence to be tested. When the reaction solution contains nucleotide substrate molecules having A and C bases, the degenerate gene sequence information obtained by sequencing contains gene sequence information of C and/or T bases in the nucleotide sequence to be tested. When the reaction solution contains nucleotide substrate molecules having C and G bases, the gene sequence information obtained by sequencing contains the gene sequence information of A and/or T bases in the nucleotide sequence to be tested. When the reaction solution contains nucleotide substrate molecules having C and T bases, the gene sequence information obtained by sequencing contains the gene sequence information of A and/or C bases in the nucleotide sequence to be tested. And when the reaction solution contains nucleotide substrate molecules having T and G bases, the gene sequence information obtained by sequencing contains the gene sequence information of C and/or A bases in the nucleotide sequence to be detected.
On the one hand, in the comprehensive comparison of information of three rounds of sequencing, if the gene sequence information represented by the signal of one round of sequencing is a larger error sequencing signal, the gene sequence information represented by the sequence signal can be shortened, so that the comparison result of at least one subsequent sequencing signal is correct.
In one aspect, in the integrated comparison of information from three rounds of sequencing, if the gene sequence information represented by the signal from one round of sequencing is a less erroneous sequence signal, a gap can be added or extended in the gene sequence information represented by that position so that the comparison of at least one sequencing signal is corrected thereafter. For example, when the signal strength of the degenerate code M is 2, i.e. MM, it can be extended to 3, i.e. MMM.
In one aspect, provided herein are methods of detecting and/or correcting errors in gene sequencing coding results, particularly sequencing methods using one or more reaction solutions comprising nucleotide substrate molecules having two or more bases. In a particular aspect, the method is applicable to SBS (sequencing by synthesis) methods for sequencing.
In one aspect, degenerate gene sequence information herein includes information on the likely gene sequence of a given target (or template) sequence. For example, when the reaction solution contains nucleotide substrate molecules having A and G bases, the degenerate gene sequence information obtained by sequencing contains gene sequence information of C and/or T bases in the nucleotide sequence to be tested. Assuming that the intensity information obtained from the sequencing reaction is 3, it means that the gene to be tested may contain three Cs and/or Ts, such as three Cs or three Ts, or one C and two Ts, or one T and two Cs, and the exact relative positions of the Ts and/or Cs cannot be distinguished based on the degenerate sequence. Degenerate gene sequence information and degenerate codes are terms commonly used in the art.
In one aspect, the methods described herein can detect and/or correct errors in sequencing, but the methods do not completely eliminate sequence errors. It is possible that the specific location in the sequence signal that is modified is not the actual location where the sequencing error occurred, but the probability is extremely low. The final accuracy can be further improved. For example, if the modified signals for MK, RY, and WS are put together, modified twice in N consecutive times, it is considered likely that an error has occurred, and the corresponding sequence should be discarded. Here, N may be a positive integer equal to or greater than 2. The larger the value of N, the higher the probability that the sequence should be discarded, as well as the final decoding ratio. In one aspect, the optimized value of N herein is 3.
The DNA sequence is a copolymer, for example, a region of DNA includes two different deoxyribonucleotides, such as AAC and GGTG.
In one aspect, methods of detecting and/or correcting sequence data errors can detect where an error occurred and/or correct sequence errors.
In one aspect, in the actual sequencing process, the method includes first obtaining a relative intensity value of the optical or other signal, which can be expressed in a specific form, by a cycle sequencing reaction. For example, M represents information on the position and the amount of base at that position (more bases are acceptable), and may also represent the result of coding a degenerate gene. By decoding the relative intensity values of a sufficient amount of information, the gene sequence information to be measured can be obtained.
In one aspect, delivering or providing reagents or reaction solutions means adding the reagents or reaction solutions to a vessel, e.g., a reaction mixture for a sequencing reaction. In one aspect, three or more rounds of sequencing can be used. Alternatively, two or more rounds of sequencing may be used. In one aspect, sequencing signals are counted in counts. The intensity information of the signal can be recorded at each sequencing, and in some embodiments, the intensity information is perfectly the same as the length of the corresponding copolymer.
Sequencing signals can be counted horizontally or by the number of times a particular nucleotide is detected. For example, if the signal intensity is n and the nucleotide added to the reaction solution is X, the sequencing result is represented as xxx. For example, the sequencing signals in FIG. 1, when counted in counts, can be converted to horizontally counted sequencing signals MMMKKKKKMKKKMMK or written as (A/C, A/C, A/C, G/T, G/T, G/T, G/T, G/T, A/C, G/T, G/T, G/T, A/C, A/C and G/T).
For example, a sequencing reaction containing dA4P and dC4P (nucleotides having 4 phosphate groups and terminal phosphates labeled with fluorescent groups) can be used an odd number of times, and a sequencing reaction containing dG4P and dT4P can be used an even number of times. A set of fluorescence signal values after multiple reactions can be seen in Table 6 below.
Other combinations of fluorescently labeled nucleotides can be used to obtain a fluorescence signal value associated with a target DNA sequence. Examples of possible combinations are shown below:
M/K mode: all odd-numbered renderings dA4P and dC4P, all even-numbered renderings dG4P and dT 4P; or the reverse of both;
R/Y mode: any odd number of renderings dA4P and dG4P, any even number of renderings dC4P and dT 4P; or the reverse of both; and
W/S mode: dA4P and dT4P for odd-numbered renderings, dC4P and dG4P for even-numbered renderings; or vice versa.
TABLE 6
Sequencing data obtained for three different nucleotide combinations can be combined as a signal by horizontal counting. For each position, the next step is to resolve the intersection of the nucleotide types represented by the three sequencing signals counted horizontally for that position to obtain the target DNA sequence. On the one hand, this is the basic principle of decoding a signal. For example, if the counted number of times the sequencing signals correspond to combinations of M/K, R/Y and W/S of (3, 5, 1,3, 2, 1), (2, 4, 3, 2, 1, 3) and (2, 1,3, 2, 3, 1), respectively, the sequence can be summarized as AACTTTGGATTGCCT (SEQ ID NO: 1).
In one aspect, the integrated comparison of the results of the three rounds of sequencing reactions involves converting a chemiluminescent signal or other form of intensity signal into gene sequence information and then comparing the results of the three rounds of sequencing at the same base position. If the representation of the results obtained by the three rounds of sequencing is consistent, the sequencing of the position is considered to be correct; if the gene sequence information represented by the results obtained from the three rounds of sequencing is not identical, the sequencing result at the base position is considered to be erroneous.
On the one hand, if the sequence signal at a specific time counted in times is larger or smaller due to inaccurate CCD luminescence, fluid movement, ambient light, DNA impurities, errors in the signal correction system, or impure sequencing reaction solution, etc., it will cause the sequencing signal to count horizontally with empty intersections (emptyintersections) of nucleotide types represented by corresponding positions or subsequent positions, the nucleotide types cannot be resolved. Clearly, errors in the counted number of times of sequencing signals can result in an overall shift of the counted horizontal sequencing signals from where the error occurred. Thus, a sequencing signal that counts horizontally is a signal with memory. Errors in the sequencing signal can be corrected based on the horizontally counted sequencing signal having a memorized characteristic.
In one aspect, the invention provides methods for detecting and/or correcting sequence data errors in sequencing results. The sequencing reaction solution contains at least two types of nucleotide substrate molecules with different bases; degenerate gene coding information can be obtained. One skilled in the art can determine whether a collision condition occurs in the code at that location by comparing two or more degenerate code information. Compared with the same substrate to be tested, the method using different primers or directly testing multiple rounds is easier, and the test can be completed through one-time test design. In one aspect, the methods provided herein are completely different from methods for testing multiple rounds of the same gene to be tested. In some aspects, the methods provided herein do not have a basis for correction if there are only two mutually orthogonally degenerate gene-encoded results (excluding instances where redundant information such as color is added thereto). In one aspect, it is first assumed herein that the detection and correction of errors in three or more mutually orthogonally degenerate codings results in this type of sequencing.
In one aspect, provided herein are methods of detecting and/or correcting sequence data errors in sequencing results. In particular, nucleotide substrate molecules that use 5' terminal polyphosphates to modify fluorophores with fluorescence switching properties are sequenced; this method is also known as fluorescence-switched sequencing. When using fluorescence-switched sequencing methods combined with 2+2 sequencing methods, the sequencing methods themselves can bring many advantages, such as long reads of 300bp and sequencing accuracies up to 99.99%; all of this cannot be achieved by using only 2+2 sequencing methods or fluorescence-switched sequencing methods; in addition, there are other advantages to using the combined approach, such as higher allowable throughput, simplicity of reaction, low error rate, and no need to obtain information in real time. Similarly, sequencing on other nucleotide substrate molecules with fluorescence switching also has the same properties. For example, the fluorescence switch sequencing method and the 2+2 sequencing method provide redundant information (luminescence information or other detectable information) in addition to color information during three rounds of sequencing, which can be used for correction; it also extends valid readings without changing accuracy; the corrected result depends on the accuracy of the sequencing method, and it can greatly improve the overall accuracy of valid reads with fixed sequencer accuracy; for example, the correctness of sequencing on a nucleic acid fragment of 400bp in length is up to 97.36%. The corrected accuracy reaches 99.17%. Thus, if a sequencer employing this error detection and correction method is employed, valid reads can be extended accordingly. When corrected using the methods provided herein, one can find obvious rules: any minor improvement in the sequencing method that can reduce the error rate can significantly reduce the error rate of the modified encoded data.
Method for reading sequence information from raw signals of high-throughput DNA sequencing
In one aspect, the disclosure relates to methods of reading nucleic acid sequence information from raw signals (rawsignals) or original signals (original signals) of sequencing reactions, such as high throughput DNA sequencing reactions. In particular aspects, the invention relates to methods of reading and/or correcting sequence information of raw or raw signals from second generation sequencing technologies (e.g., for gene or genome sequencing). In one aspect, a number of causes of deviation of the original signal from the actual sequence information during nucleic acid sequencing are considered herein to enable comprehensive correction of the detected sequence information to read the exact DNA sequence from the original sequencing signal. In one aspect, the methods disclosed herein do not affect the normal course of the sequencing reaction. In one aspect, the disclosure relates to processing of both monochromatic sequencing signals and polychromatic sequencing signals. In one aspect, the processing of each type of signal includes parameter estimation and signal correction.
In high throughput DNA sequencing, under ideal conditions, the intensity of the original signal released per sequencing reaction is directly proportional to the number of bases incorporated into the nascent DNA strand. In practice, however, this proportional relationship does not always exist for a number of reasons. For example, first, the intensity of the original signal will generally decay due to fluid erosion, hydrolysis of the DNA template, and/or base mismatches. Second, due to incomplete sequencing reactions, side (e.g., undesired) reactions, and/or base mismatches, the length of the nascent DNA strand gradually becomes desynchronized as the sequencing reaction progresses (e.g., the length of the nascent DNA strand is inconsistent due to a phase loss phenomenon). Desynchronized nascent DNA strand length in turn leads to deviation of the original signal intensity from the actual target DNA sequence. Third, the overall intensity of the raw signal will be higher due to spontaneous hydrolysis of nucleotides and/or background fluorescence from the sequencing chip or substrate. All these factors make it difficult, or sometimes impossible, to read the sequence of the target DNA directly from the intensity of the original sequencing signal based on the proportionality of the two under ideal conditions.
Existing methods of reading sequence information from raw sequencing signals consider only some of the reasons mentioned above. For example, the 454 sequencing technique only considers the phase loss phenomenon and corrects the signal deviation caused by phase loss in the matrix transformation. Indeed, if only phase loss is considered or if phase loss is merely separated from other factors such as attenuation and overall high values, the accuracy of the read DNA sequence information will be affected for the reasons stated above. In addition, 454 sequencing techniques only consider primary leads of phase loss phenomena (primary leads) and ignore secondary leads (secondary leads), which also affect the accuracy of the final result. In addition, the effectiveness of the 454 sequencing technique is also affected by many parameters set by human operators, and the technique is inconvenient to use.
Ion Torrent sequencing technology attempts to mitigate signal bias due to the above-mentioned causes by changing the order of addition of nucleotides in the sequencing reaction. However, on the one hand, this method only mitigates signal deviations, not true correction signal deviations. On the other hand, changing the order of addition of nucleotides to a sequencing reaction reduces the average sequencing read per sequencing reaction.
In another aspect, disclosed herein are methods of sequencing nucleotide substrate molecules using fluorophores with fluorescence switching properties. In one aspect, sequencing is performed by modifying the 5' end of the nucleotide substrate molecule or the middle phosphate of a fluorophore with fluorescence switching properties. On the one hand, the fluorescence switching property means that the fluorescence signal intensity after sequencing is obviously enhanced compared with the fluorescence signal intensity before sequencing reaction. In one aspect, each sequencing run uses a set of reaction solutions, each set of reaction solutions comprising at least two reaction solutions, each reaction solution comprising at least one of A, G, C or T nucleotide substrate molecules or at least one of A, G, C or U nucleotide substrate molecules. In one aspect, the nucleotide sequence fragment to be tested is first immobilized in a reaction chamber, and a reaction solution from a set of reaction solutions is added to the reaction chamber. The sequencing reaction can be initiated under appropriate conditions and the fluorescent signal recorded. Then, one additional reaction solution is provided at a time, so that other reaction solutions in the same set of reaction solutions are provided successively in the sequencing reaction. Simultaneously, one or more fluorescence signals from each reaction solution are recorded. In one aspect, at least one of the set of reaction solutions comprises two or three nucleotide molecules.
In one aspect, high throughput sequencing is used to obtain sequence information about the DNA to be tested by performing a series of enzymatic reactions and detecting the signal released during the reactions. If some of the nascent DNA strand has been extended to the nth base and the nucleotide added to the current enzymatic reaction is precisely paired and complementary to the (n +1) th and (n + m) th bases of the DNA template to be tested, the nascent DNA strand in the enzymatic reaction will ideally extend to the (n + m) th base. A "lead" has occurred in the nascent DNA strand of an enzymatic reaction if it has actually extended beyond the n + m bases. If the nascent DNA strand in the enzymatic reaction has not actually extended to the n + m bases, a "lag" has occurred in the nascent DNA strand of the enzymatic reaction ("lag"). The "lead" and "lag" phenomena are collectively referred to as the phase loss phenomenon. It should be noted that when the nascent DNA strand extends to the nth base, multiple "leads" and "lags" may have occurred in any possible order.
As shown in fig. 38, all nascent DNA strands had the same length 1 prior to the sequencing reaction. The diagonal frame, white frame or gray frame represents the nucleotide in the sequence to be tested, respectively. For example, if a hatched box represents a, a white box represents T, and a gray box represents C, the template sequence shown in fig. 38 is ATCCTT. After the sequencing reaction, the DNA molecules 1,3 and 5 were extended, the extension being normal and 2 in length. In the DNA molecule 2, for example, the "lead" phenomenon has occurred due to side (e.g., undesired) reactions, and its length is 3 since the extension has exceeded the expected length of 2 nucleotides. In the DNA molecule 4, for example, a "hysteresis" phenomenon has occurred due to incomplete reaction, and its length is 1. In one aspect, the nascent DNA strands differ in length after the sequencing reaction. The 5 DNA molecules shown in fig. 38 are schematic only, not to say that there are 5 DNA molecules in actual sequencing, and in fact there may be multiple DNA molecules in actual sequencing.
As shown in fig. 39, DNA template 1 may have the sequence of ATCTTT and DNA template 2 may have the sequence of ATCCTT. After polymer a is extended normally (DNA template 1, normal extension, showing that polymer a has the sequence of AT), polymer a (i.e., AT) can be further extended by side reactions to generate polymer B in the same sequencing reaction (DNA template 1, primary extension, showing that polymer B has the sequence of ATC). Since only nucleotide T was provided in the sequencing reaction and it was expected that polymer only extended to position 2 (i.e. with T at position 2), polymer B was in "primary lead", having extended to position 3 and had the sequence of ATC. It should be noted that in this sequencing reaction, only nucleotide T is provided, and nucleotide C is not provided, which means that C at position 3 may be the result of contamination (e.g., from the last sequencing reaction), side reactions, or polymerase errors. In this example, polymer B can be extended further to position 4 to generate polymer C (sequence with ATCT), a phenomenon known as "secondary advancement" because of the provision of nucleotide T in the sequence reaction. This was compared to DNA template 2, which has a C instead of a T at position 4. When sequencing DNA template 2, a primary lead may occur due to side reactions due to the provision of nucleotide T, extending the polymer to position 3 (C). On the one hand, however, the probability of another side reaction occurring such that another C is added at position 4 is negligible. Therefore, the DNA template 2 does not extend to position 4, and secondary advancing phenomenon does not occur in the DNA template 2.
Sequencing method
In some aspects, methods of DNA sequencing are employed herein. In some embodiments, the method comprises immobilizing the test DNA on a solid surface, hybridizing to one or more sequencing primers, and/or performing sequencing reactions in series and detecting the signal released by the reactions. In one aspect, each reaction comprises the steps of: adding reaction solution containing nucleotide, enzyme and other necessary reagents into a reactor (such as a chip) to initiate specific biochemical reaction; detecting a signal released by the reaction; and/or cleaning the reactor. The nucleotide to be added may be a natural deoxynucleotide or a nucleotide having a chemical modification group, but in one aspect, should have a hydroxyl group at the 3' end. The number of nucleotide types added per reaction can be 1, 2 or 3, but not 4 (referred to as ACGT or ACGU). In one aspect, the union of the nucleotide types added in two consecutive reactions includes all four nucleotides. For example, if A and G are added in a first reaction, C and T will be added in a second reaction. In another example, if ACG is added in the first reaction, T will be added in the second reaction.
If two types of nucleotides are added to a reaction, these 2 types of nucleotides may release the same or different types of signals in the reaction. If 3 types of nucleotides are added to a reaction, the three types of nucleotides may release the same or different types of signals. Alternatively, two of them release the same signal and the other 1 a different signal. The type of signal herein refers to the form of the signal (e.g., electrical signal, bioluminescent signal, chemiluminescent signal, etc.), or the color of the optical signal (e.g., green fluorescent signal, red fluorescent signal, etc.), or a combination thereof. For the sake of simplicity, all nucleotides in a reaction release the same type of signal, called the monochromatic signal; where more than one type of all nucleotides is released in a reaction, this is known as a polychromatic signal. The "color" is used herein for simplicity only, and the type of signal is not limited to optical signals (e.g., wavelengths) of different colors.
In certain embodiments, there are three types of signals that differ in meaning, each being:
1. the ideal signal h is a sequencing signal directly deduced under ideal conditions according to the sequence of the DNA to be detected and the sequence of the added nucleotide, and directly reflects the sequence information of the DNA;
2. the phase loss signal s is a signal formed by deviation after the ideal signal h suffers from a phase loss phenomenon;
3. the predicted raw sequencing signal p is a signal formed by phase loss signals (or phase mismatch) s in consideration of a plurality of factors: number of extended bases, fold relationship of sequencing signal intensity, signal attenuation, and global bias. The predicted original sequencing signal p is the prediction of an actual original sequencing signal according to preset parameters;
4. the actual original sequencing signal f is a signal directly measured by an instrument in the high-throughput DNA sequencing.
Parameter estimation
The process of inferring the relevant parameters of a sequencing reaction from one or more reference DNA molecules of known sequence and the actual raw sequencing signal is called parameter estimation. The basic process of parameter estimation is shown in fig. 41. Parameter estimation involves a set of parameters describing relevant properties in a sequencing reaction, such as phase loss coefficients, unit signal intensity, attenuation coefficients, global bias coefficients, and the like.
First, the method involves inferring the ideal signal h from the sequence of the reference DNA molecule, and then calculating the out-of-phase signal (or mismatch) s and the predicted raw sequencing signal p according to pre-set parameters. In one aspect, the method includes calculating a correlation coefficient c between p and the actual raw sequencing signal f. In one aspect, the method includes using an optimization method to find a set of parameters such that the correlation coefficient c reaches an optimal value. The correlation coefficient c herein includes, but is not limited to, a Pearson correlation coefficient (Pearson correlation coefficient), a Spearman correlation coefficient (Spearman correlation coefficient), average mutual information, a Euclidean distance (Euclidean distance), a Hamming distance (Hamming distance), a Chebyshev distance (Chebyshev distance), a Mahalanobis distance (Mahalanobis distance), a Manhattan distance (Manhattan distance), a minkowski distance (Minkowskidistance), a maximum value or a minimum value of absolute values of corresponding signal difference values, and the like. The optimization method herein includes, but is not limited to, a grid search method, an exhaustive method, a gradient descent method, a newton method, a Hessian matrix method, a heuristic search, etc., wherein the heuristic search includes, but is not limited to, a genetic algorithm, a simulated annealing algorithm, an ant colony algorithm, a harmonic algorithm, a spark algorithm, a particle swarm algorithm, an immune algorithm, etc. The correlation coefficients and optimization methods mentioned here are conventional knowledge in mathematics. The correlation coefficients and optimization methods mentioned here belong to general mathematical knowledge.
In one aspect, based on lead and lagThe effect of post and/or offset on the sequencing signal allows for the transformation (or transformation) between the ideal signal h and the actual original sequencing signal f. In another aspect, these parameters (e.g., lead, lag, and/or offset) can also be obtained in the process of inferring the relationship between the ideal signal h and the actual raw sequencing signal f (e.g., based on signals measured from a reference sequence of known nucleotide sequences) during the parameter estimation process. In some aspects, the estimation process includes using a matrix (e.g., a transformation matrix T) and/or a function (e.g., a transformation function))。
If a monochromatic signal is collected during sequencing, the calculation is performed directly as described above. If a polychromatic signal is collected in the sequencing, each type of signal is separated from the polychromatic signal and calculated separately using the methods described above.
In one aspect, an implementation of calculating s using h includes constructing a transformation matrix T based on the characteristics of h and related parameters, and transforming h into s using T. In one aspect, an implementation of p is calculated using s includes constructing a transformation function based on the associated parametersAnd s is converted to p using d. Specific embodiments will be described in detail below.
Signal correction
In one aspect, signal correction includes a process of inferring the sequence information of the test DNA from (1) parameters obtained from parameter estimation and (2) actual raw sequencing signals of the test DNA of unknown sequence. On the one hand, the basic process of signal correction is shown in fig. 42, which can be basically regarded as the inverse process of parameter estimation.
In a first aspect, the process includes using a transformation function based on a parameter estimated from the parameterThe inverse function of (f) transforms the actual original sequencing signal f into a phase-lost signal (or phase mismatch) s. In one aspect, the process includes treating s as a zero-order out-of-phase signal s0According to s0And related parameters constructing transformation matrix T1And use of T1Is a generalized inverse matrix of0Into a first-order phase-lost signal s1. In another aspect, the process further includes a process according to s1And the related parameters construct the transformation matrix T2And use of T2Is a generalized inverse matrix of1Into a second-order phase-lost signal s2. In yet another aspect, the process further includes according to siAnd the related parameters construct the transformation matrix Ti+1And use of Ti+1Is a generalized inverse matrix ofiInto (i +1) -order phase-lost signals si+1Wherein i is an integer of 2 or more. In one aspect, the process includes calculating a series of out-of-phase signals s0、s1、s2、...、si+1、...、sj. On the one hand, if two adjacent out-of-phase signals s are found in the calculationiAnd si+1Equal to each other, stop the calculation and return to siAs a result of the signal correction.
In one aspect, the generalized inverse matrix described above may also be replaced with a method of tikhonov regularization.
If a monochromatic signal is collected during sequencing, the calculation is performed directly as described above. If a polychromatic signal is collected in the sequencing, each type of signal is separated from the polychromatic signal and calculated separately using the methods described above.
Using transformation functions as described aboveThe inverse function of (a) transforms f to s, and s is transformed using the generalized inverse matrix of TiIs transformed into si+1Will be described in detail below.
Method for constructing transformation matrix T
In one aspect, the transformation matrixThe construction of T depends on a sequencing-related signal X and phase loss parameters. In parameter estimation, signal a is an ideal signal h; in signal correction, the signal x is the phase-loss signal s of each stagei. To improve the correction accuracy, the signal x can be prolonged by adding several 1's after the signal x; in a preferred embodiment, 1 to 100 1 s are generally added. In a particular embodiment, 5 to 101 s are added. In one aspect, the out-of-phase parameter includes a lead coefficient ε and a lag coefficient λ.
In one aspect, constructing the transformation matrix T further includes constructing a secondary matrix D. On the one hand, assuming that the signal x has m values, the sequencing reaction is actually performed n times, and the transformation matrix T and the auxiliary matrix D each have n rows and m columns. For example, in the first row of the auxiliary matrix D, only the elements of the first column are 1, and the other elements are all 0.
In one aspect, the method includes calculating a k-th row of the transformation matrix T using a k-th row of the auxiliary matrix D. For the 1 st element of the kth row of the transformation matrix T:
1. if k is odd, the element is designated as (1- λ) D, taking into account hysteresis1i;
2. If k is an even number, then the element is designated 0.
For the ith element (except for the 1 st element) of the kth row of the transformation matrix T:
1. if the parity of k and i are the same, hysteresis should be taken into account and the element is designated as (1- λ) Dki;
2. If the parities of k and i are different, the element is designated as ε (1- λ) D, taking into account the primary look-ahead phenomenonk,i-1;
3. If the i-1 th element of the signal x is smaller than 2, the secondary look-ahead should be considered, and on the basis of the calculation results of the above steps 1 and 2, the i-1 th element T of the same row of the transformation matrix T is added to the elementk,i-1。
In one aspect, the method includes calculating a k +1 th row of the auxiliary matrix using a k-th row of the transformation matrix T. In the 1 st row of the auxiliary matrix D, only the element in the 1 st column is 1, and the other elements are all 0. For the k-th row of the auxiliary matrix (except row 1):
1. the 1 st element is an element D of a row and a column on the auxiliary matrixk-1,iAnd the element T of the same row and the same column of the corresponding element in the transformation matrix Tk-1,iA difference of (d);
2. element D of ith element (except 1 st element) in same row and column on auxiliary matrixk-1,iAnd the element T of the same row and the same column of the corresponding element in the transformation matrix Tk-1,iOn the basis of the difference value of (A), the last row and the last column of the corresponding element T in the transformation matrix T are addedk-1,i-1。
Thus, in one aspect, the values of row 1 of the auxiliary matrix D are first specified and then row 1 of the transformation matrix is calculated from row 1 of the auxiliary matrix D. In one aspect, the method further includes utilizing row 1 of the transformation matrix T to compute row 2 of the auxiliary matrix; line 2 of the auxiliary matrix D is used to calculate line 2 of the transformation matrix T. The values of all elements of the auxiliary matrix and the transformation matrix are obtained in the same way.
On the one hand, the auxiliary matrix D is introduced only for computational simplicity and can be eliminated by conventional mathematical deformation to directly compute the transformation matrix T.
In the above calculations, the dephasing parameter is related to the nucleotide type and also to the row number k and column number i where the element being calculated is located. In actual calculations, the dephasing coefficients ε and/or λ may be kept constant for simplicity, or varied depending on the type of nucleotide, row number k and/or column number i.
On one hand, in parameter estimation, a transformation matrix T is obtained according to the calculation method according to a preset phase loss coefficient and an ideal signal h. In one aspect, the out-of-phase signal (or mismatch) s is the product of the transformation matrix T and the ideal signal h. If the ideal signal h is represented as a column vector, s is T multiplied by h; if the ideal signal is represented as a row vector, s is the transpose of h times T.
During parameter correction, the phase loss coefficient and the ith order phase loss signal s can be presetiThe transformation matrix T is obtained according to the above calculation method. On the one hand, the i +1 th order dephasing signal s is a generalized inverse matrix T of the transformation matrix T+And the ith order phase-loss signal. If siExpressed as a column vector, then si+1Is T+Multiplied by si(ii) a If siExpressed as a row vector, then si+1Is s isiMultiplied by T+The transposed matrix of (2). I +1 th order phase-loss signal si+1After calculation as described above, further rounding may be performed. Rounding methods include, but are not limited to:
1. rounding off: taking the nearest integer value;
2. upward rounding: is taken to be greater than si+1The smallest integer of (a);
3. rounding down: is taken to be less than si+1Maximum integer of (2)
4. Rounding to 0: if si+1If the value is more than 0, rounding downwards; if si+1And if the value is less than 0, rounding up.
5. Rounding: rounding in any of the above ways and then changing all non-positive numbers to 1.
Method for constructing transformation function
In one aspect, a transformation functionSeveral parameters are involved, including the unit signal a (the number of extended bases has a fold relationship with the sequencing signal intensity), the attenuation coefficient b, the global bias c, etc. The parameters a, b, c herein may be a single coefficient or a set of coefficients. For example, the unit signal a is related to the type of nucleotide and the number of times the sequencing reaction has occurred. In the calculation, a single value of these parameters may be used for simplicity, or this may be made for accuracySome parameters vary with the relevant factors, some parameters may use a single value, and other parameters vary with the relevant factors.
Transformation functionForms of (d) include, but are not limited to:
in the above function, whereinAndthe mathematical functions related to a, b and c include, but are not limited to, a constant function, a power function, an exponential function, a logarithmic function, a trigonometric function, an inverse trigonometric function, a rounding function, a special function, and functions generated by the mutual operation, composition, iteration and segmentation of the above functions. In some embodiments, the special function includes, but is not limited to, an elliptic function, a gamma function, a Bessel function, a beta function, and the like.
In one aspect, a transformation functionWill lose phase signal (orPhase mismatch) s to the predicted original sequencing signal p, i.e.In one aspect, a transformation functionIs inverse function ofTransforming the actual original sequencing signal f into a dephasing signal (or mismatch) s, i.e.The inverse function herein will take the conventional meaning in mathematics.
Compared with the existing method (for example, 454 patent method, as disclosed in US 2011/0213563A 1, System and method to core out of phase errors in DNA sequencing data by use of area of arc correcting algorithm, published as US 8,364,417), the following three improvements are mainly made in this document. First, the methods herein include constructing a transformation matrix that simultaneously accounts for primary lead, secondary lead, and lag in the event of a phase loss, and using the transformation matrix to correct sequencing errors due to the phase loss. Second, the methods herein include addressing signal deviations caused by attenuation, phase loss, or global offset as a whole. The approach herein neither corrects for signal deviations caused by only a single problem nor simply solves the problem one by one. And thirdly, the method for correcting the signal is improved, the parameter setting which needs to be judged by human subjective factors is avoided, and the robustness and the repeatability of the method are improved. Fourth, using the methods disclosed herein, both monochrome and bi-color signals can be corrected.
In one aspect, three levels of lead are not considered herein (FIG. 40).
On the one hand, the method disclosed herein has the following effects and advantages compared to the methods mentioned in the background art:
1. in the 2+2 sequencing method, the secondary carry-over phenomenon is significant, and the resulting bias is uncorrectable by the 454 patent which does not take into account the secondary carry-over phenomenon. In this context, on the one hand, the secondary lead phenomenon is taken into account, so that signal deviations caused by this phenomenon can be corrected well.
2. In practice, if only a simple linear fitting method is used to read out sequence information from the raw sequencing signal, the accuracy of the reading will typically be up to about 100bp or so. If the method described herein is used on the same data, the reading can be accurately performed to about 350bp, and the sequencing reading length and the sequencing accuracy are greatly improved. In some embodiments, the accuracy of the reads can be up to about 400bp, about 450bp, about 500bp, about 550bp, about 600bp, about 650bp, about 700bp, about 750bp, about 800bp, about 850bp, about 900bp, about 950bp, about 1000bp, about 1050bp, about 1100bp, about 1150bp, about 1200bp, about 1250bp, about 1300bp, about 1350bp, about 1400bp, about 1450bp, about 1500bp, about 1550bp, about 1600bp, about 1650bp, about 1700bp, about 1750bp, about 1800bp, about 1850bp, about 1900bp, about 1950bp, about 2000bp, about 2050bp, about 2100bp, about 2150bp, about 2200bp, about 2250bp, about 2300bp, about 2350bp, or about 2400 bp.
3. In one aspect, the present document is capable of correcting both monochromatic and dichroic signals.
4. In another aspect, the normal order of adding samples and/or reagents (e.g., dntps or ddntps) for sequencing is not affected herein compared to certain state of the art methods, e.g., the Ion Torrent sequencing method (alternative nucleotide streams in sequencing-by-synthesis method) as disclosed in US 2014/0031238a1 and U.S. patent No. 9,416,413.
In one aspect, disclosed herein is a method of feeding back iteratively generated errors in template molecular sequence data, comprising: a) detecting a plurality of signals corresponding to the nucleic acid sequence, the signals generated as a result of the introduction of a plurality of nucleotides into the sequencing reaction; b) using the detected signal to generate quantitative (normalized or digitized) information; c) obtaining a series of lead and/or lag information using the parameter estimates; d) obtaining phase mismatches using the amount of new nucleotides generated and accumulation of secondary leads; e) calculating the amount of new nucleotides generated in each reaction using the phase mismatches; and f) repeating steps d) and e) until the amount of new nucleotides generated in each reaction becomes convergent, wherein said parameter estimation refers to the inference of lead and/or lag from the reference sequence and its sequencing signal; wherein a secondary lead is the occurrence of an extension in a sequencing reaction that does not match a nucleotide substrate of the sequencing reaction, on the basis of which an extension that matches a nucleotide substrate of the sequencing reaction occurs; wherein the phase mismatch is due to a change in the sequencing result due to a lead and/or lag, and wherein the amount of new nucleotides is the extension length of the sequence after addition to the sequencing reaction.
In one aspect, in the parameter estimation, the method further comprises obtaining an attenuation coefficient. In another aspect, in the parameter estimation, the method further comprises obtaining an offset. In another aspect, in the parameter estimation, the method further comprises obtaining unit signal information.
In another aspect, disclosed herein is a method of feeding back iteratively generated errors in template molecular sequence data, comprising: a) detecting a plurality of signals corresponding to the nucleic acid sequence, the signals generated as a result of the introduction of a plurality of nucleotides into the sequencing reaction; b) using the detected signal to generate quantitative (normalized or digitized) information; c) obtaining a series of lead and/or lag, attenuation coefficient and offset using the parameter estimation; d) obtaining phase mismatches using the amount of new nucleotides generated and accumulation of secondary leads; e) calculating the amount of new nucleotides generated in each reaction using the phase mismatches; and f) repeating steps d) and e) until the amount of new nucleotides generated in each reaction becomes convergent, wherein parameter estimation refers to inferring lead and/or lag from the reference sequence and its sequencing signal; wherein a secondary lead is the occurrence of an extension in a sequencing reaction that does not match a nucleotide substrate of the sequencing reaction, on the basis of which an extension that matches a nucleotide substrate of the sequencing reaction occurs; wherein the phase mismatch is due to a change in the sequencing result due to a lead and/or lag, and wherein the amount of new nucleotides refers to the extension length of the sequence after addition to the sequencing reaction.
In one aspect, disclosed herein is a method of correcting lead in sequencing results using secondary lead, wherein in sequencing results, if a signal obtained by a particular reaction is similar to a unit signal, the method comprises correcting the signal using the secondary lead; wherein a secondary lead is the occurrence of an extension in a sequencing reaction that does not match a nucleotide substrate of the sequencing reaction, followed by an extension that matches the nucleotide substrate of the sequencing reaction.
In one aspect, a primary lead is included in the sequencing results, wherein the primary lead refers to a mismatch between the extension and the nucleotide substrate in the sequencing reaction.
In one aspect, the effect of the subsequent lead comprises a secondary lead effect, and the primary leads other than the first secondary lead are accumulated in subsequent sequencing reactions.
In any of the preceding embodiments, the proximity of the signal obtained from the reaction to the unit signal means that the signal obtained from the reaction is in proximity to the unit signal; a deviation of less than about 60% between the intensity information of the signal and the unit information is obtainable from the preferred reaction, a deviation of less than about 50% between the two is obtainable from the further preferred reaction, a deviation of less than about 40% between the two is obtainable from the further preferred reaction, a deviation of less than about 30% between the two is obtainable from the further preferred reaction, a deviation of less than about 20% between the two is obtainable from the further preferred reaction, a deviation of less than about 10% between the two is obtainable from the further preferred reaction, and a deviation of less than about 5% between the two is obtainable from the further preferred reaction.
In one aspect, in a sequencing reaction, the method comprises obtaining a corrected sequencing signal using a sequencing signal n-ago by feeding back iteratively generated errors in template molecular sequence data when an nth sequencing signal is obtained; then, it is determined whether there is a secondary amount of advance for the position according to the determination rule described above.
In any of the preceding embodiments, sequencing may be a process in which a reaction solution of sequencing reagents, such as nucleotides and enzymes, is added to the nucleic acid sequence to be tested.
In any of the preceding embodiments, one type or two types or three types or four types of nucleotides may be added to each reaction in the sequencing.
In any of the preceding embodiments, sequencing may be an open-ended (ends open) sequencing process. In the sequencing reaction, one type or two types or three types of nucleotides may be added. In any of the preceding embodiments, the nucleotides added in sequencing may be one or more of A, G, C and T, or one or more of A, G, C and U.
In any of the preceding embodiments, in sequencing, the detection signal may be an electrical signal, a bioluminescent signal, a chemiluminescent signal, or a combination thereof.
In any of the foregoing embodiments, in parameter estimation, the method may include first inferring an ideal signal h from a reference DNA molecule, then calculating a dephasing signal (or mismatching) s and a predicted raw sequencing signal p from pre-set parameters, and calculating a correlation coefficient c between p and the actual raw sequencing signal f.
In any of the preceding embodiments, the method may comprise using an optimization method to find a set of parameters such that the correlation coefficient c reaches an optimal value. The parameters found may include an amount of lead and/or lag and may also include one or more of an attenuation factor, an offset, and a unit signal.
In any of the preceding embodiments, the amount of lead and/or lag may refer to the degree of dephasing due to lead and/or lag in the sequencing reaction.
In any of the preceding embodiments, in sequencing, the nucleotides can be divided into two groups, and the method can include adding a sequencing reaction solution comprising a group of nucleotide molecules in each sequencing reaction.
Examples
Example 1: sequencing by the "2 +2 Monochromatic" method
In order to further describe the present disclosure, specific examples are provided below. Unless otherwise indicated, specific parameters, steps, etc., are conventional in the art. The specific embodiments are not intended to limit the scope of the invention.
For sequencing by the "2 +2 monochrome" method, three sets of reaction solutions were prepared. Each set comprises two bottles, each bottle comprising two bases labeled with the same fluorophore X. For each set, two bottles contained exactly all four bases of the sequencing reaction. The 6 vials (two vials per set) of solutions were not repeated with each other.
Table 7: reaction solution in 2+2 single color method
| First bottle | Second bottle | |
| First set | AX+CX | GX+TX |
| Second cover | AX+GX | CX+TX |
| Third set | AX+TX | CX+GX |
The complete sequencing process includes three rounds of sequencing, which are performed sequentially in any suitable order. One of the three sets of reaction solutions listed in table 7 was used for each sequencing run. For example, the sequence of three wheels may be first set → second set → third set, or second set → third set → first set, etc. The above three sets of reaction solutions were used for each round of sequencing process, except that the other conditions were identical (for example, the same sequencing primers and reaction conditions were used for each round). The two vials in the same set of reaction solutions may also be used in any suitable order, e.g., the first vial may be used before or after the second vial.
Each round of sequencing included:
1. sequencing primers were hybridized to the prepared DNA array.
2. The sequencing reaction was started. The steps can be repeated for 2.1-2.4 times.
2.1. A first vial of reaction solution (e.g., a first vial or a second vial of a first set) is added to the sequencing reaction mixture (e.g., in a flowcell) and the reaction is allowed to proceed and a fluorescent signal is collected from fluorophore X.
2.2. And cleaning all residual reaction liquid and fluorescent molecules in the flow cell.
2.3. A second vial of reaction solution (e.g., the second vial or the first vial of the first set) is added to the sequencing reaction mixture, the reaction is allowed to proceed, and a fluorescent signal is collected.
2.4. And cleaning all residual reaction liquid and fluorescent molecules in the flow cell.
3. The extended sequencing primer was unwound.
At this point, a new round of sequencing reaction can begin.
Used in this exampleThe solution can be prepared as follows. The washing solution for the sequencing reaction solution comprises: 20mM Tris-HCl pH 8.8; 10mM (NH)4)2SO4;50mM KCl;2mM MgSO4(ii) a And 0.1%20. The main solution of the sequencing reaction contained: 20mM Tris-HCl pH 8.8; 10mM (NH)4)2SO4;50mM KCl;2mM MgSO4;0.1%8000 units/mLBst polymerase; and 100 units/mL CIP (alkaline phosphatase, bovine intestine).
Three sets of sequencing reaction solutions were prepared as follows:
set 1 (bottles 1A and 1B):
bottle 1A: main solution + 20. mu.M dA4P-TG + 20. mu.M dC4P-TG
Bottle 1B: main solution + 20. mu.M dG4P-TG + 20. mu.M dT4P-TG
Set 2 (bottles 2A and 2B):
bottle 2A: main solution + 20. mu.M dA4P-TG + 20. mu.M dG4P-TG
Bottle 2B: main solution + 20. mu.M dC4P-TG + 20. mu.M dT4P-TG
Set 3 (bottles 3A and 3B):
bottle 3A: main solution + 20. mu.M dA4P-TG + 20. mu.M dT4P-TG
Bottle 3B: master solution + 20. mu.M dC4P-TG + 20. mu.M dG4P-TG
The prepared reaction solution and the main solution are placed on a refrigerator or ice at 4 ℃ for standby.
To hybridize the sequencing primers, the sequencing primer solution (10 μ M primers in 1 XSSC buffer) was injected into the sequencing chip, then heated to 90 ℃ and then cooled to 40 ℃ at a rate of5 ℃/min. The sequencing primer solution is then washed away with a wash solution.
To perform the sequencing reaction, the sequencing chip was placed on a sequencer. To perform sequencing using the first set of reaction solutions, the following steps were performed:
1. 10mL of wash solution was added to rinse the chip.
2. The chip was cooled to 4 ℃.
3. 100. mu.L of reaction solution 1A was added.
4. The chip was heated to 65 ℃.
5. Wait for 1 minute.
6. Fluorescence images were taken at 473nm excitation laser wavelength.
7. 10mL of wash solution was added to rinse the chip.
8. The chip was cooled to 4 ℃.
9. 100. mu.L of reaction solution 1B was added.
10. The chip was heated to 65 ℃.
11. Wait for 1 minute.
12. Fluorescence images were taken at 473nm excitation laser wavelength.
13. The steps 1-12 were repeated 50 times to obtain 100 fluorescence signals.
The second round of sequencing can be performed as follows. First, the chip was cooled to room temperature. Then 200. mu.L of 0.1M NaOH solution was added to denature the DNA double strand extended in the first round of sequencing. Then 10ml of a washing solution was added to wash the residual NaOH and denatured DNA single strands.
The sequencing primers are then rehybridized to the DNA array, as described above. The sequencing reaction using the second set of reaction solutions was performed as follows:
1. 10mL of wash solution was added to rinse the chip.
2. The chip was cooled to 4 ℃.
3. 100. mu.L of reaction solution 2A was added.
4. The chip was heated to 65 ℃.
5. Wait for 1 minute.
6. Fluorescence images were taken at 473nm excitation laser wavelength.
7. 10mL of wash solution was added to rinse the chip.
8. The chip was cooled to 4 ℃.
9. 100. mu.L of reaction solution 2B was added.
10. The chip was heated to 65 ℃.
11. Wait for 1 minute.
12. Fluorescence images were taken at 473nm excitation laser wavelength.
13. The steps 1-12 were repeated 50 times to obtain 100 fluorescence signals.
The third round of sequencing can be performed as follows. First, the chip was cooled to room temperature. Then 200. mu.L of 0.1M NaOH solution was added to denature the DNA double strand extended in the second round of sequencing. Then 10ml of a washing solution was added to wash the residual NaOH and denatured DNA single strands.
The sequencing primers are then rehybridized to the DNA array, as described above. The sequencing reaction using the third set of reaction solution was performed as follows:
1. 10mL of wash solution was added to rinse the chip.
2. The chip was cooled to 4 ℃.
3. 100. mu.L of reaction solution 3A was added.
4. The chip was heated to 65 ℃.
5. Wait for 1 minute.
6. Fluorescence images were taken at 473nm excitation laser wavelength.
7. 10mL of wash solution was added to rinse the chip.
8. The chip was cooled to 4 ℃.
9. 100. mu.L of reaction solution 3B was added.
10. The chip was heated to 65 ℃.
11. Wait for 1 minute.
12. Fluorescence images were taken at 473nm excitation laser wavelength.
13. The steps 1-12 were repeated 50 times to obtain 100 fluorescence signals.
At this point, three rounds of sequencing were completed.
Example 2: sequencing by the "2 +2 two-color" method
In this example, three sets of reaction solutions were prepared. Each set has two bottles, each bottle includes two nucleotide bases. The 2 nucleotide bases in each vial were labeled with two different fluorophores (such that their emission wavelengths were different) to distinguish the signals from the two nucleotide bases.
In this example, the two types of fluorophores are X and Y. For each set, two bottles contained exactly all four bases of the sequencing reaction. 6 bottles (two bottles in each set) of solution are not repeated.
Table 8: reaction liquid in '2 +2 bicolor' method
| First bottle | Second bottle | |
| First set | AX+CY | GX+TY |
| Second cover | AX+GY | CX+TY |
| Third set | AX+TY | CX+GY |
The complete sequencing process includes three rounds of sequencing, which are performed sequentially in any suitable order. One of the three sets of reaction solutions listed in table 8 was used for each round of sequencing. For example, the sequence of three wheels may be first set → second set → third set, or second set → third set → first set, etc. The above three sets of reaction solutions were used for each round of sequencing process, except that the other conditions were identical (for example, the same sequencing primers and reaction conditions were used for each round). The two vials in the same set of reaction solutions may also be used in any suitable order, e.g., the first vial may be used before or after the second vial.
Each round of sequencing included:
1. sequencing primers were hybridized to the prepared DNA array.
2. The sequencing reaction was started. The steps can be repeated for 2.1-2.4 times.
2.1. A first vial of reaction solution (e.g., a first vial or a second vial of a first set) is added to the sequencing reaction mixture (e.g., in a flow cell) and the reaction is allowed to proceed and fluorescent signals are collected from fluorophore X and fluorophore Y, respectively.
2.2. And cleaning all residual reaction liquid and fluorescent molecules in the flow cell.
2.3. A second vial of reaction solution (e.g., a second vial or a first vial of the first set) is added to the sequencing reaction mixture and the reaction is allowed to proceed and fluorescent signals are collected from fluorophore X and fluorophore Y, respectively.
2.4. And cleaning all residual reaction liquid and fluorescent molecules in the flow cell.
3. The extended sequencing primer was unwound.
At this point, a new round of sequencing reaction can begin.
Example 3: comparative examples
Comparative example 1
In this comparative example, four 3' -end-blocked nucleotide molecules were used. The 3' blocking group may prevent the polymerase molecule from continuing to extend using the nucleotide molecule as a substrate. The 3' blocking group can be cleaved under specific conditions to yield a terminal hydroxyl group. Each nucleotide molecule is labeled with a different fluorescent molecule group. The molecular groups used here are not fluorophores with fluorescence-switching properties and can be cleaved off under specific conditions. The fluorescent labels are W, X, Y and Z, respectively. The labeled nucleotide monomers are W-A, X-C, Y-G and Z-T, respectively.
The reagent 1 is a main sequencing reaction solution and comprises four kinds of labeled fluorescent nucleotide molecules with 3' ends closed and polymerase for performing polymerase catalytic extension by using the labeled nucleotide molecules. The reagent 2 is a cleaning solution. Reagent 3 is a deblocking solution containing a reagent that cleaves the 3' end blocking group and the fluorophore.
When sequencing, a sequencing primer is hybridized on the template strand. Reagent 1 is mixed with the hybridized template to allow a polymerase reaction to occur. After the reaction, the unreacted sequencing solution was washed clean with reagent 2. The fluorescent signal is collected to determine the nucleotide base added to the sequencing primer in the polymerase extension reaction. Then, all 3' end blocking groups and fluorophores were cleaved off using reagent 3. The template polynucleotide may then be used for the next sequencing reaction after washing. This sequencing method does not have data redundancy and quality control properties.
Comparative example 2
In this comparative example, a sequencing reaction was performed using a non-fluorescent switching property of nucleotides. This example is similar to example 1 except that the fluorescent label is not on the phosphate group. This example relates to four nucleotide molecules, each of which is free to be extended by a polymerase under complementary pairing conditions. Each nucleotide molecule is labeled with the same fluorescent molecular group on the base, and the molecular group has no fluorescence switching property and can be cut off under specific conditions. 3 sets of reaction solution were provided, two bottles each. For each set, two bottles contained exactly all four bases of the sequencing reaction. 6 bottles (two bottles in each set) of solution are not repeated.
Table 9: reaction solution in comparative example 2
| First bottle | Second bottle | |
| First set | AX+CX | GX+TX |
| Second cover | AX+GX | CX+TX |
| Third set | AX+TX | CX+GX |
The complete sequencing process includes three rounds of sequencing, which are performed sequentially in any suitable order. The above three sets of reaction solutions were used for each round of sequencing process, except that the other conditions were identical (for example, the same sequencing primers and reaction conditions were used for each round).
Each round of sequencing included:
1. sequencing primers were hybridized to the prepared DNA array.
2. The sequencing reaction was started. The steps can be repeated for 2.1-2.8 times.
2.1. The first reaction solution was added to the sequencing reaction mixture to allow the reaction to proceed.
2.2. And cleaning all residual reaction liquid and fluorescent molecules in the flow cell.
2.3. Fluorescent signals are collected from the fluorophores.
2.4. A reagent is added to cleave the fluorescent label group.
2.5. The second flask of reaction solution was added to the sequencing reaction mixture to allow the reaction to proceed.
2.6. And cleaning all residual reaction liquid and fluorescent molecules in the flow cell.
2.7. Fluorescent signals are collected from the fluorophores.
2.8. A reagent is added to cleave the fluorescent label group.
3. The extended sequencing primer was unwound.
A new round of sequencing can then be started. The three rounds of sequencing were completed.
In this example, substrates (nucleotide molecules) of non-fluorescent switching nature are used, and therefore it is necessary to introduce a cleavage reagent to cleave the fluorescent label in the sequencing step, which is longer. In addition, molecular scars are generated and left on the generated double-stranded DNA molecules, preventing further extension.
Example 4: detecting and/or correcting sequencing errors
In this example, a single stranded DNA molecule to be sequenced is immobilized on a solid surface. The method of immobilization may be chemical crosslinking, molecular adsorption, or the like. The 3 'end or 5' end of the DNA may be immobilized on a surface. The DNA to be tested comprises a fixed fragment with a known sequence, and can be complementarily hybridized with a sequencing primer. The sequence of the region from the 3 'end of the fragment to the 3' end of the test DNA is the region of the test sequence. In this example, the sequence to be sequenced was 5'-TGAACTTTAGCCACGGAGTA-3' (SEQ ID NO: 2).
First, a sequencing primer is hybridized to a fragment having a known sequence of the target DNA. A functional group with fluorescence switching property is connected to the base of each nucleotide substrate molecule; the number of phosphate molecules was 4.
dG4P and dT4P, along with the corresponding reaction buffer, enzyme and metal ions, were added to the reaction system to initiate a sequencing reaction that produced a fluorescent signal. The signal is acquired by a CCD (charge coupled device). The values of these fluorescence signals were recorded. The reaction was recorded as the first reaction.
The remaining dG4P and dT4P from the reaction were washed away. Then, dA4P and dC4P were added to the reaction system, the same sequencing reaction as described above was initiated, and the value of the fluorescence signal was recorded. This reaction should be recorded as the second reaction. This method is also known as the monochromatic 2+2 sequencing method.
The above process is repeated. dG4P and dT4P were added for odd numbered reactions and dA4P and dC4P were added for even numbered reactions to obtain a set of values for the sequencing signal: x is (2, 3,1,3,2, 1).
For example, the nascent strand of DNA synthesized in the above-described sequencing reaction is melted and washed away using high temperature or strongly hydrophilic substances (e.g., urea and formamide). The sequencing primer was rehybridized to the template DNA. dC4P and dT4P were added for odd numbered reactions and dA4P and dG4P were added for even numbered reactions to obtain a set of values for the sequencing signal: y is (1, 4,2, 1,4, 1).
For example, the nascent strand of DNA synthesized in the above-described sequencing reaction is melted and washed away using high temperature or strongly hydrophilic substances (e.g., urea and formamide). The sequencing primer was rehybridized to the template DNA. dA4P and dT4P were added for odd-numbered reactions and dC4P and dG4P were added for even-numbered reactions to obtain a set of values for the sequencing signal: z is (1, 2,1,4, 3,1, 2).
The sequencing signal values are then analyzed for the type of nucleotide base represented by the signal to obtain sequencing information. For each residue of the target DNA, the common base in the three signals was identified and listed in the following table as the nucleotide residue at that position.
Table 10: sequencing results before correction
When three sets of signals are resolved with common bases at each position, there are no common bases at several positions. This indicates that an error has occurred in the sequence. In this embodiment, changing the 2 nd value of signal Y from 4 to 3 and the 6 th value of signal X from 3 to 4, the signals will change as shown in the following table.
Table 11: corrected sequencing results
In the above table, "2 nd value of signal y is changed from 4 to 3" is represented as an R with a strikethrough, "6 th value of signal X is changed from 3 to 4" is represented as an increase by M (indicated in italics, underlining). After the two modifications, common bases are arranged at all positions of the three groups of signals, and the sequence formed by the common bases is a DNA sequence to be detected. This result indicates that by "encoding" the DNA with degenerate indicators (e.g., M, K, R, Y, W, S, B and D), the method can effectively detect errors that occur during sequencing, while the method of "decoding" the sequence can effectively correct these errors. The short sequence of the present embodiment can effectively explain the error correction method provided by the present disclosure. The modification method used in this embodiment is a method with minimum variation, and is also a method for realizing the simplest matching of subsequent sequences. In practice, a mathematical model may be constructed to achieve this variation. In a practical algorithm, all possible variations are statistical based on probabilities. After the probability parameter correction, the above-mentioned variation is the most likely correct variation. On the one hand, this calculation is a simple application of maximum likelihood methods based on bayesian profiles. On the other hand, the calculation method is generally a conventional mathematical method.
By encoding and decoding the DNA sequence, the method is effective in improving sequencing accuracy when applied to DNA sequencing signals. For decoding, the sequencing signal is represented as a weighted graph, as shown in FIG. 1. One weighted graph is denoted G (V, E, W), where V is a node of the graph, E is an edge of the graph, and W is a weight (e.g., real number) of each edge. The encoding and decoding process is explained below, assuming that the sequencing signal counted by the number i is ai。
1) For each signal aiAnd if the nucleotide provided in the ith sequencing reaction is X, drawing the h node aiEach node represents an X base.
2) This aiThe nodes are sequentially and orderly connected in sequence, namely the 1 st point of the node points to the 2 nd point, the 2 nd point points to the 3 rd point, and the like.
3) The last of these nodes has a ring pointing to itself.
4) All nodes at the ith time point to the first node representing the (i +1) th time.
5) And according to the statistical result of a large amount of sequencing data, assigning weights to all edges.
If a DNA sequence is sequenced once with M/K, R/Y, W/S combinations, 3 sequencing signals are obtained. These 3 sequencing signals were each mapped as described above, as shown in FIG. 1.
Three sets of signals of sequence 5' -TGAACTTTAGCCACGGAGTA-3(SEQ ID NO:2) are (with errors):
M/K:2、3、3、1、1、3、2、1、2、1
R/Y:1、4、4、2、2、1、1、4、1、1
W/S:1、1、2、1、4、3、1、3、1、1、2
the path defining the directed weighted graph is: a set of nodes, i.e. v, in a directed weighted graph1v2...vn. The set of nodes may be all different, or some of the nodes may be the same (e.g., v1And v2Representing the same node). And, for any two adjacent nodes v in the group of nodesiAnd vi+1There is a directed edge from v in the graphiDirection vi+1. The weight defining a path is the sum of the weights of all edges in the path. If the sequencing signals are represented as a weighted graph, each path in the graph represents a possible DNA sequence. The signal decoding finds the largest common path between all the pictures. The specific implementation methods include an exhaustion method, a greedy method, a dynamic programming method, a heuristic search method and the like.
Example 5: detecting and/or correcting sequencing errors
5000 DNA sequences 400bp long were decoded according to the sequencing method described in example 4; all DNA was divided into 5 groups of 1000 DNA each. According to the sequencing correction method in example 4, the encoding accuracy and the decoded accuracy are summarized in the following table.
Table 12: accuracy of sequencing
| Group of | Code accuracy rate | Post-decoding accuracy |
| 1 | 0.9736 | 0.9917 |
| 2 | 0.9813 | 0.9951 |
| 3 | 0.9878 | 0.9977 |
| 4 | 0.9953 | 0.9997 |
| 5 | 0.9973 | 0.9999 |
It is clear that the encoding-decoding methods provided herein can effectively improve the accuracy of sequencing. For example, when the error rate is 0.0364 (in other words, the accuracy is 0.9736), the corrected error rate becomes 0.0083 (in other words, the accuracy becomes 0.9917). When the error rate was 0.0047, the error rate became 0.0003 after correction. By contrast, when the error rate before correction is reduced by 7.74 times (0.0364 divided by 0.0047), the error rate after correction will be reduced by 27.6 times (0.0083 divided by 0.0003). The overall data shows a clear trend: reducing the sequencing error rate, it is inferred that the error rate will be further reduced after correction. In other words, any minor improvement to the sequencing method that can reduce the error rate can result in a more significant reduction in the error rate of the corrected sequencing data using the correction methods disclosed herein.
The encoding accuracy and the decoded accuracy of each group are respectively counted and represented by a violin graph and a box graph, as shown in fig. 2.
According to the characteristics of the modified signals in the codes, the sequences with higher probability of decoding correctness can be screened out, and the decoding accuracy is further improved. The number of signals in the above data that are modified in decoding for each sequence is counted, and the histogram of the frequency distribution is shown in fig. 3. The frequency distribution histogram has the following characteristics: there is a peak on the left side of the image and a long tail of frequencies on the right side of the peak. If the sequence in the long tail distribution area in the lower graph is discarded, and only the sequence in the peak area is used for analysis, the accuracy rate after decoding can be further improved by 2-10 times.
Fig. 4 shows the relationship between the number of signals in which an error occurs in encoding and the number of signals modified by the error in decoding. The abscissa represents the number of signals in which an error occurs in encoding, and the ordinate represents the correlation between the number of signals modified by an error in decoding. The gray scale of the color indicates the proportion of the number of times the point is counted in all the sequences. Fig. 3 shows that in most cases, even if an error occurs in decoding, the modified signal and the signal in which the error actually occurs are very close to each other. Therefore, the quality of decoding can be judged using the feature. If a signal and its adjacent signals are not modified in decoding, the base type represented by the signal has extremely high confidence.
Example 6: detecting and/or correcting sequencing errors
In this example, a single stranded DNA molecule to be sequenced is immobilized on a solid surface. The method of immobilization may be chemical crosslinking, molecular adsorption, or the like. The 3 'end or 5' end of the DNA may be immobilized on a surface. The DNA to be tested comprises a fixed fragment with a known sequence, and can be complementarily hybridized with a sequencing primer. The sequence of the region from the 3 'end of the fragment to the 3' end of the test DNA is the region of the test sequence. In this example, the sequence to be sequenced was 5'-TGAACTTTAGCCACGGAGTA-3' (SEQ ID NO: 2).
First, a sequencing primer is hybridized to a fragment having a known sequence of the target DNA. Four types of dNTPs were added to the reaction system along with the corresponding reaction buffer, enzyme and metal ions. The 3' end of each type of dNTP is blocked by a chemical group, and further, dGTP and dTTP are each labeled with a fluorescent group of the same color, and dATP and dCTP are each labeled with a fluorescent group of the same color of another type. In the reaction, dNTPs that complementarily pair with bases at positions to be extended on the template DNA are incorporated into the nascent strand of DNA by a DNA polymerase. After the reaction was completed, the residual dNTPs were washed off, and the fluorescence signal was recorded using a CCD. The above reaction was repeated to obtain a set of values for the sequencing signal: and x is KKMMMKKKMKMMMKKMKKM.
For example, the nascent strand of DNA synthesized in the above-described sequencing reaction is melted and washed away using high temperature or strongly hydrophilic substances (e.g., urea and formamide). The sequencing process is repeated by rehybridizing the sequencing primers to the DNA template, but dCTP and dTTP are labeled with the same color fluorophore, while dATP and dGTP are labeled with another same color fluorophore. Obtaining values for the set of sequencing signals: YRRRRYYYYRRYYRYRRRRYR.
For example, the nascent strand of DNA synthesized in the above-described sequencing reaction is melted and washed away using high temperature or strongly hydrophilic substances (e.g., urea and formamide). The sequencing process is repeated by rehybridizing the sequencing primers to the DNA template, but dATP and dTTP are labeled with the same color fluorophore, and dCTP and dGTP are labeled with another same color fluorophore. Obtaining values for the set of sequencing signals: and z is WSWWSWWWWSSSWSSSWSWW.
The sequencing signal values are then analyzed for the type of nucleotide base represented by the signal to obtain sequencing information. For each residue of the target DNA, the common base in the three signals was identified and listed in the following table as the nucleotide residue at that position.
Table 13: sequencing results before correction
| Signal x | K | K | M | M | M | K | K | K | M | K | M | M | M | K | K | M | K | K | M | ||
| Signal y | Y | R | R | R | R | Y | Y | Y | Y | R | R | Y | Y | R | Y | R | R | R | R | Y | R |
| Signal z | W | S | W | W | S | W | W | W | W | S | S | S | W | S | S | S | W | S | W | W | |
| Common base | T | G | A | A | ? | T | T | T | ? | G | ? | C | ? | G | ? | ? | ? | G | A | ? | ? |
When three sets of signals are resolved with common bases at each position, there are no common bases at several positions. This indicates that an error has occurred in the sequence. In this embodiment, changing the 2 nd value of signal Y from 4 to 3 and the 6 th value of signal X from 3 to 4, the signals will change as shown in the following table.
Table 14: corrected sequencing results
In the above table, "2 nd value of signal y is changed from 4 to 3" is represented as an R with a strikethrough, "6 th value of signal X is changed from 3 to 4" is represented as an increase by M (indicated in italics, underlining). After the two modifications, common bases are arranged at all positions of the three groups of signals, and the sequence formed by the common bases is a DNA sequence to be detected. This result indicates that by "encoding" the DNA with degenerate indicators (e.g., M, K, R, Y, W, S, B and D), the method can effectively detect errors that occur during sequencing, while the method of "decoding" the sequence can effectively correct these errors.
Example 7: detecting and/or correcting sequencing errors
In this example, the DNA to be tested comprises a fixed fragment of known sequence that hybridizes complementary to the sequencing primer. The sequence of the region from the 3 'end of the fragment to the 3' end of the test DNA is the region of the test sequence. In this example, the sequence to be sequenced was 5'-TGAACTTTAGCCACGGAGTA-3' (SEQ ID NO: 2).
First, a sequencing primer is hybridized to a fragment having a known sequence of the target DNA. The reaction volume containing the template DNA molecule containing the hybridization sequencing primer is divided into three portions, which can be measured in parallel or sequentially. Four types of dNTPs, some types of ddNTPs, and enzymes and buffers necessary for the DNA synthesis reaction are added to each. In some aspects, the dNTP added is a native dNTP and the ddNTP added has a detectable label (e.g., a label that can be detected by an instrument), including but not limited to radioisotope labels, chemiluminescent labels, and the like. In the first copy, ddGTP and ddTTP have the same label, while ddATP and ddCTP have another label. In the second, ddCTP and ddTTP have the same label, and ddATP and ddGTP have another label. In the third, ddATP and ddTTP have the same label, and ddCTP and ddGTP have another label.
All three are reacted under suitable conditions for a period of time during which the DNA synthesis reaction takes place. After the reaction is complete, the reaction product may optionally be washed or purified. Then, three reaction products may be subjected to DNA electrophoresis. According to the electrophoresis band, three sequencing signals can be obtained respectively:
x=KKMMMKKKMKMMMKKMKKM
y=YRRRRYYYYRRYYRYRRRRYR
z=WSWWSWWWWSSSWSSSWSWW
the sequencing signal values are then analyzed for the type of nucleotide base represented by the signal to obtain sequencing information. For each residue of the target DNA, the common base in the three signals was identified and listed in the table below as the nucleotide residue at that position.
Table 15: sequencing results before correction
| Signal x | K | K | M | M | M | K | K | K | M | K | M | M | M | K | K | M | K | K | M | ||
| Signal y | Y | R | R | R | R | Y | Y | Y | Y | R | R | Y | Y | R | Y | R | R | R | R | Y | R |
| Signal z | W | S | W | W | S | W | W | W | W | S | S | S | W | S | S | S | W | S | W | W | |
| Common base | T | G | A | A | ? | T | T | T | ? | G | ? | C | ? | G | ? | ? | ? | G | A | ? | ? |
When three sets of signals are resolved with common bases at each position, there are no common bases at several positions. This indicates that an error has occurred in the sequence. In this embodiment, changing the 2 nd value of signal Y from 4 to 3 and the 6 th value of signal X from 3 to 4, the signals will change as shown in the following table.
Table 16: corrected sequencing results
In the above table, "2 nd value of signal y is changed from 4 to 3" is represented as an R with a strikethrough, "6 th value of signal X is changed from 3 to 4" is represented as an increase by M (indicated in italics, underlining). After the two modifications, common bases are arranged at all positions of the three groups of signals, and the sequence formed by the common bases is a DNA sequence to be detected. This result indicates that by "encoding" the DNA with degenerate indicators (e.g., M, K, R, Y, W, S, B and D), the method can effectively detect errors that occur during sequencing, while the method of "decoding" the sequence can effectively correct these errors.
Example 8:sequencing by the "2 +2 bicolor three-round" method
In this example, a single stranded DNA molecule to be sequenced is immobilized on a solid surface. The method of immobilization may be chemical crosslinking, molecular adsorption, or the like. The 3 'end or 5' end of the DNA may be immobilized on a surface. The DNA to be tested comprises a fixed fragment with a known sequence, and can be complementarily hybridized with a sequencing primer. The sequence of the region from the 3 'end of the fragment to the 3' end of the test DNA is the region of the test sequence. In this example, the sequence to be sequenced was 5'-TGAACTTTAGCCACGGAGTA-3' (SEQ ID NO: 2).
First, a sequencing primer is hybridized to a fragment having a known sequence of the target DNA. dG4P and dT4P (each labeled with a fluorophore emitting a different color, such as fluorophore X and group Y), along with the corresponding reaction buffer, enzyme, and metal ion, were added to the reaction system to initiate a sequencing reaction that produces a fluorescent signal. The signal is collected by a CCD. The values of these fluorescence signals were recorded. The reaction was recorded as the first reaction.
Then, the reaction residues dG4P and dT4P were washed away. Then, dA4P and dC4P (each labeled with a fluorescent group emitting a different color, e.g., fluorescent group X and group Y) were added to the reaction system to initiate the same sequencing reaction as described above, and the value of the fluorescent signal was recorded. This reaction should be recorded as the second reaction.
The above process is repeated. dG4P and dT4P were added for odd-numbered reactions and dA4P and dC4P were added for even-numbered reactions. Both types of dN4P were added for each reaction to label different colored fluorophores. Values for a set of signals can be obtained: x ═ (1G +1T, 2A +1C, 0G +3T, 1A +0C, 1G +0T, 1A +2C, 2G +0T, 1A +0C, 1G +1T, 1A + 0C).
For example, the nascent strand of DNA synthesized in the above-described sequencing reaction is melted and washed away using high temperature or strongly hydrophilic substances (e.g., urea and formamide). The sequencing primer was rehybridized to the template DNA. Both types of dN4P were added for each reaction to label different colored fluorophores. Values for a set of signals can be obtained: y ═ (0C +1T, 3A +1G, 1C +3T, 1A +1G, 2C +0T, 1A +0G, 1C +0T, 1A +3G, 0C +1T, 1A + 0G).
For example, the nascent strand of DNA synthesized in the above-described sequencing reaction is melted and washed away using high temperature or strongly hydrophilic substances (e.g., urea and formamide). The sequencing primer was rehybridized to the template DNA. dG4P and dT4P were added for odd-numbered reactions, dA4P and dC4P were added for even-numbered reactions, and both types of dN4P were added for each reaction and labeled with different colored fluorophores. A set of sequencing signals can be obtained: z ═ 0A +1T, 0C +1G, 2A +0T, 1C +0G, 1A +3T, 2C +1G, 1A +0T, 0C +1G, 1A + 1T.
This method is referred to as a "2 +2 two-color" sequencing method. Sequence information can be obtained from sequencing data of any two rounds of sequencing data thereof. It can be considered as an orthogonal sequencing result.
The sequencing signal values are then analyzed for the type of nucleotide base represented by the signal to obtain sequencing information. For each residue of the target DNA, the common base in the three signals was identified and listed in the following table as the nucleotide residue at that position.
Table 17: sequencing results before correction
When three sets of signals are resolved with common bases at each position, there are no common bases at several positions, and thus it is concluded that an error has occurred in the sequence. Changing the 2 nd value (3A +1G) of signal y to (2A +1G) and simultaneously changing the 6 th value (1A +2C) of signal X to (1A +3C) the signals will change as shown in the following table.
Table 18: corrected sequencing results
In the above table, "2 nd value (3A +1G) of signal y is changed to (2A + 1G)" is represented as a deleted line a, and "6 th value (1A +2C) of signal x is changed to (1A + 3C)" is represented by adding C (indicated by italics and underlining). After the two modifications, common bases are arranged at all positions of the three groups of signals, and the sequence formed by the common bases is a DNA sequence to be detected. This result indicates that by "encoding" the DNA with degenerate indicators (e.g., M, K, R, Y, W, S, B and D), the method can effectively detect errors that occur during sequencing, while the method of "decoding" the sequence can effectively correct these errors.
Example 9: sequencing by the "2 +2 two-color two-round" method
In this example, a single stranded DNA molecule to be sequenced is immobilized on a solid surface. The method of immobilization may be chemical crosslinking, molecular adsorption, or the like. The 3 'end or 5' end of the DNA may be immobilized on a surface. The DNA to be tested comprises a fixed fragment with a known sequence, and can be complementarily hybridized with a sequencing primer. The sequence of the region from the 3 'end of the fragment to the 3' end of the test DNA is the region of the test sequence. In this example, the sequence to be sequenced was 5'-TGAACTTTAGCCACGGAGTA-3' (SEQ ID NO: 2).
First, a sequencing primer is hybridized to a fragment having a known sequence of the target DNA. dG4P and dT4P (each labeled with a fluorophore emitting a different color, such as fluorophore X and group Y), along with the corresponding reaction buffer, enzyme, and metal ion, were added to the reaction system to initiate a sequencing reaction that produces a fluorescent signal. The signal is collected by a CCD. The values of these fluorescence signals were recorded. The reaction was recorded as the first reaction.
Then, the reaction residues dG4P and dT4P were washed away. Then, dA4P and dC4P (each labeled with a fluorescent group emitting a different color, e.g., fluorescent group X and group Y) were added to the reaction system to initiate the same sequencing reaction as described above, and the value of the fluorescent signal was recorded. This reaction should be recorded as the second reaction.
The above process is repeated. dG4P and dT4P were added for odd-numbered reactions and dA4P and dC4P were added for even-numbered reactions. Both types of dN4P were added for each reaction to label different colored fluorophores. Values for a set of signals can be obtained: x ═ (1G +1T, 2A +1C, 0G +3T, 1A +0C, 1G +0T, 1A +2C, 2G +0T, 1A +0C, 1G +1T, 1A + 0C).
For example, the nascent strand of DNA synthesized in the above-described sequencing reaction is melted and washed away using high temperature or strongly hydrophilic substances (e.g., urea and formamide). The sequencing primer was rehybridized to the template DNA. Both types of dN4P were added for each reaction to label different colored fluorophores. Values for a set of signals can be obtained: y ═ (0C +1T, 3A +1G, 1C +3T, 1A +1G, 2C +0T, 1A +0G, 1C +0T, 1A +3G, 0C +1T, 1A + 0G).
The sequencing signal values are then analyzed for the type of nucleotide base represented by the signal to obtain sequencing information. For each residue of the target DNA, the common base in the three signals was identified and listed in the following table as the nucleotide residue at that position.
Table 19: sequencing results before correction
When the common bases at each position of the two sets of signals are resolved, there is no common base at several positions, and thus it is concluded that an error has occurred in the sequence. Changing the 2 nd value (3A +1G) of signal y to (2A +1G) and simultaneously changing the 6 th value (1A +2C) of signal X to (1A +3C) the signals will change as shown in the following table.
Table 20: corrected sequencing results
In the above table, "2 nd value (3A +1G) of signal y is changed to (2A + 1G)" is represented as a deleted line a, and "6 th value (1A +2C) of signal x is changed to (1A + 3C)" is represented by adding C (indicated by italics and underlining). After the two modifications, common bases are arranged at all positions of the three groups of signals, and the sequence formed by the common bases is a DNA sequence to be detected. This result indicates that by "encoding" the DNA with degenerate indicators (e.g., M, K, R, Y, W, S, B and D), the method can effectively detect errors that occur during sequencing, while the method of "decoding" the sequence can effectively correct these errors.
Example 10: sequencing by the "1 +3, Monochromatic" method
In this example, a single stranded DNA molecule to be sequenced is immobilized on a solid surface. The method of immobilization may be chemical crosslinking, molecular adsorption, or the like. The 3 'end or 5' end of the DNA may be immobilized on a surface. The DNA to be tested comprises a fixed fragment with a known sequence, and can be complementarily hybridized with a sequencing primer. The sequence of the region from the 3 'end of the fragment to the 3' end of the test DNA is the region of the test sequence. In this example, the sequence to be sequenced was 5'-TGAACTTTAGCCACGGAGTA-3' (SEQ ID NO: 2).
First, a sequencing primer is hybridized to a fragment having a known sequence of the target DNA. dC4P, dG4P and dT4P, as well as the corresponding reaction buffer, enzyme and metal ions, are added to the reaction system to initiate the sequencing reaction that generates a fluorescent signal. The signal is collected by a CCD. The values of these fluorescence signals were recorded. The reaction was recorded as the first reaction.
Then, the reaction residue dC4P, dG4P and dT4P were washed away. Then, dA4P was added to the reaction system, the same sequencing reaction as described above was initiated, and the value of the fluorescence signal was recorded. This reaction should be recorded as the second reaction.
The above process is repeated. dC4P, dG4P and dT4P were added for odd-numbered reactions and dA4P was added for even-numbered reactions. Obtaining values of a set of signals: x is (2, 4,1,3,1,2, 1).
For example, the nascent strand of DNA synthesized in the above-described sequencing reaction is melted and washed away using high temperature or strongly hydrophilic substances (e.g., urea and formamide). The sequencing primer was rehybridized to the template DNA. dA4P, dG4P and dT4P were added for odd-numbered reactions and dC4P was added for even-numbered reactions. Obtaining values of a set of signals: y is (4, 1,6, 2,1, 6).
For example, the nascent strand of DNA synthesized in the above-described sequencing reaction is melted and washed away using high temperature or strongly hydrophilic substances (e.g., urea and formamide). The sequencing primer was rehybridized to the template DNA. dA4P, dC4P and dT4P were added for odd-numbered reactions and dG4P was added for even-numbered reactions. Obtaining values of a set of signals: and z is (1, 7,1, 4,2, 1, 2).
For example, the nascent strand of DNA synthesized in the above-described sequencing reaction is melted and washed away using high temperature or strongly hydrophilic substances (e.g., urea and formamide). The sequencing primer was rehybridized to the template DNA. dT4P was added for odd-numbered reactions and dA4P, dC4P and dG4P were added for even-numbered reactions. Obtaining values of a set of signals: w is (1, 4,3, 9, 1).
The sequencing signal values are then analyzed for the type of nucleotide base represented by the signal to obtain sequencing information. For each residue of the target DNA, the common base in the three signals was identified and listed in the following table as the nucleotide residue at that position.
Table 21: sequencing results before correction
| Signal x | B | B | A | A | B | B | B | B | A | B | B | B | A | B | B | B | A | B | B | A | |
| Signal y | D | D | D | D | C | D | D | D | D | D | D | C | C | D | C | D | D | D | D | D | D |
| Signal z | H | G | H | H | H | H | H | H | H | G | H | H | H | H | G | G | H | G | H | H | |
| Signal w | T | V | V | V | V | T | T | T | V | V | V | V | V | V | V | V | V | T | V | ||
| Common base | T | G | A | A | C | T | T | T | A | G | ? | C | ? | ? | ? | G | A | ? | ? | ? | ? |
When the common bases at each position of the two sets of signals are resolved, there is no common base at several positions, and thus it is concluded that an error has occurred in the sequence. Changing the third value of signal y from 6 to 5 and the fourth value of signal w from 9 to 10, the signals will be changed as follows.
Table 22: corrected sequencing results
In the above table, "the third value of signal y is changed from 6 to 5" is represented as a D with a strikethrough, "the fourth value of signal w is changed from 9 to 10" is represented as an increase by V (in italics, underlined). After the two modifications, all positions of the four groups of signals have common bases, and the sequence formed by the common bases is a target DNA sequence to be detected. This result indicates that by "encoding" the DNA with degenerate indicators (e.g., M, K, R, Y, W, S, B and D), the method can effectively detect errors that occur during sequencing, while the method of "decoding" the sequence can effectively correct these errors.
Example 11: method for detecting and/or correcting sequencing errors
Section 1: substrate synthesis and spectral characterization
General aspects: all anhydrous solvents were used with the general procedure (Na or CaH)2) And (4) fresh distillation. Unless otherwise indicated, reagents were used as received from commercial suppliers. Air and/or moisture sensitivity experiments were performed under an argon atmosphere. Mass spectrometry was performed using a Bruker APEX IV mass spectrometer and an AB Sciex MALDI-TOF5800 spectrometer. Reverse phase HPLC was performed on a Shimadzu LC-20A HPLC system. The sample was dissolved in water and analyzed by Inertsil ODS-3C18 column (250X 4.6mm, 5 μm) at a flow rate of 1 mL/min, B (CH)3CN) were analyzed under a gradient in A (50mM TEAA pH 7.3) (0-20% B over 15 min, 20-30% B over 10 min).
1.1 Synthesis of terminal phosphate-labeled fluorescent nucleotides (TPLFN)
Figures 5A-C show that the fluorescence properties of TPLFN are improved by changing the fluorophore structure. FIG. 5A shows a previously developed Me-FAM-labeled nucleotide. FIG. 5B shows the previously developed Me-HCF-labeled nucleotides. FIG. 5C shows TG-labeled nucleotides in this example.
For fluorescent sequencing purposes, the fluorophore used to label the terminal phosphate of the nucleotide plays a key role. On the one hand, the phosphorylated fluorophore must be quenched completely, which means that no fluorescence emission is detected at a specific excitation wavelength. However, sufficient signal detection requires strong fluorescence emission intensity once the fluorophore is released. According to this principle, Me-FAM was chosen as the marker dye molecule in the previous report (FIG. 5A, see Sims, P.A.; Greenleaf, W.J.; Duan, H.; Xie, X. "fluorescent Pyrosequencing in PDMS Microreagents" Nature Method 2011,8, 575-. Subsequently, the chlorinated form of Me-FAM, known as Me-HCF, develops color with significant red-shifts in excitation and emission wavelengths, and is suitable for multicolor Sequencing purposes (FIG. 5B, Chen, Z.; Duan, H.; Qiao, S.; Zhou, W.; Qiu, H.; Kang, L.; Xie, X.; Huang, Y. fluorogenic Sequencing using Halogen-fluorine in laboratory nucleotides. chembiochem,2015, DOI: 10.1002/cbic.201500117). Despite successful application, the fluorescence properties of Me-FAM and Me-HCF (derived from FAM and HCF 3' -OH methylation) remain problematic, as shown by the parameters set forth in FIG. 5. Methylation of 3' -OH (or other protecting groups) is a prerequisite for the generation of fluorescent substrates, not only broadening the absorption and emission spectra, but also drastically reducing the extinction coefficient and quantum yield, especially for Me-FAM. Therefore, there is still a great need to develop fluorophores with better fluorescence properties.
TG (Tokyo green) was developed by Nagano et al, Y.Urano, M.Kamiya, K.Kanda, T.Ueno, K.Hirose, T.Nagano, Evolution of fluoroescein as a platform for fine particulate tufflercescences probes, J.Am.chem.Soc.,2005,127, 4888-one 4894. TG has been shown to have excellent fluorescence properties. The unique structure of TG compared to 5(6) -FAM is that methyl is used instead of the carboxyl group of the benzene moiety to keep the benzene ring and fluorophore orthogonal to each other (orthogonal). In addition, phosphorylated TGs have also been demonstrated to have excellent fluorescent properties. Another convenient aspect is that the only phenol group on the TG structure will facilitate the synthesis of TPLFN compared to the two phenol groups of 5(6) -FAM or HCF, since no methylation is required. The absence of this protective methyl group not only makes synthesis of TPLFN easier, but also retains the original high extinction coefficient and high quantum yield properties once the TG fluorophore is released by enzymatic digestion, resulting in much higher fluorescence/background contrast. The detailed synthetic procedure is as follows:
(I) preparation of TG-monophosphoric acid (S2)
Tokyo green S1[ Y.Urano, M.Kamiya, K.Kanda, T.Ueno, K.Hirose, T.Nagano, Evolution of fluoroescein as a platform for fine particulate fluorescent probes, J.Am.chem.Soc.,2005,127, 4888-.
In a flame-dried flask, S1(332mg, 1.00mmol) was suspended in 15mL of anhydrous CH under Ar2Cl2. To this solution was added a proton sponge (759mg, 3.50mmol) with stirring. After 10 min, the mixture was cooled to-10 ℃ and phosphorus (V) oxychloride (275. mu.L, 3.00mmol) was added. The reaction was kept at the same temperature for 30 minutes. TEAA buffer (20mL of 1M solution) was then added to quench the reaction and hydrolyze the phosphoryl chloride intermediate for 1 hour at 0 ℃. The two phases were then separated and the aqueous solution was filtered and concentrated in vacuo for further purification by reverse phase rapid LC system. Conditions are as follows: AQ C-18 column (Agela40g) with a flow rate of 20 ml/min using 0-50% acetonitrile in 50mM triethylammonium acetate buffer (pH 7.4). The fractions containing the pure product were concentrated and coevaporated twice with anhydrous DMF (2mL) and then dissolved in a specific amount of anhydrous DMF and the resulting monophosphoric acid S2(100mM MF solution) was stored in a freezer at-20 ℃ until use. Ms (esi): c21H15O7P (M-H) calculated 411.06. Found m/z 411.21.
Synthesis of dN 4P-delta-TG (TPLFN)
1) dA4P- δ -TG: 2 '-deoxyadenosine-5' -triphosphate (dATP) disodium salt (12.5uL of 100mM solution, 12.5umol) was converted to the tributylammonium salt by treatment with ion exchange resin (BioRad AG-50W-XB) and tributylamine. After removal of water on a rotary evaporator using an oil pump, the resulting tributylammonium salt was co-evaporated twice with anhydrous DMF (1mL) and then dissolved in 0.5mL anhydrous DMF under Ar. To the solution carbonyl diimidazole (CDI, 10.1mg, 63. mu. mol) was added, and the mixture was stirred at room temperature for 12 hours. MeOH (3.2. mu.L) was then added and the solution was stirred for 0.5 h. Then, the TG-tributylammonium monophosphate S2 (25. mu. mol) DMF solution (0.25mL) from the previous step was transferred to the reaction with a syringe, followed by addition of MgBr in DMF (0.5mL)2(25mg, 100. mu. mol). The mixture was stirred at room temperature for 30 h. The reaction mixture was then concentrated by oil pump, diluted with water, and diluted with B (CH) at 5 mL/min flow rate using prep. sepax ethyl C18-H (21.2x150mm) on a C18 reverse phase HPLC system (Shimadzu)3CN) in A (50mM TEAA pH 7.3) (0-20% B over 15 min, 20-30% B over 10 min, 30-50% B over 10 min). The desired fractions were collected and concentrated using a Hi-Trap Q-HP 5mL anion exchange column (GE Healthcare). The collected solution containing the desired product can be purified again by HPLC using the same elution conditions and concentrated by Hi-Trap Q-HP column. The product solution was stored at-20 ℃ until use. MS (MALDI-TOF): c31H31N5O18P4Calculated 895.0615. Found M/z 884.1019 (M-H). dC 4P-delta-TG, dT 4P-delta-TG, dG 4P-delta-TG were synthesized following the same procedure as dA 4P-delta-TG. dC4P- δ -TG: MS (MALDI-TOF): c30H31N3O19P4Calculated 861.0502. Found M/z 860.0732 (M-H). dT 4P-delta-TG: MS (MALDI-TOF): c31H32N2O20P4Calculated 876.0499. Found M/z 875.0706 (M-H). dG4P- δ -TG: MS (MALDI-TOF): c31H31N5O19P4Calculated 901.0564. Found M/z 900.0903 (M-H). FIG. 6 shows MALDI-TOF mass spectrum of purified TPLFN.
1.2. Spectral properties of fluorophores and TPLFN
The excitation/emission spectrum of TG (S1) is shown in fig. 7. While Me-FAM has a similar maximum emission wavelength as TG, the extinction coefficient and quantum yield of Me-FAM are much lower (fig. 8). At the same time, the broad emission spectrum of Me-FAM overlaps strongly with other fluorophores such as Me-HCF, making it unsuitable for future multicolor sequencing applications. In contrast, TG strong fluorescence and narrower spectra would solve the problem more easily. Fig. 7 shows excitation and emission spectra of TG (tokyo green). FIG. 8 shows emission spectra of TG (Tokyo green), Me-FAM and Me-HCF under the same conditions (2. mu.M, pH8.3, TE buffer, calculated by area normalization). The optical properties for TG, Me-FAM and Me-HCF are listed in the following table (and in FIGS. 5A-C).
TABLE 23
| Excitation max (nm) | Emission max (nm) | Quantum yield (%) | Extinction coefficient | |
| TG | 490 | 513 | 82% | 8×104 |
| Me-FAM | 463 | 514 | 55% | 2×104 |
| Me-HCF | 544 | 567 | 57% | 7×104 |
In sequencing methods, it is required that the substrate (TPLFN) is non-fluorescent prior to incorporation by the DNA polymerase. Following extension of the primer by the polymerase, the triphosphate to which the dye label is still attached is released and subsequently hydrolyzed in the presence of a phosphatase to produce the fluorescent product triphosphate. Fig. 9 and 10 show the difference in absorption and emission between TPLFN TG-dA4P and the released TG fluorophore. As shown in fig. 9, TG-dA4P was not digested by CIP (bovine intestinal alkaline phosphatase) alone. However, once the polyphosphate chain of TG-dA4P is broken down by polymerase or PDE (phosphodiesterase), the remaining triphosphate chain labeled with TG will be rapidly digested, resulting in a free TG molecule with restored strong absorption and emission intensity.
The above spectra were recorded under the following conditions:
first, the spectrum of TG-dA4P was measured at room temperature. For emission measurements: setting the excitation wavelength to 460nm, and scanning the emission at 480-600 nm; for absorption measurements: scan 310-550 nm. Then, CIP and PDE were added and spectra were recorded sequentially under the same conditions.
The stability of the TPLFN substrate under certain aqueous conditions is also considered, since spontaneous hydrolysis of the TPLFN will increase the fluorescence background during the sequencing reaction, which can interfere with the desired signal and reduce sequencing accuracy. Fortunately, the rate of hydrolysis of the TPLFN substrate is still very low, about 2ppm (substrate)/s when measured at 65 ℃, with negligible signal generated compared to polymerase incorporation. Nevertheless, in some aspects, it is preferred that the substrate solution be stored in a4 ℃ freezer rack during sequencing, for long term storage in a-20 ℃ freezer.
Section 2: polymerase kinetic study
Polymerase kinetic determinations regarding properties such as TPLFN incorporation/misincorporation ratio, homopolymer linearity test, and temperature dependence were performed using a fluorometer. Figure 11 shows the proposed kinetic pathways of the sequencing-by-synthesis process, where S is the matched substrate (TPLFN) and S is the mismatched substrate; e is an enzyme (polymerase) and DN is a primer/template pair.
Although both the TPLFN and the template serve as reaction substrates, the system can be simplified to a single substrate reaction process, since the concentration of one of the substrates, TPLFN (which is in large excess compared to the primer/template), will remain nearly unchanged. This makes the analysis of the process much easier. As shown in fig. 11, the polymerase catalyzed reaction had 3 steps including: a) DNA polymerase binds to the primer/template; b) incorporation of complementary nucleotides (TPLFN); c) the nucleotides extend along the template. By varying reaction conditions such as primer/template concentration, match or mismatch type of TPLFN, and temperature, the kinetics of the polymerase used in the sequencing process can be assessed.
FIG. 12 shows the difference in polymerase (Bst) incorporation ratio between TPLFNs. To test and compare the reaction rates, all four TPLFNs (TG-dA4P, TG-dG4P, TG-dC4P, TG-dT4P) were adjusted to the same concentration (2.0. mu.M). The reaction was carried out at 65 ℃ with Bst (120nM), single base extension primer/template (T, C, G, A relative to the four TPLFN), CIP (0.01U) and pH8.3 buffer, triggered by Mn (II) (1 mM). Typically, the labeled polyphosphate moiety is released by Bst-mediated extension and needs to be hydrolyzed by CIP to generate a fluorescent dye molecule. Excess CIP in the reaction was detected and confirmed that the hydrolysis rate was very fast and did not become a rate determining step affecting the observation of the Bst reaction rate. In FIG. 12, the observed Bst incorporation ratios of the four labeled nucleotides are ordered TG-dC4P > TG-dA4P > TG-dG4P > TG-dT 4P.
The 4 curves in fig. 12 can be fitted to the functions in the table below. The fit results indicate that the reaction system can be considered as a first order reaction to the primer/template concentration. However, unlike running the reaction in a sample cell on a fluorometer, the actual sequencing reaction on the chip will differ slightly because all primers/templates are grafted onto the chip surface. To keep each reaction cycle for different TPLFNs completed on the same time scale, the reaction rates for the four TPLFNs can be adjusted to the same level by increasing the concentration of the slower running TPLFN.
Watch 24
| Substrate | Fitting function | R2 |
| dA4P | 9.319×105(1-e-0.05242t) | 0.9976 |
| dT4P | 8.698×105(1-e-0.02616t) | 0.9994 |
| dC4P | 8.977×105(1-e-0.06189t) | 0.9959 |
| dG4P | 8.839×105(1-e-0.0405t) | 0.9961 |
In 2+2 sequencing, two different nucleotides are added to the reaction mixture together, e.g., "M" means dA4P and dC4P are added in the same cycle and "K" means dG4P and dT4P are added in the same cycle. As described above in figure 11, one of the added nucleotides can be used as S, which does not extend the current template nucleoside but competes with the complementary substrate S for binding to Bst, so it is possible that S can reduce the extension rate of S. Thus, substrate competition was evaluated by competition experiments.
In this experiment, 100nM template-primer containing only one paired nucleoside to be sequenced at the 3' end of the template, 2. mu.M complementary substrate and 2. mu.M mismatched substrate, and excess Bst and CIP enzymes were mixed together. The reaction was carried out at 65 ℃ and pH8.3, triggered by 1mM Mn (II).
The results show that the reaction rate does not decrease significantly when the substrate is added at the same concentration (see FIG. 13). This can be explained as follows. Bst enzyme is polymerase I from Bacillus stearothermophilus cells (Bacillus stearothermophilus). When Bst binds primer-template to KdBind at 5nM and match nucleotides to KdAt 5. mu.M, and mismatched nucleotides to KdBinding was at 5. mu.M-10. mu.M. See, e.g., Kornberg and Baker, DNA reproduction, 2 nd edition, 2005, University Science Books, page 126. Step 1) and step 2) in FIG. 11 are considered as two thermodynamic equilibria whose dissociation constants (K) ared) 5nM and 30. mu.M, respectively (arithmetic mean). If substrate competition does not occur, two equilibria can be combined into one, and the newly equilibria KdEqual to 150(nM) (. mu.M).
Thus, DNES and DNThe E concentrations were 25.6nM and 63.9nM, respectively. If competition occurs, DNES concentration 22.6nM, DNES*At a concentration of 11.3nM, DNThe concentration of E was 56.5 nM. The calculation shows that D with or without competitionNThe concentration of ES was only slightly changed, and thus the reaction rate was also only slightly different.
In summary, in 2+2 sequencing, the reaction rates of the four substrates varied acceptably, but could be adjusted to be equal by varying the substrate concentrations. Competition between the substrates does not significantly reduce the reaction rate. Thus, in the present method, the reaction rate per cycle can be set to a particular value and adjusted to optimized lead and lag values.
100nM of single base extension primer/template poly-G was aliquoted into two PCR tubes, to each of which was added the mismatched nucleotide TG-dG4P (2. mu.M), as well as excess Bst and CIP. After bubbling the two mixtures with argon for 2 minutes, the capped tubes were incubated at different temperatures, one at 4 ℃ and the other at 65 ℃. After 1 hour, 2. mu.M of the matching nucleotide TG-dC4P was added to each tube, and extension reactions were measured by a fluorometer at 65 ℃. If misincorporation occurs during incubation, it is expected that different signal levels are observed for the two tubes, with the tubes incubated at 65 ℃ being lower than the tubes incubated at 4 ℃ because the misincorporation ratio at 65 ℃ will be higher. However, the results in fig. 13 show that the extension signals in both tubes are almost identical, indicating that the rate of misincorporation of Bst compared to TPLFN under sequencing conditions is at an undetectable level.
One of the challenges of the continuous fluorescence sequencing strategy is the need to accurately measure the homopolymer or copolymer regions on the template through the generated fluorescence signal. FIG. 14 shows primer extension by Bst polymerase for different homopolymer templates. The reaction was carried out on a fluorimeter using the following conditions: 100 nM/template poly-T, poly-TT, poly-TTTT, and poly-TTTTTT, excess Bst and CIP, 2. mu.M TG-dA4P, pH8.3 buffer. 65 ℃ and is triggered by Mn (II). The results in FIG. 14 show that the generated fluorescence signal is proportional to the number of consecutive identical bases over a relatively wide range. In addition, figure 15 shows that the hetero-polymer (or copolymer) sequence poly-tctctctctc can give the same signal level as poly-TTTTTTTT by using a dA-dG mixture instead of using dA alone in this linear assay.
In addition to reaction rates, polymerase fidelity is also a critical issue in 2+2 sequencing strategies, especially given that the polymerases used herein have proofreading deficiencies in some respects. Not only does the incorporation of mismatched nucleotides reduce sequencing accuracy, but it also results in signal attenuation for each sequencing cycle. Although fidelity is primarily an inherent ability of a polymerase, specific reaction conditions can still affect the ability of the polymerase to distinguish errors. To evaluate the fidelity of the polymerase, misincorporation experiments were designed as follows:
excess Bst and CIP, Mn (II), 100nM primer-template (G unpaired nucleoside on template except 3' end of primer) and 2. mu.M dC4P were mixed at 65 ℃ and pH8.3 to generate a fluorescence signal of 4.5X 105。
Then, a mixture of Bst, CIP, Mn (II) and primer-template with the same concentration was mixed with 2 μ MdG4P and bubbled with argon to prevent oxidation of Mn (II). Next, half of the mixture was incubated at 65 ℃ for 30 minutes and the other half at 65 ℃ for 1 hour. After incubation, 2. mu. MdC4P was added to the mixture and the resulting fluorescence signal was 4.6X 105And 4.5X 105. This indicates that mismatch extension in the reaction system is hardly detectable in the case of using Bst polymerase. The small signal difference is mainly caused by sample mixing inaccuracyIn (1). Very slow mismatch extension rates are highly preferred in sequencing reactions because once a primer-template is extended by a mismatch, a substitution mutation is generated at the current nucleotide site, altering the duplex structure in front of the duplex, thereby blocking further extension of the primer-template. In this way, mismatch extension will gradually decrease the effective concentration of the surface-grafted template array and result in significant signal attenuation in each sequencing cycle. The studies herein have excluded the effect of mismatch extension in sequencing reactions and confirmed the high accuracy of the reaction system.
FIG. 16 shows that the extension rate of Bst is temperature dependent, showing optimal enzyme activity at 65 ℃ and completely no activity at 4 ℃. This temperature dependence can be beneficial for sequencing performance, as eventually all reactions of high throughput sequencing will be separated and confined in the microreactors on the sequencing chip developed. Thus, when loading the substrate and enzyme at 4 ℃, neither signal generation nor diffusion is a critical requirement. However, once the temperature is raised to 65 ℃, the polymerase will become fully active and rapidly generate a signal with a high signal-to-noise ratio.
The stability of the substrate TPLFN was also measured at different temperatures. The results show that the higher the temperature, the greater the hydrolysis rate. However, the hydrolysis rate did not exceed 2ppm/s, indicating that the background generated by autohydrolysis is still much lower than the polymerase extension signal. Even so, for better performance, the substrate will preferably be stored at low temperature to prevent autohydrolysis before extension begins.
Section 3: sequencing chip surface grafting
Between the oligonucleotide grafts, the glass chips used for sequencing were modified with hydrogel. The modification method is based on the reported procedures, as described below. See, for example, U.S. patent No. 8,247,177.
3.1. Hydrogel polymer coating
1) BRAPA synthesis:
the hydrogel monomer N- (5- (2-bromoacetamido) pentyl) acrylamide (BRAPA) was synthesized by the following method. (FIG. 17)
1, 5-diaminopentane (10.2g, 0.1 mol) was dissolved in 300mL of anhydrous methanol at 0 deg.C, and a solution of acryloyl chloride in anhydrous THF (0.9g, 0.09mol of acryloyl chloride dissolved in 15mL of anhydrous THF) was added dropwise with stirring. After the addition, the reaction mixture was stirred for 10 h. 200g of silica gel and 1% benzoquinone were added to the reaction and all solvent was removed with a vacuum evaporator. The silica gel powder with the chemicals adsorbed on it was loaded on top of a preparative silica gel column, eluted with DCM/methanol (10/1-1/1), the eluent containing the desired product was collected and concentrated to give 13g of off-white powder which was used directly in the next step without further purification or longer storage to prevent polymerization.
The product was suspended in 150mL THF (20mL methanol can be added to increase solubility) and then aqueous sodium bicarbonate (2 equiv.) was added at 0 deg.C. Bromoacetyl bromide (0.8 mol) was added dropwise to the mixture at a temperature, and the mixture was stirred for 10 hours to terminate the reaction. Then, 50mL brine was added to the solution, the two phases were separated and the aqueous phase was extracted with 3X50mL DCM. The combined organic phases are passed over Na2SO4Dried, concentrated and purified by silica gel column (eluting with EA/methanol) to give 13.5g BRAPA as a white solid. The product can be further purified by recrystallization from ethyl acetate. Mp 102-104 ℃. HRMS C10H18BrN2O2(M + H) calculated value 277.0541. Found m/z 277.0546.1H NMR(500MHz,d6-DMSO)δ8.22(s,1H,NH),8.02(s,1H,NH),6.21(dd,J=15Hz,10Hz,1H,CH),6.07(dd,J=15Hz,5Hz,1H,CH),5.55(dd,J=10Hz,5Hz,1H,CH),3.82(s,2H,CH2),3.08(ddd,J=10Hz,5Hz,4H,CH2),1.43(m,4H,CH2),1.27(m,2H,CH2)。13C NMR(126MHz,d6-DMSO)δ166.29,164.93,132.40,125.16,39.40,38.90,30.05,29.17,28.95,24.21。
2) Cleaning the surface of the chip:
the channeled glass chip was cleaned using the following procedure: washing with chromic acid cleaning solution for 5 min, and then washing with milliQ H2Fully washing with O; after drying in an oven at 120 ℃, the chip surface was treated with oxygen-plasma for 3 minutes. And then immediately used for surface modification.
3) Preparing hydrogel:
to 10mL of 2% acrylamide MilliQ H2To the O solution, BRAPA (70mg in 700. mu.L DMF) was added and the solution was mixed well. The mixture was filtered through a 0.22 μm filter and then bubbled with argon for 15 minutes. Then, 11.5. mu.L TEMED was added followed by the addition of potassium persulfate in milliQ H2O solution (50mg/mL, 100. mu.L). The well mixed solution was immediately loaded into the channels of a clean chip and kept under a humid argon atmosphere for 35 minutes. The hydrogel coated chip was then used with 200mL milliQ H2And (4) fully washing the product.
3.2. Primer grafting and template amplification and hybridization
A solution of5 ' -phosphorothioate oligonucleotide 10 μ M PS-T10-P7(5 ' -T TTTTTTTCAAGCAGAAGACGGCATACGA-3 ', ═ phosphorothioate) in PBS buffer pH 8.0 was loaded into the coated channel and held at 50 ℃ for 1 hour in the channel. The grafted chip surface was then blocked with 10mM 2-mercaptoethanol in PBS buffer pH 8.0 for 40 min, followed by milliQ H2And (4) fully washing the product. The grafted surface is shown in fig. 18.
Preparation of DNA templates
ECCS library design:
a lambda phage genomic DNA fragment (about 300bp) was used as a test DNA oligomer for preparing a sequencing template. Lambda DNA was obtained from New England Biolabs, USA. The complete sequencing template included linker 2(43bp), P7(21bp) on the 5 'end of the ssDNA template, and the reverse complement of linker 1(38bp) and P5(20bp) on the 3' end of the lambda ssDNA. Except for several bases, the sequences of P5, P7, linker 1 and linker 2 are identical to illumina in order to be compatible therewith.
One component library preparation (from phage λ):
a two-step PCR amplification method was used to prepare the sequencing template. In the first PCR, lambda genomic DNA (500ng, NEB), the 1 st PCR primers (200nm each), and 1x Q5 high fidelity 2x master mix (NEB) were mixed in H250 μ L of the mixture in O was treated with the following PCR thermal cycling profile: (i) heating was started at 95 ℃ for 90 seconds; (ii)30 cycles, each cycle at 95 ℃ for 30 seconds, 65 ℃ for 30 seconds, and 72 ℃ for 30 seconds. The amplification product was then purified by PCR purification kit (Zymo, D4061) and drawn into Eppendorf tubes (Eppendorf tubes) for a second PCR amplification step. The second PCR step was similar to the conditions and thermal cycling profiles of the first step, but the following primers for the newly generated template were from above: P5-Adp1(200nM) and P7-Adp2(200 nM).
The PCR products were gel purified and verified by sanger sequencing with primers P5, P7 and P5SeqP 1. After measuring its final concentration, the product containing the same DNA template was stored in a-20 ℃ freezer for use.
3.4. Library immobilization: solid phase PCR in flow cells
The same DNA template prepared above was mixed with PCR reagents and then loaded into a flow cell and surface grafted with primer P7 as described above. The mixture contained DNA template (1nM), primer P5(500nM), primer P7(62.5nM), MgCl2(6mM), dNTP (0.5mM), platinum Taq polymerase (0.5U/mL, Life Tech), BSA (0.2mg/mL), PCR buffer (200mM Tris HCl, 500mM KCl). The solid phase amplification thermal cycle comprises two phases with different temperature profiles. The first stage is an asymmetric pre-amplification process, according to (i) hot start at 95 ℃ for 90 seconds; (ii)15 cycles, each cycle comprising 30 seconds at 95 ℃, 15 seconds at 65-60 ℃ (ramp down), 30 seconds at 72 ℃. After asymmetric amplification, the template strand derived from primer P5 was very dominant in the PCR solution. Then, the second stage of thermal cycling of solid phase PCR was performed to primarily hybridize and extend the oligomer P7 grafted on the surface of the flow cell, with a thermal cycling profile: 30 cycles comprising 30 seconds at 95 ℃ and 300 seconds at 65 ℃. The sample was then denatured with formamide to remove the counterpart of the grafted oligomer, leaving only the template on the flow cell surfaceThe P7 derivative chain of the plate.
After solid phase PCR, the PCR solution was aspirated using a pipette. Formamide was injected into the flow cell to denature all remaining double stranded DNA. Finally, the chip was washed with a washing buffer (20mM Tris-HCl buffer, pH 8.0, 50mM KCl) to remove the remaining formamide.
Density measurement of solid-phase ssDNA templates
First, 5. mu.M of an oligonucleotide with a fluorescent probe (FAM-T-SeqP1) was injected into the flow cell, and the injection port was sealed. The chip was then placed on a hot plate at 80 ℃ for 2 minutes and then cooled to room temperature (or below 30 ℃) within 30 minutes. The flow cell was washed thoroughly using a wash buffer. Then, a fluorescence image of the chip was taken with a fluorescence microscope having an automatic stage. Images were taken at 5 different positions on each lane to detect uniformity and minimize random errors.
Previous experiments demonstrated that fluorescence values correlated positively and linearly with the number of FAM modified primers. For this, in calculating the PCR product concentration, a standard concentration curve is first set. A standard concentration curve was established by recording the fluorescence of lanes containing 0nM (wash buffer without FAM modified primers) and 100nM TG solution. The mean intensity of these images was fitted to a standard concentration curve and then the PCR product concentration was found.
Characterization of the solid phase PCR products is shown in figure 24. The upper panel of fig. 24 shows a heat map of PCR product density for different lanes and locations. The lower panel of fig. 24 shows the PCR product density for different templates.
Typically, the PCR product concentration of the chip for sequence is about 50-150 nM (2.5-7.5 fmol/mm)2). The average density of the four lanes of one chip was approximately the same. The density of the templates with different lengths is not obviously different when the solid phase PCR is carried out on the templates with different lengths. To evaluate the uniformity of the density of the PCR products, the Coefficient of Variation (CV) was measured by calculating the density values of all imaged sites of the chip. The CV was 0.15. + -. 0.13 for all chips.
The identified and qualified flowcell was denatured with formamide prior to hybridization of the sequencing primer (P5-SeqP 1). The treated flow cell was then transferred to a microscope platform for sequencing.
3.5. Sequencing
For sequencing experiments, a simple sequencing instrument was developed, as shown in FIG. 25. As shown in the upper panel of fig. 25, the sequencing chip (HiSeq 2000, for study only) was placed on a temperature controller, under which was a 3D translation stage for 3-dimensional movement of the sequencing chip. Above the chip is a highly sensitive CCD and 10 x microscope. The green light emitted is captured by the CCD via a microscope when blue light is illuminated on the chip during the reaction. At one end of the chip, there is a thin tube to which a valve and a pump are connected to introduce a reaction buffer and a washing buffer, and at the other end, the chip is mounted to the tube to lead out a waste liquid.
For the sequential sequencing strategy, a mixture of two different nucleotides is added to the flow cell at each reaction cycle. Thus, paired combinations of four nucleotides generate three sets, each set having two pairs of nucleotides (AC/GT or AG/TC or AT/GC). The six pairwise combinations are represented by M/K, R/Y and W/S, respectively.
Prior to each sequencing run, reagents were premixed and stored in two separate vials within the freeze rack. Both bottles contained Bst DNA polymerase (100U/. mu.L, McLab), bovine intestinal alkaline phosphatase (0.5U/ml, NEB), MnCl2(1mM), DTT (10mM) in reaction buffer (40mM Tris base, 40mM HN)4Cl, 100mM KCl), one vial was charged with TG-dA4P (3. mu.M)/TG-dG 4P (3. mu.M) for R, and the other vial was charged with TG-dC4P (2.5. mu.M)/TG-dT 4P (5. mu.M) for Y. After one sequencing run, the vial was switched to W/S and then M/K, the formulation was the same as R/Y. These nucleotide groups need not be added in a particular order, and any random sequence functions in the same manner.
Installing the flow cell on a microscope platform, placing the reagent bottle in a freezing bracket, and carrying out an automatic sequencing process by the following steps: (i) wash the flow cell and reagent input system with wash buffer (rotary valve, tubing between flow cell and reagent bottle); (ii) washing the flow cell 3 times with a wash buffer; (iii) the flow cell was cooled to 4 ℃ and loaded with one of the mixed nucleotides (TG-dA 4P/TG-dG4P for R) by syringe pump through a rotary valve; (iv) heating the flow cell to 15 ℃, and shooting a background fluorescence image by using a CCD camera (Hamamatsu); (v) heating the flow cell to 65 ℃ to trigger polymerase mediated nucleotide incorporation and primer extension, held at 65 ℃ for 1 minute; (vi) (iii) cooling the flow cell to 15 ℃, taking an image to record the fluorescence signal, and then returning to step (ii). This process is automatically controlled until the entire template is sequenced or its sequencing limit is reached. The flow cell was then denatured with formamide to regenerate the single stranded template. After annealing the primers, the next round of sequencing of the different sets of reagent mixtures was performed in the same manner as described above.
The bottom left panel of fig. 25 is a typical fluorescence reaction kinetics curve, recording the fluorescence intensity every 5 seconds. When the chip was heated at 65 ℃, the fluorescence intensity increased significantly within about 20 seconds, reaching the plateau region, indicating that the reaction was about to complete. Then, the temperature controller was cooled to 20 ℃ to obtain the fluorescence intensity after the reaction, so the fluorescence intensity increased due to the decrease in temperature. However, the unit signal decreases throughout the sequencing process due to phase loss problems and template loss. The bottom right panel of fig. 25 depicts the kinetic curve for each reaction cycle throughout sequencing.
Table 25: oligonucleotide sequences used in this section
Note that: "+" indicates phosphorothioate linkages; FAM: 5, 6-fluorescein phosphoramidites
Table 26: template sequences used in this section
Section 4: continuous sequencing phase loss correction
4.1. Signal lead and lag
For amplification-based sequencing-by-synthesis methods, one of the inevitable limiting factors is phase loss, i.e., desynchronization of the extended molecules. This phenomenon is due to accidental addition of nucleotides (lead) or incomplete extension (lag), and will lead to increased noise and sequencing errors. In the ideal case, i.e., in the absence of phase loss, all nascent DNA molecules have the same extension length; however, when phase loss is a concern, nascent DNA molecules may have different extension lengths. As the sequencing reaction proceeds, the distribution of extension lengths becomes more and more dispersed.
4.2. Virtual sequencer
4.2.1. MATLAB-based virtual sequencer
To monitor the distribution of nascent DNA extension lengths in sequencing reactions, a virtual sequencer program was developed by MATLAB to simulate all sequencing reactions. For a DNA sequence of length L, the chemical reactions considered and their corresponding kinetic constants are as follows:
table 27: chemical reactions in virtual sequencer programs and their corresponding kinetic constants
Wherein k is 1,2, … L, and
bst indicates a Bst DNA polymerase,
DNAk-1indicates the (k-1) th position of the DNA to be sequenced,
dNk4P indicates a terminal phosphate-labeled fluorescent nucleotide, which can be paired with the kth position of DNA,
pFluorescein indicates non-fluorescing fluorescein phosphate,
phosphatase is an indicator of alkaline Phosphatase,
p is an indication of phosphoric acid, and,
fluorescein indicates a fluorescent unphosphorylated Fluorescein,
Bst-DNAk-1、Bst-DNAk-1-dNk4P, etc. indicate the corresponding complexes.
The initial concentrations of the species used in the simulations are listed in the following table:
table 28: initial concentrations of various species in a virtual sequencer program
The virtual sequencer program reads a given DNA sequence from the table and automatically generates a series of chemical reactions that are passed to the SimBiology toolkit of MATLAB to generate the corresponding Ordinary Differential Equation (ODE). All chemical kinetics used in ODE are mass effects. ODE was resolved by the 4-stage Runge-Kutta method (Runge-Kutta method).
In the first sequencing cycle, the DNA is subjected to0Set as 0.05, DNAk(k>0) Is set to 0. DNA is amplifiedkSetting of final value of (k.gtoreq.0)Is the initial value for the next cycle. The concentrations of the other species were reset to the values listed in the table. A flow chart (flowgram) of the sequencing process was simulated by rotating the original values of dN4P in each cycle. The final value of fluoroescein is considered as a signal per cycle.
In 2+2 sequencing simulated by the virtual sequencer program, if the concentration of the primary dN4P species is sufficient and there are no impurities in the modified nucleotides, it gives a signal in each cycle proportional to the length of each copolymer and all nascent DNA molecules will have exactly the same length (fig. 27 a-b). The sequences used in the simulation were L10115-301, with base combinations M/K.
When impurities are present in the modified nucleotide or the reaction time is insufficient, a phase loss occurs and the sequencing signal is no longer proportional to the length of its corresponding copolymer. The effect of impurities and reaction time on the sequencing signal was evaluated by a virtual sequencer program, monitoring the concentration distribution of nascent DNA molecules. When impurities were present and the reaction time was sufficient, a look-ahead effect was observed (FIGS. 27 c-d). When no impurities were present but the reaction time was insufficient, a hysteresis effect was observed (FIGS. 27 e-f).
4.2.2. Principle of one-pass and multiple-stop
To observe the effect of loss versus distribution of nascent DNA molecule extension length, a virtual sequencer program was used to simulate the sequencing reaction by Ordinary Differential Equation (ODE). In the simulation, the molecule to be sequenced was set to K (M)nKMM, K (G and T) as the main nucleotide species in the reaction solution, and M (A and C) as the impurities. Other parameters such as reaction time and kinetic parameters are set to the estimated normal values. It was observed that after the first nucleotide K was extended by the main species, consecutive M were partially extended by the impurity as expected, resulting in a look-ahead effect. If n is 1, then K following M will be extended almost entirely by the major nucleotide species. However, if n>1, the secondary lead will decrease rapidly (fig. 28 top panel). This one-pass, multiple-termination feature enabled prediction of DNA extension length distribution and development of the following correction algorithm (see below)Text).
4.3. Phase loss correction by flux matrix
Assume that in a 2+2 sequencing run, the parameters are defined as follows: n indicates the number of sequencing cycles; m indicates the number of copolymers of the molecule to be sequenced; h is a column vector whose elements hjIndicates the length of the jth copolymer; s is a column vector whose elements siA sequencing signal indicative of cycle i; dN×MIndicating a distribution matrix of elements dijIndicating the ratio of nascent DNA molecules to j copolymers extended in the ith sequencing cycle; t isN×MIndicating the flux matrix of its elements tijIndicating the proportion of nascent DNA molecules that extend out of (pass through) the jth copolymer in the ith sequencing cycle; λ indicates the hysteresis coefficient, i.e. the proportion of nascent DNA molecules of the same length and not extended by the predominant nucleotide species in a given cycle; ε indicates the lead factor, the proportion of nascent DNA molecules of the same length and extended by the contaminating nucleotide species in a given cycle; and h' is a column vector, the elements of which
As shown in fig. 27, the phase loss phenomenon causes signal distortion and reduces sequencing accuracy. Algorithms have been developed to correct for this distortion caused by phase loss, as will be discussed in detail below. The lower graph of fig. 28 provides a summary of key concepts and an overview of the correction algorithm. The upper and lower parts of the lower diagram of FIG. 28 are distribution matrices D, respectivelyN×MAnd a flux matrix TN×MThe 3D display of (a). Each entry for D and T is represented as a cube whose dimension along the sequence axis is related to its corresponding copolymer length. The matrices D and T may be computed in an interactive and iterative manner, both being positive at or near their opposite corners, and otherwise being zero. Eventually all nascent DNA strands extend beyond each copolymer, and based on the fact that T accumulation along the circulating axis equals 1. The accumulation of T along the sequence axis is the measured out-of-phase sequencing signal. The matrices D, T and their accumulations along two axes can be classified into three parts: primary stageLeading and lagging. The primary part is the diagonal of the matrices D and T, representing the nascent DNA strand having exactly the expected length. The lead and lag portions are the upper and lower triangular portions of matrices D and T, representing nascent DNA strands of lengths greater or less than expected, respectively. As shown in the bottom graph of fig. 28, during the first few sequencing cycles, the primary portion plays a dominant role in the matrix D, T and its accumulation, contributing the vast majority of the sequencing signal. However, as the sequencing cycle continues, the primary portion decreases and the lead and lag portions increase, indicating signal distortion.
4.3.1. Distribution and flux matrix
The following assumptions are made: 1) there is no misincorporation of nucleotides in the sequencing reaction and therefore is not a leading cause; 2) the lead is caused by the residual contaminating nucleotides from the previous cycle; 3) at most one base per molecule will be extended by the contaminant nucleotides in a given cycle; 4) if the length of extension by the co-polymer of contaminant nucleotides is 1, it will be extended further by the primary nucleotide, termed secondary lead; 5) if the length of the copolymer extended by the contaminating nucleotide is greater than 1, no secondary advancement will occur; 6) the secondary lead strand will not be extended further by the contaminating nucleotide. The 3 rd to 6 th assumptions are all based on the following facts: the contaminating nucleotide species are traces, consistent with the simulation results herein by the virtual sequencer program (one-pass, multiple termination principle).
With the above assumptions, for a given N, M, h, λ, and ε, D and T are calculated as follows:
for example, consider sequencing using sequence AAGTCTGTAGGAATCACT combining M/K with 6 cycles, then h ═ (2,2,1,3,1,2,2,1,3,1)T. Hypothesis advanceAnd the lag coefficient is 0.05, then the matrices D and T are:
the incorporation ratios and impurity levels of the different nucleotides also differ, and in view of this fact, different λ and ∈ were used for the two sequencing mixtures.
Phase loss correction algorithm
The relationship between h and s is as follows:
s=T(h’,ε,λ)h (4)
since dim(s) < dim (h), the linear equation is arbitrary, the Moore-Penrose pseudo-inverse (Moore-Penrose pseudo-inverse) and iterative algorithm are used to obtain the minimum norm solution (fig. 29):
1. setting up
2. The matrices D and T are calculated according to equations (2) and (3).
3. Setting upWhereinIs the pseudo-inverse of T.
4. Comparison of [ h2]And [ h ]1]Wherein]Is a rounding operation. If the two are equal, return h2. Otherwise, jump to step 5.
5. Setting h1←h2. Jump to step 2.
Fig. 29 shows a simplified flow chart of the phase loss correction algorithm. Briefly, the algorithm employs an iterative approach to refine the sequencing signal until it converges. Typically, the iteration will terminate within 5 cycles. An example of applying this to true sequencing data is shown in figure 30. FIG. 30 illustrates the refinement process during an iteration of the phase loss correction algorithm.
4.3.2. General solution of the equation
The relationship between h and s is as follows:
s=T(h’,ε,λ)h (4)
because dim(s) < dim (h), the linear equation is infinite, there is an infinite number of solutions that fully fit the equation. The general form of these solutions is as follows:
where I is the identity matrix and w is an arbitrary vector. In the dephasing correction algorithm, w is set to a zero vector. Examination itemTo see what effect it has on h. The sequence was set to L10115-301, the base combination was set to M/K, the lead coefficient was set to 0.007, the lag coefficient was set to 0.005, the sequencing cycle was set to 100, and entries in R between rows 1-99 and columns 1-99 were found to be very close to zero (10)-16) So that it can be considered as a calculation error (fig. 31, in which a matrix is shown)Is used), so h is the actual determinant except for the last element.
4.3.3. Robustness of the dephasing correction algorithm-condition number
The moore-penrose pseudo-inverse is used in the dephasing correction algorithm. For the flux matrix T, the condition number is defined as:
a large condition number means that a small error in the T element can result in a large error in the entry of the solution. The influence of the dephasing factor on the condition number of T was evaluated. The sequences used were poly (AG) (AGAGAG …), poly (AAGG) (AAGGAAGG …), L718-308, L4418-305, L9730-303 and L10115-301, the base combination being M/K. The lead and lag coefficients used for the evaluation were 0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, and 0.1. For each sequence and dephasing coefficient, the flux matrix T is calculated according to equation (3) and its condition number is calculated according to equation (6). FIG. 32 shows the log of condition numbers at different dephasing coefficients. In all sequences, increasing the lead or lag coefficient, except for poly (AAGG), resulted in an increase in condition number, indicating that the more phase-losing molecules, the worse the correction. However, in a sequential poly (AAGG) whose DPL is equal to 2, increasing the lead factor results in a decrease in the condition number. This indicates that long DPL (length >2) has a significant retardation effect on phase loss.
4.3.4. Algorithm robustness
A) Effect of dephasing coefficient deviation on Signal correction
The dephasing coefficients are obtained by fitting the signal of the reference sequence and are used to correct for other unknown sequences. In the ideal case, the phase loss coefficients of the reference and unknown sequences are the same. However, there is inevitably a slight difference between the two groups for random reasons. Therefore, if the coefficients are inaccurate, it is necessary to test how many errors it will produce in the out-of-phase correction. 100 DNA sequences of 370bp were randomly generated, their dephasing signals in a given dephasing coefficient calculated, and corrected using different but very close coefficients. The base combinations were set to M/K, the number of sequencing cycles was 150, and the given phase loss coefficients tested were 0.001, 0.005, and 0.010, respectively. Since the correction algorithm will still generate errors over the last few cycles even if the phase loss coefficients are accurate, the difference in the number of errors between accurate and inaccurate phase loss coefficients will be used to characterize performance, the average of which is shown in fig. 33, which shows the effect of phase loss coefficient deviation on signal correction. The asterisks in each graph indicate the location of the exact coefficients, and the color bars are limited to the range of 0-5, so any number of errors greater than 5 is shown in deep red. The results show that the more the phase loss coefficient deviates, the more errors it produces, and the greater the tolerance for lead deviation is relative to lag deviation.
B) Tolerance to global noise
Sequencing signal noise may come from out-of-focus imaging, CCD imaging, fluidic or unstable or abnormal, etc. The effect of global white noise on the phase loss correction is detected. A 220-cycle 2+2 sequencing run was first simulated by a virtual sequencer. In the simulation, the sequences were set to L8703-1012, the base combinations were M/K, the reaction time was 130, and the concentrations of the main species and impurities were 2 and 0.002, respectively. All signals in the white noise simulation were added and corrected using the algorithm described above. When the standard deviation σ of white noise is 0, the algorithm fits exactly to the signal (correlation 0.9996), and there are only 1 error in the corrected signal (loop 219). However, when σ is 0.01, the algorithm also fits the signal well (correlation 0.9994), but more errors occur in the corrected signal (loop 1-162 without errors, loop 163-220 with 10 error loops). When σ is 0.02, the corrected signal is even less accurate (loop 1-148 is error free, loop 149 through loop 220 have 27 error cycles). These results indicate that global white noise will reduce the accuracy of the corrected signal and make the latter cycle error-prone.
The number of error-free cycles after the dephasing correction given the dephasing factor and global white noise is next detected. The out-of-phase signal according to equation (4) is calculated, white noise is added, and the signal is corrected using the algorithm described above. The sequence used in the simulation was lam1, the base combination was M/K, the number of sequencing cycles was 500, and each condition (given dephasing coefficient and standard variation of white noise) was repeated 100 times. If the first error in the corrected signal is in a cycle (n)ef+1), the term number of error-free cycles is defined as nef. When the dephasing factor is as low as 0.30% and σ is 0.01, only about 50 cycles are error-free, but correctAll errors are corrected right after. As the loss-of-phase coefficient or noise increases, the number of error-free cycles after correction also decreases, while still being at least 3 times before correction (fig. 35, which shows the number of error-free cycles after loss-of-phase correction given a loss-of-phase coefficient and global white noise). These results demonstrate the effectiveness of the correction algorithm in increasing the read length, and the adverse effect of noise on the read length.
C) Tolerance to spike noise
The effect of signal anomalies in a particular cycle is also detected. The out-of-phase signal s is calculated according to equation (4) and corrected to h. The signal is then enhanced in a specific single cycle with a given spike and a varying signal s resultsvAnd then s isvCorrected to hv. The sequence used in the simulation was L29732-497, the base combination was M/K, the number of sequencing cycles was 220, the tested spikes were 0.01, 0.1, and 0.5, the tested loss of phase coefficients were 0.001, 0.005, and 0.01, and the cycles of spiking were cycles 1, 25, 50, 75, 100, 125, 150, 175, and 200. In the case of a phase loss factor of 0.01 and a spike of 0.5 (fig. 36a), the same spike causes a more severe disturbance in the following cycle than in the preceding cycle. If a spike is added to the cycle 200, hvThe maximum difference between h and h can reach 47.5, but there is a small spike of 0.5. Furthermore, adding spikes in a single cycle will result in h in adjacent cyclesvThe deviation of (2). Similar phenomena were observed under other conditions.
Make | h under each conditionv-heat map of maximum value of h |, setting the range of color map to [0, 1](FIG. 36 b). When the dephasing factor, added spike, or number of cycles increases, | hvThe maximum value of-h | increases. These results indicate that as the nascent DNA length is more dispersed in the sequencing, the signal is less robust to noise, as abnormalities in the sequencing signal in one cycle will result in deviations in the corrected signal for more adjacent cycles.
4.4. Determination of loss of phase coefficient (fitting)
Lead and lag coefficients can be estimated from the sequencing results of a reference DNA molecule (i.e., a molecule with a known sequence).
For a given copolymer length array h, lead coefficient ε, and lag coefficient λ, the sequencing signal will be:
s=T(h’,ε,λ)h (4)
setting f as an array of primary fluorescence signals directly collected by the CCD of the sequencer, s(1),s(2)A parity split that is s, i.e.,
and is
Therefore, f and s(1),s(2)The relationship between them is:
f=a·bt(s(1)+s(2))+cs*(1)+ds*(2)+ξ (11)
where a, b, c, d and ξ are the unit sequencing signal, the attenuation coefficient, the signal offset of the two sequencing mixtures and the white noise term t is the array of the number of recording cycles, i.e. t ═ 1,2, …, N]T。
So for any given h, ε, and λ, s can be calculated and one set of a, b, c, and d can be found that best fits equation (5). The optimal epsilon and lambda are then determined by a gradient descent strategy. The whole algorithm is as follows:
1. define x ═ (epsilon, lambda). Define function F (x) as follows: calculating s by formula (4) from h and x;
finding a, b, c and d of the best-fit formula (11) using a confidence domain reflection algorithm or a Levenberg-marquardt algorithm (Levenberg-marquardt); computingAnd using f andthe pearson correlation coefficient between them is taken as a function of f (x).
2. The original values of ε and λ are set0=λ00.01 or any other reasonable value. Step size gammagAnd gammasSet to an arbitrarily small positive number, such as 0.01.
3. Consider the sequence x(0),x(1),x(2)… ", such that
Wherein
4. If | F (x)(n+1))-F(x(n)) Stop iteration if | < ∈ where ε is an arbitrarily small positive number, say 10-6。
If different dephasing coefficients are considered for each sequencing mixture, then x is defined as x ═ e (ε)1,ε2,λ1,λ2) The rest can be done in the same way.
4.4.1. A coefficient change trajectory; summarizing the dephasing coefficient; and the relationship between the phase loss factor and the sequencing reaction time
Coefficient variation trace
In a typical sequencing run, the fluorescence signal was fitted to the DNA sequence using the dephasing coefficient estimation algorithm, and the trace of change for each coefficient is depicted in fig. 37A, which shows the trace of change for each coefficient in the dephasing coefficient estimation algorithm. Marking by X: the number of iterations. All coefficients converge to a constant value during the iteration, which indicates an accurate estimation of the coefficients.
Summary of dephasing coefficients
The loss of phase coefficients in all rounds of sequencing were counted and summarized in FIG. 37B (loss of phase coefficients, error bars: standard deviation). The symbols a, b, c, and d in equation (11) are referred to as unit, attenuation, and two offsets.
Relationship between phase loss coefficient and sequencing reaction time
To examine the relationship between the phase loss factor and the sequencing reaction time, 5 consecutive 2+2 sequencing runs were performed in the same lane, with the reaction time increasing from 15s to 90s for each run. The DNA template in the experiment is L4418-305, the base combination is M/K, and the number of sequencing cycles is 40. The sequencing signals for each run were fitted using the algorithm described above and as a result, it was found that an increase in reaction time resulted in an increase in lead coefficient and a decrease in lag coefficient. The final reaction time for other sequencing experiments was taken to be 60s to account for the balance of the lead and lag coefficients. Fig. 37C shows the phase loss coefficients for different reaction times.
Section 5: decoding
5.1. Characterization of different sequencing profiles
5.1.1 information entropy of DNA
For a sufficiently long DNA molecule of length d, if the type of each base is independent and the probability of occurrence of each type of base is equal, i.e.,
then, the Shannon entropy (Shannon entropy) of the DNA molecule is
5.1.2 entropy of pyrosequencing
In this example, the term "degenerate sequence of DNA molecules" is used to describe sequences having the same nucleotide type but whose homopolymer lengths are all equal to 1. For example, the degenerate sequence of ` ATTCCCG ` is ` ATCG `.
In this example, the term "dark cycle" is used to describe the reaction cycle in 1 × 4 sequencing, which has a signal intensity of 0.
The 1 × 4 sequencing process was considered with the flowsheet (T, C, A, G, T, C, A, G, …). Without loss of generality, cycle 1 is assumed not to be a dark cycle and the delivered nucleotide is T. The type of the second homopolymer would be C, A or G, with equal probability of 1/3. If the second homopolymer is C, cycle 2 will not be a dark cycle. If the second homopolymer is A, cycle 2 is the dark cycle, and cycle 3 is not the dark cycle. If the second homopolymer is G, cycles 2 and 3 are both dark cycles, and cycle 4 is not a dark cycle. Therefore, two non-dark cycles NdarkThe probability distribution of the number of dark cycles in between is as follows:
so NdarkIs the expected value ofThat is, the ratio of the number of non-dark cycles to dark cycles is 1: 1.
The probability of signal intensity x in a non-dark cycle is:
the average signal intensity for the non-dark cycles (expected) is:
since the ratio of the number of non-dark cycles to dark cycles was 1:1, the average signal intensity (expected value) for any 1x4 sequencing cycle was:
the expected cycles of sequencing in a DNA molecule of length d are:
the shannon entropy of the single signal sequenced at 1x4 was:
note that if a single signal sequenced at 1x4 is considered without any a priori knowledge of its previous cycle, the probability of the intensity of the signal is:
the average signal strength remains unchanged:
however, the shannon entropy results are:
and:
H′1×4×N1×4=2.31d>HDNA
this counter-intuitive appearance is due to the fact that: the signal intensity of each cycle in the same 1x4 sequencing run is not independent, so its shannon entropy cannot be simply added up.
5.1.3 entropy of ECC sequencing (Single color)
The probability of signal intensity x in 2+2 sequencing is:
therefore, the average signal intensity for 2+2 sequencing is:
the shannon entropy of a single monochromatic 2+2 sequencing signal is:
thus, one round of monochromatic sequencing provides some information
5.1.4 entropy of ECC sequencing (two-color)
The probability of signal intensity (x, y) in 2+2 sequencing is:
the shannon entropy of a single two-color 2+2 sequencing signal is:
given that the average signal intensity for sequential sequencing is 2, d/2 cycles are required to complete the sequencing. Thus, a round of two-color sequencing provides a certain amount of information
5.1.5 sequence reaction differences required for three rounds of ECC sequencing
Different base combinations require different numbers of cycles when sequencing the same molecule. For example, for the sequence 'ACACACA', 5 cycles are required to extend the entire molecule of R/Y, but only 1 cycle is required for M/K. 10000 different DNA sequences with the length of 100bp are randomly generated, and the sequencing cycle required by three base combinations M/K, R/Y and W/S is calculated. FIG. 57 shows the range distribution of the cycles of three base combinations. The average value of the range is 8.43 as shown by the red vertical line.
5.2 ECC decoding Algorithm
5.2.1 graphical representation of the Signal
The homopolymer length of the DNA molecule to be sequenced is h, and the signal in ECC sequencing (after phase loss correction) is s ═ s1,s2,...,sn). Assume that in cycle i, hiA given signal siHas a probability of P (h)i|si). Therefore, the signals can be shown in the figures as described below.
Each signal s of siAll by siAnd each node represents. For the representation signal siS, from the jth node to the (j +1) th node, j ═ 1,2iHas a directed edge. Plotting the secondary representation signal siS ofiThe (last) node to itself. DrawingFrom the representation signal siTo each node of the representation si+1The directed edge of the first node of (1).
Each node in the graph can be labeled 1 or 0 depending on the type of nucleotide delivered in the cycle.
Next, weights for paths in the graph representing the sequencing signal are defined. Defining a path as a series of nodes v1v2...vKWherein v is for each neighboring nodekAnd vk+1Presence of from vkTo vk+1Has a directed edge. Allowing v to bekAnd vk+1Are the same node, in this case both being the last node of the signal representing a particular cycle.
If on path v1v2...vKIn (1),are all representative signal siThen these nodes are each assigned a weight P (t)i|si). Path v1v2...vKIs defined as the product of the weights of all of its nodes. For convenience of calculation, the weights of the nodes may also be specified as logarithms of probabilities, and the weights of the paths are adjusted to the sums of all their nodes, respectively.
The path of the graph represents one possible DPL (degenerate polymer length) from the sequencing results, as shown in figure 1. Specifically, from the representation siIs inserted from the last node of the representation s to its own edge (edge) representationiTo the representation s (except the last one)i+1The edge (edge) of the first node of (1) represents a deletion.
For DNA molecules, they were sequenced with the base combinations M/K, R/Y and W/S to obtain three signals. Each of the three signals may be represented as a graph as described above. Suppose thatAndpaths from the three graphs are respectively, and have the same length K: if it is notThe parity check of (a) is true for all K1, 2,3, K, then these three paths are referred to as the common path of the three graphs. Obviously, the decoding problem is actually to find the common path of the three pictures with the largest weight (largest common path, MCP).
5.2.2 ECC decoding by dynamic programming
The terminology in this section:
codeword space and node: the 3D discrete space in which the element index [ I, j, k ] (I, j, k e N), called node, represents a codeword includes the 1 st round of the ith bit, the 2 nd round of the jth bit, and the 3 rd round of the kth bit. In contrast to BS (degenerate sequence), the codeword space records each possible codeword alignment in an intuitive manner.
Skipping: bits to bits.
Connecting: nodes and directional links between nodes. The connection includes three jumps at different wheels.
Parity of the node: a three-bit xor value for a node.
Preparation of essential variables
Assuming that the maximum length of three binary strings is N, the binary strings (degenerate nucleotide sequences) are preprocessed into a look-up table.
The BS (binary string) is a 3 × N Boolean matrix (Boolean matrix) that is a binary version of the sequencing data. The value 0 (or 1) represents degenerate bases. For example:
[0,0,1,1,1,0,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0;
0,1,1,1,1,0,0,0,0,1,1,0,0,1,0,1,1,1,1,0,1;
1,0,1,1,0,1,1,1,1,0,0,0,1,0,0,0,1,0,1,1,0;]
the CNS (cycle number sequence) is a 3x N integer matrix in which the number of cycles in which binary bits (degenerate bases) are read out is recorded.
[1,1,2,2,2,3,3,3,4,5,6,6,6,7,7,8,9,9,10,11,11;
1,2,2,2,2,3,3,3,3,4,4,5,5,6,7,8,8,8,8,9,10;
1,2,3,3,4,5,5,5,5,6,6,6,7,8,8,8,9,10,11,11,12;]
DPL (degenerate polymer length) is a matrix of 3 × N integers, and the DPL of a read cycle is recorded.
[2,2,3,3,3,3,3,3,1,1,3,3,3,2,2,1,2,2,1,2,2;
1,4,4,4,4,4,4,4,4,2,2,2,2,1,1,4,4,4,4,1,1;
1,1,2,2,1,4,4,4,4,3,3,3,1,3,3,3,1,1,2,2,1;]
These tables allow easy lookup of cycles and one bit of DPL information. Different numbers of cycles indicate that two different bits come from different cycles. DPL is the input to the dynamically planned scoring function. For example, the number of 11 th bit cycles in 1 st round is CNS (1,11) ═ 6, and DPL (monomer in the cycle) is DPL (1,11) ═ 3.
Initializing comparison variables
SCORE (N) N, default NaN
CONNECTION is a node matrix with size N
ROUTABLE ═ boolean matrix of size N × N, false by default, but ROUTABLE (1,1,1) ═ true query ROUTABLE (node) is true meaning that the node has a connection back to node (1,1, 1).
STEP-3 element tuple matrix, size N, default to (0,0,0), but STEP (1,1,1) ═ 1,1,1)
The 3-element tuple of STEP (node) records the number of bits counted in three rounds of sequencing in this cycle. The STEP value of each connection in a cycle is increased by 1 and reset to 1 in a new cycle. When jumping across loops, the STEP value is taken as the corrected DPL value.
Pseudo code of comparison process
Parity (node) checks the parity of a node. The binary values of nodes (i, j, k) are taken from BS (1, i), BS (2, j), BS (3, k). A three-bit xor is calculated.
Layers and traversals. The Lth level of the codeword space contains all nodes within the box [ L, L, L ], but not nodes within the box [ L-1, L-1, L-1 ]. Take layer 5 as an example. The layer has 61 nodes. The traversal order must ensure that each node (index [ i, j, k ]) can be concatenated from the traversed node (indices [ ii, jj, kk ], ii < ═ i, jj < ═ j, and kk < ═ k). FIG. 58 shows an example of a possible traversal order-an order of traversal of the levels and nodes of the scoring matrix structure.
5.2.3 hidden Markov model for ECC decoding
Similar to its application to sequence alignment, the hidden markov model can also be applied to the signal decoding problem. Three symbols are introduced to describe the state: match (m), asterisk (, c), and space (-). The signal with intensity a is denoted as a-match. If the ideal signal strength is b and b > a, (b-a) is added immediately after a-match. If b < a, then the (a-b) interval is added at the two orthogonal positions representing the closest match of the signals. For example, an ideal DPL for sequence TGAACTTTAGCCACGGAGTA in a three base combination is:
M/K:0、2、3、3、1、1、4、2、1、2、1;
R/Y:0、1、3、4、2、2、1、1、4、1、1;
W/S:1、1、2、1、4、3、1、3、1、1、2。
the DPL measured in the experiment was: (bold underlined numbers indicate errors)
M/K:0、2、3、3、1、1、2、1、2、1;
R/Y:0、1、4、2、2、1、1、4、1、1;
W/S:1、1、2、1、4、3、1、3、1、1、2。
The decoded corrected signal using the above representation is:
M/K:mmmm-mmmmmmmmm*mmmmmm
R/Y:mmmmmmmmmmmmmmmmmmmmm
W/S:mmmm-mmmmmmmmmmmmmmmm
obviously, the alignment of signals M/K, R/Y and W/S can be regarded as the conversion process of the following alignment states: (mmm), (-m-), (mmm), (. mm), (mmm), and (mmm). This inspired us to use hidden markov models to describe signal alignment.
Typically, the overall hidden state of the model is: (mmm), (m-), (-m), (-mm), (m x m), (mm x), (m x) and ([ m ]). Each state except (m- -), (- -m- -) and (- -m) will emit a nucleotide, the type of which is determined by the corresponding sequencing signal type. States (m- -), (- -m- -) and (- -m) do not give off any nucleotides. Simulating 1 million DNA reads to count the probability of state transitions (fig. 59), the viterbi algorithm (Vertebi algorithm) of the hidden markov model would be an alternative implementation of the ECC decoding algorithm. Figure 59 shows the state transition network of the hidden markov model for ECC decoding. The width of the edge represents the magnitude of the transition probability.
5.3 other ECC decoding results
Exemplary decoding results are shown in fig. 61.
5.4 simulation of decoding at different raw precisions (raw accuracies)
To further investigate the ability of ECC decoding to enhance accuracy, decoding was simulated at 5 different levels of original precision, each level having 10000 DNA sequences. Two parameters γ and δ are used to generate a probability matrix P, the entries of which are PijThe probability of indicating a DPL with length i is sequenced as length j, following the following steps:
1. for each entry P in PijSetting up
2. Mixing N (x, mu, sigma)2) Set as a probability density function of normal distribution, i.e.Then
3. P is normalized and the resulting sum of each row is equal to one.
In the simulation, γ was set to 1.6, 1.7, 1.8, 2.0, and 2.1, respectively, and δ was set to 0.1. The overall raw accuracies at these parameter settings were 97.42%, 98.34%, 98.97%, 99.64%, and 99.80%, respectively. The same 10000 random 400bp DNA sequences are used to calculate the theoretical DPL, the new value is randomly modified according to the generated probability matrix P, and the correction is carried out by using a decoding algorithm. The scoring functions used in the decoding algorithm are subjected to their respective probability matrices P. If two of three consecutive DPLs of a DNA sequence are modified by the decoding algorithm, the DNA sequence is discarded because of the high probability of erroneous decoding. The accuracy of the sequencing was defined as follows: if a DPL of length i is sequenced (or decoded) to length j, then the accuracy of these i bases in the DPL isThe accuracy distribution of the first 300bp of the DNA sequence was calculated before and after decoding and significant accuracy shifts were found after decoding (fig. 60), indicating the capability of the decoding algorithm. Fig. 60 shows the simulated distribution of accuracy before and after decoding.
Example 12: method for correcting sequencing errors
Construction of transformation matrices
In this example, a 2+2 sequencing assay was used to create a combination of μm/K. A or C is added in every odd round, and G or T is added in every even round. When the DNA sequence to be tested is CCTGTATGACCGTATTCCGGGTCCTGTCGGTA (SEQ ID NO:40), the ideal signal obtained is h ═ 2,3,1, 2,3,2,1, 2,4,2, 3,1,3, and 1.
For simplicity, it is assumed in the calculation that the lead and lag coefficients of M and K are the same. For example, when the lead coefficient is 0.02 and the lag coefficient is 0.01, and 10 sequencing reactions are performed in total, the transformation matrix constructed according to the above method is:
for the sake of computational accuracy, it is assumed in the calculation that the lead coefficients and the lag coefficients of M and K are different. For example, when M has lead and lag coefficients of 0.02 and 0.01, respectively, and K has lead and lag coefficients of 0.01 and 0.02, respectively, for 10 sequencing reactions, the transformation matrix constructed according to the above method is:
if a 2+2 two-color sequencing method is used, the calculation method of the transformation matrix is unchanged. The only difference is the way in which the parameter estimation and signal correction are applied.
Parameter estimation for monochromatic 2+2 sequencing
In this example, a primary monochromatic 2+2 sequencing assay method was used to form nucleotide combinations of μm/K. A or C is added in every odd round, and G or T is added in every even round. The sequences were determined as follows:
AAGAGCTGGACAGCGATACCTGGCAGGCGGAGCTGCATATCGAAGTTTTCCTGCCTGCTCAGGTGCCGGATTCAGAGCTGGATGCGTGGATGGAGTCCCGGATTTATCCGGTGATGAGCGATATCCCGGCACTGTCAGATTTGATCACCAGTATGGTGGCCAGCGGCTATGACTACCGGCGCGACGATGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGACCACCCTGTGGGTTTATAAGGGGAGCGGTGACCCTTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGTCCTATGACGACAG(SEQ ID NO:41)。
a total of 200 sequencing reactions were performed and the actual raw sequencing signals were obtained as shown in FIG. 43. It can be seen that: the value of the original sequencing signal ranged between about 100 and 1500, with an overall downward trend. From about the 80 th sequencing reaction, the signal fluctuates alternately and sequence information cannot be read directly from it. By using the above parameter estimation method, it is possible to estimate the probability of the DNA molecule to be detected based on the sequence and the sequencing methodThe desired signal is h ═ 1,1,1,1,3,3,1,1,1,1,1, 1,3,3,2,2,1,2,1,1, 2,2,1,1,1,1,1,2,5,2,2,2, 1,1,2,4,2, 1,2,1,1,1, 3,1,2,1,4,1,3,1,2, 1,3,1,1,1, 2,4,1,2,1,1,1,1,1,1,1, 3,2,3,3,2,1,1, 1,4,1,1,5,2,1,6,3,1,1,2,1,1,1,2,2,1,3,2,1,1,1,1, 1,1,2,2,1,1,1,1,1,2, 1,1,2,1,2,1,1, 2,2,1,3,2,2,3,1,1,2,3,4,1,2,2,1,1,1,1,2,2,3,6,1,2,1,4,2,2,4,3,4,2,3,7,9,1,1,2,4,1,1,1,4,4,2,2,1,1,1,2,1,2,1,1,3,2,1,2,4,2,4,1,1,1,2,1,3,5,3,3,1,3,2,2,1,3,2,1,1,3,2,3,1,1,2,1,2,2,1,1,2,2,1,3,1). The relevant parameters in this sequencing were estimated using the parameter estimation method described above. When constructing the transform matrix, for accurate calculation, it is assumed that the lead and lag coefficients of M and K are different. Let t be the number of sequencing reactions. Constructing a transformation functionWherein:
1.wherein a is referred to as a unit signal;
2.where b is called the attenuation coefficient;
3.where d and e are referred to as global offsets of M and K, respectively;
4.where s is the out-of-phase signal.
In the parameter estimation, the correlation coefficient used is the pearson correlation coefficient, and the optimization method used is the gradient descent method. After 48 rounds of iterative computation, the gradient descent meets the convergence condition, and the obtained lead coefficient of M is 0.0117, and the obtained lag coefficient of M is 0.0067. The lead coefficient of K is 0.0128 and the lag coefficient of K is 0.0067. The unit signal is 519.7, the attenuation coefficient is 0.9849, the global offset for M is 122.7, the global offset for K is 150.1, and the correlation coefficient is 0.999961. The trend of all parameters in the iterative calculation process is shown in fig. 44.
Signal correction for monochrome 2+2 sequencing
In this example, a primary monochromatic 2+2 sequencing experiment was used: the sequenced sequence was unknown. Its actual original sequencing signal f, and the transformation function applied in example 1The inverse function of (d) and the phase-lost signal obtained by the parameter transformation are shown in fig. 45 (the inverted triangle signal indicates that the signal intensity at the position does not match the ideal signal).
It can be seen that: after passing through a transformation functionThe phase-lost signal obtained by the inverse function transformation still has signal values at many positions which do not accord with the ideal signal. After the signal correction step, 4 iterations are performed to obtain first-order phase-loss signals s1Second order dephasing signal s2Third order phase loss signal s3And fourth order phase-loss signal s4. After rounding off, s3And s4Are equal to each other, so that the iteration is stopped and s is output4As a result of the correction. The fourth order phase-lost signal is shown in fig. 46, where the inverted triangle indicates that the signal strength at the location does not match the ideal signal. It can be seen that as the iteration progresses, the inverted triangle signal becomes progressively less, indicating a higher and higher accuracy. In the final calibration results, the signals from the first 173 sequencing reactions were all calibrated to be completely correct. Correction errors did not occur until 174 th sequencing reaction.
Parameter estimation for two-color 2+2 sequencing
In this example, a primary two-color 2+2 sequencing experiment was used: the nucleotide combination is M/K, wherein A and G are labeled with fluorophores of the same color, and C and T are labeled with fluorophores of the same color. The sequence was as follows:
AAGAGCTGGACAGCGATACCTGGCAGGCGGAGCTGCATATCGAAGTTTTCCTGCCTGCTCAGGTGCCGGATTCAGAGCTGGATGCGTGGATGGAGTCCCGGATTTATCCGGTGATGAGCGATATCCCGGCACTGTCAGATTTGATCACCAGTATGGTGGCCAGCGGCTATGACTACCGGCGCGACGATGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGACCACCCTGTGGGTTTATAAGGGGAGCGGTGACCCTTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGTCCTATGACGACAG(SEQ ID NO:41)
a total of 200 sequencing reactions were performed to obtain the actual raw sequencing signal as shown in FIG. 47.
It can be seen that: the value of the original sequencing signal ranged between about 100 and 1200, with an overall downward trend. From about the 80 th sequencing reaction, the signal fluctuates alternately and sequence information cannot be read directly from it. Because a two-color sequencing method is adopted, 2 ideal signals, phase-loss signals, original sequencing signals and the like respectively correspond to A-labeled fluorescent groups, G-labeled fluorescent groups and C-labeled fluorescent groups and T-labeled fluorescent groups.
By utilizing the parameter estimation method, according to the sequence of the DNA molecule to be detected and a sequencing method, the ideal signal corresponding to the AG-marked fluorescent group can be deduced as follows: h is1That is (2,1,1,1,0,2,2,1,0,1,1,0,1,2,1,2,0,2,1,1,0,1, 0,1,0,0, 0,1,2,1,0,1,0, 0,1,3,0,2,1,0,1,1,1,1,0,2,1,1,0, 1,0,3,1,2,1,1,0,2,1,0,1,0,0,3,1,1,1,1,0, 0,1,1,1,1,1,1, 4,1,1,1, 0,2,0, 1,1,1,1,1,0, 1,2,0,1, 1,1,1,1,0,1,0,1,0,1,1,3,2,1,2,1,1,0,0,1,1,0,1,4,0,0,0,3,1,0,3,3,3,0,3,2,4,1,0,2,4,1,1,0,3,1,0,1,1,0,1,2,0,0,1,0,0,1,1,1,2,1,2,0,1,0,1,0,2,4,1,3,1,1,1,1,1). The ideal signals for the C and T labeled fluorophores are: h is2=(0,0,0,0,1,1,1,0,1,0,0,1,2,1,1,0,1,0,0,0,1,1,1,1,0,1,1,0,0,4,2,1,2,1,1,1,1,1,2,0,0,2,1,0,0,0,1,1,0,1,1,1,0,1,0,1,3,0,0,3,0,1,2,1,0,1,0,0,1,0,0,1,0,1,3,0,2,2,1,0,0,3,0,1,3,1,0,2,2,0,1,0,1,1,0,1,1,1,2,0,1,0,1,0,1,0,0,1,0,1,1,0,1,3,0,2,1,0,2,0,0,1,1,1,1,2,0,2,1,2,2,1,0,1,0,2,0,0,1,0,1,1,0,1,2,2,2,1,2,1,1,1,2,1,0,1,2,0,5,5,0,1,0,0,0,0,1,1,3,2,1,0,1,0,0,1,2,0,1,3,1,0,1,2,1,2,1,0,1,1,1,1,1,2,0,0,2,1,1,0)。
The relevant parameters in this sequencing were estimated using the parameter estimation method described above. When constructing the transform matrix, for accurate calculation, it is assumed that the lead and lag coefficients of M and K are different. For a certain transformation matrix T constructed according to some given dephasing coefficients, the dephasing signals of the A-and G-labeled fluorophores are assumed to be s1=Th1And the phase-loss signal of the C-and T-labeled fluorophores is s2=Th2. Let t be the number of sequencing reactions. Constructing transformation functions for A and G labeled fluorophores and C and T labeled fluorophores, respectivelyAndwherein
1.Wherein a is1And a2Unit signals which are the signals released by the A and G and C and T labeled fluorophores, respectively;
2.where b is called the attenuation coefficient;
3.wherein d is1、e1、d2And e2Refers to the overall offset of A, G, C and T, respectively;
4.where s is the out-of-phase signal.
In the parameter estimation, the correlation coefficient used is the pearson correlation coefficient, and the optimization method used is the gradient descent method. After 17 rounds of iterative calculation, the gradient descent meets the convergence condition, and the obtained lead coefficient of M is 0.0125, and the lag coefficient of M is 0.0067. The lead coefficient of K is 0.0126 and the lag coefficient of K is 0.0068. The unit signals for the signals released by the A and G and C and T labeled fluorophores were 519.8 and 480.7, respectively, with an attenuation coefficient of 0.9860, an overall shift for A of 164.5, and an overall shift for G of 133.2. The global offset for C is 140.7 and the global offset for T is 175.7. The correlation coefficient was 0.999964. The trend of all parameters in the iterative calculation process is shown in fig. 48.
Signal correction for two-color 2+2 sequencing
Primary two-color 2+2 sequencing experiment: g and T are added in every odd round, and A and C are added in every even round, wherein A and G mark fluorophores with the same color. C and T label fluorophores of the same color (different from the colors of A and G). The sequenced sequence was unknown. The raw sequencing signal f obtained in this sequencing, and the transformation function applied in example 4Andthe inverse function of (d) and the associated parametric transformation results in a phase-loss signal as shown in figure 49. Because a two-color sequencing method is adopted, 2 ideal signals, phase-loss signals, original sequencing signals and the like respectively correspond to A-labeled fluorescent groups, G-labeled fluorescent groups and C-labeled fluorescent groups and T-labeled fluorescent groups. It can be seen that there are many inverted triangle signals in fig. 49, indicating that in the out-of-phase signal (or phase mismatch) s, there are still many positions where the signal does not match the ideal signal.
After the signal correction step, 4 iterations are performed to obtain first-order phase-loss signals s1Second order dephasing signal s2Third order phase loss signal s3And fourth order phase-loss signal s4. After rounding off, s3And s4Are equal to each other, so that the iteration is stopped and s is output4As a result of the correction. The fourth order phase-lost signal is shown in fig. 50, where the inverted triangle indicates that the signal strength at the location does not match the ideal signal. It can be seen that as the iteration progresses, the inverted triangle signal becomes progressively less, indicating a higher and higher accuracy. In the final calibration results, the signals from the first 166 sequencing reactions were all calibrated to be completely correct. Correction errors did not occur until the 167 th sequencing reaction.
Overall performance derived from a large number of sequences
To comprehensively evaluate the accuracy of sequence information read from the raw sequencing signal, five monochromatic 2+2 sequencing experiments were performed. In one aspect, 500 sequencing reactions are performed per sequencing. In each sequencing experiment, a part of the tested DNA is used as a reference, and the sequence and the original sequencing signal are used for parameter estimation; the other part of the DNA to be tested was used as a sequencing sample. Two methods will be used for signal correction respectively: a signal correction according to the method described herein using the parameters estimated for the reference DNA; another method is simply to assume a simple proportional relationship between the original sequencing signal and the ideal signal, and then to infer the DNA sequence information.
In these five sequencing experiments, the estimated phase loss coefficients were 0.001, 0.003, 0.005, 0.010, and 0.011, respectively, using the original sequencing signal of the reference DNA (the lead and lag coefficients were set equal for parameter estimation). For signal calibration, the number of the first sequencing reaction (i.e., the length of the perfectly correct calibration signal) in which the signal intensity of the signal obtained by the two methods calibration does not match the ideal signal intensity was recorded separately and plotted as a histogram (as shown in fig. 51, error bars (error bars) are standard deviations). It can be seen that when the phase loss factor is 0.001, the correction signal obtained from the simple proportional relationship calculation shows correction errors in less than 100 sequencing reactions, and the method described herein gives a completely correct correction result. The accuracy of the correction results of both methods decreases with increasing dephasing coefficient. However, on the one hand, in the correction results obtained herein, the length of the completely correct correction signal is still 3-5 times that calculated from a simple direct relationship, which represents a distinct advantage herein in improving the accuracy and effective read length of the DNA sequence read from the original sequencing signal.
Example 13: error correction code fluorescent DNA sequencing
Principle of degenerate base fluorescent sequencing:
in this example, a series of fluorescent sequencing substrates (using the high performance fluorophore Tokyo Green (TG)) were developed to end-label the nucleotide tetraphosphate (dN4P or dN, see fig. 52a and 5A-5C). TG provides higher fluorescence quantum yield (0.82 at 490 nm), higher absorption coefficient, higher on-off ratio and better photostability than previously reported fluorescent dyes. In a fluorescent sequencing-by-synthesis (SBS) procedure, single-stranded DNA templates were grafted onto the surface of a glass flow cell using solid-phase PCR (fig. 23). Each template is then annealed with a sequencing primer whose 3' end serves as the starting point for the SBS reaction. In each sequencing cycle, the reaction mixture (Bst polymerase, alkaline phosphatase, and fluorescent nucleotides) was reacted with those immobilized priming DNA templates. When the polymerase incorporates the corrected nucleotide on the primer end, the non-fluorescent "dark" state dye-triphosphate will be released simultaneously, and then immediately switched to the highly-fluorescent "bright" state by dephosphorylation. This fluorescent SBS reaction produces a natural DNA duplex such that the 3' end of the synthetic strand is not terminated (still extendable, ready to be extended). Substrates that can form the correct Watson-Crick pair (Watson-Crick pair) at the end of the primer will continue to extend until the first mismatch is encountered.
This feature has been used to sequence 30-40 bases through a single base flowsheet, where one of four substrates is introduced into the reaction in each cycle. In this example, a two-base flowsheet was used. For example, in the first cycle of sequencing (FIG. 52b, the K (dG & dT) reaction mixture is brought to the priming DNA template of the starting sequence ACTTGAAA. DNA polymerase will incorporate one dT and one dG to pair with the first two bases AC and get two fluorophores, then stop on the third base T due to mismatch. in the next M cycle, two dA and one dC are paired with the next three bases TTG, getting three fluorophores. conjugation mixtures M and K are introduced alternately to react with the priming DNA template (FIG. 52 c.) the amount of fluorophore generated per cycle is equal to the number of extended bases.
After completion of polymerase extension, the fluorescence signal is measured. The normalized fluorescence signal, representing the number of extended bases per cycle, rather than the actual composition and sequence, is referred to as the Degenerate Polymer Length (DPL). In fig. 52C, a DPL array (0, 2,3,1, ·) can be converted to a degenerate sequence (kkmmkkkm.), where M ═ a or C, K ═ G or T. In addition to this M-K two-base flowsheet, there are two additional two-base flowsheets R (A, G) -Y (C, T) and W (A, T) -S (C, G), whereby the same template can be represented as different degenerate sequences (YRRRYYYYRRYY …) and (WSWWSWWWW …). To obtain these three orthogonally degenerate sequences, a reset operation between sequencing rounds was required to denature the nascent strand and anneal the sequencing primer. Each actual base can be deduced from the three sequences by calculating the intersection of degenerate bases. This sequencing method, known as Error Correction Code (ECC) sequencing, by which sequencing errors can be detected and corrected.
Degenerate base recognition
In this example, a laboratory prototype was created for fluorescent sequencing using a two-base flowsheet. Similar to other SBS sequencing methods, fluorescence intensity decay is inevitable. This attenuation, mainly due to reaction incompleteness and loss of template or primer, has caused serious challenges in base recognition (fig. 53 a). In a typical fluorescence degenerate sequencing run, the decrease in fluorescence intensity can be normalized by an exponential decay function with a signal drop of about 1% between reaction cycles. The normalized fluorescence signal in each cycle should have been rounded to DPL (fig. 53 b). However, the consistency between intensity and DPL can only be preserved for about the first 30 cycles, after which phase loss cannot be ignored, i.e. the signal of each cycle becomes significantly affected by the adjacent cycles.
Loss of phase, i.e., the lack of synchronization of the primer set (primer ensemble), has two major components: "lag" and "lead". The lag strand is mainly caused by incomplete extension, while the lead strand in double-base sequencing is mainly due to accidental extension by contaminating bases. In a given cycle, the fluorescent signal produced by the primer sets that are not synchronized is different from the corresponding DPL. Accumulation of lost phases will gradually reduce the correlation between the sequencing signal and the DPL array.
However, it has been shown that the cumulative effects of signal dephasing and decay can be better estimated according to a first order reaction regime, with residuals between the estimate and the measurement below 0.2. In addition, a sequence-independent iterative dephasing correction algorithm was developed to infer DPL arrays for each round of sequencing. The low error range of the DPL array length can be extended significantly from the first 50 cycles (about 100nt) to over 150 cycles (about 300nt) by de-phasing correction, beyond which adjacent errors (crowderrows) cannot be corrected accurately with the de-phasing algorithm (fig. 53 c). This correction method can also be applied to two other orthogonal degenerate sequences (fig. 53d-e) using RY and WS flowsheet for the same template. Each of the three degenerate sequences harbors rare errors (< 1%) that are unlikely to be located at the same base position.
ECC sequencing information communication model
In one aspect, DPL arrays collected from a round of two-base sequencing cannot provide unambiguous DNA sequences when there are no sequencing errors, the entropy of DNA information for random sequences of length L-nt is 2L bits, while the entropy of information for their DPL arrays is only L bits.
However, due to experimental sequencing errors, the entropy of the DPL array (referred to as L) is lower than L bits. Two DPL arrays containing this error provide insufficient nodal information to infer DNA sequences (L + L-0< 2L). At our current experimental error rates, additional DPL arrays were introduced to provide mutual/redundant information (2L <3L), which can be used to detect errors and infer unambiguous sequences.
An information communication model was also established, as well as a model containing the encoder, decoder and communication channels, to delineate the double base sequencing with inherent properties of error detection and correction (FIG. 54 a). The 3 orthogonal double base flowsheets encode DNA sequences, information sources into 3 original DPL arrays (n). Analysis of the DPL distribution in the human, yeast and e.coli genomes revealed that it was close to p (n) 1/2nI.e. the theoretical distribution of DPL from random DNA sequences. It is also found from fig. 54b that only 0.39% of DPL is greater than 8.0.
The sequencing reaction is considered a communication channel through which sequencing errors are inevitably introduced into the received information. For example, in cycle 3 of the R-Y round, the original DPLn — 3 is erroneously measured as m-4 (3-to-4 insertion error, fig. 54 a). The identity of the raw and measured DPLs was analyzed in 42 rounds of double base sequencing data. 5503/5609 (98.1%) of the original DPL (n ≦ 9) is faithfully propagated (FIG. 54 c).
The measured DPL array is rewritten to a degenerate base sequence by defining the codeword as a 3-tuple from the degenerate base at the same position of the degenerate base sequence in the order of MK, RY, and WS. In the case of fig. 54a, the first few codewords are (KYW), (KRS), (MRW), etc. Such codewords may be further compiled into a binary format. M, R and W are assigned a logical 1, and K, Y and S are assigned a logical 0. Each degenerate sequence in any single flow graph becomes a string of Bits (BS). The parity of the codeword is defined as the result of its three-bit XOR operation (fig. 54 d). If and only if the parity is logic 1, there is only one common base in the degenerate bases in the codeword, and this common base is considered as the decoding result. Specifically, 111(MRW) is decoded to base a, 100(MYS) is decoded to C, 010(KRS) is decoded to G, and 001(KYW) is decoded to T. The Hamming distance (Hamming distance) between these four legal codewords is 2. On the other hand, the remaining four illegal codewords with parity logic 0 (no common base) indicate a sequencing error. In the case shown in FIG. 54a, the DNA sequence is decoded from the BS, and the 3-to-4 error (MRS/110) at the 5 th codeword is captured by the decoder during the parity check. Typically, a memoryless codeword with a hamming distance of 2 is only error-detectable, but not correctable. However, it was found that double base sequencing yields a BS format that is not memoryless but rather environmentally dependent, which provides additional information for error correction beyond error detection.
Sequence decoding using dynamic programming
Error correction decoding is performed by a dynamic programming based algorithm. Double base sequencing errors, i.e. incorrectly measured DPLs, can be easily identified in the codeword by parity checking. These unique errors are simply bit insertions or deletions in the BS, not bit changes. When an error is found, it is possible to correct by changing the corresponding DPL based on the BS environment. The errors must be corrected in order from the first error because a DPL change corresponding to a BS move operation affects downstream codewords.
An exemplary embodiment is shown in fig. 55 a. Detecting the first illegal codeword under codeword 5 has three possible sources of error: (1) M-K round loop 2 inserts an error, the original DPL (n ═ 2) is measured incorrectly as 3; (2) insertion error in R-Y round cycle 2, raw DPL (n ═ 3) was measured as 4; and (3) deletion errors in W-S round 3, raw DPL (n ═ 3) was measured as 2. The insertion error in cycle 2 of the R-Y round is corrected by shifting BS2 left from the 6 th bit. Through this shift operation, many of the following illegal codewords are simultaneously qualified by parity check. Then, a second error is detected under base 14. The erasure error, along with the rest of the illegal codewords, is resolved by right shifting BS1 from bit 14. In this case, the 9 illegal codewords are legal by only two correction operations, resulting in an error-free decoded DNA sequence.
In fact, there are many possible combinations of operations to decode a sequence. Furthermore, the number of combinations increases exponentially with read length, making it impossible in practice to obtain the optimal sequence by counting all possible combinations.
Therefore, dynamic programming is employed to determine the globally optimal decoding sequence. The codeword space is constructed as a 3-dimensional matrix with three BSs as its axes. Each node (i, j, k) represents a codeword consisting of the i-th bit of BS1, the j-th bit of BS2, and the k-th bit of BS3, which can be classified or distinguished into two types, i.e., pass or error, according to the parity check (fig. 55 c). Any path starting from node (1,1,1) and passing only through the eligible node represents a possible decoded DNA sequence. The probability of a given path in codeword space can be calculated by bayesian formulation. The prior probability of occurrence of a DPL of length n is 1/2n(FIG. 54b), the probability P (m | n) of a DPL of length n that will be sequenced by length m can be obtained from the reference sequence and the data compared to theoretical values (FIG. 54 c). Then for the r-th wheel (r is MK, RY or WS) its length is miThe DPL of the ith measurement of (1) is determined by the length niIs generated by the DPL ofr(ni|mi) The following can be given:
measured probability P of a DPL array being generated from a particular DNArIs Pr(ni|mi) The accumulated result of (2). Under the assumption that three rounds of ECC sequencing are independent of each other, the probability of a given path is
PPath=PMK·PRY·PWS
The probability of each path in the codeword space can be calculated in the same way (fig. 55 c). A dynamic planning method is used to obtain the path with the maximum probability.
Decoding to improve ECC sequencing accuracy
ECC decoding can effectively correct errors in long sequencing reads. Three 14 longer-length rounds of ECC experiments were performed to sequence 3 different templates from lambda phage. There are occasionally smaller errors in the sequencing signal before ECC decoding. After decoding, these errors were completely eliminated before 200bp and also significantly reduced at 200 and 250bp (FIGS. 56 a-c). For example in FIG. 56a, although the first sequencing error occurred in base 39 of the RY round, this error was successfully corrected after ECC decoding along with another multiple sequencing errors in the WS round. The first error after ECC decoding is delayed by more than 270 bp.
ECC decoding algorithms are able to accurately identify complex error patterns. Adjacent errors in the same or different round are more challenging to correct when compared to scattered sequencing errors, because more and more elaborate correction operations are required in the decoding algorithm. When the parity check fails between the three rounds of sequencing signals, the algorithm will calculate the probability of different operations.
In one case, two sequencing errors occurred in 3 cycles of the RY-th round (1 base deletion of cycle 22 and 1 base insertion of cycle 24). At least two alternative correction passes, each comprising two correction operations, can repair these errors (fig. 56 b). The first method operates 1-to-2 insertion correction and 2-to-1 deletion correction (p (2|1) × p (1|2) ═ 0.00015, while the second method contains 1-to-2 insertion correction and 3-to-2 deletion correction (p (2|1) × p (3|2) ═ 0.00022. therefore, the second method is preferred due to the higher probability.
In another case, two adjacent long DPL sequencing errors occurred in MK and RY rounds, respectively. Clearly, a left shift of one base in the WS-th round can also restore parity legitimacy (FIG. 56 c). However, because long DPLs are more error prone, the algorithm preferably corrects two longer DPLs, rather than the shorter one, by comparing the probabilities of the different methods.
Fluorescence degenerate sequencing is inherently highly accurate. The error frequency of different DPLs per 50nt was analyzed along the sequencing reads (fig. 56 d). Without ECC correction, 106 errors were found in 11062 bases. Similar to other sequencing methods, these errors are more likely to occur at longer DPLs and post-test positions. See Forgetta et al (2013) Journal of biomolecular Techniques,24(1), 3949; and Loman et al (2012) Nature Biotechnology,30(5), 4349. The original precision in the first 100nt is 99.82%, and the original precision in the first 200nt is 99.45%. At a 99% accuracy cutoff, read lengths in excess of 250nt can be achieved.
ECC decoding eliminates most sequencing errors. The high initial accuracy of the fluorescence degenerate sequencing method is the basis for the ECC correction to completely eliminate all errors in the first 200nt (including errors in DPL up to 9 nt), with an estimated upper bound error rate as low as 0.034%. In addition, ECC decoding effectively reduces the 250nt cumulative error rate from 0.96% to 0.33%.
Sequence listing
<110> Seine Biotechnology (Beijing) Ltd
<120> method for obtaining and correcting biological sequence information
<130>757272000140
<140>Not Yet Assigned
<141>Concurrently Herewith
<150>CN201610899880.X
<151>2016-10-14
<150>CN201510944878.5
<151>2015-12-11
<150>CN201510815685.X
<151>2015-11-18
<150>CN201510822361.9
<151>2015-11-18
<160>41
<170>FastSEQ for Windows Version 4.0
<210>1
<211>15
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<400>1
aactttggat tgcct 15
<210>2
<211>20
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<400>2
tgaactttag ccacggagta 20
<210>3
<211>31
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>PS3-T10-P7
<400>3
tttttttttt caagcagaag acggcatacg a 31
<210>4
<211>20
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>P5
<400>4
aatgatacgg cgaccaccga 20
<210>5
<211>21
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>P7
<400>5
caagcagaag acggcatacg a 21
<210>6
<211>58
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>P5-SeqP1
<400>6
aatgatacgg cgaccaccga gatctacact ctttccctac acgacgctct tccgatct 58
<210>7
<211>21
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>FAM-P7rc
<400>7
tcgtatgccg tcttctgctt g 21
<210>8
<211>35
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>FAM-T-SeqP1
<400>8
ttacactctt tccctacacg acgctcttcc gatct 35
<210>9
<211>55
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp1-L10115-301-f
<400>9
acactctttc cctacacgac gctcttccga tctgtgttcg acggtgagct gagtt 55
<210>10
<211>54
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp2-L10115-301-r
<400>10
gtgactggag ttcagacgtg tgctcttccg atctcaagcc ctgccgcttt ctgc 54
<210>11
<211>55
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp1-L4418-305-f
<400>11
acactctttc cctacacgac gctcttccga tctgtgacag cagagctgcg taatc 55
<210>12
<211>54
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp2-L4418-305-r
<400>12
gtgactggag ttcagacgtg tcatgcgatc atatgagtac ggctgcagcg cccg 54
<210>13
<211>57
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp1-L718-308-f
<400>13
acactctttc cctacacgac gctcttccga tcttatcgaa cagtcaggtt aacaggc 57
<210>14
<211>54
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp2-L718-308-r
<400>14
gtgactggag ttcagacgtg tcatgcgatc atatcaacca gataagggtg ttgc 54
<210>15
<211>53
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp1-L501-500-f
<400>15
acactctttc cctacacgac gctcttccga tctactccgc tgaagtggtg gaa 53
<210>16
<211>54
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp2-L501-500-r
<400>16
gtgactggag ttcagacgtg tcatgcgatc atatttatgc tctataaagt aggc 54
<210>17
<211>53
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp1-L30501-500-f
<400>17
acactctttc cctacacgac gctcttccga tctcactcac aacaatgagt ggc 53
<210>18
<211>54
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp2-L30501-500-r
<400>18
gtgactggag ttcagacgtg tcatgcgatc atatcacgga atgcattttt ctgg 54
<210>19
<211>53
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp1-L46499-500-f
<400>19
acactctttc cctacacgac gctcttccga tctgcctaaa gtaataaaac cga 53
<210>20
<211>54
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp2-L46499-500-r
<400>20
gtgactggag ttcagacgtg tcatgcgatc atatggcata atgcaatacg tgta 54
<210>21
<211>53
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp1-L8703-1012-f
<400>21
acactctttc cctacacgac gctcttccga tctaagagct ggacagcgat acc 53
<210>22
<211>54
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp2-L8703-1012-r
<400>22
gtgactggag ttcagacgtg tgctcttccg atctcatcgc tgactctccg gatt 54
<210>23
<211>57
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp1-L718-208-f
<400>23
acactctttc cctacacgac gctcttccga tcttatcgaa cagtcaggtt aacaggc 57
<210>24
<211>54
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp2-L718-208-r
<400>24
gtgactggag ttcagacgtg tgctcttccg atcttcgctg cccatcgcat tcat 54
<210>25
<211>55
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp1-L10115-201-f
<400>25
acactctttc cctacacgac gctcttccga tctgtgttcg acggtgagct gagtt 55
<210>26
<211>54
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>Adp2-L10115-201-r
<400>26
gtgactggag ttcagacgtg tgctcttccg atctgctgaa aaacaggctg agca 54
<210>27
<211>51
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>P7-Adp2-r
<400>27
caagcagaag acggcatacg agatactgac gtgactggag ttcagacgtg t 51
<210>28
<211>45
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>P5-Adp1-f
<400>28
aatgatacgg cgaccaccga gatctacact ctttccctac acgac 45
<210>29
<211>201
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L10115-201
<400>29
gtgttcgacg gtgagctgag ttttgccctg aaactggcgc gtgagatggg gcgacccgac 60
tggcgtgcca tgcttgccgg gatgtcatcc acggagtatg ccgactggca ccgcttttac 120
agtacccatt attttcatga tgttctgctg gatatgcact tttccgggct gacgtacacc 180
gtgctcagcc tgtttttcag c 201
<210>30
<211>193
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L718-208
<400>30
tatcgaacag tcaggttaac aggctgcggc attttgtccg cgccgggctt cgctcactgt 60
tcaggccgga gccacagacc gccgttgaat gggcggatgc taattactat ctcccgaaag 120
aatccgcata ccaggaaggg cgctgggaaa cactgccctt tcagcgggcc atcatgaatg 180
cgatgggcag cga 193
<210>31
<211>308
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L10115-301
<400>31
tatcgaacag tcaggttaac aggctgcggc attttgtccg cgccgggctt cgctcactgt 60
tcaggccgga gccacagacc gccgttgaat gggcggatgc taattactat ctcccgaaag 120
aatccgcata ccaggaaggg cgctgggaaa cactgccctt tcagcgggcc atcatgaatg 180
cgatgggcag cgactacatc cgtgaggtga atgtggtgaa gtctgcccgt gtcggttatt 240
ccaaaatgct gctgggtgtt tatgcctact ttatagagca taagcagcgc aacaccctta 300
tctggttg 308
<210>32
<211>305
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L4418-305
<400>32
gtgacagcag agctgcgtaa tctcccgcat attgccagca tggcctttaa tgagccgctg 60
atgcttgaac ccgcctatgc gcgggttttc ttttgtgcgc ttgcaggcca gcttgggatc 120
agcagcctga cggatgcggt gtccggcgac agcctgactg cccaggaggc actcgcgacg 180
ctggcattat ccggtgatga tgacggacca cgacaggccc gcagttatca ggtcatgaac 240
ggcatcgccg tgctgccggt gtccggcacg ctggtcagcc ggacgcgggc gctgcagccg 300
tactc 305
<210>33
<211>303
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L9730-303
<400>33
catttgaaca taacggtgtg accgtcacgc tttctgaact gtcagccctg cagcgcattg 60
agcatctcgc cctgatgaaa cggcaggcag aacaggcgga gtcagacagc aaccggaagt 120
ttactgtgga agacgccatc agaaccggcg cgtttctggt ggcgatgtcc ctgtggcata 180
accatccgca gaagacgcag atgccgtcca tgaatgaagc cgttaaacag attgagcagg 240
aagtgcttac cacctggccc acggaggcaa tttctcatgc tgaaaacgtg gtgtaccggc 300
tgt 303
<210>34
<211>301
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L718-308
<400>34
gtgttcgacg gtgagctgag ttttgccctg aaactggcgc gtgagatggg gcgacccgac 60
tggcgtgcca tgcttgccgg gatgtcatcc acggagtatg ccgactggca ccgcttttac 120
agtacccatt attttcatga tgttctgctg gatatgcact tttccgggct gacgtacacc 180
gtgctcagcc tgtttttcag cgatccggat atgcatccgc tggatttcag tctgctgaac 240
cggcgcgagg ctgacgaaga gcctgaagat gatgtgctga tgcagaaagc ggcagggctt 300
g 301
<210>35
<211>497
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L29732-497
<400>35
tactcaaccc gatgtttgag tacggtcatc atctgacact acagactctg gcatcgctgt 60
gaagacgacg cgaaattcag cattttcaca agcgttatct tttacaaaac cgatctcact 120
ctcctttgat gcgaatgcca gcgtcagaca tcatatgcag atactcacct gcatcctgaa 180
cccattgacc tccaaccccg taatagcgat gcgtaatgat gtcgatagtt actaacgggt 240
cttgttcgat taactgccgc agaaactctt ccaggtcacc agtgcagtgc ttgataacag 300
gagtcttccc aggatggcga acaacaagaa actggtttcc gtcttcacgg acttcgttgc 360
tttccagttt agcaatacgc ttactcccat ccgagataac accttcgtaa tactcacgct 420
gctcgttgag ttttgatttt gctgtttcaa gctcaacacg cagtttccct actgttagcg 480
caatatcctc gttctcc 497
<210>36
<211>500
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L501-500
<400>36
actccgctga agtggtggaa accgcattct gtactttcgt gctgtcgcgg atcgcaggtg 60
aaattgccag tattctcgac gggctccccc tgtcggtgca gcggcgtttt ccggaactgg 120
aaaaccgaca tgttgatttc ctgaaacggg atatcatcaa agccatgaac aaagcagccg 180
cgctggatga actgataccg gggttgctga gtgaatatat cgaacagtca ggttaacagg 240
ctgcggcatt ttgtccgcgc cgggcttcgc tcactgttca ggccggagcc acagaccgcc 300
gttgaatggg cggatgctaa ttactatctc ccgaaagaat ccgcatacca ggaagggcgc 360
tgggaaacac tgccctttca gcgggccatc atgaatgcga tgggcagcga ctacatccgt 420
gaggtgaatg tggtgaagtc tgcccgtgtc ggttattcca aaatgctgct gggtgtttat 480
gcctacttta tagagcataa 500
<210>37
<211>500
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L30501-500
<400>37
cactcacaac aatgagtggc agatatagcc tggtggttca ggcggcgcat ttttattgct 60
gtgttgcgct gtaattcttc tatttctgat gctgaatcaa tgatgtctgc catctttcat 120
taatccctga actgttggtt aatacgcttg agggtgaatg cgaataataa aaaaggagcc 180
tgtagctccc tgatgatttt gcttttcatg ttcatcgttc cttaaagacg ccgtttaaca 240
tgccgattgc caggcttaaa tgagtcggtg tgaatcccat cagcgttacc gtttcgcggt 300
gcttcttcag tacgctacgg caaatgtcat cgacgttttt atccggaaac tgctgtctgg 360
ctttttttga tttcagaatt agcctgacgg gcaatgctgc gaagggcgtt ttcctgctga420
ggtgtcattg aacaagtccc atgtcggcaa gcataagcac acagaatatg aagcccgctg 480
ccagaaaaat gcattccgtg 500
<210>38
<211>500
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L46499-500
<400>38
gcctaaagta ataaaaccga gcaatccatt tacgaatgtt tgctgggttt ctgttttaac 60
aacattttct gcgccgccac aaattttggc tgcatcgaca gttttcttct gcccaattcc 120
agaaacgaag aaatgatggg tgatggtttc ctttggtgct actgctgccg gtttgttttg 180
aacagtaaac gtctgttgag cacatcctgt aataagcagg gccagcgcag tagcgagtag 240
catttttttc atggtgttat tcccgatgct ttttgaagtt cgcagaatcg tatgtgtaga 300
aaattaaaca aaccctaaac aatgagttga aatttcatat tgttaatatt tattaatgta 360
tgtcaggtgc gatgaatcgt cattgtattc ccggattaac tatgtccaca gccctgacgg 420
ggaacttctc tgcgggagtg tccgggaata attaaaacga tgcacacagg gtttagcgcg 480
tacacgtatt gcattatgcc 500
<210>39
<211>1011
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<220>
<223>L8703-1012
<400>39
aagagctgga cagcgatacc tggcaggcgg agctgcatat cgaagttttc ctgcctgctc 60
aggtgccgga ttcagagctg gatgcgtgga tggagtcccg gatttatccg gtgatgagcg 120
atatcccggc actgtcagat ttgatcacca gtatggtggc cagcggctat gactaccggc 180
gcgacgatga tgcgggcttg tggagttcag ccgatctgac ttatgtcatt acctatgaaa 240
tgtgaggacg ctatgcctgt accaaatcct acaatgccgg tgaaaggtgc cgggaccacc 300
ctgtgggttt ataaggggag cggtgaccct tacgcgaatc cgctttcaga cgttgactgg 360
tcgcgtctgg caaaagttaa agacctgacg cccggcgaac tgaccgctga gtcctatgac 420
gacagctatc tcgatgatga agatgcagac tggactgcga ccgggcaggg gcagaaatct 480
gccggagata ccagcttcac gctggcgtgg atgcccggag agcaggggca gcaggcgctg 540
ctggcgtggt ttaatgaagg cgatacccgt gcctataaaa tccgcttccc gaacggcacg 600
gtcgatgtgt tccgtggctg ggtcagcagt atcggtaagg cggtgacggc gaaggaagtg 660
atcacccgca cggtgaaagt caccaatgtg ggacgtccgt cgatggcaga agatcgcagc 720
acggtaacag cggcaaccgg catgaccgtg acgcctgcca gcacctcggt ggtgaaaggg 780
cagagcacca cgctgaccgt ggccttccag ccggagggcg taaccgacaa gagctttcgt 840
gcggtgtctg cggataaaac aaaagccacc gtgtcggtca gtggtatgac catcaccgtg 900
aacggcgttg ctgcaggcaa ggtcaacatt ccggttgtat ccggtaatgg tgagtttgct 960
gcggttgcag aaattaccgt caccgccagt taatccggag agtcagcgat g 1011
<210>40
<211>32
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<400>40
cctgtatgac cgtattccgg gtcctgtcgg ta 32
<210>41
<211>425
<212>DNA
<213>Artificial Sequence
<220>
<223>Synthetic
<400>41
aagagctgga cagcgatacc tggcaggcgg agctgcatat cgaagttttc ctgcctgctc 60
aggtgccgga ttcagagctg gatgcgtgga tggagtcccg gatttatccg gtgatgagcg 120
atatcccggc actgtcagat ttgatcacca gtatggtggc cagcggctat gactaccggc 180
gcgacgatga tgcgggcttg tggagttcag ccgatctgac ttatgtcatt acctatgaaa 240
tgtgaggacg ctatgcctgt accaaatcct acaatgccgg tgaaaggtgc cgggaccacc 300
ctgtgggttt ataaggggag cggtgaccct tacgcgaatc cgctttcaga cgttgactgg 360
tcgcgtctgg caaaagttaa agacctgacg cccggcgaac tgaccgctga gtcctatgac 420
gacag 425
Claims (121)
1. A method for obtaining sequence information of a polynucleotide of interest, the method comprising:
a) providing a first sequencing reagent to a target polynucleotide in the presence of a first polynucleotide replication catalyst, wherein the first sequencing reagent comprises at least two different nucleotide monomers each conjugated to a first label, and the nucleotide monomer/first label conjugate is substantially non-fluorescent until after incorporation of the nucleotide monomers into the target polynucleotide according to complementarity with the target polynucleotide, wherein the first labels of the at least two different nucleotide monomers are the same or different; and
b) providing a second sequencing reagent to the target polynucleotide in the presence of a second polynucleotide replication catalyst, wherein the second sequencing reagent comprises one or more nucleotide monomers each conjugated to a second label, and the nucleotide monomer/second label conjugate is substantially non-fluorescent until after incorporation of the nucleotide monomer into the target polynucleotide according to complementarity with the target polynucleotide, at least one of the one or more nucleotide monomers being different from the nucleotide monomer present in the first sequencing reagent, and wherein the second sequencing reagent is provided subsequent to providing the first sequencing reagent, and
c) obtaining sequence information of at least a portion of the target polynucleotide by detecting fluorescent emissions resulting from the first label and the second label after incorporation of the nucleotide monomers into the polynucleotide in steps a) and b).
2. A method according to claim 1 for obtaining sequence information for at least part of a single polynucleotide of interest, or for obtaining sequence information for at least part of a plurality of polynucleotides of interest simultaneously.
3. The method of claim 1 or 2, wherein the first polynucleotide replication catalyst and the second polynucleotide replication catalyst are the same polynucleotide replication catalyst, or wherein the first polynucleotide replication catalyst and the second polynucleotide replication catalyst are different polynucleotide replication catalysts.
4. The method of any one of claims 1-3, wherein the sequence information is obtained by one or more sequencing reactions, wherein optionally the one or more sequencing reactions are performed in one or more reaction volumes (e.g., reaction chambers), such as about 1x 106To about 5X 108A reactionVolume, about 1X 106To about 1X 108A reaction volume or about 1X 106To about 5X 107A reaction volume, wherein optionally the reaction volumes are physically separated from each other and/or there is substantially no material exchange between the reaction volumes, wherein optionally the reaction volumes are located in an array, such as a chip, and wherein optionally the reaction volumes are closed and/or isolated from each other by a liquid, such as an oil, immiscible with the liquid in the reaction volumes.
5. The method of claim 4, wherein the reaction volumes are provided in reaction chambers and the target polynucleotide in each of the reaction chambers is immobilized on a solid support in the reaction chamber, wherein optionally the sequence information is obtained by high throughput sequencing, e.g., wherein at least about 103、104、105、106、107、108Or 109The sequence of bars is read substantially in parallel.
6. The method of any one of claims 1-5, wherein the first polynucleotide replication catalyst and/or the second polynucleotide replication catalyst is a polymerase, such as a DNA polymerase, an RNA polymerase or an RNA-dependent RNA polymerase, a ligase, a reverse transcriptase, or a terminal deoxyribonucleoside transferase.
7. The method of any one of claims 1-6, wherein the nucleotide monomers in the first and/or the second sequencing reagent are selected from the group consisting of: deoxyribonucleotides, modified deoxyribonucleotides, ribonucleotides, modified ribonucleotides, peptide nucleotides, modified sugar-phosphate backbone nucleotides, and mixtures thereof.
8. The method of claim 7, wherein the nucleotide monomers in the first sequencing reagent and the second sequencing reagent are both deoxyribonucleotides.
9. The method of claim 8, wherein the nucleotide monomer is selected from the group consisting of: A. T/U, C and G deoxyribonucleotides, and analogs thereof.
10. The method of claim 7, wherein the nucleotide monomers in the first sequencing reagent and the second sequencing reagent are both ribonucleotides.
11. The method of claim 10, wherein the nucleotide monomer is selected from the group consisting of: A. U/T, C and G ribonucleotides, and analogs thereof.
12. The method of any one of claims 1-11, wherein the first and/or the second label is releasably conjugated to the nucleotide monomer.
13. The method of claim 12, wherein the first and/or the second label is conjugated to a terminal phosphate group of the nucleomonomer, or to a last, penultimate, or penultimate phosphate group of the nucleomonomer.
14. The method of claim 13, wherein the nucleotide monomer/first label conjugate in the first sequencing reagent and/or the one or more nucleotide monomer/second label conjugates in the second sequencing reagent have the structure of formula I:
wherein n is 0-6, R is a nucleobase, X is H, OH or OMe, or a salt thereof,
wherein optionally the nucleotide monomer/first label conjugate in the first sequencing reagent and/or the one or more nucleotide monomer/second label conjugates in the second sequencing reagent have the structure of formula II below:
15. the method of claim 13 or 14, wherein the first and/or second labels are substantially non-fluorescent until after release from the terminal phosphate group of the nucleotide monomer.
16. The method of claim 15, further comprising releasing the first and/or second labels from the terminal phosphate group of the nucleotide monomer with an activating enzyme.
17. The method of claim 16, wherein the activating enzyme is an exonuclease, a phosphotransferase, or a phosphatase.
18. The method of any one of claims 1-17, wherein the first labels of the at least two different nucleotide monomers are the same.
19. The method of any one of claims 1-17, wherein the first labels of the at least two different nucleotide monomers are different.
20. The method of any one of claims 1-19, further comprising a washing step between steps a) and b).
21. The method of any one of claims 1-20, wherein the target polynucleotide is immobilized on a surface, such as a solid surface, a soft surface, a hydrogel surface, a microparticle surface, or a combination thereof.
22. The method of claim 21, wherein the solid surface is part of a microreactor and steps a) and b) are performed in the microreactor.
23. The method of any one of claims 1-22, which is carried out at a temperature in the range of from about 20 ℃ to about 70 ℃.
24. The method of any one of claims 1-23, wherein multiple rounds of steps a) and b) are performed using different combinations of the first sequencing reagent and the second sequencing reagent.
25. The method of any one of claims 1-24, wherein the sequence information obtained in step c) is a degenerate sequence.
26. The method of claim 25, wherein at least one additional round of steps a) and b) is performed using a combination of the first sequencing reagent and the second sequencing reagent that is different from the combination of the first sequencing reagent and the second sequencing reagent in the previous round or rounds of steps a) and b) to obtain at least one additional sequence, and the additional sequence is compared to the degenerate sequence to obtain a non-degenerate sequence.
27. The method according to any of claims 1-26, wherein the initial sequence information obtained in step c) contains no errors or one or more errors.
28. The method of claim 27, wherein at least one additional round of steps a) and b) is performed using a combination of the first sequencing reagent and the second sequencing reagent that is different from the combination of the first sequencing reagent and the second sequencing reagent in the previous round or rounds of steps a) and b) to obtain at least one additional sequence, and the additional sequence is compared to the initial sequence to reduce or eliminate sequence errors.
29. The method of any one of claims 26-28, wherein the sequence comparison is performed using a mathematical analysis, algorithm, or method.
30. The method of claim 29, wherein the mathematical analysis, algorithm or method comprises a markov model or a bayesian profile based maximum likelihood method.
31. The method of any one of claims 1-30, wherein the first sequencing reagent comprises two different nucleotide monomer/first label conjugates, each nucleotide monomer/first label conjugate comprising a different nucleotide monomer, the second sequencing reagent comprises two different nucleotide monomer/second label conjugates, each nucleotide monomer/second label conjugate comprising a different nucleotide monomer, and the two nucleotide monomers in the first sequencing reagent are different from the two nucleotide monomers in the second sequencing reagent.
32. The method of claim 30 or 31, wherein the two nucleotide monomers in the first sequencing reagent and the two nucleotide monomers in the second sequencing reagent are selected from the group consisting of: A. T/U, C and G deoxyribonucleotides, and analogs thereof.
33. The method of claim 32, wherein the two nucleotide monomers in the first sequencing reagent and the two nucleotide monomers in the second sequencing reagent are selected from the group consisting of:
1) a and T/U deoxyribonucleotides in one sequencing reagent and C and G deoxyribonucleotides in another sequencing reagent;
2) a and G deoxyribonucleotides in one sequencing reagent and C and T/U deoxyribonucleotides in another sequencing reagent; and
3) a and C deoxyribonucleotides in one sequencing reagent and G and T/U deoxyribonucleotides in another sequencing reagent.
34. The method according to claim 33, wherein one round of steps a) and b) or at least two rounds of steps a) and b) is performed, one of the combinations 1) -3) is used in one round of steps a) and b) and another one of the combinations 1) -3) different from the combination used in the previous round of steps a) and b) is used in another round of steps a) and b).
35. The method of claim 34, wherein three rounds of steps a) and b) are performed, each round using a different combination selected from the combinations 1) -3).
36. The method of claim 34 or 35, wherein the sequences obtained from multiple rounds of steps a) and b) are aligned to obtain a non-degenerate sequence and/or to reduce or eliminate sequence errors in the non-degenerate sequence.
37. The method of claim 30 or 31, wherein the two nucleotide monomers in the first sequencing reagent and the two nucleotide monomers in the second sequencing reagent are selected from the group consisting of: A. T/U, C and G ribonucleotides, and analogs thereof.
38. The method of claim 37, wherein the two nucleotide monomers in the first sequencing reagent and the two nucleotide monomers in the second sequencing reagent are selected from the group consisting of:
1) a and T/U ribonucleotides in one sequencing reagent and C and G ribonucleotides in another sequencing reagent;
2) a and G ribonucleotides in one sequencing reagent and C and T/U ribonucleotides in another sequencing reagent; and
3) a and C ribonucleotides in one sequencing reagent and G and T/U ribonucleotides in another sequencing reagent.
39. The method according to claim 38, wherein one round of steps a) and b) or at least two rounds of steps a) and b) is performed, one of the combinations 1) -3) is used in one round of steps a) and b) and another one of the combinations 1) -3) different from the combination used in the previous round of steps a) and b) is used in another round of steps a) and b).
40. The method of claim 39, wherein at least three rounds of steps a) and b) are performed, each round using a different one of the combinations 1) -3).
41. The method of claim 39 or 40, wherein the sequences obtained from multiple rounds of steps a) and b) are aligned to obtain a non-degenerate sequence and/or to reduce or eliminate sequence errors in the non-degenerate sequence.
42. The method of any one of claims 31-41, wherein the first label of the two different nucleotide monomers is the same and the second label is the same as the first label.
43. The method of any one of claims 31-41, wherein the first label of the two different nucleotide monomers is different and the second label is the same as the first label.
44. The method of any one of claims 1-30, wherein one of the first and second sequencing reagents comprises three different nucleotide monomer/first label conjugates, each nucleotide monomer/first label conjugate comprises a different nucleotide monomer, the other sequencing reagent comprises one nucleotide monomer/second label conjugate, and the three nucleotide monomers in one sequencing reagent are different from the nucleotide monomers in the other sequencing reagent.
45. The method of claim 44, wherein the nucleotide monomers in the first sequencing and the second sequencing reagents are selected from the group consisting of: A. T/U, C and G deoxyribonucleotides, and analogs thereof.
46. The method of claim 45, wherein the nucleotide monomers in the first and second sequencing reagents are selected from the group consisting of:
1) c, G and T/U deoxyribonucleotides in one sequencing reagent and A deoxyribonucleotides in another sequencing reagent;
2) a, G and T/U deoxyribonucleotides in one sequencing reagent and C deoxyribonucleotides in another sequencing reagent;
3) a, C and T/U deoxyribonucleotides in one sequencing reagent and G deoxyribonucleotides in another sequencing reagent; and
4) a, C and G deoxyribonucleotides in one sequencing reagent and T/U deoxyribonucleotides in another sequencing reagent.
47. The method according to claim 46, wherein one round of steps a) and b) or at least two rounds of steps a) and b) is performed, one of the combinations 1) -4) is used in one round of steps a) and b) and another one of the combinations 1) -4) different from the combination used in the previous round of steps a) and b) is used in another round of steps a) and b).
48. The method of claim 46, wherein three rounds of steps a) and b) are performed, each round using a different combination selected from the combinations 1) -4).
49. The method of claim 46, wherein four rounds of steps a) and b) are performed, each round using a different combination selected from the combinations 1) -4).
50. The method of any one of claims 47-49, wherein the sequences obtained from multiple rounds of steps a) and b) are aligned to obtain a non-degenerate sequence and/or to reduce or eliminate sequence errors in the non-degenerate sequence.
51. The method of claim 44, wherein the nucleotide monomers in the first sequencing and the second sequencing reagents are selected from the group consisting of: A. T/U, C and G ribonucleotides, and analogs thereof.
52. The method of claim 51, wherein the nucleotide monomers in the first and second sequencing reagents are selected from the group consisting of:
1) c, G and T/U ribonucleotides in one sequencing reagent and A ribonucleotides in another sequencing reagent;
2) a, G and T/U ribonucleotides in one sequencing reagent and C ribonucleotides in another sequencing reagent;
3) a, C and T/U ribonucleotides in one sequencing reagent and G ribonucleotides in another sequencing reagent; and
4) a, C and G ribonucleotides in one sequencing reagent and T/U ribonucleotides in another sequencing reagent.
53. The method according to claim 52, wherein one round of steps a) and b) or at least two rounds of steps a) and b) is performed, one of the combinations 1) -4) is used in one round of steps a) and b) and another one of the combinations 1) -4) different from the combination used in the previous round of steps a) and b) is used in another round of steps a) and b).
54. The method of claim 52, wherein at least three rounds of steps a) and b) are performed, each round using a different one of the combinations 1) -4).
55. The method of claim 52, wherein at least four rounds of steps a) and b) are performed, each round using a different one of the combinations 1) -4).
56. The method of any one of claims 52-55, wherein the sequences obtained from multiple rounds of steps a) and b) are aligned to obtain a non-degenerate sequence and/or to reduce or eliminate sequence errors in the non-degenerate sequence.
57. The method of any one of claims 1-56, wherein a read length of about 250, about 350, about 400, about 500, about 800, or about 2400 base pairs is obtained.
58. The method of any one of claims 1-57, wherein at least about 95% code accuracy is obtained.
59. The method of any one of claims 1-58, wherein the target polynucleotide is a single-stranded polynucleotide.
60. A method for obtaining sequence information of a polynucleotide of interest, the method comprising:
a) providing a first sequencing reagent to a target polynucleotide in the presence of a first polynucleotide replication catalyst, wherein the first sequencing reagent comprises two different nucleotide monomers each conjugated to a first label, and the nucleotide monomer/first label conjugate is substantially non-fluorescent until after incorporation of the nucleotide monomers into the target polynucleotide according to complementarity with the target polynucleotide; and
b) providing a second sequencing reagent to the target polynucleotide in the presence of a second polynucleotide replication catalyst, wherein the second sequencing reagent comprises two different nucleotide monomers each conjugated to a second label, and the nucleotide monomer/second label conjugate is substantially non-fluorescent until after the nucleotide monomers are incorporated into the target polynucleotide according to complementarity with the target polynucleotide, and wherein the second sequencing reagent is provided subsequent to providing the first sequencing reagent, and
c) obtaining sequence information of at least part of the target polynucleotide by detecting fluorescence emissions caused by the first label and the second label after incorporation of the nucleotide monomers into the polynucleotide in said steps a) and b),
wherein the nucleotide monomers in the first sequencing reagent and the second sequencing reagent are selected from the group consisting of:
1) adenine (a) and thymine (T)/uracil (U) nucleotide monomers in one sequencing reagent and cytosine (C) and guanine (G) nucleotide monomers in another sequencing reagent;
2) adenine (a) and guanine (G) nucleotide monomers in one sequencing reagent and cytosine (C) and thymine (T)/uracil (U) nucleotide monomers in another sequencing reagent; and
3) adenine (a) and cytosine (C) nucleotide monomers in one sequencing reagent and guanine (G) and thymine (T)/uracil (U) nucleotide monomers in another sequencing reagent.
61. The method of claim 60, wherein the first labels of the two different nucleotide monomers in step a) and the second labels of the two different nucleotide monomers in step b) are the same label.
62. The method of claim 60, wherein the first indicia comprises two different indicia, and wherein one of the first indicia is the same as one of the second indicia and the other of the first indicia is the same as the other of the second indicia.
63. The method of any one of claims 60-62, wherein multiple rounds of steps a) and b) are performed, each round using a combination selected from the combinations 1) -3).
64. The method of claim 63, wherein at least two or three sets of sequence information are obtained in step c), the method comprising:
performing multiple rounds of steps a) and b) in a first sequencing reaction volume using combination 1) to obtain a first set of sequence information,
performing multiple rounds of steps a) and b) in a second sequencing reaction volume using combination 2) to obtain a second set of sequence information, and/or
Multiple rounds of steps a) and b) were performed in a third sequencing reaction volume using combination 3) to obtain a third set of sequence information.
65. The method of claim 64, wherein the first, second, and third sets of sequence information are obtained in parallel from separate sequencing reaction volumes.
66. The method of claim 64, wherein the first, second, and third sets of sequence information are obtained sequentially from the same sequencing reaction volume, wherein the product of a previous sequencing reaction is excised before the next sequencing reaction is initiated.
67. The method of any one of claims 64-66, further comprising comparing the at least two or three sets of sequence information to reduce or eliminate the sequence errors.
68. The method of claim 67, wherein the comparison indicates that there are no errors in the obtained target polynucleotide sequence when the at least two or three sets of sequence information are identical to each other.
69. The method of claim 67, wherein when the at least two or three sets of sequence information comprise a difference in at least one nucleotide residue of the target polynucleotide sequence, the comparison indicates that an error exists in the obtained target polynucleotide sequence.
70. The method of claim 69, further comprising correcting at least one nucleotide residue in the obtained polynucleotide sequence of interest such that, after correction, the at least two or three sets of sequence information are identical to each other.
71. A method for obtaining sequence information of a polynucleotide of interest, the method comprising:
a) providing a first sequencing reagent to a target polynucleotide in the presence of a first polynucleotide replication catalyst, wherein the first sequencing reagent comprises three different nucleotide monomers each conjugated to a first label, and the nucleotide monomer/first label conjugate is substantially non-fluorescent until after incorporation of the nucleotide monomers into the target polynucleotide according to complementarity with the target polynucleotide; and
b) providing a second sequencing reagent to the target polynucleotide in the presence of a second polynucleotide replication catalyst, wherein the second sequencing reagent comprises one nucleotide monomer conjugated to a second label, and the nucleotide monomer/second label conjugate is substantially non-fluorescent until after the nucleotide monomer is incorporated into the target polynucleotide according to complementarity with the target polynucleotide, and wherein the second sequencing reagent is provided prior to or subsequent to providing the first sequencing reagent, and
c) obtaining sequence information of at least part of the target polynucleotide by detecting fluorescence emissions caused by the first label and the second label after incorporation of the nucleotide monomers into the polynucleotide in said steps a) and b),
wherein the nucleotide monomers in the first sequencing reagent and the second sequencing reagent are selected from the group consisting of:
1) cytosine (C) nucleotide monomer, guanine (G) nucleotide monomer, and thymine (T)/uracil (U) nucleotide monomer in one sequencing reagent, and adenine (a) nucleotide monomer in another sequencing reagent;
2) adenine (a), guanine (G) and thymine (T)/uracil (U) nucleotide monomers in one sequencing reagent, and cytosine (C) nucleotide monomers in another sequencing reagent; and
3) adenine (a), cytosine (C), and thymine (T)/uracil (U) nucleotide monomers in one sequencing reagent, and guanine (G) nucleotide monomers in another sequencing reagent; and
4) adenine (a), cytosine (C) and guanine (G) nucleotide monomers in one sequencing reagent, and thymine (T)/uracil (U) nucleotide monomers in another sequencing reagent.
72. The method of claim 71, wherein the first labels of the three different nucleotide monomers in step a) and the second label of the one nucleotide monomer in step b) are the same label.
73. The method of claim 71 or 72, wherein multiple rounds of steps a) and b) are performed, each round using a combination selected from the combinations 1) -4).
74. The method of claim 73, wherein at least two, three or four sets of sequence information are obtained in step c), the method comprising:
performing multiple rounds of steps a) and b) in a first sequencing reaction volume using combination 1) to obtain a first set of sequence information,
performing multiple rounds of steps a) and b) in a second sequencing reaction volume using combination 2) to obtain a second set of sequence information,
performing multiple rounds of steps a) and b) in a third sequencing reaction volume using combination 3) to obtain a third set of sequence information, and/or
Multiple rounds of steps a) and b) were performed in a fourth sequencing reaction volume using combination 4) to obtain a fourth set of sequence information.
75. The method of claim 74, wherein the first, second, third, and fourth sets of sequence information are obtained in parallel from separate sequencing reaction volumes.
76. The method of claim 74, wherein the first, second, third, and fourth sets of sequence information are obtained sequentially from the same sequencing reaction volume, wherein the product of a previous sequencing reaction is excised before the next sequencing reaction is initiated.
77. The method of any one of claims 74-76, further comprising comparing the at least two, three, or four sets of sequence information to reduce or eliminate sequence errors.
78. The method of claim 77, wherein the comparison indicates that there are no errors in the obtained target polynucleotide sequences when the at least two, three or four sets of sequence information are identical to each other.
79. The method of claim 77, wherein when the at least two, three or four sets of sequence information comprise a difference in at least one nucleotide residue of the target polynucleotide sequence, the comparison indicates that an error exists in the obtained target polynucleotide sequence.
80. The method of claim 79, further comprising correcting at least one nucleotide residue in the obtained target polynucleotide sequence such that after correction, the at least two, three or four sets of sequence information are identical to each other, wherein optionally the at least one nucleotide residue in the set of sequence information is corrected by insertion or deletion at the position where the error occurred, wherein the insertion extends the sequence by at least one nucleotide and sequence information from the other set or sets of sequence information is compared to the extended sequence to arrive at a corrected sequence, wherein the deletion shortens the sequence by at least one nucleotide and sequence information from the set or sets of sequence information is compared to the shortened sequence to arrive at a corrected sequence.
81. A kit or system for obtaining sequence information of a polynucleotide, the kit or system comprising:
a) a first sequencing reagent comprising at least two different nucleotide monomer/first label conjugates that are substantially non-fluorescent until after incorporation of the nucleotide monomer into a polynucleotide according to complementarity with a target polynucleotide; and
b) a second sequencing reagent comprising one or more nucleotide monomer/second label conjugates that are substantially non-fluorescent until after incorporation of the nucleotide monomer into a polynucleotide according to complementarity with the target polynucleotide, at least one of the one or more nucleotide monomers being different from the nucleotide monomer present in the first sequencing reagent, and
c) a detector for detecting fluorescent emissions resulting from the first label and the second label after incorporation of the nucleotide monomer into the polynucleotide.
82. The kit or system of claim 81, further comprising a first polynucleotide replication catalyst and/or a second polynucleotide replication catalyst.
83. The kit or system of claim 81 or 82, wherein the first and/or second label is conjugated to a terminal phosphate group of the nucleotide monomer.
84. The kit or system of claim 83, further comprising an activating enzyme for releasing the first and/or second labels from the terminal phosphate group of the nucleotide monomer.
85. The kit or system of any one of claims 81-84, further comprising a solid surface on which the target polynucleotide is configured to be immobilized.
86. The kit or system of claim 85, wherein the solid surface is part of a microreactor.
87. The kit or system of any one of claims 81-86, further comprising means for obtaining sequence information for at least a portion of a polynucleotide of interest based on the fluorescent emission resulting from the first and second labels following incorporation of the nucleotide monomer into the polynucleotide.
88. The kit or system of claim 87, wherein the means comprises a computer readable medium containing executable instructions that when executed obtain sequence information for at least a portion of a target polynucleotide based on the fluorescent emission caused by the first and second labels following incorporation of the nucleotide monomer into the polynucleotide.
89. The kit or system of any one of claims 81-88, further comprising means for comparing a plurality of sequences to obtain a non-degenerate sequence and/or to reduce or eliminate sequence errors in the non-degenerate sequence.
90. The kit or system of claim 89, wherein the means comprises a computer readable medium containing executable instructions that when executed can compare sequences to obtain a non-degenerate sequence and/or reduce or eliminate sequence errors in the non-degenerate sequence.
91. A method of correcting sequencing information errors, comprising:
(a) performing parameter estimation based on sequencing signals from one or more reference polynucleotides during the sequencing reaction and the known nucleic acid sequence of the reference polynucleotides, using the parameter estimation to obtain information on pre-and/or post-translational dephasing of the sequencing reaction;
(b) obtaining a sequencing signal from the target polynucleotide during a sequencing reaction;
(c) calculating a secondary lead amount for the target polynucleotide based on the information obtained from step (a) and the sequencing signal obtained from step (b);
(d) calculating the amount of dephasing of the target polynucleotide based on the sequencing signal obtained from step (b) and the secondary lead amount of step (c);
(e) correcting the sequencing signal obtained from step (b) using the amount of phase loss to generate a predicted sequencing signal for the polynucleotide of interest;
(f) repeating steps (c) through (e) for one or more rounds, wherein the predicted sequencing signal from round i is used to calculate the secondary lead for the target polynucleotide in round i +1 until the predicted sequencing signal for the target polynucleotide from round j is mathematically convergent, wherein i and j are integers and 1 ≦ i < i +1 ≦ j,
wherein the secondary lead phenomenon refers to an unexpected nucleotide extension occurring at a residue of the target polynucleotide and the unexpected extension being further extended by a nucleotide other than the next residue during sequencing, and
wherein the amount of dephasing comprises a change in the sequencing result due to the pre-and/or post-phasing loss during sequencing.
92. The method of claim 91, wherein the parameter estimation in step (a) comprises obtaining an attenuation coefficient.
93. The method of claim 91 or 92, wherein the parameter estimation in step (a) further comprises obtaining an offset.
94. The method of any one of claims 91-93, wherein the parameter estimation in step (a) further comprises obtaining unit signal information.
95. The method of any one of claims 91-94, wherein the parameter estimation in step (a) comprises obtaining the lead coefficient and/or lag coefficient for each nucleotide or combination of nucleotides.
96. The method of any one of claims 91-95, comprising obtaining the information of the lead and/or lag phase loss phenomenon for each sequencing reaction when performing multiple sequencing reactions.
97. The method of any one of claims 1-80, wherein the sequence information is processed (e.g., corrected) by the method of any one of claims 91-96.
98. A method of correcting sequencing information errors, comprising:
(a) performing parameter estimation based on sequencing signals from one or more reference polynucleotides during the sequencing reaction and the known nucleic acid sequence of the reference polynucleotides;
(b) obtaining a sequencing signal from a polynucleotide of interest during the sequencing reaction;
(c) calculating the secondary lead of the target polynucleotide from the leading or lagging phase loss information obtained by parameter estimation in step (a) and the sequencing signal obtained from step (b);
(d) calculating the amount of dephasing of the target polynucleotide based on the sequencing signal obtained from step (b) and the secondary lead amount of step (c);
(e) correcting the sequencing signal obtained from step (b) using the amount of phase loss to generate a predicted sequencing signal for the polynucleotide of interest;
(f) repeating steps (c) through (e) for one or more rounds, wherein the predicted sequencing signal from round i is used to calculate the secondary lead for the target polynucleotide in round i +1 until the predicted sequencing signal for the target polynucleotide from round j is mathematically convergent, wherein i and j are integers and 1 ≦ i < i +1 ≦ j,
wherein the parameter estimation comprises obtaining the lead amount, the lag amount, the attenuation coefficient, and/or the offset based on the sequencing signal from the reference polynucleotide and the known nucleic acid sequence of the reference polynucleotide,
wherein the secondary lead phenomenon refers to an unexpected nucleotide extension occurring at a residue of the target polynucleotide and the unexpected extension being further extended by a nucleotide other than the next residue during sequencing, and
wherein the amount of dephasing comprises a change in the sequencing result due to the pre-and/or post-phasing loss during sequencing.
99. The method of any one of claims 1-80, wherein the sequence information is processed (e.g., corrected) by the method of claim 98.
100. A method of correcting lead during sequencing, comprising:
obtaining a sequencing signal from a polynucleotide of interest during a sequencing reaction, the sequencing signal corresponding to the sequence of the polynucleotide of interest; and
correcting the sequencing signal from the target polynucleotide with a secondary lead amount resulting from the secondary lead phenomenon, optionally using parameter estimation;
wherein the secondary lead phenomenon refers to an unexpected nucleotide extension occurring at a residue of the target polynucleotide during sequencing and the unexpected extension being further extended by a nucleotide other than the next residue.
101. The method of claim 100, wherein the sequencing signal from a target polynucleotide comprises a primary lead due to a primary lead phenomenon, wherein the primary lead phenomenon is the occurrence of an unintended nucleotide extension at a residue of the target polynucleotide during sequencing.
102. The method of claim 100 or 101, wherein if the sequencing signal from a particular nucleotide residue of the target polynucleotide is close to a unit signal, correcting the sequencing signal using the secondary lead amount,
wherein the deviation of the sequencing signal intensity from the unit signal intensity is within about 60%, within about 50%, within about 40%, within about 30%, within about 20%, within about 10%, or within about 5%.
103. The method of any one of claims 100-102, wherein when obtaining the nth sequencing signal, the method comprises:
a method of comparing said sequencing signal of a reference polynucleotide to said known sequence of said reference polynucleotide to identify errors during sequencing and to correct said errors;
using the sequencing signal of the target polynucleotide before n and the error correction method to obtain a corrected sequencing signal, e.g., by feeding back the sequencing signal of the target polynucleotide before n into the error correction method; and
determining whether a secondary lead is present at residue n by comparing the sequencing signal of the target polynucleotide at residue n to the corrected sequencing signal.
104. The method of any one of claims 91-103, wherein the sequencing comprises adding one or more sequencing reagents to the reaction solution, wherein the one or more sequencing reagents optionally comprise a nucleotide and/or an enzyme.
105. The method of any one of claims 91-104, wherein in said sequencing, one, two, or three types of nucleotides are added in each sequencing reaction.
106. The method of any one of claims 91-105, wherein the sequencing reaction involves an open or unblocked 3' end of the polynucleotide.
107. The method of any one of claims 91-106, wherein in the sequencing the added nucleotides comprise one or more of A, G, C and T, or one or more of A, G, C and U.
108. The method of any one of claims 91-107, wherein in said sequencing, said detected sequencing signal comprises an electrical signal, a bioluminescent signal, a chemiluminescent signal, or any combination thereof.
109. The method of any of claims 91-109, wherein the parameter estimation comprises:
inferring the ideal signal h from the reference polynucleotide,
calculating the phase-lost signal (or the phase mismatch) s and the predicted original sequencing signal p based on the preset parameters, and
calculating the correlation coefficient c between p and the actual raw sequencing signal f.
110. The method of claim 110, wherein the method further comprises using an optimization method to find a set of parameters such that the correlation coefficient c reaches an optimal value.
111. The method of claim 111, wherein the set of parameters comprises lead coefficients or amounts, lag coefficients or amounts, attenuation coefficients, offsets, unit signals, or any combination thereof.
112. The method of any one of claims 91-112, wherein during the sequencing, two sets of reaction solutions are provided, each set comprising one or more nucleotides that are different from the other set, and one reaction solution is provided in each sequencing reaction.
113. The method of claim 113, wherein the two sets of reaction solutions are used in an alternating manner for performing the sequencing reaction.
114. The method of any one of claims 91-114, wherein the sequencing of the target polynucleotide and the reference polynucleotide is performed simultaneously.
115. The method of any one of claims 91-115, wherein the reference polynucleotide is used for parameter estimation to obtain one or more of the following parameters of the sequencing reaction: a lead coefficient or amount, a lag coefficient or amount, a decay coefficient, an offset, and a unit signal.
116. The method of any one of claims 91-116, wherein the signal of the target polynucleotide is corrected using one or more parameters of the sequencing reaction obtained by parameter estimation.
117. The method of any one of claims 91-117, wherein the target polynucleotide comprises a tag comprising a known sequence and/or a known amount of nucleotides, and the known sequence and/or known amount of the nucleotides is used to generate a unit signal for the sequencing reaction.
118. The method of any one of claims 91-118, wherein the unit signal for each sampling point is different.
119. The method according to any one of claims 1-80, wherein the sequence information is processed (e.g. corrected) by using the method according to any one of claims 100 and 119.
120. A computer-readable medium comprising instructions for correcting sequencing information errors, the instructions comprising the steps of:
a) receiving sequencing information for the target polynucleotide and the reference polynucleotide; and
b) using the method of any one of claims 91-120, correcting the sequencing information of the target polynucleotide.
121. A computer system for sequencing comprising the computer-readable medium of claim 121.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510815685.X | 2015-11-19 | ||
| CN201510822361.9 | 2015-11-19 | ||
| CN201510944878.5 | 2015-12-12 | ||
| CN201610899880.X | 2016-10-14 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| HK1259978A1 true HK1259978A1 (en) | 2019-12-13 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11845984B2 (en) | Methods for obtaining and correcting biological sequence information | |
| US10808244B2 (en) | Method of normalizing biological samples | |
| ES2873850T3 (en) | Next Generation Sequencing Libraries | |
| AU2016202081B2 (en) | Methods for detection of nucleotide modification | |
| US9169510B2 (en) | Pyrosequencing methods and related compositions | |
| US7378242B2 (en) | DNA sequence detection by limited primer extension | |
| EP3129505B1 (en) | Methods for clonal replication and amplification of nucleic acid molecules for genomic and therapeutic applications | |
| CN112840035B (en) | Method for sequencing polynucleotides | |
| JP2008533983A (en) | Polymorphism detection method | |
| Chen et al. | Highly accurate fluorogenic DNA sequencing with information theory–based error correction | |
| CN106795554A (en) | Ion sensor DNA and RNA sequencing by synthesis using nucleotide reversible terminators | |
| KR20240069835A (en) | Improved method and kit for the generation of dna libraries for massively parallel sequencing | |
| EP2909343A1 (en) | Methods and apparatus to sequence a nucleic acid | |
| HK1259978A1 (en) | Methods for obtaining and correcting biological sequence information | |
| de Paula Careta et al. | Recent patents on high-throughput single nucleotide polymorphism (SNP) genotyping methods | |
| US20250197845A1 (en) | Compositions and methods for improved dna-encoded library preparation | |
| WO2020073274A1 (en) | Method for sequencing polynucleotide | |
| US20040248104A1 (en) | Methods and reagents for profiling quantities of nucleic acids | |
| CN116574790A (en) | A method for sequencing polynucleotides | |
| Kowalczyk | Just Enough Knowledge… | |
| HK40049621B (en) | Method for sequencing polynucleotides | |
| Gerrity | Investigations in readlength improvements for DNA sequencing by synthesis | |
| HK1232917B (en) | Methods for clonal replication and amplification of nucleic acid molecules for genomic and therapeutic applications |