WO2024168196A1 - Systems and methods for enzymatic synthesis of polynucleotides containing non-standard nucleotide basepairs - Google Patents
Systems and methods for enzymatic synthesis of polynucleotides containing non-standard nucleotide basepairs Download PDFInfo
- Publication number
- WO2024168196A1 WO2024168196A1 PCT/US2024/015068 US2024015068W WO2024168196A1 WO 2024168196 A1 WO2024168196 A1 WO 2024168196A1 US 2024015068 W US2024015068 W US 2024015068W WO 2024168196 A1 WO2024168196 A1 WO 2024168196A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- standard
- base
- standard nucleotide
- nucleotide
- dna
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N9/00—Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
- C12N9/10—Transferases (2.)
- C12N9/12—Transferases (2.) transferring phosphorus containing groups, e.g. kinases (2.7)
- C12N9/1241—Nucleotidyltransferases (2.7.7)
- C12N9/1252—DNA-directed DNA polymerase (2.7.7.7), i.e. DNA replicase
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
- C12N15/1034—Isolating an individual clone by screening libraries
- C12N15/1093—General methods of preparing gene libraries, not provided for in other subgroups
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12P—FERMENTATION OR ENZYME-USING PROCESSES TO SYNTHESISE A DESIRED CHEMICAL COMPOUND OR COMPOSITION OR TO SEPARATE OPTICAL ISOMERS FROM A RACEMIC MIXTURE
- C12P19/00—Preparation of compounds containing saccharide radicals
- C12P19/26—Preparation of nitrogen-containing carbohydrates
- C12P19/28—N-glycosides
- C12P19/30—Nucleotides
- C12P19/34—Polynucleotides, e.g. nucleic acids, oligoribonucleotides
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Y—ENZYMES
- C12Y207/00—Transferases transferring phosphorus-containing groups (2.7)
- C12Y207/07—Nucleotidyltransferases (2.7.7)
- C12Y207/07007—DNA-directed DNA polymerase (2.7.7.7), i.e. DNA replicase
-
- C—CHEMISTRY; METALLURGY
- C40—COMBINATORIAL TECHNOLOGY
- C40B—COMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
- C40B40/00—Libraries per se, e.g. arrays, mixtures
- C40B40/04—Libraries containing only organic compounds
- C40B40/06—Libraries containing nucleotides or polynucleotides, or derivatives thereof
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
- G16B35/10—Design of libraries
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
Definitions
- the name of the XML file containing the sequence listing is 3915- P1293WO.UW_Sequence_Listing.xml.
- the XML file is 172,291 bytes; was created on February 07, 2024; and is being submitted electronically via Patent Center with the filing of the specification.
- BACKGROUND [0003] The four-letter standard genetic alphabet of DNA (A, T, G, C) is ubiquitous and one of the defining biomolecular signatures of life on Earth. Organisms’ ability to read, write, and translate this information forms the basis for evolution as an emergent property of nucleic acid heteropolymers. Humanity has learned how to manipulate the standard 4-letters of DNA, spurring major advancements in biotechnology, information, and healthcare.
- non-standard nucleotides that are capable of base- pairing with other non-standard nucleotides and/or standard nucleotides.
- non-standard nucleotide refers to any nucleotide that is not one of the standard four nucleotides of DNA (i.e., A, T, G, C).
- An example of such a nucleotide includes, but is not limited to, a xenonucleotide (XNA).
- XNA xenonucleotide
- the disclosure provides a method for generating an N+1 tailing product comprising a non-standard nucleotide that is covalently bound with a 3’ end of a precursor double-stranded DNA (dsDNA) template and is non-base-paired, the method comprising: combining the precursor dsDNA template with a DNA polymerase and a non-standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase.
- dNTP non-standard deoxyribonucleotide triphosphate
- the non-standard nucleotide is a xenonucleotide (XNA) and the non-standard dNTP is a deoxy-xeno-ribonucleotide triphosphate (dxNTP).
- the DNA polymerase comprises a polypeptide sequence of a small Klenow Fragment (KF exo-) of DNA Polymerase I.
- the polypeptide sequence comprises a sequence of SEQ ID NO:2.
- the non-standard nucleotide is B or p
- the reaction condition proceeds at about 37°C for between about 1-16 hours and comprises about 0.71 U/ ⁇ L of the DNA polymerase and about 1.19 mM of the non-standard dNTP.
- the DNA polymerase comprises a polypeptide sequence of an engineered polymerase from a hyperthermophilic marine archaeon.
- the engineered polymerase is a variant of 9°N DNA polymerase.
- the polypeptide sequence comprises a sequence of SEQ ID NO:3.
- the non-standard nucleotide is selected from S n , S c , Z, X t , K n , J, and V, and the reaction condition proceeds at about 60°C for between about 4- 16 hours and comprises about 0.29 U/ ⁇ L of the DNA polymerase and about 1.19 mM of the non-standard dNTP.
- the disclosure provides a method for generating a base pair of two nucleotides of a polynucleotide, wherein at least one nucleotide of the two nucleotides is a non-standard nucleotide.
- the method comprises: generating a second N+1 tailing product comprising a second non-standard nucleotide that is base-pair complementary with the non-standard nucleotide, wherein the second non-standard nucleotide is non- base-paired; and ligating the N+1 tailing product with the second N+1 tailing product to form a dsDNA ligation product that comprises a base pair between the non-standard nucleotide and the second non-standard nucleotide.
- the N+1 tailing product comprises a hairpin.
- the second N+1 tailing product comprises a hairpin.
- the dsDNA ligation product does not comprise a free 5’ end or a free 3’ end.
- the method comprises: contacting the dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the dsDNA ligation product to generate a blunt-end DNA template that comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
- 3915-P1293WO.UW -3- can hydrogen bond to a second base, a nucleobase that can base pair (without hydrogen bonding) to a second base, a nucleobase that relies on steric exclusion for base pairing, a nucleobase that relies on hydrophobic interactions for base pairing, a nucleobase that relies on a transition metal complex for base pairing, a chemical modification, or any combination thereof.
- the non-standard nucleotide is the nucleobase that is configured to hydrogen bond to the second base and the second base is a standard base or a non-standard base.
- the non-standard nucleotide comprises the chemical modification and the chemical modification comprises a fluorophore, a biotin, a terminal alkyne, an azide, a cyclooctyne, a tetrazine, a terminal alkene, a phosphine, a halo-alkane, an aldehyde, a thiol, a transition metal complex, another reactive handle, or any combination thereof.
- the disclosure provides a dsDNA ligation product. In an aspect, the disclosure provides a further dsDNA ligation product.
- the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the dsDNA ligation product or the blunt-end dsDNA template, wherein the library polynucleotide sequence comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
- the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the further dsDNA ligation product or the further blunt-end dsDNA template, wherein the library polynucleotide sequence comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides.
- the library polynucleotide sequence further comprises: a context barcode associated with a sequence context adjacent to a base pair of a non-
- the disclosure provides a method for generating a machine learning (ML) model that correlates one or more observed current reads with an unknown non-standard nucleotide for assignment of an identity to the unknown non-standard nucleotide, the method comprising: sequencing, with a nanopore sequencing method, the defined non-standard nucleotide base pair library to produce the one or more observed current reads; and training, with a ML algorithm, the ML model to associate the one or more observed current reads with a known identity of a defined non-standard nucleotide of the defined non-standard nucleotide base pair library, wherein the ML model is configured to assign the identity to the unknown non-standard nucleotide based on the known identity of the defined non-standard nucleotide.
- ML machine learning
- the ML model comprises a convolutional long short term memory recurrent neural network (LSTM RNN).
- LSTM RNN convolutional long short term memory recurrent neural network
- the disclosure provides a non-transitory computer-readable storage medium having stored thereon at least part of a ML model.
- the disclosure provides a computational device or computational system comprising the non- transitory computer-readable storage medium.
- the disclosure provides a nanopore sequencing kit, device, or system comprising the non-transitory computer- readable storage medium.
- the disclosure provides a method for basecalling a non- standard nucleotide expanded alphabet, the method comprising: sequencing, with a nanopore sequencing method, a subject polynucleotide sequence that comprises a non- standard nucleotide to generate a subject current read; computing, with the computational device or computational system, the known identity of the defined non-standard nucleotide of the defined non-standard nucleotide base pair library associated with the subject current read with for an association; and computing, based on the association, a structure of the non-standard nucleotide.
- the disclosure provides a circuitry configured to perform all or part of a method.
- the disclosure provides a nanopore sequencing kit, device, or system comprising the circuitry.
- FIGs 1A and 1B show nucleobases for an expanded 12-letter supernumerary DNA alphabet.
- FIG. 1A Structures of standard purine and pyrimidine nucleobases.
- FIG. 1B Structures of mutually orthogonal synthetic xenonucleobases that can form the basis of a 12-letter supernumerary DNA. Single letter abbreviations of each base indicated above nucleobase structure.
- FIGs 2A-2H show XNA tailing and XNA ligation enable a facile means for enzymatic XNA incorporation.
- FIG. 2A Polymerase XNA tailing activity screened by detection of released 2′-deoxy-xenonucleoside monophosphates (dxNMPs). Hairpin HP-3′PT was used as tailing substrate (Table 2); ‘*’ indicate positions of phosphorothioate bonds.
- Extracted ion chromatograms for each dNMP and dxNMP in assays indicate dNTP and dxNTP tailing by (FIG.2B) Klenow Fragment (exo-) and (FIG. 2C) Therminator polymerase.
- Source data are provided as a Source Data file.
- FIG. 2D Assay measuring extent of XNA tailing by T4 ligation. Tailed hairpins are not substrates for T4 ligation.
- FIG. 2E XNA tailing of hairpin using optimized conditions showing XNA tailed hairpin is the major product.
- (–) is blunt-ended hairpin negative control.
- G + is a hairpin synthesized to contain a single nucleotide 3′-G overhang as the positive control (gel representative of 3 experimental replicates; yield estimates are listed in Table 9).
- FIG. 2F Assay to ligate two DNA hairpins with complementary single nucleotide XNA overhangs. Ligated hairpins are protected from exonucleases as they lack free 5′ and 3′- ends.
- FIG. 2G XNA ligation of hairpins tailed with complementary purine (pur) and pyrimidine (pyr) XNA bases using optimized reaction conditions. (+) is a positive control that used blunt DNA substrate.
- (*) is a negative control that used blunt DNA substrate without DNA ligase.
- FIGs 3A-3D show generation of 12-letter (ATGCBSPZXKJV) nanopore sequencing kmer models.
- FIG. 3A Overview of construction of NNNNNNN libraries, starting from two synthetic oligo pools (NNN-Pool) that contain blunt, NNN-3′ ends. The 24-nt triplet-barcodes in these hairpins are linked to the 3′-NNN sequence, allowing for proper identification of bases adjacent to XNA inserts. Complementary XNA base pairs are added to the library hairpins using XNA tailing and XNA ligation.
- FIGs 4A-4C show construction and end-to-end nanopore sequencing of 6- letter DNA alphabets.
- FIG. 4A Proof of concept deployment of an XNA-refinement pipeline using 4-nt kmer models measured in this disclosure.
- Pipeline is used to transform raw commercial nanopore reads into likely XNA basecalls for the sense (+) and antisense (-) strands.
- FIG. 4C Response
- FIG. 5 shows enzyme-assisted synthesis and third-generation sequencing of supernumerary 12-letter DNA.
- the kmer probability density function (observed signal mean ⁇ I z >, model mean ⁇ ki , model standard deviation ⁇ ) is used to calculate log-likelihoods while a maximum likelihood with outlier-robust log-likelihood ratios is used to determine base call.
- FIG. 6A shows an overview of an example non-templated N+1 tailing reaction. Tailing of blunt-end hairpin DNA substrates (N) can lead to complete formation of XNA-tailed hairpin products (N+1 major).
- PPi release from tailing leads to slow background rate of pyrophosphorolysis, which acts in the reverse direction of nucleotide tailing (3′-exo). Pyrophosphorolysis is mitigated by adding YiPP to tailing reactions and balancing reaction duration and reaction rates.
- the over tailing of products to generate (N+2) hairpins is also considered in optimization for tailing reactions.
- N+1 tailing is generally thought to occur at a first-order reaction rate, 2 orders of magnitude slower than templated polymerization.
- N+2 addition rates are polymerase specific and are thought to occur at first order rates 2 orders of magnitude slower than N+1 product formation. End abbreviations: 3′ indicates 3′-OH, 5′- indicates 5′-PO4.
- N A, T, G, C
- T4 ligation assay A 5′-phosphorylated hairpin oligo with a 3′-blunt end was
- 3915-P1293WO.UW -8- purchased from IDT (5′Phos-15HP; Table 2). Oligos are first refolded by incubating 20 ⁇ M of oligo in a 100 mM NaCl, 10 mM Tris-HCl buffer (pH 8.2) at 90 ⁇ C for 3 minutes then cooling at 0.1 ⁇ C/s until reaching 20 ⁇ C. All subsequent tailing reactions used 16 ⁇ M 5′Phos-15HP (blunt-end with 15 nt in the hairpin region), 1.19 mM dNTP (with dNTP used specified on lane figure panel), and tailed for 1 h at the specified temperature using the specified polymerases.
- T4 ligation reactions were performed with 11.2 ⁇ M of oligo for 1 h using T4 DNA Ligase Reaction Buffer which contains 1 mM ATP.
- FIG.6BA Tailing screen for Taq polymerase (0.25 U/ ⁇ L, 72 ⁇ C) and Klenow Fragment (exo-; KF) polymerase (0.68 U/ ⁇ L, 37 ⁇ C) followed by high concentration T4 ligation.
- FIG. 6BB Tailing screen for Deep Vent (exo-; DV) polymerase (0.1 U/ ⁇ L, 72 ⁇ C) and Therminator (Therm) polymerase (0.1 U/ ⁇ L, 72 ⁇ C) followed by high concentration T4 ligation.
- FIG.6BA Tailing screen for Taq polymerase (0.25 U/ ⁇ L, 72 ⁇ C) and Klenow Fragment (exo-; KF) polymerase (0.68 U/ ⁇ L, 37 ⁇ C) followed by high concentration T4 ligation.
- FIG. 6BB
- FIGs 6CA-6CM show UPLC/QTOF validation of tailing activity for all dNTPs and dxNTPs by Klenow Fragment (exo-).
- FIG. 2B Full set of controls for the data shown in FIG. 2B.
- Extracted ion chromatograms (EIC) show relative abundance of either dNMP or dxNMP release when corresponding dNTPs/dxNTPs are used as a substrate for polymerase (KF exo-) tailing. Chromatogram scales are normalized for comparison of runs within each panel. dNTP or dxNTP used in each reaction shown in panel legend.
- FIGs 6DA-6DM show UPLC/QTOF validation of tailing activity for all dNTPs and dxNTPs by Therminator.
- FIG. 2C Full set of controls for the data shown in FIG. 2C.
- Extracted ion chromatograms (EIC) show relative abundance of either dNMP or dxNMP release when corresponding dNTPs/dxNTPs are used as a substrate for polymerase (Therminator; Therm) tailing. Chromatogram scales are normalized for comparison of runs within each panel. dNTP or dxNTP used in each reaction shown in
- FIGs 6EA-6EE show screening and optimization of XNA tailing conditions. All tailing reactions used 11.9 ⁇ M 5′Phos-11HP, 1.19 mM of specified dNTP/dxNTP, and tailed at the specified temperature for the specified times using either Klenow Fragment (KF exo-; 0.71 U/ ⁇ L) or Therminator (Therm; 0.29 U/ ⁇ L). Tailing completeness was measured via T4 ligation assays.
- FIG. 6EA XNA tailing screen using KF exo- and Therm for 8 h.
- FIG. 6EB XNA tailing screen using KF and Therm for 8 h.
- FIG. 6EC Additional S c tailing screen using Therm for 8 or 16 h.
- FIG. 6F shows addition of yeast inorganic pyrophosphatase (YiPP) leads to slight improvements in XNA tailing reaction yield.
- 5′-phosphorylated hairpin oligos with either a 3′-blunt end or 3′-single nucleotide (-G, or -C) overhangs were purchased from IDT (5′-Phos-11HP; Table 2). Separately, 11.4 ⁇ M of 3′-blunt end oligos were tailed with 1.14 mM of dCTP or dGTP, Klenow Fragment (exo-; KF; 0.68 U/ ⁇ L), and either 0.009 U/ ⁇ L of YiPP or no YiPP at 37 ⁇ C for 4 h.
- Ligation reactions were performed using 2.6 ⁇ M of two oligos with complementary overhang bases, either enzymatically tailed (G, C) or synthesized overhangs (G*, C*). Ligation reactions were incubated for 15 min at 16 ⁇ C using T7 DNA ligase (272 U/ ⁇ L) and carried out in 1X of NEB StickTogetherTM buffer which contains 7.5% (w/v) PEG 6000. Blunt-end hairpins (- /-) serve as a negative ligation control as the short reaction time prevents blunt end ligation.
- Unligated materials were digested using exonuclease I (2.7 U/ ⁇ L), exonuclease III (13.3 U/ ⁇ L) and exonuclease VII (1.33 U/ ⁇ L) for 1 h at 37 ⁇ C. Exonuclease reactions were heat inactivated by incubation at 95 ⁇ C for 10 min and then at 80 ⁇ C for 10 min.
- Exo VII was used which has a higher heat inactivation temperature than Exo VIII (truncated) used in other aspects of this disclosure. It was also found Exo VII would result in incomplete digestion (lower band) and required different buffer conditions. In subsequent screening work, Exo VIII (truncated) was used instead in the exonuclease treatment steps. Positive control with G* and C* shows ligation of hairpins with G and C synthetic overhangs. Gel representative of a single experimental replicate. [0044] FIG. 6G shows enzymatic tailing does not lead to measurable differences in ligation when compared to ligation using fully synthetic hairpin with N+1 tails.
- over-tailed product i.e., more than one nucleotide added to the blunt 3′-end
- N+1 tailed hairpin would result in dsDNA that contains a gap of one or more nucleotides.
- the gap region exposes a 3′ and 5′ end that would make this product susceptible to exonuclease degradation. Therefore, one way one can have tested to see if over-tailing was a problem was to compare how much ligated product was observed (as measured by agarose gel band intensity) if hairpins were tailed enzymatically vs made synthetically.
- 5′-phosphorylated hairpin oligos with either a 3′-blunt end or 3′- single nucleotide (-G, or -C) overhangs were purchased from IDT (Table 2). Oligos were first folded using previously described methods. Blunt end oligo 5′Phos-11HP was then tailed with dCTP using conditions listed in Table 8. Subsequent ligation reactions were performed using T7 or T4 DNA ligase. Either the dCTP-tailed oligo (Tailed) or 5′Phos- HP-3′C (Synth) was ligated to 5′Phos-HP-3′G.
- T7 ligation reactions 2.7 ⁇ M of each oligo were incubated with 272 U/ ⁇ L of T7 DNA ligase and StickTogether TM DNA ligase buffer at 16 ⁇ C for 15 min, after which the ligase was heat inactivated at 65 ⁇ C for 10 min.
- 4.2 ⁇ M of each oligo were incubated with 80 U/ ⁇ L of T4 DNA ligase and T4 DNA ligase buffer at 16 ⁇ C for 2 h, after which the ligase was heat inactivated at 65 ⁇ C for 10 min.
- FIGs 6HA-6HQ show high resolution LC/MS of oligo showing N+1 tailing as major product.
- FIG. 6HA Hairpin oligo, 5′Phos-ScaI-HP (Table 2) was tailed
- FIG. 6I shows an overview of T3 DNA ligase, T4 DNA ligase, and T7 DNA ligase products. (top) Major products formed from T3 ligation and T4 ligation assays between hairpins generated in this disclosure. (bottom) Major and minor products formed for T7 ligation assays in this disclosure.
- T7 ligase preferentially ligates hairpins with a cohesive nucleotide overhang and has minimal blunt-end ligation activity.
- T7 ligase has been observed to perform blunt end ligation though to a lesser extent than T3 ligase and T4 ligase.
- Full hairpin sequences used in this disclosure can be found in Table 2. Nucleic acid end abbreviation: 3′ indicates 3′-OH, 5P′- indicates 5′-PO 4 .
- FIG. 6J shows an overview of XNA ligation products from XNA tailed hairpins. XNA ligation reactions were optimized making the following considerations of possible side products.
- FIGs 6KA-6KE show screening and optimization of ligation conditions across all XNA bases. All tailing reactions used conditions listed in Table 8 unless otherwise specified.
- Ligation reactions were performed using 4.7 ⁇ M of one oligo or 2.4 ⁇ M of two oligos with complementary tailed bases. Ligation reactions were incubated for 16 h at 16 ⁇ C using the specified ligase and carried out in 1X of NEB StickTogetherTM buffer which contains 7.5% (w/v) PEG 6000. Improperly ligated
- FIGs 6LA-6LC show results from screening T3 ligase, T4 ligase, T7 ligase for JV, X t K n , and BS c XNA ligation.
- Two blunt end hairpins that create a restriction enzyme site upon blunt ligation were purchased from IDT (5′Phos-NdeI-HP-1 and 5′Phos-NdeI-HP-2; Table 2). Blunt-end ligated hairpins create an NdeI restriction site, while successfully tailed and ligated hairpins do not.
- FIG. 6LA T3 ligase assay (272 U/ ⁇ L);
- FIG. 6LB T4 ligase assay (36 U/ ⁇ L);
- FIG.6LC T7 ligase assay (272 U/ ⁇ L) for reactions containing single hairpins or mixture of two hairpins (as indicated).
- FIGs 6MA-6MC show full gels of XNA tailing and XNA ligation using optimized conditions. All assays were done with a 5′-phosphorylated hairpin oligo with a 3′-blunt end, purchased from IDT (5′-Phos-11HP; Table 2). Each DNA/XNA base was tailed using conditions from Table 8. (FIG. 6MA) Full gel for optimized XNA tailing conditions from FIG. 2E. Tailing completeness was measured via T4 ligation.
- FIGs 6NA-6NF show a proof of concept for XNA tailing and XNA ligation cycling to insert two consecutive P ⁇ Z base pairs.
- FIG. 6NA Agarose gel showing steps in consecutive XNA insertion.
- FIG. 6NB A hairpin containing an MlyI restriction site adjacent to the site of XNA ligation is used (donor hairpin, HP D ).
- MlyI is a type IIS restriction enzyme (5′- GAGTCNNNNN ⁇ -3′) that leaves a blunt end after cutting.
- a donor hairpin with an MlyI site and an acceptor hairpin were tailed with P and Z respectively (generating HP D -P, HPA-Z), ligated and treated with exonucleases following the optimized conditions described in this disclosure, and then purified (lane 1).
- the purified construct contains a single P ⁇ Z base pair insertion.
- 3915-P1293WO.UW -14- site was prepared by XNA tailing (HPP-P).
- XNA ligation followed by MlyI and exonuclease treatment does not result in formation of a ligation product (lane 3).
- FIG. 6ND In a second round, reaction product mixture from lane 2 was tailed with Z to produce Z-tailed donor hairpin (HP D -Z) and Z-tailed PZ-acceptor hairpin (HP A -ZZ).
- XNA ligation followed by MlyI and exonuclease treatment does not result in formation of a ligation product (lane 4).
- FIGs 6OA-6OB show examples of basecalling XNA sequences with guppy.
- FIG. 6OA ONT guppy was trained to basecall sequences composed of standard nucleic acids (A, T, G, or C).
- A, T, G, or C standard nucleic acids
- FIG. 6PB Complete NNNNNNN library products for all XNA base pairs and blunt end ligation library sequenced in this disclosure.
- FIG. 6PC Self-ligation for library hairpins to check for incomplete tailing and pyrophosphorolysis products. Library hairpins were tailed with the listed XNA using conditions listed in Table 8, and 4.7 ⁇ M of each hairpin (except B* and Sc at 2.6 ⁇ M) was ligated to itself using the conditions listed in Table 10.
- FIGs 6QA-6QI show examples of variance minimization for segmentation steps of signal-to-sequence mapping.
- Signal-to-sequence mapping was performed using Tombo. Tombo uses an informed kmer model to improve the accuracy of signal-to- sequence mapping. Without a prior model, segmentation requires assigning each XNA to a standard base. Improper segmentation leads to inaccurate model parameter estimates. To minimize bias in segmentation, one can have assigned each XNA to the standard base that minimized the total variance in observed kmer signal levels.
- FIGs 6RA-6RE show example traces of signal deviation from the standard model.
- FIG. 6S shows an example xenomorph preprocessing pipeline.
- Xenomorph preprocess integrates basecalling, raw multi-to-single fast5 conversion, reference sequence fasta conversion, segmentation, and level assignment into a single command.
- Level extracted output files from xenomorph preprocess are inputs to basecalling through alternative hypothesis testing using xenomorph morph. Separating the preprocessing steps from alternative hypothesis testing allows users to experiment with basecalling using various model parameter settings or with alternative models without having to rerun the slower signal extraction steps.
- xenomorph preprocess uses guppy for initial basecalling, minimap2 for initial basecall-reference alignment, and ONT Tombo for signal normalization and signal-to-sequence alignment.
- FIGs 6TA-6TC show PCR amplification and sequencing of a DNA template with a P ⁇ Z base pair.
- FIG. 6TA Synthetic template DNA containing a P ⁇ Z base pair was amplified with Taq polymerase in a pH 8.0 buffer with varying concentrations of dxNTP and dNTP (Tables 22, 23). PCR products were sequenced on a MinION nanopore flow cell then basecalled for PZ detection. Read fractions that basecalled to (FIG. 6TB) P and (FIG. 6TC) Z for each condition are shown. PCR conditions differ by concentration of dxNTP and dNTPs used. The remaining fraction for each base corresponds to G and C basecalls (the most likely standard mutation for P and Z), respectively.
- FIGs 6UA-6UB show construction of 12-letter DNA for nanopore sequencing. All assays were performed using 12-letter DNA construction oligos as
- FIG. 6V shows an example workflow from sequencing to heptamer classification.
- FIGs 6WA-6WB and 6XA-6XB show an example method for generating a defined non-standard nucleotide base pair library that uses a Type IIS restriction enzyme and a context barcode (“Barcode”) associated with a sequence context and a pool barcode (“Pool-Barcode”) associated with a non-standard nucleotide, as well as steps for sequencing and machine learning (ML) model training. Randomer region indicated.
- FIGs 6YA-6YF show example process flows for training ML models for processing read data obtained by nanopore sequencing of polynucleotide sequences containing non-standard nucleotides (FIGs 6YA-6YD), as well as base calling using trained ML models for quantification of XNA retention in PCR reactions (FIG.6YE) and quantification of XNA transcription errors from in vivo transcription (FIG.6YF).
- the present disclosure provides an array of breakthrough approaches for synthesizing polynucleotide (e.g., DNA) sequences containing at least one non-standard nucleotide.
- the non-standard nucleotide can include a hydrogen bonding pattern that is consistent or compatible with a hydrogen bonding pattern of a standard or existing
- 3915-P1293WO.UW -18- nucleotide e.g., C, G, T, A
- the present disclosure also provides breakthrough approaches for synthesizing polynucleotide sequences containing one or more non-standard nucleotides, optionally using next-generation sequencing (NGC) platforms, such as nanopore sequencing.
- NGC next-generation sequencing
- the disclosure also enables non-standard nucleotides to be integrated into a wide range of technologies, such as biological computing and information storage systems, therapeutics, aptamers, biosensors, and the like.
- Methods of synthesizing polynucleotides containing one or more non- standard nucleotides make use of an N+1 tailing reaction of a suitable DNA polymerase. Accordingly, in an aspect, the disclosure provides a method for generating an N+1 tailing product comprising a non-standard nucleotide that is covalently bound with a 3’ end of a precursor double-stranded DNA (dsDNA) template, such that the non-standard nucleotide is non-base-paired.
- dsDNA double-stranded DNA
- the method comprises combining the precursor dsDNA template with a DNA polymerase and a non-standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to facilitate a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase.
- dNTP deoxyribonucleotide triphosphate
- the non-standard nucleotide is a xenonucleotide (XNA) and the non-standard dNTP is a deoxy-xeno-ribonucleotide triphosphate (dxNTP).
- the DNA polymerase comprises a polypeptide sequence of a small Klenow Fragment (KF exo-) of DNA Polymerase I, as further described herein.
- the polypeptide sequence comprises a sequence of SEQ ID NO:2.
- a variety of XNAs can be incorporated into DNA using methods of the present disclosure, however, it was found that improvement or optimization of reaction conditions allows for the N+1 tailing reaction to proceed at an acceptable rate.
- the non-standard nucleotide being added is B or p, and the reaction condition proceeds at about 37°C for between about 1-16 hours and comprises about 0.71
- the non-standard nucleotide is selected from S n , S c , Z, X t , K n , J, and V, and the reaction condition proceeds at about 60°C for between about 4-16 hours and comprises about 0.29 U/ ⁇ L of the DNA polymerase and about 1.19 mM of the non- standard dNTP. While these or similar conditions were found to be effective for the disclosed reaction, other conditions, including less-than-optimal or non-improved conditions, can be implemented in embodiments without departing from the scope and spirit of the disclosure.
- the KF exo- of DNA polymerase I can be used in embodiments, this is not the only DNA polymerase that was surprisingly and unexpectedly found to have the ability to add non-standard nucleotides to a dsDNA template in an N+1 tailing reaction.
- the DNA polymerase comprises a polypeptide sequence of an engineered polymerase from a hyperthermophilic marine archaeon.
- the engineered polymerase is a variant of 9°N DNA polymerase.
- the polypeptide sequence comprises a sequence of SEQ ID NO:3 (e.g., Therminator TM ).
- the disclosure provides a method for generating a base pair of two nucleotides of a polynucleotide, wherein at least one nucleotide of the two nucleotides is a non-standard nucleotide.
- the base pair is comprised of one non-standard nucleotide base paired with one standard nucleotide.
- the base pair is comprised of a first non-standard nucleotide base paired with a second non-standard nucleotide.
- Creation of a base pair that is comprised of two non-standard nucleotides can be implemented with a method that comprises generating a second N+1 tailing product comprising a second non-standard nucleotide that is base-pair complementary with the non-standard nucleotide, such that the second non-standard nucleotide is non-
- the second N+1 tailing product can be generated based on the same or a similar reaction as the N+1 tailing product (of the first N+1 tailing reaction).
- the method can further include ligating the N+1 tailing product with the second N+1 tailing product, which forms a dsDNA ligation product that comprises a base pair between the non- standard nucleotide and the second non-standard nucleotide, as further described herein.
- the N+1 tailing product can be linear or, in embodiments, can comprise a hairpin.
- the second N+1 tailing product can be linear or, in embodiments, can comprise a hairpin.
- the dsDNA ligation product does not comprise a free 5’ end or a free 3’ end and is fully resistant to exonucleases.
- Additional non-standard nucleotides can be added iteratively and/or sequentially, such that two or more non-standard nucleotides can be added or inserted to a polynucleotide. This can be achieved by cleaving the dsDNA ligation product and exposing the non-standard base pair. The resultant blunt-end DNA template then becomes a template for a subsequent N+1 tailing reaction.
- the method comprises contacting the dsDNA ligation product with a type IIS restriction enzyme under a reaction condition that is conducive for the type IIS restriction enzyme to cleave the dsDNA ligation product, which generates a blunt-end DNA template.
- the resultant blunt-end DNA template comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
- the method can be performed a plurality of times for creation of a plurality of base pairs between a plurality of non-standard nucleotides and a plurality of second non-standard nucleotides as sequence elements of the further dsDNA ligation product.
- the method comprises contacting the further dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the further dsDNA ligation product to generate a further blunt-end DNA template that comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides.
- the method is modular and can be repeated any number of times for addition of any number of non-standard nucleotides, either with non-standard nucleotides added in a continuous manner or in a manner such that the non-standard nucleotides are interspersed with, or interrupted by, one or more standard nucleotides, for example.
- a quantity of non-standard nucleotides added to a polynucleotide with a method of the disclosure is selected from the group including, but not necessarily limited to, the set of integers defined by the range of 1 to 10,000,000,000, inclusive.
- a quantity of standard nucleotides added to a polynucleotide with a method of the present disclosure is selected from the group including, but not necessarily limited to, the set of integers defined by the range of 1 to 10,000,000,000, inclusive.
- the non-standard nucleotide comprises an epigenetic modification, a modified sugar, a phosphate backbone, a nucleobase, a nucleobase that can hydrogen bond to a second base, a nucleobase that can base pair (without hydrogen bonding) to a second base, a nucleobase that relies on steric exclusion for base pairing, a nucleobase that relies on hydrophobic interactions for base pairing, a nucleobase that relies on a transition metal complex for base pairing, a chemical modification, or any combination thereof.
- the non-standard nucleotide is the nucleobase that is configured to hydrogen bond to the second base and the second base is a standard base or a non-standard base. In other example embodiments, the non-standard nucleotide is the nucleobase that can base pair (without hydrogen bonding) to the second base and the second base is a standard base or a non-standard base. In embodiments, the non-standard nucleotide comprises an epigenetic modification or is 4-methyl-cytosine, 5-methyl cytosine, 6-methyl adenosine, 5-hydroxymethyl cytosine, 7-methylguanosine, or N6-methyladenosine.
- the non-standard nucleotide comprises the chemical modification and the chemical modification comprises a fluorophore, a biotin, a terminal alkyne, an azide, a cyclooctyne, a tetrazine, a terminal alkene, a phosphine, a halo-alkane, an aldehyde, a thiol, a transition metal complex, another reactive handle, or any combination thereof.
- the disclosure also contemplates products, and in at least some instances, intermediates, of methods herein as also being within the scope of the disclosure.
- the disclosure provides a dsDNA ligation product that can comprise a non-standard nucleotide.
- the disclosure provides a further dsDNA ligation product that can comprise two or more non-standard nucleotides.
- the disclosure contemplates defined libraries of non-standard nucleotide base pairs, in any of a variety of nucleotide contexts, produced by the methods
- the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the dsDNA ligation product or the blunt-end dsDNA template.
- the library polynucleotide sequence comprises a base pair between a non- standard nucleotide and a second non-standard nucleotide.
- a plurality of base pairs can be incorporated into one or more defined libraries.
- the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of a further dsDNA ligation product or a further blunt-end dsDNA template, such that the library polynucleotide sequence comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides.
- a library polynucleotide sequence further comprises a context barcode associated with a sequence context adjacent to a base pair of a non- standard nucleotide and a second non-standard nucleotide of the library polynucleotide sequence, and a pool barcode associated with the non-standard nucleotide, the second non-standard nucleotide, or both.
- These or similar barcodes can be comprised of standard or otherwise sequence-able nucleotides, such that the identities of the non-standard nucleotides and the contexts can be known with a high degree of confidence. This facilitates correlation between the empirical data and the non-standard nucleotide bases being observed.
- Machine learning can be used with one or more methods for facilitation of sequence data analysis.
- the disclosure provides a method for generating a machine learning (ML) model that correlates one or more observed current reads with an unknown non-standard nucleotide, for assignment of an identity to the unknown non-standard nucleotide.
- ML machine learning
- Such a method comprises sequencing, with a nanopore sequencing method, the defined non-standard nucleotide base pair library to produce the one or more observed current reads, and training, with a ML algorithm, the ML model to
- 3915-P1293WO.UW -23- associate the one or more observed current reads with a known identity of a defined non- standard nucleotide of the defined non-standard nucleotide base pair library.
- the ML model can be configured to assign the identity to the unknown non-standard nucleotide based on the known identity of the defined non-standard nucleotide.
- the ML model comprises a convolutional long short term memory recurrent neural network (LSTM RNN), however, other ML models can be implemented, in embodiments.
- LSTM RNN convolutional long short term memory recurrent neural network
- the disclosure also contemplates computer memory, computer products, computer devices, computer systems, and the like, that implement all or part of one or more methods of the disclosure as being within the scope of the disclosure.
- the disclosure provides a non-transitory computer-readable storage medium having stored thereon at least part of a ML model.
- the disclosure provides a computational device or computational system comprising the non-transitory computer- readable storage medium.
- the disclosure provides a nanopore sequencing kit, device, or system comprising the non-transitory computer-readable storage medium, optionally further including instructional materials for use of the kit.
- the disclosure provides novel and innovative tools for use in synthesizing and sequencing polynucleotides containing non-standard nucleotides. Accordingly, in an aspect, the disclosure provides a method for basecalling a non- standard nucleotide expanded alphabet.
- the method comprises sequencing, with a nanopore sequencing method, a subject polynucleotide sequence that comprises a non-standard nucleotide to generate a subject current read, computing, with the computational device or computational system, the known identity of the defined non-standard nucleotide of the defined non-standard nucleotide base pair library associated with the subject current read with for an association, and computing, based on the association, a structure of the non-standard nucleotide.
- the structure of the non-standard nucleotide can include, correspond, or relate to an identity of the non-standard nucleotide.
- circuitry includes dedicated hardware having electronic circuitry configured to perform operations or computations on a dedicated basis, without any use of microprocessors, central processing units, or software or firmware or processor-executable instructions.
- circuitry includes, among other things, one or more computing devices such as one or more processors (e.g., microprocessor(s)), one or more central processing units (CPU), one or more digital signal processors (DSP), one or more application-specific integrated circuits (ASIC), one or more field-programmable gate arrays (FPGA), or the like, or any variations or combinations thereof, and can include discrete digital and/or analog circuit elements or electronics, or combinations thereof.
- processors e.g., microprocessor(s)
- CPU central processing units
- DSP digital signal processors
- ASIC application-specific integrated circuits
- FPGA field-programmable gate arrays
- circuitry includes combinations of circuits and computer program products having software or firmware processor-executable instructions stored on one or more computer readable memories, e.g., non-transitory computer-readable storage mediums, that work together to cause a device or system to perform one or more methodologies or technologies described herein.
- circuitry includes circuits, such as, for example, microprocessors or portions of microprocessors, that require software, firmware, and the like for operation.
- circuitry includes an implementation comprising one or more processors or portions thereof and accompanying software, firmware, hardware, and the like.
- circuitry includes a baseband integrated circuit or applications processor integrated circuit or a similar integrated circuit in a server, a cellular network device, other network device, or other computing device.
- circuitry includes one or more remotely located components.
- remotely located components e.g., server, server cluster, server farm, virtual private network, etc.
- non-remotely located components e.g., desktop computer, workstation, mobile device, controller, etc.
- remotely located components are operatively connected via one or more receivers, transmitters, transceivers, or the like.
- Embodiments include one or more data stores that, for example, store instructions and/or data.
- Non-limiting examples of one or more data stores include volatile memory (e.g., Random Access memory (RAM), Dynamic Random Access memory (DRAM), or the like), non-volatile memory (e.g., Read-Only memory (ROM), Electrically Erasable Programmable Read-Only memory (EEPROM), Compact Disc Read-Only memory (CD-ROM), or the like), persistent memory, or the like. Further non- limiting examples of one or more data stores include Erasable Programmable Read-Only memory (EPROM), flash memory, or the like.
- the one or more data stores can be connected to, for example, one or more computing devices by one or more instructions, data, or power buses.
- circuitry includes one or more computer-readable media drives, interface sockets, Universal Serial Bus (USB) ports, memory card slots, or the like, and one or more input/output components such as, for example, a graphical user
- circuitry includes one or more user input/output components that are operatively connected to at least one computing device to control (electrical, electromechanical, software- implemented, firmware-implemented, or other control, or combinations thereof) one or more aspects of the embodiment.
- circuitry includes a computer-readable media drive or memory slot configured to accept signal-bearing medium (e.g., computer-readable memory media, computer-readable recording media, or the like).
- a program for causing a system to execute any of the disclosed methods can be stored on, for example, a computer-readable recording medium (CRMM), a signal-bearing medium, or the like.
- signal-bearing media include a recordable type medium such as any form of flash memory, magnetic tape, floppy disk, a hard disk drive, a Compact Disc (CD), a Digital Video Disk (DVD), Blu-Ray Disc, a digital tape, a computer memory, or the like, as well as transmission type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link (e.g., transmitter, receiver, transceiver, transmission logic, reception logic, etc.).
- analog communication medium e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link (e.g., transmitter, receiver, transceiver, transmission logic, reception logic, etc.).
- signal-bearing media include, but are not limited to, DVD-ROM, DVD-RAM, DVD+RW, DVD-RW, DVD-R, DVD+R, CD-ROM, Super Audio CD, CD ⁇ R, CD+R, CD+RW, CD-RW, Video Compact Discs, Super Video Discs, flash memory, magnetic tape, magneto-optic disk, MINIDISC, non-volatile memory card, EEPROM, optical disk, optical storage, RAM, ROM, system memory, web server, or the like.
- the present application can include references to directions, such as “vertical,” “horizontal,” “front,” “rear,” “left,” “right,” “top,” and “bottom,” etc. These references, and other similar references in the present application, are intended to assist in helping describe and understand the particular embodiment (such as when the embodiment is positioned for use) and are not intended to limit the present disclosure to these directions or locations. [0091] The present application can also reference quantities and numbers. Unless specifically stated, such quantities and numbers are not to be considered restrictive, but examples of the possible quantities or numbers associated with the present application.
- “about” refers to the stated value and a range that includes values 11% above the stated value, 12% above the stated value, 13% above the stated value, 14% above the stated value, 15% above the stated value, 16% above the stated value, 17% above the stated value, 18% above the stated value, 19% above the stated value, 20% above the stated value, 21% above the stated value, 22% above the stated value, 23% above the stated value, 24% above the stated value, or 25% above the stated value.
- a range is stated, e.g., the range of 1-16, the stated range includes every value between the lower and upper limits as well as the lower and upper limits of the stated range, themselves, as stated values.
- the approximately stated range includes every value between the lower and upper limits as well as the lower and upper limits of the stated range, themselves, as stated values (e.g., 1 and 16 are each stated values), including those non-stated values that are near to or approximate the stated values according to practicable ranges as would be recognized by those skilled in the art or as otherwise described herein.
- the phrase “at least one of A, B, and C,” means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C), including all further possible permutations when greater than three elements are listed.
- the term “A, B, and/or C” means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C), including all further possible permutations when greater than three elements are listed.
- the term “or” is an inclusive “or”, and the phrase “A or B” means (A), (B), or (A and B).
- the term “and” requires both elements; for example, the phrase “A and B” means (A and B).
- the term “comprising”, is inclusive or open-ended and does not exclude additional, unrecited elements or method steps.
- Example 1 Enzymatic Synthesis and Nanopore Sequencing of 12-letter Supernumerary DNA
- Abstract The 4-letter DNA alphabet (A, T, G, C) is an elegant, yet non- exhaustive solution to the problem of storage, transfer, and evolution of biological information. This example provides strategies for both writing and reading DNA with expanded alphabets composed of up to 12 letters (A, T, G, C, B, S, P, Z, X, K, J, V).
- an enzymatic strategy is devised for inserting a singular, orthogonal xenonucleic acid (XNA) base pair into standard DNA sequences using 2′-deoxy-xenonucleoside triphosphates as substrates. Integrating this strategy with combinatorial oligos generated on a chip, libraries are constructed containing single XNA bases for parameterizing kmer basecalling models for nanopore sequencing. These elementary steps are combined to synthesize and sequence DNA containing 12 letters – the upper limit of what is accessible within the electroneutral, canonical base pairing framework.
- the 4-letter standard genetic alphabet of DNA (A, T, G, C) is ubiquitous and one of the defining biomolecular signatures of life on Earth. The ability to read, write, and translate this information forms the basis for life as an emergent property of nucleic acid heteropolymers. Humanity has learned how to manipulate the 4 letters of DNA, spurring major advances in biotechnology, information storage, and healthcare.
- the standard nucleic acids can be components for diagnostic tests to screen for disease or detect toxins, therapeutics that create immune responses, and even as a molecular system for long-term storage of digital information.
- Parameters of biomolecular compatibility of expanded non-canonical hydrogen bonding base pairings include stability in the DNA double helix, the ability to be replicated by DNA polymerases, transcribed by RNA polymerases, reverse transcribed by reverse transcriptases, and even translated by the ribosome. These xenonucleotides are at the forefront of nucleic acids research since they significantly expand DNA’s chemical, structural, and binding repertoire.
- XNAs xenonucleic acids
- methods for sequencing of xenonucleic acids are decades behind that of DNA and RNA, and rely on low-throughput, non-multiplexed measurements, such as gel-shift assays, mass spectrometry, and selective conversion of XNAs to standard bases followed by Sanger sequencing.
- XNA sequencing technology is lower throughput, less sensitive, and less generalizable than the methods Sanger and Coulson developed in the 1970s and has no service-oriented solution.
- ATGC-sequencing technology is in its ‘third generation.’
- XNA XNA
- One possible solution is to adapt existing first-, second-, or third-generation DNA sequencing technology to work with more DNA letters.
- Nanopore sequencing has the ability to sequence non-canonical bases such as epigenetic and epitranscriptomic modifications.
- nanopore sequencing can be used for sequencing 8-letter hachimoji DNA (A, T, G, C, B, S c , P, Z) using the Hel308 motor protein with an MspA pore.
- third-generation (high throughput, multiplexable, single molecule, real-time) sequencing of supernumerary DNA is possible despite the “k-mer explosion” in possible current signals induced by an expanded DNA alphabet.
- previous efforts in this regard did not attempt to build models for decoding the nanopore current signals to nucleic acid sequences.
- Non-standard bases can be classified using commercial nanopores (e.g., GridION, ONT). This can show that commercial nanopore sequencing platforms are indeed capable of sequencing chemically modified nucleobases including 2,4-diamino- purine, 5-nitro-indole, and 5-octadiynyldeoxyuracil.
- 3915-P1293WO.UW -32- phosphoramidite synthesis – commercial access is both limited and costly, standing as a major barrier to entry.
- standard phosphoramidite synthesis costs for non- standard bases average around $100-400 USD/nt – or over 1000 times more expensive than A, T, G, C synthesis ($0.04-0.40 USD/nt).
- next-generation synthesis methods that have transformed the ability to explore sequence space (pooled synthesis, synthesis-on-a-chip, enzymatic synthesis) are not commercially available for orthogonal base pairs.
- Enzymes like terminal deoxynucleotidyl transferase can catalyze non-templated addition of a wide range of modified nucleotide building blocks on ssDNA, and can do so at neutral pH.
- TdT terminal deoxynucleotidyl transferase
- 3915-P1293WO.UW -33- enzymes precludes them from being used for sequence-defined addition of dNTPs. More so, TdT-based enzymatic synthesis of nucleic acids would require specially protected building blocks or polymerase-nucleotide conjugates that are not commercially available. [0111] Lacking a suitable alternative, it was needed to develop an enzymatic synthesis strategy that would be flexible enough to handle all desired xenonucleobases using 2′-deoxynucleoside triphosphates as the universal building block and be specific enough to catalyze a non-processing N+1 addition.
- the 2′- deoxy-xenonucleoside triphosphates of the remaining bases were chemically synthesized: dX t TP, dK n TP, dJTP, dVTP (FIGs 6BA-6BE).
- a sensitive liquid chromatography/mass spectrometry (UPLC/QTOF) assay was developed for detecting tailing activity.
- UPLC/QTOF sensitive liquid chromatography/mass spectrometry
- the hairpin design of the substrates generates a desired dsDNA ligation product that lacks a free 5′ or 3′ end, making it fully resistant to exonucleases. Subsequent treatment of the ligation reaction with exonucleases therefore allows one to remove unreacted starting material and partially ligated products.
- the ideal dsDNA ligase should be able to ligate DNA strands with single nucleotide overhangs and have relaxed specificity for both the overhanging nucleotide
- phage ligases T3 DNA ligase, T4 DNA ligase, and T7 DNA ligase
- FOG.6I modified and non-standard nucleotide substrates
- a negative control can be performed in which hairpins are incubated individually in the presence of the respective ligases (FIG. 6J). In these single hairpin reactions, any ligation product would indicate either blunt-end ligation, from incomplete XNA tailing, or formation of a self-ligation (mismatch ligation) product.
- Nanopore sequencing from Oxford Nanopore Technology ® ) has features that make it adaptable for sequencing supernumerary DNA: it can sequence single DNA molecules without amplification, without the requirement for fluorescently labeled building blocks, and with high throughput (100k-10M reads per run). In nanopore sequencing, an ion current signal is generated as single-stranded DNA
- 3915-P1293WO.UW -36- is threaded through a protein nanopore. Conversion of signal-to-sequence, or basecalling, is performed computationally by either statistical or machine learning models. However, since commercial nanopore basecalling algorithms were empirically trained on standard 4-letter DNA (A, T, G, C), they are unable to decode xenonucleobases (B, S n , S c , P, Z, X t , K n , J, V; FIGs 6OA-6OB). [0117] With this in mind, one can build and measure diverse DNA-XNA libraries that can be used to construct de novo ground-up models for sequencing single xenonucleotides within a natural DNA context.
- NNNNNNN library was sequenced independently for model building, generating between 150k – 800k raw reads per library (Tables 14-15). Signals were then segmented and aligned to each barcoded reference sequence while filtering reads that aligned to possible ligation side products (FIGs 3B, 6J and 6QA-6QI). From these signal-to-sequence alignments, XNA-heptamer
- Example kmer signal distributions can be generated. Mean signal currents spanning all 2,304 xenonucleotide-containing kmers, ⁇ k , are shown in FIG. 3C and comparisons can be made to the most similar standard bases. [0120] Basecalling single xenonucleotide substitutions. Next, one can apply this model to predict signals emitted by sequences that contain a single xenonucleotide (B, S n , S c P, Z, X t , K n , J, or V).
- the expected signal is found by decomposition of a heptamer sequence into its constitutive kmers, then using measured kmer means to model current transitions (e.g., AGTBCCT ⁇ [ ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ ]).
- FIG. 3D shows examples of signal-level predictions generated by an example model (XNA model) overlayed over observations of that library sequence and the most similar standard-bases model (DNA model).
- XNA model example model overlayed over observations of that library sequence and the most similar standard-bases model
- the modeled probability density function can be used to calculate the likelihood that an observed set of signal levels was emitted from a particular sequence.
- the correct basecall should be the one that has the maximum likelihood of observation.
- the modularity of the 4-nt kmer model allows to make a diverse set of comparisons between a xenonucleotide and 1) a standard base (e.g., P vs. G), 2) any of the standard bases (e.g., P vs. A, T, G, C), or 3) any of the full supernumerary letters (e.g., P vs. A, T, G, C, B, S c , Z, X t , K n , J, V).
- XNA tailing and XNA ligation to enzymatically synthesize a new validation library composed of contextually diverse sequences.
- this library the nucleotide sequences adjacent to the XNA-containing heptamer can be further diversified making them further removed in sequence space from those used to build the 4-nt kmer models.
- This validation library can be built
- each set of hairpins can contain 10 unique sequences.
- the 20 bp at the 3′-end of each hairpin can be designed by randomly selecting standard bases from a uniform probability distribution.
- Individual hairpin sets can be tailed with XNA bases using XNA tailing.
- Two sets of hairpins with complementary tails can be ligated, producing a library of 100 possible sequences (10 x 10), with each sequence containing a single XNA base pair. These ligated hairpin libraries can be pooled together and sequenced for benchmarking (FIGs 4B-4C).
- the elementary tailing and ligation synthesis steps can be coupled with an additional Golden Gate ligation to generate two proof-of- concept 12-letter supernumerary dsDNA hairpins: S c uper-12 and S n uper-12 (FIGs 6UA- 6UB, Tables 7, 12, and 13).
- exonucleases can be added to remove intermediary DNA products, generating the desired 244 bp 12-letter dsDNA product.
- basecalling can be performed two different ways: 1) by comparing the XNA base at a position against a model that contains all 12 possible nucleobases, and 2) by comparing the XNA base at a position against a model that contains the XNA and the most similar standard nucleobase. Even when all 12 letters are present in the model, the presently disclosed basecalling model is able to properly decode XNAs in S c uper-12 with 39-89% per-read recall (FIG. 5, Tables 25, 26). In an example experiment, for the S n uper-12 sequence, all but one XNA were properly decoded in the 12-letter model, with the exception being K n (per-read recall of 14%).
- a general strategy is described for incorporating up to four additional orthogonal base pairs into standard DNA, and these methods can be used to build openly accessible models for sequencing XNAs (B, S n , S c , P, Z, X t , K n , J, V) in a standard DNA context (A, T, G, C) on commercial nanopore devices.
- the enzymatic synthesis strategy developed utilizes unmodified 2′-deoxy-xenonucleoside triphosphates as the elementary building blocks, avoiding the use of phosphoramidites or caged-triphosphates.
- Nanopore sequencing of XNAs can be performed using a nanopore sequencing device. This significantly expands the accessibility of sequencing XNAs. As history in sequencing progress has shown, additional widespread adoption and collection of XNA nanopore sequencing data can help further catalyze the improvement of sequencing models with newer basecalling algorithms, including data- intensive deep learning models. As these methods improve and adoption widens, strategies for synthesis and sequencing of higher complexity nucleic acids are possible.
- an additional base pair enables site-specific incorporation of chemically modified groups, including the addition of nucleobases such as Z that can act as a Br ⁇ nsted base.
- Adenosine triphosphate sodium salt (ATP; A6419-5G), acetonitrile (A955-4; LC/MS-grade), formic acid (A118P- 500), ammonium acetate (A637-500), ammonium carbonate (207861-25G), Tris base (10708976001), 5 M betaine solution (B0300-1VL), 6 N hydrochloric acid (1430071000), GelGreen (SCT124), and sodium chloride (S3014-5KG) were purchased from Sigma-Aldrich (St. Louis, MO).
- AMPure XP beads (A63880) were purchased from Beckman Coulter (Brea, CA).
- T4 DNA ligase high concentration T4 DNA ligase (M0202M, M0202L), T7 DNA ligase (M0318L), T3 DNA ligase (M0317S), yeast inorganic pyrophosphatase (YiPP; M2403L), thermolabile proteinase K (P8111S), Exo III (M0206L), thermolabile Exo I (M0568L), Exo I (M0293L), Exo VII (M0379L), Exo VIII (truncated; M0545S), Klenow Fragment (exo-; M0212L), Taq polymerase (M0267L), Bsu polymerase (M0330S), Deep Vent (exo-) polymerase (M0259S), Bst polymerase (M0275S), Sulfolobus DNA polymerase IV (M0327S), Therminator polymerase (M0261L), NEBNext ⁇ Ultra TM II End Repair
- Xenonucleoside triphosphates dS c TP, dPTP, dZTP, dBTP (dSTP-401S, dPTP-201, dZTP- 101, dBTP-301P) were purchased from FireBird Biomolecular Sciences LLC (Alachua, FL).
- Xenonucleoside triphosphate dS n TP (M-1015) was purchased from TriLink
- the eluted oligo was then folded in 100 mM of NaCl and 10 mM Tris-HCl (pH 8.2) buffer by incubating at 90 ⁇ C for 3 minutes, then cooling at 0.1 ⁇ C/s until reaching 20 ⁇ C. 15 ⁇ L of this refolded oligo was incubated with 0.17 mM dNTP or dxNTP, 300 units of Exo III and either KF (exo-) with rCutSmart TM buffer or Therminator with ThermoPol ® buffer for 16 h. For reactions using KF, the reaction was incubated with 15 units of KF at 37 ⁇ C.
- oligos are first refolded by incubating 40 ⁇ M of oligo in a 100mM NaCl, 10mM Tris-HCl buffer (pH 8.2) at 90 ⁇ C for 3 minutes then cooling at 0.1 ⁇ C/s until reaching 20 ⁇ C.
- the refolded oligos are then tailed by incubating 23.8 ⁇ M of oligo in the presence of dNTP or dxNTP (1.19 mM or 2.38 mM), YiPP (0.005 U/ ⁇ L; except for the dATP tailing reaction which did not contain YiPP), polymerase (0.71 U/ ⁇ L Klenow Fragment (KF exo-), 0.29 U/ ⁇ L Therminator polymerase, or 0.71 U/ ⁇ L Taq polymerase), and polymerase buffer (either rCutsmart TM or ThermoPol buffer). Full conditions tabulated in Table 8.
- 3915-P1293WO.UW -43- reactions were terminated by heat inactivation at 72 ⁇ C for 20 min.
- Therminator and Taq reactions were terminated by addition of 1X rCutSmart TM buffer and 0.005 U/ ⁇ L of thermolabile proteinase K at 37 ⁇ C for 15 min, followed by subsequent heat inactivation at 72 ⁇ C for 20 min.
- hairpins were refolded.
- 19.8 ⁇ M of oligo was incubated with 1.8 U/ ⁇ L of ScaI-HF at 37 ⁇ C for 2 h, followed by subsequent heat inactivation at 80 ⁇ C for 20 min.
- oligos are first refolded by incubating 20 ⁇ M of oligo in a 100 mM NaCl, 10 mM Tris-HCl buffer (pH 8.2) at 90 ⁇ C for 3 minutes then cooling at 0.1 ⁇ C/s until reaching 20 ⁇ C.
- the refolded oligos are then tailed by incubating 11.9 ⁇ M of oligo in the presence of dNTP or dxNTP (1.19 mM or 2.38 mM), YiPP (0.005 U/ ⁇ L; except for the dATP tailing reaction which did not contain YiPP), polymerase (0.71 U/ ⁇ L Klenow Fragment (KF exo-), 0.29 U/ ⁇ L Therminator polymerase, or 0.71 U/ ⁇ L Taq polymerase), and polymerase buffer (either rCutsmart TM or ThermoPol buffer).
- Reactions were either incubated for 8 h at 37 ⁇ C (KF exo-); 1, 4, 8, or 16 h at 60 ⁇ C (Therminator); or 1 h at 60 ⁇ C (Taq). Following incubation, KF exo- reactions were terminated by heat inactivation at 72 ⁇ C for 20 min. Therminator and Taq reactions were terminated by addition of 0.005 U/ ⁇ L of thermolabile proteinase K at 37 ⁇ C for 15 min, followed by subsequent heat inactivation at 72 ⁇ C for 20 min. Following either set of heat inactivation steps, hairpins were refolded.
- Resulting hairpins contained a mixture of product (tailed hairpins) and unreacted starting material (3′-blunt end hairpins).
- T4 DNA ligase was then used to screen reactions for remaining unreacted 3′-blunt ends by adding 80 U/ ⁇ L of T4 DNA ligase alongside 1X T4 DNA ligase reaction buffer. These T4 ligation reactions were incubated at 16 ⁇ C for 2 h, after which T4 ligase was heat inactivated at 65 ⁇ C for 10 min.
- a synthetic oligo hairpin with a 3′-G overhang (5′Phos-HP-3′G , Table 2) was used in the T4 ligation reaction.
- the starting material (5′Phos-11HP) was used in the T4 ligation reaction. Reaction products were run on a 2% (w/v) agarose gel, stained with GelGreen,
- Exonuclease reactions were heat inactivated by incubation at either 80 ⁇ C for 20 min (for reactions containing Exo I) or at 70 ⁇ C for 20 min (for reactions containing thermolabile Exo I). Reaction products were run on a 2% (w/v) agarose gel, stained with GelGreen, and visualized using a blue light transilluminator.
- Consecutive insertion of XNA base pairs using MlyI type IIS restriction enzyme 5′-phosphorylated hairpin oligos were purchased from IDT (5′Phos-11HP, 5′Phos-15HP, and 5′Phos-ScaI-HP; Table 2). 5′-Phos-15HP contains an MlyI restriction site adjacent to site of XNA ligation.
- MlyI is a type IIS restriction enzyme (5′- GAGTCNNNNN ⁇ -3′) that leaves a blunt end after cutting.
- 5′Phos-15HP donor hairpin with MlyI site; abbreviated HPD
- 5′Phos-11HP acceptor hairpin; abbreviated HPA
- HPD donor hairpin with MlyI site
- HPA acceptor hairpin
- These two hairpins were then ligated and subsequently treated with exonuclease following the optimized conditions described in “ XNA ligation conditions and reaction components.” This material was purified using Zymo’s DNA Clean and Concentrator and eluted in 30 ⁇ L of elution buffer.
- the purified construct contains a single P ⁇ Z base pair insertion and was digested using 1.24 U/ ⁇ L of MlyI and 1X rCutSmart TM buffer at 37 ⁇ C for 2 h then heat inactivated at 65 ⁇ C for 20 min. MlyI digestion results in a hairpin with a terminal P ⁇ Z,
- 5′-phosphorylated oligo pools (purchased as oPoolsTM from Integrated DNA Technologies) were designed to form blunt-end hairpins with two barcodes: a 24 nt Triplet-barcode [NNN-BC] and an 8 nt pool-barcode [Pool- BC] (FIG. 3A, Tables 3-5).
- the Triplet-barcode is linked to the NNN sequence at the 3′- blunt end of the hairpin, while the pool-barcode is used to decode which dxNTP/dNTP was tailed (Table 12).
- Each Triplet-barcode maps 1:1 with a corresponding NNN sequence adjacent to an XNA base.
- Ligation reactions for libraries generate combinations with two different pool barcodes. Restriction enzyme cut sites were included upstream of Triplet-barcodes to remove hairpins following ligation reactions and prepare DNA for nanopore sequencing. Full hairpin sequences in each library can be produced based on the present disclosure.
- Val-20 validation library design 5′-phosphorylated oligo pools (purchased as oPoolsTM from Integrated DNA Technologies) were designed to form blunt-ended hairpins with a variable 20 nt region at the end (Tables 3, 6).
- variable 20 nt region was designed computationally by randomization with a uniform prior probability for each base.
- Candidate sequences were passed through IDT oligo analyzer tool to remove sequences that might form secondary structures that could disrupt hairpin formation.
- Each validation oligo pool contained 10 unique sequences (six total pools: Val_A-F; Table 6) and was synthesized at a scale of 50 pmol/oligo.
- Two different validation oligo pools can be tailed with a dxNTP. Ligating two pools together (with complementary N+1 tails) results in a library with 100 possible sequences (10 x 10 combinations). Restriction enzyme cut sites were included upstream of these variable regions for nanopore library preparation following ligation.
- the assembled product contains two different restriction sites for hairpin removal, 5′- GATATC-3′ (EcoRV) and 5′-AGTACT-3′ (ScaI).
- EcoRV 5′- GATATC-3′
- 5′-AGTACT-3′ 5′-AGTACT-3′
- Asymmetric presence of restriction sites on the hairpins allows us to remove a singular hairpin and therefore generate a blunt end on the assembled product.
- the resulting dsDNA contains a single 3′- and 5′-end.
- Subsequent library preparation and sequencing of dsDNA results in reads where both sense and antisense strands, containing all 12-nucleobases, can be read in a single sequencing event (S c uper-12 and S n uper-12; FIG.5, FIGs 6UA-6UB).
- NNNNNNN library, validation library, and 12-letter DNA preparation by XNA tailing and XNA ligation were first refolded by incubating 20 ⁇ M of oligo pool in a 100 mM NaCl, 10 mM Tris- HCl (pH 8.2) buffer at 90 ⁇ C for 3 minutes then allowing for cooling at 0.1 ⁇ C/s until reaching 20 ⁇ C.
- oligos or oligo pools were tailed with a corresponding dxNTP using tailing conditions listed in Table 8. Reactions tailed with KF exo- were heat inactivated, while those tailed with Therminator were inactivated by thermolabile proteinase K treatment. Following inactivation of polymerase, oligos were refolded. Tailed oligo or oligo pools with complementary 3′-ends were then ligated with either T4 DNA ligase, T3 DNA ligase, or T7 DNA ligase using ligation conditions listed in Table 10. As a negative control for tailing, the starting material 3′-blunt end oligo or oligo pool (e.g.
- Purified NNN-oligo pools were then digested for 1 h at 37 ⁇ C using 1 U/ ⁇ L of BbsI-HF and rCutSmart TM buffer, then purified again using AMPure XP with a 2:1 bead-to-sample ratio and eluted in 30 ⁇ L of nuclease-free water. Purified NNNNNNN library samples were then prepared for nanopore sequencing following the details in the Nanopore sample preparation section.
- ligated validation oligo pool reactions were purified using AMPure XP with a 3:1 bead-to-sample ratio and eluted in 30 ⁇ L of elution buffer (10 mM Tris-HCl, pH 8.2), then combined to a final concentration of 0.2 ⁇ M/pool before enzymatic digestion for 1 h at 37 ⁇ C using 1 U/ ⁇ L of BbsI-HF and 1X rCutSmart TM buffer.
- Each ligated oligo set was then combined at a final equimolar concentration of 0.05 or 0.075 ⁇ M/oligo before proceeding to a Golden Gate ligation with the addition of 1 U/ ⁇ L of BbsI-HF, 20 U/ ⁇ L of T4 DNA ligase, 1X rCutSmart TM buffer, and 1X T4 DNA Ligase Reaction Buffer (FIG.6UA).
- the Golden Gate ligation included 60 cycles of 1) 37 ⁇ C for 5 min 2) 16 ⁇ C for 5 min, finalized by a step at 37 ⁇ C for 10 min, and a heat inactivation step at 65 ⁇ C for 20 min.
- the reaction was further digested to remove incomplete ligation products by the addition of 0.45 U/ ⁇ L of BbsI-HF, 0.45 U/ ⁇ L of thermolabile Exo I, 2.27 U/ ⁇ L of Exo III, and 0.23 U/ ⁇ L of Exo VIII (truncated), incubating at 37 ⁇ C for 1 h, followed by a heat inactivation step at 70 ⁇ C for 20 min.
- This reaction was then purified using AMPure XP with a 1.8:1 bead-to-sample ratio and eluted in 30 ⁇ L of nuclease-free water.
- the hairpin on either end of the complete, desired product was removed by splitting the reaction in half and adding 1X rCutsmart TM and 2.78 U/ ⁇ L of either ScaI-HF or EcoRV-HF. These reactions were incubated at 37 ⁇ C for 1 h, followed by a heat inactivation step at 80 ⁇ C for 20 min. The split samples were then
- Nanopore sample preparation and data acquisition Nanopore sample preparation followed standard Flongle or MinION Genomic DNA by Ligation protocol (available on the ONT community) using the SQK-LSK110 preparation kit with the following modifications.
- the NEBNext FFPE Repair Mix was omitted to avoid potential XNA removal by repair enzymes.
- the volume of the repair mix was replaced by nuclease-free water.
- AMPure XP bead-to-sample ratio was increased to 2:1 for the NNNNNNN library, and 3:1 for the validation.
- Signal-to-sequence mapping uses the Tombo (github.com/nanoporetech/tombo, ONT) pipeline.
- Tombo github.com/nanoporetech/tombo, ONT
- raw multi FAST5 files are split into single FAST5 using the ont-fast5-api (github.com/nanoporetech/ont_fast5_api, ONT) command multi_to_single_fast5.
- Single FAST5 files are then basecalled using guppy (version 6.1.5+446c355, ONT) with the high accuracy configuration settings (dna_r9.4.1_450bps_hac.cfg).
- FASTQ basecalls
- 3915-P1293WO.UW -50- passing default guppy quality score settings are assigned to their corresponding single FAST5 files using Tombo command Tombo preprocess annotate_raw_with_fastqs.
- Tombo uses a reference FASTA file that contains ground- truth sequences.
- the reference FASTA file was generated programmatically by considering every possible combination of ligation product including mismatch homo- ligation (e.g. P1-A+P1-A, see Table 12), blunt-end ligations leading to a gap (e.g. P1-P2, P1-P1, P2-P2), or pyrophosphorolysis ligation products.
- Full reference alignment files are deposited in the SRA (Table 31).
- the ground truth XNA (B, S n , S c , P, Z, J, V, X t , K n ) base needs to be substituted for a canonical base (A, T, G, C) for processing in a FASTA format.
- XNAs in reference sequences were substituted for the canonical bases that minimized observed variance in kmer levels; determined empirically (B ⁇ A; S n ⁇ A; S c ⁇ A; P ⁇ G; Z ⁇ C; X ⁇ A; K ⁇ G; J ⁇ C; V ⁇ G).
- Substituted bases are in general agreement with observations from basecalling XNA-containing reads with guppy (FIGs 6OA-6OB and 6QA-6QI). Signal-to-sequence mapping then proceeds using Tombo resquiggle.
- the Tombo resquiggle command uses mappy (minimap2 version 2.22-r1101 with ONT configuration) to first assign each single FAST5 read to a reference FASTA sequence based on the given FASTQ basecall. Following sequence assignment, Tombo uses dynamic programming for signal segmentation and proceeds to perform per-read signal normalization. As a general comment on the limitations of segmentation-based basecalling, Tombo is sensitive to the reference canonical base chosen for signal assignment.
- the per-read, median normalized level signal for each base is then extracted using the Tombo resquiggle results through the Tombo Python API. Details regarding how Tombo performs mapping, matching, and normalization, along with the Tombo Python API usage, can be found in the Tombo documentation (nanoporetech.github.io/tombo/).
- the resulting preprocessed and normalized signal- extracted data is exported to a CSV file for downstream processing (Tables 17, 18).
- the entire data preprocessing steps, including command groups and parameter settings, are wrapped into a single command (xenomorph preprocess) and available on the Xenomorph repository.
- XNA kmer model parameterization NNNNNNN libraries for a given XNA base pair are prepared as previously described in “NNNNNNN library, validation library, and 12-letter DNA preparation by XNA tailing and XNA ligation” and sequenced
- Signal-to-sequence mapping is then performed using the previously described pipeline in “Raw nanopore data preprocessing and signal-to- sequence mapping” with the following specifications. Reads that do not fully map with full coverage of triplet-barcodes and pool-barcodes of the XNA position are filtered out. Likewise, reads with a q-score ⁇ 9 and signal match score > 3 are not used in the model building. Signal-to-sequence mapping is also carried out with blunt-end ligation products (i.e. NNNNNN, or no XNA insertion), such that sequences that map better to blunt-end ligation products are not used.
- blunt-end ligation products i.e. NNNNNN, or no XNA insertion
- the 4-nt kmer was chosen in this disclosure as a proof of concept since reasonable kmer coverage could be obtained for the full NNNNNNN library (512 kmers per XNA base pair insertion) in a single Flongle flow cell run.
- each kmer consists of four nucleotide bases centered around the 0 th position nucleotide, as exemplified in Table 16. Therefore, each heptamer sequence (NNNNNNN) is composed of four, 4-nt kmers (i.e. +2 pos NNNN, +1 pos NNNN, 0 pos NNNN, -1 pos NNNN).
- Observed kmer levels are modeled as normal distributions parameterized with a mean ( ⁇ ⁇ ⁇ and standard deviation ( ⁇ ⁇ ). These parameters are used to describe observed kmer signal level probability density functions: ⁇ ⁇ ⁇ ⁇ ⁇ P ⁇ ⁇ ⁇ 1 e ⁇ ⁇ ⁇ P ⁇ ⁇ ⁇ probability that from kmer ′ ⁇ ′ ⁇ ⁇ ⁇ normalized kmer level mean for kmer ′ ⁇ ′ ⁇ ⁇ standard deviation of median normalized kmer levels for kmer ′ ⁇ ′ ⁇ ⁇ ⁇ observed median normalized kmer level
- level model means were approximated using the following kmer-specific bandwidth selection: I QR ⁇ ⁇ 0.9 ⁇ argmin ⁇ 1 , ⁇ ⁇ ⁇ ⁇ .34 BW ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ Silverman ⁇ s rule of thumb IQR ⁇ Interquartile range of kmer levels for kmer ′ ⁇ ′ ⁇ ⁇ ⁇ standard deviation of median normalized kmer levels for kmer ′ ⁇ ′ ⁇ ⁇ ⁇ number of observations ⁇ measurements ⁇ of kmer ′ ⁇ ′ BW ⁇ bandwidth used for kernel density estimate [0153] For practical purposes detailed in the Tombo documentation (github.com/nanoporetech/tombo), one can set a global standard deviation taken as the average observed standard deviation across all kmers in the model (i.e.
- kmer models Documentation for model building and code used to generate kmer models can be found in the Xenomorph repository (github.com/xenobiolab/xenomorph). For quality control, the entire experimental and computational procedure, from building libraries to generating 4-nt kmer models, was performed in duplicate. Models were built from data collected in a single run. The
- NNNNNNN For each heptamer sequence (NNNNNNN) a set of mapping kmer sequences (NNNN, NNNN, NNNN, NNNN) and observed signal levels (I NNNN , I NNNN , I NNNN , I NNNN ) ( ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ ⁇ are extracted. See Table 16 for additional information on numbering nomenclature of kmer sequences within a heptamer region.
- the kmer probability density function described previously in “XNA kmer model parameterization,” is used to estimate the probability that each observed level (e.g., ⁇ ⁇ ) came from the corresponding kmer (e.g.
- LLR Log-likelihood ratio
- LLR ratio > 0 is used as the default criteria for deciding if the XNA model is more likely than an alternative model for a given observed sequence of signals.
- ORLLR is a modified LLR test statistic that is nominally more robust towards outliers.
- the ORLLR test statistic is defined as follows: ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 2 sequence ⁇ ⁇ ⁇ median normalized kmer level for kmer ′ ⁇ ⁇ ⁇ ′ ⁇ ⁇ ⁇ median normalized kmer level for kmer ′ ⁇ ⁇ ⁇ ′ ⁇ ⁇ ⁇ ⁇ scale difference ⁇ ⁇ global standard deviation of median normalized kmer levels
- Consensus recall and specificity perform sequence-level assignments in calculations (rather than per-read level). Specificity of kmer models was calculated by alternative hypothesis testing on sequences that did not contain any XNAs. The definition of each statistic is provided below.
- T P recall ⁇ T P ⁇ FN TP ⁇ True positive FN ⁇ False negative F
- P specificity ⁇ 1 ⁇ FDR ⁇ 1 ⁇ F P ⁇ TN FP ⁇ False positive TN ⁇ True negative FDR ⁇ False discovery rate
- Receiver operating characteristic Receiver operating characteristic (ROC) curves were generated using the roc_curve function from the scikit-learn python library.
- 3915-P1293WO.UW -56- contained XNA bases flanked by 20 randomly chosen canonical bases. Recall on the validation set was calculated at the per-read and consensus level as described previously in “Recall and specificity calculations.”
- PCR amplification and basecalling of P ⁇ Z template DNA Two complementary oligos containing P and Z (PCR_Template_P, PCR_Template_Z, Table 22) were synthesized by Firebird Biomolecular Sciences (Alachua, Fl) and hybridized in a 1:1 molar ratio.25 ng of this hybridized PZ DNA construct was used as the template for a PCR reaction.
- PCR reactions contained 0.2 ⁇ M of each forward and reverse primer (PCR_Amp_F, PCR_Amp_R1-4, Table 22), 5 U/ ⁇ L of Taq polymerase in 1X ThermoPol buffer (pH 8.0). Triphosphate concentrations for dxNTPs and dNTPs varied by condition (no dxNTP, limiting, equimolar, optimal) and are tabulated in FIGs 6TA-6TC. The PCR reaction then proceeded with thermocycler conditions tabulated in Table 23. PCR reactions were purified using Zymo DNA Clean and Concentrator and eluted in 30 ⁇ L of nuclease-free water.
- the Xenomorph XNA sequencing pipeline One of the goals of this disclosure was to build a publicly available end-to-end pipeline for validation of XNA incorporation in target sequences. As a proof of concept, one can create a tool in python called “Xenomorph” comprised of a pipeline consisting of two steps: 1) preprocessing - xenomorph preprocess and 2) alternative hypothesis testing - xenomorph morph.
- Xenomorph runs raw FASTA5 data through the preprocessing pipeline with an additional FASTA handling modification that allows users to input reference sequences with XNA base pairs. Outputs for preprocessing steps are provided in a .csv file (see Table 17 for header description), which is used as an input for xenomorph morph.
- Xenomorph uses the XNA base pairs found input the reference sequence to perform LLR or ORLLR testing against user-defined alternatives. For example, for a sequence containing A, T, G, C, B, S n base pairs, users can calculate most likely base at the XNA position against most similar canonical base (e.g.
- B vs A purines/pyrimidines
- canonical bases e.g. B vs A, T, G, C
- all bases e.g. B vs A, T, G, C, S n .
- Alternative hypothesis testing can be performed on a per-read basis or a global basis.
- XNA kmers models generated in this disclosure are built-in and can be viewed using xenomorph models. Model compilation is performed ad hoc, allowing users to experiment with kmer models.
- Outputs for alternative hypothesis testing are provided as a .csv file (see Table 18 for header description).
- kmer models are inherently independent (i.e. signal observations of NNNBNNN are independent of NNNSNNN observations) and therefore modular.
- Xenomorph was built to be flexible, allowing users to add more kmer models or modify them, and straightforward, requiring two commands to go from raw nanopore data to XNA-refined sequences.
- FIG. 6S A graphical overview of the preprocessing pipeline can be found in FIG. 6S.
- Xenomorph can be found in the Xenomorph repository (github.com/xenobiolab/xenomorph) alongside all code, documentation, and parameters used in this disclosure.
- 3915-P1293WO.UW -58- building and basecalling can be downloaded from the SRA Bioproject PRJNA932328 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA932328]. Additional overview of how the Xenomorph pipeline performs XNA basecalling is found in Note 1. [0164] Data availability: Models measured in this disclosure used for basecalling are provided in Data Table 1, and can also be found on the Xenomorph github repository (github.com/xenobiolab/xenomorph/tree/main/models).
- the raw nanopore sequences (FAST5) and guppy basecalls (FASTQ) used in this disclosure to build models, validate models, and test 12-letter DNA sequencing have been deposited in the sequence reads archive (SRA) under Bioproject PRJNA932328 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA932328] and can be accessed without restriction (Table 31).
- Raw nanopore data for PZ PCR amplification experiments (FIGs 6TA-6TC) are available under restricted access, as this data was collected in a pooled nanopore run and contains additional data. Full sequences for hairpin libraries purchased for this work can be produced based on this disclosure. Additional source data can be produced based on this disclosure.
- Code availability Code for end-to-end processing of nanopore reads and basecalling xenonucleotides described in this example can be produced based on this disclosure.
- Information for Example 1 Enzymatic Synthesis and Nanopore Sequencing of 12-letter Supernumerary DNA
- Methods [0168] Organic synthesis of dX t TP: 8-(2′-Deoxy- ⁇ -D-erythro- pentofuranosyl)imidazo[1,2-a]-s-triazin-2,4-dione 5′-triphosphate.
- 3915-P1293WO.UW -71- xenonucleobases (B, S n , S c , P, Z, X t , K n , J, V) are integrated for selection.
- the pipeline, as built, also allows users to generate their own models.
- Basecalling can be performed either per-read or per-sequence (global). In per-read basecalling, individual reads are basecalled while in per-sequence, the signal of all reads that match a sequence are averaged before determining a global call.
- the per-read consensus is defined as the most frequent basecall among all reads that match a certain sequence.
- 4-nt kmer models are parameterized with a kmer mean ( ⁇ k) and a kmer variance ( ⁇ k ). Users have the choice of setting experimentally measured signal means, signal medians, or means from kernel density estimates as ⁇ k. Options for ⁇ k values are either the kmer-specific measured variance or a fixed global variance. The choice of bases to use in the model can also be specified. As described, basecalling in this disclosure uses signal means for ⁇ k and global average kmer variance for ⁇ k . [0207] Full code and documentation of Xenomorph is available on github. Sample data, such as the FAST5 data generated in this disclosure, can be found in the SRA under Bioproject PRJNA932328 (Table 31). [0208] Note 2.
- Each hairpin pool contains 10 unique sequences. Ligating two hairpin pools together generates a final library of 100 possible sequence combinations (10 x 10).
- the table shows constant regions for all oligos in each pool (black), with regions in brackets (blue, bold) being replaced with their corresponding sequence elements from Tables 4-6. ‘-F’ and ‘-R’ are used to note forward and reverse sequences of different components after the hairpin is folded.
- NNN denotes the 3 randomized bases at the end of the hairpins
- [NNN-BC] i.e., Triplet-barcode
- [Pool-BC] i.e., Pool-barcode
- NNN-BC Triplet-barcode
- [Pool-BC] Pool-barcode
- Regions highlighted in red denote restriction site sequence difference between HP_v1 and HP_v2, HP1 and HP2. All sequences are shown in the 5′ to 3′ direction.
- Full hairpin sequences purchased for this disclosure can be produced based on this disclosure.
- Triplet-barcodes sequences Sequences of the Triplet-barcodes and NNN sequences they are assigned to.
- the Triplet-barcode is a 24 nt sequence that is distal to the 3′-NNN end in each hairpin and is used to assign the true identity of the 3′- NNN bases that flank XNA insertions (Fig.3a).
- N A, T, G, or C; 64 NNN combinations
- Barcode sequences were chosen from Oxford Nanopore Technologies list of barcodes for long-read sequencing.
- Barcode sequences are shown in 5′ to 3′ direction.
- the Triplet- barcode (abbreviated as [NNN-BC]) and NNN sequences used to construct HP_v1-NNN- [Pool-ID] and HP_v2-NNN-[Pool-ID] hairpin sequences, shown in Table 3, by insertion into [NNN-BC] and [NNN] regions, respectively.
- Full sequences of all hairpins used for model generation can be produced based on this disclosure.
- Validation pool sequences were randomly generated and intended to provide a sequence diversity (+/- 20 nt surrounding an XNA nt) much greater than what is present in the model training NNN-pools.
- the smaller library size (100 sequences per ligated pool) and richer sequence diversity made it possible to multiplex all the validation sets while still obtaining sufficient coverage for calculating appropriate statistics.
- Validation pool sequences are a subset of HP1-[VAL-ID] and HP2-[VAL-ID] hairpin sequences shown in Table 3. Sequences are shown in 5′ to 3′ direction. Full sequences of hairpins ordered, alongside ligation products generated, can be produced based on this disclosure. SE SE A A
- Table shows barcodes for each oligo that links to the variable 3 nt sequence on the 3′-end and the xenonucleotide tailed on the 3′-end (bold), as well as restriction site sequences (red, bold). Sequences are shown in 5′ to 3′ direction.
- Primer sequences are used to amplify the template: each condition used a different barcoded reverse primer (PCR_Amp_R1: Equimolar; PCR_Amp_R2: Optimal; PCR_Amp_R3: No dxNTP; PCR_Amp_R4: Limiting). All conditions used the same forward primer (PCR_Amp_F). Sequences are shown in 5′ to 3′ direction. S
- Table shows: (left) fraction of base called at each xenonucleotide position using the full 12-letter supernumerary model; (right) base called using model with simplified priors, where denotes the xenonucleotide at position called, and ⁇ denotes the most similar standard base called instead. Box highlights base pair chosen from picking the most likely nucleobase among any purine or pyrimidine set, then fixing complementary base. Base called – S c uper-12 ⁇ .2 9 .1 3 .0 8
- Table shows: (left) fraction of base called at each xenonucleotide position using the full 12-letter supernumerary model; (right) base called using model with simplified priors, where denotes the xenonucleotide at position called, and ⁇ denotes the most similar standard base called instead. Box highlights base pair chosen from picking the most likely nucleobase among any purine or pyrimidine set, then fixing complementary base. Base called – S n uper-12 ⁇ .2 0 .0 9 .2 2 .0 4 .2
- Table 27 Tabulation of per-read recall from simulated signal levels for the standard genetic code (A, T, G, C). Information regarding read simulation can be found in the Note section. Standard code A.
- Table 28 Tabulation of per-read recall from simulated signal levels for the isoG/isoC code (A, T, G, C, B, S n ). isoG/isoC code 6 0 7 0 3
- a method for generating an N+1 tailing product comprising a non-standard nucleotide that is covalently bound with a 3’ end of a precursor double-stranded DNA (dsDNA) template and is non-base-paired, the method comprising: combining the precursor dsDNA template with a DNA polymerase and a non-standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase.
- dNTP non-standard deoxyribonucleotide triphosphate
- Embodiment 1 The method of Embodiment 1 or any other Embodiment, wherein the non-standard nucleotide is a xenonucleotide (XNA) and the non-standard dNTP is a deoxy-xeno-ribonucleotide triphosphate (dxNTP).
- XNA xenonucleotide
- dxNTP deoxy-xeno-ribonucleotide triphosphate
- Embodiment 3 The method of Embodiment 1 or any other Embodiment, wherein the DNA polymerase comprises a polypeptide sequence of a small Klenow Fragment (KF exo-) of DNA Polymerase I.
- Embodiment 4 The method of Embodiment 3 or any other Embodiment, wherein the polypeptide sequence comprises a sequence of SEQ ID NO:2.
- Embodiment 5. The method of any of Embodiments 3-4 or any other Embodiment, wherein the non-standard
- Embodiment 6 The method of Embodiment 1 or any other Embodiment, wherein the DNA polymerase comprises a polypeptide sequence of an engineered polymerase from a hyperthermophilic marine archaeon.
- Embodiment 7 The method of Embodiment 6 or any other Embodiment, wherein the engineered polymerase is a variant of 9°N DNA polymerase.
- Embodiment 9 The method of any of Embodiments 6-8 or any other Embodiment, wherein the non-standard nucleotide is selected from Sn, Sc, Z, Xt, Kn, J, and V, and the reaction condition proceeds at about 60°C for between about 4-16 hours and comprises about 0.29 U/ ⁇ L of the DNA polymerase and about 1.19 mM of the non-standard dNTP.
- Embodiment 10 Embodiment 10.
- Embodiment 11 A method for generating a base pair of two nucleotides of a polynucleotide, wherein at least one nucleotide of the two nucleotides is a non-standard nucleotide.
- Embodiment 12 The method of Embodiment 10 or any other Embodiment, comprising the method of any of Embodiments 1-9 or any other Embodiment.
- Embodiments 10-11 or any other Embodiment comprising: generating a second N+1 tailing product comprising a second non-standard nucleotide that is base-pair complementary with the non-standard nucleotide, wherein the second non-standard nucleotide is non-base-paired; and ligating the N+1 tailing product with the second N+1 tailing product to form a dsDNA ligation product that comprises a base pair between the non-standard nucleotide and the second non-standard nucleotide.
- Embodiment 13 The method of any of Embodiments 10-12 or any other Embodiment, wherein the N+1 tailing product comprises a hairpin.
- Embodiment 14 The method of any of Embodiments 10-13 or any other Embodiment, wherein the second N+1 tailing product comprises a hairpin.
- Embodiment 15 The method of Embodiment 14 or any other Embodiment, wherein the dsDNA ligation product does not comprise a free 5’ end or a free 3’ end.
- Embodiment 16 The method of any of Embodiments 12-15 or any other Embodiment, comprising: contacting the dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the dsDNA ligation product to generate a blunt-end DNA template that comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
- Embodiment 17 Embodiment 17.
- Embodiment 16 The method of Embodiment 16 or any other Embodiment, wherein the method is performed a plurality of times for creation of a plurality of base pairs between a plurality of non-standard nucleotides and a plurality of second non-standard nucleotides as sequence elements of a further dsDNA ligation product.
- Embodiment 17 comprising: contacting the further dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the further dsDNA ligation product to generate a further blunt-end DNA template that comprises the plurality of base pairs between the plurality of non- standard nucleotides and the plurality of second non-standard nucleotides.
- Embodiment 20 The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide is the nucleobase that is configured to hydrogen bond to the second base and the second base is a standard base or a non- standard base.
- Embodiment 21 The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide is the nucleobase that can base pair (without hydrogen bonding) to the second base and the second base is a standard base or a non-standard base.
- Embodiment 22 The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide comprises an epigenetic modification or is 4-methyl-cytosine, 5-methyl cytosine, 6-methyl adenosine, 5-hydroxymethyl cytosine, 7-methylguanosine, or N6-methyladenosine.
- Embodiment 23 The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide comprises the chemical modification and the chemical modification comprises a fluorophore, a biotin, a terminal alkyne, an azide, a cyclooctyne, a tetrazine, a terminal alkene, a phosphine, a halo-alkane, an aldehyde, a thiol, a transition metal complex, another reactive handle, or any combination thereof.
- Embodiment 24 A dsDNA ligation product produced by the method of any of Embodiments 12-23 or any other Embodiment.
- Embodiment 25 A further dsDNA ligation product produced by the method of any of Embodiments 17-23 or any other Embodiment.
- Embodiment 26 A blunt-end dsDNA template produced by the method of any of Embodiments 16-23 or any other Embodiment.
- Embodiment 27 A further blunt-end dsDNA template produced by the method of any of Embodiments 18-23 or any other Embodiment.
- Embodiment 28 A further blunt-end dsDNA template produced by the method of any of Embodiments 18-23 or any other Embodiment.
- a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the dsDNA ligation product of Embodiment 24 or any other Embodiment or the blunt-end dsDNA template of Embodiment 26 or any other Embodiment, wherein the library polynucleotide sequence comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
- a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the further dsDNA ligation product of Embodiment 25 or any other Embodiment or the further blunt-end dsDNA template of Embodiment 27 or any other Embodiment, wherein the library polynucleotide sequence
- Embodiment 30 The defined non-standard nucleotide base pair library of any of Embodiments 28-29 or any other Embodiment, wherein the library polynucleotide sequence further comprises: a context barcode associated with a sequence context adjacent to a base pair of a non-standard nucleotide and a second non-standard nucleotide of the library polynucleotide sequence; and a pool barcode associated with the non-standard nucleotide, the second non-standard nucleotide, or both.
- Embodiment 31 A method for generating a machine learning (ML) model that correlates one or more observed current reads with an unknown non-standard nucleotide for assignment of an identity to the unknown non-standard nucleotide, the method comprising: sequencing, with a nanopore sequencing method, the defined non- standard nucleotide base pair library of any of Embodiments 28-30 or any other Embodiment to produce the one or more observed current reads; and training, with a ML algorithm, the ML model to associate the one or more observed current reads with a known identity of a defined non-standard nucleotide of the defined non-standard nucleotide base pair library of any of Embodiments 28-30 or any other Embodiment, wherein the ML model is configured to assign the identity to the unknown non-standard nucleotide based on the known identity of the defined non-standard nucleotide.
- ML machine learning
- Embodiment 32 The method of Embodiment 31 or any other Embodiment, wherein the ML model comprises a convolutional long short term memory recurrent neural network (LSTM RNN).
- Embodiment 33 A non-transitory computer-readable storage medium having stored thereon at least part of a ML model produced by any of Embodiments 31- 32 or any other Embodiment.
- Embodiment 34 A computational device or computational system comprising the non-transitory computer-readable storage medium of Embodiment 33 or any other Embodiment.
- Embodiment 35 Embodiment 35.
- Embodiment 36 A method for basecalling a non-standard nucleotide expanded alphabet, the method comprising: sequencing, with a nanopore sequencing
- Embodiment 37 A circuitry configured to perform all or part of the method of Embodiment 36 or any other Embodiment.
- Embodiment 38 A circuitry configured to perform all or part of the method of Embodiment 36 or any other Embodiment.
- Embodiment 39 A nanopore sequencing kit, device, or system comprising the circuitry of Embodiment 37 or any other Embodiment.
- Embodiment 40 A nanopore sequencing kit, device, or system comprising the circuitry of Embodiment 38 or any other Embodiment.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Engineering & Computer Science (AREA)
- Organic Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Biochemistry (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Microbiology (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- General Chemical & Material Sciences (AREA)
- Library & Information Science (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Medicinal Chemistry (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Crystallography & Structural Chemistry (AREA)
- Plant Pathology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Systems and methods for generating defined base pairs in a sequence-defined library format that include at least one non-standard nucleotide in a base pair. Non-standard base pairs can be created using a nucleic acid polymerase (e.g., DNA polymerase, RNA polymerase, terminal deoxynucleotide polymerase) for blunt-end tailing addition of the non-standard base, which then can be ligated to another nucleotide end. The nucleotide sequences containing non-standard base pairs can be used to generate libraries for basecalling models of the non-standard base with next-generation sequencing (NGS) platforms and sequencing non-standard nucleotide sequences, including xenonucleotide (XNAs).
Description
SYSTEMS AND METHODS FOR ENZYMATIC SYNTHESIS OF POLYNUCLEOTIDES CONTAINING NON-STANDARD NUCLEOTIDE BASEPAIRS CROSS-REFERENCE(S) TO RELATED APPLICATION(S) [0001] This PCT application claims the benefit of U.S. Provisional Patent Application No. 63/483,926, filed February 08, 2023, the contents of which are incorporated herein in their entirety for all purposes. STATEMENT REGARDING SEQUENCE LISTING [0002] The Sequence Listing XML associated with this application is provided in XML format and is hereby incorporated by reference into the specification. The name of the XML file containing the sequence listing is 3915- P1293WO.UW_Sequence_Listing.xml. The XML file is 172,291 bytes; was created on February 07, 2024; and is being submitted electronically via Patent Center with the filing of the specification. BACKGROUND [0003] The four-letter standard genetic alphabet of DNA (A, T, G, C) is ubiquitous and one of the defining biomolecular signatures of life on Earth. Organisms’ ability to read, write, and translate this information forms the basis for evolution as an emergent property of nucleic acid heteropolymers. Humanity has learned how to manipulate the standard 4-letters of DNA, spurring major advancements in biotechnology, information, and healthcare. As examples, the standard nucleic acids are a component of many diagnostic tests to screen for disease, biosensors to detect toxins, therapeutics that create immune responses, and even as a molecular system for long-term storage of digital information. [0004] In addition, there are non-standard nucleotides that are capable of base- pairing with other non-standard nucleotides and/or standard nucleotides. As used herein, “non-standard nucleotide” refers to any nucleotide that is not one of the standard four nucleotides of DNA (i.e., A, T, G, C). An example of such a nucleotide includes, but is not limited to, a xenonucleotide (XNA). However, despite the existence of non-standard nucleotides and their potential applications, the availability of basic molecular techniques
3915-P1293WO.UW () -1-
for molecules that include at least one non-standard nucleotide is limited, and progress in this area lags far behind progress made for standard nucleotides. [0005] Accordingly, there is a need for systems and methods for working with non-standard nucleotides, including systems and methods for generating defined base pairs wherein at least one of the bases is a non-standard nucleotide, generating a sequence-defined library format, and others. The present disclosure addresses these and other long-felt and unmet needs in the field. SUMMARY [0006] This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. [0007] In an aspect, the disclosure provides a method for generating an N+1 tailing product comprising a non-standard nucleotide that is covalently bound with a 3’ end of a precursor double-stranded DNA (dsDNA) template and is non-base-paired, the method comprising: combining the precursor dsDNA template with a DNA polymerase and a non-standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase. [0008] In embodiments, the non-standard nucleotide is a xenonucleotide (XNA) and the non-standard dNTP is a deoxy-xeno-ribonucleotide triphosphate (dxNTP). [0009] In embodiments, the DNA polymerase comprises a polypeptide sequence of a small Klenow Fragment (KF exo-) of DNA Polymerase I. In embodiments, the polypeptide sequence comprises a sequence of SEQ ID NO:2. [0010] In embodiments, the non-standard nucleotide is B or p, and the reaction condition proceeds at about 37°C for between about 1-16 hours and comprises about 0.71 U/µL of the DNA polymerase and about 1.19 mM of the non-standard dNTP. [0011] In embodiments, the DNA polymerase comprises a polypeptide sequence of an engineered polymerase from a hyperthermophilic marine archaeon. In embodiments, the engineered polymerase is a variant of 9°N DNA polymerase. In embodiments, the polypeptide sequence comprises a sequence of SEQ ID NO:3.
3915-P1293WO.UW -2-
[0012] In embodiments, the non-standard nucleotide is selected from S n , S c , Z, X t , K n , J, and V, and the reaction condition proceeds at about 60°C for between about 4- 16 hours and comprises about 0.29 U/µL of the DNA polymerase and about 1.19 mM of the non-standard dNTP. [0013] In an aspect, the disclosure provides a method for generating a base pair of two nucleotides of a polynucleotide, wherein at least one nucleotide of the two nucleotides is a non-standard nucleotide. [0014] In embodiments, the method comprises: generating a second N+1 tailing product comprising a second non-standard nucleotide that is base-pair complementary with the non-standard nucleotide, wherein the second non-standard nucleotide is non- base-paired; and ligating the N+1 tailing product with the second N+1 tailing product to form a dsDNA ligation product that comprises a base pair between the non-standard nucleotide and the second non-standard nucleotide. [0015] In embodiments, the N+1 tailing product comprises a hairpin. In embodiments, the second N+1 tailing product comprises a hairpin. In embodiments, the dsDNA ligation product does not comprise a free 5’ end or a free 3’ end. [0016] In embodiments, the method comprises: contacting the dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the dsDNA ligation product to generate a blunt-end DNA template that comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide. [0017] In embodiments, the method is performed a plurality of times for creation of a plurality of base pairs between a plurality of non-standard nucleotides and a plurality of second non-standard nucleotides as sequence elements of a further dsDNA ligation product. [0018] In embodiments, the method comprises: contacting the further dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the further dsDNA ligation product to generate a further blunt-end DNA template that comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non- standard nucleotides. [0019] In embodiments, the non-standard nucleotide comprises: an epigenetic modification, a modified sugar, a phosphate backbone, a nucleobase, a nucleobase that
3915-P1293WO.UW -3-
can hydrogen bond to a second base, a nucleobase that can base pair (without hydrogen bonding) to a second base, a nucleobase that relies on steric exclusion for base pairing, a nucleobase that relies on hydrophobic interactions for base pairing, a nucleobase that relies on a transition metal complex for base pairing, a chemical modification, or any combination thereof. [0020] In embodiments, the non-standard nucleotide is the nucleobase that is configured to hydrogen bond to the second base and the second base is a standard base or a non-standard base. [0021] In embodiments, the non-standard nucleotide is the nucleobase that can base pair (without hydrogen bonding) to the second base and the second base is a standard base or a non-standard base. [0022] In embodiments, the non-standard nucleotide comprises an epigenetic modification or is 4-methyl-cytosine, 5-methyl cytosine, 6-methyl adenosine, 5- hydroxymethyl cytosine, 7-methylguanosine, or N6-methyladenosine. [0023] In embodiments, the non-standard nucleotide comprises the chemical modification and the chemical modification comprises a fluorophore, a biotin, a terminal alkyne, an azide, a cyclooctyne, a tetrazine, a terminal alkene, a phosphine, a halo-alkane, an aldehyde, a thiol, a transition metal complex, another reactive handle, or any combination thereof. [0024] In an aspect, the disclosure provides a dsDNA ligation product. In an aspect, the disclosure provides a further dsDNA ligation product. [0025] In an aspect, the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the dsDNA ligation product or the blunt-end dsDNA template, wherein the library polynucleotide sequence comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide. [0026] In an aspect, the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the further dsDNA ligation product or the further blunt-end dsDNA template, wherein the library polynucleotide sequence comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides. [0027] In embodiments, the library polynucleotide sequence further comprises: a context barcode associated with a sequence context adjacent to a base pair of a non-
3915-P1293WO.UW -4-
standard nucleotide and a second non-standard nucleotide of the library polynucleotide sequence; and a pool barcode associated with the non-standard nucleotide, the second non-standard nucleotide, or both. [0028] In an aspect, the disclosure provides a method for generating a machine learning (ML) model that correlates one or more observed current reads with an unknown non-standard nucleotide for assignment of an identity to the unknown non-standard nucleotide, the method comprising: sequencing, with a nanopore sequencing method, the defined non-standard nucleotide base pair library to produce the one or more observed current reads; and training, with a ML algorithm, the ML model to associate the one or more observed current reads with a known identity of a defined non-standard nucleotide of the defined non-standard nucleotide base pair library, wherein the ML model is configured to assign the identity to the unknown non-standard nucleotide based on the known identity of the defined non-standard nucleotide. In embodiments, the ML model comprises a convolutional long short term memory recurrent neural network (LSTM RNN). [0029] In an aspect, the disclosure provides a non-transitory computer-readable storage medium having stored thereon at least part of a ML model. In an aspect, the disclosure provides a computational device or computational system comprising the non- transitory computer-readable storage medium. In an aspect, the disclosure provides a nanopore sequencing kit, device, or system comprising the non-transitory computer- readable storage medium. [0030] In an aspect, the disclosure provides a method for basecalling a non- standard nucleotide expanded alphabet, the method comprising: sequencing, with a nanopore sequencing method, a subject polynucleotide sequence that comprises a non- standard nucleotide to generate a subject current read; computing, with the computational device or computational system, the known identity of the defined non-standard nucleotide of the defined non-standard nucleotide base pair library associated with the subject current read with for an association; and computing, based on the association, a structure of the non-standard nucleotide. [0031] In various aspects, the disclosure provides a circuitry configured to perform all or part of a method. In various aspects, the disclosure provides a nanopore sequencing kit, device, or system comprising the circuitry.
3915-P1293WO.UW -5-
DESCRIPTION OF THE DRAWINGS [0032] The foregoing aspects and many of the attendant advantages of this disclosure will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings. [0033] FIGs 1A and 1B show nucleobases for an expanded 12-letter supernumerary DNA alphabet. (FIG. 1A) Structures of standard purine and pyrimidine nucleobases. (FIG. 1B) Structures of mutually orthogonal synthetic xenonucleobases that can form the basis of a 12-letter supernumerary DNA. Single letter abbreviations of each base indicated above nucleobase structure. Arrows indicate hydrogen bonding between base pairs, drawn in the direction of donor-to-acceptor. S nucleobase has two possible structures which both base pair with B: the N-nucleoside (Sn) and C-nucleoside (Sc). [0034] FIGs 2A-2H show XNA tailing and XNA ligation enable a facile means for enzymatic XNA incorporation. (FIG. 2A) Polymerase XNA tailing activity screened by detection of released 2′-deoxy-xenonucleoside monophosphates (dxNMPs). Hairpin HP-3′PT was used as tailing substrate (Table 2); ‘*’ indicate positions of phosphorothioate bonds. Extracted ion chromatograms for each dNMP and dxNMP in assays indicate dNTP and dxNTP tailing by (FIG.2B) Klenow Fragment (exo-) and (FIG. 2C) Therminator polymerase. Source data are provided as a Source Data file. (FIG. 2D) Assay measuring extent of XNA tailing by T4 ligation. Tailed hairpins are not substrates for T4 ligation. (FIG. 2E) XNA tailing of hairpin using optimized conditions showing XNA tailed hairpin is the major product. (–) is blunt-ended hairpin negative control. G+ is a hairpin synthesized to contain a single nucleotide 3′-G overhang as the positive control (gel representative of 3 experimental replicates; yield estimates are listed in Table 9). (FIG. 2F) Assay to ligate two DNA hairpins with complementary single nucleotide XNA overhangs. Ligated hairpins are protected from exonucleases as they lack free 5′ and 3′- ends. (FIG. 2G) XNA ligation of hairpins tailed with complementary purine (pur) and pyrimidine (pyr) XNA bases using optimized reaction conditions. (+) is a positive control that used blunt DNA substrate. (*) is a negative control that used blunt DNA substrate without DNA ligase. (–) is a negative control without ligase or exonuclease, shown quantitatively for comparison with XNA ligation products (gel representative of 3 experimental replicates; yield estimates are listed in Table 11). (FIG. 2H) XNA tailing
3915-P1293WO.UW -6-
and XNA ligation steps can be cycled for consecutive additions using Type IIS restriction enzyme MlyI. [0035] FIGs 3A-3D show generation of 12-letter (ATGCBSPZXKJV) nanopore sequencing kmer models. (FIG. 3A) Overview of construction of NNNNNNN libraries, starting from two synthetic oligo pools (NNN-Pool) that contain blunt, NNN-3′ ends. The 24-nt triplet-barcodes in these hairpins are linked to the 3′-NNN sequence, allowing for proper identification of bases adjacent to XNA inserts. Complementary XNA base pairs are added to the library hairpins using XNA tailing and XNA ligation. The 8-nt pool- barcode is used to identify which XNA was tailed to the 3′-end. Restriction enzymes (RE) remove the hairpin ends. Final libraries contain an XNA base insert in every possible NNN x NNN context (N = A, T, G, C; 64 x 64 = 4,096 unique sequences per XNA base). (FIG. 3B) 4-nt kmer models were generated by decomposing every sequenced heptamer (NNNNNNN; N = modified nucleotide) into its corresponding 4-nt kmers. For a kmer’s observed current signals, mean values from the observed signal (obs) or from a kernel density estimate (KDE) can be calculated. (FIG. 3C) All measured normalized current signal means (µk, 2,304 total values from kernel density estimate) for each 4-nt kmer, with positive values in deeper purple and negative values in deeper orange. Heatmaps are binned by kmer position containing the xenonucleobase (-1, 0, +1, or +2). ‘N’ is denoted in the x-axis and the remaining NNN is denoted by row, sorted alphabetically (AAA to TTT). (FIG.3D) Example traces overlaying observed mean signal (orange) with expected signals produced by either the XNA model (blue) or a model for standard DNA (gray) kmer model. For the standard DNA model, the most similar standard base chosen for each XNA was determined from empirical observation (FIGs 6OA-6OB and 6QA-6QI). n = number of reads used: B (n = 18); Sn (n = 24); Sc (n = 40); P (n = 32); Z (n = 28); Xt (n = 18); Kn (n = 12); J (n = 14); and V (n = 18). Error bars indicate standard deviation of observed normalized signal level. [0036] FIGs 4A-4C show construction and end-to-end nanopore sequencing of 6- letter DNA alphabets. (FIG. 4A) Proof of concept deployment of an XNA-refinement pipeline using 4-nt kmer models measured in this disclosure. Pipeline is used to transform raw commercial nanopore reads into likely XNA basecalls for the sense (+) and antisense (-) strands. (FIG.4B) Confusion matrix showing per-read recall of the validation libraries using the full 12-letter supernumerary DNA model (n = 5,000 reads of each 6-letter set). Example shown with S = Sc kmer model used to analyze BSc reads. (FIG. 4C) Response
3915-P1293WO.UW -7-
operating characteristic (ROC) curve plots recall versus false discovery rate (FDR) for 4- nt kmer model basecalling of each XNA in the validation libraries, performing comparison between XNA (N) and most likely guppy basecall from the natural bases (§); (B, J, P, Xt = solid line; Sn, V, Z, Kn = dash line; Sc = dotted line). Legend shows area under the curve for each base. Source data are provided as a Source Data file. Additional benchmarking of 4-nt XNA kmer models tabulated in Tables 19-21. [0037] FIG. 5 shows enzyme-assisted synthesis and third-generation sequencing of supernumerary 12-letter DNA. Enzyme-assisted synthesis was used to construct two supernumerary 12-letter dsDNA hairpins with either S = Sc (Scuper-12) or S = Sn bases (Snuper-12). Sequenced reads are processed to produce signal-to-sequence alignments and subsequently segmented into their corresponding kmer sequences and kmer signals. The kmer probability density function (observed signal mean <Iz>, model mean µki, model standard deviation σ) is used to calculate log-likelihoods while a maximum likelihood with outlier-robust log-likelihood ratios is used to determine base call. Confusion matrices show: (left) fraction of base called at each xenonucleotide position in Scuper-12 (n = 824 reads) and Snuper-12 (n = 1,438 reads); (right) base called using model with simplified priors ( = xenonucleotide at position called, §= most similar standard base called). Box denotes base pair called from paired analysis. Values of confusion matrices are tabulated in Tables 25, 26. [0038] FIG. 6A shows an overview of an example non-templated N+1 tailing reaction. Tailing of blunt-end hairpin DNA substrates (N) can lead to complete formation of XNA-tailed hairpin products (N+1 major). PPi release from tailing leads to slow background rate of pyrophosphorolysis, which acts in the reverse direction of nucleotide tailing (3′-exo). Pyrophosphorolysis is mitigated by adding YiPP to tailing reactions and balancing reaction duration and reaction rates. The over tailing of products to generate (N+2) hairpins is also considered in optimization for tailing reactions. N+1 tailing is generally thought to occur at a first-order reaction rate, 2 orders of magnitude slower than templated polymerization. Furthermore, N+2 addition rates are polymerase specific and are thought to occur at first order rates 2 orders of magnitude slower than N+1 product formation. End abbreviations: 3′ indicates 3′-OH, 5′- indicates 5′-PO4. [0039] FIGs 6BA-6BE show results from screening polymerases capable of effective tailing of both purine and pyrimidine dNTPs for canonical bases (N = A, T, G, C) by T4 ligation assay. A 5′-phosphorylated hairpin oligo with a 3′-blunt end was
3915-P1293WO.UW -8-
purchased from IDT (5′Phos-15HP; Table 2). Oligos are first refolded by incubating 20 µM of oligo in a 100 mM NaCl, 10 mM Tris-HCl buffer (pH 8.2) at 90 ^C for 3 minutes then cooling at 0.1 ^C/s until reaching 20 ^C. All subsequent tailing reactions used 16 µM 5′Phos-15HP (blunt-end with 15 nt in the hairpin region), 1.19 mM dNTP (with dNTP used specified on lane figure panel), and tailed for 1 h at the specified temperature using the specified polymerases. Subsequent T4 ligation reactions were performed with 11.2 µM of oligo for 1 h using T4 DNA Ligase Reaction Buffer which contains 1 mM ATP. (FIG.6BA) Tailing screen for Taq polymerase (0.25 U/µL, 72 ^C) and Klenow Fragment (exo-; KF) polymerase (0.68 U/µL, 37 ^C) followed by high concentration T4 ligation. (FIG. 6BB) Tailing screen for Deep Vent (exo-; DV) polymerase (0.1 U/µL, 72 ^C) and Therminator (Therm) polymerase (0.1 U/µL, 72 ^C) followed by high concentration T4 ligation. (FIG. 6BC) Tailing screen for Taq polymerase (0.25 U/µL, 55 ^C) and Bst polymerase (0.4 U/µL, 65 ^C) followed by T4 ligation. (FIG. 6BD) Tailing screen for Bsu polymerase (0.25 U/µL, 37 ^C) and Sulfolobus (Sulf) polymerase (0.1 U/µL, 55 ^C) followed by T4 ligation. (FIG. 6BE) Positive control (G+) shows no ligation for a hairpin with a 3′ single nucleotide overhang (5′Phos-HP-3′G, lower band) while negative control (-) shows full ligation of blunt-end hairpin (upper band). Polymerase screening gels are representative of a single experimental replicate. [0040] FIGs 6CA-6CM show UPLC/QTOF validation of tailing activity for all dNTPs and dxNTPs by Klenow Fragment (exo-). (FIGs 6CA-6CM) Full set of controls for the data shown in FIG. 2B. Extracted ion chromatograms (EIC) show relative abundance of either dNMP or dxNMP release when corresponding dNTPs/dxNTPs are used as a substrate for polymerase (KF exo-) tailing. Chromatogram scales are normalized for comparison of runs within each panel. dNTP or dxNTP used in each reaction shown in panel legend. Reactions controlled for polymerase (+/- KF), Exo III (+/- Exo), or hairpin DNA (+/- DNA). Source data are provided as a Source Data file. [0041] FIGs 6DA-6DM show UPLC/QTOF validation of tailing activity for all dNTPs and dxNTPs by Therminator. (FIGs 6DA-6DM) Full set of controls for the data shown in FIG. 2C. Extracted ion chromatograms (EIC) show relative abundance of either dNMP or dxNMP release when corresponding dNTPs/dxNTPs are used as a substrate for polymerase (Therminator; Therm) tailing. Chromatogram scales are normalized for comparison of runs within each panel. dNTP or dxNTP used in each reaction shown in
3915-P1293WO.UW -9-
panel legend. Reactions controlled for polymerase (+/- Therm), Exo III (+/- Exo), or hairpin DNA (+/- DNA). Source data are provided as a Source Data file. [0042] FIGs 6EA-6EE show screening and optimization of XNA tailing conditions. All tailing reactions used 11.9 µM 5′Phos-11HP, 1.19 mM of specified dNTP/dxNTP, and tailed at the specified temperature for the specified times using either Klenow Fragment (KF exo-; 0.71 U/µL) or Therminator (Therm; 0.29 U/µL). Tailing completeness was measured via T4 ligation assays. Hairpins tailed with a dNTP or dxNTP result in a single nucleotide overhang which is no longer a substrate for self- ligation. No tailing results in blunt-ended hairpins which self-ligate in the presence of T4 DNA ligase. (FIG. 6EA) XNA tailing screen using KF exo- and Therm for 8 h. (FIG. 6EB) XNA tailing screen using KF and Therm for 8 h. (FIG. 6EC) Additional Sc tailing screen using Therm for 8 or 16 h. Positive control (G+) shows no ligation for a hairpin with a 3′ single nucleotide overhang (5′Phos-HP-3′G, lower band) while negative control (-) shows ligation of blunt-end hairpin (upper band). (FIG. 6ED) Tailing screen for dTTP and dCTP using Therm at 60 ^C for 4 h. (FIG. 6EE) Tailing screen for dTTP and dCTP using Therm at 60 ^C for 4 h followed by T4 DNA ligation and digestion with exonucleases at 37 ^C for 1 h using Exo I (2.7 U/µL), Exo III (13.3 U/µL) and Exo VIII (truncated, 0.67 U/µL). Screening gels are representative of a single experimental replicate. [0043] FIG. 6F shows addition of yeast inorganic pyrophosphatase (YiPP) leads to slight improvements in XNA tailing reaction yield. 5′-phosphorylated hairpin oligos with either a 3′-blunt end or 3′-single nucleotide (-G, or -C) overhangs were purchased from IDT (5′-Phos-11HP; Table 2). Separately, 11.4 µM of 3′-blunt end oligos were tailed with 1.14 mM of dCTP or dGTP, Klenow Fragment (exo-; KF; 0.68 U/µL), and either 0.009 U/µL of YiPP or no YiPP at 37 ^C for 4 h. Subsequent ligation reactions were performed using 2.6 µM of two oligos with complementary overhang bases, either enzymatically tailed (G, C) or synthesized overhangs (G*, C*). Ligation reactions were incubated for 15 min at 16 ^C using T7 DNA ligase (272 U/µL) and carried out in 1X of NEB StickTogether™ buffer which contains 7.5% (w/v) PEG 6000. Blunt-end hairpins (- /-) serve as a negative ligation control as the short reaction time prevents blunt end ligation. Unligated materials were digested using exonuclease I (2.7 U/µL), exonuclease III (13.3 U/µL) and exonuclease VII (1.33 U/µL) for 1 h at 37 ^C. Exonuclease reactions were heat inactivated by incubation at 95 ^C for 10 min and then at 80 ^C for 10 min.
3915-P1293WO.UW -10-
Note that in this set of experiments, Exo VII was used which has a higher heat inactivation temperature than Exo VIII (truncated) used in other aspects of this disclosure. It was also found Exo VII would result in incomplete digestion (lower band) and required different buffer conditions. In subsequent screening work, Exo VIII (truncated) was used instead in the exonuclease treatment steps. Positive control with G* and C* shows ligation of hairpins with G and C synthetic overhangs. Gel representative of a single experimental replicate. [0044] FIG. 6G shows enzymatic tailing does not lead to measurable differences in ligation when compared to ligation using fully synthetic hairpin with N+1 tails. Ligation of over-tailed product (i.e., more than one nucleotide added to the blunt 3′-end) with an N+1 tailed hairpin would result in dsDNA that contains a gap of one or more nucleotides. The gap region exposes a 3′ and 5′ end that would make this product susceptible to exonuclease degradation. Therefore, one way one can have tested to see if over-tailing was a problem was to compare how much ligated product was observed (as measured by agarose gel band intensity) if hairpins were tailed enzymatically vs made synthetically. Here, 5′-phosphorylated hairpin oligos with either a 3′-blunt end or 3′- single nucleotide (-G, or -C) overhangs were purchased from IDT (Table 2). Oligos were first folded using previously described methods. Blunt end oligo 5′Phos-11HP was then tailed with dCTP using conditions listed in Table 8. Subsequent ligation reactions were performed using T7 or T4 DNA ligase. Either the dCTP-tailed oligo (Tailed) or 5′Phos- HP-3′C (Synth) was ligated to 5′Phos-HP-3′G. For T7 ligation reactions, 2.7 µM of each oligo were incubated with 272 U/µL of T7 DNA ligase and StickTogetherTM DNA ligase buffer at 16 ^C for 15 min, after which the ligase was heat inactivated at 65 ^C for 10 min. For T4 ligation reactions, 4.2 µM of each oligo were incubated with 80 U/µL of T4 DNA ligase and T4 DNA ligase buffer at 16 ^C for 2 h, after which the ligase was heat inactivated at 65 ^C for 10 min. Unreacted hairpins or incomplete ligation products were removed by exonuclease treatment performed at 37 ^C for 1 h using Exo I (2.7 U/µL), Exo III (13.3 U/µL), and Exo VIII (truncated, 0.67 U/µL). Exonuclease reactions were heat inactivated by incubation at 80 ^C for 20 min. Results suggest over-tailing, if present, is not significant under tested conditions. Gel representative of a single experimental replicate. [0045] FIGs 6HA-6HQ show high resolution LC/MS of oligo showing N+1 tailing as major product. (FIG. 6HA) Hairpin oligo, 5′Phos-ScaI-HP (Table 2) was tailed
3915-P1293WO.UW -11-
using optimized conditions with all dNTPs and dxNTPs described in this disclosure. After tailing, ScaI digestion was used to cleave the 3′-end, generating a short oligo that could be directly detected by LC/MS. (FIGs 6HB-6HN) Extracted ion chromatograms (EIC) showing formation of N+1 tailed product (6 nt oligo) for all dNTPs and dxNTPs. In each chromatogram set, EIC for starting material (N), pyrophosphorolysis (N-1), and processive tailing (N+2) are also shown. Deconvoluted mass spectra of N+1 product is shown to resolve isotopes. (FIGs 6HO-6HP) EIC and deconvoluted spectra show negative control reactions (ScaI-treated starting material, sense and antisense strands). (FIG. 6HQ) Exact masses calculated for EIC are tabulated. End abbreviations: 3′ indicates 3′-OH, 5P′- indicates 5′-PO4. Source data are provided as a Source Data file. [0046] FIG. 6I shows an overview of T3 DNA ligase, T4 DNA ligase, and T7 DNA ligase products. (top) Major products formed from T3 ligation and T4 ligation assays between hairpins generated in this disclosure. (bottom) Major and minor products formed for T7 ligation assays in this disclosure. T7 ligase preferentially ligates hairpins with a cohesive nucleotide overhang and has minimal blunt-end ligation activity.56 In reaction conditions with crowding agents such as high MW PEG, T7 ligase has been observed to perform blunt end ligation though to a lesser extent than T3 ligase and T4 ligase. Full hairpin sequences used in this disclosure can be found in Table 2. Nucleic acid end abbreviation: 3′ indicates 3′-OH, 5P′- indicates 5′-PO4. [0047] FIG. 6J shows an overview of XNA ligation products from XNA tailed hairpins. XNA ligation reactions were optimized making the following considerations of possible side products. Starting material is thought to be tailed by XNA tailing to > 95% completion. Untailed starting material (blunt-end hairpin DNA) can self-ligate forming blunt-ended dsDNA side product. XNA tailed DNA can also self-ligate in the presence of promiscuous DNA ligases in a mismatch configuration (e.g. P:P ligation). Desired product should form in reactions that contain two hairpins tailed with complementary XNA bases. Nucleic acid end abbreviation: 3′ indicates 3′-OH, 5P′- indicates 5′-PO4. [0048] FIGs 6KA-6KE show screening and optimization of ligation conditions across all XNA bases. All tailing reactions used conditions listed in Table 8 unless otherwise specified. Subsequent ligation reactions were performed using 4.7 µM of one oligo or 2.4 µM of two oligos with complementary tailed bases. Ligation reactions were incubated for 16 h at 16 ^C using the specified ligase and carried out in 1X of NEB StickTogether™ buffer which contains 7.5% (w/v) PEG 6000. Improperly ligated
3915-P1293WO.UW -12-
materials were digested using Exo I (1.5 U/µL), Exo III (7.7 U/µL) and Exo VIII (truncated, 0.77 U/µL) for 1 h at 37 ^C. Exonuclease reactions were heat inactivated by incubation at 80 ^C for 20 min. (FIG. 6KA) T4 ligase (36 U/µL) assay. (FIG. 6KB) T7 ligase (272 U/µL) assay. (FIG. 6KC) T3 ligase assay (272 U/µL), with 8 h tailing for Sc. (FIG. 6KD) T4 ligase assay (36 U/µL) containing 0.4 M betaine with 8 h tailing for Sc. (FIG. 6KE) BSc ligation with differing amounts of T7 ligase. In all gels, ^ indicates absence of the hairpin tailed with the complementary base. positive control for ligation (– /–) shows full ligation of blunt-end hairpin (upper band, no degradation by exonucleases), while negative control (G+/ ^) shows either mismatch ligation or lack of ligation and subsequent digestion for a hairpin with a 3′ single nucleotide overhang (5′Phos-HP-3′G). Gels of FIGs 6KC-6KD are representative of two experimental replicates. [0049] FIGs 6LA-6LC show results from screening T3 ligase, T4 ligase, T7 ligase for JV, XtKn, and BSc XNA ligation. Two blunt end hairpins that create a restriction enzyme site upon blunt ligation were purchased from IDT (5′Phos-NdeI-HP-1 and 5′Phos-NdeI-HP-2; Table 2). Blunt-end ligated hairpins create an NdeI restriction site, while successfully tailed and ligated hairpins do not. This ensures that after XNA tailing, XNA ligation and NdeI/exonuclease treatment, the products left are ligation products from properly tailed material, which prohibits the formation of the NdeI restriction site. All tailing reactions used conditions listed in Table 8 except Sc which was tailed for 8 h. Subsequent ligation reactions were performed using 4.7 µM of a single oligo or 2.4 µM of two oligos) with complementary tailed bases. Ligation reactions were incubated for 16 h at 16 ^C using the specified DNA ligases and carried out in 1X of NEB StickTogether™ buffer which contains 7.5% (w/v) PEG 6000. Unligated materials, as well as blunt end ligated materials, were digested using a combination of Exo I (1.4 U/µL), Exo III (7.1 U/µL), Exo VIII (truncated, 0.71 U/µL), and NdeI (1.4 U/µL) for 1 h at 37 ^C. Enzymes were heat inactivated by incubation at 80 ^C for 20 min. (FIG. 6LA) T3 ligase assay (272 U/µL); (FIG. 6LB) T4 ligase assay (36 U/µL); (FIG.6LC) T7 ligase assay (272 U/µL) for reactions containing single hairpins or mixture of two hairpins (as indicated). Lanes labeled with single letter abbreviation of nucleotide tailed onto 3′-end of hairpin. In all gels, ^ indicates absence of the hairpin tailed with the complementary base. Pre-tailed negative control (G+/ ^) shows either mismatch ligation or lack of ligation and subsequent digestion for a hairpin with a 3′ single nucleotide overhang (5′Phos-HP- 3′G). Blunt end negative control of a reaction ligation, (–/–) condition, containing 5′Phos-
3915-P1293WO.UW -13-
NdeI-HP-1 and 5′Phos-NdeI-HP-2 shows digestion by NdeI, and subsequent digestion by exonucleases. Gel representative of a single experimental replicate. [0050] FIGs 6MA-6MC show full gels of XNA tailing and XNA ligation using optimized conditions. All assays were done with a 5′-phosphorylated hairpin oligo with a 3′-blunt end, purchased from IDT (5′-Phos-11HP; Table 2). Each DNA/XNA base was tailed using conditions from Table 8. (FIG. 6MA) Full gel for optimized XNA tailing conditions from FIG. 2E. Tailing completeness was measured via T4 ligation. Positive control (G+) shows no ligation for a hairpin with a 3′ single nucleotide overhang (5′Phos- HP-3′G, lower band) while negative control (–) shows ligation of blunt-end hairpin (upper band). Samples were diluted 4-fold before loading. (FIG. 6MB) Full gel for optimized DNA tailing conditions. Tailing completeness was measured by T4 ligation. Samples were diluted 8-fold before loading. (FIG. 6MC) Full gel for optimized XNA ligation conditions from FIG. 2G. Following tailing, 2.4 µM of each oligo with a complementary tailed base (except for the B:Sc base pair which was 1.3 µM of each) was ligated using conditions from Table 10. Unreacted hairpins or incomplete ligation products were removed by exonuclease treatment. Positive control ( + ) shows full ligation of blunt-end hairpin, while negative control ( * ) shows starting material with no polymerase and no ligase added, leading to full subsequent digestion by exonucleases. Starting material with no polymerase, ligase, or exonuclease added shown as a reference ( ^ ). Gels are representative of two experimental replicates. [0051] FIGs 6NA-6NF show a proof of concept for XNA tailing and XNA ligation cycling to insert two consecutive P≡Z base pairs. (FIG. 6NA) Agarose gel showing steps in consecutive XNA insertion. Each lane is described in the schematics that follow. (FIG. 6NB) A hairpin containing an MlyI restriction site adjacent to the site of XNA ligation is used (donor hairpin, HPD). MlyI is a type IIS restriction enzyme (5′- GAGTCNNNNN↓-3′) that leaves a blunt end after cutting. A donor hairpin with an MlyI site and an acceptor hairpin were tailed with P and Z respectively (generating HPD-P, HPA-Z), ligated and treated with exonucleases following the optimized conditions described in this disclosure, and then purified (lane 1). The purified construct contains a single P≡Z base pair insertion. Product from lane 1 was digested using MlyI, resulting in products observed in lane 2: (major product) blunt end hairpin products HPD (regenerated donor hairpin) and HPA-Z (acceptor hairpin with a 3′- P≡Z base pair); (minor product) undigested product from lane 1. (FIG. 6NC) Separately, a donor hairpin without an MlyI
3915-P1293WO.UW -14-
site was prepared by XNA tailing (HPP-P). XNA ligation followed by MlyI and exonuclease treatment does not result in formation of a ligation product (lane 3). (FIG. 6ND) In a second round, reaction product mixture from lane 2 was tailed with Z to produce Z-tailed donor hairpin (HPD-Z) and Z-tailed PZ-acceptor hairpin (HPA-ZZ). XNA ligation followed by MlyI and exonuclease treatment does not result in formation of a ligation product (lane 4). Minor ligation product previously observed in lane 2 is also no longer observed, suggesting this additional MlyI + exonuclease digestion round effectively removes MlyI-containing products. (FIG. 6NE) Tailed donor and acceptor hairpins (HPD-Z, HPA-ZZ, HPP-P) were ligated. XNA ligation followed by MlyI and exonuclease treatment results in formation of a product with two consecutive P≡Z base pair insertions (lane 5, q). Incorrect ligation product from two donor hairpins (HPD-Z + HPP-P) would not be present since HPD-Z contains an MlyI site. (FIG.6NF) MlyI cycling control reactions were carried out to assay how yield is generally affected by multiple ligation cycles. Blunt-end HPD was ligated using T4 DNA ligase (blunt-end ligation), treated with exonuclease, and purified (lane 6). Reaction product was then digested with MlyI (lane 7), then subjected to an additional round of blunt-end ligation using T4 DNA ligase (lane 8). Consecutive rounds of MlyI digestion and ligation result in a visible decrease in blunt end ligation product yield. For additional details regarding hairpin sequences used, please see methods section “Consecutive insertion of XNA base pairs using MlyI Type IIS restriction enzyme”. Nucleic acid end abbreviation: 3′ indicates 3′- OH, 5P′- indicates 5′-PO4. Gel representative of a single experimental replicate. [0052] FIGs 6OA-6OB show examples of basecalling XNA sequences with guppy. (FIG. 6OA) ONT guppy was trained to basecall sequences composed of standard nucleic acids (A, T, G, or C). One can enzymatically synthesize sequences containing various XNA base pairs (B≡Sn, P≡Z, Xt≡Kn, J≡V) to determine what canonical base is assigned to each of these XNAs. Guppy with high accuracy configuration (dna_r9.4.1_450bps_hac.cfg) was used for basecalling, and minimap2 was used to align sequences. (FIG. 6OB) Example sequence context surrounding an XNA base with the frequency of standard basecalls assigned to each XNA: i) B≡Sn base pair most frequently assigned to A:A; ii) P≡Z base pair most frequently assigned to G:C; iii) Xt≡Kn base pair most frequently assigned to A:G; iv) J≡V base pair most frequently assigned to C:G. Results suggest pyrimidine:purine guppy basecalling trends do not persist between standard and non-standard nucleic acids.
3915-P1293WO.UW -15-
[0053] FIGs 6PA-6PC show example full gels of NNNNNNN library construction for nanopore sequencing. All assays were performed using NNN-pool oligos as starting material, listed in Table 3-5. (FIG. 6PA) Steps involved in library building process exemplified using blunt-end hairpins. 8-fold diluted starting material (–) for library building shown as reference. Starting material without ligase shows subsequent full digestion via exonuclease digestion ( * ). If ligase is added and ligation is successful, a subsequent exonuclease digest leaves the ligated product (+) which does not have free 3′-ends. After removing one or both of the hairpin ends via restriction enzyme digestion, the final library product remains as the major product, with incomplete removal of the hairpin ends as the minor product ^ ). Ligation and subsequent processing were done using the methods outlined previously. (FIG. 6PB) Complete NNNNNNN library products for all XNA base pairs and blunt end ligation library sequenced in this disclosure. (FIG. 6PC) Self-ligation for library hairpins to check for incomplete tailing and pyrophosphorolysis products. Library hairpins were tailed with the listed XNA using conditions listed in Table 8, and 4.7 µM of each hairpin (except B* and Sc at 2.6 µM) was ligated to itself using the conditions listed in Table 10. Blunt end ligation ^ ) was included as a negative control. Unreacted hairpins or incomplete ligation products were removed via exonuclease treatment. Minimal self-ligation is observed for XNA-tailed NNN-pools, with J and V-tailed NNN-pools showing the most self-ligation compared to all the other XNA-tailed NNN-pools. These data suggest that the major product of XNA ligation with complementary sets is desired heteroligation products. After sequencing, self-ligation products are identified by their pool barcodes and are removed prior to model building. Gels representative of a single experimental replicate. [0054] FIGs 6QA-6QI show examples of variance minimization for segmentation steps of signal-to-sequence mapping. Signal-to-sequence mapping was performed using Tombo. Tombo uses an informed kmer model to improve the accuracy of signal-to- sequence mapping. Without a prior model, segmentation requires assigning each XNA to a standard base. Improper segmentation leads to inaccurate model parameter estimates. To minimize bias in segmentation, one can have assigned each XNA to the standard base that minimized the total variance in observed kmer signal levels. (FIGs 6QA-6QI) Boxplots of observed variance in kmer signal levels for each XNA base (B, Sn, Sc, P, Z, Xt, Kn, J, V respectively) at each position within a 4-nt kmer (-1: NNNN; 0: NNNN; +1: NNNN; +2: NNNN). Segmentation assignments used are indicated above each panel.
3915-P1293WO.UW -16-
Results are generally in agreement with guppy’s basecalling observations and indicate that standard pyrimidine:purine pairing assignments fail to properly describe signal-level observations. With optimum choice of standard base assignment. [0055] FIGs 6RA-6RE show example traces of signal deviation from the standard model. (FIGs 6RA-6RE) Example traces showing how observed normalized signal for sequences that contain a xenonucleotide (<Iz>) deviates from the expected standard DNA model signal (<I^ ୡ ^) of the most similar standard base for B (n = 70), Sn (n = 22), P (n = 45), Z (n = 44), Xt (n = 75), Kn (n = 12), J (n = 10), V (n = 10), and Sc (n = 11). Most similar standard base shown below xenonucleotide. Error bars indicate standard deviation. [0056] FIG. 6S shows an example xenomorph preprocessing pipeline. Flow diagram depicting the user input and outputs of the Xenomorph preprocessing pipeline. Xenomorph preprocess integrates basecalling, raw multi-to-single fast5 conversion, reference sequence fasta conversion, segmentation, and level assignment into a single command. Level extracted output files from xenomorph preprocess are inputs to basecalling through alternative hypothesis testing using xenomorph morph. Separating the preprocessing steps from alternative hypothesis testing allows users to experiment with basecalling using various model parameter settings or with alternative models without having to rerun the slower signal extraction steps. xenomorph preprocess uses guppy for initial basecalling, minimap2 for initial basecall-reference alignment, and ONT Tombo for signal normalization and signal-to-sequence alignment. [0057] FIGs 6TA-6TC show PCR amplification and sequencing of a DNA template with a P≡Z base pair. (FIG. 6TA) Synthetic template DNA containing a P≡Z base pair was amplified with Taq polymerase in a pH 8.0 buffer with varying concentrations of dxNTP and dNTP (Tables 22, 23). PCR products were sequenced on a MinION nanopore flow cell then basecalled for PZ detection. Read fractions that basecalled to (FIG. 6TB) P and (FIG. 6TC) Z for each condition are shown. PCR conditions differ by concentration of dxNTP and dNTPs used. The remaining fraction for each base corresponds to G and C basecalls (the most likely standard mutation for P and Z), respectively. Unamplified, synthetic P≡Z DNA was sequenced as a positive control (Std) for basecalling. Source data are provided as a Source Data file. [0058] FIGs 6UA-6UB show construction of 12-letter DNA for nanopore sequencing. All assays were performed using 12-letter DNA construction oligos as
3915-P1293WO.UW -17-
starting material, listed in Table 7. (FIG. 6UA) Oligos are tailed with a dxNTP and ligated to a complementary pair forming a sequence with a single xenonucleotide base pair insertion. These oligos contain Golden Gate sites. Four single insertion constructs undergo Golden Gate ligation to form a single dsDNA sequence containing all 12 DNA letters. To remove intermediary 6-letter, 8-letter, or 10-letter DNA products, unsuccessfully assembled hairpins are digested by restriction endo and exonucleases. (FIG. 6UB) Agarose gel showing steps in construction of 12-letter DNA. Starting material (–) shown as a reference. An example of a successful xenonucleotide tailing and ligation reaction resulting in an insertion of a single P≡Z base pair shown; subsequent exonuclease digestion leaves the ligated product (+) which does not have free 5′- or 3′- ends. Lane (xc) shows Golden Gate ligation product of Scuper-12 and lane (xn) shows Golden Gate ligation product of Snuper-12. While smaller assembled products are visible, those that fully align to the full span of 12-letter DNA product are considered for basecalling analysis. Gel representative of two experimental replicate. [0059] FIG. 6V shows an example workflow from sequencing to heptamer classification. From the nanopore sequencer, one can extract the raw signal of the heptamer. By plotting the signal as a function of time, one can feed images of the signal into a 2D CNN to classify them into the different heptamers. [0060] FIGs 6WA-6WB and 6XA-6XB show an example method for generating a defined non-standard nucleotide base pair library that uses a Type IIS restriction enzyme and a context barcode (“Barcode”) associated with a sequence context and a pool barcode (“Pool-Barcode”) associated with a non-standard nucleotide, as well as steps for sequencing and machine learning (ML) model training. Randomer region indicated. [0061] FIGs 6YA-6YF show example process flows for training ML models for processing read data obtained by nanopore sequencing of polynucleotide sequences containing non-standard nucleotides (FIGs 6YA-6YD), as well as base calling using trained ML models for quantification of XNA retention in PCR reactions (FIG.6YE) and quantification of XNA transcription errors from in vivo transcription (FIG.6YF). DETAILED DESCRIPTION [0062] The present disclosure provides an array of breakthrough approaches for synthesizing polynucleotide (e.g., DNA) sequences containing at least one non-standard nucleotide. The non-standard nucleotide can include a hydrogen bonding pattern that is consistent or compatible with a hydrogen bonding pattern of a standard or existing
3915-P1293WO.UW -18-
nucleotide (e.g., C, G, T, A), such that the non-standard nucleotide can be integrated within the overall structure of the biopolymer. Since there has been a dearth of basic tools for working with non-standard nucleotides, the present disclosure also provides breakthrough approaches for synthesizing polynucleotide sequences containing one or more non-standard nucleotides, optionally using next-generation sequencing (NGC) platforms, such as nanopore sequencing. This disclosure provides a basic set of tools that serve as the basis for working with non-standard nucleotides, which to date, have not been tractable. The disclosure also enables non-standard nucleotides to be integrated into a wide range of technologies, such as biological computing and information storage systems, therapeutics, aptamers, biosensors, and the like. [0063] Methods of synthesizing polynucleotides containing one or more non- standard nucleotides make use of an N+1 tailing reaction of a suitable DNA polymerase. Accordingly, in an aspect, the disclosure provides a method for generating an N+1 tailing product comprising a non-standard nucleotide that is covalently bound with a 3’ end of a precursor double-stranded DNA (dsDNA) template, such that the non-standard nucleotide is non-base-paired. The method comprises combining the precursor dsDNA template with a DNA polymerase and a non-standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to facilitate a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase. While the disclosed methods are useful for incorporating any non-standard nucleotide into a polynucleotide, in embodiments, the non-standard nucleotide is a xenonucleotide (XNA) and the non-standard dNTP is a deoxy-xeno-ribonucleotide triphosphate (dxNTP). [0064] Upon performance of a screen of possible DNA polymerases for the N+1 tailing reaction, it is disclosed herein that at least two DNA polymerases are surprisingly and unexpectedly capable of performing this reaction with non-standard dNTPs. Accordingly, in embodiments, the DNA polymerase comprises a polypeptide sequence of a small Klenow Fragment (KF exo-) of DNA Polymerase I, as further described herein. In embodiments, the polypeptide sequence comprises a sequence of SEQ ID NO:2. [0065] A variety of XNAs can be incorporated into DNA using methods of the present disclosure, however, it was found that improvement or optimization of reaction conditions allows for the N+1 tailing reaction to proceed at an acceptable rate. In example embodiments, the non-standard nucleotide being added is B or p, and the reaction condition proceeds at about 37°C for between about 1-16 hours and comprises about 0.71
3915-P1293WO.UW -19-
U/µL of the DNA polymerase and about 1.19 mM of the non-standard dNTP. In other example embodiments, the non-standard nucleotide is selected from S n , S c , Z, X t , K n , J, and V, and the reaction condition proceeds at about 60°C for between about 4-16 hours and comprises about 0.29 U/µL of the DNA polymerase and about 1.19 mM of the non- standard dNTP. While these or similar conditions were found to be effective for the disclosed reaction, other conditions, including less-than-optimal or non-improved conditions, can be implemented in embodiments without departing from the scope and spirit of the disclosure. [0066] While the KF exo- of DNA polymerase I can be used in embodiments, this is not the only DNA polymerase that was surprisingly and unexpectedly found to have the ability to add non-standard nucleotides to a dsDNA template in an N+1 tailing reaction. For example, in embodiments, the DNA polymerase comprises a polypeptide sequence of an engineered polymerase from a hyperthermophilic marine archaeon. In embodiments, the engineered polymerase is a variant of 9°N DNA polymerase. In embodiments, the polypeptide sequence comprises a sequence of SEQ ID NO:3 (e.g., TherminatorTM). [0067] The unexpected and surprising discovery of the ability of at least some DNA polymerases to add non-standard nucleotides as part of an N+1 tailing reaction leads to the disclosure herein of novel and innovative methods of use of these, and possibly other, DNA polymerases for stepwise, iterative synthesis of polynucleotides containing one or more non-standard nucleotides. These methods of use represent a substantial and significant improvement in the field of developing tools for synthesizing and characterizing non-standard polynucleotides, an area which has previously been largely undeveloped and in great need of innovation. [0068] In an aspect, the disclosure provides a method for generating a base pair of two nucleotides of a polynucleotide, wherein at least one nucleotide of the two nucleotides is a non-standard nucleotide. In embodiments, the base pair is comprised of one non-standard nucleotide base paired with one standard nucleotide. In embodiments, the base pair is comprised of a first non-standard nucleotide base paired with a second non-standard nucleotide. [0069] Creation of a base pair that is comprised of two non-standard nucleotides can be implemented with a method that comprises generating a second N+1 tailing product comprising a second non-standard nucleotide that is base-pair complementary with the non-standard nucleotide, such that the second non-standard nucleotide is non-
3915-P1293WO.UW -20-
base-paired. The second N+1 tailing product can be generated based on the same or a similar reaction as the N+1 tailing product (of the first N+1 tailing reaction). The method can further include ligating the N+1 tailing product with the second N+1 tailing product, which forms a dsDNA ligation product that comprises a base pair between the non- standard nucleotide and the second non-standard nucleotide, as further described herein. The N+1 tailing product can be linear or, in embodiments, can comprise a hairpin. The second N+1 tailing product can be linear or, in embodiments, can comprise a hairpin. In embodiments, the dsDNA ligation product does not comprise a free 5’ end or a free 3’ end and is fully resistant to exonucleases. [0070] Additional non-standard nucleotides can be added iteratively and/or sequentially, such that two or more non-standard nucleotides can be added or inserted to a polynucleotide. This can be achieved by cleaving the dsDNA ligation product and exposing the non-standard base pair. The resultant blunt-end DNA template then becomes a template for a subsequent N+1 tailing reaction. Accordingly, in embodiments, the method comprises contacting the dsDNA ligation product with a type IIS restriction enzyme under a reaction condition that is conducive for the type IIS restriction enzyme to cleave the dsDNA ligation product, which generates a blunt-end DNA template. The resultant blunt-end DNA template comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide. Upon subsequent N+1 tailing reactions and DNA ligation, a further dsDNA ligation product is produced. [0071] In embodiments, the method can be performed a plurality of times for creation of a plurality of base pairs between a plurality of non-standard nucleotides and a plurality of second non-standard nucleotides as sequence elements of the further dsDNA ligation product. Accordingly, in embodiments, the method comprises contacting the further dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the further dsDNA ligation product to generate a further blunt-end DNA template that comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides. [0072] The method is modular and can be repeated any number of times for addition of any number of non-standard nucleotides, either with non-standard nucleotides added in a continuous manner or in a manner such that the non-standard nucleotides are interspersed with, or interrupted by, one or more standard nucleotides, for example. In
3915-P1293WO.UW -21-
embodiments, a quantity of non-standard nucleotides added to a polynucleotide with a method of the disclosure is selected from the group including, but not necessarily limited to, the set of integers defined by the range of 1 to 10,000,000,000, inclusive. In embodiments, a quantity of standard nucleotides added to a polynucleotide with a method of the present disclosure is selected from the group including, but not necessarily limited to, the set of integers defined by the range of 1 to 10,000,000,000, inclusive. [0073] While a set of example XNAs is expressly described herein, any non- standard nucleotide or set of non-standard nucleotides can be used, in various embodiments. Accordingly, in embodiments, the non-standard nucleotide comprises an epigenetic modification, a modified sugar, a phosphate backbone, a nucleobase, a nucleobase that can hydrogen bond to a second base, a nucleobase that can base pair (without hydrogen bonding) to a second base, a nucleobase that relies on steric exclusion for base pairing, a nucleobase that relies on hydrophobic interactions for base pairing, a nucleobase that relies on a transition metal complex for base pairing, a chemical modification, or any combination thereof. In example embodiments, the non-standard nucleotide is the nucleobase that is configured to hydrogen bond to the second base and the second base is a standard base or a non-standard base. In other example embodiments, the non-standard nucleotide is the nucleobase that can base pair (without hydrogen bonding) to the second base and the second base is a standard base or a non-standard base. In embodiments, the non-standard nucleotide comprises an epigenetic modification or is 4-methyl-cytosine, 5-methyl cytosine, 6-methyl adenosine, 5-hydroxymethyl cytosine, 7-methylguanosine, or N6-methyladenosine. In embodiments, the non-standard nucleotide comprises the chemical modification and the chemical modification comprises a fluorophore, a biotin, a terminal alkyne, an azide, a cyclooctyne, a tetrazine, a terminal alkene, a phosphine, a halo-alkane, an aldehyde, a thiol, a transition metal complex, another reactive handle, or any combination thereof. [0074] The disclosure also contemplates products, and in at least some instances, intermediates, of methods herein as also being within the scope of the disclosure. Accordingly, in an aspect, the disclosure provides a dsDNA ligation product that can comprise a non-standard nucleotide. In an aspect, the disclosure provides a further dsDNA ligation product that can comprise two or more non-standard nucleotides. [0075] In addition, the disclosure contemplates defined libraries of non-standard nucleotide base pairs, in any of a variety of nucleotide contexts, produced by the methods
3915-P1293WO.UW -22-
of the disclosure that are useful for the generation of novel, empirical nanopore sequencing data that can be used for training computational systems, including but not limited to systems implementing one or more machine learning (ML) models for facilitated analysis of nanopore sequencing data. Accordingly, in an aspect, the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the dsDNA ligation product or the blunt-end dsDNA template. The library polynucleotide sequence comprises a base pair between a non- standard nucleotide and a second non-standard nucleotide. Similarly, a plurality of base pairs can be incorporated into one or more defined libraries. Accordingly, in another aspect, the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of a further dsDNA ligation product or a further blunt-end dsDNA template, such that the library polynucleotide sequence comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides. [0076] Since empirical measurements benefit from ground truth, it can be greatly beneficial to include control sequences that are associated with, and that can therefore identify, sequence contexts and non-standard nucleotides being measured empirically. Accordingly, in embodiments, a library polynucleotide sequence further comprises a context barcode associated with a sequence context adjacent to a base pair of a non- standard nucleotide and a second non-standard nucleotide of the library polynucleotide sequence, and a pool barcode associated with the non-standard nucleotide, the second non-standard nucleotide, or both. These or similar barcodes can be comprised of standard or otherwise sequence-able nucleotides, such that the identities of the non-standard nucleotides and the contexts can be known with a high degree of confidence. This facilitates correlation between the empirical data and the non-standard nucleotide bases being observed. [0077] Machine learning can be used with one or more methods for facilitation of sequence data analysis. In various aspects, the disclosure provides a method for generating a machine learning (ML) model that correlates one or more observed current reads with an unknown non-standard nucleotide, for assignment of an identity to the unknown non-standard nucleotide. Such a method comprises sequencing, with a nanopore sequencing method, the defined non-standard nucleotide base pair library to produce the one or more observed current reads, and training, with a ML algorithm, the ML model to
3915-P1293WO.UW -23-
associate the one or more observed current reads with a known identity of a defined non- standard nucleotide of the defined non-standard nucleotide base pair library. The ML model can be configured to assign the identity to the unknown non-standard nucleotide based on the known identity of the defined non-standard nucleotide. In embodiments, the ML model comprises a convolutional long short term memory recurrent neural network (LSTM RNN), however, other ML models can be implemented, in embodiments. [0078] The disclosure also contemplates computer memory, computer products, computer devices, computer systems, and the like, that implement all or part of one or more methods of the disclosure as being within the scope of the disclosure. Accordingly, in an aspect, the disclosure provides a non-transitory computer-readable storage medium having stored thereon at least part of a ML model. In an aspect, the disclosure provides a computational device or computational system comprising the non-transitory computer- readable storage medium. In an aspect, the disclosure provides a nanopore sequencing kit, device, or system comprising the non-transitory computer-readable storage medium, optionally further including instructional materials for use of the kit. [0079] In a general aspect, the disclosure provides novel and innovative tools for use in synthesizing and sequencing polynucleotides containing non-standard nucleotides. Accordingly, in an aspect, the disclosure provides a method for basecalling a non- standard nucleotide expanded alphabet. The method comprises sequencing, with a nanopore sequencing method, a subject polynucleotide sequence that comprises a non- standard nucleotide to generate a subject current read, computing, with the computational device or computational system, the known identity of the defined non-standard nucleotide of the defined non-standard nucleotide base pair library associated with the subject current read with for an association, and computing, based on the association, a structure of the non-standard nucleotide. The structure of the non-standard nucleotide can include, correspond, or relate to an identity of the non-standard nucleotide. These and similar computational methods are generally envisioned as being capable of being carried out by a programmable computer or computational device or system, or a cluster of such devices or systems. [0080] In a further aspect, the disclosure considers instructions executable by one or more processors (e.g., one or more microprocessors) for carrying out all or part of a method of the disclosure as also being within the scope of the disclosure. Accordingly, in various aspects, the disclosure provides a circuitry configured to perform all or part of a
3915-P1293WO.UW -24-
method. In various aspects, the disclosure provides a nanopore sequencing kit, device, or system comprising the circuitry. In embodiments the circuitry can comprise a non- transitory computer-readable storage medium having stored thereon instructions which, when executed by one or more processors, cause the one or more processors to coordinate or carry out all or part of a method. However, in embodiments, the circuitry can comprise dedicated hardware circuitry having logic elements that together are configured to coordinate or carry out all or part of a method. Kit, Device, System, Circuitry, Processor, and Computer Implementations [0081] Accordingly, in various aspects, all or part of methods, compositions, circuitry, non-transitory computer-readable storage media, instructional materials, and the like, can be integrated into certain commercially relevant form factors, such as kits, products, devices, computational devices, systems, computational systems, processor- executable code, firmware, software, circuitry, and others. [0082] Accordingly, embodiments of devices and any systems disclosed herein, including embodiments that include or utilize a processor and/or processor executable instructions can utilize circuitry to implement those technologies and methodologies. Such circuitry can operatively connect two or more components, generate information, determine operation conditions, control an appliance, device, or method, and/or the like. Circuitry of any type can be used. In embodiments, circuitry includes dedicated hardware having electronic circuitry configured to perform operations or computations on a dedicated basis, without any use of microprocessors, central processing units, or software or firmware or processor-executable instructions. However, in embodiments, circuitry includes, among other things, one or more computing devices such as one or more processors (e.g., microprocessor(s)), one or more central processing units (CPU), one or more digital signal processors (DSP), one or more application-specific integrated circuits (ASIC), one or more field-programmable gate arrays (FPGA), or the like, or any variations or combinations thereof, and can include discrete digital and/or analog circuit elements or electronics, or combinations thereof. [0083] In embodiments, circuitry includes one or more ASICs having a plurality of predefined logic components. In embodiments, circuitry includes one or more FPGA having a plurality of programmable logic components. In embodiments, circuitry includes hardware circuit implementations (e.g., implementations in analog circuitry,
3915-P1293WO.UW -25-
implementations in digital circuitry, and the like, and combinations thereof). In embodiments, circuitry includes combinations of circuits and computer program products having software or firmware processor-executable instructions stored on one or more computer readable memories, e.g., non-transitory computer-readable storage mediums, that work together to cause a device or system to perform one or more methodologies or technologies described herein. [0084] In embodiments, circuitry includes circuits, such as, for example, microprocessors or portions of microprocessors, that require software, firmware, and the like for operation. In embodiments, circuitry includes an implementation comprising one or more processors or portions thereof and accompanying software, firmware, hardware, and the like. In embodiments, circuitry includes a baseband integrated circuit or applications processor integrated circuit or a similar integrated circuit in a server, a cellular network device, other network device, or other computing device. In embodiments, circuitry includes one or more remotely located components. In embodiments, remotely located components (e.g., server, server cluster, server farm, virtual private network, etc.) are operatively connected via wired and/or wireless communication to non-remotely located components (e.g., desktop computer, workstation, mobile device, controller, etc.). In embodiments, remotely located components are operatively connected via one or more receivers, transmitters, transceivers, or the like. [0085] Embodiments include one or more data stores that, for example, store instructions and/or data. Non-limiting examples of one or more data stores include volatile memory (e.g., Random Access memory (RAM), Dynamic Random Access memory (DRAM), or the like), non-volatile memory (e.g., Read-Only memory (ROM), Electrically Erasable Programmable Read-Only memory (EEPROM), Compact Disc Read-Only memory (CD-ROM), or the like), persistent memory, or the like. Further non- limiting examples of one or more data stores include Erasable Programmable Read-Only memory (EPROM), flash memory, or the like. The one or more data stores can be connected to, for example, one or more computing devices by one or more instructions, data, or power buses. [0086] In embodiments, circuitry includes one or more computer-readable media drives, interface sockets, Universal Serial Bus (USB) ports, memory card slots, or the like, and one or more input/output components such as, for example, a graphical user
3915-P1293WO.UW -26-
interface, a display, a keyboard, a keypad, a trackball, a joystick, a touch-screen, a mouse, a switch, a dial, or the like, and any other peripheral device. In embodiments, circuitry includes one or more user input/output components that are operatively connected to at least one computing device to control (electrical, electromechanical, software- implemented, firmware-implemented, or other control, or combinations thereof) one or more aspects of the embodiment. [0087] In embodiments, circuitry includes a computer-readable media drive or memory slot configured to accept signal-bearing medium (e.g., computer-readable memory media, computer-readable recording media, or the like). In embodiments, a program for causing a system to execute any of the disclosed methods can be stored on, for example, a computer-readable recording medium (CRMM), a signal-bearing medium, or the like. Non-limiting examples of signal-bearing media include a recordable type medium such as any form of flash memory, magnetic tape, floppy disk, a hard disk drive, a Compact Disc (CD), a Digital Video Disk (DVD), Blu-Ray Disc, a digital tape, a computer memory, or the like, as well as transmission type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link (e.g., transmitter, receiver, transceiver, transmission logic, reception logic, etc.). Further non-limiting examples of signal-bearing media include, but are not limited to, DVD-ROM, DVD-RAM, DVD+RW, DVD-RW, DVD-R, DVD+R, CD-ROM, Super Audio CD, CD‑R, CD+R, CD+RW, CD-RW, Video Compact Discs, Super Video Discs, flash memory, magnetic tape, magneto-optic disk, MINIDISC, non-volatile memory card, EEPROM, optical disk, optical storage, RAM, ROM, system memory, web server, or the like. Terminology [0088] The description set forth herein in connection with the appended drawings, where like numerals may reference like elements, are intended as a description of various embodiments of the present disclosure and are not intended to represent the only embodiments. Each embodiment described in this disclosure is provided merely as an example or illustration and should not be construed as preferred or advantageous over other embodiments. The illustrative examples provided herein are not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Similarly, any steps described herein can be interchangeable with other steps, or combinations of steps, in any
3915-P1293WO.UW -27-
suitable combination and/or order to achieve the same or substantially similar result. Generally, the embodiments disclosed herein are non-limiting, and other embodiments within the scope of this disclosure can include structures and functionalities from more than one specific embodiment shown in the figures and described in the specification. [0089] In the foregoing description, specific details are set forth to provide a thorough understanding of example embodiments of the present disclosure. It will be apparent to one skilled in the art, however, that the embodiments disclosed herein can be practiced without embodying all the specific details. In some instances, process steps have not been described in detail in order not to unnecessarily obscure various aspects of the present disclosure. Further, it will be appreciated that embodiments of the present disclosure can employ any combination of features described herein and/or alternatives thereof. [0090] The present application can include references to directions, such as “vertical,” “horizontal,” “front,” “rear,” “left,” “right,” “top,” and “bottom,” etc. These references, and other similar references in the present application, are intended to assist in helping describe and understand the particular embodiment (such as when the embodiment is positioned for use) and are not intended to limit the present disclosure to these directions or locations. [0091] The present application can also reference quantities and numbers. Unless specifically stated, such quantities and numbers are not to be considered restrictive, but examples of the possible quantities or numbers associated with the present application. Also in this regard, the present application can use the term “plurality” to reference a quantity or number. In this regard, the term “plurality” is meant to be any number that is more than one, for example, two, three, four, five, etc. [0092] As used herein, the term “about,” “approximately,” “near,” etc., includes the stated value as well as non-stated values that are near to or approximate the stated value according to practicable ranges as would be recognized by those skilled in the art. The term “based on” means “based at least partially on.” [0093] In at least some embodiments, “about” refers to the stated value and a range that includes values 10% below the stated value to 10% above the stated value. In embodiments, “about” refers to the stated value and a range that includes values 11% below the stated value, 12% below the stated value, 13% below the stated value, 14% below the stated value, 15% below the stated value, 16% below the stated value, 17%
3915-P1293WO.UW -28-
below the stated value, 18% below the stated value, 19% below the stated value, 20% below the stated value, 21% below the stated value, 22% below the stated value, 23% below the stated value, 24% below the stated value, or 25% below the stated value. In embodiments, “about” refers to the stated value and a range that includes values 11% above the stated value, 12% above the stated value, 13% above the stated value, 14% above the stated value, 15% above the stated value, 16% above the stated value, 17% above the stated value, 18% above the stated value, 19% above the stated value, 20% above the stated value, 21% above the stated value, 22% above the stated value, 23% above the stated value, 24% above the stated value, or 25% above the stated value. [0094] In embodiments wherein a range is stated, e.g., the range of 1-16, the stated range includes every value between the lower and upper limits as well as the lower and upper limits of the stated range, themselves, as stated values. In embodiments wherein a range is approximately stated, e.g., about 1-16, the approximately stated range includes every value between the lower and upper limits as well as the lower and upper limits of the stated range, themselves, as stated values (e.g., 1 and 16 are each stated values), including those non-stated values that are near to or approximate the stated values according to practicable ranges as would be recognized by those skilled in the art or as otherwise described herein. [0095] For the purposes of the present disclosure, the phrase “at least one of A, B, and C,” means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C), including all further possible permutations when greater than three elements are listed. Likewise, as used herein, the term “A, B, and/or C” means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C), including all further possible permutations when greater than three elements are listed. Unless otherwise stated, the term “or” is an inclusive “or”, and the phrase “A or B” means (A), (B), or (A and B). Unless otherwise stated, the term “and” requires both elements; for example, the phrase “A and B” means (A and B). [0096] In the claims and for purposes of the present disclosure, the terms “a”, “an”, “the”, and the like, refer to the singular and the plural forms of the object or element referenced. As used herein in the description and claims, the term “comprising”, is inclusive or open-ended and does not exclude additional, unrecited elements or method steps. The term “consisting of,” as used in a claim, excludes any element, step, or ingredient not specified in the claim. The term “consisting essentially of,” as used in a
3915-P1293WO.UW -29-
claim, limits the scope of the claim to the specified materials or steps and those that do not materially affect the basic and novel characteristic(s) of the claim. Examples [0097] Example 1: Enzymatic Synthesis and Nanopore Sequencing of 12-letter Supernumerary DNA [0098] Abstract: The 4-letter DNA alphabet (A, T, G, C) is an elegant, yet non- exhaustive solution to the problem of storage, transfer, and evolution of biological information. This example provides strategies for both writing and reading DNA with expanded alphabets composed of up to 12 letters (A, T, G, C, B, S, P, Z, X, K, J, V). For writing, an enzymatic strategy is devised for inserting a singular, orthogonal xenonucleic acid (XNA) base pair into standard DNA sequences using 2′-deoxy-xenonucleoside triphosphates as substrates. Integrating this strategy with combinatorial oligos generated on a chip, libraries are constructed containing single XNA bases for parameterizing kmer basecalling models for nanopore sequencing. These elementary steps are combined to synthesize and sequence DNA containing 12 letters – the upper limit of what is accessible within the electroneutral, canonical base pairing framework. By introducing low-barrier synthesis and sequencing strategies, this disclosure overcomes previous obstacles and paves the way for making expanded alphabets widely accessible. [0099] The 4-letter standard genetic alphabet of DNA (A, T, G, C) is ubiquitous and one of the defining biomolecular signatures of life on Earth. The ability to read, write, and translate this information forms the basis for life as an emergent property of nucleic acid heteropolymers. Humanity has learned how to manipulate the 4 letters of DNA, spurring major advances in biotechnology, information storage, and healthcare. As examples, the standard nucleic acids can be components for diagnostic tests to screen for disease or detect toxins, therapeutics that create immune responses, and even as a molecular system for long-term storage of digital information. [0100] However, four building blocks for DNA is far from exhausting two rules of complementarity that govern canonical, hydrogen-bonding base pairing – (a) size complementarity, where the larger purines (A, G) pair with the smaller pyrimidines (C, T), and (b) hydrogen bonding complementarity, where hydrogen bond donors pair with hydrogen bond acceptors. These rules allow for up to 12 different nucleotides, forming six orthogonal pairs (FIGs 1A-1B, Table 1). Within these rules, multiple heterocyclic
3915-P1293WO.UW -30-
systems are also available to support each of the various hydrogen bonding combinations (e.g., Sn and Sc). It is therefore possible to envision various ‘supernumerary’ (def. in excess of the normal) DNA codes as a fusion of the natural nucleobases (A, T, G, C) with a set of the synthetic hydrogen bonding xenonucleobases (B, S, P, Z, X, K, J, V). As an example, one can perform chemical synthesis and characterization of one such 8-letter code, hachimoji DNA, comprised of natural (A, T, G, C) and synthetic (B, Sc, P, Z) nucleobases. [0101] Parameters of biomolecular compatibility of expanded non-canonical hydrogen bonding base pairings include stability in the DNA double helix, the ability to be replicated by DNA polymerases, transcribed by RNA polymerases, reverse transcribed by reverse transcriptases, and even translated by the ribosome. These xenonucleotides are at the forefront of nucleic acids research since they significantly expand DNA’s chemical, structural, and binding repertoire. Efforts at appropriating expanded DNA codes can result in more sensitive diagnostics tests, highly specific aptamer-based therapeutics that are cheaper and more soluble than antibodies, semi-synthetic organisms capable of biomanufacturing new molecules, catalytic nucleic acids (XNAzymes) with enzyme-like activity, and even denser forms of digital information storage. [0102] However, the biomolecular tools and commercial infrastructure for sequencing alphabets comprised of either more than 4 letters or alternative sets of letters are critically lacking. Notably, methods for sequencing of xenonucleic acids (XNAs) are decades behind that of DNA and RNA, and rely on low-throughput, non-multiplexed measurements, such as gel-shift assays, mass spectrometry, and selective conversion of XNAs to standard bases followed by Sanger sequencing. This stands in stark contrast to the state of sequencing for the standard nucleobases (A, T, G, C), which has a multitude of high throughput, multiplexable, and low-cost options. To put the disparity of sequencing technology in perspective, XNA sequencing technology is lower throughput, less sensitive, and less generalizable than the methods Sanger and Coulson developed in the 1970s and has no service-oriented solution. Conversely, ATGC-sequencing technology is in its ‘third generation.’ [0103] Currently, research and development in the field of XNA face fundamental barriers to entry in the form of sequencing, which generally requires highly specialized equipment and analytical expertise. One possible solution is to adapt existing first-, second-, or third-generation DNA sequencing technology to work with more DNA letters.
3915-P1293WO.UW -31-
However, modern sequencing infrastructure is inherently inflexible and highly specialized for ATGC sequencing. Adapting fluorescence-based DNA sequencing techniques for XNA sequencing, such as Illumina® sequencing, would require a plethora of innovations including new reagents (e.g., XNA nucleotides with unique fluorophores), engineered polymerases capable of replicating XNAs, modification of instrumentation to handle more cycles, and creation of new data collection/analysis pipelines. Any fluorescence-based XNA next-generation sequencing strategy has not been attainable to date. As an alternative approach, other next-generation sequencing methods built for DNA may be more amenable to serving as XNA sequencing solutions. [0104] Nanopore sequencing has the ability to sequence non-canonical bases such as epigenetic and epitranscriptomic modifications. For example, nanopore sequencing can be used for sequencing 8-letter hachimoji DNA (A, T, G, C, B, Sc, P, Z) using the Hel308 motor protein with an MspA pore. As such, it is proposed that third-generation (high throughput, multiplexable, single molecule, real-time) sequencing of supernumerary DNA is possible despite the “k-mer explosion” in possible current signals induced by an expanded DNA alphabet. As a limitation, previous efforts in this regard did not attempt to build models for decoding the nanopore current signals to nucleic acid sequences. In addition, these efforts were performed on a non-commercial research platform consisting of a single nanopore run by a technician (low throughput, non-multiplexable). While these previous efforts were informative, these prior approaches cannot be easily adopted by the industry or those in other fields. [0105] Non-standard bases can be classified using commercial nanopores (e.g., GridION, ONT). This can show that commercial nanopore sequencing platforms are indeed capable of sequencing chemically modified nucleobases including 2,4-diamino- purine, 5-nitro-indole, and 5-octadiynyldeoxyuracil. However, a shortcoming of previous efforts is that orthogonal base pairing nucleotides were not tested and only a small sequence space was explored, both of which exclude their applicability to expanded and evolvable genetic alphabets. For commercial nanopore sequencing to be applicable to 4+- letter genetic alphabet systems that contain orthogonal XNA base pairs, bespoke nanopore sequencing models are needed. [0106] Similarly, the ability to synthesize nucleic acids with xenonucleotide base pairs is at least a generation behind modern ATGC-synthesis technology. To date, de novo synthesis of DNA with non-standard base pairs is only possible through
3915-P1293WO.UW -32-
phosphoramidite synthesis – commercial access is both limited and costly, standing as a major barrier to entry. For example, standard phosphoramidite synthesis costs for non- standard bases average around $100-400 USD/nt – or over 1000 times more expensive than A, T, G, C synthesis ($0.04-0.40 USD/nt). Furthermore, the next-generation synthesis methods that have transformed the ability to explore sequence space (pooled synthesis, synthesis-on-a-chip, enzymatic synthesis) are not commercially available for orthogonal base pairs. Lowering barriers to entry for synthesis and sequencing of XNAs with orthogonal base pairs is needed to bring expanded genetic alphabets to the next- generation eras of synthetic biology, information storage, therapeutic discovery, sequencing, and synthesis. [0107] This example provides compositions, methods, devices, and systems reflective of significant progress with both synthesis and sequencing that makes supernumerary DNA sequences containing 6-letter, 8-letter, 10-letter, or 12-letter alphabets easily accessible. In the area of XNA synthesis, an enzyme-assisted strategy is introduced that can be used to incorporate single orthogonal XNA base pairs (B≡Sn or B≡Sc; P≡Z; Xt≡Kn; and J≡V) into synthetic 4-letter DNA. For XNA sequencing, theory is put to practice and commercial nanopore basecalling models capable of sequencing single XNA bases (B, Sn, Sc, P, Z, Xt, Kn, J, and V) embedded in a standard DNA (i.e., A, T, G, C only) context are developed. [0108] Results [0109] Non-templated XNA tailing by DNA polymerases. Under the supernumerary DNA framework, the two standard base pairs (A=T, G≡C) can be combined with any of the four mutually orthogonal base pairs (B≡Sn or B≡Sc; P≡Z; Xt≡Kn; and J≡V) shown in FIG. 1 and Table 1. Though phosphoramidite synthesis might seem like a general approach for de novo synthesis of supernumerary DNA, the chemical instability of xenonucleobases J, Sn, and Xt in organic synthesis means that, in practice, sequences using that approach can be limited to only 8 letters (the hachimoji set: B, Sc, P, Z). [0110] To meet the challenge of generalizing synthesis of supernumerary DNA to include any possible base pair, it was decided to implement enzymatic synthesis of nucleic acids. Enzymes like terminal deoxynucleotidyl transferase (TdT) can catalyze non-templated addition of a wide range of modified nucleotide building blocks on ssDNA, and can do so at neutral pH. However, the processive nature of TdT-like
3915-P1293WO.UW -33-
enzymes precludes them from being used for sequence-defined addition of dNTPs. More so, TdT-based enzymatic synthesis of nucleic acids would require specially protected building blocks or polymerase-nucleotide conjugates that are not commercially available. [0111] Lacking a suitable alternative, it was needed to develop an enzymatic synthesis strategy that would be flexible enough to handle all desired xenonucleobases using 2′-deoxynucleoside triphosphates as the universal building block and be specific enough to catalyze a non-processing N+1 addition. A solution was developed with exploration and implementation of a side reaction, largely unappreciated or uncharacterized in previous studies, that is catalyzed by many DNA polymerases. The non-templated blunt-end N+1 addition of a nucleotide to the 3′-end of dsDNA by the small fragment of DNA Pol I (small Klenow Fragment or KF exo-) was chosen as an example. In this reaction, KF catalyzes the addition of a dNTP to the free 3′-OH end of blunt-end dsDNA resulting in a 3′ N+1 DNA product. It was imagined that if 2′-deoxy- xenonucleoside triphosphates (dxNTPs) could serve as tailing substrates for a synthetic DNA hairpin, the non-processive nature of this reaction would provide a means for a controlled, non-templated, enzymatic semi-synthesis of 6-, 8-, 10-, and 12-letter DNA (FIG.6A; hairpins used in this disclosure are listed in Tables 2-7). Furthermore, since this strategy avoids the use of environmentally harmful phosphoramidites, it is inherently an environmentally friendly solution to a synthesis problem. [0112] After a campaign of screening, two enzymes were identified as being able to tail both standard DNA purines and pyrimidines to the blunt end of dsDNA hairpins (Table 8). These two enzymes included the small Klenow Fragment (KF exo-) polymerase and an engineered polymerase from hyperthermophilic marine archaea (engineered 9°N DNA polymerase - TherminatorTM). Next, activity was tested on the expanded set of XNA letters. 2′-Deoxy-xenonucleoside triphosphate building blocks for an 8-letter code can be readily available from various commercial sources (dBTP, dSnTP/dScTP, dPTP, and dZTP). To reach the full extent of the 12-letter alphabet, the 2′- deoxy-xenonucleoside triphosphates of the remaining bases were chemically synthesized: dXtTP, dKnTP, dJTP, dVTP (FIGs 6BA-6BE). Next, a sensitive liquid chromatography/mass spectrometry (UPLC/QTOF) assay was developed for detecting tailing activity. In this assay, a DNA polymerase and 3′ ^5′ exonuclease are simultaneously used to perform N+1 tailing and N+1 removal of a dxNTP substrate on an exo-resistant, blunt-end DNA hairpin (FIG. 2A). The net reaction results in formation of
3915-P1293WO.UW -34-
dxNMP + PPi, and utilizes the presence of a 3′-OH blunt-end DNA, exonuclease, DNA polymerase, and dxNTP. From this UPLC/QTOF assay, it was evident that both KF (exo- ) and TherminatorTM polymerase were fully capable of non-templated N+1 addition of all four standard dNTPs and all nine dxNTPs tested, including both the N-nucleoside (Sn) and C-nucleoside (Sc) of S. (FIGs 2B-2C, FIGs 6CA-6CM, and 6DA-6DM). [0113] Next, efforts were put to optimizing XNA tailing reaction components and conditions. Of particular concern were competing side reactions such as PPi- mediated pyrophosphorolysis (N-1) and consecutive tailing (N+2) (FIGs 6A, 6EA-6EE, and 6F). Since N+1-tailed DNA is unable to undergo self or blunt-end ligation, agarose gel-based assays were used to characterize the amount of remaining starting material. Reaction conditions (reaction time, dxNTP concentration, temperature, choice of polymerase, and reaction additives) were optimized around maximizing N+1 tailing and consuming unreacted blunt-end DNA, the latter of which would be a major source of non-specific product formation in subsequent steps. Proving that this strategy would indeed be non-processive, high-resolution UPLC/QTOF assays were then used to show formation of the N+1 DNA as the main tailing product (FIG. 6F) under optimized conditions. XNA tailing reaction characterization for each dxNTP used for a supernumerary 12-letter DNA alphabet (dBTP, dPTP, dSnTP or dScTP, dZTP, dXtTP dKnTP, dJTP, and dVTP) is shown in FIGs 2D-2E, with conditions listed in Table 8. Under these optimized conditions for all dxNTPs tested, extent of reaction is estimated to be >95% (Table 9). [0114] Ligation of XNA overhangs with complementary XNA base pairs. While N+1 tailing can be used for non-templated extension of the 3′ blunt-end of DNA, a base pair embedded in a dsDNA sequence was desired. A strategy was then developed for joining two DNA hairpins with complementary N+1 base overhangs to generate xenonucleotide base pairs. Here, it was envisioned using dsDNA ligases to catalyze end- to-end joining of two N+1 tailed DNA hairpins with complementary xenonucleotide overhangs (FIG. 2F). The hairpin design of the substrates generates a desired dsDNA ligation product that lacks a free 5′ or 3′ end, making it fully resistant to exonucleases. Subsequent treatment of the ligation reaction with exonucleases therefore allows one to remove unreacted starting material and partially ligated products. [0115] The ideal dsDNA ligase should be able to ligate DNA strands with single nucleotide overhangs and have relaxed specificity for both the overhanging nucleotide
3915-P1293WO.UW -35-
and its adjacent sequence context. There is a general promiscuity of phage ligases (T3 DNA ligase, T4 DNA ligase, and T7 DNA ligase) including their ability to ligate modified and non-standard nucleotide substrates (FIG.6I). To ensure any ligation product observed comes from complementary overhang ligation, a negative control can be performed in which hairpins are incubated individually in the presence of the respective ligases (FIG. 6J). In these single hairpin reactions, any ligation product would indicate either blunt-end ligation, from incomplete XNA tailing, or formation of a self-ligation (mismatch ligation) product. Mismatch ligation can potentially arise in conditions where crowding agents in ligation buffer are present, ligase concentration is high, reaction time is long, and is dependent on both the overhang base and choice of DNA ligase. Taking these constraints into consideration, XNA ligation reaction conditions that would generate the ligation product when two hairpins with complementary N+1 overhangs are present in the same reaction were screened. Though the sequence context for all ligation reactions was the same (Table 2), variable ligation yields was observed, suggesting the chosen DNA ligases had varying xenonucleotide base pair tolerance. After optimizing reaction conditions for XNA tailing, it was surprisingly discovered that it is indeed possible to incorporate all xenonucleotide base pairs (B:Sn, B:Sc, P:Z, Xt:Kn, and J:V) into DNA, with varying yields (estimated yield: B≡Sn ≈ 73%; B≡Sc ≈ 7%; P≡Z ≈ 53%; Xt≡Kn ≈ 31%, and J≡V ≈ 15% FIGs 2F-2G and FIGs 6J, 6KA-6KE, 6LA-6LC, and 6MA-6MC, Tables 10-11). In totality, the described enzyme-assisted synthesis reaction is successful and can be comprised of two reactions that use dxNTPs and commercially available enzymes: 1) xenonucleotide tailing and 2) xenonucleotide ligation. More so, strategies for extending the scope of these steps beyond singular XNA base pair insertions are contemplated. For example, following one round of XNA tailing and XNA ligation, one can add a type IIS restriction enzyme that regenerates blunt-ended starting material, allowing one to perform consecutive dxNTP additions (FIGs 2H and 6NA-6NF). [0116] Generation of XNA libraries for nanopore model building. From these advances in XNA writing, efforts were made to advance the capacity for XNA reading with a commercial nanopore platform. Nanopore sequencing (from Oxford Nanopore Technology®) has features that make it adaptable for sequencing supernumerary DNA: it can sequence single DNA molecules without amplification, without the requirement for fluorescently labeled building blocks, and with high throughput (100k-10M reads per run). In nanopore sequencing, an ion current signal is generated as single-stranded DNA
3915-P1293WO.UW -36-
is threaded through a protein nanopore. Conversion of signal-to-sequence, or basecalling, is performed computationally by either statistical or machine learning models. However, since commercial nanopore basecalling algorithms were empirically trained on standard 4-letter DNA (A, T, G, C), they are unable to decode xenonucleobases (B, Sn, Sc, P, Z, Xt, Kn, J, V; FIGs 6OA-6OB). [0117] With this in mind, one can build and measure diverse DNA-XNA libraries that can be used to construct de novo ground-up models for sequencing single xenonucleotides within a natural DNA context. Here, one can take note of the predictive ‘kmer models’ for nanopore sequencing. In these models, the current signal produced by any given DNA sequence is a function of the sequence kmer, which consists of the incident nucleotide in the pore and its surrounding nucleotide context. Sequencing models built with longer kmer sequences will benchmark with a higher overall accuracy. There is, however, a diminishing return in accuracy improvements as kmer size increases which is balanced against the exponential increase in library complexity and data collection requirements with longer kmers. Balancing performance and complexity, it was decided to measure the signal produced by every 4-nt kmer that contains a single xenonucleobase from the set. The synthetic capabilities of XNA tailing and XNA ligation make it possible to generate libraries containing all 4-nt kmers with a single xenonucleotide pair (44 = 256 kmers per xenonucleotide). To cover the entirety of the 4-nt-long kmer sequence space, a dual-barcoded DNA hairpin library was developed that could be synthesized on a chip (NNN-Pool; FIGs 3A and 6PA-6PC, Tables 3-5, 12, and 13). To establish the ground truth for each read, full factorial NNN coverage at the blunt end was linked to a unique 24-nt barcode (Triplet-barcode), while the identity of the tailed xenonucleotide was linked to a unique 8-nt barcode (Pool-barcode). These barcode sequences are distal to the site of XNA ligation and can therefore be decoded through standard ATGC basecalling. Though ligation biases could make it difficult to acquire reads of certain combinations (NNNNNNN; N = modified nucleotide, N = A, T, G, C), a subset of total sequence space would be needed to obtain full coverage of all 4-nt kmers. [0118] Building a 4-nt XNA kmer model. Each NNNNNNN library was sequenced independently for model building, generating between 150k – 800k raw reads per library (Tables 14-15). Signals were then segmented and aligned to each barcoded reference sequence while filtering reads that aligned to possible ligation side products (FIGs 3B, 6J and 6QA-6QI). From these signal-to-sequence alignments, XNA-heptamer
3915-P1293WO.UW -37-
nanopore signals are observed to deviate from the signal expected for a canonical DNA sequence (FIGs 6RA-6RE). After binning signal-to-sequence alignments into their constitutive kmers (Table 16), these differences can be quantified to give a measure of how the presence of a xenonucleotide in a sequence produces subtle, yet measurable deviations in the observed normalized current signal levels <Iz>. [0119] These empirical kmer signal distribution measurements formed the basis for a xenonucleotide kmer model. One can model the probability that a given 4-nt kmer will produce an ionic signal current as a normal distribution (FIG. 3B). Example kmer signal distributions can be generated. Mean signal currents spanning all 2,304 xenonucleotide-containing kmers, µk, are shown in FIG. 3C and comparisons can be made to the most similar standard bases. [0120] Basecalling single xenonucleotide substitutions. Next, one can apply this model to predict signals emitted by sequences that contain a single xenonucleotide (B, Sn, Sc P, Z, Xt, Kn, J, or V). For any such sequence, the expected signal is found by decomposition of a heptamer sequence into its constitutive kmers, then using measured kmer means to model current transitions (e.g., AGTBCCT ^ [µ^ீ்^, µீ்^^, µ்^^^, µ^^^்]). FIG. 3D shows examples of signal-level predictions generated by an example model (XNA model) overlayed over observations of that library sequence and the most similar standard-bases model (DNA model). [0121] One can integrate the 4-nt kmer model into an end-to-end basecaller for single xenonucleotide substitutions (FIGs 4A, 6S, Tables 17-18, Note 1). For any given set of observed signals, the modeled probability density function can be used to calculate the likelihood that an observed set of signal levels was emitted from a particular sequence. The correct basecall should be the one that has the maximum likelihood of observation. The modularity of the 4-nt kmer model allows to make a diverse set of comparisons between a xenonucleotide and 1) a standard base (e.g., P vs. G), 2) any of the standard bases (e.g., P vs. A, T, G, C), or 3) any of the full supernumerary letters (e.g., P vs. A, T, G, C, B, Sc, Z, Xt, Kn, J, V). [0122] To test the recall of the XNAs, one can use XNA tailing and XNA ligation to enzymatically synthesize a new validation library composed of contextually diverse sequences. In this library, the nucleotide sequences adjacent to the XNA-containing heptamer can be further diversified making them further removed in sequence space from those used to build the 4-nt kmer models. This validation library can be built
3915-P1293WO.UW -38-
combinatorically using synthetic hairpin pools as starting material. Each set of hairpins can contain 10 unique sequences. To avoid biasing which sequence contexts are chosen for validation, the 20 bp at the 3′-end of each hairpin can be designed by randomly selecting standard bases from a uniform probability distribution. Individual hairpin sets can be tailed with XNA bases using XNA tailing. Two sets of hairpins with complementary tails can be ligated, producing a library of 100 possible sequences (10 x 10), with each sequence containing a single XNA base pair. These ligated hairpin libraries can be pooled together and sequenced for benchmarking (FIGs 4B-4C). [0123] One can perform XNA basecalling model benchmarking by calculating two major performance metrics: recall (true positive rate) and specificity (1 - false discovery rate). Recall and specificity can be calculated per-read, as a per-read consensus, or as a signal-averaged per-sequence consensus. Per-read, the 4-nt kmer model is able to recall between 60-87% of XNA nucleotides correctly when comparing against the respective most similar standard base (Tables 19-21). By consensus basecalling of at least 10 reads (per-read consensus), correct sequence recall for all XNA sequences ranged from 63-99%. In an all-by-all comparison of the validation sequences, the 4-nt kmer model had sufficiently high recall to properly basecall each non-standard base as the per- read consensus (FIG. 4B, S = Sc ). To determine specificity, one can test basecalling using the 4-nt kmer model against a standard DNA library (i.e., A, T, G, C only). Of note, per- read specificity can be found to be high, ranging from 80-93% (per-read) and 89-99% (per-read consensus). ROC curves generated for XNA vs most similar standard base comparisons indicate overall high performance of the 4-nt kmer model, with values for area under the curve between 0.8-0.96 (FIG. 4C). Additional recall and specificity benchmarking for the kmer models, including per-read consensus and per-sequence recall/specificity, are summarized in Tables 19-21. [0124] As an example of how these sequencing models can be applied to accelerate XNA research, one can consider that analysis of successful P≡Z amplification can be carried out using low throughput agarose gel electrophoresis assays. Showcasing the leap to the NGS era, in a single multiplex nanopore run, it is shown herein how the PZ kmer models enable simultaneous measurement of PCR amplification efficiency for a P≡Z base pair amplified under various dxNTP and dNTP concentrations (FIGs 6TA-6TC, Tables 22-23). These sequencing results show near complete retention of P≡Z base pair using optimized dxNTP (0.6 mM dPTP; 0.05 mM dZTP) and dNTP (0.1 mM dATP,
3915-P1293WO.UW -39-
dGTP, dTPT; 0.6 mM dCTP) concentration, with increasing loss of P≡Z bases as dxNTPs become limiting. Given the throughput of nanopore flow cells (1-10M reads, MinION flow cell), it is now possible to use nanopore sequencing to screen PCR replication efficiency across hundreds to thousands of conditions (e.g., polymerase mutants, buffer composition, dxNTP/NTP concentrations) simultaneously. [0125] Synthesis and sequencing of 12-letter DNA. This example has shown that 1) enzyme-assisted synthesis can be used to add a single xenonucleotide base pair and 2) 4-nt kmer models can properly basecall individual xenonucleotides with high recall and specificity. A proof of principle is described that takes the methods developed in this example to their alphabetical limits – synthesizing and sequencing DNA that contains a full 12-letter code: A, T, G, C, B, Sn or Sc, P, Z, Xt, Kn, J, and V (FIG.5, Table 24). Using synthetic 4-letter DNA as a starting point, the elementary tailing and ligation synthesis steps can be coupled with an additional Golden Gate ligation to generate two proof-of- concept 12-letter supernumerary dsDNA hairpins: Scuper-12 and Snuper-12 (FIGs 6UA- 6UB, Tables 7, 12, and 13). In the construction procedure, exonucleases can be added to remove intermediary DNA products, generating the desired 244 bp 12-letter dsDNA product. In this proof-of-concept example, basecalling can be performed two different ways: 1) by comparing the XNA base at a position against a model that contains all 12 possible nucleobases, and 2) by comparing the XNA base at a position against a model that contains the XNA and the most similar standard nucleobase. Even when all 12 letters are present in the model, the presently disclosed basecalling model is able to properly decode XNAs in Scuper-12 with 39-89% per-read recall (FIG. 5, Tables 25, 26). In an example experiment, for the Snuper-12 sequence, all but one XNA were properly decoded in the 12-letter model, with the exception being Kn (per-read recall of 14%). When performing most similar standard base comparisons, all XNAs in Snuper-12 were properly recalled (67-93% per-read recall). Given the complexity of possible current signals when 12-letter models are invoked for basecalling, one can expect the chosen 12- letter-containing sequence to have a large influence on recall (Tables 27-30, Note 2). Despite being an example, this foray into 12-letter DNA space represents a milestone, demonstrating that DNA containing 6 orthogonal base pairs can be synthesized and sequenced. [0126] Discussion
3915-P1293WO.UW -40-
[0127] A general strategy is described for incorporating up to four additional orthogonal base pairs into standard DNA, and these methods can be used to build openly accessible models for sequencing XNAs (B, Sn, Sc, P, Z, Xt, Kn, J, V) in a standard DNA context (A, T, G, C) on commercial nanopore devices. The enzymatic synthesis strategy developed utilizes unmodified 2′-deoxy-xenonucleoside triphosphates as the elementary building blocks, avoiding the use of phosphoramidites or caged-triphosphates. To further eliminate barriers to entry, 4-nt kmer sequencing models are benchmarked and it is shown that simultaneous basecalling of 6-letter and 12-letter DNA is possible. This latter development brings XNA base pair sequencing from the “zeroth-generation” of sequencing to the third-generation sequencing era. [0128] Nanopore sequencing of XNAs, as implemented herein, can be performed using a nanopore sequencing device. This significantly expands the accessibility of sequencing XNAs. As history in sequencing progress has shown, additional widespread adoption and collection of XNA nanopore sequencing data can help further catalyze the improvement of sequencing models with newer basecalling algorithms, including data- intensive deep learning models. As these methods improve and adoption widens, strategies for synthesis and sequencing of higher complexity nucleic acids are possible. For example, variations of XNA tailing and XNA ligation that allow one to incorporate multiple consecutive XNA bases, such as the example of MlyI cycling (FIGs 6NA-6NF), creates an opportunity for 4-nt kmer models with multiple xenonucleobases present in close proximity. [0129] The generalizability of the disclosed synthesis approaches eliminates many barriers to accessing site-specifically modified DNA for applications in therapeutics, biomaterials, and genetic engineering, all while bringing supernumerary genetics to the third generation of sequencing. In the area of genetic code expansion, a single insertion of these additional base pairs allows for various arrangements of up to 448 possible codon- anticodon pairs (made up of 64 canonical codons and 96 additional codon-anticodon pairs for each XNA base pair, constrained to one XNA per codon or anticodon). In the design and discovery of DNAzymes/aptamers, an additional base pair enables site-specific incorporation of chemically modified groups, including the addition of nucleobases such as Z that can act as a Brønsted base. In the realm of DNA digital information storage, these additional bases markedly increase information density, as one can encode from log2(4) = 2 bits per base to log2(12) = 3.58 bits per base. Beyond the 12-letter DNA
3915-P1293WO.UW -41-
alphabet presented in this example, the described enzyme-assisted synthesis strategies and nanopore sequencing pipeline provide low barrier points-of-entry for writing and reading with other modified nucleotides – including epigenomic and epitranscriptomic modifications. The dual synthesis and sequencing capabilities presented herein open a frontier for openly accessible reading and writing DNA with up to 12 letters and aid in discovery and development at the limit of Nature’s rules for hydrogen bonding base pairing. [0130] Methods [0131] Commercial Materials. Agarose (0710-500G; electrophoresis grade) was purchased from Thermo Fisher Scientific (Waltham, MA). Adenosine triphosphate sodium salt (ATP; A6419-5G), acetonitrile (A955-4; LC/MS-grade), formic acid (A118P- 500), ammonium acetate (A637-500), ammonium carbonate (207861-25G), Tris base (10708976001), 5 M betaine solution (B0300-1VL), 6 N hydrochloric acid (1430071000), GelGreen (SCT124), and sodium chloride (S3014-5KG) were purchased from Sigma-Aldrich (St. Louis, MO). AMPure XP beads (A63880) were purchased from Beckman Coulter (Brea, CA). Restriction enzymes, T4 DNA ligase, high concentration T4 DNA ligase (M0202M, M0202L), T7 DNA ligase (M0318L), T3 DNA ligase (M0317S), yeast inorganic pyrophosphatase (YiPP; M2403L), thermolabile proteinase K (P8111S), Exo III (M0206L), thermolabile Exo I (M0568L), Exo I (M0293L), Exo VII (M0379L), Exo VIII (truncated; M0545S), Klenow Fragment (exo-; M0212L), Taq polymerase (M0267L), Bsu polymerase (M0330S), Deep Vent (exo-) polymerase (M0259S), Bst polymerase (M0275S), Sulfolobus DNA polymerase IV (M0327S), Therminator polymerase (M0261L), NEBNext ^ UltraTM II End Repair/dA-Tailing Module (E7546S), 50 bp DNA ladder (N3236S), and NEBNext ^ Quick Ligation Module (E6056S), and 2′-Deoxynucleoside triphosphates (dNTPs; N = A, T, G, C; N0446S) were purchased from New England Biolabs (Ipswich, MA). Gene ruler 1 kb plus DNA ladder (SM1331) was purchased from Invitrogen (Carlsbad, CA). Oligonucleotides and oligo pools were purchased from Integrated DNA Technologies (Coralville, IA), resuspended at a stock concentration of 100 µM in elution buffer (10 mM Tris-HCl, pH = 8.2) and stored at either 4 °C for immediate usage or -20 °C for long-term storage. Xenonucleoside triphosphates dScTP, dPTP, dZTP, dBTP (dSTP-401S, dPTP-201, dZTP- 101, dBTP-301P) were purchased from FireBird Biomolecular Sciences LLC (Alachua, FL). Xenonucleoside triphosphate dSnTP (M-1015) was purchased from TriLink
3915-P1293WO.UW -42-
BioTechnologies (San Diego, CA). DNA purification kits (ZD4034, ZD7011) Zymo Research (Irvine, CA). Flongle Flow Cell R9.4.1 (76521-802) and MinION Flow Cell R9.4.1 (76487-106) were purchased from VWR (Radnor, PA) MinION sequencing device (MIN-101B), Flongle Adapter, Ligation Sequencing Kit (SQK-LSK110), and Flongle Sequencing Expansion kit (EXP-CTL001) were purchased from Oxford Nanopore Technologies (ONT; Oxford, United Kingdom). Unless otherwise specified, other commodity chemicals used in this disclosure were purchased from major domestic distributors (Sigma-Aldrich, St. Louis, MO; Thermo Fisher Scientific, Waltham, MA). [0132] Polymerase-exonuclease coupled assays to measure tailing activity via nucleoside monophosphate release. A hairpin oligo with five consecutive phosphorothioate bonds on the 3′ end was purchased from IDT (HP-3′PT, Table 2). Prior to tailing, 10 µM of HP-3′PT was incubated with rCutSmartTM and 200 units of Exo III at 37 ^C for 2 h to digest hairpins, then cleaned using the Zymo ssDNA/RNA Clean & Concentrator Kit and eluted in 15 µL of elution buffer. The eluted oligo was then folded in 100 mM of NaCl and 10 mM Tris-HCl (pH 8.2) buffer by incubating at 90 ^C for 3 minutes, then cooling at 0.1 ^C/s until reaching 20 ^C. 15 µL of this refolded oligo was incubated with 0.17 mM dNTP or dxNTP, 300 units of Exo III and either KF (exo-) with rCutSmartTM buffer or Therminator with ThermoPol® buffer for 16 h. For reactions using KF, the reaction was incubated with 15 units of KF at 37 ^C. For reactions using Therminator, the reaction was incubated with 6 units of Therminator at 48 ^C. Samples were then prepared for UPLC/MS-QTOF using the methods described in “General procedure for high resolution HPLC/MS analysis of polar 2′-deoxynucleotides”. [0133] Assays to measure tailing extent through HPLC/MS analysis of oligonucleotides. A hairpin oligo was purchased from IDT (5′Phos-ScaI-HP, Table 2). In these reactions, oligos are first refolded by incubating 40 µM of oligo in a 100mM NaCl, 10mM Tris-HCl buffer (pH 8.2) at 90 ^C for 3 minutes then cooling at 0.1 ^C/s until reaching 20 ^C. The refolded oligos are then tailed by incubating 23.8 µM of oligo in the presence of dNTP or dxNTP (1.19 mM or 2.38 mM), YiPP (0.005 U/µL; except for the dATP tailing reaction which did not contain YiPP), polymerase (0.71 U/µL Klenow Fragment (KF exo-), 0.29 U/µL Therminator polymerase, or 0.71 U/µL Taq polymerase), and polymerase buffer (either rCutsmartTM or ThermoPol buffer). Full conditions tabulated in Table 8. Reactions were either incubated for 8 h at 37 ^C (KF exo-); 1, 4, 8, or 16 h at 60 ^C (Therminator); or 1 h at 60 ^C (Taq). Following incubation, KF exo-
3915-P1293WO.UW -43-
reactions were terminated by heat inactivation at 72 ^C for 20 min. Therminator and Taq reactions were terminated by addition of 1X rCutSmartTM buffer and 0.005 U/µL of thermolabile proteinase K at 37 ^C for 15 min, followed by subsequent heat inactivation at 72 ^C for 20 min. Following either set of heat inactivation steps, hairpins were refolded. Afterward, 19.8 µM of oligo was incubated with 1.8 U/ µL of ScaI-HF at 37 ^C for 2 h, followed by subsequent heat inactivation at 80 ^C for 20 min. Samples were then prepared for UPLC/MS-QTOF using the methods described in “General procedure for high resolution HPLC/MS analysis of oligonucleotides.” [0134] XNA tailing conditions and reaction components. 5′-phosphorylated hairpin oligos with either a 3′-blunt end (5′Phos-11HP) or 3′-single nucleotide overhangs (G: 5′Phos-HP-3′G; or C: 5′Phos-HP-3′C) were purchased from IDT (Table 2). For tailing dNTP and dxNTP nucleotides to 3′-blunt ends, 5′Phos-11HP oligo was used as the substrate. In these reactions, oligos are first refolded by incubating 20 µM of oligo in a 100 mM NaCl, 10 mM Tris-HCl buffer (pH 8.2) at 90 ^C for 3 minutes then cooling at 0.1 ^C/s until reaching 20 ^C. The refolded oligos are then tailed by incubating 11.9 µM of oligo in the presence of dNTP or dxNTP (1.19 mM or 2.38 mM), YiPP (0.005 U/µL; except for the dATP tailing reaction which did not contain YiPP), polymerase (0.71 U/µL Klenow Fragment (KF exo-), 0.29 U/µL Therminator polymerase, or 0.71 U/µL Taq polymerase), and polymerase buffer (either rCutsmartTM or ThermoPol buffer). Reactions were either incubated for 8 h at 37 ^C (KF exo-); 1, 4, 8, or 16 h at 60 ^C (Therminator); or 1 h at 60 ^C (Taq). Following incubation, KF exo- reactions were terminated by heat inactivation at 72 ^C for 20 min. Therminator and Taq reactions were terminated by addition of 0.005 U/µL of thermolabile proteinase K at 37 ^C for 15 min, followed by subsequent heat inactivation at 72 ^C for 20 min. Following either set of heat inactivation steps, hairpins were refolded. Resulting hairpins contained a mixture of product (tailed hairpins) and unreacted starting material (3′-blunt end hairpins). T4 DNA ligase was then used to screen reactions for remaining unreacted 3′-blunt ends by adding 80 U/µL of T4 DNA ligase alongside 1X T4 DNA ligase reaction buffer. These T4 ligation reactions were incubated at 16 ^C for 2 h, after which T4 ligase was heat inactivated at 65 ^C for 10 min. As a positive control for the tailing reaction, a synthetic oligo hairpin with a 3′-G overhang (5′Phos-HP-3′G , Table 2) was used in the T4 ligation reaction. As a negative control for tailing, the starting material (5′Phos-11HP) was used in the T4 ligation reaction. Reaction products were run on a 2% (w/v) agarose gel, stained with GelGreen,
3915-P1293WO.UW -44-
and visualized using a blue light transilluminator. Optimized conditions for tailing each substrate are tabulated in Table 8. [0135] XNA ligation conditions and reaction components. 5′-phosphorylated blunt-ended hairpin oligo (5′Phos-11HP; Table 2) was tailed with either dNTPs or dxNTPs using conditions described in “XNA tailing conditions and reaction components.” Following the tailing reaction, two sets of tailed hairpin oligos with complementary nucleotide overhangs (XNA complementary bases shown in Table 1) were ligated by incubating oligos (2.4 µM of each, except for the B:Sc base pair which was 1.3 µM of each) in a reaction containing a DNA ligase (either 272 U/µL of T3 DNA ligase; 36 U/µL of T4 DNA ligase; 272 U/µL or 750 U/µL of T7 DNA ligase) and 1X NEB StickTogether™ buffer, which contains 7.5% (w/v) PEG 6000, for 16 h at 16 ^C. Following incubation, all ligation reactions were heat inactivated at 65 ^C for 10 min. The desired product possesses no free 3′-OH end, making it resistant to 3′-exonuclease treatment. Unreacted hairpins or incomplete ligation products were removed by exonuclease treatment performed at 37 ^C for 1 h and using a combination of: 7.7 U/µL of Exo III; 1.5 U/µL of thermolabile Exo I or Exo I; 0.4 U/µL or 0.77 U/µL of Exo VIII (truncated). Exonuclease reactions were heat inactivated by incubation at either 80 ^C for 20 min (for reactions containing Exo I) or at 70 ^C for 20 min (for reactions containing thermolabile Exo I). Reaction products were run on a 2% (w/v) agarose gel, stained with GelGreen, and visualized using a blue light transilluminator. [0136] Consecutive insertion of XNA base pairs using MlyI type IIS restriction enzyme. 5′-phosphorylated hairpin oligos were purchased from IDT (5′Phos-11HP, 5′Phos-15HP, and 5′Phos-ScaI-HP; Table 2). 5′-Phos-15HP contains an MlyI restriction site adjacent to site of XNA ligation. MlyI is a type IIS restriction enzyme (5′- GAGTCNNNNN↓-3′) that leaves a blunt end after cutting. 5′Phos-15HP (donor hairpin with MlyI site; abbreviated HPD) and 5′Phos-11HP (acceptor hairpin; abbreviated HPA) were tailed with P and Z, respectively, generating HPD-P and HPA-Z. These two hairpins were then ligated and subsequently treated with exonuclease following the optimized conditions described in “ XNA ligation conditions and reaction components.” This material was purified using Zymo’s DNA Clean and Concentrator and eluted in 30 µL of elution buffer. The purified construct contains a single P≡Z base pair insertion and was digested using 1.24 U/µL of MlyI and 1X rCutSmartTM buffer at 37 ^C for 2 h then heat inactivated at 65 ^C for 20 min. MlyI digestion results in a hairpin with a terminal P≡Z,
3915-P1293WO.UW -45-
which also possesses the termini for another tailing reaction (5′-PO4 and 3′-OH blunt end). In the second round of cycling, this hairpin (which already contained a P≡Z base pair) was subjected to Z-tailing generating HPA-ZZ. A third hairpin (5′Phos-ScaI-HP, lacks MlyI site, abbreviated HPP) was P-tailed to generate HPP-P. Hairpins were then ligated, and incomplete ligation products were removed by adding 1 U/µL of MlyI, 7.7 U/µL of Exo III, 1.5 U/µL of thermolabile ExoI, and 0.77 U/µL units of ExoVIII (truncated) and incubating at 37 ^C for 1 h, followed by a heat inactivation step at 72 ^C for 20 min. Products were analyzed on a 2% (w/v) agarose gel, stained with GelGreen, and visualized using a blue light transilluminator. [0137] General procedure for high resolution HPLC/MS analysis of polar 2′- deoxynucleotides. Prior to HPLC/MS analysis, samples were mixed with an equal volume of 4% formic acid in methanol (v/v) and centrifuged at 20,000 × g for 10 min at room temperature. Soluble fraction of the resulting sample containing nucleoside or nucleoside mono, di, or triphosphates were analyzed using an Agilent 1290 Infinity II Bio UPLC on a HILIC-Z column (Poroshell 120 HILIC-Z 2.7 µm, 2.1 mm x 50 mm; Agilent) using the following mobile phases: Buffer A (20 mM ammonium carbonate in water) and Buffer B (100% acetonitrile) at room temperature. A linear gradient from 85% to 20% Buffer B over 7.5 min followed by a linear gradient from 20% to 10% Buffer B over 1 min was applied at a flow rate of 0.3 mL/min. Mass spectra were acquired in positive ionization mode using an Agilent 6530C QTOF (2 GHz) mode with the following source and acquisition parameters: gas temperature 300 °C; drying gas 8 L/min; nebulizer 35 psi; capillary voltage 3,500 V; fragmentor 175 V; skimmer 65 V; oct 1 RF vpp 750 V; acquisition rate 1 spectrum/s; acquisition time 1 s/spectrum. [0138] General procedure for high resolution HPLC/MS analysis of oligonucleotides. Prior to HPLC/MS analysis, samples were mixed with 0.85 volumes of 4% formic acid in methanol (v/v) and centrifuged at 20,000 × g for 10 min at room temperature. Soluble fractions of the resulting sample containing oligonucleotides were analyzed using an Agilent 1290 Infinity II Bio UPLC on a HILIC-Z column (Poroshell 120 HILIC-Z 2.7 µm, 2.1 mm x 100 mm; Agilent) using the following mobile phases: Buffer A (15 mM ammonium acetate in 70% water and 30% acetonitrile) and Buffer B (15 mM ammonium acetate in 30% water and 70% acetonitrile) with the column at 30 °C. A linear gradient from 85% to 60% Buffer B over 10 min followed by a linear gradient from 60% to 40% Buffer B over 2 min was applied at a flow rate of 0.4 mL/min. Mass
3915-P1293WO.UW -46-
spectra were acquired in positive ionization mode using an Agilent 6530C QTOF (4 GHz) modewith the following source and acquisition parameters: gas temperature 350 °C; drying gas 13 L/min; nebulizer 35 psi; capillary voltage 4,500 V; fragmentor 180 V; skimmer 65 V; oct 1 RF vpp 750 V; acquisition rate 1 spectrum/s; acquisition time 1 s/spectrum. [0139] NNNNNNN library design. 5′-phosphorylated oligo pools (purchased as oPools™ from Integrated DNA Technologies) were designed to form blunt-end hairpins with two barcodes: a 24 nt Triplet-barcode [NNN-BC] and an 8 nt pool-barcode [Pool- BC] (FIG. 3A, Tables 3-5). The Triplet-barcode is linked to the NNN sequence at the 3′- blunt end of the hairpin, while the pool-barcode is used to decode which dxNTP/dNTP was tailed (Table 12). Each Triplet-barcode maps 1:1 with a corresponding NNN sequence adjacent to an XNA base. Each NNN-oligo pool contained 64 (NNN = 43 = 64) unique sequences and was synthesized at a scale of 50 pmol/oligo. Ligation reactions for libraries generate combinations with two different pool barcodes. Restriction enzyme cut sites were included upstream of Triplet-barcodes to remove hairpins following ligation reactions and prepare DNA for nanopore sequencing. Full hairpin sequences in each library can be produced based on the present disclosure. [0140] Val-20 validation library design. 5′-phosphorylated oligo pools (purchased as oPools™ from Integrated DNA Technologies) were designed to form blunt-ended hairpins with a variable 20 nt region at the end (Tables 3, 6). The variable 20 nt region was designed computationally by randomization with a uniform prior probability for each base. Candidate sequences were passed through IDT oligo analyzer tool to remove sequences that might form secondary structures that could disrupt hairpin formation. Each validation oligo pool contained 10 unique sequences (six total pools: Val_A-F; Table 6) and was synthesized at a scale of 50 pmol/oligo. Two different validation oligo pools can be tailed with a dxNTP. Ligating two pools together (with complementary N+1 tails) results in a library with 100 possible sequences (10 x 10 combinations). Restriction enzyme cut sites were included upstream of these variable regions for nanopore library preparation following ligation. Validation libraries containing different XNA base pairs were prepared in independent reactions and pooled together for sequencing; a full list of these sequences can be produced based on the present disclosure. [0141] 12-letter DNA design. 5′-phosphorylated oligos were designed to form blunt-ended hairpins with a barcode sequence (Table 7). The barcode is linked to a
3915-P1293WO.UW -47-
variable 3 nt sequence at the 3′-end, as well as the dxNTP tailed to the blunt 3′-end. Oligos can be tailed with a dxNTP and ligated to a complementary pair forming a sequence with a single xenonucleotide base pair insertion. By including BbsI Golden Gate sites, four single insertion constructs could then undergo Golden Gate ligation to form a single dsDNA sequence containing all 12 letters (4 standard nucleotides and 8 xenonucleotides). To remove any intermediary 6-letter, 8-letter, or 10-letter DNA products, unsuccessfully assembled hairpins can be digested by restriction exonucleases. The assembled product contains two different restriction sites for hairpin removal, 5′- GATATC-3′ (EcoRV) and 5′-AGTACT-3′ (ScaI). Asymmetric presence of restriction sites on the hairpins allows us to remove a singular hairpin and therefore generate a blunt end on the assembled product. The resulting dsDNA contains a single 3′- and 5′-end. Subsequent library preparation and sequencing of dsDNA results in reads where both sense and antisense strands, containing all 12-nucleobases, can be read in a single sequencing event (Scuper-12 and Snuper-12; FIG.5, FIGs 6UA-6UB). [0142] NNNNNNN library, validation library, and 12-letter DNA preparation by XNA tailing and XNA ligation. For tailing dxNTP to the 3′-end of oligo pools (NNN- oligo pools, Val-oligo pools, Table 3) or 12-letter DNA oligos (HP12, Table 7), oligos were first refolded by incubating 20 µM of oligo pool in a 100 mM NaCl, 10 mM Tris- HCl (pH 8.2) buffer at 90 ^C for 3 minutes then allowing for cooling at 0.1 ^C/s until reaching 20 ^C. After refolding, oligos or oligo pools were tailed with a corresponding dxNTP using tailing conditions listed in Table 8. Reactions tailed with KF exo- were heat inactivated, while those tailed with Therminator were inactivated by thermolabile proteinase K treatment. Following inactivation of polymerase, oligos were refolded. Tailed oligo or oligo pools with complementary 3′-ends were then ligated with either T4 DNA ligase, T3 DNA ligase, or T7 DNA ligase using ligation conditions listed in Table 10. As a negative control for tailing, the starting material 3′-blunt end oligo or oligo pool (e.g. HP_v1-NNN-P1; Val_A; HP12-A1) was used. All ligation reactions were heat inactivated at 65 ^C for 10 min. Following ligation, unreacted hairpins or incomplete ligation products were removed by adding 7.7 U/µL of Exo III (3′ ^5′ dsDNA exonuclease), 1.5 U/µL of thermolabile Exo I (3′ ^5′ ssDNA exonuclease), and 0.77 U/µL units of Exo VIII (truncated, 5′ ^3′ dsDNA exonuclease) and incubating at 37 ^C for 1 h, followed by a heat inactivation step at 72 ^C for 20 min. This combination of
3915-P1293WO.UW -48-
exonucleases was used for rapid undesired product removal, but other exonuclease combinations could also accomplish the same goal. [0143] For NNNNNNN library preparation, ligated NNN-oligo pools reactions were then purified using Zymo DNA Clean and Concentrator and eluted in 30 µL of elution buffer (10 mM Tris-HCl, pH 8.2). Purified NNN-oligo pools were then digested for 1 h at 37 ^C using 1 U/µL of BbsI-HF and rCutSmartTM buffer, then purified again using AMPure XP with a 2:1 bead-to-sample ratio and eluted in 30 µL of nuclease-free water. Purified NNNNNNN library samples were then prepared for nanopore sequencing following the details in the Nanopore sample preparation section. [0144] For validation library preparation, ligated validation oligo pool reactions were purified using AMPure XP with a 3:1 bead-to-sample ratio and eluted in 30 µL of elution buffer (10 mM Tris-HCl, pH 8.2), then combined to a final concentration of 0.2 µM/pool before enzymatic digestion for 1 h at 37 ^C using 1 U/µL of BbsI-HF and 1X rCutSmartTM buffer. Validation library samples were then prepared for nanopore sequencing following the details in “Nanopore sample preparation and data acquisition.” [0145] For the 12-letter DNA preparation (Scuper-12 and Snuper-12), ligated oligo reactions were first purified using Zymo DNA Clean and Concentrator and eluted in 30 µL of elution buffer (10 mM Tris-HCl, pH 8.2). Each ligated oligo set was then combined at a final equimolar concentration of 0.05 or 0.075 µM/oligo before proceeding to a Golden Gate ligation with the addition of 1 U/µL of BbsI-HF, 20 U/µL of T4 DNA ligase, 1X rCutSmartTM buffer, and 1X T4 DNA Ligase Reaction Buffer (FIG.6UA). The Golden Gate ligation included 60 cycles of 1) 37 ^C for 5 min 2) 16 ^C for 5 min, finalized by a step at 37 ^C for 10 min, and a heat inactivation step at 65 ^C for 20 min. Following the Golden Gate ligation, the reaction was further digested to remove incomplete ligation products by the addition of 0.45 U/µL of BbsI-HF, 0.45 U/µL of thermolabile Exo I, 2.27 U/µL of Exo III, and 0.23 U/µL of Exo VIII (truncated), incubating at 37 ^C for 1 h, followed by a heat inactivation step at 70 ^C for 20 min. This reaction was then purified using AMPure XP with a 1.8:1 bead-to-sample ratio and eluted in 30 µL of nuclease-free water. The hairpin on either end of the complete, desired product was removed by splitting the reaction in half and adding 1X rCutsmartTM and 2.78 U/µL of either ScaI-HF or EcoRV-HF. These reactions were incubated at 37 ^C for 1 h, followed by a heat inactivation step at 80 ^C for 20 min. The split samples were then
3915-P1293WO.UW -49-
recombined and prepared for nanopore sequencing following the details in “Nanopore sample preparation and data acquisition.” [0146] Nanopore sample preparation and data acquisition. Nanopore sample preparation followed standard Flongle or MinION Genomic DNA by Ligation protocol (available on the ONT community) using the SQK-LSK110 preparation kit with the following modifications. During the DNA repair and end prep step, the NEBNext FFPE Repair Mix was omitted to avoid potential XNA removal by repair enzymes. The volume of the repair mix was replaced by nuclease-free water. To preserve short fragments, AMPure XP bead-to-sample ratio was increased to 2:1 for the NNNNNNN library, and 3:1 for the validation. For the validation library, the first AMPure purification step was omitted to avoid sample loss. Both Flongle and MinION flow cells used in this disclosure were from the R9.4.1 series. Flow cells were used once per sample, without washing, and collected between 0.15 - 1 M reads (Flongle) or 1-10 M reads (MinION). Summary of nanopore sequencing runs is shown in Tables 14, 15. Depending on available pores, data collection was allowed to proceed between 24 - 48 h. The collected raw nanopore reads are then passed to the data preprocessing pipeline for basecalling and signal-to-sequence mapping. [0147] Statistics and Reproducibility. Statistics utilized to analyze this data are described in the following methods sections. Analysis can be reproduced with datasets that are deposited to the SRA (Table 31) and the code developed and utilized in this disclosure (see Code Availability Statement). Read filtering is specified in the text; as a general guideline, all reads with a q-score <9 and signal match score >3 were filtered out in this disclosure. Sample size for analyses that used subsets of data are presented with figure legends. Model building dataset sizes are described in Table 15. No statistical method was used to predetermine sample size (all data that passed filter threshold were used in individual analyses). The experiments were not randomized. The investigators of the example were not blinded to allocation during experiments and outcome assessment. [0148] Raw nanopore data preprocessing and signal-to-sequence mapping. Signal-to-sequence mapping uses the Tombo (github.com/nanoporetech/tombo, ONT) pipeline. First, raw multi FAST5 files are split into single FAST5 using the ont-fast5-api (github.com/nanoporetech/ont_fast5_api, ONT) command multi_to_single_fast5. Single FAST5 files are then basecalled using guppy (version 6.1.5+446c355, ONT) with the high accuracy configuration settings (dna_r9.4.1_450bps_hac.cfg). FASTQ basecalls
3915-P1293WO.UW -50-
passing default guppy quality score settings are assigned to their corresponding single FAST5 files using Tombo command Tombo preprocess annotate_raw_with_fastqs. For signal-to-sequence mapping, Tombo uses a reference FASTA file that contains ground- truth sequences. The reference FASTA file was generated programmatically by considering every possible combination of ligation product including mismatch homo- ligation (e.g. P1-A+P1-A, see Table 12), blunt-end ligations leading to a gap (e.g. P1-P2, P1-P1, P2-P2), or pyrophosphorolysis ligation products. Full reference alignment files are deposited in the SRA (Table 31). For sequences containing an XNA, the ground truth XNA (B, Sn, Sc, P, Z, J, V, Xt, Kn) base needs to be substituted for a canonical base (A, T, G, C) for processing in a FASTA format. When processing data for model building, XNAs in reference sequences were substituted for the canonical bases that minimized observed variance in kmer levels; determined empirically (B ^A; Sn ^A; Sc ^A; P ^G; Z ^C; X ^A; K ^G; J ^C; V ^G). Substituted bases are in general agreement with observations from basecalling XNA-containing reads with guppy (FIGs 6OA-6OB and 6QA-6QI). Signal-to-sequence mapping then proceeds using Tombo resquiggle. The Tombo resquiggle command uses mappy (minimap2 version 2.22-r1101 with ONT configuration) to first assign each single FAST5 read to a reference FASTA sequence based on the given FASTQ basecall. Following sequence assignment, Tombo uses dynamic programming for signal segmentation and proceeds to perform per-read signal normalization. As a general comment on the limitations of segmentation-based basecalling, Tombo is sensitive to the reference canonical base chosen for signal assignment. The per-read, median normalized level signal for each base is then extracted using the Tombo resquiggle results through the Tombo Python API. Details regarding how Tombo performs mapping, matching, and normalization, along with the Tombo Python API usage, can be found in the Tombo documentation (nanoporetech.github.io/tombo/). The resulting preprocessed and normalized signal- extracted data is exported to a CSV file for downstream processing (Tables 17, 18). The entire data preprocessing steps, including command groups and parameter settings, are wrapped into a single command (xenomorph preprocess) and available on the Xenomorph repository. [0149] XNA kmer model parameterization. NNNNNNN libraries for a given XNA base pair are prepared as previously described in “NNNNNNN library, validation library, and 12-letter DNA preparation by XNA tailing and XNA ligation” and sequenced
3915-P1293WO.UW -51-
on a Flongle (r9.4.1) flow cell. Signal-to-sequence mapping is then performed using the previously described pipeline in “Raw nanopore data preprocessing and signal-to- sequence mapping” with the following specifications. Reads that do not fully map with full coverage of triplet-barcodes and pool-barcodes of the XNA position are filtered out. Likewise, reads with a q-score < 9 and signal match score > 3 are not used in the model building. Signal-to-sequence mapping is also carried out with blunt-end ligation products (i.e. NNNNNN, or no XNA insertion), such that sequences that map better to blunt-end ligation products are not used. Though ligation reactions were designed to minimize blunt-end ligation product formation, this additional filtering helps further reduce blunt- end ligation products. Similarly, pyrophosphorolysis products are also included in the null alignment and reads that map better to these products are removed from analysis. [0150] Kmers of length 4 nt (k = 4) were chosen as the basis for the XNA kmer model. The 4-nt kmer was chosen in this disclosure as a proof of concept since reasonable kmer coverage could be obtained for the full NNNNNNN library (512 kmers per XNA base pair insertion) in a single Flongle flow cell run. Compared to using a larger kmer model (e.g., 5-nt or 6-nt) or machine learning, 4-nt kmer models have orders of magnitude lower data requirements making this model size both attainable and desirable. Larger kmer models are possible and generally result in higher accuracy. Each kmer consists of four nucleotide bases centered around the 0th position nucleotide, as exemplified in Table 16. Therefore, each heptamer sequence (NNNNNNN) is composed of four, 4-nt kmers (i.e. +2 pos NNNN, +1 pos NNNN, 0 pos NNNN, -1 pos NNNN). Observed kmer levels are modeled as normal distributions parameterized with a mean ( ^^^^ and standard deviation ( ^^^). These parameters are used to describe observed kmer signal level probability density functions: ^ூೖ ି ఓ ^ P^ ^^^ ൌ 1 eି ೖ ଶఙೖ P^ ^^^ ൌ probability that
from kmer ′ ^^′ ^^^ െ normalized kmer level mean for kmer ′ ^^′ ^^^
standard deviation of median normalized kmer levels for kmer ′ ^^′ ^^^ െ observed median normalized kmer level
3915-P1293WO.UW -52-
[0151] In choosing appropriate ^^^ estimates, similar performance in alterative hypothesis testing was found using mean of observed levels, median of observed levels, or observed mean levels from kernel density estimate (KDE). All parameter measurements are provided in Data Table 1 and are available on the Xenomorph repository. [0152] For kernel density estimates, level model means were approximated using the following kmer-specific bandwidth selection: IQR ^^ ൌ 0.9 ∗ argmin ൬ 1 , ^^ ^ ^ ^ .34 BW ൌ ^^ ∗ ^^ ିହ ^ ^^ െ Silvermanᇱs rule of thumb
IQR െ Interquartile range of kmer levels for kmer ′ ^^′ ^^^ െ standard deviation of median normalized kmer levels for kmer ′ ^^′ ^^^ െ number of observations ^measurements^ of kmer ′ ^^′ BW െ bandwidth used for kernel density estimate [0153] For practical purposes detailed in the Tombo documentation (github.com/nanoporetech/tombo), one can set a global standard deviation taken as the average observed standard deviation across all kmers in the model (i.e. ^^^ = ^^ ≈ 0.4 for all k). Generally, it is found that a global model ^^ outperforms kmer-specific choices for ^^ in the kmer probability density function. In the deployed code for single xenonucleotide detection, option to use a global ^^, kmer-specific ^^, or manually set ^^ is available to users. [0154] Tabulated kmer model values alongside coverage, mean, min, max, median, and standard deviation of observed levels determined from this disclosure can be found in Data Table 1. These values can be used to test alternative models that could differ in performance based on application or desired metric (e.g. recall vs specificity). Custom models can also be measured and linked. Documentation for model building and code used to generate kmer models can be found in the Xenomorph repository (github.com/xenobiolab/xenomorph). For quality control, the entire experimental and computational procedure, from building libraries to generating 4-nt kmer models, was performed in duplicate. Models were built from data collected in a single run. The
3915-P1293WO.UW -53-
specific nanopore runs used to build models are found in Table 15. Raw FAST5 reads for reproducing model building, testing model building replicates, or experimenting with alternative models can be found in the SRA under Bioproject PRJNA932328. [0155] Alternative hypothesis testing of canonical vs XNA kmer models. Alternative hypothesis testing can be used as the basis for xenonucleotide detection and can either be performed at the per-read or per-sequence level. Though the results shown with the example are from per-read alternative hypothesis testing, both options are available for experimentation with the deployed code. Per-read testing uses the signal observed from a single read, while per-sequence testing averages the signal across all observations that map to the same sequence. For each heptamer sequence (NNNNNNN) a set of mapping kmer sequences (NNNN, NNNN, NNNN, NNNN) and observed signal levels (INNNN, INNNN, INNNN, INNNN) ( ^^ேேே^, ^^ேே^ே, ^^ே^ேே, ^^^ேேே^ are extracted. See Table 16 for additional information on numbering nomenclature of kmer sequences within a heptamer region. The kmer probability density function, described previously in “XNA kmer model parameterization,” is used to estimate the probability that each observed level (e.g., ^^^்^ீ) came from the corresponding kmer (e.g. ATPG) or an alternative kmer with a substituted base (e.g. ATGG). Individual log probabilities are added to calculate the log likelihood that the observed signal level observations came from a given sequence (e.g. AATPGCC). Log likelihoods of XNA-containing or canonical-only sequences can then be used for hypothesis testing based on a log likelihood ratio (LLR) or outlier-robust log-likelihood ratio (ORLLR). By default, the basecalling for alternative hypothesis testing uses agnostic maximum likelihood criteria for rejecting the null hypothesis: if LLR or ORLLR > 0, then the XNA base is more likely than the proposed alternative. All alternative hypothesis testing of XNA models in this disclosure uses ORLLR rather than LLR as the main test statistic. Code for alternative hypothesis testing is available in the Xenomorph repository using the xenomorph morph command and choice of LLR or ORLLR for test statistic can be specified by users. Additionally, the likelihood ratio threshold is an adjustable parameter that can be used to improve case-specific performance. [0156] Log-likelihood ratio (LLR) calculations. LLR statistics can be used to test if observed signal levels better match kmers containing a specified XNA or kmers containing an alternative base. For every kmer in a heptamer sequence the LLR ratio is calculated. The sum of the LLRs over each kmer is then taken as the LLR of the entire
3915-P1293WO.UW -54-
heptamer. LLR ratio > 0 is used as the default criteria for deciding if the XNA model is more likely than an alternative model for a given observed sequence of signals. LLR ൌ log^^P^ ^^^ | ^^^^^ െ log^^P^ ^^^ | ^^^^^ ^^^ ൌ observed signal level for kmer ′ ^^′ in heptamer sequence P^ ^^^| ^^^^^ ൌ probability observed signal level ^^^ belongs to kmer ′ ^^′ P^ ^^^ | ^^^^^ ൌ probability observed signal level ^^^ belongs to kmer ′ ^^′ LLR ൌ log likelihood ratio [0157] Outlier robust log-likelihood ratio (ORLLR). ORLLR is a modified LLR test statistic that is nominally more robust towards outliers. The ORLLR scaling parameters were fixed for all analysis and set as the default used by Tombo (Sf = 4; Sf2 = 3; Sp = 0.3). Additional information on usage of ORLLR for alternative hypothesis testing can be found in the Tombo documentation (github.com/nanoporetech/tombo). ORLLR ratio > 0 indicates the specified XNA model is more likely than the alternative DNA model for a given observed sequence of signals. The ORLLR test statistic is defined as follows: ^^ ^^ ^^^^ ^ ^^^^ ௗ^^^ ൌ ^^^ െ 2
sequence ^^^^ ൌ median normalized kmer level for kmer ′ ^^ ൌ ^^′ ^^^^ ൌ median normalized kmer level for kmer ′ ^^ ൌ ^^′ ^^ ^^ௗ^^^ ൌ scale difference ^^ ൌ global standard deviation of median normalized kmer levels ORLLR ൌ outlier robust log likelihood ratio ^^ ^^, ^^ ^^ଶ, ^^ ^^ ൌ ORLLR scaling parameters
3915-P1293WO.UW -55-
[0158] Recall and specificity calculations. Alternative hypothesis testing is used to refine reads and generate a per-read assignment for deciding if a given heptamer sequence contains an XNA (i.e. NNNNNNN) or an alternative base (i.e. NNNYNNN; Y≠N). As metrics to describe how well the 4-nt XNA kmer models perform at identifying XNA bases correctly, two statistics were calculated: recall and specificity. Recall and specificity are calculated either at the per-read level (n = 1) or at the consensus level with a specified minimum number of reads mapping to a heptamer required (e.g. n ≥ 10). Consensus recall and specificity perform sequence-level assignments in calculations (rather than per-read level). Specificity of kmer models was calculated by alternative hypothesis testing on sequences that did not contain any XNAs. The definition of each statistic is provided below. TP recall ൌ TP ^ FN TP ൌ True positive FN ൌ False negative FP specificity ൌ 1 െ FDR ൌ 1 െ FP ^ TN FP ൌ False positive TN ൌ True negative FDR ൌ False discovery rate [0159] Receiver operating characteristic. Receiver operating characteristic (ROC) curves were generated using the roc_curve function from the scikit-learn python library. The log-likelihood ratios obtained from basecall outputs were used in the function as target scores and used to compute the recall (or true positive rate) and false discovery rate (FDR) at different classification thresholds. The area under the curve (AUC) was calculated by the auc function from the scikit-learn python library. [0160] Proof-of-concept model validation in a new sequence context. Kmer models in this disclosure were built from an NNNNNNN heptamer sequence embedded within a largely fixed sequence context. As a proof of concept that this model can be applied to sequences outside of those found in the NNNNNNN library, one can enzymatically synthesize a smaller validation library for each XNA base pair, each of which contained 100 unique sequences (FIG. 4, Tables 3, 6). Validation sequences
3915-P1293WO.UW -56-
contained XNA bases flanked by 20 randomly chosen canonical bases. Recall on the validation set was calculated at the per-read and consensus level as described previously in “Recall and specificity calculations.” [0161] PCR amplification and basecalling of P≡Z template DNA. Two complementary oligos containing P and Z (PCR_Template_P, PCR_Template_Z, Table 22) were synthesized by Firebird Biomolecular Sciences (Alachua, Fl) and hybridized in a 1:1 molar ratio.25 ng of this hybridized PZ DNA construct was used as the template for a PCR reaction. PCR reactions contained 0.2 µM of each forward and reverse primer (PCR_Amp_F, PCR_Amp_R1-4, Table 22), 5 U/µL of Taq polymerase in 1X ThermoPol buffer (pH 8.0). Triphosphate concentrations for dxNTPs and dNTPs varied by condition (no dxNTP, limiting, equimolar, optimal) and are tabulated in FIGs 6TA-6TC. The PCR reaction then proceeded with thermocycler conditions tabulated in Table 23. PCR reactions were purified using Zymo DNA Clean and Concentrator and eluted in 30 µL of nuclease-free water. All PCR products, as well as the template synthetic sequence, were pooled equivalently by mass and prepared for nanopore sequencing following the details in “Nanopore sample preparation and data acquisition.” Reactions were sequenced on a MinION 9.4.1 flow cell as part of a larger multiplex run (1M total reads in run, with a subset belonging to this disclosure). All mapped reads were used for analysis (between 2000-3000 reads mapped to each barcode). PZ basecalling was performed per-read, using methods outlined in this disclosure with Outlier-Robust LLR test with P to G and Z to C kmer comparisons. In the absence of dxNTPs, the P≡Z base pair is observed to mutate to a G≡C base pair. [0162] Simulation of genetic codes. Simulated reads were used to test theoretical recall of nanopore-based sequencing for various genetic codes using the 4-nt kmer models described in this disclosure. To simulate reads, every possible heptamer sequence of NNNNNNN (N = A, T, G, or C; X = A, T, G, C, B, Sn, Sc, P, Z, Xt, Kn, J, V) was generated. Heptamer sequences were then split into their corresponding kmer sequences (i.e. NNNN, NNNN, NNNN, NNNN). For each kmer, 1000 signal levels were simulated by random sampling from the corresponding kmer probability density function using 4-nt kmer model means (µk) and a fixed model standard deviation (σg = 0.4). Simulated kmers signals were then recompiled to their sequences to form simulated sequence-signal pairs. A total of 4,096,000 reads (4,096 x 1,000) were simulated for each substitution base (N) in a given genetic code. Alternative hypothesis testing was performed on simulated reads
3915-P1293WO.UW -57-
using the Xenomorph pipeline and recall was calculated for every NNNNNNN sequence. Code used in this publication to simulate reads is available in the Xenomorph repository under xenosim.py. [0163] The Xenomorph XNA sequencing pipeline. One of the goals of this disclosure was to build a publicly available end-to-end pipeline for validation of XNA incorporation in target sequences. As a proof of concept, one can create a tool in python called “Xenomorph” comprised of a pipeline consisting of two steps: 1) preprocessing - xenomorph preprocess and 2) alternative hypothesis testing - xenomorph morph. For preprocessing using xenomorph preprocess, Xenomorph runs raw FASTA5 data through the preprocessing pipeline with an additional FASTA handling modification that allows users to input reference sequences with XNA base pairs. Outputs for preprocessing steps are provided in a .csv file (see Table 17 for header description), which is used as an input for xenomorph morph. For alternative hypothesis testing with the xenomorph morph command, Xenomorph uses the XNA base pairs found input the reference sequence to perform LLR or ORLLR testing against user-defined alternatives. For example, for a sequence containing A, T, G, C, B, Sn base pairs, users can calculate most likely base at the XNA position against most similar canonical base (e.g. B vs A), purines/pyrimidines (e.g. B vs A, G), canonical bases (e.g. B vs A, T, G, C), or all bases (e.g. B vs A, T, G, C, Sn). Alternative hypothesis testing can be performed on a per-read basis or a global basis. XNA kmers models generated in this disclosure are built-in and can be viewed using xenomorph models. Model compilation is performed ad hoc, allowing users to experiment with kmer models. Outputs for alternative hypothesis testing are provided as a .csv file (see Table 18 for header description). Users can experimentally generate their own kmer models for arbitrary base pairs and integrate them into the Xenomorph tool by linking model .csv files to available model selections in models/config.csv. Since this disclosure considers single-insertions of an XNA base, kmer models are inherently independent (i.e. signal observations of NNNBNNN are independent of NNNSNNN observations) and therefore modular. Xenomorph was built to be flexible, allowing users to add more kmer models or modify them, and straightforward, requiring two commands to go from raw nanopore data to XNA-refined sequences. A graphical overview of the preprocessing pipeline can be found in FIG. 6S. Xenomorph can be found in the Xenomorph repository (github.com/xenobiolab/xenomorph) alongside all code, documentation, and parameters used in this disclosure. Experimental data for model
3915-P1293WO.UW -58-
building and basecalling can be downloaded from the SRA Bioproject PRJNA932328 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA932328]. Additional overview of how the Xenomorph pipeline performs XNA basecalling is found in Note 1. [0164] Data availability: Models measured in this disclosure used for basecalling are provided in Data Table 1, and can also be found on the Xenomorph github repository (github.com/xenobiolab/xenomorph/tree/main/models). The raw nanopore sequences (FAST5) and guppy basecalls (FASTQ) used in this disclosure to build models, validate models, and test 12-letter DNA sequencing have been deposited in the sequence reads archive (SRA) under Bioproject PRJNA932328 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA932328] and can be accessed without restriction (Table 31). Raw nanopore data for PZ PCR amplification experiments (FIGs 6TA-6TC) are available under restricted access, as this data was collected in a pooled nanopore run and contains additional data. Full sequences for hairpin libraries purchased for this work can be produced based on this disclosure. Additional source data can be produced based on this disclosure. [0165] Code availability: Code for end-to-end processing of nanopore reads and basecalling xenonucleotides described in this example can be produced based on this disclosure. [0166] Information for Example 1 (Enzymatic Synthesis and Nanopore Sequencing of 12-letter Supernumerary DNA) [0167] Methods [0168] Organic synthesis of dXtTP: 8-(2′-Deoxy-β-D-erythro- pentofuranosyl)imidazo[1,2-a]-s-triazin-2,4-dione 5′-triphosphate. [0169] 8-[2′-Deoxy-3′,5′-di-O-(p-toluoyl)-β-D-erythro-pentofuranosyl]-2-[(2- methylpropionyl)amino]-imidazo[1,2-a]-s-triazin-4-one (3). 5-Aza-7-deazaguanine 2 (8 g, 51.9 mmol) was dissolved in 10% aqueous K2CO3 solution (400 mL) and a solution of Bu4NHSO4 (1.28 g, 3.77 mmol) in CH2Cl2 (280 mL) was added at room temperature. After vigorous stirring for more than 1 min, a solution of 2′-deoxy-di-O-(p-toluoyl)-α-D- erythro-pentofuranosyl chloride 1 (21.4 g, 55.0 mmol) in CH2Cl2 (180 mL) was added at room temperature. The reaction mixture was stirred for 1 h at room temperature and the organic layer was separated. The aqueous layer was extracted with CH2Cl2 (500 mL ^ 2). The combined organic layer was dried (Na2SO4), filtered and concentrated. The residue was dissolved in pyridine (300 mL) and DMAP (1.12 g) and isobutyryl chloride (6 mL)
3915-P1293WO.UW -59-
were added at room temperature. After stirring overnight at room temperature, the reaction mixture was concentrated and the residue was dissolved in hot MeOH (150 mL). This mixture was stored at 0 ^C for 3 h and the resulting precipitate was filtered and washed with cold MeOH (50 mL) to give the ^-isomer 3 (5.9 g, 10.3 mmol, 19%) as a white solid. 1H NMR(DMSO-d6, 300 MHz) ^10.39 (s, 1H), 7. 90 (d, 2H, J=8.1), 7.81 (d, 2H, J=8.1), 7.70 (d, 1H, J=2.7), 7.60 (d, 1H, J=2.4), 7.34 (d, 2H, J=8.1), 7.26 (d, 2H, J=8.1), 6.36 (t, 1H, J=6.6), 5.74 (m, 1H), 4.50-4.64 (m, 3H), 3.13 (m, 1H), 2.81 (m, 1H), 2.68 (m, 1H), 2.38, 2.34 (2s, 6H), 1.03, 1.01 (2s, 6H). [0170] 2-Amino-8-(2′-Deoxy-β-D-erythro-pentofuranosyl)imidazo[1,2-a]-s- triazin-4-one (4). To a stirred suspension of 3 (5 g, 8.7 mmol) in MeOH (80 mL) was added 40 % methylamine solution (5 mL) and stirred for 2 days at room temperature. After removal of solvent, ethanol/ether mixture was added to the residue. The resulting precipitate was filtered and dried to give 4 (2.1 g, 7.9 mmol, 90%) as a white solid. 1H NMR(DMSO-d6, 300 MHz) ^7.42 (d, 1H, J=2.4), 7.34 (d, 1H, J=2.4), 6.93(brs, 2H), 6.13 (t, 1H, J = 6.6), 5.30 (brs, 1H), 4.98 (m, 1H), 4.29 (m, 1H), 3.78 (m, 1H), 3.50 (m, 2H), 2.32 (m, 1H), 2.12 (m, 1H). [0171] 8-(2′-Deoxy-β-D-erythro-pentofuranosyl)imidazo[1,2-a]-s-triazin-2,4- dione (5). 4 (1.34 g, 5 mmol) was dissolved in acetic acid (38 mL) and a solution of NaNO2 (2 g) in H2O (6 mL) was added at room temperature. The reaction mixture was stirred at room temperature for 2 days and concentrated. The residue was purified by silica gel column chromatography (CH2Cl2: MeOH=8: 1 to 4: 1) to give dXt nucleoside 5 (900 mg, 3.36 mmol, 67%) as a white solid.1H NMR(DMSO-d6, 300 MHz) ^7.45 (d, 1H, J=3.0), 7.41 (d, 1H, J=3.0), 6.05 (t, 1H, J=6.6), 4.28 (m, 1H), 3.79 (m, 1H), 3.46-3.58 (m, 2H), 2.28 (m, 1H), 2.13 (m, 1H). [0172] 8-[2′-Deoxy-5′-O-(4,4′-dimethoxytripenylmethyl)- ^-D-erythro- pentofuranosyl]imidazo[1,2-a]-triazine-2,4-dione (6). To a stirred solution of 5 (469 mg, 1.75 mmol) in pyridine (25 mL) was added DMTCl (652 mg, 1.92 mmol) at room temperature. The reaction mixture was stirred overnight at room temperature and evaporated. The residue was purified by silica gel column chromatography (EtOAc to EtOAc: MeOH=9: 1) to give 6 (580 mg, 1.02 mmol, 58%) as a white solid. 1H NMR(DMSO-d6, 300 MHz) ^ 11.26 (s, 1H), 7.42 (d, 1H, J=2.4), 7.18-7.35 (m, 10H), 6.83 (m, 4H), 6.09 (t, 1H, J=6.3), 5.37 (d, 1H, J=4.5), 4.28 (m, 1H), 3.90 (M, 1H), 3.72 (S, 6H), 3.12 (m, 2H), 2.39 (m, 1H), 2.18 (m, 1H).
3915-P1293WO.UW -60-
[0173] 8-(3′-O-Acetyl-2′-deoxy- ^-D-erythro-pentofuranosyl)imidazo[1,2-a]- triazine-2,4-dione (7). To a stirred solution of 6 (580 mg, 1.02 mmol) in pyridine (15 mL) was added Ac2O (144 ^L, 1.53 mmol) at room temperature. After being stirred at room temperature overnight, the reaction mixture was evaporated. The residue was treated with 3% dichloroacetic acid in CH2Cl2 (25 mL) for 2 h. Solvents were removed and the residue was purified by silica gel column chromatography (CH2Cl2: MeOH=7: 1) to give 7 (250 mg, 0.81 mmol, 79%) as a white solid.1H-NMR(300 MHz, DMSO-d6): ^ 11.23 (s, 1H), 7.45 (d, 1H, J=2.7), 7.42 (d, 1H, J=2.4), 6.04 (m, 1H), 5.23 (m, 2H), 4.01 (m, 1H), 3.56 (m, 2H), 2.52 (m, 1H), 2.31 (m, 1H), 2.04 (s, 3H). [0174] 8-(2′-Deoxy-β-D-erythro-pentofuranosyl)imidazo[1,2-a]-s-triazin-2,4- dione 5′-triphosphate (8). To a solution of 7 (309 mg, 1 mmol) in pyridine (4 mL) and dioxane (3.4 mL) was added a solution of 2-chloro-4H-1,3,2-benzodioxaphosphorin-4- one (260 mg) in dioxane (2.6 mL) at RT. After 20 min a mixture of tributylammonium pyrophosphate in DMF (0.2 M, 10 mL, 2 mmol) and tributylamine (1.2 mL, 4.8 mmol) was added. After 20 min a solution of iodine (360 mg) and water (0.56 mL) in pyridine (28 mL) was added. After 30 min the reaction was quenched by the addition of aqueous Na2SO3 (5%, 1 mL). The solvents were removed in vacuo. The residue was treated with NH4OH (20 mL) for 3 h at room temperature and the mixture was lyophilized. The residue was dissolved in water (50 mL), and the mixture was filtered (0.2 µm). Purification by reverse phase HPLC (Sunfire Prep C18 column, 5 ^m, 30 x 250 mm, eluent A = 25 mM TEAA pH 7, eluent B = CH3CN in A, gradient from 0 to 40% B in 20 min, flow rate = 15 mL/min), followed by ion-exchange HPLC (Dionex BioLC DNAPac PA-100, 22 x 250 mm, eluent A = water, eluent B = 1 M aq. NH4HCO3, gradient from 0 to 30% B in 20 min, flow rate = 10 mL/min) gave 8 as a colorless foam after lyophilization. 1H-NMR (D2O, 300 MHz): ^7.38 (m, 1H), 7.30(m, 1H), 6.15 (t, 1H, J=6.0), 4.54 (m, 1H), 4.01-4.09 (m, 3H), 2.41 (m, 1H), 2.34 (m, 1H).31P-NMR (D2O, 120 MHz): ^ -10.01 (d, 1P), -10.51 (d, 1P), -22.48 (t, 1P). [0175] Organic synthesis of dKnTP: 5-(2′-Deoxy-β-D-ribofuranosyl)-2,6- diamino-3-nitropyridine 5′-triphosphate. [0176] 2-Amino-6-chloro-5-iodo-3-nitropyridine (10). A mixture of 2-amino-6- chloro-3-nitropyridine 9 (5.7 g, 32.8 mmol), water (4.5 mL), concentrated H2SO4 (1.26 mL) and H5IO6 (1.59 g) was stirred for 15 min at 95 ^C. Iodine (3.6 g) was added in portions. The reaction mixture was stirred for 1 h at 95 ^C, cooled to room temperature,
3915-P1293WO.UW -61-
poured into sat. aqueous sodium thiosulfate solution and extracted with ethyl acetate. The organic layer was dried (Na2SO4), filtered and concentrated. The residue was purified by silica gel column chromatography (Hex: EtOAc=3: 2) to give 10 (8.7 g, 29.1 mmol, 88%) as an orange solid.1H NMR(DMSO-d6, 300 MHz) ^8.62 (s, 1H), 8.26 (brs, 2H). [0177] 2-Amino-5-[3′-O-(tert-butyldiphenylsilyl)-β-D-glycero-pentofuran-3′- ulos-1′-yl]-6-chloro-3-nitropyridine (12). A solution of palladium acetate (187 mg, 0.83 mmol) and triphenyl arsine (509 mg, 1.66 mmol) in chloroform (30 mL) was stirred for 30 min at room temperature. This solution was added to the mixture of glycal 11 (3.25 g, 9.2 mmol), 10 (2.49 g, 8.3 mmol) and silver carbonate (4.59 g, 16.6 mmol) in chloroform (60 mL) at room temperature. The reaction mixture was refluxed overnight, cooled to room temperature and filtered through a celite pad, the filtrate was concentrated and the residue was purified by silica gel column chromatography (Hex: EtOAc=4: 1 to 7: 3) to give compound 12 (2.75 g, 5.23 mmol, 63%) as an orange foam. 1H NMR(CDCl3, 300 MHz) ^8.42 (s, 1H), 7.73-7.82 (m, 4H), 7.41-7.48 (m, 6H), 5.83 (m, 1H), 7.77 (m, 1H), 4.23 (s, 1H), 3.90 (m, 2H), 1.78 (t, 1H, J=6.0), 1.23 (t, 1H, J=6.9), 1.08 (s, 9H). [0178] 2-Amino-5-(2′-deoxy-β-D-ribofuranosyl)-6-chloro-3-nitropyridine (14). To a stirred solution of 12 (2.75 g, 5.23 mmol) in THF (60 mL) was added AcOH (1.5 mL), followed by addition of 1 M TBAF in THF (7.9 mL) at 0 ^C. After 30 min stirring, the reaction mixture was concentrated to give crude compound 13, which was dissolved in CH3CN/AcOH (46 mL/23 mL). To this mixture was added Na(OAc)3BH (1.66 g, 7.83 mmol) at 0 ^C. After 1 h stirring at 0 ^C, acetone was added and the reaction mixture was concentrated. The residue was purified by silica gel column chromatography (CH2Cl2: MeOH=15: 1) to give 14 (1.21 g, 4.18 mmol, 80%) as a yellow solid. 1H NMR(DMSO- d6, 300 MHz) ^8.49 (s, 1H), 8.14 (brs, 2H), 5.13 (d, 1H, J=3.9), 5.06 (dd, 1H, J=5.7, 9.9), 4.83 (t, 1H, J=5.4), 4.17 (m, 1H), 3.78 (m, 1H), 3.43-3.52 (m, 2H), 2.16 (dd, 1H, J = 5.7, 12.6), 1.66 (m, 1H). [0179] 5-(2′-Deoxy-β-D-ribofuranosyl)-2,6-diamino-3-nitropyridine (15). 14 (1.2 g, 4.14 mmol) was dissolved in 7 N NH3 in MeOH (80 mL) and heated overnight at 110 ^C. The reaction mixture cooled and concentrated. The residue was washed with ethanol/ether mixture to give 15 (1 g, 3.7 mmol, 90%) as a yellow solid. 1H NMR(DMSO-d6, 300 MHz) ^7.96 (s, 1H), 7.25 (brs, 4H), 5.01-5.15 (m, 2H), 4.88 (dd, 1H, J= 6.3, 9.6), 4.20 (m, 1H), 3.74 (m, 1H), 3.47-3.58 (m, 2H), 1.89-1.97 (m, 2H). 13C
3915-P1293WO.UW -62-
NMR (DMSO-d6, 75 MHz) ^160.6, 155.4, 133.7, 118.2, 112.7, 88.4, 78.1, 72.7, 62.1, 40.9. [0180] 5-[2′-Deoxy-5′-O-(4,4′-dimethoxytripenylmethyl)-β-D-ribofuranosyl]-2,6- diamino-3-nitropyridine (16). To a stirred solution of 15 (310 mg, 1.15 mmol) in pyridine (20 mL) was added DMTCl (428 mg, 1.26 mmol) at room temperature. After being stirred at room temperature for 3 h, catalytic amount of DMAP was added. The reaction mixture was stirred for an additional 1 h and concentrated. The residue was purified by silica gel column chromatography (Hex: EtOAc=1: 2 to 1: 4) to give 16 (410 mg, 0.72 mmol, 62%) as a yellow foam. 1H-NMR(300 MHz, DMSO-d6): ^ 8.07 (s, 1H), 6.79-8.0 (m, 17H), 5.13 (d, 1H, J = 3.9), 4.94 (dd, 1H, J = 9.0, 6.0), 4.11 (m, 1H), 3.85 (m, 1H), 3.71 (s, 6H), 3.08 (d, 2H, J = 3.6), 2.15 (m, 1H), 1.86 (m, 1H). [0181] 5-[3′-O-Acetyl-2′-deoxy-5′-O-(4,4′-dimethoxytripenylmethyl)-β-D- ribofuranosyl]-2,6-diamino-3-nitropyridine (17). To a stirred solution of 16 (1.08 g, 1.89 mmol) in pyridine (40 mL) were added Ac2O (0.25 mL, 2.63 mL) and catalytic amount of DMAP at room temperature. After being stirred at room temperature for 2 h, the reaction mixture was concentrated and the residue was purified by silica gel column chromatography (Hex: EtOAc=1: 2) to give 17 (1.08 g, 1.76 mmol, 93%) as a yellow foam. 1H-NMR(300 MHz, CDCl3): ^ 8.09 (s, 1H), 7.21-7.38 (m, 9H), 6.82 (d, 4H, J = 7.8), 5.53 (d, 1H, J = 4.8), 4.93 (dd, 1H, J = 4.8, 11.4), 4.06 (m, 1H), 3.57 (dd, 1H, J = 1.8, 10.5), 3.38 (dd, 1H, J = 2.1, 10.2), 2.63 (m, 1H), 2.14 (m, 1H), 2.10(s, 3H). [0182] 5-(3′-O-Acetyl-2′-deoxy-β-D-ribofuranosyl)-2,6-diamino-3-nitropyridine (18). A mixture of 17 (1.08 g, 1.76 mmol) in 3% dichloroacetic acid in CH2Cl2 (40 mL) was stirred at room temperature for 1 h and concentrated. The residue was purified by silica gel column chromatography (EtOAc to EtOAc: MeOH=9: 1) to give 18 (512 mg, 1.64 mmol, 93%) as a yellow solid. 1H-NMR(300 MHz, DMSO-d6): ^ 8.03(s, 1H), 7.90, 7.56 (2brs, 2H), 7.23 (brs, 2H), 5.33 (brs, 1H), 5.17 (m, 1H), 4.87 (t, 1H, J = 8.1), 3.93 (m, 1H), 3.65 (m, 1H), 3.51 (dd, 1H, J = 2.7, 11.4), 2.10 (m, 2H), 2.04 (s, 3H). [0183] 5-(2′-Deoxy-β-D-ribofuranosyl)-2,6-diamino-3-nitropyridine 5′- triphosphate (19). 18 (312 mg, 1 mmol) was dissolved in pyridine (6 mL) and dioxane (5 mL). To this mixture was added a solution of 2-chloro-1,3,2-benzodioxaphosphorin-4- one (300 mg, 1.48 mmol) in dioxane (3 mL) at room temperature. After 20 min stirring, a mixture of 0.2 M tributylammonium pyrophosphate in DMF (15 mL) and Bu3N (1.6 mL) was added. After additional 20 min stirring, a mixture of I2 (360 mg) and water (0.5 mL)
3915-P1293WO.UW -63-
in pyridine (25 mL) was added. After 30 min, the reaction mixture was quenched with 5% sodium sulfite solution. After solvents were removed, the residue was dissolved in water (30 mL) and left to stand at room temperature overnight. Water was removed and 25 % NH4OH (50 mL) was added. The mixture was stirred at room temperature for 4 h and concentrated. The residue was dissolved in water (50 mL), and the mixture was filtered (0.2 µm). Purification by reverse phase HPLC (Sunfire Prep C18 column, 5 ^m, 30 x 250 mm, eluent A = 25 mM TEAA pH 7, eluent B = CH3CN in A, gradient from 0 to 40% B in 20 min, flow rate = 15 mL/min, Rt = 14 min), followed by ion-exchange HPLC (Dionex BioLC DNAPac PA-100, 22 x 250 mm, eluent A = water, eluent B = 1 M aq. NH4HCO3, gradient from 0 to 30% B in 20 min, flow rate = 10 mL/min, Rt = 15 min) gave compound 6 (180 ^mol, 22%) as a yellow foam after lyophilization.1H-NMR (D2O, 300 MHz): ^ 7.96 (s, 1H), 4.86 (dd, 1H, J = 4.8, 11.1), 4.40 (m, 1H), 4.00 (m, 1H), 2.14 (m, 1H), 1.87(m, 1H); 31P-NMR (D2O, 120 MHz): ^ -7.5 (d, 1P), -12.4 (d, 1P), -21.0 (t, 1P). [0184] Organic synthesis of dJTP: 4-Amino-8-(2′-deoxy-β-D-erythro- pentofuranosyl)imidazo[1,2-a]-1,3,5-triazin-2-one 5′-triphosphate. [0185] 1-(2′-Deoxy-β-D-erythro-pentofuranosyl)-2-nitroimidazole (23). To a stirred suspension of 2-nitroimidazole 20 (2 g, 17.8 mmol), K2CO3 (8 g, 58 mmol) in CH3CN (800 mL) was added TDA-1 (0.4 mL, 0.84 mmol) at room temperature. This mixture was stirred at RT for 1 h and chloro sugar 21 (8 g, 20.7 mmol) was added at room temperature. After stirring at room temperature for 2 h, the reaction mixture was filtered and the filtrate was evaporated. The residue was purified by silica gel column chromatography (Hex: EtOAc=2: 1) to give 22 as a white foam. To a solution of crude 22 in MeOH (150 mL) was added 40% MeNH2 in water (10 mL) at RT. The reaction mixture was stirred overnight at RT and evaporated, then ethyl ether was added to the residue. The resulting precipitate was filtered to give 23 (3.4 g, 14.8 mmol, 84%) as a white solid. 1H NMR (300 MHz, DMSO-d6): ^ 8.00 (s, 1H), 7.18 (s, 1H), 6.52 (t, 1H, J = 5.4 Hz), 5.29 (d, 1H, J = 4.5 Hz), 5.08 (t, 1H, J = 5.1 Hz), 4.23 (m, 1H), 3.83 (m, 1H), 3.55-3.66 (m, 2H), 2.42 (m, 1H), 2.28 (m, 1H). 13C NMR (75 MHz, DMSO-d6): ^ 144.8, 128.4, 123.9, 89.2, 88.7, 69.4, 61.0, 42.6. [0186] 1-[2′-Deoxy-3′,5′-O-di-(tert-butyldimethylsilyl)-β-D-erythro- pentofuranosyl]-2-nitroimidazole (24). To a stirred solution of 23 (3.4 g, 14.8 mmol) in DMF (140 mL) were added imidazole (3 g, 44.1 mmol) and TBDMSCl (6.7 g, 44.5
3915-P1293WO.UW -64-
mmol) at room temperature. The reaction mixture was stirred overnight at room temperature, poured into water (300 mL) and extracted with ethyl ether. The organic layer was dried (Na2SO4), filtered and evaporated. The residue was purified by silica gel column chromatography (Hex: EtOAc=4: 1) to give 24 (6.2 g, 13.5 mmol, 91%) as a white solid. 1H NMR (300 MHz, CDCl3): ^ 7.91 (s, 1H), 7. 01 (s, 1H), 6.62 (dd, 1H, J = 4.2, 6.3 Hz), 4.45 (m, 1H), 3.77-3.97 (m, 3H), 2.59 (m, 1H), 2.19 (m, 1H), 0.91, 0.87 (2s, 18H), 0.11, 0.10, 0.05 (3s, 12H). 13C NMR (75 MHz, CDCl3): ^ 128.4, 122.9, 89.1, 88.2, 69.6, 61.7, 43.7, 26.1, 25.9, 18.5, 18.1. [0187] 2-Amino-1-[2′-Deoxy-3′,5′-O-di-(tert-butyldimethylsilyl)-β-D-erythro- pentofuranosyl]-imidazole (25). A suspension of 24 (3.1 g, 6.8 mmol) and 10% Pd/C (800 mg) in EtOH (80 mL) was degassed. The reaction mixture was stirred overnight at room temperature under H2, filtered through a celite pad and washed with MeOH. The filtrate was evaporated to give 25 (2.45 g, 5.7 mmol, 85%) as a pale yellow solid, which was used for the cyclization without further purification.1H NMR (300 MHz, DMSO-d6): ^ 6.68 (s, 1H), 6.38 (s, 1H), 5.84 (t, 1H, J = 6.6 Hz), 5.49 (brs, 2H), 4.39 (m, 1H), 3.73 (m, 1H), 3.62 (m, 2H), 2.23 (m, 1H), 2.05 (m, 1H), 0.87, .086 (2s, 18H), 0.08, 0.04, 0.03 (3s, 12H).13C NMR (75 MHz, CDCl3): ^ 149.7, 124.8, 111.4, 87.1, 83.6, 72.8, 63.5, 26.5, 26.4, 18.7, 18.5. [0188] 8-[2′-Deoxy-3′,5′-O-di-(tert-butyldimethylsilyl)-β-D-erythro- pentofuranosyl]imidazo[1,2-a]-1,3,5-triazin-2-one-4-thione (26). A mixture of phenyl chloroformate (0.17 mL) and potassium thiocyanate (150 mg) in EtOAc (5 mL) was stirred for 1 h at room temperature. To this mixture was added a solution of 25 (428 mg, 1 mmol) in 1,4-dioxane (4.5 mL) at room temperature. The reaction mixture was stirred for 4 h at 40 ^C and MeOH (0.5 mL) was added. The mixture was evaporated and the residue was purified by silica gel column chromatography (CH2Cl2: acetone=10: 1) to give 26 (165 mg, 0.32 mmol, 32%) as a yellow foam. 1H NMR (300 MHz, CDCl3): ^ 9.65 (s, 1H), 7.51 (d, 1H, J = 3.0 Hz), 7.43 (d, 1H, J =2.7 Hz), 6.28 (t, 1H, J = 5.7 Hz), 4.45 (m, 1H), 3.88-3.96 (m, 2H), 3.75 (m, 1H), 2.35 (m, 1H), 2.21 (m, 1H), 0,93, 0.89 (2s, 18H), 0.12, 0.8 (2s, 12H). 13C NMR (75 MHz, CDCl3): ^ 171.1, 153.3, 147.4, 116.4, 110.5, 88.2, 84.8, 71.2, 62.7, 42.0, 26.2, 25.9, 18.6, 18.2. [0189] 8-[2′-Deoxy-3′,5′-O-di-(tert-butyldimethylsilyl)-β-D-erythro- pentofuranosyl]-4-methylthioimidazo[1,2-a]-1,3,5-triazin-2-one-4-thione (27). Methyl iodide (0.32 mL, 5.1 mmol) was added to a mixture of 26 (870 mg, 1.7 mmol) and
3915-P1293WO.UW -65-
NaHCO3 (214 mg, 2.04 mmol) in 1,4-dioxane (4 mL) and MeOH (8 mL). After being stirred for 30 h at room temperature, the reaction mixture was evaporated and the residue was purified by silica gel column chromatography (CH2Cl2: acetone=7: 3) to give 27 (600 mg, 1.14 mmol, 67%) as a white foam. 1H NMR (300 MHz, CDCl3): ^ 7.34 (d, 1H, J = 2.7 Hz), 6.87 (d, 1H, J = 2.7 Hz), 6.37 (t, 1H, J = 6.0 Hz), 4.43 (m, 1H), 3.70-3.94 (m, 3H), 2.71 (s, 3H), 2.34 (m, 1H), 2.16 (m, 1H), 0.92, 0.88 (2s, 18H), 0.10, 0.06 (2s, 12H). 13C NMR (75 MHz, CDCl3): ^ 160.2, 158.6, 149.1, 116.1, 106.6, 88.0, 84.2, 71.4, 62.8, 42.0, 26.2, 25.9, 18.7, 18.2, 13.3. [0190] 4-Amino-8-[2′-deoxy-3′,5′-O-di-(tert-butyldimethylsilyl)-β-D-erythro- pentofuranosyl]-imidazo[1,2-a]-1,3,5-triazin-2-one (28). A solution of 27 (600 mg, 1.14 mmol) in methanolic ammonia (7 N, 20 mL) was stirred for 40 h at room temperature, evaporated and the residue was purified by silica gel column chromatography (CH2Cl2: MeOH=9: 1) to give 28 (360 mg, 0.73 mmol, 64%). 1H NMR (300 MHz, DMSO-d6): ^ 7.50-7.66 (m, 3H), 7.35 (d, 1H, J = 2.4 Hz), 6.05 (t, 1H, J = 6.6 Hz), 4.43 (m, 1H), 3.59- 3.77 (m, 3H), 2.40 (m, 1H), 2.14 (m, 1H), 0.86, 0.85 (2s, 18H), 0.08, 0.04 (2s, 12H). 13C NMR (75 MHz, DMSO-d6): ^ 155.5, 152.9, 150.7, 115.3, 107.8, 87.4, 83.0, 72.5, 63.2, 39.0, 26.4, 26.3, 18.6, 18.3. [0191] 8-[2′-Deoxy-3′,5′-O-di-(tert-butyldimethylsilyl)-β-D-erythro- pentofuranosyl]-4-[(2-methylpropionyl)amino]-imidazo[1,2-a]-1,3,5-triazin-2-one (29). To a stirred solution of 28 (800 mg, 1.61 mmol) and DMAP (100 mg) in pyridine (20 mL) was added isobutyryl chloride (0.254 mL, 2.42 mmol) at RT. After being stirred for 1 h at room temperature, the reaction mixture was evaporated and the residue was purified by silica gel column chromatography (Hex: EtOAc=1: 1) to give 29 (780 mg, 1.38 mmol, 85%) as a white foam.1H NMR (300 MHz, CDCl3): ^ 11.7 (brs, 1H), 7.37 (d, 1H, J = 2.7 Hz), 7.28 (d, 1H, J = 2.4 Hz), 6.30 (t, 1H, J = 6.0 Hz), 4.44 (m, 1H), 3.74-3.93 (m, 3H), 2.60 (m, 1H), 2.32 (m, 1H), 2.16 (m, 1H), 1.18, 1.16 (2s, 6H), 0.92, 0.87 (2s, 18H), 0.10, .006 (2s, 12H). 13C NMR (75 MHz, CDCl3): ^ 192.5, 152.5, 149.0, 147.9, 115.9, 107.9, 88.2, 84.4, 71.4, 62.8, 42.0, 39.8, 26.2, 25.9, 19.4, 18.6, 18.2. [0192] 8-(2′-Deoxy-β-D-erythro-pentofuranosyl)-4-(2-methylpropionyl)amino- imidazo[1,2-a]-1,3,5-triazin-2-one (30). To a stirred solution of 29 (780 mg, 1.38 mmol) in THF (20 mL) was added a solution of HF (70% in pyridine, 0.95 mL) in pyridine (1.2 mL) at 0 ^C. After being stirred overnight at room temperature, the reaction mixture was evaporated and the residue was purified by silica gel column chromatography (CH2Cl2:
3915-P1293WO.UW -66-
MeOH=8: 1) to give 30 (390 mg, 1.16 mmol, 84%) as a white solid.1H NMR (300 MHz, DMSO-d6): ^ 11.5 (s, 1H), 7.55 (d, 1H, J = 2.7Hz), 7.48 (d, 1H, J = 2.7 Hz), 6.07 (t, 1H, J = 6.6 Hz), 5.29 (d, 1H, J = 4.2 Hz), 4.99 (t, 1H, J = 5.4 Hz), 4.28 (m, 1H), 3.79 (m, 1H), 3.48-3.58 (m, 2H), 2.52 (m, 1H), 2.27 (m, 1H), 2.13 (m, 1H), 1.10, 1.08 (2s, 6H). 13C NMR (75 MHz, DMSO-d6): ^ 191.6, 152.3, 149.9, 148.0, 117.2, 108.9, 88.4, 83.9, 70.9, 61.9, 41.0, 39.3, 19.7. [0193] 8-[2′-Deoxy-5′-O-(4,4′-dimethoxytripenylmethyl)- ^-D-erythro- pentofuranosyl]-4-[(2-methylpropionyl)amino]-imidazo[1,2-a]-1,3,5-triazin-2-one (31). To a stirred suspension of 30 (380 mg, 1.13 mmol) and DMT-Cl (460 mg, 1.36 mmol) in CH2Cl2 (15 mL) was added Et3N (0.32 mL, 2.26 mmol) at room temperature. After being stirred overnight at room temperature, the reaction mixture was evaporated and the residue was purified by silica gel column chromatography (EtOAc to EtOAc: MeOH=9: 1) to give 31 (620 mg, 0.97 mmol, 86%) as a pale yellow foam. 1H NMR (300 MHz, CDCl3): ^ 11.73 (s, 1H), 6.81-7.41 (m, 11H), 6.33 (t, 1H, J = 6.0 Hz), 4.67 (m, 1H), 4.13 (m, 1H), 3.79 (s, 6H), 3.42-3.52 (m, 3H), 2.57-2.68 (m, 2H), 2.43 (m, 1H), 1.19, 1.16 (2s, 6H).13C NMR (75 MHz, CDCl3): ^ 192.6, 158.9, 152.8, 148.9, 147.7, 144.5, 135.6, 135.5, 130.4, 128.5, 128.2, 127.3, 116.3, 113.5, 108.1, 87.1, 86.5, 84.5, 71.6, 55.5, 41.5, 39.8, 19.4. [0194] 8-[3′-O-Acetyl-2′-deoxy-5′-O-(4,4′-dimethoxytripenylmethyl)- ^-D- erythro-pentofuranosyl]-4-[(2-methylpropionyl)amino]-imidazo[1,2-a]-1,3,5-triazin-2- one (32). To a stirred solution of 31 (420 mg, 0.66 mmol) in pyridine (10 mL) was added Ac2O (0.093 mL, 0.98 mmol) at room temperature. The reaction mixture was stirred overnight at room temperature and evaporated. The residue was purified by silica gel column chromatography (EtOAc) to give 32 (440 mg, 0.65 mmol, 98%) as a pale yellow foam.1H NMR (300 MHz, CDCl3): ^ 11.72 (brs, 1H), 7.25-7.39 (m, 9H), 7.16 (d, 1H, J = 2.7 Hz), 7.02 (d, 1H, J = 2.7 Hz), 6.82 (d, 4H, J = 8.7 Hz), 6.38 (t, 1H, J = 7.2 Hz), 5.43 (m, 1H), 4.19 (m, 1H)m 3.79 (s, 6H), 3.46 (m, 2H), 2.63 (m, 1H), 2.51(m, 2H), 2.09 (s, 3H), 1.19, 1.17 (2s, 6H). 13C NMR (75 MHz, CDCl3): ^ 192.5, 170.6, 158.9, 152.4., 149.6, 147.8, 144.4, 135.4, 135.3, 130.4, 128.4, 128.3, 127.4, 115.5, 113.5, 108.5, 87.3, 84.6, 83.9, 75.1, 63.7, 55.5, 39.9, 38.6, 21.2, 19.4. [0195] 8-(3′-O-Acetyl-2′-deoxy- ^-D-erythro-pentofuranosyl)-4-[(2- methylpropionyl)amino]-imidazo[1,2-a]-1,3,5-triazin-2-one (33). A solution of 32 (440 mg, 0.65 mmol) in 3% trichloroacetic acid in CH2Cl2 (20 mL) was stirred for 2 h at room
3915-P1293WO.UW -67-
temperature and evaporated. The residue was purified by silica gel column chromatography (CH2Cl2: MeOH=10: 1) to give 33 (200 mg, 0.53 mmol, 82%) as a white solid. 1H NMR (300 MHz, DMSO-d6): ^ 9.19 (s, 1H), 7.79 (d, 1H, J = 3.6 Hz), 7.13 (t, 1H, J =2.7 Hz), 6.74 (t, 1H, J =2.7 Hz), 5.97 (m, 1H) 5.22 (m, 1H), 3.96 (m, 1H), 3.533.54 (m, 2H), 2.37-2.53 (m, 2H), 2.22 (m, 1H), 2.06 (s, 3H), 0.85 (m, 6H). 13C NMR (75 MHz, DMSO-d6): ^ 195.1, 176.2, 170.7, 157.7, 151.1, 114.1, 97.3, 85.5, 83.5, 75.7, 62.2, 36.9, 34.1, 21.6, 19.9. [0196] 4-Amino-8-(2′-deoxy-β-D-erythro-pentofuranosyl)imidazo[1,2-a]-1,3,5- triazin-2-one 5′-triphosphate (34). To a solution of 33 (260 mg, 0.69 mmol) in pyridine (6 mL) and dioxane (4.5 mL) was added a solution of 2-chloro-4H-1,3,2- benzodioxaphosphorin-4-one (214 mg, 1.06 mmol) in dioxane (2.1 mL) at RT. After 15 min a mixture of tributylammonium pyrophosphate in DMF (0.2 M, 10.5 mL, 2.1 mmol) and tributylamine (1.14 mL, 4.8 mmol) was added. After 20 min a solution of iodine (255 mg, 1.0 mmol) and water (0.35 mL) in pyridine (18 mL) was added. After 20 min the reaction was quenched by the addition of aqueous Na2SO3 (5%, 1 mL). The solvents were removed in vacuo. NH4OH (30 mL) was added, and the mixture was stirred overnight at room temperature. After evaporation, the residue was dissolved in water (50 mL) and filtered (0.2 µm). Purification by ion-exchange HPLC (Dionex BioLC DNAPac PA-100, 22 x 250 mm, eluent A = water, eluent B = 1 M aq. NH4HCO3, gradient from 0 to 40% B in 20 min, flow rate = 10 mL/min, Rt = 12 min), followed by reverse phase HPLC (SunFires Prep C18 column, 5 ^m, 19 x 250 mm, eluent A = 25 mM TEAA pH 7, eluent B = CH3CN, gradient from 0 to 20% B in 20 min, flow rate = 10 mL/min, Rt = 10 min) gave 34 as a colorless foam after lyophilization. 1H NMR (300 MHz, D2O): ^ 7.44 (m, 1H), 7.34 (m, 1H), 6.19 (t, 1H, J = 6.6 Hz), 4.56 (m, 1H), 4.01- 4.07 (m, 3H), 2.38 (m, 1H), 2.32 (m, 1H). 31P NMR (120 MHz, D2O): ^ -9.78 (br, 1p), -10.42 (br, 1P), -22.13 (br, 1P). [0197] Organic synthesis of dVTP: 2-Amino-5-nitro-3-(2′-deoxy- ^-D- ribofuranosyl)-1H-pyridin-6-one-5′-triphosphate. [0198] 2-Amino-5-nitro-3-(2′-deoxy- ^-D-ribofuranosyl)-6-[2-(4- nitrophenyl)ethoxy]-pyridine (36). Palladium acetate (0.055 g, 0.243 mmol) and triphenylarsine (0.175 g, 0.486 mmol) were dissolved in chloroform (10 mL), and the mixture was stirred at rt for 30 min. Then it was added to a mixture of compound 35 (1.05 g, 2.43 mmol), glycal (0.86 g, 2.43 mmol) and silver carbonate (1.34 g, 4.86 mmol) in
3915-P1293WO.UW -68-
chloroform (20 mL). The resulting mixture was refluxed overnight. After cooling to rt, it was filtered through Celite and washed with ethyl acetate. The combined filtrate was concentrated in vacuo. The residue was purified by flash chromatography (silica, hexanes: EtOAc=2: 1 to 1: 1) to give a brown solid. This solid was dissolved in THF (20 mL) and treated with HF-pyridine (0.58 mL) and stirred at rt for 1 h. The mixture was evaporated with silica gel and the residue was purified by flash chromatography (silica, hexanes: EtOAc=1: 2 to 100% EtOAc) to give a yellow solid. This material, without further characterization, was dissolved in acetic acid (10 mL) and acetonitrile (10 mL) and treated with sodium triacetoxyborohydride (0.602 g, 2.84 mmol) and stirred at rt for 1 h. The mixture was poured into brine and extracted with EtOAc, and the organic layer was dried over Na2SO4. Solvent was removed under reduced pressure, the residue was purified by flash chromatography (silica, EtOAc to EtOAc: MeOH=30: 1) to give compound 36 as a yellow solid (0.59 g, 59% for 3 steps).1H NMR (300 MHz, DMSO-d6) ^ 8.18 (s, 1H), 8.15 (d, J = 2.1 Hz, 1H), 7.66 (d, J = 8.4 Hz, 2H), 7.49 (s, 2H), 5.16 (t, J = 4.7 Hz, 1H), 5.09 (d, J = 3.9 Hz, 1H), 5.01 (t, J = 8.0 Hz, 1H), 4.58 (t, J = 6.5 Hz, 2H), 4.24 (m, 1H), 3.80 (d, J = 2.4 Hz, 1H), 3.50-3.62 (m, 2H), 3.21 (t, J = 6.5 Hz, 2H), 1.97- 1.99 (m, 2H). 13C NMR (75 MHz, DMSO-d6) ^ 158.6, 156.6, 146.8, 146.3, 135.0, 130.5, 123.3, 120.9, 112.2, 87.8, 76.7, 72.1, 66.4, 61.4, 34.4. [0199] 2-{[(dimethylamino)methylidene]amino}-5-nitro-3-(2′-deoxy- ^-D- ribofuranosyl)-6-[2-(4-nitrophenyl)ethoxy]-pyridine (37). A mixture of 36 (1.48 g, 3.52 mmol) and N,N-dimethylformamide dimethyl acetal (1.87 mL, 14.08 mmol) in methanol (20 mL) was stirred at rt overnight. The mixture was evaporated and purified by flash chromatography (neutral silica, EtOAc: MeOH=30: 1 to 10: 1) to give compound 37 as a yellow solid (1.47 g, 88%). 1H NMR (300 MHz, DMSO-d6) ^ 8.65 (s, 1H), 8.31 (s, 1H), 8.16 (d, J = 8.7 Hz, 2H), 7.64 (d, J = 8.4 Hz, 2H), 5.25 (dd, J = 9.5, 5.6 Hz, 1H), 5.03 (d, J = 3.6 Hz, 1H), 4.80 (J = 5.3 Hz, 1H), 4.70 (m, 2H), 4.12 (s, 1H), 3.76-3.77 (m, 1H), 3.42-3.55 (m, 2H), 3.21 (s, 5H), 3.10 (s, 3H), 2.31 (dd, J = 12.5, 5.3 Hz, 1H), 1.52-1.62 (m, 1H). 13C NMR (75 MHz, DMSO-d6) ^ 159.9, 157.1, 154.7, 147.1, 146.2, 133.1, 130.5, 125.6, 124.0, 123.3, 87.2, 74.3, 72.1, 66.3, 62.3, 42.1, 40.8, 34.8, 34.5. [0200] 2-{[(dimethylamino)methylidene]amino}-5-nitro-3-[2′-deoxy-5′-O-(4,4′- dimethoxytrityl)- ^-D-ribofuranosyl]-6-[2-(4-nitrophenyl)ethoxy]-pyridine (38). To a mixture of 37 (1.15 g, 3.05 mmol), TEA (1.39 mL, 13.7 mmol) and DMAP (37 mg, 0.305 mmol) in CH2Cl2 (40 mL) was added DMTr-Cl (3.02 g, 7.78 mmol) and the mixture was
3915-P1293WO.UW -69-
stirred at rt overnight. This was poured into water and extracted with CH2Cl2 and the organic layer was dried over Na2SO4. Solvent was removed under reduced pressure; the residue was purified by chromatography (neutral silica, EtOAc: hexanes=3: 1 to EtOAc) to give compound 38 as dark-yellow solid. (3.16 g, 91%). 1H NMR (300 MHz, CDCl3) ^ 8.48 (s, 1H), 8.43 (s, 1H), 8.16 (d, J = 8.4 Hz, 2H), 7.54 (d, J = 8.7 Hz, 2H), 7.45 (d, J = 7.5 Hz, 2H), 7.26-7.36 (m, 6H), 7.17-7.22 (m, 1H), 6.84 (d, J = 8.1 Hz, 4H), 5.41 (dd, J = 9.7, 6.2 Hz, 1H), 4.62 (t, J = 6.0 Hz, 2H), 4.31-4.32 (m, 1H), 4.02 (dd, J = 8.4, 4.8 Hz, 1H), 3.79 (s, 6H), 3.43 (dd, J = 9.6, 4.8 Hz, 1H), 3.21-3.27 (m, 3H), 3.16 (s, 3H), 3.12 (s, 3H), 2.46 (ddd, J = 13.2, 6.0, 2.4 Hz, 1H), 1.78-1.90 (m, 2H).13C NMR (75 MHz, CDCl3) ^ 159.6, 158.5, 156.0, 155.0, 146.8, 146.4, 144.8, 135.9, 135.8, 133.8, 130.2, 130.0, 128.1, 127.8, 127.2, 126.8, 123.8, 123.6, 113.1, 86.3, 85.5, 75.1, 74.7, 66.5, 64.5, 60.4, 55.2, 42.1, 41.1, 35.3, 35.2, 21.0, 14.2. [0201] 2-{[(dimethylamino)methylidene]amino}-5-nitro-3-[2′-deoxy-5′-O-(4,4′- dimethoxytrityl)-3′-O-acetyl- ^-D-ribofuranosyl]-6-[2-(4-nitrophenyl)ethoxy]-pyridine (39). To a solution of 38 (0.3 g, 0.39 mmol), DMAP (4.8 mg, 0.039 mmol) and pyridine (0.125 mL, 1.56 mmol) in CH2Cl2 (5 mL) was added Ac2O (0.074 mL) and the mixture was stirred at rt overnight. The reaction was quenched with brine, extracted with CH2Cl2 and the organic layer was dried over Na2SO4. Solvent was removed under reduced pressure, the residue was purified by chromatography (neutral silica, hexanes: EtOAc=1: 1 to 1: 4) to give compound 39 as a yellow solid product. (0.282 g, 88%). 1H NMR (300 MHz, CDCl3) ^ 8.51 (s, 1H), 8.48 (s, 1H), 8.17 (d, J = 8.4 Hz, 2H), 7.55 (d, J = 8.4 Hz, 2H), 7.44 (d, J = 7.5 Hz, 2H), 7.35 (d, J = 8.7 Hz, 4H), 7.30-7.26 (m, 2H), 7.16-7.21 (m, 1H), 6.84 (d, J = 8.4 Hz, 4H), 5.30-5.36 (m, 1H), 5.26 (d, J = 5.4 Hz, 1H), 4.63 (t, J = 6.2 Hz, 2H), 4.17-4.18 (m, 1H), 3.79 (m, 6H), 3.28-3.39 (m, 2H), 3.24 (t, J = 6.2 Hz, 2H), 3.18 (s, 3H), 3.11 (s, 3H), 2.58 (dd, J = 13.8, 5.1 Hz, 1H), 2.07 (s, 3H), 1.85-1.95 (m, 1H).13C NMR (75 MHz, CDCl3) ^ 170.5, 159.6, 158.4, 156.0, 155.1, 146.8, 146.4, 144.8, 135.9, 135.8, 133.9, 130.2, 130.1, 130.0, 128.1, 127.8, 127.3, 126.7, 123.6, 123.1, 113.1, 107.2, 86.2, 83.5, 75.5, 66.5, 64.0, 55.2, 41.2, 40.0, 35.3, 35.0, 21.2. [0202] 2-{[(dimethylamino)methylidene]amino}-5-nitro-3-(2′-deoxy-3′-O-acetyl- ^-D-ribofuranosyl)-6-[2-(4-nitrophenyl)ethoxy]-pyridine (40). To a solution of 39 (0.232 g, 0.28 mmol) in CH2Cl2 was added Cl2CHCOOH (0.23 mL, 2.08 mL) and the mixture was stirred at rt for 1 h. The reaction was quenched with saturated NaHCO3 solution, extracted with CH2Cl2 and the organic layer was dried over Na2SO4. Solvent was
3915-P1293WO.UW -70-
removed under reduced pressure, the residue was purified by chromatography (neutral silica, hexanes: EtAOc=1: 3 to 100% EtOAC) to give compound 40 as a yellow solid. (0.122 g, 85%).1H NMR (300 MHz, CDCl3) ^ 8.50 (s, 1H), 8.39 (s, 1H), 8.16 (d, J = 8.7 Hz, 2H), 7.54 (d, J = 8.4 Hz, 2H), 5.29 (dd, J = 10.4, 5.3 Hz, 1H), 5.22 (d, J = 5.4 Hz, 1H), 4.63 (t, J = 6.0 Hz, 2H), 4.06 (d, J = 2.7 Hz, 1H), 3.85 (m, 2H), 3.20-3.25 (m, 5H), 3.15 (s, 3H), 2.44-2.50 (m, 2H), 2.06-2.16 (m, 4H).13C NMR (75 MHz, CDCl3) ^ 170.9, 160.3, 156.6, 155.3, 146.8, 146.3, 134.9, 130.2, 126.9, 123.6, 121.5, 85.1, 76.9, 66.6, 63.2, 41.4, 39.2, 35.3, 35.3, 21.1. [0203] 2-Amino-5-nitro-3-(2′-deoxy- ^-D-ribofuranosyl)-1H-pyridin-6-one-5′- triphosphate (41). To a solution of 6 (64.2 mg, 0.12 mmol) in pyridine (0.8 mL) and dioxane (2.4 mL) was added a solution of 2-chloro-4H-1,3,2-benzodioxaphosphorin-4- one (48 mg, 0.24 mmol) in dioxane (1.0 mL) at room temperature. After 15 min, a mixture of tributylammonium pyrophosphate in DMF (0.2 M, 2.4 mL, 0.48 mmol) and tributylamine (0.27 mL, 1.1 mmol) was added. After 20 min, a solution of iodine (61 mg, 0.24 mmol) and water (0.095 mL) in pyridine (4.76 mL) was added. After 30 min, the reaction was quenched by the addition of aqueous Na2SO3 (5%, until color disappears). The pyridine and dioxane were removed under reduced pressure. The residue was dissolved in acetonitrile (3 mL) and DBU (0.5 mL). The mixture was stirred at room temperature for 4 h. The volatiles were removed under reduced pressure and dissolved in ammonium hydroxide (10 mL). The mixture was stirred at room temperature overnight. Ammonia was removed by rotary evaporation, and the residue was diluted with water (20 mL). Purification by ion-exchange HPLC (water to water: 1 M ammonium bicarbonate = 50: 50 in 25 min) gave the triphosphate 41 as a yellow solid after lyophilization (^ = 11800 in H2O, ^^^^^^^^nm, 88.5 µmoles, 74%). 1H NMR (300 MHz, D2O) ^ 8.37 (s, 1H), 5.06 (dd, J = 8.1, 5.1 Hz, 1H), 4.58 (d, J = 5.4 Hz, 1H), 4.15-4.20 (m, 3H), 2.31- 2.42 (m, 1H), 2.05 (dd, J = 13.8, 5.1 Hz, 1H). 31P NMR (121 MHz, D2O) δ −8.4 (m, 1P), −10.4 (d, J = 18.3 Hz, 1P), −21.5 (m, 1P). [0204] Notes [0205] Note 1. Xenonucleotide substitution basecalling with Xenomorph. End-to- end pipeline for processing raw nanopore reads into xenonucleotide basecalls is available on the Xenomorph github repository (https://github.com/xenobiolab/xenomorph). The repository is also available on Zenodo (https://doi.org/10.5281/zenodo.8356450). The empirically measured 4-nt kmer models for all standard DNA bases (A, T, G, C) and all
3915-P1293WO.UW -71-
xenonucleobases (B, Sn, Sc, P, Z, Xt, Kn, J, V) are integrated for selection. The pipeline, as built, also allows users to generate their own models. Basecalling can be performed either per-read or per-sequence (global). In per-read basecalling, individual reads are basecalled while in per-sequence, the signal of all reads that match a sequence are averaged before determining a global call. The per-read consensus is defined as the most frequent basecall among all reads that match a certain sequence. [0206] 4-nt kmer models are parameterized with a kmer mean (µk) and a kmer variance (σk). Users have the choice of setting experimentally measured signal means, signal medians, or means from kernel density estimates as µk. Options for σk values are either the kmer-specific measured variance or a fixed global variance. The choice of bases to use in the model can also be specified. As described, basecalling in this disclosure uses signal means for µk and global average kmer variance for σk. [0207] Full code and documentation of Xenomorph is available on github. Sample data, such as the FAST5 data generated in this disclosure, can be found in the SRA under Bioproject PRJNA932328 (Table 31). [0208] Note 2. Per-read recall of simulated reads for artificial genetic codes. One can have explored the statistical limits of per-read recall of the 4-nt XNA/DNA kmer model on simulated reads for a few theoretical, but synthetically accessible, expanded genetic alphabets (Tables 27-30). For each genetic alphabet, one can have simulated 1,000 sets of signal levels for every possible heptamer sequence (4,096 possible sequences of the form NNNNNNN) and then basecalled these sequences using Xenomorph. One can have found that average per-read recall decreases as one increases the density of signal levels, from 88.5 ^ 7.5% average recall for the standard genetic code to 65 ^ 11% averaged recall with the 12-letter supernumerary (S = Sn) genetic code. Even in this most complex case, per-read recall is strongly sequence-specific with certain sequences showing > 80% recall while others < 30%. These simulations suggest an upper bound for single nucleotide recall and can be used to guide design and sequencing constraints. [0209] Results [0210] Table 1. Full description and abbreviations of XNA nucleobases used in this disclosure. Base abbreviation (Base), base pair (BP), and full chemical name of XNA nucleosides (Nucleoside) used in this disclosure. Additional references for each base and base pair are provided in the (Ref) column.
3915-P1293WO.UW -72-
Base P Nucleoside n c ′ ′
[0211] Table 2. Synthetic hairpin sequences. Names of hairpins and sequences of hairpins used in screening and optimization of non-library XNA tailing and XNA ligation reactions. All sequences shown in 5′ to 3′ direction. SEQ ID
3915-P1293WO.UW -73-
4 HP-3′PT ATCTTGGCTCGCTAAAAGACCACGGGCCTCTT TTTGAGGCCCGTGGTCTTTTAGCGAGCC*A*A* C T A C T T A A T
[0212] Table 3. Synthetic hairpin NNN-oligo pools and hairpin 20mer-validation pools. Two sets of libraries were constructed in this disclosure: 1) a random NNN-Pool library and 2) a validation library. NNN-Pool library was generated by ligating two hairpins together, each containing a randomized NNN-3′ end (library size = 64 x 64). The validation library was constructed by ligating hairpin pools that consist of a constant
3915-P1293WO.UW -74-
region followed by a randomly chosen 20mer sequence. Each hairpin pool contains 10 unique sequences. Ligating two hairpin pools together generates a final library of 100 possible sequence combinations (10 x 10). The table shows constant regions for all oligos in each pool (black), with regions in brackets (blue, bold) being replaced with their corresponding sequence elements from Tables 4-6. ‘-F’ and ‘-R’ are used to note forward and reverse sequences of different components after the hairpin is folded. NNN denotes the 3 randomized bases at the end of the hairpins, while [NNN-BC] (i.e., Triplet-barcode) and [Pool-BC] (i.e., Pool-barcode) are the barcodes that link to the 3′-NNN randomized bases and the tailed XNA, respectively. Regions highlighted in red denote restriction site sequence difference between HP_v1 and HP_v2, HP1 and HP2. All sequences are shown in the 5′ to 3′ direction. Full hairpin sequences purchased for this disclosure can be produced based on this disclosure. SEQ ID NO NNN Pool Sequence Construction n
3915-P1293WO.UW -75-
15 HP2- /5Phos/[VAL- 10 VAL ID R ATAGAAGTCTTCTAGCTCTTTTTGAGCTAG
[0 3] ab e . brary poo-barcodes sequences. Sequences o poo barcodes used for construction of NNN-libraries. Pool barcodes were used to identify which XNA was tailed onto the 3′-end of each hairpin pool from sequencing results (shown in Table 12). Pool barcode sequences are used to construct the HP_v1-NNN-[Pool-ID] and HP_v2-NNN-[Pool-ID] hairpin sequences shown in Table 3 by insertion into the [Pool- BC] region. Sequences shown in 5′ to 3′ direction. [Pool-ID] [Pool-BC] Pool Name
[0214] Table 5. Triplet-barcodes sequences. Sequences of the Triplet-barcodes and NNN sequences they are assigned to. The Triplet-barcode is a 24 nt sequence that is distal to the 3′-NNN end in each hairpin and is used to assign the true identity of the 3′- NNN bases that flank XNA insertions (Fig.3a). Each NNN combination (N = A, T, G, or C; 64 NNN combinations) has a corresponding Triplet-barcode that maps to it 1:1. Barcode sequences were chosen from Oxford Nanopore Technologies list of barcodes for long-read sequencing. Barcode sequences are shown in 5′ to 3′ direction. The Triplet- barcode (abbreviated as [NNN-BC]) and NNN sequences used to construct HP_v1-NNN- [Pool-ID] and HP_v2-NNN-[Pool-ID] hairpin sequences, shown in Table 3, by insertion into [NNN-BC] and [NNN] regions, respectively. Full sequences of all hairpins used for model generation can be produced based on this disclosure.
3915-P1293WO.UW -77-
29 NB AACGAGTCTCTTGGG AT 61 NB GCTGTGTTCCACTTC CT 14 ACCCATAGA G 46 ATTCTCCTG G
3915-P1293WO.UW -78-
45 NB TCAGTGAGGATCTAC GT 77 NB CACCCACACTTACTT TT 30 TTCGACCCA G 62 CAGGACGTA G
[0215] Table 6. 20-mer validation library sequences. The randomly chosen 20mer sequences contained within each validation library pool are listed. Each validation pool contained a mixture of 10 hairpin sequences (numbered 1 to 10) with 20 randomized base pairs at the end of the hairpin. Two pools of validation hairpins (each containing 10 unique sequences) can be ligated together to generate 100 (10 x 10) random sequence combinations. Validation pool sequences were randomly generated and intended to provide a sequence diversity (+/- 20 nt surrounding an XNA nt) much greater than what is present in the model training NNN-pools. The smaller library size (100 sequences per ligated pool) and richer sequence diversity made it possible to multiplex all the validation sets while still obtaining sufficient coverage for calculating appropriate statistics. Validation pool sequences are a subset of HP1-[VAL-ID] and HP2-[VAL-ID] hairpin sequences shown in Table 3. Sequences are shown in 5′ to 3′ direction. Full sequences of hairpins ordered, alongside ligation products generated, can be produced based on this disclosure. SE SE A A
3915-P1293WO.UW -79-
83 VAL_A TCCTCCTTTTCGACTG 113 VAL_D TCTGTGCGATTACAA 4 ACAT 4 ACGCT G C G A G A T T
3915-P1293WO.UW -80-
99 VAL_B AAATTGTATGCATTT 129 VAL_E TTAGTCGGAAGTATC 10 GACCC 10 TGTAC A C G A
[0216] Table 7. 12-letter DNA construction sequences. Sequences of hairpins used to build 12-letter DNA. Two tailed hairpins can be ligated together to generate a sequence with a single xenonucleotide insertion. Four single insertion constructs can undergo Golden Gate ligation to form a DNA sequence containing all 12 letters. Table shows barcodes for each oligo that links to the variable 3 nt sequence on the 3′-end and the xenonucleotide tailed on the 3′-end (bold), as well as restriction site sequences (red, bold). Sequences are shown in 5′ to 3′ direction.
3915-P1293WO.UW -83-
1 H 5 P1 /5Ph /GAGTCAGCGTGGGAATGAATCCTTTGATGGGTCTTCCAGCT T
[0217] Table 8. Optimized reaction components and conditions used for XNA tailing. Polymerase choice, polymerase amount, reaction time, and reaction temperature for each dxNTP and dNTP tailed to blunt-ended hairpin oligos or oligo pools (Table 2, 3). All reactions excluding N = A also contain 0.005 U/µL of YiPP). N/X Polymerase [Pol] U/µL [dNTP] mM Time Temp C C C C C C C C C C C C C
[0218] Table 9. XNA tailing extent of reaction. Tailing extent reaction estimate was calculated as the percent relative intensity of the product band, relative to intensities
3915-P1293WO.UW -84-
in each lane for blunt-ligation (upper band) and product (lower band). Tailing reactions were optimized to eliminate residual starting material (blunt-end DNA), which would be a major source of non-specific ligation in subsequent reactions. Band intensity was estimated using ImageJ with background intensity subtraction. Estimates presented are for tailing reactions shown in FIG.2 using optimized reaction conditions (see FIGs 6MA- 6MC for the full gel). Note that these estimates present an upper limit of tailing success, and more conservative estimates (< 5% remaining) should be used if quantifying by gel. Base Blunt-end remaining
[0219] Table 10. Optimized reaction components and conditions used for XNA ligation. DNA ligase choice and ligase amount for XNA ligation reactions between complementary single nucleotide 3′-overhang of XNA base pairs. All ligation reactions were carried out at 16 ^C for 16 h. Base pair Ligase [Ligase] L L L
3915-P1293WO.UW -85-
XtKn T3 272 U/µL L L [0220] Table 11. X
gat on ye . gat on ye d was calculated as the relative intensity of the product band, setting negative control as 0% yield and starting material band intensity as 100% yield. Ligation yield for individual sequences is known to have a sequence context dependence. Estimates presented are for ligation reactions shown in FIG.2 using optimized reaction conditions (see FIGs 6MA-6MC for the full gel). Base pair Ligation yield estimate
[0221] Table 12. Constructs generated through XNA tailing. Construction of single XNA nt tailed NNN-library hairpins, validation library hairpins, and 12-letter DNA hairpins by XNA tailing of different oligos and oligo pools (Tables 2, 3, 7). HP Substrate Name(s) dxNTP Product Name(s) n c t Kn
3915-P1293WO.UW -86-
HP_v2-NNN-P4 V HP_v2-NNN-P4-V
[0222] Table 13. Constructs generated through XNA ligation. Assembly of single XNA-bp constructs from two oligos or oligo pools containing a single nucleotide overhang (Table 12). NNNNNNN libraries were built to obtain full coverage over an NNNNNNN heptamer region for building kmer models. Val-20 validation library was built to sample randomized 20-mer regions for testing kmer models. 12-letter DNA was built such that all 12 bases could be sequenced in a single read. Library size calculation includes consideration of both sense and antisense sequences as independent reads. N/A = not applicable. Library size
3915-P1293WO.UW -87-
HP_v1-NNN-P3- HP_v2-NNN-P4- 4096 x 2 Model B Sn HPNNNBSn bildi et et et et et et et et et
3915-P1293WO.UW -88-
HP12-A1-B HP12-A2-Sc HP12-BSc-A N/A 12-letter DNA n n A A A A A A A
[0223] Table 14. Nanopore run overview. Run IDs, contents of run, flow cells, and flow cell chemistry used to generate training and validation datasets. Run ID Run contents Flow cell
3915-P1293WO.UW -89-
NNN-JV-NNN- HP-NNN-JV-2 Flongle 2 R941
[0224] Table 15. Nanopore run read summary for model building. Summary of total reads obtained from each run for model building, listed by Run ID. In the xenomorph processing pipeline, reads are first basecalled by guppy then aligned to a reference using minimap2. Reads that align to the reference are further filtered by reads that fully align to barcoded heteroligation products and subsequently analyzed for kmer model building and kmer model validation. Run ID Model built Total reads Reads pass filter
[0225] Table 16. Heptamer, kmer, and level structure used in kmer model. Heptamer sequence contains every combination of canonical nucleotide (N = A, T, G, C)
3915-P1293WO.UW -90-
and any XNA base (N = B, Sn, Sc, P, Z, J, V, Xt, Kn for XNA or N = A, T, G, C for canonical). Kmers of k = 4 are extracted from heptamer sequences and mapped to the signal level using a (-1, 0, +1, +2) sequence-to-level mask. The kmer sequences are assigned signal levels matching the 0th position nucleotide. Red letters highlight the sliding window within the heptamer sequence that corresponds to the kmer sequence. For a given heptamer sequence S, corresponding kmers are numbered k1 to k4 from left to right. Heptamer(S) Kmer seq level kmer i
[0226] Table 17. Column header and description of xenomorph preprocess output file. Col Header Description Format
3915-P1293WO.UW -91-
Position of XNA base in reference 13 d ii I
[ ] a e . oumn eaer an escrpton o xenomorp morp output e. Col Header Description Format
[0228] Table 19. Recall benchmarking for 4-nt kmer XNA models. Tabulation of estimated recall for 4-nt XNA kmer models (mean model) tested against the Val-20 set (n = 5,000 reads for each base). recall e
3915-P1293WO.UW -92-
J 0.60 0.63 0.74 0.55 0.59 0.73
[0229] Table 20. Specificity benchmarking for 4-nt kmer XNA models. Tabulation of specificity (1-FDR) for the 4-nt XNA kmer models tested against a DNA library without any XNAs incorporated (NNN-blunt; n = 50,000 reads for each base). Per-read (Per-Read), per-read consensus (Consensus), and per-sequence consensus (Per- sequence) values calculated from sequences with at least 10 mapped reads. specificity (1 – FDR) e
[0230] Table 21. Recall benchmarking for 4-nt and 12-letter kmer XNA model. Tabulation of estimated recall for a 4-nt XNA kmer model tested Val-20 sets using a full 12-letter (ATGCBScPZXKJV) kmer model for alternative hypothesis testing, from confusion matrix shown in FIG. 4. In ‘N vs §’ comparisons, ‘§’ denotes the most similar standard base as determined by guppy
Mean
signal levels and outlier- robust log-likelihood ratios were used for the base classification. Box highlights base chosen from picking the most likely nucleobase among any purine or pyrimidine.
3915-P1293WO.UW -93-
Per-read recall of Val-20 XNA in sequence
2 7 8 0 2 4 1 2 2 6 6 1
[0231] Table 22. Template sequences and primer sequences used for PCR of P≡Z base pair. Synthetic oligo template sequence with a P≡Z base pair (red, bold), received from Firebird Biosciences (Alachua, Fl). Oligo template sequences are hybridized prior to use as a PCR template. Primer sequences are used to amplify the template: each condition used a different barcoded reverse primer (PCR_Amp_R1: Equimolar; PCR_Amp_R2: Optimal; PCR_Amp_R3: No dxNTP; PCR_Amp_R4: Limiting). All conditions used the same forward primer (PCR_Amp_F). Sequences are shown in 5′ to 3′ direction. S
3915-P1293WO.UW -94-
15 PCR_Tem GGTCTGGTGCCACTGGTAACTGGGACAGCTGAAGTPCAGTCA 6 l P GCCAGGGAAACACGATAGGCAACCACACC G A C T C
[0232] Table 23. Thermocycler settings used for PCR of P≡Z base pair. Thermocycler conditions used to amplify P≡Z template.25 total cycles were performed. e n n d
[0233] Table 24. 12-letter DNA Sequence (Scuper-12 and Snuper-12). Sequences of 12-letter DNA as prepared for nanopore sequencing. XNA positions are bolded in orange. Two sets of sequences were constructed, with set A containing Sc and set B
3915-P1293WO.UW -95-
containing Sn. After Golden Gate ligation, both sets were digested using restriction enzymes at either restriction site 5′-GATATC-3′ or 5′-AGTACT-3′ to remove one of the hairpin ends. By removing one hairpin, a singular blunt end is generated for nanopore DNA preparation; this allows for subsequent sequencing of both sense and antisense strands in a single nanopore sequencing event. Since 4-nt kmer models were built from dsDNA data, basecalling uses signals collected from dsDNA portion of each read. Sequences shown in 5′ to 3′ direction. SEQ ID 12-letter Sequence T T C G A
3915-P1293WO.UW -96-
164 12L-DNA- ACTCAGGGAACAAACCAAGTTACGTGCTGAACTVAG A 2 CTCAGCGTGGGAATGAATCCTTTGATAAGGCAGAAA T G T A A C A C G A A G T
3915-P1293WO.UW -97-
166 12L-DNA- ACTCAGGGAACAAACCAAGTTACGTGCTGAACTVGA B 2 GTCAGCGTGGGAATGAATCCTTTGATAAGGCAGAAA T G T A A G A
[0234] Table 25. Scuper-12 per-read recall confusion matrix values. Tabulation of per-read recall results using the 4-nt XNA kmer model for Scuper-12, from confusion matrices shown in Fig. 5. Table shows: (left) fraction of base called at each xenonucleotide position using the full 12-letter supernumerary model; (right) base called using model with simplified priors, where denotes the xenonucleotide at position called, and § denotes the most similar standard base called instead. Box highlights base pair chosen from picking the most likely nucleobase among any purine or pyrimidine set, then fixing complementary base. Base called – Scuper-12 § .2 9 .1 3 .0 8
3915-P1293WO.UW -98-
0. 0.0 0.0 0.6 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.7 0.2 Z 03 0 1 9 3 1 2 0 0 3 0 8 8 2 .5 2 .0 5 .1 9 .0 7
[0235] Table 26. Snuper-12 per-read recall confusion matrix values. Tabulation of per-read recall results using the 4-nt XNA kmer model for Snuper-12, from confusion matrices shown in Fig. 5. Table shows: (left) fraction of base called at each xenonucleotide position using the full 12-letter supernumerary model; (right) base called using model with simplified priors, where denotes the xenonucleotide at position called, and § denotes the most similar standard base called instead. Box highlights base pair chosen from picking the most likely nucleobase among any purine or pyrimidine set, then fixing complementary base. Base called – Snuper-12 § .2 0 .0 9 .2 2 .0 4 .2
3915-P1293WO.UW -99-
01 6 7 0 7 5 1 7 3 0 0 1 3 7 .3 3 .2 1 .0 7
[0236] Table 27. Tabulation of per-read recall from simulated signal levels for the standard genetic code (A, T, G, C). Information regarding read simulation can be found in the Note section. Standard code A.
[0237] Table 28. Tabulation of per-read recall from simulated signal levels for the isoG/isoC code (A, T, G, C, B, Sn). isoG/isoC code 6 0 7 0 3
3915-P1293WO.UW -100-
Sn 0.07 0.00 0.07 0.00 0.03 0.83 [0238] T
a e . aua o o pe-ea eca o s uae signal levels for the hachimoji code (A, T, G, C, B, Sc, P, Z). Hachimoji code 2 1 0 3 4 1 0 9
[0239] Table 30. Tabulation of per-read recall from simulated signal levels for the 12-base Snupernumerary code (A, T, G, C, B, Sn P, Z, Xt, Kn, J, V). Snupernumerary code 0 0 0
3915-P1293WO.UW -101-
0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 C 1 3 0 9 3 0 0 9 2 1 9 0 0 1 1 0 0 0 0 6
[0240] Table 31. Nanopore datasets of supernumerary DNA uploaded to Sequence Reads Archive. Accession numbers for samples deposited in the SRA. For each sample, both raw FAST5 and FASTQ basecalls (guppy) are provided. Model building datasets used in this disclosure, alongside independent replicates, are available as testing data for the public. All samples can be found under SRA Bioproject PRJNA932328 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA932328] – Nanopore sequencing of 12- letter DNA (ATGCBSPZXKJV). Sample Name Contents Sample Accession
3915-P1293WO.UW -103-
SEQ ID NO:2 S ll Kl F t KF f DNA Pl I thti t t V E I N E Y D R D
SEQ ID NO:3 A Y E E Y H G IP T I L
3915-P1293WO.UW -104-
Table 32: KDE values for various Kmers Kmer Model Coverage KDE Mean Median Max Min Std
3915-P1293WO.UW -105-
ACGT ATGC 471 0.510 0.429 0.491 1.432 -1.749 0.376 ACTA ATGC 531 0825 0843 0837 2867 0461 0260
3915-P1293WO.UW -106-
ATGT ATGC 523 1.985 1.732 1.874 3.074 -1.370 0.623 ATTA ATGC 488 1410 1365 1401 2406 1464 0379
3915-P1293WO.UW -107-
CGCA ATGC 545 -0.260 -0.220 -0.245 1.177 -1.338 0.241 CGGA ATGC 566 1110 1054 1073 0833 2808 0368
3915-P1293WO.UW -108-
GCTA ATGC 550 0.595 0.640 0.634 1.642 -1.324 0.288 GGAA ATGC 565 2210 2095 2163 0450 3903 0396
3915-P1293WO.UW -109-
TCAC ATGC 511 0.515 0.591 0.555 2.232 -1.365 0.279 TCAG ATGC 567 0955 0963 0949 2644 0992 0354
3915-P1293WO.UW -110-
CTCT ATGC 568 2.230 2.103 2.183 3.051 -0.770 0.403 CTGC ATGC 556 1225 1094 1126 1986 0713 0312
3915-P1293WO.UW -111-
TCTC ATGC 565 0.035 0.120 0.048 2.784 -1.435 0.394 TCTG ATGC 569 0155 0322 0280 2827 2055 0329
3915-P1293WO.UW -112-
GGCC ATGC 551 -1.250 -1.228 -1.236 0.665 -3.145 0.282 GGCG ATGC 559 1185 1102 1131 1933 2282 0325
3915-P1293WO.UW -113-
ACBG B 1672 1.610 1.494 1.574 2.578 -2.021 0.405 ACBT B 252 1285 1114 1243 1802 1834 0476
3915-P1293WO.UW -114-
BAGT B 1617 -1.260 -1.304 -1.277 0.345 -3.461 0.418 BATA B 313 0385 0487 0428 1846 2301 0341
3915-P1293WO.UW -115-
BGGT B 2402 -1.785 -1.942 -1.850 0.113 -3.589 0.575 BGTA B 2704 1490 1343 1407 1597 2581 0295
3915-P1293WO.UW -116-
CBAT B 256 -0.555 -0.650 -0.596 0.279 -1.721 0.274 CBCA B 1223 0415 0398 0394 2138 0797 0283
3915-P1293WO.UW -117-
CTBG B 3608 2.335 2.199 2.273 3.305 -1.068 0.413 CTBT B 409 2955 2469 2799 3806 0860 0938
3915-P1293WO.UW -118-
GCBG B 1164 1.405 1.300 1.394 2.431 -2.352 0.434 GCBT B 170 1170 0884 1087 1954 1240 0536
3915-P1293WO.UW -119-
TBAT B 651 -0.410 -0.425 -0.418 1.166 -2.237 0.334 TBCA B 2355 0190 0275 0197 3364 2141 0542
3915-P1293WO.UW -120-
TTBG B 5100 1.500 1.425 1.460 2.719 -0.964 0.360 TTBT B 443 1955 1676 1830 3144 0128 0565
3915-P1293WO.UW -121-
ASAT Sn 124 -2.065 -1.906 -1.989 -0.403 -2.950 0.468 ASCA S 48 0535 0456 0499 0709 1484 0449
3915-P1293WO.UW -122-
CCSA Sn 1789 1.010 0.613 0.730 2.623 -2.046 0.536 CCSC S 341 0480 0452 0505 1873 1401 0513
3915-P1293WO.UW -123-
CTSA Sn 830 0.530 0.478 0.480 3.201 -1.589 0.669 CTSC S 101 0035 0837 0856 3014 1087 1110
3915-P1293WO.UW -124-
GSAT Sn 512 -2.925 -2.797 -2.832 -0.556 -3.512 0.357 GSCA S 151 1230 1103 1137 0752 2712 0585
3915-P1293WO.UW -125-
SAGT Sn 762 -1.755 -1.870 -1.799 -0.545 -3.385 0.355 SATA S 468 1335 1447 1430 0586 2730 0457
3915-P1293WO.UW -126-
SGGT Sn 317 -1.965 -2.120 -2.073 -0.828 -3.687 0.365 SGTA S 567 1915 1804 1874 0310 2443 0325
3915-P1293WO.UW -127-
TCSA Sn 1457 1.585 1.263 1.517 2.968 -1.699 0.954 TCSC S 409 1370 0508 0850 2177 1630 0996
3915-P1293WO.UW -128-
TTSA Sn 242 0.195 0.571 0.338 2.115 -1.862 0.741 TTSC S 51 0080 0122 0052 1734 1416 0635
3915-P1293WO.UW -129-
APAT P 777 -2.925 -2.764 -2.853 0.010 -3.805 0.417 APCA P 1451 0995 0905 0931 1488 3207 0375
3915-P1293WO.UW -130-
CCPA P 2333 0.775 0.790 0.807 2.662 -1.829 0.295 CCPC P 1080 0275 0324 0316 1381 1677 0247
3915-P1293WO.UW -131-
CTPA P 3139 1.575 1.428 1.521 2.882 -1.834 0.504 CTPC P 2376 1125 1029 1117 2724 1682 0519
3915-P1293WO.UW -132-
GPAT P 1029 -3.830 -3.511 -3.648 -0.601 -4.440 0.526 GPCA P 1151 1500 1376 1471 1550 3342 0530
3915-P1293WO.UW -133-
PAGT P 2648 -1.770 -2.042 -1.879 0.059 -3.559 0.452 PATA P 1897 1215 1318 1281 0617 3534 0410
3915-P1293WO.UW -134-
PGGT P 3618 -2.255 -2.391 -2.346 0.608 -3.902 0.374 PGTA P 3922 1755 1699 1730 1448 3236 0333
3915-P1293WO.UW -135-
TCPA P 2409 0.865 0.861 0.865 3.107 -1.847 0.310 TCPC P 1650 0380 0373 0346 2478 1137 0304
3915-P1293WO.UW -136-
TTPA P 2584 0.815 0.759 0.787 1.991 -2.205 0.386 TTPC P 1944 0315 0235 0335 2727 1728 0546
3915-P1293WO.UW -137-
ATTZ Z 1288 1.520 1.398 1.471 2.652 -1.828 0.420 ATZA Z 1154 2475 2305 2412 3416 0791 0463
3915-P1293WO.UW -138-
CCTZ Z 1439 0.260 0.274 0.241 2.304 -1.361 0.313 CCZA Z 1429 1230 1230 1246 2574 1815 0314
3915-P1293WO.UW -139-
CZGT Z 2216 -0.180 -0.181 -0.186 2.139 -1.580 0.262 CZTA Z 1944 0400 0553 0496 2625 0663 0342
3915-P1293WO.UW -140-
GTTZ Z 1435 1.100 1.038 1.086 3.001 -1.823 0.420 GTZA Z 1060 2125 2013 2082 3163 0728 0374
3915-P1293WO.UW -141-
TCTZ Z 2032 0.365 0.464 0.402 3.134 -1.808 0.339 TCZA Z 2117 1280 1236 1240 2639 0335 0230
3915-P1293WO.UW -142-
TZGT Z 2048 0.390 0.358 0.377 1.584 -1.383 0.279 TZTA Z 816 0800 0907 0853 2997 0033 0331
3915-P1293WO.UW -143-
ZCGT Z 325 0.585 0.492 0.532 1.687 -2.926 0.563 ZCTA Z 625 1125 1030 1057 2290 1364 0342
3915-P1293WO.UW -144-
ZTGT Z 1088 1.700 1.616 1.680 2.903 -1.343 0.468 ZTTA Z 907 1835 1664 1737 2568 1572 0438
3915-P1293WO.UW -145-
ATTX Xt 266 0.335 0.684 0.530 2.631 -2.304 0.633 ATXA X 376 2620 2205 2397 3253 1555 0689
3915-P1293WO.UW -146-
CCTX Xt 330 -0.065 0.294 0.066 2.837 -1.043 0.873 CCXA X 459 1100 1004 1028 1989 1159 0271
3915-P1293WO.UW -147-
CXGT Xt 1797 -1.115 -1.174 -1.157 1.104 -2.803 0.257 CXTA X 145 0460 0406 0464 1862 1988 0478
3915-P1293WO.UW -148-
GTTX Xt 635 -0.120 0.158 -0.029 1.867 -1.753 0.591 GTXA X 436 2270 1808 1999 2984 1751 0735
3915-P1293WO.UW -149-
TCTX Xt 370 -0.500 -0.088 -0.387 2.890 -0.915 0.753 TCXA X 1493 1040 1021 1019 2413 0677 0209
3915-P1293WO.UW -150-
TXGT Xt 1199 -0.665 -0.702 -0.699 1.199 -2.709 0.331 TXTA X 105 0055 0118 0071 1676 2332 0483
3915-P1293WO.UW -151-
XCGT Xt 368 -0.815 -0.654 -0.777 2.248 -1.724 0.492 XCTA X 538 0000 0634 0531 3391 0956 0714
3915-P1293WO.UW -152-
XTGT Xt 485 0.100 0.136 0.077 2.847 -2.049 0.582 XTTA X 127 0185 0343 0253 1598 1250 0439
3915-P1293WO.UW -153-
AKAT Kn 65 -2.035 -1.709 -1.829 1.875 -4.022 0.838 AKCA K 164 0485 0504 0468 0632 2101 0503
3915-P1293WO.UW -154-
CCKA Kn 654 0.720 0.699 0.725 1.773 -1.196 0.319 CCKC K 920 0330 0361 0367 2143 0578 0186
3915-P1293WO.UW -155-
CTKA Kn 449 2.220 1.843 1.991 2.679 -0.383 0.593 CTKC K 233 1970 1730 1831 2627 0835 0449
3915-P1293WO.UW -156-
GKAT Kn 416 -2.385 -2.353 -2.382 -0.482 -4.183 0.393 GKCA K 356 0705 0717 0694 0590 2583 0317
3915-P1293WO.UW -157-
KAGT Kn 251 -2.740 -2.560 -2.684 -0.011 -3.447 0.526 KATA K 479 1540 1493 1543 1131 2755 0443
3915-P1293WO.UW -158-
KGGT Kn 147 -1.870 -2.205 -1.944 -0.595 -4.206 0.714 KGTA K 839 1095 1091 1096 0836 3406 0467
3915-P1293WO.UW -159-
TCKA Kn 1281 0.645 0.647 0.648 1.933 -2.023 0.287 TCKC K 3032 0240 0336 0300 2671 1675 0258
3915-P1293WO.UW -160-
TTKA Kn 330 1.065 0.940 1.063 2.072 -1.223 0.561 TTKC K 599 1145 1053 1079 1848 0622 0297
3915-P1293WO.UW -161-
AJAT J 126 0.160 -0.034 0.080 2.418 -2.555 0.792 AJCA J 102 1115 0885 1047 2816 2226 0721
3915-P1293WO.UW -162-
CCJA J 725 1.250 1.202 1.231 3.239 -0.765 0.322 CCJC J 363 0910 0868 0882 1984 1095 0307
3915-P1293WO.UW -163-
CTJA J 705 2.515 2.264 2.438 3.803 -1.394 0.598 CTJC J 583 2400 2307 2390 3065 2133 0458
3915-P1293WO.UW -164-
GJAT J 311 0.055 -0.217 -0.085 2.368 -3.260 0.734 GJCA J 88 0610 0610 0623 3183 2563 0857
3915-P1293WO.UW -165-
JAGT J 1181 -0.655 -0.800 -0.709 1.266 -3.162 0.627 JATA J 504 0370 0059 0208 2251 2232 0577
3915-P1293WO.UW -166-
JGGT J 780 -1.715 -1.624 -1.692 0.849 -3.829 0.486 JGTA J 1999 0985 0904 0952 2119 3162 0347
3915-P1293WO.UW -167-
TCJA J 975 1.130 1.156 1.159 2.130 -1.291 0.263 TCJC J 602 0815 0859 0840 2307 2057 0358
3915-P1293WO.UW -168-
TTJA J 510 1.705 1.657 1.709 3.450 -0.750 0.443 TTJC J 359 1625 1542 1596 2315 0662 0343
3915-P1293WO.UW -169-
ATTV V 61 0.165 0.311 0.144 2.640 -1.006 0.790 ATVA V 253 0625 0134 0392 2927 2513 1035
3915-P1293WO.UW -170-
CCTV V 122 -0.080 0.213 0.232 2.046 -1.035 0.597 CCVA V 288 0480 0394 0474 2788 0903 0513
3915-P1293WO.UW -171-
CVGT V 710 -2.075 -1.435 -1.550 0.596 -2.810 0.655 CVTA V 95 1655 1117 1454 1585 2228 0896
3915-P1293WO.UW -172-
GTTV V 82 -0.260 -0.016 -0.127 1.574 -3.517 0.681 GTVA V 227 0400 0129 0324 2460 3131 0964
3915-P1293WO.UW -173-
TCTV V 183 0.030 0.012 -0.067 2.871 -1.354 0.778 TCVA V 230 0240 0415 0261 2402 1708 0830
3915-P1293WO.UW -174-
TVGT V 269 -1.500 -1.302 -1.399 1.205 -2.810 0.516 TVTA V 52 1290 0527 0888 1745 2342 1106
3915-P1293WO.UW -175-
VCGT V 196 0.100 -0.069 0.051 1.529 -1.921 0.471 VCTA V 57 0550 0643 0615 2688 0615 0509
3915-P1293WO.UW -176-
VTGT V 181 0.960 0.592 0.853 2.290 -2.437 0.873 VTTA V 65 0950 0697 0885 1962 1879 0624
3915-P1293WO.UW -177-
ASAT Sc 18 -2.890 -2.515 -2.708 -0.935 -3.695 0.848 ASCA S 7 0125 0385 0056 0365 1616 0710
3915-P1293WO.UW -178-
CCSA Sc 207 0.905 0.887 0.897 1.870 -0.612 0.345 CCSC S 97 0725 0622 0675 1222 0694 0312
3915-P1293WO.UW -179-
CTSA Sc 51 1.130 1.034 1.111 2.175 -1.675 0.700 CTSC S 55 2130 1733 2019 2698 0929 0839
3915-P1293WO.UW -180-
GSAT Sc 88 -3.525 -3.203 -3.357 -0.611 -4.153 0.676 GSCA S 22 0790 0863 0815 0543 2113 0708
3915-P1293WO.UW -181-
SAGT Sc 87 -2.235 -2.040 -2.125 0.796 -3.425 0.680 SATA S 47 1530 1527 1545 0050 2407 0451
3915-P1293WO.UW -182-
SGGT Sc 40 -2.290 -2.125 -2.201 0.166 -3.294 0.632 SGTA S 32 1900 1670 1825 0270 2631 0594
3915-P1293WO.UW -183-
TCSA Sc 56 0.490 0.812 0.669 2.408 -0.043 0.599 TCSC S 68 0055 0388 0187 2254 2915 0718
3915-P1293WO.UW -184-
TTSA Sc 14 1.210 1.110 1.154 1.779 0.319 0.450 TTSC S 10 1175 0849 1043 1913 0216 0680
NON-LIMITING EMBODIMENTS [0241] While general features of the disclosure are described and shown and particular features of the disclosure are set forth in the claims, the following non-limiting embodiments relate to features, and combinations of features, that are explicitly envisioned as being part of the disclosure. The following non-limiting Embodiments contain elements that are modular and can be combined with each other in any number, order, or combination to form a new non-limiting Embodiment, which can itself be further combined with other non-limiting Embodiments. [0242] Embodiment 1. A method for generating an N+1 tailing product comprising a non-standard nucleotide that is covalently bound with a 3’ end of a precursor double-stranded DNA (dsDNA) template and is non-base-paired, the method comprising: combining the precursor dsDNA template with a DNA polymerase and a non-standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase. [0243] Embodiment 2. The method of Embodiment 1 or any other Embodiment, wherein the non-standard nucleotide is a xenonucleotide (XNA) and the non-standard dNTP is a deoxy-xeno-ribonucleotide triphosphate (dxNTP). [0244] Embodiment 3. The method of Embodiment 1 or any other Embodiment, wherein the DNA polymerase comprises a polypeptide sequence of a small Klenow Fragment (KF exo-) of DNA Polymerase I. [0245] Embodiment 4. The method of Embodiment 3 or any other Embodiment, wherein the polypeptide sequence comprises a sequence of SEQ ID NO:2. [0246] Embodiment 5. The method of any of Embodiments 3-4 or any other Embodiment, wherein the non-standard nucleotide is B or p, and the reaction
3915-P1293WO.UW -185-
condition proceeds at about 37°C for between about 1-16 hours and comprises about 0.71 U/µL of the DNA polymerase and about 1.19 mM of the non-standard dNTP. [0247] Embodiment 6. The method of Embodiment 1 or any other Embodiment, wherein the DNA polymerase comprises a polypeptide sequence of an engineered polymerase from a hyperthermophilic marine archaeon. [0248] Embodiment 7. The method of Embodiment 6 or any other Embodiment, wherein the engineered polymerase is a variant of 9°N DNA polymerase. [0249] Embodiment 8. The method of Embodiment 7 or any other Embodiment, wherein the polypeptide sequence comprises a sequence of SEQ ID NO:3. [0250] Embodiment 9. The method of any of Embodiments 6-8 or any other Embodiment, wherein the non-standard nucleotide is selected from Sn, Sc, Z, Xt, Kn, J, and V, and the reaction condition proceeds at about 60°C for between about 4-16 hours and comprises about 0.29 U/µL of the DNA polymerase and about 1.19 mM of the non-standard dNTP. [0251] Embodiment 10. A method for generating a base pair of two nucleotides of a polynucleotide, wherein at least one nucleotide of the two nucleotides is a non-standard nucleotide. [0252] Embodiment 11. The method of Embodiment 10 or any other Embodiment, comprising the method of any of Embodiments 1-9 or any other Embodiment. [0253] Embodiment 12. The method of any of Embodiments 10-11 or any other Embodiment, comprising: generating a second N+1 tailing product comprising a second non-standard nucleotide that is base-pair complementary with the non-standard nucleotide, wherein the second non-standard nucleotide is non-base-paired; and ligating the N+1 tailing product with the second N+1 tailing product to form a dsDNA ligation product that comprises a base pair between the non-standard nucleotide and the second non-standard nucleotide. [0254] Embodiment 13. The method of any of Embodiments 10-12 or any other Embodiment, wherein the N+1 tailing product comprises a hairpin. [0255] Embodiment 14. The method of any of Embodiments 10-13 or any other Embodiment, wherein the second N+1 tailing product comprises a hairpin.
3915-P1293WO.UW -186-
[0256] Embodiment 15. The method of Embodiment 14 or any other Embodiment, wherein the dsDNA ligation product does not comprise a free 5’ end or a free 3’ end. [0257] Embodiment 16. The method of any of Embodiments 12-15 or any other Embodiment, comprising: contacting the dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the dsDNA ligation product to generate a blunt-end DNA template that comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide. [0258] Embodiment 17. The method of Embodiment 16 or any other Embodiment, wherein the method is performed a plurality of times for creation of a plurality of base pairs between a plurality of non-standard nucleotides and a plurality of second non-standard nucleotides as sequence elements of a further dsDNA ligation product. [0259] Embodiment 18. The method of Embodiment 17 or any other Embodiment, comprising: contacting the further dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the further dsDNA ligation product to generate a further blunt-end DNA template that comprises the plurality of base pairs between the plurality of non- standard nucleotides and the plurality of second non-standard nucleotides. [0260] Embodiment 19. The method of any of Embodiments 1, 3-4, 6-8, and 10-18 or any other Embodiment, wherein the non-standard nucleotide comprises: an epigenetic modification, a modified sugar, a phosphate backbone, a nucleobase, a nucleobase that can hydrogen bond to a second base, a nucleobase that can base pair (without hydrogen bonding) to a second base, a nucleobase that relies on steric exclusion for base pairing, a nucleobase that relies on hydrophobic interactions for base pairing, a nucleobase that relies on a transition metal complex for base pairing, a chemical modification, or any combination thereof. [0261] Embodiment 20. The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide is the nucleobase that is configured to hydrogen bond to the second base and the second base is a standard base or a non- standard base.
3915-P1293WO.UW -187-
[0262] Embodiment 21. The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide is the nucleobase that can base pair (without hydrogen bonding) to the second base and the second base is a standard base or a non-standard base. [0263] Embodiment 22. The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide comprises an epigenetic modification or is 4-methyl-cytosine, 5-methyl cytosine, 6-methyl adenosine, 5-hydroxymethyl cytosine, 7-methylguanosine, or N6-methyladenosine. [0264] Embodiment 23. The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide comprises the chemical modification and the chemical modification comprises a fluorophore, a biotin, a terminal alkyne, an azide, a cyclooctyne, a tetrazine, a terminal alkene, a phosphine, a halo-alkane, an aldehyde, a thiol, a transition metal complex, another reactive handle, or any combination thereof. [0265] Embodiment 24. A dsDNA ligation product produced by the method of any of Embodiments 12-23 or any other Embodiment. [0266] Embodiment 25. A further dsDNA ligation product produced by the method of any of Embodiments 17-23 or any other Embodiment. [0267] Embodiment 26. A blunt-end dsDNA template produced by the method of any of Embodiments 16-23 or any other Embodiment. [0268] Embodiment 27. A further blunt-end dsDNA template produced by the method of any of Embodiments 18-23 or any other Embodiment. [0269] Embodiment 28. A defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the dsDNA ligation product of Embodiment 24 or any other Embodiment or the blunt-end dsDNA template of Embodiment 26 or any other Embodiment, wherein the library polynucleotide sequence comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide. [0270] Embodiment 29. A defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the further dsDNA ligation product of Embodiment 25 or any other Embodiment or the further blunt-end dsDNA template of Embodiment 27 or any other Embodiment, wherein the library polynucleotide sequence
3915-P1293WO.UW -188-
comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides. [0271] Embodiment 30. The defined non-standard nucleotide base pair library of any of Embodiments 28-29 or any other Embodiment, wherein the library polynucleotide sequence further comprises: a context barcode associated with a sequence context adjacent to a base pair of a non-standard nucleotide and a second non-standard nucleotide of the library polynucleotide sequence; and a pool barcode associated with the non-standard nucleotide, the second non-standard nucleotide, or both. [0272] Embodiment 31. A method for generating a machine learning (ML) model that correlates one or more observed current reads with an unknown non-standard nucleotide for assignment of an identity to the unknown non-standard nucleotide, the method comprising: sequencing, with a nanopore sequencing method, the defined non- standard nucleotide base pair library of any of Embodiments 28-30 or any other Embodiment to produce the one or more observed current reads; and training, with a ML algorithm, the ML model to associate the one or more observed current reads with a known identity of a defined non-standard nucleotide of the defined non-standard nucleotide base pair library of any of Embodiments 28-30 or any other Embodiment, wherein the ML model is configured to assign the identity to the unknown non-standard nucleotide based on the known identity of the defined non-standard nucleotide. [0273] Embodiment 32. The method of Embodiment 31 or any other Embodiment, wherein the ML model comprises a convolutional long short term memory recurrent neural network (LSTM RNN). [0274] Embodiment 33. A non-transitory computer-readable storage medium having stored thereon at least part of a ML model produced by any of Embodiments 31- 32 or any other Embodiment. [0275] Embodiment 34. A computational device or computational system comprising the non-transitory computer-readable storage medium of Embodiment 33 or any other Embodiment. [0276] Embodiment 35. A nanopore sequencing kit, device, or system comprising the non-transitory computer-readable storage medium of Embodiment 33 or any other Embodiment. [0277] Embodiment 36. A method for basecalling a non-standard nucleotide expanded alphabet, the method comprising: sequencing, with a nanopore sequencing
3915-P1293WO.UW -189-
method, a subject polynucleotide sequence that comprises a non-standard nucleotide to generate a subject current read; computing, with the computational device or computational system of Embodiment 34 or any other Embodiment, the known identity of the defined non-standard nucleotide of the defined non-standard nucleotide base pair library associated with the subject current read with for an association; and computing, based on the association, a structure of the non-standard nucleotide. [0278] Embodiment 37. A circuitry configured to perform all or part of the method of Embodiment 36 or any other Embodiment. [0279] Embodiment 38. A circuitry configured to perform all of the method of Embodiment 36 or any other Embodiment. [0280] Embodiment 39. A nanopore sequencing kit, device, or system comprising the circuitry of Embodiment 37 or any other Embodiment. [0281] Embodiment 40. A nanopore sequencing kit, device, or system comprising the circuitry of Embodiment 38 or any other Embodiment. [0282] While illustrative embodiments have been illustrated and described, it will be appreciated that various changes can be made therein without departing from the spirit and scope of the disclosure.
3915-P1293WO.UW -190-
Claims
CLAIMS The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows: 1. 1. A method for generating an N+1 tailing product comprising a non- standard nucleotide that is covalently bound with a 3’ end of a precursor double-stranded DNA (dsDNA) template and is non-base-paired, the method comprising: combining the precursor dsDNA template with a DNA polymerase and a non- standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase.
2. The method of claim 1, wherein the non-standard nucleotide is a xenonucleotide (XNA) and the non-standard dNTP is a deoxy-xeno-ribonucleotide triphosphate (dxNTP).
3. The method of claim 1, wherein the DNA polymerase comprises a polypeptide sequence of a small Klenow Fragment (KF exo-) of DNA Polymerase I.
4. The method of claim 3, wherein the polypeptide sequence comprises a sequence of SEQ ID NO:2.
5. The method of claim 3, wherein the non-standard nucleotide is B or p, and the reaction condition proceeds at about 37°C for between about 1-16 hours and comprises about 0.71 U/µL of the DNA polymerase and about 1.19 mM of the non- standard dNTP.
6. The method of claim 1, wherein the DNA polymerase comprises a polypeptide sequence of an engineered polymerase from a hyperthermophilic marine archaeon.
7. The method of claim 6, wherein the engineered polymerase is a variant of 9°N DNA polymerase.
3915-P1293WO.UW -191-
8. The method of claim 7, wherein the polypeptide sequence comprises a sequence of SEQ ID NO:3.
9. The method of claim 6, wherein the non-standard nucleotide is selected from Sn, Sc, Z, Xt, Kn, J, and V, and the reaction condition proceeds at about 60°C for between about 4-16 hours and comprises about 0.29 U/µL of the DNA polymerase and about 1.19 mM of the non-standard dNTP.
10. A method for generating a base pair of two nucleotides of a polynucleotide, wherein at least one nucleotide of the two nucleotides is a non-standard nucleotide.
11. The method of claim 10, comprising: combining a double-stranded DNA (dsDNA) template with a DNA polymerase and a non-standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase to produce an N+1 tailing product; wherein the N+1 tailing product comprises the non-standard nucleotide covalently bound with the 3’ end of the precursor dsDNA template and is non-base-paired.
12. The method of claim 11, comprising: generating a second N+1 tailing product comprising a second non-standard nucleotide that is base-pair complementary with the non-standard nucleotide, wherein the second non-standard nucleotide is non-base-paired; and ligating the N+1 tailing product with the second N+1 tailing product to form a dsDNA ligation product that comprises a base pair between the non-standard nucleotide and the second non-standard nucleotide.
13. The method of claim 11, wherein the N+1 tailing product comprises a hairpin.
14. The method of claim 12, wherein the second N+1 tailing product comprises a hairpin.
3915-P1293WO.UW -192-
15. The method of claim 12, wherein the dsDNA ligation product does not comprise a free 5’ end or a free 3’ end.
16. The method of claim 12, comprising: contacting the dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the dsDNA ligation product to generate a blunt-end DNA template that comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
17. The method of claim 16, wherein the method is performed a plurality of times for creation of a plurality of base pairs between a plurality of non-standard nucleotides and a plurality of second non-standard nucleotides as sequence elements of a further dsDNA ligation product.
18. The method of claim 17, comprising: contacting the further dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the further dsDNA ligation product to generate a further blunt-end DNA template that comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides.
19. The method of claim 1, wherein the non-standard nucleotide comprises: an epigenetic modification, a modified sugar, a phosphate backbone, a nucleobase, a nucleobase that can hydrogen bond to a second base, a nucleobase that can base pair (without hydrogen bonding) to a second base, a nucleobase that relies on steric exclusion for base pairing, a nucleobase that relies on hydrophobic interactions for base pairing, a nucleobase that relies on a transition metal complex for base pairing, a chemical modification, or any combination thereof.
20. The method of claim 19, wherein the non-standard nucleotide is the nucleobase that is configured to hydrogen bond to the second base and the second base is a standard base or a non-standard base.
3915-P1293WO.UW -193-
21. The method of claim 19, wherein the non-standard nucleotide is the nucleobase that can base pair (without hydrogen bonding) to the second base and the second base is a standard base or a non-standard base.
22. The method of claim 19, wherein the non-standard nucleotide comprises an epigenetic modification or is 4-methyl-cytosine, 5-methyl cytosine, 6-methyl adenosine, 5-hydroxymethyl cytosine, 7-methylguanosine, or N6-methyladenosine.
23. The method of claim 19, wherein the non-standard nucleotide comprises the chemical modification and the chemical modification comprises a fluorophore, a biotin, a terminal alkyne, an azide, a cyclooctyne, a tetrazine, a terminal alkene, a phosphine, a halo-alkane, an aldehyde, a thiol, a transition metal complex, another reactive handle, or any combination thereof.
24. A dsDNA ligation product produced by the method of claim 12.
25. A further dsDNA ligation product produced by the method of claim 17.
26. A blunt-end dsDNA template produced by the method of claim 16.
27. A further blunt-end dsDNA template produced by the method of claim 18.
28. A defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the dsDNA ligation product of claim 24, wherein the library polynucleotide sequence comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
29. A defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the further dsDNA ligation product of claim 25, wherein the library polynucleotide sequence comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides.
30. The defined non-standard nucleotide base pair library of claim 28, wherein the library polynucleotide sequence further comprises:
3915-P1293WO.UW -194-
a context barcode associated with a sequence context adjacent to a base pair of a non-standard nucleotide and a second non-standard nucleotide of the library polynucleotide sequence; and a pool barcode associated with the non-standard nucleotide, the second non- standard nucleotide, or both.
31. A method for generating a machine learning (ML) model that correlates one or more observed current reads with an unknown non-standard nucleotide for assignment of an identity to the unknown non-standard nucleotide, the method comprising: sequencing, with a nanopore sequencing method, the defined non-standard nucleotide base pair library of claim 28 to produce the one or more observed current reads; and training, with a ML algorithm, the ML model to associate the one or more observed current reads with a known identity of a defined non-standard nucleotide of the defined non-standard nucleotide base pair library of claim 28, wherein the ML model is configured to assign the identity to the unknown non- standard nucleotide based on the known identity of the defined non-standard nucleotide.
32. The method of claim 31, wherein the ML model comprises a convolutional long short term memory recurrent neural network (LSTM RNN).
33. A non-transitory computer-readable storage medium having stored thereon at least part of a ML model produced by claim 31.
34. A computational device or computational system comprising the non- transitory computer-readable storage medium of claim 33.
35. A nanopore sequencing kit, device, or system comprising the non- transitory computer-readable storage medium of claim 33.
36. A method for basecalling a non-standard nucleotide expanded alphabet, the method comprising:
3915-P1293WO.UW -195-
sequencing, with a nanopore sequencing method, a subject polynucleotide sequence that comprises a non-standard nucleotide to generate a subject current read; computing, with the computational device or computational system of claim 34, the known identity of the defined non-standard nucleotide of the defined non-standard nucleotide base pair library associated with the subject current read with for an association; and computing, based on the association, a structure of the non-standard nucleotide.
37. A circuitry configured to perform all or part of the method of claim 36.
38. A circuitry configured to perform all of the method of claim 36.
39. A nanopore sequencing kit, device, or system comprising the circuitry of claim 37.
40. A nanopore sequencing kit, device, or system comprising the circuitry of claim 38.
3915-P1293WO.UW -196-
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363483926P | 2023-02-08 | 2023-02-08 | |
| US63/483,926 | 2023-02-08 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024168196A1 true WO2024168196A1 (en) | 2024-08-15 |
Family
ID=92263538
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/015068 Pending WO2024168196A1 (en) | 2023-02-08 | 2024-02-08 | Systems and methods for enzymatic synthesis of polynucleotides containing non-standard nucleotide basepairs |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024168196A1 (en) |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150133320A1 (en) * | 1997-04-01 | 2015-05-14 | Illumina, Inc. | Method of nucleic acid amplification |
| WO2019081680A1 (en) * | 2017-10-25 | 2019-05-02 | Institut Pasteur | Immobilization of nucleic acids using an enzymatic his-tag mimic for diagnostic applications |
| US20200263218A1 (en) * | 2017-10-04 | 2020-08-20 | Centrillion Technology Holdings Corporation | Method and system for enzymatic synthesis of oligonucleotides |
| US20200392572A1 (en) * | 2017-12-21 | 2020-12-17 | Curevac Ag | Linear double stranded dna coupled to a single support or a tag and methods for producing said linear double stranded dna |
| US10934569B1 (en) * | 2018-12-20 | 2021-03-02 | Nicole A Leal | Enzymatic processes for synthesizing RNA containing certain non-standard nucleotides |
| US20210171920A1 (en) * | 2015-10-29 | 2021-06-10 | Temple University-Of The Commonwealth System Of Higher Education | Modification of 3' Terminal Ends of Nucleic Acids by DNA Polymerase Theta |
| US20210355519A1 (en) * | 2020-05-15 | 2021-11-18 | Codex Dna, Inc. | Demand synthesis of polynucleotide sequences |
-
2024
- 2024-02-08 WO PCT/US2024/015068 patent/WO2024168196A1/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150133320A1 (en) * | 1997-04-01 | 2015-05-14 | Illumina, Inc. | Method of nucleic acid amplification |
| US20210171920A1 (en) * | 2015-10-29 | 2021-06-10 | Temple University-Of The Commonwealth System Of Higher Education | Modification of 3' Terminal Ends of Nucleic Acids by DNA Polymerase Theta |
| US20200263218A1 (en) * | 2017-10-04 | 2020-08-20 | Centrillion Technology Holdings Corporation | Method and system for enzymatic synthesis of oligonucleotides |
| WO2019081680A1 (en) * | 2017-10-25 | 2019-05-02 | Institut Pasteur | Immobilization of nucleic acids using an enzymatic his-tag mimic for diagnostic applications |
| US20200392572A1 (en) * | 2017-12-21 | 2020-12-17 | Curevac Ag | Linear double stranded dna coupled to a single support or a tag and methods for producing said linear double stranded dna |
| US10934569B1 (en) * | 2018-12-20 | 2021-03-02 | Nicole A Leal | Enzymatic processes for synthesizing RNA containing certain non-standard nucleotides |
| US20210355519A1 (en) * | 2020-05-15 | 2021-11-18 | Codex Dna, Inc. | Demand synthesis of polynucleotide sequences |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Lucas et al. | Quantitative analysis of tRNA abundance and modifications by nanopore RNA sequencing | |
| Lu et al. | Enzymatic DNA synthesis by engineering terminal deoxynucleotidyl transferase | |
| US10865410B2 (en) | Next-generation sequencing libraries | |
| DK2245187T3 (en) | Methods for accurate sequence data and modified due to localization | |
| US20200370202A1 (en) | Methods, systems, computer readable media, and kits for sample identification | |
| Leu et al. | Cascade of reduced speed and accuracy after errors in enzyme-free copying of nucleic acid sequences | |
| Fleming et al. | Structural elucidation of bisulfite adducts to pseudouridine that result in deletion signatures during reverse transcription of RNA | |
| Kupakuwana et al. | Acyclic identification of aptamers for human alpha-thrombin using over-represented libraries and deep sequencing | |
| CN106574286A (en) | Selective amplification of nucleic acid sequences | |
| Kawabe et al. | Enzymatic synthesis and nanopore sequencing of 12-letter supernumerary DNA | |
| US20170101675A1 (en) | Ion sensor dna and rna sequencing by synthesis using nucleotide reversible terminators | |
| CN105579592B (en) | DNA linker molecules for the preparation of DNA libraries and methods for their production and use | |
| Arguello et al. | In vitro selection with a site-specifically modified RNA library reveals the binding preferences of N6-methyladenosine reader proteins | |
| US20200190574A1 (en) | Rna-stitch sequencing: an assay for direct mapping of rna : rna interactions in cells | |
| US20240052342A1 (en) | Method for duplex sequencing | |
| KR20240069835A (en) | Improved method and kit for the generation of dna libraries for massively parallel sequencing | |
| JP2002525129A (en) | Methods for analyzing polynucleotides | |
| US20060141516A1 (en) | De-novo sequencing of nucleic acids | |
| Wang et al. | Small-molecule-catalysed deamination enables transcriptome-wide profiling of N 6-methyladenosine in RNA | |
| CN116287167B (en) | Method for sequencing nucleic acid molecules | |
| Giurgiu et al. | A Fluorescent G‐Quadruplex Sensor for Chemical RNA Copying | |
| US20160239732A1 (en) | System and method for using nucleic acid barcodes to monitor biological, chemical, and biochemical materials and processes | |
| Depmeier et al. | Expanding the Horizon of the Xeno Nucleic Acid Space: Threose Nucleic Acids with Increased Information Storage | |
| WO2024168196A1 (en) | Systems and methods for enzymatic synthesis of polynucleotides containing non-standard nucleotide basepairs | |
| EP4314339B1 (en) | Chimeric artefact detection method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24754101 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |