[go: up one dir, main page]

WO2019079802A1 - Procédés de codage et de décodage à haut débit de l'information stockée dans l'adn - Google Patents

Procédés de codage et de décodage à haut débit de l'information stockée dans l'adn Download PDF

Info

Publication number
WO2019079802A1
WO2019079802A1 PCT/US2018/056900 US2018056900W WO2019079802A1 WO 2019079802 A1 WO2019079802 A1 WO 2019079802A1 US 2018056900 W US2018056900 W US 2018056900W WO 2019079802 A1 WO2019079802 A1 WO 2019079802A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleotide
nucleotides
digits
sequence
strands
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2018/056900
Other languages
English (en)
Inventor
Henry Hung-yi LEE
Reza Kalhor
George M. Church
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harvard University
Original Assignee
Harvard University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harvard University filed Critical Harvard University
Publication of WO2019079802A1 publication Critical patent/WO2019079802A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C13/00Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00
    • G11C13/0002Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00 using resistive RAM [RRAM] elements
    • G11C13/0009RRAM elements whose operation depends upon chemical change
    • G11C13/0014RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material
    • G11C13/0019RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material comprising bio-molecules

Definitions

  • the present invention relates in general to methods of using nucleotide transitions to encode information into a nucleotide sequence and high-throughput decoding of information stored in the nucleotide sequence.
  • DNA is a compelling data storage medium given its superior density, stability, energy-efficiency, and longevity compared to currently used electronic media (C. Bancroft, T. Bowler, B. Bloom, C. T. Cleiland, Long-term storage of information in DNA. Science. 293, 1763-1765 (2001), V. Zhirnov, R. M. Zadegan, G S, Sandhu, G. M. Church, W. L. Hughes, Nucleic acid memory. Nat. Mater. 15, 366-370 (2016)). Recent studies have demonstrated that any digital data can be written in DNA, stored, and accurately read (G. M. Church, Y. Gao, S. Kosuri, Next -generation digital information storage in DNA. Science. 337, 1628 (2012), N.
  • the present disclosure provides a method of decoding a nucleotide sequence, the nucleotide sequence encoding a value corresponding to a format of information.
  • the method includes determining the nucleotide sequence, identifying a transition or boundary or edge between different or nonidenticai nucleotides of the nucleotide sequence, and assigning a predetermined value to the identified transition or boundary or edge to create the value encoded in the nucleotide sequence corresponding to the format of information.
  • the nucleotide sequence encodes a series of values corresponding to the format of information and wherein a plurality of transitions or boundaries or edges between different or nonidenticai nucleotides of the nucleotide sequence are identified and each identified transition or boundary or edge is assigned a predetermined value to create the series of values encoded in the nucleotide sequence corresponding to the format of information.
  • the value corresponding to the format of information can be obtained from analog, digital, optical, visible or non-visible wavelengths, chemical, or physical input sources.
  • the value is a digital value and the series of values are digital values.
  • the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more.
  • the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more.
  • the format of information is selected from the group consisting of text, image, video or audio format, sensor data, and combinations thereof.
  • the different or nonidentical nucleotides comprise natural nucleotides or nonnatural nucleotides. In other embodiments, the different or nonidentical nucleotides comprise adenine, cytosine, guanine, and thymine. In one embodiment, the nucleotide sequence includes at least one nucleotide homopolymer. In another embodiment, the nucleotide sequence includes a nucleotide homopolymer for each different or nonidentical nucleotide in the nucleotide sequence.
  • the nucleotide sequence includes a nucleotide homopolymer for each different or nonidentical nucleotide in the nucleotide sequence and wherein the transition between one nucleotide homopolymer to a different or nonidentical nucleotide homopolymer is a single transition or boundary or edge.
  • the series of digital values comprises two different digital values.
  • the series of digital values comprises three different digital values.
  • the series of digital values comprises more than three different digital values.
  • each digital value in the series of digital values represents two, three or more different digital values.
  • the each nucleotide transition or boundary or edge is assigned a predetermined digital value.
  • the step of determining the nucleotide sequence is carried out by sequencing methods including nanopore sequencing, sequencing-by-synthesis, sequencing- by-ligation, and sequencing-by-hybridization. In one embodiment, the step of determining the nucleotide sequence is carried out by nucleotides modified with reversible terminators. In another embodiment, the step of determining the nucleotide sequence is carried out by detection of pyrophosphate or hydrogen ions generated during DNA polymerization of a complementar nucleotide strand. In one embodiment, the step of determining the nucleotide sequence is carried out by ligation of fluorescently modified single-stranded nucleotides with complementarity to the nucleotide sequence to be sequenced.
  • the series of digital values includes a corresponding barcode.
  • the method further includes decoding a plurality of nucleotide sequences, each member of the plurality encoding for an identical value corresponding to the format of information, wherein the nucleotide sequence is determined for each member of the plurality, and identifying a transition or boundary or edge between different nucleotides of each member of the plurality and assigning a predetermined value to each identified transition or boundary or edge to create the identical value corresponding to the format of information.
  • each member of the plurality of the nucleotide sequence encodes a series of identical values corresponding to the format of information and wherein a plurality of transitions or boundaries or edges between different or nonidentical nucleotides of each member of the plurality of the nucleotide sequence are identified and each identified transition or boundary or edge is assigned a predetermined value to create the series of identical values encoded in each member of the plurality of the nucleotide sequence corresponding to the format of information.
  • the nucleotide sequence is attached to a substrate. In another embodiment, each member of the plurality of nucleotide sequence is attached to a substrate. In one embodiment, the series of digital values is a bit or trit stream and the nucleotide sequence corresponds to a bit or trit sequence within the bit or trit stream.
  • the series of digital values is a bit or trit stream and the bit or trit stream comprises a plurality of bit or trit sequences each having a corresponding barcode to indicate position within the bit or trit stream and with the plurality of bit or trit sequences having a corresponding plurality of nucleotide sequences, wherein each member of the plurality of nucleotide sequences is sequenced, and identifying a plurality of transitions or boundaries or edges between different nucleotides of each member of the plurality and assigning a predetermined bit or trit value to each transition or boundary or edge of the plurality of transitions or boundaries or edges to create the bit or trit sequences corresponding to each member of the plurality.
  • the present disclosure provides a method of decoding a nucleotide sequence encoding for a series of digital values corresponding to a format of information.
  • the method includes determining the nucleotide sequence to identify nucleotide homopolymers and for each homopolymer assigning one or more of the nucleotides based on a predetermined predicted homopolymer length of the nucleotide produced using enzymatic or chemical synthesis, and assigning a particular digital value for each of the one or more nucleotides.
  • the predicted homopolymer length is determined from empirical observation.
  • the predicted homopolymer length is a median, a mean, or a mode.
  • the format of information is selected from the group consisting of text, image, video or an audio format, sensor data, and combination thereof.
  • the nucleotides comprise natural nucleotides or nonnatural nucleotides.
  • the nucleotides comprise adenine, cytosine, guanine, and thymine.
  • the present disclosure provides a method of sequencing and decoding a plurality of nucleotide sequences representing a format of information wherein each nucleotide sequence encodes a portion of the format of information and wherein each portion of the format of information has more than two corresponding nucleotide sequences.
  • the method includes determining the sequences and decoded series of digital values for the sequences within a first portion of the plurality of nucleotide sequences, translating the series of digital values into the portions of the format of information, and sequencing and decoding in series additional portions into series of digital values and translating the series of digital values into the portions of the format of information until the entire format of information is achieved.
  • the present disclosure provides a method of encoding a series of digital values corresponding to a format of information into a nucleotide sequence.
  • the method includes for each digital value, assigning a corresponding nucleotide to different or nonidentical nucleotide transition to generate the nucleotide sequence, synthesizing the nucleotide sequence, and optionally storing the nucleotide sequence.
  • the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more.
  • the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more.
  • the format of information is selected from the group consisting of text, image, video or an audio format, sensor data, and combination thereof.
  • nucleotides or different or nonidentical nucleotides comprise natural nucleotides or nonnatural nucleotides. In another embodiment, the nucleotides or different or nonidentical nucleotides comprise adenine, cytosine, guanine, and thymine.
  • the present disclosure provides a method for high- throughput decoding of a format of information encoded in a plurality of nucleotide sequences.
  • the method includes providing a plurality of nucleotide sequences, the plurality of nucleotide sequences represents a packet of information, the packet comprises at least one unique identifier; sequencing at least one of the plurality of nucleotide sequences using a selective sequencer; storing the sequence and its unique identifier; and preventing, using the selective sequencer, redundant sequencing of the same nucleotide sequence.
  • the step of preventing comprises using the unique identifier to prevent sequencing of additional nucleotide sequence with the same identifier.
  • the selective sequencer is a nanopore sequencer or a sequencer compatible with sequencing-by-synthesis, sequencing-by-ligation and sequencing-by- hybridization methods.
  • the sequence is stored in computer memon,'.
  • the sequence is decoded into digital values.
  • the unique identifier is a synthetic sequence.
  • the unique identifier is located at the 3' end, the 5' end of the nucleotide sequence, or is interspersed within the nucleotide sequence.
  • the plurality of nucleotide sequences comprises a plurality of unique identifiers.
  • the method further includes sequencing a predetermined number of nucleotide sequences; assembling the packet of information; and analyzing the information to determine if the information is correctly decoded.
  • the method further includes permitting sequencing of any nucleotide sequences that were not correctly decoded.
  • the step of analyzing is performed using a decoding algorithm.
  • the present disclosure provides a method of encoding information using nucleotides.
  • the method includes converting a format of information into a sequence of binary ASCII bits, converting the sequence of binary ASCII bits into a sequence of ternary ASCII bits, converting the sequence of ternary ASCII bits into a corresponding oligonucleotide sequence such that one bit represents a transition between non-identical nucleotides, and synthesizing the corresponding oligonucleotide sequence by the following steps: (a) providing a reaction mixture to an initiator oligonucleotide immobilized to a solid support wherein the reaction mixture comprises an amount of terminal deoxyiiucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations, wherein the TdT adds one or more of the selected nucleotide triphosphates to the 3' terminal nu
  • the present disclosure provides a method of decoding a format of information from a synthesized oligonucleotide sequence encoding bit sequences of the formation of information.
  • the method includes amplifying the oligonucleotide sequence, sequencing the amplified oligonucleotide sequence, converting the oligonucleotide sequence to bit sequences wherein each bit represents a transition between non-identical nucleotides, and converting the bit sequences to the format of information.
  • the oligonucleotide sequence is ligated to a universal adaptor before amplification.
  • the present disclosure provides a method of storing information using nucleotides.
  • the method includes converting a format of information into a sequence of binary ASCII bits, converting the sequence of binary ASCII bits into a sequence of ternary ASCII bits, converting the sequence of ternary ASCII bits into a corresponding oligonucleotide sequence such that one bit represents a transition between non-identical nucleotides, synthesizing the corresponding oligonucleotide sequence by the following steps: (a) providing a reaction mixture to an initiator oligonucleotide immobilized to a solid support wherein the reaction mixture comprises an amount of terminal deoxynucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations, wherein the TdT adds one or more of the selected nucleotide triphosphates to the 3' terminal nucleo
  • TdT terminal de
  • the nucleotide triphosphate comprises dATP, dTTP, dCTP, dGTP, and dUTP.
  • synthesis activity is modulated by the ratio of the amount of TdT : the amount of apvrase.
  • divalent cations comprise magnesium and cobalt.
  • the reaction mixture further comprises additives comprising glycerol, sucrose, PEG8000, betaine, DMSA, Triton-XlOO and Tween20.
  • the 3' terminal nucleotide of the initiator oligonucleotide is preferably A, G or T.
  • a polyC tail is added to the end of the corresponding oligonucleotide sequence.
  • a washing step is included between steps (a) and (b).
  • an index is included in the oligonucleotide sequence to specify strand order.
  • the nucleotide sequence is synthesized by a template-independent DNA polymerase.
  • the template-independent DNA polymerase is terminal deoxynucieotidyl transferase (TdT).
  • TdT terminal deoxynucieotidyl transferase
  • the nucleotide sequence is synthesized by a mixture of a template-independent DNA polymerase and an apvrase.
  • the information is stored using a codec model.
  • the codec model is capable of correcting errors accumulated from synthesis, storage and sequencing.
  • the sequencing is streaming nanopore sequencing.
  • Fig. 1 depicts in schematic of a comparison of the number of steps required for a single coupling in enzymatic DNA synthesis vs phosphoramidite chemistry.
  • Figs. 2 A - 2C depict results for optimizing and tuning TdT: apyrase ratio.
  • Fig. 2 A depict initiator extension with dATP, dCTP, dGTP or dTTP by four different TdT to apyrase ratios.
  • TdT concentration is constant at lU/ ( uL, apyrase concentration varies and is marked above each lane. mU is milliunits. Gels are 15% TBE-urea. "L” is ssDNA size marker.
  • Figs. 2B & 2C depict extension of an initiator with various concentration of dCTP (Fig.
  • dGTP Fig, 2C
  • Apyrase:TdT ratio, as well as dNTP concentrations are marked above each lane.
  • Gels are 15% TBE-urea.
  • "L” is ssDNA size marker and includes the unextended initiator, as well as initiator synthesized with 1 , 2, 3, 4, or 5 additional Cytosines (Fig, 2B) or Guanines (Fig. 2C),
  • Figs. 3 A - 3C depict effects of cobalt on TdT: apyrase performance.
  • Fig. 3 A depicts an initiator extension with each dNTP by various ratios of TdT to apyrase in presence of magnesium and presence or absence of supplemental cobalt.
  • TdT concentration is constant at ⁇ / ⁇
  • apyrase concentration which varies, as well as presence or absence of cobalt are marked above each lane.
  • cobalt is at 250 ⁇ .
  • Gels are 15% TBE-urea.
  • "L" is ssDNA size marker.
  • Fig. 3B depicts an initiator extension with 300 ⁇ dATP in presence of Magnesium and increasing amounts of supplemental cobalt.
  • Cobalt concentrations are marked above each lane.
  • Gel is 15% TBE-urea.
  • "L” is ssDNA size marker.
  • Fig. 3C depicts an initiator extension with each dNTP by TdT:apyrase in magnesium-only or cobalt-only reactions.
  • dNTP concentration is marked above each lane.
  • Gel is 1.5% TBE-urea.
  • "L” is ssDNA size marker and includes the unextended initiator, as well as initiator synthesized with 1, 2, 3, 4, or 5 additional nucleotides of the corresponding base, that is Cytosines for the gel with cytosine extension.
  • Figs. 4A - 4C depict buffer and additives optimization for TdT:apyrase.
  • Fig. 4A depicts an initiator extension with dATP by TdT:apyrase with increasing concentration of Enzvmatics Green Buffer. Final buffer concentration is marked above each lane. Gels are 15% TBE-urea. "L” is ssDNA size marker.
  • Fig. 4B depicts an initiator extension with a 500 ⁇ mixture of all dNTPs by TdT apyrase in presence of various additives in different concentrations. Each lane is labelled with a number, the additive and its concentration in that lane are listed below the gels. Gels are 10% TBE-urea.
  • "L” is an RNA size marker.
  • 4C depicts an initiator extension with various dCTP concentration by TdT: apyrase in the optimized buffer and the standard buffer. Gels are 1 5% TBE-urea. "L” is ssDNA size marker and includes the unextended initiator, as well as initiator synthesized with 1, 2, 3, 4, or 5 additional Cytosines,
  • Fig. 5 depicts Optimizing polymerase to initiator ratio.
  • Initiator extension extension with dATP by TdT apyrase with increasing concentration of TdT. Values above each lane mark the concentration of TdT at units per ⁇ . Apyrase concentration is constant at 1 ⁇ / ⁇ . Gel is 15% TBE-urea. "L” is ssDNA size marker and includes the unextended initiator which is 27 bases long.
  • Fig. 6 depicts TdT: apyrase performance and nucleotide concentration optimization for all sixteen possible combinations of 3' base of the initiator and the incoming nucleotide triphosphate (4 by 4). Each combination is evaluated on five lanes. The concentration of the relevant nucleotide is shown in ⁇ on top of each lane. Gels are 15% TBE-urea. "L" is ssDNA size marker and includes the unextended initiator which is 27 bases long.
  • Fig. 7 depicts multiple consecutive rounds of extension using the TdT:apyrase reagent. Two different series of transitions are shown. The nucleotides that is added is marked on top of each lane. All samples that are shown on each gel were aliquots of the same reaction that were samples after the addition of each nucleotide. Gels are 15% TBE-urea. "L" is ssDNA size marker and includes the unextended initiator which is 24 bases long.
  • Figs. 8A - 8C depict schematics for an enzymatic synthesis platform for DNA information storage.
  • Fig, 8 A shows a schematic depiction of the synthesis reaction consisting of an oligonucleotide initiator, terminal deoxynucleotidyl transferase (TdT) and apyrase (AP).
  • TdT catalyzes the addition of nucleotides to the 3' end of the initiator, and apyrase degrades nucleotide triphosphates to terminate polymerization. Subsequent nucleotide triphosphates are added for further DNA synthesis. All synthesized strands share the same order of transitions between different nucleotides.
  • FIG. 8B depict a schematic conversion between DNA and information. Synthesized DNA polymers are processed in silica by extracting transitions, which are then mapped to trits and bits.
  • Fig. 8C depicts the conversion between nucleotide transitions and trits used in this study.
  • Figs. 9A - 9D depict encoding "hello world! in DNA using enzymatic synthesis.
  • Fig. 9A depicts an overview of the encoding scheme. Each character is represented by its own DNA strand containing a header index. To encode each character, its respective ASCII binary representation is converted to ternary, then to nucleotide transitions according to the mapping in Fig. 8C. DNA is synthesized using the enzymatic strategy disclosed herein, then sequenced as a pool using Illumina or Oxford Nanopore platforms.
  • Fig. 9B depicts strand fidelity of each strand by Illumina and Oxford sequencing platforms.
  • Fig. 9C depicts streams of nanopore sequencing data. Each read is represented as a light gray dot. Reads passing the correct number of transitions (dark gray) and those with correct transitions (black) are marked. For each strand, the vertical line marks the time where the correct data can be decoded with a 99.9% confidence from the collected sequences.
  • Fig. 9D depicts data reconstruction using streaming nanopore sequencing compared to batch sequencing-by-synthesis (SBS), For each platform, the point of time at which the entire message can be decoded is marked by a box and an arrow.
  • SBS batch sequencing-by-synthesis
  • Fig. 10 depicts profiling accuracy of each "hello world! strand at every position, Illumina sequencing output was subjected to run-length encoding.
  • the black line indicates the percentage of reads that contained a nucleotide.
  • the bars indicate percentage of ail reads that had a deletion, mismatch, or insertion at each position. As the frequencies of deletions and insertions are small, their bars are not visible in most positions.
  • Fig. 11 depicts the length distribution for each of the twelve synthesizes strands. Lengths of all reads are denoted by the black line. Lengths of perfect reads are denoted by the gray shading. As perfect reads are longer, on average, size selection will increase the yield of correctly synthesized strands.
  • Figs. 12A - 12B depict the evaluation of 5-Bromo-dCTP and natural dCTP for TdT:apyrase.
  • 5-Bromo-dCTP as a substitute for natural dCTP is evaluated.
  • " is ssDNA size marker and includes the initiator oligonucleotide which is 27 bases long and ends in three cytosines.
  • Fig. 12A depicts that the extension lengths were evaluated over indicated concentration of natural dCTP.
  • Fig. 12B depicts that the extension lengths were evaluated over indicated concentration of 5-Bromo-dCTP (5Br-dCTP).
  • Figs. 13A - 13C depict an enzymatic synthesis strategy for storing information in DNA.
  • Fig. 13 A depicts a schematic depiction of a series of enzymatic synthesis reactions consisting of an oligonucleotide initiator (N, gray), terminal deoxynucleotidyl transferase (TdT) and apyrase (AP).
  • the initiator is tethered to a solid support.
  • TdT catalyzes the addition of a given nucleotide triphosphate to the 3' end of all initiators while apyrase degrades the added substrate to limit net polymerization.
  • FIG. 13B depicts the DNA strands synthesized for each of eight consecutive synthesis cycle, as shown on 15% TBE-urea gel. The initiators were not tethered to a solid support and no wash was performed between cycles. The first lane is a single-stranded DNA size marker which includes 24 nucleotide long initiator oligonucleotide.
  • Fig. 13C depicts a schema for
  • Raw strands represent enzymatically- synthesized DNA.
  • Compressed strands represent sequences of non-identical nucleotides. Transitions between nucleotides, starting with the last nucleotide of the initiator
  • strands is equivalent to the template sequence, all desired transitions are present and the information stored in DNA is retrieved.
  • Figs. 14A - 14H depict the demonstration of information storage in DNA using enzymatic synthesis.
  • Fig. 14A depicts that the message "hello world!” was encoded in twelve template sequences, H01-H12, each representing one character. Transitions between nucleotides starts with the last base of the initiator, which is labeled 'g ⁇ A header index (shaded gray) denotes strand order. Only results from the first five transitions sequences are shown (see Fig. 15).
  • To encode each character its respective ASCII decimal value, prefixed with an address is represented in base 2 (binary) or in base 3 (ternary) (see Table 1), mapped to transitions (see Fig. 13C), resulting in template sequences with nucleotides to be synthesized (capitalized).
  • Fig. 14B depicts the extension lengths for each base from (A). Only
  • Fig. 14C depicts the distribution of extension lengths for each nucleotide transition, combined across ail positions from ail perfect strands.
  • Fig. 14D depicts the stepwise increases in strand R length with an increasing strand ⁇ length for all synthesized strands of H01-H12.
  • Fig. 14E depicts the distribution of all strand R lengths. Distributions are derived via kernel density estimation for all synthesized strands ('all ', gray shading) and a subpopulation of strands that contain all desired transitions ('perfect', dotted line).
  • Fig. 14F depicts the bulk error analysis for all synthesized strands of H01-H12.
  • strands ' were aligned, by Needleman-Wunsch, to their respective template sequences, and the number of mismatches, insertions, and missing nucleotides were tabulated.
  • Fig. 14G depicts the information retrieval with in silica filtering. Fraction of perfect strands are shown before
  • Fig, 14H depicts the information retrieval by different sequencing platforms. Streaming nanopore sequencing (Oxford) was compared to batch sequencing-by-synthesis (lilumina). Each dot indicates the fraction of sequencing run at which each strand is robustly retrieved (100% correct with 99.99% probability). Arrow denotes the fraction of the sequencing run at which all data is robustly retrieved using each platform.
  • Fig. 15 depicts the dxtension lengths for perfect strands of H01-H12. Extension lengths for each nucleotide from perfect strands are displayed as a letter-value plot for each template sequence.
  • Fig. 16 depicts the raw lengths for all and perfect strands of H01 -HI 2. All
  • synthesized strands of H01-H12 were sequenced with lilumina. Length distribution for the set of all (gray with shading) and perfectly (dashed line) synthesized and sequenced raw strands are shown. Distributions are derived via kernel density estimation. The number of all strands to perfect strands for each template sequence are as follows: H01 ⁇ all : 399363, perfect: 42337 ⁇ , H02 ⁇ all: 431770, perfect: 62243 ⁇ ; H03 ⁇ all : 611804, perfect: 89302 ⁇ ; H04
  • Fig. 17 depicts the synthesis error analysis for all strands of H01-H12. All synthesized strands R were sequenced with lilumina and transitions of non-identical nucleotides were
  • strands ' Each of these strands is aligned, by Needleman-Wunsch, to its respective template sequence. For each alignment, the number of mismatches, insertions, and missing nucleotides are tabulated.
  • Figs. 18 A. - 18B depict the nanopore sequencing and decoding of H01 -HI 2.
  • Nanopore sequencing (Oxford) of synthesized raw strands. For each raw strand, the sequence of non-identical nucleotides are extracted to form compressed strands (strands*")- Fraction of perfect strands ' are plotted out of the set of all strands ' (filled triangles) or out of the set of the top 3 most abundant strands*" (open triangles). Strands* " can be filtered based on the design of the template sequence (Methods).
  • Figs. 19 A - 19E depict the coded strand architecture for sequence reconstruction.
  • Fig. 19A depicts a DNA information storage channel. Data is converted to template sequences, synthesized (strand ), and can be stored in vitro. Retrieval starts with sequencing, then
  • Fig, 19B depicts the coded strand architecture, 'scaffold', enables
  • Fig. 19C depicts a 16-base transition sequence, E0, is synthesized and sequenced with Illumina. Examples of diverse strands "' produced by synthesis of E0.
  • Strands are aligned, by Needleman-Wunsch, to the template.
  • Ambiguous alignments can exist depending on the location and number of missing nucleotides within a strand ' .
  • FIG. 19D depicts the error analysis for purified strands of E0. Synthesized strands were purified in silico, by filtering for strands 11 between 32-48 bases in length, and aligned by Needleman- Wunsch to the E0 template. For each alignment, the number of mismatches, insertions, and missing nucleotides were tabulated.
  • Fig. 19E depicts evaluating the diversity of synthesized
  • the number of sequencing reads for each length of strand was tabulated. Diversity was evaluated as the number of unique variants at each length of strand C and the Levenshtein edit distance was computed with respect to the E0 template.
  • the set of 802 purified strands contains 2 perfect strands.
  • Figs. 20A - 20C depict the synthesis error analyses and diversity of all synthesized strands of E0. All synthesized strands 11 of E0 were sequenced with Illumina and transitions of non-identical nucleotides were extracted to form strands Fig. 20A depicts the length distribution for the set of all (gray with shading) and perfectly (dashed line) synthesized and sequenced raw strands are shown. Distributions are derived via kernel density estimation. The number of all strands to perfect strands for the template sequence is as follows: E0 ⁇ all: 79192, perfect: 3 ⁇ .
  • a sequence of non-identical nucleotides were extracted to form strand 0 , which is then aligned, by Needleman-Wunsch, to its respective template sequence.
  • Fig. 20B depicts that for each alignment, the number of mismatches, insertions, and missing nucleotides from strand 0 are tabulated.
  • Fig. 20C depicts the number of sequencing reads at each length (number of nucleotides of strand is tabulated. Diversity is evaluated as the number of unique variants at each strand C length and the Levenshtein edit
  • Strands were filtered for read counts of at least 3 to remove aberrantly synthesized or sequenced variants.
  • Figs. 21A - 21B depict the constraints for valid transitions between nucleotides. As physical processes, both chemical synthesis and enzymatic synthesis have constraints for valid transitions between nucleotides, A transition matrix with no self-transitions (Fig. 21 A) and a transition matrix excluding specific transitions (Fig. 21B) are depicted. Based on whether certain transitions are permitted, there exists a fundamental limit for the maximum number of bits per nucleotide that is possible to store. This limit is equal to
  • Figs. 22A - 22B depict the placement and modulation of information into template sequences.
  • Fig. 22A depicts the placement of information within a template sequence for both experimental and simulated storage systems.
  • template sequences contained 8 or 16 nucleotides each.
  • template sequences contained 38, 74, or 152 nucleotides each.
  • Each nucleotide in a template sequence either stores 1 trit (blue), 1 bit (red), or is allocated for synchronization (orange).
  • Fig. 22B depicts a modulation scheme to map 16 bits to a sequence of 16 nucleotides. As an intermediate step, 16 bits are converted to a mixture of 8-trits and 4-bits using map Ml (Table 9).
  • Figs. 23 A - 23 B depict the Markov model for the production of DNA strands.
  • Fig. 23A depicts that a Markov model provides a statistical framework for the production of DNA strands " created from a desired template sequence.
  • the k-th state denoted by 3 ⁇ 4 the k-th state denoted by 3 ⁇ 4 .
  • Markov model specifies the process for writing the k-t nucleotide in the template sequence.
  • An example is provided for the template sequence (AGCT).
  • the Markov model contains states which include a deletion error
  • Fig. 23 B depicts that in the event of synthesis of a strand ' nucleotide, either a correct write occurs with probability l ⁇ P ⁇ b , or a write error (mismatch or substituted strand C nucleotide) occurs with total probability A specific substitution error occurs with probability
  • the function - x,y) mathematically represents the probability for substitutions of different strand 0 nucleotides.
  • Figs. 24A - 24E depict reconstruction of a template sequence by MAP estimation.
  • a template sequence may be successfully reconstructed from multiple DNA strands ' .
  • a template DNA sequence, associated scaffold sequence, and mathematical representation is c
  • the entries of the alpha and beta tables represent alpha forward probabi lities and beta backward probabilities, and are computed incrementally and efficiently based on dynamic programming recursions. These alpha and beta probabilities are necessary for the MAP estimation of each nucleotide in the template sequence as illustrated in (Fig. 24D) and (Fig. 24E). Specifically, an example of decoding the fourth nucleotide 0 4 of the template sequence is provided in (Fig. 24D).
  • This decoding involves determining the following probabilities: ⁇ ATCGCT ⁇ ** CA * A * *), ⁇ ATCG €T f ** CT * A **), W ⁇ ATCGCT j ** CC * ⁇ **), and ⁇ ATCGCT f ** CG * A **) each representing the fact that either an A, T, C, or G is possible for the fourth nucleotide respectively.
  • the decomposition of the probability ⁇ (ATCGCT I ** CG * A **) into different cases is given in (E).
  • the result of MAP estimation applied for all nucleotides reveals that a nearly correct reconstruction of the template sequence is possible even with one received DNA strand'", and that errors may be localized to their proper positions within the sequence.
  • Fig. 25A - 25C depict the coded strand architecture for robust information storage in imperfectly synthesized DNA strands.
  • Fig. 25A depicts that the message "Eureka!” was encoded and partitioned into four template sequences, E1-E4. Each sequence stores a 2-bit address and 14 bits of data and these bits are mapped to a template sequence of 16 nucleotides, which includes four synchronization nucleotides (dark gray). Synthesis performed with initiators tethered to beads and sequencing performed on the Illumina platform.
  • Fig. 25B depicts that retrieving information from E1-E4.
  • Synthesized strands R were sequenced using the Illumina sequencing-by-synthesis (SBS) platform and purified in silico based on raw length of 32-48 nucleotides (Methods), The decoding accuracy for each sequence is defined as the probability of 100% correct data retrieval for a given number of reads, estimated over 500 decoding trials. Each trial is based on a randomly drawn set of purified strand " variants. A 90% decoding accuracy (gray band) is considered sufficient for robust data retrieval, and the accuracy could be further reinforced by other codec modules.
  • Fig. 25C depicts the decoding of E3.
  • a set of 10 DNA strands " is decoded as two sets of five n
  • the decoder uses MAP estimation and a scaffold to determine the probability for each of the four nucleotides at every position.
  • the decoded sequence is a probabilistic consensus of the reconstructed sequences from MAP estimation and successfully retrieves the data stored in E3.
  • Fig. 26 depicts the raw lengths for all and perfect strands for E1-E4, All synthesized strands of E1 -E4 were sequenced with Illumina. Length distribution for the set of all (gray with shading) and perfectly (dashed line) synthesized and sequenced raw strands are shown.
  • Distributions are derived via kernel density estimation.
  • the number of all strands to perfect strands for each template sequence are as follows: El ⁇ all: 1 19677, perfect: 21 ⁇ ; E2 ⁇ all: 106983, perfect: 3 ⁇ ; E3 ⁇ all: 106793, perfect: 3 ⁇ ; E4 ⁇ all: 146710, perfect: 19 ⁇ .
  • Figs. 27A ⁇ 27B depict the synthesis error analysis for all strands and purified strands
  • strands Each of these strands is aligned, by Needleman-Wunsch, to its respective template sequence. For each alignment, the fraction of strands with the indicated number of mismatches, insertions, and missing nucleotides are tabulated. The set of all strands are evaluated in (Fig. 27 A) and the set of purified strands obtained by filtering the length of the corresponding strands R between 32-48 bases, assuming an extension length of 3 to 4 bases per template nucleotide are evaluated in (Fig, 27B).
  • Figs. 28A - 28B depict the lengths, diversity, and edit distance for all and purified strands for E1-E4. All synthesized strands R of E1-E4 were sequenced with Ulumina and
  • Strands ' were filtered for read counts of at least 3 to remove aberrantly synthesized or sequenced variants.
  • the number of sequencing reads at each length (number of strand*" nucleotides) is tabulated. Diversity is evaluated as the number of unique variants at each length and the Levenshtein edit distance is computed according to its respective template sequence.
  • Fig. 29 depicts the diversity of compressed synthesized strands for EO.
  • Strands ⁇ obtained for template sequence E0. Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E0 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
  • Fig. 30 depicts the diversity of compressed synthesized strands for El .
  • Strands obtained for sequence El Different strand variants are ranked in the vertical axis in order of the number of reads per variant.
  • the strands are arranged on the horizontal axis in order of increasing length.
  • most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
  • Fig. 31 depicts the diversity of compressed synthesized strands for E2. Strands obtained for sequence E2. Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E2 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
  • Fig. 32 depicts the diversity of compressed synthesized strands for E3. Strands obtained for sequence E3. Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E3 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
  • Fig. 33 depicts the diversity of compressed synthesized strands for E4. Strands obtained for sequence E4, Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E4 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
  • Figs. 34A - 34H depict the decoding curves for E1-E4 template sequences for "Eureka! ". Results for the successful reconstruction of sequences E1-E4 from the in silico size-selected set of DNA strands ⁇ .
  • All decoding curves illustrate the probability of correct decoding of a sequence vs. the number of purified reads of synthesized DNA strands
  • the probability of correct decoding is based on 500 decoding trials, each of which involves sampling a set of purified DNA strands according to the target number of total reads. In each decoding trial, the sampled set of DNA strands is filtered further based on the number of reads per strand (between 1 and 5 reads per strand).
  • the 10 strands with the longest length are selected for reconstruction via MAP decoding and consensus.
  • Decoding curves are presented for sequences E1-E4 in (Fig. 34 A), (Fig. 34C), (Fig. 34E), and (Fig. 34G) respectively when applying the different filters based on reads per strand.
  • the best decoding results from the filters are compiled for each datapoint to produce the "Best MAP Decoding" curve in (Fig. 34B), (Fig. 34D), (Fig. 34F), and (Fig. 34H).
  • This curve is compared to the two-step baseline filter, used for HQ1-H12, decoding which outputs the longest DNA strand which also has the highest number of reads amongst other strands of equal length. Taken together, these results show that decoding accuracy improves substantially when applying MAP decoding and consensus with 10 filtered strands compared to baseline decoding with one filtered strand.
  • Figs. 35 A - 35C depict a roadmap for scaling DNA storage systems.
  • Fig. 35A depicts the efficiency of storage for experimental and simulated systems.
  • Experimental systems black
  • Simulated maximum storage systems white circles
  • the amount of bits stored per sequence is dependent on the amount of error-correction codes (ECC) that are applied. Reducing ECCs increases the efficiency rate of storage.
  • ECC error-correction codes
  • the upper bound theoretical limit represents a maximum efficiency of storage of -1.58 bits per transition between non-identical nucleotides.
  • the lower bound theoretical limit represents the minimum number of bits per template sequence that must be stored for addressing only.
  • Fig. 35B depicts that flexible- write storage is enabled by a codec which harnesses diversely synthesized strands. The decoding pipeline supports robust data retrieval from synthesized strands with a significant percentage of errors.
  • 35C depicts a system architecture for storing information in enzymatiealiy-synthesized DNA.
  • a bitstream is partitioned into rows, each augmented with an address to delineate its order for reassembly.
  • An ECC such as a Bose-Chaudhuri-Hocquenghem (BCH) code can be applied to each row, or an ECC such as a Reed-Solomon (RS) code can be applied across multiple rows, to protect data from errors.
  • Modulation consists of mapping sequences of bits to template sequences, which includes synchronization nucleotides. Enzymatic synthesis then produces multiple diverse strands 0 per template sequence. The resulting strands ⁇ are used for sequence reconstruction based on MAP estimation and probabilistic consensus. Subsequently, the reconstructed sequence is demodulated into bits. Error-correction is applied to ensure data retrieval.
  • Figs. 36A ⁇ 36F depict the estimated capacity in bits per template sequence with increased synthesis accuracy for simulated DNA storage systems. Tradeoffs between estimated capacity (bits stored per sequence) vs. synthesis accuracy.
  • Fig. 36A estimated capacity vs. synthesis accuracy measured in terms of the probability of deletions only (missing nucleotides) or (Fig. 36B) including additional 5% substitution and 2% insertion errors.
  • Fig. 36C estimated capacity vs. synthesis accuracy measured in terms of the probability of deletions only (missing nucleotides) or (Fig. 36D) including additional 5% substitution and 2% insertion errors.
  • Fig. 36E estimated capacity vs. synthesis accuracy measured in terms of the probability of deletions only (missing nucleotides) or (Fig. 36F) including additional 5% substitution and 2% insertion errors.
  • the estimated capacity decreases smoothly as synthesis accuracy decreases. The tradeoffs are non-linear. If more compressed strand variants are utilized for decoding, the estimated capacity increases.
  • Figs. 37A - 37F depict the waterfall decoding curves for simulated DNA storage systems. Simulation results for successfully decoding and retrieving information from multiple DNA strands synthesized per sequence. Decoding results are visualized as "waterfall curves ' ", representing the probability of correct retrieval for varying levels of errors tolerated per strand. The boundary of error-tolerance for all displayed systems is between 25-30% per strand*", including missing nucleotides (deletions), mismatches (substitutions), and insertion errors. This error tolerance is obtained for decoding with up to 10 diverse strands*" per sequence. (Fig. 37 A) Decoding 23 bits of information stored in template sequences of 38
  • Fig. 37C Decoding 36 bits of information stored in template sequences of 74 nucleotides, based on multiple strands " containing only missing nucleotides and (Fig. 37B) with the inclusion of mismatches (substitutions) and insertion errors. (Fig. 37C) Decoding 36 bits of information stored in template sequences of 74 nucleotides, based on multiple strands " containing only missing nucleotides and (Fig. 37B) with the inclusion of mismatches (substitutions) and insertion errors. (Fig. 37C) Decoding 36 bits of information stored in template sequences of 74 nucleotides, based on multiple
  • Figs. 38A - 38D depict the majority alignment of DNA strands per sequence. Simulation results for decoding sequences using the majority alignment algorithm.
  • Template sequences have (Fig. 38 A) 16, (Fig. 38B) 24, (Fig. 38C) 74, and (Fig. 38D) 152 nucleotides respectively. Each template sequence is randomly created per decoding trial. A total of 1000 decoding trials were simulated per datapoiiit. The production of DNA strands from a template sequence is simulated according to a Markov model with probability of deletion per nucleotide. Sequences are decoded from either 10, 100, or 1000 diverse strands " . Majority alignment achieves an increase in decoding accuracy given more strands C . However, the decoding accuracy reaches a theoretical limit. The error-tolerance saturates at approximately
  • Figs. 39 A. - 39B depict the system architecture of codec for storing information in DNA.
  • Fig. 39A depicts a high-level block diagram of a DNA storage system. Data is represented as bits of information which are encoded into a set of DNA sequences. De novo synthesis (e.g., enzymatic synthesis) of each sequence results in the creation of diverse DNA strands which can be stored at high volumetric density. For random-access retrieval of data, a subset of the DNA strands may be PCR-amplified and then sequenced (e.g., using Illumina or nanopore sequencing technologies), DNA sequencing results in several reads. All reads are clustered, filtered, processed in-silico, and provided to a decoder for reconstruction.
  • Fig. 39B depicts a detailed block diagram of a codec for robust storage of digital information in DN A.
  • the encoder first partitions payload data into rows of bits. Each row is prefixed with an address (turquoise) to delineate its order. To recover missing rows of data, an error-correction code (ECC) may be applied per block of rows, resulting in redundant rows of information (purple). Additionally, an ECC may be applied per row/sequence of data, resulting in redundant bits per row (light green).
  • ECC error-correction code
  • Each row of bits is modulated into a DNA sequence of nucleotides (blue) containing interspersed synchronization nucleotides (orange). Synthesis of each sequence results in diverse compressed strands which may contain nucleotide errors (red).
  • the decoder fully or partially reconstructs DNA sequences using synchronization alignment and consensus algorithms. After demodulation of DNA sequences to rows/sequences of bits, the decoder may apply error-correction decoding per row/sequence to correct remaining bit eirors (red). The decoder then orders all rows according to their addresses. If any rows are missing, additional error- correction may be applied across rows using a block ECC. The final step of the decoder is to extract the original payload data from the ordered rows of bits. Overall, the encoding and decoding pipelines ensure the robust storage of data in DNA sequences.
  • Figs. 40A - 40E depict an array-format enzymatic synthesis platform.
  • Fig. 40 A depicts that the prototype is comprised of two main parts: a Mantis liquid handler, which has a single robotic arm that can be programmed to dispense one of six reagents at a time, and custom jigs, which were either laser cut (Epilog Legend 36EXT) or machined (gift from Formulatrix) to hold the glass slide acting as a solid support substrate for the DNA.
  • a Mantis liquid handler which has a single robotic arm that can be programmed to dispense one of six reagents at a time
  • custom jigs which were either laser cut (Epilog Legend 36EXT) or machined (gift from Formulatrix) to hold the glass slide acting as a solid support substrate for the DNA.
  • Epilog Legend 36EXT laser cut
  • Fig. 40B depicts that the enzymatic mix is dispensed according to programmed coordinates on the treated slide, resulting in a 2D grid of features.
  • Fig. 40C depicts that the Mantis places the enzymatic mix, according to programmed coordinates, in serial to all features on the slide.
  • Fig. 40D depicts that for each synthesis cycle, there are four dispense cycles, one for each of the four nucleotide triphosphates used. The specific nucleotide triphosphate is dispensed only to the desired features (bold).
  • Fig. 40E depicts that the Mantis has a single dispenser and places the nucleotide triphosphate, according to programmed coordinates, in serial to the desired features on the slide.
  • Fig. 41 depicts the raw lengths for all and perfect raw strands for S01 -S03.
  • the number of all strands and perfect strands for each template sequence are as follows: SOI repl (all: 192989, perfect: 1 ⁇ , SO I rep 2 ⁇ all: 220921, perfect: 684 ⁇ , SOI rep 3 ⁇ all: 153002, perfect: 286 ⁇ , S02 rep 1 (all: 277897, perfect: 3545 ⁇ , S02 rep 2 (all: 385615, perfect: 4889 ⁇ , S02 rep 3 ⁇ all: 176680, perfect: 248 ⁇ , S03 rep 3 ⁇ all : 185327, perfect: 464 ⁇ , S03 rep 2 ⁇ all : 169000, perfect: 273 ⁇ , S03 rep 3 ⁇ all: 209018, perfect 898 ⁇ , The S01 rep 1 distribution for perfect strands is not visible due to the low number of perfect strands.
  • Figs. 42 A - 42B depict the synthesis error analysis for ail and purified strands for S01- R
  • Figs. 43A - 43B depict the lengths, diversity, and edit distance for all and purified strands for S01-S03. All synthesized strands* " of S01-S03 were sequenced with Illumina and transitions extracted. Run-length compressed strands (strands C ) were filtered for read counts of at least 3 to remove aberrantly synthesized or sequenced variants. The number of sequencing reads at each length (number of strand " nucleotides) is tabulated. Diversity is evaluated as the number of unique variants at each length and the Levenshtein edit distance is computed according to its respective template sequence. These measurements are presented for all synthesized strands " (Fig. 43 A) or a set of purified strands 1 ' obtained by filtering the length of the corresponding strands R between 39-52 bases, assuming an extension length of 3 to 4 bases per template nucleotide are evaluated in (Fig. 43B).
  • Figs. 44A - 44B depict the reagent cost projections for phosphoramidite chemistry and enzymatic synthesis.
  • the minimum feature size is 2.37 nm, which corresponds to the diameter of double-stranded DNA.
  • the price per megabyte for 1 million features with current feature sizes of 15 (gray circle) or 38 microns (gray diamond) are indicated.
  • Embodiments of the present disclosure are directed to methods of decoding a nucleotide sequence.
  • the nucleotide sequence contains encoded one, or more, or a series of values corresponding to a format of information. Each value or value point within the nucleotide sequence is represented as a transition or boundary or edge between different or nonidentical nucleotides of the nucleotide sequence.
  • the steps of decoding include determining the nucleotide sequence, identifying a transition or boundary or edge between different or nonidentical nucleotides of the nucleotide sequence, and assigning a predetermined value to the identified transition or boundary or edge to create the value that was originally encoded in the nucleotide sequence corresponding to the format of information.
  • the step of determining the nucleotide sequence includes sequencing according to methods known to a skilled in the art. in one embodiment, sequencing includes nanopore sequencing.
  • sequencing includes nanopore sequencing.
  • the values are represented by a plurality of transitions or boundaries or edges between different or nonidentical nucleotides of the nucleotide sequence, which can be identified.
  • Each identified transition or boundary or edge is assigned a predetermined value to create the series of values encoded in the nucleotide sequence corresponding to the format of information.
  • the value corresponding to the format of information can be obtained from many input sources, including but are not limited to analog, digital, optical, visible or non-visible wavelengths, chemical, or physical input sources.
  • the disclosure contemplates digital values.
  • Digital values can include multiple digits according to a specific need.
  • the digital values include two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more digits to accommodate a certain need or application.
  • the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more digits.
  • the series of digital values comprises two, three or more different digital values.
  • Each of the digital value of the series of digital values represents two, three or more different digital values.
  • Each of the digital value of the series of digital values represents a digital value of the two, three or more different digital values.
  • the disclosure contemplates natural nucleotides or nonnatural nucleotides for information encoding, storage and decoding.
  • the nucleotides can be R A or DNA.
  • the nucleotides can include adenine, cytosine, guanine, thymine and uridine.
  • Any format of information can be converted into corresponding values and encoded in the nucleotide sequence.
  • a format of information includes but is not limited to text, image, video or audio format, sensor data, and combinations thereof.
  • the present disclosure contemplates the use of nucleotide transitions for information encoding and decoding.
  • the transition can be from a certain nucleotide to another different or nonidentical nucleotide.
  • the transition can also be from a certain nucleotide or nucleotide homopolymer to another different or nonidentical nucleotide or nucleotide homopolymer.
  • the transition between one nucleotide homopolymer to a different or nonidentical nucleotide homopolymer is a single transition or boundary or edge.
  • the each nucleotide transition or boundary or edge is assigned a predetermined digital value.
  • the series of digital values includes a corresponding barcode.
  • the disclosed method further contemplates decoding a plurality of nucleotide sequences.
  • Each member of the plurality encodes for an identical value or series of identical values corresponding to the format of information.
  • the nucleotide sequence or a plurality of nucleotide sequences can be attached to a substrate or solid support.
  • Embodiments of the present disclosure are directed to a method of decoding a nucleotide sequence encoding for a series of digital values corresponding to a format of information.
  • the nucleotide sequence can be determined by sequencing methods known to a skilled in the art to identify nucleotide homopolymers. Each homopolymer is assigned one or more of the nucleotides based on a predetermined predicted homopolymer length of the nucleotide produced using enzymatic synthesis, and a particular digital value is assigned for each of the one or more nucleotides.
  • the predicted homopolymer length can be determined from empirical observation.
  • the predicted homopolymer length is a median, a mean, or a mode based on data collected from empirical observation.
  • Embodiments of the present disclosure are directed a method of sequencing and decoding a plurality of nucleotide sequences representing a format of information wherein each nucleotide sequence encodes a portion of the format of information and wherein each portion of the format of information has more than two corresponding nucleotide sequences.
  • the nucleotide sequences are determined and series of digital values for the sequences within a first portion of the plurality of nucleotide sequences are decoded and translated into the portion of the format of information.
  • the sequencing and decoding are continued in series for additional portions into series of digital values and the series of digital values are translated into the portions of the format of information until the entire format of information is achieved.
  • Embodiments of the present disclosure further provides a method of encoding a series of digital values corresponding to a format of information into a nucleotide sequence.
  • the method includes for each digital value, assigning a corresponding nucleotide to different or nonidentical nucleotide transition to generate the nucleotide sequence, synthesizing the nucleotide sequence, and optionally storing the nucleotide sequence.
  • the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more digits.
  • the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more digits.
  • Embodiments of the present disclosure also provides a method for high-throughput decoding of a format of information encoded in a plurality of nucleotide sequences or a plurality of DNA strands.
  • the plurality of nucleotide sequences or DNA strands are separated (packetized) into many packets.
  • each packet includes a plurality of DNA strands.
  • each packet includes a plurality of identical DNA strands.
  • each of the nucleotide sequence or DNA stand can include a unique identifier (such as a barcode sequence) corresponding to the specific packet of information.
  • each packet includes a plurality of identical nucleotide sequences (each as an independent DNA strand), thus, sequencing one strand in that packet is sufficient since the remaining strands are considered redundant.
  • each packet includes a plurality of near perfect identical nucleotide sequences (each as an independent DNA strand), due to encoding errors. In this case, an algorithm is designed to sample a predetermined number of nucleotide sequences with redundant identifiers, which leads to decoding of the format of information.
  • the algorithm will dictate for each packet, sequencing and decoding more than one strand with a specific identifier until a certain confidence of correctness is reached, without requiring sequencing of all the strands with the same/redundant identifier.
  • the sequence with its unique identifier is stored. In this manner, redundant sequencing of the same nucleotide sequence is prevented using the selective sequencer.
  • the selective sequencer is a sequencing platform that can prevent or halt redundant sequencing of the nucleotide sequences based on the unique identifier that is associated with the nucleotide sequence.
  • the selective sequencer is a nanopore sequencer that includes the selective functionality.
  • Embodiments of the disclosure relate to optimizing packet information management to improve data accuracy and increase the content loading speed, which can drive faster internet connections for many types of utilities including cellphones.
  • the information stored in DNA is packetized (separated) into units of DNA strands.
  • each packet can contain multiple copies of representative DNA strands. In decoding or retrieving the stored information, it would be more efficient to sequence one or a few representative DNA strands for each packet.
  • the initial results and simulations shown in Fig, 9D indicated that sequencing time and cost can be reduced by at least 2 fold, which would be a dramatic benefit when scaled to very large datasets.
  • Embodiments of the disclosure are directed to the use of the selective sequencer to optimize packet information management.
  • the selective sequencer has a first feature which can generate DNA sequences on the fly. This is an improvement over the current state of the art sequencer (Illumina being an exemplary case), which must fully sequence the DN A strand that was deposited on the sequencer before the sequence data can be used for further decoding, retrieval or recovery.
  • the Oxford Nanopore sequencer allows each DNA strand to be sequenced and decoded independently. This asynchronous sequencing allows processing and decoding each packet on the fly.
  • the selective sequencer has a second feature such that after a packet is sequenced and decoded, the sequencer moves on to sequence only the strands of the remaining unsequenced packets.
  • the sequencer is able to physically prevent further redundant sequencing of copies of DNA strands of the decoded packets.
  • a unique identifier such as a barcode, or header index is included in the DNA strands which signals the sequencer whether the strand has been decoded so that the sequencer can make a decision of whether to block continued sequencing.
  • Oxford Nanopore' s nanopore sequencing platform has the first feature, and there has been a proof-of-concept demonstration for the second feature for sequencing genomes (DNA strands of biological origin, not of synthetic origin). This platform performs the second feature by physically kicking the DNA strand out of the pore after reading just a fraction of the DNA strand.
  • nanopore sequencing is artificially slowed down to obtain high accuracy reads because it is highly error-prone.
  • Embodiments of the disclosure are thus directed to interspersing the unique identifier throughout the DNA strand to improve accuracy of sequencing using nanopore sequencing. Theoretically, the sequencing rate of nanopore sequencing can increase more than 20 fold, and at this rate, the error-rate will likely be even higher.
  • the sequence information can be stored in a suitable medium including computer memory.
  • the stored sequence information can be further decoded into digital values.
  • Any unique identifier can be used including a synthetic sequence or barcode sequence.
  • the synthetic sequence or barcode sequence is located at the 3' end, the 5' end of the nucleotide sequence, or is interspersed within the nucleotide sequence.
  • a plurality of nucleotide sequences can be labeled with a plurality of unique identifiers.
  • the method can further include sequencing a predetermined number of nucleotide sequences; assembling the packet of information; and analyzing the assembled information to determine if the information is correctly decoded.
  • the method further includes permitting sequencing of any nucleotide sequences that were not correctly decoded.
  • the assembled information can be analyzed using a decoding algorithm.
  • a format of information is first converted to a binary sequence, such as zeros "0s” and ones “ Is", and then to a ternary sequence, such as zeros "0s", ones "Is", and twos "2s", although any number can be used.
  • a binary sequence such as zeros "0s” and ones " Is”
  • a ternary sequence such as zeros "0s", ones "Is", and twos "2s”
  • Each digit of the ternary sequence corresponds to a transition of different or non -identical nucleotides according to a conversion scheme.
  • the ternary bit sequence is further converted to a corresponding oligonucleotide sequence.
  • Figs. 8B-8C and Fig. 9A provide an exemplary embodiment of such a conversion scheme.
  • the oligonucleotide sequence is synthesized and containing the encoded format of information. Synthesis can be carried out according to methods known to a skilled in the art. Embodiments of the disclosure are direct to enzymatic synthesis of oligonucleotides.
  • a template-independent D ' NA polymerase such as a terminal deoxynucleotidyiy transferase (TdT) is used.
  • an initiator oligonucleotide (a primer/an initiator) immobilized to a solid support is sequentially contacted by a reaction mixture that comprises an amount of terminal deoxynucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations.
  • TdT terminal deoxynucleotide transferase
  • apyrase an amount of selected nucleotide triphosphates
  • divalent cations divalent cations
  • any enzymatic, chemical or physical methods or reagents can be used to control the length of the nucleotide extension/polymerization.
  • one or more desired/selected nucleotides is added to the extending oligonucleotide chain until corresponding oligonucleotide sequence is formed.
  • the nucleotide triphosphate includes dATP, dTTP, dCTP, dGTP, and dUTP.
  • the synthesis activity is modulated by the ratio of the amount of TdT to the amount of apyrase.
  • divalent cations comprising magnesium and cobalt
  • additives comprising glycerol, sucrose, PEG8000, betaine, DMSA, Triton-Xl OO and Tween20 can also modulate the enzymatic reaction. Since each bit represents a transition between different or non- identical nucleotides, the information can be accurately encoded into oligonucleotide sequences independent of the lengt of each nucleotide extension/polymerization.
  • the disclosure provides that during each round of nucleotide extension/polymerization, one type of selected nucleotide triphosphate is added. In one embodiment, the excessive nucleotide triphosphate is inactivated by apyrase. This inactivation allows for multiple rounds of nucleotide polymerization that each adds a different nucleotide to the initiator or growing polynucleotide chain.
  • Embodiments of the present disclosure are directed to a method of decoding a format of information from a synthesized oligonucleotide sequence encoding bit sequences of the formation of information.
  • the synthesized oligonucleotide sequence containing the encoded information can be amplified.
  • the amplified oligonucleotide sequence is sequenced and the sequence can be converted to bit sequences according to the encoding scheme wherein each bit represents a transition between different or non-identical nucleotides.
  • the bit sequences can be converted back to the format of information.
  • the oligonucleotide sequence is ligated to a universal adaptor before amplification.
  • Embodiments of the present disclosure are directed to a method of storing information using nucleotides.
  • a format of information is first converted into a sequence of binary ASCII bits, then converted into a ternary sequence, which is further converted into a corresponding oligonucleotide sequence such that one bit of the ternary sequence represents a transition between different or non-identical nucleotides.
  • the corresponding oligonucleotide sequence is synthesized by the following steps: (a) providing a reaction mixture to an initiator oligonucleotide immobilized to a solid support wherein the reaction mixture comprises an amount of terminal deoxynucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations, wherein the TdT adds one or more of the selected nucleotide triphosphates to the 3' terminal nucleotide of the initiator oligonucleotide, and wherein the apyrase degrades excessive nucleotide triphosphates to inactive diphosphates and monophosphates, and (b) repeating step (a) until the corresponding oligonucleotide sequence is formed, and storing the synthesized corresponding oligonucleotide sequence.
  • TdT terminal deoxynucleotide transferase
  • the initiator oligonucleotides are immobilized on beads and pre-mixed with reagents that include TdT, apyrase and reaction buffer.
  • the initiator oligonucleotides can also be immobilized on the surface of a solid support such as beads or on the surface of a fluidic channel.
  • Certain embodiment of the disclosure is directed to an initiator that is attached by a cleavable moiety. This mixture is sequentially contacted with one type of the desired nucleotide triphosphates (dNTPs).
  • dNTPs desired nucleotide triphosphates
  • the ratio of the amount of TdT to the amount of apyrase in the reaction reagents modulates the enzymatic synthesis.
  • the desired or selected nucleotide is a natural nucleotide or any nucleotide analog known to a skilled in the art.
  • the reaction reagent can include a buffer comprising a monovalent salt, a divalent salt, a buffering agent, and a reducing agent at a suitable pH and temperature.
  • the selected concentration of reaction reagents is determined by the selected nucleotide triphosphate present in the reaction reagent.
  • a washing step is included between each round of enzymatic synthesis.
  • the present disclosure provides methods of enzymatic oligonucleotide synthesis which enable rapid and high-accuracy synthesis of custom DNA sequences by the template- independent DNA-polymerase terminal deoxynucleotidyl transferase (TdT).
  • TdT template- independent DNA-polymerase terminal deoxynucleotidyl transferase
  • the methods according to the present disclosure can be used for synthesis of cheaper, more accurate and longer custom DNA sequences for various biochemical, biomedical, or biosynthetic applications.
  • the methods according to the present disclosure can facilitate the use of DNA as an information storage medium.
  • a solid-phase synthesis device can be used to record digital information in DNA molecules.
  • the method according to the disclosure further comprises releasing the polynucleotide after the desired sequence of nucleotides has been added to the 3' end of the polynucleotide.
  • the method according to the disclosure further comprises releasing the polynucleotide using an enzyme, a chemical, light, heat or other suitable method or reagent.
  • the method according to the disclosure further comprises releasing the polynucleotide, collecting the polynucleotide, amplifying the polynucleotide and sequencing the polynucleotide.
  • nucleotide triphosphate inactivating enzyme is an apyrase.
  • nucleotide triphosphate inactivating enzyme is a nucleotide triphosphate degrading enzyme that degrades nucleotide triphosphates at a rate slower than rate of addition of nucleotides by the error prone or template independent DNA polymerase.
  • the nucleotide triphosphate inactivating enzyme is a nucleotide triphosphate degrading enzyme present at a concentration that degrades nucleotide triphosphates at a rate slower than rate of addition of nucleotides by the present concentration of the error prone or template independent DNA polymerase.
  • the nucleotide triphosphate inactivating enzyme comprises ATP diphosphohydrolase, dNTP pyrophosphatases, dNTPases, and phosphatases.
  • the concentration of nucleotide triphosphate inactivating enzyme is modulated to control addition of one or more nucleotides.
  • the nucleotide triphosphate inactivating enzyme renders free nucleotide triphosphates inactive.
  • the nucleotide inactivating enzyme renders free nucleotide triphosphates inactive by degradation.
  • the nucleotide inactivating enzyme renders free nucleotide triphosphates inactive by polymerizing them with each other.
  • the reaction conditions present a competing reaction between addition of free nucleotide triphosphates to the initiator sequence and degradation of free nucleotide triphosphates.
  • Polymerases including without limitation error-prone or template-dependent polymerases, modified or otherwise, can be used to create nucleotide polymers having a random or known or desired sequence of nucleotides.
  • Template-independent polymerases whether modified or otherwise, can be used to create the nucleic acids de novo. Ordinary nucleotides are used, such as A, T/U, C or G. Nucleotides may be used which lack chain terminating moieties.
  • a template independent polymerase may be used to make the nucleic acid sequence. Such template independent polymerase may be error-prone which may lead to the addition of more than one nucleotide resulting in a homopolymer.
  • oligonucleotide sequences or polynucleotide sequences are synthesized using an error prone polymerase, such as template independent error prone polymerase, and common or natural nucleic acids, which may be unmodified.
  • Initiator sequences or primers are attached to a substrate, such as a silicon dioxide substrate, at various locations whether known, such as in an addressable array, or random.
  • Reagents including at least a selected nucleotide, a template independent polymerase and other reagents required for enzymatic activity of the polymerase are applied at one or more locations of the substrate where the initiator sequences are located and under conditions where the polymerase adds one or more than one or a plurality of the nucleotide to the initiator sequence to extend the initiator sequence.
  • the nucleotides (“dNTPs") may be applied or flow in periodic applications. Nucleotides with blocking groups or reversible terminators can be used with the dNTPs under reaction conditions that are sufficient to limit or reduce the probability of enzymatic addition of the dNTP to one dNTP, i.e. one dNTP is added using the selected reaction conditions taking into consideration the reaction kinetics.
  • a microfluidic channel or microfluidic channels having an input and an output can be used to deliver reaction fluids including reagents, such as a polymerase, a nucleotide and other appropriate reagents and washes to particular locations on a substrate within the flow cell, such as within a microfluidic channel.
  • reagents such as a polymerase
  • reaction conditions will be based on dimensions of the substrate reaction region, reagents, concentrations, reaction temperature, and the structures used to create and deliver the reagents and washes.
  • pH and other reactants and reaction conditions can be optimized for the use of TdT to add a dNTP to an existing nucleotide or oligonucleotide in a template independent manner.
  • a dNTP to an existing nucleotide or oligonucleotide in a template independent manner.
  • reagents and reaction conditions for dNTP addition such as initiator size, divalent cation and pH.
  • TdT was reported to be active over a wide pH range with an optimal pH of 6.85. Methods of providing or delivering dNTP, rNTP or rNDP are useful in making nucleic acids.
  • nucleic acid molecule As used herein, the terms “nucleic acid molecule,” “nucleic acid sequence,” “nucleic acid fragment” and “oligomer” are used interchangeably and are intended to include, but are not limited to, a polymeric form of nucleotides that may have various lengths, including either deoxyribonucleotides or ribonucleotides, or analogs thereof.
  • nucleotide refers to a nucleoside having one or more phosphate groups joined in ester linkages to the sugar moiety. Exemplary nucleotides include nucleoside monophosphates, diphosphates and triphosphates.
  • nucleic acid molecule In general, the terms “nucleic acid molecule,” “nucleic acid sequence,” “nucleic acid fragment,” “oligonucleotide” and “polynucleotide” are used interchangeably and are intended to include, but not limited to, a polymeric form of nucleotides that may have various lengths, either deoxyribonucleotides (DNA) or ribonucleotides (RNA), or analogs thereof.
  • DNA deoxyribonucleotides
  • RNA ribonucleotides
  • a oligonucleotide is typically composed of a specific sequence of four nucleotide bases: adenine (A); cytosine (C); guanine (G); and thymine (T) (uracil (U) for thymine (T) when the polynucleotide is RNA).
  • deoxynucleotides such as dATP, dCTP, dGTP, dTTP
  • rNTPs ribonucleotide triphosphates
  • rNDPs ribonucleotide diphosphates
  • oligonucleotide sequence is the alphabetical representation of a polynucleotide molecule; alternatively, the term may be applied to the polynucleotide molecule itself.
  • This alphabetical representation can be input into databases in a computer having a central processing unit and used for bioinformatics applications such as functional genomics and homology searching.
  • Oligonucleotides may optionally include one or more non-standard nucieotide(s), nucleotide analog(s) and/or modified nucleotides.
  • the present disclosure contemplates any deoxyribonucleotide or ribonucleotide and chemical variants thereof, such as methylated, hydroxymethylated or glycosylated forms of the bases, and the like.
  • natural nucleotides are used in the methods of making the nucleic acids. Natural nucleotides lack chain terminating moieties. According to certain aspects, nucleotides with blocking groups or reversible terminators can be used in certain embodiments. Nucleotides with blocking groups or reversible terminators are known to those of skill in the art.
  • nucleotide analog refers to a non-standard nucleotide, including non-naturally occurring ribonucleotides or deoxyribonucleotides.
  • nucleotide analogs are modified at any position so as to alter certain chemical properties of the nucleotide yet retain the ability of the nucleotide analog to perform its intended function.
  • positions of the nucleotide which may he derivitized include the 5 position, e.g., 5-(2-amino)propyl uridine, 5-bromo uridine, 5-propyne uridine, 5-propenyl uridine, etc.; the 6 position, e.g., 6-(2-amino) propyl uridine, the 8-position for adenosine and/or guanosines, e.g., 8-bromo guanosine, 8- chloro guanosine, 8-fluoroguanosine, etc.
  • 5 position e.g., 5-(2-amino)propyl uridine, 5-bromo uridine, 5-propyne uridine, 5-propenyl uridine, etc.
  • the 6 position e.g., 6-(2-amino) propyl uridine
  • the 8-position for adenosine and/or guanosines e.g
  • Nucleotide analogs also include deaza nucleotides, e.g., 7-deaza-adenosine; O- and N-modified (e.g., alkylated, e.g., N6-methyl adenosine, or as otherwise known in the art) nucleotides; and other heterocyclicaliy modified nucleotide analogs such as those described in Herdewijn, Antisense Nucleic Acid Drug Dev., 2000 Aug. 10(4):297-310.
  • Nucleotide analogs may also comprise modifications to the sugar portion of the nucieotides.
  • the 2' OH-group may be replaced by a group selected from H, OR, R, F, CI, Br, I, Sit SR, NII 2 , M IR. NR 2 , COOR, or OR, wherein R is substituted or unsubstituted O-Ce alkyl, alkenyl, alkynyl, aryl, etc.
  • R is substituted or unsubstituted O-Ce alkyl, alkenyl, alkynyl, aryl, etc.
  • Other possible modifications include those described in U.S. Pat. Nos. 5,858,988, and 6,291,438.
  • the phosphate group of the nucleotide may also be modified, e.g., by substituting one or more of the oxygens of the phosphate group with sulfur (e.g., phosphorothioates), or by making other substitutions which allow the nucieotide to perform its intended function such as described in, for example, Eckstein, Antisense Nucleic Acid Drug Dev. 2000 Apr. 10(2): 1 17-21, Rusckowski et al. Antisense Nucleic Acid Drug Dev. 2000 Oct. 10(5):333 ⁇ 45, Stein, Antisense Nucleic Acid Drag Dev. 2001 Oct. 1 1(5): 317-25, Vorobj ev et al . Antisense Nucleic Acid Drug Dev. 2001 Apr.
  • modified nucleotides include, but are not limited to diaminopurine, S2T, 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xantine, 4- acetyl cytosine, 5-(carboxyhydroxylmethyl)uracil, 5-carboxymethylaminomethyl-2- thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-isopentenyladenine, 1 -methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2- methyl adenine, 2-methylguanine, 3-methylcytosine, 5 -methyl cytosine, N6-adenine, 7- methyi guanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D- man
  • Nucleic acid molecules may also be modified at the base moiety (e.g., at one or more atoms that typically are available to form a hydrogen bond with a complementary nucleotide and/or at one or more atoms that are not typically capable of forming a hydrogen bond with a complementary nucleotide), sugar moiety or phosphate backbone.
  • Nucleic acid molecules may also contain amine-modified groups, such as aminoallyl-dUTP (aa-dUTP) and aminohexhylacrylamide- dCTP (aha-dCTP) to allow covalent attachment of amine reactive moieties, such as N- hydroxy succinimide esters (NHS).
  • a nucleic acid used in the invention can also include native or non-native bases.
  • a native deoxyribonucleic acid can have one or more bases selected from the group consisting of adenine, thymine, cytosine or guanine and a ribonucleic acid can have one or more bases seiected from the group consisting of uracil, adenine, cytosine or guanine.
  • Exemplar ⁇ ' non-native bases that can be included in a nucleic acid, whether having a native backbone or analog structure, include, without limitation, inosine, xathanine, hypoxathanine, isocytosine, isoguanine, 5 -methyl cytosine, 5-hydroxymethyl cytosine, 2- aminoadenine, 6-methyl adenine, 6-methyl guanine, 2 -propyl guanine, 2-propyl adenine, 2- thioLiracil, 2-thiothymine, 2- thiocytosine, 15 -halouracil, 15 -halocytosine, 5-propynyl uracil, 5-propynyl cytosine, 6-azo uracil, 6-azo cytosine, 6-azo thymine, 5-uracil, 4- thiouracil, 8-halo adenine or guanine, 8- amino adenine or guanine, 8-thiol adenine
  • adenine or guanine 5-halo substituted uracil or cytosine, 7-methylguanine, 7- methyiadenine, 8-azaguanine, 8-azaadenine, 7- deazaguanine, 7-deazaadenine, 3-deazaguanine, 3-deazaadenine or the like.
  • unique barcode sequences may be attached to each nucleic acid, i.e. DNA or RNA strands. Then adapters and or primers or other reagents known to those of skill in the art may be used as desired to sequence or amplify the nucleic acid with the unique barcode sequence.
  • polymerases are used to build nucleic acid molecules, such as for representing information which is referred to herein as being recorded in the nucleic acid sequence or the nucleic acid is referred to herein as being storage media.
  • Polymerases are enzymes that produce a nucleic acid sequence, for example, using DNA or RNA as a template. Polymerases that produce RNA polymers are known as RNA polymerases, while polymerases that produce DNA polymers are known as DNA polymerases. Polymerases that incorporate errors are known in the art and are referred to herein as an "error-prone polymerases". Template independent polymerases may be error prone polymerases.
  • Error-prone polymerases will either accept a non-standard base, such as a reversible chain terminating base, or will incorporate a different nucleotide, such as a natural or unmodified nucleotide that is selectively given to it as it tries to copy a template.
  • Template-independent polymerases such as terminal deoxynucleotidyl transferase (TdT), also known as DNA nucleotidylexotransferase (DNTT) or terminal transferase create nucleic acid strands by catalyzing the addition of nucleotides to the 3' terminus of a DNA molecule without a template.
  • TdT terminal deoxynucleotidyl transferase
  • DNTT DNA nucleotidylexotransferase
  • Cobalt is a cofactor, however the enzyme catalyzes reaction upon Mg and Mn administration in vitro.
  • Nucleic acid initiators may be 4 or 5 nucleotides or longer and may be single stranded or double stranded. Double stranded initiators may have a 3' overhang or they may be blunt ended or they may have a 3' recessed end.
  • TdT like all DNA polymerases, also requires divalent metal ions for catalysis.
  • TdT is unique in its ability to use a variety of divalent cations such as Co2+, Mn2+, Zn2+ and Mg2+.
  • the extension rate of the primer p(dA)n (where n is the chain length from 4 through 50) with dATP in the presence of divalent metal ions is ranked in the following order: Mg2+ > Zn2+ > Co2+ > Mn2+.
  • each metal ion has different effects on the kinetics of nucleotide incorporation.
  • Mg2+ facilitates the preferential utilization of dGTP and dATP whereas Co2+ increases the catalytic polymerization efficiency of the pyrimidines, dCTP and dTTP.
  • Zn2+ behaves as a unique positive effector for TdT since reaction rates with Mg2+ are stimulated by the addition of micromolar quantities of Zn2+. This enhancement may reflect the ability of Zn2+ to induce conformational changes in TdT that yields higher catalytic efficiencies. Polymerization rates are lower in the presence of Mn2+ compared to Mg2+, suggesting that Mn2+ does not support the reaction as efficiently as Mg2+.
  • TdT is provided in Biochim Biophys Acta., May 2010; 1804(5): 1151-1 166 hereby incorporated by reference in its entirety.
  • the nucleotide pulse replaces Mg++ with other cation(s), such as Na+, K+, Rb+, Be++, Ca++, or Sr++
  • the nucleotide can bind but not incorporate, thereby regulating whether the nucleotide will incorporate or not.
  • a pulse of (optional) pre-wash without nucleotide or Mg++ can be provided or then Mg++ buffer without nucleotide can be provided.
  • the incorporation of specific nucleic acids into the polymer can be regulated.
  • these polymerases are capable of incorporating nucleotides independent of the template sequence and are therefore beneficial for creating nucleic acid sequences de novo.
  • the combination of an error-prone polymerase and a primer sequence serves as a writing mechanism for imparting information into a nucleic acid sequence.
  • nucleotide substrate By controlling the primer/initiator, the nucleotide substrate, or the template independent polymerase, the addition of a nucleotide to an initiator sequence or an existing nucleotide or oligonucleotide can be regulated to produce an oligonucleotide by extension.
  • these polymerases are capable of incorporating nucleotides without a template sequence and are therefore beneficial for creating nucleic acid sequences de novo.
  • polymers such as nucleotide sequences, including DNA strands identified herein may be sequenced by passing the strand through nanopores or nanogaps or nanochannels to determine the individual nucleic acid/nucleotide.
  • Nanopore means a hole or passage having a nanometer scale width.
  • Exemplary nanopores include a hole or passage through a membrane formed by a multimeric protein ring. Typically, the passage is 0.2-25 nm wide.
  • Nanopores may include transmembrane structures that may permit the passage of molecules through a membrane. Examples of nanopores include a-hemolysin (Staphylococcus aureus) and MspA (Mycobacterium smegmatis).
  • Nanopores may be found in the art describing nanopore sequencing or described in the art as pore-forming toxins, such as the ⁇ - PFTs Panton-Valentine leukocidin S, aeroiysin, and Clostridial Epsilon-toxin, the a-PFTs cytolysin A, the binary PFT anthrax toxin, or others such as pneumolysin or gramicidin.
  • Nanopores have become technologically and economically significant with the advent of nanopore sequencing technology. Methods for nanopore sequencing are known in the art, for example, as described in US 5,795,782, which is incorporated by reference.
  • nanopore detection involves a nanopore-perforated membrane immersed in a voltage- conducting fluid, such as an ionic solution including, for example, KC1, NaCl, NiCL LiCi or other ion forming inorganic compounds known to those of skill in the art.
  • a voltage- conducting fluid such as an ionic solution including, for example, KC1, NaCl, NiCL LiCi or other ion forming inorganic compounds known to those of skill in the art.
  • a voltage is applied across the membrane, and an electric current results from the conduction of ions through the nanopore.
  • Nanopores within the scope of the present disclosure include solid state nonprotein nanopores known to those of skill in the art and DNA origami nanopores known to those of skill in the art. Such nanopores provide a nanopore width larger than known protein nanopores which allow the passage of larger molecules for detection while still being sensitive enough to detect a change in ionic current when the complex passes through the nanopore.
  • Nanopore sequencing means a method of determining the components of a polymer based upon interaction of the polymer with the nanopore. Nanopore sequencing may be achieved by measuring a change in the conductance of ions through a nanopore that occurs when the size of the opening is altered by interaction with the polymer.
  • the present disclosure envisions the use of a nanogap which is known in the art as being a gap between two electrodes where the gap is about a few nanometers in width such as between about 0.2 ran to about 25 ran or between about 2 and about 5 nm. The gap mimics the opening in a nanopore and allows polymers to pass through the gap and between the electrodes.
  • aspects of the present disclosure also envision use of a nanochannel electrodes are placed adjacent to a nanochannel through which the polymer passes. It is to be understood that one of skill will readily envision different embodiments of molecule or moiety identification and sequencing based on movement of a molecule or moiety through an electric field and creating a distortion of the electric field representative of the structure passing through the electric field.
  • Methods described herein are capable of generating large amounts of data (billions of bits). Accordingly, high throughput methods of sequencing these nucleic acid molecules, such as that disclosed in Mitra (1999) Nucleic Acids Res. 27(24):e34; pp.1-6, are useful. In preferred embodiments, high throughput methods are used with PCR amplicons or other nucleic acid molecules having lengths of less than 100 bp.
  • PCR amplicons of 100 bp, 1 10 bp, 120 bp, 130 bp, 140 bp, 150 bp, 160 bp, 170 bp, 180 bp, 190 bp, 200 bp, 250 bp, 300 bp, 350 bp, 400 bp, 450 bp, 500 bp, 550 bp, 600 bp, 650 bp, 700 bp, 750 bp, 800 bp, 850 bp, 900 bp, 950 bp, 1000 bp or more may be used.
  • Sequencing methods useful in the present disclosure include sequencing-by-ligation, sequencing-by-synthesis, sequencing-by-hybridization known to a skilled in the art.
  • Shendure et al. Accurate multiplex polony sequencing of an evolved bacterial genome, Science, vol. 309, p. 1728-32. 2005, Drmanac et al., Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays, Science, vol. 327, p. 78-81 . 2009, McKernan et al, Sequence and structural variation in a human genome uncovered by short- read, massively parallel ligation sequencing using two-base encoding, Genome Res., vol. 19, p. 1527-41.
  • Sequencing primers are those that are capable of binding to a known binding region of the target polynucleotide and facilitating ligation of an oligonucleotide probe of the present disclosure. Sequencing primers may be designed with the aid of a computer program such as, for example, DNAWorks, or Gene20iigo. The binding region can vary in length but it should be long enough to hybridize the sequencing primer. Target polynucleotides may have multiple different binding regions thereby allowing different sections of the target polynucleotide to be sequenced. Sequencing primers are selected to form highly stable duplexes so that they remain hybridized during successive cycles of ligation.
  • Sequencing primers can be selected such that ligation can proceed in either the 5' to 3' direction or the 3' to 5' direction or both. Sequencing primers may contain modified nucleotides or bonds to enhance their hybridization efficiency, or improve their stability, or prevent extension from a one terminus or the other.
  • single stranded DNA templates are prepared by PGR amplification to be used with sequencing primers.
  • single stranded template is attached to beads or nanoparticles in an emulsion and amplified through ePCR. Supports and Attachment
  • one or more oligonucleotide sequences described herein are immobilized on a support (e.g., a solid and/or semi-solid support).
  • a support e.g., a solid and/or semi-solid support.
  • an oligonucleotide sequence can be attached to a support using one or more of the phosphoramidite linkers described herein.
  • Suitable supports include, but are not limited to, slides, beads, chips, particles, strands, gels, sheets, tubing, spheres, containers, capillaries, pads, slices, films, plates and the like.
  • a solid support may be biological, nonbiologicai, organic, inorganic, or any combination thereof.
  • Supports of the present invention can be any shape, size, or geometry as desired. Supports may be made from glass (silicon dioxide), metal, ceramic, polymer or other materials known to those of skill in the art. Supports may be a solid, semi-solid, elastomer or gel.
  • a support is a microarray.
  • Oligonucleotides immobilized on microarrays include nucleic acids that are generated in or from an assay reaction.
  • the oligonucleotides or polynucleotides on microarrays are single stranded and are covalently attached to the solid phase support, usually by a 5 ! -end or a 3'- end.
  • probes are immobilized via one or more cleavabie linkers.
  • a covalent interaction is a chemical linkage between two atoms or radicals formed by the sharing of a pair of electrons (i.e., a single bond), two pairs of electrons (i.e., a double bond) or three pairs of electrons (i.e., a triple bond).
  • Covalent interactions are also known in the art as electron pair interactions or electron pair bonds.
  • Noncovalent interactions include, but are not limited to, van der Waals interactions, hydrogen bonds, weak chemical bonds (i.e., via short-range noncovalent forces), hydrophobic interactions, ionic bonds and the like.
  • affixing or immobilizing nucleic acid molecules to the substrate is performed using a covalent linker that is selected from the group that includes oxidized 3 -methyl uridine, an acrylyl group and hexaethylene glycol, hi addition to the attachment of linker sequences to the molecules of the pool for use in directional attachment to the support, a restriction site or regulatory element (such as a promoter element, cap site or translational termination signal), is, if desired, joined with the members of the pool.
  • Nucleic acids that have been synthesized on the surface of a support may be removed, such as by a cleavable linker or linkers known to those of skill in the art.
  • Linkers can be designed with chemically reactive segments which are optionally cleavable with agents such as enzymes, light, heat, pH buffers, and redox reagents. Such linkers can be employed to pre-fabricate an in situ solid-phase inactive reservoir of a different solution-phase primer for each discrete feature. Upon linker cleavage, the primer would be released into solution for PGR, perhaps by using the heat from the thermocycling process as the trigger.
  • affixing of nucleic acid molecules to the support is performed via hybridization of the members of the pool to nucleic acid molecules that are covalently bound to the support.
  • reagents and washes are delivered that the reactants are present at a desired location for a desired period of time to, for example, covalently attached dNTP to an initiator sequence or an existing nucleotide attached at the desired location,
  • a selected nucleotide reagent liquid is pulsed or flowed or deposited at the reaction site where reaction takes place and then may be optionally followed by deliver ⁇ - of a buffer or wash that does not include the nucleotide.
  • Suitable delivery systems include fluidics systems, microfluidics systems, syringe systems, ink jet systems, pipette systems and other fluid deliver ⁇ ' systems known to those of skill in the art.
  • flow cell embodiments or flow channel embodiments or microfluidic channel embodiments are envisioned which can deliver separate reagents or a mixture of reagents or washes using pumps or electrodes or other methods known to those of skill in the art of moving fluids through channels or microfluidic channels through one or more channels to a reaction region or vessel where the surface of the substrate is positioned so that the reagents can contact the desired location where a nucleotide is to be added.
  • a microfluidic device is provided with one or more reservoirs which include one or more reagents which are then transferred via microchannels to a reaction zone where the reagents are mixed and the reaction occurs.
  • Such microfluidic devices and the methods of moving fluid reagents through such microfluidic devices are known to those of skill in the art.
  • Immobilized nucleic acid molecules may, if desired, be produced using a device (e.g., any commercially-available inkjet printer, which may be used in substantially unmodified form) which sprays a focused burst of reagent-containing solution onto a support (see Castellino (1997) Genome Res. 7:943-976, incorporated herein in its entirety by reference).
  • a device e.g., any commercially-available inkjet printer, which may be used in substantially unmodified form
  • Such a method is currently in practice at ineyte Pharmaceuticals and Rosetta Biosystems, Inc., the latter of which employs "minimally modified Epson Inkjet cartridges" (Epson America, Inc.; Torrance, CA).
  • the method of inkjet deposition depends upon the piezoelectric effect, whereby a narrow tube containing a liquid of interest (in this case, oligonucleotide synthesis reagents) is encircled by an adapter.
  • An electric charge sent across the adapter causes the adapter to expand at a different rate than the tube, and forces a small drop of liquid reagents from the tube onto a coated slide or other support.
  • Reagents can be deposited onto a discrete region of the support, such that each region forms a feature of the array.
  • the feature is capable of generating an anion toroidal vortex as described herein.
  • the desired nucleic acid sequence can be synthesized drop-by-drop at each position, as is true for other methods known in the art. If the angle of dispersion of reagents is narrow, it is possible to create an array comprising many features. Alternatively, if the spraying device is more broadly focused, such that it disperses nucleic acid synthesis reagents in a wider angle, as much as an entire support is covered each time, and an array is produced in which each member has the same sequence (i.e., the array has only a single feature).
  • This example describes an embodiment of using nucleotide transitions to encode a format of information using DNA polymerases catalyzed DNA oligonucleotide sequences.
  • the encoded DNA sequence can be stored or decoded.
  • Such an enzymatic based nucleotide synthesis can catalyze the linkage of naturally occurring deoxynucieotide triphosphates (dNTPs) rapidly, in a single step, and under non-toxic biocompatible conditions, as compared to chemical methods (Fig, 1).
  • dNTPs deoxynucieotide triphosphates
  • the methods used terminal deoxynucleotidyl transferase (TdT), a unique template-independent DNA polymerase which rampantly and indiscriminately adds dNTP substrates to the 3' termini of DNA strands (F. J. Bollum, Thermal conversion of nonprinting deoxyribonucleic acid to primer. J. Biol. Chem. 234, 2733-2734 (1959), F. J. Bollum, Oligodeoxyribonucleoti de-primed reactions catalyzed by calf thymus polymerase, J, Biol. Chem. 237, 1945-1949 ( 1962), L. M. Chang, F. J. Bollum, Molecular biology of terminal transferase.
  • TdT terminal deoxynucleotidyl transferase
  • dNTPsnucleotides are added by TdT before being degraded by apyrase (Figs. 2A-2C, Figs. 3A-3C, Figs. 4A-4C and Fig. 5).
  • the lowest dNTPnucleotide concentrations required for maximum coupling efficiency was further determined (Fig. 6), such that adding nucleotide substrates in series would result in stepwise increases in DNA length (Fig. 7).
  • Embodiments of the disclosure provide an enzymatic synthesis strategy that is rapid and simple, requiring few components to produce DNA with a given information content (Fig, 8 A).
  • Embodiments of the disclosure include a reaction mixture of short oligonucleotide initiators, TdT, and apyrase.
  • the initiators are immobilized on solid supports, such as beads or a surface, to allow removal of reaction byproducts and facilitate downstream processing and amplification.
  • TdT extends the initiators until the substrate is degraded by apyrase, allowing immediate addition of subsequent nucleotide substrates.
  • Adding a series of dNTPsnucleotides results in a population of DNA strands, all extended by the same order of nucleotides. While extension lengths may vary across strands, the same information content is stored in the whole population as transitions between different or non-identical nucleotides (Fig. 8B).
  • trits was used (Trits are the ternary equivalent of bits. One trit is log(3)/log(2) (about 1.58496) bits of information) to maximize information capacity, given three possible transitions for each nucleotide, (Fig. 8C).
  • the message "hello world!” was encoded and synthesized (Fig. 9A).
  • Fig. 9A To encode each character, its binary ASCII representation was first converted to ternary and then to nucleotide transitions (Table 1). Each character was then synthesized as its own DNA strand preceded by a header index to specify strand order. Following synthesis, these strands were ii ated to a universal adapter, PCR amplified, and stored as a single pool without additional purification (Materials and Methods), These 12 DNA strands, each with 8 trits, carry the 144 bits of data. (Table 1).
  • the pool of DNA strands was sequenced using both Alumina and Oxford Nanopore platforms and extracted nucleotide transitions from each read by performing run-length encoding, a lossless data compression algorithm ubiquitously used in modern communications.
  • the correct transition was the most abundant species, comprising 88,6%, on average, of sequences filtered for the expected number of transitions and 19%, on average, of all sequences (Fig. 9B).
  • the remainder of the reads largely contained deletions and, to a smaller extent, mismatches and insertions (Figs. 10 and I I).
  • the same pool with Oxford Nanopore MinlON was next sequenced and a similar result was observed (Fig.
  • DNA translocation rates through nanopores may be increased since nucleotide transitions are, in principle, easier to detect (D. Fologea, J. Uplinger, B. Thomas, D. S, McNabb, J. Li, Slowing DNA translocation in a solid-state nanopore. Nano Lett. 5, 1734-1737 (2005), M. Vega, P. Granell, C. Lasorsa, B. Lerner, M. Perez, Automated and inexpensive method to manufacture solid- state nanopores and micropores in robust silicon wafers. J. Phys. Conf. Ser. 687, 012029 (2016), B.
  • the present disclosure contemplates improvements and design optimizations of the nucleotide encoding and decoding methods described herein.
  • the current implementation of the methods results in an approximately 25 -fold decrease in information density compared to the maximum possible for DNA which is more than a thousand fold better than electronic storage systems (V. Zhirnov, R. M. Zadegan, G. S. Sandhu, G. M. Church, W. !,. Hughes, Nucleic acid memory. Nat. Mater. 15, 366-370 (2016), G. M. Church, Y. Gao, S, Kosuri, Next-generation digital information storage in DNA. Science. 337, 1628 (2012), Y. Erlich, D. Zieiinski, DNA Fountain enables a robust and efficient storage architecture. Science.
  • coding systems that are tailored to these biochemical processes may enable the use of all transitions, by considering extension lengths, and provide highly efficient data recovery, saving on synthesis and sequencing costs even with imperfectly synthesized DNA strands.
  • the length of nucleotide extensions per transition may be considered a design optimizations and tuned according to application demands, trading density for read-out by specialized nanopore sequencing (D. Fologea, J. Uplinger, B. Thomas, D. S. McNabb, J. Li, Slowing DNA translocation in a solid-state nanopore. Nano Lett. 5, 1734-1737 (2005), M. Vega, P. Granell, C. Lasorsa, B. Lerner, M.
  • TdT to apyrase ratio optimization To obtain a ratio of TdT polymerization activity to apyrase degradation activity that would allow for net positive extension of the initiator, initiator extensions was assessed by TdT in presence of a wide range of apyrase concentrations with every dNTP substrate (Fig, 2A).
  • Each reaction was carried out in 20 ⁇ total volume. All reaction components but the dNTP were assembled in 18 ⁇ _ ⁇ while the dNTP was prepared in 2 ⁇ of water.
  • the 18 ⁇ mix was composed such that upon mixing with the 2 ⁇ , dNTP solution, the following initial composition would be obtained: 200 ⁇ dNTP, I X Enzymatics Green Buffer, 0.05 ⁇ f-P5- SBS3 initiator oligo, ⁇ ⁇ / ⁇ TdT, and 4, 2, 1, 0,5, or 0.25 milliunits (mil) of apyrase per microliter.
  • the 18 ⁇ _ ⁇ mixture was added to a tube containing the 2 ⁇ dNTP mix and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • IX Enzymatics Green Buffer 0.1 ⁇ 150617 LT2 initiator (AGATCAATTAATACGATACCTGCG) (36), ⁇ / ⁇ TdT, and 0.125, 0.25, 0,5, or mU/ ⁇ L apyrase.
  • the starting final concentration of substrate was varied at 5, 10, 20, 40, or 80 ⁇ for dCTP or at 1.25, 2.5, 5, 10, 20 ⁇ for dGTP.
  • the 16 ⁇ mixture was added to a tube containing the 4 ⁇ dNTP sample and mixed immediately by pipetting. After mixing, each reaction was incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • apyrase leads to the same level of extension as 10 ⁇ dCTP with 0.5 ⁇ / ⁇ apyrase, 5 ⁇ dCTP with 0.25U ⁇ L apyrase, and 2.5 ⁇ dCTP with 0.125 ⁇ / ⁇ apyrase.
  • T. P. Chirpich The effect of different buffers on terminal deoxynucleotidyl transferase activity. Biochim. Biophys. Acta. 518, 535-538 (1978), M. R. Deibel Jr, M. S. Coleman, Biochemical properties of purified human terminal deoxynucleotidyltransferase. J. Biol. Chem. 255, 4206-4212 (1980), L. M. Chang, F. J. Bollum, Multiple roles of divalent cation in the terminal deoxynucleotidyltransferase reaction. J. Biol. Chem. 265, 17436-17440 (1990)).
  • the buffer system as disclosed is based on magnesium as divalent cation with the option of supplementing cobalt.
  • the buffer system as disclosed is based on cobalt as the sole divalent cation.
  • the performance of the TdT:apyrase system in all three conditions were evaluated, namely, magnesium as the only divalent cation, magnesium supplemented with cobalt, and cobalt as the only divalent cation. For that, two experiments were carried out, comparing each of magnesium with cobalt and cobalt-only conditions separately with magnesium-only condition (Figs. 3A-3C).
  • each reaction was carried out in 20 ⁇ _ total volume. All reaction components but the dNTP were assembled in 18 ⁇ while the dNTP was prepared in 2 ⁇ of water. The 18 ⁇ mix was composed such that upon mixing with the 2 ⁇ dNTP solution, the following initial composition would be obtained: 200 ⁇ dNTP, IX Enzymatics Green Buffer, 0.05 ⁇ f-P5- SBS3 initiator oligo, 250 ⁇ cobalt chloride (if present), ⁇ / ⁇ TdT, and 4, 2, 1, 0.5, or 0.25 milliunits (raU) of apyrase per microliter.
  • raU milliunits
  • the 18 ⁇ _, mixture was added to a tube containing the 2 ⁇ dNTP mix and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a
  • the 14 ⁇ mix was prepared as a master mix for ail reactions and composed such that upon mixing with the 6 ⁇ !, dNTP and cobalt solution, the following initial composition would be obtained: 300 ⁇ dATP, 0.05 ⁇ f-P5-SBS3 initiator oligo, IX Enzymatics Green Buffer, lU/ L TdT, ImU/uL apyrase and 50, 100, 150, 200, 250, or 300 ⁇ cobalt chloride.
  • the 14 ⁇ mixture was added to a tube containing the 6 ⁇ dATP and cobalt mixture and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • the 16 ⁇ mix was composed such that upon mixing with the 4 ⁇ dNTP solution, the following initial composition would be obtained: IX Enzymatics Green Buffer (Composition of 10X Green Buffer (BO 120) from Enzymatics according to the manufacturer: 200 mM Tris- Acetate, 500 mM Potassium Acetate, 100 mM Magnesium Acetate , pH 7.9 @ 25°C) or X Promega TdT buffer (Composition of Terminal Transferase 5X Buffer (Ml 89 A) from Promega according to the manufacturer: 500mM cacodylate buffer (pH 6.8), 5mM CoC12 and 0.5niM DTT), 0.1 ⁇ 150617 LT2 initiator, ⁇ / ⁇ TdT, and 1 ⁇ / ⁇ apyrase.
  • IX Enzymatics Green Buffer Composition of 10X Green Buffer (BO 120) from Enzymatics according to the manufacturer: 200 mM Tris- Acetate, 500 mM Potassium Acetate
  • the starting final concentration of dNTPs was varied at 25, 50, 100, 200, or 400 ⁇ for dCTP, dATP, and dTTP, or at 12.5, 25, 50, 100, or 2 ⁇ )0 ⁇ for dGTP.
  • the 16 ⁇ mixture was added to a tube containing the 4 ⁇ dNTP sample and mixed immediately by pipetting. After mixing, each reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE- Urea gel.
  • extension dynamics with TdT make a few patterns clear.
  • extension with pyrimi dines (dCTP and dTTP) is stimulated by cobalt as the divalent cation while extension with purines (dATP and dGTP) is hampered.
  • dCTP and dTTP extension with pyrimi dines
  • purines dATP and dGTP
  • TdT behavior L. M. Chang, F. J. Bollum, Multiple roles of divalent cation in the terminal deoxynucleotidyltransferase reaction. J. Biol. Chem. 265, 17436-17440 (1990), K. I. Kato, J. M. Goncalves, G. E. Houts, F. J.
  • each reaction was carried out in 20 ⁇ ... total volume. All reaction components but the dNTP and buffer were assembled in 14 ⁇ while the dNTP and desired amount of buffer were prepared in 6 ⁇ volume.
  • the 14 ⁇ mix was prepared as a master mix for all reactions and composed such that upon mixing with the 6 ⁇ !, dNTP and buffer solution, the following initial composition would be obtained: 300 ⁇ dATP, 0,05 ⁇ f ⁇ P5-SBS3 initiator oligo, ⁇ / ⁇ TdT, lmU/ ⁇ . apyrase and 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0, 1.2, or 1.4X Enzvmatics Green Buffer.
  • the 14 ⁇ mixture was added to a tube containing the 6 ⁇ dNTP mix and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • Each reaction was composed of 0.1 ⁇ 150617 LT2, 0.7X Enzymatics Green Buffer, 125 ⁇ of each dNTP, ⁇ / ⁇ . TdT, and the desired amount of the additive.
  • the additives were glycerol at 27% (v/v), sucrose at 20 and 40% (w/v), PEG 8000 at 5 and 10% (w/v), betaine at 0.5 and 1M, DMSO at 5, 10, 20, and 30% (v/v), Triton X-100 at 0.01, 0.1, 0.5, and 1.0% (v/v), and Tween 20 at 0.01, 0.1, 0.5, and 1.0% (v/v).
  • the reaction were carried out at room temperature for 20 minutes and then mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 10% Novex TBE-Urea gel.
  • ⁇ ⁇ 150617_LT2 initiator ⁇ / ⁇ TdT, and 1 ⁇ / ⁇ apyrase.
  • the starting final concentrations of dCTPs were 25, 50, 100, 200, or 400 ⁇ .
  • the ⁇ mixture was added to a tube containing the 4 ⁇ dNTP sample and mixed immediately by pipetting. After mixing, each reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • TdT Consistent and reproducible extension of the initiator upon addition of various nucleotides in presence of apyrase demands that TdT be at saturating concentrations relative to the initiator. Subsaturation levels of TdT can result in a high extension variability, or extension of less than the maximum possible fraction of initiators upon the addition of dNTPs. With the final composition of the reaction having taken shape, it was examined what levels of TdT would be saturating relative to the initiator concentrations that was commonly used.
  • IX Custom Synthesis Buffer 0.1 ⁇ (or less) initiator oiigo, lU/ ⁇ TdT (or more), and lmU./ ⁇ , apyrase.
  • nucleotide composition of the initiator at the 3' is also important (K, I Kato, J. M. Goncalves, G. E, Houts, F. J. Bollum, Deoxynucleotide- polymerizing enzymes of calf thymus gland.
  • K I Kato, J. M. Goncalves, G. E, Houts, F. J. Bollum, Deoxynucleotide- polymerizing enzymes of calf thymus gland.
  • TdT operates in a distributive manner (M. R. Deibel Jr, M. S. Coleman, Biochemical properties of purified human terminal deoxynucleotidyltransferase. J. Biol. Chem. 255, 4206-4212 ( 1980), E. A. Motea, A. J. Berdis, Terminal deoxynucieotidyl transferase: the story of a misguided DNA polymerase. Biochim. Biophys. Acta. 1804, 1 S il l 66 (2010)); it does not remain bound to the nascent oligonucleotide and is not processive.
  • the ⁇ 8 ⁇ mix was composed such that upon mixing with the 2 ⁇ dNTP solution, the following initial composition would be obtained: IX Custom Synthesis Buffer Buffer, 0.1 ⁇ initiator oiigo, lU/ ⁇ TdT, and 0.25 niU/ ⁇ -, apyrase.
  • the initial final concentration of dNTPs was varied at 2, 4, 8, 16, or 32 ⁇ for dCTP, dATP, and dTTP, or at 1, 2, 4, 8, or 16 ⁇ for dGTP.
  • the 18 L mixture was added to a tube containing the 2 ⁇ _. dNTP sample and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • Template sequences were synthesized using the TdT;apyrase mixture by cyclic addition of nucleotide triphosphates to the reaction.
  • the template sequence GATGTAGA was synthesized (Fig 7, left) and in another, the template sequence CGCACTCG was synthesized (Fig. 7, right).
  • Each reaction was carried out in ⁇ total volume and was mixed with a 2 l . of dNTP at 50X the desired final concentration.
  • the ⁇ mix consisted of: IX Custom Synthesis Buffer, 0. 1 ⁇ initiator oiigo, ⁇ / ⁇ TdT, and 0.25 ⁇ / ⁇ , apyrase.
  • the initial final concentration of dNTP was 40 ⁇ for dATP, 200 ⁇ for dCTP, 20 ⁇ for dGTP, and ⁇ ⁇ ! for dTTP,
  • the ⁇ mixture was added to a tube containing 2 ⁇ iL of the desired dNTP sample and mixed immediately by pipetting. After 1 minute incubation at room temperature, a 2 ⁇ 1 sample of the mix was taken to be run on a gel. The remaining ⁇ was added to another tube containing 2 ⁇ of the next nucleotide, mixed and incubated as before, following by collection of another 2 ⁇ sample for PAGE analysis. These steps were repeated for 8 cycles without washing, thereby extending the initiator with 8 different dNTPs while using the same enzymatic mix. Afterwards, each of the 2 ⁇ iL samples that were taken was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 1 5% Novex TBE-Urea gel.
  • Example III Enzymatic DNA Synthesis for Digital Information Storage
  • a template-independent DNA polymerase for controlled synthesis of sequences with user-defined information was harnessed.
  • retrieval of 144-bits, including addressing, from perfectly synthesized DNA strands using batch Illumina and real-time Oxford Nanopore sequencing was demonstrated.
  • a codec was then developed for data retrieval from populations of diverse but imperfectly synthesized DNA strands, each with -30% error tolerance. With this codec, a kilobyte-scale design was experimentally validated which stores 1 bit per nucleotide. Simulations of the codec supported reliable and robust storage of information for large-scale systems.
  • a de novo DNA synthesis strategy and a digital codec designed specifically for information storage is provided.
  • DNA for biological functionality requires single-base precision and accuracy, these demands can be relaxed for DNA for digital information.
  • a template-independent DNA polymerase was used, a protein evolved to rapidly catalyze the linkage of naturally occurring nucleotide triphosphates (dNTPs) under non-toxic biocompatible conditions. Information in transitions were encoded between non-identical nucleotides, rather than single nucleotides. It was demonstrated that enzymatic synthesis and tailored computational tools provide robust information storage, as assessed using batch (Illumina) and real-time (Oxford Nanopore) sequencing. The presently- disclosed enzymatic synthesis strategy is cheaper than phosphoramidite chemistry and may reduce reagent costs by orders of magnitude, facilitating the adoption of DNA as a storage medium.
  • the enzyme terminal deoxynucleotidyi transferase is used.
  • TdT is a template-independent DNA polymerase which rampantly and indiscriminately adds dNTPs to the 3' termini of DNA.
  • TdT is largely used in reactions where one nucleotide triphosphate is added to indeterminate lengths.
  • it is sought to leverage apyrase, which degrades nucleotide triphosphates into their TdT-inactive diphosphate and monophosphate precursors. By competing with TdT for nucleotide triphosphates, apyrase effectively limits DNA polymerization.
  • a mixture was thus created and optimized containing a tuned ratio of these two enzymes such that a nucleotide triphosphate is added at least once to each strand by TdT before being degraded by apyrase (Figs. 2A-2C and Fig. 5).
  • the lowest nucleotide triphosphate concentrations required was determined such that adding a series of nucleotides results in stepwise increases in the length of synthesized DNA (Figs. 6-7).
  • the core of the reaction contemplates a mixture of TdT, apyrase, and short oligonucleotide initiators.
  • TdT Upon addition of a nucleotide triphosphate, TdT extends the initiators until ail added substrate is degraded by apyrase.
  • the number of polymerized nucleotides was define as 'extension length' .
  • Subsequent nucleotide triphosphates are added to continue the synthesis process. While the extension length for each added nucleotide triphosphate may vary, the resulting population of synthesized strands all share the same number and sequence of nucleotide transitions (Fig. 13B).
  • information was chosen to encode as transitions between non-identical nucleotides (Fig. 13C). Given three possible transitions for each nucleotide, trits was used (Trits are the ternar equivalent of bits. One trit is log(3)/log(2) (about 1.58496) bits of information.) to maximize information capacity.
  • Trits are the ternar equivalent of bits. One trit is log(3)/log(2) (about 1.58496) bits of information.) to maximize information capacity.
  • To convert information to DNA information in trits were mapped to a template sequence of non-identical nucleotides, starting with the last nucleotide of the initiator. Enzymatic DNA synthesis of each template sequence produced 'raw strands', or strands , which can be physically stored. To retrieve information
  • strands are sequenced and transitions between non-identical nucleotides extracted, resulting in 'compressed strands', or strands If a strand ' is equivalent to the template sequence, the strand (compressed or raw) is considered 'perfect' and the information is retrieved by mapping the sequence of non-identical nucleotides back to trits.
  • "hello world!” a message containing 96-bits of ASCII data (Fig. 1 A) was encoded and synthesized. This message was split into twelve individual 8-bit characters, and prefixed each character's bit representation with a 4-bit address to denote their order. These 144 total bits of information, including addressing, were also expressed in trits and mapped according to nucleotide transitions (Fig. 13C), resulting in twelve eight- nucleotide template sequences (Table 1). All twelve template sequences were synthesized (HOI -HI 2) in parallel on bead-conjugated initiators, and performed washing every two
  • Alumina sequencing was used to read out the synthesized strands 11 and to assess the information stored in corresponding strands (Methods).
  • DNA strands * " synthesized for HOI -HI 2 was first sequenced using an entire MinlON flowceli (Oxford Nanopore) and observed that the most abundant species, an average of 49.9% of filtered strands " , were perfectly synthesized (Fig. 18A), This is largely consistent with results from
  • nanopore sequencing can enable faster and more efficient information retrieval from strands synthesized with the enzymatic strategy.
  • DNA translocation rates are slowed through nanopores for accurate single-base sequencing. This rate may be increased since it is, in principle, easier to detect transitions between non-identical nucleotides, each with extension lengths greater than one (D. Fologea, J. Uplinger, B. Thomas, D. S. McNabb, J. Li, Slowing DNA translocation in a solid-state nanopore. Nano Lett. 5, 1734-1737 (2005); M. Vega, P. Granell, C. Lasorsa, B.
  • Coded strand architecture It has been established that data can be stored in enzymatically-synthesized DNA and retrieved by in silico filtering for perfectly synthesized DNA strands. However, perfect strands 0 may not be required for data retrieval, imperfectly synthesized strands may be used to reconstruct template sequences if nucleotide errors occur in different locations. It was thus sought to develop a codec for robust data retrieval which leverages the diversity of imperfectly synthesized strands C for template sequence reconstruction. The core of the codec relies on three elements: (i) A coded strand architecture which includes synchronization nucleotides to facilitate error localization, (ii) Sufficiently diverse strands ' produced by
  • a key feature of the presently disclosed codec is the addition of synchronization nucleotides which are interspersed between information-encoding nucleotides (Fig. 19B). These nucleotides act as a scaffold to aid reconstruction of a template sequence from imperfectly-synthesized DNA strands that may contain errors as a result of missing, mismatched, and inserted nucleotides.
  • CTCGTGCT template sequence of 8 nucleotides
  • CTCTGC and TCGTCT synthesized DNA strands 0
  • the codec includes a module for encoding information in template sequences which incorporates synchronization nucleotides.
  • the population of synthesized DNA strands for a desired sequence must be sufficiently diverse. That is, if the same nucleotide is missing systematically across all strands, then it cannot be retrieved without additional forms of error correction. It was thus analyzed diversity generated from the synthesis process by synthesizing a longer 16 -nucleotide template sequence (called EO), which contains 12 unique transitions between nucleotides to mitigate ambiguous alignments
  • EO 16 -nucleotide template sequence
  • Fig. 19C In silico size selection was performed of strands R ranging 32 to 48 bases in length, assuming that each of the 16 template nucleotides were synthesized with an extension length of two to three bases (Fig. 20A). This purified set was analyzed by aligning the corresponding strands to the EO template and observed that missing nucleotides were predominant, in line with the previous analyses, but could occur in different positions (Fig. 19C, Fig, 19D, Fig. 20B),
  • Levenshtein edit distances of strands ⁇ from the purified set (Fig. 19D, Fig. 20C). It was observed that the median strand C length was 12 nucleotides and the maximal number of variants occurred at this length.
  • the Levenshtein edit distance was also calculated (V. I. Levenshtein, in Soviet physics doklady (1966), vol. 10, pp. 707-710), which summarizes the number of single-nucleotide edits required to repair a strand " to the desired E0 sequence.
  • the median edit distance for these variants was four, indicating that synchronization nucleotides could be placed approximately every three or four nucleotides to recall missing strand nucleotides from diversely synthesized strands. It was thus set out to reconstruct a template sequence from a population of diverse but imperfect strands ' using statistical inference and mathematical models.
  • MAP maximum a posterior
  • Each template sequence contained a 2-bit address to delineate its order, and 14 bits of data. These 16 bits are encoded in a template sequence of 16 nucleotides, which includes four synchronization nucleotides, resulting in 1 bit stored per nucleotide (Fig. 22B), Sequences E1-E4 carry a total of 64 bits of information including addressing, and were synthesized in parallel on beads with a wash every cycle. Following the last synthesis cycle, strands were ligated to a universal adapter, PGR amplified, and stored as a single pool.
  • Sequence E3 required the most sequencing reads for reconstruction as synthesized strands contained one extra edit on average in comparison to synthesized strands for other template sequences (Figs. 27A-27B and Figs. 28A-28B). It was also found that MAP estimation was a more robust decoding algorithm than the previous two-step filter for 1 10 ! -1 1 12, requiring fewer reads for data retrieval (Figs. 34A-34H). These results show that the codec can accurately reconstruct data without requiring perfectly synthesized DNA strands.
  • the experimental results demonstrate that byte- and kilobyte-scale storage systems can be achieved if sufficient number of strands are synthesized (Fig, 35A).
  • the "hello world!” experiment stored 12 bits per template sequence. This is sufficient for a 256-byte maximum storage system where 11 bits are used for addressing 2,048 total template sequences, each with I bit of data.
  • the "Eureka!” experiment stored 16 bits per template sequence. This allows for a 4-kilobyte maximum storage system, where 15 bits are used for addressing 32,768 total template sequences, each with 1 bit of data (Table 7).
  • the scalability of the DNA storage codec was next assessed for gigabyte- and petabyte-scale storage through simulation, assuming that the requisite number of DNA strands for each could be produced.
  • Increased storage capacity requires more nucleotides per template sequence for additional address space, synchronization nucleotides, and data, in one embodiment, 36 bits were stored, including data and address, in a 74-nucleotide template sequence and similarly, 57 bits in a 152-nucleotide template sequence to simulate gigabyte- and petabyte-scale systems, respectively (Fig, 3 A).
  • the codec is able to resolve several types of errors, including missing nucleotides in synthesized strands* " , which would otherwise drastically reduce information storage capacities (M. C. Davey, D. J. C. Mackay, Reliable communication over channels with insertions, deletions, and substitutions. IEEE, Trans. Inf. Theory. 47, 687-698 (2001); M. Mitzenmacher, A survey of results for deletion channels and related synchronization channels. Probab. Surv. 6, 1-33 (2009)).
  • the comprehensive codec architecture consists of encoding and decoding frameworks to extract information from diversely synthesized DNA strands (Figs. 35C, Figs, 39A-39B).
  • the encoder consists of several core components; (i) Partitioning of data into ordered rows of bits; (ii) Prefixing of rows with addresses; (iii) Error correction per row of bits via an error-correction code (ECC) per template sequence (e.g., Bose-Chaudhuri-Hocquenghem code), and error correction per block of rows via a block ECC (e.g., Reed-Solomon or Fountain code, (iv) Modulation to map rows of bits to template sequences. All template sequences are subsequently synthesized enzymatically, resulting in a population of diverse DNA strands. Strands R are read out by sequencing and corresponding strands are input to a decoder.
  • ECC error-correction code
  • Strands R are read out by sequencing and corresponding strands are input to a decoder.
  • the crucial first step of the decoding pipeline is MAP estimation aided by scaffolding, followed by probabilistic consensus. Multiple subsets of strands C can be used for sequence reconstruction. Each reconstructed sequence need not be identical to the template sequence. After demodulation of the reconstructed sequence, the resulting bit sequence can be corrected by bit-level ECCs in the decoding pipeline to reinforce error-free data retrieval.
  • the design harnesses the diversity of enzymatically-synthesized DNA strands and supports a flexible-write approach to provide a functional and robust storage system.
  • extension lengths per template nucleotide may be considered a design optimization and tuned according to application demands, trading density for read-out speed and cost by specialized nanopore sequencing (S. M. H. T. Yazdi,
  • DNA for information storage is synthesized in a high-density array format with proprietary machines.
  • the presently disclosed bead-based process was thus translated to a 2D array-based platform (Figs. 40A-40E).
  • this prototype produced perfectly synthesized strands for each of the three 13 -nucleotide template sequences tested herein. Analyses of the synthesized strands indicate similar error and diversity profiles to those observed using the bead-based process, indicating that the codec could be used to store information in DNA synthesized with this platform (Fig. 41, Figs. 42A-42B and Figs. 43A- 43B).
  • Synthesis accuracy can be further improved by additional process engineering, e.g., more stringent washing per cycle that reduces carryover of nucleotide triphosphates from previous cycles to further diminish the rate of substituted strand " nucleotides. Optimization of reaction conditions to improve mixing or the use of more processive, rather than c
  • distributive, TdT mutants may reduce the rate of missing strand nucleotides (M. A. Jensen,
  • the presently disclosed enzymatic DNA synthesis strategy disclosed herein is advantageous in speed and cost relative to phosphoramidite chemistry.
  • reagent costs were compared for both processes as a function of feature size (reagent volume) (Figs, 44A-44B, Table 6).
  • the analyses indicate that the enzymatic synthesis strategy could already be cheaper as a drop-in replacement to phosphoramidite chemistry when using existing automation which synthesizes DNA strands in 15-30-micron features (Figs. 44A-44B).
  • Further miniaturization, together with reductions to enzyme cost through recycling, provide a potential roadmap for overall reduction in reagent costs by several orders of magnitude (Figs. 44A-44B).
  • the increased rate of enzymatic catalysis over chemical coupling and a lack of blocking moieties may shorten the synthesis cycle times compared to phosphoramidite chemistry, reducing write speed and equipment amortization time (Table 6).
  • aspects of the present disclosure are directed to an enzymatic synthesis strategy and tailored coding architecture for robust information storage in DNA.
  • This storage solution is an alternative to prior studies which utilized phosphoramidite chemistry to produce DNA for information storage.
  • This approach offers potentially dramatic benefits to the cost and speed of synthesis and sequencing without requiring single-base accuracy. Additionally, this approach may alleviate biosecurity concerns associated with widespread DNA synthesis of genetic information, as genes are unlikely to be produced with this strategy. While this work illustrates DNA information storage in vitro, it could provide a foundation for development of de novo molecular recording systems in vivo (B. M. Zamft et al, Measuring cation dependent DNA polymerase fidelity landscapes by deep sequencing. PLoS One.
  • the phrase "hello world!” was converted to decimal ASCII and then to ternary as shown in Table 1.
  • the ASCII decimal (data) was converted to base 2 (for binary, 8 bits) or to base 3 (for ternary, 5 trits).
  • the addresses were converted from a decimal value to base 2 (for binary, 4 bits) or base 3 (for ternary, 3 trits). Addresses were concatenated to data to form a resulting string of 2 bits or 8 trits.
  • a custom Python script was used to map trits to template sequences H01 - H 2 shown in Table 1.
  • Nucleotide triphosphates were prepared at the following concentrations: 8mM dATP, 4mM dCTP, 4mM dGTP, and 16mM dTTP.
  • eac template sequence Table 1
  • the required dNTP volumes corresponding to each transition type were dispensed (Table 3) in a 96-well PGR plate (VVVR) using a Mantis liquid handler (Formulatrix), which has a minimum dispense volume of 0.2 ⁇ .
  • initiator- conjugated polystyrene beads for each of the twelve template sequences were suspended in an enzymatic reaction mix, comprised of Ix Custom Synthesis Buffer (14 mM Tris-Acetate, 35 mM: Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0. 1 % Triton X-100, 10% (w/v) PEG 8000) with lU/ ⁇ L TdT (Enzymatics) and ImU/ ⁇ L apyrase (NEB).
  • Ix Custom Synthesis Buffer 14 mM Tris-Acetate, 35 mM: Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0. 1 % Triton X-100, 10% (w/v) PEG 8000
  • lU/ ⁇ L TdT Enzymatics
  • ImU/ ⁇ L apyrase N-dapyrase
  • a universal adapter was ligated to the 3 ' of the synthesized strands using a hybridization-based strategy as previously described (C. . wok, Y. Ding, M. E, Sherlock, S. M. Assmann, P, C. Bevilacqua, A hybridization-based approach for quantitative and low- bias single-stranded DNA ligation. Anal. Biochem. 435, 181- 186 (2013).
  • the 5P-rSBS9- GGG adaptor (/5Phos/AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC T/ideoxyU/CCGATCT GGG/3SpC3/) forms a hairpin with a 5' polyG overhang which hybridizes to single-stranded DNA strands ending in polyC.
  • the beads carrying synthesized DNA with polyC tail were resuspended in a reaction composed of ⁇ ⁇ 5P-rSBS9-GGG adaptor, IX NEB T4 DNA Ligase Buffer, 20% PEG 8K, 500mM Betaine, and 6 units of T4 DNA Ligase (Enzymatics). Ligation mixture was incubated at 16C overnight.
  • Each sample was column purified after amplification.
  • Illumina sequencing adapters To add the complete Illumina sequencing adapters, amplified strands were diluted and used as a template for a PCR reaction with NEBNext Dual Indexing Primers. Each strand received a different index by real-time cycle-limited PCR for 15 cycles. Barcoded strands were then combined and sequenced single end using Illumina MiSeq v2 150 Micro. Samples were then combined and sequenced using Illumina MiSeq with reagent kit v2. Sequencing was done in one direction, starting from the forward primer in each sample for 150 bases. Oxford Nanopore sequencing
  • each Illumina-barcoded strand was diluted 100-fold in Tris-HCl pH 8,0 with 0.01% Tween -20 and amplified with nested primers, comprising a barcoding primer pair, LWB 01-12 from SQK-LWB001 (Oxford Nanopore), and 50nM of primers PR2-P5 and 3580F-P7 for 10 cycles.
  • 5 ⁇ _ of each strand was then pooled (60 ⁇ _ total) and cleaned with 90 ⁇ _, of Agencourt Ampure XP beads according to the manufacturer's protocol.
  • i .uL of the pooled library was diluted with 9 ⁇ .
  • the first step is to filter for the designed number of nucleotides which contain a terminal 'C, used for ligation, in compressed strands.
  • eight of twelve template sequences specifically H01 , H02, H04, H08, H09, H10, and HI 1 , have 9 nucleotides to be synthesized.
  • four of the twelve template sequences, specifically H3, H5, H6, and H7 contain only 8 nucleotides to be synthesized.
  • the second step is to select the most frequently synthesized compressed strand variant.
  • Reads in the opposite orientation were not processed. Data retrieval for each sequence was performed as above for Illumina sequencing with a two-step filter. Real-time data reconstruction with nanopore sequencing reads was simulated by applying the two-step data retrieval filter to a subsampled number of shuffled sequencing reads obtained up to a given time point. The 48 -hour sequencing run was split into 2 -hour increments. For each increment, the timestamp for all reads obtained during the entire sequencing run were shuffled and the number of reads corresponding to the total elapsed sequencing time up to the given increment were randomly sampled. The probability of correct retrieval was assessed by performing 10,000 decoding trials for each increment and expressed each time interval to fraction of total sequencing time.
  • Encoding and decoding pipelines were implemented partly using the C++ programming language, compiled via a g++ compiler on an Ubuntu Linux operating system, and partly via specialized MATLAB (Mathworks) functions.
  • the message ' ' Eureka? ' ' consisting of 7 ASCII characters, equivalent to 56 bits of payioad data, was encoded into 4 template sequences E1-E4 each containing 16 nucleotides.
  • the encoding steps consisted of data partitioning, addressing, and modulation of bit sequences to nucleotide sequences with no repeated bases (i.e., self-transitions). Modulation included the placement of synchronization nucleotides within DNA sequences as described herein.
  • sequence E0 was specified and designed for memeposes of error analysis.
  • decoding consisted of sequence reconstruction from run-length compressed DNA strands via MAP estimation and consensus. Reconstmcted E1-E4 DNA sequences were demodulated into bit sequences, and data were extracted by ordering according to addresses.
  • the initiator oligonucleotide Bio-U-LT2 was conjugated to streptavidin beads (Invitrogen) according to manufacturer instaictions at 20% binding capacity and Biotin-14- dCTP was used to bind remaining free streptavidin. Blank beads, which have free streptavidin bound by Biotiti-14-dCTP were also prepared. Prior to use, the initiator conjugated beads were di luted 10-fold with blank beads and washed with Ix Custom Synthesis Buffer without PEG.
  • E0-E4 was performed similarly as described above. However, Bromo- dCTP was used instead of dCTP (Figs. 12A-12B) and concentrations of each dNTP regardless of transition type were fixed. The final concentration of dNTPs for each cycle were as follows: ⁇ dATP, 15 ⁇ Bromo-dCTP, 5 ⁇ dGTP, and 15 ⁇ dTTP. As above, a series of dNTPs were di spensed for each nucleotide of the template sequence in a 96-weil PCR plate.
  • initiator-conjugated magnetic beads were suspended in the enzymatic reaction mix, comprised of Ix Custom Synthesis Buffer (14 mM Tris- Acetate, 35 mM Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0.1% Triton X- 100, 10% (w/v) PEG 8000) with lU/uL TdT (Enzymatics) and 0.25mU/uL apyrase (NEB).
  • Ix Custom Synthesis Buffer 14 mM Tris- Acetate, 35 mM Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0.1% Triton X- 100, 10% (w/v) PEG 8000
  • lU/uL TdT Enzymatics
  • NTB 0.25mU/uL apyrase
  • each reaction was pulse vortexed and incubated for 30 seconds at room temperature. Beads were collected by magnet and washed in Ix Custom Synthesis Buffer without PEG and resuspended with fresh enzymatic mix. The reaction mixture was then transferred to the next well containing the next nucleotide substrate. Following the last cycle, each sample was prepared for Illumina sequencing as described above. Complete Alumina sequencing adapters were added by real-time cycle-limited PGR for 12 cycles, Barcoded strands were then combined and sequenced as single-end 175bp reads using Illumina Mi Seq v2 Nano.
  • Sequences were trimmed as before to remove the 5' initiator oligo sequence (Bio- U-LT2) and the 3' universal oligo sequence (5P-rSBS9-GGG). Only reads which presented both sequences for trimming were retained for further analysis.
  • a sequence of non -identical nucleotides for each raw strand was extracted as above. Purified strands were obtained by selecting strands with raw lengths between 32-48 bases, corresponding to average extension lengths of 2 to 3 per template nucleotide. Purified strands were used for analysis of synthesis errors with Needleman-Wunsch and for sequence reconstruction of E1-E4 with the decoding pipeline.
  • DNA strands synthesized for each template sequence E1 -E4 were randomly sampled from data according to a target number of reads, and then subject to a two-step filter, A filter was first applied to include those DNA strands with read counts either 1, 2, 3, 4, or 5 depending on the target number of reads, to exclude aberrant DNA strands, which could arise from combinations of synthesis and sequencing errors. A second filter was applied to rank DNA strands according to compressed strand lengths. A total of 10 top-ranked DNA strands were selected from all purified and filtered strands. These 10 strands were used to reconstruct each template sequence using MAP estimation and consensus implemented according to equations explained herein. The probability of correct retrieval of each template sequence E1-E4 was assessed by performing 500 decoding trials for each target number of reads, Each trial consisted of a random sampling of purified reads. Simulated large-scale storage systems
  • BCH Bose-Chaudhuri-Hocquenghem
  • the robustness of the codec was next assessed by performing 500 decoding trials for varying levels of synthesis accuracies.
  • a template sequence was randomly generated and ten compressed strands were synthesized by simulation with the Markov model. These compressed strands were used towards reconstruction of the template sequence via MAP estimation and probabilistic consensus.
  • Each reconstructed sequence of K nucleotides was demodulated intoi?bits, and decoded with a Matlab BCH decoder (Mathworks) to yield £ ⁇ bits.
  • the probability of correct data retrieval for a specific level of synthesis accuracy was computed as the fraction of successful decoding trials. Results for data retrieval were benchmarked on a multi-core server.
  • This example describes the evaluation of the use of 5-Bromo-dCTP (5Br-dCTP) as an altemative to natural dCTP in the synthesis reactions.
  • 5Br-dCTP 5-Bromo-dCTP
  • reaction components not including the dNTP were assembled in 18 ⁇ while nucleotide triphosphates were prepared in 2 ⁇ of water.
  • the 18 ⁇ mix was composed such that upon mixing with a 2 ⁇ nucleotide triphosphate solution, the following initial composition would be obtained: IX Custom Synthesis Buffer, 0.1 ⁇ LT2+3C initiator, ⁇ ⁇ / ⁇ TdT, and 0,25 ⁇ / ⁇ apyrase.
  • the initial final concentration of the dNTP was varied at 2, 4, 8, 16, or 32 ⁇ .
  • the 18 ⁇ mixture was added to a tube containing the 2 ⁇ _ ⁇ dNTP sample and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • a modular design for encoding and decoding digital information in DNA is presented (Fig. 35C, Figs. 39A-39B). While single monolithic architectures can he more efficient, modular designs allow for the optimization of encoding and decoding blocks separately. Such a distributed approach simplifies the design space considerably. Within individual blocks, error-correcting codes borrowed from traditional communication systems (e.g., Reed-Solomon, Fountain, BCH, LDPC) may be applied to handle multiple types of errors.
  • traditional communication systems e.g., Reed-Solomon, Fountain, BCH, LDPC
  • Eiformation is stored in short sequences of DNA, and must be reassembled by a decoder. Alignment errors (e.g., missing or inserted nucleotides) due to inaccurate DNA synthesis or sequencing are more difficult to correct compared to substitutions or erasures common in communication systems.
  • encoding and decoding frameworks were presented, together defined as a codec, for storing and extracting information from populations of diverse DNA strands.
  • An important part of the encoding strategy is the placement of synchronization patterns which are regularly interspersed throughout data, allowing a decoder to compute accurate alignments from diverse synthesized strands. Synchronization patterns are inserted in the modulation step of the encoding pipeline, which translates rows of bits into DNA sequences which adhere to modulation constraints (Fig. 39B, Figs. 21A-21B).
  • the codec is inclusive of core components such as Reed-Solomon or Fountain codes utilized in prior DNA storage systems (Y.
  • the encoder first partitions data into ordered rows of bits, prefixing an address to each row to delineate its order in reassembly. Error-correction is incorporated within each row of bits, or block of rows to protect against synthesis errors, missing sequences, or low sequencing coverage.
  • the encoder outputs a book of template sequences, which are written by enzymatic synthesis to DNA strands.
  • the resulting strands can then be stored.
  • the stored strands are read by high-throughput DNA sequencing and transitions extracted to form a sequence of non-identical nucleotides, which is then fed into a decoder.
  • a crucial first step of the decoder is to harness information latent in diverse DNA strands by MAP estimation and probabilistic consensus (Fig. 13B).
  • the decoder is designed to function with minimal sequencing reads.
  • Existing approaches for strand alignments S. B. Needleman, C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453 (1970); T. F. Smith, M. S. Waterman, Identification of common molecular subsequences. J, Mol. Biol . 147, 195-197 ( 1981); C. Notredame, D. G. Higgins, J. Heringa, T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol.
  • the encoder (Fig. 39B) first partitions data into ordered rows of bits. Each row of bits is eventually stored in one template sequence of DNA. In subsequent paragraphs, this correspondence is maintained between rows of bits and template sequences of DNA. Each row is prefixed with a unique address to delineate its order in reassembly. Let ""denote the total number of bits stored per row, including both payload data and addresses. Let ⁇ ⁇ ⁇ indic te the number of address bits. With ⁇ bits, it is possible to address a total of bits.
  • template sequences in which each template sequence stores (i3— ⁇ Ejbits ot payload data.
  • the storage capacity is equal to the number of DNA sequences multiplied by the number of bits of payload data stored per template sequence.
  • the storage capacity is maximized by maximizing the total number of DNA template sequences. The following equations specify the storage capacity, and the maximum storage capacity.
  • Storage Capacity 2 ⁇ ( ⁇ - )bits, for 0 ⁇ ⁇ ⁇ ⁇ .
  • the goal of an encoder and decoder architecture is to recover both the address and payload data correctly. If the address is irretrievable or only partially reconstructed, the order of information is lost. In this sense, it is more critical to recover the address. If the address is correct, it is possible to correct errors in the payload data using redundant information stored in other DNA sequences. However, in the analyses, both the address and payload information (a total of Ubits per sequence) are decoded reliably with equal error protection.
  • Reed-Solomon (RS) codes and Fountain codes which may be incorporated within the encoding and decoding architecture are briefly described (Fig, 39B). However, these codes were not explicitly used in the experiments or simulations. If synthesizing thousands or millions of DNA sequences, error-correction across multiple sequences is necessary to protect against the following types of errors: 1) Missing strands for particular sequences; 2) Strands with low sequencing coverage (i .e., too few reads of DNA strands for particular sequences after PCR-ampiification and sequencing; 3) Sequences with detected errors after reconstruction from multiple strands, 4) Sequences with undetected errors in either the address or payload after reconstruction from multiple strands.
  • RS Reed-Solomon
  • the error locations within a block of reconstructed sequences are known and may be pinpointed by checking the addresses of all sequences. For example, missing sequences can be identified by their missing addresses which are not available in a block.
  • the error locations within a block of reconstructed sequences are unknown and undetected.
  • the fourth type of error is not accommodated by most Fountain codes which are specialized only to handle erased/missing sequences.
  • Fountain codes were originally applied in packet communication networks for recovering missing packets at a high-level abstraction layer in the communication protocol stack. While RS codes can correct undetected errors in the payload data, they assume that the address per sequence is reconstructed correctly.
  • an RS code is applied in the vertical direction across multiple rows/sequences of bits (Fig. 39B).
  • a total of i!bits per row exist prior to RS encoding, and a total of i!bits per row exist after RS encoding (Fig, 39B).
  • the organization of information in the horizontal direction is unaltered.
  • the RS code inserts extra rows of redundant parity bits. Each extra row of parity bits contains its own unique address which is utilized by the RS decoder for error-correction.
  • RS(H r ⁇ .., fe rs .)code which has a minimum Hamming distance of (n S — k rs — 1).
  • k rs wws store address and payload information bits, while (? S additional rows store RS parity bits.
  • the RS code is able to correct up toJfsequences with known error locations within a block of sequences, andUsequences containing undetected errors, where 2U + E ⁇ n S — k S ) .
  • the undetected errors cost twice as much in terms of added redundancy required.
  • R S(255,223)code is specified, which corrects up to 16- sequences with undetected errors within a block of sequences, or corrects up to32sequences with known error locations within a block.
  • the RS code may be applied to a block of fl S sequences of bits as a layer of protection for both detected and undetected errors, with the assumption that the address for sequences is known (Fig. 39B).
  • ECCs error-correcting codes
  • BCH Bose-Chaudhuri-Hocquenghem
  • LDPC low-density parity check
  • Synchronization itself is insufficient for correct decoding.
  • a missing nucleotide in the compressed strand causes a synchronization error, but even if the position of the deletion is known via synchronization, the missing nucleotide must be recovered correctly.
  • the alignment and synchronization step of the decoder may resolve ail errors perfectly by utilizing the diversity of synthesized strands per sequence. If enough diversity is available, the missing information in one strand variant may be recovered correctly from other variants. In this way, alignment and consensus algorithms have a probability of success for decoding correctly.
  • the number of nucleotides for template sequences must increase. In these systems, a few errors may still occur after the alignment and consensus step of the decoder.
  • BCH codes were applied to encode and decode bits stored per DNA sequence, LDPC codes could also provide similar error- correction capabilities.
  • BCH(63 f S7,l) BCH(63,45,3); BCH(63,33 ⁇ 44);
  • BCH(31, 2I,2), BCH(63,36,S), and 601 (12 , 5 ,11) codes were applied respectively. These BCH codes are applicable for DNA storage due to their short sequence length requirements, and efficient error-correcting abilities.
  • parameters ⁇ 1 ⁇ 2 ;£ - :i 3 ⁇ 4and :i ⁇ 3 ⁇ 4 ⁇ 3 ⁇ 4fo the BCH code directly affect overall system parameters.
  • the coding scheme establishes baseline efficiencies in simulations, towards a flexible-write strategy for DNA storage. The level of efficiency for coded systems is anticipated to improve.
  • a principal element of DNA storage is the encoder's mapping from bits to template nucleotides (modulation), as well as the decoder's mapping from nucleotides of reconstructed sequences to bits (demodulation).
  • modulation maps J?bits to / ⁇ nucleotides: 13 ⁇ 423 ⁇ 413 ⁇ 4, .. b s ⁇ o i a 2 & 3 ... ⁇ ⁇ ⁇ the ideal case, one template nucleotide stores a maximum of 2bits. Therefore, an upper bound for eveiy modulation scheme is the limit: B ⁇
  • a demodulation step maps K nucleotides to Hbits:
  • S 2K is not achievable for several reasons.
  • the controlled process of synthesis adds each nucleotide one by one. According to a specific concentration of nucleotide triphosphates, each nucleotide is added correctly to strand ⁇ , or an error such as a missing nucleotide in strand ⁇ (deletion) may occur.
  • a current design constraint for enzymatic synthesis is to specify sequences with non-identical nucleotides (e.g., without AA, TT, CC, GG transitions). Specifying information only in sequences of non-identical nucleotides allows for facile data processing. Further work to account for polymerization extension lengths could remove such a constraint.
  • Constraints reflecting valid and invalid transitions between nucleotides may be expressed via a transition matrix ⁇ (Figs. 21A-21B).
  • An upper bound for the maximum amount of bits stored per nucleotide is l g 2. >i rii;aos ( ⁇ r ) .where ⁇ 1 ⁇ ⁇ ) the maximum eigenvalue of F, For enzymatic synthesis in this paper, self transitions were forbidden, leading to an upper bound of B ⁇ (3 ⁇ 43 ⁇ 4C 3 )) % 3 ⁇ 4 ⁇ 21 A).
  • minimizing the use of certain transition types, such as CA or CG would improve synthesis accuracy but reduce the amount of information bits stored per template nucleotide (Fig. 21B).
  • synchronization nucleotides An important aspect of the modulation step of the encoder (Fig. 39B) is the insertion of synchronization nucleotides at regular intervals within each sequence.
  • embedded synchronization patterns provide resilience against alignment errors. The error- resilience is boosted significantly during the alignment and consensus step of the decoder (prior to the demodulation step in the pipeline).
  • Synchronization nucleotides are also utilized in the demodulation step of the decoder (Fig, 39B). As a tradeoff, the inclusion of synchronization nucleotides reduces the total space allocated for address and payload information.
  • each template nucleotide either stores 1 bit or 1 bit, or is selected for synchronization (Fig. 22 A). Without the necessity for synchronization nucleotides, it would be possible to store up to 1.5 bits per template nucleotide (close to the upper bound of 3 ⁇ 43 ⁇ 4 ⁇ 3 ⁇ bits per nucleotide) by converting all input bits directly into trits.
  • the modulation scheme for specific sequences El, E2, E3, E4 synthesized in experiment is provided (Figs. 25A, 22B).
  • the demodulation step of the decoding pipeline attempts to reverse the steps of modulation.
  • demodulation converts a sequence of nucleotides into a mixture of bits and trits, and subsequently extracts a sequence of bits according to tables of conversion (Fig. 22B, Table 9). If errors exist within the sequence of nucleotides, the demodulation step may also output a sequence of bits containing errors. Synchronization nucleotides (Figs, 22A-22B) ensure that errors are localized within a sequence to some degree, limiting a propagation of errors.
  • the modulation scheme for simulations is nearly identical to the modulation scheme used in the "Eureka! " experiment and includes a similar synchronization pattern embedded per sequence.
  • a sequence of bits is converted to a mixture of bits and trits, and then to a sequence of nucleotides. It is noted that the intermediate mixture of bits and trits is designed to facilitate placement of information between synchronization nucleotides, while also ensuring that no self-transitions are possible.
  • the demodulation step consists of reciprocal conversions to map a sequence of nucleotides to a sequence of bits (Table 9).
  • the following table specifies the conversion of B bits per sequence to K nucleotides per template sequence for all DNA storage systems analyzed in this paper.
  • the conversion utilizes an intermediate form of information which consists of a mixture of bits and trits.
  • the demodulation step of the decoder reverses the steps of modulation.
  • the end-to-end efficiency rate of storage may be computed for all experimental and simulated systems. Specifically, starting with 12 bits of data and addresses stored per sequence, an ECC per sequence results in B bits per sequence. Then E bits per sequence are converted and modulated into K nucleotides per sequence, including synchronization nucleotides. The following table lists these efficiencies for information storage in template DNA sequences.
  • DNA storage was modeled as an input-output subsystem involving only a sequence of Sbits (Fig. 39B). Based on this abstraction, the input to the DNA storage system can be represented by a sequence of Bbits prior to modulation. Similarly, the output can be represented by a sequence of iSbits, obtained after demodulation. The output bit sequence may contain errors.
  • Random input sequences of Sbits were generated, and obtained output sequences of i?bits by simulating a subsystem within the encoding and decoding pipeline (Fig, 39B).
  • the probability of bit error, denoted by ]? t-srr &T> was estimated by averaging over all input- output bit sequences.
  • the capacity was derived to be S ⁇ I— 3 ⁇ 4 (P r- « ⁇ wr))bi ts - m tn * s standard capacity formula for a bit-error memoryless channel, i1 ⁇ 2( * )denotes the binary entropy function (T. M, Cover, J. A. Thomas, Elements of Information Theory (John Wiley & Sons, 2012)).
  • the capacity in bits up to a maximum of Sliits per template sequence was plotted for different levels of synthesis accuracy (Figs. 36A-36F).
  • a template sequence of 38 nucleotides could store 10 more bits of data and addresses, an increase from 23 to 33 bits (Fig. 36A).
  • 27 and 70 more bits of data and addresses could be stored per template sequence of 74 and 152 nucleotides, respectively at the same level of synthesis accuracy (Figs. 36C and 36E).
  • the codec was also tested with a combination of missing nucleotides, substitutions, and insertion errors (Figs. 36B, 36D, and 36F).
  • Enzymatic synthesis produces populations of diverse strand variants from each DNA sequence.
  • the presence of diversity in DNA strands enables a larger set of strategies for synthesis, storage, and sequencing.
  • Encoding DNA sequences with synchronization patterns i.e., scaffolding
  • the term scaffolding is used to denote specially designed synchronization patterns in DNA sequences. This section describes algorithms for the alignment of diverse DNA strands by scaffolding and consensus.
  • the i ik synthesized compressed strand is comprised of a random number of nucleotides, and its random length is represented by random variable L i .
  • One particular realization of the ⁇ ⁇ synthesized strand is denoted by the following vector:
  • the length £ ⁇ is a realization of the random variable I,. Given a set of synthesized strands-fV j ],. a decoder must estimate correctly which original sequence was intended for storage. This estimation is computed based on the probabilistic framework of the Markov model (Fig. 23A). Such a framework is common to and adapted from the framework of synchronization codes used in traditional communication systems (24). Optimal alignment of diverse strands
  • the method for aligning diverse strands is based on maximum a posteriori (MAP) estimation of each nucleotide.
  • MAP maximum a posteriori
  • the notation ⁇ ⁇ indicates a set of events occurring simultaneously. Realizations of random variables are denoted by lower-case symbols in the above formula. Associated probabilities are computed based on a Markov chain model which characterizes how synthesized DNA strands (outputs) are produced from an input sequence (Fig. 23 A).
  • the optimal alignment is computed efficiently via dynamic programming recursions (explained in subsequent sections) if the number of strands is a small constant, and if the length of DNA sequences is short. While sequence lengths are short in DNA storage systems, the number of synthesized strands per sequence may be large. Therefore, it is critical to design approximations to the above exact optimization. For future algorithmic designs, it is noted that a superior alignment may be estimated for ail input nucleotides O t 0 2 z ... O K jointly. However, individual probability estimates computed per nucleotide allow for the direct application of consensus rules and error-correction after alignment.
  • the above product rule may be derived from Bayes' theorem directly, and is related to a simple Bayes classifier.
  • the given consensus optimi zation is computed efficiently via dynamic programming recursions, and remains tractable even for an increasing number of strand variants. Its computational complexity scales linearly in the number of strands.
  • the key difference between the above consensus product rule and the optimal solution of alignment i s that the inner probability only involves a single strand, as opposed to all strands jointly. As the number of strands increases, the inner probability may be computed for each strand separately and efficiently, after which a product rule is applied.
  • Dynamic programming is designed to utilize pre-existing computations in a recursive manner.
  • a two-dimensional table of probabilities is populated in the "forward" direction.
  • a two-dimensional table of probabilities is populated in the "backward” direction.
  • the following table summarizes the recursive computations required. The sum of the probabilities in each column of the table yields and ⁇ ( ⁇ $£ % respectively.
  • MAP estimation by scaffolding is provided (Figs. 23A-23B and Figs. 24A-24E).
  • Decoding by alignment is possible because of the synchronization pattern embedded as a scaffold in the template sequence.
  • the synchronization nucleotides provide strong cues for the correct placement of other nucleotides.
  • the ⁇ / ⁇ probabilities include and ⁇ ) as well as e C » *
  • groupwise alignment from three strands may be computed:
  • majority voting alignment Another algorithm for alignment, termed majority voting alignment, consists of greedy consensus (T. Batu, S. Kannan, S. Khanna, A. McGregor, Reconstructing Strings From Random Traces, in Proceedings of the Fifteenth Annual (ACM-SIAM) Symposium on Discrete Algorithms, (SODA), New Orleans, Louisiana, USA, January 11-14, 2004, pp. 910-918). It was found that such an algorithm was not sufficient to correct a large number of errors such as missing nucleotides, given only 10 filtered strands (Figs. 38A-38D). However, majority voting alignment may be combined with codes such as repetition coding to correct a larger number of errors. A full analysis of a coded form of majority voting alignment is an interesting direction to explore for future algorithmic designs. Considerations of increased diversity for consensus
  • Enzymatic synthesis of a template sequence produces raw strands (strands 5 ) with variable extension length per nucleotide. From each strand R , transitions can be extracted to form compressed strands (strands ). Each strand may be of variable length. For subsequent analyses in this section, the distribution of strand lengths was modeled, and compute the number of diverse strand variants of each length. Edit distances between synthesized strand variants and the original template sequence was also provided, along with a detailed error analyses.
  • Synthesis errors resulting in missing nucieotides (deletions), or insertions directly affect the length of a strand 0 unlike conventional errors such as substituted (mismatched) nucleotides.
  • a mathematical model was constructed for nucleotide errors occurring in synthesized strands c .
  • the model is a Markov model (Fig. 23 A) with a state space indicating different types of nucleotide errors such as missing nucleotides (deletions), substituted nucleotides, and insertions.
  • each strand c variant is synthesized independently and according to identical error statistics, as specified in the Markov model (e.g., Ps b* Pins)- ⁇ e error process results in several unique synthesized strands .
  • These strands c can be aligned to reconstruct the original sequence. While reconstruction is possible through alignment and probabilistic consensus, often the exact determination of error events in strands is ambiguous. For example, a random insertion followed by a deletion of an intended nucleotide is indistinguishable from a substitution error (Fig. 23A).
  • K nucleotides in the template sequence « There exist K nucleotides in the template sequence « .
  • the length of a synthesized strand c is also a random variable, which was denoted here byiL.
  • the length I has a probability mass function P L (l). Assuming each write is independent of previous and future writes, the generating function for length L is given by, Binomial distribution (Special case)
  • Size selection processes performed in silico or in vitro, to keep only longer synthesized strand c variants decrease the effective number of missing nucleotides to be resolved.
  • size-selection of strand 0 variants led to a reduction in the effective probability of missing nucleotides.
  • Enzymatic synthesis not only produces strands 0 of different lengths, but also produces diverse strands 0 .
  • Each strand 0 may contain errors such as missing nucleotides in different positions relative to its corresponding template sequence.
  • the theoretical diversity was compared with the experimentally observed diversity produced by enzymatic synthesis.
  • This upper bound is equivalent to the total number of strands of length I obtained after (K— ⁇ ) deletion errors, and is independent of the template sequence itself.
  • Proposition Define C ca _[ » JCjos the storage capacity.
  • the first inequality states that the number of nucleotides per sequence must increase, and not be held constant, to store enough bits.
  • the third inequality states that the number of nucleotides per sequence must increase at least by iog 2 M f in order to increase storage capacity. This bound indirectly implies that storing an address per sequence, which requiresO ⁇ io3 ⁇ 4 ) bits of storage per sequence, is a minimal requirement for reassembly.
  • the prototype is comprised of two main parts; a Mantis liquid handler, which has a single robotic arm that can be programmed to dispense one of six reagents at a time, and custom jigs, which were either laser cut (Epilog Legend 36EXT) or machined (gift from Formuiatrix) to hold the glass slide acting as a solid support substrate for the DNA (Fig, 40 A), Initiator immobilization and surface preparation
  • a 5' amine-modified initiator oligo (5Aml2-fSBS3-ctgag) and a 3' amine-modified blocking oligo (10T-3Am) were covalently attached onto an aldehydesilane-coated microscope slide (Schott Nexterion Slide AL).
  • the blocking oligo was included to prevent unwanted interactions, such as adsorption, between the initiator or enzymes to the surface. To do this, it was created an oligo mixture containing 2 ⁇ 5Aml2- fSBS3-ctgag and 8uM 10T-3AM in 3X SSC (IX SSC is 150mM NaCl and 15mM sodium citrate) and 1.5M Betaine.
  • the oligo mixture was dispensed as 0.1 ⁇ _, droplets onto the slide using a Mantis liquid handler (Figs. 4QB-40C). Following the dispense, the slide was incubated at room temperature for 30 minutes in a parafilm-sealed Petri dish with Kimwipes saturated with 4X SSC, Then, the slide was transferred to a 100°C hotplate and dried for 30 minutes.
  • the synthesis procedure depends on precise and specific localization of nucleotide triphosphates and enzymatic mixes to initiator spots, which is denoted as features.
  • features Once these droplets are dispensed, however, they are prone to spread unevenly and uncontrollably across the glass surface and may contaminate neighboring features.
  • To constrain the droplet it was sought to create virtual "wells" for each feature by increasing the hydrophobicity in the areas between features. Dispensed droplets should then stay localized on each feature. 0,3 ⁇ . droplets containing 3X SCC and 1.5M Betaine was first dispensed on top of the features using a Mantis liquid handler and then dried the slide for 30 minutes on a 100°C hotplate.
  • the slide is dipped in Sigmacote (Sigma), which produces a neutral hydrophobic film over the areas of the glass which do not contain features, dried under a fume hood for 5 minutes, then dried for 5 minutes on a 100°C hotplate. Afterwards, the slide is washed twice with 0.2% SDS and three times with distilled water (Invitrogen UltraPure), The slides were then stringently washed by placing it in a boiling solution of 0.2X SSC for 15 minutes, then in room temperature distilled water (Invitrogen UltraPure). Lastly, to reduce Schiff bases and unreacted aldehydes, the slide was incubated for 10 minutes in a sodium borohydride reducing solution.
  • the solution was prepared by dissolving 0.12g of NaBH 4 (Sigma) in 30mL phosphate buffered saline (PBS, Invitrogen), then adding lOmL of 100% ethanol. Afterwards, the slide was washed once with 0.2% SDS and three times with distilled water (Invitrogen UftraPure). The prepared slide is then kept in an ice-cold ethanoi hath until use.
  • PBS phosphate buffered saline
  • Each synthesis cycle was composed of the following six steps: (i) the slide is placed on a custom jig for the Mantis liquid handler; (ii) a 0.5 ⁇ !, dispense of enzymatic reaction mix, comprised of Ix Custom Synthesis Buffer (14 mM Tris-Acetate, 35 mM Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0.1% Triton X-100, 10% (w/v) PEG 8000) with ⁇ / ⁇ TdT (Enzymatics) and 0.25mU/uL apyrase (NEB); (iii) a 0.1 ⁇ dispense of a nucleotide triphosphate at the following 6X concentrations in 10% PEG 8000 + 0.05% Triton X-100: 60 ⁇ dATP, 75 ⁇ Br-dCTP, 18 ⁇ dGTP, and () ⁇ !
  • the synthesized strands were then released from the slide surface by cleaving the uracils located on the 5' end of the initiators with USER enzymes.
  • the cleavage reaction mixture was composed of 0.18 units of UDG (Enzymatics) per ⁇ , 0.18 units of Endonuclease VIII (Enzymatics) per ⁇ , and 0.5 ⁇ ttSBS9 in USER TE-T buffer (40mM Tris-HCl pH 8.0, ImM EDTA, 0.01% Tween-20), The cleavage mixture was dispensed as 2 ⁇ droplets with the Mantis liquid handler onto each of the features.
  • a sequencing library for each feature was generated next. Using cycle-limited realtime PCR, 5 ⁇ of each feature was first amplified with the primers tSBS3 and ttSBS9, then with NEBNext Dual Indexing Primers for 15 cycles. Barcoded strands were then combined and sequenced single end using Illumina MiSeq v3 50.
  • each DNA sequence to be synthesized occupies a physical spot, also denoted as a feature, on a planar surface.
  • All DNA sequences are arranged as a 2D array to allow spatial addressing (x and y Cartesian coordinates). All DNA sequences are synthesized in parallel per cycle, that is, all features receive their first nucleotide during the first cycle, they then all receive their second nucleotide during the next cycle, and so on.
  • Each cycle consists of a series of reactions. Reagents for each reaction may be dispensed directly to each feature by non-contact Inkjet dispense or to all features by first sealing the array surface to form a flow cell and then flushing the reagent through.
  • the reagent to be dispensed by inkjet is denoted as droplet whereas the reagent to be flushed is denoted as flowcell.
  • the total cost of reagents for a cycle of each synthesis process can be computed as follows:
  • Cycle_c 0 st ffhmm ($ >e X V f ) + ⁇ X $ ⁇ X 3 ⁇ 4
  • n is the total number of features
  • is a constant representing the cost of droplet reagents in enzymatic synthesis
  • V is the droplet volume in cubic centimeters
  • $ ⁇ je represents the cost of flowcell reagents in chemical synthesis
  • $ ⁇ > ⁇ represents the cost of droplet reagents in enzymatic synthesis.
  • Flowcell area (A) can be expressed as a function of number of features (n) and density (/ ) ) of features:
  • Cycle_cast 9ng ($ fsS X c X. n ⁇ D) -f- (n. X $ ⁇ X 3 ⁇ 4 X rf 3 )
  • the number of features and feature density from the Agilent SurePrint G3 system was then utilized as a physical basis for projecting reagent costs for synthesis.
  • phosphoramidite reagent cost per cycle is 0.626 USD whereas the enzymatic reagent cost per cycle is 0.055 USD 61.3) or 0.0044 USD (assuming 4,38), a ⁇ 11-fold and -140-fold drop in cost respectively.
  • a cost-effective strategy would be to increase the number of synthesized features, n, for a given surface area (increasing the feature density, 0, as a result) per cycle and to minimize the total number of cycles, thereby limiting flowcell reagent cost. For this approach, it assumes that features are maximally packed, end-to-end, in a given surface area.
  • the flowcell area (,4 ) can be alternatively expressed as a function of the number of features (?;, ⁇ and feature size diameter (d):
  • Efficiency rate of storage For ease, it sets the average efficiency rate of storage for both enzymatic and phosphoramidite to be equivalent, storing an average of 1 bit per template nucleotide. The rate for each approach may be different depending on factors such as synthesis accuracy and the required addition of error-correction codes per template sequence to ensure accurate information recovery. Altering the efficiency rate of storage for each processes will change costs linearly, and the resulting difference between enzymatic and phosphoramidite approaches would likely be within an order of magnitude. Improvements to enzymatic synthesis wil l increase the efficiency rate of storage to be competitive to that of phosphoramidite synthesis. Such improvements will also influence the number of diversely synthesized needed for template reconstruction and inform the minimum required feature size.
  • Feature density For the reagent cost per megabyte projections, features are maximally packed with no spacing between. Practically, features are likely to be separated by a gap, usually a fraction of the feature size, to accommodate for potential positioning errors when droplets are dispensed. The number of features will then decrease inversely proportional to the square of the gap size (equation 1 1 to be modified accordingly). As this parameter is the same for calculating reagent costs for both phosphoramidite and enzymatic synthesis, altering the number of features may change absolute costs for each approach but relative comparisons between approaches will remain unchanged.
  • Feature size Reaching the projected costs depends on overcoming significant engineering challenges associated with miniaturizing feature sizes.
  • Current Inkjet printheads dispense 1-10 picoliter droplets, resulting in feature sizes of 15-38 microns (equation 4 and (FUJIFILM Dimatix col laborates with Agilent in developing Inkjet technology for advanced life sciences applications j Press Center
  • phosphoramidite features To reach the projected cost per megabyte equivalent to magnetic tape, phosphoramidite features must be ⁇ 40nm which requires dispensing a 0.016 attoliter droplet, whereas enzymatic features must be -350- 800nm which requires dispensing a droplet of 1 1 -134 attol iters.
  • Equipment amortization is another important, but often neglected, cost consideration.
  • Capital equipment costs are likely to increase significantly as DNA synthesis is scaled to achieve target costs.
  • specialty dispensers will be required.
  • the time required for a dispenser to find the correct feature to receive a droplet reagent, the dispenser seek time becomes an important consideration.
  • positioning systems likely with nanometer-scale resolution will be required, which may be expensive or prone to breakdown. While these are all important factors, there is insufficient information to estimate relevant parameters. Accordingly, it was assumed for ease that all seek times could be instantaneous and thus equipment amortization for enzymatic and phosphoramidite would be primarily dependent on their respective cycle time.
  • a conservative estimate of enzymatic cycle time is ⁇ 4-fold shorter than phosphoramidite chemistry (Table 6), which could result in a shortened amortization schedule, further reducing total synthesis costs.
  • V f and V represent flowcell and droplet volume respectively.
  • the two highlighted values on the bottom of the Phosphoramidite Price section are and c .
  • the two highlighted values on the bottom of the Phosphoramidite Price section are Table 7. Parameters of DNA Storage Systems

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente divulgation concerne un procédé de décodage d'une séquence de nucléotides, où la séquence de nucléotides code pour une valeur correspondant à un format d'information. Le procédé comprend la détermination de la séquence de nucléotides, l'identification d'une transition ou d'une limite ou d'un bord entre des nucléotides différents ou non identiques de la séquence de nucléotides, et l'attribution d'une valeur prédéfinie à la transition ou à la limite ou au bord identifié pour créer la valeur codée dans la séquence de nucléotides correspondant au format de l'information.
PCT/US2018/056900 2017-10-20 2018-10-22 Procédés de codage et de décodage à haut débit de l'information stockée dans l'adn Ceased WO2019079802A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762575017P 2017-10-20 2017-10-20
US62/575,017 2017-10-20

Publications (1)

Publication Number Publication Date
WO2019079802A1 true WO2019079802A1 (fr) 2019-04-25

Family

ID=66173900

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2018/056900 Ceased WO2019079802A1 (fr) 2017-10-20 2018-10-22 Procédés de codage et de décodage à haut débit de l'information stockée dans l'adn

Country Status (1)

Country Link
WO (1) WO2019079802A1 (fr)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091876A (zh) * 2019-12-16 2020-05-01 中国科学院深圳先进技术研究院 一种dna存储方法、系统及电子设备
CN112288089A (zh) * 2020-09-28 2021-01-29 清华大学 阵列式核酸信息存储方法及装置
WO2021064095A1 (fr) 2019-10-01 2021-04-08 Centre National De La Recherche Scientifique Acides nucléiques biocompatibles pour le stockage de données numériques
CN113314187A (zh) * 2021-05-27 2021-08-27 广州大学 一种数据存储方法、解码方法、系统、装置及存储介质
WO2021242446A1 (fr) * 2020-05-28 2021-12-02 Microsoft Technology Licensing, Llc Synthèse de novo de polynucléotides à l'aide d'une polymérase liée à un substrat
US11268091B2 (en) 2018-12-13 2022-03-08 Dna Script Sas Direct oligonucleotide synthesis on cells and biomolecules
US11773422B2 (en) 2019-08-16 2023-10-03 Microsoft Technology Licensing, Llc Regulation of polymerase using cofactor oxidation states
US11795450B2 (en) 2019-09-06 2023-10-24 Microsoft Technology Licensing, Llc Array-based enzymatic oligonucleotide synthesis
US11995558B2 (en) 2018-05-17 2024-05-28 The Charles Stark Draper Laboratory, Inc. Apparatus for high density information storage in molecular chains
EP4397772A3 (fr) * 2020-04-24 2024-07-24 Microsoft Technology Licensing, LLC Amorces homopolymères pour l'amplification de polynucléotides créées par synthèse enzymatique
US12227775B2 (en) 2017-05-22 2025-02-18 The Charles Stark Draper Laboratory, Inc. Modified template-independent DNA polymerase

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120142006A1 (en) * 2000-10-06 2012-06-07 The Trustees Of Columbia University In The City Of New York Massive parallel method for decoding dna and rna
US20170017436A1 (en) * 2015-07-13 2017-01-19 President And Fellows Of Harvard College Methods for Retrievable Information Storage Using Nucleic Acids

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120142006A1 (en) * 2000-10-06 2012-06-07 The Trustees Of Columbia University In The City Of New York Massive parallel method for decoding dna and rna
US20170017436A1 (en) * 2015-07-13 2017-01-19 President And Fellows Of Harvard College Methods for Retrievable Information Storage Using Nucleic Acids

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12227775B2 (en) 2017-05-22 2025-02-18 The Charles Stark Draper Laboratory, Inc. Modified template-independent DNA polymerase
US11995558B2 (en) 2018-05-17 2024-05-28 The Charles Stark Draper Laboratory, Inc. Apparatus for high density information storage in molecular chains
US11993773B2 (en) 2018-12-13 2024-05-28 Dna Script Sas Methods for extending polynucleotides
US11268091B2 (en) 2018-12-13 2022-03-08 Dna Script Sas Direct oligonucleotide synthesis on cells and biomolecules
US11773422B2 (en) 2019-08-16 2023-10-03 Microsoft Technology Licensing, Llc Regulation of polymerase using cofactor oxidation states
US11795450B2 (en) 2019-09-06 2023-10-24 Microsoft Technology Licensing, Llc Array-based enzymatic oligonucleotide synthesis
WO2021064095A1 (fr) 2019-10-01 2021-04-08 Centre National De La Recherche Scientifique Acides nucléiques biocompatibles pour le stockage de données numériques
CN115380329A (zh) * 2019-10-01 2022-11-22 法国国家科学研究中心 用于数字数据存储的生物兼容核酸
JP2022552790A (ja) * 2019-10-01 2022-12-20 サントル ナショナル デ ラ ルシェルシュ シアンティフィック デジタルデータ保存のための生体適合性核酸
CN111091876B (zh) * 2019-12-16 2024-05-17 中国科学院深圳先进技术研究院 一种dna存储方法、系统及电子设备
CN111091876A (zh) * 2019-12-16 2020-05-01 中国科学院深圳先进技术研究院 一种dna存储方法、系统及电子设备
EP4397772A3 (fr) * 2020-04-24 2024-07-24 Microsoft Technology Licensing, LLC Amorces homopolymères pour l'amplification de polynucléotides créées par synthèse enzymatique
US12385086B2 (en) 2020-04-24 2025-08-12 Microsoft Technology Licensing, Llc Homopolymer primers for amplification of polynucleotides created by enzymatic synthesis
US11702683B2 (en) 2020-05-28 2023-07-18 Microsoft Technology Licensing, Llc De novo polynucleotide synthesis with substrate-bound polymerase
WO2021242446A1 (fr) * 2020-05-28 2021-12-02 Microsoft Technology Licensing, Llc Synthèse de novo de polynucléotides à l'aide d'une polymérase liée à un substrat
CN112288089B (zh) * 2020-09-28 2022-12-20 清华大学 阵列式核酸信息存储方法及装置
CN112288089A (zh) * 2020-09-28 2021-01-29 清华大学 阵列式核酸信息存储方法及装置
CN113314187A (zh) * 2021-05-27 2021-08-27 广州大学 一种数据存储方法、解码方法、系统、装置及存储介质

Similar Documents

Publication Publication Date Title
WO2019079802A1 (fr) Procédés de codage et de décodage à haut débit de l'information stockée dans l'adn
Lee et al. Terminator-free template-independent enzymatic DNA synthesis for digital information storage
JP7586880B2 (ja) 核酸ベースのデータ記憶
Yazdi et al. DNA-based storage: Trends and methods
Lu et al. Enzymatic DNA synthesis by engineering terminal deoxynucleotidyl transferase
US20240070422A1 (en) Methods of Storing Information Using Nucleic Acids
JP7277054B2 (ja) ホモポリマーコード化核酸メモリ
Yu et al. High-throughput DNA synthesis for data storage
US11286479B2 (en) Chemical methods for nucleic acid-based data storage
CA3100529A1 (fr) Compositions et procedes de stockage de donnees base sur l'acide nucleique
US11795450B2 (en) Array-based enzymatic oligonucleotide synthesis
US11174512B2 (en) Homopolymer encoded nucleic acid memory
Lee et al. Enzymatic DNA synthesis for digital information storage
Baek et al. Recent progress in high-throughput enzymatic DNA synthesis for data storage
Roquet et al. DNA-based data storage via combinatorial assembly
Jo et al. Recent progress in DNA data storage based on high-throughput DNA synthesis
US20240293818A1 (en) Temperature-controlled fluidic reactions system
JP2024530614A (ja) 核酸データストレージのための組成物、システム、および方法
US20250239331A1 (en) Combinatorial enumeration and search for nucleic acid-based data storage
Lin et al. Cap-free DNA synthesis enables scalable high-yield DNA data storage
HK40015249A (en) Nucleic acid-based data storage
HK1210848B (zh) 利用核酸存儲信息的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18868230

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18868230

Country of ref document: EP

Kind code of ref document: A1