WO2025054432A1

WO2025054432A1 - Systems and methods for dna reading

Info

Publication number: WO2025054432A1
Application number: PCT/US2024/045555
Authority: WO
Inventors: Mirko PALLA; Sean MIHM; Swapnil P. BHATIA; Miriam S. RAMLIDEN; Das PERERA; Dionis Minev
Original assignee: Catalog Technologies Inc
Current assignee: Catalog Technologies Inc
Priority date: 2023-09-06
Filing date: 2024-09-06
Publication date: 2025-03-13
Anticipated expiration: 2026-03-06

Abstract

The technologies described in this specification include technologies for fast and accurate reading of information stored in nucleic acid molecules. The technologies include a method for reading digital information written into nucleic acid sequence(s). The technologies include providing nucleic acid molecules indicative of digital information comprising a string of symbols and modifying at least a portion of the nucleic acid molecules with one or more labels. One or more modified nucleic acid molecules are translocated through one or more nanopores in a substrate and current signals are detected. A first current level corresponds to passage of a portion of the nucleic acid molecules and a second current level corresponds to passage of one or more labels. The reading includes identifying the one or more labels from the second voltage level and identifying the digital information based at least in part on the identified label(s).

Description

SYSTEMS AND METHODS FOR DNA READING

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This Application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/536,760, filed on September 6, 2023, titled “Systems and Methods for DNA reading,” the entire contents of which are hereby incorporated by reference.

BACKGROUND

[0002] Nucleic acid digital data storage is a stable approach for encoding and storing information for long periods of time, with data stored at higher densities than magnetic tape or hard drive storage systems. Additionally, digital data stored in nucleic acid molecules that are stored in cold and dry conditions can be retrieved as long as 60,000 years later or longer. To access digital data stored in nucleic acid molecules, the nucleic acid molecules may be sequenced. As such, nucleic acid digital data storage may be an ideal method for storing data that is not frequently accessed but may have a high volume of information to be stored or archived for long periods of time.

SUMMARY

[0003] The technologies described in this specification include methods and systems for fast and accurate reading of information stored in nucleic acid molecules. In an aspect, the technologies include a method for reading digital information written into nucleic acid sequence(s). The technologies include providing nucleic acid molecules indicative of digital information comprising a string of symbols and modifying at least a portion of the nucleic acid molecules with one or more labels. The technologies include translocating one or more modified nucleic acid molecules through one or more nanopores disposed in a substrate and configured to receive an input nucleic acid molecule. The technologies include reading one or more signals received from the one or more nanopores and translating the one or more signals into the information. The reading includes detecting an electric current signal from the one or more nanopores and detecting a change in current through the one or more nanopores corresponding to the translocation. A first current level corresponds to passage of a portion of the nucleic acid molecules indicative of the digital information and a second current level corresponds to passage of one of the one or more labels through the one or more nanopores. The reading includes identifying the one label from the second voltage level and identifying, from a library of nucleic acid molecules, the digital information based at least in part on the identified label.

[0004] In an aspect, the technologies include method for reading information written into nucleic acid sequence(s), including providing a plurality of nucleic acid molecules, each molecule including a plurality of nucleic acid motifs and modifying at least a portion of the plurality of nucleic acid molecules with one or more structural labels and decorating at least one of the one or more structural labels with a fluorescent label. The method includes reading one or more fluorescent signals and translating the one or more fluorescent signals into the information. The reading includes detecting a change in optical signal along a length of a nucleic acid molecule. A first optical signal corresponds to a first motif and a second optical signal corresponds to a second motif. The reading includes identifying a first label from the first optical signal and a second label from the second optical signal; and identifying, from a library of nucleic acid molecules, the information based at least in part on the identified labels.

[0005] In an aspect, the technologies include a method including translating information into a string of symbols; and mapping the string of symbols to a plurality of identifiers. Each individual identifier of the plurality of identifiers includes a combination of a plurality of components from a library of components, and each component in an individual identifier of the plurality of identifiers includes a distinct nucleic acid sequence. Each identifier represents a symbol position of an individual symbol in the string of symbols. The method includes forming at least one individual identifier of the plurality of identifiers by depositing the plurality of components into a compartment. The plurality of components assemble via one or more reactions in the compartment to form the combination of components of the at least one individual identifier representing at least one symbol position of an individual symbol in the string of symbols.

[0006] In an aspect, the technologies include a system for reading information written into nucleic acid sequence(s). The system includes one or more first reagents and components to assemble one or more nucleic acid molecules encoding information. The system includes one or more second reagents to label one or more nucleic acid molecules. The system includes a fluidic device configured for transfer fluid, the fluid comprising one or more of the first or second reagents. The system includes a processor and a memory functionally connected to the fluidic device, the memory comprising instructions that, when executed, cause the processor to actuate one or more components of the device to perform one or more method(s) for reading digital information written into nucleic acid sequence(s) as described in this specification.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The novel features of the technologies are set forth with particularity in the appended claims. Abetter understanding of the features and advantages of the present technologies will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also “Figure” and “FIG.” herein).

[0008] FIG 1 is a schematic of an example identifier assembled from components of different combinatorial layers.

[0009] FIG. 2 is a schematic representation of a nanopore FET device as described in this specification.

[0010] FIGS. 3A and 3B are schematics illustrating the use of catalytically dead restriction enzymes to read example FLIs or other nucleic acid molecule encoding (digital) information. FIG. 3A illustrates example identifiers A and B that can be distinguished based on restriction enzyme site composition. FIG. 3B illustrates example identifiers A and B that can be distinguished based on Mnll restriction enzyme patterning.

[0011] FIG. 4A and 4B are schematics illustrating the use of multiple enzymes to label components using different proteins. FIG. 4A illustrates direct labeling. FIG. 4B illustrates the use of indirect labels (e.g., on DNA flaps).

[0012] FIGS. 5A-5C are schematics illustrating an example general overview of scaffold- mediated component ligation. FIG. 5A illustrates example components, templating strands, and edge strands. FIG. 5B illustrates templating strands binding of the component oligos to the scaffold sequence. FIG. 5C shows an example of different component sequences (when compared to the panel of panel (b), FIG. 5B).

[0013] FIGS. 6A and 6B are schematics illustrating the modulation of edge sequence length and use of structural components for, e.g., component identification. FIG. 6A illustrates different example edge sequences, e.g., 6bp, less than 6bp, or blunt ends. FIG. 6B illustrates component sequences as shown, e.g., in FIG. 5A, that include an example hairpin motif.

[0014] FIGS. 7A-D are schematics illustrating an example scheme for single-stranded component ligation with structural features. FIG. 7A shows of a single-stranded component that forms a programmed secondary structural feature (e.g., a hairpin). FIG. 7B shows an example of a combinatorial 6-layer ligation scheme. FIG. 7C shows different example hairpin structural features that can be detected, e.g., through solid-state nanopore sequencing. FIG. 7D shows an example of how hairpin loops as an addressable site for a short oligos with 3’ or 5’ overhangs for further diversification of structural features.

[0015] FIG. 8A is a schematic illustrating an example nucleic acid molecule (e.g., an identifier) with secondary structural features (labels). FIG. 8B are schematics illustrating the reading of an example nucleic acid molecule (e.g., an identifier) with secondary structural features (label) using nanopores. FIG. 8C is a graph illustrating the current versus time signal of the first three structural features/labels (e.g., hairpins) translocated through the solid-state nanopore.

[0016] FIG. 9 is a schematic illustrating a hairpin structure on a double-stranded DNA segment.

[0017] FIGS. 10A-F are schematics illustrating a general overview of an example process for split template-based assembly of DNA using nicking endonuclease and polymerase. [0018] FIG. 11 is a schematic overview of an example split template-based assembly of DNA using nicking endonuclease and polymerase in a reaction with three nucleotides in a reaction and fourth nucleotide acting as a stopper.

[0019] FIG. 12A is a schematic illustrating an example identifier with secondary structural features (labels). FIG. 12B is a schematic illustrating an example process to generate an example nucleic acid molecule (e.g., an identifier) with secondary structural features (labels). FIG. 12C is a schematic illustrating the reading of an example nucleic acid molecule (e.g., an identifier) with secondary structural features using nanopores.

[0020] FIG. 13A is a schematic of an example labeled DNA molecule (e.g., FLI). The example structural label is represented as a dumbbell DNA structure. FIG. 13B is a set of schematics illustrating different forms of dumbbell structures.

[0021] FIG. 14A is a schematic of 7.2 kb single-stranded DNA molecule labeled with short oligos ~40 bp in length and including 29 DNA dumbbells. FIG. 14B shows an agarose gel electrophoresis validation showing: (1) only tiling oligos, (2) 14 DNA dumbbells and tiling oligos, (3) 15 DNA dumbbells and tiling oligos (different position of dumbbells), and (4) 29 DNA dumbbells and tiling oligos.

[0022] FIGS. 15A-C illustrate translocation of a 20 kb double-stranded DNA without any structural features. FIG. 15A is a graph showing average blockage (current) versus log dwell time of molecules in the nanopore. FIG. 15B is a graph showing an overlayed plot of individual translocation events. FIG. 15C (i-iii) are graphs showing individual translocation events of example 20 kb dsDNA fragments. [0023] FIGS. 16 A and 16B illustrates translocation of a 7.2 kb linear DNA fragment with structural labels (construct number 4 in FIG. 14). FIG. 16A is a graph showing an overlayed plot of individual translocation events. FIG. 16B(i-iii) are graphs showing individual translocation events of example 7.2 kb linear DNA fragments with structural labels.

[0024] FIG. 17A is a graph showing average blockage (current) versus log dwell time of molecules in the nanopore for translocation of an example 7.2 kb linear DNA fragment with structural labels (construct number 4 in FIG. 14). FIG. 17B shows an agarose gel electrophoresis validation showing example 7.2 kb DNA constructs labeled with 29 DNA dumbbells (directly in the middle -400 bp long) and tiling oligos.

[0025] FIGS. 18A-C illustrates translocation of a 7.2 kb linear DNA fragment with structural labels (construct number 2 in FIG. 14). FIG. 18A is a graph showing average blockage (current) versus log dwell time of molecules in the nanopore. FIG. 18B is a graph showing an overlayed plot of individual translocation events as described above. FIG. 18C(i-iii) are graphs showing individual translocation events of example 7.2 kb linear DNA fragments.

[0026] FIGS. 19A-C illustrates translocation of a 7.2 kb linear DNA fragment without any structural labels, i.e., a 7.2kb single-stranded fragment with short — 30-40nt oligo tiles (construct number 1 in FIG. 14). FIG. 19A is a graph showing average blockage (current) versus log dwell time of molecules in the nanopore. FIG. 19B is a graph showing an overlayed plot of individual translocation events. FIG. 19C(i-iii) are graphs showing individual translocation events of 7.2 kb linear DNA fragments with no structural labels. [0027] FIGS. 20A and 20B is a set of schematics illustrating an example expandomer labeling strategy that can be used with the technologies described in this specification. FIG. 20A is a schematic of example single-stranded scaffolds (“expandable tiling oligos”). FIG. 20B is a schematic of looped expandable tiling oligos on a scaffold.

[0028] FIG. 21 is a diagram illustrating two example configurations of a scaffold and looped expandable tiling oligos.

[0029] FIG. 22 shows a denaturing PAGE gel illustrating the assembly of an example expandomer design. The gel lanes are as follows: (1): Single-stranded DNA ladder (L). (2 and 3): 160 nt scaffold only (sc2 and sc3). (4 and 5): expandable tiling oligos only. Expandable tiling oligos in the middle of the scaffold are shorter than on the end of the scaffolds. (6 and 7): scaffold and expandable tiling oligos showing ligated product of expected length (labeled with asterisk) and scaffold only (white arrow).

[0030] FIG. 23 is a diagram illustrating an example process for consensus generation from a concatemer. [0031] FIG. 24 is a diagram illustrating rolling circle amplification of circularized NMDIs/FLIs enabling the creation of consensus-based labeling.

[0032] FIG. 25 schematically illustrates an overview of a process for encoding, writing, accessing, reading, and decoding digital information stored in nucleic acid sequences.

[0033] FIGS. 26A and 26B schematically illustrate an example method of encoding digital data, referred to as “data at address”, using objects or identifiers (e.g., nucleic acid molecules). FIG. 26A illustrates combining a rank object (or address object) with a bytevalue object (or data object) to create an identifier. FIG. 26B illustrates an embodiment of the data at address method wherein the rank objects and byte-value objects are themselves combinatorial concatenations of other objects.

[0034] FIGS. 27A and 27B schematically illustrate an example method of encoding digital information using objects or identifiers (e.g., nucleic acid sequences). FIG. 27A illustrates encoding digital information using a rank object as an identifier. FIG. 27B illustrates an embodiment of the encoding method wherein the address objects are themselves combinatorial concatenations of other objects.

[0035] FIG. 28 shows a contour plot, in log space, of a relationship between the combinatorial space of possible identifiers (C, x-axis) and the average number of identifiers (k, y-axis) that may be constructed to store information of a given size (contour lines).

[0036] FIG. 29 schematically illustrates an overview of a method for writing information to nucleic acid sequences (e.g., deoxyribonucleic acid).

[0037] FIGS. 30A and 30B illustrate an example method, referred to as the “product scheme”, for constructing identifiers (e.g., nucleic acid molecules) by combinatorially assembling distinct components (e.g., nucleic acid sequences). FIG. 30A illustrates the architecture of identifiers constructed using the product scheme. FIG. 30B illustrates an example of the combinatorial space of identifiers that may be constructed using the product scheme.

[0038] FIG. 31 schematically illustrates the use of overlap extension polymerase chain reaction to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences).

[0039] FIG. 32 schematically illustrates the use of sticky end ligation to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences).

[0040] FIG. 33 schematically illustrates the use of recombinase assembly to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences). [0041] FIGS. 34A and 34B demonstrates template directed ligation. FIG. 34A schematically illustrates the use of template directed ligation to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences). FIG. 34B shows a histogram of the copy numbers (abundances) of 256 distinct nucleic acid sequences that were each combinatorially assembled from six nucleic acid sequences (e.g., components) in one pooled template directed ligation reaction.

[0042] FIGS. 35A-35G schematically illustrate an example method, referred to as the “permutation scheme”, for constructing identifiers (e.g., nucleic acid molecules) with permuted components (e.g., nucleic acid sequences). FIG. 35A illustrates the architecture of identifiers constructed using the permutation scheme. FIG. 35B illustrates an example of the combinatorial space of identifiers that may be constructed using the permutation scheme. FIG. 35C shows an example implementation of the permutation scheme with template directed ligation. FIG. 35D shows an example of how the implementation from FIG. 36C may be modified to construct identifiers with permuted and repeated components. FIG. 35E shows how the example implementation from FIG. 35D may lead to unwanted byproducts that may be removed with nucleic acid size selection. FIG. 35F shows another example of how to use template directed ligation and size selection to construct identifiers with permuted and repeated components. FIG. 35G shows an example of when size selection may fail to isolate a particular identifier from unwanted byproducts.

[0043] FIGS. 36A-36D schematically illustrate an example method, referred to as the “MchooseK” scheme, for constructing identifiers (e.g., nucleic acid molecules) with any number, K, of assembled components (e.g., nucleic acid sequences) out of a larger number, AT, of possible components. FIG. 36A illustrates the architecture of identifiers constructed using the MchooseK scheme. FIG. 36B illustrates an example of the combinatorial space of identifiers that may be constructed using the MchooseK scheme. FIG. 36C shows an example implementation of the MchooseK scheme using template directed ligation. FIG. 36D shows how the example implementation from FIG. 36C may lead to unwanted byproducts that may be removed with nucleic acid size selection.

[0044] FIGS. 37A and 37B schematically illustrates an example method, referred to as the “partition scheme” for constructing identifiers with partitioned components. FIG. 37A shows an example of the combinatorial space of identifiers that may be constructed using the partition scheme. FIG. 37B shows an example implementation of the partition scheme using template directed ligation. [0045] FIGS. 38A and 38B schematically illustrates an example method, referred to as the “unconstrained string” (or USS) scheme, for constructing identifiers made up of any string of components from a number of possible components. FIG. 38A shows an example of the combinatorial space of identifiers that may be constructed using the USS scheme. FIG. 38B shows an example implementation of the USS scheme using template directed ligation. [0046] FIGS. 39A and 39B schematically illustrates an example method, referred to as “component deletion” for constructing identifiers by removing components from a parent identifier. FIG. 39A shows an example of the combinatorial space of identifiers that may be constructed using the component deletion scheme. FIG. 39B shows an example implementation of the component deletion scheme using double stranded targeted cleavage and repair.

[0047] FIG. 40 schematically illustrates a parent identifier with recombinase recognition sites where further identifiers may be constructed by applying recombinases to the parent identifier.

DESCRIPTION

[0048] The technologies described in this specification include technologies for nanoporebased information reading. Nucleic acid-based data storage, e.g., DNA-based data storage (DDS), e.g., as described in this specification, relies on three primary methods for encoding data into nucleic acids, e.g., DNA (throughout this specification, “DNA” serves an example nucleic acid; other nucleic acids, e.g., RNA and/or their analogues can be used). Base-by- base writing is a method in which source bits (e.g., digital information, e.g., binary information) are mapped to DNA bases, which are then synthesized using standard DNA synthesis techniques. This method can include a bijective translation where one source bit maps to a single base, for example, as is performed using phosphorami dite electrochemistry, and more sophisticated methods where several bits map to several bases using, for example, the TdT-based synthesis. While the viability of these two approaches for encoding information into DNA has been demonstrated, these approaches have several disadvantages: for example, DNA strands encoding information using these approaches tend to be very long (e.g., 100s of bases), rendering the strands prone to breaking and requiring cumbersome reading/writing techniques.

[0049] The technologies described in this specification use methods to encode source bits in specific subsets of DNA species from a universe of possible species using combinatorial writing, where each species is built from a combination of a finite set of possible oligos (see, e.g., N. Roquet et al., “DNA-based data storage via combinatorial assembly,” bioRxiv, p. 2021.04.20.440194, Jan. 2021). An example DNA writer uses picoliter-scale inject printing, has a writing throughput of 1 Tbit/day (including error correction), and is currently the fastest among all writers pertaining to DDS. Recently, this system was used to encode the whole of the English-language Wikipedia into DNA, demonstrating success in terms of speed, scale, and accuracy. Moreover, this printer is continuously being improved in automation, scalability, throughput, and in reducing reagent volumes.

[0050] Various techniques that use variations of DNA molecules can be used to encode information. For example, some methods rely on marking the DNA molecule, e.g., by nicking, methylation, or DNA dumbbells. Strategies for encoding schemes include labeling specific DNA regions of a unique site, e.g., a 10-15 base pair long site, with 3D DNA structures. Labels producing distinct signal amplitudes and unique features during reading can support various bit information encoding schemes, e.g., as described below.

[0051] To achieve scale, massive parallelization with the precise control provided by complementary metal-oxide-semiconductor (CMOS)-based process technology and reader electronics can be combined. At the DNA reading level, reading long fragments of nucleic acids (10 kb - 1 Mb) has been demonstrated -in a highly parallelized biological nanopore array - using commercial technology. An example sequencing technology is foundationally built on single-base readout: In a current implementation, such a device has -2500 nanopores and a throughput of 500 bases/s per flow-cell. Utilizing a base-by-base reading strategy limits the data rate to 5 Mbit/s per flowcell, translating to a few years to read PB-sized information. On the other hand, state-of-the-art solid-state nanopores (ssNPs) have a reported throughput of -30 kbit/s per pore (56-bit carrier translocate in 1.5 ms), for which DNA molecules are translocating in a linear fashion. Typical ssNP diameters (-20 nm) in this reading approach require larger molecular structures to be added to the DNA backbone to encode readable bits. Departing from this conventional ssNP capability, described in this specification are a novel nanopore field-effect transistor technologies that can bring the read speed to 1 MBit/s per pore (3 Ox higher throughout) and a potential array -based throughput of up to one PB/day, or more.

[0052] Based on initial calculations for ssNPs, one can estimate the throughput of a single Nanopore Field Effect Transistor (NPFET) reader and extrapolate for an array of NPFETs cointegrated on a CMOS chip. Importantly, despite utilizing larger molecular bit structures on the components of a nucleic acid encoding digital data (as described below), an NPFET multi-base reader allows much higher (e.g., Mbit/s) data throughput than regular biological pores working at kbit/s read rates. Therefore, the technologies described herein provide the capabilities to read freely translocating multi-base bits at a very high translocation speed. The NPFET can achieve this speed due to its high bandwidth, which can be maintained at very high integration densities. It is estimated that with an array of IM NPFET readers, it would be possible to achieve 1 PB/day DNA read throughput (or more). Therefore, the technologies described in this specification may achieve - on a component level - data read throughput (up to Mbit/s) for multi-base bits using active ssNPS,

[0053] In some implementations, the technologies described in this specification utilize passive ssNPs (no electrodes) with, e.g., <15 nm diameter with optimized noise performance. The ssNPs used with the technologies described in this specification can have any diameter of less than 100 nm, e.g., between 0 and 5 nm, between 5 and 10 nm, between 10 and 20 nm, between 20 and 50, or between 50 and 100 nm. These pores are fabricated en masse using a CMOS-compatible lithographic process in a 200- and 300-mm line. These passive nanopores can serve as the basis for the described detection methods. The NPFET extended gate geometry can be optimized for SNR to obtain clear discriminative signatures of the multibase bits (DNA labels during high throughput nanopore measurements).

[0054] In an example implementation, a nanopore reader as described in this specification can be used to perform reading in a DNA multi-base encoding approach using nucleic acid molecule encoding digital information, e.g., DNA “identifiers” as described below. Assume that we will write and read 1 Tbit of information. We would have IxlO¹² addresses corresponding to 1 Tbit of information. In a representative encoding scheme, a 40-bit (2⁴⁰ = IxlO¹²) address DNA strand, called an identifier (described below), is written if the related bit is 1 (presence of address strand=l, absence =0). If the address bits are separated by 100 bp, we would have to build DNA fragments of 4 kbp to represent 1 Tbit of information. To read 1 Tbit, we would need to read all the address strands present at ~1, 000-fold redundancy - given SNR considerations. Using a single pore, reading DNA molecules at the rate of 2 ms/bp, it would take more than 200 million years to achieve this goal. This assumes that the wait time between different translocating molecules is negligible, which is generally the case for certain nanopores. On the other hand, a single ssNP working at 24 ns/bp read speed would still take -3,000 years. This demonstrates that throughput increase could be achieved by arraying nanopores in a CMOS circuit. In a readout chip with 1,000 individually addressable nanopores, 1 Tbit could be read in 3 years. Therefore, by scaling this nanopore array to IM, 1 Tbit could be read in less than a day. Ultimately, reducing redundancy and increasing the number of bits per address to a decreased overhead, and keep wait time between successive translocation events sufficiently low, read throughputs pertaining to DDS archives of up to 1 PB/day could be achieved.

[0055] The multi-base encoding scheme described below allows for the highest writing throughput to record digital data into DNA. In an example implementation, the nanopore approach described in this specification includes double-stranded, labeled DNA - via multibase encoding with molecular tags - to be read at high speed (as described below). Specific DNA labeling also provides novel methods to create metadata, error correction codes, and potentially watermarking (to prove non-fungibility) of the information at the level of the bits. Various labeling technologies can be used and adapted for the required nanopore signal-to- noise ratio threshold, for example, biotin-streptavidin, 3D DNA structures, and enzyme labels. In some implementations, the closest label spacing that can be reliably discriminated would determine the bit length or the component length in an identifier. By assessing results on the optimal spacing of labels, amount, and type of labels, one can design the encoding scheme for an NPFET-based readout.

[0056] In some implementations, ssNPs can be produced at quantity at a size of 15 nm, e.g., using advanced photolithography and patterning processes. The ssNPs can be produced to have any diameter of less than 100 nm, e.g., between 0 and 5 nm, between 5 and 10 nm, between 10 and 20 nm, between 20 and 50, or between 50 and 100 nm. The passive ssNP has excellent noise properties and can be readily used to develop assays using labeling techniques. Advanced NPFET fabrication processes can provide nanopore signals to the CMOS plane with high signal bandwidth. Further nanopore diameter size reduction to <10 nm can also be provided. Single-base resolution is not required in the described detection approach. Thus, <3 nm nanopores are not required. The described design can provide reading of 0.5-30 kbp DNA molecules to recognize different bit representations (25-mer to 100-mer bits) at very high throughput. In some implementations, the described design can provide reading of DNA molecules of a length of between 1 bp and 1 Mbp, e.g., between 1 bp and 10 bp, between 10 bp and 100 bp, between 100 bp and 1 kbp, 1 kbp and 10 kbp, between 10 kbp and 100 kbp, between 100 kbp and 1 Mbp.

[0057] Conventional methods of reading nanopores are limited by the signal bandwidth of individual nanopores and the number of nanopore cells that can be arrayed on one die. The NPFET has an intrinsic advantage of its larger signal bandwidths than existing solid-state technologies. The technologies described in this specification include devices operating at 10 MHz and with 1,000 nanopores or more. Moreover, techniques that can reduce the unit cell size, e.g., by separating the critical analog front-end function close to the nanopore from any analog and digital processing that can be done more efficiently further away, can help to bring down the active area per nanopore. The approach using standard CMOS for ssNP readers combined with a microfluidic cartridge can be processed in industry-compatible settings.

[0058] Described in this specification are high-throughput reading technologies that utilize combinatorial encoding schemes that consist of labeling specific components of nucleic acid molecules encoding digital data with 3D DNA structures producing distinct signal amplitudes during readout, that can support bit information encoding and provide 100B-1T reads per run applicable to DNA storage.

[0059] In an example implementation, the technologies described in this specification can provide DNA data storage using a collection of up to 10¹⁶ DNA molecules called identifiers, each with a unique sequence, and at some fixed number of copies. Each identifier is constructed by concatenating a unique combination of prefabricated oligonucleotides, called components grouped into a set of layers, with a priori known sequences (FIG 1). To read the information encoded in each library of identifiers, one must invert this process and discover the identifiers present in a particular sample from the set of all possible identifiers.

Therefore, a reading problem can be mathematically defined as: given a list of possible identifiers and a sample library containing a subset of those identifiers, determine which identifiers exist in the sample, and calculate their distribution.

[0060] Alternatively or additionally to DNA encoding approaches using DNA identifiers, the labeling and reading technologies described in this specification can also be used with other types on nucleic acid molecules encoding digital information and/or other encoding schemes. For example, the labeling approaches can be used with nucleic acid molecules other than identifiers, e.g., to label bases or base sequences (e.g., components) of nucleic acid molecules that are either formed using components or assembled base-by-base. Example encoding schemes for encoding digital information into nucleic acid molecules (other than identifiers) are described below.

[0061] Described in this specification are technologies for reading DNA libraries (e.g., libraries of nucleic acid molecules encoding digital information, e.g., libraries of identifiers) that are orders of magnitude faster than existing solutions and support molecular-scale data storage and archiving. To this end, DNA reading performance metrics are established with estimates for capacity, reading speed per identifier, scalability, and automation listed in the table below (Table 1). Due to the described writing strategy described below, which uses pre-synthesized components, the describe solution does not require base-pair level resolution or even base-by-base sequencing.

TABLE 1

[0062] Described in this specification are devices with a very high-density array of nanopores with individually addressable electrodes, e.g., as illustrated in FIG. 2. Nanopore field-effect transistors can break the translocation speed limitations of regular nanopores due to their improved detection bandwidth (100 MHz), while they can also scale to very large arrays (to millions). To this end, these sensor arrays can analyze up to 10⁹ barcode molecules per second.

[0063] Described in this specification are technologies for combinatorial DNA encoding where billions of information-encoded molecules (e.g., identifiers) are created and read by a high-throughput (HT) sensing technology applicable to DNA storage. In some implementations, a commercially available DNA sequencer can be used, which has limitations in read throughput pertaining to the number of pores on the chip and DNA translocation speed controlled by a molecular motor. In an example reader as described in this specification, combinatorial encoding schemes including labeling specific components with 3D DNA structures can be used, producing distinct signal amplitudes during readout that can support bit information encoding (e.g., as described below). An example reader can include a nanopore field-effect transistor (NPFET) can provide 100B-1T reads per run with the capabilities described in Table 1.

[0064] The technologies described in this specification improve on techniques with singlebase resolution, using nanopore sequencing, to the identification of unique structural labels that can serve as individual signatures of a certain region of a nucleic acid molecule, e.g., a sub-sequence, a motif, or a component. Read throughput can be improved by leveraging the combinatorial encoding scheme described below and using component identification rather than component sequencing. Components can be labeled or identified using a secondary structural feature, e.g., motif or entity, e.g., as described below. The technologies can bring the read speed to an array-based throughput of up to 1 PB/day and provide scalability via massive parallelization by combining the precise control of CMOS-based process technology and reader electronics. Combinatorial writing allows the use of redundant bases to reduce reading error.

Protein-based labeling

[0065] Described in this specification are technologies for protein-based labeling of nucleic acids for reading data encoded in nucleic acid molecules, e.g., DNA.

[0066] In some implementations, reading and subsequent decoding of data encoded in DNA relies on DNA sequencing, whereby individual bases are read and mapped to a string of components that classify/define an identifier or other nucleic acid molecule encoding digital information. This process is time-consuming and computationally intensive. Given that there are a limited number of component options per layer, reading DNA-encoded libraries can be accomplished by distinguishing different components. Therefore, an approach as described here is to read regions of nucleic acids, e.g., sub-base sequences, motifs and/or components, instead of individual bases. Sequencing approaches, such as nanopore and optical mapping-based technologies, can be leveraged to detect, e.g., component labels on DNA.

[0067] In some implementations, the technologies utilize proteins that target specific DNA motifs found on specific components. Enzymes, such as restriction endonucleases, and RNA-guided endonucleases, e.g., Cas proteins, can be modified by generating mutations of their nuclease domains to eliminate their ability to cleave DNA. Alternatively, transcription factor recognition sites can be incorporated into component sequences allowing transcription factors to bind along a full-length identifier (FLI) or other nucleic acid molecule encoding digital information (NMDI) (throughout this specification, FLIs represent an example implementation of the labeling and reading technologies described). Overall, these methods result in the mentioned proteins binding to specific sequences (e.g., components or fragments thereof) and decorating NMDIs/FLIs. These labels can then be identified via nanopore or optical mapping to characterize an NMDI/FLI. Rather than reading individual or multiple bases on a DNA molecule, stretches of DNA can be identified by reading the protein label, which can improve reading speed and accuracy. For example, reading speed can be increased because there is a greater distance between two labels than between two bases. Current decoding involves base-calling, component calling by mapping bases to components and then identifying/defining NMDIs/FLIs by calling components. A protein-labeling technique as described here can include calling components directly.

[0068] Alternatively or additionally, optical mapping using fluorescent labels to detect and distinguish different DNA motifs can be used. These optical methods, however, may be limited due to the limited number of fluorophores available. In contrast, protein labeling has the advantage that a large library of proteins, variants, or combinations can be created to distinguish multiple components. In some implementations, fluorescent labels can be used to tag/label the labeling proteins, or can be used alone or in combination with any other labeling technology described in this specification (e.g., to label one or more structural motifs, e.g., hairpins, dumbbells or other, as described below). Fluorescent signals can be detected using, e.g., CCD cameras or other optical detectors. Such cameras or detectors can be combined with one or more nanopore readers as described in this specification into a single device or can be stand-alone devices.

[0069] There are different variants of this approach. In some implementations, a catalytically dead restriction enzyme can be used as illustrated in FIG. 3. In FIG. 3A, Identifiers A and B can be distinguished based on restriction enzyme site composition. In FIG. 3B, Mnll restriction enzyme sits on two different identifiers, illustrating how a pattern of Mnll can classify identifiers.

[0070] Other variations include the use of multiple enzymes to label, e.g., components or motifs using different proteins (FIG. 4A) or the use of indirect labels (e.g., on DNA flaps) as shown in (FIG. 4B).

[0071] The proteins used in the technologies described in this specification can have modifications, e.g., phosphorylation, acetylation, hydroxylation, ubiquitination, or methylation. The modifications can be used as a method of identifying DNA sequences. Protein cross-linking agents, such as formaldehyde, can be added to crosslink proteins to the DNA, preventing them from disassociating.

Scaffold-mediated combinatorial DNA assembly with structural motifs

[0072] Ligation of short double-stranded fragments (e.g., ~30 bp) relies on overhangs (e.g., 6 bp long) that provide specificity when assembling long dsDNA from shorter fragments.

These overhangs are called edge sequences. Edge sequences serve the purpose of introducing specificity to each component position in the full-length final product (NMDI/FLI). To do that, edge sequences need to have a certain length to provide the desired specificity and binding strength for, e.g., a ligase to ligate the backbone of the DNA (instead of using a ligase, other chemical ligation approaches, such as click chemistry, can be used). This means that sequences must be long enough so that two components are bound to each other for a sufficiently long time for a ligase to bind and seal the nicks, and so that crosstalk is minimized. Increasing the edge sequence length provides more binding energy and can increase the time two components are bound to each other. This method comes with a tradeoff: longer edge sequences can also contribute to more off-target ligation (more stringent computational sequence design may be needed to minimize crosstalk).

[0073] The scaffold-mediated combinatorial assembly of DNA described in this specification (see FIG. 5) addresses this problem by utilizing a template-based approach that pre-organizes the components in the correct order along a scaffold (e.g., each component is now in high local concentration) and then uses, e.g., a varying length of edge sequences (shown in FIG. 6A) to ligate the components together. FIG. 5 illustrates the general overview of an example scaffold-mediated ligation scheme. FIG. 5A shows short oligos that are present in the one- pot scaffold-mediated ligation reaction alongside long an example scaffold, e.g., M13 bacteriophage genome ssDNA (~10 kb). Single-stranded oligos have components encoded into them (L0-L15) with specific edges and/or the universal primer sequences (FIG. 5A, left: strands that are mixed solid lines and dashed lines). FIG. 5B shows templating strands binding of the component oligos to the scaffold sequence. Templating strands (FIG. 5A, center, solid lines) are partially complementary to the scaffold and to the component sequence. In some implementations, the scaffold sequence stays constant and does not change. Similarly, edge strands (FIG. 5A, right, dashed lines) bind to the scaffold and edge sequences that are present in the component strands. Only component oligos that are prearranged along the scaffold with the help of templating and edge oligos can ligate, e.g., via the 5’ phosphate. Combinatorial assembly occurs by switching the templating and component strands therefore allowing different component sequences to be present at positions a, b, c, etc. FIG. 5C shows an example of different component sequences (when compared to panel (b), FIG. 5B) assembled on the scaffold (z.e., a2, b2, c2 versus al, bl, cl). [0074] FIG. 6 illustrates the modulation of edge sequence length and use of structural components for, e.g., component identification. FIG. 6A shows an example of how edge sequences can be shortened to, e.g., 6bp, less than 6bp, or blunt end ligation. FIG. 6B shows component sequences as shown, e.g., in FIG. 5A, that include a hairpin motif where stem length and hairpin shape (different secondary structural motifs) can be designed to create a unique secondary structural feature for a given component. These structural components can utilize the same scaffold-mediated ligation scheme outlined in FIG. 5.

[0075] The advantage of a scaffold-based approach as described in this specification is that the scaffold allows for the decrease of the edge sequence length, down to 1-2 bases overhangs or even blunt end ligation. Without a scaffold, a very short sticky end (1-2 bases) ligation would take significant time (the time two components are bound to each other for a 1-2 -base overhang is short), especially when N components need to be ligated together. The scaffold brings the components into extremely high local concentration and allows faster ligation. As a result, this system favors ligation of what is bound to the scaffold and reduces unwanted background ligation of any components that are not bound to the scaffold.

[0076] Moreover, designing the components to contain one or more motifs, e.g., hairpin motifs (as shown in FIG. 6B), can introduce unique structural features for downstream applications, such as solid-state nanopore sequencing. Structural features can include DNA strands of varying lengths, e.g., 10-500 bp, and can be a single feature or multiple features per component. These labeling schemes expand the component-based encoding of DNA libraries from single base calling with traditional techniques, e.g., biological nanopore-based sequencing, to custom solid-state nanopores in combination with engineered structural labels (e.g., hairpins or similar). This technology provides the capability to expand the set of uniquely identifiable features in subsequent current-time signals, e.g., in a nanopore.

[0077] Scaffold-mediated combinatorial DNA assembly allows high-yield, low off-target ligation of regular ssDNA components and components that have a structural label, such as a uniquely identifiable (in current-time signal) hairpin. This technology can maintain specificity and assembly order through scaffold-mediated templating despite shorter overhangs, provide quicker ligation because components are pre-arranged along the scaffold, provide decreased off-target ligation (e.g., components that are in incorrect order), and provides structural labels (e.g., different hairpins) that can be integrated for component-based encoding of DNA libraries. This technology offers the ability to expand the set of uniquely identifiable features in the subsequent current-time signal on solid-state nanopores.

Combinatorial assembly of long single-stranded oligonucleotides with structural features [0078] Described in this specification are technologies for the combinatorial assembly of nucleic acid molecules encoding digital information, e.g., identifiers (as described below with double-stranded components) using single-stranded components that ligate via a hairpin stem (see FIG. 7). In some implementations, edge sequences are encoded in single-stranded regions below a hairpin stem and provide specificity for component ligation following the principles of the combinatorial assembly techniques described in this specification, e.g., as described below. Moreover, each hairpin can serve as a unique structural feature for downstream applications, such as solid-state nanopore sequencing. With this technique, technologies based on component-based encoding of DNA libraries can be expanded from single-base calling with traditional biological nanopore-based sequencing to solid-state nanopores in combination with engineered structural labels (e.g., different hairpins). This technology provides capabilities to expand the set of uniquely identifiable features in the subsequent current-time signal.

[0079] In an example implementation of the combinatorial assembly technologies described in this specification, N double-stranded components (~30 bp) are ligated together via a 3’ overhang on either side of the double-stranded components. Each component can be ordered from a chemical DNA synthesis vendor and comprised of a top and bottom oligo that are subsequently annealed. In some implementations, both top and bottom oligo are present in an equimolar ratio in the annealing reaction, otherwise, an excess of either oligo will be present in the reaction. If such an excess is present in the downstream combinatorial assembly, single oligos can act as terminators to full-length assembly. For example, a single oligo can have only one edge sequence present on its 3’ end that can bind and ligate to an edge sequence of a double-stranded component. The double-stranded component now has a long single-stranded oligo on one end and is effectively terminated in its ability to further assemble in that direction. This problem can be mitigated by size selection (e.g., using gel purification) of each component after the annealing process, which, however, can result in low yields and significant time and labor costs for, e.g., >100 components. Using single-stranded components can remove the stoichiometry differences between the top and bottom strands and can remove the annealing step (intramolecular formation of the hairpin stem can be very robust). Moreover, hairpins can create unique current-time signals in nanopore sequencing, e.g., as described (above) in this specification.

[0080] FIG. 7 shows a schematic overview of single-stranded component ligation schemes with structural features. FIG. 7A shows of a single-stranded component that forms a programmed secondary structural feature (e.g., a hairpin). FIG. 7B shows an example of a combinatorial 6-layer ligation where each component is comprised of a ssDNA. In some implementations, ligation is templated by using the double-stranded region of the hairpin stem with a 5’ phosphate modification, resulting in a continuous 6-layer ssDNA containing equidistant hairpin structural features. FIG. 7C shows an example of different hairpin structural features that can be detected through solid-state nanopore sequencing. Design parameters that can be varied when constructing unique secondary structural features or motifs with, e.g., DNA hairpins, include (1) hairpin stem length, (2) the number of hairpin stems, (3) hairpin loop size (4) incorporation of unnatural bases (e.g., 3-Cyanovinylcarbazole Phosphoramidite ^CNVK) or methylated bases in the hairpin. FIG. 7D shows an example of how hairpin loops can be used as addressable sites for short oligos with 3’ or 5’ overhangs that may or may not contain various modifications (e.g., biotin) to further diversify structural features or provide addressable sites for selective purification/pull-down.

[0081] FIG. 8 illustrates the reading of an example nucleic acid molecule encoding digital information, e.g., an identifier, with secondary structural features using nanopores. FIG. 8A shows a schematic depicting a DNA object with seven unique structural features or labels as described in this specification, e.g., different hairpin structures. Such (secondary) structural features or labels can include one or more of the above-mentioned modifications to the hairpin (see FIG. 7C). FIG. 8B illustrates translocation of the concatenated DNA labels, e.g., hairpins, through a nanopore (e.g., a solid-state nanopore). FIG. 8C is a graph illustrating current versus time signal of the first three secondary structural features or labels (e.g., hairpins) translocated through the solid-state nanopore. The current detected is dependent on size and/or shape of the secondary structural feature or label. The current-time signals including, e.g., signal strength, signal spacing, signal amplitude, or a combination thereof can be detected and used to determine a pattern (e.g., a current change pattern) or “fingerprint.”

[0082] Hairpin features as described above can, in some implementations, be used with the double-stranded DNA-based technologies described in this specification. Hairpin structures can be added as a secondary structural feature in combinatorial assembly techniques with double-stranded components. Here, either the top or bottom strand of the dsDNA component can be modified to contain a (unique) hairpin structural feature. The same principles that are outlined in FIGS. 7C and 7D can be applied to dsDNA technologies, e.g., as shown for an example component decorated with a hairpin structure in FIG. 9.

Labeling Strategies for Component Identification via Nanopore Readout Post- Writing [0083] In some implementations of the technologies described in this specification, a nick- translation-driven labeling scheme can be used for high-throughput component identification via nanopore readout post-writing. [0084] A feature of the labeling technologies described in this specification is the use of labeling distance and labeling types specific for components that provide a unique signal (e.g., a unique current-time signal pattern or a “fingerprint”), via nanopore-based DNA readout. In some implementations, chemistry optimization and labeling rule determination utilizing nick-translation can be used. In an example implementation, first, nicking enzymes can be used to produce single-stranded “nicks” at sequence-specific locations along the NMDI/FLI. Next, DNA Polymerase I can be used to replace some of the nucleotides of a DNA sequence with their labeled analogues. Finally, the original “nick” is sealed by DNA ligase.

[0085] In some implementations, methylation with CpG Methyltransferase (M.SssI), which methylates all cytosine residues (C5) within the double-stranded dinucleotide recognition sequence 5'...CG...3' can be used. Next, the methylated NMDI/FLI is run on established DNA sequencing technology to generate an amplified signal related to the methylation sites that can be compared to a predetermined NMDI/FLI set to be tested. A signal processing scheme can be used to classify nanopore signals corresponding to labels.

Component sequence replacement post-writing

[0086] The technologies described in this specification can be used to replace component sequences post-writing using an integrase or other sequence insertion/replacement methods. That is, the writing process remains the same as described elsewhere in this specification with no change in component design. When reading is desired, a Component Sequence Replacement step is performed. In this step, each component (of a nucleic acid molecule encoding digital information, e.g., an identifier) is replaced with a new component sequence. In some implementations, the new component sequence is longer than the original component sequence to create a sufficient number and length of sites for chemical operations. An example application of this technique is in label-based reading approaches that require long sites, or several sites spaced out by long inter-site regions. Another application is in DNA computing where the result of one round of computation (through chemical operation) is marked by replacing one or more component sequences with new sequences that can be used in a second round of computation.

[0087] In some implementations, component replacement is carried out by including a short sequence motif that is recognized by a specific enzyme, e.g., an integrase/recombinase, into each component sequence. In some implementations, component replacement is carried out by including a similar type of sequence motif that is recognized by a transposase into each component sequence. In some implementations, instead of using specific recognition motifs (that are pre-determined by the specific enzyme), a programmable enzyme or set of enzymes that target any DNA sequence for insertion or recombination based on a guide DNA of any chosen sequence can be used. In some implementations, the replacement operations described in this specification for all identifiers in a library take as many sites and replacement steps as the number of components. In some implementations, this process can be carried out in a number of steps equal to the number of components in a layer. Generally, a labeling technique need not be unique to every layer. For example, every component Ci can be used to report the same signal/readout.

Split template-based assembly of DNA using nicking endonuclease and polymerase.

[0088] Described in this specification are technologies that use a nicking endonuclease and a polymerase to sequentially build up a long ssDNA strand, e.g., to construct a nucleic acid molecule encoding digital data, e.g., an identifier molecule, from component oligos as described in this specification. The short template oligos can be used in a combinatorial assembly approach, e.g., in the encoding schemes described below in this specification to create a large set of unique long ssDNA strands.

[0089] The combinatorial assembly of long (>500 nucleotides) ssDNA using a polymerase is challenging. The technologies described in this specification can be used for such a combinatorial assembly using sequential assembly of short template oligos that can be combined in a one-pot reaction. Each oligo includes a nicking endonuclease recognition site that allows selective cleaving of a small (<10 bp) fragment close to the 5 ’-end, which allows subsequent binding of the next template oligo. With this technique, full-length ssDNA molecule (e.g., an NMDI/FLI) can be assembled (e.g., with tiled with short oligos). The technologies described in this specification can be used without a strand displacing polymerase and special stopping sequences (or modifications) to halt the polymerase and without using nicking endonuclease in combination with a strand-displacing polymerase to amplify ssDNA templates.

[0090] The technologies described here provide ssDNA assembly without the need to tightly control the stoichiometry of each template oligo because the methods are based on a singledirection polymerization (versus bi-functional templates that polymerize in both directions). This technique can result in fewer termination events and overall better yield of full-length products. Moreover, the technologies described here use ligation of ssDNA template oligo, which can be cheaper than ligation of double-stranded components. Moreover, after the assembly process, only one type of oligo is a full-length molecule (e.g., an identifier) while short template strands (e.g., components from one or more layers) remain un-ligated. The process thus facilitates any downstream purifications to isolate long ssDNA molecules from reaction mixtures containing, e.g., double-stranded molecules or intermediate products. [0091] FIG. 10 shows a general schematic overview of an example split template-based assembly of DNA using nicking endonuclease and polymerase. FIG. 10A illustrates an example one-pot reaction of ssDNA oligos that each include a recognition site for a nicking endonuclease close to the 5 ’-end. Each of the ssDNA oligos includes a left edge sequence, a “component barcode” sequence, a right edge sequence, and a nicking site for nicking endonuclease between the component barcode sequence and the right edge sequence. Initiator oligos form a double-stranded template (e.g., this template can be layer 0 of a multilayer identifier molecule or a universal primer site) that includes a ssDNA overhang of the bottom strand with sequence ‘a’. The first oligo from Layer 1, (4^th from the top) binds with edge sequence ‘a*’ on the 3 ’-end to the overhang and forms a template for the 3 ’-end extension of the bottom initiator oligo with a polymerase. In some implementations, the polymerase has no strand displacing activity and no 5’->3’ exonuclease activity, e.g., Sulfolobus DNA Polymerase IV. FIG. 10B illustrates the extension of the bottom initiator oligo resulting in a nicking endonuclease recognition site to be double-stranded and, therefore, providing the capability for the oligo to be nicked at the specific site. FIG. 10C illustrates how after nicking the short oligo (<10bp) will have a short dwell time and fall off. The second oligo (Layer 2, 1^st from the top) binds with ‘b’ on the 3’-end to the overhang (b*) to form a template for the 3 ’-end extension of the bottom initiator oligo with a polymerase. Once the Layer 2 oligo binds it provides the next template for the polymerase to extend and activates the next nicking site. FIGS. 10D-E illustrate the same process for the ligation of Layer 3. FIGS. 10F show the result of the process repeated to sequentially build up the entire full-length DNA assembly (total of 15 layers in this example).

[0092] Modifications on the 3 ’-end of the short template oligo can include a biotin or a barcode sequence that is addressable and can be used, e.g., to selectively purify the full- length product. Moreover, gel electrophoresis (e.g., urea polyacrylamide-based denaturing) can be used to purify the long ssDNA from the short template strands.

[0093] Polymerases that can be used with the technologies described in this specification include any polymerase that does not exhibit strand displacing activity nor 5’->3’ exonuclease activity. Example polymerases that can be used include Q5 High-Fidelity DNA Polymerase, Phusion High-Fidelity DNA Polymerase, or T7 DNA Polymerase. [0094] Nicking endonucleases that can be used with the technologies described in this specification include any nicking endonuclease, depending on the recognition sequence that is encoded in the template oligos. Example endonucleases include Nb.BsrDI or Nb.BtsI. [0095] In some implementations, strand displacing polymerases such as Bst DNA Polymerase, Large Fragment, or Bst 2.0 DNA Polymerase can be used. By using only three nucleotides in the reaction (e.g., dATP, dCTP, and/or dTTP and omitting dGTP) the polymerase can only incorporate where the template strand contains an A, G, or T. Every template sequence that contains one or more C’s can serve as a stopper as illustrated in FIG. 11. This technique can be used to facilitate the usage of a strand displacing polymerase. Moreover, 3 ’-ends of the template strands can be modified with inverted dT’s to stop any unwanted polymerase activity on the short template strands.

NMDI/FLI Flossing

[0096] The technologies described in this specification can be used for high-accuracy identification of NMDIs/FLIs or other nucleic acid molecule encoding digital information for DNA storage and computation applications as described below in this specification.

[0097] The NMDEFLI-Flossing technology is based on nanopore-based detection of a single or concatemerized NMDI/FLI-dsDNA template. The NMDI/FLI template is positioned between the cisltrans chambers separated by a membrane or layer with one or more nanopores. An example NMDI/FLI includes one or more DNA dumbbells large enough to block (complete) translation of the NMDI/FLI through the pore. In some implementations, the dumbbells can be positioned at one or both ends of an NMDI/FLI, e.g., such that the components encoding digital information are positioned between two dumbbells. By iteratively changing voltage polarity, one can translate, or “floss”, the template back and forth through the nanopore sensing zone, providing multiple interrogations of the NMDI/FLI bases and/or components (e.g., using secondary structural features as described in this specification). At each voltage polarity change, the template moves from 5' to 3' end (or vice versa) in a linear fashion and stops when one or more dumbbells reach the nanopore opening at the end of the reading cycle. Flossing the DNA template multiple times can create multiple reads of the same molecule, providing consensus (sequence) generation in a downstream bioinformatics pipeline. By strategically introducing a specific restriction enzyme site at the location of the one or more dumbbells, one can cleave the NMDI/FLI in the cisltrans sides on demand by which the dsDNA NMDI/FLI could be scarlessly recovered for recollection in the DNA storage library. Alternatively, one could introduce photo-enzymatically and chemically cleavable (reversible) moieties to provide cleavage of the dumbbells and NMDI/FLI recovery.

Context-dependent DNA reading via polymerase kinetics

[0098] The technologies described in this specification can be used for DNA template reading using a sequence context-dependent process that reflects on DNA polymerase (Pol) kinetics.

[0099] For example, when DNA Pol (humans have 15 different polymerases) encounters a DNA modification site, such as methylation, the DNA Pol incorporates the complementary nucleotide in the template at different kinetic rates from its normal counterpart. In a specific example, using a sequencing system based on distinguishable fluorescently labeled dNTPs that detects a fluorescent signal at each base incorporation step, the incorporation time for a modified site is longer, which is reflected in longer dwell time based on a longer fluorescent peak associated with that particular nucleotide. Various tandem repeat expansions associated with neurological diseases slow down DNA Pol incorporation activity, e.g., stalls for various lengths of times when crossing these repeated regions. Therefore, these DNA Pol-related SM kinetic signatures of regulation or tandem repeats can be used for, e.g., barcoding, e.g., to designate a particular component of an identifier in a combinatorial data encoding scheme as described in this specification. In some implementations, DNA reading kinetic signatures can be detected with DNA nanowire resistance measurements. Here, a reverse approach can be used to rationally design barcoded regions into the various components that can produce unique kinetic signatures.

Prioritized/ selective nanopore sequencing

[0100] Throughput in nanopore sequencing is dependent on nanopore-finding rate (e.g., how many molecules, e.g., NMDIs/FLIs can find a nanopore and/or how fast a molecule can find a nanopore) and nanopore translocation rate. The technologies described in this specification can be used to increase pore-finding rate.

[0101] In some implementations, to increase throughput, the length of molecules (e.g., NMDIs/FLIs) can be increased, e.g., by adding more layers. In some implementations, an example system is configured to have longer molecules reach nanopores first. For example, the charge of DNA or the size of a DNA molecule can be used for size selection. For example, agarose gels can be used for selection, e.g., by running a gel in reverse. Using this method, longer molecules can reach nanopores first. In some implementations, flow-based techniques can be used. In some implementations, (electrical) resistance can be used for size selection.

[0102] In some implementations, the rim of one or more nanopores can be decorated with one or more molecules that attract other molecules. This technique is similar to affinity capture near a nanopore. For example, a charged molecule positioned near a nanopore can attract a DNA molecule. Tuning charge density (e.g., of a DNA molecule, e.g., an NMDI/FLI, or of the charged molecule) can improve selection performance. In some implementations, a library of NMDIs/FLIs includes molecules tagged with beads (e.g., magnetic beads). The beads can be attracted towards a nanopore. In some implementations, nanopores can be pre-loaded with bead-tagged DNA molecules (e.g., a library of NMDIs/FLIs) given a voltage potential across the cis and trans chambers of the nanopore, or nanopore array. In this configuration, the bead dimension (a sphere with diameter, Db) is larger than the nanopore diameter (D_P), which essentially captures the bead on the cis side of the chamber and stretches the NMDI/FLI in the pore lumen towards the trans side. After this pre-loading step, beads can be cleaved so that the molecule, e.g., an NMDI/FLI, can translocate through the nanopore. In some implementations, a bead can be attached with an ssDNA adapter. In some implementations, beads and/or helicase enzymes (as stoppers) can be used to repeatedly read an identifier. First, the DNA molecules tagged with a stopper are captured in the nanopore using a voltage potential applied across the nanopore or nanopore array and subsequently identified based on their current blockade signature (an electronic fingerprint). Next, by reversing the voltage potential, the detected molecule is “kicked out” from the nanopore. Then, by continuously iterating between capturing and “kicking out” the molecule (using voltage reversal), the molecule can be re-attracted to the nanopore and read it again. Using this method, the same DNA molecule can be read/interrogated multiple times, increasing the probability of accurately identifying the molecule or its components based on its repeated current blockade signature. In some implementations, the DNA that has been read and accurately identified can be (enzymatically, and/or chemically) cleaved in a selective manner near a nanopore, which will minimize excessive data acquisition upon molecule classification, especially when using a large (millions of pores) nanopore array.

[0103] In some implementations, a library of DNA molecules (e.g., NMDIs/FLIs) is prepared with helicases so that the helicases can find a nanopore faster than DNA without helicase. In some implementations, helicase can be split, with one half attached to the nanopore and another attached to the identifier molecule so that the helicase units assemble faster than the DNA finds the nanopore. [0104] In some implementations, the translocation rate of a molecule can be adaptively tuned based on how many molecules are captured successfully. In other words, given an initial molecular concentration, the translocation rate can be dynamically tuned, e.g., by increasing/decreasing the voltage gradient across the nanopore or nanopore array, which reflects in the increase/decrease of the number of molecules, e.g., NMDIs/FLIs, captured through the pores in a unit time. Given a constant voltage potential across a nanopore array, the initial concentration of the molecules contained in the cis chamber will decrease as a function of time, as the molecules will translocate to the trans chamber. Thus, a constant translocation rate can be dynamically established based on the actual concentration (number of molecules) in the cis chamber, which is measured as the rate of molecule detection in the nanopore. Therefore, when the rate of molecule detection decreases (the number of molecules in the cis chamber decreases), the voltage gradient can be dynamically adjusted across the pore array to accelerate molecule transport and hence establish a constant translocation rate, e.g., until all molecules have been translocated to the trans side of the nanopore chamber.

Example labels for increased sensitivity in translocation signal detection using solid-state nanopores

[0105] Unlike biological nanopores, solid-state nanopores do not incorporate proteins into their systems, but use various metal or metal alloy substrates with nanometer sized pores that allow DNA or RNA to pass through. Solid-state nanopore measurements typically have a higher translocation speed that can be ~ 1000-fold higher over standard biological nanopore translocation driven by helicase enzymes (Oxford Nanopore Technologies -450 bp/s). The increased translocation speed results in a current versus time signal that, depending on the sampling rate, can impede the accurate detection or identification of bases or structural features. To increase the sensitivity of translocation signal resulting from linear DNA molecules decorated with structural DNA labels, a DNA expandomer strategy can be deployed, as described in this specification. The strategy includes ligating individual oligos or more complex nanostructures onto a scaffold (this can be a single-stranded DNA, e.g., an NMDI or FLI) that can subsequently be expanded under denaturing conditions (or by any other means that unfolds the DNA expandomer). The expandomer structures are ligated onto or between structural labels. The resulting increase in template length of a label (e.g., anywhere between 10 and 10,000 nucleotides) adds additional DNA that will space out the structural features and allow better detection of structural labels in the current versus time measurements on the solid-state nanopore.

[0106] FIG. 12 illustrates the reading of an example identifier with secondary structural features using nanopores. FIG. 12A shows a schematic depicting a DNA object with seven unique structural features or labels, in this example a linear double-stranded DNA template with structural labels SO to S6. The length of DNA oligo can vary and can include between 1 and 10 labels, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10. In some implementations, a DNA oligo can include 10 or more labels, e.g., between 10 and 30 labels. Structural labels can be any nucleic acid (e.g., DNA) nanostructure composed of a single DNA strand with a secondary structure or can be composed of multiple strands that form a 2D or 3D DNA nanostructure. Furthermore, structural labels can also include one or more chemical modifications that distinguish the labels by charge or molecular weight. FIG. 12B illustrates an example combinatorial assembly used to assemble structural labels on an FLI template. In some implementations of the technologies described in this specification, e.g., the described (expandomer) labeling technologies, other (double-stranded) nucleic acid molecules encoding digital information can be used, e.g., molecules encoding digital data in bases or base sequences (e.g., using bit-per-base encoding). An FLI or other double-stranded molecule can be rendered single-stranded. Subsequently, structural labels are added (and/or ligated), or an NMDI/FLI is assembled from single components already containing the structural labels. In addition, in some implementations, an NMDI/FLI can also consist of only single-stranded DNA that contains structural labels comprising DNA secondary structures. In some implementations, the features or secondary structures of a label are designed such that the labeled DNA segment almost completely fills the nanopore. FIG. 12C illustrates an example FLI with structural labels translocating through a (solid-state) nanopore.

[0107] FIG. 13A illustrates an example labeled DNA molecule (e.g., FLI). The example structural label is represented as a dumbbell DNA structure. FIG. 13B illustrates different forms of dumbbell structures where t-arm length, t-arm stem, and/or number of arms can vary in nucleotide composition or size, or both. In some implementations, the nucleic acid structure of the labels can vary in one or more of size, shape, stem length, or mechanical properties (flexibility). In some implementations, one or more of the nucleic acids of the labels are decorated, e.g., with one more atoms or molecules. In some implementations, the labels described in this specification, e.g., dumbbells, hairpins, proteins, etc. are fluorescently labeled. In some implementations, one or more of the nucleic acids of the labels are decorated with one or more fluorescently labeled nucleotides (e.g., dye- or quencher-labeled molecules). In some implementations, one or more of the nucleic acids of the labels include nucleotides decorated with small moieties. Such moieties can be or include Biotin, Desthiobiotin, Digoxigenin, DNP (Dinitrophenol), Photo-labile groups (“Caged”), Triple bonds (Alkyne), DBCO, Azide (-N3), Trans-Cyclooctene (TCO), Vinyl, Free amino group (- NH2), Redox Dyes, Halogen atoms (F, Cl, Br, I), Mercury (Hg), Selenium (Se), Methyl group (-CH3). In some implementations, click chemistry can be used to decorate nucleic acids of the labels, wherein pairs of functional groups rapidly and selectively react (“click”) with each other under mild, aqueous conditions. This method provides a convenient, versatile, and reliable two-step coupling procedure for coupling molecules A and B, where A is or includes DBCO-containing nucleotides or Alkyne-containing nucleotides and where B is or includes Azide-containing fluorescent dye, Azide-containing (desthio) biotinylation reagent, Azide- containing FLAG reagent, or Azide-containing amino acid, or where A is or includes an Azide-containing nucleotides and B is or includes DBCO-containing fluorescent dye, a (desthio) biotinylation reagent, a FLAG reagent, an Alkyne-containing fluorescent dye, or an amino acid. In some implementations one or more of the nucleic acids of the labels include nucleotides decorated using (1) a dead Cas9 system programmed to target and bind — but not cleave — a specific structural label region (~20 nt) by a guide RNA, (2) transcription factors (TF) to bind to a label region that contain specific TF recognition sites (as described above), or (3) catalytically dead restriction enzymes to target binding motifs designed into the label regions.

[0108] FIG. 14 illustrates an example molecule (FLI) labeled with a set of “dumbbell” DNA labels. FIG. 14A is a schematic of a 7.2 kb single-stranded DNA molecule labeled with short oligos ~40 bp in length and including 29 DNA dumbbells located in the middle (length of labeled region: -400 bp). The labeled portion of the construct is flanked by regions comprising “tiles,” i.e., short single-stranded unlabeled oligos that are separated by nicks (flanking regions -3 kb). FIG. 14B shows results of an agarose gel electrophoresis validation (2% agarose gel, 10 pL SYBR Safe in 160 mL gel (IxTE, 10 mM MgCh), 60 V for 4.5 h, loaded equimolar amounts), showing four different 7.2 kb DNA constructs labeled with (1) only tiling oligos, (2) 14 DNA dumbbells and tiling oligos, (3) 15 DNA dumbbells and tiling oligos, and (4) 29 DNA dumbbells and tiling oligos. The 14 and 15 dumbbells of samples 2 and 3 are located in the same region of sample 4. Sample 2 contains the first 14 dumbbells and sample 3 contains the last 15 dumbbells (assuming counting from left to right). Results indicate little variation in molecular weight. The two lanes to the left show results for circularized and linearized Ml 3 bacteriophage genomes. [0109] FIG. 15 illustrates translocation of a 20 kb double-stranded DNA without any structural features. Experimental conditions were: 1 M KC1 + lx TBS Buffer, 15 nm Norcada nanopore, 200 mV, 12.4 nA, 4ml2s experiment duration. FIG. 15A shows average blockage (current) versus log dwell time of molecules in the nanopore. FIG. 15B is an overlayed plot of individual translocation events (All plots were generated using Nanolyzer software). The change in current from about 0 to about -400 indicates entry of a molecule into the nanopore, while the change in current from about -400 to about 0 indicates exit of the molecule from the nanopore. FIG. 15C (i-iii) show individual translocation events of example 20 kb dsDNA fragments. FIG. 15C (iii) shows a step in the signal at about 550ms, indicating translocation of a folded molecule.

[0110] FIG. 16 illustrates translocation of a 7.2 kb linear DNA fragment with structural labels construct number 4 in FIG. 14). Experimental conditions: full set of 29 dumbbells present (-400 bp stretch), 3.75 M LiCl + lx TE Buffer, 15 nm Norcada nanopore, 200 mV, 14.5 nA, 3m43s experiment duration. FIG. 16A is an overlayed plot of individual translocation events. FIG. 16B(i-iii) show individual translocation events of example 7.2 kb linear DNA fragments with structural labels. FIG. 16B(i) shows a translocation event with a particularly strong signal from the structural feature in the middle 400 bp stretch of the molecule containing 29 DNA dumbbells (corresponding time interval: approx. 1800-2000). It should be noted that the number of read molecules can be increased to improve signal-to- noise ration (e.g., increasing the likelihood of reading molecules with strong signals).

[0111] FIG. 17A shows average blockage (current) versus log dwell time of molecules in the nanopore for translocation of an example 7.2 kb linear DNA fragment with structural labels (construct number 4 in FIG. 14). Experimental conditions were: All dumbbells present (-400 bp stretch), 3.75 M LiCl + IxTE Buffer, 15 nm Norcada nanopore, 200 mV, 14.5 nA, 3m43s experiment duration. FIG. 17B illustrates an agarose gel electrophoresis validation (2% agarose gel, 10 pL SYBR Safe in 160 mL gel (IxTE, 10 mM MgCh), 60 V for 4.5 h, loaded equimolar amounts), showing example 7.2 kb DNA constructs labeled with 29 DNA dumbbells (directly in the middle -400 bp long) and tiling oligos. The gel also shows individual (not hybridized) tiling oligos and DNA dumbbells running at a lower molecular weight than the constructs.

[0112] FIG. 18 illustrates translocation of a 7.2 kb linear DNA fragment with structural labels (construct number 2 in FIG. 14), i.e. with only the first 14 DNA dumbbells out of the 29 possible dumbbells (see FIG. 14B). The labeled length is about -200 bp (approx, half of the ~400bp tiled by the 29 dumbbells). Experimental conditions were: 3 pL DNA [10 nM stock] in 50 pL KC1 [1 M] + lx TBS Buffer, 10 nm Norcada nanopore, 150 mV, 2m45s experiment duration. FIG. 18A shows average blockage (current) versus log dwell time of molecules in the nanopore. FIG. 18B is an overlayed plot of individual translocation events as described above. FIG. 18C(i-iii) show individual translocation events of example 7.2 kb linear DNA fragments. The structural labels are clearly visible, e.g. at approximate time intervals 790-840 (i), 800-850 (ii); and 2100-2200. Note that a nanopore is agnostic in terms of direction of a molecule entering the pore. Therefore, different time plots can be seen depending on whether molecule entered with the left or right end first (FIG. 18B)..

[0113] FIG. 19 illustrates translocation of a 7.2 kb linear DNA fragment without any structural labels, i.e., a 7.2kb single-stranded fragment with short — 30-40nt oligo tiles (construct number 1 in FIG. 14). Experimental conditions were: all tiles (no dumbbells), 1.5 pL DNA [10 nM stock] in 50 pL KC1 [1 M] + lx TBS Buffer, 10 nm Norcada nanopore, 150 mV, 8.8 nA, 5m0s experiment duration. FIG. 19A shows average blockage (current) versus log dwell time of molecules in the nanopore. FIG. 19B is an overlayed plot of individual translocation events. FIG. 19C(i-iii) show individual translocation events of 7.2 kb linear DNA fragment with no structural labels. None of the plots show no distinguishable feature / label signals.

Expandomer labeling for component identification

[0114] Alternatively or additionally to labeling with dumbbells or other labels described above, nucleic acid molecule encoding digital information, e.g., identifiers, can be labeled with labels including one or more expandomers, e.g., for faster and more accurate reading compared with standard sequencing. In expandomer-based sequencing, DNA to be sequenced is converted into a longer surrogate molecule (expandomers). Expandomers can be synthesized from a DNA template using a polymerase in a modified DNA replication process by incorporating customized expandable nucleotides (e.g., X-NTPs). These expandable nucleotides support longer, high signal -to-noise reporters that can be 10-100 (or more) times longer than the original DNA. After synthesis and denaturation, bonds between the X-NTPs are degraded, which allows the backbone containing the reporters to expand. The expandomers pass through a nanopore (or other sequencing technology) to read out the signal. In some implementations, expandomers can be used for standard sequencing where each base in the sequence is hybridized to an X-NTP, e.g., where bits or bit sequences are encoded in individual bases. [0115] FIG. 20 illustrates an example expandomer labeling strategy that can be used with the technologies described in this specification. The template (a single-stranded nucleotide encoding digital information, e.g., identifier) is represented by a right-facing arrow indicating a single-stranded scaffold that can be of any length. This strand is tiled with oligos that contain an unhybridized single-stranded looped section analogous to the reporters described above. The oligos contain a 5 ’-phosphate modification that can be used to ligate the looped expandable tiling oligos onto the scaffold. Upon ligation of the labels, the construct can be denatured, which results in two different length single-stranded oligos. Upon denaturation, the looped sections extend, forming a strand that is longer (e.g., several times longer) than the scaffold strand. FIG. 20A is a schematic of long single-stranded scaffolds (e.g., expandable tiling oligos) These oligos are designed for minimized secondary structure. Single-stranded DNA can have some form of a secondary structure (e.g., sections that display hairpins/base pairing), and the likelihood of the occurrence of such structure increases with strand length. These strands used here can be computationally optimized to decrease the amount of predicted secondary structure using a commercial thermodynamic prediction tool. Example oligos that can tile the scaffold with a looped section (to form the expandomer) are shown above the scaffold and are of length 30-60 nucleotides. In some implementations, oligos with looped sections can be between 10 and 1000 oligos long, between 10 and 100 oligos, or over 1000 oligos long. FIG. 20B is a schematic of looped expandable tiling oligos on a scaffold. The looped sections are now looped between two straight sections of the label, which are hybridized to the scaffold. In some implementations, a looped section terminates in an X-NTP. In some implementations, the straight parts of the oligos are of the same respective length in each label. In some implementations, the straight parts of the oligos are of different respective lengths. In some implementations, the loop can be or can include a DNA nanostructure, e.g., one or more dumbbells, hairpins, coils, or a combination thereof. [0116] FIG. 21 shows two example configurations of scaffold and looped expandable tiling oligos. The 160 nucleotide scaffolds are shown by the dotted arrows, and tiling oligos containing loops are shown by the white arrows. There are two different example sequence design versions: top scaffold is called sc2, and the bottom scaffold is called sc3.

[0117] FIG. 22 shows a denaturing PAGE gel illustrating assembly of an example expandomer design. Two separate 160 nucleotide scaffolds were designed using only A, G, and T (sc2 and sc3, see FIG. 21). Each scaffold is annealed to the looped expandable tiling oligos and subsequently ligated to form a looped hybrid structure (as illustrated in FIG. 20B). The gel lanes are as follows: (1): Single-stranded DNA ladder (L). (2 and 3): 160 nt scaffold only (sc2 and sc3). (4 and 5): expandable tiling oligos only. Tiling oligos in the middle of the scaffold are shorter than on the end of the scaffolds. (6 and 7): scaffold and expandable tiling oligos showing ligated product of expected length (labeled with asterisk) and scaffold only (white arrow). Expandable tiling oligos were in lOx molar excess (scaffold at 10 nM and tiling oligos at 100 nM final reaction concentration) during the reaction. (8+9): Same reaction as lanes 6 and 7, but without the ligase. (10): same ladder as lane 1. Reactions were incubated at 34°C for 10 minutes before quenched with 30 mM EDTA (final concentration) and prepared on a 6% denaturing gel (run at 150V for 40 minutes in lx TBE buffer). The results illustrate the successful assembly of DNA with looped reporter-type oligos for expandomer sequencing of DNA encoding digital information.

NMDI/FLI concatemerization strategy for consensus-based DNA reading

[0118] Despite the significant gains in speed and throughput that nanopore-based sequencing can generate, controlling sequencing error rate can be challenging. This issue has been shown in commercially available biological nanopore-based devices, such as the MinlON by Oxford Nanopore Technologies, which has an error rate of 10.5%. Comparatively, the increase in speed of translocation in solid-state nanopores can result in lower spatial and temporal resolution than that of biological pores. While there are approaches to slow down translocation and improve resolution, these approaches may not be appropriate or yield sufficient reduction in error rate. Therefore, it is important to build-in redundancy to DNA molecules to ensure all nucleotides nucleic acid molecule encoding digital information, e.g., NMDIs/FLIs, are read. An effective strategy for building this redundancy is the concatenation of such NMDsI/FLIs. Using this approach rather than reading signals from a single NMDI/FLI translocating across a pore, concatemers of a plurality of NMDIs/FLIs can be generated, e.g., concatemers of greater than three of the same NMDI/FLI translocating across a pore. These concatemers can be used in a consensus-based sequencing approach, which can further improve the accuracy of nanopore-based sequencing.

[0119] Consensus-based sequencing is a powerful tool that has been shown to significantly reduce base-calling errors in DNA sequencing. This technique is also particularly useful when reading labeled DNA. Labeling of individual components of NMDIs/FLIs can be a powerful strategy in reading NMDIs/FLIs. The premise behind this approach is to read entire stretches of DNA (or a component) by reading one or more labels instead of individual bases. One potential source of error when taking a label-based approach to reading NMDIs/FLIs is the efficiency of labeling. Missing labels can result in loss of data or incorrect component calling. Concatenation of NMDIs/FLIs can provide a consensus-based method of reading labeled NMDIs/FLIs. To ensure identification accuracy, the concatenated NMDI/FLI is read in repeated passes by detecting component-specific labels, thus improving results in singlemolecule measurements. Molecular biology-based strategies for concatenating NMDIs/FLIs, include rolling circle amplification, which can be a powerful tool for consensus generation for the purpose of reading labels.

[0120] FIG. 23 illustrates an example process for consensus generation from a concatemer. Labels from concatenated DNA (e.g., three NMDIs/FLIs) can generate a consensus, providing error correction to reading labels. The concatenated NMDIs/FLIs are read in repeated passes by detecting component-specific labels to improve identification accuracy. In the illustrative example, a single NMDI/FLI has three components (shaded segments; first row). In the second row, three concatenated NMDI/FLI molecules are shown as a long DNA sequence (triangles = component-specific labels, vertical line = NMDI/FLI junction).

Consensus reads (fourth row) obtained from multiple passes on a single NMDI/FLI molecule (shown in the third row) can be used to improve results for error-prone single-molecule measurements.

[0121] FIG. 24 illustrates rolling circle amplification of circularized NMDIs/FLIs enabling the creation of consensus-based labeling. In this implementation, a circular DNA template is used, which can be in the form of a plasmid or a circularized oligonucleotide. This template is amplified by a DNA polymerase enzyme (e.g., phi29 DNA polymerase) that binds to the template to synthesize new DNA strands. As the DNA polymerase moves around the circular template, it continuously synthesizes new copies of the DNA molecule, leading to exponential amplification. This process can be repeated multiple times to generate large amounts of DNA from a small starting sample. These copies of the DNA can be used for consensus generation as described above.

Encoding Schemes

[0122] Described in this specification are methods and systems for encoding information, e.g., digital information in nucleic acid (e.g., deoxyribonucleic acid, DNA) molecules, e.g., without base-by-base synthesis, by encoding bit-value information in the presence or absence of unique nucleic acid sequences within a pool, comprising specifying each bit location in a bit-stream with a unique nucleic sequence and specifying the bit value at that location by the presence or absence of the corresponding unique nucleic acid sequence in the pool. But, more generally, specifying unique bytes in a byte stream can be represented by unique subsets of nucleic acid sequences. Also disclosed are methods for generating unique nucleic acid sequences without base-to-base synthesis using combinatorial genomic strategies (e.g., assembly of multiple nucleic acid sequences or enzymatic-based editing of nucleic acid sequences).

[0123] The term “symbol,” as used herein, generally refers to a representation of a unit of digital information. Digital information may be divided or translated into a string of symbols. In an example, a symbol may be a bit and the bit may have a value of ‘0’ or ‘ 1’ .

[0124] The term “distinct,” or “unique,” as used herein, generally refers to an object that is distinguishable from other objects in a group. For example, a distinct, or unique, nucleic acid sequence may be a nucleic acid sequence that does not have the same sequence as any other nucleic acid sequence. A distinct, or unique, nucleic acid molecule may not have the same sequence as any other nucleic acid molecule. The distinct, or unique, nucleic acid sequence or molecule may share regions of similarity with another nucleic acid sequence or molecule. [0125] The term “component,” as used herein, generally refers to a nucleic acid sequence. A component may be a distinct nucleic acid sequence. A component may be concatenated or assembled with one or more other components to generate other nucleic acid sequence or molecules.

[0126] The term “layer,” as used herein, generally refers to group or pool of components. Each layer may comprise a set of distinct components such that the components in one layer are different from the components in another layer. Components from one or more layers may be assembled to generate one or more identifiers.

[0127] The term “identifier,” as used herein, generally refers to a nucleic acid molecule or a nucleic acid sequence that represents the position and value of a bit-string within a larger bitstring More generally, an identifier may refer to any object that represents or corresponds to a symbol in a string of symbols. In some embodiments, identifiers may comprise one or multiple concatenated components.

[0128] The term “combinatorial space,” as used herein generally refers to the set of all possible distinct identifiers that may be generated from a starting set of objects, such as components, and a permissible set of rules for how to modify those objects to form identifiers. The size of a combinatorial space of identifiers made by assembling or concatenating components may depend on the number of layers of components, the number of components in each layer, and the particular assembly method used to generate the identifiers. [0129] The term “identifier rank,” as used herein generally refers to a relation that defines the order of identifiers in a set.

[0130] The term “identifier library,” as used herein generally refers to a collection of identifiers corresponding to the symbols in a symbol string representing digital information. In some embodiments, the absence of a given identifier in the identifier library may indicate a symbol value at a particular position. One or more identifier libraries may be combined in a pool, group, or set of identifiers. Each identifier library may include a unique barcode that identifies the identifier library.

[0131] The term “nucleic acid,” as used herein, general refers to deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or a variant thereof. A nucleic acid may include one or more subunits selected from adenosine (A), cytosine (C), guanine (G), thymine (T), and uracil (U), or variants thereof. A nucleotide can include A, C, G, T, or U, or variants thereof. A nucleotide can include any subunit that can be incorporated into a growing nucleic acid strand. Such subunit can be A, C, G, T, or U, or any other subunit that may be specific to one of more complementary A, C, G, T, or U, or complementary to a purine (i.e., A or G, or variant thereof) or pyrimidine (i.e., C, T, or U, or variant thereof). In some examples, a nucleic acid may be single-stranded or double stranded, in some cases, a nucleic acid is circular.

[0132] The terms “nucleic acid molecule” or “nucleic acid sequence,” as used herein, generally refer to a polymeric form of nucleotides, or polynucleotide, that may have various lengths, either deoxyribonucleotides (DNA) or ribonucleotides (RNA), or analogs thereof. The term “nucleic acid sequence” may refer to the alphabetical representation of a polynucleotide; alternatively, the term may be applied to the physical polynucleotide itself. This alphabetical representation can be input into databases in a computer having a central processing unit and used for mapping nucleic acid sequences or nucleic acid molecules to symbols, or bits, encoding digital information. Nucleic acid sequences or oligonucleotides may include one or more non-standard nucleotide(s), nucleotide analog(s) and/or modified nucleotides.

[0133] An “oligonucleotide”, as used herein, generally refers to a single-stranded nucleic acid sequence, and is typically composed of a specific sequence of four nucleotide bases: adenine (A); cytosine (C); guanine (G), and thymine (T) or uracil (U) when the polynucleotide is RNA.

[0134] Examples of modified nucleotides include, but are not limited to diaminopurine, 5- fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xantine, 4- acetyl cytosine, 5-(carboxyhydroxylmethyl)uracil, 5-carboxymethylaminomethy 1 -2- thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-i sopentenyladenine, 1-methylguanine, 1 -methylinosine, 2,2-dimethylguanine, 2- methyl adenine, 2-methylguanine, 3 -methylcytosine, 5-methyl cytosine, N6-adenine, 7- methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D- mannosylqueosine, 5 ’-methoxy carboxymethyluracil, 5-methoxyuracil, 2-methylthio-D46- isopentenyladenine, uracil-5-oxyacetic acid (v), wybutoxosine, pseudouracil, queosine, 2- thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil, 5 -methyluracil, uracil-5- oxyacetic acid methylester, uracil-5-oxyacetic acid (v), 5-methyl-2-thiouracil, 3-(3-amino-3- N-2-carboxypropyl)uracil, (acp3)w, 2,6-diaminopurine and the like. Nucleic acid molecules may also be modified at the base moiety (e.g., at one or more atoms that typically are available to form a hydrogen bond with a complementary nucleotide and/or at one or more atoms that are not typically capable of forming a hydrogen bond with a complementary nucleotide), sugar moiety or phosphate backbone. Nucleic acid molecules may also contain amine-modified groups, such as aminoallyl-dUTP (aa-dUTP) and aminohexhylacrylamide- dCTP (aha-dCTP) to allow covalent attachment of amine reactive moieties, such as N- hydroxy succinimide esters (NHS).

[0135] The term “primer,” as used herein, generally refers to a strand of nucleic acid that serves as a starting point for nucleic acid synthesis, such as polymerase chain reaction (PCR). In an example, during replication of a DNA sample, an enzyme that catalyzes replication starts replication at the 3 ’-end of a primer attached to the DNA sample and copies the opposite strand.

[0136] The term “polymerase” or “polymerase enzyme,” as used herein, generally refers to any enzyme capable of catalyzing a polymerase reaction. Examples of polymerases include, without limitation, a nucleic acid polymerase. The polymerase can be naturally occurring or synthesized. An example polymerase is a $29 polymerase or derivative thereof. In some cases, a transcriptase or a ligase is used (i.e., enzymes which catalyze the formation of a bond) in conjunction with polymerases or as an alternative to polymerases to construct new nucleic acid sequences. Examples of polymerases include a DNA polymerase, a RNA polymerase, a thermostable polymerase, a wild-type polymerase, a modified polymerase, E. coli DNA polymerase I, T7 DNA polymerase, bacteriophage T4 DNA polymerase 129 (phi29) DNA polymerase, Taq polymerase, Tth polymerase, Tli polymerase, Pfu polymerase Pwo polymerase, VENT polymerase, DEEPVENT polymerase, Ex-Taq polymerase, LA-Taw polymerase, Sso polymerase Poc polymerase, Pab polymerase, Mth polymerase ES4 polymerase, Tru polymerase, Tac polymerase, Tne polymerase, Tma polymerase, Tea polymerase, Tih polymerase, Tfi polymerase, Platinum Taq polymerases, Tbr polymerase, Tfl polymerase, Pfutubo polymerase, Pyrobest polymerase, KOD polymerase, Bst polymerase, Sac polymerase, KI enow fragment polymerase with 3’ to 5’ exonuclease activity, and variants, modified products and derivatives thereof.

[0137] Digital information, such as computer data, in the form of binary code can comprise a sequence or string of symbols. A binary code may encode or represent text or computer processor instructions using, for example, a binary number system having two binary symbols, typically 0 and 1, referred to as bits. Digital information may be represented in the form of non-binary code which can comprise a sequence of non-binary symbols. Each encoded symbol can be re-assigned to a unique bit string (or “byte”), and the unique bit string or byte can be arranged into strings of bytes or byte streams. A bit value for a given bit can be one of two symbols (e.g., 0 or 1). A byte, which can comprise a string of Abits, can have a total of 2^N unique byte-values. For example, a byte comprising 8 bits can produce a total of 2⁸ or 256 possible unique byte-values, and each of the 256 bytes can correspond to one of 256 possible distinct symbols, letters, or instructions which can be encoded with the bytes. Raw data (e.g., text files and computer instructions) can be represented as strings of bytes or byte streams. Zip files, or compressed data files comprising raw data can also be stored in byte streams, these files can be stored as byte streams in a compressed form, and then decompressed into raw data before being read by the computer.

[0138] Methods and systems of the present disclosure may be used to encode computer data or information in a plurality of identifiers, each of which may represent one or more bits of the original information. In some examples, methods and systems of the present disclosure encode data or information using identifiers that each represents two bits of the original information.

[0139] Previous methods for encoding digital information into nucleic acids have relied on base-by-base synthesis of the nucleic acids, which can be costly and time consuming. Alternative methods may improve the efficiency, improve the commercial viability of digital information storage by reducing the reliance on base-by-base nucleic acid synthesis for encoding digital information, and eliminate the de novo synthesis of distinct nucleic acid sequences for every new information storage request.

[0140] New methods, e.g., as described in this specification, can encode digital information (e.g., binary code, ternary code, quarternary code, base-x code, wherein x is an integer, decimal code, or hexadecimal code, or combinations thereof) in a plurality of identifiers, or nucleic acid sequences, comprising combinatorial arrangements of components instead of relying on base-by-base or de-novo nucleic acid synthesis (e.g., phosphoramidite synthesis). As such, new strategies may produce a first set of distinct nucleic acid sequences (or components) for the first request of information storage, and can there-after re-use the same nucleic acid sequences (or components) for subsequent information storage requests. These approaches can significantly reduce the cost of DNA-based information storage by reducing the role of de-novo synthesis of nucleic acid sequences in the information-to-DNA encoding and writing process. Moreover, unlike implementations of base-by-base synthesis, such as phosphoramidite chemistry- or template-free polymerase- based nucleic acid elongation, which may use cyclical delivery of each base to each elongating nucleic acid, new methods of information-to-DNA writing using identifier construction from components are highly parallelizable processes that do not necessarily use cyclical nucleic acid elongation. Thus, new methods may increase the speed of writing digital information to DNA compared to older methods.

Example methods for encoding and writing information to nucleic acid sequence(s) [0141] In an aspect, the present disclosure provides methods for encoding information into nucleic acid sequences. A method for encoding information into nucleic acid sequences may comprise (a) translating the information into a string of symbols, (b) mapping the string of symbols to a plurality of identifiers, and (c) constructing an identifier library comprising at least a subset of the plurality of identifiers. An individual identifier of the plurality of identifiers may comprise one or more components. An individual component of the one or more components may comprise a nucleic acid sequence. Each symbol at each position in the string of symbols may correspond to a distinct identifier. The individual identifier may correspond to an individual symbol at an individual position in the string of symbols. Moreover, one symbol at each position in the string of symbols may correspond to the absence of an identifier. For example, in a string of binary symbols (e.g., bits) of ‘0’s and ‘ l’s, each occurrence of ‘0’ may correspond to the absence of an identifier.

[0142] In another aspect, the present disclosure provides methods for nucleic acid-based computer data storage. A method for nucleic acid-based computer data storage may comprise (a) receiving computer data, (b) synthesizing nucleic acid molecules comprising nucleic acid sequences encoding the computer data, and (c) storing the nucleic acid molecules having the nucleic acid sequences. The computer data may be encoded in at least a subset of nucleic acid molecules synthesized and not in a sequence of each of the nucleic acid molecules. [0143] In another aspect, the present disclosure provides methods for writing and storing information in nucleic acid sequences. The method may comprise, (a) receiving or encoding a virtual identifier library that represents information, (b) physically constructing the identifier library, and (c) storing one or more physical copies of the identifier library in one or more separate locations. An individual identifier of the identifier library may comprise one or more components. An individual component of the one or more components may comprise a nucleic acid sequence.

[0144] In another aspect, the present disclosure provides methods for nucleic acid-based computer data storage. A method for nucleic acid-based computer data storage may comprise (a) receiving computer data, (b) synthesizing a nucleic acid molecule comprising at least one nucleic acid sequence encoding the computer data, and (c) storing the nucleic acid molecule comprising the at least one nucleic acid sequence. Synthesizing the nucleic acid molecule may be in the absence of base-by-base nucleic acid synthesis.

[0145] In another aspect, the present disclosure provides methods for writing and storing information in nucleic acid sequences. A method for writing and storing information in nucleic acid sequences may comprise, (a) receiving or encoding a virtual identifier library that represents information, (b) physically constructing the identifier library, and (c) storing one or more physical copies of the identifier library in one or more separate locations. An individual identifier of the identifier library may comprise one or more components. An individual component of the one or more components may comprise a nucleic acid sequence. [0146] FIG. 25 illustrates an overview process for encoding information into nucleic acid sequences, writing information to the nucleic acid sequences, reading information written to nucleic acid sequences, and decoding the read information. Digital information, or data, may be translated into one or more strings of symbols. In an example, the symbols are bits and each bit may have a value of either ‘0’ or ‘ 1’ . Each symbol may be mapped, or encoded, to an object (e.g., identifier) representing that symbol. Each symbol may be represented by a distinct identifier. The distinct identifier may be a nucleic acid molecule made up of components. The components may be nucleic acid sequences. The digital information may be written into nucleic acid sequences by generating an identifier library corresponding to the information. The identifier library may be physically generated by physically constructing the identifiers that correspond to each symbol of the digital information. All or any portion of the digital information may be accessed at a time. In an example, a subset of identifiers is accessed from an identifier library. The subset of identifiers may be read by sequencing and identifying the identifiers. The identified identifiers may be associated with their corresponding symbol to decode the digital data.

[0147] A method for encoding and reading information using the approach of FIG. 25 can, for example, include receiving a bit stream and mapping each one-bit (bit with bit-value of ‘ I’) in the bit stream to a distinct nucleic acid identifier using an identifier rank or a nucleic acid index. Constructing a nucleic acid sample pool, or identifier library, comprising copies of the identifiers that correspond to bit values of 1 (and excluding identifiers for bit values of 0). Reading the sample can comprise using molecular biology methods (e.g., sequencing, hybridization, PCR, etc), determining which identifiers are represented in the identifier library, and assigning bit-values of ‘ 1’ to the bits corresponding to those identifiers and bitvalues of ‘0’ elsewhere (again referring to the identifier rank to identify the bits in the original bit-stream that each identifier corresponds to), thus decoding the information into the original encoded bit stream.

[0148] Encoding a string of N distinct bits, can use an equivalent number of unique nucleic acid sequences as possible identifiers. This approach to information encoding may use de- novo synthesis of identifiers (e.g., nucleic acid molecules) for each new item of information (string of N bits) to store. In other instances, the cost of newly synthesizing identifiers (equivalent in number to or less than TV) for each new item of information to store can be reduced by the one-time de-novo synthesis and subsequent maintenance of all possible identifiers, such that encoding new items of information may involve mechanically selecting and mixing together pre-synthesized (or pre-fabricated) identifiers to form an identifier library. In other instances, both the cost of (1) de-novo synthesis of up to N identifiers for each new item of information to store or (2) maintaining and selecting from N possible identifiers for each new item of information to store, or any combination thereof, may be reduced by synthesizing and maintaining a number (less than N, and in some cases much less than TV) of nucleic acid sequences and then modifying these sequences through enzymatic reactions to generate up to TV identifiers for each new item of information to store.

[0149] The identifiers may be rationally designed and selected for ease of read, write, access, copy, and deletion operations. The identifiers may be designed and selected to minimize write errors, mutations, degradation, and read errors.

[0150] FIGS. 26A and 26B schematically illustrate an example method, referred to as “data at address”, of encoding digital data in objects or identifiers (e.g., nucleic acid molecules). FIG. 26A illustrates encoding a bit stream into an identifier library wherein the individual identifiers are constructed by concatenating or assembling a single component that specifies an identifier rank with a single component that specifies a byte-value. In general, the data at address method uses identifiers that encode information modularly by comprising two objects: one object, the “byte-value object” (or “data object”), that identifies a byte-value and one object, the “rank object” (or “address object”), that identifies the identifier rank (or the relative position of the byte in the original bit-stream). FIG. 26B illustrates an example of the data at address method wherein each rank object may be combinatorially constructed from a set of components and each byte-value object may be combinatorially constructed from a set of components. Such combinatorial construction of rank and byte-value objects enables more information to be written into identifiers than if the objects where made from the single components alone (e.g., FIG. 26A).

[0151] FIGS. 27A and 27B schematically illustrate another example method of encoding digital information in objects or identifiers (e.g., nucleic acid sequences). FIG. 27A illustrates encoding a bit stream into an identifier library wherein identifiers are constructed from single components that specify identifier rank. The presence of an identifier at a particular rank (or address) specifies a bit-value of ‘ 1’ and the absence of an identifier at a particular rank (or address) specifies a bit-value of ‘O’. This type of encoding may use identifiers that solely encode rank (the relative position of a bit in the original bit stream) and use the presence or absence of those identifiers in an identifier library to encode a bit-value of ‘ 1’ or ‘O’, respectively. Reading and decoding the information may include identifying the identifiers present in the identifier library, assigning bit-values of ‘ 1’ to their corresponding ranks and assigning bit- values of ‘0’ elsewhere. FIG. 27B illustrates an example encoding method where each identifier may be combinatorially constructed from a set of components such that each possible combinatorial construction specifies a rank. Such combinatorial construction enables more information to be written into identifiers than if the identifiers where made from the single components alone (e.g., FIG. 27A). For example, a component set may comprise five distinct components. The five distinct components may be assembled to generate ten distinct identifiers, each comprising two of the five components. The ten distinct identifiers may each have a rank (or address) that corresponds to the position of a bit in a bit stream. An identifier library may include the subset of those ten possible identifiers that corresponds to the positions of bit-value ‘ 1’, and exclude the subset of those ten possible identifiers that corresponds to the positions of the bit-value ‘0’ within a bit stream of length ten.

[0152] FIG. 28 shows a contour plot, in log space, of a relationship between the combinatorial space of possible identifiers (C, x-axis) and the average number of identifiers (k, y-axis) to be physically constructed in order to store information of a given original size in bits (D, contour lines) using the encoding method shown in FIGS. 27A and 27B. This plot assumes that the original information of size D is re-coded into a string of C bits (where C may be greater than Z>) where a number of bits, k, has a bit-value of ‘ 1’. Moreover, the plot assumes that information-to-nucleic-acid encoding is performed on the re-coded bit string and that identifiers for positions where the bit-value is ‘ 1 ’ are constructed and identifiers for positions where the bit-value is ‘0’ are not constructed. Following the assumptions, the combinatorial space of possible identifiers has size C to identify every position in the recoded bit string, and the number of identifiers used to encode the bit string of size D is such that D = log2(Cchoosek), where Cchoosek may be the mathematical formula for the number of ways to pick k unordered outcomes from C possibilities. Thus, as the combinatorial space of possible identifiers increases beyond the size (in bits) of a given item of information, a decreasing number of physically constructed identifiers may be used to store the given information.

[0153] FIG. 29 shows an overview method for writing information into nucleic acid sequences. Prior to writing the information, the information may be translated into a string of symbols and encoded into a plurality of identifiers. Writing the information may include setting up reactions to produce possible identifiers. A reaction may be set up by depositing inputs into a compartment. The inputs may comprise nucleic acids, components, templates, enzymes, or chemical reagents. The compartment may be a well, a tube, a position on a surface, a chamber in a microfluidic device, or a droplet within an emulsion. Multiple reactions may be set up in multiple compartments. Reactions may proceed to produce identifiers through programmed temperature incubation or cycling. Reactions may be selectively or ubiquitously removed (e.g., deleted). Reactions may also be selectively or ubiquitously interrupted, consolidated, and purified to collect their identifiers in one pool. Identifiers from multiple identifier libraries may be collected in the same pool. An individual identifier may include a barcode or a tag to identify to which identifier library it belongs. Alternatively, or in addition to, the barcode may include metadata for the encoded information. Supplemental nucleic acids or identifiers may also be included in an identifier pool together with an identifier library. The supplemental nucleic acids or identifiers may include metadata for the encoded information or serve to obfuscate or conceal the encoded information.

[0154] An identifier rank (e.g., nucleic acid index) can comprise a method or key for determining the ordering of identifiers. The method can comprise a look-up table with all identifiers and their corresponding rank. The method can also comprise a look up table with the rank of all components that constitute identifiers and a function for determining the ordering of any identifier comprising a combination of those components. Such a method may be referred to as lexicographical ordering and may be analogous to the manner in which words in a dictionary are alphabetically ordered. In the data at address encoding method, the identifier rank (encoded by the rank object of the identifier) may be used to determine the position of a byte (encoded by the byte-value object of the identifier) within a bit stream. In an alternative method, the identifier rank (encoded by the entire identifier itself) for a present identifier may be used to determine the position of bit-value of ‘ 1’ within a bit stream.

[0155] A key may assign distinct bytes to unique subsets of identifiers (e.g., nucleic acid molecules) within a sample. For example, in a simple form, a key may assign each bit in a byte to a unique nucleic acid sequence that specifies the position of the bit, and then the presence or absence of that nucleic acid sequence within a sample may specify the bit-value of 1 or 0, respectively. Reading the encoded information from the nucleic acid sample can comprise any number of molecular biology techniques including sequencing, hybridization, or PCR. In some embodiments, reading the encoded dataset may comprise reconstructing a portion of the dataset or reconstructing the entire encoded dataset from each nucleic acid sample. When the sequence may be read the nucleic acid index can be used along with the presence or absence of a unique nucleic acid sequence and the nucleic acid sample can be decoded into a bit stream (e.g., each string of bits, byte, bytes, or string of bytes).

[0156] Identifiers may be constructed by combinatorially assembling component nucleic acid sequences. For example, information may be encoded by taking a set of nucleic acid molecules (e.g., identifiers) from a defined group of molecules (e.g., combinatorial space). Each possible identifier of the defined group of molecules may be an assembly of nucleic acid sequences (e.g., components) from a prefabricated set of components that may be divided into layers. Each individual identifier may be constructed by concatenating one component from every layer in a fixed order. For example, if there are M layers and each layer may have n components, then up to C = n^M unique identifiers may be constructed and up to 2^C different items of information, or C bits, may be encoded and stored. For example, storage of a megabit of information may use 1 x 10⁶ distinct identifiers or a combinatorial space of size C = 1 x 10⁶. The identifiers in this example may be assembled from a variety of components organized in different ways. Assemblies may be made from M= 2 prefabricated layers, each containing n = 1 x 10³ components. Alternatively, assemblies may be made from M= 3 layers, each containing n = 1 x 10² components. As this example illustrates, encoding the same amount of information using a larger number of layers may allow for the total number of components to be smaller. Using a smaller number of total components may be advantageous in terms of writing cost.

[0157] In an example, one can start with two sets of unique nucleic acid sequences or layers, X and Y, each with x and y components (e.g., nucleic acid sequences), respectively. Each nucleic acid sequence from X can be assembled to each nucleic acid sequence from Y.

Though the total number of nucleic acid sequences maintained in the two sets may be the sum of x and y, the total number of nucleic acid molecules, and hence possible identifiers, that can be generated may be the product of x and y. Even more nucleic acid sequences (e.g., identifiers) can be generated if the sequences from X can be assembled to the sequences of Y in any order. For example, the number of nucleic acid sequences (e.g., identifiers) generated may be twice the product of x and y if the assembly order is programmable. This set of all possible nucleic acid sequences that can be generated may be referred to as XY. The order of the assembled units of unique nucleic acid sequences in XY can be controlled using nucleic acids with distinct 5’ and 3’ ends, and restriction digestion, ligation, polymerase chain reaction (PCR), and sequencing may occur with respect to the distinct 5’ and 3’ ends of the sequences. Such an approach can reduce the total number of nucleic acid sequences (e.g., components) used to encode N distinct bits, by encoding information in the combinations and orders of their assembly products. For example, to encode 100 bits of information, two layers of 10 distinct nucleic acid molecules (e.g., component) may be assembled in a fixed order to produce 10*10 or 100 distinct nucleic acid molecules (e.g., identifiers), or one layer of 5 distinct nucleic acid molecules (e.g., components) and another layer of 10 distinct nucleic acid molecules (e.g., components) may be assembled in any order to produce 100 distinct nucleic acid molecules (e.g., identifiers).

[0158] Nucleic acid sequences (e.g., components) within each layer may comprise a unique (or distinct) sequence, or barcode, in the middle, a common hybridization region on one end, and another common hybridization region on another other end. The barcode may contain a sufficient number of nucleotides to uniquely identify every sequence within the layer. For example, there are typically four possible nucleotides for each base position within a barcode. Therefore, a three base barcode may uniquely identify 4³ = 64 nucleic acid sequences. The barcodes may be designed to be randomly generated. Alternatively, the barcodes may be designed to avoid sequences that may create complications to the construction chemistry of identifiers or sequencing. Additionally, barcodes may be designed so that each may have a minimum hamming distance from the other barcodes, thereby decreasing the likelihood that base-resolution mutations or read errors may interfere with the proper identification of the barcode.

[0159] The hybridization region on one end of the nucleic acid sequence (e.g., component) may be different in each layer, but the hybridization region may be the same for each member within a layer. Adjacent layers are those that have complementary hybridization regions on their components that allow them to interact with one another. For example, any component from layer X may be able to attach to any component from layer Y because they may have complementary hybridization regions. The hybridization region on the opposite end may serve the same purpose as the hybridization region on the first end. For example, any component from layer Y may attach to any component of layer X on one end and any component of layer Z on the opposite end.

[0160] FIGS. 30A and 30B illustrate an example method, referred to as the “product scheme”, for constructing identifiers (e.g., nucleic acid molecules) by combinatorially assembling a distinct component (e.g., nucleic acid sequence) from each layer in a fixed order. FIG. 30A illustrates the architecture of identifiers constructed using the product scheme. An identifier may be constructed by combining a single component from each layer in a fixed order. For Mlayers, each with N components, there are Ai^m possible identifiers. FIG. 30B illustrates an example of the combinatorial space of identifiers that may be constructed using the product scheme. In an example, a combinatorial space may be generated from three layers each comprising three distinct components. The components may be combined such that one component from each layer may be combined in a fixed order. The entire combinatorial space for this assembly method may comprise twenty-seven possible identifiers.

[0161] FIGS. 31-34 illustrate chemical methods for implementing the product scheme (see FIG. 30) Methods depicted in FIGS. 31-34, along with any other methods for assembling two or more distinct components in a fixed order may be used, for example, to produce any one or more identifiers in an identifier library. Identifiers may be constructed using any of the implementation methods described in FIGS. 31-34, at any time during the methods or systems disclosed herein. In some instances, all or a portion of the combinatorial space of possible identifiers may be constructed before digital information is encoded or written, and then the writing process may involve mechanically selecting and pooling the identifiers (that encode the information) from the already existing set. In other instances, the identifiers may be constructed after one or more steps of the data encoding or writing process may have occurred (i.e., as information is being written). [0162] Enzymatic reactions may be used to assemble components from the different layers or sets. Assembly can occur in a one pot reaction because components (e.g., nucleic acid sequences) of each layer have specific hybridization or attachment regions for components of adjacent layers. For example, a nucleic acid sequence (e.g., component) XI from layer X, a nucleic acid sequence Y 1 from layer Y, and a nucleic acid sequence Z1 from layer Z may form the assembled nucleic acid molecule (e.g., identifier) X1Y1Z1. Additionally, multiple nucleic acid molecules (e.g., identifiers) may be assembled in one reaction by including multiple nucleic acid sequences from each layer. For example, including both Y1 and Y2 in the one pot reaction of the previous example may yield two assembled products (e.g., identifiers), X1Y1Z1 and X1Y2Z1. This reaction multiplexing may be used to speed up writing time for the plurality of identifiers that are physically constructed. Assembly of the nucleic acid sequences may be performed in a time period that is less than or equal to about 1 day, 12 hours, 10 hours, 9 hours, 8 hours, 7 hours, 6 hours, 5 hours, 4 hours, 3 hours, 2 hours, or 1 hour. The accuracy of the encoded data may be at least about or equal to about 90%, 95%, 96%, 97%, 98%, 99%, or greater.

[0163] Identifiers may be constructed in accordance with the product scheme using overlap extension polymerase chain reaction (OEPCR), as illustrated in FIG. 31. Each component in each layer may comprise a double-stranded or single stranded (as depicted in the figure) nucleic acid sequence with a common hybridization region on the sequence end that may be homologous and/or complementary to the common hybridization region on the sequence end of components from an adjacent layer. An individual identifier may be constructed by concatenating one component (e.g., unique sequence) from a layer X (or layer 1) comprising components Xi — XA, a second component (e.g., unique sequence) from a layer Y (or layer

2) comprising Yi — YA, and a third component (e.g., unique sequence) from layer Z (or layer

3) comprising Zi — ZB. The components from layer X may have a 3’ end that shares complementarity with the 3’ end on components from layer Y. Thus single-stranded components from layer X and Y may be annealed together at the 3’ end and may be extended using PCR to generate a double-stranded nucleic acid molecule. The generated doublestranded nucleic-acid molecule may be melted to generate a 3’ end that shares complementarity with a 3’ end of a component from layer Z. A component from layer Z may be annealed with the generated nucleic acid molecule and may be extended to generate a unique identifier comprising a single component from layers X, Y, and Z in a fixed order. DNA size selection (e.g., with gel extraction) or polymerase chain reaction (PCR) with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction.

[0164] Identifiers may be assembled in accordance with the product scheme using sticky end ligation, as illustrated in FIG. 32. Three layers, each comprising double stranded components (e.g., double stranded DNA (dsDNA)) with single-stranded 3’ overhangs, can be used to assemble distinct identifiers. For example, identifiers comprising one component from the layer X (or layer 1) comprising components Xi — XA, a second component from the layer Y (or layer 2) comprising Yi — YB, and a third component from the layer Z (or layer 3) comprising Zi — Z_c. To combine components from layer X with components from layer Y, the components in layer X can comprise a common 3’ overhang, FIG. 32 labeled a, and the components in layer Y can comprise a common, complementary 3’ overhang, a*. To combine components from layer Y with components from layer Z, the elements in layer Y can comprise a common 3’ overhang, FIG. 32 labeled b, and the elements in layer Z can comprise a common, complementary 3’ overhang, b*. The 3’ overhang in layer X components can be complementary to the 3’ end in layer Y components and the other 3’ overhang in layer Y components can be complementary to the 3’ end in layer Z components allowing the components to hybridize and ligate. As such, components from layer X cannot hybridize with other components from layer X or layer Z, and similarly components from layer Y cannot hybridize with other elements from layer Y. Furthermore, a single component from layer Y can ligate to a single component of layer X and a single component of layer Z, ensuring the formation of a complete identifier. DNA size selection (for example with gel extraction) or PCR with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction.

[0165] The sticky ends for sticky end ligation may be generated by treating the components of each layer with restriction endonucleases. In some embodiments, the components of multiple layers may be generated from one “parent” set of components. For example, an embodiment wherein a single parent set of double-stranded components may have complementary restrictions sites on each end (e.g., restriction sites for BamHI and Bglll). Any two components may be selected for assembly, and individually digested with one or the other complementary restriction enzymes (e.g., Bglll or BamHI) resulting in complementary sticky ends that can be ligated together resulting in an inert scar. The product nucleic acid sequence may comprise the complementary restriction sites on each end (e.g., BamHI on the 5’ end and Bglll on the 3’ end), and can be further ligated to another component from the parent set following the same process. This process may cycle indefinitely. If the parent comprises N components, then each cycle may be equivalent to adding an extra layer of N components to the product scheme.

[0166] A method for using ligation to construct a sequence of nucleic acids comprising elements from set X (e.g., set 1 of dsDNA) and elements from set Y (e.g., set 2 of dsDNA) can comprise the steps of obtaining or constructing two or more pools (e.g., set 1 of dsDNA and set 2 of dsDNA) of double stranded sequences wherein a first set (e.g., set 1 of dsDNA) comprises a sticky end (e.g., a) and a second set (e.g., set 2 of dsDNA) comprises a sticky end (e.g., a*) that is complementary to the sticky end of the first set. Any DNAfrom the first set (e.g., set 1 of dsDNA) and any subset of DNA from the second set (e.g., set 2 of dsDNA) can me combined and assembled and then ligated together to form a single double stranded DNA with an element from the first set and an element from the second set.

[0167] Identifiers may be assembled in accordance with the product scheme using site specific recombination, as illustrated in FIG. 33. Identifiers may be constructed by assembling components from three different layers. The components in layer X (or layer 1) may comprise double-stranded molecules with an attB recombinase site on one side of the molecule, components from layer Y (or layer 2) may comprise double-stranded molecules with an attP recombinase site on one side and an attBy recombinase site on the other side, and components in layer Z (or layer 3) may comprise an attPy recombinase site on one side of the molecule. attB and attP sites within a pair, as indicate by their subscripts, are capable of recombining in the presence of their corresponding recombinase enzyme. One component from each layer may be combined such that one component from layer X associates with one component from layer Y, and one component from layer Y associates with one component from layer Z. Application of one or more recombinase enzymes may recombine the components to generate a double-stranded identifier comprising the ordered components. DNA size selection (for example with gel extraction) or PCR with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction. In general, multiple orthogonal attB and attP pairs may be used, and each pair may be used to assemble a component from an extra layer. For the large-serine family of recombinases, up to six orthogonal attB and attP pairs may be generated per recombinases, and multiple orthogonal recombinases may be implemented as well. For example, thirteen layers may be assembled by using twelve orthogonal attB and attP pairs, six orthogonal pairs from each of two large serine recombinases, such as Bxbl and PhiC31.

Orthogonality of attB and attP pairs ensures that an attB site from one pair does not react with an attP site from another pair. This enables components from different layers to be assembled in a fixed order. Recombinase-mediated recombination reactions may be reversible or irreversible depending on the recombinase system implemented. For example, the large serine recombinase family catalyzes irreversible recombination reactions without requiring any high energy cofactors, whereas the tyrosine recombinase family catalyzes reversible reactions.

[0168] Identifiers may be constructed in accordance with the product scheme using template directed ligation (TDL), as shown in FIG. 34A. Template directed ligation utilizes single stranded nucleic acid sequences, referred to as “templates” or “staples”, to facilitate the ordered ligation of components to form identifiers. The templates simultaneously hybridize to components from adjacent layers and hold them adjacent to each other (3’ end against 5’ end) while a ligase ligates them. In the example from FIG. 34A, three layers or sets of single-stranded components are combined. A first layer of components (e.g., layer X or layer 1) that share common sequences a on their 3’ end, which are complementary to sequences a*; a second layer of components (e.g., layer Y or layer 2) that share common sequences b and c on their 5’ and 3’ ends respectively, which are complementary to sequences b* and c*; a third layer of components (e.g., layer Z or layer 3) that share common sequence d on their 5’ end, which may be complementary to sequences d*; and a set of two templates or “staples” with the first staple comprising the sequence a*b* (5’ to 3’) and the second staple comprising a sequence c*d* (‘5 to 3’). In this example, one or more components from each layer may be selected and mixed into a reaction with the staples, which, by complementary annealing may facilitate the ligation of one component from each layer in a defined order to form an identifier. DNA size selection (for example with gel extraction) or PCR with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction.

[0169] FIG. 34B shows a histogram of the copy numbers (abundances) of 256 distinct nucleic acid sequences that were each assembled with 6-layer TDL. The edge layers (first and final layers) each had one component, and each of the internal layers (remaining 4 four layers) had four components. Each edge layer component was 28 bases including a 10 base hybridization region. Each internal layer component was 30 bases including a 10 base common hybridization region on the 5’ end, a 10 base variable (barcode) region, and a 10 base common hybridization region on the 3’ end. Each of the three template strands was 20 bases in length. All 256 distinct sequences were assembled in a multiplex fashion with one reaction containing all of the components and templates, T4 Polynucleotide Kinase (for phosphorylating the components), and T4 Ligase, ATP, and other proper reaction reagents. The reaction was incubated at 37 degrees for 30 minutes and then room temperature for 1 hour. Sequencing adapters were added to the reaction product with PCR, and the product was sequenced with an Illumina Mi Seq instrument. The relative copy number of each distinct assembled sequence out of 192910 total assembled sequence reads is shown. Other embodiments of this method may use double stranded components, where the components are initially melted to form single stranded versions that can anneal to the staples. Other embodiments or derivatives of this method (i.e., TDL) may be used to construct a combinatorial space of identifiers more complex than what may be accomplished in the product scheme.

[0170] FIGS. 35A and 35B schematically illustrate an example method, referred to as the “permutation scheme”, for constructing identifiers (e.g., nucleic acid molecules) with permuted components (e.g., nucleic acid sequences). FIG. 35A illustrates the architecture of identifiers constructed using the permutation scheme. An identifier may be constructed by combining a single component from each layer in a programmable order. FIG. 35B illustrates an example of the combinatorial space of identifiers that may be constructed using the permutation scheme. In an example, a combinatorial space of size six may be generated from three layers each comprising one distinct component. The components may be concatenated in any order. In general, with M layers, each with N components, the permutation scheme enables a combinatorial space

total identifiers.

[0171] FIG. 35C illustrates an example implementation of the permutation scheme with template directed ligation (TDL). Components from multiple layers are assembled in between fixed left end and right end components, referred to as edge scaffolds. These edge scaffolds are the same for all identifiers in the combinatorial space and thus may be added as part of the reaction master mix for the implementation. Templates or staples exist for any possible junction between any two layers or scaffolds such that the order in which components from different layers are incorporated into an identifier in the reaction depends on the templates selected for the reaction. In order to enable any possible permutation of layers for A7 layers, there may be M² 2M distinct selectable staples for every possible junction (including junctions with the scaffolds). Moi those templates (shaded in grey) form junctions between layers and themselves and may be excluded for the purposes of permutation assembly as described herein. However, their inclusion can enable a larger combinatorial space with identifiers comprising repeat components as illustrated in FIGS. 35D-G. DNA size selection (for example with gel extraction) or PCR with primers targeting the edge scaffolds may be implemented to isolate identifier products from other byproducts that may form in the reaction.

[0172] FIGS. 35D-G illustrate example methods of how the permutation scheme may be expanded to include certain instances of identifiers with repeated components. FIG. 35D shows an example of how the implementation form FIG. 35C may be used to construct identifiers with permuted and repeated components. For example, an identifier may comprise three total components assembled from two distinct components. In this example, a component from a layer may be present multiple times in an identifier. Adjacent concatenations of the same component may be achieved by using a staple with adjacent complementary hybridization regions for both the 3’ end and 5’ end of the same component, such as the a*b* (5’ to 3’) staple in the figure. In general, for A7 layers, there are M such staples. Incorporation of repeated components with this implementation may generate nucleic acid sequences of more than one length (i.e., comprising one, two, three, four, or more components) that are assembled between the edge scaffolds, as demonstrated in FIG. 35E. FIG. 35E shows how the example implementation from FIG. 35D may lead to nontargeted nucleic acid sequences, besides the identifier, that are assembled between the edge scaffolds. The appropriate identifier cannot be isolated from non-targeted nucleic acid sequence with PCR because they share the same primer binding sites on the edge. However, in this example, DNA size selection (e.g., with gel extraction) may be implemented to isolate the targeted identifier (e.g., the second sequence from the top) from the non-targeted sequences since each assembled nucleic acid sequence can be designed to have a unique length (e.g., if all components have the same length). FIG. 35F shows another example where constructing an identifier with repeated components may generate multiple nucleic acid sequences with equal edge sequences but distinct lengths in the same reaction. In this method, templates that assemble components in one layer with components in other layers in an alternating pattern may be used. As with the method shown in FIG. 35E, size selection may be used to select identifiers of the designed length. FIG. 35G shows an example where constructing an identifier with repeated components may generate multiple nucleic acid sequences with equal edge sequences and for some nucleic acid sequences (e.g., the third and fourth from the top and the sixth and seventh from the top), equal lengths. In this example, those nucleic acid sequences that share equal lengths may be excluded from both being individual identifiers as it may not be possible to construct one without also constructing the other, even if PCR and DNA size selection are implemented. [0173] FIGS. 36A-36D schematically illustrate an example method, referred to as the “MchooseK scheme”, for constructing identifiers (e.g., nucleic acid molecules) with any number, K, of assembled components (e.g., nucleic acid sequences) out of a larger number, M of possible components. FIG. 36A illustrates the architecture of identifiers constructed using the MchooseK scheme. Using this method identifiers are constructed by assembling one component form each layer in any subset of all layers (e.g., choose components from k layers out of M possible layers). FIG. 36B illustrates an example of the combinatorial space of identifiers that may be constructed using the MchooseK scheme. In this assembly scheme the combinatorial space may comprise N^K MchooseK possible identifiers for M layers, N components per layer, and an identifier length of K components. In an example, if there are five layers each comprising one component, then up to ten distinct identifiers may be assemble comprising two components each.

[0174] The MchooseK scheme may be implemented using template directed ligation, as shown in FIG. 36C. As with the TDL implementation for the permutation scheme (FIG. 35C), components in this example are assembled between edge scaffolds that may or may not be included in the reaction master mix. Components may be divided into AT layers, for example M 4 layers with predefined rank from 2 to M, where the left edge scaffold may be rank 1 and the right edge scaffold may be rank M+l. Templates comprise nucleic acid sequences for the 3’ to 5’ ligation of any two components with lower rank to higher rank, respectively. There are ((M+l)²+M+ l)/2 such templates. An individual identifier of any K components from distinct layers may be constructed by combining those selected components in a ligation reaction with the corresponding K+ 1 staples used to bring the K components together with the edge scaffolds in their rank order. Such a reaction set up may yield the nucleic acid sequence corresponding to the target identifier between the edge scaffolds. Alternatively, a reaction mix comprising all templates may be combined with the select components to assemble the target identifier. This alternative method may generate various nucleic acid sequences with the same edge sequences but distinct lengths (if all component lengths are equal), as illustrated in FIG. 36D. The target identifier (bottom) may be isolated from byproduct nucleic acid sequences by size.

[0175] FIGS. 37A and 37B schematically illustrate an example method, referred to as the “partition scheme” for constructing identifiers with partitioned components. FIG. 37A shows an example of the combinatorial space of identifiers that may be constructed using the partition scheme. An individual identifier may be constructed by assembling one component from each layer in a fixed order with the optional placement of any partition (specially classified component) between any two components of different layers. For example, a set of components may be organized into one partition component and four layers containing one component each. A component from each layer may be combined in a fixed order and a single partition component may be assembled in various locations between layers. An identifier in this combinatorial space may comprise no partition components, a partition component between the components from the first and second layer, a partition between the components from the second and third layer, and so on to make a combinatorial space of eight possible identifiers. In general, with AY layers, each with N components, and p partition components, there are IV^p+l)^¹ possible identifiers that may be constructed. This method may generate identifiers of various lengths.

[0176] FIG. 37B shows an example implementation of the partition scheme using template directed ligation. Templates comprise nucleic acid sequences for ligating together one component from each of M layers in a fixed order. For each partition component, additional pairs of templates exist that enable the partition component to ligate in between the components from any two adjacent layers. For example a pair of templates such that one template (with sequence g*b* (5’ to 3’) for example) in a pair enables the 3’ end of layer 1 (with sequence b) to ligate to the 5’ end of the partition component (with sequence g) and such that the second template in the pair (with sequence c*h* (5’ to 3’) for example) enables the 3’ end of the partition component (with sequence h) to ligate to the 5’ end of layer 2 (with sequence c). To insert a partition between any two components of adjacent layers, the standard template for ligating together those layers may be excluded in the reaction and the pair of templates for ligating the partition in that position may be selected in the reaction. In the current example, targeting the partition component between layer 1 and layer 2 may use the pair of templates c*h* (5’ to 3’) and g*b* (5’ to 3’) to select for the reaction rather than the template c*b* (5’ to 3’). Components may be assembled between edge scaffolds that may be included in the reaction mix (along with their corresponding templates for ligating to the first and Mth layers, respectively). In general, a total of around M-l + 2*p*(M-l) selectable templates may be used for this method for M layers and p partition components. This implementation of the partition scheme may generate various nucleic acid sequences in a reaction with the same edge sequences but distinct lengths. The target identifier may be isolated from byproduct nucleic acid sequences by DNA size selection. Specifically, there may be exactly one nucleic acid sequence product with exactly M layer components. If the layer components are designed large enough compared to the partition components, it may be possible to define a universal size selection region whereby the identifier (and none of the non-targeted byproducts) may be selected regardless of the particular partitioning of the components within the identifier, thereby allowing for multiple partitioned identifiers from multiple reactions to be isolated in the same size selection step.

[0177] FIGS. 38A and 38B schematically illustrates an example method, referred to as the “unconstrained string scheme” or “USS”, for constructing identifiers made up of any string of components from a number of possible components. FIG. 38A shows an example of the combinatorial space of 3-component (or 4-scaffold) length identifiers that may be constructed using the unconstrained string scheme. The unconstrained string scheme constructs an individual identifier of length K components with one or more distinct components each taken from one or more layers, where each distinct component can appear at any of the K component positions in the identifier (allowing for repeats). For example, for two layers, each comprising one component, there are eight possible 3-component length identifiers. In general, with M layers, each with one component, there are M^K possible identifiers of length K components. FIG. 38B shows an example implementation of the unconstrained string scheme using template directed ligation. In this method, K+l single-stranded and ordered scaffold DNA components (including two edge scaffolds and K-l internal scaffolds) are present in the reaction mix. An individual identifier comprises a single component ligated between every pair of adjacent scaffolds. For example, a component ligated between scaffolds A and B, a component ligated between scaffolds C and D, and so on until all K adjacent scaffold junctions are occupied by a component. In a reaction, selected components from different layers are introduced to scaffolds along with selected pairs of staples that direct them to assemble onto the appropriate scaffolds. For example, the pair of staples a*L* (5’ to 3’) and A*b* (5’ to 3’) direct the layer 1 component with a 5’ end region ‘a’ and 3’ end region ‘b’ to ligate in between the L and A scaffolds. In general, with A/ layers and K+l scaffolds, 2*M*K selectable staples may be used to construct any USS identifier of length U Because the staples that connect a component to a scaffold on the 5’ end are disjoint from the staples that connect the same component to a scaffold on the 3’ end, nucleic acid byproducts may form in the reaction with equal edge scaffolds as the target identifier, but with less than K components (less than K+l scaffolds) or with more than K components (more than K+l scaffolds). The targeted identifier may form with exactly K components (K+l scaffolds) and may therefore be selectable through techniques like DNA size selection if all components are designed to be equal in length and all scaffolds are designed to be equal in length. In certain embodiments of the unconstrained string scheme where there may be one component per layer, that component may solely comprise a single distinct nucleic acid sequence that fulfills all three roles of (1) an identification barcode, (2) a hybridization region for staple-mediated ligation of the 5’ end to a scaffold, and (3) a hybridization region for staple mediated ligation of the 3’ end to a scaffold.

[0178] The internal scaffolds illustrated in FIG. 38B may be designed such that they use the same hybridization sequence for both the staple-mediated 5’ ligation of the scaffold to a component and the staple-mediated 3’ ligation of the scaffold to another (not necessarily distinct) component. Thus the depicted one-scaffold, two-staple stacked hybridization events in FIG. 38B represent the statistical back-and-forth hybridization events that occur between the scaffold and each of the staples, thus enabling both 5’ component ligation and 3’ component ligation. In other embodiments of the unconstrained string scheme, the scaffold may be designed with two concatenated hybridization regions - a distinct 3’ hybridization region for staple-mediated 3’ ligation and a distinct 5’ hybridization region for staple- mediated 5’ ligation.

[0179] FIGS. 39A and 39B schematically illustrate an example method, referred to as the “component deletion scheme”, for constructing identifiers by deleting nucleic acid sequences (or components) from a parent identifier. FIG. 39A shows an example of the combinatorial spaces of possible identifiers that may be constructed using the component deletion scheme. In this example, a parent identifier may comprise multiple components. A parent identifier may comprise more than or equal to about 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50 or more components. An individual identifier may be constructed by selectively deleting any number of components from N possible components, leading to a “full” combinatorial space of size 2^N, or by deleting a fixed number of K components from N possible components, thus leading to an “NchooseK” combinatorial space of size NchooseK. In an example with a parent identifier with 3 components, the full combinatorial space may be 8 and the 3choose2 combinatorial space may be 3.

[0180] FIG. 39B shows an example implementation of the component deletion scheme using double stranded targeted cleavage and repair (DSTCR). The parent sequence may be a single stranded DNA substrate comprising components flanked by nuclease-specific target sites (which can be 4 or less bases in length), and where the parent may be incubated with one or more double-strand-specific nucleases corresponding to the target sites. An individual component may be targeted for deletion with a complementary single stranded DNA (or cleavage template) that binds the component DNA (and flanking nuclease sites) on the parent, thus forming a stable double stranded sequence on the parent that may be cleaved on both ends by the nucleases. Another single stranded DNA (or repair template) hybridizes to the resulting disjoint ends of the parent (between which the component sequence had been) and brings them together for ligation, either directly or bridged by a replacement sequence, such that the ligated sequences on the parent no longer contain active nuclease-targeted sites. We refer to this method as “Double Stranded Targeted Cleavage” (DSTC). Size selection may be used to select for identifiers with a certain number of deleted components.

[0181] Alternatively, or in addition to, the parent identifier may be a double or single stranded nucleic acid substrate comprising components separated by spacer sequences such that no two components are flanked by the same sequence. The parent identifier may be incubated with Cas9 nuclease. An individual component may be targeted for deletion with guide ribonucleic acids (the cleavage templates) that bind to the edges of the component and enable Cas9-mediated cleavage at its flanking sites. A single stranded nucleic acid (the repair template) may hybridize to the resulting disjoint ends of the parent identifier (e.g., between the ends where the component sequence had been), thus bringing them together for ligation. Ligation may be done directly or by bridging the ends with a replacement sequence, such that the ligated sequences on the parent no longer contain spacer sequences that can be targeted by Cas9. We refer to this method as “sequence specific targeted cleavage and repair” or “SSTCR”.

[0182] Identifiers may be constructed by inserting components into a parent identifier using a derivative of DSTCR. A parent identifier may be single stranded nucleic acid substrate comprising nuclease-specific target sites (which can be 4 or less bases in length), each embedded within a distinct nucleic acid sequence. The parent identifier may be incubated with one or more double-strand-specific nucleases corresponding to the target sites. An individual target site on the parent identifier may be targeted for component insertion with a complementary single stranded nucleic acid (the cleavage template) that binds the target site and the distinct surrounding nucleic acid sequence on the parent identifier, thus forming a double stranded site. The double-stranded site may be cleaved by a nuclease. Another single stranded nucleic acid (the repair template) may hybridize to the resulting disjoint ends of the parent identifier and bring them together for ligation, bridged by a component sequence, such that the ligated sequences on the parent no longer contain active nuclease-targeted sites. Alternatively, a derivative of SSTCR may be used to insert components into a parent identifier. The parent identifier may be a double or single-stranded nucleic acid and the parent may be incubated with a Cas9 nuclease. A distinct site on the parent identifier may be targeted for cleavage with a guide RNA (the cleavage template). A single stranded nucleic acid (the repair template) may hybridize to the disjoint ends of the parent identifier and bring them together for ligation, bridged by a component sequence, such that the ligated sequences on the parent identifier no longer contain active nuclease-targeted sites. Size selection may be used to select for identifiers with a certain number of component insertions.

[0183] FIG. 40 schematically illustrates a parent identifier with recombinase recognition sites. Recognition sites of different patterns can be recognized by different recombinases. All recognition sites for a given set of recombinases are arranged such that the nucleic acids in between them may be excised if the recombinase is applied. The nucleic acid strand shown in FIG. 40 can adopt 2⁵=32 different sequences depending on the subset of recombinases that are applied to it. In some embodiments, as depicted in FIG. 40, unique molecules can be generated using recombinases to excise, shift, invert, and transpose segments of DNAto create different nucleic acid molecules. In general, with N recombinases there can be 2^N possible identifiers built from a parent. In some embodiments, multiple orthogonal pairs of recognition sites from different recombinases may be arranged on a parent identifier in an overlapping fashion such that the application of one recombinase affects the type of recombination event that occurs when a downstream recombinase is applied (see Roquet et al., Synthetic recombinase-based state machines in living cells, Science 353 (6297): aad8559 (2016), which is entirely incorporated herein by reference). Such a system may be capable of constructing a different identifier for every ordering of N recombinases, N! . Recombinases may be of the tyrosine family such as Flp and Cre, or of the large serine recombinase family such as PhiC31, Bxbl, TP901, or A118. The use of recombinases from the large serine recombinase family may be advantageous because they facilitate irreversible recombination and therefore may produce identifiers more efficiently than other recombinases.

[0184] In some instances, a single nucleic acid sequence can be programmed to become many distinct nucleic acid sequences by applying numerous recombinases in a distinct order. Approximately — e^xM! distinct nucleic acid sequences may be generated by applying M recombinases in different subsets and orders thereof, when the number of recombinases, M, may be less than or equal to 7 for the large serine recombinase family. When the number of recombinases, M, may be greater than 7, the number of sequences that can be produced approximates 3.9^M, see e.g., Roquet et al., Synthetic recombinase-based state machines in living cells, Science 353 (6297): aad8559 (2016), which is entirely incorporated herein by reference. Additional methods for producing different DNA sequences from one common sequence can include targeted nucleic acid editing enzymes such as CRISPR-Cas, TALENS, and Zinc Finger Nucleases. Sequences produced by recombinases, targeted editing enzymes or the like can be used in conjunction with any of the previous methods, for example methods disclosed in any of the figures and disclosure in the present application.

[0185] If the bit-stream of information to be encoded is larger than that which can be encoded by any single nucleic acid molecule, then the information can be split and indexed with nucleic acid sequence barcodes. Moreover, any subset of size k nucleic acid molecules from the set of N nucleic acid molecules can be chosen to produce log2(Nchoosek) bits of information. Barcodes may be assembled onto the nucleic acid molecules within the subsets of size Ho encode even longer bit streams. For example, Mbarcodes may be used to produce M*log2(Nchoosek) bits of information. Given a number, N, of available nucleic acid molecules in a set and a number, M of available barcodes, subsets of size k = k_o may be chosen to minimize the total number of molecules in a pool to encode a piece of information. A method for encoding digital information can comprise steps for breaking up the bit stream and encoding the individual elements. For example, a bit stream comprising 6 bits can be split into 3 components each component comprising two bits. Each two bit component can be barcoded to form an information cassette, and grouped or pooled together to form a hyperpool of information cassettes

[0186] Barcodes can facilitate information indexing when the amount of digital information to be encoded exceeds the amount that can fit in one pool alone. Information comprising longer strings of bits and/or multiple bytes can be encoded by layering the approach disclosed in FIG. 27, for example, by including a tag with unique nucleic acid sequences encoded using the nucleic acid index. Information cassettes or identifier libraries can comprise nitrogenous bases or nucleic acid sequences that include unique nucleic acid sequences that provide location and bit-value information in addition to a barcode or tag which indicates the component or components of the bit stream that a given sequence corresponds to. Information cassettes can comprise one or more unique nucleic acid sequences as well as a barcode or tag. The barcode or tag on the information cassette can provide a reference for the information cassette and any sequences included in the information cassette. For example, the tag or barcode on an information cassette can indicate which portion of the bit stream or bit component of the bit steam the unique sequence encodes information for (e.g., the bit value and bit position information for).

[0187] Using barcodes, more information in bits can be encoded in a pool than the size of the combinatorial space of possible identifiers. A sequence of 10 bits, for example, can be separated into two sets of bytes, each byte comprising 5 bits. Each byte can be mapped to a set of 5 possible distinct identifiers. Initially, the identifiers generated for each byte can be the same, but they may be kept in separate pools or else someone reading the information may not be able to tell which byte a particular nucleic acid sequence belongs to. However each identifier can be barcoded or tagged with a label that corresponds to the byte for which the encoded information applies (e.g., barcode one may be attached to sequences in the nucleic acid pool to provide the first five bits and barcode two may be attached to sequences in the nucleic acid pool to provide the second five bits), and then the identifiers corresponding to the two bytes can be combined into one pool (e.g., “hyper-pool” or one or more identifier libraries). Each identifier library of the one or more combined identifier libraries may comprise a distinct barcode that identifies a given identifier as belonging to a given identifier library. Methods for adding a barcode to each identifier in an identifier library can comprise using PCR, Gibson, ligation, or any other approach that enables a given barcode (e.g., barcode 1) to attach to a given nucleic acid sample pool (e.g., barcode 1 to nucleic acid sample pool 1 and barcode 2 to nucleic acid sample pool 2). The sample from the hyper-pool can be read with sequencing methods, and sequencing information can be parsed using the barcode or tag. A method using identifier libraries and barcodes with a set of M barcodes and N possible identifiers (the combinatorial space) can encode a stream of bits with a length equivalent to the product of M and N.

[0188] In some embodiments, identifier libraries may be stored in an array of wells. The array of wells may be defined as having n columns and q rows and each well may comprise two or more identifier libraries in a hyper-pool. The information encoded in each well may constitute one large contiguous item of information of size nx q larger than the information contained in each of the wells. An aliquot may be taken from one or more of the wells in the array of wells and the encoding may be read using sequencing, hybridization, or PCR.

[0189] A nucleic acid sample pool, hyper-pool, identifier library, group of identifier libraries, or a well, containing a nucleic acid sample pool or hyper-pool may comprise unique nucleic acid molecules (e.g., identifiers) corresponding to bits of information and a plurality of supplemental nucleic acid sequences. The supplemental nucleic acid sequences may not correspond to encoded data (e.g., do not correspond to a bit value). The supplemental nucleic acid samples may mask or encrypt the information stored in the sample pool. The supplemental nucleic acid sequences may be derived from a biological source or synthetically produced. Supplemental nucleic acid sequences derived from a biological source may include randomly fragmented nucleic acid sequences or rationally fragmented sequences. The biologically derived supplemental nucleic acids may hide or obscure the data-containing nucleic acids within the sample pool by providing natural genetic information along with the synthetically encoded information, especially if the synthetically encoded information (e.g., the combinatorial space of identifiers) is made to resemble natural genetic information (e.g., a fragmented genome). In an example, the identifiers are derived from a biological source and the supplemental nucleic acids are derived from a biological source. A sample pool may contain multiple sets of identifiers and supplemental nucleic acid sequences. Each set of identifiers and supplemental nucleic acid sequences may be derived from different organisms. In an example, the identifiers are derived from one or more organisms and the supplemental nucleic acid sequences are derived from a single, different organism. The supplemental nucleic acid sequences may also be derived from one or more organism and the identifiers may be derived from a single organism that is different from the organism that the supplemental nucleic acids are derived from. Both the identifiers and the supplemental nucleic acid sequences may be derived from multiple different organisms. A key may be used to distinguish the identifiers from the supplemental nucleic acid sequences.

[0190] The supplemental nucleic acid sequences may store metadata about the written information. The metadata may comprise extra information for determining and/or authorizing the source of the original information and or the intended recipient of the original information. The metadata may comprise extra information about the format of the original information, the instruments and methods used to encode and write the original information, and the date and time of writing the original information into the identifiers. The metadata may comprise additional information about the format of the original information, the instruments and methods used to encode and write the original information, and the date and time of writing the original information into nucleic acid sequences. The metadata may comprise additional information about modifications made to the original information after writing the information into nucleic acid sequences. The metadata may comprise annotations to the original information or one or more references to external information. Alternatively, or in addition to, the metadata may be stored in one or more barcodes or tags attached to the identifiers.

[0191] The identifiers in an identifier pool may have the same, similar, or different lengths than one another. The supplemental nucleic acid sequences may have a length that is less than, substantially equal to, or greater than the length of the identifiers. The supplemental nucleic acid sequences may have an average length that is within one base, within two bases, within three bases, within four bases, within five bases, within six bases, within seven bases, within eight bases, within nine bases, within ten bases, or within more bases of the average length of the identifiers. In an example, the supplemental nucleic acid sequences are the same or substantially the same length as the identifiers. The concentration of supplemental nucleic acid sequences may be less than, substantially equal to, or greater than the concentration of the identifiers in the identifier library. The concentration of the supplemental nucleic acids may be less than or equal to about 1%, 10 %, 20 %, 40 %, 60 %, 80 %, 100, %, 125 %, 150 %, 175 %, 200 %, 1000 %, IxlO⁴ %, 1 xlO⁵ %, 1 xlO⁶ %, 1 xlO⁷ %, 1 xlO⁸ % or less than the concentration of the identifiers. The concentration of the supplemental nucleic acids may be greater than or equal to about 1 %, 10 %,20 %, 40%, 60 %,80 %, 100, %, 125 %, 150 %, 175 %, 200 %, 1000%, 1 xlO⁴ %, 1 xl0⁵%,l xl0⁶%,l xl0⁷%,l xl0⁸% or more than the concentration of the identifiers. Larger concentrations may be beneficial for obfuscation or concealing data. In an example, the concentration of the supplemental nucleic acid sequences are substantially greater (e.g., 1 xlO⁸ % greater) than the concentration of identifiers in an identifier pool.

Example methods for reading information stored in nucleic acid sequences

[0192] In another aspect, the present disclosure provides methods for reading information encoded in nucleic acid sequences. A method for reading information encoded in nucleic acid sequences may comprise (a) providing an identifier library, (b) identifying the identifiers present in the identifier library, (c) generating a string of symbols from the identifiers present in the identifier library, and (d) compiling information from the string of symbols. An identifier library may comprise a subset of a plurality of identifiers from a combinatorial space. Each individual identifier of the subset of identifiers may correspond to an individual symbol in a string of symbols. An identifier may comprise one or more components. A component may comprise a nucleic acid sequence.

[0193] Information may be written into one or more identifier libraries as described elsewhere herein. Identifiers may be constructed using any method described elsewhere herein. Stored data may be copied and accessed using any method described elsewhere herein.

[0194] The identifier may comprise information relating to a location of the encoded symbol, a value of the encoded symbol, or both the location and the value of the encoded symbol. An identifier may include information relating to a location of the encoded symbol and the presence or absence of the identifier in an identifier library may indicate the value of the symbol. The presence of an identifier in an identifier library may indicate a first symbol value (e.g., first bit value) in a binary string and the absence of an identifier in an identifier library may indicate a second symbol value (e.g., second bit value) in a binary string. In a binary system, basing a bit value on the presence or absence of an identifier in an identifier library may reduce the number of identifiers assembled and, therefore, reduce the write time. In an example, the presence of an identifier may indicate a bit value of ‘ 1’ at the mapped location and the absence of an identifier may indicate a bit value of ‘0’ at the mapped location.

[0195] Generating symbols (e.g., bit values) for a piece of information may include identifying the presence or absence of the identifier that the symbol (e.g., bit or bit sequence) may be mapped or encoded to. Determining the presence or absence of an identifier may include sequencing the present identifiers or using a hybridization array to detect the presence of an identifier. In an example, decoding and reading the encoded sequences may be performed using sequencing platforms. Examples of sequencing platforms are described in U.S. Patent Application Ser. No. 14/465,685 filed August 21, 2014, U.S. Patent Application Ser. No. 13/886,234 filed May 2, 2013, and U.S. Patent Application Ser. No. 12/400,593 filed March 9, 2009, each of which is entirely incorporated herein by reference.

[0196] In an example, decoding nucleic acid encoded data may be achieved using the technologies described above and/or by base-by-base sequencing of the nucleic acid strands, such as Illumina® Sequencing, or by utilizing a sequencing technique that indicates the presence or absence of specific nucleic acid sequences, such as fragmentation analysis by capillary electrophoresis. The sequencing may employ the use of reversible terminators. The sequencing may employ the use of natural or non-natural (e.g., engineered) nucleotides or nucleotide analogs. Alternatively or in addition to, decoding nucleic acid sequences may be performed using a variety of analytical techniques, including but not limited to, any methods that generate optical, electrochemical, or chemical signals. A variety of sequencing approaches may be used including, but not limited to, polymerase chain reaction (PCR), digital PCR, Sanger sequencing, high-throughput sequencing, sequencing-by-synthesis, single-molecule sequencing, sequencing-by-ligation, RNA-Seq (Illumina), Next generation sequencing, Digital Gene Expression (Helicos), Clonal Single MicroArray (Solexa), shotgun sequencing, Maxim-Gilbert sequencing, or massively-parallel sequencing. The sequencing technologies described in this specification can be used each alone or in combination with one or more other sequencing technologies.

[0197] Various read-out methods can be used to pull information from the encoded nucleic acid. In an example, microarray (or any sort of fluorescent hybridization), digital PCR, quantitative PCR (qPCR), and various sequencing platforms can be further used to read out the encoded sequences and by extension digitally encoded data. [0198] An identifier library may further comprise supplemental nucleic acid sequences that provide metadata about the information, encrypt or mask the information, or that both provide metadata and mask the information. The supplemental nucleic acids may be identified simultaneously with identification of the identifiers. Alternatively, the supplemental nucleic acids may be identified prior to or after identifying the identifiers. In an example, the supplemental nucleic acids are not identified during reading of the encoded information. The supplemental nucleic acid sequences may be indistinguishable from the identifiers. An identifier index or a key may be used to differentiate the supplemental nucleic acid molecules from the identifiers.

[0199] The efficiency of encoding and decoding data may be increased by recoding input bit strings to enable the use of fewer nucleic acid molecules. For example, if an input string is received with a high occurrence of ‘ 111’ substrings, which may map to three nucleic acid molecules (e.g., identifiers) with an encoding method, it may be recoded to a ‘000’ substring which may map to a null set of nucleic acid molecules. The alternate input substring of ‘000’ may also be recoded to ‘ 111’ . This method of recoding may reduce the total amount of nucleic acid molecules used to encode the data because there may be a reduction in the number of ' l’s in the dataset. In this example, the total size of the dataset may be increased to accommodate a codebook that specifies the new mapping instructions. An alternative method for increasing encoding and decoding efficiency may be to recode the input string to reduce the variable length. For example, ‘ 111’ may be recoded to ’00’ which may shrink the size of the dataset and reduce the number of ‘ 1 ’s in the dataset.

[0200] The speed and efficiency of decoding nucleic acid encoded data may be controlled (e.g., increased) by specifically designing identifiers for ease of detection, for example, using the labeling technologies described above. Other technologies can be used alone or in combination with the labeling technologies described above. For example, nucleic acid sequences (e.g., identifiers) that are designed for ease of detection may include nucleic acid sequences comprising a majority of nucleotides that are easier to call and detect based on their optical, electrochemical, chemical, or physical properties. Engineered nucleic acid sequences may be either single or double stranded. Engineered nucleic acid sequences may include synthetic or unnatural nucleotides that improve the detectable properties of the nucleic acid sequence. Engineered nucleic acid sequences may comprise all natural nucleotides, all synthetic or unnatural nucleotides, or a combination of natural, synthetic, and unnatural nucleotides. Synthetic nucleotides may include nucleotide analogues such as peptide nucleic acids, locked nucleic acids, glycol nucleic acids, and threose nucleic acids. Unnatural nucleotides may include dNaM, an artificial nucleoside containing a 3-methoxy-2- naphthly group, and d5SICS, an artificial nucleoside containing a 6-methylisoquinoline-l- thione-2-yl group. Engineered nucleic acid sequences may be designed for a single enhanced property, such as enhanced optical properties, or the designed nucleic acid sequences may be designed with multiple enhanced properties, such as enhanced optical and electrochemical properties or enhanced optical and chemical properties.

[0201] Engineered nucleic acid sequences may comprise reactive natural, synthetic, and unnatural nucleotides that do not improve the optical, electrochemical, chemical, or physical properties of the nucleic acid sequences. The reactive components of the nucleic acid sequences may enable the addition of a chemical moiety that confers improved properties to the nucleic acid sequence. Each nucleic acid sequence may include a single chemical moiety or may include multiple chemical moieties. Example chemical moieties may include, but are not limited to, fluorescent moieties, chemiluminescent moieties, acidic or basic moieties, hydrophobic or hydrophilic moieties, and moieties that alter oxidation state or reactivity of the nucleic acid sequence.

[0202] A sequencing platform may be designed specifically for decoding and reading information encoded into nucleic acid sequences. The sequencing platform may be dedicated to sequencing single or double stranded nucleic acid molecules. The sequencing platform may decode nucleic acid encoded data by reading individual bases (e.g., base-by-base sequencing) or by detecting the presence or absence of an entire nucleic acid sequence (e.g., component) incorporated within the nucleic acid molecule (e.g., identifier), e.g., using a labeling technology as described above. The sequencing platform may include the use of promiscuous reagents, increased read lengths, and the detection of specific nucleic acid sequences by the addition of detectable chemical moieties. The use of more promiscuous reagents during sequencing may increase reading efficiency by enabling faster base calling which in turn may decrease the sequencing time. The use of increased read lengths may enable longer sequences of encoded nucleic acids to be decoded per read. The addition of detectable chemical moiety tags may enable the detection of the presence or absence of a nucleic acid sequence by the presence or absence of a chemical moiety. For example, each nucleic acid sequence encoding a bit of information may be tagged with a chemical moiety that generates a unique optical, electrochemical, or chemical signal. The presence or absence of that unique optical, electrochemical, or chemical signal may indicate a ‘0’ or a ‘ 1’ bit value. The nucleic acid sequence may comprise a single chemical moiety or multiple chemical moieties. The chemical moiety may be added to the nucleic acid sequence prior to use of the nucleic acid sequence to encode data. Alternatively or in addition to, the chemical moiety may be added to the nucleic acid sequence after encoding the data, but prior to decoding the data. The chemical moiety tag may be added directly to the nucleic acid sequence or the nucleic acid sequence may comprise a synthetic or unnatural nucleotide anchor and the chemical moiety tag may be added to that anchor.

[0203] Unique codes may be applied to minimize or detect encoding and decoding errors. Encoding and decoding errors may occur from false negatives (e.g., a nucleic acid molecule or identifier not included in a random sampling). An example of an error detecting code may be a checksum sequence that counts the number of identifiers in a contiguous set of possible identifiers that is included in the identifier library. While reading the identifier library, the checksum may indicate how many identifiers from that contiguous set of identifiers to expect to retrieve, and identifiers can continue to be sampled for reading until the expected number is met. In some embodiments, a checksum sequence may be included for every contiguous set of R identifiers where R can be equal in size or greater than 1, 2, 5, 10, 50, 100, 200, 500, or 1000 or less than 1000, 500, 200, 100, 50, 10, 5, or 2. The smaller the value of R, the better the error detection. In some embodiments, the checksums may be supplemental nucleic acid sequences. For example, a set comprising seven nucleic acid sequences (e.g., components) may be divided into two groups, nucleic acid sequences for constructing identifiers with a product scheme (components XI -X3 in layer X and Y1-Y3 in layer Y), and nucleic acid sequences for the supplemental checksums (X4-X7 and Y4-Y7). The checksum sequences X4-X7 may indicate whether zero, one, two, or three sequences of layer X are assembled with each member of layer Y. Alternatively, the checksum sequences Y4-Y7 may indicate whether zero, one, two, or three sequences of layer Y are assembled with each member of layer X. In this example, an original identifier library with identifiers {X1Y1, XI Y3, X2Y1, X2Y2, X2Y3} may be supplemented to include checksums to become the following pool: {X1Y1, X1Y3, X2Y1, X2Y2, X2Y3, XI Y6, X2Y7, X3Y4, X6Y1, X5Y2, X6Y3}. The checksum sequences may also be used for error correction. For example, absence of X1Y1 from the above dataset and the presence of X1Y6 and X6Y1 may enable inference that the XI Y1 nucleic acid molecule is missing from the dataset. The checksum sequences may indicate whether identifiers are missing from a sampling of the identifier library or an accessed portion of the identifier library. In the case of a missing checksum sequence, access methods such as PCR or affinity tagged probe hybridization may amplify and/or isolate it. In some embodiments, the checksums may not be supplemental nucleic acid sequences. They checksums may be coded directly into the information such that they are represented by identifiers.

[0204] Noise in data encoding and decoding may be reduced by constructing identifiers palindromically, for example, by using palindromic pairs of components rather than single components in the product scheme. Then the pairs of components from different layers may be assembled to one another in a palindromic manner (e.g., YXY instead of XY for components X and Y). This palindromic method may be expanded to larger numbers of layers (e.g., ZYXYZ instead of XYZ) and may enable detection of erroneous cross reactions between identifiers.

[0205] Adding supplemental nucleic acid sequences in excess (e.g., vast excess) to the identifiers may prevent sequencing from recovering the encoded identifiers. Prior to decoding the information, the identifiers may be enriched from the supplemental nucleic acid sequences. For example, the identifiers may be enriched by a nucleic acid amplification reaction using primers specific to the identifier ends. Alternatively, or in addition to, the information may be decoded without enriching the sample pool by sequencing (e.g., sequencing by synthesis) using a specific primer. In both decoding methods, it may be difficult to enrich or decode the information without having a decoding key or knowing something about the composition of the identifiers. Alternative access methods may also be employed such as using affinity tag based probes.

[0206] The technologies described in this specification can be used with encoding schemes that are not based on identifiers as described above. For example, the digital information (e.g., binary information) can be encoded in base sequences the nucleic acid molecules. These base sequences encode information, e.g., either in one or more bases or transitions between bases as described below. One or more bases, base (sub-)sequences, or transitions encode bits of digital information that can be in any form, e.g., the digital information can include binary (base-2), ternary (base-3), quarternary (base-4), decimal, or hexadecimal digits, or combinations thereof. The digital information can include base-x digits, where x is an integer.

[0207] In some implementations, the base sequences can include one or more bit barcodes. A bit barcode is a sequence of bits that can determine the location of the encoded bits within the overall format of information, e.g., the location of a word in a sentence or a page in a book. Each bit barcode can be labeled as described above.

[0208] In some implementations, the digital information is encoded using bit-per-base encoding. For example, a single message can be encoded in a plurality of ways, e.g., A or C for zero, G or T for the number 1. Alternatively or additionally, digital information can be encoded in base transitions, a transition from an A to a C for zero and from a G to a T for the number 1. In some implementations, one bit is encoded into a base (“bit-per-base”) or base transition (e.g., A = 0; G = 1; A-C = 0; G-T= 1). In some implementations, multiple bits are encoded into a base or base transition (e.g., A = 10; G = 11; A-C = 10; G-T= 11). In some implementations, one or more bits are encoded in a sub-base sequence. A sub-base sequence can encode information in a bit-per-base scheme (e.g., AAGG = 0011) or other schemes (e.g., ACA = 10; GTG = 11). A sub-base sequence can be labeled as described above. In some implementations, one bit barcode is labeled with one or more first labels. In some implementations, one sub-base sequence is labeled with one or more second labels.

[0209] The labeling and reading technologies described in this specification can be used to label and read information encoded in nucleic acid molecules. This information can include digital information or other information, e.g., biological information. The nucleic acid molecules can be artificial or naturally occurring. In some implementations, the technologies described in this specification can be used to label motifs or sub-sequences in naturally occurring DNA or RNA, and said DNA or RNA can be read using the nanopore and/or fluorescence-based reading technologies described in this specification.

Systems for encoding sequence data

[0210] A system for encoding (digital) information into nucleic acids (e.g., DNA) can comprise systems, methods and devices for converting files and data (e.g., raw data, compressed zip files, integer data, and other forms of data) into bytes and encoding the bytes into segments or sequences of nucleic acids, typically DNA, or combinations thereof.

[0211] In an aspect, the present disclosure provides systems for encoding sequence data (e.g., digital sequence data, e.g., binary sequence data) using nucleic acids. A system for encoding said sequence data using nucleic acids may comprise a device and one or more computer processors. The device may be configured to construct an identifier library. The one or more computer processors may be individually or collectively programmed to (i) translate the information into a string of symbols, (ii) map the string of symbols to the plurality of identifiers, and (iii) construct an identifier library comprising at least a subset of a plurality of identifiers. An individual identifier of the plurality of identifiers may correspond to an individual symbol of the string of symbols. An individual identifier of the plurality of identifiers may comprise one or more components. An individual component of the one or more components may comprise a nucleic acid sequence. [0212] In another aspect, the present disclosure provides systems for reading sequence data (e.g., digital sequence data, e.g., binary sequence data) using nucleic acids. A system for reading said sequence data using nucleic acids may comprise a database and one or more computer processors. The database may store an identifier library encoding the information. The one or more computer processors may be individually or collectively programmed to (i) identify the identifiers in the identifier library, (ii) generate a plurality of symbols from identifiers identified in (i), and (iii) compile the information from the plurality of symbols. The identifier library may comprise a subset of a plurality of identifiers. Each individual identifier of the plurality of identifiers may correspond to an individual symbol in a string of symbols. An identifier may comprise one or more components. A component may comprise a nucleic acid sequence.

[0213] Systems for encoding, writing, copying, accessing, reading, and decoding information encoded and written into nucleic acid molecules may be a single integrated unit or may be multiple units configured to execute one or more of the aforementioned operations. A system for encoding and writing information into nucleic acid molecules (e.g., identifiers) may include a device and one or more computer processors.

[0214] The device may comprise a plurality regions, sections, or partitions. The reagents and components to assemble the identifiers may be stored in one or more regions, sections, or partitions of the device. Layers may be stored in separate regions of section of the device. A layer may comprise one or more unique components. The component in one layer may be unique from the components in another layer. The regions or sections may comprise vessels and the partitions may comprise wells. Each layer may be stored in a separate vessel or partition. Each reagent or nucleic acid sequence may be stored in a separate vessel or partition. Alternatively, or in addition to, reagents may be combined to form a master mix for identifier construction. The device may transfer reagents, components, and templates from one section of the device to be combined in another section. The device may provide the conditions for completing the assembly reaction. For example, the device may provide heating, agitation, and detection of reaction progress. The constructed identifiers may be directed to undergo one or more (subsequent) reactions to add barcodes, common sequences, variable sequences, labels, or tags to one or more identifiers. The identifiers may then be directed to a region or partition to generate an identifier library. One or more identifier libraries may be stored in each region, section, or individual partition of the device. The device may transfer fluid (e.g., reagents, components, templates, or labels) using pressure, vacuum, or suction. [0215] The present disclosure provides computer systems that are programmed to implement methods of the disclosure. An example computer system is programmed or otherwise configured to encode digital information into nucleic acid sequences and/or read (e.g., decode) information derived from nucleic acid sequences. The computer system can regulate various aspects of the encoding and decoding procedures of the present disclosure, such as, for example, the bit-values and bit location information for a given bit or byte from an encoded bitstream or byte stream. The computer system is functionally connected to one or more devices (e.g., electro-mechanically, e.g., via electronically actuated valves or pumps), e.g., one or more devices that may transfer fluid (e.g., reagents, components, templates, or labels,) using pressure, vacuum, or suction.

[0216] A computer system includes a central processing unit (CPU, also “processor” and “computer processor” herein), which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system also includes memory or memory location (e.g., random-access memory, read-only memory, flash memory), electronic storage unit (e.g., hard disk), communication interface (e.g., network adapter) for communicating with one or more other systems, and peripheral devices, such as cache, other memory, data storage and/or electronic display adapters. The memory, storage unit, interface and peripheral devices are in communication with the CPU through a communication bus (solid lines), such as a motherboard. The storage unit can be a data storage unit (or data repository) for storing data. The computer system can be operatively coupled to a computer network (“network”) with the aid of the communication interface. The network can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network in some cases is a telecommunication and/or data network. The network can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network, in some cases with the aid of the computer system, can implement a peer-to-peer network, which may enable devices coupled to the computer system to behave as a client or a server.

[0217] Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit. The algorithm can, for example, be used with a DNA index and raw data or zip file compressed or decompressed data, to determine a customized method for coding digital information from the raw data or zip file compressed data, prior to encoding the digital information. Illustrative Embodiments

[0218] Item 1. A method for reading digital information written into nucleic acid sequence(s), comprising:

(a) providing nucleic acid molecules indicative of digital information comprising a string of symbols;

(b) modifying at least a portion of the nucleic acid molecules with one or more labels;

(c) translocating one or more modified nucleic acid molecules through one or more nanopores disposed in a substrate and configured to receive an input nucleic acid molecule;

(d) reading one or more signals received from the one or more nanopores and translating the one or more signals into the information, the reading comprising

(i) detecting an electric current signal from the one or more nanopores;

(ii) detecting a change in current through the one or more nanopores corresponding to the translocation, wherein a first current level corresponds to passage of a portion of the nucleic acid molecules indicative of the digital information and a second current level corresponds to passage of one of the one or more labels through the one or more nanopores;

(iii) identifying the one label from the second voltage level; and

(iv) identifying, from a library of nucleic acid molecules, the digital information based at least in part on the identified label.

[0219] Item 2. The method of item 1, comprising mapping said string of symbols to a plurality of identifiers, an individual identifier of said plurality of identifiers corresponding to an individual symbol of said string of symbols.

[0220] Item 3. The method as in any one of items 1-2, comprising constructing a plurality of components, each individual component of said plurality of components being a nucleic acid molecule having a component nucleic acid sequence.

[0221] Item 4. The method as in any one of items 1-3, comprising chemically linking together two or more components of said plurality of components, thereby generating a plurality of identifier nucleic acid molecules corresponding to at least a subset of said plurality of identifiers, each identifier nucleic acid molecule of said plurality of identifier nucleic acid molecules comprising two or more components.

[0222] Item 5. The method of item 1, wherein the digital information is encoded in base sequences of the nucleic acid molecules.

[0223] Item 6. The method of item 5, wherein the base sequences comprise one or more bit barcodes. [0224] Item 7. The method as in any one of items 5-6, wherein the digital information is encoded using bit-per-base encoding.

[0225] Item 8. The method of item 7, comprising encoding one bit per base or base transition.

[0226] Item 9. The method of item7, comprising encoding multiple bits per base or base transition.

[0227] Item 10. The method of item 7, comprising encoding one or more bits in a sub-base sequence.

[0228] Item 11. The method as in any one of items 6-10, wherein one or more of the bit barcodes or one or more of the sub-base sequences, or both, are labeled.

[0229] Item 12. The method of item 11, wherein one bit barcode is labeled with one or more first labels.

[0230] Item 13. The method as in any one of items 11-12, wherein one sub-base sequence is labeled with one or more second labels.

[0231] Item 14. The method as in any one of items 1-13, wherein the digital information comprises binary code, ternary code, quarternary code, base-x code, wherein x is an integer, decimal code, or hexadecimal code, or a combination thereof.

[0232] Item 15. The method as in any one of items 1-14, wherein unmodified identifier nucleic acid molecules are single-stranded DNA molecules.

[0233] Item 16. The method as in any one of items 1-15, wherein the one or more nanopores are or comprise a Nanopore Field Effect Transistor (NPFET)

[0234] Item 17. The method as in any one of items 1-16, wherein the one or more nanopores are integrated into a complementary metal-oxide-semiconductor (CMOS) circuit.

[0235] Item 18. The method as in any one of items 1-17, wherein the one or more nanopores have a diameter of 10-15 nm.

[0236] Item 19. The method as in any one of items 1-18, wherein the one or more nanopores have a diameter of less than 10 nm.

[0237] Item 20. The method as in any one of items 1-19, wherein at least one component in an identifier nucleic acid molecule is labeled.

[0238] Item 21. The method as in any one of items 1-20, wherein each component of one or more identifier nucleic acid molecules of said plurality of identifier nucleic acid molecules is labeled with one or more distinct labels.

[0239] Item 22. The method of item 21, wherein a first label causes a first current change in the one or more nanopores, and a second label causes a second current change. [0240] Item 23. The method of item 22, wherein the first current change is different from the second current change.

[0241] Item 24. The method of item 23, wherein reading comprises detecting a current change fingerprint.

[0242] Item 25. The method of item 24, wherein the current change includes one or more changes in signal spacing, signal amplitude, or a combination thereof.

[0243] Item 26. The method as in any one of items 21-25, wherein the one or more labels are or comprise biotin-streptavidin, 3D DNA structures, enzyme labels, proteins, or a combination thereof.

[0244] Item 27. The method as in any one of items 21-25, wherein the one or more labels are or comprise a protein.

[0245] Item 28. The method of item 27, wherein the protein is an enzyme or an endonuclease.

[0246] Item 29. The method as in any one of items 1-28, wherein the one or more components include one or more transcription factor recognition sites to bind one or more transcription factors.

[0247] Item 30. The method as in any one of items 27-29, wherein the protein is modified by phosphorylation, acetylation, hydroxylation, ubiquitination, methylation, or cross-linking.

[0248] Item 31. The method as in any one of items 27-30, wherein the one or more labels comprise two or more proteins.

[0249] Item 32. The method as in any one of items 21-31, wherein the one or more labels comprise an indirect label on a DNA flap.

[0250] Item 33. The method as in any one of items 21-32, wherein the one or more labels are or comprise one or more DNA hairpins.

[0251] Item 34. The method of item 33, wherein each component has at least one hairpin.

[0252] Item 35. The method of item 33, wherein one or more components have two or more hairpins.

[0253] Item 36. The method as in any one of items 33-35, wherein the one or more hairpins have a length of 10-500 bp.

[0254] Item 37. The method as in any one of items 33-36, wherein two or more hairpins have different hairpin stem length.

[0255] Item 38. The method as in any one of items 33-37, wherein two or more hairpins have different numbers of hairpin stems [0256] Item 39. The method as in any one of items 33-38, wherein two or more hairpins have different hairpin loop sizes

[0257] Item 40. The method as in any one of items 33-39, wherein two or more hairpins incorporate one or more unnatural bases or methylated bases.

[0258] Item 41. The method as in any one of items 33-40, wherein the hairpins are chemically modified.

[0259] Item 42. The method as in any one of items 33-41, wherein the one or more hairpins are located on one strand of a double stranded component.

[0260] Item 43. The method as in any one of items 21-42, wherein the one or more labels comprise a looped section, the looped sections configured to expand upon denaturation of the one or more nucleic acid molecules.

[0261] Item 44. The method of item 43, comprising denaturing the one or more nucleic acid molecules, thereby forming an expandomer comprising the one or more labels.

[0262] Item 45. The method as in any one of items 1-44, comprising reading a plurality of nucleic acid molecules and generating, from the plurality of reads, a consensus read.

[0263] Item 46. The method of item 45, comprising circularizing one or more nucleic acid molecules and performing rolling circle amplification.

[0264] Item 47. The method as in any one of items 1-46, wherein two or more components are assembled into an identifier nucleic acid molecule using a single-stranded DNA scaffold. [0265] Item 48. The method of item 47, comprising binding one or more templating strands to (a) the scaffold and (b) to a component, each templating strand being at least partially complementary to the scaffold and one component.

[0266] Item 49. The method of item 48, wherein each templating strands corresponds to one specific position along the scaffold.

[0267] Item 50. The method as in any one of items 48-49, wherein ligation of two component edges via a templating strand occurs via ligation with over hangs of 6 bases or fewer.

[0268] Item 51. The method as in any one of items 48-49, wherein ligation of two component edges via a templating strand occurs via ligation with overhangs of 1 or 2 bases. [0269] Item 52. The method as in any one of items 48-49, wherein ligation of two component edges via a templating strand occurs via blunt end ligation.

[0270] Item 53. The method as in any one of items 1-52, wherein modifying comprises replacement of one or more components. [0271] Item 54. The method of item 53, wherein one or more components are replaced with a component with a different length.

[0272] Item 55. The method as in any one of items 53-54, wherein one or more components are replaced with a component that is a result of chemical computation.

[0273] Item 56. The method as in any one of items 1-55, wherein modifying comprises is a nick-translation scheme.

[0274] Item 57. The method of item 56, wherein each component comprises a nicking endonuclease recognition site.

[0275] Item 58. The method as in any one of items 1-57, wherein chemically linking together comprises:

(a) forming an initiator oligo comprising an initiator strand and a template strand;

(b) binding, to an overhang of the template strand, a portion of a first component, the first component comprising a recognition site for a nicking endonuclease;

(c) extending, using a polymerase, the template strand along the first component;

(d) nicking a portion of the first component, thereby causing a second overhang of the template strand; and

(e) binding, to the second overhang of the template strand, a portion of a second component, the second component comprising a recognition site for a nicking endonuclease. [0276] Item 59. The method of item 58, comprising repeating steps (c)-(e).

[0277] Item 60. The method as in any one of items 58-59, wherein a binding portion of the first or second component is modified.

[0278] Item 61. The method as in any one of items 58-60, wherein the polymerase is a strand displacing polymerase.

[0279] Item 62. The method as in any one of items 1-61, wherein translocating comprises repeatedly passing the identifier nucleic acid molecule through the nanopore by iteratively changing nanopore voltage polarity.

[0280] Item 63. The method as in any one of items 1-62, wherein each identifier nucleic acid molecule comprises a DNA dumbbell.

[0281] Item 64. The method as in any one of items 1-63, wherein each identifier nucleic acid molecule comprises two DNA dumbbells.

[0282] Item 65. The method of item 64, wherein each identifier nucleic acid molecule comprises a first DNA dumbbell disposed between a first end and a first component of the identifier nucleic acid molecule and a second DNA dumbbell is disposed between a second end and the first component of the identifier nucleic acid molecule. [0283] Item 66. The method as in any one of items 1-65, comprising selecting a subset of identifier nucleic acid molecule based on molecule size.

[0284] Item 67. The method of item 66, wherein one or more nanopores are decorated with one or more molecules to attract an identifier nucleic acid molecule.

[0285] Item 68. The method as in any one of items 1-67, comprising labeling or tagging each of a plurality of identifiers with a bead.

[0286] Item 69. The method as in any one of items 1-68, wherein a nanopore is pre-loaded with a bead-tagged identifier nucleic acid molecule, the bead having a diameter larger than the diameter of the nanopore.

[0287] Item 70. The method as in any one of items 1-69, comprising labeling each of a plurality of identifiers with a helicase configured to act as a stopper preventing complete translocation of the identifier nucleic acid molecule.

[0288] Item 71. The method as in any one of items 1-70, wherein translocating comprises repeatedly passing the identifier nucleic acid molecule through the nanopore by iteratively changing nanopore voltage polarity.

[0289] Item 72. The method as in any one of items 1-71, comprising dynamically adjusting nanopore translocation rate based on molecule detection rate.

[0290] Item 73. The method as in any one of items 1-72, comprising modifying at least a portion of one of the one or more labels with one or more fluorescent labels, reading one or more fluorescent signals from the one or more fluorescent labels and translating the one or more signals into the information, the reading comprising

(i) detecting a change in optical signal along a length of a nucleic acid molecule, wherein a first optical signal corresponds to a first motif and a second optical signal corresponds to a second motif;

(ii) identifying a first optical label from the first optical signal and a second optical label from the second optical signal; and

(iii) identifying, from the library of nucleic acid molecules, the information based at least in part on the identified optical labels.

[0291] Item 74. A method for reading information written into nucleic acid sequence(s), comprising:

(a) providing a plurality of nucleic acid molecules, each molecule comprising a plurality of nucleic acid motifs; (b) modifying at least a portion of the plurality of nucleic acid molecules with one or more structural labels and decorating at least one of the one or more structural labels with a fluorescent label;

(c) reading one or more fluorescent signals and translating the one or more fluorescent signals into the information, the reading comprising

(ii) identifying a first label from the first optical signal and a second label from the second optical signal; and

(iii) identifying, from a library of nucleic acid molecules, the information based at least in part on the identified labels.

[0292] Item 75. The method as in any of items 1-74, comprising: translating the information into the string of symbols; mapping the string of symbols to a plurality of identifiers, wherein each individual identifier of the plurality of identifiers comprises a combination of a plurality of components from a library of components, wherein each component in an individual identifier of the plurality of identifiers comprises a distinct nucleic acid sequence, and wherein each identifier represents a symbol position of an individual symbol in the string of symbols; and forming at least one individual identifier of the plurality of identifiers by depositing the plurality of components into a compartment, wherein the plurality of components assemble via one or more reactions in the compartment to form the combination of components of the at least one individual identifier representing at least one symbol position of an individual symbol in the string of symbols.

[0293] Item 76. The method of item 75, wherein each symbol is one of two or more possible symbol values.

[0294] Item 77. The method of item 75, wherein a first symbol value of two possible symbol values is represented by an absence of a distinct identifier of the plurality of identifiers, and wherein a second symbol value of the two possible symbol values is represented by a presence of the distinct identifier, or vice versa.

[0295] Item 78. The method as in any one of items 75-77, wherein the plurality of components assemble via the one or more reactions to form identifiers that represent symbol values that are represented by a presence of the identifiers. [0296] Item 79. The method of item 75, wherein the distinct nucleic acid sequence of an individual component appears in more than one distinct identifier.

[0297] Item 80. The method as in any one of items 75-79, wherein each component comprises a distinct nucleic acid sequence with first and second ends, a first hybridization region on the first end, and a second hybridization region on the second end.

[0298] Item 81. The method as in any one of items 75-80, wherein each of the plurality of components belongs in one of M layers, and wherein one component from each of the M layers assemble to form the at least one identifier.

[0299] Item 82. The method of item 81, wherein each layer of the M layers comprises a distinct set of components.

[0300] Item 83. The method of item 82, wherein within each layer, the components have a common first hybridization region and a common second hybridization region.

[0301] Item 84. The method of item 75, wherein each component in the compartment has first and second hybridization regions, and the first or second hybridization region of each component is complementary to the first or second hybridization region of another component, and wherein the one or more reactions comprise hybridization of the first and second complimentary hybridization regions.

[0302] Item 85. The method of item 75, wherein a subset of the plurality of identifiers are formed via one reaction in a multiplex fashion.

[0303] Item 86. The method of item 75, wherein the one or more reactions comprise overlapextension polymerase chain reaction (PCR), polymerase cycling assembly, sticky end ligation, ligase cycling reaction, or template directed ligation.

[0304] Item 87. The method as in any one of items 75-86, further comprising generating an identifier library comprising the at least one formed identifier and at least one additional formed identifier.

[0305] Item 88. The method of item 87, wherein the identifier library comprises a distinct barcode, metadata of the information, or both.

[0306] Item 89. The method of item 88, further comprising extracting a targeted subset of the identifier library.

[0307] Item 90. The method of item 89, further comprising combining a plurality of probes with the identifier library, wherein the plurality of probes share complementarity with the plurality of components of each identifier of the targeted subset such that each identifier of the targeted subset hybridizes with at least one probe when combined with the plurality of probes. [0308] Item 91. The method of item 90, wherein the plurality of probes comprises one or more affinity tags, and wherein the one or more affinity tags are captured by an affinity bead or an affinity column.

[0309] Item 92. The method of item 89, wherein each identifier comprises one or more common primer binding regions, one or more variable primer binding regions, or any combination thereof.

[0310] Item 93. The method of item 92, further comprising: combining the identifier library with primers that bind to the one or more common primer binding regions or to the one or more variable primer binding regions, wherein the primers bind to the one or more common primer binding regions or to the one or more variable primer binding regions; and selectively amplifying said targeted subset of said identifier library.

[0311] Item 94. The method of item 89, further comprising selectively removing a portion of non-targeted identifiers from said identifier library.

[0312] Item 95. A method for reading information written into nucleic acid sequence(s), comprising:

(a) providing nucleic acid molecules indicative of information comprising a string of symbols;

(i) detecting an electric current signal from the one or more nanopores;

(ii) detecting a change in current through the one or more nanopores corresponding to the translocation, wherein a first current level corresponds to passage of a portion of the nucleic acid molecules indicative of the information and a second current level corresponds to passage of one of the one or more labels through the one or more nanopores;

(iii) identifying the one label from the second voltage level; and

(iv) identifying, from a library of nucleic acid molecules, the information based at least in part on the identified label.

[0313] Item 96. The method of item 95, comprising performing one or more of steps of the method(s) as in any one of items 2-94. [0314] Item 97. A system for reading information written into nucleic acid sequence(s), the system comprising: one or more first reagents and components to assemble one or more nucleic acid molecules encoding information; one or more second reagents to label one or more nucleic acid molecules; a fluidic device configured for transfer fluid, the fluid comprising one or more of the first or second reagents; and a processor and a memory functionally connected to the fluidic device, the memory comprising instructions that, when executed, cause the processor to actuate one or more components of the device to perform one or more method(s) as in any one of items 1-96.

What is claimed is:

Claims

1. A method for reading digital information written into nucleic acid sequence(s), comprising:

(i) detecting an electric current signal from the one or more nanopores;

(iii) identifying the one label from the second voltage level; and

2. The method of claim 1, comprising mapping said string of symbols to a plurality of identifiers, an individual identifier of said plurality of identifiers corresponding to an individual symbol of said string of symbols.

3. The method as in any one of claims 1-2, comprising constructing a plurality of components, each individual component of said plurality of components being a nucleic acid molecule having a component nucleic acid sequence.

4. The method as in any one of claims 1-3, comprising chemically linking together two or more components of said plurality of components, thereby generating a plurality of identifier nucleic acid molecules corresponding to at least a subset of said plurality of identifiers, each identifier nucleic acid molecule of said plurality of identifier nucleic acid molecules comprising two or more components.

5. The method of claim 1, wherein the digital information is encoded in base sequences of the nucleic acid molecules.

6. The method of claim 5, wherein the base sequences comprise one or more bit barcodes.

7. The method as in any one of claims 5-6, wherein the digital information is encoded using bit-per-base encoding.

8. The method of claim 7, comprising encoding one bit per base or base transition.

9. The method of claim7, comprising encoding multiple bits per base or base transition.

10. The method of claim 7, comprising encoding one or more bits in a sub-base sequence.

11. The method as in any one of claims 6-10, wherein one or more of the bit barcodes or one or more of the sub-base sequences, or both, are labeled.

12. The method of claim 11, wherein one bit barcode is labeled with one or more first labels.

13. The method as in any one of claims 11-12, wherein one sub-base sequence is labeled with one or more second labels.

14. The method as in any one of claims 1-13, wherein the digital information comprises binary code, ternary code, quarternary code, base-x code, wherein x is an integer, decimal code, or hexadecimal code, or a combination thereof.

15. The method as in any one of claims 1-14, wherein unmodified identifier nucleic acid molecules are single-stranded DNA molecules.

16. The method as in any one of claims 1-15, wherein the one or more nanopores are or comprise a Nanopore Field Effect Transistor (NPFET)

17. The method as in any one of claims 1-16, wherein the one or more nanopores are integrated into a complementary metal-oxide-semiconductor (CMOS) circuit.

18. The method as in any one of claims 1-17, wherein the one or more nanopores have a diameter of 10-15 nm.

19. The method as in any one of claims 1-18, wherein the one or more nanopores have a diameter of less than 10 nm.

20. The method as in any one of claims 1-19, wherein at least one component in an identifier nucleic acid molecule is labeled.

21. The method as in any one of claims 1-20, wherein each component of one or more identifier nucleic acid molecules of said plurality of identifier nucleic acid molecules is labeled with one or more distinct labels.

22. The method of claim 21, wherein a first label causes a first current change in the one or more nanopores, and a second label causes a second current change.

23. The method of claim 22, wherein the first current change is different from the second current change.

24. The method of claim 23, wherein reading comprises detecting a current change fingerprint.

25. The method of claim 24, wherein the current change includes one or more changes in signal spacing, signal amplitude, or a combination thereof.

26. The method as in any one of claims 21-25, wherein the one or more labels are or comprise biotin-streptavidin, 3D DNA structures, enzyme labels, proteins, or a combination thereof.

27. The method as in any one of claims 21-25, wherein the one or more labels are or comprise a protein.

28. The method of claim 27, wherein the protein is an enzyme or an endonuclease.

29. The method as in any one of claims 1-28, wherein the one or more components include one or more transcription factor recognition sites to bind one or more transcription factors.

30. The method as in any one of claims 27-29, wherein the protein is modified by phosphorylation, acetylation, hydroxylation, ubiquitination, methylation, or cross-linking.

31. The method as in any one of claims 27-30, wherein the one or more labels comprise two or more proteins.

32. The method as in any one of claims 21-31, wherein the one or more labels comprise an indirect label on a DNA flap.

33. The method as in any one of claims 21-32, wherein the one or more labels are or comprise one or more DNA hairpins.

34. The method of claim 33, wherein each component has at least one hairpin.

35. The method of claim 33, wherein one or more components have two or more hairpins.

36. The method as in any one of claims 33-35, wherein the one or more hairpins have a length of 10-500 bp.

37. The method as in any one of claims 33-36, wherein two or more hairpins have different hairpin stem length.

38. The method as in any one of claims 33-37, wherein two or more hairpins have different numbers of hairpin stems

39. The method as in any one of claims 33-38, wherein two or more hairpins have different hairpin loop sizes

40. The method as in any one of claims 33-39, wherein two or more hairpins incorporate one or more unnatural bases or methylated bases.

41. The method as in any one of claims 33-40, wherein the hairpins are chemically modified.

42. The method as in any one of claims 33-41, wherein the one or more hairpins are located on one strand of a double stranded component.

43. The method as in any one of claims 21-42, wherein the one or more labels comprise a looped section, the looped sections configured to expand upon denaturation of the one or more nucleic acid molecules.

44. The method of claim 43, comprising denaturing the one or more nucleic acid molecules, thereby forming an expandomer comprising the one or more labels.

45. The method as in any one of claims 1-44, comprising reading a plurality of nucleic acid molecules and generating, from the plurality of reads, a consensus read.

46. The method of claim 45, comprising circularizing one or more nucleic acid molecules and performing rolling circle amplification.

47. The method as in any one of claims 1-46, wherein two or more components are assembled into an identifier nucleic acid molecule using a single-stranded DNA scaffold.

48. The method of claim 47, comprising binding one or more templating strands to (a) the scaffold and (b) to a component, each templating strand being at least partially complementary to the scaffold and one component.

49. The method of claim 48, wherein each templating strands corresponds to one specific position along the scaffold.

50. The method as in any one of claims 48-49, wherein ligation of two component edges via a templating strand occurs via ligation with over hangs of 6 bases or fewer.

51. The method as in any one of claims 48-49, wherein ligation of two component edges via a templating strand occurs via ligation with overhangs of 1 or 2 bases.

52. The method as in any one of claims 48-49, wherein ligation of two component edges via a templating strand occurs via blunt end ligation.

53. The method as in any one of claims 1-52, wherein modifying comprises replacement of one or more components.

54. The method of claim 53, wherein one or more components are replaced with a component with a different length.

55. The method as in any one of claims 53-54, wherein one or more components are replaced with a component that is a result of chemical computation.

56. The method as in any one of claims 1-55, wherein modifying comprises is a nicktranslation scheme.

57. The method of claim 56, wherein each component comprises a nicking endonuclease recognition site.

58. The method as in any one of claims 1-57, wherein chemically linking together comprises:

(e) binding, to the second overhang of the template strand, a portion of a second component, the second component comprising a recognition site for a nicking endonuclease.

59. The method of claim 58, comprising repeating steps (c)-(e).

60. The method as in any one of claims 58-59, wherein a binding portion of the first or second component is modified.

61. The method as in any one of claims 58-60, wherein the polymerase is a strand displacing polymerase.

62. The method as in any one of claims 1-61, wherein translocating comprises repeatedly passing the identifier nucleic acid molecule through the nanopore by iteratively changing nanopore voltage polarity.

63. The method as in any one of claims 1-62, wherein each identifier nucleic acid molecule comprises a DNA dumbbell.

64. The method as in any one of claims 1-63, wherein each identifier nucleic acid molecule comprises two DNA dumbbells.

65. The method of claim 64, wherein each identifier nucleic acid molecule comprises a first DNA dumbbell disposed between a first end and a first component of the identifier nucleic acid molecule and a second DNA dumbbell is disposed between a second end and the first component of the identifier nucleic acid molecule.

66. The method as in any one of claims 1-65, comprising selecting a subset of identifier nucleic acid molecule based on molecule size.

67. The method of claim 66, wherein one or more nanopores are decorated with one or more molecules to attract an identifier nucleic acid molecule.

68. The method as in any one of claims 1-67, comprising labeling or tagging each of a plurality of identifiers with a bead.

69. The method as in any one of claims 1-68, wherein a nanopore is pre-loaded with a bead- tagged identifier nucleic acid molecule, the bead having a diameter larger than the diameter of the nanopore.

70. The method as in any one of claims 1-69, comprising labeling each of a plurality of identifiers with a helicase configured to act as a stopper preventing complete translocation of the identifier nucleic acid molecule.

71. The method as in any one of claims 1-70, wherein translocating comprises repeatedly passing the identifier nucleic acid molecule through the nanopore by iteratively changing nanopore voltage polarity.

72. The method as in any one of claims 1-71, comprising dynamically adjusting nanopore translocation rate based on molecule detection rate.

73. The method as in any one of claims 1-72, comprising modifying at least a portion of one of the one or more labels with one or more fluorescent labels, reading one or more fluorescent signals from the one or more fluorescent labels and translating the one or more signals into the information, the reading comprising

74. A method for reading information written into nucleic acid sequence(s), comprising:

(a) providing a plurality of nucleic acid molecules, each molecule comprising a plurality of nucleic acid motifs;

(b) modifying at least a portion of the plurality of nucleic acid molecules with one or more structural labels and decorating at least one of the one or more structural labels with a fluorescent label;

(c) reading one or more fluorescent signals and translating the one or more fluorescent signals into the information, the reading comprising (i) detecting a change in optical signal along a length of a nucleic acid molecule, wherein a first optical signal corresponds to a first motif and a second optical signal corresponds to a second motif;

75. The method as in any of claims 1-74, comprising: translating the information into the string of symbols; mapping the string of symbols to a plurality of identifiers, wherein each individual identifier of the plurality of identifiers comprises a combination of a plurality of components from a library of components, wherein each component in an individual identifier of the plurality of identifiers comprises a distinct nucleic acid sequence, and wherein each identifier represents a symbol position of an individual symbol in the string of symbols; and forming at least one individual identifier of the plurality of identifiers by depositing the plurality of components into a compartment, wherein the plurality of components assemble via one or more reactions in the compartment to form the combination of components of the at least one individual identifier representing at least one symbol position of an individual symbol in the string of symbols.

76. The method of claim 75, wherein each symbol is one of two or more possible symbol values.

77. The method of claim 75, wherein a first symbol value of two possible symbol values is represented by an absence of a distinct identifier of the plurality of identifiers, and wherein a second symbol value of the two possible symbol values is represented by a presence of the distinct identifier, or vice versa.

78. The method as in any one of claims 75-77, wherein the plurality of components assemble via the one or more reactions to form identifiers that represent symbol values that are represented by a presence of the identifiers.

79. The method of claim 75, wherein the distinct nucleic acid sequence of an individual component appears in more than one distinct identifier.

80. The method as in any one of claims 75-79, wherein each component comprises a distinct nucleic acid sequence with first and second ends, a first hybridization region on the first end, and a second hybridization region on the second end.

81. The method as in any one of claims 75-80, wherein each of the plurality of components belongs in one of M layers, and wherein one component from each of the M layers assemble to form the at least one identifier.

82. The method of claim 81, wherein each layer of the M layers comprises a distinct set of components.

83. The method of claim 82, wherein within each layer, the components have a common first hybridization region and a common second hybridization region.

84. The method of claim 75, wherein each component in the compartment has first and second hybridization regions, and the first or second hybridization region of each component is complementary to the first or second hybridization region of another component, and wherein the one or more reactions comprise hybridization of the first and second complimentary hybridization regions.

85. The method of claim 75, wherein a subset of the plurality of identifiers are formed via one reaction in a multiplex fashion.

86. The method of claim 75, wherein the one or more reactions comprise overlap-extension polymerase chain reaction (PCR), polymerase cycling assembly, sticky end ligation, ligase cycling reaction, or template directed ligation.

87. The method as in any one of claims 75-86, further comprising generating an identifier library comprising the at least one formed identifier and at least one additional formed identifier.

88. The method of claim 87, wherein the identifier library comprises a distinct barcode, metadata of the information, or both.

89. The method of claim 88, further comprising extracting a targeted subset of the identifier library.

90. The method of claim 89, further comprising combining a plurality of probes with the identifier library, wherein the plurality of probes share complementarity with the plurality of components of each identifier of the targeted subset such that each identifier of the targeted subset hybridizes with at least one probe when combined with the plurality of probes.

91. The method of claim 90, wherein the plurality of probes comprises one or more affinity tags, and wherein the one or more affinity tags are captured by an affinity bead or an affinity column.

92. The method of claim 89, wherein each identifier comprises one or more common primer binding regions, one or more variable primer binding regions, or any combination thereof.

93. The method of claim 92, further comprising: combining the identifier library with primers that bind to the one or more common primer binding regions or to the one or more variable primer binding regions, wherein the primers bind to the one or more common primer binding regions or to the one or more variable primer binding regions; and selectively amplifying said targeted subset of said identifier library.

94. The method of claim 89, further comprising selectively removing a portion of nontargeted identifiers from said identifier library.

95. A system for reading information written into nucleic acid sequence(s), the system comprising: one or more first reagents and components to assemble one or more nucleic acid molecules encoding information; one or more second reagents to label one or more nucleic acid molecules; a fluidic device configured for transfer fluid, the fluid comprising one or more of the first or second reagents; and a processor and a memory functionally connected to the fluidic device, the memory comprising instructions that, when executed, cause the processor to actuate one or more components of the device to perform one or more method(s) as in any one of claims 1-94.