WO2024159068A1 - Quality control for dna data storage - Google Patents
Quality control for dna data storage Download PDFInfo
- Publication number
- WO2024159068A1 WO2024159068A1 PCT/US2024/013050 US2024013050W WO2024159068A1 WO 2024159068 A1 WO2024159068 A1 WO 2024159068A1 US 2024013050 W US2024013050 W US 2024013050W WO 2024159068 A1 WO2024159068 A1 WO 2024159068A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- instances
- polynucleotides
- data
- bases
- cases
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C13/00—Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00
- G11C13/0002—Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00 using resistive RAM [RRAM] elements
- G11C13/0009—RRAM elements whose operation depends upon chemical change
- G11C13/0014—RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material
- G11C13/0019—RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material comprising bio-molecules
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/123—DNA computing
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B82—NANOTECHNOLOGY
- B82Y—SPECIFIC USES OR APPLICATIONS OF NANOSTRUCTURES; MEASUREMENT OR ANALYSIS OF NANOSTRUCTURES; MANUFACTURE OR TREATMENT OF NANOSTRUCTURES
- B82Y15/00—Nanotechnology for interacting, sensing or actuating, e.g. quantum dots as markers in protein assays or molecular motors
Definitions
- a goal of DNA data storage can be to provide a long lasting backup, especially, for example, where other backups fail.
- verifying proper synthesis and/or storage of a polynucleotide pools can be critical.
- these pools can be extremely large, and a typical full sequencing run can be very expensive and may not be scalable.
- Quality control of polynucleotide pools can comprise verifying the data integrity and/or quantifying DNA degradation.
- quality control further comprises correcting a polynucleotide pool or a subset thereof, as needed. In this instances, the original data in the polynucleotide pool is not available unless the polynucleotide pool is fully decoded and/or sequenced.
- the plurality of polynucleotides can store digital information (e.g., binary data). The quality control of the plurality of polynucleotides can be performed before, during, and/or after synthesis or storage of the plurality of polynucleotides.
- the quality control of the plurality of polynucleotides can be performed on any suitable synthesis or storage device, such as those described herein.
- the quality control can further comprise quality control polynucleotides and/or one or more codecs, as further described herein.
- the provided designs and implementations can provide cost-effective and/or scalable quality control of a plurality of polynucleotides.
- a method for quality control (QC) of data polynucleotides comprising: (i) providing a plurality of QC polynucleotides on a surface, wherein the plurality of QC polynucleotides comprises a first primer sequence; (ii) amplifying the plurality of QC polynucleotides Attorney Docket No.00415-0047-00304 based on the first primer sequence; (iii) sequencing the plurality of QC polynucleotides; and (iv) aligning the plurality of QC polynucleotides against a reference to estimate an error rate in the data polynucleotides, a synthesis uniformity in the data polynucleotides, or a combination thereof for QC the data polynucleotides.
- the error rate, the synthesis uniformity, or a combination thereof is based at least in part on a relative read count of the plurality of QC polynucleotides.
- the plurality of QC polynucleotides is about or less 1% of the polynucleotides on the surface.
- the plurality of QC polynucleotides are provided at a portion of the surface.
- the plurality of QC polynucleotides are provided in uniformly on the surface.
- the data polynucleotides comprise a second primer sequence.
- the first primer sequence is different than the second primer sequence.
- the first primer sequence and the second primer sequence are different lengths.
- the quality control is performed after synthesis of the data polynucleotides. In some instances, the QC is performed prior to cleavage of the data polynucleotides from the surface. In some instances, each of the QC polynucleotides is about 50 to 200 nucleobases in length. In some instances, each of the data polynucleotides is about 100 to about 300 nucleobases in length.
- a method for quality control (QC) of data polynucleotides comprising: (i) selecting a subset of a plurality of data polynucleotides; (ii) applying an inner codec to the subset of the plurality of data polynucleotides, wherein the inner codec comprises probabilistic decoding; and (iii) estimating an error rate in the plurality of polynucleotides based at least in part on a likelihood associated with each decoded sequence in the subset of the plurality of data polynucleotides.
- a high likelihood is associated with a lower error rate.
- a low likelihood is associated with a higher error rate.
- further comprising decoding an index of the subset of the data polynucleotides further comprising decoding an index of the subset of the data polynucleotides.
- the index is decoded using the inner codec, an outer codec, or a combination thereof.
- the index is used to estimate a relative distribution of the subset of the plurality of data polynucleotides.
- the QC is performed during synthesis of the polynucleotides, QC of stored polynucleotides, or a combination thereof.
- the subset of the plurality of the data polynucleotides are selected at random. In some instances, the subset of the plurality of the data polynucleotides are selected based at least in part on their location on a surface.
- the plurality of data polynucleotides comprises about 100,000 polynucleotides. In some instances, the subset of the plurality of data polynucleotides is about 0.1 % of the plurality of data polynucleotides.
- the method is used in conjunction with current sensing, optical imaging, flow sensing, size estimation, quality estimation, mass estimation, or any combination thereof. In some instances, the current sensing comprises measuring a current of a chip or a section of the chip. In some instances, the current is compared to a reference value. In some instances, a difference between the current and the reference value is indicative of a chip failure, a deblocking failure, or a combination thereof.
- the current sensing is performed before synthesis of the plurality of data polynucleotides. In some instances, the current sensing is used to detect a chip defect, adjust Attorney Docket No.00415-0047-00304 polynucleotide synthesis locations on a chip, or a combination thereof. In some instances, mass estimation is performed using fluorescence. In some instances, the fluorescence is used to detect a yield of the plurality of polynucleotides. In some instances, optical imaging comprises detecting a chip defect, non-uniformity, or a combination thereof.
- Also provided herein is a method of performing QC of a plurality of cells on a surface, comprising: (i) measuring a current of each cell in the plurality of cells on the surface; (ii) determining if one or more cells in the plurality of cells comprises a defect based at least in part on the current; and (iii) synthesizing and/or storing polynucleotides at a second one or more cells in the plurality of cells, wherein the second one or more cells do not comprise the defect.
- the defect comprises a physical defect.
- the surface is a synthesis surface, a storage surface, or a combination thereof. In some instances, further comprising blocking the one or more cells comprising the defect.
- blocking is performed by a protecting group on the surface. In some instances, blocking is performed by a photolabile protecting group on the surface. In some instances, blocking is performed by selectively supplying energy to the one or more cells. In some instances, blocking is performed by a masking material. In some instances, blocking is performed by addressable control of each cell in the plurality of cells.
- FIG.2 shows a non-limiting example of periodic quality control of stored polynucleotides in accordance with some embodiments.
- FIG.3 shows a non-limiting example of digital information storage in accordance with some embodiments.
- FIG.4 shows a non-limiting example of generating a hash in accordance with some embodiments.
- FIG.5 shows a non-limiting example of an encoding scheme, including an outer codec, in accordance with some embodiments.
- FIG.6 shows a non-limiting example of an encoding scheme, including shuffling lanes of binary data, in accordance with some embodiments.
- FIG.7 shows a shows a non-limiting example of an encoding scheme, including an inner codec, in accordance with some embodiments.
- Attorney Docket No.00415-0047-00304 shows a non-limiting example of an encoding scheme, including an alternative inner codec, in accordance with some embodiments.
- FIG.9 shows a non-limiting example of a decoding scheme, including an inner codec and an outer codec, in accordance with some embodiments.
- FIG.10 shows a non-limiting example of a greedy algorithm for decoding in accordance with some embodiments.
- FIG.11 shows a non-limiting example of a maximum likelihood (ML) algorithm for decoding in accordance with some embodiments.
- ML maximum likelihood
- FIG.12 shows a non-limiting example of a computing device; in this case, a device with one or more processors, memory, storage, and a network interface.
- DETAILED DESCRIPTION [021] Provided herein are methods and systems for quality control of digital information stored in nucleic acids. For DNA data storage to be a viable option for long-lasting storage, scalable, efficient, and cost-effective methods for verifying proper synthesis and/or storage of a polynucleotide pools can be critical. As such, provided herein are quality control methods for verifying digital information encoded in nucleic acids, referred to herein as data polynucleotides.
- the quality control method comprises using designated quality control (QC) polynucleotides that are synthesized and/or stored with data polynucleotides. In some cases, the quality control method comprises verifying a subset of data polynucleotides.
- the methods provided herein use QC polynucleotides or a subset of the data polynucleotides as a proxy to estimate an error rate, uniformity, or a combination thereof in the data polynucleotides.
- the methods provide for quality control (QC) of data polynucleotides. In some instances, the method comprises providing a plurality of QC polynucleotides on a surface.
- the plurality of QC polynucleotides comprises a first primer sequence. In some instances, the method comprises amplifying the plurality of QC polynucleotides based on the first primer sequence. In some instances, the method comprises sequencing the plurality of QC polynucleotides. In some instances, the method comprises aligning the plurality of QC polynucleotides against a reference. In some examples, aligning is done to estimate an error rate in the data polynucleotides, a synthesis uniformity in the data polynucleotides, or a combination thereof for QC the data polynucleotides. [023] In some instances, the methods provide for quality control (QC) of data polynucleotides.
- QC quality control
- the method comprises selecting a subset of a plurality of data polynucleotides. In some instances, the method comprises applying an inner codec to the subset of the plurality of data polynucleotides. In some examples, the inner codec comprises probabilistic decoding. In some instances, the method comprises estimating an error rate in the plurality of polynucleotides. In some examples, the error rate in the plurality of polynucleotides is based at least in part on a likelihood associated with each Attorney Docket No.00415-0047-00304 decoded sequence in the subset of the plurality of data polynucleotides.
- the methods are for performing quality control (QC) of a plurality of cells on a synthesis surface.
- the method comprises measuring a current of each cell in the plurality of cells on the surface.
- the method comprises determining if one or more cells in the plurality of cells comprises a defect.
- determining the defect is based at least in part on the current.
- the method comprises synthesizing and/or storing polynucleotides at a second one or more cells in the plurality of cells. In some examples, the second one or more cells do not comprise the defect.
- Nucleic Acid Based Information Storage [026] Provided herein are devices, compositions, systems and methods for nucleic acid-based information (data) storage.
- a biomolecule such as a DNA molecule provides a suitable host for information storage in-part due to its stability over time and capacity for enhanced information coding, as opposed to traditional binary information coding.
- a digital sequence encoding an item of information i.e., digital information in a binary code for processing by a computer
- An encryption scheme is applied to convert the digital sequence from a binary code to a nucleic acid sequence.
- a surface material for nucleic acid extension, a design for loci for nucleic acid extension (aka, arrangement spots), and reagents for nucleic acid synthesis are selected.
- an early step of data storage process disclosed herein includes obtaining or receiving one or more items of information in the form of an initial code.
- Items of information e.g., digital information
- Items of information include, without limitation, text, audio and visual information.
- Exemplary sources for items of information include, without limitation, books, periodicals, electronic databases, medical records, letters, forms, voice recordings, animal recordings, biological profiles, broadcasts, films, short videos, emails, bookkeeping phone logs, internet activity logs, drawings, paintings, prints, photographs, pixelated graphics, and software code.
- Exemplary biological profile sources for items of information include, without limitation, gene libraries, genomes, gene expression data, and protein activity data.
- Exemplary formats for items of information include, without limitation, .txt, .PDF, .doc, .docx, .ppt, .pptx, .xls, .xlsx, .rtf, .jpg, .gif, .psd, .bmp, .tiff, .png, and. mpeg.
- the amount of individual file sizes encoding for an item of information, or a plurality of files encoding for items of information, in digital format include, without limitation, up to 1024 bytes (equal to 1 KB), 1024 KB (equal to 1MB), 1024 MB (equal to 1 GB), 1024 GB (equal to 1TB), 1024 TB (equal to 1PB), 1 exabyte, 1 zettabyte, 1 yottabyte, 1 xenottabyte or more.
- Attorney Docket No.00415-0047-00304 In some instances, an amount of digital information is at least 1 gigabyte (GB).
- the amount of digital information is at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 or more than 1000 gigabytes. In some instances, the amount of digital information is at least 1 terabyte (TB). In some instances, the amount of digital information is at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 or more than 1000 terabytes. In some instances, the amount of digital information is at least 1 petabyte (PB).
- PB petabyte
- the amount of digital information is at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 or more than 1000 petabytes.
- the digital information does not contain genomic data acquired from an organism. Items of information in some instances are encoded. Non- limiting encoding method examples include 1 bit/base, 2 bit/base, 4 bit/base or other encoding method. [029] Systems and Methods for Quality Control of Polynucleotide Pools [030] Provided herein are systems and methods for QC of a polynucleotide pool or a plurality of polynucleotide pools.
- the polynucleotide pool or the plurality of polynucleotide pools comprise data polynucleotides.
- the data polynucleotides comprise digital information, such as binary data.
- the digital information comprises an item of information, such as, but not limited to, those described herein.
- the polynucleotide pool or the plurality of polynucleotide pools comprises one or more items of information.
- the one or more items of information are encoded in data polynucleotides in a polynucleotide pool or a plurality of polynucleotide pools. [031] Provided herein are systems and methods for QC of data polynucleotides.
- Data polynucleotides can encode digital information as described herein.
- the digital information is encoded as data polynucleotides using the systems and methods described herein.
- the QC methods described herein are agnostic to the systems and methods of encoding digital information in polynucleotides.
- the QC methods described herein are agnostic to the size or type of the digital information encoded in polynucleotides.
- the QC methods described herein are agnostic to the size of the polynucleotide pool described herein. [032]
- QC of data polynucleotides comprises a plurality of QC polynucleotides.
- the QC polynucleotides can be provided on a surface, for example, for synthesis and/or storage of the data polynucleotides, such as those described herein. In some cases, the QC polynucleotides are synthesized at the same time as the data polynucleotides. In some cases, the QC polynucleotides are provided on a surface or a portion of a surface. In some cases, the QC polynucleotides are provided uniformly on a surface or a portion of a surface. In some instances, the portion of the surface comprises a discrete location (e.g., loci, cell, feature, etc.) or a plurality of discrete locations on the surface.
- a discrete location e.g., loci, cell, feature, etc.
- the QC polynucleotides are about 1 % to about 10 % of the polynucleotides on the surface. In some instances, the polynucleotides on the surface comprise the QC polynucleotides and the data polynucleotides encoding digital information.
- the QC polynucleotides are about 1 % to Attorney Docket No.00415-0047-00304 about 2 %, about 1 % to about 3 %, about 1 % to about 4 %, about 1 % to about 5 %, about 1 % to about 6 %, about 1 % to about 7 %, about 1 % to about 8 %, about 1 % to about 9 %, about 1 % to about 10 %, about 2 % to about 3 %, about 2 % to about 4 %, about 2 % to about 5 %, about 2 % to about 6 %, about 2 % to about 7 %, about 2 % to about 8 %, about 2 % to about 9 %, about 2 % to about 10 %, about 3 % to about 4 %, about 3 % to about 5 %, about 3 % to about 6 %, about 3 % to about 7 %, about 3 % to about 8 %, about 3 % to about 9
- the QC polynucleotides are about 1 %, about 2 %, about 3 %, about 4 %, about 5 %, about 6 %, about 7 %, about 8 %, about 9 %, or about 10 % of the polynucleotides on the surface. In some cases, the QC polynucleotides are at least about 1 %, about 2 %, about 3 %, about 4 %, about 5 %, about 6 %, about 7 %, about 8 %, or about 9 % of the polynucleotides on the surface.
- the QC polynucleotides are at most about 2 %, about 3 %, about 4 %, about 5 %, about 6 %, about 7 %, about 8 %, about 9 %, or about 10 % of the polynucleotides on the surface.
- the length of each of the plurality of QC polynucleotides is about 20 to about 500 bases.
- the length of each of the plurality of QC polynucleotides is about 20 bases to about 50 bases, about 20 bases to about 100 bases, about 20 bases to about 200 bases, about 20 bases to about 300 bases, about 20 bases to about 400 bases, about 20 bases to about 500 bases, about 50 bases to about 100 bases, about 50 bases to about 200 bases, about 50 bases to about 300 bases, about 50 bases to about 400 bases, about 50 bases to about 500 bases, about 100 bases to about 200 bases, about 100 bases to about 300 bases, about 100 bases to about 400 bases, about 100 bases to about 500 bases, about 200 bases to about 300 bases, about 200 bases to about 400 bases, about 200 bases to about 500 bases, about 300 bases to about 400 bases, about 300 bases to about 500 bases, or about 400 bases to about 500 bases.
- the length of each of the plurality of QC polynucleotides is about 20 bases, about 50 bases, about 100 bases, about 200 bases, about 300 bases, about 400 bases, or about 500 bases. In some cases, the length of each of the plurality of QC polynucleotides is at least about 20 bases, about 50 bases, about 100 bases, about 200 bases, about 300 bases, or about 400 bases. In some cases, the length of each of the plurality of QC polynucleotides is at most about 50 bases, about 100 bases, about 200 bases, about 300 bases, about 400 bases, or about 500 bases. [034] In some cases, the plurality of QC polynucleotides each comprise a first primer sequence.
- the first primer sequence of each of the plurality of QC polynucleotides is different than a second primer sequence of each of the data polynucleotides. In some instances, the first primer sequence of the plurality of QC polynucleotides is a different length than a second primer sequence of data polynucleotides. In some instances, the first primer sequence is unique to a polynucleotide pool.
- a first plurality of QC polynucleotides for QC of a first polynucleotide pool encoding a first file can have a different primer sequence than a second plurality of QC polynucleotides for QC of a second polynucleotide pool encoding a second file.
- a polynucleotide pool can comprise two or more files and two or more QC polynucleotides can each have a unique primer sequence for QC of each of the files.
- a plurality of QC polynucleotides with a primer sequence can be used for QC of a plurality of polynucleotide pools.
- the length of the first primer sequence is about 10 bases to about 50 bases. In some cases, the length of the first primer sequence is about 10 bases to about 15 bases, about 10 bases to about 18 bases, about 10 bases to about 20 bases, about 10 bases to about 22 bases, about 10 bases to about 25 bases, about 10 bases to about 28 bases, about 10 bases to about 30 bases, about 10 bases to about 35 bases, about 10 bases to about 40 bases, about 10 bases to about 45 bases, about 10 bases to about 50 bases, about 15 bases to about 18 bases, about 15 bases to about 20 bases, about 15 bases to about 22 bases, about 15 bases to about 25 bases, about 15 bases to about 28 bases, about 15 bases to about 30 bases, about 15 bases to about 35 bases, about 15 bases to about 40 bases, about 15 bases to about 45 bases, about 15 bases to about 50 bases, about 18 bases to about 20 bases, about 18 bases to about 22 bases, about 18 bases to about 25 bases, about 18 bases to about 28 bases, about 18 bases to about 30 bases, about 18 bases to about 35 bases, about 15 bases to about 40 bases, about 15
- the length of the first primer sequence is about 10 bases, about 15 bases, about 18 bases, about 20 bases, about 22 bases, about 25 bases, about 28 bases, about 30 bases, about 35 bases, about 40 bases, about 45 bases, or about 50 bases. In some cases, the length of the first primer sequence is at least about 10 bases, about 15 bases, about 18 bases, about 20 bases, about 22 bases, about 25 bases, about 28 bases, about 30 bases, about 35 bases, about 40 bases, or about 45 bases. In some cases, the length of the first primer sequence is at most about 15 bases, about 18 bases, about 20 bases, about 22 bases, about 25 bases, about 28 bases, about 30 bases, about 35 bases, about 40 bases, about 45 bases, or about 50 bases.
- the QC polynucleotides described herein can be extracted and/or amplified.
- the Attorney Docket No.00415-0047-00304 plurality of QC polynucleotides are amplified.
- the plurality of QC polynucleotides are amplified based on their primer sequence (e.g., first primer sequence).
- the plurality of QC polynucleotides are extracted and/or amplified from surfaces where they are synthesized or stored. After extraction and/or amplification of QC polynucleotides from the surface of a structure, suitable sequencing technology may be employed to sequence the polynucleotides, as further described herein.
- the DNA sequence is read on the substrate or within a feature of a structure.
- the plurality of QC polynucleotides are aligned.
- the plurality of QC polynucleotides are aligned against a reference.
- the reference is a known sequence or a preselected sequence.
- the known sequence or the preselected sequence is the original sequence of the plurality of QC polynucleotides.
- the plurality of QC polynucleotides are aligned against a reference to estimate an error rate in the data polynucleotides, a synthesis uniformity in the data polynucleotides, or a combination thereof.
- An error rate in the data polynucleotides can be estimated by aligning the plurality of QC polynucleotides against a reference.
- the reference is a known sequence, as previously described herein.
- aligning the plurality of QC polynucleotides against a reference generates a relative read count.
- the relative read count comprises the number of sequence QC polynucleotides that have the same sequence as the reference sequence.
- the relative read count is used to estimate an error rate in the plurality of QC polynucleotides by determining the number of sequenced QC polynucleotides that have the same sequence as the reference sequence out of all the sequenced QC polynucleotides. In some instances, the error rate in the plurality of QC polynucleotides is used to estimate an error rate in the data polynucleotides. In some instances, the error rate in the data polynucleotides is based at least in part on the relative read count. [038] A synthesis uniformity in the data polynucleotides can be estimated by aligning the plurality of QC polynucleotides against a reference.
- the reference is a known sequence, as previously described herein.
- aligning the plurality of QC polynucleotides against a reference generates a relative read count, as described herein.
- the relative read count is used to estimate a synthesis uniformity in the plurality of QC polynucleotides by determining the number of sequenced QC polynucleotides that have the same sequence as the reference sequence out of all the sequenced QC polynucleotides.
- the relative read count is used to estimate a synthesis uniformity in one or more discrete locations (e.g., loci, cell, feature, etc.) where the sequenced QC polynucleotides have the same sequence as the reference sequence.
- a particular cell out of a plurality of cells comprises less QC polynucleotides or comprises QC polynucleotides that comprise less alignment with the reference compared to QC polynucleotides in other cells.
- the synthesis uniformity in the plurality of QC polynucleotides is used to estimate a synthesis uniformity in the data polynucleotides.
- the synthesis uniformity in the data polynucleotides is based at least in part on the relative read count.
- the methods for QC of polynucleotides comprising QC polynucleotides, as described herein are performed after synthesis of the data polynucleotides. In some cases, the methods for QC of polynucleotides comprising QC polynucleotides, as described herein, are performed after an initial synthesis of the data polynucleotides, as exemplary illustrated in FIG.1. In some cases, the methods for QC of polynucleotides comprising QC polynucleotides, as described herein, are performed after re-synthesis of the data polynucleotides if a synthesis or storage error is encountered.
- the methods for QC of polynucleotides comprising QC polynucleotides, as described herein, are performed on data polynucleotides that are stored.
- QC of data polynucleotides comprises selecting a subset of a plurality of data polynucleotides. In some instances, the subset of the plurality of data polynucleotides is selected randomly. In some instances, the subset of the plurality of data polynucleotides is selected pseudo randomly. In some instances, the subset of the plurality of data polynucleotides are selected at least in part based on their location on a synthesis or storage surface, such as those described herein.
- the subset of the plurality of data polynucleotides are selected at least in part based on one or more physical or chemical properties, such as, but not limited to, those measured by current sensing, optical imagining, flow sensing, etc. In some cases, the subset of the plurality of data polynucleotides comprises about 0.01 % to about 5 % of the plurality of data polynucleotides.
- the subset of the plurality of data polynucleotides comprises about 0.01 % to about 0.02 %, about 0.01 % to about 0.05 %, about 0.01 % to about 0.08 %, about 0.01 % to about 0.1 %, about 0.01 % to about 0.2 %, about 0.01 % to about 0.5 %, about 0.01 % to about 1 %, about 0.01 % to about 2 %, about 0.01 % to about 3 %, about 0.01 % to about 4 %, about 0.01 % to about 5 %, about 0.02 % to about 0.05 %, about 0.02 % to about 0.08 %, about 0.02 % to about 0.1 %, about 0.02 % to about 0.2 %, about 0.02 % to about 0.5 %, about 0.02 % to about 1 %, about 0.02 % to about 2 %, about 0.02 % to about 3 %, about 0.02 % to about 4 %, about 0.02 % to
- the subset of the plurality of data polynucleotides comprises about 0.01 %, about 0.02 %, about 0.05 %, about 0.08 %, about 0.1 %, about 0.2 %, about 0.5 Attorney Docket No.00415-0047-00304 %, about 1 %, about 2 %, about 3 %, about 4 %, or about 5 % of the plurality of data polynucleotides.
- the subset of the plurality of data polynucleotides comprises at least about 0.01 %, about 0.02 %, about 0.05 %, about 0.08 %, about 0.1 %, about 0.2 %, about 0.5 %, about 1 %, about 2 %, about 3 %, or about 4 % of the plurality of data polynucleotides.
- the subset of the plurality of data polynucleotides comprises at most about 0.02 %, about 0.05 %, about 0.08 %, about 0.1 %, about 0.2 %, about 0.5 %, about 1 %, about 2 %, about 3 %, about 4 %, or about 5 % of the plurality of data polynucleotides.
- a polynucleotide pool comprises the plurality of data polynucleotides.
- the plurality of data polynucleotides comprise about 100 to 500,000 polynucleotides.
- the plurality of data polynucleotides comprise about 100 to about 500, about 100 to about 1,000, about 100 to about 5,000, about 100 to about 10,000, about 100 to about 50,000, about 100 to about 100,000, about 100 to about 200,000, about 100 to about 300,000, about 100 to about 400,000, about 100 to about 500,000, about 500 to about 1,000, about 500 to about 5,000, about 500 to about 10,000, about 500 to about 50,000, about 500 to about 100,000, about 500 to about 200,000, about 500 to about 300,000, about 500 to about 400,000, about 500 to about 500,000, about 1,000 to about 5,000, about 1,000 to about 10,000, about 1,000 to about 50,000, about 1,000 to about 100,000, about 1,000 to about 200,000, about 1,000 to about 300,000, about 1,000 to about 400,000, about 1,000 to about 500,000, about 5,000 to about 10,000, about 5,000 to about 50,000, about 5,000 to about 100,000, about 5,000 to about 200,000, about 5,000 to about 300,000, about 5,000 to about 400,000, about 1,000 to about 500,000, about 5,000 to
- the plurality of data polynucleotides comprise about 100, about 500, about 1,000, about 5,000, about 10,000, about 50,000, about 100,000, about 200,000, about 300,000, about 400,000, or about 500,000 polynucleotides. In some cases, the plurality of data polynucleotides comprise at least about 100, about 500, about 1,000, about 5,000, about 10,000, about 50,000, about 100,000, about 200,000, about 300,000, or about 400,000 polynucleotides. In some cases, the plurality of data polynucleotides comprise at most about 500, about 1,000, about 5,000, about 10,000, about 50,000, about 100,000, about 200,000, about 300,000, about 400,000, or about 500,000 polynucleotides.
- an inner codec is applied to the subset of the plurality of data polynucleotides.
- the inner codec comprises probabilistic decoding.
- An inner codec generally comprises a decoding polynucleotides into digital information.
- the inner codec comprises converting or transforming each of the subset of the plurality of data polynucleotides into binary data.
- a full length of the subset of the plurality of data polynucleotides are transformed or converted into binary data (e.g., full decoding).
- a partial length of the subset of the plurality of data polynucleotides are transformed or converted into binary data (e.g., partial decoding).
- the partial length comprises an index, such as those described herein (e.g., lane index, frame index, UUID, content ID, etc.).
- the inner codec is applied to the subset of the plurality of data polynucleotides that have been sequenced. In some instances, the inner codec is applied to the subset of the plurality of data polynucleotides that have or have not been ordered, aligned, clustered, or any combination thereof.
- the plurality of data polynucleotides and/or the subset of the plurality of data polynucleotides are encoded using the methods described herein. In some cases, the plurality of data polynucleotides and/or the subset of the plurality of data polynucleotides are decoded using the methods described herein. In some instances, the inner codec comprises a greedy algorithm. In some instances, the inner codec comprises a maximum likelihood (ML) algorithm. In some instances, the inner codec comprises a mixed greedy ML algorithm. [044] In some cases, the probabilistic decoding of the inner codec provides a likelihood of the overall decoded sequence.
- redundancy within each polynucleotide sequences helps to estimate error rates without knowing a reference polynucleotide.
- the inner codec decodes sequences with high probabilities and/or in very few steps, then the error rate is likely low.
- the inner codec decodes sequences with low probabilities and/or takes more steps then the error rate is likely high.
- the data polynucleotides comprises an index. In some instances, an index of the subset of the plurality of data polynucleotides is decoded.
- the index is decoded using an inner codec, an outer codec, or a combination thereof, such as, but not limited to, those described herein.
- the index is used to estimate a relative distribution of the subset of the plurality of polynucleotides.
- the relative distribution is used to estimate uniformity of the data polynucleotides. For example, if the plurality of data polynucleotides comprises about 100,000 polynucleotide sequences, and the subset selected is 0.1% of the data polynucleotides, a distribution centered around 100 decoded indexes can be expected.
- relative distribution changes between subsets of the data polynucleotides indicate a loss of uniformity across the data polynucleotides.
- the methods for QC of data polynucleotides comprising selecting a subset, as described herein are performed after synthesis of the data polynucleotides. In some cases, the methods for QC of polynucleotides comprising selecting a subset, as described herein, are performed after an initial synthesis of the data polynucleotides, as exemplary illustrated in FIG.1.
- the methods for QC of polynucleotides comprising selecting a subset, as described herein are performed after re-synthesis of the data polynucleotides if an synthesis or storage error is encountered.
- the methods for QC of polynucleotides comprising selecting a subset, as described herein are Attorney Docket No.00415-0047-00304 performed on data polynucleotides that are stored, as exemplary illustrated in FIG.2. [047]
- the methods and systems for QC of polynucleotides are used in conjunction with one or more additional methods for QC.
- the one or more additional methods comprise any one of: current sensing, resistance sensing, optical imaging, flow sensing, size estimation, quality estimation, mass estimation, or any combination thereof.
- current or resistance sensing comprises measuring a current or a resistance, respectively, of a chip or a section of the chip.
- a current or resistance is compared to a reference value.
- the different between the current or resistance and the standard value is indicative of a chip failure, synthesis error, or a combination thereof.
- the chip failure comprises a defect in a chip.
- the defect in the chip causes a polynucleotide synthesis and/or storage problem.
- the synthesis error comprises a deblocking failure.
- mass estimation comprises measuring absorbance to estimate a mass of a polynucleotide sequence. In some cases, mass estimation comprises using fluorescence to measure the mass of a polynucleotide sequence. In some instances, the fluorescence is used to detect a yield of the plurality of polynucleotides. In some cases, optical imaging comprises detecting a chip defect, non-uniformity, or a combination thereof. In some cases, flow sensing is used to detect the flow of a liquid, gas, or a combination thereof over the synthesis and/or storage chip. [048] An exemplary flow diagram of QC of a polynucleotide pool is provided in FIG.1.
- current sensing may be employed prior to synthesis to QC a synthesis chip.
- Current sensing prior to synthesis can be performed to detect a chip defect, adjust polynucleotide synthesis locations on the chip, or a combination thereof. This can be followed by synthesis placement optimizations, followed by the synthesis of data polynucleotides.
- the data polynucleotides are synthesized along with QC polynucleotides, as previously described herein.
- synthesis of the plurality of data polynucleotides is performed with continuous QC using one or more additional methods described herein.
- the continuous QC comprises current sensing, resistance sensing, optical imaging, flow sensing, or a combination thereof.
- post-synthesis QC comprises determining an oligo length distribution and/or mass estimation of the plurality of data polynucleotides using, for example, techniques described herein.
- the QC polynucleotides are amplified and sequenced for QC of the data polynucleotides, as previously described herein. In some instances, the QC polynucleotides are fully sequenced. In some instances, the QC polynucleotides are partially sequenced. In some cases, the QC polynucleotides are aligned, and an error rate and/or a uniformity is estimated, as previously described herein.
- the data polynucleotides are amplified and a sub- set (or sub-sample) is sequenced.
- the subset is fully decoded.
- the subset is partially decoded.
- partial decoding comprises applying an inner codec, an outer codec, or a combination thereof.
- an inner codec is applied to estimate an error rate, as Attorney Docket No.00415-0047-00304 previously described herein.
- an index of the subset is partially decoded.
- an outer codec is applied to estimate a uniformity, as previously described herein.
- QC of the QC polynucleotides, data polynucleotides, or a combination thereof is used to determine a final QC decision.
- the final QC decision is based on the error rate, uniformity, or both.
- the final QC decision comprises a pass or fail of the synthesized data polynucleotides.
- the data polynucleotides are stored.
- the data polynucleotides are resynthesized.
- the final QC decision comprises a pass for some sections of a synthesis surface and a fail for some sections of the synthesis surface. In some instances, only the data polynucleotides from the pass sections of the synthesis surface are stored. In some instances the data polynucleotides from the fail sections of the synthesis surface are resynthesized. [052] In some cases, the final QC decision comprises a threshold. In some instances, the threshold comprises a static value or a dynamic value. In some instances, the threshold comprises a static range or a dynamic range. In some instances, the threshold is based on one or more combined values or ranges, such as error rate, uniformity, or both.
- the error rate may be less than 5 %, 4 %, 3 %, 2 % 1%, 0.5 %, 0.1 %, 0.05 %, 0.01 %, 0.0005 %, or 0.0001 %.
- the uniformity may be greater than 90 %, 91 %, 92 %, 93 %, 94 %, 95 %, 96 %, 97 %, 98 %, 99 %, 99.5 %, 99.9 %, 99.95 %, or 99.99%.
- the error rate may be greater than 10 %, 15 %, 20 %, 25 %, 30 %, 35 %, 40 %, 45 %, or 50 %.
- the uniformity may be less than 80 %, 70%, 60 %, 50 %, 40 %, 30 %, 20 %, or 10 %.
- QC of data polynucleotides in a pool are periodically performed. In some cases, QC of data polynucleotides are performed over days, weeks, months, or years. In some cases, a plurality of data polynucleotides are retrieved from storage. In some cases, the data polynucleotides are amplified, and a sub-set (or sub- sample) is sequenced. In some cases, the subset is fully decoded. In some cases, the subset is partially decoded. In some instances, partial decoding comprises applying an inner codec, an outer codec, or a combination thereof. In some examples, an inner codec is applied to estimate an error rate, as previously described herein.
- QC of the retrieved data polynucleotides further comprises a pool level QC comprising determining an oligo length distribution and/or mass estimation of the plurality of data polynucleotides using, for example, techniques described herein.
- the pool level QC is combined with the sub-set QC to determine a final QC decision, as shown in FIG.2.
- Attorney Docket No.00415-0047-00304 the final QC decision comprises a pass or fail of the synthesized data polynucleotides.
- the data polynucleotides are returned to storage. In some cases, if the final QC decision is a fail, then the data polynucleotides are fully sequenced and/or decoded. In some instances, the full sequencing and decoding of the data polynucleotides comprises sequencing and/or decoding duplicate data polynucleotides if a sample of data polynucleotides cannot be decoded. In some cases, the data polynucleotides are resynthesized. In some cases, the final QC decision comprises a pass for some sections of a storage surface and a fail for some sections of the storage surface.
- the final QC decision comprises a threshold.
- the threshold comprises a static value or a dynamic value.
- the threshold comprises a static range or a dynamic range.
- the threshold is based on one or more combined values or ranges, such as error rate, uniformity, or both. In some instances, if the error rate is low and the uniformity is high, then the final QC decision is a pass.
- the error rate may be less than 5 %, 4 %, 3 %, 2 % 1%, 0.5 %, 0.1 %, 0.05 %, 0.01 %, 0.0005 %, or 0.0001 %.
- the uniformity may be greater than 90 %, 91 %, 92 %, 93 %, 94 %, 95 %, 96 %, 97 %, 98 %, 99 %, 99.5 %, 99.9 %, 99.95 %, or 99.99%.
- the error rate is high and the uniformity is low, then the final QC decision is a pass.
- the error rate may be greater than 10 %, 15 %, 20 %, 25 %, 30 %, 35 %, 40 %, 45 %, or 50 %.
- the uniformity may be less than 80 %, 70%, 60 %, 50 %, 40 %, 30 %, 20 %, or 10 %.
- QC quality control
- the cells comprise active regions on a surface, such as compartments, location, loci, features, spots, or any variation thereof suitable for synthesis and/or storage of polynucleotides.
- the surface comprises a synthesis and/or storage surface of a device, such as, but not limited to, those described herein.
- a method for performing QC of a plurality of cells comprises one or more steps.
- a step comprises measuring a physical and/or chemical property of each of the plurality of cells on the surface.
- a step comprises current sensing, resistance sensing, optical imaging, or a combination thereof.
- a voltage is applied to the surface. In some instances, the voltage is about 0.1 V to about 3 V.
- the voltage is about 0.1 V to about 0.25 V, about 0.1 V to about 0.5 V, about 0.1 V to about 0.75 V, about 0.1 V to about 1 V, about 0.1 V to about 1.25 V, about 0.1 V to about 1.5 V, about 0.1 V to about 1.75 V, about 0.1 V to about 2 V, about 0.1 V to about 2.5 V, about 0.1 V to about 3 V, about 0.25 V to about 0.5 V, about 0.25 V to about 0.75 V, about 0.25 V to about 1 V, about 0.25 V to about 1.25 V, about 0.25 V to about 1.5 V, about 0.25 V to Attorney Docket No.00415-0047-00304 about 1.75 V, about 0.25 V to about 2 V, about 0.25 V to about 2.5 V, about 0.25 V to about 3 V, about 0.5 V to about 0.75 V, about 0.5 V to about 1 V, about 0.5 V to about 1.25 V, about 0.5 V to about 1.5 V, about 0.5 V to about 1.75 V, about 0.5 V to about 2
- the voltage is about 0.1 V, about 0.25 V, about 0.5 V, about 0.75 V, about 1 V, about 1.25 V, about 1.5 V, about 1.75 V, about 2 V, about 2.5 V, or about 3 V. In some instances, the voltage is at least about 0.1 V, about 0.25 V, about 0.5 V, about 0.75 V, about 1 V, about 1.25 V, about 1.5 V, about 1.75 V, about 2 V, or about 2.5 V. In some instances, the voltage is at most about 0.25 V, about 0.5 V, about 0.75 V, about 1 V, about 1.25 V, about 1.5 V, about 1.75 V, about 2 V, about 2.5 V, or about 3 V.
- a step comprises measuring a current, resistance, or a combination thereof of each of the plurality of cell on the surface when a voltage is applied.
- the current sensing, resistance sensing, optical imaging, or a combination thereof is used to determine if one or more cells in the plurality of cells comprises a defect.
- the defect comprises a physical defect in the surface.
- the defect is determined based at least in part on the current, resistance, or a combination thereof.
- a current, resistance, or a combination thereof measured in a cell comprising a defect is different than a corresponding standard value measured in cells that do not comprise a defect.
- the current, resistance, or a combination thereof measured in a cell comprising a defect is different than the corresponding standard value by about 1 % to about 40 %. In some instances, the current, resistance, or a combination thereof is different by about 1 % to about 5 %, about 1 % to about 10 %, about 1 % to about 15 %, about 1 % to about 20 %, about 1 % to about 25 %, about 1 % to about 30 %, about 1 % to about 35 %, about 1 % to about 40 %, about 5 % to about 10 %, about 5 % to about 15 %, about 5 % to about 20 %, about 5 % to about 25 %, about 5 % to about 30 %, about 5 % to about 35 %, about 5 % to about 40 %, about 10 % to about 15 %, about 10 % to about 20 %, about 10 % to about 25 %, about 10 % to about 30 %, about 10 % to about 15
- the current, resistance, or a combination thereof is different by about 1 %, about 5 %, about 10 %, about 15 %, about 20 %, about 25 %, about 30 %, about Attorney Docket No.00415-0047-00304 35 %, or about 40 %. In some instances, the current, resistance, or a combination thereof is different by at least about 1 %, about 5 %, about 10 %, about 15 %, about 20 %, about 25 %, about 30 %, or about 35 %.
- the current, resistance, or a combination thereof is different by at most about 5 %, about 10 %, about 15 %, about 20 %, about 25 %, about 30 %, about 35 %, or about 40 %.
- the one or more cells comprising the defect is blocked.
- blocking one or more cells comprises leaving a protecting group, such as DMT, on the surface.
- blocking one or more cells comprises preventing deblocking of the protecting group.
- blocking comprises one or more photolabile protecting groups, where the hydroxyl groups generated on the surface are blocked by photolabile-protecting groups.
- a pattern of free hydroxyl groups on the surface may be generated. These hydroxyl groups can react with photoprotected nucleoside phosphoramidites, according to phosphoramidite chemistry.
- a second photolithographic mask can be applied and the surface can be exposed to UV light to generate second pattern of hydroxyl groups, followed by coupling with 5'- photoprotected nucleoside phosphoramidite.
- patterns can be generated and oligomer chains can be extended.
- blocking further comprises selectively supplying energy to one or more cells.
- a mask is created on a surface through heating elements on or proximal to the surface.
- a layer of masking material is applied to the surface and the heating elements are employed to apply energy to the masking material at selected sites, whereby the applied energy brings about a phase change in the masking material at the selected sites such that it adheres to the surface or can be displaced from the surface to mask or unmask the selected sites respectively.
- the masking material is a solid, gas, liquid, or a combination thereof.
- the masking material comprises, for example, C 15 -C 30 n-alkanes (e.g., tetracosane (C 24 ), icosane (C 20 ), etc.).
- the masking material comprises a mixture of two or more higher straight chain alkanes, such as, for example, C 16 -C 30 n-alkanes or C 18 -C 28 n-alkanes.
- the masking material for example nanospheres, may be deposited on the surface in the form of a dispersion, for example in acetonitrile.
- blocking further comprises addressable locations on a surface. In some instances, the locations are addressable through one or more electrodes near the locations. In some instances, the one or more electrodes are independently addressable. In some instances, the one or more electrodes at each of the locations on a surface are independently addressable.
- each electrode controls nucleoside (nucleoside phosphoramidite) coupling through electrochemistry at a Attorney Docket No.00415-0047-00304 specific location on the surface.
- reagents comprise protons or other acid molecule.
- electrodes are located at positions around the edges of a surface of a well. In some instances, electrodes control chemical reactions occurring near the synthesis surface. For example, if acid or other reagent is generated near the synthesis surface, the portion of a polynucleotide bound to this surface will be contacted with a higher concentration of acid than the portion of the polynucleotide that is distal to the site of acid generation.
- Electrodes such as those located near the surface of a well, in some instances produce or control a proton gradient which results in uniform or targeted exposure of a portion of the polynucleotide to acid. Sites near uncharged electrodes do not couple with nucleosides deposited over the synthesis surface, and the pattern of charged electrodes is altered before addition of the next nucleoside. By applying a series of electrode-controlled masks to the surface, the desired polynucleotides are synthesized at exact locations on the surface.
- polynucleotides are synthesized and/or stored at a second one or more cells in the plurality of cells. In some instances, the second one or more cells do not comprise a defect. In some instances, the polynucleotides are synthesized and/or stored using the systems and methods described herein. In some instances, the polynucleotides are encoded according to the systems and methods described herein. [063] Systems and Methods for Digital Information Storage [064] Provided herein are methods and systems for storage of digital information. In some cases, the digital information comprises one or more objects. In some cases, the one or more objects comprises an item of information, such as, but not limited to, those described herein.
- the one or more objects comprises a file or a metadata associated the file.
- the digital information comprises binary data.
- the binary data is a byte stream or a byte array.
- each of the one or more objects is about 1 GB to about 1 TB. In some cases, the each of the one or more objects is about 1 GB to about 1 TB.
- the each of the one or more objects is about 1 GB to about 10 GB, about 1 GB to about 50 GB, about 1 GB to about 100 GB, about 1 GB to about 500 GB, about 1 GB to about 1 TB, about 10 GB to about 50 GB, about 10 GB to about 100 GB, about 10 GB to about 500 GB, about 10 GB to about 1 TB, about 50 GB to about 100 GB, about 50 GB to about 500 GB, about 50 GB to about 1 TB, about 100 GB to about 500 GB, about 100 GB to about 1 TB, or about 500 GB to about 1 TB.
- each of the one or more objects is about 1 GB, about 10 GB, about 50 GB, about 100 GB, about 500 GB, or about 1 TB. In some cases, each of the one or more objects is at least about 1 GB, about 10 GB, about 50 GB, about 100 GB, or about 500 GB. In some cases, each of the one or more objects is at most about 10 GB, about 50 GB, about 100 GB, about 500 GB, or about 1 TB.
- a system of storing digital information can comprise one or more processing units, a memory in communication with the one or more processing units, instructions stored in the memory and executed on the one or more processing units, or any combination thereof.
- the one or more processing Attorney Docket No.00415-0047-00304 units and memory are distributed across one or more physical or logical locations.
- the one or more processing units include any combination of central processing units (CPUs), graphical processing units (GPUs), single core processors, multi- core processors, processor clusters, application- specific integrated circuits (ASICs), programmable circuits such as Field Programmable Gate Arrays (FPGA), an AI-accelerator and variations thereof.
- the one or more of the processing units comprise a Single Instruction Multiple Data (SIMD) or Single Program Multiple Data (SPMD) parallel architectures.
- SIMD Single Instruction Multiple Data
- SPMD Single Program Multiple Data
- the one or more processing units include one or more GPUs or CPUs that implement SIMD or SPMD.
- an AI-accelerator comprise Google-TPU, Graphcore, Cerebras, SambaNova, or a combination thereof.
- one or more of the processing units is implemented in software and/or firmware, in addition to hardware implementations.
- Software or firmware implementations of the processing units can include computer- or machine- executable instructions written in any suitable programming language to perform the various functions described herein.
- Software implementations of the one or more processing units can be stored in whole or part in the memory.
- the system can comprise one or more hardware logic components.
- the memory comprises removable storage, non-removable storage, local storage, and/or remote storage to provide storage of instructions, data structures, program modules (e.g., hashing module), and any other data described herein.
- the memory is used to store information related to the algorithms described herein (e.g., software code, parameters, executable instructions, etc.).
- the instructions stored on the memory can comprise one or more steps for storing digital information.
- the one or more steps comprises splitting digital information of one or more objects into a plurality of pools.
- each of the plurality of pools is about 1 GB to about 1 TB. In some cases, each of the plurality of pools is about 1 GB to about 1 TB.
- each of the plurality of pools is about 1 GB to about 10 GB, about 1 GB to about 50 GB, about 1 GB to about 100 GB, about 1 GB to about 500 GB, about 1 GB to about 1 TB, about 10 GB to about 50 GB, about 10 GB to about 100 GB, about 10 GB to about 500 GB, about 10 GB to about 1 TB, about 50 GB to about 100 GB, about 50 GB to about 500 GB, about 50 GB to about 1 TB, about 100 GB to about 500 GB, about 100 GB to about 1 TB, or about 500 GB to about 1 TB.
- each of the plurality of pools is about 1 GB, about 10 GB, about 50 GB, about 100 GB, about 500 GB, or about 1 TB. In some cases, each of the plurality of pools is at least about 1 GB, about 10 GB, about 50 GB, about 100 GB, or about 500 GB. In some cases, each of the plurality of pools is at most about 10 GB, about 50 GB, about 100 GB, about 500 GB, or about 1 TB. [067] In some instances, the one or more objects comprises an item of information, such as a file, as previously described herein.
- the one or more objects comprises a metadata associated Attorney Docket No.00415-0047-00304 with an item of information (e.g., metadata associated with a file).
- metadata associated with an object include a list of keywords attached to an object, an object size, a thumbnail picture, a text summary, an ID range for a sorted key-value database, a timestamp, a version, or any other data providing information about one or more aspects of an object, or any combination thereof.
- the metadata is customizable.
- the metadata is used to search for an object in the plurality of pools. [068] An exemplary diagram of digital information storage is illustrated in FIG.3. As shown, one or more objects 305 can be split into a plurality of pools 310.
- one object is split into a plurality of pools. In some cases, one object is split into two, three, four, five, six, seven, eight, nine, or ten pools. In some cases, more than one object is split into a plurality of pools. In some cases, one or more objects is in a pool. In some cases, one, two, three, four, five, six, seven, eight, nine, or ten objects are in a pool. In some cases, the plurality of pools are duplicated. In some cases, the plurality of pools comprise redundant pools, where two or more pools comprise the same one or more objects. In some cases, two, three, four, five, six, seven, eight, nine, or ten pools comprise the same one or more objects.
- Each pool in the plurality of pools can comprise any one of or a combination of a pool descriptor, a pool item, or an end descriptor.
- a pool comprises at least one pool item.
- a pool comprises more than one pool item.
- a pool comprises at least one pool descriptor.
- a pool comprises more than one pool descriptor.
- a pool comprises at least one end descriptor.
- a pool comprises more than one end descriptor.
- each pool comprises a pool descriptor 315, one or more pool items 320, and an end descriptor 325.
- a pool comprises redundant pool items, pool descriptors, end pool descriptors, or a combination thereof.
- two or more pool items, pool descriptors, end pool descriptors, or a combination thereof are identical.
- two, three, four, five, six, seven, eight, nine, or ten, pool descriptors, end pool descriptors, or a combination thereof are identical.
- the one or more steps in the instructions comprise generating a pool descriptor, a pool item, an end descriptor, or any combination thereof in each pool of the plurality of pools.
- the pool descriptor comprises a version, a pool ID, a list of pool item descriptors, or any combination thereof.
- the version comprises the version of information (e.g., if information is updated).
- the pool ID comprises a unique ID of the pool.
- the unique ID comprises a universal unique identifier (UUID).
- the unique ID comprises a content ID.
- the content ID comprises a digital fingerprinting system, which can be used to identify and/or manage copyright or ownership of a content.
- the list of pool item descriptors comprises a path of an object, a size of an object (e.g., a total size of an object), a range of the pool item within an object, offset of the pool item in a pool, or any combination thereof.
- the range of the pool item within an object comprises one or more locations of a payload in the pool item within an object.
- the one or more locations comprises a Attorney Docket No.00415-0047-00304 start and/or an end range of a payload in a pool item (e.g., line 1-6 in pool item 1, line 7-13 in pool item 2, ... etc., in a pool).
- the offset of the pool item comprises a payload location within the one or more pool items in a pool.
- the pool item comprises a data payload and/or a hash of the pool item.
- the data payload comprises the object or a portion of the object that is being stored.
- the hash of the pool item comprises a hashed value of the object or a portion of the object that is being stored.
- the end pool descriptor comprises a list of object descriptors.
- the list of object descriptors comprises a path of the object and/or a hash of the object.
- the path of the object comprises a unique path.
- the path of the object comprises a hierarchy (e.g., directory hierarchy). In some examples, the path of the object does not comprise a hierarchy.
- the systems and methods for storing digital information can comprise one or more hashes. In some cases, the one or more hashes are determined using a hashing module. In some cases, the hashing module is executed on the one or more processing units, such as those described herein. In some cases, the hashing module comprises instructions for determining the one or more hashes (e.g., a hash function). In some cases, the instructions (e.g., a hash function) are stored on a memory, such as those described herein.
- a hash may be determined a hash function (FIG.4).
- a hash function generally comprises a function that turns an input of arbitrary length into an output with a fixed length (e.g., 224, 256, 384, 512 bits or characters).
- the hash function comprises a cryptographic hash function.
- the hash function comprises MD-5, SHA-1, SHA-2, SHA-3, RIPEMD-160, Whirlpool, BLAKE, BLAKE2, BLAKE3, or a variation thereof.
- the hash function comprises SHA-2.
- SHA-2 comprises SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, or SHA- 512/256.
- the output of a hash function can be deterministic and infeasible to reverse-engineer. Further, generating an output of fixed length can increase security, since any party involved in decrypting a hash would not be able to tell the length of the input.
- a hash is generated upon inputting an identification code, encryption key, password, or any variation thereof. In some examples, the hash allows verification of the content (e.g., item of information or digital information stored in a pool) during decoding.
- the input 405 comprises an object.
- a hash function 410 is used to determine a hashed output (or hash) 415.
- the input 420 comprises an object.
- a hash function 425 is used to determine a hashed output (or hash) 430. In some examples, the hash function 410 and hash function 425 are the same hash function.
- the hash function Attorney Docket No.00415-0047-00304 410 and hash function 425 are both SHA-256. In some examples, the hash function 410 and hash function 425 are different hash functions. In some examples, the output 415 and the output 430 are the same length. In some examples, the output 415 and the output 430 are both 256 bits. In some examples, the output 415 and the output 430 are different lengths. [074] A hash function can comprise one or more steps to generate a hash. In some cases, the one or more steps in a hash function comprises padding bits. In some instances, extra bits are added to the digital information (or the message) being hashed.
- extra bits are added to the message such that the length of the digital message is a modulus value less than a total number of bits.
- the modulus value is 64 bits.
- the number of bits is 512 bits and the length of the digital information is 448 bits (e.g., for SHA-256).
- the first extra bit comprises a binary digit of 1.
- the subsequently added extra bits comprise a binary digit of 0s. [075]
- the one or more steps in a hash function comprises padding a length.
- padding the length comprises adding a modulus value to the digital information (e.g., also referred to as a bi-endian (BE) integer).
- the modulus value or the BE integer generally represents the length of the original input comprising the original digital information in binary.
- the modulus value is 64 bits.
- 64 bits are added to the digital message of 448 bits, and the total number of bits is 512 bits (e.g., for SHA-256).
- the modulus value is calculated by applying a modulus to the original digital information. As an example, if the original digital information is “hello world” in binary, the length of the original input is 88 bits, which is “1011000” in binary. As such, 0s followed by “1011000” are added to the end of the 448 bits of digital information such that the total number of bits is 512.
- the one or more steps in the hash function comprises initializing one or more hash values or buffers.
- 8 hash values or buffers are initialized.
- the initialized hash values are hard-coded (e.g., constants).
- the initialized hash values represent a first 32 bits of fractional part of the square roots of the first 8 primes (e.g, 2, 3, 5, 7, 11, 13, 17, 19).
- the one or more steps in the hash function further comprises initializing round constants (or keys).
- 64 round constants are initialized.
- each of the 64 round constants represent the first 32 bits of the fractional parts of the cube roots of the first 64 primes (e.g., 2-311).
- the 64 different round constants are stored in an array.
- the one or more steps in the hash function comprises compression.
- each block of information e.g., every 512 bits
- each block of information undergoes compression.
- each block of information undergoes a fixed number of rounds.
- compression is performed by a one-way compression function.
- the one-way compression function is single block-length compression function.
- the compression function is a Davies-Meyer, Matyas-Meyer-Oseas, or Miyaguchi-Preneel compression function.
- the one-way compression function is double block-length Attorney Docket No.00415-0047-00304 compression function.
- the compression function is a MDC-2/Meyer–Schilling, MDC- 4, or Hirose compression function.
- the output from the compression function is less than the block of information. In some examples, the output has a length of 256 bits. [078]
- one or more of the hashes e.g., hashes of pool item(s), hashes of object(s)
- all of the hashes e.g., hashes of pool item(s), hashes of object(s) are calculated during storage of information.
- this allows stable low memory usage regardless of the size of the objects.
- the first one or more hashes of each of the one or more objects require less memory than the one or more objects.
- the second one or more hashes of each of the one or more pool items require less memory than one or more pool items.
- the source data e.g., item of information
- each of the pools are written once without seeks. In some examples, this minimizes data transfers and latency. [079]
- the hashes described herein can serve one or more purposes.
- the one or more purposes can comprise, by way of non-limiting example, one or more of: verifying the integrity of one or more items of information (e.g., an object), signature generation and verification (e.g., for digital signatures), password verification, proof-of-work, or identifier for item of information.
- an encryption and/or compression can further be added.
- the encryption and/or compression is implemented with streaming application programmable interface (API). In some examples, this avoids the need to store intermediate results.
- the digital information to be stored is already compressed, for example, to reduce data transfer costs.
- the digital information to be stored is already encrypted, for example, for security reasons.
- the one or more steps in the instructions stored on the memory can further comprise creating a plurality of index pools.
- the plurality of index pools contain only indices.
- the index pools are used when retrieving the objects stored in the plurality of pools encoded in a plurality of polynucleotides.
- index pools are sequenced and temporarily stored in digital storage systems (e.g. flash drives) to search for objects.
- the plurality of polynucleotides encoding the pool is sequenced.
- the one or more index pools comprise an index pool descriptor and/or a list of object indexing.
- the index pool descriptor comprises a version, a pool ID, a size of a pool, a timestamp, or a combination thereof.
- the pool ID comprises a unique ID of the pool.
- the unique ID comprises a universal unique identifier (UUID).
- the unique ID comprises a content ID.
- the content ID comprises a digital fingerprinting system, which can be used to identify and/or manage copyright or ownership of a content.
- the size of each of the plurality of index pools is about 1GB to about 1 TB.
- the list of an object indexing comprises a path of an object, a hash of an object, a list of object fragments, a list of object metadata, or any combination thereof.
- the path of the object Attorney Docket No.00415-0047-00304 comprises a unique path.
- the path of the object comprises a hierarchy (e.g., directory hierarchy).
- the path of the object does not comprise a hierarchy.
- the hash of the object is a hash as previously described herein (e.g., SHA-256).
- the list of object fragments comprises a pool ID of a pool containing a fragment, a range of a fragment, or a combination thereof.
- the list of object metadata comprises the type of metadata, the metadata payload, or an combination thereof.
- the type of metadata comprises, a list of keywords attached to an object, a thumbnail picture, a text summary, an ID range for a sorted key-value database, a timestamp, a version, or any combination thereof.
- the metadata is customizable.
- the metadata is used to search for an object in the plurality of pools. [083] In some cases, an index pool can store information of about 1 to about 1 million pools.
- an index pool can store information of about 1 pool to about 10 pools, about 1 pool to about 100 pools, about 1 pool to about 1,000 pools, about 1 pool to about 5,000 pools, about 1 pool to about 10,000 pools, about 1 pool to about 50,000 pools, about 1 pool to about 100,000 pools, about 1 pool to about 500,000 pools, about 1 pool to about 1 million pools, about 10 pools to about 100 pools, about 10 pools to about 1,000 pools, about 10 pools to about 5,000 pools, about 10 pools to about 10,000 pools, about 10 pools to about 50,000 pools, about 10 pools to about 100,000 pools, about 10 pools to about 500,000 pools, about 10 pools to about 1 million pools, about 100 pools to about 1,000 pools, about 100 pools to about 5,000 pools, about 100 pools to about 10,000 pools, about 100 pools to about 50,000 pools, about 100 pools to about 100,000 pools, about 100 pools to about 100,000 pools, about 100 pools to about 500,000 pools, about 100 pools to about 1 million pools, about 1,000 pools to about 5,000 pools, about 1,000 pools to about 10,000 pools, about 100 pools to about 50,000 pools, about 100 pools to about 100,000 pools, about
- an index pool can store information of about 1 pool, about 10 pools, about 100 pools, about 1,000 pools, about 5,000 pools, about 10,000 pools, about 50,000 pools, about 100,000 pools, about 500,000 pools, or about 1 million pools. In some cases, an index pool can store information of at least about 1 pool, about 10 pools, about 100 pools, about 1,000 pools, about 5,000 pools, about 10,000 pools, about 50,000 pools, about 100,000 pools, or about 500,000 pools. In some cases, an index pool can store information of at most about 10 pools, about 100 pools, about 1,000 pools, about 5,000 pools, about 10,000 pools, about 50,000 pools, about 100,000 pools, about 500,000 pools, or about 1 million pools. [084] In some cases, each of the one or more index pools is about 1 GB to about 1 TB.
- each of the plurality of pools is about 1 GB to about 1 TB.
- each of the one or more index Attorney Docket No.00415-0047-00304 pools is about 1 GB to about 10 GB, about 1 GB to about 50 GB, about 1 GB to about 100 GB, about 1 GB to about 500 GB, about 1 GB to about 1 TB, about 10 GB to about 50 GB, about 10 GB to about 100 GB, about 10 GB to about 500 GB, about 10 GB to about 1 TB, about 50 GB to about 100 GB, about 50 GB to about 500 GB, about 50 GB to about 1 TB, about 100 GB to about 500 GB, about 100 GB to about 1 TB, or about 500 GB to about 1 TB.
- each of the one or more index pools is about 1 GB, about 10 GB, about 50 GB, about 100 GB, about 500 GB, or about 1 TB. In some cases, each of the one or more index pools is at least about 1 GB, about 10 GB, about 50 GB, about 100 GB, or about 500 GB. In some cases, each of the one or more index pools is at most about 10 GB, about 50 GB, about 100 GB, about 500 GB, or about 1 TB. [085] Encoding Scheme [086] An encoding scheme can be applied to each of the plurality of pools and/or index pools.
- the encoding scheme encodes the digital information in the plurality of pools as a plurality of polynucleotides. In some cases, the encoding scheme encodes the digital information in the index pools as a plurality of polynucleotides. In some instances, the encoding scheme comprises codecs for encoding binary data as nucleic acid sequences (e.g., inner codec). In some instances, the encoding scheme comprises an error correction code (ECC) (e.g., outer codec). In some cases, the encoding scheme (e.g., inner codec or low-level codec) is also designed and implemented to allow streaming read and write API access.
- ECC error correction code
- the encoding scheme (e.g., inner codec or low-level codec) is also designed and implemented to match the streaming of the systems and methods for digital storage (e.g., outer codec or high-level codec) described herein.
- the encoding scheme can generally comprise one or more operations.
- the one or more operations can comprise one or more operation to manipulate or transform data (e.g., digital information).
- the one or more operations can comprise by way of non-limiting example, splitting, shuffling, concatenating, transposing, translating, duplicating, labeling (e.g., using an index) data or a part of the data, or any combination thereof.
- method of encoding digital or binary data in a plurality of nucleotide sequences can comprise splitting the binary data into a plurality of frames.
- the plurality of frames comprise about 100 to about 10,000 frames.
- the plurality of frames comprise about 100 frames to about 250 frames, about 100 frames to about 500 frames, about 100 frames to about 750 frames, about 100 frames to about 1,000 frames, about 100 frames to about 2,500 frames, about 100 frames to about 5,000 frames, about 100 frames to about 7,500 frames, about 100 frames to about 10,000 frames, about 250 frames to about 500 frames, about 250 frames to about 750 frames, about 250 frames to about 1,000 frames, about 250 frames to about 2,500 frames, about 250 frames to about 5,000 frames, about 250 frames to about 7,500 frames, about 250 frames to about 10,000 frames, about 500 frames to about 750 frames, about 500 frames to about 1,000 frames, about 500 frames to about 2,500 frames, about 500 frames to about 5,000 frames, about 500 frames to about 7,500 frames, about 500 frames to about 750 frames, about 500 frames to about 1,000 frames, about 500 frames to about
- the plurality of frames comprise about 100 frames, about 250 frames, about 500 frames, about 750 frames, about 1,000 frames, about 2,500 frames, about 5,000 frames, about 7,500 frames, or about 10,000 frames. In some instances, the plurality of frames comprise at least about 100 frames, about 250 frames, about 500 frames, about 750 frames, about 1,000 frames, about 2,500 frames, about 5,000 frames, or about 7,500 frames. In some instances, the plurality of frames comprise at most about 250 frames, about 500 frames, about 750 frames, about 1,000 frames, about 2,500 frames, about 5,000 frames, about 7,500 frames, or about 10,000 frames. In some cases, the frames each comprise the same amount of data. In alternative cases, the frames each comprise a different amount of data. In some instances, each frame is assigned a frame index.
- the frame index increases for each frame index (e.g., 0, 1, 2, 3, 4, 5, ..., etc.). In some examples, the frame index monotonically increases for each frame index.
- Methods for encoding digital or binary data comprise an outer codec. In some instances, methods for encoding digital or binary data in a plurality of nucleotide sequences comprise an outer codec. In some instances, an outer codec is applied to the binary data. In some instances, an outer codec is applied to the binary data once the binary data is split into a plurality of frames. In such instances, outer codec is applied to each of the plurality of frames.
- An exemplary diagram of splitting a data stream into frames and applying an outer codec is exemplary illustrated in FIG.5.
- the outer codec comprises an error correction code or scheme, such as a Reed- Solomon (RS) code.
- RS Reed- Solomon
- This outer codec is used for spreading the digital or binary data to be stored over many oligonucleotides.
- spreading the data builds redundancy to correct for erasures (e.g., lost oligos).
- spreading the data also builds redundancy to correct errors from an inner codec.
- the error correction scheme comprises Reed-Solomon (RS) code.
- RS Reed-Solomon
- a RS encoder is used to encode the binary data or plurality of frames comprising binary data.
- the RS codes operates on a block of data treated as a set of finite-field elements.
- the RS code comprises mapping data, e.g., ⁇ ⁇ ⁇ ⁇ ⁇ , ... , ⁇ ⁇ ⁇ ⁇ ⁇ , to a polynomial ⁇ ⁇ , where ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
- the encoded data ⁇ is obtained by evaluating ⁇ at various different ⁇ points ⁇ ⁇ , ... , ⁇ ⁇ in the field ⁇ (e.g., ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , ... , ⁇ ⁇ ⁇ ⁇ .
- the RS code comprises an encoding scheme in which each codeword contains the message as a prefix, and error correcting symbols are appended as a suffix.
- the RS code is specified as RS(n, k) with m-bit symbols.
- the encoder takes k data symbols of m-bits each, and adds parity symbols (error correcting symbols or check symbols) to make an n symbol codeword.
- parity symbols error correcting symbols or check symbols
- the codeword C(x) comprises the parity check information CK(x) which is systematically appended to the message information M(x).
- k refers to the message length (e.g., symbols)
- t refers to the number of errors to be corrected
- n refers to the block length (e.g., message length n plus the correction length t)
- x n-k refers to the displacement shift in the message
- GF Galois field
- the RS decoder corrects up to 16 symbol errors in the codeword, meaning errors up to 16 bytes can be corrected by the decoder.
- the error correction scheme comprises a linear error correction code (or linear block code), such as a low-density parity-check (LDPC) code.
- the error correction scheme comprises a linear block error-correcting code, such as polar code.
- the error correction scheme comprises a high-performance forward error correction (FEC), such as a Turbo- code.
- FEC forward error correction
- the error correction scheme comprises an RS code, an LDPC code, a Turbo- code, a polar code, or any combination thereof (e.g., RS-based LDPC codes).
- the error correction scheme comprises low density parity check (LDPC) code.
- the LDPC code is used to encode the binary data or plurality of frames comprising binary data.
- the structure of a LDPC code is defined by a parity check matrix containing 0s at most entries and 1s elsewhere.
- an (N, K) LDPC code for K information bits is a linear block code with a block size of N, defined by a sparse (N-K) ⁇ N parity check matrix in which all elements other than 1 s are 0s.
- the number of 1s in a row or a column is referred to as the degree of the row or the column.
- a codeword of length N is represented as a vector C and for information bits of length K, an (N, K) code with 2K codewords is used.
- the LDPC code is regular when each row and each column of the parity check matrix has a constant degree and irregular otherwise.
- an irregular LDPC code Attorney Docket No.00415-0047-00304 outperforms a regular LDPC code.
- the irregular LDPC code promises improved performance only if the row degrees and the column degrees are appropriately adjusted.
- the error correction scheme comprises a polar code.
- a polar code can achieve Shannon capacity by theoretical proof.
- a polar code comprises low encoding and decoding complexity.
- B N comprises a transposed such as, for example, a bit reversal matrix.
- G N (A c ) is a submatrix obtained from a row, which corresponds to the index in the set A c , in G N , and u A c is frozen bits the number of which is (N ⁇ K), with N being the code length and K being the length of information bits.
- the error correction scheme comprises a turbo code.
- a turbo code generally comprises the parallel concatenation of two or more component codes applied to different interleaved versions of the same information sequence.
- recursive systematic convolutional (RSC) codes are used as the component codes.
- the input to the first RSC encoder is the original information sequence.
- the original information sequence d is also applied to an interleaver to produce an interleaved version d’.
- the interleaved version d′ of the information sequence is the input to the second RSC encoder.
- the outputs from the turbo encoder comprise systematic sequences of u and redundant parts x (1) (output from the first RSC encoder) and x (2) (output from the second encoder).
- the output of the encoder comprises u 1 , x 1(1) , x 1(2) , u 2 , x 2(1) , x 2(2) , where u k is the k th systematic bit (i.e., data bit), x k(1) is the parity output from the first RSC encoder associated with the k th systematic bit uk; and xk(2) is the parity output from the second RSC encoder associated with the k th systematic bit u k .
- the decoding procedure for the turbo codes generally comprises iterative decoding.
- the turbo code decoding procedure can comprise two component decoders (corresponding to two RSC encoders), an interleaver; and, a de-interleaver.
- the two component decoders are soft-input and soft-output (SISO) decoders.
- outputs of the two component decoders comprise likelihood information concerning the coded data sequence.
- the size of the binary data is increased once an outer codec (e.g., ECC) is Attorney Docket No.00415-0047-00304 applied.
- the frame sizes are increased once an ECC is applied to each of the frame comprising binary data.
- the frames are divided into a plurality of lanes.
- each lane comprises a lane index.
- each frame comprises about 1000 to about 10,000 lanes.
- each frame comprises about 5000 lanes.
- each frame comprises about 1,000 lanes to about 2,500 lanes, about 1,000 lanes to about 5,000 lanes, about 1,000 lanes to about 7,500 lanes, about 1,000 lanes to about 10,000 lanes, about 2,500 lanes to about 5,000 lanes, about 2,500 lanes to about 7,500 lanes, about 2,500 lanes to about 10,000 lanes, about 5,000 lanes to about 7,500 lanes, about 5,000 lanes to about 10,000 lanes, or about 7,500 lanes to about 10,000 lanes.
- each frame comprises about 1,000 lanes, about 2,500 lanes, about 5,000 lanes, about 7,500 lanes, or about 10,000 lanes.
- each frame comprises at least about 1,000 lanes, about 2,500 lanes, about 5,000 lanes, or about 7,500 lanes.
- each frame comprises at most about 2,500 lanes, about 5,000 lanes, about 7,500 lanes, or about 10,000 lanes.
- Each lane can further comprise about 100 to about 300 bits.
- each lane comprises about 100 bits to about 150 bits, about 100 bits to about 200 bits, about 100 bits to about 250 bits, about 100 bits to about 300 bits, about 150 bits to about 200 bits, about 150 bits to about 250 bits, about 150 bits to about 300 bits, about 200 bits to about 250 bits, about 200 bits to about 300 bits, or about 250 bits to about 300 bits.
- each lane comprises about 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300 bits. In some cases, each lane comprises at least about 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300 bits. In some cases, each lane comprises at most about 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300 bits.
- the methods for encoding digital or binary data in a plurality of nucleotide sequences comprise shuffling the binary data.
- each lane is shuffled base at least in part on lane indices.
- each lane is shuffled after applying an outer codec (e.g., ECC) to the binary data.
- ECC outer codec
- shuffling each lane allows resistance against errors that can occur during synthesis or sequencing, such as those affecting a whole oligonucleotide library.
- the errors can comprise an insertion, a deletion, a substitution, or a combination thereof.
- the shuffling comprises a rotation scheme within each lane based partly on each lane index.
- each bit in a lane may be shifted by each lane index (e.g., no shuffling in lane 0, 1 bit shift in lane 1, 2 bit shift in lane 2, etc.).
- the shuffling comprises a pseudorandom process within each lane.
- a random seed are used to initialize a pseudorandom number generator.
- a number generated by the pseudorandom number generator is determined by the random seed. Therefore, the same sequence of numbers are generated by the pseudorandom number generator using the same seed.
- using shuffling comprises a pseudorandom process, each bit in a lane is be shifted according to the numbers generated by the pseudorandom number generator.
- the lane index is used as a seed to create a permutation of some or all the bits for that lane.
- the permutation of the some or all the bits is created by sampling from a random number generator.
- the permutation is stored in a pre-compiled form.
- the use of a pseudo random generator allows for a smaller implementation source code.
- the frame index and the lane index are prepended. In some instances, the frame index and the lane index are prepended to each lane once each lane is shuffled.
- the frame index comprises about 12 bits to about 20 bits.
- the frame index comprises about 12 bits to about 14 bits, about 12 bits to about 16 bits, about 12 bits to about 18 bits, about 12 bits to about 20 bits, about 14 bits to about 16 bits, about 14 bits to about 18 bits, about 14 bits to about 20 bits, about 16 bits to about 18 bits, about 16 bits to about 20 bits, or about 18 bits to about 20 bits.
- the frame index comprises about 12 bits, about 14 bits, about 16 bits, about 18 bits, or about 20 bits.
- the frame index comprises at least about 12 bits, about 14 bits, about 16 bits, or about 18 bits.
- the frame index comprises at most about 14 bits, about 16 bits, about 18 bits, or about 20 bits.
- the lane index comprises about 12 bits to about 16 bits.
- the lane index comprises about 12 bits to about 14 bits, about 12 bits to about 16 bits, or about 14 bits to about 16 bits.
- the lane index comprises about 12 bits, about 14 bits, or about 16 bits.
- the lane index comprises at least about 12 bits, or about 14 bits.
- the lane index comprises at most about 14 bits, or about 16 bits. As shown in FIG.6, in some instances, the lane index is 12 bits and the frame index is 20 bits. In some cases, the lane index is the symbol width m from the RS code.
- the methods for encoding digital or binary data in a plurality of nucleotide sequences comprise an inner codec.
- the inner codec is applied to the binary data.
- the inner codec is applied to the binary data from the ECC.
- the inner codec is applied to the lanes of the binary data.
- the inner codec is applied to the lanes of the binary data once the lanes have been shuffled.
- the encoding scheme comprises an inner codec.
- an inner codec is applied to each lane to encode the binary data as a nucleotide sequence. The inner codec is used to transform digital or binary data into nucleotide bases.
- the inner codec is capable of correcting deletion, substitution, or insertion errors, or any combination thereof. In some further embodiments, the inner codec is used to validate oligos and discard any suspicious oligos to avoid contaminating the outer decoding. The inner codec further encodes the indices (frame index and lane index), which can allow for efficient clustering during decoding. [0107] In some instances, the encoding scheme adds redundancy across the plurality of oligonucleotide sequences. In some instances, the redundancy is about 5 % to about 10 %.
- the redundancy is about 5 % to about 6 %, about 5 % to about 7 %, about 5 % to about 8 %, about 5 % to Attorney Docket No.00415-0047-00304 about 9 %, about 5 % to about 10 %, about 6 % to about 7 %, about 6 % to about 8 %, about 6 % to about 9 %, about 6 % to about 10 %, about 7 % to about 8 %, about 7 % to about 9 %, about 7 % to about 10 %, about 8 % to about 9 %, about 8 % to about 10 %, or about 9 % to about 10 %.
- the redundancy is about 5 %, about 6 %, about 7 %, about 8 %, about 9 %, or about 10 %. In some instances, the redundancy is at least about 5 %, about 6 %, about 7 %, about 8 %, or about 9 %. In some instances, the redundancy is at most about 61 GB %, about 7 %, about 8 %, about 9 %, or about 10 %. In some cases, this redundancy allows a library of oligos to be decoded in the presence of errors in the individual oligos, such as insertions, deletions, substitutions, or any combination thereof. [0108] An exemplary diagram of an encoding scheme is shown in FIG.7.
- the encoding scheme in the inner codec combines two or more of: bits from each lane, a bit history, and a bit position.
- a model e.g., adaptive model
- each context is mapped to a bit history.
- the bit history is represented by an 8- bit state.
- the bit history is updated each time a context is encountered, for example, through the use of a lookup table.
- a bit position comprises the least significant bit (LSB) from a bit index of the bits to encode. For example, if 100 bits encode a 100-mer oligonucleotide, a “bit index” refers to an index from 0 to 99 in the bits to encode.
- the LSB comprises the bit position in a binary integer representing the binary 1s place of the integer.
- the LSB index is any length.
- the LSB index is represented by a 4-bit state.
- the inner codec comprises generating base candidates for bits of the binary data. Base candidates are generated for the binary data using a lookup table, a hash, or a combination thereof. In some instances, the hash is determined using methods previously described herein.
- the binary data comprises two or more of: bits from each lane, bit history, and a bit position. In some instances, the bit rate for encoding is about 1 bit per base to about 2 bits per base.
- the bit rate for encoding is about 1 bit per base to about 1.1 bits per base, about 1 bit per base to about 1.2 bits per base, about 1 bit per base to about 1.3 bits per base, about 1 bit per base to about 1.4 bits per base, about 1 bit per base to about 1.5 bits per base, about 1 bit per base to about 1.6 bits per base, about 1 bit per base to about 1.7 bits per base, about 1 bit per base to about 1.8 bits per base, about 1 bit per base to about 1.9 bits per base, about 1 bit per base to about 2 bits per base, about 1.1 bits per base to about 1.2 bits per base, about 1.1 bits per base to about 1.3 bits per base, about 1.1 bits per base to about 1.4 bits per base, about 1.1 bits per base to about 1.5 bits per base, about 1.1 bits per base to about 1.6 bits per base, about 1.1 bits per base to about 1.7 bits per base, about 1.1 bits per base to about 1.8 bits per base, about 1.1 bits per base to about 1.9 bits per base, about 1.1 bits per base to
- the bit rate for encoding is about 1 bit per base, about 1.1 bits per base, about 1.2 bits per base, about 1.3 bits per base, about 1.4 bits per base, about 1.5 bits per base, about 1.6 bits per base, about 1.7 bits per base, about 1.8 bits per base, about 1.9 bits per base, or about 2 bits per base. In some instances, the bit rate for encoding is at least about 1 bit per base, about 1.1 bits per base, about 1.2 bits per base, about 1.3 bits per base, about 1.4 bits per base, about 1.5 bits per base, about 1.6 bits per base, about 1.7 bits per base, about 1.8 bits per base, or about 1.9 bits per base.
- the bit rate for encoding is at most about 1.1 bits per base, about 1.2 bits per base, about 1.3 bits per base, about 1.4 bits per base, about 1.5 bits per base, about 1.6 bits per base, about 1.7 bits per base, about 1.8 bits per base, about 1.9 bits per base, or about 2 bits per base.
- a hash comprises a function that can be used to map data of an arbitrary size (e.g., arbitrary number bits) to a fixed size value (e.g., a hashed value).
- the hashed value is mapped to nucleotide sequences.
- the inner codec comprises a base repetition check.
- the base repetition check is performed once the base candidates are selected.
- the base repetition check checks for repetitions in two or more sequential bases.
- the base repetition check substitutes one base for another if there are repetition in two or more sequential bases.
- the lookup table or the hash is updated based on bases that were updated during the base repetition check. Further, after the base repetition check, the bit history is updated. In some instances, the frame index and/or lane index are incremented. In some instances, this process is repeated until sequences of all of the plurality of nucleotide sequences are determined.
- the inner codec further comprises performing GC filtering prior to synthesizing the plurality of the nucleotide sequences.
- the GC filtering removes about 1% to about 10% of lanes in the plurality of lanes.
- the GC filtering removes about 5% to about 10% of lanes in the plurality of lanes.
- the GC filtering removes no lanes in the plurality of lanes.
- the GC filtering removes about 1 %, about 2 %, about 3 %, about 4 %, about 5 %, Attorney Docket No.00415-0047-00304 about 6 %, about 7 %, about 8 %, about 9 %, or about 10 %.
- the GC filtering removes at least about 1 %, about 2 %, about 3 %, about 4 %, about 5 %, about 6 %, about 7 %, about 8 %, or about 9 %. In some cases, the GC filtering removes at most about 2 %, about 3 %, about 4 %, about 5 %, about 6 %, about 7 %, about 8 %, about 9 %, or about 10 %. In some cases, the plurality of nucleotide sequences comprises about 40% to about 60% GC content.
- the plurality of nucleotide sequences comprises about 40 % to about 45 %, about 40 % to about 50 %, about 40 % to about 55 %, about 40 % to about 60 %, about 45 % to about 50 %, about 45 % to about 55 %, about 45 % to about 60 %, about 50 % to about 55 %, about 50 % to about 60 %, or about 55 % to about 60 % GC content. In some cases, the plurality of nucleotide sequences comprises about 40 %, about 45 %, about 50 %, about 55 %, or about 60 % GC content.
- the plurality of nucleotide sequences comprises at least about 40 %, about 45 %, about 50 %, or about 55 % GC content. In some cases, the plurality of nucleotide sequences comprises at most about 45 %, about 50 %, about 55 %, or about 60 % GC content. In some cases, at least 90% of the plurality of nucleotide sequences comprises about 40% to about 60 % GC content.
- At least 90% of the plurality of nucleotide sequences comprises about 40 % to about 45 %, about 40 % to about 50 %, about 40 % to about 55 %, about 40 % to about 60 %, about 45 % to about 50 %, about 45 % to about 55 %, about 45 % to about 60 %, about 50 % to about 55 %, about 50 % to about 60 %, or about 55 % to about 60 % GC content. In some cases, at least 90% of the plurality of nucleotide sequences comprises about 40 %, about 45 %, about 50 %, about 55 %, or about 60 % GC content.
- At least 90% of the plurality of nucleotide sequences comprises at least about 40 %, about 45 %, about 50 %, or about 55 % GC content. In some cases, at least 90% of the plurality of nucleotide sequences comprises at most about 45 %, about 50 %, about 55 %, or about 60 % GC content.
- the output from the inner codec comprises an final oligonucleotide library.
- FIG.8 An exemplary diagram of an alternative encoding scheme is shown in FIG.8. In some instances, the encoding scheme in the inner codec comprises starting with a default lookup table. The default lookup table is used to select a word to encode within each lane.
- the word is an 8 bit word or a byte.
- the lookup table is applied to generate base candidates for each word or byte) within each lane. A next lookup table is selected based on the previously encoded word or byte.
- the encoding scheme further comprises performing a base repetition check, GC filtering, or a combination thereof, as previously described herein. In some instances, this process is repeated until sequences of all of the plurality of nucleotide sequences may be determined.
- the output from the inner codec comprises a final oligonucleotide library. [0113] In some cases, the length of each of the oligonucleotides (or polynucleotides) in a library is about 20 to about 500 bases.
- the length of each of the oligonucleotides (or polynucleotides) in a library is about 20 bases to about 50 bases, about 20 bases to about 100 bases, about 20 bases to about 200 bases, about 20 bases to about 300 bases, about 20 bases to about 400 bases, about 20 bases to about 500 bases, about 50 bases to about 100 bases, about 50 bases to about 200 bases, about 50 bases to about 300 bases, about 50 bases to about 400 bases, about 50 bases to about 500 bases, about 100 bases to Attorney Docket No.00415-0047-00304 about 200 bases, about 100 bases to about 300 bases, about 100 bases to about 400 bases, about 100 bases to about 500 bases, about 200 bases to about 300 bases, about 200 bases to about 400 bases, about 200 bases to about 500 bases, about 300 bases to about 400 bases, about 300 bases to about 500 bases, or about 400 bases to about 500 bases.
- the length of each of the oligonucleotides (or polynucleotides) in a library is about 20 bases, about 50 bases, about 100 bases, about 200 bases, about 300 bases, about 400 bases, or about 500 bases. In some cases, the length of each of the oligonucleotides (or polynucleotides) in a library is at least about 20 bases, about 50 bases, about 100 bases, about 200 bases, about 300 bases, or about 400 bases. In some cases, the length of each of the oligonucleotides (or polynucleotides) in a library is at most about 50 bases, about 100 bases, about 200 bases, about 300 bases, about 400 bases, or about 500 bases.
- the library comprising a plurality of polynucleotides from the encoding scheme are synthesized.
- the library comprising the plurality of polynucleotides from the encoding scheme encode a pool of the plurality of pools.
- the library comprising the plurality of polynucleotides from the encoding scheme encode an index pool.
- methods comprise use of electrochemical deprotection.
- the substrate is a flexible substrate.
- At least 10 10 , 10 11 , 10 12 , 10 13 , 10 14 , or 10 15 bases are synthesized in one day.
- at least 10 x 10 8 , 10 x 10 9 , 10 x 10 10 , 10 x 10 11 , or 10 x 10 12 polynucleotides are synthesized in one day.
- each polynucleotide synthesized comprises at least 20, 50, 100, 200, 300, 400 or 500 nucleobases.
- these bases are synthesized with a total average error rate of less than about 1 in 100; 200; 300; 400; 500; 1000; 2000; 5000; 10000; 15000; 20000 bases.
- these error rates are for at least 50%, 60%, 70%, 80%, 90%, 95%, 98%, 99%, 99.5%, or more of the polynucleotides synthesized. In some instances, these at least 90%, 95%, 98%, 99%, 99.5%, or more of the polynucleotides synthesized do not differ from a predetermined sequence for which they encode. In some instances, the error rate for synthesized polynucleotides on a substrate using the methods and systems described herein is less than about 1 in 200, less than about 1 in 1,000, less than about 1 in 2,000, less than about 1 in 3,000, or less than about 1 in 5,000.
- error rate refers to a comparison of the collective amount of synthesized polynucleotide to an aggregate of predetermined polynucleotide sequences.
- synthesized polynucleotides disclosed herein comprise a tether of 12 to 25 bases.
- the tether comprises 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 or more bases.
- Electrochemical reactions in some Attorney Docket No.00415-0047-00304 instances are controlled by any source of energy, such as light, heat, radiation, or electricity.
- electrodes are used to control chemical reactions as all or a portion of discrete loci on a surface. Electrodes in some instances are charged by applying an electrical potential to the electrode to control one or more chemical steps in polynucleotide synthesis. In some instances, these electrodes are addressable. Any number of the chemical steps described herein is in some instances controlled with one or more electrodes.
- Electrochemical reactions may comprise oxidations, reductions, acid/base chemistry, or other reaction that is controlled by an electrode.
- electrodes generate electrons or protons that are used as reagents for chemical transformations.
- Electrodes in some instances directly generate a reagent such as an acid.
- an acid is a proton.
- Electrodes in some instances directly generate a reagent such as a base. Acids or bases are often used to cleave protecting groups, or influence the kinetics of various polynucleotide synthesis reactions, for example by adjusting the pH of a reaction solution.
- Electrochemically controlled polynucleotide synthesis reactions in some instances comprise redox-active metals or other redox-active organic materials.
- metal or organic catalysts are employed with these electrochemical reactions.
- acids are generated from oxidation of quinones.
- Control of chemical reactions with is not limited to the electrochemical generation of reagents; chemical reactivity may be influenced indirectly through biophysical changes to substrates or reagents through electric fields (or gradients) which are generated by electrodes.
- substrates include but are not limited to nucleic acids.
- electrical fields which repel or attract specific reagents or substrates towards or away from an electrode or surface are generated. Such fields in some instances are generated by application of an electrical potential to one or more electrodes. For example, negatively charged nucleic acids are repelled from negatively charged electrode surfaces.
- Electrodes generate electric fields which repel polynucleotides away from a synthesis surface, structure, or device.
- electrodes generate electric fields which attract polynucleotides towards a synthesis surface, structure, or device.
- protons are repelled from a positively charged surface to limit contact of protons with substrates or portions thereof.
- repulsion or attractive forces are used to allow or block entry of reagents or substrates to specific areas of the synthesis surface.
- nucleoside monomers are prevented from contacting a polynucleotide chain by application of an electric field in the vicinity of one or both components.
- Such arrangements allow gating of specific reagents, which may obviate the need for protecting groups when the concentration or rate of contact between reagents and/or substrates is controlled.
- unprotected nucleoside monomers are used for polynucleotide synthesis.
- application of the field in the vicinity of one or both components promotes contact of nucleoside monomers with a polynucleotide chain.
- application of electric fields to a substrate can alter the substrates reactivity or conformation.
- electric fields Attorney Docket No.00415-0047-00304 generated by electrodes are used to prevent polynucleotides at adjacent loci from interacting.
- the substrate is a polynucleotide, optionally attached to a surface.
- Application of an electric field in some instances alters the three-dimensional structure of a polynucleotide. Such alterations comprise folding or unfolding of various structures, such as helices, hairpins, loops, or other 3- dimensional nucleic acid structure. Such alterations are useful for manipulating nucleic acids inside of wells, channels, or other structures.
- electric fields are applied to a nucleic acid substrate to prevent secondary structures.
- a suitable method for polynucleotide synthesis on a substrate of this disclosure is a phosphoramidite-based synthesis of DNA.
- a reagent for the phosphoramidite-based synthesis comprises any one of or a combination of a nucleoside phosphoramidite, an oxidizer, an activator, or a deblocker or the solvent comprises acetonitrile.
- the phosphoramidite- based synthesis method comprises the controlled addition of a phosphoramidite building block, i.e.
- nucleoside phosphoramidite to a growing polynucleotide chain in a coupling step that forms a phosphite triester linkage between the phosphoramidite building block and a nucleoside bound to the substrate.
- the nucleoside phosphoramidite is provided to the substrate activated.
- the nucleoside phosphoramidite is provided to the substrate with an activator.
- nucleoside phosphoramidites are provided to the substrate in a 1.5, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100-fold excess or more over the substrate-bound nucleosides.
- nucleoside phosphoramidite is performed in an anhydrous environment, for example, in anhydrous acetonitrile.
- the substrate is optionally washed.
- the coupling step is repeated one or more additional times, optionally with a wash step between nucleoside phosphoramidite additions to the substrate.
- a polynucleotide synthesis method used herein comprises 1, 2, 3 or more sequential coupling steps.
- the nucleoside bound to the substrate is de-protected by removal of a protecting group, where the protecting group functions to prevent polymerization.
- Protecting groups may comprise any chemical group that prevents extension of the polynucleotide chain.
- the protecting group is cleaved (or removed) in the presence of an acid.
- the protecting group is cleaved in the presence of a base.
- the protecting group is removed with electromagnetic radiation such as light, heat, or other energy source.
- the protecting group is removed through an oxidation or reduction reaction (e.g., a .
- a protecting group comprises a triarylmethyl group.
- a protecting group comprises an aryl ether.
- a protecting comprises a disulfide.
- a protecting group comprises an acid-labile silane.
- a protecting group comprises an acetal. In some instances, a protecting group comprises a ketal. In some instances, a protecting group comprises an enol ether. In some instances, a protecting group comprises a methoxybenzyl group. In some instances, a protecting group comprises an azide. In some instances, a Attorney Docket No.00415-0047-00304 protecting group is 4,4’-dimethoxytrityl (DMT). In some instances, a protecting group is a tert-butyl carbonate. In some instances, a protecting group is a tert-butyl ester. In some instances, a protecting group comprises a base-labile group.
- phosphoramidite polynucleotide synthesis methods optionally comprise a capping step.
- a capping step the growing polynucleotide is treated with a capping agent.
- a capping step generally serves to block unreacted substrate-bound 5’-OH groups after coupling from further chain elongation, preventing the formation of polynucleotides with internal base deletions.
- phosphoramidites activated with 1H-tetrazole often react, to a small extent, with the O6 position of guanosine. Without being bound by theory, upon oxidation with I2 /water, this side product, possibly via O6-N7 migration, undergoes depurination.
- the apurinic sites can end up being cleaved in the course of the final deprotection of the polynucleotide thus reducing the yield of the full-length product.
- the O6 modifications may be removed by treatment with the capping reagent prior to oxidation with I2/water.
- inclusion of a capping step during polynucleotide synthesis decreases the error rate as compared to synthesis without capping.
- the capping step comprises treating the substrate- bound polynucleotide with a mixture of acetic anhydride and 1-methylimidazole. Following a capping step, the substrate is optionally washed.
- a substrate described herein comprises a bound growing nucleic acid that may be oxidized.
- the oxidation step comprises oxidizing the phosphite triester into a tetracoordinated phosphate triester, a protected precursor of the naturally occurring phosphate diester internucleoside linkage.
- phosphite triesters are oxidized electrochemically.
- oxidation of the growing polynucleotide is achieved by treatment with iodine and water, optionally in the presence of a weak base such as a pyridine, lutidine, or collidine.
- Oxidation is sometimes carried out under anhydrous conditions using tert-Butyl hydroperoxide or (1S)-(+)-(10-camphorsulfonyl)-oxaziridine (CSO).
- CSO tert-Butyl hydroperoxide
- a capping step is performed following oxidation.
- a second capping step allows for substrate drying, as residual water from oxidation that may persist can inhibit subsequent coupling.
- the substrate and growing polynucleotide is optionally washed.
- the step of oxidation is substituted with a sulfurization step to obtain polynucleotide phosphorothioates, wherein any capping steps can be performed after the sulfurization.
- reagents are capable of the efficient sulfur transfer, including, but not limited to, 3-(Dimethylaminomethylidene)amino)-3H-1,2,4-dithiazole-3- thione, DDTT, 3H-1,2-benzodithiol-3-one 1,1-dioxide, also known as Beaucage reagent, and N,N,N'N'- Tetraethylthiuram disulfide (TETD).
- DDTT 3-(Dimethylaminomethylidene)amino)-3H-1,2,4-dithiazole-3- thione
- DDTT 3H-1,2-benzodithiol-3-one 1,1-dioxide
- Beaucage reagent also known as Beaucage reagent
- TETD N,N,N'N'- Tetraethylthiuram disulfide
- a protected 5’ end (or 3’ end, if synthesis is conducted in a 5’ to 3’ direction) of the substrate bound growing polynucleotide is be removed so that the primary hydroxyl group can react with a next nucleoside phosphoramidite.
- the protecting group is DMT and deblocking occurs with trichloroacetic acid in Attorney Docket No.00415-0047-00304 dichloromethane.
- the protecting group is DMT and deblocking occurs with electrochemically generated protons.
- Conducting detritylation for an extended time or with stronger than recommended solutions of acids may lead to increased depurination of solid support-bound polynucleotide and thus reduces the yield of the desired full-length product.
- Methods and compositions described herein provide for controlled deblocking conditions limiting undesired depurination reactions.
- the substrate bound polynucleotide is washed after deblocking.
- efficient washing after deblocking contributes to synthesized polynucleotides having a low error rate.
- Methods for the synthesis of polynucleotides on a substrate described herein may involve an iterating sequence of the following steps: application of a protected monomer to a surface of a substrate feature to link with either the surface, a linker or with a previously deprotected monomer; deprotection of the applied monomer so that it can react with a subsequently applied protected monomer; and application of another protected monomer for linking.
- One or more intermediate steps include oxidation and/or sulfurization.
- one or more wash steps precede or follow one or all of the steps.
- Methods for the synthesis of polynucleotides on a substrate described herein may comprise an oxidation step.
- methods involve an iterating sequence of the following steps: application of a protected monomer to a surface of a substrate feature to link with either the surface, a linker or with a previously deprotected monomer; deprotection of the applied monomer so that it can react with a subsequently applied protected monomer; application of another protected monomer for linking, and oxidation and/or sulfurization.
- one or more wash steps precede or follow one or all of the steps.
- Methods for the synthesis of polynucleotides on a substrate described herein may further comprise an iterating sequence of the following steps: application of a protected monomer to a surface of a substrate feature to link with either the surface, a linker or with a previously deprotected monomer; deprotection of the applied monomer so that it can react with a subsequently applied protected monomer; and oxidation and/or sulfurization.
- one or more wash steps precede or follow one or all of the steps.
- Methods for the synthesis of polynucleotides on a substrate described herein may further comprise an iterating sequence of the following steps: application of a protected monomer to a surface of a substrate feature to link with either the surface, a linker or with a previously deprotected monomer; and oxidation and/or sulfurization.
- one or more wash steps precede or follow one or all of the steps.
- Methods for the synthesis of polynucleotides on a substrate described herein may further comprise an iterating sequence of the following steps: application of a protected monomer to a surface of a substrate feature to link with either the surface, a linker or with a previously deprotected monomer; deprotection of the applied monomer so that it can react with a subsequently applied protected monomer; and oxidation and/or sulfurization.
- one or more wash steps precede or follow one or Attorney Docket No.00415-0047-00304 all of the steps.
- polynucleotides are synthesized with photolabile protecting groups, where the hydroxyl groups generated on the surface are blocked by photolabile-protecting groups.
- a pattern of free hydroxyl groups on the surface may be generated. These hydroxyl groups can react with photoprotected nucleoside phosphoramidites, according to phosphoramidite chemistry.
- a second photolithographic mask can be applied and the surface can be exposed to UV light to generate second pattern of hydroxyl groups, followed by coupling with 5'-photoprotected nucleoside phosphoramidite.
- patterns can be generated and oligomer chains can be extended.
- the lability of a photocleavable group depends on the wavelength and polarity of a solvent employed and the rate of photocleavage may be affected by the duration of exposure and the intensity of light.
- This method can leverage a number of factors such as accuracy in alignment of the masks, efficiency of removal of photo- protecting groups, and the yields of the phosphoramidite coupling step. Further, unintended leakage of light into neighboring sites can be minimized.
- the density of synthesized oligomer per spot can be monitored by adjusting loading of the leader nucleoside on the surface of synthesis.
- the surface of a substrate described herein that provides support for polynucleotide synthesis may be chemically modified to allow for the synthesized polynucleotide chain to be cleaved from the surface. In some instances, the polynucleotide chain is cleaved at the same time as the polynucleotide is deprotected.
- the polynucleotide chain is cleaved after the polynucleotide is deprotected.
- a trialkoxysilyl amine such as (CH 3 CH 2 O) 3 Si-(CH 2 ) 2 -NH 2 is reacted with surface SiOH groups of a substrate, followed by reaction with succinic anhydride with the amine to create an amide linkage and a free OH on which the nucleic acid chain growth is supported.
- Cleavage includes gas cleavage with ammonia or methylamine.
- cleavage includes linker cleavage with electrically generated reagents such as acids or bases.
- polynucleotides are assembled into larger nucleic acids that are sequenced and decoded to extract stored information.
- the surfaces described herein can be reused after polynucleotide cleavage to support additional cycles of polynucleotide synthesis.
- the linker can be reused without additional treatment/chemical modifications.
- a linker is non-covalently bound to a substrate surface or a polynucleotide. In some embodiments, the linker remains attached to the polynucleotide after cleavage from the surface.
- Linkers in some embodiments comprise reversible covalent bonds such as esters, amides, ketals, beta substituted ketones, heterocycles, or other group that is capable of being reversibly cleaved. Such reversible cleavage reactions are in some instances controlled through the addition or removal of reagents, or by electrochemical processes controlled by electrodes. Optionally, chemical linkers or surface-bound chemical groups are regenerated after a number of cycles, to restore reactivity and remove unwanted side product formation on such linkers or surface-bound chemical Attorney Docket No.00415-0047-00304 groups. [0130] Alternatively, the polymer synthesis can be enzymatic DNA synthesis.
- the enzymatic DNA synthesis uses water as a solvent and the reagent is an enzyme terminal deoxynucleotidyl transferase (TdT) or a deblocker.
- TdT enzyme terminal deoxynucleotidyl transferase
- enzymatic synthesis of DNA uses a template-independent DNA polymerase, terminal deoxynucleotidyl transferase (TdT), which is a protein that evolved to rapidly catalyze the linkage of naturally occurring dNTPs.
- TdT adds nucleotides indiscriminately so it is stopped from continuing unregulated synthesis by various techniques such a tethering the TDT, creating variant enzymes, and using nucleotides that include reversible terminators to prevent chain elongation.
- the synthesized libraries of polynucleotides can be stored in device.
- the device comprises a polynucleotide data storage system.
- the libraries encoding pools e.g., a plurality of pools or index pools
- the compartments comprise, by way of non-limiting example, active surfaces (e.g., loci), tubes, cells, spots, or any other physical storage solutions.
- the compartments comprise locations (e.g., spots) on a microfluidic chip, such as a digital microfluidic chip.
- the compartments are marked with a label.
- the label comprises a barcode, a name (e.g., customer name, sample type, etc.), a timestamp, a list of objects stored, or any combination thereof.
- the device for storing digital information in DNA comprises one or more compartments.
- each of the one or more compartments comprises a library comprising a plurality of polynucleotides.
- the library encodes a pool comprising digital information corresponding to one or more objects (e.g., a pool of the plurality of pools described herein).
- the pool comprises a pool descriptor, one or more pool items, an end pool descriptor, such as those described herein.
- the pool comprises about 1 GB to about 1 TB of digital information, as previously described herein.
- each of the one or more compartments comprises a medium for storing the plurality of polynucleotides.
- the medium comprises a solid, a liquid, a gas, or any combination thereof.
- the medium comprises a salt solution.
- the molar ratio of salt to DNA may range from about 20:1 to about 2:1. In some examples, the molar ratio depends on the molecular weight of the salt used and on the relative amounts of salt and DNA combined. In some examples, the molar ratio is calculated between the cation of the salt and the negatively charged phosphate groups of the DNA. In some examples, the salt solution comprises a molar ratio of less than 20:1 salt cation to phosphate groups in the DNA. In some examples, the salt solution is dried to create a dried product.
- the salt solution comprises, by way of non-limiting examples, calcium chloride, calcium nitrate, calcium carbonate, calcium phosphate, magnesium chloride, magnesium Attorney Docket No.00415-0047-00304 sulfate, magnesium nitrate, magnesium carbonate, lanthanum chloride, lanthanum nitrate, lanthanum carbonate, lanthanum bromide, or a mixture thereof.
- the salt solution comprises barium (II) chloride dihydrate, calcium chloride dihydrate, copper (II) chloride anhydrous, lanthanum trichloride, magnesium dichloride hexahydrate, sodium chloride, or strontium chloride hexahydrate.
- a medium for storing the plurality of polynucleotides comprises nanoparticles.
- the nanoparticles comprise silica nanoparticles.
- a subset of the plurality of polynucleotides are encapsulated in the nanoparticles.
- the nanoparticles encapsulating polynucleotides are stored in a water-free or near-to water-free environment.
- nanoparticles comprise a protective layer of silica (e.g., tetraethoxysilane).
- the nanoparticles comprise a co-interacting compound with the polynucleotides (e.g., N-[3- (Trimethoxysilyl)propyl]-N,N,N-trimethylammonium chloride).
- the nanoparticles encapsulating polynucleotides are stored on a digital microfluidic chip.
- the digital microfluidic chip allows for programmability of fluid.
- the programmability allows for automated storage and/or retrieval of polynucleotides.
- each location on a digital microfluidic chip comprises about 100 GB, 500 GB, 1 TB, 2 TB, 10 TB, 20 TB, 30 TB, or 50 TB.
- each location comprises about 50 ⁇ g, 100 ⁇ g, 150 ⁇ g, 200 ⁇ g, 250 ⁇ g, 300 ⁇ g, 350 ⁇ g, 400 ⁇ g, 450 ⁇ g, 500 ⁇ g, 600 ⁇ g, 700 ⁇ g, 800 ⁇ g, 900 ⁇ g, or 1000 ⁇ g of nanoparticles.
- each of the one or more compartments are in communication. In some instances, each of the one or more compartments are in communication through the medium. In some cases, each of the one or more compartments are not in communication. In some instances, each of the one or more compartments are not in communication through the medium.
- the device further comprises one or more second compartments.
- each of the one or more second compartments comprises a second library.
- the second library encodes an index pool, such as those described herein.
- the one or more second compartments comprise a medium as previously described herein.
- the one or more second compartments comprise the same medium as the one or more compartments.
- the one or more second compartments comprise different media as the one or more compartments.
- each of the one or more second compartments are in communication with each other and/or the one or more compartments (e.g., through the medium). In some cases, each of the one or more second compartments are not in communication with each other and/or the one or more compartments.
- the device further comprises a solid support comprising a surface.
- a size of the solid support is between about 40 and 120 mm by between about 25 and 100 mm.
- a size of the solid support is about 80 mm by about 50 mm.
- a width of a solid support is at least or about 10 mm, 20 mm, 40 mm, Attorney Docket No.00415-0047-00304 60 mm, 80 mm, 100 mm, 150 mm, 200 mm, 300 mm, 400 mm, 500 mm, or more than 500 mm.
- a height of a solid support is at least or about 10 mm, 20 mm, 40 mm, 60 mm, 80 mm, 100 mm, 150 mm, 200 mm, 300 mm, 400 mm, 500 mm, or more than 500 mm.
- the solid support has a planar surface area of at least or about 100 mm 2 ; 200 mm 2 ; 500 mm 2 ; 1,000 mm 2 ; 2,000 mm 2 ; 4,500 mm 2 ; 5,000 mm 2 ; 10,000 mm 2 ; 12,000 mm 2 ; 15,000 mm 2 ; 20,000 mm 2 ; 30,000 mm 2 ; 40,000 mm 2 ; 50,000 mm 2 or more.
- the thickness of the solid support is between about 50 mm and about 2000 mm, between about 50 mm and about 1000 mm, between about 100 mm and about 1000 mm, between about 200 mm and about 1000 mm, or between about 250 mm and about 1000 mm.
- Non- limiting examples thickness of the solid support include 275 mm, 375 mm, 525 mm, 625 mm, 675 mm, 725 mm, 775 mm and 925 mm.
- the thickness of the solid support is at least or about 0.5 mm, 1.0 mm, 1.5 mm, 2.0 mm, 2.5 mm, 3.0 mm, 3.5 mm, 4.0 mm, or more than 4.0 mm.
- Described herein are devices wherein two or more solid supports are assembled.
- solid supports are interfaced together on a larger unit. Interfacing may comprise exchange of fluids, electrical signals, or other medium of exchange between solid supports.
- This unit is capable of interface with any number of servers, computers, or networked devices.
- a plurality of solid support is integrated onto a rack unit, which is conveniently inserted or removed from a server rack.
- the rack unit may comprise any number of solid supports.
- the rack unit comprises at least 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10,000, 20,000, 50,000, 100,000 or more than 100,000 solid supports.
- two or more solid supports are not interfaced with each other.
- Nucleic acids (and the information stored in them) present on solid supports can be accessed from the rack unit. Access includes removal of polynucleotides from solid supports, direct analysis of polynucleotides on the solid support, or any other method which allows the information stored in the nucleic acids to be manipulated or identified. Information in some instances is accessed from a plurality of racks, a single rack, a single solid support in a rack, a portion of the solid support, or a single locus on a solid support. In various instances, access comprises interfacing nucleic acids with additional devices such as mass spectrometers, HPLC, sequencing instruments, PCR thermocyclers, or other device for manipulating nucleic acids.
- additional devices such as mass spectrometers, HPLC, sequencing instruments, PCR thermocyclers, or other device for manipulating nucleic acids.
- Access to nucleic acid information in some instances is achieved by cleavage of polynucleotides from all or a portion of a solid support.
- Cleavage in some instances comprises exposure to chemical reagents (ammonia or other reagent), electrical potential, radiation, heat, light, acoustics, or other form of energy capable of manipulating chemical bonds.
- cleavage occurs by charging one or more electrodes in the vicinity of the polynucleotides.
- electromagnetic radiation in the form of UV light is used for cleavage of polynucleotides.
- a lamp is used for cleavage of polynucleotides, and a mask mediates exposure locations of the UV light to the surface.
- Solid supports as described herein comprise an active area.
- the active area comprises regions, cells, features, or loci for nucleic acid synthesis.
- the active area comprises regions or loci for nucleic acid storage.
- the regions or loci comprise the one or more compartments.
- the regions or loci comprise the second one or more compartments.
- the regions are addressable. In some examples, the regions are addressable through an electrode.
- the active area comprises varying dimensions. For example, the dimension of the active area is between about 1 mm to about 50 mm by about 1 mm to about 50 mm. In some instances, the active area comprises a width of at least or about 0.5, 1, 1.5, 2, 2.5, 3, 5, 5, 10, 12, 14, 16, 18, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, or more than 80 mm. In some instances, the active area comprises a height of at least or about 0.5, 1, 1.5, 2, 2.5, 3, 5, 5, 10, 12, 14, 16, 18, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, or more than 80 mm.
- the solid support has a number of sites (e.g., spots) or positions for synthesis or storage.
- the solid support comprises up to or about 10,000 by 10,000 positions in an area.
- the solid support comprises between about 1000 and 20,000 by between about 1000 and 20,000 positions in an area.
- the solid support comprises at least or about 10, 30, 50, 75, 100, 200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10,000, 12,000, 14,000, 16,000, 18,000, 20,000 positions by least or about 10, 30, 50, 75, 100, 200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10,000, 12,000, 14,000, 16,000, 18,000, 20,000 positions in an area. In some instances the area is up to 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, or 2.0 inches squared.
- the solid support comprises loci having a pitch of at least or about 0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5, 6, 7, 8, 9, 10, or more than 10 um. In some instances, the solid support comprises loci having a pitch of about 5 um. In some instances, the solid support comprises loci having a pitch of about 2 um. In some instances, the solid support comprises loci having a pitch of about 1 um. In some instances, the solid support comprises loci having a pitch of about 0.2 um.
- the solid support comprises loci having a pitch of about 0.2 um to about 10 um, about 0.2 to about 8 um, about 0.5 to about 10 um, about 1 um to about 10 um, about 2 um to about 8 um, about 3 um to about 5 um, about 1 um to about 3 um or about 0.5 um to about 3 um. In some instances, the solid support comprises loci having a pitch of about 0.1 um to about 3 um.
- the solid support for nucleic acid synthesis or storage as described herein comprises a high capacity for storage of data. For example, the capacity of the solid support is at least or about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, or more than 1000 petabytes.
- the capacity of the solid support is between about 1 to about 10 petabytes or between about 1 to about 100 petabytes. In some instances, the capacity of the solid support is about 100 petabytes.
- the data is stored as arrays of packets as droplets. In some examples, the Attorney Docket No.00415-0047-00304 arrays of packets are addressable packets. In some examples, the packets are addressable using an electrode. In some instances, the data is stored as arrays of packets as droplets on a spot. In some instances, the data is stored as arrays of packets as dry wells.
- the arrays comprise at least or about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 200, or more than 200 gigabytes of data. In some instances, the arrays comprise at least or about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 200, or more than 200 terabytes of data. In some instances, an item of information is stored in a background of data. For example, an item of information encodes for about 10 to about 100 megabytes of data and is stored in 1 petabyte of background data.
- an item of information encodes for at least or about 1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 300, 400, 500, or more than 500 megabytes of data and is stored in 1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 300, 400, 500, or more than 500 petabytes of background data.
- devices for solid support based nucleic acid synthesis and storage wherein following synthesis, the polynucleotides are collected in packets as one or more droplets. In some instances, the polynucleotides are collected in packets as one or more droplets and stored.
- a number of droplets is at least or about 1, 10, 20, 50, 100, 200, 300, 500, 1000, 2500, 5000, 75000, 10,000, 25,000, 50,000, 75,000, 100,000, 1 million, 5 million, 10 million, 25 million, 50 million, 75 million, 100 million, 250 million, 500 million, 750 million, or more than 750 million droplets.
- a droplet volume comprises 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, or more than 100 um (micrometer) in diameter.
- a droplet volume comprises 1-100 um, 10-90 um, 20-80 um, 30-70 um, or 40-50 um in diameter.
- the polynucleotides that are collected in the packets comprise a similar sequence.
- the polynucleotides further comprise a non-identical sequence to be used as a tag or barcode.
- the non-identical sequence is used to index the polynucleotides stored on the solid support and to later search for specific polynucleotides based on the non-identical sequence.
- Exemplary tag or barcode lengths include barcode sequences comprising, without limitation, about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25 or more bases in length.
- the tag or barcode comprise at least or about 10, 50, 75, 100, 200, 300, 400, or more than 400 base pairs in length.
- the packets comprise about 100 to about 1000 copies of each polynucleotide.
- the packets comprise at least or about 50, 75, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000, or more than 2000 copies of each polynucleotide.
- the packets comprise about 1000X to about 5000X synthesis redundancy.
- Synthesis redundancy in some instances is at least or about 500X, 1000X, 1500X, 2000X, 2500X, 3000X, 3500X, 4000X, 5000X, 6000X, 7000X, 8000X, or more than 8000X.
- the polynucleotides that are synthesized using solid support based methods as described herein comprise various lengths. In some instances, the polynucleotides are synthesized and further stored on the Attorney Docket No.00415-0047-00304 solid support. In some instances, the polynucleotide length is in between about 100 to about 1000 bases.
- the polynucleotides comprise at least or about 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, or more than 2000 bases in length.
- Sequencing [0148] Polynucleotides are extracted and/or amplified from surfaces where they are synthesized or stored.
- suitable sequencing technology may be employed to sequence the polynucleotides.
- the DNA sequence is read on the substrate or within a feature of a structure.
- the polynucleotides stored on the substrate are extracted is optionally assembled into longer nucleic acids and then sequenced.
- Polynucleotides synthesized and stored on the structures described herein encode data that can be interpreted by reading the sequence of the synthesized polynucleotides and converting the sequence into binary code readable by a computer. In some cases the sequences require assembly, and the assembly step may need to be at the nucleic acid sequence stage or at the digital sequence stage.
- the detection system comprises a device for holding and advancing the structure through a detection location and a detector disposed proximate the detection location for detecting a signal originated from a section of the tape when the section is at the detection location.
- the signal is indicative of a presence of a polynucleotide.
- the signal is indicative of a sequence of a polynucleotide (e.g., a fluorescent signal).
- a detection system comprises a computer system comprising a polynucleotide sequencing device, a database for storage and retrieval of data relating to polynucleotide sequence, software for converting DNA code of a polynucleotide sequence to binary code, a computer for reading the binary code, or any combination thereof.
- sequencing systems that can be integrated into the devices described herein. Various methods of sequencing are well known in the art, and comprise “base calling” wherein the identity of a base in the target polynucleotide is identified.
- polynucleotides synthesized using the methods, devices, compositions, and systems described herein are sequenced after cleavage from the synthesis surface.
- sequencing occurs during or simultaneously with polynucleotide synthesis, wherein base calling occurs immediately after or before extension of a nucleoside monomer into the growing polynucleotide chain.
- Methods for base calling include measurement of electrical currents/voltages generated by polymerase-catalyzed addition of bases to a template strand.
- synthesis surfaces comprise enzymes, such as polymerases. In some Attorney Docket No.00415-0047-00304 instances, such enzymes are tethered to electrodes or to the synthesis surface.
- enzymes comprise terminal deoxynucleotidyl transferases, or variants thereof.
- systems and Methods for Digital Information Retrieval [0153] Provided herein are methods and systems for retrieval of digital information.
- the digital information comprises one or more objects as previously described herein.
- each of the one or more objects is about 1 GB to about 1 TB as previously described herein.
- the one or more objects comprises an item of information, such as, but not limited to, those described herein.
- the systems and methods decode nucleotide sequences (e.g., polynucleotides, oligonucleotides, plurality of polynucleotides, etc.).
- a method for retrieving a digital information stored in a plurality of polynucleotides comprises one or more steps.
- retrieving a digital information stored in a plurality of polynucleotides comprises accessing an index pool.
- accessing an index pool comprises fully or partially sequencing an library encoding an index pool.
- the index pool is encoded in the library using the systems and methods described herein.
- the polynucleotides in a library encoding an index pool are sequenced using the systems and methods described herein. In some instances, more than one index pool are accessed. In some instances, the polynucleotides in more than one library are sequenced. In some instances, the sequenced library is temporarily stored in a memory storage system (e.g. flash drives). In some instances, the sequenced library is converted to digital information to retrieve an index pool. In some instances, the index pool is temporarily stored in a memory storage system (e.g. flash drives). In some instances, the digital information in the index pool is used to search for one or more objects of interest.
- a memory storage system e.g. flash drives
- the one or more objects of interest are stored in a library comprising a plurality of polynucleotides encoding the one or more objects.
- the one of more objects of interest are searched using a metadata associated with the one or more object.
- accessing an index pool determines a plurality of pools corresponding to one or more objects.
- the one or more objects of interest is retrieved from a compartment in a storage device.
- retrieving a digital information stored in a plurality of polynucleotides comprises sequencing the plurality of polynucleotides corresponding to one or more objects in a plurality of pools.
- the plurality of polynucleotides are in a library.
- the library is in a compartment of a device, as previously described herein.
- the plurality of polynucleotides in a library encoding a pool are sequenced using the systems and methods described herein.
- the pool is encoded in the library using the systems and methods described herein.
- the plurality of polynucleotides in more than one compartment is sequenced to retrieve the one or more objects. [0157]
- retrieving a digital information stored in a plurality of polynucleotides further Attorney Docket No.00415-0047-00304 comprises applying a decoding scheme.
- the decoding scheme decodes the digital information in the plurality of pools.
- the decoding scheme is applied to the sequenced library comprising a plurality of polynucleotides.
- a decoding scheme comprises an inner codec, an outer codec (e.g., ECC), or a combination thereof.
- the decoding scheme decodes a plurality of nucleotide sequences to generate an output comprising digital information (e.g., an object).
- the decoding scheme comprises undoing operations in the encoding scheme.
- the operations comprise, splitting, shuffling, concatenating, transposing, translating, duplicating, labeling (e.g., using an index) data or a part of the data, or any combination thereof.
- the methods and systems decode nucleotide sequences (e.g., polynucleotides, oligonucleotides, plurality of polynucleotides, etc.).
- the nucleotide sequences are encoded using the methods described herein.
- the methods and systems comprise an inner codec, an outer codec, or a combination thereof.
- methods for decoding the plurality of nucleotide sequences may comprises determining the plurality of nucleotide sequences.
- determining the plurality of nucleotide sequences comprises sequencing the nucleotides.
- the nucleotides are sequenced using the methods described herein. [0159] After sequencing the plurality of nucleotides, the encoded binary data is decoded. In some instances, the plurality of nucleotides are decoded using the schematic illustrated, by way of non-limiting example, in FIG.9. The output from sequencing comprises an unordered list of reads (e.g., nucleotide sequences), as shown in FIG.9. [0160] In some instances, the sequenced polynucleotides, such as an unorder list of reads, are clustered after sequencing. In some cases, clustering is performed prior to applying an inner codec.
- the sequenced polynucleotides are clustered based on an index, such as the frame index, the lane index, or a combination thereof. In such instances, the sequenced polynucleotides are partially decoded to obtain the frame index, the lane index, or the combination thereof. In some instances, clustering is performed using a hash function, as previously described herein. In some instances, a hash function is used if the bases in the nucleotide sequences were determined using a hash in the encoding scheme, as previously described herein. [0161] In some instances, the sequenced polynucleotides (e.g., reads) are aligned.
- sequenced polynucleotides are aligned after they have been clustered. In some cases, the sequenced polynucleotides are aligned prior to applying the inner codec. In some instances, aligning comprises analyzing consensus of the reads (e.g., nucleotide sequences) using an alignment algorithm. In some examples, the alignment algorithm comprises a pairwise alignment algorithm, a multi-sequence alignment algorithm, or a combination thereof. [0162] In some instances, a pairwise alignment algorithm comprises initializing a position for each read. Attorney Docket No.00415-0047-00304 Initializing comprises aligning a nucleotide sequence to a position 0. Consensus of a next one or more bases are analyzed between reads.
- about 3 to about 10 reads are analyzed for consensus.
- about 3 to about 4, about 3 to about 5, about 3 to about 6, about 3 to about 7, about 3 to about 8, about 3 to about 9, about 3 to about 10, about 4 to about 5, about 4 to about 6, about 4 to about 7, about 4 to about 8, about 4 to about 9, about 4 to about 10, about 5 to about 6, about 5 to about 7, about 5 to about 8, about 5 to about 9, about 5 to about 10, about 6 to about 7, about 6 to about 8, about 6 to about 9, about 6 to about 10, about 7 to about 8, about 7 to about 9, about 7 to about 10, about 8 to about 9, about 10 or about 9 to about 10 reads are analyzed for consensus.
- the next one or more bases comprise the next 2 to 10 bases. In some instances, the next one or more bases is about 2, 3, 4, 5, 6, 7, 8, 9, or 10 bases. In some instances, the next one or more bases is at least about 2, 3, 4, 5, 6, 7, 8, or 9 bases. In some instances, the next one or more bases is at most about 3, 4, 5, 6, 7, 8, 9, or 10 bases.
- the next one or more bases is about 2, 3, 4, or 5 bases.
- the consensus is analyzed between the reads, and it is determined whether the next one or more bases are correct. If there is consensus between a base at a position, e.g., x, between all reads, then the subsequent base, e.g., x+1, may then be analyzed. If there is a inconsistencies in a base at a position, e.g., x, among the reads, then it is determined whether the read comprising the inconsistency has an error. In some instances, the error is an insertion, deletion, or substitution.
- decoding scheme comprise an inner codec.
- the inner codec is applied to the plurality of nucleotide sequences.
- the inner codec is used to transform the nucleotide sequences into digital or binary data.
- the inner codec is capable of correcting deletion, substitution, or insertion errors, or any combination thereof.
- the inner codec is used to validate oligos and discard any suspicious oligos to avoid contaminating the outer decoding.
- the inner codec allows for efficient decoding using the indices (frame index and lane index).
- An inner codec comprising a decoding scheme is applied to the plurality of nucleotide sequences.
- the inner codec may transform each of the plurality of nucleotide sequences into lanes of binary data.
- the inner codec is applied to a plurality of nucleotides that have been sequenced.
- the inner codec is applied to the unordered reads.
- the inner codec is applied to the reads or the plurality of polynucleotides once they have been clustered, as described herein.
- the inner codec is applied to the reads or the plurality of nucleotides once they have been aligned, as described herein.
- Attorney Docket No.00415-0047-00304 [0165]
- the inner codec comprises a greedy algorithm.
- the inner codec comprises a maximum likelihood (ML) algorithm.
- the inner codec comprises a mixed greedy ML algorithm.
- a inner codec comprising a greedy algorithm (e.g., greedy decoder) is exemplary illustrated in FIG.10. As shown, a greedy algorithm takes into account transitions from only the most probably state as it decodes each bit position in a sequence. In some instances, each bit is guessed using the greedy algorithm one at a time.
- the x-axis comprises the bit position and the y-axis comprises a state.
- a state comprises one or more valid encoding states S that are analyzed at each bit position.
- each state S is assigned a probability.
- the state S is defined as the encoded bits from each lane, a bit history, and a bit position.
- the state S is defined as the bit history and the bit word. The greedy algorithm repeatedly finds the highest probable state at each position until the highest probable end state is reached. In some instances, the decoded bits are backtracked by following the highest probable states at each bit position.
- the greedy decoder finds a locally optimal solution.
- the locally optional solution is an approximate of a globally optimal solution.
- the greedy decoder provides a solution (or end state) in a reasonable amount of time compared to other inner codecs, such as those described herein.
- performance of the inner codec is improved by knowing where the oligonucleotide sequence ends.
- the oligonucleotide lengths are determined during sequencing, for example, through pair-end sequencing.
- a drift term is introduced to the greedy algorithm.
- the drift term comprises an integer associated with the total number of insertions and deletions.
- each insertion is represented as a +1 value and each deletion is represented as a -1 value. For example, if there are no insertions and 2 deletions, the total drift is -2.
- the greedy algorithm discards all end decoding states that do not match the length of oligo as being invalid. Therefore, the drift term allows the greedy algorithm to know which end decoding states are valid, and can further improve the performance.
- the inner codec further comprises a z-axis corresponding to the drift.
- a inner codec comprising a ML algorithm is exemplary illustrated in FIG.11. As shown, a ML algorithm takes into account transitions from all states as it decodes each bit position in a sequence.
- each bit is guessed using the ML algorithm one at a time. In some instances, more than one bit is guessed using the ML algorithm at a given time. In some cases, the ML algorithm repeatedly finds all transition states at each position until end candidate states are determined. In some instances, the x-axis comprises the bit position and the y- axis comprises a state, as previously described herein. In some instances, a drift term, as previously described herein, is used to filter the end candidate states. In some instances, the ML algorithm provides Attorney Docket No.00415-0047-00304 the globally optimal solution by tracking all state transitions. In some cases, the ML algorithm is computationally intensive compared to other decoding schemes, such as those described herein.
- an inner codec comprises a mixed greedy ML algorithm.
- a mixed greedy ML algorithm takes into account transitions from a plurality of states as it decodes each bit position in a sequence.
- the plurality of states are about 100 to about 1000 states as it decodes each bit position in a sequence.
- the plurality of states are about 100 to about 200, about 100 to about 300, about 100 to about 400, about 100 to about 500, about 100 to about 600, about 100 to about 700, about 100 to about 800, about 100 to about 900, about 100 to about 1,000, about 200 to about 300, about 200 to about 400, about 200 to about 500, about 200 to about 600, about 200 to about 700, about 200 to about 800, about 200 to about 900, about 200 to about 1,000, about 300 to about 400, about 300 to about 500, about 300 to about 600, about 300 to about 700, about 300 to about 800, about 300 to about 900, about 300 to about 1,000, about 400 to about 500, about 400 to about 600, about 400 to about 700, about 400 to about 800, about 400 to about 900, about 400 to about 1,000, about 500 to about 600, about 500 to about 700, about 500 to about 800, about 500 to about 900, about 500 to about 1,000, about 600 to about 700, about 600 to about 800, about 600 to about 900, about 600 to about 1,000, about 700 to about 800, about 500
- the plurality of states are about 100, about 200, about 300, about 400, about 500, about 600, about 700, about 800, about 900, or about 1,000 states. In some instances, the plurality of states are at least about 100, about 200, about 300, about 400, about 500, about 600, about 700, about 800, or about 900 states. In some instances, the plurality of states are at most about 200, about 300, about 400, about 500, about 600, about 700, about 800, about 900, or about 1,000 states.
- the states are defined as previously described herein.
- each bit is guessed using the mixed greedy ML algorithm one at a time. In some instances, more than one bit is guessed using the mixed greedy ML algorithm at a given time.
- the mixed greedy ML algorithm repeatedly finds about 100 to about 1000 transition states at each position until end candidate states are determined.
- a drift term as previously described herein, is used to filter the end candidate states.
- the mixed greedy ML algorithm provides the globally optimal solution, while being less computationally expensive relative to other inner codecs, such as the ML algorithm described herein.
- the inner codec comprises a beam search decoder or a random sampling decoder (e.g., pure sampling decoder, a top-K sampling decoder, etc.).
- a beam search decoder or a random sampling decoder provides a diversity of candidate states compared to a greedy decoder.
- the inner codec further comprises a checksum.
- the checksum is used to verify data integrity, detect errors, or a combination thereof.
- a checksum is generated using a checksum function or checksum algorithm (e.g., parity byte or parity work Attorney Docket No.00415-0047-00304 (longitudinal parity check), sum complement, position dependent, fuzzy checksum, etc.).
- checksum functions or algorithms include, but are not limited to, BSD checksum (Unix), SYSV checksum (Unix), sum4, sum8, sum16, sum32, fletcher-4, fletcher-8, fletcher-16, fletcher-32, Adler-32, xor8, Luhn algorithm, Verhoeff algorithm, or Damm algorithm.
- the checksum comprises a RS code (e.g., a small RS code).
- the decoder gives a list of possibilities (e.g., “list decoding”) assuming the user can decide which one it actually is.
- decoding scheme further comprises arranging lanes into frames.
- the decoded lanes from the inner codec are arranged into frames based on the lane index and the frame index.
- one or more lanes are missing from a frame, as shown in FIG.9.
- the lanes are missing due to errors occurred during synthesis or sequencing of the nucleotides.
- about 1% to about 10% of the lanes are missing from a frame.
- about 1 % to about 2 %, about 1 % to about 4 %, about 1 % to about 6 %, about 1 % to about 8 %, about 1 % to about 10 %, about 2 % to about 4 %, about 2 % to about 6 %, about 2 % to about 8 %, about 2 % to about 10 %, about 4 % to about 6 %, about 4 % to about 8 %, about 4 % to about 10 %, about 6 % to about 8 %, about 6 % to about 10 %, or about 8 % to about 10 % of the lanes are missing from a frame.
- the inner codec comprises a “format”. In some cases, there is no a-priori information about the size of the data (e.g., binary data) during decoding.
- frame index 0 comprises the size of the data.
- frame 0 is decoded first.
- the data is then extracted from frame 0 to reject frames outside of the expected data size (e.g., from incorrectly decoded oligos).
- the inner codec comprises a hash (e.g., SHA-256).
- the hash verifies that the data was correctly decoded.
- the encoding and decoding are performed as a stream. In some instances, this can limit memory use to only temporary buffers.
- Methods for decoding a plurality of nucleotide sequences can comprise an outer codec (e.g., ECC).
- ECC outer codec
- the plurality of nucleotide sequences are decoded into digital or binary data.
- an outer codec e.g., ECC
- an ECC is applied to the digital or binary data.
- an ECC is applied to each of the frames.
- the ECC is applied to the lanes from the inner codec.
- the ECC is applied after the lanes from the inner codec are arranged into frames.
- the outer codec comprises an ECC used to encode the data (e.g., binary data).
- the ECC comprises a Reed-Solomon (RS) code, a LDPC code, a polar code, a turbo code, or any combination thereof.
- the ECC comprises a Reed-Solomon (RS) code.
- the RS decoder receives a codeword, ⁇ , which is the original codeword ⁇ plus errors ⁇ (e.g., ⁇ ⁇ ⁇ ⁇ ⁇ ). In some cases, the errors ⁇ is 0.
- the RS decoder attempts to identify the position and magnitude of up to t errors (or 2t erasures). The RS code then attempts to correct these identified errors and/or erasures.
- the RS decoder comprises a syndrome calculation.
- the syndrome calculation comprises receiving incoming symbols and dividing them into the generator polynomial g(x), as previously described herein.
- the syndromes are calculated by substituting the 2t roots (or syndromes of the RS codeword c(x)) of the generator polynomial g(x) into r(x).
- the generator polynomial g(x) is a known parameters of the decoder.
- the RS codeword c(x) has 2t syndromes that depend on errors.
- the RS decoder comprises finding symbol error location.
- parity or check symbols t cause the syndrome calculation to be zero in the case of no errors.
- parity or check symbols t comprise the remainder in the RS encoder. If there are errors, the resulting polynomial g(x) is passed to a Euclid algorithm. In some instances, factors of the remainder are found using the Euclid algorithm. In some instances, the results are evaluated over iterations for each of the incoming symbols. In some instances, errors are found and the errors are corrected. In some cases, the corrected code word c(x) is the outputted from the RS decoder. In some instances, there are more errors in the code word than can be corrected by the RS code (e.g., e(x) > 2t).
- the received codeword r(x) is outputted from the RS decoder. In some instances, the received codeword r(x) is outputted with an indication that the error correction has failed (e.g., a flag). In some instances, the received codeword r(x) (e.g., the lane or the frame comprising binary data as described herein) is discarded. [0180] In some instances, the frames from the ECC are merged to generate an output comprising the binary data. In some instances, the binary data comprises byte streams or byte arrays, as previously described herein. The decoding methods described herein can be used to recover data in the presence of an error in at least one nucleotide sequence in the plurality of nucleotide sequences that was stored.
- the error comprises an insertion, deletion, substitution, or any combination thereof.
- the data is recovered in the presence of errors (e.g., error rate) in about 0.001% to about 30% of the nucleotide sequences in the plurality of nucleotides.
- the data is recovered in the presence an error rate of about 0.001 % to about 0.01 %, about 0.001 % to about 0.1 %, about 0.001 % to about 0.5 %, about 0.001 % to about 1 %, about 0.001 % to about 2 %, about 0.001 % to about 5 %, about 0.001 % to about 10 %, about 0.001 % to about 15 %, about 0.001 % to about 20 %, Attorney Docket No.00415-0047-00304 about 0.001 % to about 25 %, about 0.001 % to about 30 %, about 0.01 % to about 0.1 %, about 0.01 % to about 0.5 %, about 0.01 % to about 1 %, about 0.01 % to about 2 %, about 0.01 % to about 5 %, about 0.01 % to about 10 %, about 0.01 % to about 15 %, about 0.01 % to about 20 %, about 0. 0.01
- the data is recovered in the presence an error rate of about 0.001 %, about 0.01 %, about 0.1 %, about 0.5 %, about 1 %, about 2 %, about 5 %, about 10 %, about 15 %, about 20 %, about 25 %, or about 30 %. In some instances, the data is recovered in the presence an error rate of at least about 0.001 %, about 0.01 %, about 0.1 %, about 0.5 %, about 1 %, about 2 %, about 5 %, about 10 %, about 15 %, about 20 %, or about 25 %.
- the decoding scheme comprises soft decoding.
- Soft decoding generally refers to decoding by considering a range of possible values (e.g., using probability estimates). As an example, sequencing carries quality for each base which can be considered during probability calculations. In such an example, each state comprises a final probability, which can be used in the outer decoder as, for example, log-likelihood if that outer decoder supports soft-decoding.
- an LDPC ECC comprises an iterative decoder. This provides possibilities to go back and forth between the inner and outer decoder in an iterative manner instead of a single pass. However, in some instances, this is accompanied by the cost of higher computing requirements.
- the hashes of the present disclosure can allow verification of digital information during retrieval. In some cases, retrieving a digital information stored in a plurality of polynucleotides further comprises verifying at least the one or more objects. In some instances, the one or more objects are verified using a first one or more hashes in the plurality of pools.
- retrieving a digital information stored in a plurality of polynucleotides further comprises verifying one or more pool items.
- the one or more pool items are verified using a second one or more hashes in the plurality of pools.
- Verifying hashes generally comprises generating hashes (e.g., cryptographic hashes).
- Verifying can further comprise comparing the generated hashes with the previously determined hashes.
- the previously hashes and the new hashes are determined using the same hash function.
- the hash function comprises a cryptographic hash function.
- the hash function comprises MD-5, SHA-1, SHA-2, SHA-3, RIPEMD-160, Whirlpool, BLAKE, BLAKE2, BLAKE3, or a variation thereof.
- the hash function comprises SHA-2.
- SHA-2 comprises SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, or SHA-512/256.
- Retrieving digital information can comprise combining the information stored across pools items and/or the plurality of pools. In some cases, retrieving a digital information stored in a plurality of polynucleotides further comprises combining the digital information in the plurality of pools. In some instances, the data payload in the one or more pool items are combined.
- the data payload in the one or more pool items across the plurality of pools are combined.
- the combined data payloads comprise the digital information.
- the retrieved digital information is further stored on a memory.
- the retrieved digital information is presented to a user.
- the information is presented to a user on an interface.
- the interface is an interface of an electronic device (e.g., personal electronic device).
- the electronic device comprises an application configured to communicate with the systems described herein via a computer network to access the information.
- the methods for retrieving digital information in DNA can be carried out on a system.
- such a system comprises an apparatus comprising one or more processing units, a memory, instructions, a sequencing device, or a combination thereof.
- the memory is in communication with the one or more processing units.
- the instructions are stored on the memory.
- the sequencing device in communication with the memory, the one or more processing units, or the combination thereof.
- the one or more processing units and memory are distributed across one or more physical or logical locations.
- the memory is used to store digital information, polynucleotides sequences (e.g., partially or fully decoded sequences), or the combination thereof.
- the memory is Attorney Docket No.00415-0047-00304 used to store information related to the algorithms described herein (e.g., software code, parameters, executable instructions, etc.).
- the memory can comprise any suitable memory described herein.
- the memory can be configured according to embodiments described herein.
- the sequencing device is configured to determining the plurality of nucleotide sequences using the methods described herein.
- the one or more processing units include any combination of central processing units (CPUs), graphical processing units (GPUs), single core processors, multi- core processors, processor clusters, application-specific integrated circuits (ASICs), programmable circuits such as Field Programmable Gate Arrays (FPGA), an AI-accelerator and variations thereof.
- the one or more of the processing units comprise a Single Instruction Multiple Data (SIMD) or Single Program Multiple Data (SPMD) parallel architectures.
- SIMD Single Instruction Multiple Data
- SPMD Single Program Multiple Data
- the one or more processing units include one or more GPUs or CPUs that implement SIMD or SPMD.
- an AI-accelerator comprise Google-TPU, Graphcore, Cerebras, SambaNova, or a combination thereof.
- one or more of the processing units is implemented in software and/or firmware, in addition to hardware implementations.
- Software or firmware implementations of the processing units can include computer- or machine- executable instructions written in any suitable programming language to perform the various functions described herein.
- Software implementations of the one or more processing units can be stored in whole or part in the memory.
- the system can comprise one or more hardware logic components.
- illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
- decoding is run on compute-on-memory technologies, such as, but not limited to, UpMem.
- the one or more processing units is configured to perform one or more decoding steps.
- the processing device is configured to perform one or more steps comprising: applying a decoding scheme to decode the digital information in the plurality of pools; verifying at least the one or more objects using a first one or more hashes in the plurality of pools; combining the digital information in the plurality of pools to retrieve the one or more objects; and storing the digital information on a memory.
- the one or more processing units is configured to perform one or more steps comprising: apply an inner codec to the plurality of polynucleotides; or apply an ECC to the plurality of polynucleotides.
- the inner codec transforms each of the plurality of polynucleotides into digital information.
- the inner codec comprises a mixed decoding algorithm comprising a greedy algorithm and a maximum likelihood (ML) algorithm.
- the output from an ECC are merged to generate an output comprising the digital information.
- a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual values within that range, for example, 1.1, 2, 2.3, 5, and 5.9. This applies regardless of the breadth of the range.
- the upper and lower limits of these intervening ranges may independently be included in the smaller ranges, and are also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention, unless the context clearly dictates otherwise.
- preselected sequence As used herein, the terms “preselected sequence”, “predefined sequence” or “predetermined sequence” are used interchangeably. The terms mean that the sequence of the polymer is known and chosen before synthesis or assembly of the polymer. In particular, various aspects of the invention are described herein primarily with regard to the preparation of nucleic acids molecules, the sequence of the Attorney Docket No.00415-0047-00304 polynucleotide being known and chosen before the synthesis or assembly of the nucleic acid molecules. [0196] As used herein, the term “hash” or “hashes” may generally refer to a string of fixed length that is outputted from a hash function.
- a hash function may generally comprise a function that receives an input of arbitrary length into an output with a fixed length.
- the input may be one or more terms of a transaction or a contract, which may be passed through hash function to generate a hash.
- the hash function may be deterministic, and it may be infeasible to reverse-engineer the input from the hashed output. The act of feeding an input into a hash function may be referred to as “hashing”.
- Polynucleotide sequences described herein may be, unless stated otherwise, comprise DNA or RNA or an analog or derivative thereof.
- nucleic acids polynucleotides, oligonucleotides, oligos, oligonucleic acids are used synonymously throughout to represent a polymer of nucleoside monomers.
- nucleic acids are connected via phosphate or sulfur-containing linkages.
- Nucleic acids in some instances comprise DNA, RNA, non-canonical nucleic acids, unnatural nucleic acids, or other nucleoside.
- nucleotides comprise non-canonical bases, sugars, or other moiety.
- nucleotides comprise terminators which are configured to prevent extension reactions. In some instances, such terminators are removed before addition of subsequent nucleotides to the growing chain.
- FIG.12 a block diagram is shown depicting an exemplary machine that includes a computer system 1200 (e.g., a processing or computing system) within which a set of instructions can execute for causing a device to perform or execute any one or more of the aspects and/or methodologies for static code scheduling of the present disclosure.
- the components in FIG.12 are examples only and do not limit the scope of use or functionality of any hardware, software, embedded logic component, or a combination of two or more such components implementing particular embodiments.
- Computer system 1200 may include one or more processors 1201, a memory 1203, and a storage 1208 that communicate with each other, and with other components, via a bus 1240.
- the bus 1240 may also link a display 1232, one or more input devices 1233 (which may, for example, include a keypad, a keyboard, a mouse, a stylus, etc.), one or more output devices 1234, one or more storage devices 1235, and various tangible storage media 1236. All of these elements may interface directly or via one or more interfaces or adaptors to the bus 1240. For instance, the various tangible storage media 1236 can interface with the bus 1240 via storage medium interface 1226.
- Computer system 1200 may have any suitable physical form, including but not limited to one or more integrated circuits (ICs), printed circuit boards (PCBs), mobile handheld devices (such as mobile telephones or PDAs), laptop or notebook computers, distributed computer systems, computing grids, or servers.
- ICs integrated circuits
- PCBs printed circuit boards
- mobile handheld devices such as mobile telephones or PDAs
- laptop or notebook computers distributed computer systems, computing grids, or servers.
- Computer system 1200 includes one or more processor(s) 1201 (e.g., central processing units Attorney Docket No.00415-0047-00304 (CPUs), general purpose graphics processing units (GPGPUs), or quantum processing units (QPUs)) that carry out functions.
- processor(s) 1201 e.g., central processing units Attorney Docket No.00415-0047-00304 (CPUs), general purpose graphics processing units (GPGPUs), or quantum processing units (QPUs)
- Processor(s) 1201 optionally contains a cache memory unit 1202 for temporary local storage of instructions, data, or computer addresses.
- Processor(s) 1201 are configured to assist in execution of computer readable instructions.
- Computer system 1200 may provide functionality for the components depicted in FIG.12 as a result of the processor(s) 1201 executing non-transitory, processor- executable instructions embodied in one or more tangible computer-readable storage media, such as memory 1203, storage 1208, storage devices 1235, and/or storage medium 1236.
- the computer-readable media may store software that implements particular embodiments, and processor(s) 1201 may execute the software.
- Memory 1203 may read the software from one or more other computer-readable media (such as mass storage device(s) 1235, 1236) or from one or more other sources through a suitable interface, such as network interface 1220.
- the software may cause processor(s) 1201 to carry out one or more processes or one or more steps of one or more processes described or illustrated herein.
- the memory 1203 may include various components (e.g., machine readable media) including, but not limited to, a random access memory component (e.g., RAM 1204) (e.g., static RAM (SRAM), dynamic RAM (DRAM), ferroelectric random access memory (FRAM), phase-change random access memory (PRAM), etc.), a read-only memory component (e.g., ROM 1205), and any combinations thereof.
- ROM 1205 may act to communicate data and instructions unidirectionally to processor(s) 1201, and RAM 1204 may act to communicate data and instructions bidirectionally with processor(s) 1201.
- ROM 1205 and RAM 1204 may include any suitable tangible computer-readable media described below.
- a basic input/output system 1206 (BIOS), including basic routines that help to transfer information between elements within computer system 1200, such as during start-up, may be stored in the memory 1203.
- Fixed storage 1208 is connected bidirectionally to processor(s) 1201, optionally through storage control unit 1207.
- Fixed storage 1208 provides additional data storage capacity and may also include any suitable tangible computer-readable media described herein.
- Storage 1208 may be used to store operating system 1209, executable(s) 1210, data 1211, applications 1212 (application programs), and the like.
- Storage 1208 can also include an optical disk drive, a solid-state memory device (e.g., flash-based systems), or a combination of any of the above. Information in storage 1208 may, in appropriate cases, be incorporated as virtual memory in memory 1203. [0203]
- storage device(s) 1235 may be removably interfaced with computer system 1200 (e.g., via an external port connector (not shown)) via a storage device interface 1225.
- storage device(s) 1235 and an associated machine-readable medium may provide non-volatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data for the computer system 1200.
- Bus 1240 connects a wide variety of subsystems.
- reference to a bus may encompass one or more digital signal lines serving a common function, where appropriate.
- Bus 1240 may be any of several types of bus structures including, but not limited to, a memory bus, a memory controller, a peripheral bus, a local bus, and any combinations thereof, using any of a variety of bus architectures.
- Computer system 1200 may also include an input device 1233.
- a user of computer system 1200 may enter commands and/or other information into computer system 1200 via input device(s) 1233.
- Examples of an input device(s) 1233 include, but are not limited to, an alpha- numeric input device (e.g., a keyboard), a pointing device (e.g., a mouse or touchpad), a touchpad, a touch screen, a multi-touch screen, a joystick, a stylus, a gamepad, an audio input device (e.g., a microphone, a voice response system, etc.), an optical scanner, a video or still image capture device (e.g., a camera), and any combinations thereof.
- the input device is a Kinect, Leap Motion, or the like.
- Input device(s) 1233 may be interfaced to bus 1240 via any of a variety of input interfaces 1223 (e.g., input interface 1223) including, but not limited to, serial, parallel, game port, USB, FIREWIRE, THUNDERBOLT, or any combination of the above.
- input interfaces 1223 e.g., input interface 1223
- computer system 1200 may communicate with other devices, specifically mobile devices and enterprise systems, distributed computing systems, cloud storage systems, cloud computing systems, and the like, connected to network 1230. Communications to and from computer system 1200 may be sent through network interface 1220.
- network interface 1220 may receive incoming communications (such as requests or responses from other devices) in the form of one or more packets (such as Internet Protocol (IP) packets) from network 1230, and computer system 1200 may store the incoming communications in memory 1203 for processing.
- Computer system 1200 may similarly store outgoing communications (such as requests or responses to other devices) in the form of one or more packets in memory 1203 and communicated to network 1230 from network interface 1220.
- Processor(s) 1201 may access these communication packets stored in memory 1203 for processing.
- Examples of the network interface 1220 include, but are not limited to, a network interface card, a modem, and any combination thereof.
- Examples of a network 1230 or network segment 1230 include, but are not limited to, a distributed computing system, a cloud computing system, a wide area network (WAN) (e.g., the Internet, an enterprise network), a local area network (LAN) (e.g., a network associated Attorney Docket No.00415-0047-00304 with an office, a building, a campus or other relatively small geographic space), a telephone network, a direct connection between two computing devices, a peer-to-peer network, and any combinations thereof.
- a network, such as network 1230 may employ a wired and/or a wireless mode of communication. In general, any network topology may be used.
- Information and data can be displayed through a display 1232.
- Examples of a display 1232 include, but are not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a thin film transistor liquid crystal display (TFT-LCD), an organic liquid crystal display (OLED) such as a passive- matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display, a plasma display, and any combinations thereof.
- the display 1232 can interface to the processor(s) 1201, memory 1203, and fixed storage 1208, as well as other devices, such as input device(s) 1233, via the bus 1240.
- the display 1232 is linked to the bus 1240 via a video interface 1222, and transport of data between the display 1232 and the bus 1240 can be controlled via the graphics control 1221.
- the display is a video projector.
- the display is a head-mounted display (HMD) such as a VR headset.
- HMD head-mounted display
- suitable VR headsets include, by way of non-limiting examples, HTC Vive, Oculus Rift, Samsung Gear VR, Microsoft HoloLens, Razer OSVR, FOVE VR, Zeiss VR One, Avegant Glyph, Freefly VR headset, and the like.
- the display is a combination of devices such as those disclosed herein.
- computer system 1200 may include one or more other peripheral output devices 1234 including, but not limited to, an audio speaker, a printer, a storage device, and any combinations thereof.
- Such peripheral output devices may be connected to the bus 1240 via an output interface 1224.
- Examples of an output interface 1224 include, but are not limited to, a serial port, a parallel connection, a USB port, a FIREWIRE port, a THUNDERBOLT port, and any combinations thereof.
- computer system 1200 may provide functionality as a result of logic hardwired or otherwise embodied in a circuit, which may operate in place of or together with software to execute one or more processes or one or more steps of one or more processes described or illustrated herein.
- Reference to software in this disclosure may encompass logic, and reference to logic may encompass software.
- a computer-readable medium may encompass a circuit (such as an IC) storing software for execution, a circuit embodying logic for execution, or both, where appropriate.
- the present disclosure encompasses any suitable combination of hardware, software, or both.
- the various illustrative logical blocks, modules, and circuits described in connection with the Attorney Docket No.00415-0047-00304 embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
- a general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
- a processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- the steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by one or more processor(s), or in a combination of the two.
- a software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
- An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
- the storage medium may be integral to the processor.
- the processor and the storage medium may reside in an ASIC.
- the ASIC may reside in a user terminal.
- the processor and the storage medium may reside as discrete components in a user terminal.
- suitable computing devices include, by way of non- limiting examples, server computers, desktop computers, laptop computers, notebook computers, sub- notebook computers, netbook computers, netpad computers, set-top computers, media streaming devices, handheld computers, Internet appliances, mobile smartphones, tablet computers, personal digital assistants, video game consoles, and vehicles.
- the computing device includes an operating system configured to perform executable instructions.
- the operating system is, for example, software, including programs and data, which manages the device’s hardware and provides services for execution of applications.
- suitable server operating systems include, by way of non-limiting examples, FreeBSD, OpenBSD, NetBSD ® , Linux, Apple ® Mac OS X Server ® , Oracle ® Solaris ® , Windows Server ® , and Novell ® NetWare ® .
- suitable personal computer operating systems include, by way of non-limiting examples, Microsoft ® Windows ® , Apple ® Mac OS X ® , UNIX ® , and UNIX-like operating systems such as GNU/Linux ® .
- the operating system is provided by cloud computing.
- suitable mobile smartphone operating systems include, by way of non-limiting examples, Nokia ® Symbian ® OS, Apple ® Attorney Docket No.00415-0047-00304 iOS ® , Research In Motion ® BlackBerry OS ® , Google ® Android ® , Microsoft ® Windows Phone ® OS, Microsoft ® Windows Mobile ® OS, Linux ® , and Palm ® WebOS ® .
- suitable media streaming device operating systems include, by way of non-limiting examples, Apple TV ® , Roku ® , Boxee ® , Google TV ® , Google Chromecast ® , Amazon Fire ® , and Samsung ® HomeSync ® .
- Non-transitory computer readable storage medium includes one or more non-transitory computer readable storage media encoded with a program including instructions executable by the operating system of an optionally networked computing device.
- a computer readable storage medium is a tangible component of a computing device.
- a computer readable storage medium is optionally removable from a computing device.
- a computer readable storage medium includes, by way of non- limiting examples, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, distributed computing systems including cloud computing systems and services, and the like.
- the program and instructions are permanently, substantially permanently, semi-permanently, or non-transitorily encoded on the media.
- Computer program [0217]
- the platforms, systems, media, and methods disclosed herein include at least one computer program, or use of the same.
- a computer program includes a sequence of instructions, executable by one or more processor(s) of the computing device’s CPU, written to perform a specified task.
- Computer readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), computing data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, those of skill in the art will recognize that a computer program may be written in various versions of various languages. [0218] The functionality of the computer readable instructions may be combined or distributed as desired in various environments.
- a computer program comprises one sequence of instructions.
- a computer program comprises a plurality of sequences of instructions.
- a computer program is provided from one location.
- a computer program is provided from a plurality of locations.
- a computer program includes one or more software modules.
- a computer program includes, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins, or add-ons, or combinations thereof.
- Attorney Docket No.00415-0047-00304 Web application [0219]
- a computer program includes a web application.
- a web application in various embodiments, utilizes one or more software frameworks and one or more database systems.
- a web application is created upon a software framework such as Microsoft ® .NET or Ruby on Rails (RoR).
- a web application utilizes one or more database systems including, by way of non- limiting examples, relational, non-relational, object oriented, associative, XML, and document oriented database systems.
- suitable relational database systems include, by way of non- limiting examples, Microsoft ® SQL Server, mySQLTM, and Oracle ® .
- a web application may be written in one or more markup languages, presentation definition languages, client-side scripting languages, server-side coding languages, database query languages, or combinations thereof.
- a web application is written to some extent in a markup language such as Hypertext Markup Language (HTML), Extensible Hypertext Markup Language (XHTML), or eXtensible Markup Language (XML).
- a web application is written to some extent in a presentation definition language such as Cascading Style Sheets (CSS).
- a web application is written to some extent in a client-side scripting language such as Asynchronous JavaScript and XML (AJAX), Flash ® ActionScript, JavaScript, or Silverlight ® .
- AJAX Asynchronous JavaScript and XML
- Flash ® ActionScript JavaScript
- JavaScript JavaScript
- Silverlight ® Silverlight ®
- a web application is written to some extent in a server-side coding language such as Active Server Pages (ASP), ColdFusion ® , Perl, JavaTM, JavaServer Pages (JSP), Hypertext Preprocessor (PHP), PythonTM, Ruby, Tcl, Smalltalk, WebDNA ® , or Groovy.
- a web application is written to some extent in a database query language such as Structured Query Language (SQL).
- SQL Structured Query Language
- a web application integrates enterprise server products such as IBM ® Lotus Domino ® .
- a web application includes a media player element.
- a media player element utilizes one or more of many suitable multimedia technologies including, by way of non-limiting examples, Adobe ® Flash ® , HTML 5, Apple ® QuickTime ® , Microsoft ® Silverlight ® , JavaTM, and Unity ® .
- Mobile application [0220]
- a computer program includes a mobile application provided to a mobile computing device.
- the mobile application is provided to a mobile computing device at the time it is manufactured.
- the mobile application is provided to a mobile computing device via the computer network described herein.
- a mobile application is created by techniques known to those of skill in the art using hardware, languages, and development environments known to the art.
- Suitable programming languages include, by way of non-limiting examples, C, C++, C#, Objective-C, JavaTM, Attorney Docket No.00415-0047-00304 JavaScript, Pascal, Object Pascal, PythonTM, Ruby, VB.NET, WML, and XHTML/HTML with or without CSS, or combinations thereof.
- Suitable mobile application development environments are available from several sources. Commercially available development environments include, by way of non-limiting examples, AirplaySDK, alcheMo, Appcelerator ® , Celsius, Bedrock, Flash Lite, .NET Compact Framework, Rhomobile, and WorkLight Mobile Platform.
- a computer program includes a standalone application, which is a program that is run as an independent computer process, not an add-on to an existing process, e.g., not a plug-in. Those of skill in the art will recognize that standalone applications are often compiled.
- a compiler is a computer program(s) that transforms source code written in a programming language into binary object code such as assembly language or machine code. Suitable compiled programming languages include, by way of non-limiting examples, C, C++, Objective-C, COBOL, Delphi, Eiffel, JavaTM, Lisp, PythonTM, Visual Basic, and VB .NET, or combinations thereof. Compilation is often performed, at least in part, to create an executable program.
- a computer program includes one or more executable complied applications.
- Web browser plug-in [0225]
- the computer program includes a web browser plug-in (e.g., extension, etc.).
- a plug-in is one or more software components that add specific functionality to a larger software application.
- Makers of software applications support plug-ins to enable third-party developers to create abilities which extend an application, to support easily adding new features, and to reduce the size of an application.
- plug-ins enable customizing the functionality of a software application.
- plug-ins are commonly used in web browsers to play video, generate interactivity, scan for viruses, and display particular file types.
- Those of skill in the art will be familiar with several web browser plug-ins including, Adobe ® Flash ® Player, Microsoft ® Silverlight ® , and Apple ® QuickTime ® .
- the toolbar comprises one or more web browser extensions, add-ins, Attorney Docket No.00415-0047-00304 or add-ons.
- the toolbar comprises one or more explorer bars, tool bands, or desk bands.
- explorer bars In view of the disclosure provided herein, those of skill in the art will recognize that several plug- in frameworks are available that enable development of plug-ins in various programming languages, including, by way of non-limiting examples, C++, Delphi, JavaTM, PHP, PythonTM, and VB .NET, or combinations thereof.
- Web browsers also called Internet browsers are software applications, designed for use with network-connected computing devices, for retrieving, presenting, and traversing information resources on the World Wide Web.
- Suitable web browsers include, by way of non-limiting examples, Microsoft ® Internet Explorer ® , Mozilla ® Firefox ® , Google ® Chrome, Apple ® Safari ® , Opera Software ® Opera ® , and KDE Konqueror.
- the web browser is a mobile web browser.
- Mobile web browsers also called microbrowsers, mini-browsers, and wireless browsers
- mobile computing devices including, by way of non-limiting examples, handheld computers, tablet computers, netbook computers, subnotebook computers, smartphones, music players, personal digital assistants (PDAs), and handheld video game systems.
- Suitable mobile web browsers include, by way of non- limiting examples, Google ® Android ® browser, RIM BlackBerry ® Browser, Apple ® Safari ® , Palm ® Blazer, Palm ® WebOS ® Browser, Mozilla ® Firefox ® for mobile, Microsoft ® Internet Explorer ® Mobile, Amazon ® Kindle ® Basic Web, Nokia ® Browser, Opera Software ® Opera ® Mobile, and Sony ® PSPTM browser.
- Software modules [0228]
- the platforms, systems, media, and methods disclosed herein include software, server, and/or database modules, or use of the same.
- software modules are created by techniques known to those of skill in the art using machines, software, and languages known to the art.
- a software module comprises a file, a section of code, a programming object, a programming structure, a distributed computing resource, a cloud computing resource, or combinations thereof.
- a software module comprises a plurality of files, a plurality of sections of code, a plurality of programming objects, a plurality of programming structures, a plurality of distributed computing resources, a plurality of cloud computing resources, or combinations thereof.
- the one or more software modules comprise, by way of non-limiting examples, a web application, a mobile application, a standalone application, and a distributed or cloud computing application.
- software modules are in one computer program or application.
- software modules are in more than one computer program or application. In some embodiments, software modules are hosted on one machine. In other embodiments, software modules are hosted on more than one machine. In further embodiments, software modules are hosted on a distributed computing platform such as a cloud computing platform. In some embodiments, Attorney Docket No.00415-0047-00304 software modules are hosted on one or more machines in one location. In other embodiments, software modules are hosted on one or more machines in more than one location. Databases [0229] In some embodiments, the platforms, systems, media, and methods disclosed herein include one or more databases, or use of the same. In view of the disclosure provided herein, those of skill in the art will recognize that many databases are suitable for storage and retrieval of information.
- suitable databases include, by way of non-limiting examples, relational databases, non- relational databases, object oriented databases, object databases, entity-relationship model databases, associative databases, XML databases, document oriented databases, and graph databases. Further non- limiting examples include SQL, PostgreSQL, MySQL, Oracle, DB2, Sybase, and MongoDB.
- a database is Internet-based.
- a database is web-based.
- a database is cloud computing-based.
- a database is a distributed database.
- a database is based on one or more local computer storage devices.
- Example 1 Quality Control of Digital Information in DNA Post-Synthesis
- Digital information is encoded in data polynucleotides using methods described herein (FIGs.5- 7). A voltage is applied to a synthesis surface and a current is measured to determine if there is a defect on the synthesis surface. About 100,000 data polynucleotides are synthesized using the methods described herein with continuous quality control by current sensing, optical imaging, and flow sensing.
- QC polynucleotides are synthesized on the synthesis surface with the data polynucleotides.
- the QC polynucleotides constitute about 1 % of the total polynucleotides on the surface (e.g., data polynucleotides plus QC polynucleotides).
- the QC polynucleotides comprise a first primer sequences, and are amplified based on the first primer sequence.
- the data polynucleotides comprise a second primer sequence that is different than the first primer sequence.
- the QC polynucleotides are fully sequenced.
- the QC polynucleotides are then aligned against a reference.
- the reference is a preselected sequence that is known. Aligning the QC polynucleotides against a reference generates a relative read count to determine the number of QC polynucleotides that have the same sequence as the reference sequence. [0233]
- the QC polynucleotides are aligned against a reference to estimate an error rate in the polynucleotides.
- the error rate in the QC polynucleotides serve as a proxy for the error rate in the data Attorney Docket No.00415-0047-00304 polynucleotides.
- the error rate is estimated to be less than 5 %.
- the QC polynucleotides are also aligned against a reference to estimate a synthesis uniformity in the data polynucleotides.
- the synthesis uniformity of the QC polynucleotides are analyzed across locations of the synthesis surface. The synthesis uniformity is estimated to be greater than 95 %.
- a subset of the data polynucleotides comprising about 0.1 % of the data polynucleotides are also selected and amplified.
- the subset is randomly selected from across the synthesis surface, and are sequenced.
- the subset is partially decoded by an inner codec to determine an index, as shown in FIG.9.
- the index is used to estimate a relative distribution of the subset of the plurality of data polynucleotides that are then arranged according to lanes and frames. Since the subset comprises about 0.1 % of 100,000 data polynucleotides, the relative distribution of the subset is be centered around about every 100 decode indices. The relative distribution is used to estimate synthesis uniformity, which is estimated to be greater than 95 %.
- a inner codec comprising a mixed greedy ML algorithm is further applied and a likelihood is generated.
- the likelihood is based on the number of steps required for decoding and the probability associated with each step in the algorithm.
- the likelihood is associated with an error rate in the data polynucleotides, where a high likelihood is associated with a low error rate, and a low likelihood is associated with a high error rate.
- the error rate is estimated to be less than 5 %.
- a method for quality control (QC) of data polynucleotides comprising: a. providing a plurality of QC polynucleotides on a surface, wherein the plurality of QC polynucleotides comprises a first primer sequence; b. amplifying the plurality of QC polynucleotides based on the first primer sequence; c. sequencing the plurality of QC polynucleotides; and Attorney Docket No.00415-0047-00304 d.
- Item 2 The method of item 1, wherein the error rate, the synthesis uniformity, or a combination thereof is based at least in part on a relative read count of the plurality of QC polynucleotides.
- Item 3 The method of items 1 or 2, wherein the plurality of QC polynucleotides is about or less 1% of the polynucleotides on the surface.
- a method for quality control (QC) of data polynucleotides comprising: a.
- the method of any one of items 13-15 further comprising decoding an index of the subset of the data polynucleotides.
- Item 17. The method of item 16, wherein the index is decoded using the inner codec, an outer codec, or a combination thereof.
- Item 18. The method of item 16, wherein the index is used to estimate a relative distribution of the subset of the plurality of data polynucleotides.
- Item 19. The method of any one of items 13-18, wherein the QC is performed during synthesis of the polynucleotides, QC of stored polynucleotides, or a combination thereof.
- Item 21 The method of any one of items 13-19, wherein the subset of the plurality of the data polynucleotides are selected at random.
- Item 21 The method of any one of items 13-19, wherein the subset of the plurality of the data polynucleotides are selected based at least in part on their location on a surface.
- Item 22 The method of any one of items 13-21, wherein the plurality of data polynucleotides comprises about 100,000 polynucleotides.
- Item 23 The method of item 22, wherein the subset of the plurality of data polynucleotides is about 0.1 % of the plurality of data polynucleotides.
- Item 24 The method of any one of items 13-19, wherein the subset of the plurality of the data polynucleotides are selected at random.
- Item 21 The method of any one of items 13-19, wherein the subset of the plurality of the data polynucleotides are selected based at least
- Item 25 The method of item 24, wherein the current sensing comprises measuring a current of a chip or a section of the chip.
- Item 26 The method of item 25, wherein the current is compared to a reference value.
- Item 27 The method of item 26, wherein a difference between the current and the reference value is indicative of a chip failure, a deblocking failure, or a combination thereof.
- Item 28 The method of any one of items 24-27, wherein the current sensing is performed before synthesis of the plurality of data polynucleotides.
- Item 29 The method of item 28, wherein the current sensing is used to detect a chip defect, adjust polynucleotide synthesis locations on a chip, or a combination thereof.
- Item 30 The method of any one of items 24-29, wherein mass estimation is performed using fluorescence. Attorney Docket No.00415-0047-00304 [0269]
- Item 31 The method of item 30, wherein the fluorescence is used to detect a yield of the plurality of polynucleotides.
- Item 32 The method of anyone of items 24-31, wherein optical imaging comprises detecting a chip defect, non-uniformity, or a combination thereof.
- Item 33 The method of anyone of items 24-31, wherein optical imaging comprises detecting a chip defect, non-uniformity, or a combination thereof.
- a method of performing QC of a plurality of cells on a surface comprising: a. measuring a current of each cell in the plurality of cells on the surface; b. determining if one or more cells in the plurality of cells comprises a defect based at least in part on the current; c. synthesizing and/or storing polynucleotides at a second one or more cells in the plurality of cells, wherein the second one or more cells do not comprise the defect.
- Item 34 The method of item 33, wherein the defect comprises a physical defect.
- Item 35 The method of item 33 or 34, wherein the surface is a synthesis surface, a storage surface, or a combination thereof.
- Item 36 The method of item 33 or 34, wherein the surface is a synthesis surface, a storage surface, or a combination thereof.
- Item 37 The method of item 36, wherein blocking is performed by a protecting group on the surface.
- Item 38 The method of item 37, wherein blocking is performed by a photolabile protecting group on the surface.
- Item 39 The method of any one of items 36-38, wherein blocking is performed by selectively supplying energy to the one or more cells.
- Item 40 The method of any one of items 36-39, wherein blocking is performed by a masking material.
- Item 41 The method of any one of items 36-40, wherein blocking is performed by addressable control of each cell in the plurality of cells.
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Genetics & Genomics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP24710911.9A EP4655240A1 (en) | 2023-01-26 | 2024-01-26 | Quality control for dna data storage |
| CN202480019397.0A CN120882652A (en) | 2023-01-26 | 2024-01-26 | Quality control of DNA data storage |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363481747P | 2023-01-26 | 2023-01-26 | |
| US63/481,747 | 2023-01-26 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024159068A1 true WO2024159068A1 (en) | 2024-08-02 |
Family
ID=90364177
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/013050 Ceased WO2024159068A1 (en) | 2023-01-26 | 2024-01-26 | Quality control for dna data storage |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP4655240A1 (en) |
| CN (1) | CN120882652A (en) |
| WO (1) | WO2024159068A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160076095A1 (en) * | 2010-02-19 | 2016-03-17 | Life Technologies Corporation | Methods and systems for nucleic acid sequencing validation, calibration and normalization |
| US20190284620A1 (en) * | 2018-03-18 | 2019-09-19 | Bryan Bishop | Systems and methods for data storage in nucleic acids |
| CN110628890A (en) * | 2019-11-07 | 2019-12-31 | 中国人民解放军军事科学院军事医学研究院 | Sequencing quality control standards and their applications and products |
| WO2021028726A2 (en) * | 2019-07-03 | 2021-02-18 | Bostongene Corporation | Systems and methods for sample preparation, sample sequencing, and sequencing data bias correction and quality control |
| US20230015348A1 (en) * | 2018-01-05 | 2023-01-19 | Billiontoone, Inc. | Quality control templates ensuring validity of sequencing-based assays |
-
2024
- 2024-01-26 CN CN202480019397.0A patent/CN120882652A/en active Pending
- 2024-01-26 EP EP24710911.9A patent/EP4655240A1/en active Pending
- 2024-01-26 WO PCT/US2024/013050 patent/WO2024159068A1/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160076095A1 (en) * | 2010-02-19 | 2016-03-17 | Life Technologies Corporation | Methods and systems for nucleic acid sequencing validation, calibration and normalization |
| US20230015348A1 (en) * | 2018-01-05 | 2023-01-19 | Billiontoone, Inc. | Quality control templates ensuring validity of sequencing-based assays |
| US20190284620A1 (en) * | 2018-03-18 | 2019-09-19 | Bryan Bishop | Systems and methods for data storage in nucleic acids |
| WO2021028726A2 (en) * | 2019-07-03 | 2021-02-18 | Bostongene Corporation | Systems and methods for sample preparation, sample sequencing, and sequencing data bias correction and quality control |
| CN110628890A (en) * | 2019-11-07 | 2019-12-31 | 中国人民解放军军事科学院军事医学研究院 | Sequencing quality control standards and their applications and products |
Also Published As
| Publication number | Publication date |
|---|---|
| CN120882652A (en) | 2025-10-31 |
| EP4655240A1 (en) | 2025-12-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20250335346A1 (en) | Codecs for dna data storage | |
| US11435905B1 (en) | Accurate and efficient DNA-based storage of electronic data | |
| Antkowiak et al. | Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction | |
| Organick et al. | Random access in large-scale DNA data storage | |
| US10370246B1 (en) | Portable and low-error DNA-based data storage | |
| JP7090148B2 (en) | DNA-based data storage and data retrieval | |
| ES2979182T3 (en) | Nucleic acid-based data storage | |
| Hawkins et al. | Indel-correcting DNA barcodes for high-throughput sequencing | |
| Bornholt et al. | A DNA-based archival storage system | |
| US10742233B2 (en) | Efficient encoding of data for storage in polymers such as DNA | |
| CN112673431B (en) | Reconstruction by tracking reads with variable errors | |
| US12260937B2 (en) | Reverse concatenation of error-correcting codes in DNA data storage | |
| US10956806B2 (en) | Efficient assembly of oligonucleotides for nucleic acid based data storage | |
| Organick et al. | Scaling up DNA data storage and random access retrieval | |
| Bhardwaj et al. | Trace reconstruction problems in computational biology | |
| Ding et al. | Improving error-correcting capability in DNA digital storage via soft-decision decoding | |
| Milenkovic et al. | DNA-based data storage systems: A review of implementations and code constructions | |
| Xiang et al. | A tutorial on coding methods for DNA-based molecular communications and storage | |
| EP4655240A1 (en) | Quality control for dna data storage | |
| Pe'er et al. | Spectrum alignment: efficient resequencing by hybridization. | |
| Zhang et al. | Spider-web enables stable, repairable, and encryptible algorithms under arbitrary local biochemical constraints in dna-based storage | |
| US11474898B2 (en) | Multiple responder approach to systems with different types of failures | |
| Zhang et al. | Soft-decision decoding for DNA-based data storage | |
| TWI770247B (en) | Nucleic acid method for data storage, and non-transitory computer-readable storage medium, system, and electronic device | |
| Kim et al. | Design of dna storage coding scheme with ldpc codes and interleaving |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24710911 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2025543913 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 11202504984R Country of ref document: SG |
|
| WWP | Wipo information: published in national office |
Ref document number: 11202504984R Country of ref document: SG |
|
| WWE | Wipo information: entry into national phase |
Ref document number: CN2024800193970 Country of ref document: CN Ref document number: 202480019397.0 Country of ref document: CN |
|
| WWP | Wipo information: published in national office |
Ref document number: 202480019397.0 Country of ref document: CN |
|
| WWP | Wipo information: published in national office |
Ref document number: 2024710911 Country of ref document: EP |