WO2019204702A1 - Error-correcting dna barcodes - Google Patents
Error-correcting dna barcodes Download PDFInfo
- Publication number
- WO2019204702A1 WO2019204702A1 PCT/US2019/028279 US2019028279W WO2019204702A1 WO 2019204702 A1 WO2019204702 A1 WO 2019204702A1 US 2019028279 W US2019028279 W US 2019028279W WO 2019204702 A1 WO2019204702 A1 WO 2019204702A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- barcode
- barcodes
- genetic
- corrupt
- error
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
- G16B35/10—Design of libraries
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- the disclosure generally relates to methods to correct errors in digitized genetic barcode sequences, including substitution, insertion, and deletion errors that occur in pooled populations of diverse genetic materials.
- DNA barcodes short, unique DNA sequences that are coupled to each member in the population (FIG. 1A).
- DNA barcode-based identification is central to such diverse applications as single-cell genome and RNA sequencing 1-7 , gene synthesis 8 9 , high-throughput antibody screens 10 11 , and drug discovery 12 13 .
- Such experiments have been enabled by recent breakthroughs in massively-parallel, pooled DNA synthesis 14 15 .
- a recent study used DNA barcodes to discover small molecule inhibitors of enzymes by screening about 10 8 small molecules. Each small molecule was attached to a unique set of three DNA barcodes.
- the highest affinity ligands were enriched via multiple rounds of selection and then identified via high-throughput sequencing of the attached barcodes 16 .
- the rapid growth of such methodologies in all areas of biomedicine requires the development of large pools (>l0 6 members) of unique DNA barcodes to identify individual members (e.g., cells, proteins, drugs) in heterogeneous ensembles.
- Every assay using DNA barcodes is subject to errors introduced during DNA synthesis and sequencing. These errors decrease experimental power and accuracy by confounding the identity of individual biomolecules in the population. Base pair insertions and deletions are particularly challenging to decode because these mutations cause a frameshift in all downstream sequencing. Applying a manufacturer-advertised error rate of up to 1 per 200 nucleotides (nt) 17 to a 20 base pair (bp) long barcode with no error correction translates to a best-case scenario of 10% data lost or, worse, incorrectly interpreted. Next-generation sequencing also has error rates between 10 3 and 10 4 . This alone represents errors in approximately 1% of the above example 20 bp barcodes, which can be limiting for detection of rare events. These errors can be overcome through the use of error-correcting DNA barcodes— DNA sequences that can correctly identify the underlying individuals in a pooled experiment even in the presence of DNA sequencing and synthesis errors.
- Error-correcting barcodes should efficiently detect and correct DNA sequencing and synthesis errors.
- Many current DNA barcode strategies repurpose error-correcting codes developed for computers 18 19 , such as Hamming or Reed-Solomon codes, to DNA applications 20 ’ 21 .
- Hamming distance which describes the number of substitutions between two sequences of equal length, is possibly the most used due to its simplicity.
- nearly all well-studied error-correcting codes developed in computer science—including the widely-used Hamming codes— were not designed to handle deletions and insertions. Such codes are generally used to only detect errors without correcting them. However, this method also fails to account for the possibility that a single error (e.g., deletion) can convert one barcode into another.
- Levenshtein codes also known as edit codes, can theoretically account for all three types of common error: substitutions, insertions, and deletions, but only when the corrupted length of each barcode after errors is known 22 23 . This is a critical limitation in real-world DNA barcode applications because errors can change the barcode length unpredictably, which can lead to erroneous decoding of Levenshtein-based barcodes in the context of a longer read (FIG. 1B).
- Levenshtein codes can be used at twice the level of error correction as desired for a given application, for example using a 2 error-correcting code when a 1 -error correcting code is desired, but this is inefficient and significantly decreases the number of valid barcodes for a given oligonucleotide length.
- existing DNA barcode strategies are unable to efficiently detect and decode real-world errors encountered during DNA synthesis and sequencing.
- a genetic barcode may be selected for a member of the population, and that genetic barcode will have been created in a manner that allows for a given error tolerance in the barcode at decoding.
- the compositions and methods disclosed herein address these and other needs.
- FREE barcodes can correct substitutions, insertions, and deletions even when the edited length of the barcode is unknown, which is a significant advantage over known DNA barcode error-correcting methods.
- FREE barcodes are designed with experimental considerations in mind, including balanced GC content, minimal homopolymer runs, absence of GGC motifs, and no self complementarity of more than two bases to reduce internal hairpin propensity. Lists of barcodes were generated with different lengths and error-correction levels that may be broadly useful in diverse high- throughput applications.
- hairpin melting temperatures were calculated which can be used to select subsets of barcodes compatible with experimental conditions.
- the largest barcode list includes >l0 6 unique error-correcting barcodes usable in a single experiment. Moreover, appending two or more barcodes together in combination increases the total barcode set, producing >10 9 -10 12 unique error-correcting DNA barcodes.
- the included software for creating new barcode libraries and decoding/error-correcting observed barcodes is fast and efficient, decoding >120,000 barcodes per second with a single processor, and is designed to be user friendly for a broad biologist community.
- Modern high-throughput biological assays study pooled populations of individual members by labeling each member with a unique DNA sequence called a“barcode.”
- DNA barcodes are frequently corrupted by DNA synthesis and sequencing errors, leading to significant data loss and incorrect data interpretation.
- Disclosed herein is a novel error-correction strategy to improve the efficiency and statistical power of DNA barcodes.
- the error-correcting method accurately handles insertions and deletions in DNA barcodes, the most common type of error encountered during DNA synthesis and sequencing, resulting in order-of-magnitude increases in accuracy, efficiency, and signal-to-noise.
- the accompanying software package makes deployment of these barcodes effortless for the broader experimental scientist community.
- a computerized method of generating a set of unique genetic barcodes for attaching to members of a pooled population of samples represented by respective genetic sequences The set of genetic barcodes has boundaries established by controlling the presence and absence of selected combinations of base values when the selected combinations represent disallowed physical characteristics within the genetic samples.
- the method includes (i) identifying a common length of the barcodes in the set, wherein the length is a given number of base positions present in each genetic barcode, (ii) identifying base values available to fill each base position, wherein the respective genetic barcode is subject to error formation in the base values, such that at least one of the respective genetic sequences has a corrupt barcode, and (iii) identifying a center barcode from which a sphere of multiple corrupt barcodes may be generated.
- Each of the multiple corrupt barcodes refers back to the center barcode also in the sphere
- generating the multiple corrupt barcodes utilizes a computer executing software configured to receive an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the number of edits necessary in any one of the multiple corrupt barcodes to match the center barcode and generates combinations of bases filling the length of respective corrupt barcode sequences, comprising errors in the base values within the acceptable error level, such that the corrupt barcode sequences pack the sphere having the center barcode.
- the computerized method includes (i) identifying an original length of a generated barcode from a set of unique genetic barcodes, wherein the original length comprises a given number of base positions present in the generated barcode, (ii) identifying a barcode starting value position along a respective genetic sequence of a genetic sample, wherein the starting value position is occupied by a first base of a respective genetic barcode, and (iii) identifying a center barcode within a sphere of multiple corrupt barcodes such that each of the multiple corrupt barcodes refers back to the center barcode also in the sphere.
- Identifying the center barcode includes a computer executing software configured to receive an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the number of edits necessary in any one of the corrupt barcodes to match the center barcode.
- the computer is further configured to utilize the barcode starting value position within the genetic sequence and evaluating bases up to the length of the generated barcode.
- the method stores the evaluated bases as a corrupt barcode in a computerized memory buffer and identifies a respective decode sphere in which the corrupt barcode exists. Finally, the method includes decoding the corrupt barcode to the center barcode of the respective decode sphere.
- FIG. l(A-C) are schematics depicting applications and error-correction strategies of DNA barcodes.
- FIG. 1A depicts illustrative examples of high-throughput sequencing assays that require large lists of error-correcting DNA barcodes. Barcodes can be used to identify individual cells or molecules in pooled libraries (Klein, 2015; Fan, 2008; Melkko, 2004).
- FIG. IB depicts current strategies to correct synthesis and sequencing errors in DNA barcodes that are confounded by insertions and deletions.
- Hamming distance can only handle substitutions.
- Levenshtein distance is limited by the fact that barcodes are prepended to other sequences of interest. Insertions and/or deletions (“indels”) produce phantom Levenshtein distance errors when bases from the remaining DNA molecule shift into or out of the barcode window.
- FIG. 1C depicts examples of FREE divergence given an actual edit history. Levenshtein and Hamming distances are also shown for comparison. A substitution and insertion are correctly attributed as two edits by FREE divergence (first column).
- Indels can have zero cost, particularly near the end of the barcode where they can occasionally be undone by fill or truncation (fourth column). Edits past the barcode end can matter since the fill/truncation step happens only upon observation (fifth column).
- FIG. 2(A-D) are schematics and graphs showing FREE barcode generation and decoding.
- FIG. 2A shows error-correcting barcode generation as a sphere packing problem.
- Reserved around each accepted barcode B e.g.,“CTCA” is DecodeSphere m (B), the set of all sequences within FREE divergence m of B. That is, the set of all sequences with any combination of up to m errors from B, followed by fill or truncation as necessary.
- Any set of disjoint decode spheres is a valid FREE code (FIG. 2A, right panel).
- FIG. 2B is a graph showing the number of single- and double-error correction barcodes generated for a range of barcode lengths.
- FIG. 2C shows that the herein disclosed methods and software can decode more than 120,000 barcodes per second for all barcode lengths considered here.
- FIG. 2D shows a comparison of FREE barcode counts against pruned Hamming codes and Levenshtein codes.
- Hamming codes were pruned to remove members that did not decode FREE divergence errors, while Levenshtein codes were produced at double the error-correction levels for the same purpose.
- FREE codes produce more barcodes than either of the other methods for all barcode lengths.
- FIG. 3(A-C) shows experimental measurement of synthesis and sequencing error rates.
- FIG. 3A is a schematic of the DNA constructs used for barcode validation experiments.
- Each member in the synthetic library had a unique pair of left and right barcodes drawn from a list of more than 8,000 17- nucleotides FREE codes with double-error correction.
- FIG. 3B is a graph showing the measured synthesis error rates, by intended reference base and error type— substitution (sub), deletion (del), and insertion (ins).
- FIG. 3C is a graph showing measured sequencing substitution error rates, by reference base. Insertions and deletions from illumina sequencing are extremely rare and are omitted for clarity.
- FIG. 4(A-B) are graphs showing decoding corrupted barcodes from simulated errors.
- 4A shows modeled and simulated decoding error rates given per-base error rate for length 8 barcodes and length 16 barcodes. Barcode sets are labeled according to length and number of errors corrected.
- FIG. 4B shows that the 16-2 code is length 16 and corrects up to 2 errors.
- Solid lines show the error rate approximations using a binomial model. Circles and triangles show direct simulation error rates for single- and double-error correcting codes, respectively. Substitution, insertion, and deletion errors each have simulated error rates P( error per base)/3 for simplicity.
- FIG. 5 is a graph showing decoded corrupted barcodes from experimental data. Observed decoding error rates are compared with theoretical rates from the synthesis and sequencing error rates.
- FIG. 6(A-D) shows combined barcode libraries via concatenation of FREE barcodes.
- FIG. 6A shows concatenated barcodes can be decoded sequentially in a left-to-right order, even when the end position of each edited sub-barcode is not initially known.
- the decoded first FREE sub-barcode can be used to find the starting position of the next sub-barcode, and similarly for subsequent sub-barcodes.
- FIG. 6B shows concatenated barcode decoding error rates.
- Concatenated barcode labels use the following format: a 3x(l6-l) barcode consists of three concatenated sub-barcodes, each of which is 16 bp long and can correct up to 1 error.
- FIG. 6C and FIG. 6D show concatenating multiple barcodes combining to increase the numbers of effective FREE barcodes.
- Concatenated barcodes can correct the same number of errors per sub-barcode. When the errors are distributed evenly among the sub-barcodes, concatenated barcodes can correct a higher total number of errors than the individual sub-barcodes.
- FIG. 6C shows concatenated single-error correcting barcodes.
- FIG. 6D shows concatenated double-error correcting barcodes. Dashed lines: projected quantities calculated by sampling; dotted lines: log-linear projections.
- FIG. 7(A-C) are graphs showing decode sphere volumes and code efficiency.
- FIG. 7A shows that unlike Hamming decode spheres, FREE divergence decode spheres do not have uniform volume due to degeneracy of insertions and deletions.
- the sequence AACT only has three unique deletions because a deletion of either A generates the same resulting sequence.
- Sphere volumes of 1- and 2-error codes are shown for all words and for only valid code words after application of FREE code synthesis and sequencing filters (no homopolymer runs, no triplet complementarity, etc.) ⁇ Black lines are explained in FIG. 7B below.
- FIG. 7B shows optimal sphere packing bounds.
- the optimal packing for an error-correcting code is not known in general.
- Typical code generating algorithms including the herein disclosed algorithms, are instead heuristics for finding relatively good codes.
- the volumes of FREE divergence decode spheres are not uniform, so the volume of every sphere in the space is instead determined, sorted, and the minimum number of barcodes at which the cumulative sum of barcode sphere volumes is smaller than the space is determined.
- the upper bound calculated for valid code words is shown.
- the volume at which that happens for each code is shown in FIG. 7A as black lines.
- the lower bound is the best efficiency achieved by any code generation method to date, which for FREE codes is simply the number of barcodes reported herein.
- the actual maximum possible number of barcodes is somewhere between the two.
- FIG. 7C shows raw and actual code rates for each FREE barcode set disclosed herein as well as the asymptotic values they approach.
- FIG. 8(A-B) are graphs depicting error rate simulations by error type.
- FIG. 8A and FIG. 8B show the simulations performed for FIG. 4, repeated for each error type - substitutions (top panel; also called mismatches), deletions (middle panel), insertions (bottom panel) - individually.
- Barcode sets are labeled according to length and number of errors corrected; for example, the 16-2 code is length 16 and corrects up to 2 errors. Mismatches follow the binomial approximation closely, while deletions and especially insertions perform slightly better than the binomial approximation.
- FIG. 8A is shown for base pair length 8.
- FIG. 8B is shown for base pair length 16.
- FIG. 9 is a set of graphs depicting error rate comparison with constant barcode length.
- the barcode length used for each is denoted at the top of each individual graph.
- FIG. 10 is a set of graphs depicting error rate comparison with constant barcode number of errors corrected. The binomial approximation of the decode error rate as a function of the error rate per base, grouped by given number of errors corrected.
- FIG. 11 is a set of graphs depicting error rate comparison with constant number of barcodes.
- FIG. 12 is a graph depicting a coverage histogram and statistics for the FREE code validation experiment. Each of the 8,684 oligos were observed with average coverage of l59x.
- FIG. 13 is a graph depicting maximum error run length probabilities. The probability distribution of maximum consecutive-error run lengths from a model assuming independent errors and from herein disclosed data. The two differ significantly because errors in the herein disclosed data are not independent.
- FIG. 14 is a set of graphs depicting hairpin melting temperatures (Tm). Hairpin melting temperature CDFs are shown for all barcodes libraries included with this manuscript. The barcodes included here nearly all have a Tm ⁇ 60°C, and users can further filter the barcode sets to avoid hairpins in their specific experimental conditions.
- compositions disclosed herein have certain functions. Disclosed herein are certain structural requirements for performing the disclosed functions, and it is understood that there are a variety of structures which can perform the same function which are related to the disclosed structures, and that these structures will ultimately achieve the same result.
- an agent includes a plurality of agents, including mixtures thereof.
- the terms“can,”“may,”“optionally,”“can optionally,” and“may optionally” are used interchangeably and are meant to include cases in which the condition occurs as well as cases in which the condition does not occur.
- the statement that a formulation“may include an excipient” is meant to include cases in which the formulation includes an excipient as well as cases in which the formulation does not include an excipient.
- Ranges can be expressed herein as from “about” one particular value, and/or to "about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent "about,” it will be understood that the particular value forms another embodiment. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint. It is also understood that there are a number of values disclosed herein, and that each value is also herein disclosed as "about” that particular value in addition to the value itself. For example, if the value" 10" is disclosed, then “about 10" is also disclosed.
- a polynucleotide e.g. DNA or RNA
- a polynucleotide comprises a sequence of nucleotides that enables it to non- covalently bind, to another polynucleotide in a sequence-specific, antiparallel, manner (i.e., a polynucleotide specifically binds to a complementary polynucleotide) under the appropriate in vitro and/or in vivo conditions of temperature and solution ionic strength.
- polynucleotide hybridization is the binding of a polymerase chain reaction (PCR) primer to a polynucleotide template (e.g., in a sequencing reaction).
- PCR polymerase chain reaction
- a polynucleotide e.g., a PCR primer or probe
- a polynucleotide can comprise at least 70%, at least 80%, at least 90%, at least 95%, at least 99%, or 100% sequence complementarity to a target site within the target polynucleotide sequence.
- a “primer” is a short polynucleotide, generally with a free 3'-OH group that binds to a target or "template” potentially present in a sample of interest by hybridizing with the target, and thereafter promoting polymerization of a polynucleotide complementary to the target.
- a “polymerase chain reaction” (“PCR”) is a reaction in which replicate copies are made of a target polynucleotide using a "pair of primers” or a “set of primers” consisting of an "upstream” and a “downstream” primer, and a catalyst of polymerization, such as a DNA polymerase, and typically a thermally- stable polymerase enzyme. Methods for PCR are well known in the art, and taught, for example in "PCR: A
- polynucleotide and“oligonucleotide” are used interchangeably and generally refer to a linear polymer of nucleotide monomers of DNA or RNA.
- Monomers making up a polynucleotide are capable of specifically binding to a second polynucleotide by way of a regular pattern of monomer- to-monomer interactions, such as Watson-Crick type of base pairing, base stacking, Hoogsteen or reverse Hoogsteen types of base pairing, for example.
- Such monomers and their internucleosidic linkages may be naturally occurring or may be analogs thereof (e.g., naturally occurring or non-naturally occurring analogs).
- Non-limiting examples non-naturally occurring analogs include phosphorothioate internucleosidic linkages, bases containing linking groups permitting the attachment of labels, such as fluorophores, or haptens.
- “oligonucleotide” can refer to (relatively) smaller polynucleotides comprising, for example, 2-100, 3-75, or 3-50 monomeric units.
- Polynucleotides may, in some instances, include the natural deoxyribonucleosides (e.g., deoxyadenosine, deoxycytidine, deoxyguanosine, and deoxythymidine for DNA or their ribose counterparts for RNA) linked by phosphodiester linkages.
- polynucleotides can also include non-natural nucleotide analogs (e.g., including modified bases, sugars, or internucleosidic linkages).
- a polynucleotide may be represented by a sequence of letters (upper or lower case), such as“ATGCCTG,” and it will be understood that the nucleotides are in 5' 3' order from left to right and that“A” denotes deoxyadenosine,“C” denotes deoxycytidine,“G” denotes deoxyguanosine, and“T” denotes deoxythymidine, and that“I” denotes deoxyinosine, and“U” denotes deoxyuridine (in the case of RNA), unless otherwise indicated or implied from context. Unless otherwise noted the terminology and atom numbering conventions will follow those disclosed in Strachan and Read, HUMAN MOLECULAR GENETICS, 2nd ed. (Wiley-Liss, New York
- a barcode can be used to identify a barcoded sample. Often, one or more barcoded samples are mixed in a heterogenous pool, which can contain known or unknown components.
- the use of a barcode facilitates the identification of a barcoded sample. Typically, any given barcode identifies only one barcoded sample, and any given barcoded sample is identified by only one barcode. Thus, the barcode and the barcoded sample can be exclusively assigned. This is conceptually similar to the use of scannable universal product codes (UPC) on consumer products of a grocery store shelf.
- UPC scannable universal product codes
- the barcode is a specially designed DNA sequence attached to a barcoded sample and which identifies the barcoded sample for each read output by a DNA sequencer.
- DNA can be extracted from each barcoded sample, and the unique identifying barcode for that sample is amplified in a polymerase chain reaction (PCR).
- PCR polymerase chain reaction
- DNA extracts from multiple barcoded samples, or multiple PCR amplicons from PCRs of separate barcoded samples, can be mixed together (multiplexed) for sequencing numerous samples simultaneously in bulk, thereby reducing time and complexity.
- multiplexed sequencings a barcoded sample can thus be tracked by identification of its barcode.
- DNA barcodes can be selected to be of sufficient length to generate the desired number of barcodes with sufficient variability to account for common sequencing errors, generally ranging in size from about 2 to about 20 bases, but may be longer or shorter. Longer barcodes will permit higher sequence diversity and typically allow more samples to be combined.
- the target specific PCR sequences for the forward and reverse PCR primers can be specific for any DNA sequence, in coding or non-coding regions of a target genome, plasmid, or organelle.
- DNA barcoding has been described for many sequencing platforms. For instance, the first platform of next-generation sequencing (NGS), the GS 20, was developed for barcoding in 2007 (Parameswaran P, Jalili R, Tao L, et al. Nucleic Acids Research. 2007; 35(l9):el30). Since that time, barcoding strategies have been made commercially available by Illumina (San Diego, Calif.), Pacific Biosciences (Menlo Park, Calif.), and Thermo Fisher Scientific (Waltham, Mass.), such as the Ion TorrentTM (Thermo Fisher Scientific, Waltham, Mass.) sequencing platforms.
- NGS next-generation sequencing
- a high throughput sequencing method can be any sequencing method, with high throughput generally meaning greater than 100 (e.g., 1,000 or more) reads per run.
- Next-generation sequencing refers to modem high throughput sequencing platforms that parallelize the sequencing process, producing thousands or millions of sequences concurrently, in contrast to less-efficient and more expensive standard dye-terminator methods.
- Non-limiting examples of NGS methods include single molecule real-time sequencing (also referred to as Pacific Biosciences or PacBio), ion semiconductor (also referred to as Ion Torrent sequencing), pyrosequencing (also referred to as Roche 454), sequencing by synthesis (also referred to as Illumina sequencing), sequencing by ligation (also referred to as SOLiD sequencing) and chain termination sequencing (also referred to as Sanger sequencing).
- single molecule real-time sequencing also referred to as Pacific Biosciences or PacBio
- ion semiconductor also referred to as Ion Torrent sequencing
- pyrosequencing also referred to as Roche 454
- sequencing by synthesis also referred to as Illumina sequencing
- sequencing by ligation also referred to as SOLiD sequencing
- chain termination sequencing also referred to as Sanger sequencing.
- High throughput DNA sequencing can be carried out on pooled, barcoded PCR amplicon DNA sequences as described above, producing a file with individual DNA sequences from all samples in random order.
- the high throughput DNA sequencing is a next-generation sequencing (NGS) method, such as single-molecule real-time sequencing, ion semiconductor sequencing, pyrosequencing, sequencing by synthesis, sequencing by ligation and chain termination sequencing.
- NGS next-generation sequencing
- Obtained DNA sequences from each barcoded sample (or PCR amplicons thereof) can be identified by barcode after sequencing and all reads are sorted into files by barcode.
- separating the DNA sequences and identifying target DNA sequences and/or barcode sequences can be performed by computer implemented methods, and can further employ database searches.
- the databases may include online databases such as BLAST
- barcodes can be designed to exhibit improved read accuracy for sequencing using a sequence-by-synthesis platform (as discussed previously), which can include fluorophore-labeled nucleotide sequencing platforms or non-labeled sequencing platforms, such as, for example, the Ion PGMTM and Ion ProtonTM Sequencers, and the Ion SSTM and Ion S5 XL Next GenerationTM Sequencing System.
- a sequence-by-synthesis platform can include fluorophore-labeled nucleotide sequencing platforms or non-labeled sequencing platforms, such as, for example, the Ion PGMTM and Ion ProtonTM Sequencers, and the Ion SSTM and Ion S5 XL Next GenerationTM Sequencing System.
- Design of the barcodes is not limited to any particular instrument platform or sequencing technology, however.
- the barcode has a number of nucleotide bases (sometimes referred to as a barcode length) of at least 3 bases, at least 4 bases, at least 5 bases, at least 6 bases, at least 7 bases, at least 8 bases, at least 9 bases, at least 10 bases, at least 11 bases, at least 12 bases, at least 13 bases, at least 14 bases, at least 15 bases, at least 16 bases, at least 17 bases, at least 18 bases, at least 19 bases, or at least 20 bases.
- the barcode has a number of nucleotide bases from 3 to 20 bases, from 3-18 bases, from 3-16 bases, from 3-15 bases, from 3-12 bases, or from 3-8 bases.
- two or more barcodes can be used in combination in a concatenated sequence.
- Combined use of barcodes in concatenated barcodes facilitate generation of a high number of barcodes (e.g., over 106 and even 1012 or more).
- Combined use of barcodes, as it relates to concatenated barcodes refers to the combining of two or more barcodes from a DecodeSphere in a single polynucleotide to form a larger barcode (a“concatenated barcode”).
- a concatenated barcode comprises sub-barcodes in a single polynucleotide wherein the sub-barcodes are disposed along the polynucleotide sufficiently close to an adjacent sub-barcode such that the concatenated barcode can be identified from a single amplicon formed from a PCR amplification reaction.
- a sub-barcode can be disposed along the polynucleotide within a nucleotide base length of an adjacent sub-barcode of 100 bases or less, 75 bases or less, 50 bases or less, 25 bases or less, 10 bases or less, 9 bases or less, 8 bases or less, 7 bases or less, 6 bases or less, 5 bases or less, 4 bases or less, 3 bases or less, 2 bases or less, or 1 base or less.
- a first sub-barcode can be immediately adjacent to a second sub barcode (e.g., the 3’ end of the first sub-barcode is directly and covalently linked by a phosphodiester bond to the 5’ end of a second sub-barcode).
- a concatenated barcode can be defined by an intended sequence, for example a polynucleotide sequence designed by a user, without regard to the
- a concatenated barcode can comprise sub-barcodes which are of the same base length (number of bases or base pairs; bp) or of different base lengths.
- a barcoded sample can be any composition of matter capable of attachment, either directly or indirectly, with a polynucleotide comprising a barcode.
- the barcoded sample can be, for example, a small molecule, a macromolecule (e.g., a nucleic acid, protein, lipid, polysaccharide, or synthetic or biological polymer), a solid support (e.g., a synthetic particle, a bead such as an affinity bead, a plastic, a metallic substance), a vesicle (e.g., liposome, exosome, micelle), an artificial or biological cell, or others.
- a macromolecule e.g., a nucleic acid, protein, lipid, polysaccharide, or synthetic or biological polymer
- a solid support e.g., a synthetic particle, a bead such as an affinity bead, a plastic, a metallic substance
- a vesicle e.g., liposome,
- Suitable beads include silica gel beads, controlled pore glass beads, magnetic beads, Dynabeads, Sephadex/Sepharose beads, cellulose beads, polystyrene beads, or any combination thereof.
- a barcoded sample and a cell are said to be attached when a barcoded sample either enters a cell or is attached to a component (e.g., a lipid) on the surface of a cell.
- Direct attachment can include, for instance, covalent linkage of a 3’ end of a polynucleotide comprising a barcode to the barcoded sample, for instance a chemically accessible motif of a small molecule capable of covalent linkage to the 3’ end or other portion of a polynucleotide comprising a barcode.
- Indirect attachment can include a physical attachment but without a direct covalent linkage between the barcoded sample and the polynucleotide comprising a barcode.
- an intervening component can link the barcoded sample and the polynucleotide comprising a barcode.
- An example of an indirect attachment includes a linkage between a polynucleotide comprising a barcode and a solid support (e.g., a bead), wherein the solid support is further linked to the barcoded sample (e.g., an antibody or small molecule).
- a solid support e.g., a bead
- the methods can include a plurality of barcodes.
- the methods can include at least 50, at least 100, at least 250, at least 500, at least 750, at least 1,000, at least 2,000, at least 5,000, at least 10,000, at least 50,000, at least 100,000, at least 500,000, at least 1,000,000 barcodes.
- the methods can include at least 107, at least 108, at least 109, at least 1010, at least 1011, at least 1012, at least 1013, at least 1014, or at least 1015 barcodes.
- the total number of barcodes can include single barcodes only, concatenated barcodes only, or both single barcodes and concatenated barcodes.
- a plurality of barcodes can sometimes be referred to as a barcode library.
- the term“barcode library” can also refer to a plurality of barcodes, each attached to a single barcoded sample of a plurality of barcoded samples.
- a polynucleotide (e.g., a polynucleotide comprising a barcode) can be sequenced by any sequencing method known in the art to sequence a polynucleotide.
- the DNA sequencing method is a high throughput sequencing method, for instance a next-generation sequencing (NGS) method.
- the DNA sequencing method is a low throughput sequencing method, for example Sanger sequencing.
- Sanger sequencing is a method of DNA sequencing based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication.
- sequencing results and other polynucleotide analyses can be represented and analyzed by digitized sequences using known computer implemented methods.
- a polynucleotide comprising a barcode can be immobilized on a solid support.
- the solid support is a bead (e.g., an affinity bead), multi-well plate (e.g., 96-well plate), filter surface, or other solid support.
- a barcoded polynucleotide can be attached to a solid support at one end, and attached (directly or indirectly) to a barcoded sample at the other end.
- the technology disclosed herein takes advantage of a highly efficient way to tag samples in a pooled set of specimens with a genetic barcode as shown in Figure 1A (5A-5C) and account for likely corruption of the barcode sequences during experimentation, genetic synthesis, and genetic sequencing operations.
- the corruption is caused by naturally occurring errors in genetic sequences that change certain portions of a genetic barcode sequence from an originally generated genetic barcode (25) to a different sequence that does not match the original barcode exactly and may lead to errors for certain identification purposes, absent additional processing.
- This disclosure utilizes computers, computer processors, and associated kinds of computerized memory to provide additional processing functionality that enables even corrupted barcodes to be decoded back to an intended original genetic barcode, which in turn identifies a property of the specimen.
- the identified property of the specimen includes, but is not limited to, a source of the specimen, a characteristic of the specimen, or any information or metadata that can be useful to track in regard to the specimen to which the genetic barcode is attached.
- Figure 1B illustrates that the genetic barcodes (25) described herein start with a particular set of base values (27) filling the base positions (20) of a genetic barcode having a defined length (145). The base values, however, may change during the course of genetic synthesis and genetic sequencing operations.
- FIG. 1C shows that originally generated genetic barcodes (25) have base values with a starting position (45A-45E) in an overall genetic sequence for a sample.
- the base values fill the base positions in the bar code sequence up to the length (147) of the genetic barcode (25).
- these base values (27) in an expected configuration of the original genetic barcode (25) are subject to actual edits (53) such that the observed version (55) of the barcode (corrupted with mutations in the base values) is different due to truncation and fill effects as shown.
- One goal of the technology described herein lies in identifying not only the corrupt barcodes (63) as observed after various edits, but also to be able to use those corrupt barcodes in reliably determining to which of the originally generated genetic barcodes (25) the corrupt barcode (63) should refer back to after decoding.
- the set of genetic barcodes has boundaries (33) established by controlling the presence and absence of selected combinations of base values (20) when the selected combinations represent disallowed physical characteristics within the genetic samples. Controlling select combinations of base values involves implementing lexicographical constraints (e.g., constraining the genetic code in a sequence) and identifying base values (27) available to fill each base position in terms of “GC” code content, disallowing homopolymer triplets, disallowing triplet self complementarity, and disallowing GGC base values (illumina error motif).
- the method includes (i) identifying a common length (145) of the barcodes (25) in the set (23), wherein the length (145) is a given number of base positions (20) present in each genetic barcode (25), (ii) identifying base values (27) available to fill each base position (20), wherein the respective genetic barcode (25) is subject to error formation in the base values, such that at least one of the respective genetic sequences has a corrupt barcode (63), and (iii) identifying a center barcode (125) from which a sphere (70) of multiple corrupt barcodes (63) may be generated.
- Each of the multiple corrupt barcodes (63) refers back to the center barcode (125) also in the sphere (70).
- the method also includes generating the multiple corrupt barcodes (63) by using a computer executing software configured to receive an acceptable error level (herein,“m”) for the sphere (70) of multiple corrupt barcodes, wherein the error level corresponds to the maximum number of edits necessary in any one of the multiple corrupt barcodes (63) to match the center barcode (125) and generates combinations of base values (27) filling the length (145) of respective corrupt barcode (63) sequences.
- the corrupt barcodes (63) within a given sphere (70) and within given set (33) have errors in the base values (27) within the acceptable error level, such that the corrupt barcode sequences pack the sphere (70) having the center barcode (125).
- Generating combinations of base values (27) having up to the maximum number of errors allowed by the error level includes adjusting base values (27) from the center barcode (125) with end fills and end truncations of base values at one end of the length of the genetic barcode, up to the entire length (145) of the genetic barcode.
- the method further includes generating distinct center barcodes (125) for each member of the pooled population of genetic samples (5A-5C) and using respective distinct center barcodes (125) to generate additional spheres of corrupt barcodes that relate back to a corresponding distinct center barcode.
- the computerized method is configured to result in the sphere and the additional spheres (80, 90, 110) being disjointed with no common members at the identified error level.
- genetic barcodes (25) can be concatenated so that multiple barcodes form a single new barcode (i.e., the multiple barcodes become sub-barcodes), and, using the sub-bar codes, new spheres (150, 175, 195) of corrupt barcodes are generated with computerized equipment.
- the computerized method includes (i) identifying an original length (145) of a generated barcode (25) from a set of unique genetic barcodes, wherein the original length comprises a given number of base positions (20) present in the generated barcode, (ii) identifying a barcode starting value position (45) along a respective genetic sequence of a genetic sample, wherein the starting value position is occupied by a first base value of a respective genetic barcode, and (iii) identifying a center barcode (125) within a sphere (70) of multiple corrupt barcodes (63) such that each of the multiple corrupt barcodes (63) refers back to the center barcode also in the sphere (70).
- Identifying the center barcode (125) includes a computer executing software configured to receive an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the maximum number of edits (53) necessary in any one of the corrupt barcodes to match the center barcode (125).
- the computer is further configured to utilize the barcode starting value position (45) within the genetic sequence and evaluating bases up to the length (145) of the generated barcode.
- the method stores the evaluated bases as a corrupt barcode in a computerized memory buffer and identifies a respective decode sphere (70) in which the corrupt barcode (63) exists.
- the method includes decoding the corrupt barcode to the center barcode of the respective decode sphere.
- a computerized method as described above for decoding genetic barcodes further includes identifying the respective decode sphere in which the corrupt barcode exists by evaluating the base position values (27) after the starting point (45) and up to the original length of the generated barcode (25).
- the base position values after the starting point (45) and up to the original length (145) of the generated barcode include filled positions at an end of the length, wherein the filled positions do not match the corresponding positions of the generated barcode.
- the base position values after the starting point and up to the original length of the generated barcode do not include base positions from the originally generated barcode that have been truncated at an end of the length.
- the computerized method identifies a true length (145) of the first corrupt barcode and from that identifies a next start position in the genetic sequence for a sequential barcode (175, 195) to be decoded.
- the hardware and software used to enable this disclosure will be sufficient to provide for sets of genetic barcodes to be stored, subject to entry in a -up table or other matching means, and configured with a graphical user interface to receive at least one of an acceptable error level, an acceptable probability of a number of errors in a corrupted barcode, and a genetic barcode length.
- DNA barcodes short DNA sequences prepended to DNA libraries— for identification of individuals in pooled biomolecule populations.
- DNA synthesis and sequencing errors confound the correct interpretation of observed barcodes and can lead to significant data loss or spurious results.
- Widely-used error-correcting codes borrowed from computer science e.g., Hamming and Levenshtein codes
- Disclosed herein are experimentally validated FREE (Filled/truncated Right End Edit) barcodes, which can correct substitution, insertion, and deletion errors, even when these errors alter the barcode length.
- FREE barcodes are designed with experimental considerations in mind, including balanced GC content, minimal homopolymer runs, and reduced internal hairpin propensity. Lists of barcodes with different lengths and error-correction levels were generated which may be useful in diverse high-throughput applications, including >l0 6 single-error correcting l6-mers that strike a balance between decoding accuracy, barcode length, and library size. Moreover, concatenating two or more FREE codes into a single barcode increases the available barcode space in combination, generating lists with > 10 15 error-correcting barcodes. The included software for creating barcode libraries and decoding sequenced barcodes is efficient and designed to be user-friendly for the general biology community. Results
- FIG. 1C shows a typical example of how FREE divergence captured the actual number of barcode edits in the context of a longer read.
- An insertion caused the final T to move out of the barcode window, but FREE divergence correctly accounts for its loss.
- a barcode length n is set, and any DNA sequence of length n is referred to as a word.
- the set of all words is called W such that FreeDiv(B, W) ⁇ m the m-error decode sphere of B, written as DecodeSphere m (B), or just DecodeSphere(B) if m is clear from context. Any observed DNA sequence within
- DecodeSphere(B) will by definition decode to (error-correct to) the center word B (FIG. 2 A). Then, an m-error correcting FREE code is simply any set of barcodes such that the m-error decode spheres of all barcodes are disjoint, or, in other words, no two decode spheres overlap. Any corrupted barcode with up to m errors is thus in the decode sphere of exactly one barcode and can be decoded (error-corrected) uniquely (FIG. 2A).
- Requiring disjoint decode spheres places a limit on the relationship between allowed m, the number of correctible errors, and n, the barcode length: to fit more than one non overlapping decode sphere in the space requires that 2 m + 1 ⁇ n.
- Candidate barcodes must have: (1) balanced GC content (40-60%); (2) no homopolymer triples (e.g., AAA); (3) no GGC (a known Illumina-based error motif 25 ); and (4) no self-complementarity of >2 bases to reduce hairpin propensity. All disclosed software is available in the GitHub repository (https://github.com/finkelsteinlab).
- the number of available error-correcting barcodes for a DNA sequence of length n depends on the experimentally-required degree of error-correction (FIG. 2B).
- Libraries of single-error correcting codes up to a l6-nucleotide length were generated, containing >1,600,000 barcodes.
- more robust, double-error correcting codes up to a l7-nucleotide length with >23,000 unique members were generated (Table 1).
- Barcodes correcting m errors required length at least 2m + 1 bp to avoid having all decode spheres overlap all other decode spheres.
- the 1 -error and 2-error correcting barcode libraries had minimum lengths of 3 bp and 5 bp respectively.
- the barcode decoding software ran in time proportional to the length of the barcodes but constant with respect to the number of barcodes in the library. Hence, 1 -error and 2-error correcting codes decoded at the same speed for a given barcode length even though the 1 -error libraries contained many more barcodes (FIG. 2C). Even the slowest decodes considered, the l7-mer double-error correction barcodes, decoded at >120,000 barcodes ⁇ sec 1 on a desktop computer using a single processor.
- Table 1 Number of FREE barcodes for each barcode set disclosed herein, by barcode length and number of errors corrected. Comparison with current error-correcting DNA barcode strategies. Current state-of-the art error correcting DNA barcoding applications often use Hamming or Levenshtein error-correction strategies 2023 . Hamming codes only correct substitutions, and are thus insufficient for any DNA barcode applications with indels 26 . However, Hamming codes are linear codes, meaning the code words form a well-structured lattice in barcode space.
- Levenstein codes can be used directly (e.g., without pruning) because they account for indels, but must be used at 2-fold higher error correction for DNA barcode applications (FIG. IB). Such over corrected Levenshtein barcode sets were generated in a manner similar to the FREE code generation strategy. This strategy produced even fewer barcodes than the pruned Hamming code sets. (FIG. 2D). Sequence-Levenshtein codes attempted to solve the problems inherent in using Levenstein codes for DNA applications, but an error in the derivation of these codes often caused them to decode to the wrong barcode 27 .
- FREE codes offer a substantially larger number of usable barcodes for a given barcode length, when taking into consideration real-world errors such as deletions, insertions, and substitutions that are encountered during DNA sequencing and synthesis. Error Correction in Real and Simulated Data.
- FREE barcodes generated herein were validated by both numerical simulation and by experiment. Pooled oligonucleotide synthesis was used to produce a library of >8,000 oligos with double-error correcting barcodes at both ends (FIG. 3A). The barcodes were arranged such that each left barcode should only be observed on the same oligo with one specific right barcode sequence, and similarly for right barcodes. Hence, the rate of incorrectly decoding barcodes could be measured by observing unexpected left-right barcode pairs. 1.4 million copies of this library were sequenced on an Illumina MiSeq for an average coverage of l59x using the standard Illumina workflow.
- Full-length, paired-end Illumina sequencing was used to measure the background synthesis and sequencing error rates (FIG. 3B and 3C). Using full-length paired-end reads permitted discrimination between synthesis and sequencing errors. Substitution, insertion, and deletion error rates from library amplification using Q5 polymerase have previously been reported to occur at rates less than 10 5 , and thus are a negligible fraction of the measured synthesis errors 28 . Measured errors were dominated by single-base synthesis deletions, which occurred at rates of approximately 1 in 200 bp and 1 in 100 bp in the left and right barcode regions, respectively (FIG. 3B and FIG. 13). The two-fold difference in synthesis error rates between the two sides was consistent with statements from the manufacturer regarding their synthesis error rates 17 .
- Sequencing error rates were between 10 4 and 10 3 , as advertised by Illumina (FIG. 3C). In sum, experimental error rates were dominated by deletion errors. As Hamming codes are not designed to error-correct deletions in barcodes, they perform very poorly in DNA-based experiments.
- each increase in error correction level resulted in at least an order of magnitude improvement in the decoding error rate (FIG. 4).
- FIG. 4 For example, experimental data showed an overall per-base p err of approximately 10 2 (FIG. 3B and 3C).
- the approximate uncorrected decode error rate (solid line) was 8% for length 8 barcodes and 15% for length 16 barcodes. Without error correction, a best-case scenario would be that these errors could be successfully filtered out, representing a significant loss of data. In other scenarios, these data might be erroneously counted.
- FREE barcodes were validated by measuring the decoding error rates for the experimental dataset (FIG. 5). For double-error correction, mismatches in barcode pairs were used to identify erroneously decoded barcodes. After corrections, error rates of 0.29% and 0.46% were observed for left and right barcodes, respectively.
- the 0- and l-error correction rates shown in FIG. 5 were counted by also counting the number of errors observed in each correctly decoded barcode. That is, 0-error correction decode error rates were calculated as the number of erroneously decoded barcodes plus the number of correctly decoded barcodes with 1 or 2 errors; l-error correction errors were counted similarly.
- the theoretical model was calculated using the synthesis and sequencing error rates found in FIG. 3 to calculate the decode error probability of each barcode depending on its base composition, and then combined for an overall error rate.
- each barcode may not be defined exactly because the primer region could also have errors.
- sub-barcode 1 The left-most sub-barcode (referred to as sub-barcode 1) was decoded first, and then the decoded sub-barcode was used to find the starting position of the immediately adjacent sub-barcode 3’ to sub-barcode 1 (referred to as sub-barcode 2).
- the error-correction level of each FREE sub-barcode remained the same, such that, for example, three concatenated double-error correction sub-barcodes could each correct up to two errors for a maximum total of six corrected errors if, and only if, the errors were evenly distributed, two per sub-barcode.
- Overall concatenated barcode decoding error rates were given by the probability of any decoding error in any sub-barcode or -barcodes. Concatenated barcode error rates were thus slightly higher than for the individual sub-barcodes (FIG. 6B). The decoding process was performed automatically using the disclosed software.
- Concatenating FREE barcodes results in combinations of large barcode sets that are sufficient for even the most demanding high-throughput sequencing applications (FIG. 6).
- Concatenated barcodes were pruned to remain compatible with experimental constrains by removing DNA sequences that had triplet repeats of a single base or had excess self-complementary (defined as any self-complementarity of three or more bases). Even with these filters, lists of up to 10 10 barcodes were generated with concatenation of three single-error correcting codes (FIG. 6). Beyond that, the projected total barcode count was estimated via subsampling, where possible.
- FLEE filled/truncated right end edit
- FREE barcodes are a powerful tool to correct DNA barcode errors, reducing measurement errors in modem, high-throughput experiments. Use of FREE barcodes improves these assays in three key ways: (1) helping avoid spurious results; (2) decreasing the amount of discarded data; and (3) increasing experimental signal-to-noise ratios. Decreasing spurious results and discarded data are important for any experiment involving DNA barcodes. Further, increased signal-to-noise ratios facilitates new and useful possibilities for assays. The power to decrease error rates from 15% to 0.05%, as in FIG. 4B, opens the door for entirely new assay designs. FREE barcodes are broadly useful for the ever-growing set of pooled high-throughput sequencing experiments in cell and molecular biology, protein engineering, and drug discovery. Methods
- n the word length, n, is given. Any DNA sequence of length n is a word, and any word observed in the data is an observed word.
- Strings of DNA are represented as base-4 numbers where A, C, G, and T correspond to 0, 1, 2, and 3 respectively.
- A, C, G, and T correspond to 0, 1, 2, and 3 respectively.
- 39 is the word number and 5 is the word length.
- the word length is required to uniquely convert numbers to DNA to account for leading A’s.
- the word number from the example above, 39, with word length 3 is simply GCT.
- word length n the largest valid word number is 4" - 1.
- a decode sphere is defined around a barcode B to be the set of all words with FreeDiv less than or equal to m, and an encode sphere is defined to be the set of all words of FreeDiv less than or equal to 2m.
- FREE barcode sets are generated with a modified lexicographic code generation method.
- Lexicographic code generation consists of marching through all words lexicographically, alphabetically in this case, and adding new words to the list of barcodes whenever they are sufficiently far from all previous barcodes 30 .
- lexicographic codes are linear 30 , and more generally, lexicographic code generation is shown to have relatively good sphere packing efficiency 24 .
- the first FREE modification to the procedure is to enforce the following sequencing and synthesis properties:
- the coloring of barcodes, decode spheres, and encode spheres is accomplished by having an array of 4" integers valued 0, 1, or 2: 0 for uncolored, 1 for black, and 2 for red.
- the location of each integer in memory itself represents the word, via the numerical representation of DNA given above. This is both memory and speed efficient. Memory efficiency is important, as it is a limiting resource for this method.
- the memory required for barcode generation is 4 k bytes, which for the disclosed experiments was up to l6Gb of random access memory (RAM).
- Barcode Decoding The decoding process builds the code book and looks up decoded words directly. This is performed in a memory efficient fashion as follows. For each barcode in a list, the barcode index is defined as the index of that barcode within the list of barcodes. A user again reserves a space of A k integers to represent the code space. For each barcode B, a user stores the barcode index of B at every word of DecodeSphere(B). A user stores barcode indices rather than barcode numbers because barcode indices require fewer bits per word.
- the memory required for barcode decoding is (1, 2, or 4) x 4 n bytes, requiring 1, 2, or 4 bytes to store each barcode index. For the disclosed experiments, the maximum memory used for barcode decoding was 32Gb of RAM.
- Barcode Pruning Specific barcode lists from the literature or elsewhere may sometimes be required for a given experiment, but require pruning to find a subset with error-correction. A user can accomplish barcode pruning via the same strategy as barcode generation, but only considering the input set of barcodes as potential new barcodes. This pruning method was also used to prune the linear Hamming codes.
- Levenshtein Barcodes Levenshtein Barcodes. Levenshtein barcodes were generated lexicographically using the standard technique of code generation with a metric. Briefly, for desired barcode length n and number of correctable errors e, a user can walk through the space of n-mers lexicographically adding any new word if it: (a) satisfies the same sequencing and synthesis properties as above, and (b) is Levenshtein distance at least 2 ⁇ ? +l from any previously accepted barcode.
- Oligonucleotide pools were designed as in FIG. 3A, with primers and barcodes on each end and a spacer in the middle (116 bp total length). To test the FREE method, 8,634 barcodes of length 17 and double-error correction were used in 8,634 unique pairs. Oligos were synthesized (CustomArray), and the oligo pool was amplified for twenty cycles with Q5 polymerase (NEB) and sequenced on an Illumina MiSeq machine with 2x150 bp paired-end reads. Maximum likelihood sequences were inferred using both reads.
- the left and right primer sequences were used to determine both the read orientation and the starting position of each barcode.
- Each barcode was then decoded using the FREE decoding software. Matching barcodes identified correctly decoded barcodes, while mismatching barcodes indicated an error.
- the FREE method was powerful enough to reveal a surprising and unrelated source of error: the creation of oligo chimeras, sequences with the left part of one oligo and right part of another, which were then also accounted for (Example 2).
- FreeDiv(X, Y ) can be efficiently calculated with a modified Needleman-Wunsch algorithm 1 , where the last row and column of the matrix have zero penalty for insertion and deletion corresponding to right-end fill or truncation respectively.
- FREE divergence is symmetric because any minimum filled/truncated right end edit path (FREE path) is invertible by inverting all the edits and then inverting the fill/truncation step. Substitutions are invertible with substitutions, while insertions and deletions are invertible with each other in the natural way, so edits by themselves are invertible.
- FREE path minimum filled/truncated right end edit path
- Invertibility with the fill/truncation step is less obvious, and requires no edit be truncated off the end. For example, a substitution in the last position followed by any insertion results in the substitution getting truncated off the end. Minimum FREE paths never have any edits truncated off the end, because any truncated edit can be omitted to create a shorter edit path.
- a and Y be barcodes and let P be any minimum FREE path from X to Y . If P has no fill or truncation, then the fill/truncation step is trivially invertible by doing nothing.
- P has a fill step which fills /bases at the end. Then starting at Y and inverting the edits results in exactly those/bases being outside the barcode window, so they are truncated to arrive at X.
- P has a truncation step which truncates t bases. Since P is a minimum edit path, none of the truncated bases were edited bases, so they are not needed for the inverted edit path starting at Y .
- FreeDiv(X, Y) FreeDiv(Y, A) and any inverted minimum FREE path is itself a minimum FREE path.
- FREE divergence is not a metric.
- the counter-example shown in the right column of FIG. 1C was used.
- FreeDiv(TAGA, ACGC ) the modified Needleman-Wunsch algorithm described above produces
- Sphere iterator Central to the disclosed generation and decoding algorithms is the ability to deterministically iterate over decode and encode spheres. Recursive iteration is far too slow for practical use due to redundancy. For example, attempting to find DecodeSphereJJT) by finding DecodeSpherefW) of all words W in DecodeSphere ⁇ (B) results in iterating over each 2-error word at least twice, by switching the order of added edits. As the number of edits, m, grows, the redundancy grows as m ⁇ due to edit permutations.
- Code efficiency is measured, where possible, in terms of a code rate, defined as the number of usable“message” bits that can be encoded in a single barcode divided by the actual number of bits in the sent barcode.
- k message bits have r bits added for error correction, giving a code rate of k/(k + r).
- each sent base is two bits of information, so the denominator is 2 n.
- the numerator is the effective number of message bits: the length of the largest binary number smaller than the number of barcodes, given by b ⁇ ogi(Number of barcodes)c.
- the number of message bits does not need to be an integer, so one can refer to the previous as the actual message bits, while one may be more interested in the “raw” message bits: ⁇ ogi(Number of barcodes) without a floor function. These correspond to raw and actual code rates, shown in FIG. 7.
- the code rate of FREE codes increases with barcode length, and appears to asymptotically approach a maximal code rate determined by the properties of the decode sphere packing.
- FIG. 2B shows that after some boundary effects at short barcode lengths, the number of raw message bits (log of the number of barcodes) increases linearly with the length of the barcodes. The slope of this line, up to a factor of 2 for the x-axis due to using base-4 instead of base-2, is an empirical estimate for the asymptotic code rate— message bits over sent bits— for this packing method. Estimated asymptotic values are shown for single- and double-error correcting codes as dashed lines in FIG. 1C.
- H parity check matrix
- Primer processing Primers were used both chemically for library amplification and informatically to distinguish left from right sides. However, the possibility of insertions and/or deletions in these primer sites introduced some uncertainty in the starting position of the DNA barcodes.
- a custom adaptation of the Smith-Waterman algorithm for overhanging sequences was written. The user specifies an expected primer sequence, a full-length observed read, and a maximum allowable number of errors, which can be chosen to be 2 for both the left (19 bp) and right (18 bp) primers. Using the modified Smith-Waterman algorithm with unity penalties for all error types, the highest scoring prefix of the observed sequence which matched the expected sequence was identified. If two or more possible lengths had the same score, the one closest to the expected length was selected. If the number of edits is less than or equal to 2, this best inferred length then determines the position to be used as the start of the barcode sequence.
- Decode errors are detected by whether or not the left and right barcodes, as shown in FIG. 3A, match an intended left/right barcode pair. There are two possible ways to decode incorrectly: either by decoding to a wrong barcode or by decoding to“None” if the observed barcode is not in any decode sphere at all. If a barcode decodes to“None”, then that decode is an error. If a barcode decodes to an incorrect barcode, then the observed output is that the left and right barcodes mismatch but it is unclear which is actually the decode error. To determine which barcode was in error, the edit distance of the entire oligo was measured against the two possible intended sequences, accepting the one with lowest edit distance. To measure the 0- and 1 -Error correction data in FIG. 5, the edit distance of each observed barcode to the intended barcode was measured using the primer processing algorithm described above.
- the observed number of wrong barcodes with zero errors was ⁇ 10 4 , the approximate size of a decode sphere, so this was accepted as an approximation for the number of chimera oligos having barcodes with zero errors.
- This number and the distribution of correct barcode errors was then used to approximate the number of the wrong barcodes l-error and 2-errors away from the wrong barcode were chimera oligos. These were then omitted from analysis.
- the decoding error rate of an / «-error correction code is the probability of seeing more than m errors in a given barcode.
- each barcode was modeled as a queue of intended bases. At each read position, an intended base is popped off the queue and attempted to be added. One of four things will happen: 1) the correct base will be added, 2) an incorrect base will be added, 3) the base will be deleted, or 4) another base will be inserted and the intended base will go back to the top of the queue. The first three options do not return the base to the queue, resulting in the same structure of expected output 7 observed output. However, insertions cause the intended base to return to the top of the queue, and the output was never expected in the first place.
- CDS the 3-by-4 matrix with columns corresponding to the DNA bases, and rows corresponding to all non-insertion outputs: correct bases, deletions, and substitutions.
- Insertion and deletion rates, p,(b) and p b), are taken directly from synthesis error rate measurements.
- Substitution rates, p s (b), are calculated as the probability of not observing the event ⁇ no synthesis substitution and no sequencing substitution ⁇ nor the event ⁇ synthesis substitution to another base c and correcting synthesis substitution back to b ⁇ , and are thus given by
- FIG. 5 since experimental errors clump together, increasing the probability of having more than two, say, in a single barcode.
- the user would use a sphere iterator (see Supplemental Methods) which, instead of iterating over all barcodes with up to some number of errors, iterates over the most likely erroneous barcodes on the chosen synthesis and sequencing platform until the total probability of the barcodes contained in the sphere is at least 1-10 6 .
- the rest of the generation and decoding process would remain the same.
- These barcodes would only achieve the expected accuracy on exactly the same DNA synthesis and sequencing pipeline, but they would be more efficient (produce more barcodes per barcode length) for a given desired decode error rate. As such, popular synthesis and sequencing pipelines may warrant their own dedicated barcodes in the future.
- Exemplary embodiments may include program products comprising computer or machine- readable media for carrying or having machine-executable instructions or data structures stored thereon.
- the sensing electrode may be computer driven.
- Exemplary embodiments illustrated in the methods of the figures may be controlled by program products comprising computer or machine-readable media for carrying or having machine-executable instructions or data structures stored thereon.
- Such computer or machine-readable media can be any available media which can be accessed by a general purpose or special purpose computer or other machine with a processor.
- Such computer or machine-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. Combinations of the above are also included within the scope of computer or machine-readable media.
- Computer or machine-executable instructions comprise, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing machines to perform a certain function or group of functions.
- Software implementations of the present disclosure could be accomplished with standard programming techniques with rule based logic and other logic to accomplish the various connection steps, processing steps, comparison steps and decision steps.
- elements shown as integrally formed may be constructed of multiple parts or elements shown as multiple parts may be integrally formed, the operation of the assemblies may be reversed or otherwise varied, the length or width of the structures and/or members or connectors or other elements of the system may be varied, the nature or number of adjustment or attachment positions provided between the elements may be varied.
- the elements and/or assemblies of the system may be constructed from any of a wide variety of materials that provide sufficient strength or durability. Accordingly, all such modifications are intended to be included within the scope of the present disclosure.
- the order or sequence of any process or method steps may be varied or re-sequenced according to alternative embodiments.
- Other substitutions, modifications, changes and omissions may be made in the design, operating conditions and arrangement of the preferred and other exemplary embodiments without departing from the spirit of the present subject matter.
- the methods and systems may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.
- the methods and systems may take the form of a computer program product on a computer-readable storage medium having computer-readable program instructions (e.g., computer software) embodied in the storage medium.
- the present methods and systems may take the form of web-implemented computer software. Any suitable computer-readable storage medium may be utilized including hard disks, CD-ROMs, optical storage devices, or magnetic storage devices.
- These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including computer-readable instructions for implementing the function specified in the flowchart block or blocks.
- the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks.
- blocks of the block diagram and flowchart illustration support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the block diagram and flowchart illustration, and combinations of blocks in the block diagram and flowchart illustration, can be implemented by special purpose hardware- based computer systems that perform the specified functions or steps, or combinations of special purpose hardware and computer instructions.
- Results can be delivered to a gateway (remote computer via the Internet or satellite) for in graphical user interface format.
- the described system can be used with an algorithm, such as those disclosed herein.
- the computer may include a processing unit that communicates with other elements. Also included in the computer readable medium may be an output device and an input device for receiving and displaying data. This display device/input device may be, for example, a keyboard or pointing device that is used in combination with a monitor.
- the computer system may further include at least one storage device, such as a hard disk drive, a floppy disk drive, a CD Rom drive, SD disk, optical disk drive, or the like for storing information on various computer-readable media, such as a hard disk, a removable magnetic disk, or a CD-ROM disk.
- each of these storage devices may be connected to the system bus by an appropriate interface.
- the storage devices and their associated computer-readable media may provide nonvolatile storage. It is important to note that the computer described above could be replaced by any other type of computer in the art. Such media include, for example, magnetic cassettes, flash memory cards and digital video disks.
- a network interface controller can be implemented via a gateway that comprises a general-purpose computing device in the form of a computing device or computer.
- bus structures can be used as well, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
- bus architectures can comprise an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, an ISA bus or a bus that can be used as a bus or peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
- such architectures can comprise an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, an ISA bus or a bus that can comprise an Industry Standard Architecture
- AGP Accelerated Graphics Port
- PCI Peripheral Component Interconnects
- PCMCIA Personal Computer Memory Card Industry Association
- USB Universal Serial Bus
- the bus, and all buses specified in this description can also be implemented over a wired or wireless network connection and each of the subsystems, including the processor , a mass storage device, an operating system, network interface controller, Input/Output Interface, and a display device, can be contained within one or more remote computing devices at physically separate locations, connected through buses of this form, in effect implementing a fully distributed system.
- the computer typically comprises a variety of computer readable media.
- Exemplary readable media can be any available media that is accessible by the computer and comprises, for example and not meant to be limiting, both volatile and non-volatile media, removable and non-removable media.
- the system memory comprises computer readable media in the form of volatile memory, such as random access memory (RAM), and/or non-volatile memory, such as read only memory (ROM).
- the computer can also comprise other removable/non-removable, volatile/non-volatile computer storage media.
- a mass storage device can be a hard disk, a removable magnetic disk, a removable optical disk, magnetic cassettes or other magnetic storage devices, flash memory cards, CD-ROM, digital versatile disks (DVD) or other optical storage, random access memories (RAM), read only memories (ROM), electrically erasable programmable read-only memory (EEPROM), and the like.
- any number of program modules can be stored on the mass storage device, including by way of example, an operating system and computational software.
- Each of the operating system and computational software (or some combination thereof) can comprise elements of the programming and the computational software.
- Data can also be stored on the mass storage device.
- Data can also be stored in any of one or more databases known in the art. Examples of such databases comprise, DB2TM, MICROSOFTTM ACCESS, MICROSOFTTM SQL Server, ORACLETM, mySQL, PostgreSQL, and the like.
- the databases can be centralized or distributed across multiple systems.
- the user can enter commands and information into the computer 102 via an input device.
- input devices comprise, but are not limited to, a keyboard, pointing device (e.g., a“mouse”), a microphone, a joystick, a scanner, tactile input devices such as gloves, and other body coverings, and the like
- a human machine interface that is coupled to the network interface controller, but can be connected by other interface and bus structures, such as a parallel port, game port, an IEEE 1394 Port (also known as a Firewire port), a serial port, or a universal serial bus (USB).
- a display device can also be connected to the system bus via an interface, such as a display adapter.
- the computer can have more than one display adapter and the computer can have more than one display device.
- a display device can be a monitor, an LCD (Liquid Crystal Display), or a projector.
- other output peripheral devices can comprise components such as speakers and a printer which can be connected to the computer via Input/Output Interface. Any step and/or result of the methods can be output in any form to an output device.
- Such output can be any form of visual representation, including, but not limited to, textual, graphical, animation, audio, tactile, and the like.
- the computer can operate in a networked environment.
- a remote computing device can be a personal computer, portable computer, a server, a router, a network computer, a peer device, sensor node, or other common network node, and so on.
- Logical connections between the computer and a remote computing device can be made via a local area network (LAN), a general wide area network (WAN), or any other form of a network.
- LAN local area network
- WAN wide area network
- a network adapter can be implemented in both wired and wireless environments.
- Such networking environments are conventional and commonplace in offices, enterprise-wide computer networks, intranets, and other networks such as the Internet.
- Computer readable media can be any available media that can be accessed by a computer.
- Computer readable media can comprise“computer storage media” and“communications media.”“Computer storage media” comprise volatile and non-volatile, removable and non-removable media implemented in any methods or technology for storage of information such as computer readable instructions, data structures, program modules, or other data.
- Exemplary computer storage media comprises, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
- the methods and systems described herein can employ Artificial Intelligence techniques such as machine learning and iterative learning.
- Artificial Intelligence techniques such as machine learning and iterative learning.
- Such techniques include, but are not limited to, expert systems, case-based reasoning, Bayesian networks, behavior based AI, neural networks, fuzzy systems, evolutionary computation (e.g. genetic algorithms), swarm intelligence (e.g. ant algorithms), and hybrid intelligent systems (e.g. Expert inference rules generated through a neural network or production rules from statistical learning).
- CustomArray, Inc. maker of custom microarrays, oligo pools and instrumentation. Available at: http://www.customarrayinc.com/aboutus_main.htm. (Accessed: 8th January 2018)
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Chemical & Material Sciences (AREA)
- Biochemistry (AREA)
- Molecular Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
A computerized method of generating a set of genetic barcodes for attaching to samples predicts errors in base values of genetic sequences and provides for decoding corrupt barcodes that include the errors. The set of genetic barcodes is bounded by either requiring or preventing selected combinations of base values that correspond to physical features of a genetic sample. The method includes identifying a common length of the barcodes in the set, identifying base values available to fill each base position, and identifying a center barcode from which a sphere of multiple corrupt barcodes may be generated. The sphere of barcodes referring to a given center barcode is stored as a look-up table to decode corrupt barcodes back to a single unique center barcode.
Description
ERROR-CORRECTING DNA BARCODES
CROSS REFERENCE TO RELATED APPLICATIONS
This application is an international filing under the Patent Cooperation Treaty and claims priority to United States Provisional Patent Application Serial No. 62/660,531 filed on April 20, 2018, which is incorporated by reference as if set forth fully in this disclosure.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH
This invention was made with government support under Grant Nos. GM124141 and R01 GM120554, each awarded by the National Institutes of Health. The Government has certain rights in the invention.
FIELD
The disclosure generally relates to methods to correct errors in digitized genetic barcode sequences, including substitution, insertion, and deletion errors that occur in pooled populations of diverse genetic materials.
BACKGROUND
Many modem large-scale biology experiments use high-throughput DNA sequencing to study the behavior of individual biomolecules in pooled populations. These experiments encode the identity of individual members via DNA barcodes— short, unique DNA sequences that are coupled to each member in the population (FIG. 1A). DNA barcode-based identification is central to such diverse applications as single-cell genome and RNA sequencing1-7, gene synthesis8 9, high-throughput antibody screens10 11, and drug discovery12 13. Such experiments have been enabled by recent breakthroughs in massively-parallel, pooled DNA synthesis14 15. For example, a recent study used DNA barcodes to discover small molecule inhibitors of enzymes by screening about 108 small molecules. Each small molecule was attached to a unique set of three DNA barcodes. The highest affinity ligands were enriched via multiple rounds of selection and then identified via high-throughput sequencing of the attached barcodes16. The rapid growth of such methodologies in all areas of biomedicine requires the development of large pools (>l06 members) of unique DNA barcodes to identify individual members (e.g., cells, proteins, drugs) in heterogeneous ensembles.
Every assay using DNA barcodes is subject to errors introduced during DNA synthesis and sequencing. These errors decrease experimental power and accuracy by confounding the identity of individual biomolecules in the population. Base pair insertions and deletions are particularly challenging to decode because these mutations cause a frameshift in all downstream sequencing. Applying a manufacturer-advertised error rate of up to 1 per 200 nucleotides (nt)17 to a 20 base pair
(bp) long barcode with no error correction translates to a best-case scenario of 10% data lost or, worse, incorrectly interpreted. Next-generation sequencing also has error rates between 103 and 104. This alone represents errors in approximately 1% of the above example 20 bp barcodes, which can be limiting for detection of rare events. These errors can be overcome through the use of error-correcting DNA barcodes— DNA sequences that can correctly identify the underlying individuals in a pooled experiment even in the presence of DNA sequencing and synthesis errors.
Error-correcting barcodes should efficiently detect and correct DNA sequencing and synthesis errors. Many current DNA barcode strategies repurpose error-correcting codes developed for computers18 19, such as Hamming or Reed-Solomon codes, to DNA applications20 ’ 21. Hamming distance, which describes the number of substitutions between two sequences of equal length, is possibly the most used due to its simplicity. However, nearly all well-studied error-correcting codes developed in computer science— including the widely-used Hamming codes— were not designed to handle deletions and insertions. Such codes are generally used to only detect errors without correcting them. However, this method also fails to account for the possibility that a single error (e.g., deletion) can convert one barcode into another. Levenshtein codes, also known as edit codes, can theoretically account for all three types of common error: substitutions, insertions, and deletions, but only when the corrupted length of each barcode after errors is known22 23. This is a critical limitation in real-world DNA barcode applications because errors can change the barcode length unpredictably, which can lead to erroneous decoding of Levenshtein-based barcodes in the context of a longer read (FIG. 1B). As a workaround, Levenshtein codes can be used at twice the level of error correction as desired for a given application, for example using a 2 error-correcting code when a 1 -error correcting code is desired, but this is inefficient and significantly decreases the number of valid barcodes for a given oligonucleotide length. In sum, existing DNA barcode strategies are unable to efficiently detect and decode real-world errors encountered during DNA synthesis and sequencing.
The art of computerized sequencing for genes and small molecules continues to face a problem in that these materials may be analyzed in pooled populations with each population including a large number of individual specimens (e.g., strands of RNA or DNA) that need to be associated with different and often entirely unique identifiers. Coupling short, unique genetic sequences, referred to herein as“barcodes”, to each member of a given population of genetic specimens has been shown previously; however, decoding these barcodes has been complicated by mutations that occur within the barcode sequences due to physical genetic synthesis and computerized genetic sequencing operations. Problems to be solved, as discussed herein, lie in accounting for these kinds of errors by creating barcode libraries with error correction in mind. According to this disclosure, a genetic barcode may be selected for a member of the population, and that genetic barcode will have been created in a manner that allows for a given error tolerance in the barcode at decoding. The compositions and methods
disclosed herein address these and other needs.
SUMMARY
Disclosed herein are experimentally validated, error-correcting, filled/truncated Right End Edit (FREE) barcodes. FREE barcodes can correct substitutions, insertions, and deletions even when the edited length of the barcode is unknown, which is a significant advantage over known DNA barcode error-correcting methods. FREE barcodes are designed with experimental considerations in mind, including balanced GC content, minimal homopolymer runs, absence of GGC motifs, and no self complementarity of more than two bases to reduce internal hairpin propensity. Lists of barcodes were generated with different lengths and error-correction levels that may be broadly useful in diverse high- throughput applications. For each barcode set, hairpin melting temperatures were calculated which can be used to select subsets of barcodes compatible with experimental conditions. The largest barcode list includes >l06 unique error-correcting barcodes usable in a single experiment. Moreover, appending two or more barcodes together in combination increases the total barcode set, producing >109-1012 unique error-correcting DNA barcodes. The included software for creating new barcode libraries and decoding/error-correcting observed barcodes is fast and efficient, decoding >120,000 barcodes per second with a single processor, and is designed to be user friendly for a broad biologist community.
Modern high-throughput biological assays study pooled populations of individual members by labeling each member with a unique DNA sequence called a“barcode.” DNA barcodes are frequently corrupted by DNA synthesis and sequencing errors, leading to significant data loss and incorrect data interpretation. Disclosed herein is a novel error-correction strategy to improve the efficiency and statistical power of DNA barcodes. The error-correcting method accurately handles insertions and deletions in DNA barcodes, the most common type of error encountered during DNA synthesis and sequencing, resulting in order-of-magnitude increases in accuracy, efficiency, and signal-to-noise. The accompanying software package makes deployment of these barcodes effortless for the broader experimental scientist community.
In one aspect, disclosed herein is a computerized method of generating a set of unique genetic barcodes for attaching to members of a pooled population of samples represented by respective genetic sequences. The set of genetic barcodes has boundaries established by controlling the presence and absence of selected combinations of base values when the selected combinations represent disallowed physical characteristics within the genetic samples. The method includes (i) identifying a common length of the barcodes in the set, wherein the length is a given number of base positions present in each genetic barcode, (ii) identifying base values available to fill each base position, wherein the respective genetic barcode is subject to error formation in the base values, such that at least one of the respective genetic sequences has a corrupt barcode, and (iii) identifying a center barcode from which a sphere of multiple corrupt barcodes may be generated. Each of the multiple corrupt barcodes refers back to the
center barcode also in the sphere, and generating the multiple corrupt barcodes utilizes a computer executing software configured to receive an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the number of edits necessary in any one of the multiple corrupt barcodes to match the center barcode and generates combinations of bases filling the length of respective corrupt barcode sequences, comprising errors in the base values within the acceptable error level, such that the corrupt barcode sequences pack the sphere having the center barcode.
In another aspect, provided herein are computerized methods of decoding genetic barcodes subject to error formation and identified as being attached to members of a pooled population of genetic samples represented by respective genetic sequences. The computerized method includes (i) identifying an original length of a generated barcode from a set of unique genetic barcodes, wherein the original length comprises a given number of base positions present in the generated barcode, (ii) identifying a barcode starting value position along a respective genetic sequence of a genetic sample, wherein the starting value position is occupied by a first base of a respective genetic barcode, and (iii) identifying a center barcode within a sphere of multiple corrupt barcodes such that each of the multiple corrupt barcodes refers back to the center barcode also in the sphere. Identifying the center barcode includes a computer executing software configured to receive an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the number of edits necessary in any one of the corrupt barcodes to match the center barcode. The computer is further configured to utilize the barcode starting value position within the genetic sequence and evaluating bases up to the length of the generated barcode. The method stores the evaluated bases as a corrupt barcode in a computerized memory buffer and identifies a respective decode sphere in which the corrupt barcode exists. Finally, the method includes decoding the corrupt barcode to the center barcode of the respective decode sphere.
Additional aspects and advantages of the disclosure will be set forth, in part, in the detailed description and any claims which follow, and in part will be derived from the detailed description or can be learned by practice of the various aspects of the disclosure. The advantages described below will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are in and constitute a part of this specification, illustrate certain examples of the present disclosure and together with the description, serve to explain, without limitation, the principles of the disclosure. Like numbers represent the same element(s) throughout the figures.
FIG. l(A-C) are schematics depicting applications and error-correction strategies of DNA barcodes.
FIG. 1A depicts illustrative examples of high-throughput sequencing assays that require large lists of error-correcting DNA barcodes. Barcodes can be used to identify individual cells or molecules in pooled libraries (Klein, 2015; Fan, 2008; Melkko, 2004).
FIG. IB depicts current strategies to correct synthesis and sequencing errors in DNA barcodes that are confounded by insertions and deletions. Hamming distance can only handle substitutions. Levenshtein distance is limited by the fact that barcodes are prepended to other sequences of interest. Insertions and/or deletions (“indels”) produce phantom Levenshtein distance errors when bases from the remaining DNA molecule shift into or out of the barcode window.
FIG. 1C depicts examples of FREE divergence given an actual edit history. Levenshtein and Hamming distances are also shown for comparison. A substitution and insertion are correctly attributed as two edits by FREE divergence (first column). FREE divergence is a symmetric function, describable as FreeDiv(E, O) = FreeDiv(0, E) (first and second columns). Different actual edit paths can result in the same observed sequence (second and third columns). Indels can have zero cost, particularly near the end of the barcode where they can occasionally be undone by fill or truncation (fourth column). Edits past the barcode end can matter since the fill/truncation step happens only upon observation (fifth column).
FIG. 2(A-D) are schematics and graphs showing FREE barcode generation and decoding.
FIG. 2A shows error-correcting barcode generation as a sphere packing problem. Reserved around each accepted barcode B (e.g.,“CTCA”) is DecodeSpherem(B), the set of all sequences within FREE divergence m of B. That is, the set of all sequences with any combination of up to m errors from B, followed by fill or truncation as necessary. Any set of disjoint decode spheres is a valid FREE code (FIG. 2A, right panel).
FIG. 2B is a graph showing the number of single- and double-error correction barcodes generated for a range of barcode lengths.
FIG. 2C shows that the herein disclosed methods and software can decode more than 120,000 barcodes per second for all barcode lengths considered here.
FIG. 2D shows a comparison of FREE barcode counts against pruned Hamming codes and Levenshtein codes. Hamming codes were pruned to remove members that did not decode FREE divergence errors, while Levenshtein codes were produced at double the error-correction levels for the same purpose. FREE codes produce more barcodes than either of the other methods for all barcode lengths.
FIG. 3(A-C) shows experimental measurement of synthesis and sequencing error rates.
FIG. 3A is a schematic of the DNA constructs used for barcode validation experiments. Each
member in the synthetic library had a unique pair of left and right barcodes drawn from a list of more than 8,000 17- nucleotides FREE codes with double-error correction. By using the left and right primer regions to distinguish the left and right ends from one another, it could be determined whether the left and right barcodes were correctly decoded (matching) or incorrectly decoded (mismatching).
FIG. 3B is a graph showing the measured synthesis error rates, by intended reference base and error type— substitution (sub), deletion (del), and insertion (ins).
FIG. 3C is a graph showing measured sequencing substitution error rates, by reference base. Insertions and deletions from illumina sequencing are extremely rare and are omitted for clarity.
FIG. 4(A-B) are graphs showing decoding corrupted barcodes from simulated errors. FIG.
4A shows modeled and simulated decoding error rates given per-base error rate for length 8 barcodes and length 16 barcodes. Barcode sets are labeled according to length and number of errors corrected.
FIG. 4B shows that the 16-2 code is length 16 and corrects up to 2 errors. Solid lines show the error rate approximations using a binomial model. Circles and triangles show direct simulation error rates for single- and double-error correcting codes, respectively. Substitution, insertion, and deletion errors each have simulated error rates P( error per base)/3 for simplicity.
FIG. 5 is a graph showing decoded corrupted barcodes from experimental data. Observed decoding error rates are compared with theoretical rates from the synthesis and sequencing error rates.
FIG. 6(A-D) shows combined barcode libraries via concatenation of FREE barcodes. FIG. 6A shows concatenated barcodes can be decoded sequentially in a left-to-right order, even when the end position of each edited sub-barcode is not initially known. The decoded first FREE sub-barcode can be used to find the starting position of the next sub-barcode, and similarly for subsequent sub-barcodes.
FIG. 6B shows concatenated barcode decoding error rates. Concatenated barcode labels use the following format: a 3x(l6-l) barcode consists of three concatenated sub-barcodes, each of which is 16 bp long and can correct up to 1 error.
FIG. 6C and FIG. 6D show concatenating multiple barcodes combining to increase the numbers of effective FREE barcodes. Concatenated barcodes can correct the same number of errors per sub-barcode. When the errors are distributed evenly among the sub-barcodes, concatenated barcodes can correct a higher total number of errors than the individual sub-barcodes.
FIG. 6C shows concatenated single-error correcting barcodes.
FIG. 6D shows concatenated double-error correcting barcodes. Dashed lines: projected quantities calculated by sampling; dotted lines: log-linear projections.
FIG. 7(A-C) are graphs showing decode sphere volumes and code efficiency.
FIG. 7A shows that unlike Hamming decode spheres, FREE divergence decode spheres do not have uniform volume due to degeneracy of insertions and deletions. For example, the sequence AACT only has three unique deletions because a deletion of either A generates the same resulting sequence.
Sphere volumes of 1- and 2-error codes are shown for all words and for only valid code words after application of FREE code synthesis and sequencing filters (no homopolymer runs, no triplet complementarity, etc.)· Black lines are explained in FIG. 7B below.
FIG. 7B shows optimal sphere packing bounds. The optimal packing for an error-correcting code is not known in general. Typical code generating algorithms, including the herein disclosed algorithms, are instead heuristics for finding relatively good codes. For reference, one can usually find a (typically impossible) upper bound on the maximum number of code words by calculating the volume of the space divided by the volume of a single decode sphere. As shown in FIG. 7A, the volumes of FREE divergence decode spheres are not uniform, so the volume of every sphere in the space is instead determined, sorted, and the minimum number of barcodes at which the cumulative sum of barcode sphere volumes is smaller than the space is determined. The upper bound calculated for valid code words is shown. The volume at which that happens for each code is shown in FIG. 7A as black lines. The lower bound is the best efficiency achieved by any code generation method to date, which for FREE codes is simply the number of barcodes reported herein. The actual maximum possible number of barcodes is somewhere between the two.
FIG. 7C shows raw and actual code rates for each FREE barcode set disclosed herein as well as the asymptotic values they approach.
FIG. 8(A-B) are graphs depicting error rate simulations by error type. FIG. 8A and FIG. 8B show the simulations performed for FIG. 4, repeated for each error type - substitutions (top panel; also called mismatches), deletions (middle panel), insertions (bottom panel) - individually. Barcode sets are labeled according to length and number of errors corrected; for example, the 16-2 code is length 16 and corrects up to 2 errors. Mismatches follow the binomial approximation closely, while deletions and especially insertions perform slightly better than the binomial approximation.
FIG. 8A is shown for base pair length 8.
FIG. 8B is shown for base pair length 16.
FIG. 9 is a set of graphs depicting error rate comparison with constant barcode length. The binomial approximation of the decode error rate as a function of the error rate per base, grouped by given barcode length. The barcode length used for each is denoted at the top of each individual graph.
FIG. 10 is a set of graphs depicting error rate comparison with constant barcode number of errors corrected. The binomial approximation of the decode error rate as a function of the error rate per base, grouped by given number of errors corrected.
FIG. 11 is a set of graphs depicting error rate comparison with constant number of barcodes.
The binomial approximation of the decode error rate as a function of the error rate per base, grouped by number of barcodes. Numbers of barcodes were not precisely equal, and are indicated as approximations at the top of each graph. Rather, each panel started with the number of 2-error
correcting barcodes and used the smallest 0- and 1 -error correcting barcode sets with at least as many barcodes.
FIG. 12 is a graph depicting a coverage histogram and statistics for the FREE code validation experiment. Each of the 8,684 oligos were observed with average coverage of l59x.
FIG. 13 is a graph depicting maximum error run length probabilities. The probability distribution of maximum consecutive-error run lengths from a model assuming independent errors and from herein disclosed data. The two differ significantly because errors in the herein disclosed data are not independent.
FIG. 14 is a set of graphs depicting hairpin melting temperatures (Tm). Hairpin melting temperature CDFs are shown for all barcodes libraries included with this manuscript. The barcodes included here nearly all have a Tm < 60°C, and users can further filter the barcode sets to avoid hairpins in their specific experimental conditions.
DETAILED DESCRIPTION
The following description of the disclosure is provided as an enabling teaching of the disclosure in its best, currently known embodiment(s). To this end, those skilled in the relevant art will recognize and appreciate that many changes can be made to the various embodiments of the invention described herein, while still obtaining the beneficial results of the present disclosure. It will also be apparent that some of the desired benefits of the present disclosure can be obtained by selecting some of the features of the present disclosure without utilizing other features. Accordingly, those who work in the art will recognize that many modifications and adaptations to the present disclosure are possible and can even be desirable in certain circumstances and are a part of the present disclosure. Thus, the following description is provided as illustrative of the principles of the present disclosure and not in limitation thereof.
Terminology
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which this invention belongs. The following definitions are provided for the full understanding of terms used in this specification.
Disclosed are the components to be used to prepare the disclosed compositions as well as the compositions themselves to be used within the methods disclosed herein. These and other materials are disclosed herein, and it is understood that when combinations, subsets, interactions, groups, etc. of these materials are disclosed that while specific reference of each various individual and collective combinations and permutation of these compounds may not be explicitly disclosed, each is specifically contemplated and described herein. For example, if a particular barcode is disclosed and discussed
and a number of modifications that can be made to the barcode are discussed, specifically contemplated is each and every combination and permutation of the barcode and the modifications that are possible unless specifically indicated to the contrary. Thus, if a class of barcodes A, B, and C are disclosed as well as a class of barcodes D, E, and F and an example of a combination barcode, or, for example, a combination barcode comprising A-D is disclosed, then even if each is not individually recited each is individually and collectively contemplated meaning combinations, A-E, A-F, B-D, B-E, B-F, C-D, C-E, and C-F are considered disclosed. Likewise, any subset or combination of these is also disclosed. Thus, for example, the sub-group of A-E, B-F, and C-E would be considered disclosed.
This concept applies to all aspects of this application including, but not limited to, steps in methods of making and using the disclosed compositions. Thus, if there are a variety of additional steps that can be performed it is understood that each of these additional steps can be performed with any specific embodiment or combination of embodiments of the disclosed methods.
It is understood that the compositions disclosed herein have certain functions. Disclosed herein are certain structural requirements for performing the disclosed functions, and it is understood that there are a variety of structures which can perform the same function which are related to the disclosed structures, and that these structures will ultimately achieve the same result.
Unless otherwise expressly stated, it is in no way intended that any method set forth herein be construed as requiring that its steps be performed in a specific order. Accordingly, where a method claim does not actually recite an order to be followed by its steps or it is not otherwise specifically stated in the claims or descriptions that the steps are to be limited to a specific order, it is no way intended that an order be inferred, in any respect. This holds for any possible non-express basis for interpretation, including: matters of logic with respect to arrangement of steps or operational flow; plain meaning derived from grammatical organization or punctuation; and the number or type of embodiments described in the specification.
As used in the specification and claims, the singular form“a,”“an,” and“the” include plural references unless the context clearly dictates otherwise. For example, the term“an agent” includes a plurality of agents, including mixtures thereof.
As used herein, the terms“can,”“may,”“optionally,”“can optionally,” and“may optionally” are used interchangeably and are meant to include cases in which the condition occurs as well as cases in which the condition does not occur. Thus, for example, the statement that a formulation“may include an excipient” is meant to include cases in which the formulation includes an excipient as well as cases in which the formulation does not include an excipient.
Ranges can be expressed herein as from "about" one particular value, and/or to "about" another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as
approximations, by use of the antecedent "about," it will be understood that the particular value forms another embodiment. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint. It is also understood that there are a number of values disclosed herein, and that each value is also herein disclosed as "about" that particular value in addition to the value itself. For example, if the value" 10" is disclosed, then "about 10" is also disclosed.
By“hybridizable” or“complementary” or“substantially complementary” it is meant that a polynucleotide (e.g. DNA or RNA) comprises a sequence of nucleotides that enables it to non- covalently bind, to another polynucleotide in a sequence-specific, antiparallel, manner (i.e., a polynucleotide specifically binds to a complementary polynucleotide) under the appropriate in vitro and/or in vivo conditions of temperature and solution ionic strength. An example of polynucleotide hybridization is the binding of a polymerase chain reaction (PCR) primer to a polynucleotide template (e.g., in a sequencing reaction). It is understood in the art that the sequence of polynucleotide (e.g., a PCR primer or probe) need not be 100% complementary to that of a target polynucleotide to specifically hybridize with the target polynucleotide. A polynucleotide (e.g., a PCR primer or probe) can comprise at least 70%, at least 80%, at least 90%, at least 95%, at least 99%, or 100% sequence complementarity to a target site within the target polynucleotide sequence.
A "primer" is a short polynucleotide, generally with a free 3'-OH group that binds to a target or "template" potentially present in a sample of interest by hybridizing with the target, and thereafter promoting polymerization of a polynucleotide complementary to the target. A "polymerase chain reaction" ("PCR") is a reaction in which replicate copies are made of a target polynucleotide using a "pair of primers" or a "set of primers" consisting of an "upstream" and a "downstream" primer, and a catalyst of polymerization, such as a DNA polymerase, and typically a thermally- stable polymerase enzyme. Methods for PCR are well known in the art, and taught, for example in "PCR: A
PRACTICAL APPROACH" (M. MacPherson et al., IRL Press at Oxford University Press (1991)).
As used herein,“polynucleotide” and“oligonucleotide” are used interchangeably and generally refer to a linear polymer of nucleotide monomers of DNA or RNA. Monomers making up a polynucleotide are capable of specifically binding to a second polynucleotide by way of a regular pattern of monomer- to-monomer interactions, such as Watson-Crick type of base pairing, base stacking, Hoogsteen or reverse Hoogsteen types of base pairing, for example. Such monomers and their internucleosidic linkages may be naturally occurring or may be analogs thereof (e.g., naturally occurring or non-naturally occurring analogs). Non-limiting examples non-naturally occurring analogs include phosphorothioate internucleosidic linkages, bases containing linking groups permitting the attachment of labels, such as fluorophores, or haptens. In some embodiments,“oligonucleotide” can refer to (relatively) smaller polynucleotides comprising, for example, 2-100, 3-75, or 3-50 monomeric
units. Polynucleotides may, in some instances, include the natural deoxyribonucleosides (e.g., deoxyadenosine, deoxycytidine, deoxyguanosine, and deoxythymidine for DNA or their ribose counterparts for RNA) linked by phosphodiester linkages. However, polynucleotides can also include non-natural nucleotide analogs (e.g., including modified bases, sugars, or internucleosidic linkages). In an embodiment, a polynucleotide may be represented by a sequence of letters (upper or lower case), such as“ATGCCTG,” and it will be understood that the nucleotides are in 5' 3' order from left to right and that“A” denotes deoxyadenosine,“C” denotes deoxycytidine,“G” denotes deoxyguanosine, and“T” denotes deoxythymidine, and that“I” denotes deoxyinosine, and“U” denotes deoxyuridine (in the case of RNA), unless otherwise indicated or implied from context. Unless otherwise noted the terminology and atom numbering conventions will follow those disclosed in Strachan and Read, HUMAN MOLECULAR GENETICS, 2nd ed. (Wiley-Liss, New York, 1999).
A barcode can be used to identify a barcoded sample. Often, one or more barcoded samples are mixed in a heterogenous pool, which can contain known or unknown components. The use of a barcode facilitates the identification of a barcoded sample. Typically, any given barcode identifies only one barcoded sample, and any given barcoded sample is identified by only one barcode. Thus, the barcode and the barcoded sample can be exclusively assigned. This is conceptually similar to the use of scannable universal product codes (UPC) on consumer products of a grocery store shelf. The barcode is a specially designed DNA sequence attached to a barcoded sample and which identifies the barcoded sample for each read output by a DNA sequencer. DNA can be extracted from each barcoded sample, and the unique identifying barcode for that sample is amplified in a polymerase chain reaction (PCR). DNA extracts from multiple barcoded samples, or multiple PCR amplicons from PCRs of separate barcoded samples, can be mixed together (multiplexed) for sequencing numerous samples simultaneously in bulk, thereby reducing time and complexity. In multiplexed sequencings, a barcoded sample can thus be tracked by identification of its barcode.
Barcoding individual samples with specific DNA tags allows many samples to be combined for sequencing, streamlining the workflow, reducing workflow complexity, decreasing time to result, and reducing costs. DNA barcodes can be selected to be of sufficient length to generate the desired number of barcodes with sufficient variability to account for common sequencing errors, generally ranging in size from about 2 to about 20 bases, but may be longer or shorter. Longer barcodes will permit higher sequence diversity and typically allow more samples to be combined. The target specific PCR sequences for the forward and reverse PCR primers can be specific for any DNA sequence, in coding or non-coding regions of a target genome, plasmid, or organelle.
DNA barcoding has been described for many sequencing platforms. For instance, the first platform of next-generation sequencing (NGS), the GS 20, was developed for barcoding in 2007 (Parameswaran P, Jalili R, Tao L, et al. Nucleic Acids Research. 2007; 35(l9):el30). Since that time,
barcoding strategies have been made commercially available by Illumina (San Diego, Calif.), Pacific Biosciences (Menlo Park, Calif.), and Thermo Fisher Scientific (Waltham, Mass.), such as the Ion Torrent™ (Thermo Fisher Scientific, Waltham, Mass.) sequencing platforms.
A high throughput sequencing method can be any sequencing method, with high throughput generally meaning greater than 100 (e.g., 1,000 or more) reads per run. Next-generation sequencing (NGS) refers to modem high throughput sequencing platforms that parallelize the sequencing process, producing thousands or millions of sequences concurrently, in contrast to less-efficient and more expensive standard dye-terminator methods. Non-limiting examples of NGS methods include single molecule real-time sequencing (also referred to as Pacific Biosciences or PacBio), ion semiconductor (also referred to as Ion Torrent sequencing), pyrosequencing (also referred to as Roche 454), sequencing by synthesis (also referred to as Illumina sequencing), sequencing by ligation (also referred to as SOLiD sequencing) and chain termination sequencing (also referred to as Sanger sequencing).
High throughput DNA sequencing can be carried out on pooled, barcoded PCR amplicon DNA sequences as described above, producing a file with individual DNA sequences from all samples in random order. In some embodiments, the high throughput DNA sequencing is a next-generation sequencing (NGS) method, such as single-molecule real-time sequencing, ion semiconductor sequencing, pyrosequencing, sequencing by synthesis, sequencing by ligation and chain termination sequencing. Obtained DNA sequences from each barcoded sample (or PCR amplicons thereof) can be identified by barcode after sequencing and all reads are sorted into files by barcode. In some embodiments, separating the DNA sequences and identifying target DNA sequences and/or barcode sequences can be performed by computer implemented methods, and can further employ database searches. In some embodiments, the databases may include online databases such as BLAST
(Altschul, S. et al. (1990).“Basic local alignment search tool”. Journal of Molecular Biology. 215 (3): 403-410), GreenGenes (Appl. Environ. Microbiol. July 2006 vol. 72 no. 7 5069-5072), SILVA (Quast, et al., (2013) Nucl. Acids Res. 41 (Dl): D590-D596.), and/or analysis pipelines such as QIIME (Caporaso I., Nature Methods, 2010; doi:l0.l038/nmeth.f.303) or mothur (Schloss, et al., Appl.
Environ. Microbiol. December 2009 vol. 75 no. 23 7537-7541).
In some embodiments, barcodes can be designed to exhibit improved read accuracy for sequencing using a sequence-by-synthesis platform (as discussed previously), which can include fluorophore-labeled nucleotide sequencing platforms or non-labeled sequencing platforms, such as, for example, the Ion PGM™ and Ion Proton™ Sequencers, and the Ion SS™ and Ion S5 XL Next Generation™ Sequencing System. Design of the barcodes is not limited to any particular instrument platform or sequencing technology, however.
General methods of creating, using, and interpreting experiments results of barcodes are disclosed in a number of references, for example, US Pat. App. Pub. Nos. US20170335410,
US20170166956, US20170198345, US20170342465, US20170342405, and US20180002748, each of which are incorporated by reference in their entireties.
In some embodiments, the barcode has a number of nucleotide bases (sometimes referred to as a barcode length) of at least 3 bases, at least 4 bases, at least 5 bases, at least 6 bases, at least 7 bases, at least 8 bases, at least 9 bases, at least 10 bases, at least 11 bases, at least 12 bases, at least 13 bases, at least 14 bases, at least 15 bases, at least 16 bases, at least 17 bases, at least 18 bases, at least 19 bases, or at least 20 bases. In some embodiments, the barcode has a number of nucleotide bases from 3 to 20 bases, from 3-18 bases, from 3-16 bases, from 3-15 bases, from 3-12 bases, or from 3-8 bases.
In some embodiments, two or more barcodes can be used in combination in a concatenated sequence. Combined use of barcodes in concatenated barcodes facilitate generation of a high number of barcodes (e.g., over 106 and even 1012 or more). Combined use of barcodes, as it relates to concatenated barcodes, refers to the combining of two or more barcodes from a DecodeSphere in a single polynucleotide to form a larger barcode (a“concatenated barcode”). In the context of concatenated barcodes, the two or more barcodes from a DecodeSphere are occasionally referred to as “sub-barcodes.” The concatenated barcode can, but need not, identify a different barcoded sample compared to that of the individual sub-barcodes. A concatenated barcode comprises sub-barcodes in a single polynucleotide wherein the sub-barcodes are disposed along the polynucleotide sufficiently close to an adjacent sub-barcode such that the concatenated barcode can be identified from a single amplicon formed from a PCR amplification reaction. For example, a sub-barcode can be disposed along the polynucleotide within a nucleotide base length of an adjacent sub-barcode of 100 bases or less, 75 bases or less, 50 bases or less, 25 bases or less, 10 bases or less, 9 bases or less, 8 bases or less, 7 bases or less, 6 bases or less, 5 bases or less, 4 bases or less, 3 bases or less, 2 bases or less, or 1 base or less. In some embodiments, a first sub-barcode can be immediately adjacent to a second sub barcode (e.g., the 3’ end of the first sub-barcode is directly and covalently linked by a phosphodiester bond to the 5’ end of a second sub-barcode). However, as disclosed herein, frameshift mutations such as insertions and deletions can interrupt one or more sub-barcodes, or can alter the spacing between two or more sub-barcodes (e.g., by inserting a base pair between two immediately adjacent sub barcodes). Thus, in some embodiments, a concatenated barcode can be defined by an intended sequence, for example a polynucleotide sequence designed by a user, without regard to the
experimentally determined sequence of the concatenated barcode prior to error-correcting. A concatenated barcode can comprise sub-barcodes which are of the same base length (number of bases or base pairs; bp) or of different base lengths.
A barcoded sample can be any composition of matter capable of attachment, either directly or indirectly, with a polynucleotide comprising a barcode. The barcoded sample can be, for example, a small molecule, a macromolecule (e.g., a nucleic acid, protein, lipid, polysaccharide, or synthetic or
biological polymer), a solid support (e.g., a synthetic particle, a bead such as an affinity bead, a plastic, a metallic substance), a vesicle (e.g., liposome, exosome, micelle), an artificial or biological cell, or others. Examples of suitable beads include silica gel beads, controlled pore glass beads, magnetic beads, Dynabeads, Sephadex/Sepharose beads, cellulose beads, polystyrene beads, or any combination thereof. In the instance of a cell, a barcoded sample and a cell are said to be attached when a barcoded sample either enters a cell or is attached to a component (e.g., a lipid) on the surface of a cell. Direct attachment can include, for instance, covalent linkage of a 3’ end of a polynucleotide comprising a barcode to the barcoded sample, for instance a chemically accessible motif of a small molecule capable of covalent linkage to the 3’ end or other portion of a polynucleotide comprising a barcode. Indirect attachment can include a physical attachment but without a direct covalent linkage between the barcoded sample and the polynucleotide comprising a barcode. For example, an intervening component can link the barcoded sample and the polynucleotide comprising a barcode. An example of an indirect attachment includes a linkage between a polynucleotide comprising a barcode and a solid support (e.g., a bead), wherein the solid support is further linked to the barcoded sample (e.g., an antibody or small molecule).
In some embodiments, the methods can include a plurality of barcodes. In some embodiments, the methods can include at least 50, at least 100, at least 250, at least 500, at least 750, at least 1,000, at least 2,000, at least 5,000, at least 10,000, at least 50,000, at least 100,000, at least 500,000, at least 1,000,000 barcodes. In some embodiments, the methods can include at least 107, at least 108, at least 109, at least 1010, at least 1011, at least 1012, at least 1013, at least 1014, or at least 1015 barcodes. The total number of barcodes can include single barcodes only, concatenated barcodes only, or both single barcodes and concatenated barcodes. A plurality of barcodes can sometimes be referred to as a barcode library. Alternatively, the term“barcode library” can also refer to a plurality of barcodes, each attached to a single barcoded sample of a plurality of barcoded samples.
A polynucleotide (e.g., a polynucleotide comprising a barcode) can be sequenced by any sequencing method known in the art to sequence a polynucleotide. In some embodiments, the DNA sequencing method is a high throughput sequencing method, for instance a next-generation sequencing (NGS) method. In some embodiments, the DNA sequencing method is a low throughput sequencing method, for example Sanger sequencing. Sanger sequencing is a method of DNA sequencing based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. In some embodiments, sequencing results and other polynucleotide analyses can be represented and analyzed by digitized sequences using known computer implemented methods.
In some embodiments, a polynucleotide comprising a barcode can be immobilized on a solid support. In some embodiments, the solid support is a bead (e.g., an affinity bead), multi-well plate (e.g., 96-well plate), filter surface, or other solid support. For example, a barcoded polynucleotide can
be attached to a solid support at one end, and attached (directly or indirectly) to a barcoded sample at the other end.
As set forth in the figures attached, the technology disclosed herein takes advantage of a highly efficient way to tag samples in a pooled set of specimens with a genetic barcode as shown in Figure 1A (5A-5C) and account for likely corruption of the barcode sequences during experimentation, genetic synthesis, and genetic sequencing operations. The corruption is caused by naturally occurring errors in genetic sequences that change certain portions of a genetic barcode sequence from an originally generated genetic barcode (25) to a different sequence that does not match the original barcode exactly and may lead to errors for certain identification purposes, absent additional processing. This disclosure utilizes computers, computer processors, and associated kinds of computerized memory to provide additional processing functionality that enables even corrupted barcodes to be decoded back to an intended original genetic barcode, which in turn identifies a property of the specimen. The identified property of the specimen includes, but is not limited to, a source of the specimen, a characteristic of the specimen, or any information or metadata that can be useful to track in regard to the specimen to which the genetic barcode is attached. Figure 1B illustrates that the genetic barcodes (25) described herein start with a particular set of base values (27) filling the base positions (20) of a genetic barcode having a defined length (145). The base values, however, may change during the course of genetic synthesis and genetic sequencing operations. These changes may be the result of base substitutions (10), base insertions (12), and base deletions (14) as illustrated in Fig. 1B. Figure 1C shows that originally generated genetic barcodes (25) have base values with a starting position (45A-45E) in an overall genetic sequence for a sample. The base values fill the base positions in the bar code sequence up to the length (147) of the genetic barcode (25). As shown in the figure, however, these base values (27) in an expected configuration of the original genetic barcode (25) are subject to actual edits (53) such that the observed version (55) of the barcode (corrupted with mutations in the base values) is different due to truncation and fill effects as shown.
One goal of the technology described herein lies in identifying not only the corrupt barcodes (63) as observed after various edits, but also to be able to use those corrupt barcodes in reliably determining to which of the originally generated genetic barcodes (25) the corrupt barcode (63) should refer back to after decoding.
This disclosure incorporates the use of any computerized hardware and software necessary to fulfill these purposes herein. Accordingly, in describing the embodiments as computerized methods for generating and decoding barcodes, the methods may be practiced with software and hardware particularly configured for the purposes at hand. This disclosure is particularly suited for generating a set of unique barcodes that can be used in genetic labeling and then decoded even after certain errors have been incorporated into the barcode sequences.
In one aspect, illustrated generally at FIG. 2A, a computerized method of generating a set (23) of unique genetic barcodes (25) for attaching to members of a pooled population of samples (5) represented by respective genetic sequences (35). The set of genetic barcodes has boundaries (33) established by controlling the presence and absence of selected combinations of base values (20) when the selected combinations represent disallowed physical characteristics within the genetic samples. Controlling select combinations of base values involves implementing lexicographical constraints (e.g., constraining the genetic code in a sequence) and identifying base values (27) available to fill each base position in terms of “GC” code content, disallowing homopolymer triplets, disallowing triplet self complementarity, and disallowing GGC base values (illumina error motif).
The method includes (i) identifying a common length (145) of the barcodes (25) in the set (23), wherein the length (145) is a given number of base positions (20) present in each genetic barcode (25), (ii) identifying base values (27) available to fill each base position (20), wherein the respective genetic barcode (25) is subject to error formation in the base values, such that at least one of the respective genetic sequences has a corrupt barcode (63), and (iii) identifying a center barcode (125) from which a sphere (70) of multiple corrupt barcodes (63) may be generated. Each of the multiple corrupt barcodes (63) refers back to the center barcode (125) also in the sphere (70). The method also includes generating the multiple corrupt barcodes (63) by using a computer executing software configured to receive an acceptable error level (herein,“m”) for the sphere (70) of multiple corrupt barcodes, wherein the error level corresponds to the maximum number of edits necessary in any one of the multiple corrupt barcodes (63) to match the center barcode (125) and generates combinations of base values (27) filling the length (145) of respective corrupt barcode (63) sequences. The corrupt barcodes (63) within a given sphere (70) and within given set (33) have errors in the base values (27) within the acceptable error level, such that the corrupt barcode sequences pack the sphere (70) having the center barcode (125). Generating combinations of base values (27) having up to the maximum number of errors allowed by the error level includes adjusting base values (27) from the center barcode (125) with end fills and end truncations of base values at one end of the length of the genetic barcode, up to the entire length (145) of the genetic barcode.
In one non- limiting embodiment, the method further includes generating distinct center barcodes (125) for each member of the pooled population of genetic samples (5A-5C) and using respective distinct center barcodes (125) to generate additional spheres of corrupt barcodes that relate back to a corresponding distinct center barcode. The computerized method is configured to result in the sphere and the additional spheres (80, 90, 110) being disjointed with no common members at the identified error level. In yet another embodiment, genetic barcodes (25) can be concatenated so that multiple barcodes form a single new barcode (i.e., the multiple barcodes become sub-barcodes), and, using the
sub-bar codes, new spheres (150, 175, 195) of corrupt barcodes are generated with computerized equipment.
In another aspect, provided herein are computerized methods of decoding genetic barcodes (25) subject to error formation and identified as being attached to members of a pooled population of genetic samples (5) represented by respective genetic sequences. The computerized method includes (i) identifying an original length (145) of a generated barcode (25) from a set of unique genetic barcodes, wherein the original length comprises a given number of base positions (20) present in the generated barcode, (ii) identifying a barcode starting value position (45) along a respective genetic sequence of a genetic sample, wherein the starting value position is occupied by a first base value of a respective genetic barcode, and (iii) identifying a center barcode (125) within a sphere (70) of multiple corrupt barcodes (63) such that each of the multiple corrupt barcodes (63) refers back to the center barcode also in the sphere (70). Identifying the center barcode (125) includes a computer executing software configured to receive an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the maximum number of edits (53) necessary in any one of the corrupt barcodes to match the center barcode (125). The computer is further configured to utilize the barcode starting value position (45) within the genetic sequence and evaluating bases up to the length (145) of the generated barcode. The method stores the evaluated bases as a corrupt barcode in a computerized memory buffer and identifies a respective decode sphere (70) in which the corrupt barcode (63) exists. Finally, the method includes decoding the corrupt barcode to the center barcode of the respective decode sphere.
In another embodiment, a computerized method as described above for decoding genetic barcodes further includes identifying the respective decode sphere in which the corrupt barcode exists by evaluating the base position values (27) after the starting point (45) and up to the original length of the generated barcode (25). The base position values after the starting point (45) and up to the original length (145) of the generated barcode include filled positions at an end of the length, wherein the filled positions do not match the corresponding positions of the generated barcode. In other scenarios, the base position values after the starting point and up to the original length of the generated barcode do not include base positions from the originally generated barcode that have been truncated at an end of the length. For genetic sequences having multiple barcodes (155, 175, 195) therein, after decoding a first corrupt barcode (155) in the genetic sequence, the computerized method identifies a true length (145) of the first corrupt barcode and from that identifies a next start position in the genetic sequence for a sequential barcode (175, 195) to be decoded.
As noted above, the hardware and software used to enable this disclosure will be sufficient to provide for sets of genetic barcodes to be stored, subject to entry in a -up table or other matching means, and configured with a graphical user interface to receive at least one of an acceptable error level, an acceptable probability of a number of errors in a corrupted barcode, and a genetic barcode length.
EXAMPLES
To further illustrate the principles of the present disclosure, the following examples are put forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how the compositions, articles, and methods claimed herein are made and evaluated. They are intended to be purely exemplary of the invention and are not intended to limit the scope of what the inventors regard as their disclosure. These examples are not intended to exclude equivalents and variations of the present invention which are apparent to one skilled in the art. Unless indicated otherwise, temperature is °C or is at ambient temperature, and pressure is at or near atmospheric.
There are numerous variations and combinations of process conditions that can be used to optimize product quality and performance. Only reasonable and routine experimentation will be required to optimize such process conditions.
Example 1.
Abstract
Many large-scale high-throughput experiments use DNA barcodes— short DNA sequences prepended to DNA libraries— for identification of individuals in pooled biomolecule populations. However, DNA synthesis and sequencing errors confound the correct interpretation of observed barcodes and can lead to significant data loss or spurious results. Widely-used error-correcting codes borrowed from computer science (e.g., Hamming and Levenshtein codes) do not properly account for insertions and deletions in DNA barcodes, even though deletions are the most common type of synthesis error. Disclosed herein are experimentally validated FREE (Filled/truncated Right End Edit) barcodes, which can correct substitution, insertion, and deletion errors, even when these errors alter the barcode length. FREE barcodes are designed with experimental considerations in mind, including balanced GC content, minimal homopolymer runs, and reduced internal hairpin propensity. Lists of barcodes with different lengths and error-correction levels were generated which may be useful in diverse high-throughput applications, including >l06 single-error correcting l6-mers that strike a balance between decoding accuracy, barcode length, and library size. Moreover, concatenating two or more FREE codes into a single barcode increases the available barcode space in combination, generating lists with > 1015 error-correcting barcodes. The included software for creating barcode libraries and decoding sequenced barcodes is efficient and designed to be user-friendly for the general biology community.
Results
Overview of Filled/truncated Right End Edit (FREE) Divergence Codes. After DNA synthesis and sequencing, a barcode of length n can be altered, and is not guaranteed to end after exactly n bases. A goal was to design barcodes that can be unambiguously identified from the first n bases of the sequenced read. To begin, a filled/truncated right-end m-edit, hereafter written“FRE m-edit,” of a DNA sequence of length n was designed to be the result of any m edits— substitutions (sub), insertions (ins), or deletions (del)— followed by truncating or filling with any random bases on the right (as from the unknown downstream read) as necessary to return to original length n (FIG. 1B). For any two DNA sequences X and Y of the same length, the Filled/truncated Right End Edit (FREE) Divergence between X and Y, written FreeDiv(X, Y), was defined to be the minimum m such that Y is a FRE m-edit of X.
FIG. 1C shows a typical example of how FREE divergence captured the actual number of barcode edits in the context of a longer read. An insertion caused the final T to move out of the barcode window, but FREE divergence correctly accounts for its loss. FREE divergence is a symmetric function, i.e. FreeDiv(X, Y) = FreeDiv(Y, X) (FIG. 1C). This is because reversing the edits and reversing the right-end fill or truncation step moves one from Y back to X in the same minimum number of steps. FREE divergence is defined as the minimum number of steps between the expected and observed barcode, but it is possible to accomplish the same transformation with more edits, for example via the identity ins-del = sub (FIG. 1C). Also, insertions and/or deletions (indels) near the end of the sequence can result in a FREE divergence of zero if the inserted or filled bases match the truncated or deleted bases respectively. While FIG. 1C shows this for deletions (resulting in a zero score), inserting‘GC’ instead of deleting it results in the same sequenced barcode example. Finally, FREE divergence is not a metric— a mathematically precise term for distance— because edits outside the barcode window can lead to violation of the triangle inequality (FIG. 1C, Example 2). This requires the use of specialized code generation techniques that do not rely on the properties of a metric, and also underlies usage of the term divergence rather than distance throughout this example.
With FREE divergence defined, building an error correcting barcode list is conceptually equivalent to packing spheres in the space of possible barcodes (FIG. 2A). A barcode length n is set, and any DNA sequence of length n is referred to as a word. For any word B, the set of all words is called W such that FreeDiv(B, W) < m the m-error decode sphere of B, written as DecodeSpherem(B), or just DecodeSphere(B) if m is clear from context. Any observed DNA sequence within
DecodeSphere(B) will by definition decode to (error-correct to) the center word B (FIG. 2 A). Then, an m-error correcting FREE code is simply any set of barcodes such that the m-error decode spheres of all barcodes are disjoint, or, in other words, no two decode spheres overlap. Any corrupted barcode with up to m errors is thus in the decode sphere of exactly one barcode and can be decoded (error-corrected)
uniquely (FIG. 2A). Requiring disjoint decode spheres places a limit on the relationship between allowed m, the number of correctible errors, and n, the barcode length: to fit more than one non overlapping decode sphere in the space requires that 2 m + 1 < n.
Efficient FREE barcode generation and decoding. A software library accompanying this manuscript efficiently generates FREE barcodes with a given total length and error-correction level. The generation algorithm is conceptually very simple: iterate through the space of n-mers
alphabetically, determine the decode sphere for each candidate barcode, and reserve barcodes when their decode spheres do not overlap the decode spheres of any previously reserved barcodes (FIG. 2A). This set of reserved barcodes by definition forms a valid FREE code. Additional algorithmic details make the process faster and more memory efficient. Adding valid code words in alphabetical order is a heuristic method previously observed to efficiently pack spheres24. Experimental synthesis and sequencing limitations are also during barcode selection. Candidate barcodes must have: (1) balanced GC content (40-60%); (2) no homopolymer triples (e.g., AAA); (3) no GGC (a known Illumina-based error motif25); and (4) no self-complementarity of >2 bases to reduce hairpin propensity. All disclosed software is available in the GitHub repository (https://github.com/finkelsteinlab).
The number of available error-correcting barcodes for a DNA sequence of length n depends on the experimentally-required degree of error-correction (FIG. 2B). Libraries of single-error correcting codes up to a l6-nucleotide length were generated, containing >1,600,000 barcodes. In addition, more robust, double-error correcting codes up to a l7-nucleotide length with >23,000 unique members were generated (Table 1). Barcodes correcting m errors required length at least 2m + 1 bp to avoid having all decode spheres overlap all other decode spheres. Thus, the 1 -error and 2-error correcting barcode libraries had minimum lengths of 3 bp and 5 bp respectively. The barcode decoding software ran in time proportional to the length of the barcodes but constant with respect to the number of barcodes in the library. Hence, 1 -error and 2-error correcting codes decoded at the same speed for a given barcode length even though the 1 -error libraries contained many more barcodes (FIG. 2C). Even the slowest decodes considered, the l7-mer double-error correction barcodes, decoded at >120,000 barcodes · sec 1 on a desktop computer using a single processor.
Barcode 1-Error 2 -Error
Length Correction Correction
3 2
4 3
5 10 2
6 27 2
7 67 4
8 213 7
9 554 12
10 1,903 31
11 6,161 75
12 17,214 179
13 56,736 468
14 157,197 1,156
15 518,509 3,183
16 1,636,418 8,777
Table 1: Number of FREE barcodes for each barcode set disclosed herein, by barcode length and number of errors corrected. Comparison with current error-correcting DNA barcode strategies. Current state-of-the art error correcting DNA barcoding applications often use Hamming or Levenshtein error-correction strategies2023. Hamming codes only correct substitutions, and are thus insufficient for any DNA barcode applications with indels26. However, Hamming codes are linear codes, meaning the code words form a well-structured lattice in barcode space. An alternative hypothesis was tested whether pruning well-packed Hamming decode spheres to subsets with disjoint FreeDiv decode spheres could result in a more efficient packing— more barcodes for a given barcode length— than an alphabetical generation strategy. This was not, in fact, the case: FREE codes have about a factor of two more barcodes for a given length than pruning of Hamming codes (FIG. 2D).
Levenstein codes can be used directly (e.g., without pruning) because they account for indels, but must be used at 2-fold higher error correction for DNA barcode applications (FIG. IB). Such over corrected Levenshtein barcode sets were generated in a manner similar to the FREE code generation strategy. This strategy produced even fewer barcodes than the pruned Hamming code sets. (FIG. 2D). Sequence-Levenshtein codes attempted to solve the problems inherent in using Levenstein codes for DNA applications, but an error in the derivation of these codes often caused them to decode to the wrong barcode27. In sum, FREE codes offer a substantially larger number of usable barcodes for a given barcode length, when taking into consideration real-world errors such as deletions, insertions, and substitutions that are encountered during DNA sequencing and synthesis.
Error Correction in Real and Simulated Data. FREE barcodes generated herein were validated by both numerical simulation and by experiment. Pooled oligonucleotide synthesis was used to produce a library of >8,000 oligos with double-error correcting barcodes at both ends (FIG. 3A). The barcodes were arranged such that each left barcode should only be observed on the same oligo with one specific right barcode sequence, and similarly for right barcodes. Hence, the rate of incorrectly decoding barcodes could be measured by observing unexpected left-right barcode pairs. 1.4 million copies of this library were sequenced on an Illumina MiSeq for an average coverage of l59x using the standard Illumina workflow.
Full-length, paired-end Illumina sequencing was used to measure the background synthesis and sequencing error rates (FIG. 3B and 3C). Using full-length paired-end reads permitted discrimination between synthesis and sequencing errors. Substitution, insertion, and deletion error rates from library amplification using Q5 polymerase have previously been reported to occur at rates less than 105, and thus are a negligible fraction of the measured synthesis errors28. Measured errors were dominated by single-base synthesis deletions, which occurred at rates of approximately 1 in 200 bp and 1 in 100 bp in the left and right barcode regions, respectively (FIG. 3B and FIG. 13). The two-fold difference in synthesis error rates between the two sides was consistent with statements from the manufacturer regarding their synthesis error rates17. Sequencing error rates were between 104 and 10 3, as advertised by Illumina (FIG. 3C). In sum, experimental error rates were dominated by deletion errors. As Hamming codes are not designed to error-correct deletions in barcodes, they perform very poorly in DNA-based experiments.
Experimentally-determined error rates were compared to simulations of the overall decoding error rate, or in other words, the probability of incorrectly demultiplexing a barcode. Simulations were used to analyze the decode error rate for several error-correcting codes as a function of the per-base error rate, perr (FIG. 4). Simulations were performed in two different ways. First a binomial model, which assumes independent and identically distributed errors at each base, was used to calculate the probability of observing more than 1- or 2-errors given per-base perr. Second, errors were directly simulated using the disclosed decoding software: for a given per-base perr, barcodes were randomly selected and errors with probability perr were added. For simplicity, models were made of insertion, deletion, and substitution error rates of perr / 3 with no correlation between individual errors within a given barcode. The corrupted barcodes were then decoded using the disclosed software, and the fraction of incorrectly decoded barcodes was used as a measure of the decode error rate.
At experimentally-determined per-base error rates, perr, each increase in error correction level resulted in at least an order of magnitude improvement in the decoding error rate (FIG. 4). For example, experimental data showed an overall per-base perr of approximately 10 2 (FIG. 3B and 3C). At this per-base error rate, the approximate uncorrected decode error rate (solid line) was 8% for
length 8 barcodes and 15% for length 16 barcodes. Without error correction, a best-case scenario would be that these errors could be successfully filtered out, representing a significant loss of data. In other scenarios, these data might be erroneously counted. For zero-, single-, and double-error correction length 8 barcodes, the approximate decode error rate decreases from 8% to 0.3% to 0.005%. For length 16 barcodes, the approximate decode error rate decreases from 15% to 1% to 0.05%. A more comprehensive comparison of the various barcode lists is given in FIGs. 9, 10, and 11. The simulated results were consistently better than the binomial approximation because indels near the right end occasionally add the correct base and because insertions occasionally push other errors out of the barcode window (FIG. 8).
FREE barcodes were validated by measuring the decoding error rates for the experimental dataset (FIG. 5). For double-error correction, mismatches in barcode pairs were used to identify erroneously decoded barcodes. After corrections, error rates of 0.29% and 0.46% were observed for left and right barcodes, respectively. The 0- and l-error correction rates shown in FIG. 5 were counted by also counting the number of errors observed in each correctly decoded barcode. That is, 0-error correction decode error rates were calculated as the number of erroneously decoded barcodes plus the number of correctly decoded barcodes with 1 or 2 errors; l-error correction errors were counted similarly. On the other hand, the theoretical model was calculated using the synthesis and sequencing error rates found in FIG. 3 to calculate the decode error probability of each barcode depending on its base composition, and then combined for an overall error rate.
The experimentally-observed decoding error rates followed the same trend as the simulated errors: decode error rates decreased by approximately an order of magnitude with each additional error-correction level. Also, experimental error rates were higher than the theoretical error rate. This can be explained by two observations. First, the theoretical model assumes independent errors at each position along the barcode. This assumption is not observed in the experimental data (FIG. 13).
Second, the starting position of each barcode may not be defined exactly because the primer region could also have errors. One should be careful to identify the start of each barcode as precisely as possible, but any errors in starting position can appear as spurious insertions or deletions during decoding. Nonetheless, even though per-base errors are not independent, the overall order-of- magnitude decrease in decode errors per error-correction level was recapitulated in the experimental dataset.
Combinations of large barcode lists via concatenation. State-of-the-art high-throughput sequencing applications already require >l06 unique barcodes16. We anticipate that improvements in high-density pooled oligo synthesis, along with the continuing reduction in sequencing costs, will continue to push the need for even larger error-correcting barcode sets. Below, we demonstrate that arbitrarily large barcode lists (>l015 unique members shown here) can be constructed from FREE
barcodes by concatenating multiple FREE barcodes in a row.
As a demonstration, two or three barcodes were concatenated from the same starting list of sub-barcodes (FIG. 6). For purposes of FIG. 6, the original barcodes to be concatenated were referred to as sub-barcodes, while the full length, concatenated barcode was referred to as a barcode (or, occasionally referred to herein as a“concatenated barcode”). Due to the possibility of insertions and deletions, the starting positions of the second and third sub-barcodes were only known approximately, and that approximation worsened as more sub-barcodes were added (FIG. 6A). Decoding the sub barcodes sequentially from left-to-right was a strategy to account for this ambiguity. The left-most sub-barcode (referred to as sub-barcode 1) was decoded first, and then the decoded sub-barcode was used to find the starting position of the immediately adjacent sub-barcode 3’ to sub-barcode 1 (referred to as sub-barcode 2). The error-correction level of each FREE sub-barcode remained the same, such that, for example, three concatenated double-error correction sub-barcodes could each correct up to two errors for a maximum total of six corrected errors if, and only if, the errors were evenly distributed, two per sub-barcode. Overall concatenated barcode decoding error rates were given by the probability of any decoding error in any sub-barcode or -barcodes. Concatenated barcode error rates were thus slightly higher than for the individual sub-barcodes (FIG. 6B). The decoding process was performed automatically using the disclosed software.
Concatenating FREE barcodes results in combinations of large barcode sets that are sufficient for even the most demanding high-throughput sequencing applications (FIG. 6). Concatenated barcodes were pruned to remain compatible with experimental constrains by removing DNA sequences that had triplet repeats of a single base or had excess self-complementary (defined as any self-complementarity of three or more bases). Even with these filters, lists of up to 1010 barcodes were generated with concatenation of three single-error correcting codes (FIG. 6). Beyond that, the projected total barcode count was estimated via subsampling, where possible. When this process was limited by available hard drive space, the projected total was estimated via log-linear fit, which went above 1015 barcodes for 3 x (16 bp single-error) barcodes. Such concatenated barcode sets can be generated on demand using the disclosed software and single barcode lists. In sum, concatenating FREE codes produced a rapid and efficient strategy for further increasing the size of error-correcting barcode lists for pooled high-throughput sequencing experiments.
Discussion
Shown herein are the design and experimental validation of Filled/truncated Right End Edit (FREE) error-correcting DNA barcodes capable of correcting substitution, insertion, and deletion errors, even when the corrupted length of the barcode is unknown. Lists of FREE Divergence error correcting barcodes were generated, and software available on GitHub was developed for user-friendly generation and decoding of DNA barcodes for real-world applications.
Most high-throughput DNA sequencing applications require PCR-based amplification or reverse transcription (in the case of RNA) of the input nucleic acid libraries. The polymerase and reverse transcriptase enzymes used during library preparation perform best on libraries that avoid stable secondary structures and self-complimentary regions. To improve the utility of codes for such demanding applications, UNAFold was used to calculate the melting temperature of hairpins for FREE barcodes29. This information facilitates the pruning out of barcode sequences having a propensity to form stable hairpins in specific experimental conditions (FIG. 14). Such experimental considerations can further increase the utility of FREE codes for demanding high-throughput sequencing applications.
In validating FREE barcodes, the types and frequency of errors introduced during massively- parallel oligo synthesis and Illumina-based high-throughput sequencing were measured. Deletions during synthesis were the most frequent sources of error (about 1 per 100 nucleotides), followed by substitutions and insertions (about 1 per 1000 nucleotides). These experimentally measured error frequencies were used to simulate and experimentally measure the decoding quality of FREE codes. Even though the observed decoding error rates did not follow a model that assumed independent errors at each base, exponential improvement of the final decoding error rate was obtained with codes that correct for increasing numbers of errors. Importantly, the error-correcting decode software operated sufficiently fast to handle the massive data sets involved in modern high-throughput sequencing applications, decoding hundreds of thousands of barcodes per second on a single processor for all barcode lists considered.
Although focus was placed exclusively on filled/truncated right end edit (FREE) codes prepended to the start of sequenced DNA reads, the present disclosure applies equally to their natural mirrored counterparts, filled/truncated left end edit (FLEE) codes. FLEE may be required for applications where the barcode appears at the end of each sequenced read (3’ to the read) rather than at the beginning (5’ to the read). In fact, the same codes can be used by simply determining the reverse complement of FREE codes before synthesis and again before decoding. Hence, FREE barcodes can be used equally well at the 5’ or 3’ end of pooled samples.
FREE barcodes are a powerful tool to correct DNA barcode errors, reducing measurement errors in modem, high-throughput experiments. Use of FREE barcodes improves these assays in three key ways: (1) helping avoid spurious results; (2) decreasing the amount of discarded data; and (3) increasing experimental signal-to-noise ratios. Decreasing spurious results and discarded data are important for any experiment involving DNA barcodes. Further, increased signal-to-noise ratios facilitates new and useful possibilities for assays. The power to decrease error rates from 15% to 0.05%, as in FIG. 4B, opens the door for entirely new assay designs. FREE barcodes are broadly useful for the ever-growing set of pooled high-throughput sequencing experiments in cell and molecular biology, protein engineering, and drug discovery.
Methods
Definitions and Numerical Representation ofDNA. For any barcode system, the word length, n, is given. Any DNA sequence of length n is a word, and any word observed in the data is an observed word.
Strings of DNA are represented as base-4 numbers where A, C, G, and T correspond to 0, 1, 2, and 3 respectively. For example,
AAGCT = (00213)&ase 4 = 39; length 5
Here, 39 is the word number and 5 is the word length. The word length is required to uniquely convert numbers to DNA to account for leading A’s. For example, the word number from the example above, 39, with word length 3 is simply GCT. For word length n, the largest valid word number is 4" - 1.
For an /«-error correcting code, a decode sphere is defined around a barcode B to be the set of all words with FreeDiv less than or equal to m, and an encode sphere is defined to be the set of all words of FreeDiv less than or equal to 2m. These are written as DecodeSphere(B) and
EncodeSphere( B).
Barcode Generation. FREE barcode sets are generated with a modified lexicographic code generation method. Lexicographic code generation consists of marching through all words lexicographically, alphabetically in this case, and adding new words to the list of barcodes whenever they are sufficiently far from all previous barcodes30. For Hamming codes, lexicographic codes are linear30, and more generally, lexicographic code generation is shown to have relatively good sphere packing efficiency24. The first FREE modification to the procedure is to enforce the following sequencing and synthesis properties:
Balanced GC content (40-60%)
No homopolymer triples (e.g., TTT)
No triplet self-complementarity
No GGC (Illumina error motif25)
For speed, these potential barcodes are iterated over via recursive base addition: given a barcode prefix P, the next base is added only if it does not violate any of the above. Such an approach thereby skips large recursive subtrees in which all words violate one of the above conditions.
For an «/-error correcting code, a primary requirement is that the decode spheres of all barcodes are disjoint. Because FREE divergence is not a metric, standard metric-based code generation methods cannot be used. Instead, this is accomplished directly with a sphere iterator. For
every accepted barcode B, one can iterate over DecodeSphere(B) and reserve all words therein as mapping to B. And for any potential new barcode P, one first verifies no words in DecodeSphere(P) are reserved before accepting it as a new barcode.
This algorithm would otherwise be very slow because most decode sphere tests would ran into reserved words and fail to add new barcodes. One further observation makes this process tractable. Given a barcode B and a proposed new barcode W, if FreeDiv(B, W) < 2m, that is, if W is in
EncodeSphere(B), then DecodeSphere(W) and DecodeSphere(B) overlap and W is not a valid new barcode (Example 2). This implies the following algorithm: generate the code by lexicographically iterating over words while looking for new barcodes to add to the code. For each accepted new barcode B, a user colors any uncolored words in EncodeSphere(B) black, and then a user colors all words in DecodeSphere(B) red. Restricting encode sphere coloring to previously uncolored words avoids overwriting the decode spheres of all previous barcodes. All black- and red-colored words are guaranteed to not be valid barcodes, so addition of new barcodes is restricted to uncolored words. For an uncolored proposed new barcode W, DecodeSphere(W) is checked for red words. If no red words are found, W is added as a new barcode.
The coloring of barcodes, decode spheres, and encode spheres is accomplished by having an array of 4" integers valued 0, 1, or 2: 0 for uncolored, 1 for black, and 2 for red. The location of each integer in memory itself represents the word, via the numerical representation of DNA given above. This is both memory and speed efficient. Memory efficiency is important, as it is a limiting resource for this method. The memory required for barcode generation is 4k bytes, which for the disclosed experiments was up to l6Gb of random access memory (RAM).
Barcode Decoding. The decoding process builds the code book and looks up decoded words directly. This is performed in a memory efficient fashion as follows. For each barcode in a list, the barcode index is defined as the index of that barcode within the list of barcodes. A user again reserves a space of Ak integers to represent the code space. For each barcode B, a user stores the barcode index of B at every word of DecodeSphere(B). A user stores barcode indices rather than barcode numbers because barcode indices require fewer bits per word. The memory required for barcode decoding is (1, 2, or 4) x 4n bytes, requiring 1, 2, or 4 bytes to store each barcode index. For the disclosed experiments, the maximum memory used for barcode decoding was 32Gb of RAM.
Barcode Pruning. Specific barcode lists from the literature or elsewhere may sometimes be required for a given experiment, but require pruning to find a subset with error-correction. A user can accomplish barcode pruning via the same strategy as barcode generation, but only considering the input set of barcodes as potential new barcodes. This pruning method was also used to prune the linear Hamming codes.
Simulation of Errors. To test the error-correcting capacity of FREE barcodes, error-simulating
code was written which adds a given number of substitutions, insertions, deletions, or all three randomly distributed. This could be used to verify the correctness of each of the FREE /«-error correcting codes by randomly selecting barcodes, adding m errors, and verifying that the decoded word matches the expected word. The same code was used for generating FIG. 4 by randomly choosing the number of errors from a binomial distribution with probability of error
.
Levenshtein Barcodes. Levenshtein barcodes were generated lexicographically using the standard technique of code generation with a metric. Briefly, for desired barcode length n and number of correctable errors e, a user can walk through the space of n-mers lexicographically adding any new word if it: (a) satisfies the same sequencing and synthesis properties as above, and (b) is Levenshtein distance at least 2<?+l from any previously accepted barcode.
Pruned Linear Hamming Barcodes. Hamming barcode lists were generated using linearity in base-4. Briefly, a Hamming code of length n with k < n“message bits” can be generated by all linear combinations of k basis vectors of length n which are chosen to enforce the error-correction properties required. These codes were then filtered according to the FREE sequencing and synthesis property requirements and pruned as described above to form valid FREE codes.
Experimental Synthesis, Sequencing, and Decoding Error Rates. Oligonucleotide pools were designed as in FIG. 3A, with primers and barcodes on each end and a spacer in the middle (116 bp total length). To test the FREE method, 8,634 barcodes of length 17 and double-error correction were used in 8,634 unique pairs. Oligos were synthesized (CustomArray), and the oligo pool was amplified for twenty cycles with Q5 polymerase (NEB) and sequenced on an Illumina MiSeq machine with 2x150 bp paired-end reads. Maximum likelihood sequences were inferred using both reads.
The left and right primer sequences were used to determine both the read orientation and the starting position of each barcode. Each barcode was then decoded using the FREE decoding software. Matching barcodes identified correctly decoded barcodes, while mismatching barcodes indicated an error. The FREE method was powerful enough to reveal a surprising and unrelated source of error: the creation of oligo chimeras, sequences with the left part of one oligo and right part of another, which were then also accounted for (Example 2).
Once each oligo had been identified from its barcodes, the observed sequence was aligned with the reference sequence. At each base where the two reads agreed with each other but not with the reference sequence, a synthesis error was counted. At each base where the reads disagreed and one read matched the reference sequence, a sequencing error was counted. At each base where the reads disagreed and neither matched the reference sequence, a synthesis and a sequencing error were counted.
Observed synthesis and sequencing error rates for each reference base were used to find
theoretical decoding error rates for each barcode given its base composition. These were then used to estimate overall expected error rate (Example 2).
Example 2.
Error Correcting Code and Free Divergence Properties
Minimum error correcting barcode length. An error correcting code which corrects m substitutions, deletions, or insertions is at least 2/«+ 1 bp long. Proof: Suppose the contrary. Let L £ 2m be the length of the barcode. Then, by definition, every barcode is at most L substitutions from any other barcode by substituting all of the bases. For any two barcodes B\ and I _, define B,,,,,i to be the barcode with the first m bases of B\ and the remaining L - m £ m bases of Bi. Then Bm„i G
DecodeSpherem(B\) and /?„„>/ G DecodeSpherem(B2). Since B\ and
were arbitrary, it is thus impossible to have two disjoint decode spheres. Therefore, it is impossible to have a non-trivial (i.e. more than one barcode) /«-error correcting code of length less than 2m + 1 bp.
Calculating FREE divergence. FreeDiv(X, Y ) can be efficiently calculated with a modified Needleman-Wunsch algorithm1, where the last row and column of the matrix have zero penalty for insertion and deletion corresponding to right-end fill or truncation respectively.
Symmetry and minimum paths. FREE divergence is symmetric because any minimum filled/truncated right end edit path (FREE path) is invertible by inverting all the edits and then inverting the fill/truncation step. Substitutions are invertible with substitutions, while insertions and deletions are invertible with each other in the natural way, so edits by themselves are invertible.
Invertibility with the fill/truncation step is less obvious, and requires no edit be truncated off the end. For example, a substitution in the last position followed by any insertion results in the substitution getting truncated off the end. Minimum FREE paths never have any edits truncated off the end, because any truncated edit can be omitted to create a shorter edit path.
Let A and Y be barcodes and let P be any minimum FREE path from X to Y . If P has no fill or truncation, then the fill/truncation step is trivially invertible by doing nothing. Suppose P has a fill step which fills /bases at the end. Then starting at Y and inverting the edits results in exactly those/bases being outside the barcode window, so they are truncated to arrive at X. Suppose P has a truncation step which truncates t bases. Since P is a minimum edit path, none of the truncated bases were edited bases, so they are not needed for the inverted edit path starting at Y . After inverting the edits, t bases need to be filled, which are filled with the last t bases of X. Hence, any minimum FREE path can be inverted in the same number of edits. Furthermore, since all minimum FREE paths from A to T and from T to A are invertible, FreeDiv( X, Y) < FreeDiv(Y, A) and FreeDiviY, A) < FreeDiv( X, Y). Therefore,
FreeDiv(X, Y) = FreeDiv(Y, A) and any inverted minimum FREE path is itself a minimum FREE path.
FREE divergence is not a metric. The counter-example shown in the right column of FIG. 1C was used. For FreeDiv(TAGA, ACGC ), the modified Needleman-Wunsch algorithm described above
produces
so, from the value in the last row and column, FreeDiv(TAGA, ACGC ) = 3. Hence, the following is a minimum filled/truncated right end edit path (FREE path) between TAGA and ACGC:
AGGG
The vertical bars (“I”) show the end of the barcode window, though the truncation step would not happen until after all actual edits. Now, the above FREE path shows that FreeDiv(TAGA, TACG) = FreeDivfTACG, ACGC ) = 1. But FreeDiv(TAGA, ACGC ) = 3, a violation of the triangle inequality.
This was the error made in the paper defining Sequence-Levenshtein codes2. That code generation technique depends on the Sequence-Levenshtein distance function being a metric, which it is not.
Barcode Generation
Sphere iterator. Central to the disclosed generation and decoding algorithms is the ability to deterministically iterate over decode and encode spheres. Recursive iteration is far too slow for practical use due to redundancy. For example, attempting to find DecodeSphereJJT) by finding DecodeSpherefW) of all words W in DecodeSphere \(B) results in iterating over each 2-error word at least twice, by switching the order of added edits. As the number of edits, m, grows, the redundancy grows as m\ due to edit permutations.
To iterate over a sphere centered at barcode B, a method was instead built to iterate over all words at a given FREE divergence d from B and then iterate over d as needed. Additionally, the following identities regarding substitution (sub), insertion (ins), and deletion (del) edits were exploited to optimize iteration: sub-del = del, ins-del = del-ins = sub, ins-sub = sub-ins, and in the last position, ins = del = sub. Use of these identities assumes one is only interested in solid spheres, so it will for example iterate over a sequence at divergence d with an ins-del sequence during the previous d - 1 divergence sphere with a sub.
Use of encode spheres. The algorithm used for efficient code generation relies on the fact that if a word W is in EncodeSphereiB), then DecodeSphereiB) and DecodeSphere(W) overlap. That is,
there exists a word U such that U G DecodeSphereiB) and U E DecodeSpherefW). Let W E
EncodeSphere(B). If W E DecodeSphereiB), then U = W and by symmetry, the process is done.
Suppose W 6E EncodeSphereiB). By the definition of encode spheres, there exists a filled/truncated right end edit path (FREE path) with at most 2 m edits from B to W. Let U be the filled or truncated sequence m edits along this path from B. With this choice, FreeDiviB, U) < m, where the less than or equal sign is in case of any fill/truncation effects. Furthermore, there are at most m more edits along the path to W by choosing the fill or truncation for word U to allow use of the same FREE path to W. Then, by use of symmetry, FreeDiv(W, U) £ m, and the process is done.
Code efficiency. Code efficiency is measured, where possible, in terms of a code rate, defined as the number of usable“message” bits that can be encoded in a single barcode divided by the actual number of bits in the sent barcode. In many standard codes, k message bits have r bits added for error correction, giving a code rate of k/(k + r). For n-mer barcodes, each sent base is two bits of information, so the denominator is 2 n. The numerator is the effective number of message bits: the length of the largest binary number smaller than the number of barcodes, given by b\ogi(Number of barcodes)c. However, for purposes herein, the number of message bits does not need to be an integer, so one can refer to the previous as the actual message bits, while one may be more interested in the “raw” message bits: \ogi(Number of barcodes) without a floor function. These correspond to raw and actual code rates, shown in FIG. 7.
The code rate of FREE codes increases with barcode length, and appears to asymptotically approach a maximal code rate determined by the properties of the decode sphere packing. FIG. 2B shows that after some boundary effects at short barcode lengths, the number of raw message bits (log of the number of barcodes) increases linearly with the length of the barcodes. The slope of this line, up to a factor of 2 for the x-axis due to using base-4 instead of base-2, is an empirical estimate for the asymptotic code rate— message bits over sent bits— for this packing method. Estimated asymptotic values are shown for single- and double-error correcting codes as dashed lines in FIG. 1C.
Generating Linear Hamming Codes. Generating a linear Hamming code for DNA strings of length n encoding raw messages of length k and which corrects up to e errors is equivalent to finding a parity check matrix H = ( -PT I /„_*) over Galois field F4 such that any subset of 2e columns is linearly independent3. Given such a matrix H, all barcodes can be expressed as mG, where m is any raw message vector of length k and G is the generation matrix G = (h IF). Matrices Pi and Pi,
corresponding to single- and double-error correcting linear Hamming codes, were found via lexicographical search through possible columns of H, accepting new columns if they were linearly independent from all previous subsets of 2<? - 1 columns. Such P matrices were found for k up to length 100. The submatrices corresponding to codes up to length 14, as used in FIG. 2, are given by
Experimental Validation
Primer processing. Primers were used both chemically for library amplification and informatically to distinguish left from right sides. However, the possibility of insertions and/or deletions in these primer sites introduced some uncertainty in the starting position of the DNA barcodes. To address this, a custom adaptation of the Smith-Waterman algorithm for overhanging sequences was written. The user specifies an expected primer sequence, a full-length observed read, and a maximum allowable number of errors, which can be chosen to be 2 for both the left (19 bp) and right (18 bp) primers. Using the modified Smith-Waterman algorithm with unity penalties for all error types, the highest scoring prefix of the observed sequence which matched the expected sequence was identified. If two or more possible lengths had the same score, the one closest to the expected length was selected. If the number of edits is less than or equal to 2, this best inferred length then determines the position to be used as the start of the barcode sequence.
Experimental decode errors. Decode errors are detected by whether or not the left and right barcodes, as shown in FIG. 3A, match an intended left/right barcode pair. There are two possible ways to decode incorrectly: either by decoding to a wrong barcode or by decoding to“None” if the observed barcode is not in any decode sphere at all. If a barcode decodes to“None”, then that decode is an error. If a barcode decodes to an incorrect barcode, then the observed output is that the left and right barcodes mismatch but it is unclear which is actually the decode error. To determine which barcode was in error, the edit distance of the entire oligo was measured against the two possible intended sequences, accepting the one with lowest edit distance. To measure the 0- and 1 -Error correction data in FIG. 5, the edit distance of each observed barcode to the intended barcode was measured using the primer processing algorithm described above.
This analysis resulted in the detection of chimera oligos, oligos with the left side of one intended oligo and the right side of another intended oligo. Most of the barcodes which decoded to wrong barcodes matched the wrong barcode with zero errors, which was unexpected. The decode
spheres for l7-mer, 2-error correcting codes contain about 104 barcodes, of which about 102 are 1- error away and exactly 1 is the 0-error wrong barcode (FIG. 7A). Furthermore, the wrong barcode with zero errors is the center word, furthest from the sphere boundary and other barcodes. Thus, seeing a barcode decode to a wrong barcode with zero errors should be vanishingly rare compared with 1 and 2 errors. These together imply that random errors were not observed. Rather, chimera oligos appeared to have been generated. This is likely explained by degeneracy in the spacer region: the spacers are all different, but have stretches of identical sequences around 20 bp long. Incomplete PCR products could then act as primers for this sequence in later rounds of PCR, creating chimera sequences. To correct for these chimeras, it was conservatively assumed that the distribution of the number of errors in chimera barcodes is the same as that for correct barcodes, though it is likely higher. The observed number of wrong barcodes with zero errors was < 104, the approximate size of a decode sphere, so this was accepted as an approximation for the number of chimera oligos having barcodes with zero errors. This number and the distribution of correct barcode errors was then used to approximate the number of the wrong barcodes l-error and 2-errors away from the wrong barcode were chimera oligos. These were then omitted from analysis.
Decode error rate model. The decoding error rate of an /«-error correction code is the probability of seeing more than m errors in a given barcode. For error analysis, each barcode was modeled as a queue of intended bases. At each read position, an intended base is popped off the queue and attempted to be added. One of four things will happen: 1) the correct base will be added, 2) an incorrect base will be added, 3) the base will be deleted, or 4) another base will be inserted and the intended base will go back to the top of the queue. The first three options do not return the base to the queue, resulting in the same structure of expected output 7 observed output. However, insertions cause the intended base to return to the top of the queue, and the output was never expected in the first place. For this reason, it must be modeled differently from the other three. Assuming independent errors of all types and positions, models were made for insertions with a negative binomial distribution and the correct bases, deletions, and substitutions with a multinomial distribution, using measured error rates per reference base, shown in FIG. 3.
Let a barcode be given and let B be the l-by-4 row vector with counts of each of the bases ACGT in the given barcode. Let / be the l-by-4 row vector of insertion counts for each of the four bases. Further, let CDS be the 3-by-4 matrix with columns corresponding to the DNA bases, and rows corresponding to all non-insertion outputs: correct bases, deletions, and substitutions. The rows of CDS are occasionally referred to individually as C, D, and S, but they are left in matrix form as they are tightly connected. In fact, it must be true that C + D + S = B.
The measured error rates given reference base shown in FIG. 3 were used. Insertion and deletion rates, p,(b) and p b), are taken directly from synthesis error rate measurements. Substitution
rates, ps(b), are calculated as the probability of not observing the event {no synthesis substitution and no sequencing substitution} nor the event {synthesis substitution to another base c and correcting synthesis substitution back to b } , and are thus given by
Now let pc(b ) = 1 - pd(b) - ps(b ) be the probability of correctly adding a base, let N be the random variable for the total number of errors, let nerr be given, and let m, , and n, be the number of insertions, deletions, and substitutions respectively. Then, from the assumption of independent error rates given reference base, it is obtained for each reference base the previously mentioned negative binomial distribution for insertions and multinomial distribution for the rest:
Maximum error run lengths. The error models used herein, both the simpler binomial model and that derived above, assume independent errors at each position, understanding that this is an oversimplification. A quick way to see that the errors in the experimental data are not independent and to show why this impacts this work directly is to check the distribution of maximum error ran lengths (the maximum number of consecutive errors in a given oligo). The binomial model where each position is either an error with probability p or correct with probability q = 1 - p was considered. It was desirable to know the probability that in a sequence of length n the maximum run of errors will be r bases long. This is a well-studied problem, and Simpson’s solution as presented by Hald [4] was used. Briefly, let Z„be the probability that the maximum ran of errors in a sequence of length n is at least r bases long and let n be the probability that the first run of r errors ends at the n* position. Then
For n < r, Zn = Zn = 0 trivially, since there are not enough bases, and for n = r, Zn = zn = pr since they must all be errors. For n > r, n— (1 Z//-/-] )qp' by definition of zn, since this is the probability that no ran of length > r in the first n - r - l bases, then there is a correct base followed by r errors. One can then recursively find Zcr+i for increasing c, resulting in the general formula
Finally, for fixed n and p, P(maximum run length = r) = Zn(r)-Zn(r+l), shown in FIG. 13 using the oligo length, n = 116, and measured probability of error, p = 0.005 (FIG. 3). The experimental data were similar to this model for maximum runs of zero and one errors, but deviate significantly for maximum runs of more than one error because our errors are not, in fact, independent. This helps explain the deviation of the experimental barcode decoding error rates from the model predictions in
FIG. 5, since experimental errors clump together, increasing the probability of having more than two, say, in a single barcode.
focused on codes that fix a specified number of errors because such barcodes are platform- independent and hence widely applicable. However, for very popular end-to-end synthesis and sequencing pipelines, one can use the FREE barcode generation and decoding method here described to produce platform- specific barcodes where the user specifies a desired total decode error probability rather than a fixed maximum number of errors to correct. For example, if the rates of substitutions, deletions, and insertions for a given synthesis and sequencing pipeline are known, a user could choose to generate barcodes which have, say, a 10 6 probability of decode error. To accomplish this, the user would use a sphere iterator (see Supplemental Methods) which, instead of iterating over all barcodes with up to some number of errors, iterates over the most likely erroneous barcodes on the chosen
synthesis and sequencing platform until the total probability of the barcodes contained in the sphere is at least 1-10 6. The rest of the generation and decoding process would remain the same. These barcodes would only achieve the expected accuracy on exactly the same DNA synthesis and sequencing pipeline, but they would be more efficient (produce more barcodes per barcode length) for a given desired decode error rate. As such, popular synthesis and sequencing pipelines may warrant their own dedicated barcodes in the future.
Publications cited herein are hereby specifically by reference in their entireties and at least for the material for which they are cited.
It should be understood that while the present disclosure has been provided in detail with respect to certain illustrative and specific aspects thereof, it should not be considered limited to such, as numerous modifications are possible without departing from the broad spirit and scope of the present disclosure as defined in the appended claims. It is, therefore, intended that the appended claims cover all such equivalent variations as fall within the true spirit and scope of the invention.
The present disclosure has been described with reference to example embodiments, however persons skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the disclosed subject matter. For example, although different example embodiments may have been described as including one or more features providing one or more benefits, it is contemplated that the described features may be interchanged with one another or alternatively be combined with one another in the described example embodiments or in other alternative embodiments. Because the technology of the present disclosure is relatively complex, not all changes in the technology are foreseeable. The present disclosure described with reference to the exemplary embodiments is manifestly intended to be as broad as possible. For example, unless specifically otherwise noted, the exemplary embodiments reciting a single particular element also encompass a plurality of such particular elements.
Exemplary embodiments may include program products comprising computer or machine- readable media for carrying or having machine-executable instructions or data structures stored thereon. For example, the sensing electrode may be computer driven. Exemplary embodiments illustrated in the methods of the figures may be controlled by program products comprising computer or machine-readable media for carrying or having machine-executable instructions or data structures stored thereon. Such computer or machine-readable media can be any available media which can be accessed by a general purpose or special purpose computer or other machine with a processor. By way of example, such computer or machine-readable media can comprise RAM, ROM, EPROM,
EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. Combinations of the above are also included within the scope of computer or machine-readable media. Computer or machine-executable instructions comprise, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing machines to perform a certain function or group of functions. Software implementations of the present disclosure could be accomplished with standard programming techniques with rule based logic and other logic to accomplish the various connection steps, processing steps, comparison steps and decision steps.
It is also important to note that the construction and arrangement of the elements of the system as shown in the preferred and other exemplary embodiments is illustrative only. Although only a certain number of embodiments have been described in detail in this disclosure, those skilled in the art who review this disclosure will readily appreciate that many modifications are possible (e.g., variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters, mounting arrangements, use of materials, colors, orientations, etc.) without materially departing from the novel teachings and advantages of the subject matter recited. For example, elements shown as integrally formed may be constructed of multiple parts or elements shown as multiple parts may be integrally formed, the operation of the assemblies may be reversed or otherwise varied, the length or width of the structures and/or members or connectors or other elements of the system may be varied, the nature or number of adjustment or attachment positions provided between the elements may be varied. It should be noted that the elements and/or assemblies of the system may be constructed from any of a wide variety of materials that provide sufficient strength or durability. Accordingly, all such modifications are intended to be included within the scope of the present disclosure. The order or sequence of any process or method steps may be varied or re-sequenced according to alternative embodiments. Other substitutions, modifications, changes and omissions may be made in the design, operating conditions and arrangement of the preferred and other exemplary embodiments without departing from the spirit of the present subject matter.
Disclosed are components that can be used to perform the disclosed methods and systems. These and other components are disclosed herein, and it is understood that when combinations, subsets, interactions, groups, etc. of these components are disclosed that while specific reference of each various individual and collective combinations and permutation of these may not be explicitly disclosed, each is specifically contemplated and described herein, for all methods and systems. This applies to all aspects of this application including, but not limited to, steps in disclosed methods. Thus, if there are a variety of additional steps that can be performed it is understood that each of these
additional steps can be performed with any specific embodiment or combination of embodiments of the disclosed methods.
As will be appreciated by one skilled in the art, the methods and systems may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the methods and systems may take the form of a computer program product on a computer-readable storage medium having computer-readable program instructions (e.g., computer software) embodied in the storage medium. More particularly, the present methods and systems may take the form of web-implemented computer software. Any suitable computer-readable storage medium may be utilized including hard disks, CD-ROMs, optical storage devices, or magnetic storage devices.
Embodiments of the methods and systems are described herein with reference to block diagrams and flowchart illustrations of methods, systems, apparatuses and computer program products. It will be understood that each block of the block diagram and flowchart illustration can be
implemented by computer program instructions. These computer program instructions may be loaded onto a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create a means for implementing the functions specified in the flowchart block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including computer-readable instructions for implementing the function specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks.
Accordingly, blocks of the block diagram and flowchart illustration support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the block diagram and flowchart illustration, and combinations of blocks in the block diagram and flowchart illustration, can be implemented by special purpose hardware- based computer systems that perform the specified functions or steps, or combinations of special purpose hardware and computer instructions.
The figures present an overview of an embodiment of a computer readable medium for use
with the methods disclosed herein. Results can be delivered to a gateway (remote computer via the Internet or satellite) for in graphical user interface format. The described system can be used with an algorithm, such as those disclosed herein.
As may be understood from the figures, in this implementation, the computer may include a processing unit that communicates with other elements. Also included in the computer readable medium may be an output device and an input device for receiving and displaying data. This display device/input device may be, for example, a keyboard or pointing device that is used in combination with a monitor. The computer system may further include at least one storage device, such as a hard disk drive, a floppy disk drive, a CD Rom drive, SD disk, optical disk drive, or the like for storing information on various computer-readable media, such as a hard disk, a removable magnetic disk, or a CD-ROM disk. As will be appreciated by one of ordinary skill in the art, each of these storage devices may be connected to the system bus by an appropriate interface. The storage devices and their associated computer-readable media may provide nonvolatile storage. It is important to note that the computer described above could be replaced by any other type of computer in the art. Such media include, for example, magnetic cassettes, flash memory cards and digital video disks.
Further comprising an embodiment of the system can be a network interface controller. One skilled in the art will appreciate that the systems and methods disclosed herein can be implemented via a gateway that comprises a general-purpose computing device in the form of a computing device or computer.
One or more of several possible types of bus structures can be used as well, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures can comprise an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, an
Accelerated Graphics Port (AGP) bus, and a Peripheral Component Interconnects (PCI), a PCI- Express bus, a Personal Computer Memory Card Industry Association (PCMCIA), Universal Serial Bus (USB) and the like. The bus, and all buses specified in this description can also be implemented over a wired or wireless network connection and each of the subsystems, including the processor , a mass storage device, an operating system, network interface controller, Input/Output Interface, and a display device, can be contained within one or more remote computing devices at physically separate locations, connected through buses of this form, in effect implementing a fully distributed system.
The computer typically comprises a variety of computer readable media. Exemplary readable media can be any available media that is accessible by the computer and comprises, for example and not meant to be limiting, both volatile and non-volatile media, removable and non-removable media. The system memory comprises computer readable media in the form of volatile memory, such as
random access memory (RAM), and/or non-volatile memory, such as read only memory (ROM).
In another aspect, the computer can also comprise other removable/non-removable, volatile/non-volatile computer storage media. For example and not meant to be limiting, a mass storage device can be a hard disk, a removable magnetic disk, a removable optical disk, magnetic cassettes or other magnetic storage devices, flash memory cards, CD-ROM, digital versatile disks (DVD) or other optical storage, random access memories (RAM), read only memories (ROM), electrically erasable programmable read-only memory (EEPROM), and the like.
Optionally, any number of program modules can be stored on the mass storage device, including by way of example, an operating system and computational software. Each of the operating system and computational software (or some combination thereof) can comprise elements of the programming and the computational software. Data can also be stored on the mass storage device. Data can also be stored in any of one or more databases known in the art. Examples of such databases comprise, DB2™, MICROSOFT™ ACCESS, MICROSOFT™ SQL Server, ORACLE™, mySQL, PostgreSQL, and the like. The databases can be centralized or distributed across multiple systems.
In another aspect, the user can enter commands and information into the computer 102 via an input device. Examples of such input devices comprise, but are not limited to, a keyboard, pointing device (e.g., a“mouse”), a microphone, a joystick, a scanner, tactile input devices such as gloves, and other body coverings, and the like These and other input devices can be connected to the processing unit via a human machine interface that is coupled to the network interface controller, but can be connected by other interface and bus structures, such as a parallel port, game port, an IEEE 1394 Port (also known as a Firewire port), a serial port, or a universal serial bus (USB).
In yet another aspect, a display device can also be connected to the system bus via an interface, such as a display adapter. It is contemplated that the computer can have more than one display adapter and the computer can have more than one display device. For example, a display device can be a monitor, an LCD (Liquid Crystal Display), or a projector. In addition to the display device, other output peripheral devices can comprise components such as speakers and a printer which can be connected to the computer via Input/Output Interface. Any step and/or result of the methods can be output in any form to an output device. Such output can be any form of visual representation, including, but not limited to, textual, graphical, animation, audio, tactile, and the like.
The computer can operate in a networked environment. By way of example, a remote computing device can be a personal computer, portable computer, a server, a router, a network computer, a peer device, sensor node, or other common network node, and so on. Logical connections between the computer and a remote computing device can be made via a local area network (LAN), a general wide area network (WAN), or any other form of a network. Such network connections can be through a network adapter. A network adapter can be implemented in both wired and wireless
environments. Such networking environments are conventional and commonplace in offices, enterprise-wide computer networks, intranets, and other networks such as the Internet.
Any of the disclosed methods can be performed by computer readable instructions embodied on computer readable media. Computer readable media can be any available media that can be accessed by a computer. By way of example and not meant to be limiting, computer readable media can comprise“computer storage media” and“communications media.”“Computer storage media” comprise volatile and non-volatile, removable and non-removable media implemented in any methods or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Exemplary computer storage media comprises, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
The methods and systems described herein can employ Artificial Intelligence techniques such as machine learning and iterative learning. Examples of such techniques include, but are not limited to, expert systems, case-based reasoning, Bayesian networks, behavior based AI, neural networks, fuzzy systems, evolutionary computation (e.g. genetic algorithms), swarm intelligence (e.g. ant algorithms), and hybrid intelligent systems (e.g. Expert inference rules generated through a neural network or production rules from statistical learning).
The embodiments of the method, system and computer program product described herein are further set forth in the claims below.
References
1. Klein, A. M. et al. Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells. Cell 161, 1187-1201 (2015).
2. Macosko, E. Z. et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202-1214 (2015).
3. Zheng, G. X. Y. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303-311 (2016).
4. Kitzman, J. O. Haplotypes drop by drop. Nat. Biotechnol. 34, 296-298 (2016).
5. Haque, A., Engel, J., Teichmann, S. A. & Lonnberg, T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med. 9, 75 (2017).
6. Zilionis, R. et al. Single-cell barcoding and sequencing using droplet microfluidics.
Nat. Protoc. 12, 44-73 (2017).
7. Spies, N. et al. Genome-wide reconstruction of complex structural variants using read clouds. Nat. Methods 14, 915-920 (2017).
8. Eroshenko, N., Kosuri, S., Marblestone, A. H., Conway, N. & Church, G. M. Gene Assembly from Chip-Synthesized Oligonucleotides. Curr. Protoc. Chem. Biol. 2012, (2012).
9. Plesa, C., Sidore, A. M., Lubock, N. B., Zhang, D. & Kosuri, S. Multiplexed gene synthesis in emulsions for exploring protein functional landscapes. Science eaao5l67 (2018).
doi : 10.1126/science. aao5167
10. Fan, R. et al. Integrated barcode chips for rapid, multiplexed analysis of proteins in microliter quantities of blood. Nat. Biotechnol. 26, 1373-1378 (2008).
11. Ma, C. et al. A clinical microchip for evaluation of single immune cells reveals high functional heterogeneity in phenotypically similar T cells. Nat. Med. 17, 738-743 (2011).
12. Zimmermann, G. & Neri, D. DNA-encoded chemical libraries: foundations and applications in lead discovery. Drug Discov. Today 21, 1828-1834 (2016).
13. Melkko, S., Scheuermann, J., Dumelin, C. E. & Neri, D. Encoded self-assembling chemical libraries. Nat. Biotechnol. 22, 568 (2004).
14. Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499-507 (2014).
15. Petrone, J. DNA writers attract investors. Nat. Biotechnol. 34, 363-364 (2016).
16. Litovchick, A. et al. Encoded Library Synthesis Using Chemical Ligation and the Discovery of sEH Inhibitors from a 334-Million Member Library. Sci. Rep. 5, (2015).
17. CustomArray, Inc. - maker of custom microarrays, oligo pools and instrumentation. Available at: http://www.customarrayinc.com/aboutus_main.htm. (Accessed: 8th January 2018)
18. Peterson, W. W. & Weldon, E. J. Error-correcting Codes. (MIT Press, 1972).
19. MacWilliams, F. J. & Sloane, N. J. A. The Theory of Error-correcting Codes. (Elsevier,
1977).
20. Lyons, E., Tremmel, G., Sheridan, P., Miyano, S. & Sugano, S. Large-scale DNA Barcode Library Generation for Biomolecule Identification in High-throughput Screens. Sci. Rep. 7, 13899 (2017).
21. Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950-954 (2017).
22. Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals in Soviet physics doklady 10, 707-710 (1966).
23. Costea, P. I., Lundeberg, J. & Akan, P. TagGD: Fast and Accurate Software for DNA Tag Generation and Demultiplexing. PLOS ONE 8, e57521 (2013).
24. Houghten, S. K., Ashlock, D. & Lenarz, J. Constmction of Optimal Edit Metric Codes in 2006 IEEE Information Theory Workshop - ITW '06 Chengdu 259-263 (2006).
doi: 10.1109/ITW2.2006.323799
25. Quail, M. A. et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Tllumina MiSeq sequencers. BMC Genomics 13, 341 (2012).
26. Hamming, R. W. Error detecting and error correcting codes. Bell Labs Tech. J. 29, 147-160 (1950).
27. Buschmann, T. & Bystrykh, L. V. Levenshtein error-correcting barcodes for multiplexed DNA sequencing. BMC Bioinformatics 14, 272 (2013).
28. Lee, D. F., Lu, J., Chang, S., Loparo, J. J. & Xie, X. S. Mapping DNA polymerase errors by single-molecule sequencing. Nucleic Acids Res. 44, el l8-el l8 (2016).
29. Markham, N. R. & Zuker, M. UNAFold: software for nucleic acid folding and hybridization. Bioinforma. Struct. Funct. Appl. 3-31 (2008).
30. Zanten, A. J. van. Lexicographic Order and Linearity. Des. Codes Cryptogr. 10, 85-97
(1997).
Claims
1. A computerized method of generating a set of unique genetic barcodes for attaching to members of a pooled population of samples represented by respective genetic sequences, the method comprising: establishing boundaries for the set of unique genetic barcodes by controlling the presence and absence of selected combinations of base values when the selected combinations represent disallowed physical characteristics within the genetic samples; identifying a common length of the barcodes in the set, wherein the length comprises a given number of base positions present in each genetic barcode; identifying base values available to fill each base position, wherein the respective genetic barcode is subject to error formation in the base values, such that at least one of the respective genetic sequences has a corrupt barcode; identifying a center barcode from which a sphere of multiple corrupt barcodes may be generated such that each of the multiple corrupt barcodes refers back to the center barcode also in the sphere, wherein generating the multiple corrupt barcodes comprises a computer executing software that performs the following steps: receiving an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the maximum number of edits necessary in any one of the multiple corrupt barcodes to match the center barcode; generating combinations of bases filling the length of respective corrupt barcode sequences, comprising errors in the base values within the acceptable error level, such that the corrupt barcode sequences pack the sphere having the center barcode.
2. The computerized method according to Claim 1, further comprising generating distinct center barcodes for each member of the pooled population of genetic samples.
3. The computerized method according to Claim 2, further comprising using respective distinct center barcodes to generate additional spheres of corrupt barcodes that relate back to a corresponding distinct center barcode.
4. The computerized method according to Claim 3, wherein the sphere and the additional spheres are disjointed with no common members at the identified error level.
5. The computerized method according to Claim 1, wherein generating combinations of bases filling the length of respective corrupt barcode sequences comprises generating combinations having a number of errors in comparison to the center barcode and the number of errors is less than or equal to the acceptable error level.
6. The computerized method according to Claim 1, wherein generating combinations having the number of errors comprises adjusting base values from the center barcode with end fills and end truncations of base values at one end of the length of the genetic barcode, up to the entire length of the genetic barcode.
7. The computerized method according to Claim 1, wherein the length of the genetic barcode is at least two times the error level“m” plus 1 (“2m+l”).
8. The computerized method according to Claim 1, wherein the error formation comprises at least one of a substitution, an insertion, and a deletion of a base value at a position in the genetic sequence and within a respective genetic barcode.
9. The computerized method according to Claim 1, wherein controlling select combinations of base values comprises implementing lexicographical constraints in identifying base values available to fill each base position GC content, disallowing homopolymer triplets, disallowing triplet self complementarity, and disallowing GGC base values (illumina error motif).
10. The computerized method according to Claim 1, further comprising concatenating multiple center barcodes into a new single barcode from which new spheres of corrupt barcodes are generated from each of the multiple center barcodes.
11. A computerized method according to Claim 1, further comprising minimizing internal hairpin propensity of respective genetic barcodes.
12. A computerized method according to Claim 1, wherein the set of unique genetic barcodes forms a barcode space in computerized memory, wherein the barcode space defines respective barcode spheres in the memory, wherein each center barcode and each corresponding corrupt barcode are stored in arbitrary order with no mathematical symmetry in the barcode sphere.
13. A computerized method according to Claim 1, wherein the barcode space is stored as a look up table implemented in computerized memory, wherein the corrupt barcodes are matched to a respective one of the center barcodes.
14. A computerized method of decoding genetic barcodes subject to error formation and identified as being attached to members of a pooled population of genetic samples represented by respective genetic sequences, the method comprising: identifying an original length of a generated barcode from a set of unique genetic barcodes, wherein the original length comprises a given number of base positions present in the generated barcode; identifying a barcode starting value position along a respective genetic sequence of a genetic sample, wherein the starting value position is occupied by a first base of a respective genetic barcode; identifying a center barcode within a sphere of multiple corrupt barcodes such that each of the multiple corrupt barcodes refers back to the center barcode also in the sphere, wherein identifying the center barcode comprises a computer executing software that performs the following steps: receiving an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the maximum number of edits necessary in any one of the corrupt barcodes to match the center barcode; from the barcode starting value position within the genetic sequence, evaluating bases up to the length of the generated barcode and storing the evaluated bases as a corrupt barcode in a computerized memory buffer; identifying a respective decode sphere in which the corrupt barcode exists; decoding the corrupt barcode to the center barcode of the respective decode sphere.
15. A computerized method according to Claim 14, further comprising identifying the respective decode sphere in which the corrupt barcode exists by evaluating the base position values after the starting point and up to the original length of the generated barcode.
16. A computerized method according to Claim 15, wherein the base position values after the starting point and up to the original length of the generated barcode include filled positions at an end of the length, wherein the filled positions do not match the corresponding positions of the generated barcode.
17. A computerized method according to Claim 15, wherein the base position values after the starting point and up to the original length of the generated barcode do not include base positions from the originally generated barcode that have been truncated at an end of the length.
18. A computerized method according to Claim 14, wherein decoding to the center barcode comprises accessing a look-up table stored in a computerized memory.
19. A computerized method according to Claim 14, further comprising, for genetic sequences having multiple barcodes therein: after decoding a first corrupt barcode in the genetic sequence, identifying a true length of the first corrupt barcode; identifying a next start position in the genetic sequence for a sequential barcode to be decoded.
20. A computerized method according to Claim 14, further comprising configuring a graphical user interface to receive at least one of an acceptable error level, an acceptable probability of a number of errors in a corrupted barcode, and a genetic barcode length.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201862660531P | 2018-04-20 | 2018-04-20 | |
| US62/660,531 | 2018-04-20 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019204702A1 true WO2019204702A1 (en) | 2019-10-24 |
Family
ID=68239272
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2019/028279 Ceased WO2019204702A1 (en) | 2018-04-20 | 2019-04-19 | Error-correcting dna barcodes |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2019204702A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021053208A1 (en) * | 2019-09-20 | 2021-03-25 | Sophia Genetics S.A. | Methods for dna library generation to facilitate the detection and reporting of low frequency variants |
| US20220084628A1 (en) * | 2020-09-16 | 2022-03-17 | 10X Genomics, Inc. | Methods and systems for barcode error correction |
| CN114774516A (en) * | 2022-03-28 | 2022-07-22 | 深圳裕康医学检验实验室 | UMI sequence design method for correcting sequencing errors and application thereof |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160289740A1 (en) * | 2015-03-30 | 2016-10-06 | Cellular Research, Inc. | Methods and compositions for combinatorial barcoding |
-
2019
- 2019-04-19 WO PCT/US2019/028279 patent/WO2019204702A1/en not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160289740A1 (en) * | 2015-03-30 | 2016-10-06 | Cellular Research, Inc. | Methods and compositions for combinatorial barcoding |
Non-Patent Citations (2)
| Title |
|---|
| BUSCHMANN ET AL.: "Levenshtein error-correcting barcodes for multiplexed DNA sequencing", BMC BIOINFORMATICS, vol. 14, no. 272, 11 September 2013 (2013-09-11), pages 1 - 10, XP055194297 * |
| COSTEA ET AL.: "TagGD: fast and accurate software for DNA Tag generation and demultiplexing", PLOS ONE, vol. 8, no. 3, 4 March 2013 (2013-03-04), pages 1 - 5, XP055289247, DOI: 10.1371/journal.pone.0057521 * |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021053208A1 (en) * | 2019-09-20 | 2021-03-25 | Sophia Genetics S.A. | Methods for dna library generation to facilitate the detection and reporting of low frequency variants |
| EP4031664A1 (en) * | 2019-09-20 | 2022-07-27 | Sophia Genetics S.A. | Methods for dna library generation to facilitate the detection and reporting of low frequency variants |
| EP4570922A3 (en) * | 2019-09-20 | 2025-09-17 | Sophia Genetics S.A. | Methods for dna library generation to facilitate the detection and reporting of low frequency variants |
| US20220084628A1 (en) * | 2020-09-16 | 2022-03-17 | 10X Genomics, Inc. | Methods and systems for barcode error correction |
| WO2022060889A3 (en) * | 2020-09-16 | 2022-04-28 | 10X Genomics, Inc. | Methods and systems for barcode error correction |
| CN114774516A (en) * | 2022-03-28 | 2022-07-22 | 深圳裕康医学检验实验室 | UMI sequence design method for correcting sequencing errors and application thereof |
| CN114774516B (en) * | 2022-03-28 | 2024-04-12 | 深圳裕康医学检验实验室 | UMI sequence design method for correcting sequencing errors and application thereof |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20250137049A1 (en) | Methods, systems, computer readable media, and kits for sample identification | |
| Anavy et al. | Data storage in DNA with fewer synthesis cycles using composite DNA letters | |
| Liu et al. | Hi-TOM: a platform for high-throughput tracking of mutations induced by CRISPR/Cas systems | |
| Boisvert et al. | Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies | |
| Organick et al. | Random access in large-scale DNA data storage | |
| Marsan et al. | Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification | |
| US12260937B2 (en) | Reverse concatenation of error-correcting codes in DNA data storage | |
| US20180211001A1 (en) | Trace reconstruction from noisy polynucleotide sequencer reads | |
| Organick et al. | Scaling up DNA data storage and random access retrieval | |
| Bhardwaj et al. | Trace reconstruction problems in computational biology | |
| Kumar et al. | Long-read amplicon denoising | |
| US20120185177A1 (en) | Harnessing high throughput sequencing for multiplexed specimen analysis | |
| CN107969138A (en) | Barcode sequences and related systems and methods | |
| WO2019204702A1 (en) | Error-correcting dna barcodes | |
| Milenkovic et al. | DNA-based data storage systems: A review of implementations and code constructions | |
| US20210202032A1 (en) | Method of tagging nucleic acid sequences, composition and use thereof | |
| Westesson et al. | Accurate detection of recombinant breakpoints in whole-genome alignments | |
| Erlich et al. | Capacity-approaching DNA storage | |
| Anavy et al. | Improved DNA based storage capacity and fidelity using composite DNA letters | |
| Sharma et al. | Efficiently enabling block semantics and data updates in dna storage | |
| EP4052260B1 (en) | Trace reconstruction of polymer sequences using quality scores | |
| Hawkins et al. | Error-correcting DNA barcodes for high-throughput sequencing | |
| Simpson | Efficient sequence assembly and variant calling using compressed data structures | |
| Schwarz | Methods for Constraint Satisfaction, Error Handling, and Data Recovery in DNA Data Storage | |
| Ebler | Design and application of methods for genome inference |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19788060 Country of ref document: EP Kind code of ref document: A1 |