WO2019046768A1 - Symbolic squencing of dna and rna via sequence encoding - Google Patents
Symbolic squencing of dna and rna via sequence encoding Download PDFInfo
- Publication number
- WO2019046768A1 WO2019046768A1 PCT/US2018/049173 US2018049173W WO2019046768A1 WO 2019046768 A1 WO2019046768 A1 WO 2019046768A1 US 2018049173 W US2018049173 W US 2018049173W WO 2019046768 A1 WO2019046768 A1 WO 2019046768A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- probe
- translation
- region
- auxiliary
- codeword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6813—Hybridisation assays
- C12Q1/6827—Hybridisation assays for detection of mutation or polymorphism
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6844—Nucleic acid amplification reactions
- C12Q1/6862—Ligase chain reaction [LCR]
Definitions
- This disclosure relates generally to the field of molecular biology. More particularly, it concerns methods, and compositions for use therein, of encoding nucleic acid sequence information into rationally-designed short sequences for sequencing by next- generation sequencing.
- NGS Next-generation sequencing
- This disclosure describes new compositions and methods for profiling DNA and RNA using Next Generation Sequencing (NGS).
- NGS Next Generation Sequencing
- target DNA/RNA sequences are stoichiometrically "translated" into designed codewords that compactly and error-resiliently encode the targets' sequence information.
- Sequence encoding can suppress NGS errors, as well as reduce both NGS procedure and interpretation time. Sequence encoding can significantly impact both molecular diagnostics for precision medicine, as well as academic and clinical research on the human genome/transcriptome.
- compositions of nucleic acid molecules comprising: (a) at least three auxiliary probes, wherein each auxiliary probe comprises a first auxiliary probe hybridization region and a first auxiliary probe universal region, wherein the first auxiliary probe hybridization region of each auxiliary probe has a unique sequence, wherein the first auxiliary probe universal regions of each auxiliary probe have the same sequence; (b) at least three translation probes, wherein each translation probe comprises a first nucleic acid molecule, wherein said first nucleic acid molecule comprises a first translation probe hybridization region and a first translation probe codeword region, wherein the first translation probe hybridization region of each translation probe has a unique sequence, wherein the first translation probe codeword region of each translation probe has a unique sequence; and (c) at least three translation probe protection oligonucleotides, wherein each translation probe protection oligonucleotide comprises a first translation probe protection oligonucleotide hybridization region that is at least partially complementary to
- the translation probes are modular probes.
- the first nucleic acid molecules of the translation probes further comprise a second translation probe hybridization region positioned between the first translation probe hybridization region and the first translation probe codeword region.
- the translation probes further comprise a second nucleic acid molecule, wherein each of the second nucleic acid molecules comprises a third translation probe hybridization region and a fourth translation probe hybridization region, wherein the third translation probe hybridization region is complementary to the second translation probe hybridization region of the first nucleic acid molecule.
- the translation probes further comprise a third nucleic acid molecule, wherein the third nucleic acid molecule comprises a fifth translation probe hybridization region, wherein the fifth translation probe hybridization region is complementary to the fourth translation probe hybridization region of the second nucleic acid molecule.
- the third nucleic acid molecules of the translation probes further comprise a sixth translation probe hybridization region, wherein the translation probe protection oligonucleotides further comprise a second translation probe protection oligonucleotide hybridization region, wherein the sixth translation probe hybridization region is complementary to the second translation probe protection oligonucleotide hybridization region.
- the first translation probe hybridization region is at least 5 nucleotides longer than the first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region.
- the first translation probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first translation probe hybridization region of one of the translation probes, wherein the first translation probe hybridization region is at least 17 nucleotides long.
- each first translation probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7.
- each first translation probe codeword region may be 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, or 98 nucleotides long.
- each of the translation probe codeword regions in the composition lacks sequence identity at at least 2 nucleotide positions as compared to any other translation probe codeword region in the composition.
- each of the translation probe codeword regions in the composition has a Hamming distance of at least two relative to every other translation probe codeword region in the composition.
- each of the translation probe codeword regions may be 14 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 12 nucleotide positions.
- each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions.
- each of the translation probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other translation probe codeword region in the composition.
- each of the translation probe codeword regions in the composition has a Hamming distance of at least three relative to every other translation probe codeword region in the composition.
- each of the translation probe codeword regions may be 14 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 11 nucleotide positions.
- each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
- each of the translation probes further comprises a first translation probe universal region, wherein the first translation probe universal regions of each translation probe have the same sequence.
- the first translation probe codeword region is positioned between the first translation probe universal region and the first translation probe hybridization region.
- each of the translation probes comprises a 5' phosphate. In other aspects, each of the translation probes lacks a 5' phosphate. [0013] In some aspects, each of the translation probes is between 30 and 200 nucleotides long. For example, each of the translation probes may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein.
- each of the auxiliary probes further comprises a first auxiliary probe codeword region, wherein each auxiliary probe in the composition has a unique first auxiliary probe codeword region sequence.
- the first auxiliary probe codeword region is positioned between the first auxiliary probe hybridization region and the first auxiliary probe universal region.
- each first auxiliary probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first auxiliary probe codeword region is a multiple of 7.
- each first auxiliary probe codeword region may be 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, or 98 nucleotides long.
- each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 2 nucleotide positions as compared to any other auxiliary probe codeword region in the composition.
- each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least two relative to every other auxiliary probe codeword region in the composition.
- each of the auxiliary probe codeword regions may be 14 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 12 nucleotide positions.
- each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions.
- each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other auxiliary probe codeword region in the composition.
- each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least three relative to every other auxiliary probe codeword region in the composition.
- each of the auxiliary probe codeword regions may be 14 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 11 nucleotide positions.
- each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions
- each of the auxiliary probes comprises a 5' phosphate.
- each of the auxiliary probes may comprise a 5' phosphate when each of the translation probes lacks a 5' phosphate.
- each of the auxiliary probes lacks a 5' phosphate.
- each of the auxiliary probes may lack a 5' phosphate when each of the translation probes comprises a 5' phosphate.
- compositions further comprise at least three auxiliary probe protection oligonucleotides, wherein each auxiliary probe protection oligonucleotide comprises a first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region of one of the auxiliary probes.
- the first auxiliary probe hybridization region is at least 5 nucleotides longer than the first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region.
- the first auxiliary probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first auxiliary probe hybridization region of one of the auxiliary probes, wherein the first auxiliary probe hybridization region is at least 17 nucleotides long.
- each of the auxiliary probes is between 30 and 200 nucleotides long.
- each of the auxiliary probes may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein.
- compositions further comprise at least one target nucleic acid molecule comprising a first target region and a second target region, wherein the first target region and the second target region are directly adjacent within the target nucleic acid molecule, wherein the first target region is complementary to the first translation probe hybridization region of one of the translation probes in the composition, wherein the second target region is complementary to the first auxiliary probe hybridization region of one of the auxiliary probes in the composition.
- methods are provided herein for determining the presence a target nucleic acid molecule in a sample, the target nucleic acid molecule comprising a known target sequence having a first target region and a second target region that is directly adjacent to the first target region, the method comprising: (a) contacting the sample with at least a first auxiliary probe and at least a first translation probe, wherein the auxiliary probe comprises a first auxiliary probe hybridization region and a first auxiliary probe universal region, wherein the first auxiliary probe hybridization region is complementary to the first target region, and wherein the first translation probe comprises a first nucleic acid molecule, wherein said first nucleic acid molecule comprises a first translation probe hybridization region and a first translation probe codeword region, wherein the first translation probe hybridization region is complementary to the second target region; (b) incubating the product of step (a) under conditions to allow the first auxiliary probe hybridization region to anneal to the first target region and the first translation probe hybridization region to anneal to the
- the first translation probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7.
- each first translation probe codeword region may be 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, or 98 nucleotides long.
- the first translation probe further comprises a first translation probe universal region. In certain aspects, the first translation probe codeword region is positioned between the first translation probe universal region and the first translation probe hybridization region.
- step (a) further comprises contacting the sample with at least a first translation probe protection oligonucleotide, wherein the translation probe protection oligonucleotide comprises a first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region.
- the first translation probe hybridization region is at least 5 nucleotides longer than the first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region.
- the first translation probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first translation probe hybridization region of one of the translation probes, wherein the first translation probe hybridization region is at least 17 nucleotides long.
- the first translation probe is a modular probe.
- the first nucleic acid molecule of the translation probe further comprises a second translation probe hybridization region positioned between the first translation probe hybridization region and the first translation probe codeword region.
- the translation probe further comprises a second nucleic acid molecule, wherein the second nucleic acid molecule comprises a third translation probe hybridization region and a fourth translation probe hybridization region, wherein the third translation probe hybridization region is complementary to the second translation probe hybridization region of the first nucleic acid molecule.
- the translation probe further comprises a third nucleic acid molecule, wherein the third nucleic acid molecule comprises a fifth translation probe hybridization region, wherein the fifth translation probe hybridization region is complementary to the fourth translation probe hybridization region of the second nucleic acid molecule.
- the third nucleic acid molecule of the translation probe further comprises a sixth translation probe hybridization region, wherein the translation probe protection oligonucleotides further comprise a second translation probe protection oligonucleotide hybridization region, wherein the sixth translation probe hybridization region is complementary to the second translation probe protection oligonucleotide hybridization region.
- the first translation probe is between 30 and 200 nucleotides long.
- the first translation probe may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein.
- step (a) further comprises contacting the sample with at least a second translation probe, wherein the second translation probe comprises a second translation probe hybridization region and a second translation probe codeword region, wherein the translation probe hybridization regions on each of the first and second translation probes has a unique sequence, wherein the translation probe codeword region on each of the first and second translation probes has a unique sequence.
- each of the translation probe codeword regions in the composition lacks sequence identity at at least 2 nucleotide positions as compared to any other translation probe codeword region in the composition.
- each of the translation probe codeword regions in the composition has a Hamming distance of at least two relative to every other translation probe codeword region in the composition.
- each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions.
- each of the translation probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other translation probe codeword region in the composition.
- each of the translation probe codeword regions in the composition has a Hamming distance of at least three relative to every other translation probe codeword region in the composition.
- each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
- the first auxiliary probe further comprises a first auxiliary probe codeword region.
- the first auxiliary probe codeword region is positioned between the first auxiliary probe hybridization region and the first auxiliary probe universal region.
- step (a) further comprises contacting the sample with at least a second auxiliary probe, wherein the second auxiliary probe comprises a second auxiliary probe hybridization region and a second auxiliary probe codeword region, wherein the auxiliary probe codeword region on each of the first and second auxiliary probes has a unique sequence.
- each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 2 nucleotide positions as compared to any other auxiliary probe codeword region in the composition.
- each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least two relative to every other auxiliary probe codeword region in the composition.
- each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions.
- each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other auxiliary probe codeword region in the composition.
- each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least three relative to every other auxiliary probe codeword region in the composition.
- each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
- the first auxiliary probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first auxiliary probe codeword region is a multiple of 7.
- each first auxiliary probe codeword region may be 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, or 98 nucleotides long.
- step (a) further comprises contacting the sample with at least a first auxiliary probe protection oligonucleotide, wherein the auxiliary probe protection oligonucleotide comprises a first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region.
- the first auxiliary probe hybridization region is at least 5 nucleotides longer than the first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region.
- the first auxiliary probe protection oligonucleotide hybridization region may comprise at least 12 continuous nucleotides that are complementary to the first auxiliary probe hybridization region of one of the auxiliary probes, wherein the first auxiliary probe hybridization region is at least 17 nucleotides long.
- the first auxiliary probe is between 30 and 200 nucleotides long.
- the first auxiliary probe may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein.
- step (a) comprises contacting the sample with a composition of the present embodiments.
- step (c) is performed by incubating the product of step (b) with a ligase.
- the first target region is positioned upstream of the second target region, wherein the first auxiliary probe comprises a 5' phosphate.
- the first translation probe lacks a 5' phosphate.
- the second target region is positioned upstream of the first target region, wherein the first translation probe comprises a 5' phosphate.
- the first auxiliary probe lacks a 5' phosphate.
- step (c) is performed chemically.
- the first target region is positioned upstream of the second target region, wherein the first auxiliary probe comprises a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid.
- the first translation probe lacks a 5' functionalization.
- the second target region is positioned upstream of the first target region, wherein the first translation probe comprises a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid.
- the first auxiliary probe lacks a 5' functionalization.
- detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In some aspects, detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region. In some aspects, detecting the ligation product in step (d) is further comprises quantitating the amount of the ligation product having both the first translation probe codeword region and the first auxiliary probe universal region that is present in the sample.
- quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In certain aspects, quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region.
- detecting and/or quantitating the amount of the ligation product comprises performing DNA sequencing.
- the DNA sequencing comprises Sanger sequencing, sequencing-by-synthesis, or nanopore sequencing.
- detecting and/or quantitating the amount of the ligation product comprises performing Hamming error correction to the sequences obtained for the translation probe and/or auxiliary probe codeword regions.
- determining the presence of a target nucleic acid molecule does not comprise a step of bead capture. In some other aspects, determining the presence of a target nucleic acid molecule further comprises a step of bead capture.
- the target nucleic acid molecule comprises DNA. In some aspects, the target nucleic acid molecule comprises RNA.
- the target sequence is determined to not be present in the sample. In some aspects, if the ligation product is detected, then the target sequence is determined to be present in the sample.
- methods are provided herein for determining the presence a plurality of target nucleic acid molecules in a sample, each target nucleic acid molecule comprising a known target sequence having a first target region and a second target region that is directly adjacent to the first target region, the method comprising: (a) contacting the sample with at least two auxiliary probes and at least two translation probes, wherein the auxiliary probes each comprise a first auxiliary probe hybridization region and a first auxiliary probe universal region, wherein the first auxiliary probe hybridization region of each auxiliary probe has a unique sequence, wherein the first auxiliary probe universal regions of each auxiliary probe have the same sequence, wherein the first auxiliary probe hybridization region is complementary to the first target region of one of the plurality of target nucleic acid molecules, and wherein the translation probes each comprise a first nucleic acid molecule, wherein said first nucleic acid molecule comprises a first translation probe hybridization region and a first translation probe codeword region, wherein the first translation probe hybridization region
- the translation probes are a modular probes.
- the first nucleic acid molecules of the translation probes further comprise a second translation probe hybridization region positioned between the first translation probe hybridization region and the first translation probe codeword region.
- the translation probes further comprise a second nucleic acid molecule, wherein each of the second nucleic acid molecules comprises a third translation probe hybridization region and a fourth translation probe hybridization region, wherein the third translation probe hybridization region is complementary to the second translation probe hybridization region of the first nucleic acid molecule.
- the translation probes further comprise a third nucleic acid molecule, wherein the third nucleic acid molecule comprises a fifth translation probe hybridization region, wherein the fifth translation probe hybridization region is complementary to the fourth translation probe hybridization region of the second nucleic acid molecule.
- the third nucleic acid molecules of the translation probes further comprise a sixth translation probe hybridization region, wherein the translation probe protection oligonucleotides further comprise a second translation probe protection oligonucleotide hybridization region, wherein the sixth translation probe hybridization region is complementary to the second translation probe protection oligonucleotide hybridization region.
- the translation probes are between 30 and 200 nucleotides long.
- the translation probes may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein.
- each of the translation probe codeword regions is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7.
- each first translation probe codeword region may be 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, or 98 nucleotides long.
- each of the translation probe codeword regions lacks sequence identity at at least 2 nucleotide positions as compared to any other translation probe codeword region.
- each of the translation probe codeword regions in the composition has a Hamming distance of at least two relative to every other translation probe codeword region in the composition.
- each of the translation probe codeword regions may be 14 nucleotides long, wherein no two translation probe codeword regions share sequence identity at more than 12 nucleotide positions.
- each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions share sequence identity at more than 19 nucleotide positions.
- each of the translation probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other translation probe codeword region.
- each of the translation probe codeword regions in the composition has a Hamming distance of at least three relative to every other translation probe codeword region in the composition.
- each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
- each of the translation probe codeword regions may be 14 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 11 nucleotide positions [0046]
- each of the translation probes further comprises a first translation probe universal region, wherein the first translation probe universal regions of each translation probe have the same sequence.
- the first translation probe codeword region is positioned between the first translation probe universal region and the first translation probe hybridization region.
- step (a) further comprises contacting the sample with at least two first translation probe protection oligonucleotides, wherein the translation probe protection oligonucleotides comprise a first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region of one of the translation probes.
- the first translation probe hybridization region is at least 5 nucleotides longer than the first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region.
- the first translation probe protection oligonucleotide hybridization region may comprise at least 12 continuous nucleotides that are complementary to the first translation probe hybridization region of one of the translation probes, wherein the first translation probe hybridization region is at least 17 nucleotides long.
- each of the auxiliary probes further comprises a first auxiliary probe codeword region.
- the first auxiliary probe codeword region is positioned between the first auxiliary probe hybridization region and the first auxiliary probe universal region.
- each of the auxiliary probe codeword regions lacks sequence identity at at least 2 nucleotide positions as compared to any other auxiliary probe codeword region.
- each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least two relative to every other auxiliary probe codeword region in the composition.
- each of the auxiliary probe codeword regions may be 14 nucleotides long, wherein no two auxiliary probe codeword regions share sequence identity at more than 12 nucleotide positions.
- each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions share sequence identity at more than 19 nucleotide positions.
- each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other auxiliary probe codeword region.
- each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least three relative to every other auxiliary probe codeword region in the composition.
- each of the auxiliary probe codeword regions may be 14 nucleotides long, wherein no two auxiliary probe codeword regions share sequence identity at more than 11 nucleotide positions.
- each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions share sequence identity at more than 18 nucleotide positions
- step (a) further comprises contacting the sample with at least two first auxiliary probe protection oligonucleotides, wherein the auxiliary probe protection oligonucleotides comprise a first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region of one of the auxiliary probes.
- the first auxiliary probe hybridization region is at least 5 nucleotides longer than the first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region.
- the first auxiliary probe protection oligonucleotide hybridization region may comprise at least 12 continuous nucleotides that are complementary to the first auxiliary probe hybridization region of one of the auxiliary probes, wherein the first auxiliary probe hybridization region is at least 17 nucleotides long.
- the auxiliary probes are between 30 and 200 nucleotides long.
- the auxiliary probes may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein.
- step (a) comprises contacting the sample with a composition of the present embodiments.
- step (c) is performed by incubating the product of step (b) with a ligase.
- the first target regions are positioned upstream of the second target regions, wherein the first auxiliary probes comprise a 5' phosphate.
- the first translation probes may lack a 5' phosphate.
- the second target regions are positioned upstream of the first target regions, wherein the first translation probes comprise a 5' phosphate.
- the first auxiliary probes may lack a 5' phosphate.
- some targets within a single sample have their first target region positioned upstream of their second target region while other targets within the same single sample have their second target regions positioned upstream of their first target regions.
- step (c) is performed chemically.
- the first target regions are positioned upstream of the second target regions, wherein the first auxiliary probes comprise a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid.
- the first translation probes may lack a 5' functionalization.
- the second target regions are positioned upstream of the first target regions, wherein the first translation probes comprise a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid.
- the first auxiliary probes may lack a 5' functionalization.
- some targets within a single sample have their first target region positioned upstream of their second target region while other targets within the same single sample have their second target regions positioned upstream of their first target regions.
- detecting the ligation products in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In some aspects, detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region. In some aspects, detecting the ligation product in step (d) is further comprises quantitating the amount of the ligation product having both the first translation probe codeword region and the first auxiliary probe universal region that is present in the sample.
- quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In certain aspects, quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region.
- detecting the ligation products in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first translation probe universal region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In some aspects, detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe universal region. In some aspects, detecting the ligation product in step (d) is further comprises quantitating the amount of the ligation product having both the first translation probe universal region and the first auxiliary probe universal region that is present in the sample.
- quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first translation probe universal region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In certain aspects, quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe universal region.
- detecting and/or quantitating the amount of the ligation product comprises performing DNA sequencing.
- the DNA sequencing comprises Sanger sequencing, sequencing-by-synthesis, or nanopore sequencing.
- detecting and/or quantitating the amount of the ligation product comprises performing Hamming error correction to the sequences obtained for the translation probe and/or auxiliary probe codeword regions.
- determining the presence of a target nucleic acid molecule does not comprise a step of bead capture. In some other aspects, determining the presence of a target nucleic acid molecule further comprises a step of bead capture. [0060] In some aspects, the target nucleic acid molecules comprise DNA. In some aspects, the target nucleic acid molecules comprise RNA.
- the target sequences are determined to not be present in the sample. In some aspects, if the ligation products are detected, then the target sequences are determined to be present in the sample.
- essentially free in terms of a specified component, is used herein to mean that none of the specified component has been purposefully formulated into a composition and/or is present only as a contaminant or in trace amounts.
- the total amount of the specified component resulting from any unintended contamination of a composition is therefore well below 0.05%, preferably below 0.01%.
- Most preferred is a composition in which no amount of the specified component can be detected with standard analytical methods.
- FIGS. 1A-B Sequence encoding.
- FIG. 1A An overview of the general concept of sequence encoding.
- a target biological DNA or RNA sequence (SEQ ID NO: 87) is stoichiometrically converted into a predesigned Codeword sequence (SEQ ID NO: 88).
- FIG. IB A diagram illustrating how sequence encoding is implemented using a translator probe (left) and an auxiliary probe (right).
- the translator probe comprises regions 5, 6, and 7, and the auxiliary probe comprises regions 3 and 4.
- the target comprises regions 1 and 2, which are complementary to regions 5 and 4, respectively.
- Region 6 is a Codeword representing the sequence of region 1 of the target.
- Region 3 is a universal sequence conserved across all auxiliary probes.
- FIGS. 3A-B Sequence encoding using a single- stranded translator probe.
- FIG. 3A The matched translator probe (probe 5b/6b; SEQ ID NO: 2) was designed against a 20 nt sequence around the SNP rsl509186.
- a mismatched translator probe (probe 5a/6a; SEQ ID NO: 3) was designed against a 20 nt sequence bearing consecutive 5 nt mismatches from the target.
- the auxiliary probe (probe 3/4; SEQ ID NO: 1) was designed against a 33 nt adjacent sequence.
- FIG. 3B Experimental qPCR results using single-stranded translator probes.
- the primer hybridizing to the matched translator probe's Codeword is the reverse primer 1 (rpl; SEQ ID NO: 85).
- the primer hybridizing to the mismatch translator probe's Codeword is the reverse primer 2 (rp2; SEQ ID NO: 86).
- the primer hybridizing to the sequencing adapter on the auxiliary probe is the forward primer 1 (fpl; SEQ ID NO: 15).
- FIGS. 4A-B Sequence encoding using a double- stranded translator probe.
- FIG. 4A The matched translator probe (SEQ ID NOS: 4 and 5) was designed against a 35 nt sequence around the SNP rs3217424, where the SNP at rs3217424 is C.
- the mismatched translator probe (SEQ ID NOS: 10 and 11) was designed against a 35 nt sequence bearing a single-nucleotide mismatch, where the SNP at rs3217424 is G.
- the auxiliary probe (SEQ ID NO: 8) was designed against a 15 nt sequence downstream of the translator probe-targeted region.
- FIG. 4A The matched translator probe (SEQ ID NOS: 4 and 5) was designed against a 35 nt sequence around the SNP rs3217424, where the SNP at rs3217424 is C.
- the mismatched translator probe (SEQ ID NOS: 10 and 11) was designed against a 35
- the primer hybridizing to the matched translator probe's Codeword is the reverse primer 1 (rpl; SEQ ID NO: 9).
- the primer hybridizing to the mismatch translator probe's Codeword is the reverse primer 2 (rp2; SEQ ID NO: 12).
- the primer hybridizing to the sequencing adapter on the auxiliary probe is the forward primer 1 (fpl; SEQ ID NO: 15).
- FIG. 5 Sanger sequencing results of qPCR amplicons from FIG. 4B. Sanger sequencing was performed after 40 cycles. The underlined sequence matches the Codeword for the correct translator probe. The sequence shown corresponds to SEQ ID NO: 89.
- FIGS. 6A-F Hamming encoding of DNA Codewords.
- FIG. 6A Every 4-nt DNA word is appended with three additional error correction nucleotides.
- FIG. 6B The error correction nucleotides x, y, and z are designed to satisfy the three error correction equations displayed. Note that modular arithmetic is used: 2, 6, 10, and 14 are all equal to 2 in mod 4.
- FIG. 6C Error correction via a (7,4) Hamming encoding. In the left panel, the second nucleotide b is mutated T>C, resulting in two of the three error correction equations being violated.
- FIG. 6D Out of the 256 (7,4) Hamming codes, 216 are amenable for serving as Codewords due to properties of DNA synthesis and sequencing.
- FIG. 6E Hamming encoding greatly decreases the error rate of sequence encoding. The top line represents "Uncoded (12nt)"; the middle line represents “Correcting (21nt)”; the bottom line represents “Detecting (21nt).”
- FIG. 6D Out of the 256 (7,4) Hamming codes, 216 are amenable for serving as Codewords due to properties of DNA synthesis and sequencing.
- FIG. 6E Hamming encoding greatly decreases the error rate of sequence encoding. The top line represents "Uncoded (12nt)"; the middle line represents “Correcting (21nt)”; the bottom line represents “Detecting (21nt).”
- FIG. 7 Sequence encoding embodiment with sequence adaptors to facilitate downstream NGS analysis.
- the translator probe comprises a new region 7. Regions 3 and 7 serve as sequencing adaptors for NGS.
- FIG. 8 Proposed workflow for sequence translation, library preparation, and NGS. The entire process is expected to take less than eight hours, including sequencing and bioinformatic analysis.
- FIGS. 9A-B Experimental NGS results on 22-plex translation.
- FIG. 9A Eleven translators were designed to subsequences of different human genes, and 11 translators were designed to mouse homologs of the human genes.
- experimental NGS used 150 sequencing cycles. The gray dots show the number of reads aligned to each human Codeword, and the black dots show the number of reads mapped to mouse Codewords. Further analysis of the mouse Codeword sequences revealed that roughly 90% of these did not contain any auxiliary probe sequence, and likely correspond to non-specific dimers with sequencing adaptors. The overall ratio of reads between human Codewords and mouse Codewords with auxiliary sequence is over 500.
- FIG. 9A Eleven translators were designed to subsequences of different human genes, and 11 translators were designed to mouse homologs of the human genes.
- experimental NGS used 150 sequencing cycles. The gray dots show the number of reads aligned to each human Codeword, and the black dots show the number of reads mapped to mouse Codewords. Further analysis
- Each translator probe's Codeword was 21 nt long, as described in FIG. 6.
- the Codeword error correction system was able to recover roughly 4% of the library.
- the 11% of the library that could not be aligned to any Codeword likely represents adaptor dimers not perfectly removed by size selection.
- FIGS. 10A-B Sequence encoding using M-Probes as translator probes.
- FIG. 10A The matched translator probe (formed by SEQ ID NOS: 72-78) was designed against a 104 nt sequence around the SNP rs2775256.
- the mismatched translator probe formed by SEQ ID NOS: 72-75 and 78-80 was designed against a 104 nt sequence bearing consecutive 6-nt mismatches from the target.
- the auxiliary probe formed by SEQ ID NOS: 81-82 was designed against a 37 nt sequence downstream of translator probe-targeted region.
- FIG. 10B Experimental qPCR results for a M-Probe translator.
- the ligated product was detected by qPCR with two sets of primers (fpl+rpl and fpl+rp2).
- the primer hybridizing to the matched translator probe's Codeword is the reverse primer 1 (rpl; SEQ ID NO: 83).
- the primer hybridizing to the mismatch translator probe's Codeword is the reverse primer 2 (rp2; SEQ ID NO: 84).
- the primer hybridizing to the sequencing adapter on the auxiliary probe is the forward primer 1 (fpl; SEQ ID NO: 15).
- FIG. 11 Sequence encoding using Codewords on both the translator probe and the auxiliary probe to overcome potential nonspecific ligation. NGS reads with inconsistent Codewords can be excluded from interpretation as they likely result from nonspecific ligation.
- NGS detection and quantification of sequence variants is hampered by the intrinsic NGS error rate, commonly estimated to be between 0.1% and 1% for the Illumina platforms. Based on the intrinsic error rate, most commercial LDTs state mutation limits of detection of between 1% and 5%. Suppression of sequencing errors using molecular barcodes (e.g. , SafeSeq, CAPPseq, DuplexSeq) greatly increases the sequencing depth required, which in turn increases sequencing cost. In contrast, in sequence encoding, Codewords can be designed to be orthogonal and error-correcting, essentially eliminating the problem of NGS intrinsic error.
- molecular barcodes e.g. , SafeSeq, CAPPseq, DuplexSeq
- a typical 300 nucleotide (nt) NGS read can uniquely specify 4 300 ⁇ 4 x 10 180 sequences.
- the entire human genome is only 3 x 10 9 nt long, and there are fewer than 10 8 known DNA sequence variants.
- the number of RNA transcripts and transcript variants are even smaller in number: 3 x 10 4 genes, and likely less than 10 7 total RNA splice variants. This gross numerical mismatch points to an enormous NGS inefficiency when applied to profiling the human genome/transcriptome, or any other sample in which a reference genome or transcriptome is known.
- Sequencing encoding offers significant advantages over the direct or "literal" NGS that is in use today in speed, accuracy, and ease of interpretation.
- sequencing by synthesis takes between 5 and 10 minutes per cycle, depending on the exact platform. Consequently a 300 to 600 cycle NGS run will typically require 1 to 3 full days. With 21 nt Codewords, the actual sequencing time can be reduced by 15- to 30-fold, to 2 hours.
- sequencing by synthesis chemistry is imperfect and results in incorrect base calls with a probability of between 0.1% and 1% (intrinsic error rate). This error rate is problematic especially for detecting and quantitation of single nucleotide variants. Sequence encoding allows the designer to intentionally map closely-related sequences into highly distinct Codewords, and essentially eliminates the impact of sequencing error. For example, one could encode wild-type KRAS as "ATATCCC,” KRAS- G12D as "ATATGAG,” and KRAS-G12V as "ATATCCA.” Every Codeword is different by three nucleotides, so it would require an extremely unlikely three simultaneous NGS base call errors for one Codeword to be misinterpreted as the other.
- the method of detecting, identifying, or quantifying nucleic acid molecules by sequence encoding does not require a step of bead capture, and thus can avoid the bead-washing steps required following bead capture, because ligation alone is sufficient to exclude unbound molecules.
- FIG. 7 shows one embodiment of sequence encoding for analysis by next generation sequencing (NGS), also known as sequencing-by- synthesis.
- the translator probe further comprises a region 7.
- Regions 3 and 7 are adaptors for NGS amplification or index appending, so nucleic acid molecules lacking either region 3 or region 7 will not be sequenced. Thus, only ligation products will be analyzed by NGS.
- one potential workflow for sequencing encoding and NGS analysis is provided.
- the entire sample-to-answer workflow is expected to require less than eight hours total when optimized and using only 21 sequencing cycles and represents a significant speedup over current NGS-based laboratory developed tests (LDTs).
- LDTs laboratory developed tests
- To further speed up this workflow requires shortening the time bottlenecking Step 2, the hybridization of genomic DNA to the variant and auxiliary probes.
- Hybridization kinetics are primarily determined by the individual concentration of each probe. For low- to medium-plex translation, individual probe concentrations can be quite high and hybridization quite fast; for example, 5 nM per probe for a set of 1000-plex translators corresponds to a hybridization half-life of roughly three minutes.
- the simplest (7,4) Hamming code inserts 3 error-correcting bits for every 4 bit message (longer messages are first broken up into 4 bit words). All 7-bit instances of the Hamming code have the property that they are at least Hamming distance 3 from any other instance - that is to say, one would need to change at least 3 bits in order to transform one Hamming code instance into another. This property means that (7,4) Hamming codes are correcting for up to one error, and tolerant for up to two errors: The original sequence can be restored from any sequence mutated by one base; more conservatively, any sequence with two mutations will not match any other Codewords and can be excluded.
- the (7,4) DNA encoding shown in FIG. 6 can be used.
- the assignment of A, T, C, and G to numerical values and the design of the error check equations are selected such that long homopolymers and extremal G/C content are rare.
- Manual pruning of the 256 possible (7,4) Hamming codes removes 40 sequences that can contribute to homopolymers of more than 5 nt (via having a homopolymer of length 3 at the beginning or end of the Codeword) or have G/C content of >75% or ⁇ 25%, resulting in 216 good (7,4) nt Codeword segments.
- FIG. 6E shows the effective error rate of the Codewords given different intrinsic error rates; at 1 % intrinsic error rate, the proposed Codewords exhibit roughly 0.6% error rate when NGS reads unmatched to any designed Codeword are corrected, and 0.01% error when unmatched NGS reads are discarded. These are roughly 20-fold and 1000-fold better than a naive encoding with no error correction.
- the present disclosure provides synthetic oligonucleotide probes for use in sequence encoding.
- the adaptors are single-stranded probes.
- the adaptors are at least partially double- stranded probes.
- the oligonucleotide probes can have a length of 30 to 200 nucleotides, particularly 50 to 100 nucleotides, such as between 60 and 70 nucleotides. Exemplary structures of the probes, are provided in FIGS. 1, 4A, 8, 11A, and 12.
- the probes can comprise part or all of sequencing primer sequences or their binding sites, such as index sequencing primers for particular sequencing platforms (e.g., Illumina index primers).
- Sequence encoding is implemented via a pair of DNA hybridization probes that are conditionally ligated using the target sequence as a splint (FIG. IB).
- the left probe called the translator probe
- the right probe called the auxiliary probe
- the universal region 3 that is identical for all auxiliary probes.
- the translator probe is functionalized with a phosphate at the 5' end. Only when both the translator probe and the auxiliary probe bind adjacent DNA subsequences, is the 5' end of the translator probe flush against the 3' end of the auxiliary probe to allow subsequent ligation.
- Multiple translator probes can be used simultaneously (FIG. 2), to allow either sequence encoding of multiple target sequences or to identify which one of several potential target sequences is present in a sample.
- the molecular specificity of the translator and auxiliary probes is beneficial to accurate inference of genomic DNA variants based on Codeword analysis. Nonspecific binding of variant probes and auxiliary probes to other genomic loci would result in false positive results.
- the ligation process helps improve specificity, but is insufficient on its own to ensure accurate translation of target DNA sequences into Codewords.
- a protector oligonucleotide comprising a region 8 that is partially complementary to region 5 is introduced.
- at least five continuous nucleotides on region 5 are not bound to the protector, i.e., form a toehold, in order to allow initiation of hybridization between the target and the translator probe.
- This protector oligonucleotide can improve the specificity of hybridization reactions (see Zhang et al , 2012, Wang and Zhang, 2015, U.S. Pat. No. 9,284,602, and U.S. Pat. Publn. No. 2016/0340727, each of which is incorporated herein by reference in its entirety), and maintains high sequence selectivity across a large range of temperatures and buffer conditions.
- the protector oligonucleotide is present in molar excess.
- the nucleic acid probes are rationally designed so that the standard free energy for hybridization (e.g. , theoretical standard free energy) between the specific target nucleic acid molecule and the region 5 is close to zero, while the standard free energy for hybridization between a spurious target (even one differing from the specific (actual) target by as little as a single nucleotide) and the probe is high enough to make their binding unfavorable by comparison.
- the standard free energy for hybridization e.g. , theoretical standard free energy
- the "toehold" region is present in region 5, is complementary to a target sequence 1 and not complementary to a protector region 8.
- the sequence of the complementary regions is rationally designed to achieve this matching under desired conditions of temperature and probe concentration.
- the equilibrium for the actual target and probe rapidly approaches 50% target:probe::protector:probe (or whatever ratio is desired), while equilibrium for the spurious target and primer greatly favors protector:probe.
- it is thought that hybridization to a target begins at the toehold and continues along the length of the region 5 until the probe is no longer "double- stranded.” This assumes complementarity between the target and the region 5.
- the region 5 of the probe will bind stably to a target in the absence of a mismatch but not in the presence of a mismatch. If a mismatch exists between the region 5 of the probe and the target, the probe duplex prefers to reform. In this way, the frequency of producing a ligation product when a target sequence is not present is reduced. This type of discrimination is typically not possible using the standard single- stranded probes because in those reactions there is no competing nucleic acid strand (such as the protector oligonucleotide) to which a mismatched probe strand would prefer to bind.
- a mismatch e.g. , single nucleotide change
- both the translation probe and the auxiliary probe may have a protector oligonucleotide.
- PCR polymerase chain reaction
- NGS next- generation sequencing
- Standard hybridization probes are: (1) length-limited by synthesis capabilities and cannot query long target regions; (2) not economical for profiling of DNA samples with combinatorial diversity, such as T-cell receptors and antibody fragments; (3) incapable of accurate quantitation of trinucleotide repeats such as in Huntington's gene, Fragile X, and Federick's Ataxia, as well as microsatellite repeats.
- Another embodiment of sequence encoding uses modular M- Probes constructed from many oligonucleotides, as shown in FIG. 10A (U.S. Pat. Appln. No. 62/398,484 and Wang et al, 2017, each of which is incorporated herein by reference in its entirety).
- M- Probes are capable of sequence-selective binding of very long nucleic acid targets, and furthermore tolerates non-pathogenic sequence variations at specified locations.
- the modular probe is designed based on detection or capture of a target nucleic acid sequence of at least partially known sequence.
- the target sequence is divided conceptually into several regions, a region being a number of continuous nucleotides that act as a unit in hybridization or dissociation. Note that the regions may or may not be directly adjoining one another.
- Several diseases are caused or characterized by an abnormal number of triplet repeats; examples include Huntington's disease (excessive number of CAG repeats), Friedreich's Ataxia (GAA repeats), Myotonic dystrophy (CTG repeats), and the Fragile X syndrome (CGG repeats).
- Huntington's disease excessive number of CAG repeats
- GAA repeats Friedreich's Ataxia
- CCG repeats Myotonic dystrophy
- CGG repeats the Fragile X syndrome
- these repeats induce slipped strand mispairing during DNA replication; slipped strand mispairing likewise complicate or preclude many conventional DNA analysis techniques, such as Sanger Sequencing, quantitative PCR, and next-generation sequencing.
- Modular probes can be designed to, for example, the Huntington's gene sequence.
- Each modular probe is designed to target a threshold number of repeats (6, 9, 12, 15, 18, 21, 24, and 27), as well as the 3' neighboring sequence.
- a 12 repeat probe is designed to hybridize to any target sequences bearing 12 or more CAG repeats, in addition to the 8 nt downstream of the CAG repeats.
- FIG. 11 Another embodiment of sequence encoding that overcomes potential nonspecific ligation is shown in FIG. 11.
- a secondary Codeword is placed on the auxiliary probe near region 3, or between region 3 and region 4.
- PCR amplification using both Codewords (regions 6 and 10) or both universal regions (regions 3 and 7) or NGS paired-end reads to analyze both Codewords will differentiate correctly ligated species with consistent Codeword pairs from nonspecific ligation products.
- a nucleic acid molecule of interest can be a single nucleic acid molecule or a plurality of nucleic acid molecules. Also, a nucleic acid molecule of interest can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, RNA, amplified DNA, a pre-existing nucleic acid library, etc.
- Nucleic acids in a nucleic acid sample being analyzed (or processed) in accordance with the present disclosure can be from any nucleic acid source.
- nucleic acids in a nucleic acid sample can be from virtually any nucleic acid source, including but not limited to genomic DNA, complementary DNA (cDNA), RNA (e.g., messenger RNA, ribosomal RNA, short interfering RNA, microRNA, etc.), plasmid DNA, mitochondrial DNA, etc.
- genomic DNA complementary DNA
- RNA e.g., messenger RNA, ribosomal RNA, short interfering RNA, microRNA, etc.
- RNA e.g., messenger RNA, ribosomal RNA, short interfering RNA, microRNA, etc.
- plasmid DNA mitochondrial DNA, etc.
- mitochondrial DNA mitochondrial DNA
- Exemplary organisms include, but are not limited to, plants, animals (e.g., reptiles, mammals, insects, worms, fish, etc.), bacteria, fungi (e.g. , yeast), viruses, etc.
- the nucleic acids in the nucleic acid sample are derived from a mammal, where in certain embodiments the mammal is a human.
- a nucleic acid molecule of interest can be a single nucleic acid molecule or a plurality of nucleic acid molecules.
- a nucleic acid molecule of interest can be of biological or synthetic origin.
- nucleic acid molecules examples include genomic DNA, cDNA, cell-free DNA (cfDNA), RNA, amplified DNA, a pre-existing nucleic acid library, etc.
- the target nucleic acid is a double- stranded DNA molecule, such as, for example, human genomic DNA.
- a nucleic acid molecule of interest may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, chemical, enzymatic, degradation over time, etc. Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc.
- a nucleic acid molecule of interest may also be subjected to chemical modification (e.g. , bisulfite conversion, methylation / demethylation), extension, amplification (e.g. , PCR, isothermal, etc.), etc.
- RNA molecule may be obtained from a sample, such as a sample comprising total cellular RNA, a transcriptome, or both; the sample may be obtained from one or more viruses; from one or more bacteria; or from a mixture of animal cells, bacteria, and/or viruses, for example.
- the sample may comprise mRNA, such as mRNA that is obtained by affinity capture.
- Obtaining nucleic acid molecules may comprise generation of the cDNA molecule by reverse transcribing the mRNA molecule with a reverse transcriptase, such as, for example Tth DNA polymerase, HIV Reverse Transcriptase, AMV Reverse Transcriptase, MMLV Reverse Transcriptase, or a mixture thereof.
- a reverse transcriptase such as, for example Tth DNA polymerase, HIV Reverse Transcriptase, AMV Reverse Transcriptase, MMLV Reverse Transcriptase, or a mixture thereof.
- PCRTM polymerase chain reaction
- two synthetic oligonucleotide primers which are complementary to two regions of the template DNA (one for each strand) to be amplified, are added to the template DNA (that need not be pure), in the presence of excess deoxynucleotides (dNTP's) and a thermostable polymerase, such as, for example, Taq (Thermus aquaticus) DNA polymerase.
- dNTP's deoxynucleotides
- a thermostable polymerase such as, for example, Taq (Thermus aquaticus) DNA polymerase.
- the target DNA is repeatedly denatured (around 90°C), annealed to the primers (typically at 50-60°C) and a daughter strand extended from the primers (72°C). As the daughter strands are created they act as templates in subsequent cycles.
- the template region between the two primers is amplified exponentially, rather than linearly.
- a barcode such as a sample barcode, may be added to the target nucleic acid molecules during amplification.
- One method involves annealing a primer to the sequence encoded nucleic acid molecule, the primer including a first portion complementary to the sequence encoded nucleic acid molecule and a second portion including a barcode; and extending the annealed primer to form a barcoded nucleic acid molecule.
- the primer may include a 3' portion and a 5' portion, where the 3' portion may anneal to a portion of the sequence encoded nucleic acid molecule and the 5' portion comprises the barcode.
- Methods are also provided for the sequencing of the library of sequence encoded nucleic acid molecules. Any technique for sequencing nucleic acids known to those skilled in the art can be used in the methods of the present disclosure. DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing-by- synthesis, pyrosequencing, 454 sequencing, nanopore sequencing, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, and SOLiD sequencing.
- DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing-by- synthesis, pyrosequencing, 454 sequencing, nanopore sequencing, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, and SOLiD sequencing.
- Amplification refers to any in vitro process for increasing the number of copies of a nucleotide sequence or sequences. Nucleic acid amplification results in the incorporation of nucleotides into DNA or RNA. As used herein, one amplification reaction may consist of many rounds of DNA replication. For example, one PCR reaction may consist of 30-100 "cycles" of denaturation and replication. [00121] “Polymerase chain reaction,” or “PCR,” means a reaction for the in vitro amplification of specific DNA sequences by the simultaneous primer extension of complementary strands of DNA.
- PCR is a reaction for making multiple copies or replicates of a target nucleic acid flanked by primer binding sites, such reaction comprising one or more repetitions of the following steps: (i) denaturing the target nucleic acid, (ii) annealing primers to the primer binding sites, and (iii) extending the primers by a nucleic acid polymerase in the presence of nucleoside triphosphates.
- the reaction is cycled through different temperatures optimized for each step in a thermal cycler instrument. Particular temperatures, durations at each step, and rates of change between steps depend on many factors well-known to those of ordinary skill in the art, e.g. , exemplified by the references: McPherson et al., editors, PCR: A Practical Approach and PCR2: A Practical Approach (IRL Press, Oxford, 1991 and 1995, respectively).
- Primer means an oligonucleotide, either natural or synthetic that is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3' end along the template so that an extended duplex is formed.
- the sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide.
- primers are extended by a DNA polymerase.
- Primers are generally of a length compatible with its use in synthesis of primer extension products, and are usually are in the range of between 8 to 100 nucleotides in length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30, 20 to 40, 21 to 50, 22 to 45, 25 to 40, and so on, more typically in the range of between 18-40, 20-35, 21-30 nucleotides long, and any length between the stated ranges.
- Typical primers can be in the range of between 10-50 nucleotides long, such as 15-45, 18-40, 20-30, 21-25 and so on, and any length between the stated ranges.
- the primers are usually not more than about 10, 12, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70 nucleotides in length.
- a nucleic acid "region" or “domain” is a consecutive stretch of nucleotides of any length.
- a "nucleoside” is a base-sugar combination, i.e. , a nucleotide lacking a phosphate. It is recognized in the art that there is a certain inter-changeability in usage of the terms nucleoside and nucleotide.
- the nucleotide deoxyuridine triphosphate, dUTP is a deoxyribonucleoside triphosphate. After incorporation into DNA, it serves as a DNA monomer, formally being deoxyuridylate, i.e. , dUMP or deoxyuridine monophosphate.
- dUTP is a base-sugar combination
- dUTP is a deoxyribonucleoside triphosphate.
- dUMP deoxyuridine monophosphate.
- one may say that one incorporates deoxyuridine into DNA even though that is only a part of
- Nucleotide is a term of art that refers to a base- sugar-phosphate combination. Nucleotides are the monomeric units of nucleic acid polymers, i.e. , of DNA and RNA. The term includes ribonucleotide triphosphates, such as rATP, rCTP, rGTP, or rUTP, and deoxyribonucleotide triphosphates, such as dATP, dCTP, dUTP, dGTP, or dTTP.
- ribonucleotide triphosphates such as rATP, rCTP, rGTP, or rUTP
- deoxyribonucleotide triphosphates such as dATP, dCTP, dUTP, dGTP, or dTTP.
- nucleic acid or “polynucleotide” will generally refer to at least one molecule or strand of DNA, RNA, DNA-RNA chimera or a derivative or analog thereof, comprising at least one nucleobase, such as, for example, a naturally occurring purine or pyrimidine base found in DNA (e.g. , adenine "A,” guanine “G,” thymine “T” and cytosine “C”) or RNA (e.g. A, G, uracil "U” and C).
- nucleobase such as, for example, a naturally occurring purine or pyrimidine base found in DNA (e.g. , adenine "A,” guanine “G,” thymine “T” and cytosine "C”) or RNA (e.g. A, G, uracil "U” and C).
- nucleic acid encompasses the terms “oligonucleotide” and “polynucleotide.”
- Oligonucleotide refers collectively and interchangeably to two terms of art, “oligonucleotide” and “polynucleotide.” Note that although oligonucleotide and polynucleotide are distinct terms of art, there is no exact dividing line between them and they are used interchangeably herein.
- adaptor may also be used interchangeably with the terms “oligonucleotide” and “polynucleotide.”
- the term “adaptor” can indicate a linear adaptor (either single stranded or double stranded) or a stem-loop adaptor. These definitions generally refer to at least one single-stranded molecule, but in specific embodiments will also encompass at least one additional strand that is partially, substantially, or fully complementary to at least one single- stranded molecule.
- a nucleic acid may encompass at least one double- stranded molecule or at least one triple-stranded molecule that comprises one or more complementary strand(s) or "complement(s)" of a particular sequence comprising a strand of the molecule.
- nucleic acid molecule or “nucleic acid target molecule” refers to any single- stranded or double- stranded nucleic acid molecule including standard canonical bases, hypermodified bases, non-natural bases, or any combination of the bases thereof.
- the nucleic acid molecule contains the four canonical DNA bases - adenine, cytosine, guanine, and thymine, and/or the four canonical RNA bases - adenine, cytosine, guanine, and uracil. Uracil can be substituted for thymine when the nucleoside contains a 2'-deoxyribose group.
- the nucleic acid molecule can be transformed from RNA into DNA and from DNA into RNA.
- mRNA can be created into complementary DNA (cDNA) using reverse transcriptase and DNA can be created into RNA using RNA polymerase.
- a nucleic acid molecule can be of biological or synthetic origin.
- nucleic acid molecules examples include genomic DNA, cDNA, RNA, a DNA/RNA hybrid, amplified DNA, a pre-existing nucleic acid library, etc.
- a nucleic acid may be obtained from a human sample, such as blood, serum, plasma, cerebrospinal fluid, cheek scrapings, biopsy, semen, urine, feces, saliva, sweat, etc.
- a nucleic acid molecule may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, and hydrodynamic shearing.
- Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases, such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc.
- a nucleic acid molecule of interest may also be subjected to chemical modification (e.g. , bisulfite conversion, methylation / demethylation), extension, amplification (e.g. , PCR, isothermal, etc.), etc.
- Nucleic acid(s) that are “complementary” or “complement(s)” are those that are capable of base-pairing according to the standard Watson-Crick, Hoogsteen or reverse Hoogsteen binding complementarity rules.
- the term “complementary” or “complement(s)” may refer to nucleic acid(s) that are substantially complementary, as may be assessed by the same nucleotide comparison set forth above.
- substantially complementary may refer to a nucleic acid comprising at least one sequence of consecutive nucleobases, or semiconsecutive nucleobases if one or more nucleobase moieties are not present in the molecule, are capable of hybridizing to at least one nucleic acid strand or duplex even if less than all nucleobases do not base pair with a counterpart nucleobase.
- a "substantially complementary" nucleic acid contains at least one sequence in which about 70%, about 71%, about 72%, about 73%, about 74%, about 75%, about 76%, about 77%, about 77%, about 78%, about 79%, about 80%, about 81%, about 82%, about 83%, about 84%, about 85%, about 86%, about 87%, about 88%, about 89%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, to about 100%, and any range therein, of the nucleobase sequence is capable of base-pairing with at least one single or double- stranded nucleic acid molecule during hybridization.
- the term “substantially complementary” refers to at least one nucleic acid that may hybridize to at least one nucleic acid strand or duplex in stringent conditions.
- a “partially complementary” nucleic acid comprises at least one sequence that may hybridize in low stringency conditions to at least one single or double-stranded nucleic acid, or contains at least one sequence in which less than about 70% of the nucleobase sequence is capable of base-pairing with at least one single or double-stranded nucleic acid molecule during hybridization.
- non-complementary refers to nucleic acid sequence that lacks the ability to form at least one Watson-Crick base pair through specific hydrogen bonds.
- ligase refers to an enzyme that is capable of joining the 3' hydroxyl terminus of one nucleic acid molecule to a 5' phosphate terminus of a second nucleic acid molecule to form a single molecule.
- the ligase may be a DNA ligase or RNA ligase. Examples of DNA ligases include E. coli DNA ligase, T4 DNA ligase, and mammalian DNA ligases.
- sample means a material obtained or isolated from a fresh or preserved biological sample or synthetically-created source that contains nucleic acids of interest.
- a sample is the biological material that contains the variable immune region(s) for which data or information are sought.
- Samples can include at least one cell, fetal cell, cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, or reaction mixture suspected of containing immune nucleic acids of interest. Samples can also include non- human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses.
- substantially known refers to having sufficient sequence information in order to permit preparation of a nucleic acid molecule, including its amplification. This will typically be about 100%, although in some embodiments some portion of an adaptor sequence is random or degenerate. Thus, in specific embodiments, substantially known refers to about 50% to about 100%, about 60% to about 100%, about 70% to about 100%, about 80% to about 100%, about 90% to about 100%, about 95% to about 100%, about 97% to about 100%, about 98% to about 100%, or about 99% to about 100%.
- kits for creating libraries of target nucleic acids in a sample refers to a combination of physical elements.
- a kit may include, for example, one or more components, such as translation and auxiliary adaptors, either with or without protector oligonucleotides, as well as specific primers, enzymes, reaction buffers, an instruction sheet, and other elements useful to practice the technology described herein. These physical elements can be arranged in any way suitable for carrying out the disclosure.
- the components of the kits may be packaged either in aqueous media or in lyophilized form.
- the container means of the kits will generally include at least one vial, test tube, flask, bottle, syringe or other container means, into which a component may be placed, and preferably, suitably aliquoted (e.g. , aliquoted into the wells of a microtiter plate). Where there is more than one component in the kit, the kit also will generally contain a second, third or other additional container into which the additional components may be separately placed. However, various combinations of components may be comprised in a single vial.
- the kits of the present disclosure also will typically include a means for containing the nucleic acids, and any other reagent containers in close confinement for commercial sale. Such containers may include injection or blow molded plastic containers into which the desired vials are retained.
- kits will also include instructions for employing the kit components as well the use of any other reagent not included in the kit. Instructions may include variations that can be implemented. It is contemplated that such reagents are embodiments of kits of the disclosure. Such kits, however, are not limited to the particular items identified above. VII. Examples
- Sequence encoding is implemented via a pair of DNA hybridization probes that are conditionally ligated using the target sequence as a splint (FIG. IB).
- the left probe called the translator probe
- the right probe bears a universal region 3 that is identical for all auxiliary probes.
- the translator probe is functionalized with a phosphate at the 5' end. Only when both the translator probe and the auxiliary probe bind adjacent DNA subsequences is the 5' end of the translator probe flush against the 3' end of the auxiliary probe to allow subsequent ligation.
- FIG. 2 provides experimental results on sequence encoding using quantitative PCR (qPCR).
- qPCR quantitative PCR
- Two different translator probes were present and simultaneously allowed to the NA 18562 human cell line genomic DNA, which bears a target sequence matching one of the two translators (FIG. 3A).
- Regions 5a and 5b differ by 5 nucleotides.
- Using the Codewords 6a and 6b as reverse primer binding sites in a qPCR reaction shows a roughly 10 cycle Ct difference, indicating roughly 1000-fold higher intended ligation over nonspecific ligation.
- Molecular specificity of the translator and auxiliary probes is beneficial to accurately infer genomic DNA variants based on Codeword analysis. Nonspecific binding of Variant and Auxiliary Probes to other genomic loci can result in false positive results. The ligation process helps improve specificity, but is insufficient on its own to ensure accurate translation of target DNA sequences into Codewords.
- a protector oligonucleotide comprising a region 8 that is partially complementary to region 5 is introduced. Importantly, at least 5 continuous nucleotides on region 5 are not bound by the protector, to allow initiation of hybridization between the target and the translator probe.
- This protector oligonucleotide can improve the specificity of hybridization reactions (see Zhang, Chen, and Yin, Nature Chemistry 2012 and Wang and Zhang, Nature Chemistry 2015; U.S. Patent No. 9,284,602 and U.S. Patent Appln. No. 15/174,373) and maintains high sequence selectivity across a large range of temperatures and buffer conditions.
- FIG. 4B provides qPCR results using a double- stranded translator probe.
- the simplest (7,4) Hamming code inserts 3 error-correcting bits for every 4 bit message (longer messages are first broken up into 4 bit words). All 7-bit instances of the Hamming code have the property that they are at least Hamming distance 3 from any other instance - that is to say, one would need to change at least 3 bits in order to transform one Hamming code instance into another. This property means that (7,4) Hamming codes are correcting for up to one error, and tolerant for up to two errors: The original sequence can be restored from any sequence mutated by one base; more conservatively, any sequence with two mutations will not match any other Codewords and can be excluded.
- FIG. 6E shows the effective error rate of the Codewords given different intrinsic error rates; at 1 % intrinsic error rate, the proposed Codewords exhibit roughly 0.6% error rate when NGS reads unmatched to any designed Codeword are corrected, and 0.01 % error when unmatched NGS reads are discarded. These are roughly 20-fold and 1000-fold better than a naive encoding with no error correction.
- FIG. 7 illustrates another embodiment of sequence encoding for analysis by next generation sequencing (NGS), also known as sequencing-by-synthesis.
- the translator probe further comprises a region 7. Regions 3 and 7 are adaptors for NGS amplification or index appending, so nucleic acid molecules lacking either region 3 or region 7 will not be sequenced. Thus, only ligation products will be analyzed by NGS.
- one potential workflow for sequencing encoding and NGS analysis is provided.
- the entire sample-to-answer workflow is expected to require less than eight hours total when optimized and using only 21 sequencing cycles, which represents a significant speedup over current NGS-based laboratory developed tests (LDTs).
- LDTs laboratory developed tests
- To further speed up this workflow requires shortening the time bottlenecking Step 2, the hybridization of genomic DNA to the Variant and Auxiliary Probes. Hybridization kinetics are primarily determined by the individual concentration of each Probe. For low- to medium-plex Translation, individual Probe concentrations can be quite high and hybridization quite fast; for example, 5 nM per Probe for a set of 1000-plex Translators corresponds to a hybridization half-life of roughly three minutes. The four hours listed for Step 2 assumes 100,000-plex translators, and is similar to the times allotted for hybridization by commercial whole exome capture panels. Ligation by NEB Quick Ligase has a half-life of one minute, and does not significantly contribute to overall time.
- FIG. 9 provides experimental results of a 22-plex sequence encoding system (SEQ ID NOS: 17-60): 11 translators were designed to bind specifically to exon subsequences of 11 human genes, and 11 translators were designed to bind the mouse homolog genes.
- the sample input is 200 ng of human gDNA, so all 11 human gene Codewords are expected to yield high NGS reads, and all 11 mouse gene Codewords are expected to have few or no NGS reads. All Codewords were 21 nt long and use the Hamming error-correction mechanism described in Example 3.
- FIG. 9A shows that there were roughly 50-fold more human
- FIG. 9B shows the distribution of NGS reads in the library.
- the 3.8% of the library that were corrected Codewords demonstrates the error resilience of sequence encoding to NGS intrinsic error. In a standard NGS protocol and analysis, these 3.8% would have showed up spuriously as single nucleotide variants, but this possibility is completely eliminated in symbolic sequencing.
- FIG. 10A Another embodiment of sequence encoding uses modular probes (M- Probes) constructed from many oligonucleotides, shown in FIG. 10A (U.S. Pat. Appln. No. 62/398,484; Wang et al, 2017).
- M-Probes are capable of sequence- selective binding of very long nucleic acid targets, and furthermore tolerate non-pathogenic sequence variations at specified locations.
- FIG. 10B shows qPCR results using M-Probe translators. A roughly 6 cycle Ct difference is observed, indicating roughly 60-fold higher intended ligation over nonspecific ligation.
- FIG. 11 Another embodiment of sequence encoding that overcomes potential nonspecific ligation is shown in FIG. 11.
- a secondary Codeword is placed on the auxiliary probe near region 3.
- PCR amplification using both Codewords or NGS paired-end reads to analyze both Codewords will differentiate correctly ligated species with consistent Codeword pairs from nonspecific ligation products.
- Table 1 Sequences used throughout the Examples
- CAC CTA GTC AGA GAG ACA AAC ACC AGA ACA CTA TAA CGA GTA CTA GCA AAA CCC AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
- PPIA-Human-P4 TAA GTA AGT TCT TGG GAA TTA AAG
- AGA CTG GCT CTT AAA AAG TG GAPDH-Human-C5 /5Phos/CCA GAC CCT GCA CTT TTT AAG AGC CAG TCT CTG GCC CCA GCC ACA TAC CAA TGC GGG GTT TCA CAA ACA GAT CCT AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
- GUSB-Human-P6 GGA AAC AGC GGG GCC CAG GGT GGC
- TAA TAA TTA TGC ACG TCA CAT CTG TAA TAA CAT TCG CAT TCG GAG TAA CTC AGG CAG ATC GGA AGA GCG TCG TGT AGG GAA AGA GTG T
Landscapes
- Chemical & Material Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Organic Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Biotechnology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Physics & Mathematics (AREA)
- Biochemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Provided herein are new compositions and methods for profiling DNA and RNA using next-generation sequencing (NGS). In particular, target nucleic acid sequences are sequence encoded. Sequence encoding stoichiometrically "translates" the target nucleic acid sequences into rationally-designed short Codewords that compactly and error-resiliently encode the targets' sequence information and can be efficiently and accurately sequenced by short-read NGS platforms.
Description
DESCRIPTION
SYMBOLIC SEQUENCING OF DNA AND RNA VIA SEQUENCE ENCODING
BACKGROUND
[0001] The present application claims the priority benefit of United States provisional application number 62/552,652, filed August 31, 2017, the entire contents of which is incorporated herein by reference.
[0002] This invention was made with government support under Grant No. R01 HG008752 awarded by the National Institutes of Health. The government has certain rights in the invention. 1. Field
[0003] This disclosure relates generally to the field of molecular biology. More particularly, it concerns methods, and compositions for use therein, of encoding nucleic acid sequence information into rationally-designed short sequences for sequencing by next- generation sequencing. 2. Description of Related Art
[0004] Next-generation sequencing (NGS) has been a great boon to the study of the human genome and transcriptome, but remains slow, labor-intensive, error-prone, and expensive. Today, a majority of the NGS used for research and clinical applications do not involve de novo sequencing of an entirely unknown sample. For these applications, NGS as currently implemented wastes a vast majority of its information capacity. For example, in the profiling of point mutations that cause hereditary diseases or cancers, only 1 nucleotide (nt) out of a 300 nt read contains the actual information of interest, and the other 299 nt of information only serve to align the read to the proper genetic loci. Reducing the amount of NGS information capacity wasted can increase the speed, ease, accuracy, effective throughput, and cost of targeted NGS profiling of DNA and RNA samples.
SUMMARY
[0005] This disclosure describes new compositions and methods for profiling DNA and RNA using Next Generation Sequencing (NGS). In the disclosure, target DNA/RNA sequences are stoichiometrically "translated" into designed codewords that compactly and error-resiliently encode the targets' sequence information. Sequence encoding can suppress NGS errors, as well as reduce both NGS procedure and interpretation time. Sequence encoding can significantly impact both molecular diagnostics for precision medicine, as well as academic and clinical research on the human genome/transcriptome.
[0006] In one embodiment, compositions of nucleic acid molecules are provided herein, the compositions comprising: (a) at least three auxiliary probes, wherein each auxiliary probe comprises a first auxiliary probe hybridization region and a first auxiliary probe universal region, wherein the first auxiliary probe hybridization region of each auxiliary probe has a unique sequence, wherein the first auxiliary probe universal regions of each auxiliary probe have the same sequence; (b) at least three translation probes, wherein each translation probe comprises a first nucleic acid molecule, wherein said first nucleic acid molecule comprises a first translation probe hybridization region and a first translation probe codeword region, wherein the first translation probe hybridization region of each translation probe has a unique sequence, wherein the first translation probe codeword region of each translation probe has a unique sequence; and (c) at least three translation probe protection oligonucleotides, wherein each translation probe protection oligonucleotide comprises a first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region of one of the translation probes.
[0007] In some aspects, the translation probes are modular probes. In certain aspects, the first nucleic acid molecules of the translation probes further comprise a second translation probe hybridization region positioned between the first translation probe hybridization region and the first translation probe codeword region. In certain aspects, the translation probes further comprise a second nucleic acid molecule, wherein each of the second nucleic acid molecules comprises a third translation probe hybridization region and a fourth translation probe hybridization region, wherein the third translation probe hybridization region is complementary to the second translation probe hybridization region of the first nucleic acid molecule. In certain aspects, the translation probes further comprise a third nucleic acid
molecule, wherein the third nucleic acid molecule comprises a fifth translation probe hybridization region, wherein the fifth translation probe hybridization region is complementary to the fourth translation probe hybridization region of the second nucleic acid molecule. In certain aspects, the third nucleic acid molecules of the translation probes further comprise a sixth translation probe hybridization region, wherein the translation probe protection oligonucleotides further comprise a second translation probe protection oligonucleotide hybridization region, wherein the sixth translation probe hybridization region is complementary to the second translation probe protection oligonucleotide hybridization region. In certain aspects, the first translation probe hybridization region is at least 5 nucleotides longer than the first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region. In certain aspects, the first translation probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first translation probe hybridization region of one of the translation probes, wherein the first translation probe hybridization region is at least 17 nucleotides long.
[0008] In some aspects, each first translation probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7. For example, each first translation probe codeword region may be 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, or 98 nucleotides long. [0009] In some aspects, each of the translation probe codeword regions in the composition lacks sequence identity at at least 2 nucleotide positions as compared to any other translation probe codeword region in the composition. In certain aspects, each of the translation probe codeword regions in the composition has a Hamming distance of at least two relative to every other translation probe codeword region in the composition. For example, each of the translation probe codeword regions may be 14 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 12 nucleotide positions. Alternatively, each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions. [0010] In some aspects, each of the translation probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other translation probe codeword region in the composition. In certain aspects, each of the
translation probe codeword regions in the composition has a Hamming distance of at least three relative to every other translation probe codeword region in the composition. For example, each of the translation probe codeword regions may be 14 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 11 nucleotide positions. Alternatively, each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
[0011] In some aspects, each of the translation probes further comprises a first translation probe universal region, wherein the first translation probe universal regions of each translation probe have the same sequence. In certain aspects, the first translation probe codeword region is positioned between the first translation probe universal region and the first translation probe hybridization region.
[0012] In some aspects, each of the translation probes comprises a 5' phosphate. In other aspects, each of the translation probes lacks a 5' phosphate. [0013] In some aspects, each of the translation probes is between 30 and 200 nucleotides long. For example, each of the translation probes may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein.
[0014] In some aspects, each of the auxiliary probes further comprises a first auxiliary probe codeword region, wherein each auxiliary probe in the composition has a unique first auxiliary probe codeword region sequence. In certain aspects, the first auxiliary probe codeword region is positioned between the first auxiliary probe hybridization region and the first auxiliary probe universal region. In certain aspects, each first auxiliary probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first auxiliary probe codeword region is a multiple of 7. For example, each first auxiliary probe codeword region may be 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, or 98 nucleotides long.
[0015] In some aspects, each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 2 nucleotide positions as compared to any other auxiliary probe codeword region in the composition. In certain aspects, each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least two relative to every other auxiliary probe codeword region in the composition. For example,
each of the auxiliary probe codeword regions may be 14 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 12 nucleotide positions. Alternatively, each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions.
[0016] In some aspects, each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other auxiliary probe codeword region in the composition. In certain aspects, each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least three relative to every other auxiliary probe codeword region in the composition. For example, each of the auxiliary probe codeword regions may be 14 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 11 nucleotide positions. Alternatively, each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions
[0017] In some aspects, each of the auxiliary probes comprises a 5' phosphate. In particular, each of the auxiliary probes may comprise a 5' phosphate when each of the translation probes lacks a 5' phosphate. In other aspects, each of the auxiliary probes lacks a 5' phosphate. In particular, each of the auxiliary probes may lack a 5' phosphate when each of the translation probes comprises a 5' phosphate.
[0018] In some aspects, the compositions further comprise at least three auxiliary probe protection oligonucleotides, wherein each auxiliary probe protection oligonucleotide comprises a first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region of one of the auxiliary probes. In certain aspects, the first auxiliary probe hybridization region is at least 5 nucleotides longer than the first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region. In certain aspects, the first auxiliary probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first auxiliary probe hybridization region of one of the auxiliary probes, wherein the first auxiliary probe hybridization region is at least 17 nucleotides long.
[0019] In some aspects, each of the auxiliary probes is between 30 and 200 nucleotides long. For example, each of the auxiliary probes may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein. [0020] In some aspects, the compositions further comprise at least one target nucleic acid molecule comprising a first target region and a second target region, wherein the first target region and the second target region are directly adjacent within the target nucleic acid molecule, wherein the first target region is complementary to the first translation probe hybridization region of one of the translation probes in the composition, wherein the second target region is complementary to the first auxiliary probe hybridization region of one of the auxiliary probes in the composition.
[0021] In one embodiment, methods are provided herein for determining the presence a target nucleic acid molecule in a sample, the target nucleic acid molecule comprising a known target sequence having a first target region and a second target region that is directly adjacent to the first target region, the method comprising: (a) contacting the sample with at least a first auxiliary probe and at least a first translation probe, wherein the auxiliary probe comprises a first auxiliary probe hybridization region and a first auxiliary probe universal region, wherein the first auxiliary probe hybridization region is complementary to the first target region, and wherein the first translation probe comprises a first nucleic acid molecule, wherein said first nucleic acid molecule comprises a first translation probe hybridization region and a first translation probe codeword region, wherein the first translation probe hybridization region is complementary to the second target region; (b) incubating the product of step (a) under conditions to allow the first auxiliary probe hybridization region to anneal to the first target region and the first translation probe hybridization region to anneal to the second target region, thereby producing an annealed product if the target nucleic acid molecule is present in the sample; (c) incubating the product of step (b) under conditions to allow the ligation of the annealed first auxiliary probe to the annealed first translation probe, thereby producing a ligation product having both the first translation probe codeword region and the first auxiliary probe universal region if the target nucleic acid molecule is present in the sample; and (d) detecting the ligation product, thereby determining the presence of the target nucleic acid molecule in the sample.
[0022] In some aspects, the first translation probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7. For example, each first translation probe codeword region may be 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, or 98 nucleotides long. [0023] In some aspects, the first translation probe further comprises a first translation probe universal region. In certain aspects, the first translation probe codeword region is positioned between the first translation probe universal region and the first translation probe hybridization region.
[0024] In some aspects, step (a) further comprises contacting the sample with at least a first translation probe protection oligonucleotide, wherein the translation probe protection oligonucleotide comprises a first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region. In certain aspects, the first translation probe hybridization region is at least 5 nucleotides longer than the first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region. In certain aspects, the first translation probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first translation probe hybridization region of one of the translation probes, wherein the first translation probe hybridization region is at least 17 nucleotides long. [0025] In some aspects, the first translation probe is a modular probe. In certain aspects, the first nucleic acid molecule of the translation probe further comprises a second translation probe hybridization region positioned between the first translation probe hybridization region and the first translation probe codeword region. In certain aspects, the translation probe further comprises a second nucleic acid molecule, wherein the second nucleic acid molecule comprises a third translation probe hybridization region and a fourth translation probe hybridization region, wherein the third translation probe hybridization region is complementary to the second translation probe hybridization region of the first nucleic acid molecule. In certain aspects, the translation probe further comprises a third nucleic acid molecule, wherein the third nucleic acid molecule comprises a fifth translation probe hybridization region, wherein the fifth translation probe hybridization region is complementary to the fourth translation probe hybridization region of the second nucleic acid molecule. In certain aspects, the third nucleic acid molecule of the translation probe further
comprises a sixth translation probe hybridization region, wherein the translation probe protection oligonucleotides further comprise a second translation probe protection oligonucleotide hybridization region, wherein the sixth translation probe hybridization region is complementary to the second translation probe protection oligonucleotide hybridization region.
[0026] In some aspects, the first translation probe is between 30 and 200 nucleotides long. For example, the first translation probe may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein. [0027] In some aspects, step (a) further comprises contacting the sample with at least a second translation probe, wherein the second translation probe comprises a second translation probe hybridization region and a second translation probe codeword region, wherein the translation probe hybridization regions on each of the first and second translation probes has a unique sequence, wherein the translation probe codeword region on each of the first and second translation probes has a unique sequence. In some aspects, each of the translation probe codeword regions in the composition lacks sequence identity at at least 2 nucleotide positions as compared to any other translation probe codeword region in the composition. In certain aspects, each of the translation probe codeword regions in the composition has a Hamming distance of at least two relative to every other translation probe codeword region in the composition. For example, each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions. In some aspects, each of the translation probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other translation probe codeword region in the composition. In some aspects, each of the translation probe codeword regions in the composition has a Hamming distance of at least three relative to every other translation probe codeword region in the composition. For example, each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions. [0028] In some aspects, the first auxiliary probe further comprises a first auxiliary probe codeword region. In certain aspects, the first auxiliary probe codeword region is
positioned between the first auxiliary probe hybridization region and the first auxiliary probe universal region.
[0029] In some aspects, step (a) further comprises contacting the sample with at least a second auxiliary probe, wherein the second auxiliary probe comprises a second auxiliary probe hybridization region and a second auxiliary probe codeword region, wherein the auxiliary probe codeword region on each of the first and second auxiliary probes has a unique sequence. In some aspects, each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 2 nucleotide positions as compared to any other auxiliary probe codeword region in the composition. In certain aspects, each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least two relative to every other auxiliary probe codeword region in the composition. For example, each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions. In some aspects, each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other auxiliary probe codeword region in the composition. In certain aspects, each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least three relative to every other auxiliary probe codeword region in the composition. For example, each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
[0030] In some aspects, the first auxiliary probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first auxiliary probe codeword region is a multiple of 7. For example, each first auxiliary probe codeword region may be 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, or 98 nucleotides long.
[0031] In some aspects, step (a) further comprises contacting the sample with at least a first auxiliary probe protection oligonucleotide, wherein the auxiliary probe protection oligonucleotide comprises a first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region. In certain aspects, the first auxiliary probe hybridization region is at least 5 nucleotides longer than the first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization
region. For example, the first auxiliary probe protection oligonucleotide hybridization region may comprise at least 12 continuous nucleotides that are complementary to the first auxiliary probe hybridization region of one of the auxiliary probes, wherein the first auxiliary probe hybridization region is at least 17 nucleotides long. [0032] In some aspects, the first auxiliary probe is between 30 and 200 nucleotides long. For example, the first auxiliary probe may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein.
[0033] In some aspects, step (a) comprises contacting the sample with a composition of the present embodiments.
[0034] In some aspects, step (c) is performed by incubating the product of step (b) with a ligase. In certain aspects, the first target region is positioned upstream of the second target region, wherein the first auxiliary probe comprises a 5' phosphate. In these aspects, the first translation probe lacks a 5' phosphate. In certain other aspects, the second target region is positioned upstream of the first target region, wherein the first translation probe comprises a 5' phosphate. In these aspects, the first auxiliary probe lacks a 5' phosphate.
[0035] In some aspects, step (c) is performed chemically. In certain aspects, the first target region is positioned upstream of the second target region, wherein the first auxiliary probe comprises a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid. In these aspects, the first translation probe lacks a 5' functionalization. In certain other aspects, the second target region is positioned upstream of the first target region, wherein the first translation probe comprises a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid. In these aspects, the first auxiliary probe lacks a 5' functionalization. [0036] In some aspects, detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In some aspects, detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword
region. In some aspects, detecting the ligation product in step (d) is further comprises quantitating the amount of the ligation product having both the first translation probe codeword region and the first auxiliary probe universal region that is present in the sample. In certain aspects, quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In certain aspects, quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region.
[0037] In some aspects, detecting and/or quantitating the amount of the ligation product comprises performing DNA sequencing. In certain aspects, the DNA sequencing comprises Sanger sequencing, sequencing-by-synthesis, or nanopore sequencing. In certain aspects, detecting and/or quantitating the amount of the ligation product comprises performing Hamming error correction to the sequences obtained for the translation probe and/or auxiliary probe codeword regions.
[0038] In some aspects, determining the presence of a target nucleic acid molecule does not comprise a step of bead capture. In some other aspects, determining the presence of a target nucleic acid molecule further comprises a step of bead capture.
[0039] In some aspects, the target nucleic acid molecule comprises DNA. In some aspects, the target nucleic acid molecule comprises RNA.
[0040] In some aspects, if the ligation product is not detected, then the target sequence is determined to not be present in the sample. In some aspects, if the ligation product is detected, then the target sequence is determined to be present in the sample.
[0041] In one embodiment, methods are provided herein for determining the presence a plurality of target nucleic acid molecules in a sample, each target nucleic acid molecule comprising a known target sequence having a first target region and a second target region that is directly adjacent to the first target region, the method comprising: (a) contacting the sample with at least two auxiliary probes and at least two translation probes, wherein the auxiliary probes each comprise a first auxiliary probe hybridization region and a first
auxiliary probe universal region, wherein the first auxiliary probe hybridization region of each auxiliary probe has a unique sequence, wherein the first auxiliary probe universal regions of each auxiliary probe have the same sequence, wherein the first auxiliary probe hybridization region is complementary to the first target region of one of the plurality of target nucleic acid molecules, and wherein the translation probes each comprise a first nucleic acid molecule, wherein said first nucleic acid molecule comprises a first translation probe hybridization region and a first translation probe codeword region, wherein the first translation probe hybridization region of each translation probe has a unique sequence, wherein the first translation probe codeword region of each translation probe has a unique sequence, wherein the first translation probe hybridization region is complementary to the second target region of one of the plurality of target nucleic acid molecules; (b) incubating the product of step (a) under conditions to allow the first auxiliary probe hybridization regions to anneal to the first target regions and the first translation probe hybridization regions to anneal to the second target regions, thereby producing annealed products if the target nucleic acid molecules are present in the sample; (c) incubating the product of step (b) under conditions to allow the ligation of the auxiliary probe to the translation probe annealed to a known target sequence, thereby producing a ligation products having both a first translation probe codeword region and a first auxiliary probe universal region if one of the target nucleic acid molecules is present in the sample; and (d) detecting the ligation products, thereby determining the presence of the target nucleic acid molecules in the sample.
[0042] In some aspects, the translation probes are a modular probes. In certain aspects, the first nucleic acid molecules of the translation probes further comprise a second translation probe hybridization region positioned between the first translation probe hybridization region and the first translation probe codeword region. In certain aspects, the translation probes further comprise a second nucleic acid molecule, wherein each of the second nucleic acid molecules comprises a third translation probe hybridization region and a fourth translation probe hybridization region, wherein the third translation probe hybridization region is complementary to the second translation probe hybridization region of the first nucleic acid molecule. In certain aspects, the translation probes further comprise a third nucleic acid molecule, wherein the third nucleic acid molecule comprises a fifth translation probe hybridization region, wherein the fifth translation probe hybridization region is complementary to the fourth translation probe hybridization region of the second nucleic acid molecule. In certain aspects, the third nucleic acid molecules of the translation
probes further comprise a sixth translation probe hybridization region, wherein the translation probe protection oligonucleotides further comprise a second translation probe protection oligonucleotide hybridization region, wherein the sixth translation probe hybridization region is complementary to the second translation probe protection oligonucleotide hybridization region.
[0043] In some aspects, the translation probes are between 30 and 200 nucleotides long. For example, the translation probes may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein. In some aspects, each of the translation probe codeword regions is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7. For example, each first translation probe codeword region may be 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, or 98 nucleotides long.
[0044] In some aspects, each of the translation probe codeword regions lacks sequence identity at at least 2 nucleotide positions as compared to any other translation probe codeword region. In certain aspects, each of the translation probe codeword regions in the composition has a Hamming distance of at least two relative to every other translation probe codeword region in the composition. For example, each of the translation probe codeword regions may be 14 nucleotides long, wherein no two translation probe codeword regions share sequence identity at more than 12 nucleotide positions. Alternatively, each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions share sequence identity at more than 19 nucleotide positions.
[0045] In some aspects, each of the translation probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other translation probe codeword region. In certain aspects, each of the translation probe codeword regions in the composition has a Hamming distance of at least three relative to every other translation probe codeword region in the composition. For example, each of the translation probe codeword regions may be 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions. Alternatively, each of the translation probe codeword regions may be 14 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 11 nucleotide positions
[0046] In some aspects, each of the translation probes further comprises a first translation probe universal region, wherein the first translation probe universal regions of each translation probe have the same sequence. In certain aspects, the first translation probe codeword region is positioned between the first translation probe universal region and the first translation probe hybridization region.
[0047] In some aspects, step (a) further comprises contacting the sample with at least two first translation probe protection oligonucleotides, wherein the translation probe protection oligonucleotides comprise a first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region of one of the translation probes. In certain aspects, the first translation probe hybridization region is at least 5 nucleotides longer than the first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region. For example, the first translation probe protection oligonucleotide hybridization region may comprise at least 12 continuous nucleotides that are complementary to the first translation probe hybridization region of one of the translation probes, wherein the first translation probe hybridization region is at least 17 nucleotides long.
[0048] In some aspects, each of the auxiliary probes further comprises a first auxiliary probe codeword region. In certain aspects, the first auxiliary probe codeword region is positioned between the first auxiliary probe hybridization region and the first auxiliary probe universal region.
[0049] In some aspects, each of the auxiliary probe codeword regions lacks sequence identity at at least 2 nucleotide positions as compared to any other auxiliary probe codeword region. In certain aspects, each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least two relative to every other auxiliary probe codeword region in the composition. For example, each of the auxiliary probe codeword regions may be 14 nucleotides long, wherein no two auxiliary probe codeword regions share sequence identity at more than 12 nucleotide positions. Alternatively, each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions share sequence identity at more than 19 nucleotide positions.
[0050] In some aspects, each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other auxiliary probe codeword region. In certain aspects, each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least three relative to every other auxiliary probe codeword region in the composition. For example, each of the auxiliary probe codeword regions may be 14 nucleotides long, wherein no two auxiliary probe codeword regions share sequence identity at more than 11 nucleotide positions. Alternatively, each of the auxiliary probe codeword regions may be 21 nucleotides long, wherein no two auxiliary probe codeword regions share sequence identity at more than 18 nucleotide positions
[0051] In some aspects, step (a) further comprises contacting the sample with at least two first auxiliary probe protection oligonucleotides, wherein the auxiliary probe protection oligonucleotides comprise a first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region of one of the auxiliary probes. In certain aspects, the first auxiliary probe hybridization region is at least 5 nucleotides longer than the first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region. For example, the first auxiliary probe protection oligonucleotide hybridization region may comprise at least 12 continuous nucleotides that are complementary to the first auxiliary probe hybridization region of one of the auxiliary probes, wherein the first auxiliary probe hybridization region is at least 17 nucleotides long.
[0052] In some aspects, the auxiliary probes are between 30 and 200 nucleotides long. For example, the auxiliary probes may be 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nucleotides long, or any nucleotide length derivable therein.
[0053] In some aspects, step (a) comprises contacting the sample with a composition of the present embodiments.
[0054] In some aspects, step (c) is performed by incubating the product of step (b) with a ligase. In certain aspects, the first target regions are positioned upstream of the second target regions, wherein the first auxiliary probes comprise a 5' phosphate. In these aspects, the first translation probes may lack a 5' phosphate. In certain other aspects, the second target
regions are positioned upstream of the first target regions, wherein the first translation probes comprise a 5' phosphate. In these aspects, the first auxiliary probes may lack a 5' phosphate. In some aspects, some targets within a single sample have their first target region positioned upstream of their second target region while other targets within the same single sample have their second target regions positioned upstream of their first target regions.
[0055] In some aspects, step (c) is performed chemically. In certain aspects, the first target regions are positioned upstream of the second target regions, wherein the first auxiliary probes comprise a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid. In these aspects, the first translation probes may lack a 5' functionalization. In certain other aspects, the second target regions are positioned upstream of the first target regions, wherein the first translation probes comprise a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid. In these aspects, the first auxiliary probes may lack a 5' functionalization. In some aspects, some targets within a single sample have their first target region positioned upstream of their second target region while other targets within the same single sample have their second target regions positioned upstream of their first target regions.
[0056] In some aspects, detecting the ligation products in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In some aspects, detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region. In some aspects, detecting the ligation product in step (d) is further comprises quantitating the amount of the ligation product having both the first translation probe codeword region and the first auxiliary probe universal region that is present in the sample. In certain aspects, quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In certain aspects, quantitating the amount of the ligation product in step (d) comprises performing
quantitative PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region.
[0057] In some aspects, detecting the ligation products in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first translation probe universal region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In some aspects, detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe universal region. In some aspects, detecting the ligation product in step (d) is further comprises quantitating the amount of the ligation product having both the first translation probe universal region and the first auxiliary probe universal region that is present in the sample. In certain aspects, quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first translation probe universal region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region. In certain aspects, quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe universal region.
[0058] In some aspects, detecting and/or quantitating the amount of the ligation product comprises performing DNA sequencing. In certain aspects, the DNA sequencing comprises Sanger sequencing, sequencing-by-synthesis, or nanopore sequencing. In some aspects, detecting and/or quantitating the amount of the ligation product comprises performing Hamming error correction to the sequences obtained for the translation probe and/or auxiliary probe codeword regions.
[0059] In some aspects, determining the presence of a target nucleic acid molecule does not comprise a step of bead capture. In some other aspects, determining the presence of a target nucleic acid molecule further comprises a step of bead capture.
[0060] In some aspects, the target nucleic acid molecules comprise DNA. In some aspects, the target nucleic acid molecules comprise RNA.
[0061] In some aspects, if the ligation products are not detected, then the target sequences are determined to not be present in the sample. In some aspects, if the ligation products are detected, then the target sequences are determined to be present in the sample.
[0062] As used herein, "essentially free," in terms of a specified component, is used herein to mean that none of the specified component has been purposefully formulated into a composition and/or is present only as a contaminant or in trace amounts. The total amount of the specified component resulting from any unintended contamination of a composition is therefore well below 0.05%, preferably below 0.01%. Most preferred is a composition in which no amount of the specified component can be detected with standard analytical methods.
[0063] As used herein the specification, "a" or "an" may mean one or more. As used herein in the claim(s), when used in conjunction with the word "comprising," the words "a" or "an" may mean one or more than one.
[0064] The use of the term "or" in the claims is used to mean "and/or" unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and "and/or." As used herein "another" may mean at least a second or more. [0065] Throughout this application, the term "about" is used to indicate that a value includes the inherent variation of error for the device, the method being employed to determine the value, or the variation that exists among the study subjects.
[0066] Other objects, features and advantages of the present disclosure will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only, since various changes and modifications within the spirit and scope of the disclosure will become apparent to those skilled in the art from this detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
[0067] The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present disclosure. The disclosure may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.
[0068] FIGS. 1A-B: Sequence encoding. FIG. 1A: An overview of the general concept of sequence encoding. A target biological DNA or RNA sequence (SEQ ID NO: 87) is stoichiometrically converted into a predesigned Codeword sequence (SEQ ID NO: 88). FIG. IB: A diagram illustrating how sequence encoding is implemented using a translator probe (left) and an auxiliary probe (right). The translator probe comprises regions 5, 6, and 7, and the auxiliary probe comprises regions 3 and 4. The target comprises regions 1 and 2, which are complementary to regions 5 and 4, respectively. Region 6 is a Codeword representing the sequence of region 1 of the target. Region 3 is a universal sequence conserved across all auxiliary probes. [0069] FIG. 2: Sequence encoding using multiple translators. A sample suspected to possess either a la, lb, or lc sequence is reacted with three different translators with corresponding regions 5a, 5b, or 5c. In this figure, the lb sequence is present, so the translator with region 5b is ligated to the auxiliary probe.
[0070] FIGS. 3A-B: Sequence encoding using a single- stranded translator probe. FIG. 3A: The matched translator probe (probe 5b/6b; SEQ ID NO: 2) was designed against a 20 nt sequence around the SNP rsl509186. A mismatched translator probe (probe 5a/6a; SEQ ID NO: 3) was designed against a 20 nt sequence bearing consecutive 5 nt mismatches from the target. The auxiliary probe (probe 3/4; SEQ ID NO: 1) was designed against a 33 nt adjacent sequence. FIG. 3B: Experimental qPCR results using single-stranded translator probes. The three probes, each 1 nM, were reacted with 50 ng/well human genomic DNA (sample NA 18562) at 64°C in lx HiFi Taq Ligase buffer for 15 minutes. Then HiFi Taq Ligase was added to the solution and the ligation reaction proceeded at 64°C for 15 minutes. The ligated product was detected by qPCR with two sets of primers (fpl+rpl and fpl+rp2). The primer hybridizing to the matched translator probe's Codeword is the reverse primer 1 (rpl; SEQ ID NO: 85). The primer hybridizing to the mismatch translator probe's Codeword
is the reverse primer 2 (rp2; SEQ ID NO: 86). The primer hybridizing to the sequencing adapter on the auxiliary probe is the forward primer 1 (fpl; SEQ ID NO: 15).
[0071] FIGS. 4A-B: Sequence encoding using a double- stranded translator probe. FIG. 4A: The matched translator probe (SEQ ID NOS: 4 and 5) was designed against a 35 nt sequence around the SNP rs3217424, where the SNP at rs3217424 is C. The mismatched translator probe (SEQ ID NOS: 10 and 11) was designed against a 35 nt sequence bearing a single-nucleotide mismatch, where the SNP at rs3217424 is G. The auxiliary probe (SEQ ID NO: 8) was designed against a 15 nt sequence downstream of the translator probe-targeted region. FIG. 4B: Experimental qPCR results using double- stranded translator probes to discriminate a single-nucleotide polymorphism (SNP) in genomic DNA. The three probes, each 2 nM, were reacted together with 50 ng/well human genomic DNA (sample NA18537; SNP at rs3217424 is C) at 50°C in lx HiFi Taq Ligase buffer for 2 hours. Then HiFi Taq Ligase was added to the solution and the ligation reaction proceeded at 50°C for 15 minutes. The ligated product was detected by qPCR with two sets of primers (fpl+rpl and fpl+rp2). The primer hybridizing to the matched translator probe's Codeword is the reverse primer 1 (rpl; SEQ ID NO: 9). The primer hybridizing to the mismatch translator probe's Codeword is the reverse primer 2 (rp2; SEQ ID NO: 12). The primer hybridizing to the sequencing adapter on the auxiliary probe is the forward primer 1 (fpl; SEQ ID NO: 15).
[0072] FIG. 5: Sanger sequencing results of qPCR amplicons from FIG. 4B. Sanger sequencing was performed after 40 cycles. The underlined sequence matches the Codeword for the correct translator probe. The sequence shown corresponds to SEQ ID NO: 89.
[0073] FIGS. 6A-F: Hamming encoding of DNA Codewords. FIG. 6A: Every 4-nt DNA word is appended with three additional error correction nucleotides. FIG. 6B: The error correction nucleotides x, y, and z are designed to satisfy the three error correction equations displayed. Note that modular arithmetic is used: 2, 6, 10, and 14 are all equal to 2 in mod 4. FIG. 6C: Error correction via a (7,4) Hamming encoding. In the left panel, the second nucleotide b is mutated T>C, resulting in two of the three error correction equations being violated. Because b is the only variable to appear in both the first and second equations, it is clear that b was mutated; simple modular arithmetics shows that the proper value of b should be T = 2 to allow the equations to be satisfied. In the right panel, a nucleotide deletion results in violation of multiple equations. FIG. 6D: Out of the 256 (7,4) Hamming codes, 216 are amenable for serving as Codewords due to properties of DNA synthesis and sequencing.
FIG. 6E: Hamming encoding greatly decreases the error rate of sequence encoding. The top line represents "Uncoded (12nt)"; the middle line represents "Correcting (21nt)"; the bottom line represents "Detecting (21nt)." FIG. 6F: Computational workflow for interpreting NGS reads. [0074] FIG. 7: Sequence encoding embodiment with sequence adaptors to facilitate downstream NGS analysis. The translator probe comprises a new region 7. Regions 3 and 7 serve as sequencing adaptors for NGS.
[0075] FIG. 8: Proposed workflow for sequence translation, library preparation, and NGS. The entire process is expected to take less than eight hours, including sequencing and bioinformatic analysis.
[0076] FIGS. 9A-B: Experimental NGS results on 22-plex translation. FIG. 9A: Eleven translators were designed to subsequences of different human genes, and 11 translators were designed to mouse homologs of the human genes. For development and optimization purposes, experimental NGS used 150 sequencing cycles. The gray dots show the number of reads aligned to each human Codeword, and the black dots show the number of reads mapped to mouse Codewords. Further analysis of the mouse Codeword sequences revealed that roughly 90% of these did not contain any auxiliary probe sequence, and likely correspond to non-specific dimers with sequencing adaptors. The overall ratio of reads between human Codewords and mouse Codewords with auxiliary sequence is over 500. FIG. 9B: Each translator probe's Codeword was 21 nt long, as described in FIG. 6. The Codeword error correction system was able to recover roughly 4% of the library. The 11% of the library that could not be aligned to any Codeword likely represents adaptor dimers not perfectly removed by size selection.
[0077] FIGS. 10A-B: Sequence encoding using M-Probes as translator probes. FIG. 10A: The matched translator probe (formed by SEQ ID NOS: 72-78) was designed against a 104 nt sequence around the SNP rs2775256. The mismatched translator probe (formed by SEQ ID NOS: 72-75 and 78-80) was designed against a 104 nt sequence bearing consecutive 6-nt mismatches from the target. The auxiliary probe (formed by SEQ ID NOS: 81-82) was designed against a 37 nt sequence downstream of translator probe-targeted region. FIG. 10B: Experimental qPCR results for a M-Probe translator. The ligated product was detected by qPCR with two sets of primers (fpl+rpl and fpl+rp2). The primer hybridizing to the
matched translator probe's Codeword is the reverse primer 1 (rpl; SEQ ID NO: 83). The primer hybridizing to the mismatch translator probe's Codeword is the reverse primer 2 (rp2; SEQ ID NO: 84). The primer hybridizing to the sequencing adapter on the auxiliary probe is the forward primer 1 (fpl; SEQ ID NO: 15). [0078] FIG. 11: Sequence encoding using Codewords on both the translator probe and the auxiliary probe to overcome potential nonspecific ligation. NGS reads with inconsistent Codewords can be excluded from interpretation as they likely result from nonspecific ligation.
DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
[0079] Currently in clinical reference laboratories, it takes at least three full days to go from sample (e.g. , blood, biopsy) to answer (e.g. , a clinical recommendation) for NGS laboratory developed tests (LDTs), with each step of library preparation, sequencing, and bioinformatic alignment taking a full day or more. Sequence encoding offers the potential to significantly accelerate the clinical NGS workflow to under one day, via reduced cycle sequencing and simplified sequence alignment and interpretation. Accelerating NGS LDTs will have positive effects on clinical outcomes for rapidly progressing diseases.
[0080] Another limitation of NGS studies on the human genome or transcriptome is that NGS detection and quantification of sequence variants, such as drug resistance mutations, is hampered by the intrinsic NGS error rate, commonly estimated to be between 0.1% and 1% for the Illumina platforms. Based on the intrinsic error rate, most commercial LDTs state mutation limits of detection of between 1% and 5%. Suppression of sequencing errors using molecular barcodes (e.g. , SafeSeq, CAPPseq, DuplexSeq) greatly increases the sequencing depth required, which in turn increases sequencing cost. In contrast, in sequence encoding, Codewords can be designed to be orthogonal and error-correcting, essentially eliminating the problem of NGS intrinsic error. This reduces the complexity of bioinformatic variant calling and potentially allows easier and more accurate conversion of NGS data into clinical recommendations. [0081] Furthermore, the read length limitation of Illumina and Ion Torrent (the two leading NGS platforms) means that long DNA sequence variants are challenging to profile. These include structural variants, such as gene fusions and translocations, as well as long repetitive sequences, such as variable number tandem repeats and long interspersed nuclear elements. On the other hand, long-read sequencing platforms (e.g. , Pacific Biosciences and Oxford Nanopore) can handle long DNA sequences, but have low throughput and high error rates, rendering these platforms unlikely candidates for FDA clearance for in vitro diagnostic (IVD) use. Diagnostic profiling of long sequence variants should thus be based on short-read sequencing to facilitate clinical adoption. Combining sequence encoding with M-Probe technology that is capable of specific hybridization for targets over 500 nucleotides in length will enable long and/or difficult DNA sequence variants to be profiled by short-read sequencing.
I. Symbolic Sequencing
[0082] A typical 300 nucleotide (nt) NGS read can uniquely specify 4300 ~ 4 x 10180 sequences. In contrast, the entire human genome is only 3 x 109 nt long, and there are fewer than 108 known DNA sequence variants. The number of RNA transcripts and transcript variants are even smaller in number: 3 x 104 genes, and likely less than 107 total RNA splice variants. This gross numerical mismatch points to an enormous NGS inefficiency when applied to profiling the human genome/transcriptome, or any other sample in which a reference genome or transcriptome is known.
[0083] Direct enumeration of all known variants and transcripts could be in principle achieved with fewer than 16 nt (416 = 4.3 x 109). In other words, a unique numerical identifier between 1 and 4.3 billion is assigned to every single known DNA variant and RNA transcript, and then that identifier is encoded in modular base 4, implemented in DNA nucleotides (e.g. , A = 1, T = 2, C = 3, and G = 0). Due to the exponential nature of information, sequence encoding is highly scalable in the face of discoveries of new genomic variants and alternative RNA splice variants. For example, using 25 nt Codewords allows for more than 1015 numerical identifiers, over 10 million-fold more than all variants known to date.
A. Sequence Encoding
[0084] Sequencing encoding offers significant advantages over the direct or "literal" NGS that is in use today in speed, accuracy, and ease of interpretation. First, sequencing by synthesis takes between 5 and 10 minutes per cycle, depending on the exact platform. Consequently a 300 to 600 cycle NGS run will typically require 1 to 3 full days. With 21 nt Codewords, the actual sequencing time can be reduced by 15- to 30-fold, to 2 hours.
[0085] Second, sequencing by synthesis chemistry is imperfect and results in incorrect base calls with a probability of between 0.1% and 1% (intrinsic error rate). This error rate is problematic especially for detecting and quantitation of single nucleotide variants. Sequence encoding allows the designer to intentionally map closely-related sequences into highly distinct Codewords, and essentially eliminates the impact of sequencing error. For example, one could encode wild-type KRAS as "ATATCCC," KRAS- G12D as "ATATGAG," and KRAS-G12V as "ATATCCA." Every Codeword is different by
three nucleotides, so it would require an extremely unlikely three simultaneous NGS base call errors for one Codeword to be misinterpreted as the other.
[0086] Furthermore, different NGS platforms have known classes of sequences on which they struggle (e.g. , GC-rich sequences for Illumina, homopolymer sequences for Ion Torrent). Codewords can be designed to avoid these problematic sequences to decrease intrinsic error rate.
[0087] Third, bioinformatic alignment of NGS reads to the human genome is often a computationally intensive and imperfect process, because of the size and the repetitiveness of the human genome. In contrast, matching NGS reads to a set of designed Codewords is computationally trivial via Hash Table or Suffix Tree exact string matching algorithms, and essentially error-free.
[0088] Finally, the method of detecting, identifying, or quantifying nucleic acid molecules by sequence encoding does not require a step of bead capture, and thus can avoid the bead-washing steps required following bead capture, because ligation alone is sufficient to exclude unbound molecules.
B. Sequence Encoding and NGS readout
[0089] FIG. 7 shows one embodiment of sequence encoding for analysis by next generation sequencing (NGS), also known as sequencing-by- synthesis. In this embodiment, the translator probe further comprises a region 7. Regions 3 and 7 are adaptors for NGS amplification or index appending, so nucleic acid molecules lacking either region 3 or region 7 will not be sequenced. Thus, only ligation products will be analyzed by NGS.
[0090] In FIG. 8, one potential workflow for sequencing encoding and NGS analysis is provided. The entire sample-to-answer workflow is expected to require less than eight hours total when optimized and using only 21 sequencing cycles and represents a significant speedup over current NGS-based laboratory developed tests (LDTs). To further speed up this workflow requires shortening the time bottlenecking Step 2, the hybridization of genomic DNA to the variant and auxiliary probes. Hybridization kinetics are primarily determined by the individual concentration of each probe. For low- to medium-plex translation, individual probe concentrations can be quite high and hybridization quite fast; for example, 5 nM per probe for a set of 1000-plex translators corresponds to a hybridization half-life of roughly
three minutes. The four hours listed for Step 2 assumes 100,000-plex translators and is similar to the times allotted for hybridization by commercial whole exome capture panels. Ligation by NEB Quick Ligase has a half-life of one minute and does not significantly contribute to overall time. II. Rational Design of Error-Resilient Codewords
[0091] Naive design of Codeword sequences can result in sequence encoding being susceptible to NGS intrinsic error. In the field of signal processing, passing messages across faulty channels (e.g. , the Internet) has led to the development of error correcting and error detecting codes. These ideas can be directly applied in Codeword design. Because Illumina sequencing errors are predominantly base replacements (as opposed to insertions or deletions), Hamming encoding is well-suited for sequence encoding.
[0092] To review, the simplest (7,4) Hamming code inserts 3 error-correcting bits for every 4 bit message (longer messages are first broken up into 4 bit words). All 7-bit instances of the Hamming code have the property that they are at least Hamming distance 3 from any other instance - that is to say, one would need to change at least 3 bits in order to transform one Hamming code instance into another. This property means that (7,4) Hamming codes are correcting for up to one error, and tolerant for up to two errors: The original sequence can be restored from any sequence mutated by one base; more conservatively, any sequence with two mutations will not match any other Codewords and can be excluded.
[0093] For example, the (7,4) DNA encoding shown in FIG. 6 can be used. The assignment of A, T, C, and G to numerical values and the design of the error check equations are selected such that long homopolymers and extremal G/C content are rare. Manual pruning of the 256 possible (7,4) Hamming codes removes 40 sequences that can contribute to homopolymers of more than 5 nt (via having a homopolymer of length 3 at the beginning or end of the Codeword) or have G/C content of >75% or <25%, resulting in 216 good (7,4) nt Codeword segments.
[0094] For demonstration purposes, 21 nt Codewords, corresponding to three (7,4) Codeword segments that can enumerate over 10 million distinct Codewords, were used. These Codewords can correct 1 nt error every 7 nt, or tolerate 2 nt errors every 7 nt. FIG. 6E shows the effective error rate of the Codewords given different intrinsic error rates; at 1 %
intrinsic error rate, the proposed Codewords exhibit roughly 0.6% error rate when NGS reads unmatched to any designed Codeword are corrected, and 0.01% error when unmatched NGS reads are discarded. These are roughly 20-fold and 1000-fold better than a naive encoding with no error correction. [0095] Correction of NGS reads that do not match any designed Codeword is done at the level of satisfying the error-checking equations in FIG. 6B, and does not require knowledge of the designed Codeword sequences. The time complexity of this operation is O(M), where M is the length of the Codeword (here M = 21). After correcting or discarding NGS reads that do not exactly match any designed Codeword, a Suffix Tree algorithm can be used to perform exact string matching on the designed Codewords (FIG. 6F). Suffix Tree is extremely rapid, with runtime complexity of O(M); importantly it has no dependence on the number of Codewords designed, and thus scales well to high multiplex.
III. Nucleic Acid Probes
[0096] In some embodiments, the present disclosure provides synthetic oligonucleotide probes for use in sequence encoding. In particular embodiments, the adaptors are single-stranded probes. In other embodiments, the adaptors are at least partially double- stranded probes. The oligonucleotide probes can have a length of 30 to 200 nucleotides, particularly 50 to 100 nucleotides, such as between 60 and 70 nucleotides. Exemplary structures of the probes, are provided in FIGS. 1, 4A, 8, 11A, and 12. Further, the probes can comprise part or all of sequencing primer sequences or their binding sites, such as index sequencing primers for particular sequencing platforms (e.g., Illumina index primers).
A. Single-Stranded Translator Implementation of Sequence Encoding
[0097] Sequence encoding is implemented via a pair of DNA hybridization probes that are conditionally ligated using the target sequence as a splint (FIG. IB). The left probe, called the translator probe, bears a Codeword (region 6, boxed) that uniquely specifies both the gene and the variant. The right probe, called the auxiliary probe, bears a universal region 3 that is identical for all auxiliary probes. In some embodiments, as pictured in FIG. IB, the translator probe is functionalized with a phosphate at the 5' end. Only when both the translator probe and the auxiliary probe bind adjacent DNA subsequences, is the 5' end of the translator probe flush against the 3' end of the auxiliary probe to allow subsequent ligation. Multiple translator probes can be used simultaneously (FIG. 2), to allow either sequence
encoding of multiple target sequences or to identify which one of several potential target sequences is present in a sample.
B. Double-Stranded Translator Implementation of Sequence Encoding
[0098] The molecular specificity of the translator and auxiliary probes is beneficial to accurate inference of genomic DNA variants based on Codeword analysis. Nonspecific binding of variant probes and auxiliary probes to other genomic loci would result in false positive results. The ligation process helps improve specificity, but is insufficient on its own to ensure accurate translation of target DNA sequences into Codewords.
[0099] In another embodiment (FIG. 4A), a protector oligonucleotide comprising a region 8 that is partially complementary to region 5 is introduced. Importantly, at least five continuous nucleotides on region 5 are not bound to the protector, i.e., form a toehold, in order to allow initiation of hybridization between the target and the translator probe. This protector oligonucleotide can improve the specificity of hybridization reactions (see Zhang et al , 2012, Wang and Zhang, 2015, U.S. Pat. No. 9,284,602, and U.S. Pat. Publn. No. 2016/0340727, each of which is incorporated herein by reference in its entirety), and maintains high sequence selectivity across a large range of temperatures and buffer conditions. In some aspects, the protector oligonucleotide is present in molar excess.
[00100] In some embodiments, the nucleic acid probes are rationally designed so that the standard free energy for hybridization (e.g. , theoretical standard free energy) between the specific target nucleic acid molecule and the region 5 is close to zero, while the standard free energy for hybridization between a spurious target (even one differing from the specific (actual) target by as little as a single nucleotide) and the probe is high enough to make their binding unfavorable by comparison.
[00101] As shown in FIG. 4A, the "toehold" region is present in region 5, is complementary to a target sequence 1 and not complementary to a protector region 8. The sequence of the complementary regions is rationally designed to achieve this matching under desired conditions of temperature and probe concentration. As a result, the equilibrium for the actual target and probe rapidly approaches 50% target:probe::protector:probe (or whatever ratio is desired), while equilibrium for the spurious target and primer greatly favors protector:probe.
[00102] Mechanistically, it is thought that hybridization to a target begins at the toehold and continues along the length of the region 5 until the probe is no longer "double- stranded." This assumes complementarity between the target and the region 5. When nucleotide mismatches exist between a spurious target and region 5, displacement of the second strand (i.e., the protector oligonucleotide) is thermodynamically unfavorable and the association between the region 5 and the spurious target is reversed.
[00103] Because the standard free energy favors a complete match (fully complementary) between the target sequence of the nucleic acid and toehold regions of the probe rather than a mismatch (e.g. , single nucleotide change), the region 5 of the probe will bind stably to a target in the absence of a mismatch but not in the presence of a mismatch. If a mismatch exists between the region 5 of the probe and the target, the probe duplex prefers to reform. In this way, the frequency of producing a ligation product when a target sequence is not present is reduced. This type of discrimination is typically not possible using the standard single- stranded probes because in those reactions there is no competing nucleic acid strand (such as the protector oligonucleotide) to which a mismatched probe strand would prefer to bind.
[00104] In addition, as show in FIG. 11, both the translation probe and the auxiliary probe may have a protector oligonucleotide.
C. M-Probe Translators for Sequence Encoding of Long Targets
[00105] Common techniques for analyzing nucleic acid sequences include the polymerase chain reaction (PCR) and next- generation sequencing (NGS), but these techniques fail in the analysis of long or complex sequences. Trinucleotide repeats, in particular, are difficult to analyze due to slipped strand mispairing, and the fact that pathogenic variants are frequently characterized by long strands (>200 nucleotides) that exceed the read length of NGS.
[00106] Standard hybridization probes are: (1) length-limited by synthesis capabilities and cannot query long target regions; (2) not economical for profiling of DNA samples with combinatorial diversity, such as T-cell receptors and antibody fragments; (3) incapable of accurate quantitation of trinucleotide repeats such as in Huntington's gene, Fragile X, and Federick's Ataxia, as well as microsatellite repeats.
[00107] As such, another embodiment of sequence encoding uses modular M- Probes constructed from many oligonucleotides, as shown in FIG. 10A (U.S. Pat. Appln. No. 62/398,484 and Wang et al, 2017, each of which is incorporated herein by reference in its entirety). M- Probes are capable of sequence-selective binding of very long nucleic acid targets, and furthermore tolerates non-pathogenic sequence variations at specified locations.
[00108] The modular probe is designed based on detection or capture of a target nucleic acid sequence of at least partially known sequence. The target sequence is divided conceptually into several regions, a region being a number of continuous nucleotides that act as a unit in hybridization or dissociation. Note that the regions may or may not be directly adjoining one another.
[00109] Several diseases are caused or characterized by an abnormal number of triplet repeats; examples include Huntington's disease (excessive number of CAG repeats), Friedreich's Ataxia (GAA repeats), Myotonic dystrophy (CTG repeats), and the Fragile X syndrome (CGG repeats). Biologically, these repeats induce slipped strand mispairing during DNA replication; slipped strand mispairing likewise complicate or preclude many conventional DNA analysis techniques, such as Sanger Sequencing, quantitative PCR, and next-generation sequencing.
[00110] Modular probes can be designed to, for example, the Huntington's gene sequence. Each modular probe is designed to target a threshold number of repeats (6, 9, 12, 15, 18, 21, 24, and 27), as well as the 3' neighboring sequence. For example, a 12 repeat probe is designed to hybridize to any target sequences bearing 12 or more CAG repeats, in addition to the 8 nt downstream of the CAG repeats.
D. Dual Translators to Suppress Nonspecific Ligation Errors
[00111] Another embodiment of sequence encoding that overcomes potential nonspecific ligation is shown in FIG. 11. A secondary Codeword is placed on the auxiliary probe near region 3, or between region 3 and region 4. PCR amplification using both Codewords (regions 6 and 10) or both universal regions (regions 3 and 7) or NGS paired-end reads to analyze both Codewords will differentiate correctly ligated species with consistent Codeword pairs from nonspecific ligation products.
IV. Further Processing of Target Nucleic Acids
A. Target Nucleic Acid Molecules
[00112] A nucleic acid molecule of interest can be a single nucleic acid molecule or a plurality of nucleic acid molecules. Also, a nucleic acid molecule of interest can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, RNA, amplified DNA, a pre-existing nucleic acid library, etc.
[00113] Nucleic acids in a nucleic acid sample being analyzed (or processed) in accordance with the present disclosure can be from any nucleic acid source. As such, nucleic acids in a nucleic acid sample can be from virtually any nucleic acid source, including but not limited to genomic DNA, complementary DNA (cDNA), RNA (e.g., messenger RNA, ribosomal RNA, short interfering RNA, microRNA, etc.), plasmid DNA, mitochondrial DNA, etc. Furthermore, as any organism can be used as a source of nucleic acids to be processed in accordance with the present disclosure, no limitation in that regard is intended. Exemplary organisms include, but are not limited to, plants, animals (e.g., reptiles, mammals, insects, worms, fish, etc.), bacteria, fungi (e.g. , yeast), viruses, etc. In certain embodiments, the nucleic acids in the nucleic acid sample are derived from a mammal, where in certain embodiments the mammal is a human. A nucleic acid molecule of interest can be a single nucleic acid molecule or a plurality of nucleic acid molecules. Also, a nucleic acid molecule of interest can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, cell-free DNA (cfDNA), RNA, amplified DNA, a pre-existing nucleic acid library, etc. In some aspects, the target nucleic acid is a double- stranded DNA molecule, such as, for example, human genomic DNA.
[00114] A nucleic acid molecule of interest may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, chemical, enzymatic, degradation over time, etc. Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc. A nucleic acid molecule of interest may also be subjected to chemical modification (e.g. , bisulfite conversion, methylation / demethylation), extension, amplification (e.g. , PCR, isothermal, etc.), etc.
[00115] An RNA molecule may be obtained from a sample, such as a sample comprising total cellular RNA, a transcriptome, or both; the sample may be obtained from one or more viruses; from one or more bacteria; or from a mixture of animal cells, bacteria, and/or viruses, for example. The sample may comprise mRNA, such as mRNA that is obtained by affinity capture.
[00116] Obtaining nucleic acid molecules may comprise generation of the cDNA molecule by reverse transcribing the mRNA molecule with a reverse transcriptase, such as, for example Tth DNA polymerase, HIV Reverse Transcriptase, AMV Reverse Transcriptase, MMLV Reverse Transcriptase, or a mixture thereof. B. Amplification of Sequence Encoded Nucleic Acids
[00117] A number of template-dependent processes are available to amplify the sequence encoded nucleic acids present in a given sample. One of the best known amplification methods is the polymerase chain reaction (referred to as PCR™) which is described in detail in U.S. Patent Nos. 4,683,195, 4,683,202, and 4,800,159, each of which is incorporated herein by reference in its entirety. Briefly, two synthetic oligonucleotide primers, which are complementary to two regions of the template DNA (one for each strand) to be amplified, are added to the template DNA (that need not be pure), in the presence of excess deoxynucleotides (dNTP's) and a thermostable polymerase, such as, for example, Taq (Thermus aquaticus) DNA polymerase. In a series (typically 30-35) of temperature cycles, the target DNA is repeatedly denatured (around 90°C), annealed to the primers (typically at 50-60°C) and a daughter strand extended from the primers (72°C). As the daughter strands are created they act as templates in subsequent cycles. Thus, the template region between the two primers is amplified exponentially, rather than linearly.
[00118] A barcode, such as a sample barcode, may be added to the target nucleic acid molecules during amplification. One method involves annealing a primer to the sequence encoded nucleic acid molecule, the primer including a first portion complementary to the sequence encoded nucleic acid molecule and a second portion including a barcode; and extending the annealed primer to form a barcoded nucleic acid molecule. Thus, the primer may include a 3' portion and a 5' portion, where the 3' portion may anneal to a portion of the sequence encoded nucleic acid molecule and the 5' portion comprises the barcode.
C. Sequencing of Sequence Encoded Nucleic Acids
[00119] Methods are also provided for the sequencing of the library of sequence encoded nucleic acid molecules. Any technique for sequencing nucleic acids known to those skilled in the art can be used in the methods of the present disclosure. DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing-by- synthesis, pyrosequencing, 454 sequencing, nanopore sequencing, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, and SOLiD sequencing.
V. Definitions [00120] "Amplification," as used herein, refers to any in vitro process for increasing the number of copies of a nucleotide sequence or sequences. Nucleic acid amplification results in the incorporation of nucleotides into DNA or RNA. As used herein, one amplification reaction may consist of many rounds of DNA replication. For example, one PCR reaction may consist of 30-100 "cycles" of denaturation and replication. [00121] "Polymerase chain reaction," or "PCR," means a reaction for the in vitro amplification of specific DNA sequences by the simultaneous primer extension of complementary strands of DNA. In other words, PCR is a reaction for making multiple copies or replicates of a target nucleic acid flanked by primer binding sites, such reaction comprising one or more repetitions of the following steps: (i) denaturing the target nucleic acid, (ii) annealing primers to the primer binding sites, and (iii) extending the primers by a nucleic acid polymerase in the presence of nucleoside triphosphates. Usually, the reaction is cycled through different temperatures optimized for each step in a thermal cycler instrument. Particular temperatures, durations at each step, and rates of change between steps depend on many factors well-known to those of ordinary skill in the art, e.g. , exemplified by the references: McPherson et al., editors, PCR: A Practical Approach and PCR2: A Practical Approach (IRL Press, Oxford, 1991 and 1995, respectively).
[00122] "Primer" means an oligonucleotide, either natural or synthetic that is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3' end along the template so that an extended duplex is formed. The sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide. Usually primers are
extended by a DNA polymerase. Primers are generally of a length compatible with its use in synthesis of primer extension products, and are usually are in the range of between 8 to 100 nucleotides in length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30, 20 to 40, 21 to 50, 22 to 45, 25 to 40, and so on, more typically in the range of between 18-40, 20-35, 21-30 nucleotides long, and any length between the stated ranges. Typical primers can be in the range of between 10-50 nucleotides long, such as 15-45, 18-40, 20-30, 21-25 and so on, and any length between the stated ranges. In some embodiments, the primers are usually not more than about 10, 12, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70 nucleotides in length. [00123] As used herein, a nucleic acid "region" or "domain" is a consecutive stretch of nucleotides of any length.
[00124] "Incorporating," as used herein, means becoming part of a nucleic acid polymer.
[00125] A "nucleoside" is a base-sugar combination, i.e. , a nucleotide lacking a phosphate. It is recognized in the art that there is a certain inter-changeability in usage of the terms nucleoside and nucleotide. For example, the nucleotide deoxyuridine triphosphate, dUTP, is a deoxyribonucleoside triphosphate. After incorporation into DNA, it serves as a DNA monomer, formally being deoxyuridylate, i.e. , dUMP or deoxyuridine monophosphate. One may say that one incorporates dUTP into DNA even though there is no dUTP moiety in the resultant DNA. Similarly, one may say that one incorporates deoxyuridine into DNA even though that is only a part of the substrate molecule.
[00126] "Nucleotide," as used herein, is a term of art that refers to a base- sugar-phosphate combination. Nucleotides are the monomeric units of nucleic acid polymers, i.e. , of DNA and RNA. The term includes ribonucleotide triphosphates, such as rATP, rCTP, rGTP, or rUTP, and deoxyribonucleotide triphosphates, such as dATP, dCTP, dUTP, dGTP, or dTTP.
[00127] The term "nucleic acid" or "polynucleotide" will generally refer to at least one molecule or strand of DNA, RNA, DNA-RNA chimera or a derivative or analog thereof, comprising at least one nucleobase, such as, for example, a naturally occurring purine or pyrimidine base found in DNA (e.g. , adenine "A," guanine "G," thymine "T" and cytosine "C") or RNA (e.g. A, G, uracil "U" and C). The term "nucleic acid" encompasses
the terms "oligonucleotide" and "polynucleotide." "Oligonucleotide," as used herein, refers collectively and interchangeably to two terms of art, "oligonucleotide" and "polynucleotide." Note that although oligonucleotide and polynucleotide are distinct terms of art, there is no exact dividing line between them and they are used interchangeably herein. The term "adaptor" may also be used interchangeably with the terms "oligonucleotide" and "polynucleotide." In addition, the term "adaptor" can indicate a linear adaptor (either single stranded or double stranded) or a stem-loop adaptor. These definitions generally refer to at least one single-stranded molecule, but in specific embodiments will also encompass at least one additional strand that is partially, substantially, or fully complementary to at least one single- stranded molecule. Thus, a nucleic acid may encompass at least one double- stranded molecule or at least one triple-stranded molecule that comprises one or more complementary strand(s) or "complement(s)" of a particular sequence comprising a strand of the molecule. As used herein, a single stranded nucleic acid may be denoted by the prefix "ss," a double- stranded nucleic acid by the prefix "ds," and a triple stranded nucleic acid by the prefix "ts." [00128] A "nucleic acid molecule" or "nucleic acid target molecule" refers to any single- stranded or double- stranded nucleic acid molecule including standard canonical bases, hypermodified bases, non-natural bases, or any combination of the bases thereof. For example and without limitation, the nucleic acid molecule contains the four canonical DNA bases - adenine, cytosine, guanine, and thymine, and/or the four canonical RNA bases - adenine, cytosine, guanine, and uracil. Uracil can be substituted for thymine when the nucleoside contains a 2'-deoxyribose group. The nucleic acid molecule can be transformed from RNA into DNA and from DNA into RNA. For example, and without limitation, mRNA can be created into complementary DNA (cDNA) using reverse transcriptase and DNA can be created into RNA using RNA polymerase. A nucleic acid molecule can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, RNA, a DNA/RNA hybrid, amplified DNA, a pre-existing nucleic acid library, etc. A nucleic acid may be obtained from a human sample, such as blood, serum, plasma, cerebrospinal fluid, cheek scrapings, biopsy, semen, urine, feces, saliva, sweat, etc. A nucleic acid molecule may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, and hydrodynamic shearing. Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases, such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc. A nucleic acid molecule of interest may also be subjected to chemical
modification (e.g. , bisulfite conversion, methylation / demethylation), extension, amplification (e.g. , PCR, isothermal, etc.), etc.
[00129] Nucleic acid(s) that are "complementary" or "complement(s)" are those that are capable of base-pairing according to the standard Watson-Crick, Hoogsteen or reverse Hoogsteen binding complementarity rules. As used herein, the term "complementary" or "complement(s)" may refer to nucleic acid(s) that are substantially complementary, as may be assessed by the same nucleotide comparison set forth above. The term "substantially complementary" may refer to a nucleic acid comprising at least one sequence of consecutive nucleobases, or semiconsecutive nucleobases if one or more nucleobase moieties are not present in the molecule, are capable of hybridizing to at least one nucleic acid strand or duplex even if less than all nucleobases do not base pair with a counterpart nucleobase. In certain embodiments, a "substantially complementary" nucleic acid contains at least one sequence in which about 70%, about 71%, about 72%, about 73%, about 74%, about 75%, about 76%, about 77%, about 77%, about 78%, about 79%, about 80%, about 81%, about 82%, about 83%, about 84%, about 85%, about 86%, about 87%, about 88%, about 89%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, to about 100%, and any range therein, of the nucleobase sequence is capable of base-pairing with at least one single or double- stranded nucleic acid molecule during hybridization. In certain embodiments, the term "substantially complementary" refers to at least one nucleic acid that may hybridize to at least one nucleic acid strand or duplex in stringent conditions. In certain embodiments, a "partially complementary" nucleic acid comprises at least one sequence that may hybridize in low stringency conditions to at least one single or double-stranded nucleic acid, or contains at least one sequence in which less than about 70% of the nucleobase sequence is capable of base-pairing with at least one single or double-stranded nucleic acid molecule during hybridization.
[00130] The term "non-complementary" refers to nucleic acid sequence that lacks the ability to form at least one Watson-Crick base pair through specific hydrogen bonds. [00131] The term "ligase" as used herein refers to an enzyme that is capable of joining the 3' hydroxyl terminus of one nucleic acid molecule to a 5' phosphate terminus of a second nucleic acid molecule to form a single molecule. The ligase may be a DNA ligase or
RNA ligase. Examples of DNA ligases include E. coli DNA ligase, T4 DNA ligase, and mammalian DNA ligases.
[00132] "Sample" means a material obtained or isolated from a fresh or preserved biological sample or synthetically-created source that contains nucleic acids of interest. In certain embodiments, a sample is the biological material that contains the variable immune region(s) for which data or information are sought. Samples can include at least one cell, fetal cell, cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, or reaction mixture suspected of containing immune nucleic acids of interest. Samples can also include non- human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses.
[00133] As used herein in relation to a nucleotide sequence, "substantially known" refers to having sufficient sequence information in order to permit preparation of a nucleic acid molecule, including its amplification. This will typically be about 100%, although in some embodiments some portion of an adaptor sequence is random or degenerate. Thus, in specific embodiments, substantially known refers to about 50% to about 100%, about 60% to about 100%, about 70% to about 100%, about 80% to about 100%, about 90% to about 100%, about 95% to about 100%, about 97% to about 100%, about 98% to about 100%, or about 99% to about 100%.
VI. Kits
[00134] The technology herein includes kits for creating libraries of target nucleic acids in a sample. A "kit" refers to a combination of physical elements. For example, a kit may include, for example, one or more components, such as translation and auxiliary adaptors, either with or without protector oligonucleotides, as well as specific primers, enzymes, reaction buffers, an instruction sheet, and other elements useful to practice the technology described herein. These physical elements can be arranged in any way suitable for carrying out the disclosure. [00135] The components of the kits may be packaged either in aqueous media or in lyophilized form. The container means of the kits will generally include at least one vial,
test tube, flask, bottle, syringe or other container means, into which a component may be placed, and preferably, suitably aliquoted (e.g. , aliquoted into the wells of a microtiter plate). Where there is more than one component in the kit, the kit also will generally contain a second, third or other additional container into which the additional components may be separately placed. However, various combinations of components may be comprised in a single vial. The kits of the present disclosure also will typically include a means for containing the nucleic acids, and any other reagent containers in close confinement for commercial sale. Such containers may include injection or blow molded plastic containers into which the desired vials are retained. [00136] A kit will also include instructions for employing the kit components as well the use of any other reagent not included in the kit. Instructions may include variations that can be implemented. It is contemplated that such reagents are embodiments of kits of the disclosure. Such kits, however, are not limited to the particular items identified above. VII. Examples
[00137] The following examples are included to demonstrate preferred embodiments of the disclosure. It should be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function well in the practice of the disclosure, and thus can be considered to constitute preferred modes for its practice. However, those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit and scope of the disclosure.
Example 1 - Sequence Encoding with Single-stranded Translator Probes [00138] Sequence encoding is implemented via a pair of DNA hybridization probes that are conditionally ligated using the target sequence as a splint (FIG. IB). The left probe, called the translator probe, bears a Codeword (region 6, boxed) that uniquely specifies both the gene and the variant. The right probe, called the auxiliary probe, bears a universal region 3 that is identical for all auxiliary probes. In some embodiments, as illustrated in FIG. IB, the translator probe is functionalized with a phosphate at the 5' end. Only when both the
translator probe and the auxiliary probe bind adjacent DNA subsequences is the 5' end of the translator probe flush against the 3' end of the auxiliary probe to allow subsequent ligation.
[00139] Multiple translator probes can be used simultaneously (FIG. 2) to allow either sequence encoding of multiple target sequences or to identify which one of several potential target sequences is present in a sample. FIG. 3B provides experimental results on sequence encoding using quantitative PCR (qPCR). Two different translator probes were present and simultaneously allowed to the NA 18562 human cell line genomic DNA, which bears a target sequence matching one of the two translators (FIG. 3A). Regions 5a and 5b differ by 5 nucleotides. Using the Codewords 6a and 6b as reverse primer binding sites in a qPCR reaction shows a roughly 10 cycle Ct difference, indicating roughly 1000-fold higher intended ligation over nonspecific ligation.
Example 2 - Sequence Encoding with Double-stranded Translator Probes
[00140] Molecular specificity of the translator and auxiliary probes is beneficial to accurately infer genomic DNA variants based on Codeword analysis. Nonspecific binding of Variant and Auxiliary Probes to other genomic loci can result in false positive results. The ligation process helps improve specificity, but is insufficient on its own to ensure accurate translation of target DNA sequences into Codewords.
[00141] In another embodiment (FIG. 4A), a protector oligonucleotide comprising a region 8 that is partially complementary to region 5 is introduced. Importantly, at least 5 continuous nucleotides on region 5 are not bound by the protector, to allow initiation of hybridization between the target and the translator probe. This protector oligonucleotide can improve the specificity of hybridization reactions (see Zhang, Chen, and Yin, Nature Chemistry 2012 and Wang and Zhang, Nature Chemistry 2015; U.S. Patent No. 9,284,602 and U.S. Patent Appln. No. 15/174,373) and maintains high sequence selectivity across a large range of temperatures and buffer conditions. FIG. 4B provides qPCR results using a double- stranded translator probe. A roughly 6 cycle Ct difference is observed, indicating a single-nucleotide specificity of roughly 60-fold. Subsequent Sanger sequencing of the qPCR amplicon (FIG. 5) verifies that the amplicon bears the designed Codeword for the intended translator probe.
Example 3 - Rational Design of Error-Resilient Codewords
[00142] Naive design of Codeword sequences can result in sequence encoding being susceptible to NGS intrinsic error. In the field of signal processing, passing messages across faulty channels (e.g. , the Internet) has led to the development of error correcting and error detecting codes. These ideas can be directly applied in Codeword design. Because Illumina sequencing errors are predominantly base replacements (as opposed to insertions or deletions), Hamming encoding is well-suited for sequencing encoding.
[00143] To review, the simplest (7,4) Hamming code inserts 3 error-correcting bits for every 4 bit message (longer messages are first broken up into 4 bit words). All 7-bit instances of the Hamming code have the property that they are at least Hamming distance 3 from any other instance - that is to say, one would need to change at least 3 bits in order to transform one Hamming code instance into another. This property means that (7,4) Hamming codes are correcting for up to one error, and tolerant for up to two errors: The original sequence can be restored from any sequence mutated by one base; more conservatively, any sequence with two mutations will not match any other Codewords and can be excluded.
[00144] Here, the (7,4) DNA encoding shown in FIG. 6 was used. The assignment of A, T, C, G to numerical values and the design of the error check equations are selected such that long homopolymers and extreme G/C content are rare. Manual pruning of the 256 possible (7,4) Hamming codes removes 40 sequences that can contribute to homopolymers of more than 5 nt (via having a homopolymer of length 3 at the beginning or end of the Codeword) or have G/C content of >75% or <25%, resulting in 216 good (7,4) nt Codeword segments.
[00145] For demonstration purposes here, 21 nt Codewords were used, corresponding to three (7,4) Codeword segments that can enumerate over 10 million distinct Codewords. These Codewords can correct 1 nt error every 7 nt, or tolerate 2 nt errors every 7 nt. FIG. 6E shows the effective error rate of the Codewords given different intrinsic error rates; at 1 % intrinsic error rate, the proposed Codewords exhibit roughly 0.6% error rate when NGS reads unmatched to any designed Codeword are corrected, and 0.01 % error when unmatched NGS reads are discarded. These are roughly 20-fold and 1000-fold better than a naive encoding with no error correction.
[00146] Correction of NGS reads that do not match any designed Codeword is done at the level of satisfying the error-checking equations in FIG. 6B, and does not require knowledge of the designed Codeword sequences. The time complexity of this operation is O(M), where M is the length of the Codeword (here M = 21). After correcting or discarding NGS reads that do not exactly match any designed Codeword, a Suffix Tree algorithm can be used to perform exact string matching on the designed Codewords (FIG. 6F). Suffix Tree is extremely rapid, with runtime complexity of O(M); importantly it has no dependence on the number of Codewords designed, and thus scales well to highly multiplex uses.
Example 4 - Sequence Encoding and NGS Readout [00147] FIG. 7 illustrates another embodiment of sequence encoding for analysis by next generation sequencing (NGS), also known as sequencing-by-synthesis. In this embodiment, the translator probe further comprises a region 7. Regions 3 and 7 are adaptors for NGS amplification or index appending, so nucleic acid molecules lacking either region 3 or region 7 will not be sequenced. Thus, only ligation products will be analyzed by NGS.
[00148] In FIG. 8, one potential workflow for sequencing encoding and NGS analysis is provided. The entire sample-to-answer workflow is expected to require less than eight hours total when optimized and using only 21 sequencing cycles, which represents a significant speedup over current NGS-based laboratory developed tests (LDTs). To further speed up this workflow requires shortening the time bottlenecking Step 2, the hybridization of genomic DNA to the Variant and Auxiliary Probes. Hybridization kinetics are primarily determined by the individual concentration of each Probe. For low- to medium-plex Translation, individual Probe concentrations can be quite high and hybridization quite fast; for example, 5 nM per Probe for a set of 1000-plex Translators corresponds to a hybridization half-life of roughly three minutes. The four hours listed for Step 2 assumes 100,000-plex translators, and is similar to the times allotted for hybridization by commercial whole exome capture panels. Ligation by NEB Quick Ligase has a half-life of one minute, and does not significantly contribute to overall time.
[00149] FIG. 9 provides experimental results of a 22-plex sequence encoding system (SEQ ID NOS: 17-60): 11 translators were designed to bind specifically to exon subsequences of 11 human genes, and 11 translators were designed to bind the mouse
homolog genes. The sample input is 200 ng of human gDNA, so all 11 human gene Codewords are expected to yield high NGS reads, and all 11 mouse gene Codewords are expected to have few or no NGS reads. All Codewords were 21 nt long and use the Hamming error-correction mechanism described in Example 3. [00150] FIG. 9A shows that there were roughly 50-fold more human
Codewords (upper dots) than mouse Codewords (middle dots). For this experiment, we performed 150 cycle NGS rather than 21 cycle NGS to allow debugging. Detailed analysis of the mouse Codeword NGS reads showed that over 90% of these reads did not contain any auxiliary probe sequence, and are likely due to nonspecific ligation of mouse translator probes to sequencing adaptors. When these reads are filtered out, human Codewords reads were roughly 500-fold higher than those of mouse Codewords.
[00151] FIG. 9B shows the distribution of NGS reads in the library. The 3.8% of the library that were corrected Codewords demonstrates the error resilience of sequence encoding to NGS intrinsic error. In a standard NGS protocol and analysis, these 3.8% would have showed up spuriously as single nucleotide variants, but this possibility is completely eliminated in symbolic sequencing.
Example 5 - M-Probe Translators for Sequence Encoding of Long Targets
[00152] Another embodiment of sequence encoding uses modular probes (M- Probes) constructed from many oligonucleotides, shown in FIG. 10A (U.S. Pat. Appln. No. 62/398,484; Wang et al, 2017). M-Probes are capable of sequence- selective binding of very long nucleic acid targets, and furthermore tolerate non-pathogenic sequence variations at specified locations. FIG. 10B shows qPCR results using M-Probe translators. A roughly 6 cycle Ct difference is observed, indicating roughly 60-fold higher intended ligation over nonspecific ligation. Example 6 - Dual Translators to Suppress Nonspecific Ligation Errors
[00153] Another embodiment of sequence encoding that overcomes potential nonspecific ligation is shown in FIG. 11. A secondary Codeword is placed on the auxiliary probe near region 3. PCR amplification using both Codewords or NGS paired-end reads to analyze both Codewords will differentiate correctly ligated species with consistent Codeword pairs from nonspecific ligation products.
Table 1. Sequences used throughout the Examples
B2M-Mouse-Cl /5Phos/CAA ATG AAT CTT CAG AGC TGA AAA GAA AAG GGG AAG GGA GGG AGA GAA GGA GAG TCA ATA ACT GTA CCT CCT TAG AGC TTC AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
TFRC-Human-P2 CGA ATT GGC AGG AAC CGA GTC TCC
AGT GAG GGA GGA GC
TFRC-Human-C2 /5Phos/GTC CTC TCC TGG CTC CTC CCT
CAC TGG AGA CTC GGT TCC TGC CAA TTC GCG TAT CGT TTC CCC TGG GGT GAG ATC GGA AGA GCG TCG TGT AGG GAA AGA GTG T
TFRC-Mouse-P2 TTT CTT CTG GCT GAA ACG GAG GAG
ACA GAC AAG TCA GAA ACC ATG G
TFRC-Mouse-C2 /5Phos/ATC CTC TGT TTC CAT GGT TTC
TGA CTT GTC TGT CTC CTC CGT TTC AGC CAG AAG AAA CGT ATC GCA GGT ACA CTT GCA AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
ACTB-Human-P3 TAG TGT TCT GGT GTT TGT CTC TCT
GAC TAG GTG TCT AAG A
ACTB-Human-C3 /5Phos/CCA CAA CAC TGT CTT AGA
CAC CTA GTC AGA GAG ACA AAC ACC AGA ACA CTA TAA CGA GTA CTA GCA AAA CCC AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
ACTB-Mouse-P3 AAC TAT GTT CTC TCA ATT GCC TTT
CTG ACT AGG TGT TTA AAC CCT ACA
ACTB-Mouse-C3 /5Phos/CCA CAG CAC TGT AGG GTT
TAA ACA CCT AGT CAG AAA GGC AAT TGA GAG AAC ATA GTT TAA CGA GTC ATC GAG GAC CTT AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
PPIA-Human-P4 TAA GTA AGT TCT TGG GAA TTA AAG
TAA TTA CTG AAG
PPIA-Human-C4 /5Phos/CTC ACT AGA ATA CTT CTT CAG
TAA TTA CTT TAA TTC CCA AGA ACT TAC TTA TTA CCG GAA AAC CCA ACC ACC AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
PPIA-Mouse-P4 TTG GGA TTT CTT ACT AGG AAT TGA
ACG TTA TTA CCA AAG
PPIA-Mouse-C4 /5Phos/CAG AAC ACC ACT TTG GTA
ATA ACG TTC AAT TCC TAG TAA GAA ATC CCA ATT ACC GGT TAC CGG CAG AAG TAG ATC GGA AGA GCG TCG TGT AGG GAA AGA GTG T
GAPDH-Human-P5 GCA TTG GTA TGT GGC TGG GGC CAG
AGA CTG GCT CTT AAA AAG TG
GAPDH-Human-C5 /5Phos/CCA GAC CCT GCA CTT TTT AAG AGC CAG TCT CTG GCC CCA GCC ACA TAC CAA TGC GGG GTT TCA CAA ACA GAT CCT AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
GAPDH-Mouse-P5 CTG GAT TGG TAT GAC AAT GAA TAC
GGC TAC AGC AAC AGG GTG G
GAPDH-Mouse-C5 /5Phos/TGA GGT CCA CCA CCC TGT TGC
TGT AGC CGT ATT CAT TGT CAT ACC AAT CCA GGG GGT TTC GCA TTC AAC CAC CAG ATC GGA AGA GCG TCG TGT AGG GAA AGA GTG T
GUSB-Human-P6 GGA AAC AGC GGG GCC CAG GGT GGC
TCT GTT TGT TCC CTG TTT
GUSB-Human-C6 /5Phos/GAG AGC TTT CCA AAC AGG
GAA CAA ACA GAG CCA CCC TGG GCC CCG CTG TTT CCC AAC CAC CTG AGC TTC AAG ATA GAT CGG AAG AGC GTC GTG TAG GGA AAG AGT GT
GUSB-Mouse-P6 TTC GGT GCT AGG CTG GGT CCA TTT
TTA TCT CAA TTC CCA AG
GUSB-Mouse-C6 /5Phos/GAT TTT TTT TCC TCT TGG GAA
TTG AGA TAA AAA TGG ACC CAG CCT AGC ACC GAA CAA CCA CTT ATG AAC GAC GTC AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
VCP-Human-P7 TCC AAG TCT CAG TAT GTT GCC CAT
GCT GGT TTC GGA TTT CTG G
VCP-Human-C7 /5Phos/ATC ACT TGA GGC CAG AAA
TCC GAA ACC AGC ATG GGC AAC ATA CTG AGA CTT GGA ACG TGA CGC GGC CTT CCG AGA AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
VCP-Mouse-P7 TCT GGC ATG GCT GTG GGA AAA CCT
TAC TGG CTA AAG CCA TTG
VCP-Mouse-C7 /5Phos/GGC ATT CAT TAG CAA TGG CTT
TAG CCA GTA AGG TTT TCC CAC AGC CAT GCC AGA ACG TGA CCC AAC AAC GAG CAT AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
GPI-Human-P8 AGC ACT TCG AGC AGC TGC TCT CGG
GGG CTC ACT GGA TGG TGA G
GPI-Human-C8 /5Phos/AGC CTC AGC ACT CAC CAT
CCA GTG AGC CCC CGA GAG CAG CTG CTC GAA GTG CTG GTT GTT CTG GAG CTC TTC CGA GAT CGG AAG AGC GTC GTG TAG GGA AAG AGT GT
GPI-Mouse-P8 ATT CGA TAC TCT ACG AAC ACG GCC
AAA GTG AAA GAG TTT GGA ATT GA
GPI-Mouse-C8 /5Phos/TGT TTT GAG GGT CAA TTC CAA ACT CTT TCA CTT TGG CCG TGT TCG TAG AGT ATC GAA TGG TTG TTT GCT TAC GTC CAT GAG ATC GGA AGA GCG TCG TGT AGG GAA AGA GTG T
REEP5-Human-P9 CTA AAT AAC AGA GGC ATC TCC CAT
CCC CAG AGT AGT GAA AGA C
REEP5-Human-C9 /5Phos/GAG AAT ACA CAC TGT CTT
TCA CTA CTC TGG GGA TGG GAG ATG CCT CTG TTA TTT AGG CTG CAG ATG ATC GTT ATG AAA GAT CGG AAG AGC GTC GTG TAG GGA AAG AGT GT
REEP5-Mouse-P9 TAC TAC GAC ACC CAG TGG CTG ACG
TAC TGG GTG GTA TAT GGT G
REEP5-Mouse-C9 /5Phos/AAT GCT GAA CAC ACC ATA
TAC CAC CCA GTA CGT CAG CCA CTG GGT GTC GTA GTA GCT GCA GGA CCT CGG ATT CAT AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
SNRPD3-Human-P10 ATT TCA TGA TTG CTG GGT AAT TCA
CCA GCT TTG TCA CAA TGT CA
SNRPD3-Human-C10 /5Phos/AGA AAA GCA GAC TGA CAT
TGT GAC AAA GCT GGT GAA TTA CCC AGC AAT CAT GAA ATC TGA GCT TCC AGC GGC CAT CTA GAT CGG AAG AGC GTC GTG TAG GGA AAG AGT GT
SNRPD3-Mouse-P10 AAT TGG CTA TTG GTG TGC CGA TTA
AAG TCT TGC ACG AGG CTG AAG G
SNRPD3-Mouse-C10 /5Phos/CAC TAT GTG GCC TTC AGC CTC
GTG CAA GAC TTT AAT CGG CAC ACC AAT AGC CAA TTC TGA GCT TTG GTG GTC CCT ATA GAT CGG AAG AGC GTC GTG TAG GGA AAG AGT GT
HPRTl-Human-Pl l AAT GTT ATT ACA GAT GTG ACG TGC
ATA ATT ATT AGT A
HPRTl-Human-Cl l /5Phos/TGT AAA CAT ACA AAT TAC
TAA TAA TTA TGC ACG TCA CAT CTG TAA TAA CAT TCG CAT TCG GAG TAA CTC AGG CAG ATC GGA AGA GCG TCG TGT AGG GAA AGA GTG T
HPRTl-Mouse-Pl l GTT GAA TTT CTC CTA AGG TTA CTA
AGT AGT TTA TTT TTC CTT T
HPRTl-Mouse-Cl l /5Phos/GTA CCA ATC CAA AAG GAA
AAA TAA ACT ACT TAG TAA CCT TAG GAG AAA TTC AAC CGC ATT CCT AAG TAA CCA ACA AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT
B2M-Human-Au GTG ACT GGA GTT CAG ACG TGT GCT
CTT CCG ATC AGG CTG CTG TTC CTA CC
TFRC-Human-Au GTG ACT GGA GTT CAG ACG TGT GCT CTT CCG ATC ACG TGC TGC AGG GAA
ACTB-Human-Au GTG ACT GGA GTT CAG ACG TGT GCT
CTT CCG ATC CCA GTG TTA GTA CCT ACA C
PPIA-Human-Au GTG ACT GGA GTT CAG ACG TGT GCT
CTT CCG ATC CTT CTG AGT CAT AAA TTC ATT TT
GAPDH-Human-Au GTG ACT GGA GTT CAG ACG TGT GCT
CTT CCG ATC GCC ACC AGA GGG CG
GUSB -Human- Au GTG ACT GGA GTT CAG ACG TGT GCT
CTT CCG ATC GAG CAC CTT TTT CCT GG
VCP-Human-Au GTG ACT GGA GTT CAG ACG TGT GCT
CTT CCG ATC GGC CAA GGT GGG AGG
GPI-Human-Au GTG ACT GGA GTT CAG ACG TGT GCT
CTT CCG ATC GCA CTT GGC AGA GAA CC
REEP5-Human-Au GTG ACT GGA GTT CAG ACG TGT GCT
CTT CCG ATC GGA GGG GGC ACA TTC T
SNRPD3 -Human- Au GTG ACT GGA GTT CAG ACG TGT GCT
CTT CCG ATC TGG GAG GGG GAA GTC C
HPRT1 -Human- Au GTG ACT GGA GTT CAG ACG TGT GCT
CTT CCG ATC TTT GCC AGA CTG ACC CA
rs2775_SlP CGT GCA CGT CAA ACA GTA ACT TTG
GAT CTG TAA CAT ACA GGG AAT GCG TGC TAC AGT CTC AGC AG
rs2775_SlC GAA CGA CGG ACG TTG TGC CAT TCC
CTG TAT GTT ACA GAT CCA AAG TTA CTG TTT GAC GTG CAC GTT TAG TGC GAA GTC TAC TAT CCA CG
rs2775_S2P CTG CTG AGA CTG TAG CAC GTA AAT
GCA ACT GAA AGC ATC AGT AGC AAA AAT AGG GGC TGA ACG TAA CTC CTC
G
rs2775_S2C AGC TAT CTT CGT CCA TCT GGC CTA
TTT TTG CTA CTG ATG CTT TCA GTT GCA TTT AGC ACA ACG TCC GTC GTT
C
rs2775_tP_A CGA GGA GTT ACG TTC AGC CAA AGC
AGA AAA AAG
rs2775_tC_T /5Phos/TAA CTC ACT GAT TCT TTT TTC
TGC TTT CCA GAT GGA CGA AGA TAG CTT ACC GGT CAT CGA ATC CGA TAG ATC GGA AGA GCG TCG TG
rs2775_uC ttttttt GTT AAA TCG TGG ATA GTA GAC
TTC GCA Ct
79 rs2775_tP_CGTCT CGA GGA GTT ACG TTC AGC CAA AGC
AGC GTC TGG
80 rs2775_tC_AGACG /5Phos/TAA CTC ACT GAT TCC AGA
CGC TGC TTT CCA GAT GGA CGA AGA TAG CTT ACC GGT CCA TCT GAT GAG CAG ATC GGA AGA GCG TCG TG
81 rs2775_AP CTA TTT AAA AAT ATA CGA TCT GAG
ATG
82 rs2775_AC TTC AGA CGT GTG CTC TTC CGA TCC
ATC TCA GAT CGT ATA TTT TTA AAT AGC CTG TCT TCC
83 rs2775_rP_AT TCC GAT CTA TCG GAT TCG ATG AC
84 rs2775_rP_AGACG TCT TCC GAT CTG CTC ATC AGA TG
85 rsl509186_rP_18562 TTC CGA TCT CGA TCG AGT TCC
86 rsl509186_rP_fake5nt TCC GAT CTA TCG GAT TCG ATG AC
87 sequence in FIG. 1 AAA CTT GTG GTA GTT GGA GCT GGT
GGC GTA GGC AAG AGT GCC
88 sequence in FIG. 1 GCG TAA GCA ACT CCG TAA TTC
89 sequence in FIG. 5 GAC TGT GNN NCT TAT TGA CGA ATC
TAG GAT TGA ACC ACT CCT GAG ATC GG
* * *
[00154] All of the methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the compositions and methods of this disclosure have been described in terms of preferred embodiments, it will be apparent to those of skill in the art that variations may be applied to the methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the disclosure. More specifically, it will be apparent that certain agents which are both chemically and physiologically related may be substituted for the agents described herein while the same or similar results would be achieved. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the disclosure as defined by the appended claims.
REFERENCES
The following references, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference.
U.S. Pat. No. 4,683, 195
U.S. Pat. No. 4,683,202
U.S. Pat. No. 4,800, 159
U.S. Pat. No. 9,284,602
U.S. Pat. Publn. No. 2016/0340727
U.S. Pat. Appln. No. 62/398,484
Wang et al , Modular probes for enriching and detecting complex nucleic acid sequences, Nature Chemistry, DOI: 10.1038/NCHEM.2820, Published online July 17, 2017.
Wang and Zhang, Simulation- guided DNA probe design for consistently ultraspecific hybridization, Nature Chemistry, 7:545-53, 2015.
Wu et al , Continuously tunable nucleic acid hybridization probes, Nature Methods, 12: 1191- 96, 2015.
Zhang et al , Optimizing the specificity of nucleic acid hybridization, Nature Chemistry, 4:208-14, 2012.
Claims
1. A composition of nucleic acid molecules, the composition comprising:
(a) at least three auxiliary probes, wherein each auxiliary probe comprises a first auxiliary probe hybridization region and a first auxiliary probe universal region, wherein the first auxiliary probe hybridization region of each auxiliary probe has a unique sequence, wherein the first auxiliary probe universal regions of each auxiliary probe have the same sequence;
(b) at least three translation probes, wherein each translation probe comprises a first nucleic acid molecule, wherein said first nucleic acid molecule comprises a first translation probe hybridization region and a first translation probe codeword region, wherein the first translation probe hybridization region of each translation probe has a unique sequence, wherein the first translation probe codeword region of each translation probe has a unique sequence; and
(c) at least three translation probe protection oligonucleotides, wherein each translation probe protection oligonucleotide comprises a first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region of one of the translation probes.
2. The composition of claim 1, wherein the translation probes are modular probes.
3. The composition of claim 2, wherein the first nucleic acid molecules of the translation probes further comprise a second translation probe hybridization region positioned between the first translation probe hybridization region and the first translation probe codeword region.
4. The composition of claim 3, wherein the translation probes further comprise a second nucleic acid molecule, wherein each of the second nucleic acid molecules comprises a third translation probe hybridization region and a fourth translation probe hybridization region, wherein the third translation probe hybridization region is complementary to the second translation probe hybridization region of the first nucleic acid molecule.
5. The composition of claim 4, wherein the translation probes further comprise a third nucleic acid molecule, wherein the third nucleic acid molecule comprises a fifth translation probe hybridization region, wherein the fifth translation probe hybridization region is complementary to the fourth translation probe hybridization region of the second nucleic acid molecule.
6. The composition of claim 5, wherein the third nucleic acid molecules of the translation probes further comprise a sixth translation probe hybridization region, wherein the translation probe protection oligonucleotides further comprise a second translation probe protection oligonucleotide hybridization region, wherein the sixth translation probe hybridization region is complementary to the second translation probe protection oligonucleotide hybridization region.
7. The composition of any one of claims 1-6, wherein the first translation probe hybridization region is at least 5 nucleotides longer than the first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region.
8. The composition of claim 7, wherein the first translation probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first translation probe hybridization region of one of the translation probes, wherein the first translation probe hybridization region is at least 17 nucleotides long.
9. The composition of any one of claims 1-8, wherein each first translation probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7.
10. The composition of any one of claims 1-9, wherein each of the translation probe codeword regions in the composition lack sequence identity at at least 2 nucleotide positions as compared to any other translation probe codeword region in the composition.
11. The composition of claim 10, wherein each of the translation probe codeword regions in the composition has a Hamming distance of at least two relative to every other translation probe codeword region in the composition.
12. The composition of claim 10, wherein each of the translation probe codeword regions is 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions.
13. The composition of any one of claims 1-9, wherein each of the translation probe codeword regions in the composition lack sequence identity at at least 3 nucleotide positions as compared to any other translation probe codeword region in the composition.
14. The composition of claim 13, wherein each of the translation probe codeword regions in the composition has a Hamming distance of at least three relative to every other translation probe codeword region in the composition.
15. The composition of claim 13, wherein each of the translation probe codeword regions is 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
16. The composition of any one of claims 1-15, wherein each of the translation probes further comprises a first translation probe universal region, wherein the first translation probe universal regions of each translation probe have the same sequence.
17. The composition of claim 16, wherein the first translation probe codeword region is positioned between the first translation probe universal region and the first translation probe hybridization region.
18. The composition of any one of claims 1-17, wherein each of the translation probes comprises a 5' phosphate.
19. The composition of any one of claims 1-17, wherein each of the translation probes lacks a 5' phosphate.
20. The composition of any one of claims 1-19, wherein each of the translation probes is between 30 and 200 nucleotides long.
21. The composition of any one of claims 1-20, wherein each of the auxiliary probes further comprises a first auxiliary probe codeword region, wherein each auxiliary probe in the composition has a unique first auxiliary probe codeword region sequence.
22. The composition of claim 21, wherein the first auxiliary probe codeword region is positioned between the first auxiliary probe hybridization region and the first auxiliary probe universal region.
23. The composition of any one of claims 21-22, wherein each first auxiliary probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7.
24. The composition of any one of claims 21-23, wherein each of the auxiliary probe codeword regions in the composition lack sequence identity at at least 2 nucleotide positions as compared to any other auxiliary probe codeword region in the composition.
25. The composition of claim 24, wherein each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least two relative to every other auxiliary probe codeword region in the composition.
26. The composition of any one of claims 21-24, wherein each of the auxiliary probe codeword regions is 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions.
27. The composition of any one of claims 21-23, wherein each of the auxiliary probe codeword regions in the composition lack sequence identity at at least 3 nucleotide positions as compared to any other auxiliary probe codeword region in the composition.
28. The composition of claim 27, wherein each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least three relative to every other auxiliary probe codeword region in the composition.
29. The composition of any one of claims 21-27, wherein each of the auxiliary probe codeword regions is 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
30. The composition of any one of claims 1-29, wherein each of the auxiliary probes comprises a 5' phosphate.
31. The composition of any one of claims 1-29, wherein each of the auxiliary probes lacks a 5' phosphate.
32. The composition of any one of claims 1-29, further comprising at least three auxiliary probe protection oligonucleotides, wherein each auxiliary probe protection oligonucleotide comprises a first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region of one of the auxiliary probes.
33. The composition of claim 32, wherein the first auxiliary probe hybridization region is at least 5 nucleotides longer than the first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region.
34. The composition of claim 33, wherein the first auxiliary probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first auxiliary probe hybridization region of one of the auxiliary probes, wherein the first auxiliary probe hybridization region is at least 17 nucleotides long.
35. The composition of any one of claims 1-34, wherein each of the auxiliary probes is between 30 and 200 nucleotides long.
36. The composition of any one of claims 1-35, further comprising at least one target nucleic acid molecule comprising a first target region and a second target region, wherein the first target region and the second target region are directly adjacent within the target nucleic acid molecule, wherein the first target region is complementary to the first auxiliary probe hybridization region of one of the auxiliary probes in the composition, wherein the second target region is complementary to the first translation probe hybridization region of one of the translation probes in the composition.
37. A method for determining the presence a target nucleic acid molecule in a sample, the target nucleic acid molecule comprising a known target sequence having a first target region and a second target region that is directly adjacent to the first target region, the method comprising:
(a) contacting the sample with at least a first auxiliary probe and at least a first translation probe, wherein the auxiliary probe comprises a first auxiliary probe hybridization region and a first auxiliary probe universal region, wherein the first auxiliary probe hybridization region is complementary to the first target region, and wherein the first translation probe comprises a first nucleic acid molecule, wherein said first nucleic acid molecule comprises a first translation probe hybridization region and a first translation probe codeword region, wherein the first translation probe hybridization region is complementary to the second target region;
(b) incubating the product of step (a) under conditions to allow the first auxiliary probe hybridization region to anneal to the first target region and the first translation probe hybridization region to anneal to the second target region, thereby producing an annealed product if the target nucleic acid molecule is present in the sample;
(c) incubating the product of step (b) under conditions to allow the ligation of the annealed first auxiliary probe to the annealed first translation probe, thereby producing a ligation product having both the first translation probe codeword region and the first auxiliary probe universal region if the target nucleic acid molecule is present in the sample; and
(d) detecting the ligation product, thereby determining the presence of the target nucleic acid molecule in the sample.
38. The method of claim 37, wherein the first translation probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7.
39. The method of claim 37 or 38, wherein the first translation probe further comprises a first translation probe universal region.
40. The method of claim 39, wherein the first translation probe codeword region is positioned between the first translation probe universal region and the first translation probe hybridization region.
41. The method of any one of claims 37-40, wherein step (a) further comprises contacting the sample with at least a first translation probe protection oligonucleotide, wherein the translation probe protection oligonucleotide comprises a first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region.
42. The method of claim 41, wherein the first translation probe hybridization region is at least 5 nucleotides longer than the first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region.
43. The method of claim 42, wherein the first translation probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first translation probe hybridization region of one of the translation probes, wherein the first translation probe hybridization region is at least 17 nucleotides long.
44. The method of any one of claim 37-43, wherein the first translation probe is a modular probe.
45. The method of claim 44, wherein the first nucleic acid molecule of the first translation probe further comprises a second translation probe hybridization region positioned between the first translation probe hybridization region and the first translation probe codeword region.
46. The method of claim 45, wherein the first translation probe further comprises a second nucleic acid molecule, wherein the second nucleic acid molecule comprises a third translation probe hybridization region and a fourth translation probe hybridization region, wherein the third translation probe hybridization region is complementary to the second translation probe hybridization region of the first nucleic acid molecule.
47. The method of claim 46, wherein the first translation probe further comprises a third nucleic acid molecule, wherein the third nucleic acid molecule comprises a fifth translation probe hybridization region, wherein the fifth translation probe hybridization region is complementary to the fourth translation probe hybridization region of the second nucleic acid molecule.
48. The method of claim 47, wherein the third nucleic acid molecule of the translation probe further comprises a sixth translation probe hybridization region, wherein the translation probe protection oligonucleotides further comprise a second translation probe protection oligonucleotide hybridization region, wherein the sixth translation probe hybridization region is complementary to the second translation probe protection oligonucleotide hybridization region.
49. The method of any one of claims 37-48, wherein the first translation probe is between 30 and 200 nucleotides long.
50. The method of any one of claims 37-49, wherein step (a) further comprises contacting the sample with at least a second translation probe, wherein the second translation probe comprises a second translation probe hybridization region and a second translation probe codeword region, wherein the translation probe hybridization regions on each of the first and second translation probes has a unique sequence, wherein the translation probe codeword region on each of the first and second translation probes has a unique sequence.
51. The method of claim 50, wherein each of the translation probe codeword regions in the composition lacks sequence identity at at least 2 nucleotide positions as compared to any other translation probe codeword region in the composition.
52. The method of claim 50, wherein each of the translation probe codeword regions in the composition has a Hamming distance of at least two relative to every other translation probe codeword region in the composition.
53. The method of claim 50 or 51, wherein each of the translation probe codeword regions is 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 19 nucleotide positions.
54. The method of any one of claims 50-53, wherein each of the translation probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other translation probe codeword region in the composition.
55. The method of claim 54, wherein each of the translation probe codeword regions in the composition has a Hamming distance of at least three relative to every other translation probe codeword region in the composition.
56. The method of any one of claims 50-54, wherein each of the translation probe codeword regions is 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
57. The method of any one of claims 37-56, wherein the first auxiliary probe further comprises a first auxiliary probe codeword region.
58. The method of claim 57, wherein the first auxiliary probe codeword region is positioned between the first auxiliary probe hybridization region and the first auxiliary probe universal region.
59. The method of claim 58, wherein step (a) further comprises contacting the sample with at least a second auxiliary probe, wherein the second auxiliary probe comprises a second auxiliary probe hybridization region and a second auxiliary probe codeword region, wherein the auxiliary probe codeword region on each of the first and second auxiliary probes has a unique sequence.
60. The method of claim 59, wherein each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 2 nucleotide positions as compared to any other auxiliary probe codeword region in the composition.
61. The method of claim 60, wherein each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least two relative to every other auxiliary probe codeword region in the composition.
62. The method of any one of claims 59-60, wherein each of the auxiliary probe codeword regions is 21 nucleotides long, wherein no two auxiliary probe codeword
regions in the composition share sequence identity at more than 19 nucleotide positions.
63. The method of any one of claims 59-62, wherein each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other auxiliary probe codeword region in the composition.
64. The method of claim 63, wherein each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least three relative to every other auxiliary probe codeword region in the composition.
65. The method of any one of claims 59-63, wherein each of the auxiliary probe codeword regions is 21 nucleotides long, wherein no two auxiliary probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
66. The method of any one of claims 37-65, wherein the first auxiliary probe codeword region is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7.
67. The method of any one of claims 37-66, wherein step (a) further comprises contacting the sample with at least a first auxiliary probe protection oligonucleotide, wherein the auxiliary probe protection oligonucleotide comprises a first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region.
68. The method of claim 67, wherein the first auxiliary probe hybridization region is at least 5 nucleotides longer than the first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region.
69. The method of claim 68, wherein the first auxiliary probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first auxiliary probe hybridization region of one of the auxiliary
probes, wherein the first auxiliary probe hybridization region is at least 17 nucleotides long.
70. The method of any one of claims 37-69, wherein the first auxiliary probe is between 30 and 200 nucleotides long.
71. The method of claim 37, wherein step (a) comprises contacting the sample with a composition of any one of claims 1-35.
72. The method of claim 37, wherein step (c) is performed by incubating the product of step (b) with a ligase.
73. The method of claim 72, wherein the first target region is positioned upstream of the second target region, wherein the first auxiliary probe comprises a 5' phosphate.
74. The method of claim 73, wherein the first translation probe lacks a 5' phosphate.
75. The method of claim 72, wherein the second target region is positioned upstream of the first target region, wherein the first translation probe comprises a 5' phosphate.
76. The method of claim 75, wherein the first auxiliary probe lacks a 5' phosphate.
77. The method of claim 37, wherein step (c) is performed chemically.
78. The method of claim 77, wherein the first target region is positioned upstream of the second target region, wherein the first auxiliary probe comprises a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid.
79. The method of claim 78, wherein the first translation probe lacks a 5' functionalization.
80. The method of claim 77, wherein the second target region is positioned upstream of the first target region, wherein the first translation probe comprises a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid.
81. The method of claim 80, wherein the first auxiliary probe lacks a 5' functionalization.
82. The method of claim 37, wherein detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region.
83. The method of claim 37, wherein detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region.
84. The method of claim 37, wherein detecting the ligation product in step (d) is further comprises quantitating the amount of the ligation product having both the first translation probe codeword region and the first auxiliary probe universal region that is present in the sample.
85. The method of claim 84, wherein quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region.
86. The method of claim 84, wherein quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region.
87. The method of either claim 37 or 84, wherein detecting and/or quantitating the amount of the ligation product comprises performing DNA sequencing.
88. The method of claim 87, wherein the DNA sequencing comprises Sanger sequencing, sequencing-by-synthesis, or nanopore sequencing.
89. The method of claim 88, wherein detecting and/or quantitating the amount of the ligation product comprises performing Hamming error correction to the sequences obtained for the translation probe and/or auxiliary probe codeword regions.
90. The method of either claim 37 or 84, wherein the method does not comprise a step of bead capture.
91. The method of either claim 37 or 84, wherein the method further comprises a step of bead capture.
92. The method of claim 37, wherein the target nucleic acid molecule comprises DNA.
93. The method of claim 37, wherein the target nucleic acid molecule comprises RNA.
94. The method of claim 37, wherein if the ligation product is not detected, then the target sequence is determined to not be present in the sample.
95. The method of claim 37, wherein if the ligation product is detected, then the target sequence is determined to be present in the sample.
96. A method for determining the presence a plurality of target nucleic acid molecules in a sample, each target nucleic acid molecule comprising a known target sequence having a first target region and a second target region that is directly adjacent to the first target region, the method comprising:
(a) contacting the sample with at least two auxiliary probes and at least two translation probes,
wherein the auxiliary probes each comprise a first auxiliary probe hybridization region and a first auxiliary probe universal region, wherein the first auxiliary probe hybridization region of each auxiliary probe has a unique sequence, wherein the first auxiliary probe universal regions of each auxiliary probe have the same sequence, wherein the first auxiliary probe hybridization region is complementary to the first target region of one of the plurality of target nucleic acid molecules, and
wherein the translation probes each comprise a first nucleic acid molecule, wherein said first nucleic acid molecule comprises a first translation probe hybridization region and a first translation probe codeword region, wherein the first translation probe hybridization region of each translation probe has a unique sequence, wherein the first translation probe codeword region of each translation probe has a unique
sequence, wherein the first translation probe hybridization region is complementary to the second target region of one of the plurality of target nucleic acid molecules;
(b) incubating the product of step (a) under conditions to allow the first auxiliary probe hybridization regions to anneal to the first target regions and the first translation probe hybridization regions to anneal to the second target regions, thereby producing annealed products if the target nucleic acid molecules are present in the sample;
(c) incubating the product of step (b) under conditions to allow the ligation of the auxiliary probe to the translation probe annealed to a known target sequence, thereby producing a ligation products having both a first translation probe codeword region and a first auxiliary probe universal region if one of the target nucleic acid molecules is present in the sample; and
(d) detecting the ligation products, thereby determining the presence of the target nucleic acid molecules in the sample.
97. The method of claim 96, wherein the translation probes are a modular probes.
98. The method of claim 97, wherein the first nucleic acid molecules of the translation probes further comprise a second translation probe hybridization region positioned between the first translation probe hybridization region and the first translation probe codeword region.
99. The method of claim 98, wherein the translation probes further comprise a second nucleic acid molecule, wherein each of the second nucleic acid molecules comprises a third translation probe hybridization region and a fourth translation probe hybridization region, wherein the third translation probe hybridization region is complementary to the second translation probe hybridization region of the first nucleic acid molecule.
100. The method of claim 99, wherein the translation probes further comprise a third nucleic acid molecule, wherein the third nucleic acid molecule comprises a fifth translation probe hybridization region, wherein the fifth translation probe hybridization region is complementary to the fourth translation probe hybridization region of the second nucleic acid molecule.
101. The method of claim 100, wherein the third nucleic acid molecules of the translation probes further comprise a sixth translation probe hybridization region, wherein the translation probe protection oligonucleotides further comprise a second translation probe protection oligonucleotide hybridization region, wherein the sixth translation probe hybridization region is complementary to the second translation probe protection oligonucleotide hybridization region.
102. The method of any one of claims 96-101, wherein the translation probes are between 30 and 200 nucleotides long.
103. The method of any one of claims 96-102, wherein each of the translation probe codeword regions is between 7 and 98 nucleotides long, wherein the length of each first translation probe codeword region is a multiple of 7.
104. The method of any one of claims 96-103, wherein each of the translation probe codeword regions lacks sequence identity at at least 2 nucleotide positions as compared to any other translation probe codeword region.
105. The method of claim 104, wherein each of the translation probe codeword regions in the composition has a Hamming distance of at least two relative to every other translation probe codeword region in the composition.
106. The method of claim 104, wherein each of the translation probe codeword regions is 21 nucleotides long, wherein no two translation probe codeword regions share sequence identity at more than 19 nucleotide positions.
107. The method of claim 104, wherein each of the translation probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other translation probe codeword region.
108. The method of claim 107, wherein each of the translation probe codeword regions in the composition has a Hamming distance of at least three relative to every other translation probe codeword region in the composition.
109. The method of claim 107, wherein each of the translation probe codeword regions is 21 nucleotides long, wherein no two translation probe codeword regions in the composition share sequence identity at more than 18 nucleotide positions.
110. The method of any one of claims 96-109, wherein each of the translation probes further comprises a first translation probe universal region, wherein the first translation probe universal regions of each translation probe have the same sequence.
111. The method of claim 110, wherein the first translation probe codeword region is positioned between the first translation probe universal region and the first translation probe hybridization region.
112. The method of any one of claims 96-111, wherein step (a) further comprises contacting the sample with at least two first translation probe protection oligonucleotides, wherein the translation probe protection oligonucleotides comprise a first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region of one of the translation probes.
113. The method of claim 112, wherein the first translation probe hybridization region is at least 5 nucleotides longer than the first translation probe protection oligonucleotide hybridization region that is at least partially complementary to the first translation probe hybridization region.
114. The method of claim 113, wherein the first translation probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first translation probe hybridization region of one of the translation probes, wherein the first translation probe hybridization region is at least 17 nucleotides long.
115. The method of any one of claims 96-114, wherein each of the auxiliary probes further comprises a first auxiliary probe codeword region.
116. The method of claim 115, wherein the first auxiliary probe codeword region is positioned between the first auxiliary probe hybridization region and the first auxiliary probe universal region.
117. The method of claim 116, wherein each of the auxiliary probe codeword regions lacks sequence identity at at least 2 nucleotide positions as compared to any other auxiliary probe codeword region.
118. The method of claim 117, wherein each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least two relative to every other auxiliary probe codeword region in the composition.
119. The method of claim 117, wherein each of the auxiliary probe codeword regions is 21 nucleotides long, wherein no two auxiliary probe codeword regions share sequence identity at more than 19 nucleotide positions.
120. The method of claim 116, wherein each of the auxiliary probe codeword regions in the composition lacks sequence identity at at least 3 nucleotide positions as compared to any other auxiliary probe codeword region.
121. The method of claim 120, wherein each of the auxiliary probe codeword regions in the composition has a Hamming distance of at least three relative to every other auxiliary probe codeword region in the composition.
122. The method of claim 120, wherein each of the auxiliary probe codeword regions is 21 nucleotides long, wherein no two auxiliary probe codeword regions share sequence identity at more than 18 nucleotide positions.
123. The method of any one of claims 96-122, wherein step (a) further comprises contacting the sample with at least two first auxiliary probe protection oligonucleotides, wherein the auxiliary probe protection oligonucleotides comprise a first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region of one of the auxiliary probes.
124. The method of claim 123, wherein the first auxiliary probe hybridization region is at least 5 nucleotides longer than the first auxiliary probe protection oligonucleotide hybridization region that is at least partially complementary to the first auxiliary probe hybridization region.
125. The method of claim 124, wherein the first auxiliary probe protection oligonucleotide hybridization region comprises at least 12 continuous nucleotides that are complementary to the first auxiliary probe hybridization region of one of the auxiliary
probes, wherein the first auxiliary probe hybridization region is at least 17 nucleotides long.
126. The method of any one of claims 96-125, wherein the auxiliary probe is between 30 and 200 nucleotides long.
127. The method of claim 96, wherein step (a) comprises contacting the sample with a composition of any one of claims 1-35.
128. The method of any one of claims 96-127, wherein step (c) is performed by incubating the product of step (b) with a ligase.
129. The method of claim 128, wherein the first target regions are positioned upstream of the second target regions, wherein the first auxiliary probes comprise a 5' phosphate.
130. The method of claim 129, wherein the first translation probes lack a 5' phosphate.
131. The method of claim 128, wherein the second target regions are positioned upstream of the first target regions, wherein the first translation probes comprise a 5' phosphate.
132. The method of claim 131, wherein the first auxiliary probes lack a 5' phosphate.
133. The method of any one of claims 96-127, wherein step (c) is performed chemically.
134. The method of claim 133, wherein the first target regions are positioned upstream of the second target regions, wherein the first auxiliary probes comprise a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid.
135. The method of claim 134, wherein the first translation probes lack a 5' functionalization.
136. The method of claim 133, wherein the second target regions are positioned upstream of the first target regions, wherein the first translation probes comprise a 5' functionalization selected from the group consisting of an alkyne, an azide, a primary amine, and a carboxylic acid.
137. The method of claim 136, wherein the first auxiliary probes lack a 5' functionalization.
138. The method of any one of claims 96-137, wherein detecting the ligation products in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region.
139. The method of any one of claims 96-137, wherein detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region.
140. The method of any one of claims 96-137, wherein detecting the ligation product in step (d) is further comprises quantitating the amount of the ligation product having both the first translation probe codeword region and the first auxiliary probe universal region that is present in the sample.
141. The method of claim 140, wherein quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first translation probe codeword region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region.
142. The method of claim 140, wherein quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe codeword region.
143. The method of any one of claims 110-111, wherein detecting the ligation products in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first translation probe universal region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region.
144. The method of any one of claims 110-111, wherein detecting the ligation product in step (d) comprises performing PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe universal region.
145. The method of any one of claims 110-111, wherein detecting the ligation product in step (d) is further comprises quantitating the amount of the ligation product having both the first translation probe universal region and the first auxiliary probe universal region that is present in the sample.
146. The method of claim 145, wherein quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first translation probe universal region and a reverse primer that comprises a sequence that is identical to the first auxiliary probe universal region.
147. The method of claim 145, wherein quantitating the amount of the ligation product in step (d) comprises performing quantitative PCR using a forward primer that comprises a sequence that is complementary to the first auxiliary probe universal region and a reverse primer that comprises a sequence that is identical to the first translation probe universal region.
148. The method of any one of claims 96-147, wherein detecting and/or quantitating the amount of the ligation product comprises performing DNA sequencing.
149. The method of claim 148, wherein the DNA sequencing comprises Sanger sequencing, sequencing-by-synthesis, or nanopore sequencing.
150. The method of claim 148, wherein detecting and/or quantitating the amount of the ligation product comprises performing Hamming error correction to the sequences obtained for the translation probe and/or auxiliary probe codeword regions.
The method of any one of claims 96-147, wherein the method does not comprise step of bead capture.
152. The method of any one of claims 96-147, wherein the method further comprises a step of bead capture.
153. The method of any one of claims 96-149, wherein the target nucleic acid molecules comprise DNA.
154. The method of any one of claims 96-149, wherein the target nucleic acid molecules comprise RNA.
155. The method of claim 96, wherein if the ligation products are not detected, then the target sequences are determined to not be present in the sample.
156. The method of claim 96, wherein if the ligation products are detected, then the target sequences are determined to be present in the sample.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201762552652P | 2017-08-31 | 2017-08-31 | |
| US62/552,652 | 2017-08-31 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2019046768A1 true WO2019046768A1 (en) | 2019-03-07 |
| WO2019046768A8 WO2019046768A8 (en) | 2019-12-05 |
Family
ID=65526066
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2018/049173 Ceased WO2019046768A1 (en) | 2017-08-31 | 2018-08-31 | Symbolic squencing of dna and rna via sequence encoding |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2019046768A1 (en) |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022055885A1 (en) * | 2020-09-08 | 2022-03-17 | Catalog Technologies, Inc. | Systems and methods for writing by sequencing of nucleic acids |
| US11286479B2 (en) | 2018-03-16 | 2022-03-29 | Catalog Technologies, Inc. | Chemical methods for nucleic acid-based data storage |
| US11379729B2 (en) | 2016-11-16 | 2022-07-05 | Catalog Technologies, Inc. | Nucleic acid-based data storage |
| US11535842B2 (en) | 2019-10-11 | 2022-12-27 | Catalog Technologies, Inc. | Nucleic acid security and authentication |
| US11610651B2 (en) | 2019-05-09 | 2023-03-21 | Catalog Technologies, Inc. | Data structures and operations for searching, computing, and indexing in DNA-based data storage |
| US11763169B2 (en) | 2016-11-16 | 2023-09-19 | Catalog Technologies, Inc. | Systems for nucleic acid-based data storage |
| WO2025058684A1 (en) * | 2023-09-11 | 2025-03-20 | Western Digital Technologies, Inc. | Multi-tier error correction codes for dna data storage |
| US12430202B2 (en) | 2022-12-05 | 2025-09-30 | Western Digital Technologies, Inc. | Nested error correction codes for DNA data storage |
| US12437841B2 (en) | 2018-08-03 | 2025-10-07 | Catalog Technologies, Inc. | Systems and methods for storing and reading nucleic acid-based data with error protection |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040137484A1 (en) * | 2001-10-15 | 2004-07-15 | Zhang David Y. | Nucleic acid amplification methods |
| US20070190542A1 (en) * | 2005-10-03 | 2007-08-16 | Ling Xinsheng S | Hybridization assisted nanopore sequencing |
| US20110172975A1 (en) * | 2009-08-19 | 2011-07-14 | University Of Sao Paulo | Generation and reproduction of dna sequences and analysis of polymorphisms and mutations by using error-correcting codes |
| US20140171338A1 (en) * | 2011-05-17 | 2014-06-19 | Dxterity Diagnostics Incorporated | Methods and compositions for detecting target nucleic acids |
| US20140235470A1 (en) * | 2012-12-07 | 2014-08-21 | Invitae Corporation | Multiplex nucleic acid detection methods |
| US20140302486A1 (en) * | 2011-09-02 | 2014-10-09 | President And Fellows Of Harvard College | Systems and methods for detecting biomarkers of interest |
-
2018
- 2018-08-31 WO PCT/US2018/049173 patent/WO2019046768A1/en not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040137484A1 (en) * | 2001-10-15 | 2004-07-15 | Zhang David Y. | Nucleic acid amplification methods |
| US20070190542A1 (en) * | 2005-10-03 | 2007-08-16 | Ling Xinsheng S | Hybridization assisted nanopore sequencing |
| US20110172975A1 (en) * | 2009-08-19 | 2011-07-14 | University Of Sao Paulo | Generation and reproduction of dna sequences and analysis of polymorphisms and mutations by using error-correcting codes |
| US20140171338A1 (en) * | 2011-05-17 | 2014-06-19 | Dxterity Diagnostics Incorporated | Methods and compositions for detecting target nucleic acids |
| US20140302486A1 (en) * | 2011-09-02 | 2014-10-09 | President And Fellows Of Harvard College | Systems and methods for detecting biomarkers of interest |
| US20140235470A1 (en) * | 2012-12-07 | 2014-08-21 | Invitae Corporation | Multiplex nucleic acid detection methods |
Non-Patent Citations (2)
| Title |
|---|
| WANG ET AL.: "Modular probes for enriching and detecting complex nucleic acid sequences", NAT CHEM, vol. 9, no. 12, 17 July 2017 (2017-07-17), pages 1222 - 1228, XP055581011, ISSN: 1755-4330, DOI: 10.1038/nchem.2820 * |
| YARKIN ET AL.: "Detection of HPV DNA in cervical specimens collected in cytologic solution by ligation-dependent PCR", ACTA CYTOL, vol. 47, no. 3, 1 May 2003 (2003-05-01), pages 450 - 456, XP055581017 * |
Cited By (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12236354B2 (en) | 2016-11-16 | 2025-02-25 | Catalog Technologies, Inc. | Systems for nucleic acid-based data storage |
| US11379729B2 (en) | 2016-11-16 | 2022-07-05 | Catalog Technologies, Inc. | Nucleic acid-based data storage |
| US12001962B2 (en) | 2016-11-16 | 2024-06-04 | Catalog Technologies, Inc. | Systems for nucleic acid-based data storage |
| US11763169B2 (en) | 2016-11-16 | 2023-09-19 | Catalog Technologies, Inc. | Systems for nucleic acid-based data storage |
| US11286479B2 (en) | 2018-03-16 | 2022-03-29 | Catalog Technologies, Inc. | Chemical methods for nucleic acid-based data storage |
| US12006497B2 (en) | 2018-03-16 | 2024-06-11 | Catalog Technologies, Inc. | Chemical methods for nucleic acid-based data storage |
| US12437841B2 (en) | 2018-08-03 | 2025-10-07 | Catalog Technologies, Inc. | Systems and methods for storing and reading nucleic acid-based data with error protection |
| US11610651B2 (en) | 2019-05-09 | 2023-03-21 | Catalog Technologies, Inc. | Data structures and operations for searching, computing, and indexing in DNA-based data storage |
| US12002547B2 (en) | 2019-05-09 | 2024-06-04 | Catalog Technologies, Inc. | Data structures and operations for searching, computing, and indexing in DNA-based data storage |
| US11535842B2 (en) | 2019-10-11 | 2022-12-27 | Catalog Technologies, Inc. | Nucleic acid security and authentication |
| WO2022055885A1 (en) * | 2020-09-08 | 2022-03-17 | Catalog Technologies, Inc. | Systems and methods for writing by sequencing of nucleic acids |
| US12430202B2 (en) | 2022-12-05 | 2025-09-30 | Western Digital Technologies, Inc. | Nested error correction codes for DNA data storage |
| US12474994B2 (en) | 2022-12-05 | 2025-11-18 | Western Digital Technologies, Inc. | Preprocessing for correcting insertions and deletions in DNA data storage |
| WO2025058684A1 (en) * | 2023-09-11 | 2025-03-20 | Western Digital Technologies, Inc. | Multi-tier error correction codes for dna data storage |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2019046768A8 (en) | 2019-12-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2019046768A1 (en) | Symbolic squencing of dna and rna via sequence encoding | |
| EP3164489B1 (en) | Tagging and assessing a target sequence | |
| DK2623613T3 (en) | Increasing the reliability of the allele-indications by molecular counting | |
| CN108138228B (en) | High Molecular Weight DNA Sample Tracking Tags for Next Generation Sequencing | |
| AU2015315103A1 (en) | Methods and compositions for rapid nucleic acid library preparation | |
| CN102177250A (en) | Method for Direct Amplification from Crude Nucleic Acid Samples | |
| WO2011156795A2 (en) | Nucleic acids for multiplex organism detection and methods of use and making the same | |
| JP7602464B2 (en) | Quantitative amplicon sequencing for multiple copy number variation detection and allelic ratio quantification | |
| US20220267848A1 (en) | Detection and quantification of rare variants with low-depth sequencing via selective allele enrichment or depletion | |
| US20160115544A1 (en) | Molecular barcoding for multiplex sequencing | |
| CN109790577B (en) | Methods for removing adapter dimers from nucleic acid sequencing preparations | |
| KR20230006852A (en) | Quantitative blocker displacement amplification (QBDA) sequencing for quantification of uncorrected and multiple variant allele frequencies | |
| US20230416730A1 (en) | Methods and compositions for addressing inefficiencies in amplification reactions | |
| US20240301466A1 (en) | Efficient duplex sequencing using high fidelity next generation sequencing reads | |
| AU2006226873B2 (en) | Nucleic acid detection | |
| US20230250470A1 (en) | Amplicon comprehensive enrichment | |
| WO2023220621A1 (en) | Long-range dna sequencing through concatenating chimeric amplicon reads | |
| HK40062228A (en) | Quantitative amplicon sequencing for multiplexed copy number variation detection and allele ratio quantitation | |
| WO2023107512A2 (en) | Methods for detecting inherited mutations using multiplex gene specific pcr |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18849523 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18849523 Country of ref document: EP Kind code of ref document: A1 |