WO2025083068A1 - Method for capturing epigenetically modified dna - Google Patents
Method for capturing epigenetically modified dna Download PDFInfo
- Publication number
- WO2025083068A1 WO2025083068A1 PCT/EP2024/079220 EP2024079220W WO2025083068A1 WO 2025083068 A1 WO2025083068 A1 WO 2025083068A1 EP 2024079220 W EP2024079220 W EP 2024079220W WO 2025083068 A1 WO2025083068 A1 WO 2025083068A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- region
- nucleic acid
- acid molecules
- nucleotide
- strand
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6806—Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
Definitions
- the present invention relates to methods for capturing nucleic acid sequences while kee pi ng/p reserving their epigenetic information, preferably for capturing certain regions of the genome, thereby enriching these regions, while maintaining its epigenetic information, with the same probability/efficiency, independently of whether one or more nucleotides are epigenetically modified.
- the invention also relates to computer programs related to the methods of the invention.
- the sample When studying the epigenetic variations present in a polynucleotide of interest or in a region of interest within a polynucleotide (e.g., the methylation status of a region of interest within a polynucleotide), the sample usually contains a plurality of nucleic acid molecules that differ in the chemical (epigenetic) modifications present in one or more nucleotides at one or more loci (e.g., the sample usually contains a plurality of nucleic acid molecules that differ in their methylation status).
- the variations in the epigenetic modifications present in a certain sample may vary from non-modified molecules to fully modified molecules, and all the possibilities in between. If the epigenetic modification is methylation, the variations in the pattern of methylated cytosines may vary from non-methylated molecules to fully methylated molecules, and all the possibilities in between.
- WGBS Whole-Genome Bisulfite Sequencing
- the molecules from a given sample are first treated with bisulfite, to preserve in each molecule the information regarding the original methylation status of the sample.
- the chemical transformation with bisulfite of the nucleic acids results in the generation of ambiguity, as non-methylated cytosines will be transformed to uracils and visualized (read) as thymines, whereas the methylated cytosines will not be transformed and will be read as cytosines.
- the sample is sequenced to study the sequence and to ascertain the methylation status of the sample.
- the molecules Prior to sequencing, the molecules are generally randomly fragmented (e.g., by physical shearing), step that is essential in sample preparation for sequencing platforms, such as Next Generation Sequencing.
- nucleic acid molecule e.g., DNA in the case of WGBS
- probes or baits that are designed to target specific regions of interest.
- these probes are marked (e.g., biotinylated) and can be recovered (e.g., using streptavidin-coated magnetic beads). The process can be used to capture targeted nucleic acids.
- the epigenetic modification status (such as the methylation status) of a certain region of interest (ROI) within a genome or, generally, within a nucleic acid molecule, where the ROI comprises nucleotides which are susceptible of carrying epigenetic modifications (e.g., cytosine methylations or other chemical modifications).
- Nucleic acid molecules comprising nucleotides located at loci corresponding to those in the ROI may be fully modified or non-modified, and all the possibilities in between.
- two possibilities exist.
- One possibility would be to design one probe for each of the possible molecules (with different epigenetic modification information preserved), to capture all molecules. If the number of possibly modified nucleotides within a region of interest covered by a single probe is high, the number of probes to capture all possible combinations within the same region would then be really high.
- the second possibility would be to design a single "consensus" probe.
- the first possibility (different probes) is expensive and time-consuming, since a large number of probes would be needed when the sample contains multiple possible epigenetic variation information.
- the bias will also cause that molecules bound with high affinity to the consensus probe (again due to their specific modification status preserved information) will be captured with more efficiency/affinity and thus more represented in the final sequencing step, even if they were scarce in the original sample.
- the bias will ultimately result in an erroneous assessment of the modification status (e.g., methylation status) of a certain sample.
- the present invention addresses the above needs and provides a new capturing method capable of overcoming the bias caused by the use of consensus capture probes that are designed to hybridize to a plurality of nucleic acid molecules that differ in their sequence as a consequence of the transformation/conversion performed to them in order to preserve the information on their original epigenetic modifications (e.g., a plurality of nucleic acid molecules that differ in their sequence as a consequence of the transformation/conversion performed to them to preserve the information of their original methylation status).
- the new method comprises the use of a single capture probe for all molecules of the sample, avoiding the need of designing, synthetising and using a large number of probes to capture every possible molecule that could be comprised in the sample.
- the first region may comprise one or more modified nucleotides or transformed modified nucleotides, or a copy thereof.
- the second region is characterized by being identical or at least substantially identical in each of the nucleic acid molecules comprised in the plurality at at least a position which corresponds to the same locus in a region of interest (ROI), wherein the locus in the region of interest is occupied by a nucleotide which is susceptible of being modified.
- ROI region of interest
- the second region that is identical or substantially identical at at least a locus occupied with a nucleotide which is susceptible of being modified in the ROI in all the molecules, it is possible to capture all of the molecules in the plurality with a single probe, and with the same efficiency and affinity regardless of the original modification status of the nucleotide susceptible of being modified, thus eliminating the bias derived by the presence of a different epigenetic status in at least two of the molecules.
- the present invention can be used in, e.g., a type of sequencing technology called Genomic and Epigenomic Unified Sequencing (GEUS) (as described, e.g., in WO 2015/104302) that allows to account for methylation and highly reduce the error rate interrogating the same position of the original sequence in different contexts from two related strands.
- GEUS Genomic and Epigenomic Unified Sequencing
- Figure 1 represents the capturing step of a method for determining the epigenetic status (methylation status in this case) of 8 different original nucleic acid molecules corresponding to the same region of interest (ACCGTCGACG, wherein "C” represents a cytosine which may or may not be modified, e.g., methylated).
- the 8 original nucleic acid molecules have the exact same nucleotide sequence but different epigenetic modifications (see the left-hand region (or 5' region) of the molecules represented in Figure IB).
- Figure 1A the eight molecules in Figure IB were treated with bisulfite to convert non-methylated cytosines into uracils/thymines (represented with a T) and to differentiate them from the originally methylated cytosines, which are resistant to bisulfite treatment.
- the eight molecules in Figure 1A become different in sequence due to the original differences in the methylation status in the first region of each molecule, see Figure 1A. This way, the different epigenetic modification status in the original molecules ("first region" in Figure IB) is fixed or preserved in the molecules of the present invention (Figure 1A).
- the "second" region (the 3' region in this specific case) in every molecule is identical or at least substantially identical at at least a locus which is occupied by a nucleotide susceptible of being modified (in this case, there are 8 identical inserts in the 3' region of the molecules, see Figure 1A).
- a single capture probe designed to bind (hybridize) to at least a portion of the second region which comprises at least one locus occupied by a nucleotide which is susceptible of being modified will capture with same affinity and efficacy (i.e., without generating a bias due to the different epigenetic status within the original nucleic acid molecules) all of the nucleic acid molecules, regardless of their differences in sequence.
- first region of each of the molecules comprised in the plurality are not the same, but all comprise at least one nucleotide in a position which correspond to a certain locus in certain region of interest (e.g., in an certain region of interest in a genome or in a certain region of interest in a nucleic acid molecule).
- a single probe will capture the nucleic acid molecules comprised in a plurality with the same efficiency/efficacy regardless of the epigenetic status of the original molecule at a certain locus, since the second region of each molecule in the plurality is identical in all molecules at least at a position which corresponds to the same locus in the region of interest which is occupied by a nucleotide susceptible of being modified.
- the probe will capture molecules 1-3, 6 and 8 (i.e., a plurality of nucleic acid molecules) with the same efficacy/affinity regardless of the epigenetic modifications present at a certain locus, e.g., locus 6, because all these molecules comprise, at a position corresponding to locus 6 in the region of interest, the same nucleotide (G).
- molecules 1-3, 6 and 8 i.e., a plurality of nucleic acid molecules
- G nucleotide
- Region of interest A C C G T C G A C G (see Figure ID)
- the bias caused by the use of consensus capture probes that are designed to hybridize to a plurality of nucleic acid molecules that differ in their sequence as a consequence of the transformation/conversion performed to them in order to preserve the information on their original epigenetic e.g., a plurality of nucleic acid molecules that differ in their sequence as a consequence of the transformation/conversion performed to them in order to preserve the information on their original methylation
- the method of the present invention comprises the use of at least a single capture probe for all molecules of the sample, avoiding the need of designing, synthetising and using a large number of probes to capture every possible molecule that could be comprised in a sample.
- the present invention provides a method comprising:
- each of the nucleic acid molecules comprises two regions: a first region and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, wherein the nucleotide sequence of the second region is identical or at least substantially identical in each of the plurality of nucleic acid molecules at at least a position corresponding to the same locus in the region of interest which is occupied by a nucleotide susceptible of being modified, wherein the first and the second regions in each of the molecules provided in (i) comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified; wherein at least two of the molecules of the plurality of nucleic acid molecules may have (and preferably have) a different nucleotide at least at one position
- Figure 1 Schematic representation of a plurality of nucleic acid molecules from the exact same region or with the exact same nucleotide sequence, wherein the nucleic acid molecules differ between them in three transformed modified nucleotides and/or copies thereof in the first region. Every molecule has an identical sequence at the second region that binds with the same complementarity and thus, same efficiency, to the same probe.
- certain Ts highlighted in red and underlined
- cytosines with a certain methylation status e.g., unmethylated (non-modified)
- the nucleotide molecules comprised in the plurality differ in the nucleotide sequence of the first region, as a consequence of the treatment (conversion or transformation) for preserving the epigenetic modification status of the original molecules, but are substantially identical (e.g., 100% identical) at the second region, and are also identical to the probe in bold (100% complementary, if strictly speaking) (Figure 1A).
- Figure 1A Schematic representation of a plurality of molecules corresponding to the exact same region of interest, or with the exact same nucleotide sequence, wherein the nucleic acid molecules differ between them in three modified nucleotides in the first region.
- Unmodified cytosines (unmethylated) are represented in red and highlighted with double underlined; modified cytosines (methylated) are represented in in black and highlighted with underlined (simple) at the first region of the molecule.
- the second region is identical or substantially identical for all of the molecules (IB).
- the first region of the molecules in Figure IB correspond to the original molecules, and are particular realizations (possibilities) of nucleic acid molecules corresponding to the region of interest ((A C C G T C G A C G, wherein C refers to either methylated C (mC) or non-methylated C (uC))).
- the molecules in Figure 1A correspond to the molecules in Figure IB once the molecules in Figure IB have been treated with an agent capable of converting the non-methylated cytosines in U (T), e.g., bisulfite.
- T non-methylated cytosines in U
- Schematic representation of several nucleic acid molecules which comprise a first region and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of the original nucleic acid molecules.
- the first region of each of the molecules comprise a nucleotide at at least one position corresponding to a locus in the region of interest ((A C C G T C G A C G, wherein C refers to either methylated C (mC) or non-methylated C (uC)) which may be occupied by a nucleotide susceptible of being modified.
- the sequence of the second region is identical among a plurality of nucleic acid molecules at least at a position corresponding to the same locus in the region of interest, wherein the locus in the original molecule is occupied by a nucleotide susceptible of being modified.
- the second region of all molecules comprised in the plurality is complementary to the unique probe and, thus, all molecules comprised in said plurality may bind to the probe with the same efficiency regardless of the methylation status of the nucleotide in that locus (Figure 1C).
- Figure 1C Schematic representation of the molecules represented in Figure 1C, but before transformation with an agent capable of converting the non-methylated cytosines in U (T), e.g., bisulfite.
- Unmodified cytosines (unmethylated) are represented in red and are highlighted with double underlined, and modified cytosines (methylated) are highlighted with simple underline in the first region of the molecule ( Figure ID).
- the first region of the molecules in Figure ID correspond to the original molecules, and are particular realizations (possibilities) of nucleic acid molecules with nucleotides at positions corresponding to at least one locus in the region of interest.
- the molecules in Figure 1C correspond to the molecules in Figure ID once the molecules in Figure ID have been treated with an agent capable of converting the non-methylated cytosines in U (T), e.g., bisulfite.
- FIG 2 Schematic representation of bisulfite sequencing library preparation after the bisulfite conversion step and the number and sequences of probes needed to capture with the same complementarity/efficiency a plurality of molecules corresponding to the exact same region of interest.
- the molecules differ between them in the nucleotides at three positions, corresponding to loci 3, 6 and 9.
- the differences in these nucleotides among the molecules are the consequence of the transformation of the original molecules with an agent capable of transforming non-methylated cytosines in uracil (thymine).
- the plurality of molecules is the same as the plurality of the first regions of the molecules shown at Figure 1A. Similar as for Figure 1, double underlined "T” represents converted unmethylated cytosines, and single underlined "C” represents methylated cytosines or copies thereof (unmethylated cytosines).
- Figure 3 Schematic representation of a nucleic acid molecule according to the present invention starting from a GEUS molecule (as described, e.g., in WO 2015/104302) as a template.
- the 5' region e.g., the first region
- the 5' region is represented by left to right, the nucleotides A-T-T-G-A-A-C-G-C-T in the gradient coloured (grey scale) lines.
- the darker side of the gradient represents the start (5') of the original molecule and the lighter side of the molecule, the end of it (3').
- the 3' region is represented by left to right, the nucleotides A-G-T-G-T-T-T-G-A-T.
- the darker side of the gradient represents the start of the original molecule and the lighter side, the end.
- the adapters at the 5' and 3' ends of the molecule are represented by a line in solid grey.
- the nucleotide sequence covalently linking the 5' region and the 3' region is represented in black, and it links the 5' and 3' regions of the molecule.
- Optional unique molecular barcodes (UMIs) are represented by a white line.
- At least two different primers are represented by thin black arrows indicating the direction of the sequencing synthesis.
- the white thick arrows containing a sequence correspond to the reads synthesized after the primer hybridization in 5' to 3' direction. Reads can have the same or different length.
- R1 stands for read 1
- R2 for read 2
- mC stands for methylated cytosine and uC stands for unmethylated cytosine.
- Figure 4 Schematic representation of a way of obtaining the nucleic acid sequence provided in step (i) of the method of the present invention, as described, e.g., in WO 2015/104302.
- the darker side of the gradient represents the start of the original molecule and the lighter side, the end of it.
- FIG. 5 Schematic diagram showing sequencing step of the method of the present invention inside an NGS machine (e.g., Illumina MiSeq NGS machine).
- NGS machine e.g., Illumina MiSeq NGS machine.
- A is a representation of a tile of the flow cell where the cluster amplification will take place, with a nucleic acid molecule according to the present invention, in this case a DNA GEUS molecule (as described, e.g., in WO 2015/104302), attached to it.
- the end of the molecule contains the NGS adapter, in this case Illumina P7, with the first sample index (to multiplex samples in the same lane), then the external unique molecular index (UMI), then the 5' region (or first region) of the nucleic acid molecule followed by the internal UMI, then a known sequence (i.e., a nucleotide sequence to which primers can bind, which can be a hairpin or a region thereof if it is a GEUS molecule), the synthetic internal UMI followed by the 3' region of the nucleic acid molecule, which in this case is the synthetic 3' region (or second region) of the molecule, the synthetic external UMI and lastly the Illumina P5 NGS sequencing adapter with the second sample index which prevents index hoping.
- the NGS adapter in this case Illumina P7
- the first sample index to multiplex samples in the same lane
- UMI external unique molecular index
- the synthetic internal UMI followed by the 3' region of the nucleic acid molecule
- step by step how the first 3 reads are sequenced by synthesis after each step of primer hybridization (hybridization of primer 1 and sequencing of read 1, hybridization of primer 2 and sequencing of read 2, hybridization of sample index primer and first index read). Then, the complementary molecule is synthesized and amplified by clustering, and the last 3 other reads are generated step by step in the same way, primer hybridization and synthesis of read 3, primer hybridization and synthesis of read 4 and second index sample primer hybridization and synthesis of the second sample index.
- the reads can have different order of being synthesized and the NGS instrument protocol needs to be adapted accordingly.
- Fig. 6 A) Paired-end (PE) sequencing (which is, in practice, single-end (SE) sequencing) of the molecule provided in step i. of the method of the present invention. B) paired-end (PE) sequencing of the molecule provided in step i. of the method of the present invention.
- PE Paired-end
- SE single-end
- PE paired-end
- Figure 7A-N Exemplary method as described in Example 2.
- the molecule has been divided in two to fit it in the page, and the arrow indicates that the sequence continues.
- FIG. 8A-I Exemplary method as described in Example 3.
- Figure 9A-N Exemplary method as described in Example 4.
- Figure 10 Extended step with methylated cytosines vs extended without methylated cytosines comparison.
- Fig. 11A-C Method with E15 primer.
- the term “about” as used in connection with a numerical value throughout the specification and the claims denotes an interval of accuracy, familiar and acceptable to a person skilled in the art. For instance, the term “about” means the indicated value ⁇ 1% of its value, or the term “about” means the indicated value ⁇ 2% of its value, or the term “about” means the indicated value ⁇ 5% of its value, the term “about” means the indicated value ⁇ 10% of its value, or the term “about” means the indicated value ⁇ 20% of its value, or the term “about” means the indicated value ⁇ 30% of its value; preferably the term “about” means exactly the indicated value ( ⁇ 0%).
- Percent (%)sequence identity with respect to polypeptides described herein is defined as the percentage of nucleotide residues in a candidate sequence that are identical with the nucleotide residues in the reference sequence, after aligning the sequences and introducing gaps, if necessary, to achieve the maximum percent sequence identity, and not considering any conservative substitutions as part of the sequence identity. Alignment for purposes of determining percent of sequence identity can be achieved in various ways that are within the skill in the art, for example, using publicly available computer software such as BLAST. Those skilled in the art can determine appropriate parameters for measuring alignment, including any algorithms needed to achieve maximum alignment over the full-length of the sequences being compared.
- the "percentage of identity" as used herein is decided in the context of a local alignment, i.e., it is based on the alignment of regions of local similarity between nucleobase sequences, contrary to a global alignment, which aims to align two sequences across their entire span.
- percentage identity is calculated preferably only based on the local alignment comparison algorithm.
- the present invention relates to methods for capturing or enriching a plurality of nucleic acid molecules which have information regarding the epigenetic modification status of one or more original nucleic acids.
- the molecules to be captured may comprise, and preferably comprise, in at least one of the molecules, one or more modified nucleotides or transformed modified nucleotides, and/or a copy thereof.
- the present invention provides a method (from herein after "the method of the present invention"), such as a nucleic acid capturing (hybridisation capture) or enriching (target enrichment) method, comprising at least steps (i) and (ii).
- the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules.
- plural is referred herein as two or more nucleic acid molecules.
- the term "plurality of nucleic acid molecules” is not limited, as long as there is more than one nucleic acid molecules.
- the nucleic acid molecules comprised in the plurality provided in step (i) may be DNA or RNA molecules, preferably DNA molecules. They may be single stranded (ss) molecules (such as a ss DNA molecules) or doublestranded (ds) molecules (such as ds DNA molecules).
- the plurality provided in step (i) comprises ss molecules, more preferably ssDNA molecules. They may be synthetic (or partially synthetic) molecules.
- the plurality of nucleic acid molecules provided in step (i) each comprise two regions, a first region and a second region. The length of the regions is not particularly limited.
- the plurality of nucleic acid molecules may comprise, as the first region, fragments of DNA, such as fragments of genomic DNA which have been treated with an agent (or a method or process, see below) capable of converting a nucleotide to another nucleotide which is read distinctly from the original nucleotide.
- agent or a method or process, see below
- genomic DNA refers to the total genetic information of an organism. It is the (biological) information of inheritance which is passed from one generation of organism to the next.
- a nucleic acid may be fragmented through any suitable method including, but not limited to, mechanical stress (sonication, nebulization, cavitation, etc.), enzymatic fragmentation (enzyme digestion with restriction endonucleases, nicking endonucleases, exonucleases, etc.) and chemical fragmentation (dimethyl sulphate, hydrazine, NaCI, piperidine, acid, etc.) or be fragmented at the original organism (e.g., cell free DNA).
- the suitable size of fragments may be selected prior to step (i) of the method of the invention. The optimal length will ultimately depend on the probes properties and suitable ratios and methods and/or the available sequencing instruments and methods and the desired percentage of reads overlap, if the ultimate goal for the target enrichment is to sequence the captured molecules.
- the ends of genomic fragments are processed so that the sample can enter the specific protocol of the sequencing platform.
- the genomic fragments are end-repaired after having been fragmented.
- end-repaired refers to the conversion of the nucleic acid (such as DNA) fragments that contain damaged or incompatible 5'- and/or 3'-protruding ends into blunt- ended DNA containing a 5'-phosphate and 3'-hydroxil groups.
- the blunting of the DNA ends can be achieved by enzymes including, without limitation, T4 DNA polymerase (having 5'->3' polymerase activity that fills-in 5' protruded DNA ends) and the Klenow fragment of E. coli DNA polymerase I (having 3'->5' exonuclease activity that removes 3'-overhangs).
- T4 DNA polymerase having 5'->3' polymerase activity that fills-in 5' protruded DNA ends
- Klenow fragment of E. coli DNA polymerase I having 3'->5' exonuclease activity that removes 3'-overhangs.
- the term "dA-tailing” or "A-tailing”, as used herein, refers to the addition of an A base to the 3' end of a blunt phosphorylated DNA fragment. This treatment creates compatible overhangs for subsequent ligation. This step is performed by methods well known by the person skilled in the art by using, for example, the Klenow fragment of E. coli DNA polymerase I.
- the second region comprised in the nucleic acid molecules of the plurality provided in (i) may preferably be synthetic DNA, and preferably does not comprise any modified nucleotide, such as any modified cytosine, e.g., it does not comprise methylated cytosines.
- Both the first and the second regions comprised in the nucleic acid molecules of the plurality provided in (i) are related in the sense that the base identities in both regions provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, as explained in detail below.
- both the first region and the second region of each of the molecules comprised in the plurality provided in (i) provide information regarding the base identities in one or more original nucleic acid molecules.
- These one or more original nucleic acid molecules when treated with an agent or method capable of converting the base of a nucleotide to a base which is read distinctly from the original base (i.e., when converted or transformed), may correspond to the first region of the nucleic acid molecules comprised in the plurality provided in (i).
- both the first region and the second region of each of the molecules comprised in the plurality provided in (i) provide information regarding at least one base identity in a corresponding locus in a region of interest (ROI), preferably the base identity of a nucleotide which is susceptible of being modified.
- ROI region of interest
- region of interest or "ROI” is referred herein as a region (e.g., at least 1 nucleotide, but preferably more, such as at least 2, 3, 4, 5, 10, 15, 20, 30, 50, 100, 160, 180, 200, 250, 300, 350, 400, 500, 1000, 1500, 3000, or more nucleotides) in a genome and/or in a nucleic acid molecule.
- the ROI comprises at least one nucleotide that may have an epigenetic modification, and thus is susceptible of being (epigenetically) modified (e.g., a cytosine, which is susceptible of being methylated and may thus have an epigenetic modification).
- the ROI may comprise more than one nucleotides susceptible of being (epigenetically) modified.
- the nucleotide susceptible of being modified is a cytosine.
- Methylated and nonmethylated cytosines are susceptible of being transformed or converted.
- nonmethylated cytosines can be converted into uracil upon bisulphite treatment, as will be explained below. Therefore, the ROI represents a region of the genome or a region of a nucleic acid molecule which epigenetic modification status may be interrogated.
- An example of a ROI may be: 5'-ATCGGGA-3', wherein "C” refers to a cytosine which may be methylated (mC) or not (uC), and is thus a nucleotide susceptible of being modified.
- C refers to a cytosine which may be methylated (mC) or not (uC), and is thus a nucleotide susceptible of being modified.
- a "locus" in a nucleic acid molecule refers to a specific, fixed position on that nucleic acid molecule. For instance, an ROI molecule comprising 7 bp, has 7 loci. If the ROI has the following sequence ("C” refers to cytosine which may be methylated (mC) or not (uC)):
- the nucleotide located in the locus at position 1 (5') is an A.
- the nucleotide located in the locus at position 2 is a T.
- the nucleotide located in the locus at position 3 is a C, and so on and so forth.
- one or more original molecules refers to one or more nucleic acid molecule which at least partially overlaps with the sequence of the ROI and which comprises at least one nucleotide at a position corresponding to a locus in the ROI, wherein the locus in the ROI is occupied by a nucleotide susceptible of being modified.
- the original molecule may comprise more than one nucleotide at more than one positions corresponding to more than one locus in the ROI, wherein the loci in the ROI are occupied by nucleotides susceptible of being modified.
- the original molecules correspond to the first region of the plurality of molecules provided in step (i) before these have been transformed/converted. See Figure 1.
- the one or more original molecules could be 5'-ATuCGGGA-3', or 5'-ATmCGGGA-3'.
- the "ROI” represents a theoretical region of a nucleic acid molecule or genome that comprises at least one nucleotide susceptible of being epigenetically modified
- the "one or more original molecules” represent the actual existing molecules present in a given sample, at least partially overlapping with the sequence of the ROI (the loci), comprising the at least one nucleotide with or without the epigenetic modification.
- an "M” in capital also denotes a mC, i.e., a methylated C
- a "C” is small or capital letter also denotes a uC, i.e., a non-methylated C (see, e.g., Fig. 7-10).
- the original nucleic acid molecule or molecules may be derived from an organism, such as a human or non-human animal, or from plants, bacteria, fungi, yeasts, and/or viruses.
- the one or more original nucleic acid molecules may be a fragment of genomic DNA (e.g., nuclear DNA, mitochondrial DNA and chloroplast DNA).
- the one or more original nucleic acid molecule(s) is/are a fragment of genomic DNA.
- the genomic DNA comprises the DNA of the nucleus (also referred to as chromosomal DNA, including cell- free DNA (cfDNA)) but also the DNA of the plastids (e.g., chloroplasts) and other cellular organelles (e.g., mitochondria, etc.).
- the original nucleic acid molecule may be plasmid DNA or fragments of single stranded nucleic acid molecules (e.g., DNA, cDNA, mRNA).
- the original nucleic acid molecule or molecules may be single-stranded or double-stranded DNA.
- the one or more original nucleic acid molecule or molecules may also be RNA, such as mRNA.
- the one or more original nucleic acid molecules may be a synthetic nucleic acid molecule or molecules, such as synthetic DNA or synthetic RNA.
- the synthetic nucleic acid may be double-stranded or single-stranded.
- the one or more original nucleic acid molecules are fragments of genomic DNA.
- the ROI comprises at least one nucleotide susceptible of conversion at a certain locus (position).
- the ROI may comprise a C in the locus at position 3:
- the original nucleic acid molecules may have at at least that one locus, (e.g., in the locus at position 3 in the current example), either a modified nucleotide (e.g., mC) or the non-modified version of the same nucleotide (e.g., uC):
- nucleic acid molecules provided in (i) is obtained, for instance as represented in steps A and B:
- Step A synthesis of the precursor of the second region:
- Step B conversion of the molecules provided in Step A:
- the molecules in Step B represent an example of the plurality of the nucleic acid molecules provided in step (i) of the method of the present invention.
- said plurality of nucleic acid molecules comprises a first and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules.
- the sequence of the second region in the at least two molecules comprised in the plurality provided in step (i) may be identical or substantially identical at at least a position corresponding to a locus in the region of interest which has a nucleotide (which is occupied by a nucleotide) susceptible of being modified.
- the corresponding locus in the original molecule would be at position 5 in the second region (namely, the fifth nucleotide from the 5' start of the second region).
- the corresponding locus in the first region would be at position 3 (namely, the third nucleotide from the 5' start of the first region).
- the corresponding locus in the original molecule and in the ROI is also at position 3 (namely, the third nucleotide from the 5' start of the ROI).
- nucleic acid molecules each comprising a first and a second region.
- the first and the second regions in each of the molecules comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified (e.g., locus 3 in the ROI, as described above).
- the first region comprises a T or a C (molecules 1 and 2, respectively) at a position corresponding to locus 3 in the ROI and the second region comprises, in both molecules, a G at the position corresponding to locus 3 in the ROI.
- the two molecules may have (and preferably have) a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified.
- the sequences of the first regions of both molecules may be (and are preferably) different, because the nucleotide at the position corresponding to locus 3 in the ROI (occupied by a nucleotide susceptible of being modified, C, in the ROI) may be (and is preferably) different in each of the molecules (e.g., T in the first and C in the second one).
- a position corresponding to that same locus in the region of interest is occupied by the same nucleotide in the two nucleic acid molecules.
- the nucleotide at a position corresponding to locus 3 in the ROI is occupied, in the second region of both molecules, by the same nucleotide (G).
- the first region of at least one of the molecules of the plurality of nucleic acid molecules may comprise, at least at one certain position (corresponding to a certain locus in the original sequence), a transformed unmodified nucleotide (in this case, the first region of molecule 1 comprises a T (i.e., originally a transformed nucleotide, transformed uC, U, or copy thereof (T)) at position (locus) 3).
- the first region of the other molecule (molecule 2) may not comprise, at the corresponding position (locus), a transformed nucleotide, but a modified one (it comprises a mC or a copy thereof, uC), at position 3).
- the nucleotide sequence of the second region is identical or at least substantially identical in each molecule (1 and 2). They are certainly identical at at least a position corresponding to a locus in the original molecule which has a nucleotide (which is occupied by a nucleotide) susceptible of being modified (in this case, in position 5 of both second regions, which corresponds to the locus at position 3 in the original molecule, as discussed above, there is a G).
- there is more than one original nucleic acid molecule there is more than one original nucleic acid molecule.
- the original nucleic acid molecules if more than one, may have all the same sequence, or at least share at least one corresponding loci in the ROI, and may differ only in their epigenetic modifications, such as in their methylation status.
- the original nucleic acid molecule(s) is/are untreated nucleic acid molecule(s), i.e., they have not been converted or transformed, and have preferably not been treated, with agents/methods capable of converting the base of a nucleotide to a base which is read distinctly from the original base.
- the term “epigenetic modification” refers to any chemical modification that may be present in one or more nucleotides, but which does not change the nucleotide sequence (does not change the genetic code sequence of the nucleotide).
- an epigenetic modification may thus be present in any nucleotide, i.e., A, T, U, G and/or C, in DNA or RNA sequences.
- “epigenetic modification” can be also used interchangeably with the term “chemical modification” of a nucleotide.
- a “modified nucleotide” or a “chemically modified nucleotide” or an “epigenetically modified nucleotide” thus refers to a nucleotide that has been chemically modified with an epigenetic modification or a "tag".
- a “modified nucleotide” in the context of the present invention, is a nucleotide that differs in its structure from primary nucleotides (Guanine, Cytosine, Thymine, Uracil, or Adenine), e.g., because it comprises an epigenetically modified base.
- a modified nucleotide may thus be a nucleotide that carries "epigenetic information", i.e., a nucleotide that carries an "epigenetic modification", as described above.
- an epigenetically modified base is a methylated base, a hydroxymethylated base, a formylated base, an acetylated base or a carboxylic acidcontaining base.
- the epigenetic modification is a methylation
- the modified base is a modified (e.g., methylated) cytosine.
- An epigenetic modification may refer to cytosine methylation.
- the nucleotide sequence (C) does not change, but the nucleotide cytosine is chemically modified by the incorporation of a methyl group. Hence, the cytosine is chemically modified because it is methylated.
- the principal epigenetic tag found in DNA is that of covalent attachment of a methyl group to the C5 position of cytosine residues in CpG dinucleotide sequences (see, e.g., Handy DE. et al. "Epigenetic modifications: basic mechanisms and role in cardiovascular disease", Circulation, 2011, 123(19):2145-56).
- DNA methylation modulates the chromatin structure and affects cognate gene expression by maintaining various expression patterns across cell types.
- the presence of DNA methylation in the promoter region is directly connected to repression of transcription.
- DNA methylation in the gene body shows positive correlation with gene expression.
- epigenetic modifications may also be associated with the presence of disease.
- 5mC oxidation derivatives could be used as markers in cancer diagnostics and prognostics (see, e.g., Chen K., Zhao BS. and He C., "Nucleic acid modifications in regulation of gene expression", Cell Chem Biol., 2016;23(l):74-85).
- CpG CpG dinucleotide sequences
- cytosines other than those in CpG, can be methylated as well. See, e.g., Handy DE. et al., "Epigenetic modifications: basic mechanisms and role in cardiovascular disease", Circulation, 2011;123(19):2145-56.
- nucleotides Chemical or epigenetic modifications which take place in nucleotides are, for example, 5-methylcytosine (5mC) and its oxidative derivatives (e.g., 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC) and 5-arboxylcytosine (5caC)) and N 6 -methyladenine (6mA) in DNA; N 6 -methyladenosine (m6A), pseudouridine (psi, UJ), and 5-methylcytosine (m5C) in messenger RNA and long noncoding RNA, or /V 4 -methylcytosine (4mC or m 4 dC) in bacterial genomes, see, Chen K., Zhao BS.
- 5-methylcytosine (5mC) and its oxidative derivatives e.g., 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC) and 5-arboxylcytosine (5caC)
- RNA molecules are also decorated with similar modifications.
- /V 6 -methyladenosine (m 6 A) is also present in mRNA, see, e.g., Chen K., Zhao BS. and He C., "Nucleic acid modifications in regulation of gene expression", Cell Chem Biol., 2016;23(l):74-85.
- RNAs transfer, ribosomal, small nuclear, and small nucleolar
- Cytosine can also be methylated in RNA in order to form 5mC.
- tRNA modifications are known to affect translation and affect different physiological processes. For example, in 5. cerevisiae, there are 74 genes involved in the installation of ⁇ 25 chemically distinct modifications presented at 36 positions in yeast cytoplasmic tRNAs, see Chen K., Zhao BS. and He C., "Nucleic acid modifications in regulation of gene expression", Cell Chem Biol., 2016;23(l):74-85.
- modified cytosines refers to cytosine bases that are modified by the replacement or addition of one or more atoms or chemical groups, such as a methyl group.
- base that is detectably dissimilar to cytosine in terms of hybridization properties refers to a base that cannot hybridize (hydrogen bridges will not be present) with a guanine in the complementary strand, such as uracil.
- the conversion of (non-methylated) cytosine to uracil in the paired DNA molecules is performed with a deamination agent such as bisulfite, but any other agent or enzymatic treatment (e.g., TET oxidation of modified cytosines, followed by APOBEC deamination of nonmodified cytosines) may be used.
- a deamination agent such as bisulfite
- any other agent or enzymatic treatment e.g., TET oxidation of modified cytosines, followed by APOBEC deamination of nonmodified cytosines
- Modified cytosines such as methylated cytosines are resistant to the treatment with reagents such as bisulfite and A3A because the cytosines remain unchanged after the treatment with these reagents (e.g., they remain as cytosines) or because, upon treatment or after being copied (e.g., amplified by PCR), they are converted into a base that is complementary to guanine and is read as (unmodified, such as unmethylated) cytosine in polymerase-base amplification and sequencing (e.g., 5-hydroxymethylcytosine that is converted to cytosine-5- methylsulfonate, or 5mC/5fC which is converted to 5hmC/5caC, respectively, after treatment with TET- methylcytosine dioxygenase 2).
- reagents such as bisulfite and A3A because the cytosines remain unchanged after the treatment with these reagents (e.g., they remain as cytosines) or because, upon
- the modified cytosine is 5- methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC) or 5-formylcytosine (5fC).
- the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position (locus), a 5-methylcytosine (5mC), a 5-hydroxymethylcytosine (5hmC) or a 5-formylcytosine (5fC).
- the second region of the same molecule does not comprise any one of 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC) or 5-formylcytosine (5fC).
- the base detectably dissimilar to cytosine is uracil, which is then complementary to A and copied as T.
- a "nucleotide which is susceptible of being modified” refers to any nucleotide which may carry an epigenetic modification, as described above.
- a cytosine is a preferred nucleotide susceptible of being modified.
- the cytosine may be methylated (mC) or not (uC), i.e., the cytosine may carry at least one epigenetic modification (e.g., methylation).
- the nucleotide which is susceptible of being modified is cytosine.
- base that is detectably dissimilar to a certain nucleotide in terms of hybridization properties refers to a base that cannot hybridize (hydrogen bridges will not be present) with a base which would be otherwise complementary to it in the complementary strand (an example would be adenine, if the original base was guanine; another example would be uracil if the original base was cytosine).
- the base detectably dissimilar to cytosine is thymine or uracil, more preferably is uracil.
- the reagent used in this step could be a reagent capable of converting non-methylated cytosines to a base that is detectably dissimilar to cytosine in terms of hybridization properties but incapable of acting on methylated cytosines.
- agents are, without limitation, deamination agents, bisulfite, metabisulfite or cytidine-deaminases such as activation-induced cytidine deaminase (AID).
- the reagent is bisulfite.
- the base identities in the first and second regions of the nucleic acid molecules comprised in the plurality provided in step (i) are characterized in that they provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, as mentioned above.
- the first and the second regions of the nucleic acid molecules of the present invention are related in the sense that they both have information regarding the base identity in the corresponding loci in an original nucleic acid molecule.
- the information provided by the sequence in the first region is independent from the information provided by the sequence in the second region.
- Each of the nucleic acid molecules of the present invention thus comprise two sources of information regarding the base identity in the corresponding positions (loci) of the original nucleic acid molecule.
- the modification of a nucleotide (e.g., cytosine residue) at a certain position (locus) in the one or more original nucleic acid molecules may be ascertained from the information provided by the identity of the bases at the corresponding positions (loci) in the first and second regions comprised in the nucleic acid molecules of the invention. With this information, the epigenetic modification status (such as the methylation status) in a ROI within a sample can be ascertained.
- a nucleotide e.g., cytosine residue
- the first region may provide information regarding the base identities in the corresponding loci of an original nucleic acid molecule.
- the second region may provide, independently, information regarding the base identities in the same loci of the same original nucleic acid molecule.
- the base identities in one of the first or second regions provide information of the base identities of the corresponding loci in an original nucleic acid molecule
- the base identities in the other region second or first, respectively
- the sequence of the first region of the nucleic acid molecules of the present invention correspond to the sequence of the original nucleic acid molecule after this has been treated with an agent (or method or process, as described herein) capable of converting the base of a nucleotide (e.g., non-methylated cytosine(s)) to a base which is read distinctly from the original base (e.g., cytosine) (e.g., bisulfite or A3A), i.e., the sequence of the first region of the nucleic acid molecule of the present invention corresponds to the converted original nucleic acid molecule.
- the first region thus provides information on the base identity in the original nucleic acid molecule.
- the sequence of the second region of the nucleic acid molecule of the present invention may correspond to the converted sequence of the reverse complementary strand of the first region previous to the conversion (e.g., previous to the bisulfite treatment, see below for further details).
- the second region also provides independent information on the base identity in the corresponding loci in the original nucleic acid molecule.
- the term "agent capable of converting (or transforming) a nucleotide to another nucleotide which is read distinctly from the original nucleotide” refers to any agent (e.g., a reactive, reagent or enzyme) or method or process which is able to convert or transform (i.e., to alter its chemical structure) a certain nucleotide, so that the converted or transformed nucleotide is read (recognized) by the enzyme responsible of copying the nucleic acid molecule (e.g., polymerase) as another nucleotide which is different from the original nucleotide.
- agent e.g., a reactive, reagent or enzyme
- the enzyme responsible of copying the nucleic acid molecule e.g., polymerase
- the enzyme responsible of copying the nucleic acid molecule e.g., polymerase
- the enzyme responsible of copying the nucleic acid molecule will introduce, at that position, a nucleotide which is different to the nucleotide which the enzyme would have introduced if the original nucleotide had not been modified.
- an agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base may be bisulfite. Bisulfite is able to convert uC to U, by deaminating the C.
- a U is read by the polymerase differently as a C, i.e., the polymerase would introduce a A when reading U, instead of introducing a G if it had read C.
- agents capable of converting or transforming a nucleotide to another nucleotide which is read distinctly from the original nucleotide are, without limitation, deamination agents, metabisulfite or cytidine-deaminases such as activation-induced cytidine deaminase (AID).
- the enzyme beta-glycosyltransferase is able to glycosylate 5hmCs
- the enzyme APOBEC3A cytosine deaminase (A3A) is able to deaminate uCs to Us.
- the enzyme ten-eleven translocation (TET) methylcytosine dioxygenase 2 is capable of oxidate 5mC to 5hmC or 5fC or 5caC.
- an agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base may be the AID/APOBEC family of enzymes, see, e.g., Berney, M.
- the agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base is bisulfite.
- bisulfite sodium bisulfite
- bisulfite selectively changes unmethylated cytosines into uracil through deamination, while leaving methylated cytosines (both 5-methylcytosine and 5- hydroxymethylcytosine) unchanged.
- bisulfite ion has its accustomed meaning of HSO3-.
- bisulfite is used as an aqueous solution of a bisulfite salt, for example sodium bisulfite, which has the formula NaHSOs, or magnesium bisulfite, which has the formula MgfHSChh.
- Suitable counter-ions for the bisulfite compound may be monovalent or divalent.
- Examples of monovalent cations include, without limitation, sodium, lithium, potassium, ammonium, and tetraalkylammonium.
- Suitable divalent cations include, without limitation, magnesium, manganese, and calcium.
- the agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base is A3A. In another embodiment, the agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base is the enzyme beta-glycosyltransferase.
- a "converted nucleotide” or a “transformed nucleotide” refers to a nucleotide which has been put in contact with an agent or method capable of converting the base of a nucleotide to a base which is read distinctly from the original base, under the conditions suitable for the conversion to occur. If the nucleic nucleotide is susceptible of conversion, it will be converted by the action of the agent, thus leading to a “converted or transformed nucleotide".
- converted nucleic acid refers to a nucleic acid which has been put in contact with an agent or method capable of converting the base of a nucleotide into another one which is read distinctly from the original base, under the conditions suitable for the conversion to occur, as explained in detail above. If the nucleic acid molecule comprises one or more nucleotides susceptible of conversion, these will be converted by the action of the agent/method, thus leading to a “converted or transformed nucleic acid".
- the term "convert a nucleotide” refers to the chemical modification of the nucleotide originated by the agent (or method or process) capable of converting (or transforming) a nucleotide to another nucleotide which is read distinctly from the original nucleotide as described above, so that it is read distinctly from the original nucleotide.
- the conversion of C to U takes place by the chemical modification of the structure of C, which is deaminated to give rise to U.
- the one or more original nucleic acid molecules comprises one or more nucleotides susceptible of conversion, as explained above.
- uC is a nucleotide which is susceptible of conversion, because it can be deaminated and converted to U, which is a base which is read distinctly from the original uC.
- a transformed molecule or sequence in the context of the present invention, refers to a sequence or a molecule comprising nucleotides that have been converted.
- the sequence of the first region may not comprise nucleotides at all positions corresponding to all loci in the ROI. But the sequence of the first region comprises nucleotides in at least one, preferably in at least two, more preferably in at least three, or in at least 4, 5, 10, 15, 20, 30, 50, 100, 160, 200, 300, 500, 1000 or more corresponding loci in the ROL
- the skilled person is able to assign the information (base identity) for each locus in the original nucleic acid molecule and/or in the ROI, combining the information given by both, the first and second regions of the nucleic acid molecule of the present invention.
- the ROI is 5' ATTTGGC 3' and one original nucleic acid molecule is 5' ATTTGGuC 3': 5' -ATTTGGU - — GUUAAAT-- 3'
- the second region (GUUAAAT) has a sequence which was complementary to the reverse sequence (GuCuCAAAT) of the first region (ATTTGGC) before both, the second and the first regions were treated with an agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base, e.g., bisulfite or A3A, and, thus, were converted (e.g., non-methylated C to U, also referred to as "uC-to-T conversion").
- the first region will have a mC, uC in the corresponding locus, and the second region will have a G in the corresponding locus.
- Any other modification will work in the same way considering the original and converted nucleotide sequences and the relationship between the first and second regions. For instance, in this case, both regions provide information regarding the base identity in the corresponding loci in an original nucleic acid molecule (5' ATTTGGuC 3').
- the second region has a sequence which comprises complementary bases at the corresponding locus in the first region ("same complementary" sequence) in tandem before conversion, i.e., the second region (TAAAUUG) has a sequence which is complementary to the sequence in the first region before it has been converted (ATTTGGuC).
- both regions provide information regarding the base identity in the corresponding loci in an original nucleic acid molecule (5' ATTTGGuC 3').
- the nucleic acid molecules of the present invention provide two sources of independent information regarding the true base identity in a certain locus of an original nucleic acid molecule, and conversely, two sources of independent information regarding the true base identity in at least one certain locus in a ROI, wherein the at least one locus in the ROI is occupied by a nucleotide susceptible of being modified.
- the first and second regions of the molecules of the present invention are located, within the molecule, next or close to each other, but they are not overlapping.
- the first and the second regions may be contiguous to each other, i.e., they may be directly linked to each other.
- the first region of a nucleic acid molecule is located towards the 5' region in the molecule or, in other words, the second region of a molecule is located towards the 3' in the same molecule.
- the first region is comprised in the 5' region of the nucleic acid molecule of the present invention
- the second region is comprised in the 3' region of the nucleic acid molecule of the present invention.
- the first region of the nucleic acid molecule is located closer to the 5' end of the molecule than the second region (but may not be exactly at the 5' end of the molecule, e.g., there may be other sequences at the 5' end of the molecule which do not belong to the 5' region), and the second region is located closer to the 3' end of the molecule than the first region (but may not be exactly at the 3' end of the molecule, e.g., there may be other sequences at the 3' end of the molecule which do not belong to the 3' region).
- end refers to the regions of sequence at (or proximal to) either end of a nucleic acid sequence.
- the expression "5' region”, as used in the present invention, refers to a region of a nucleotide strand which is located towards the 5' end of said strand.
- the 5' region of a strand may include the 5' end of said strand.
- the term "5' end”, as used herein, designates the end of a nucleotide strand that has the fifth carbon in the sugar-ring of the deoxyribose at its terminus.
- 3' region refers to a region of a nucleotide strand which is located towards the 3' end of said strand.
- the 3' region of a strand may include the 3' end of said strand.
- the term "3' end”, as used herein, designates the end of a nucleotide strand that has the third carbon in the sugar-ring of the deoxyribose at its terminus.
- At least two of the molecules of the plurality of nucleic acid molecules may have a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified.
- one of the molecules (the "first” molecule) may have a T in the first region at a certain position and the other molecule (the “second” molecule) may have, in its first region, a C, in a position corresponding to the same locus as the T in the first region of the first molecule.
- At least two of the molecules of the plurality of nucleic acid molecules have a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified.
- the first region of at least one of the molecules of the plurality of nucleic acid molecules provided in step (i) of the method of the present invention comprises, at least at a certain position (locus), a modified nucleotide or a copy thereof, wherein the first region of at least another nucleic acid molecule of the plurality of nucleic acid molecules provided in step (i) of the method of the present invention does not comprise, at least at the same position (same locus), a modified nucleotide or a copy thereof; and wherein the at least a modified nucleotide, or copies thereof, are not present in the second region in any of the nucleic acid molecules comprised in the plurality provided in step (i).
- the plurality of nucleic acid molecules provided in step (i) there are at least two molecules that differ at least in that, while one molecule comprises, at a certain position (locus) in the first region, a modified nucleotide ora copy thereof, the other nucleic molecule does not comprise, in the same position (locus) in the first region, a modified nucleotide or a copy thereof.
- the first region of at least one of the molecules of the plurality of nucleic acid molecules provided in step (i) of the method of the present invention comprises, at least at a certain position (locus) a transformed (converted) modified nucleotide, or a copy thereof, wherein the first region of at least another nucleic acid molecule of the plurality of nucleic acid molecules provided in step (i) of the method of the present invention does not comprise, at least at the same position (same locus), a transformed (converted) modified nucleotide, or a copy thereof.
- the plurality of nucleic acid molecules provided in step (i) there are at least two molecules that differ at least in that, while one molecule comprises, at a certain position (locus) in the first region, a transformed modified nucleotide, or a copy thereof, the other nucleic molecule does not comprise, in the same position (locus) in the first region, transformed modified nucleotide, or a copy thereof.
- the plurality of nucleic acid molecules provided in step (i) there are at least two molecules that differ at least in that, while one molecule comprises, at a certain position (locus) in the first region, a transformed non-modified nucleotide, or a copy thereof, the other nucleic molecule does not comprise, in the same position (locus) in the first region, the same transformed non-modified nucleotide, or a copy thereof (e.g., one of the molecules comprises, at a certain position (locus) in the first region, a U, or a T, and the other nucleic molecule does not comprise, in the same position (locus) in the first region, the same nucleotide or copy thereof, i.e., does not comprise a U or a T).
- one of the molecules comprises, at a certain position (locus) in the first region, a U, or a T
- the other nucleic molecule does not comprise, in the same position (locus) in the first region, the
- transformed modified nucleotide or "converted modified nucleotide”, as used herein, refers to a nucleotide that was originally (epige netica lly) modified nucleotide, but that has been treated with an agent (or method or process, as described herein) capable of converting a nucleotide or, more specifically, the base of a nucleotide into another base (or into another nucleotide) which is read distinctly from the original base (nucleotide) and converted (or transformed), as defined above, and, as a consequence of the transformation (conversion), the epigenetic modification has been removed and the epigenetic modification information has been transformed (or converted).
- agent or method or process, as described herein
- a "transformed modified nucleotide” or “converted modified nucleotide” is a nucleotide that was originally modified, and that has been treated with an agent or method capable of converting the base of said modified nucleotide into another one, which is read distinctly from the original base.
- the nucleotide has been converted.
- the originally modified nucleotide has been transformed, leading to the "transformed modified nucleotide” or "converted modified nucleotide”.
- a transformed modified nucleotide may be a T derived from a methylated cytosine that has been treated with an agent capable of converting a nucleotide into another one which is read distinctly from the original nucleotide, and converted, so that the original mC has been converted to T as a consequence of the treatment with the AID/APOBEC family of enzymes, see, e.g., Berney, M. and McGouran, J.F., "Methods for detection of cytosine and thymine modifications in DNA", Nat Rev Chem, 2018, 2, 332-348 or (Nabel CS. et al., "AID/APOBEC deaminases disfavor modified cytosines implicated in DNA demethylation", Nat Chem Biol., 2012, 8(9):751-8.
- the first region of at least one of the molecules comprised in the plurality of nucleic acid molecules provided in (i) comprises, at least at one certain position (certain locus), a modified nucleotide or a copy thereof, and, at least one other molecule comprised in the plurality of nucleic acid molecules provided in (i) does not comprise, at least at the same position (same locus) in its first region, a modified nucleotide or a copy thereof.
- Figure 1 B eight original nucleic acid molecules are provided, with the same sequence (i.e., the same bases at the same or corresponding loci), wherein each of them comprises a different methylation status ("C" with single underline represents a methylated cytosine, and "C” with double underline represents a non-methylated cytosine). See the first region (towards the 5' region) of the molecules of Figure IB.
- the region of interest is, in this case: A C C G T C G A C G, wherein "C” without underline represents a cytosine that may be methylated (mC) or not (uC):
- the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at a certain position (certain locus), a modified nucleotide (e.g., the third nucleotide, methylated cytosine "C", in molecule number 2 in Figure 1A) or a copy thereof (e.g., the third nucleotide, methylated or unmethylated cytosine "C" depending on the nucleotides given to the polymerase to synthetise the copies, in molecule number 2 in Figure 1A), and the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at the same position (same locus), the same modified nucleotide (which would be a mC) or copy thereof (uC), but it comprises a transformed nonmodified nucleotide (e.g., uracil "U”) or copy thereof (e.g., the third nucleotide,
- copy of a nucleotide refers to the nucleotide obtained after copying or amplifying (e.g., via PCR) a given nucleic acid molecule comprising a nucleotide, possibly after conversion or transformation.
- the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof; and wherein the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at a position which corresponds to the same locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof or a transformed modified nucleotide or a copy thereof; or b) - the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules a transformed modified nucleotide, or a copy thereof; and wherein the first region of at least one other molecule of the plurality of nucle
- the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof, and the first region of at least one other molecule may comprise, at the corresponding locus, a transformed non-modified nucleotide, or a copy thereof.
- the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules a transformed modified nucleotide, or a copy thereof; and the first region of at least one other molecule of the plurality of nucleic acid molecules may comprise, at the corresponding locus, a non-modified nucleotide, or a copy thereof.
- the plurality of molecules of the present invention there are at least two molecules that may have (and preferably have), at least at one the same locus in the first region, different sequence (a different nucleotide).
- This potential difference in sequence at at least this specific locus in the first region is a consequence of the potential differences in epigenetic status in the same locus in the original molecules.
- this potential difference in sequence is the way the epigenetic information potentially present in the original molecules is fixed or preserved in the molecules of the present invention (and their copies, if any).
- the one or more original molecules are at least partially overlapping a region of interest (ROI) in the genomic DNA.
- the first regions in the nucleic acid molecules provided in (i) are fragments of genomic DNA (original molecules) which have been treated with an agent capable of converting (or transforming) a nucleotide to another nucleotide which is read distinctly from the original nucleotide, and they all have nucleotides at at least one position corresponding to a locus in the ROI (e.g., they overlap with at least part of the ROI).
- the genomic DNA molecules are generally randomly fragmented.
- the first region of the molecules are for instance transformed fragments of genomic DNA, the bigger the genome the shorter the odds that more than one fragment from independent original molecules share the exact same region (same start and end on the genome reference sequence).
- the first regions of some molecules will comprise at least part of the sequence of the region of interest.
- the first regions of the nucleic acid molecules comprised in the plurality provided in (i), comprising a first and a second region, will not be all identical.
- the nucleic acid sequences of the second region should be identical or at least substantially identical in each of the plurality of nucleic acid molecules at at least a position corresponding to the same locus in the region of interest, wherein the locus is occupied by a nucleotide susceptible of being converted.
- the second region of each of the nucleic acid molecules comprised in the plurality provided in (i) comprises, at least at one certain position (locus) (the same position (locus) in the second region in all of the molecules), the same nucleotide, which is a nucleotide that corresponds in the ROI to a nucleotide susceptible of being converted by an agent or method capable of converting a nucleotide into another nucleotide which is read distinctly from the original nucleotide.
- the second region of each of the nucleic acid molecules comprised in the plurality provided in (i) comprises, at the same loci, at least two nucleotides which are identical or substantially identical in all nucleic acid molecules.
- the second region of each of the nucleic acid molecules comprised in the plurality provided in (i) comprises, at the same loci, one or more nucleotides which are identical or substantially identical in all nucleic acid molecules, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 160, 200, 300, 500, 1000 or more, nucleotides which are identical or substantially identical in all nucleic acid molecules.
- the second region of each of the nucleic acid molecules comprised in the plurality provided in (i) comprises, at the same loci, at least 1 nucleotide which is identical or substantially identical in all nucleic acid molecules.
- the first region (the 5' region in this specific case) of the molecules shown in Figure ID may be fragments of genomic DNA and may represent the original molecules.
- the cytosines in the first region have different methylation status in each of the molecules (methylated cytosines are highlighted with single underlined and non-methylated cytosines are double underlined).
- Figure 1C shows a plurality of nucleic acid molecules comprising two regions (they comprise a first region in the 5' region and a second region in the 3' region).
- the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of the original nucleic acid molecule (that overlaps with a ROI in this case).
- Figure 1C shows the transformed or converted version of the molecules of Figure ID. These molecules may be comprised in the plurality provided in step (i) of the method of the present invention.
- the first region of at least two of them may have (and preferably have) a different nucleotide at at least one corresponding locus.
- molecule 2 has a uC
- molecule 3 has a T.
- the nucleotide is the same in both molecules (G at position 8 (starting from the 5' of the second region), which corresponds to loci 3 in the ROI).
- a position corresponding to that same locus in the region of interest is occupied by the same nucleotide in the at least two nucleic acid molecules.
- Both the first and the second regions in each of the molecules comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified (locus 3 in the ROI, which is occupied by a C).
- the first region of at least one of the molecules of a plurality of nucleic acid molecules comprises, at a certain position corresponding to a certain locus in the region of interest, a modified nucleotide (e.g., at a position corresponding to position (locus) number three in the region of interest, there is a uC (which corresponds to a methylated cytosine in the original molecule) in molecule number 2, 5 and 8.
- a modified nucleotide e.g., at a position corresponding to position (locus) number three in the region of interest, there is a uC (which corresponds to a methylated cytosine in the original molecule) in molecule number 2, 5 and 8.
- the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at the same corresponding position (locus) in the original molecule, a modified nucleotide (e.g., at the same position (locus), i.e., at a position (locus) corresponding to position number three in the region of interest, there is a T, a copy of a transformed nonmethylated cytosine (which corresponds to a non-methylated cytosine in the original molecule), in molecules number 3 and 7).
- a modified nucleotide e.g., at the same position (locus), i.e., at a position (locus) corresponding to position number three in the region of interest, there is a T, a copy of a transformed nonmethylated cytosine (which corresponds to a non-methylated cytosine in the original molecule), in molecules number 3 and 7).
- the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at a certain position (locus of the original molecule), a modified nucleotide (i.e. a mC), or a copy thereof (i.e. mC or uC, complementary to a G)
- the first region of at least one other molecule of the plurality of nucleic acid molecules comprises, at least at the same position (locus of the original molecule), a transformed non-modified nucleotide (i.e., U) or a copy thereof (i.e., a T), both complementary to a A.
- the second region of the nucleic acid molecules comprised in a plurality of nucleic acid molecules comprises, at the same position corresponding to the same locus in the region of interest, at least one nucleotide which is identical or substantially identical in the plurality of nucleic acid molecules, and the locus is occupied in the original molecule, by a nucleotide susceptible of being modified or transformed.
- a nucleotide susceptible of being modified or transformed For instance, in Figure 1C, at position 8 in the second region of molecules 2, 3, 5, 7 and 8, there is always a G.
- Position 8 in the second region corresponds to locus 3 in the ROL Hence, although this corresponding locus was originally occupied by a mC in molecules 2, 5 and 8 and by a uC in molecules 3 and 7 (see locus 3 at the first region of Figure ID) in the second region, the corresponding locus is occupied by the same nucleotide, i.e., a G.
- the second region of the nucleic acid molecules comprised in a plurality of nucleic acid molecules does not comprise any modified nucleotide.
- the second region of the nucleic acid molecules comprised in a plurality of nucleic acid molecules does not comprise any modified C, preferably it does not comprise any methylated C (mC).
- the resulting molecule provided in step (i) comprises at least four regions that are substantially different from each other in sequence, so that one primer can only specifically bind to one of the four regions, and not to the others.
- Said regions are named 1, 2, 3, and 4 in the context of a strand with a Watson insert, and regions 1', 2', 3' and 4' in the context of a strand with a Crick insert, see e.g., Fig. 7A.
- Regions 1 to 4 and 1' to 4' can also be referred to in the present document as A, B, C, D or A', B', C', D', respectively.
- said 1-4 and l'-4' regions are not the same as the first and second regions of the molecule, which have been defined above.
- the Watson and Crick insert may be representative of the first region of the molecule provided in step (i).
- the present invention refers to regions 1 to 4 or 1' to 4', it is also referring to the complementary sequences thereof.
- amplification and sequencing primers can be designed against one of said the four regions, so that the primers specifically bind only to one of said regions, and not to the others.
- a region is "substantially different" to another region when the percentage of nucleobase identity between both regions is less than 90%, such as less than 80%, or less than 70%, or less than 60%, or less than 50%, or less than 40%, or less than 30%, or less than 20%, or less than 10% or less.
- a region is "substantially different” to another region when the percentage of nucleobase identity between both regions is such that it does not allow a primer that is capable of specifically binding (specifically hybridizing) to one of these regions to specifically hybridize to the other.
- a region is “substantially different” to another region when the percentage of nucleobase identity between both regions is such that it does not allow a primer that is capable of efficiently hybridize to one of the regions, to efficiently hybridize to the others.
- efficient hybridization is referred herein as a hybridization that has sufficient specificity as to serve as a primer for a specific amplification or sequencing step.
- the at least four regions (1-4 or l'-4') that are substantially different from each other of the resulting molecule of step (i) are located flanking the first and second regions of the molecule of step (i).
- flanking is referred herein to a place that is at both sides of a given region.
- the at least four regions that are substantially different from each other of the resulting molecule of step (i) are located 1, 2, 3, 4, 5, 6,7 ,8, 9, 10, 11, 12, 13, 14, 15, 16 or more than 16 nucleotides upstream and downstream the first and second regions of the molecule of step (i).
- regions 2 and 2' are comprised in the linking region between the first and second regions of the molecules provided in (i).
- regions 3 and 3' are comprised in the linking region between the first and second regions of the molecules provided in (i).
- the at least four regions that are substantially different from each other of the resulting molecule of step (i) are characterized in that at least two different primers, such as two, three or four different primers, preferably three or four different primers, can bind to said four different regions in the nucleic acid molecule provided in (i), wherein:
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i); 2. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the region 3 or 3' of the molecule, to sequence at least part of the second region of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and
- At least one of the primers is capable of binding (hybridizing) at least partially to region 2 or 2' of the molecule , to sequence at least part of the first region the nucleic acid molecules provided in (i).
- the at least four regions that are substantially different from each other of the resulting molecule of step (i) are characterized in that at least two different primers, such as two, three or four different primers, preferably three or four different primers, can bind to said four different regions in the nucleic acid molecule provided in (i), wherein:
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the region 3 or 3' of the molecule, to sequence at least part of the second region of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); or
- At least one of the primers is capable of binding (hybridizing) at least partially to region 2 or 2' of the molecule , to sequence at least part of the first region the nucleic acid molecules provided in (i).
- the at least four regions that are substantially different from each other of the resulting molecule of step (i) are characterized in that at least two different primers, such as two, three or four different primers, preferably three or four different primers, can bind to three of said four different regions in the nucleic acid molecule provided in (i), wherein:
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and
- At least one of the primers is capable of binding (hybridizing) at least partially to region 2 or 2' of the molecule , to sequence at least part of the first region the nucleic acid molecules provided in (i).
- the at least four regions that are substantially different from each other of the resulting molecule of step (i) are characterized in that at least two different primers, such as two, three or four different primers, preferably three or four different primers, can bind to at least three of said four different regions in the nucleic acid molecule provided in (i), wherein:
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the region 3 or 3' of the molecule, to sequence at least part of the second region of the nucleic acid molecules provided in (i); and 3. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i).
- the nucleotide sequence of the second region is identical or at least substantially identical in each of the plurality of nucleic acid molecules provided in (i), as described in detail above (see also Figure 1 A and B).
- a region is "substantially identical" to another region when the percentage of nucleobase identity between both regions is at least 98%, at least 99%, at least 99,9%, preferably at least 99,99%.
- the "substantial identity” includes the possible errors (i.e., insertion, deletion or substitution of nucleotides made by polymerase enzymes or by DNA damage, library processing, sequencing or mapping.
- the second region of the molecules provided by the invention is also identical or at least substantially identical in all of the plurality of nucleic acid molecules, because it has been synthetised using the original molecule as a template, and before any transformation step occurs (see below for an exemplary embodiment on how to provide the molecules of step (i)).
- the nucleotide sequence of the second region will be identical or substantially identical in all ofthe plurality of nucleic acid molecules.
- the second region represents a common region in all of the plurality of nucleic acid molecules, that will serve for an efficient capture step when a capture probe is designed to hybridize to said second region.
- the first and second regions of the nucleic acid molecules of the present invention are linked, preferably covalently linked.
- the first and second regions are covalently linked by a third region, also called herein a "linking region".
- the third or linking region comprises or, preferably, is a nucleotide sequence.
- the first and the second regions are directly linked to each other, i.e., the first and the second regions are continuous in the molecule, and there is no linker between them.
- the linker is a nucleotide sequence (such as an adaptor) that is identical or substantially identical in all of the plurality of nucleic acid molecules.
- primers can (at least partially) bind (hybridize) to said linker.
- the third region is preferably a nucleotide sequence that is long enough so that primers can (at least partially) bind (hybridize) to it, preferably with enough specificity so that the primer does not substantially bind to other regions of the molecule, in orderto sequence the molecule ofthe present invention, especially the first and second regions of the molecule.
- primer refers to a short strand of nucleic acid that is at least partially complementary to a sequence in another nucleic acid and serves as a starting point for nucleic acid (e.g., DNA) synthesis.
- the primer has at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, preferably at least 18, at least 20, at least 25, at least 30 or more bases long.
- complementary refers to the base pairing that allows the formation of a duplex between nucleotides or nucleic acids, such as, for instance, between the two strands of a double-stranded DNA molecule or between an oligonucleotide primer and a primer binding site on a single-stranded nucleic acid or between an oligonucleotide probe and its complementary sequence in a DNA molecule.
- Complementary nucleotides are, generally, A and T (or A and U), or C and G.
- Two single-stranded DNA molecules are said to be substantially complementary when the nucleotides of one strand, optimally aligned and compared and with appropriate nucleotide insertions or deletions, pair with about 60% of the other strand, at least 70%, at least 80%, at least 85%, usually at least about 90% to about 95%, and even about 98% to about 100%.
- the "degree of identity" between two nucleotide regions can be determined using algorithms implemented in a computer and methods which are widely known by the persons skilled in the art. The identity between two nucleotide sequences is preferably determined using the BLASTN algorithm (BLAST Manual, Altschul, S. et al., NCBI NLM NIH Bethesda, Md.
- the third region or linking region may have a length of at least 5 nucleotides, such as at least 10, or at least 15 nucleotides, or at least 17 nucleotides, such as 17 nucleotides.
- the third region or linking region may have a length of from 5 to 100 nucleotides, such as from 15 to 100 nucleotides, such as from 15 to 80 nucleotides, such as from 15 to 70 nucleotides, preferably from 15 to 80 nucleotides, more preferably from 17 to 70 nucleotides, even more preferably from 25 to 65 nucleotides, such as 17 nucleotides, or 29 nucleotides, or 64 nucleotides.
- the third region or linking region may have a length of at least 20 nucleotides, such as at least 25, 26, 27, 28, 29 or 30 nucleotides.
- the third region or linking region has a length of at least 17 nucleotides, such as 17 nucleotides, or 18 nucleotides, or 19 nucleotides.
- the adaptor has a length of 29 nucleotides.
- the third region or linking region can also have a longer length, such as at least 35, 40, 45, 50, 55 or at least 60 nucleotides.
- the third region or linking region has a length of 64 nucleotides, but it can be longer, such as at least 65, 70, 75, 80 or more nucleotides.
- the third region or linking region may comprise from 5 to 100 nucleotides, preferably from 15 to 80 nucleotides, more preferably from 25 to 70 nucleotides, even more preferably from 29 to 64 nucleotides.
- the length allows that primers can (at least partially) bind (hybridize) to it with enough specificity so that the primer does not substantially bind to other regions of the molecule, in order to sequence the first and/or second regions of the nucleic acid molecules provided in step (i) of the method of the present invention.
- hybridization refers to the process in which two single-stranded polynucleotides bind (at least partially) non- covalently to form a stable double-stranded polynucleotide.
- binding may be used to refer to “hybridize” or “at least partially hybridize”.
- the skilled person is familiar with conditions and buffers suitable for the hybridization of two single-stranded polynucleotides, as described above.
- the nucleic acid molecules of the present invention may further comprise one adapter at the 5' end of the molecule and/or one adapterat the 3' end of the molecule.
- the terms "adapter” and “adaptor” are used interchangeably in the present description and refer to an oligonucleotide or nucleic acid fragment or segment that can be ligated to a nucleic acid molecule of interest.
- the "adapter molecule” of the method of the invention is preferably a DNA molecule having one end which is compatible with the end of the nucleic acid molecules (preferably DNA) of the present invention.
- An adapter or adaptor in genetic engineering is a short, chemically synthesized, singlestranded or double-stranded oligonucleotide that can be ligated to the ends of other DNA or RNA molecules.
- Adaptors may contain "sites for cutting” (e.g., "restriction sites", sequences of oligonucleotides that are recognized by restriction enzymes). The "sites for cutting” add a way to adapt the final elements of the library to the needs of the different sequencing platforms.
- At least one portion of the adaptors has sequences common to all the adaptors present in the population of nucleic acid molecules of step (i), if this is the case. In this case, identical primers for sequencing all molecules could be used.
- the adapters include unique and combinatorial barcodes (also referred to "combinatorial sequences” or “barcodes” or “barcode sequences” or “combinatorial labelling”) that allow sample identification, multiplexing, pairing as well as quantitative analysis.
- the constructs obtained by the methods of the invention may have barcodes that allow generating unique identifiers associated with the initial construct, thus giving the ability to differentiate between constructs.
- Said unique identifiers allow identification of a specific construct comprising said identifier and its descendants. Each unique identifier is associated with an individual molecule or a fragment of an individual molecule in the starting sample. Therefore, any amplification products of said initial individual molecule bearing the unique identifier are assumed to be identical by descent.
- the combinatorial barcodes also allow for quantifying the percentage of individual sequences within a sample and are useful for monitoring biases and error control during the amplification steps.
- the terms "combinatorial sequence”, “barcode sequence”, “barcode” and “combinatorial barcode” are used interchangeably all along the present description and refer to an identifier unique to the individual adapter sequence or a separate nucleic acid (e.g., DNA) molecule (barcode sequence on its own, not belonging to the adapter).
- the barcode sequence is included in the adapter.
- the combinatorial sequence within the adapter sequence is a degenerate nucleic acid sequence.
- the combinatorial sequence may contain any nucleotide, including adenine, guanine, thymine, cytosine, uracil, methylated cytosine (e.g., 5mC or 5hmC) and other modified nucleotides.
- the number of nucleotides in the combinatorial sequence is preferably designed such that the number of potential and actual sequences represented by the combinatorial sequence is greaterthan the total number of adapters in the library.
- the combinatorial sequence may be located in any region of the adapter sequence.
- nucleic acid molecules the plurality of nucleic acid molecules provided in step (i) of the present invention. Some non-limiting examples thereof are described below.
- the plurality of nucleic acid molecules provided of step (i) may be generated by a method comprising the following steps:
- Step (a) Providing a plurality or population of nucleic acid molecules, preferably a plurality or population of double-stranded nucleic acid molecules.
- a plurality of nucleic acid molecules is provided.
- the molecules may be single stranded (ss) or double stranded (ds).
- the plurality or population of nucleic acid molecules are ds, preferably they are fragments of genomic DNA.
- the plurality or population of nucleic acid molecules provided in step (a) would correspond to the "original molecules” in the context of the present invention.
- the "population or plurality of nucleic acid molecules”, as used herein, is a collection of nucleic acid molecules that may be ds or ss. For instance, they may be ssDNA molecules, or RNA molecules, as described in detail above.
- the population or plurality of double stranded nucleic acid molecules are double stranded, and may be, without limitation, genomic DNA (nuclear DNA, mitochondrial DNA, chloroplast DNA, cfDNA, etc.), plasmid DNA or ds DNA molecules obtained from ss nucleic acid samples (e.g., DNA, cDNA, mRNA, etc.). In an embodiment said population is formed by fragments of dsDNA.
- the plurality of ds nucleic acid molecules is genomic DNA.
- genomic DNA comprises the DNA of the nucleus (also referred to as chromosomal DNA) but also the DNA of the plastids (e.g., chloroplasts) and other cellular organelles (e.g., mitochondria) or circulating/cell-free DNA (cfDNA).
- the double stranded DNA molecules are fragments of genomic DNA.
- the plurality or population of nucleic acid molecules comprised in the first region may correspond to the first region of molecules 1 to 8 as shown in Figure IB or ID.
- the nucleic acid molecules differ in their methylation status at the positions occupied by a cytosine in the corresponding loci in the first region and in the original molecule.
- Step (b) Ligating one adaptor to at least one end of the nucleic acid molecules provided in (a), thereby obtaining an adaptor-containing nucleic acid molecule.
- an adapter is ligated to at least the 3' region of the nucleic acid molecules provided in (a).
- the 3' region of the adaptor forms a hairpin loop whose 3' end that can be extended by action of a polymerase.
- the adaptor may be preferably a double-stranded adaptor.
- the adaptors may be added as a complex comprising an elongation primer with a hairpin adapter under conditions adequate for the hybridization of the elongation primer to the second strand of the adapter, wherein the elongation primer comprises a 3' region which is complementary to the second strand of the adapter molecule and which, after hybridization with the second strand of the adapter molecule creates overhanging ends, and wherein the hairpin adapter comprises a hairpin loop region and overhanging ends which are compatible with the overhanging ends formed after hybridization of the elongation primer to the second strand of the adapter.
- Step (b) is also called herein "ligation step”.
- the adaptors are added (ligated) at least at the 3' end of the nucleic acid molecules provided in (a), but adaptors may also be optionally ligated to the 5' region of the molecules provided in (a).
- an adaptor represented by " — " in the figure
- an adaptor may be ligated to the 3' region of the plurality or population of eight nucleic acid molecules, which then become adaptor-containing nucleic acid molecules:
- the ligation is preferably performed under conditions adequate for the ligation of the adaptors to at least the 3' end of the nucleic acid molecules, thereby obtaining a plurality of "adapter-containing nucleic acid molecules” (also referred herein as “adapter-modified nucleic acid molecules”).
- At least a portion of the adaptors have sequences common to all the adaptors used in step (b).
- the adapter when the nucleic acid molecules provided in (a) are double-stranded, the adapter is a so-called "Y adapter", which is a ds adapter, and can be ligated to at least one end of a ds nucleic acid molecule.
- the "Y adapter” has a “Y” form. Further details regarding "Y- adapters” are explained, e.g., in WO 2015/104302.
- a Y-adapter the 3' region of the first DNA strand and the 5' region of the second DNA strand form a double stranded region by sequence complementarity and wherein the 5' region of the first DNA strand and the 3' region of the second DNA strand are not complementary.
- Y-adapter and "Y-adaptor” are used interchangeably and, in the context of the present invention, refer to an adapter formed by two nucleic acid (preferably DNA) (ds DNA) wherein the 3' region of the first DNA strand and the 5' region of the second DNA strand form a double stranded region by sequence complementarity, wherein the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y-adapter are compatible with the ends of the double stranded DNA molecules.
- the "Y adapter” can also be obtained by cleavage of a hairpin.
- a hairpin is ligated to at least one end, preferably to both ends of the ds nucleic acid molecules (such as ds DNA molecules) and, in a further step, the hairpin(s) is(are) cleaved, so that at least one of the strand of the ds nucleic acid molecules comprise an adapter ligated to it.
- further primers may be ligated to at least one end of the ds nucleic acid molecules (such as ds DNA molecules), see below.
- a hairpin may be considered to be a type of "Y-adapter", since it may become a Y-adapter if the hairpin is cleaved.
- a "hairpin” or “stem loop” occurs when two regions of the same strand, usually complementary in nucleotide sequence when read in opposite directions, base-pair to form a double helix that ends in an unpaired loop.
- the molecules generated in step (b), when the adapters do not comprise a hairpin loop are contacted with a hairpin adapter under conditions adequate for the ligation of the hairpin adapter to the molecules generated in step (b), as described in detail below.
- a hairpin adapter may be ligated to the adapters ligated in step (b).
- a hairpin adapter may be incorporated as described in WO 2015/104302.
- sequence complementarity refers to a property shared between two nucleic acid sequences, such that when they are aligned antiparallel to each other, the nucleotide bases at each position (locus) will be complementary.
- the 3' region of the second nucleic acid (e.g., DNA) strand of the Y- adapter forms a hairpin loop by hybridization between a first and a second segment within said 3' region, the first segment being located at the 3' end of the 3' region and the second segment being located in the vicinity of the 5' region of the second DNA strand.
- hairpin loop refers to a region of DNA formed by unpaired bases that is created when a DNA strand folds and forms base pairs with another section or segment of the same strand.
- the 3' region of the second DNA strand of the Y-adapter does not form a hairpin loop by hybridization between a first and a second segment within said 3' region.
- the adapters preferably ds adapters (such as DNA adapters, which may be Y-adapters or not), comprise at least one barcode sequence in a region, preferably the ds region, of the adapter.
- ds adapters such as DNA adapters, which may be Y-adapters or not
- the adapters comprise at least one barcode sequence in a region, preferably the ds region, of the adapter.
- This will provide at least for the pairing between each original nucleic acid (such as DNA) strand of the original ds nucleic acid molecule and to be able to deduplicate reads, that is, to differentiate reads that originate from the same original sequence or reads that are independent but start and end at the same loci which becomes crucial for enrichment specially of low input/low diversity libraries or high depth whole genome sequencing.
- This allows for keeping track of both strands of each ds nucleic acid fragment originally used in step (a) as described above.
- each of the strands of the double-stranded original nucleic acid molecules can be paired by using barcode sequences in step (b).
- the barcode that pairs each of the strands of an original double-stranded nucleic acid molecule is placed in the 5' region of the second strand of the adapter. Said barcode pairing can be performed either before or after the ligation, or simultaneously with the ligation.
- the Y adapter comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region (which may be referred to as "duplex" or "double-stranded, ds, region” in the context of the present invention) by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y-adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase
- the first strand comprises at least two regions:
- duplex a region comprising at least two nucleotides that are complementary to the second strand and thus form a double stranded region
- a region that is not complementary to the second strand i.e., a singlestranded, ss, region
- the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
- a modified nucleotide e.g., G if we are converting C to U/T
- at least one modified nucleotide e.g., methylated C if we are converting C to U/T
- at least one non-modified nucleotide susceptible of being converted e.g., non-methyl
- the adapter comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region (which may be referred to as "duplex" or "double-stranded, ds, region” in the context of the present invention) by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase
- the first strand comprises at least a region comprising at least two nucleotides that are complementary to the second strand and thus form a double stranded region ("duplex")
- the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
- a modified nucleotide e.g., G if we are converting C to U/T
- at least one modified nucleotide e.g., methylated C if we are converting C to U/T
- the adapter preferably the Y adapter, is further characterized in that a primer can specifically bind to one of the strands of the double stranded region of the adaptor, or to the their complementary or transformed complementary thereof (including reverse complementary and transformed reverse complementary), thereby allowing the primer to be extended by action of a polymerase.
- a primer can specifically bind to one of the strands of the double stranded region of the adaptor, or to the their complementary or transformed complementary thereof (including reverse complementary and transformed reverse complementary), thereby allowing the primer to be extended by action of a polymerase.
- the adapter preferably the Y adapter
- a primer can specifically bind to the complementary sequence or to the transformed complementary, preferably transformed reverse complementary, of a sequence comprised in the double- stranded region of the adapter (when denatured), thereby allowing the primerto be extended by action of a polymerase.
- the adapter preferably the Y adapter, is further characterized in that a primer can specifically bind:
- the adapter preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region with a length of at least 3 nucleotides, such as 5 nucleotides, or 6 nucleotides, or 7 nucleotides, or 8, or 9, or 10, or more, by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase
- the first strand comprises at least 5 nucleotides, preferably at least 10, 13, or 15, such as 15, 16, 17, 18, 19, 20, 21 nucleotides or more, and comprises:
- a 3' region comprising at least two nucleotides, preferably more, such as at least 5, or at least 7, or at least 10, or more, such as at least 12, or at least 13, or at least 14, or at least 15, or at least 16, or more, that are complementary to the second strand and thus form a double stranded region, and
- the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
- a modified nucleotide e.g., G if we are converting C to U/T
- at least one modified nucleotide e.g., methylated C if we are converting C to U/T
- at least one non-modified nucleotide susceptible of being converted e.g., non-methylated C if we are converting C to U/T.
- the (a) 3' region of the first strand of the adapter that is complementary to the second strand and thus form a double stranded region with it comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 14, 15, 16, 17, 18, 19, 20 or more nucleotides.
- the 3' region of the first strand that is complementary to the 5' region of the second strand comprises at least 7, more preferably at least 10 nucleotides.
- the double stranded region formed by the 3' region of the first strand and the 5' region of the second strand comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 14, 15, 16, 17, 18, 19, 20 or more nucleotides.
- the double stranded region formed by the 3' region of the first strand and the 5' region of the second strand comprises at least 5, more preferably at least 10 nucleotides, and it comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one nonmodified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
- a modified nucleotide e.g., G if we are converting C to U/T
- the double stranded region formed by the 3' region of the first strand and the 5' region of the second strand comprises: at least 1, 2, 3, 4, preferably 5, 6, 7, 8, 9, 10 or more non-modified nucleotides complementary to a modified nucleotide, (e.g., G if we are converting C to U/T); at least 1, 2, 3, 4, preferably 5, 6, 7, 8, 9, 10 or more modified nucleotides (e.g., methylated C if we are converting C to U/T), and at least 1, 2, 3, 4, preferably 5, 6, 7, 8, 9, 10 or more non-modified nucleotides (e.g., non-methylated C if we are converting C to U/T).
- a modified nucleotide e.g., G if we are converting C to U/T
- modified nucleotides e.g., methylated C if we are converting C to U/T
- non-methylated C if we are converting C to U/T
- the 3' region of the first strand that is complementary to the second strand of the adapter comprises one or more barcode sequences.
- the one or more optional barcode sequences are placed towards, preferably in, the 5' end of the first strand of the adapter and/or in the 3' region of the second strand of the adapter.
- said barcode sequences comprise at least 4, preferably at least 6, nucleotides.
- the one or more optional barcode sequences can be used as unique molecular identifiers within a population of nucleic acid molecules.
- the adapter preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region of at least 3, preferably 5 or 10 or 16, or more, nucleotides by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase
- the first strand comprises at least 5 nucleotides, preferably at least 10, 12, 13, OR 16, such as 16, 17, 18, 19, 20, 21 nucleotides or more, and:
- a 3' region comprising at least five, preferably at least 7, more preferably at least 10, nucleotides that are complementary to the second strand and thus form a double stranded region
- the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
- a modified nucleotide e.g., G if we are converting C to U/T
- modified nucleotide e.g., methylated C if we are converting C to U/T
- non-modified nucleotide susceptible of being converted e.g., non-methylated C if we are converting C to U/T.
- the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
- a modified nucleotide e.g., G if we are converting C to U/T
- modified nucleotide e.g., methylated C if we are converting C to U/T
- non-modified nucleotide susceptible of being converted e.g., non-methylated C if we are converting C to U/T
- the adapter preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase
- the first strand comprises at least 7 nucleotides, preferably at least 15, such as 15, 16, 17, 18, 19, 20, 21 nucleotides or more, and comprises:
- the adapter preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase
- the first strand comprises at least 7 nucleotides, preferably at least 15, such as 15, 16, 17, 18, 19, 20, 21 nucleotides or more, and comprises:
- the adapter preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the first strand comprises at least 7 nucleotides, preferably at least 15, such as 15, 16, 17, 18, 19, 20, 21 nucleotides or more, and comprises:
- first strand comprises SEQ ID NO: 44 and the second strand comprises SEQ ID NO:45, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 44 and 45, respectively, or
- first strand comprises SEQ ID NO: 46 and the second strand comprises SEQ ID NO: 47, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 46 and 47, respectively, and wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y-adapter are compatible with the ends of a double stranded DNA molecule, and, preferably, wherein the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase.
- the Y adapter may contain one or more barcode sequences in the 5' region of the first nucleic acid (DNA) strand and/or in the 3' region of the second nucleic acid (DNA) strand of the Y adapter formed by two nucleic acid (DNA) strands (and/or in the double stranded region).
- the barcode sequences may thus be located in the single-stranded region of the Y-adapter molecule and/or in the double stranded region of the Y-adapter.
- each original nucleic acid (DNA) strand and its synthetic complementary strand, see step (c) would then be paired.
- the adaptor has a first barcode sequence in the double stranded region and/or a second barcode sequence in the 5' region of the second strand of the adaptor.
- At least one adapter preferably comprising a hairpin from which a polymerase can synthetise a complementary strand, is ligated to the 3' end of the molecule provided in (a).
- the 3' adapter may preferably comprise one or more barcodes, as explained above.
- a second adapter e.g., a linear adapter
- the adapter ligated at the 5' end of the molecule has a length which is enough for a primer to hybridize to it.
- Step (c) Synthesizing, for each of the strands of the nucleic acid molecules obtained in step
- the complementary strand is also referred to as the "synthetic complementary strand", and it is generated by polymerase elongation from the 3' end of the adapter molecule, using the strands of the nucleic acid molecules obtained in step (b) as template.
- each original strand of a nucleic acid (e.g., DNA) molecule is physically bound to a complementary strand obtained by synthetic extension. See also Figures IB and ID. Step
- extension step is also called herein as "extension step".
- the strands may be denatured.
- the original nucleic acid strand of a nucleic acid (e.g., DNA) molecule and its synthetic complementary strand are physically linked by one of their ends by a loop, which is preferably a nucleotide sequence to which primers can at least partially bind, as defined above.
- an extension step is performed in which, for each of the strands of the nucleic acid molecules obtained in step (b), a complementary strand, the "synthetic complementary strand", is generated by polymerase elongation from the 3' end of the adapter molecule, using the strands of the five nucleic acid molecules obtained in step (b) as a template, to provide barcode paired adaptor-containing double stranded nucleic acid molecules (see Figures IB and ID, although in these figures the molecules are represented in a linearise mode when in fact, as they are complementary, the have a ds configuration).
- the extension step is performed with natural occurring nucleotides (canonical base (e.g., A, C, G, T, or U) or non-modified nucleotides), and not with modified nucleotides (such as methylated C), so that the resulting molecule of step i) does not have a synthetic complementary strand (which will give rise to the second region of the molecule of the invention) comprising modified nucleotides.
- the synthesis is performed with non-modified cytosines.
- polymerase elongation refers to the synthesis of a complementary strand by a DNA polymerase that adds free nucleotides to the 3'end of the second DNA strand in the adapter molecule.
- Said adapter molecule may act as a primer for the elongation step, as described above. During this step the temperature is chosen depending on the optimal temperature for the specific DNA polymerase used.
- the paired double stranded nucleic acid molecules obtained in step (c) are amplified to provide amplified paired double stranded nucleic acid molecules.
- the pairing between both strands of the original double-stranded nucleic acid (e.g., DNA) molecules allows keeping track of both strands of each double stranded nucleic acid (e.g., DNA) fragment originally used.
- each adapter may include unique and combinatorial barcodes (e.g., unique molecular identifiers or UMIs) that allow sample identification and multiplexing as well as quantitative analysis.
- the adapter such as the Y-adapter, is provided as a library of adapters wherein each member of the library is distinguishable from the others by a combinatorial sequence located within the double stranded region formed by the 3' region of the first strand and the 5' region of the second strand of the adapter.
- the Y-adapter incorporates bases labelled with the second member of a binding pair that allows the recovery of the original nucleic acid (e.g., DNA) template after the elongation or amplification steps.
- the sample used as a nucleic acid (e.g., DNA) template may be identified, preserved during the process and recovered, stored and submitted to multiple amplifications with different conditions and sequencings without sample exhaustion.
- the adaptors (such as Y-adaptors) may contain "sites for cutting", as described above.
- the molecules generated in step (b), when the ds adapters do not comprise a hairpin loop, are contacted with a hairpin adapter under conditions adequate for the ligation of the hairpin adapterto the molecules generated in step (b), as described in detail below.
- a hairpin adapter may be ligated to the adapters ligated in step (b).
- a hairpin adapter may be incorporated as described in WO 2015/104302.
- step (i) of the method of the present invention comprises, after step (c), the following step (cl):
- each strand of the adapter-containing nucleic acid molecules with a complex of an elongation primer with a hairpin adapter under conditions adequate for the hybridization of the elongation primer to the second strand of the adapter, wherein the elongation primer comprises a 3' region which is complementary to the second strand of the adapter molecule and which, after hybridization with the second strand of the adapter molecule creates overhanging ends, and wherein the hairpin adapter comprises a hairpin loop region and overhanging ends which are compatible with the overhanging ends formed after hybridization of the elongation primer to the second strand of the adapter.
- step (c) the molecules generated are treated with an agent or method or process, as described herein, capable of converting a nucleotide into another one which is read distinctly from the original nucleotide, under the conditions suitable forthe conversion/transformation to occur.
- the method of the present invention further comprises a step (ii), also called herein "capture step”.
- This step comprises capturing at least some of the molecules provided in (i) by using at least one capture probe that binds at least partially to the second region of the plurality of nucleic acid molecules.
- Target enrichment may also be advantageous for other applications, since it allows for the specific selection of nucleic acid molecules or regions, facilitating the sequencing process and data analysis.
- target enrichment Several methods of target enrichment are available, but they all comprise the use of a probe or capture probe specifically designed for hybridizing with (“capturing") nucleic acid molecules comprising the region of interest. After the capture, the sample will be enriched with the molecules comprising the region of interest, and the sequencing can thus be performed only with the interesting material.
- the nucleotide sequence of the second region is identical or at least substantially identical in each of the plurality of nucleic acid molecules at at least a position corresponding to the same locus in the region of interest which is occupied by a nucleotide susceptible of being modified (regardless the epigenetic modifications present in the one or more original molecules)
- an efficient capture step can be carried out.
- the bias associated to the differences in the epigenetic modifications in the one or more original molecules is thus eliminated.
- a region of interest can be enriched with the same efficiency and efficacy, regardless of the modification status (epigenetic modifications) of the original molecules.
- the capture probe binds to at least a part of the second region of the plurality of molecules provided in step (i), including a nucleotide which is located at a position corresponding to a locus in the ROI which is occupied by a nucleotide susceptible of being modified.
- the probe "captures" the nucleic acid molecules provided in (i), because it binds to at least part of the second region of these molecules.
- the overlap region (the "binding region") between the captured region and the probe can be as small as less than 20 nucleotides such as from 11 to 20 nucleotides, for instance from 13 to 18 nucleotides.
- first region shows different sequence (e.g., Figures 1A and 1C) arising from a different methylation status in their corresponding original molecules (e.g., Figures IB and ID).
- TAACAACTAC TAACAACTAC
- the capture probe would bind better to some of the molecules (if the capture probe shows higher complementary to them), than to other molecules (if the capture probe shows low complementary to those molecules because of the differences in sequence). This will cause a bias in the capture step, where some molecules will be captured better than other.
- the capture probe binds with the same affinity to all the nucleic acid molecules of the present invention since the capture probe is 100% complementary to all of the second regions.
- the capture step of the present method does not cause any bias, making the method more efficient.
- the capture probe is attached to a support that facilitates the purification an/or immobilization of the nucleic acids captured with the method of the present invention.
- support refers to any material configured to chemically bond with a nucleic acid including but not limited to plastic, latex, glass, metal (i.e., for example a magnetized metal), nylon, nitrocellulose, quartz, silicon or ceramic.
- the support is preferably solid and may be roughly spherical (i.e., for example a bead) or may comprise a standard laboratory container such as a microwell plate or a surface.
- immobilized refers to the association or binding between the molecule (e.g., the capture probe) and the support in a manner that provides a stable association underthe conditions of elongation, amplification, excision, and other processes as described herein.
- binding can be covalent or non-covalent.
- Non-covalent binding includes electrostatic, hydrophilic and hydrophobic interactions.
- Covalent binding is the formation of covalent bonds that are characterized by sharing of pairs of electrons between, atoms.
- Such covalent binding can be directly between the capture probe and the support or can be formed by a cross linker or by inclusion of a specific reactive group on either the support or the adapter or both.
- Covalent attachment of a probe can be achieved using a binding partner, such as avidin or streptavidin, immobilized to the support and the non- covalent binding of the biotinylated adapter to the avidin or streptavidin. Immobilization may also involve a combination of covalent and non-covalent interactions.
- a binding partner such as avidin or streptavidin
- the capture probe may be synthesized first, with subsequent attachment to the support. Alternatively, the capture probe may be synthesized directly on the support.
- the capture step or step (ii) is performed over the nucleic acid molecules of the present invention that are on the supernatant of the reaction vessel, so that the capture probe is not attached to any support.
- the capture probe is conjugated to one or more molecules such as one or more chromophore, fluorophores, beads, etc.
- the capture probe comprises or is conjugated to tags or one or more recognition molecules (e.g., streptavidin, avidin, neutravidin, horseradish peroxidase, alkaline phosphatase, antibodies, etc).
- the method of the present invention further comprises a step (iii) of determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules.
- this step it is possible to ascertain the epigenetic status of the plurality of molecules in a sample in the region of interest.
- the first and second regions of the plurality of nucleic acid molecules provided in step (i) and captured in step (ii) are sequenced and/or analysed.
- the skilled person is aware of means of sequencing the molecules captured in step (ii) of the method of the present invention.
- the sequencing can take place by using one or more of the currently available sequencing technologies or platforms (e.g., Illumina, Roche, Ion Torrent, etc. sequencing platforms).
- the sequencing can be performed either at the low scale, which consists in the analysis of selected fragments, or high throughput (also named genome-scale), which consists in the massive analysis of all or a large representation of the whole material, such as Next Generation Sequencing (NGS) approaches.
- NGS Next Generation Sequencing
- the length of the fragment that can be analysed depends on the sequencing methodology used.
- Current state of the art sequencing techniques aiming the genomic scale and most of the locus specific assess ss nucleic acid molecules (such as DNA strands) separately.
- sequencing or the expressions "determining the sequence” or “sequence determination” and the like, such as “determining the base identity” or “determining the identity of a base” means the determination of the information relating to the nucleotide base sequence of a nucleic acid, particularly involving determination and ordering of a plurality of contiguous nucleotides in a nucleic acid.
- Said information may include the identification or determination of partial as well as full sequence information of the nucleic acid molecule.
- Said information refers, e.g., to the primary sequence of a DNA molecule, such as a ss or ds DNA molecule or to the epigenetic modifications (for example methylations or hydroxymethylations), or both.
- the sequence information may be determined with varying degrees of statistical reliability or confidence.
- the determination of the primary sequence of a DNA molecule includes the detection of mutations or genetic variants such as polymorphisms (SNPs, INDELs, etc.). By analysing the output of the sequencing, each read will provide information regarding the primary sequence (including mutations and SNPs) and the epigenetic modifications (e.g., methylation status) of the one or more original nucleic acid sequences within a ROL
- the methods described herein may be useful in identifying and/or distinguishing epigenetic modifications (i.e., for ascertaining the epigenetic modification status of one or more nucleic acid molecules), as explained above, e.g., cytosine (C), 5-methylcytosine (5mC), 5- hydroxymethylcytosine (5hmC) and 5-formylcytosine (5fC) in the one or more original nucleic acid sequences within the ROL
- methods described herein may be useful in distinguishing one residue from the group consisting of cytosine (C), 5-methylcytosine (5mC), 5-
- the method further comprises diagnosing a condition in the subject based at least in part on the sequencing information provided in step (iii) of the method of the present invention.
- the condition may be any condition, a trait or even aging, obesity, etc.
- the condition may be cancer, which can be selected from a sarcoma, a glioma, an adenoma, leukemia, such as chronic lymphocytic leukaemia (CLL), bladder cancer, breast cancer, colorectal cancer (CRC), (endometrial cancer, kidney cancer, liver cancer, lung cancer, melanoma, non-Hodgkin lymphoma, pancreatic cancer, prostate cancer, thyroid cancer, etc.
- CLL chronic lymphocytic leukaemia
- CRC colorectal cancer
- endometrial cancer kidney cancer, liver cancer, lung cancer, melanoma, non-Hodgkin lymphoma, pancreatic cancer, prostate cancer, thyroid cancer, etc.
- the condition may also be a neurodegenerative condition, such as Alzheimer's disease, frontotemporal dementia, amyotrophic lateral sclerosis, Parkinson's disease, spinocerebellar ataxia, spinal muscle atrophy, Lewy body dementia, or Huntington's disease.
- the condition may also be any inherited or environmental disease or any rare or common disease or any trait not necessarily linked to disease.
- the condition may be caused by or be related to the epigenetic modifications in one or more nucleotides susceptible of conversion comprised in a ROI of the genome of the subject.
- the present invention further comprises an in vitro method for diagnosing a condition, the method comprising the steps of:
- step (i) of the method of the present invention (2) Providing a plurality of nucleic acid molecules as defined in step (i) of the method of the present invention from a sample obtained from the patient; (3) Capturing the molecules provided in (i) by using a probe as defined in step (ii) of the method of the present invention;
- step (4) Diagnosing a condition in the subject based at least in part on the information provided in step (4).
- the method of the present invention may further comprise a step of determining the true identity of a base at a certain position (locus) in a ROI, based on the information provided in step (iii).
- a "true identity” is the identity of the base originally present at a certain position (locus) in the original nucleic acid molecule (e.g., A, C, G, T, U, or any modification thereof, such as a modified nucleotide, e.g., a modified cytosine, such as a methylated cytosine, mC (e.g., 5mC, 5hmC and/or 5fC)).
- step (i) information regarding the sequence of at least part (and preferably) all of the first and second regions is provided.
- at least two sources of information for at least one, preferably for each one of the first and second regions is provided. Since the first and second regions of the nucleic acid molecule provided in step (i) of the present invention provide, independently, information on the base identities in the corresponding loci in an original nucleic acid molecule, of at least two sources of information on the base identities in a corresponding loci in the original nucleic acid molecule and corresponding ROI are provided.
- the method of the present invention also allows for the determination of the base identities in the original molecule (in the corresponding ROI), including the epigenetic modifications, with reduced error.
- the method of the present invention further comprises using a computer comprising a processor, a memory, and instructions stored thereupon that, when executed, determine the identity of a base (e.g., the true base) at a certain position (locus) in an original nucleic acid molecule (in the corresponding ROI), based on the sequencing information provided in step (iii).
- a base e.g., the true base
- locus a position in an original nucleic acid molecule
- the processor may comprise one or more processing units, such as a microprocessor, GPU, CPU, multi-core processor or the like.
- the memory may comprise one or more volatile or non-volatile memory devices, such as DRAM, SRAM, flash memory, read-only memory, ferroelectric RAM, hard disk drives, floppy disks, magnetic tapes, optical disks or the like.
- the present invention thus further provides a computer program comprising instructions which, when executed by a computer, is able to determine the identity and/or a BQ score or probability of being an error) of a true base at a certain position (locus) in an original nucleic acid molecule (in the corresponding ROI), based on the information provided in step (iii) of the method of the present invention.
- the present invention also further provides a computer program comprising instructions which, when executed by a computer, is able to implement any of the methods disclosed in the present document. Therefore, such a computer program is communicatively communicated to the electronic components of a sequencing machine.
- the computer program product may be implemented in software, hardware, ora combination of both.
- the computer program product can be stored in a memory of the sequencing machine or can be saved remotely, for example, on a connected remote server communicatively to the device.
- step (iii) of the method of the present invention comprises using at least two different primers, such as two, three or four different primers, preferably at least three primers, even more preferably four different primers, to sequence the molecule provided in step (i).
- step (iii) comprises the use of at least two different primers, preferably at least three different primers, even more preferably four different primers, and the sequencing of the molecules provided in step (i) using the at least two different primers, preferably at least three different primers, even more preferably four different primers.
- Step (iii) thus provides sequence information of the molecule provided in step (i) of the method of the present invention.
- the molecules provided in step (i) of the present invention further comprise one adapter at the 5' end of the molecule and one adapter at the 3' end of the molecule, for instance as described above.
- Sequencing a nucleic acid molecule can comprise the determination of the identity of the base (e.g., adenine (A), cytosine (C), thymine (T), guanine (G), uracil (U) and, its modifications, such as methyl cytosines (5mC, 5hmC), etc) present at the specific locus in the original nucleic acid molecule (in the corresponding ROI).
- A adenine
- C cytosine
- T thymine
- G guanine
- U uracil
- 5mC, 5hmC methyl cytosines
- the at least two different primers such as two, three or four different primers, preferably three or four different primers, bind at least to four different regions in the nucleic acid molecule provided in (i), wherein:
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of the first region of the molecule, to sequence at least part of the first region of at least one of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to the third region which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the second region of at least one of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least a portion of the second region of the molecule, to sequence at least part of the second region of at least one nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to the third region which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the first region of at least one of the nucleic acid molecules provided in (i).
- the at least two different primers preferably at least three different primers, even more preferably four different primers, to sequence the molecule provided in step (i), wherein the at least two different primers bind to at least three, preferably to at least four different regions in the nucleic acid molecule provided in (i), wherein:
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and/or
- At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i).
- step iii) comprises using at least two different primers, preferably at least three different primers, to sequence the molecule provided in step (i), wherein the at least two different primers bind to at least three different regions in the nucleic acid molecule provided in (i), wherein:
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 5' end of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially either:
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the region 3 or 3' of the molecule, to sequence at least part of the second region of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and
- At least one of the primers is capable of binding (hybridizing) at least partially to region 2 or 2' of the molecule , to sequence at least part of the first region the nucleic acid molecules provided in (i).
- step iii) comprises using at least two different primers, preferably at least three different primers, to sequence the molecule provided in step (i), wherein the at least two different primers bind to at least three different regions in the nucleic acid molecule provided in (i), wherein:
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i).
- the primer may at least partially bind (hybridize) to the above sequences under low stringency conditions, preferably medium stringency conditions, most preferably high stringency conditions. See, e.g., Figure 3.
- the binding of the primers to the at least three, preferably to the at least four, different regions in the nucleic acid molecules provided in (i) may be performed simultaneously (i.e., all two, three or four primers at the same time) or not-simultaneously.
- the binding of the primers to the at least three, preferably at least to four, different regions in the nucleic acid molecule provided in (i) is not performed simultaneously.
- the binding of the at least two different primers is specific binding.
- the primer(s) bind to the above-described regions in the molecule in a specific manner, i.e., it binds to the above-described regions, but it does not substantially bind to any other region in the nucleic acid molecule provided in (i).
- the skilled person is aware of means for designing primers and checking them for specificity.
- the at least two different primers which bind to at least to at least four different regions in the plurality of nucleic acid molecules provided in (i) are four different primers, each specifically at least partially binding (hybridizing) to the regions 1-3 and/or 1-4, preferably 1-4, as described above.
- primers 1 (capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of the nucleic acid molecule provided in (i)) and 2 (capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence the second region of the nucleic acid molecule provided in (i)) should be different, and primers 3 (capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i)) and 4 (capable of binding at least partially the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence the first region of the nucleic acid molecule provided in (i)) should also be different.
- the at least two different primers, preferably at least three different primers, even more preferably four different primers, provided in step (iii) of the method of the present invention may be used to sequence the molecules provided in step (i) and captured in step (ii).
- the sequencing can take place by using one or many of the currently available sequencing technologies (e.g., Illumina, Roche, Ion Torrent, etc. sequencing platforms).
- the fact that the at least two different primers (such as three or four different primers) bind to at least three, preferably to at least four different regions in the molecule provided in step (i), and that both the 5' and 3' regions are to be at least partially sequenced, implies that, for the (at least partially) sequencing of :
- step (i) (1) the second region of the nucleic acid molecule provided in step (i), using the primer binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i); and
- step (i) (2) the first region of the nucleic acid molecule provided in step (i), using the primer binding (hybridizing) at least partially to at least a portion of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), the complementary sequence of the nucleic acid molecules provided in (i) needs to be synthesized and amplified (cluster amplification for parallel sequencing by synthesis).
- the nucleic acid molecules are synthesized using the so-called “sequencing by synthesis” technique of next generation sequencing, such as Illumina sequencing, which makes use of the synthesis of the original and the complementary strands to read the sequence of a certain nucleic acid molecule.
- a primer attaches to the forward strand adapter primer binding site, and a polymerase adds a fluorescently tagged dNTP to the DNA strand. Only one base is able to be added per round due to the fluorophore acting as a blocking or synthesis terminator group; however, the blocking group is reversible.
- each of the four bases has a unique emission, and after each round, the machine records which nucleotide was added. Once the colour is recorded the fluorophore is washed away and another dNTP is washed over the flow cell and the process is repeated.
- the method of the present invention can be used to reduce uncertainty and overall error rate in the determination of a sequence of a polynucleotide (e.g., an original DNA polynucleotide), mainly before requiring alignment to a reference genome (or reference nucleic acid sequence).
- the methods of the present invention thus provide more than two, such as three, and preferably up to four sources of independent information from a singlestranded nucleic acid molecule as described in step i) (e.g. preferably up to eight sources of independent information if we consider a double-stranded nucleic acid molecule) regarding the base identity in each corresponding locus in an original nucleic acid molecule.
- every nucleotide of the original molecule is represented more than two, such as three, and preferably up to 4 times, the raw probability of errors for each base can be highly reduced both, mainly at premapping but also at post mapping steps. Reducing the error rate at premapping step also improves the mapping quality of each read, which again reduce mapping errors and therefore variant calling errors. To know exactly where every insert (first and second regions) starts and ends, also improve the mapping and the calling of SNPs, but mainly the calling of INDELs and other type of rearrangements.
- UMIs (optional) at the beginning and end of every insert improves the sequencing at the beginning of every read, allows deduplication which becomes crucial when doing enrichment, and reduces the number of uninformative cycles of sequencing and unnecessary bioinformatic resources. It also linking dsDNA strands of the original molecule if they have been separated during the procedure.
- the computer program comprises instructions to perform a locus analysis between different readings to determine the true base at a certain locus of the original nucleic acid to be sequenced. It is noted that from the present description the skilled person may envisage different ways in which locus analysis may be performed, all of them comprised within this invention.
- a sequencing machine usually comprises samples, trays, incubators, fungibles, micropipetting systems, and many other elements within it, that enables the fully automation of the sequencing of a particular nucleic acid molecule.
- the present embodiment is not limited thus to sequencing machines comprising just these elements but to any other machines capable of automating any of the methods disclosed in the present document as the person skilled in the art may envisage.
- the present invention further provides a kit comprising at least two different primers, wherein the at least two different primers are capable of at least partially binding to at least three, preferably to at least four different regions in the nucleic acid molecule provided in step (i) of the method of the present invention.
- the at least two, such as three or preferably four different primers are capable of at least partially binding to at least three, preferably to at least four different regions in the nucleic acid molecule provided in step (i) of the method of the present invention, wherein:
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and/or
- At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i).
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 5' end of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i).
- at least one of the primers is (e.g., the second primer) capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i).
- At least one of the primers is capable of binding (hybridizing) at least partially either: to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence the second region of the nucleic acid molecule provided in (i); or to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in a., to sequence the first region of the nucleic acid molecule provided in (i).
- the invention provides a kit comprising at least two different primers, such as at least three different primers, preferably four different primers, wherein the at least two different primers, such as at least three different primers, preferably four different primers, are capable of at least partially binding to at least four different regions in the nucleic acid molecule provided in step (i) of the method of the present invention, wherein:
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 5' end of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i).
- kit of the present invention further comprises instructions for its use.
- the kit of the present invention further comprises a double stranded adapter for use in the method for the generation of the nucleic acid molecule provided in step (i) of the method of the present invention, wherein the adapter comprises a first nucleic acid strand and a second nucleic acid strand, wherein the second region of the first nucleic acid strand and the first region of the second nucleic acid strand form a double stranded region by sequence complementarity, wherein the ends of said double stranded region formed by the second region of the first nucleic acid strand and the first region of the second nucleic acid strand of the adapter are compatible with the ends of a double stranded nucleic acid molecule, wherein the double-stranded region of the adapter comprises one or more barcode sequence(s), and wherein the second region of the second strand of the adapter forms a hairpin loop by hybridization between a first and a second segment within said second region, the first segment being located at the 3' end of
- the adapter has a restriction site in the first region of the first strand of the adapter.
- the adapter comprises at least one barcode sequence in the single stranded region of the adapter and wherein the second region of the second strand of the adapter forms a hairpin loop by hybridization between a first and a second segment within said second region, the first segment being located at the 3' end of the second region and the second segment being located in the vicinity of the first region of the second strand.
- the kit further comprises:
- a library of double-stranded adapters comprising a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity and wherein the ends of said double stranded region are compatible with the ends of double stranded nucleic acid molecules;
- each elongation primer comprises a 3' region which is complementary to the second strand of the adapter molecule as defined in (i) and which, after hybridization with the second strand of the adapter molecule creates overhanging ends;
- each hairpin adapter comprises a hairpin loop region and overhanging ends which are compatible with the overhanging ends formed after hybridization of the elongation primer as defined in (ii) to the second strand of the Y-adapter as defined in (i), wherein the elongation primers of (ii) and the hairpin adapters of (iii) may be provided as a complex; wherein the adapters of (i), the elongation primers of (ii) and the hairpin adapters of (iii) are suitable for obtaining a library of adapters for use in the method for the generation of the nucleic acid molecule provided in step (i) of the method of the present invention.
- the kit comprises one or more of the adapters of the invention, as defined above under section "adapters of the invention".
- the kit comprises one or more adapters comprising the sequence of SEQ ID NO: 44, 45, 46, 47, 48, 49, 50, 51, or 52, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 44, 45, 46, 47, 48, 49, 50, 51, or 52, respectively.
- the kit comprises at least the adapters comprising the sequence of SEQ ID NO: 44 and 45, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 44, or 45, respectively.
- the kit comprises at least the adapters comprising the sequence of SEQ ID NO: 46 and 47, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 46 or 47 respectively, and an optional adapter comprising the sequence of SEQ ID NO: 52, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 52.
- M denotes methylated cytosine
- C denotes non-methylated cytosine
- G denotes guanine
- T denotes thymine
- SEQ ID NO: 48 (E9 duplex): GGMGTGG
- SEQ ID NO: 46 E15 full length: GMTMTTMMGATMTGGMGTGGMAG
- SEQ ID NO: 47 (E15 full length): CTGCMACGCMGTGCCTCAG
- SEQ ID NO: 50 (E15 duplex): GGMGTGG
- SEQ ID NO: 52 (E15 hairpin):
- a method comprising the steps of:
- each of the nucleic acid molecules comprises two regions: a first region and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, wherein the nucleotide sequence of the second region is identical or at least substantially identical in each of the plurality of nucleic acid molecules at at least a position corresponding to the same locus in a region of interest which is occupied by a nucleotide susceptible of being modified, wherein the first and the second regions in each of the molecules provided in (i) comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified; wherein at least two of the nucleic acid molecules of the plurality of nucleic acid molecules may have (and preferably have) a different nucleotide in
- the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at a position which corresponds to the same locus in the one or more original nucleic acid molecules, at least a modified nucleotide or a copy thereof, preferably wherein the at least one modified nucleotide is a methylated cytosine and/or the copy thereof is a unmethylated cytosine.
- the method according to any one of items 1 or 2 wherein the first region is in the 5' region of the nucleic acid molecule, and the second region is in the 3' region of the nucleic acid molecule.
- the method according any one of the preceding items wherein the first region of the at least one of the plurality of the nucleic acid molecules provided in step (i) is a fragment of genomic DNA.
- the method according to any one of the preceding items, wherein the first and the second region are bound by a linker.
- the plurality of nucleic acid molecules provided in step i) are DNA molecules.
- the method according to any one of the preceding items, wherein at least one of the plurality of nucleic acid molecules provided in step (i) further comprises:
- step (i) The method according to any one of the preceding items, wherein the plurality of nucleic acid molecules in step (i) is provided by: a) Providing one or more original nucleic acid molecules, preferably wherein the one or more original nucleic acid molecules are fragments of genomic DNA; b) Ligating one adaptor to at least one end of the one or more original nucleic acid molecules provided in a), thereby obtaining one or more adaptor-containing original nucleic acid molecules, wherein the 3' region of the adaptor forms a hairpin loop whose 3' end can be extended by action of a polymerase; c) Synthesizing, for each of the one or more adaptor-containing original nucleic acid molecules obtained in step b), a complementary strand, the "synthetic complementary strand", by polymerase elongation of the 3' end of the adaptor molecule, using the one or more adaptor-containing original nucleic acid molecules obtained in step b) as template,
- the one or more original nucleic acid molecules are double-stranded (ds) nucleic acid molecules, preferably genomic ds DNA, and wherein the adaptors of step b) are ds adaptors.
- ds double-stranded
- the adaptors of step b) are ds adaptors.
- step b The method according to one or more of items 11 to 13, wherein the strands of the double-stranded original nucleic acid molecules are further paired by using barcode sequences in step b).
- step 14 wherein the pairing can be performed either before or after the ligation, or simultaneously with the ligation.
- step (i) the determination of the true identity of a base at a certain locus in the one or more original nucleic acid molecules is performed by using at least two different primers, preferably at least three different primers, even more preferably four different primers, to sequence the molecule provided in step (i) as defined in claim 1, wherein the molecules provided in step (i) further comprise one adapter at the 5' end of the molecule and one adapter at the 3' end of the molecule; wherein the at least two different primers bind to at least three, preferably at least four different regions in the nucleic acid molecule provided in step (i), wherein:
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of at least one nucleic acid molecules provided in step (i);
- At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in step (i), to sequence the second region of at least one of the nucleic acid molecules provided in step (i);
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of at least one of the nucleic acid molecules provided in step (i);
- At least one of the primers is capable of binding at least partially the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence the first region of at least one of the nucleic acid molecules provided in step (i).
- a modified cytosine at a given position is determined if a cytosine appears in first region of the nucleic acid molecules obtained in step (c) and a guanine appears in the corresponding position (locus) in the second region of the same molecule, and/or wherein the presence of a non-modified cytosine at a given position (locus) is determined if a uracil or thymine appears in the first region of the nucleic acid molecules obtained in step (c) and a guanine appears in the corresponding position (locus) in the second region of the same molecule.
- a computer comprising a processor, a memory, and instructions stored thereupon that, when executed, ascertain the identity of a base and/or an associated BQ at a certain position (locus) in an original nucleic acid molecule, based on the information provided in step (iii).
- step (i) of the method as defined in any one of items 1 to 19, from a sample obtained from the patient;
- a computer program comprising instructions which, when executed by a computer, is able to determine the identity of an ascertained/inferred base and/or the associated BQ at a certain position (locus) in an original nucleic acid molecule, based on the information provided in step (iii) of the method as defined in any one of items 16 to 19.
- step (a) Selecting or identifying a region of interest within a genome and/or within a nucleic acid molecule;
- step (b) Inferring the sequence of the second region of the plurality of nucleic acid molecules provided in step (i) of the method as defined in any one of items I to 19;
- a computer program comprising instructions which, when executed by a computer, is able to obtain the sequence of at least a capture probe that binds to at least a portion of the second region, by implementing the method as defined in item 22.
- a method comprising the steps of:
- each of the nucleic acid molecules comprises two regions: a first region and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, wherein the first and the second regions in each of the molecules provided in (i) comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified; wherein at least two of the nucleic acid molecules of the plurality of nucleic acid molecules may have a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified and wherein, in the second region, a position corresponding to that same locus in the region of interest is occupied by
- the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least a one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof, preferably wherein the at least one modified nucleotide is a methylated cytosine, and wherein the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at a position which corresponds to the same locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof or a transformed modified nucleotide or a copy thereof.
- the plurality of nucleic acid molecules in step (i) is provided by: a) Providing one or more original nucleic acid molecules, preferably wherein the one or more original nucleic acid molecules are fragments of genomic DNA; b) Ligating one adaptor to at least one end of the one or more original nucleic acid molecules provided in a), thereby obtaining one or more adaptorcontaining original nucleic acid molecules, wherein the 3' region of the adaptor forms a hairpin loop whose 3' end can be extended by action of a polymerase; c) Synthesizing, for each of the one or more adaptor-containing original nucleic acid molecules obtained in step b), a complementary strand, the "synthetic complementary strand", by polymerase elongation of the 3' end of the adaptor molecule, using the one or more adaptor-containing original nucleic acid molecules obtained in step b) as template, thereby pairing the one or more original nucleic acid molecules obtained in step
- the one or more original nucleic acid molecules are double-stranded (ds) nucleic acid molecules, preferably genomic ds DNA, and wherein the adaptors of step b) are ds adaptors.
- the method further comprises a step (iii) of determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules.
- step (i) the determination of the true identity of a base at a certain locus in the one or more original nucleic acid molecules is performed by using at least two different primers to sequence the molecule provided in step (i) as defined in item 1, wherein the molecules provided in step (i) further comprise one adapter at the 5' end of the molecule and one adapter at the 3' end of the molecule; wherein the at least two different primers bind to four different regions in the nucleic acid molecule provided in step (i), wherein:
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of at least one nucleic acid molecules provided in step (i);
- At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in step (i), to sequence the second region of at least one of the nucleic acid molecules provided in step (i);
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of at least one of the nucleic acid molecules provided in step (i);
- At least one of the primers is capable of binding at least partially the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence the first region of at least one of the nucleic acid molecules provided in step (i).
- the method according to one or more of items 7 or 8, wherein the method further comprises using a computer comprising a processor, a memory, and instructions stored thereupon that, when executed, ascertain the identity of a base and/or an associated BQ at a certain position (locus) in an original nucleic acid molecule, based on the information provided in step (iii).
- An in vitro method for diagnosing a condition comprising the steps of: (1) Selecting or identifying a region of interest relevant to the condition to be diagnosed within the genome of a patient;
- step (i) of the method as defined in any one of items 1 to 9, from a sample obtained from the patient;
- a computer program comprising instructions which, when executed by a computer, is able to determine the identity of an ascertained/inferred base and/or the associated BQ at a certain position (locus) in an original nucleic acid molecule, based on the information provided in step (iii) of the method as defined in any one of items 7 to 9.
- step (b) Inferring the sequence of the second region of the plurality of nucleic acid molecules provided in step (i) of the method as defined in any one of items 1 to 9;
- (c) Obtaining the sequence of at least a capture probe that binds to at least a portion of the second region, wherein the second region comprises at least one nucleotide which is located at a position which corresponds to a locus in the region of interest which is occupied by a nucleotide susceptible of being modified. 13.
- a computer program comprising instructions which, when executed by a computer, is able to obtain the sequence of at least a capture probe that binds to at least a portion of the second region, by implementing the method as defined in item 12.
- a nucleic acid molecule which comprises a 5' region and a 3' region, wherein the 5' region and the 3' region are covalently linked by a nucleotide sequence to which primers can bind, wherein the base identities in one of the 5' or 3' regions and the base identities in the other region both provide, independently, information on the base identities in the corresponding loci in an original nucleic acid molecule, wherein the molecule further comprises:
- One adapter at the 3' end of the molecule ii. Using at least two different primers, such as at least three different primers, preferably at least four different primers, to sequence the molecule provided in step (i), wherein the at least two different primers, such as at least three different primers, preferably at least four different primers, bind to at least three, preferably to at least four different regions in the nucleic acid molecule provided in (i), wherein:
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the 5' region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the 5' region and the 3' region of the nucleic acid molecule provided in (i), to sequence at least part of the 3' region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the 3' region of the nucleic acid molecule provided in (i); and/or
- At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the 5' region and the 3' region of the nucleic acid molecule provided in a., to sequence at least part of the 5' region of the nucleic acid molecule provided in (i).
- the method according to item 1 wherein the 3' region of the nucleic acid molecule provided in step (i) is at least partially complementary to the reverse strand of the 5' region.
- the method according to any one of the preceding items, wherein the 5' region and/or the 3' region of the nucleic acid molecule provided in step (i) are fragments of genomic DNA.
- nucleotide sequence which covalently links the 5' region and the 3' region of the nucleic acid molecule provided in step (i) has a length of at least 5 nucleotides, preferably a length of at least 10 nucleotides, even more preferably at least 17 nucleotides.
- step (ii) four different primers are used to sequence the molecule provided in step (i).
- the nucleic acid molecule of step (i) is generated by: a.
- step (b) Synthesizing, for each of the strands of the nucleic acid molecules obtained in step (b), a complementary strand, the "synthetic complementary strand", by polymerase elongation from the 3' end of the second nucleic acid strand in the adapter molecule, using each of the strands of the nucleic acid molecules obtained in step (b) as template, thereby pairing each of the strands of the nucleic acid molecules obtained in step (b) with its synthetic complementary strand to provide a plurality of adaptor-modified nucleic acid molecules, wherein the original nucleic acid strand and its synthetic complementary obtained in step (c) are covalently linked by a nucleotide sequence to which primers can at least partially bind; d.
- step (c) providing complementary strands of the plurality of adaptor- modified DNA molecules obtained in step (c), optionally using primers the sequences of which are complementary to at least a portion of the doublestranded adaptors; e.
- step (d) optionally amplifying the paired double stranded nucleic acid molecules obtained in step (d) to provide amplified paired double stranded nucleic acid molecules.
- step (c) the plurality of paired adaptor-modified DNA molecules is separated to generate a library of paired adaptor-modified DNA molecules.
- each strand of said adapter-containing nucleic acid molecules with a complex of an elongation primerwith a hairpin adapter under conditions adequate for the hybridization of the elongation primer to the second strand of the adapter, wherein the elongation primer comprises a 3' region which is complementary to the second strand of the adapter molecule and which, after hybridization with the second strand of the adapter molecule creates overhanging ends, and wherein the hairpin adapter comprises a hairpin loop region and overhanging ends which are compatible with the overhanging ends formed after hybridization of the elongation primer to the second strand of the adapter.
- step (b) a plurality of adaptor-modified nucleic acid molecules is provided, wherein the strands of genomic DNA fragments are further paired by using barcode sequences.
- step (c) Converting non-modified nucleotides in the paired adaptor-containing nucleic acid molecules, if any, to another nucleotide which is read distinctly from said nucleotide, in the paired adaptor-containing nucleic acid molecules; and/or c22) Converting modified nucleotides in the paired adaptor-containing nucleic acid molecules, if any, to another nucleotide which is read distinctly from said nucleotide, in the paired adaptor-containing nucleic acid molecules.
- nucleic acid molecule of step (i) is generated by: a. Providing a double stranded nucleic acid molecule, preferably wherein the double-stranded DNA molecule is a fragment of genomic DNA; b. Covalently linking the forward and reverse single-stranded nucleic acid molecules provided in step a., wherein the covalent linking in step b.
- nucleic acid molecule which comprises a 5' region and a 3' region, wherein the 5' region and the 3' region are covalently linked by a nucleotide sequence to which primers can bind, wherein the base identities in one of the 5' or 3' regions and the base identities in the other region both provide, independently, information on the base identities in the corresponding loci in an original nucleic acid molecule.
- step a. The method according to item 20, wherein the double stranded nucleic acid molecule provided in step a. is provided as a population of double stranded nucleic acid molecules, preferably wherein the plurality of double-stranded DNA molecules are fragments of genomic DNA.
- step (ii) The method according to one or more of items 1 to 21, wherein the method further comprises determining the true identity of a base at a certain locus in an original nucleic acid molecule, based on the information provided in step (ii).
- step (ii) The method according to one or more of the preceding items, wherein the method further comprises using a computer comprising a processor, a memory, and instructions stored thereupon that, when executed, ascertain the identity of a base and/or an associated BQat a certain position in an original nucleic acid molecule, based on the information provided in step (ii).
- a computer program comprising instructions which, when executed by a computer, is able to determine the identity of an ascertained/inferred base and/or the associated BQ at a certain position in an original nucleic acid molecule, based on the information provided in step (ii). of the method as defined in any one of the preceding items.
- a kit comprising at least two different primers, such as at least three different primers, preferably four different primers, wherein the at least two different primers, such as at least three different primers, preferably four different primers, are capable of at least partially binding to at least three, preferably at least four different regions in the nucleic acid molecule provided in step (i) of the method as defined in any one of items 1 to 25, wherein:
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 5' end of the nucleic acid molecule provided in (i), to sequence at least part of the 5' region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the 5' region and the 3' region of the nucleic acid molecule provided in (i), to sequence at least part of the 3' region of the nucleic acid molecule provided in (i);
- At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), to sequence at least part of the 3' region of the nucleic acid molecule provided in (i); and/or
- At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the 5' region and the 3' region of the nucleic acid molecule provided in (i), to sequence at least part of the 5' region of the nucleic acid molecule provided in (i).
- kit further comprises a double stranded adapter for use in the method as defined in any one of items 9-24, wherein the adapter comprises a first nucleic acid strand and a second nucleic acid strand, wherein the 3' region of the first nucleic acid strand and the 5' region of the second nucleic acid strand form a double stranded region by sequence complementarity, wherein the ends of said double stranded region formed by the 3' region of the first nucleic acid strand and the 5' region of the second nucleic acid strand of the adapter are compatible with the ends of a double stranded nucleic acid molecule, wherein the double-stranded region of the adapter comprises one or more barcode sequence(s), and wherein the 3' region of the second strand of the adapter forms a hairpin loop by hybridization between a first and a second segment within said 3' region, the first segment being located at the 3' end of the 3' region and the second segment being located
- the adapter comprises at least one barcode sequence in the single stranded region of the adapter and wherein the 3' region of the second strand of the adapter forms a hairpin loop by hybridization between a first and a second segment within said 3' region, the first segment being located at the 3' end of the 3' region and the second segment being located in the vicinity of the 5' region of the second strand.
- the kit further comprises:
- a library of double-stranded adapters comprising a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity and wherein the ends of said double stranded region are compatible with the ends of double stranded nucleic acid molecules;
- each elongation primer comprises a 3' region which is complementary to the second strand of the adapter molecule as defined in (i) and which, after hybridization with the second strand of the adapter molecule creates overhanging ends;
- each hairpin adapter comprises a hairpin loop region and overhanging ends which are compatible with the overhanging ends formed after hybridization of the elongation primer as defined in (ii) to the second strand of the Y-adapter as defined in (i), wherein the elongation primers of (ii) and the hairpin adapters of (iii) may be provided as a complex; wherein the adapters of (i), the elongation primers of (ii) and the hairpin adapters of (iii) are suitable forobtaining a library of adapters for use in the method as defined in any one of items 9 to 24.
- the molecule exemplified herein is the GEUS molecule as described in WO 2015/104302. As discussed above, this molecule is an example of a molecule as defined in step i. of the method of the present invention.
- the advantages and effects explained herein for the GEUS molecule are equally applicable to any molecule as defined in step i. of the method of the present invention.
- the nucleic acid molecule of STEP i. may be generated by:
- Step a Providing a double-stranded nucleic acid molecule: See, e.g., Figure 4A:
- Step b Ligating, at least partially, double-stranded (ds) adaptors to at least one end, and preferably to both ends, wherein one of the adaptors comprises a hairpin ("hairpin" in the below representation): See, e.g., Figure 4B. adaptor - ATCGAAMGMT-hairpin hairpin - TAGCTTGMGA-adaptor
- Step c Synthesizing, for each of the strands of the nucleic acid molecules obtained in step (b), a complementary strand, the "synthetic complementary strand", by polymerase elongation from the 3' end of the "hairpin", using each of the strands of the nucleic acid molecules obtained in step (b2) as template.
- Figure 4D ill adaptor - ATCGAAMGMT -hairpin - AGCGTTCGAT adaptor - AGMGTTCGAT -hairpin - ATCGAACGCT
- Step c.21 Converting the non-methylated cytosine(s) of the molecules obtained after step c), to a base which is read distinctly from cytosine (e.g., uracil/thymine) by, e.g., bisulfite treatment:
- cytosine e.g., uracil/thymine
- bisulfite treatment For simplicity, we will continue this example with only one of the molecules obtained after step c. However, it is noted that the method can be continued with both of them, see Fig. 4E. adaptor - ATUGAACGCT-hairpin-AGGG7TL/GAT- adaptor (Converted non-methylated cytosines underlined)
- the resulting molecule is a "GEUS molecule", as described, e.g., in WO 2015/104302, and comprises a first region (in bold), second region (in italics), wherein both regions are covalently linked by a nucleotide sequence to which primers can bind (comprising the hairpin of the adaptor, as explained above, also referred to as "linking region"), wherein the base identities in one of the first or second regions and the base identities in the other region both provide, independently, information on the base identities in the corresponding loci in an original nucleic acid molecule (in this case, they provide information on the base identities of the original strand of step a "ATCGAAMGMT", and wherein the molecule comprises one adaptor in the 5' end and another adaptor in the 3' end.
- This molecule is an example molecule of the molecule provided in step i. of the method of the present invention, also represented in Fig. 5A.
- This step comprises capturing at least some of the molecules provided in (i) by using at least one capture probe that binds at least partially to the second region of the plurality of nucleic acid molecules.
- the molecule provided in STEP i. is sequenced using at least two primers that bind to at least three, preferably to at least four different regions.
- four different primers binding to four different regions in the nucleic acid molecule provided in (i) are be used, as explained below.
- the skilled person will however immediately recognise that at least some of the advantages and effects described herein are equally applicable to methods where at least two different primers, such as three different primers, binding to at least three different regions in the nucleic acid molecule provided in (i)are used.
- primer 1 is capable of binding to a portion of the adapter at the 5' end of the GEUS molecule, to sequence at least part of the first region of the GEUS molecule provided (primer corresponding to item 1. of the claimed method
- primer 2 is capable of binding to a region of hairpin, to sequence the second region of the GEUS molecule (primer corresponding to item 2. of the claimed method
- primer 3 is capable of binding to the adapter at the 3' end of the GEUS molecule, to sequence at least part of the second region of the GEUS molecule (primer corresponding to item 3. of the claimed method
- step iii.) primer 4 is capable of binding to a region of hairpin, to sequence the first region of the
- the GEUS molecule will be sequenced, in practice, based on a single-end (SE) sequencing, as Read 1 and Read 3 provide information of the bases located in the same loci of the original molecule. Since this GEUS molecule comprises two regions (a first region highlighted in bold, and a second region highlighted in italics) that provide related information, which is the information on the base identities in the corresponding loci in an original nucleic acid molecule (strand ATCGAAMGMT in step a. above), using only two conventional primers that hybridize in the adaptors of the molecule (i.e., "usual PE sequencing") will result, in practice, in SE sequencing, as shown in Fig. 6A.
- SE sequencing single-end
- Read 2 and 4 two additional reads are obtained, named Read 2 and 4. Said reads derive from primers that hybridize in the hairpin region of the GEUS molecules (p2 and p4, respectively, see Figure 5B).
- a true pair end (PE) sequencing for each of the first and second regions of the GEUS molecule is obtained, because the first and second regions are now read from each of their ends, as shown above and in Fig. 6B.
- the method of the present invention allows for PE sequencing for molecules as the ones defined in step i. of the method, for example the GEUS molecule described in WO 2015/104302.
- the method claimed herein provides more than two, such as at least three and preferably at least four sources of information (namely four reads) per original single strand molecule (e.g., up to eight sources in total if the original molecule was a double stranded molecule, such as in the present example).
- more than two, such as three, preferably four sources of information per original strand is advantageous as it allows the detection of sequencing errors that would not be detected if only two reads were obtained, as explained above and herein below:
- the underlined bases in reads mean that they correspond to a true base in the original DNA template. For example, in Read 1, when a "A” is read, it means that the original DNA template has a "A” in said position.
- the not-underlined bases in reads refer to bases that, when read, provide two possible alternatives, so that the true identity of the base in that position the DNA template cannot be inferred from that source of information. For instance, in Read 1, when a "T” is read, it can be a "T” or a "unmethylated C", in the original DNA template.
- Read 3 cannot overcome the ambiguity, and the true identity of the base will be inferred incorrectly. This is the case of the erroneous "G” highlighted with a thick black border in Read 1. Although Read 3 shows a "A”, the base "A” in Read 3 can mean either a true “G” or a true “A”, as Read 3 is ambiguous for said bases.
- Read 1 contains an error in a base that cannot be directly inferred by Read 3
- the true base identity in the DNA template will be mistaken: since a "G” in Read 1 means a true “G” in the DNA template, the mistaken "G”, highlighted with a thick black border, will be inferred as a true "G”, deriving in an error in the sequencing of the DNA template.
- Read 1 and Read 3 are similar as the one shown above. However, two new reads are provided: Read 4 and Read 2 are the reads derived from primers 2 and 4, see Fig. 6B.
- the claimed method allows for very low error rates when sequencing a molecule as defined in step i. of the method of the present invention. This is because more than two reads (e.g., at least three, and preferably at least four reads) per molecule are obtained.
- EXAMPLE 2 Advantages of using the molecule of the invention wherein the second region does not comprise any modified nucleotides, and of using adapters having the special features of “wherein the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T”.
- a modified nucleotide e.g., G if we are converting C to U/T
- at least one modified nucleotide e.g., methylated C if we are converting C to U/T
- Example 4 reproduces the same method and starting molecule, but including modified nucleotides (methylated C) in the extension step.
- Example 3 reproduces the same method but using different adapters not having the above mentioned special features.
- the starting molecule exemplified herein is the GEUS molecule as described in WO 2015/104302. As discussed above, this molecule is an example of a molecule as defined in step i. of the method of the present invention.
- STEP i providing a plurality of nucleic acid molecules of the present invention.
- the nucleic acid molecules of STEP i. may be generated by:
- Step a Providing a double-stranded nucleic acid molecule: See, e.g., Figure 7A, where a fragment of dsDNA from whole genome is represented, to which an "A tailing" is added (highlighted with black background).
- Step b Ligating double-stranded (ds) adaptors to each end of the molecule of step a) wherein one of the adaptors comprises a hairpin ("hairpin" in the below representation): See Fig. 7B, where the adaptors are called herein E9 adapters and are added to each ends of the molecule resulting from step a.
- hairpin in the below representation
- M denotes methylated cytosine (i.e., a modified nucleotide)
- C denotes non-methylated C (i.e., non-modified nucleotide).
- each adapter has a first and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y- adapter are compatible with the ends of the double stranded DNA molecules, wherein: the 3' region of the second strand (“GTGCCTCAGGCTCCGATCGAGTGTTGTCTCGATCGGAGCCTGAGGCAC" in Fig 7B) forms a hairpin loop whose 3' end can be extended by action of a polymerase,
- GMTTMMGATMTGGMGTGGMAGGATTATT in Fig 7B
- Fig 7B the first strand
- a region (named DUPLEXUMIT in Fig 7B) comprising at least two nucleotides that are complementary to the second strand and thus form a double stranded region, and
- a region (named Y_ssDNA in Fig 7B, "GMTMTTMMGATMT") that is not complementary to the second strand but that is sufficiently complementary to a primer, thereby allowing the primer to bind to said region and be extended by action of a polymerase, wherein the double stranded region formed by the first and the second strands of the adapter (sequences "GGMGTGGM” and "MCGCAMCG” in Fig.
- the 7B comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T.
- the ds region formed by the first and second strands of the adapter comprises:
- the double stranded region of the adapter comprises UMI sequences (i.e., one or more barcode), together with the A tailing added in step a., as shown in bold and with black highlight, respectively, in Fig. 7B.
- Step c Synthesizing, for each of the strands of the nucleic acid molecules obtained in step (b), a complementary strand, the "synthetic complementary strand", by polymerase elongation from the 3' end of the "hairpin", using each of the strands of the nucleic acid molecules obtained in step (b2) as template.
- Fig. 7C represents the ligation of the adapters E9 to the molecule provided in step a.
- the strands are denatured (see Fig. 7D).
- the synthetic complementary strand is generated by extension of the 3' end of the adapters, using natural (non-modified) nucleotides (A, C, G, T). Hence, the synthetic strand does not have any M, see Fig. 7E.
- step c the molecules generated are treated with an agent (in this case, bisulfite) capable of converting a nucleotide into another one which is read distinctly from the original nucleotide, under the conditions suitable for the conversion/transformation to occur.
- an agent in this case, bisulfite
- Fig. 7G shows the same molecules as Fig. 7F but in a linear form.
- the adapters comprising the special features and used herein also provide further advantages to the method and the molecule of the invention. Since the adapters have the above mentioned special features, the resulting extended molecule comprises four regions, highlighted in Fig 7F with a box (numbered 1-4 in the Watson insert and l'-4' in the Crick insert), that are substantially different among each other, so that a primer can only specifically bind to one of them, and not to the others.
- the advantages associated to the presence of the four different regions will be explained below. Of note, these four boxes represent regions 1 to 4 and 1' to 4' explained above, and can also be named as A, B, C, D or A', B', C', D', respectively.
- FW and RV forward and reverse (RV) fusion primers (FP) used are shown in Fig. 7H. These primers are prepared for Illumina sequencing, which requires different primer binding sites in each end of the molecule, see Fig. 5b.
- Fig. 71 shows in the upper part all of the molecules of the reaction (the two molecules Fig. 7G and the two primers (two copies of each) of Fig. 7H), and in the bottom the specific hybridization of the primers of Fig. 7H in the molecules of Fig. 7G.
- Fig. 71 shows that the reverse primer can bind and amplify the molecule of the invention (it hybridizes in box 4 and 4'), while the forward primer does not have a complementary sequence to bind yet. This ensures sequencing directionality (5' and 3'ends of each molecule is different). This directionality is an effect derived from the features of the adaptors, which created the four different regions comprised in boxes 1-4 and l'-4'.
- the primers would specifically hybridize in more than one region in the molecule of the invention, loosing said directionality and generating aberrant amplicons.
- An example of the full method using adapters not having the above mentioned special features in the ds region is provided in Example 4.
- a first effect of having the adaptors with the special features is that they create the necessary directionality for the reverse primer to bind in the first cycle of PCR.
- the adaptors did not have at least one G in the double stranded region, two of the sequences included in boxes, would be reverse complement of each other ("complementarity effect"), resulting in the loss of directionality because the forward primer would be able to also hybridize in the first molecule.
- the adapters did not have the special feature of having at least one modified nucleotide (M) and at least one non-modified nucleotide susceptible of being converted (C), the different regions included in the boxes would be identical, so they would be tandem repetitions ("mirror effect"). This would cause the forward primer to hybridize in both ends of the molecule, thereby creating molecules with the same ends, and thus not suitable for Illumina sequencing, as will be explained in the next example.
- the regions included in boxes 1 - 4 and l'-4' are all different.
- the presence of C and M in the adaptors also helps as a control for an efficient bisulfite conversion, because if the bisulfite conversion did not work, or is not carried out, the complementarity/specificity of the reverse primer will be reduced and will increase for the forward primer, leading to the loss of directionality.
- the presence of c and M in the adapter also ensures a method with optimized efficiency, which will not be possible if no bisulfite is performed, or if all of the cysteines are methylated (see Example 4).
- a further control of the efficiency of the bisulfite conversion is the fact that, since the extension step was performed with non-modified C, the resulting synthetic strand of the GEUS molecule lacks non-methylated C (all non-methylated C were converted to U after bisulfite treatment). When the synthetic strand of the GEUS molecule comprises non-modified C, it is an indicative of low bisulfite conversion efficiency.
- Fig. 7J shows the extension and denaturation cycles, in preparation for the next amplification step, shown in Fig. 7K. It is observed that, in the second PCR cycle (Fig. 7K), the forward reverse fusion primer can anneal into the first generation amplicons, thereby generating more amplicons. Once the forward primer is extended, the first complete molecules to be sequenced appear (see Fig. 7K, bottom part, molecules inside a box).
- Fig. 7L and 7M represent the third cycle of PCR, wherein the dsDNA complete molecules to be sequences are highlighted with a black box.
- This step comprises capturing at least some of the molecules provided in (i) by using at least one capture probe that binds at least partially to the second region of the plurality of nucleic acid molecules.
- a universal probe Watson can be designed, which will be complementary to the second region of all of the molecules in the plurality of molecules (Fig. 7N represents a plurality of 4 different molecules, each one with a different methylation status). If the second region is absent, four different probes Watson would need to be designed in order to capture all molecules (called probe I, II, III, and IV Watson in fig. 7N).
- Fig. 7N represents the benefits of the capture probe designed for the method of the present invention. It is capable of capturing all molecules with the same specificity and efficiency because it is directed to hybridize with the second region of the molecules, which is identical in all of them.
- EXAMPLE 3 Disadvantages of using modified adapters without the special features of “wherein the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T”.
- a modified nucleotide e.g., G if we are converting C to U/T
- Fig. 8A shows modified adapters not comprising at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T.
- the bottom part of Fig. 8A shows the ligation of these modified adapters to the molecule.
- Fig. 8B represents the denaturation step.
- Fig. 8C shows the extension step once said modified adapters have been ligated to the molecule.
- a synthetic complement strand (second region) is created for each original strand (first region).
- the extension is performed with non-methylated Cs.
- Fig. 8D shows the resulting molecule after transformation or conversion step.
- Fig. 8D represents the molecules of Fig. 8B where four boxes are indicated.
- boxes 1 and 3 and boxes 2 and 4 are tandem repetitions. The same effect is observed in l'-3' and 2' and 4'.
- This tandem effect is caused by the lack of at least M and one C in the ds region of the adapters. See also Fig. 8E, upper part in modified adapter molecules.
- the region included in the boxes 1 and 2, and 3 and 4 are complementary to each other (in reverse complement).
- the same effect is observed in l'-2' and 3' and 4'.
- This reverse complement repeats effect is caused due to the lack of at least one G complementary to a modified nucleotide in the ds region of the adapters. See also Fig. 8E, upper part in modified adapter molecules. These effect do not occur when the adaptors have the special features, as they result in different four regions, see bottom part of fig. 8E, bottom part in GEUS molecule.
- tandem and reverse complement repeats will have negative implications in the subsequent steps of the method.
- the tandem repeat effect will cause that, during amplification, the primers will be able to bind to two regions of the molecules (2 and 4, or 2' and 4') and copy thereof, see Fig. 8G. This will cause the presence of short amplicons (aberrant molecules that will be amplified and sequenced more efficiently) and long amplicons (complete molecules).
- the reverse complement repeat effect will cause that the forward primer would have to join the complementary molecule (once the reverse primer acts) but, because of complementarity in the dsDNA sequences in the adapter, these are already complementary regions and can join directly (except for one letter), yielding aberrant molecules that will be amplified but not sequenced (short and complete FPRV-FPRV) and molecules that will be amplified but with lost directionality (short and complete FPRV-FPFW) see Fig. 8G.
- Fig. 8H and I show the next amplification steps performed with the negative effects described above.
- EXAMPLE 4 Disadvantages of using a molecule wherein the second region comprises methylated C (M).
- This Example represents the same method as Example 2, but with the difference that the extension step is performed using methylated C (M).
- STEP i providing a plurality of nucleic acid molecules.
- Step a Providing a double-stranded nucleic acid molecule: See, e.g., Fig. 9A, where a fragment of dsDNA from whole genome is represented, to which an "A tailing" is added.
- Step b Ligating double-stranded (ds) adaptors to each end of the molecule of step a) wherein one of the adaptors comprises a hairpin ("hairpin"): See Fig. 9B, where adaptors called herein E9 adapters are added to each ends of the molecule resulting from step a.
- ds double-stranded
- M denotes methylated cytosine (i.e., a modified nucleotide)
- C denotes non-methylated C (i.e., non-modified nucleotide).
- Step c Synthesizing, for each of the strands of the nucleic acid molecules obtained in step (b), a complementary strand, the "synthetic complementary strand", by polymerase elongation from the 3' end of the "hairpin", using each of the strands of the nucleic acid molecules obtained in step (b2) as template.
- Fig. 9C represents the ligation of the adapters E9 to the molecule provided in step a.
- the strands are denatured (see Fig. 9D).
- the synthetic complementary strand is generated by extension of the 3' end of the adapters, using natural nucleotides (A, G, T), and modified C (methylated C). Since only methylated cytosines were used in step c., the synthetic complementary strand generated comprises all methylated cytosines (M), see Fig. 9E.
- step c. the molecules generated are treated with an agent (in this case, bisulfite) capable of converting a nucleotide into another one which is read distinctly from the original nucleotide, under the conditions suitable for the conversion/transformation to occur.
- an agent in this case, bisulfite
- this molecule is more difficult to treat with said agent due to the fact that both strands are strongly bound by G-M bonds. This is because the agent is more effective with ssDNA and, in order to separate the G-M bonds, stronger energy is required.
- the extension step is performed with M
- the strands of each molecule are bond stronger to each other than in the case of the molecule in Example 2 (Fig. 7F) where the extension step was performed with non-methylated C, which are transformed into uracil with the agent, and leave non-complementary gaps in between the two strands that facilitates the access of more agent (such as bisulfite) to completely bisulfite the entire molecule.
- An efficient transformation step is crucial for the efficiency of the method, as discussed in the previous examples.
- the resulting molecule due to the presence of the methylated C in the synthetic complementary strand, comprises four regions, indicated in Fig. 9F with boxes. Two of said four regions are tandem copies, i.e., are similar to each other: box 1 is identical to box 3, and box 1' is identical to box 3'. This effect was avoided in the Example above, Fig. 7F, by performing an extension step with natural C (non-modified C), rather than with methylated C.
- Fig. 9G shows the same molecules as Fig. 9F but in a linear form.
- the molecules need to be denaturized.
- the molecule shows high level of complementarity. This did not happen in the molecules depicted in Fig. 7G and 7F, where the level of complementarity between both strands was weaker due to the transformation of nonmethylated cytosines into uracil, which did not base pair with the corresponding G.
- Fig. 9H shows the fusion primers to be used, which are identical to those shown in Fig. 7H.
- Fig. 91 shows in the upper part all of the molecules prepared for the amplification/sequencing reaction, and in the bottom the specific hybridization of the primers of Fig. 9H in the molecules of Fig. 9G.
- Fig. 91 shows that, in this example, it is the forward primer the one that can bind to the box 4 of the molecules, while the reverse cannot bind.
- the first amplicon is created, another forward primer will bind to it (Fig. 9J-L).
- the molecule is being amplified from both ends with the same primer, thereby generating molecules that have identical ends, produced by the forward primer.
- These molecules would be extended but not sequenced, since sequencing platforms requires different primer binding regions in each end (Fig 5B).
- Fig. 71-L only the reverse primer could bind in the first cycle, and the forward primer could bind in the second cycle, creating the desired directionality.
- Fig. 9M the first complete molecules are generated, but they are not suitable for sequencing because its ends are not different.
- This step comprises capturing at least some of the molecules provided in (i) by using at least one capture probe that binds at least partially to the second region of the plurality of nucleic acid molecules.
- Fig. 9N shows the capture step for this example.
- EXAMPLE 5 Method of the present invention using different adapter (E15).
- Fig. 11 A, B, C show the adapter E15, and the beginning of the method of the present invention.
- the only difference with Example 2 is that the adapter that is provided as a complex of two molecules, wherein the hairpin is provided as a different molecule, see Fig. 11A.
- the ligation is thus performed in two steps, see Fig. 11B and C.
Landscapes
- Chemical & Material Sciences (AREA)
- Organic Chemistry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Analytical Chemistry (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- Physics & Mathematics (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Biochemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The present invention relates to methods for capturing nucleic acid sequences while keeping/preserving their epigenetic information, preferably for capturing certain regions of the genome, thereby enriching these regions, while maintaining it epigenetic information, with the same probability/efficiency, independently of whether one or more nucleotides are epigenetically modified. In particular, the present invention relates to a method comprising the steps of: (i) providing a plurality of nucleic acid molecules, wherein each of the nucleic acid molecules comprises two regions: a first region and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, wherein the first and the second regions in each of the molecules provided in (i) comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified; wherein at least two of the nucleic acid molecules of the plurality of nucleic acid molecules may have a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified and wherein, in the second region, a position corresponding to that same locus in the region of interest is occupied by the same nucleotide in the at least two nucleic acid molecules; and (ii) Capturing the molecules provided in (i) by using at least one capture probe that binds to at least a portion of the second region which comprises at least one nucleotide which is located at a position which corresponds to a locus in the ROI which is occupied by a nucleotide susceptible of being modified.
Description
METHOD FOR CAPTURING EPIGENETICALLY MODIFIED DNA
FIELD OF THE INVENTION
The present invention relates to methods for capturing nucleic acid sequences while kee pi ng/p reserving their epigenetic information, preferably for capturing certain regions of the genome, thereby enriching these regions, while maintaining its epigenetic information, with the same probability/efficiency, independently of whether one or more nucleotides are epigenetically modified. The invention also relates to computer programs related to the methods of the invention.
BACKGROUND OF THE INVENTION
When studying the epigenetic variations present in a polynucleotide of interest or in a region of interest within a polynucleotide (e.g., the methylation status of a region of interest within a polynucleotide), the sample usually contains a plurality of nucleic acid molecules that differ in the chemical (epigenetic) modifications present in one or more nucleotides at one or more loci (e.g., the sample usually contains a plurality of nucleic acid molecules that differ in their methylation status). This means that some molecules will have at certain loci, certain nucleotides which are chemically (epigenetically) modified (e.g., some molecules will have certain cytosines methylated), while other molecules will have different nucleotides which are chemically (epigenetically) modified (e.g., other molecules will have different methylation status). In fact, the variations in the epigenetic modifications present in a certain sample may vary from non-modified molecules to fully modified molecules, and all the possibilities in between. If the epigenetic modification is methylation, the variations in the pattern of methylated cytosines may vary from non-methylated molecules to fully methylated molecules, and all the possibilities in between.
On the other hand, in Whole-Genome Bisulfite Sequencing (WGBS) methods, the molecules from a given sample are first treated with bisulfite, to preserve in each molecule the information regarding the original methylation status of the sample. The chemical transformation with bisulfite of the nucleic acids results in the generation of ambiguity, as non-methylated cytosines will be transformed to uracils and visualized (read) as thymines, whereas the methylated cytosines will not be transformed and will be read as cytosines. Then, the sample is sequenced to study the sequence and to ascertain the methylation status of the
sample. Prior to sequencing, the molecules are generally randomly fragmented (e.g., by physical shearing), step that is essential in sample preparation for sequencing platforms, such as Next Generation Sequencing.
In addition, it may be necessary to include a capture step to select and enrich regions of interest within the sample. Targeted sequencing of just the coding regions or specific genes or segments/regions of chromosomes or genes that are relevant to a particular disease/purpose has several advantages, such as reduction of sequencing costs (less sequencing yield and less reagents are needed; less data is generated and, thus, less data needs to be analysed, and more depth coverage in those regions are obtained and errors are also reduced). Several methods of target enrichment are available. For instance, the nucleic acid molecule (e.g., DNA in the case of WGBS) is hybridized to single-stranded oligonucleotides (probes or baits that are designed to target specific regions of interest). Typically, these probes are marked (e.g., biotinylated) and can be recovered (e.g., using streptavidin-coated magnetic beads). The process can be used to capture targeted nucleic acids.
Hence, it may be desirable to know the epigenetic modification status (such as the methylation status) of a certain region of interest (ROI) within a genome or, generally, within a nucleic acid molecule, where the ROI comprises nucleotides which are susceptible of carrying epigenetic modifications (e.g., cytosine methylations or other chemical modifications). Nucleic acid molecules comprising nucleotides located at loci corresponding to those in the ROI (e.g., nucleic acid molecules from different cells, orfrom different samples) may be fully modified or non-modified, and all the possibilities in between. Hence, in order to capture all possible variations, two possibilities exist. One possibility would be to design one probe for each of the possible molecules (with different epigenetic modification information preserved), to capture all molecules. If the number of possibly modified nucleotides within a region of interest covered by a single probe is high, the number of probes to capture all possible combinations within the same region would then be really high. The second possibility would be to design a single "consensus" probe. The first possibility (different probes) is expensive and time-consuming, since a large number of probes would be needed when the sample contains multiple possible epigenetic variation information.
On the other hand, designing a single "consensus" probe comes with the drawback that the probe will bind with more affinity/efficiency to some of the molecules than to the others, depending on the epigenetic modifications present in each molecule and their degree of complementarity with the consensus probe. The "consensus" probe may even not be able to hybridize to any of the exiting molecules. This possibility will of course cause a bias in the capturing step. This bias will inevitably result in low representation of those molecules that, due to their specific modification status preserved information (e.g., C to T conversion for methylation status), bind with less affinity to the consensus probe, and thus were less captured and less sequenced, even if they were predominant in the original sample. Similarly, the bias will also cause that molecules bound with high affinity to the consensus probe (again due to their specific modification status preserved information) will be captured with more efficiency/affinity and thus more represented in the final sequencing step, even if they were scarce in the original sample. The bias will ultimately result in an erroneous assessment of the modification status (e.g., methylation status) of a certain sample.
Therefore, there is a need for improved capturing methods that are more efficiently produced and/or that reduce the undesired bias caused by the use of consensus capture probes that hybridize to nucleic acids that present and preserve epigenetic variation information (such as different methylation status) within a sample.
SUMMARY OF THE INVENTION
The present invention addresses the above needs and provides a new capturing method capable of overcoming the bias caused by the use of consensus capture probes that are designed to hybridize to a plurality of nucleic acid molecules that differ in their sequence as a consequence of the transformation/conversion performed to them in order to preserve the information on their original epigenetic modifications (e.g., a plurality of nucleic acid molecules that differ in their sequence as a consequence of the transformation/conversion performed to them to preserve the information of their original methylation status). In addition, the new method comprises the use of a single capture probe for all molecules of the sample, avoiding the need of designing, synthetising and using a large number of probes to capture every possible molecule that could be comprised in the sample. This is achieved by
providing a plurality of nucleic acid molecules that are characterized by having two regions, a first region and a second region. The first region may comprise one or more modified nucleotides or transformed modified nucleotides, or a copy thereof. The second region is characterized by being identical or at least substantially identical in each of the nucleic acid molecules comprised in the plurality at at least a position which corresponds to the same locus in a region of interest (ROI), wherein the locus in the region of interest is occupied by a nucleotide which is susceptible of being modified. Thanks to the presence of the second region that is identical or substantially identical at at least a locus occupied with a nucleotide which is susceptible of being modified in the ROI in all the molecules, it is possible to capture all of the molecules in the plurality with a single probe, and with the same efficiency and affinity regardless of the original modification status of the nucleotide susceptible of being modified, thus eliminating the bias derived by the presence of a different epigenetic status in at least two of the molecules.
The present invention can be used in, e.g., a type of sequencing technology called Genomic and Epigenomic Unified Sequencing (GEUS) (as described, e.g., in WO 2015/104302) that allows to account for methylation and highly reduce the error rate interrogating the same position of the original sequence in different contexts from two related strands.
An advantage of the present invention is shown in Figure 1, which represents the capturing step of a method for determining the epigenetic status (methylation status in this case) of 8 different original nucleic acid molecules corresponding to the same region of interest (ACCGTCGACG, wherein "C" represents a cytosine which may or may not be modified, e.g., methylated). The 8 original nucleic acid molecules have the exact same nucleotide sequence but different epigenetic modifications (see the left-hand region (or 5' region) of the molecules represented in Figure IB). In Figure 1A, the eight molecules in Figure IB were treated with bisulfite to convert non-methylated cytosines into uracils/thymines (represented with a T) and to differentiate them from the originally methylated cytosines, which are resistant to bisulfite treatment. The eight molecules in Figure 1A become different in sequence due to the original differences in the methylation status in the first region of each molecule, see Figure 1A. This way, the different epigenetic modification status in the original molecules ("first region" in Figure IB) is fixed or preserved in the molecules of the present invention (Figure 1A). The
"second" region (the 3' region in this specific case) in every molecule is identical or at least substantially identical at at least a locus which is occupied by a nucleotide susceptible of being modified (in this case, there are 8 identical inserts in the 3' region of the molecules, see Figure 1A). Hence, a single capture probe designed to bind (hybridize) to at least a portion of the second region which comprises at least one locus occupied by a nucleotide which is susceptible of being modified (in this case, the probe binds to the whole second region) will capture with same affinity and efficacy (i.e., without generating a bias due to the different epigenetic status within the original nucleic acid molecules) all of the nucleic acid molecules, regardless of their differences in sequence. This results in a more efficient and accurate capture method, which translates in a better sequencing method, wherein the bias due to the different epigenetic status within the original nucleic acid molecules is eliminated.
The above advantages of the method of the present invention are also achieved even if the first region of each of the molecules comprised in the plurality are not the same, but all comprise at least one nucleotide in a position which correspond to a certain locus in certain region of interest (e.g., in an certain region of interest in a genome or in a certain region of interest in a nucleic acid molecule). As shown in Figure 1C and D, a single probe will capture the nucleic acid molecules comprised in a plurality with the same efficiency/efficacy regardless of the epigenetic status of the original molecule at a certain locus, since the second region of each molecule in the plurality is identical in all molecules at least at a position which corresponds to the same locus in the region of interest which is occupied by a nucleotide susceptible of being modified.
For instance, as shown in Figure 1C, the probe will capture molecules 1-3, 6 and 8 (i.e., a plurality of nucleic acid molecules) with the same efficacy/affinity regardless of the epigenetic modifications present at a certain locus, e.g., locus 6, because all these molecules comprise, at a position corresponding to locus 6 in the region of interest, the same nucleotide (G). To exemplify this:
Region of interest: A C C G T C G A C G (see Figure ID)
Nucleotide at locus 6, marked in bold above: C (this can be methylated (mC) or nonmethylated (uC)). Molecules 1-3, 6 and 8 and probe binding to their second region:
Molecule number
First region Second region
Thus, with the method of the present invention the bias caused by the use of consensus capture probes that are designed to hybridize to a plurality of nucleic acid molecules that differ in their sequence as a consequence of the transformation/conversion performed to them in order to preserve the information on their original epigenetic (e.g., a plurality of nucleic acid molecules that differ in their sequence as a consequence of the transformation/conversion performed to them in order to preserve the information on their original methylation) can be overcome or at least drastically reduced. In addition, the method of the present invention comprises the use of at least a single capture probe for all molecules of the sample, avoiding the need of designing, synthetising and using a large number of probes to capture every possible molecule that could be comprised in a sample. In particular, the present invention provides a method comprising:
(i) providing a plurality of nucleic acid molecules, wherein each of the nucleic acid molecules comprises two regions: a first region and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, wherein the nucleotide sequence of the second region is identical or at least substantially identical in each of the plurality of nucleic acid molecules at at least
a position corresponding to the same locus in the region of interest which is occupied by a nucleotide susceptible of being modified, wherein the first and the second regions in each of the molecules provided in (i) comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified; wherein at least two of the molecules of the plurality of nucleic acid molecules may have (and preferably have) a different nucleotide at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified and wherein, in the second region, a position corresponding to that same locus in the region of interest is occupied by the same nucleotide in the at least two nucleic acid molecules; and
(ii) Capturing the molecules provided in (i) by using at least one capture probe that binds to at least a portion of the second region of the plurality of nucleic acid molecules, preferably to at least a portion of the second region which comprises at least one nucleotide which is located at a position which corresponds to a locus in the original molecule which is occupied by a nucleotide susceptible of being modified.
BRIEF DESCRIPTION OF THE FIGURES
Figure 1. Schematic representation of a plurality of nucleic acid molecules from the exact same region or with the exact same nucleotide sequence, wherein the nucleic acid molecules differ between them in three transformed modified nucleotides and/or copies thereof in the first region. Every molecule has an identical sequence at the second region that binds with the same complementarity and thus, same efficiency, to the same probe. In the plurality of molecules shown in 1A, certain Ts (highlighted in red and underlined) were transformed from cytosines with a certain methylation status (e.g., unmethylated (non-modified)) at one of the
regions (e.g., first region). The nucleotide molecules comprised in the plurality differ in the nucleotide sequence of the first region, as a consequence of the treatment (conversion or transformation) for preserving the epigenetic modification status of the original molecules, but are substantially identical (e.g., 100% identical) at the second region, and are also identical to the probe in bold (100% complementary, if strictly speaking) (Figure 1A). Schematic representation of a plurality of molecules corresponding to the exact same region of interest, or with the exact same nucleotide sequence, wherein the nucleic acid molecules differ between them in three modified nucleotides in the first region. Unmodified cytosines (unmethylated) are represented in red and highlighted with double underlined; modified cytosines (methylated) are represented in in black and highlighted with underlined (simple) at the first region of the molecule. The second region is identical or substantially identical for all of the molecules (IB). The first region of the molecules in Figure IB correspond to the original molecules, and are particular realizations (possibilities) of nucleic acid molecules corresponding to the region of interest ((A C C G T C G A C G, wherein C refers to either methylated C (mC) or non-methylated C (uC))). The molecules in Figure 1A correspond to the molecules in Figure IB once the molecules in Figure IB have been treated with an agent capable of converting the non-methylated cytosines in U (T), e.g., bisulfite. Schematic representation of several nucleic acid molecules which comprise a first region and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of the original nucleic acid molecules. The first region of each of the molecules comprise a nucleotide at at least one position corresponding to a locus in the region of interest ((A C C G T C G A C G, wherein C refers to either methylated C (mC) or non-methylated C (uC)) which may be occupied by a nucleotide susceptible of being modified. The sequence of the second region is identical among a plurality of nucleic acid molecules at least at a position corresponding to the same locus in the region of interest, wherein the locus in the original molecule is occupied by a nucleotide susceptible of being modified. Hence, at least at that position, the second region of all molecules comprised in the plurality is complementary to the unique probe and, thus, all molecules comprised in said plurality may bind to the probe with the same efficiency regardless of the methylation status of the nucleotide in that locus (Figure 1C). Schematic representation of the molecules represented in Figure 1C, but before transformation with an agent capable of converting the non-methylated cytosines in U (T), e.g., bisulfite. Unmodified
cytosines (unmethylated) are represented in red and are highlighted with double underlined, and modified cytosines (methylated) are highlighted with simple underline in the first region of the molecule (Figure ID). The first region of the molecules in Figure ID correspond to the original molecules, and are particular realizations (possibilities) of nucleic acid molecules with nucleotides at positions corresponding to at least one locus in the region of interest. The molecules in Figure 1C correspond to the molecules in Figure ID once the molecules in Figure ID have been treated with an agent capable of converting the non-methylated cytosines in U (T), e.g., bisulfite.
Figure 2. Schematic representation of bisulfite sequencing library preparation after the bisulfite conversion step and the number and sequences of probes needed to capture with the same complementarity/efficiency a plurality of molecules corresponding to the exact same region of interest. The molecules differ between them in the nucleotides at three positions, corresponding to loci 3, 6 and 9. The differences in these nucleotides among the molecules are the consequence of the transformation of the original molecules with an agent capable of transforming non-methylated cytosines in uracil (thymine). For comparison, the plurality of molecules is the same as the plurality of the first regions of the molecules shown at Figure 1A. Similar as for Figure 1, double underlined "T" represents converted unmethylated cytosines, and single underlined "C" represents methylated cytosines or copies thereof (unmethylated cytosines).
Figure 3. Schematic representation of a nucleic acid molecule according to the present invention starting from a GEUS molecule (as described, e.g., in WO 2015/104302) as a template. The 5' region (e.g., the first region) is represented by left to right, the nucleotides A-T-T-G-A-A-C-G-C-T in the gradient coloured (grey scale) lines. The darker side of the gradient represents the start (5') of the original molecule and the lighter side of the molecule, the end of it (3'). The 3' region is represented by left to right, the nucleotides A-G-T-G-T-T-T-G-A-T. Similarly as before, the darker side of the gradient represents the start of the original molecule and the lighter side, the end. The adapters at the 5' and 3' ends of the molecule are represented by a line in solid grey. The nucleotide sequence covalently linking the 5' region and the 3' region (e.g., the second region) is represented in black, and it links the 5' and 3' regions of the molecule. Optional unique molecular barcodes (UMIs) are represented by a
white line. At least two different primers are represented by thin black arrows indicating the direction of the sequencing synthesis. The white thick arrows containing a sequence correspond to the reads synthesized after the primer hybridization in 5' to 3' direction. Reads can have the same or different length. R1 stands for read 1, R2 for read 2, R3 for read 3 and R4 for read 4. mC stands for methylated cytosine and uC stands for unmethylated cytosine.
Figure 4. Schematic representation of a way of obtaining the nucleic acid sequence provided in step (i) of the method of the present invention, as described, e.g., in WO 2015/104302. The darker side of the gradient represents the start of the original molecule and the lighter side, the end of it.
Figure 5. Schematic diagram showing sequencing step of the method of the present invention inside an NGS machine (e.g., Illumina MiSeq NGS machine). (A) is a representation of a tile of the flow cell where the cluster amplification will take place, with a nucleic acid molecule according to the present invention, in this case a DNA GEUS molecule (as described, e.g., in WO 2015/104302), attached to it. From the bottom to the top, the end of the molecule contains the NGS adapter, in this case Illumina P7, with the first sample index (to multiplex samples in the same lane), then the external unique molecular index (UMI), then the 5' region (or first region) of the nucleic acid molecule followed by the internal UMI, then a known sequence (i.e., a nucleotide sequence to which primers can bind, which can be a hairpin or a region thereof if it is a GEUS molecule), the synthetic internal UMI followed by the 3' region of the nucleic acid molecule, which in this case is the synthetic 3' region (or second region) of the molecule, the synthetic external UMI and lastly the Illumina P5 NGS sequencing adapter with the second sample index which prevents index hoping.
At (B) it is described, step by step, how the first 3 reads are sequenced by synthesis after each step of primer hybridization (hybridization of primer 1 and sequencing of read 1, hybridization of primer 2 and sequencing of read 2, hybridization of sample index primer and first index read). Then, the complementary molecule is synthesized and amplified by clustering, and the last 3 other reads are generated step by step in the same way, primer hybridization and synthesis of read 3, primer hybridization and synthesis of read 4 and second index sample primer hybridization and synthesis of the second sample index. Depending on the
instruments, the reads can have different order of being synthesized and the NGS instrument protocol needs to be adapted accordingly.
Fig. 6: A) Paired-end (PE) sequencing (which is, in practice, single-end (SE) sequencing) of the molecule provided in step i. of the method of the present invention. B) paired-end (PE) sequencing of the molecule provided in step i. of the method of the present invention.
Figure 7A-N: Exemplary method as described in Example 2. In some graphs of this figure, the molecule has been divided in two to fit it in the page, and the arrow indicates that the sequence continues.
Figure 8A-I Exemplary method as described in Example 3.
Figure 9A-N: Exemplary method as described in Example 4.
Figure 10: Extended step with methylated cytosines vs extended without methylated cytosines comparison.
Fig. 11A-C: Method with E15 primer.
DETAILED DESCRIPTION OF THE INVENTION
Definitions
All terms as used herein, unless otherwise stated, shall be understood in their ordinary meaning as known in the art. Other more specific definitions for certain terms as used in the present application are as set forth below and are intended to apply uniformly throughout the description and claims unless an otherwise expressly set out definition provides a broader definition.
Throughout the description and claims the word "comprise" and variations of the word (e.g., "comprising", "having", "including", "containing"), are not intended to exclude othertechnical features, additives, components, or steps. Furthermore, the word "comprise" encompasses
the case of "consisting of". Additional objects, advantages and features of the invention will become apparent to those skilled in the art upon examination of the description or may be learned by practice of the invention.
In this specification and claims, the use of the terms "a" and "an" and "the" and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context.
The use of any and all examples, or exemplary language (e.g., "such as") provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
If the term "about" as used in connection with a numerical value throughout the specification and the claims denotes an interval of accuracy, familiar and acceptable to a person skilled in the art. For instance, the term "about" means the indicated value ± 1% of its value, or the term "about" means the indicated value ± 2% of its value, or the term "about" means the indicated value ± 5% of its value, the term "about" means the indicated value ± 10% of its value, or the term "about" means the indicated value ± 20% of its value, or the term "about" means the indicated value ± 30% of its value; preferably the term "about" means exactly the indicated value (± 0%).
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skilled in the art to which this invention belongs. For example, the Concise Dictionary of Biomedicine and Molecular Biology, Juo, Pei- Show, 2nd ed., 2002, CRC Press; The Dictionary of Cell and Molecular Biology, 3rd ed., 1999, Academic Press; the Oxford Dictionary Of Biochemistry And Molecular Biology, Revised, 2000, Oxford University Press; or the "Dictionary of Biology" (Hale, W. G.; Margham, J. P.; Saunders, V. A., Collins), provide one of skill with a general dictionary of many of the terms used in this
disclosure. Units, prefixes, and symbols are denoted in their "Systeme International de Unites" (SI) accepted form.
Numeric ranges are inclusive of the numbers defining the range.
The headings provided herein are not limitations of the various aspects or embodiments of the disclosure. Furthermore, the present invention covers all possible combinations of particular aspects and embodiments described herein.
"Percent (%)sequence identity" with respect to polypeptides described herein is defined as the percentage of nucleotide residues in a candidate sequence that are identical with the nucleotide residues in the reference sequence, after aligning the sequences and introducing gaps, if necessary, to achieve the maximum percent sequence identity, and not considering any conservative substitutions as part of the sequence identity. Alignment for purposes of determining percent of sequence identity can be achieved in various ways that are within the skill in the art, for example, using publicly available computer software such as BLAST. Those skilled in the art can determine appropriate parameters for measuring alignment, including any algorithms needed to achieve maximum alignment over the full-length of the sequences being compared.
Preferably, the "percentage of identity" as used herein is decided in the context of a local alignment, i.e., it is based on the alignment of regions of local similarity between nucleobase sequences, contrary to a global alignment, which aims to align two sequences across their entire span. Thus, in the context of the present invention, percentage identity is calculated preferably only based on the local alignment comparison algorithm.
The method of the present invention
The present invention relates to methods for capturing or enriching a plurality of nucleic acid molecules which have information regarding the epigenetic modification status of one or more original nucleic acids. The molecules to be captured may comprise, and preferably comprise, in at least one of the molecules, one or more modified nucleotides or transformed modified nucleotides, and/or a copy thereof.
In a first aspect, the present invention provides a method (from herein after "the method of the present invention"), such as a nucleic acid capturing (hybridisation capture) or enriching (target enrichment) method, comprising at least steps (i) and (ii).
Step (i): Providing a plurality of nucleic acid molecules
Step (i): Providing a plurality of nucleic acid molecules (from herein after "the nucleic acid molecules of the present invention"), wherein each molecule of the plurality of nucleic acid molecules comprises a first region and a second region (also referred to as "a 5' region" and a 3' region). The base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules.
By "plurality" is referred herein as two or more nucleic acid molecules.
In the context of the present invention, the term "plurality of nucleic acid molecules" is not limited, as long as there is more than one nucleic acid molecules. The nucleic acid molecules comprised in the plurality provided in step (i) may be DNA or RNA molecules, preferably DNA molecules. They may be single stranded (ss) molecules (such as a ss DNA molecules) or doublestranded (ds) molecules (such as ds DNA molecules). Preferably, the plurality provided in step (i) comprises ss molecules, more preferably ssDNA molecules. They may be synthetic (or partially synthetic) molecules. The plurality of nucleic acid molecules provided in step (i) each comprise two regions, a first region and a second region. The length of the regions is not particularly limited.
The plurality of nucleic acid molecules may comprise, as the first region, fragments of DNA, such as fragments of genomic DNA which have been treated with an agent (or a method or process, see below) capable of converting a nucleotide to another nucleotide which is read distinctly from the original nucleotide. The term "genomic DNA" refers to the total genetic information of an organism. It is the (biological) information of inheritance which is passed from one generation of organism to the next.
A nucleic acid may be fragmented through any suitable method including, but not limited to, mechanical stress (sonication, nebulization, cavitation, etc.), enzymatic fragmentation (enzyme digestion with restriction endonucleases, nicking endonucleases, exonucleases, etc.) and chemical fragmentation (dimethyl sulphate, hydrazine, NaCI, piperidine, acid, etc.) or be fragmented at the original organism (e.g., cell free DNA). In principle, there is no restriction on the length of the nucleic acid fragments, though it is preferable to have a narrow range of lengths. The suitable size of fragments may be selected prior to step (i) of the method of the invention. The optimal length will ultimately depend on the probes properties and suitable ratios and methods and/or the available sequencing instruments and methods and the desired percentage of reads overlap, if the ultimate goal for the target enrichment is to sequence the captured molecules.
Usually, the ends of genomic fragments are processed so that the sample can enter the specific protocol of the sequencing platform.
Preferably, the genomic fragments are end-repaired after having been fragmented. The term "end-repaired", as used herein, refers to the conversion of the nucleic acid (such as DNA) fragments that contain damaged or incompatible 5'- and/or 3'-protruding ends into blunt- ended DNA containing a 5'-phosphate and 3'-hydroxil groups. The blunting of the DNA ends can be achieved by enzymes including, without limitation, T4 DNA polymerase (having 5'->3' polymerase activity that fills-in 5' protruded DNA ends) and the Klenow fragment of E. coli DNA polymerase I (having 3'->5' exonuclease activity that removes 3'-overhangs). For efficient phosphorylation of DNA ends any enzyme capable of adding 5'-phosphates to ends of unphosphorylated DNA fragments can be used, including, without limitation, T4 polynucleotide kinase.
The term "dA-tailing" or "A-tailing", as used herein, refers to the addition of an A base to the 3' end of a blunt phosphorylated DNA fragment. This treatment creates compatible overhangs for subsequent ligation. This step is performed by methods well known by the person skilled in the art by using, for example, the Klenow fragment of E. coli DNA polymerase I.
The second region comprised in the nucleic acid molecules of the plurality provided in (i) may preferably be synthetic DNA, and preferably does not comprise any modified nucleotide, such as any modified cytosine, e.g., it does not comprise methylated cytosines.
Both the first and the second regions comprised in the nucleic acid molecules of the plurality provided in (i) are related in the sense that the base identities in both regions provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, as explained in detail below. Hence, both the first region and the second region of each of the molecules comprised in the plurality provided in (i) provide information regarding the base identities in one or more original nucleic acid molecules. These one or more original nucleic acid molecules, when treated with an agent or method capable of converting the base of a nucleotide to a base which is read distinctly from the original base (i.e., when converted or transformed), may correspond to the first region of the nucleic acid molecules comprised in the plurality provided in (i). In addition, both the first region and the second region of each of the molecules comprised in the plurality provided in (i) provide information regarding at least one base identity in a corresponding locus in a region of interest (ROI), preferably the base identity of a nucleotide which is susceptible of being modified.
By "region of interest" or "ROI" is referred herein as a region (e.g., at least 1 nucleotide, but preferably more, such as at least 2, 3, 4, 5, 10, 15, 20, 30, 50, 100, 160, 180, 200, 250, 300, 350, 400, 500, 1000, 1500, 3000, or more nucleotides) in a genome and/or in a nucleic acid molecule. The ROI comprises at least one nucleotide that may have an epigenetic modification, and thus is susceptible of being (epigenetically) modified (e.g., a cytosine, which is susceptible of being methylated and may thus have an epigenetic modification). The ROI may comprise more than one nucleotides susceptible of being (epigenetically) modified. Preferably, the nucleotide susceptible of being modified is a cytosine. Methylated and nonmethylated cytosines are susceptible of being transformed or converted. For example, nonmethylated cytosines can be converted into uracil upon bisulphite treatment, as will be explained below. Therefore, the ROI represents a region of the genome or a region of a nucleic acid molecule which epigenetic modification status may be interrogated. An example of a ROI may be:
5'-ATCGGGA-3', wherein "C" refers to a cytosine which may be methylated (mC) or not (uC), and is thus a nucleotide susceptible of being modified.
In the context of the present invention, a "locus" (plural: loci) in a nucleic acid molecule refers to a specific, fixed position on that nucleic acid molecule. For instance, an ROI molecule comprising 7 bp, has 7 loci. If the ROI has the following sequence ("C" refers to cytosine which may be methylated (mC) or not (uC)):
5'-ATCGGGA-3'.
The nucleotide located in the locus at position 1 (5') is an A. The nucleotide located in the locus at position 2 is a T. The nucleotide located in the locus at position 3 is a C, and so on and so forth.
In the context of the present invention, the term "one or more original molecules" refers to one or more nucleic acid molecule which at least partially overlaps with the sequence of the ROI and which comprises at least one nucleotide at a position corresponding to a locus in the ROI, wherein the locus in the ROI is occupied by a nucleotide susceptible of being modified. The original molecule may comprise more than one nucleotide at more than one positions corresponding to more than one locus in the ROI, wherein the loci in the ROI are occupied by nucleotides susceptible of being modified. The original molecules correspond to the first region of the plurality of molecules provided in step (i) before these have been transformed/converted. See Figure 1. For example, in the ROI exemplified above, 5'- ATCGGGA-3', wherein the C (cytosine) may be methylated (mC, M, m) or non-methylated (uC), the one or more original molecules could be 5'-ATuCGGGA-3', or 5'-ATmCGGGA-3'. Thus, while the "ROI" represents a theoretical region of a nucleic acid molecule or genome that comprises at least one nucleotide susceptible of being epigenetically modified, the "one or more original molecules" represent the actual existing molecules present in a given sample, at least partially overlapping with the sequence of the ROI (the loci), comprising the at least one nucleotide with or without the epigenetic modification. Of note, in the context of the
present invention, an "M" in capital also denotes a mC, i.e., a methylated C; and a "C" is small or capital letter also denotes a uC, i.e., a non-methylated C (see, e.g., Fig. 7-10).
The original nucleic acid molecule or molecules may be derived from an organism, such as a human or non-human animal, or from plants, bacteria, fungi, yeasts, and/or viruses. For instance, the one or more original nucleic acid molecules may be a fragment of genomic DNA (e.g., nuclear DNA, mitochondrial DNA and chloroplast DNA). In a preferred embodiment, the one or more original nucleic acid molecule(s) is/are a fragment of genomic DNA. The genomic DNA comprises the DNA of the nucleus (also referred to as chromosomal DNA, including cell- free DNA (cfDNA)) but also the DNA of the plastids (e.g., chloroplasts) and other cellular organelles (e.g., mitochondria, etc.). The original nucleic acid molecule may be plasmid DNA or fragments of single stranded nucleic acid molecules (e.g., DNA, cDNA, mRNA). The original nucleic acid molecule or molecules may be single-stranded or double-stranded DNA. The one or more original nucleic acid molecule or molecules may also be RNA, such as mRNA.
In another embodiment, the one or more original nucleic acid molecules may be a synthetic nucleic acid molecule or molecules, such as synthetic DNA or synthetic RNA. The synthetic nucleic acid may be double-stranded or single-stranded. In a preferred embodiment, the one or more original nucleic acid molecules are fragments of genomic DNA.
In the context of the present invention, as explained above, the ROI comprises at least one nucleotide susceptible of conversion at a certain locus (position). For instance, the ROI may comprise a C in the locus at position 3:
5'-ATCGGGA-3'.
Since the nucleotide in this locus may be modified (e.g., methylated, such as a mC) or nonmodified (e.g., non-methylated, such as a uC), the original nucleic acid molecules may have at at least that one locus, (e.g., in the locus at position 3 in the current example), either a modified nucleotide (e.g., mC) or the non-modified version of the same nucleotide (e.g., uC):
1) 1 5'-ATuCGGGA-3'; or
2) 2 5'-ATmCGGGA-3'.
From the one or more original nucleic acid molecules, a plurality of nucleic acid molecules provided in (i) is obtained, for instance as represented in steps A and B:
Step A (synthesis of the precursor of the second region):
3) 1 5'-ATuCGGGA - TCCCGAT-3'
4) 2 5'-ATmCGGGA - TCCCGAT-3'
Step B (conversion of the molecules provided in Step A):
5) 1 5'-ATTGGGA - TTTTGAT-3'
6) 2 5'-ATCGGGA - TTTTGAT-3'
The molecules in Step B represent an example of the plurality of the nucleic acid molecules provided in step (i) of the method of the present invention.
As explained above, said plurality of nucleic acid molecules comprises a first and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules.
The sequence of the second region in the at least two molecules comprised in the plurality provided in step (i) may be identical or substantially identical at at least a position corresponding to a locus in the region of interest which has a nucleotide (which is occupied by a nucleotide) susceptible of being modified. In the above example, the corresponding locus in the original molecule would be at position 5 in the second region (namely, the fifth nucleotide from the 5' start of the second region). The corresponding locus in the first region would be at position 3 (namely, the third nucleotide from the 5' start of the first region). The corresponding locus in the original molecule and in the ROI is also at position 3 (namely, the third nucleotide from the 5' start of the ROI).
Also, in this example, there are at least two possible (a plurality) nucleic acid molecules, each comprising a first and a second region. The first and the second regions in each of the molecules comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified (e.g., locus 3 in the ROI, as described above). The first region comprises a T or a C (molecules 1 and 2, respectively) at a position corresponding to locus 3 in the ROI and the second region comprises, in both molecules, a G at the position corresponding to locus 3 in the ROI.
In addition, the two molecules may have (and preferably have) a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified. In this case, the sequences of the first regions of both molecules may be (and are preferably) different, because the nucleotide at the position corresponding to locus 3 in the ROI (occupied by a nucleotide susceptible of being modified, C, in the ROI) may be (and is preferably) different in each of the molecules (e.g., T in the first and C in the second one).
In addition, in the second region, a position corresponding to that same locus in the region of interest is occupied by the same nucleotide in the two nucleic acid molecules. In this case, the nucleotide at a position corresponding to locus 3 in the ROI is occupied, in the second region of both molecules, by the same nucleotide (G).
In particular, the first region of at least one of the molecules of the plurality of nucleic acid molecules may comprise, at least at one certain position (corresponding to a certain locus in the original sequence), a transformed unmodified nucleotide (in this case, the first region of molecule 1 comprises a T (i.e., originally a transformed nucleotide, transformed uC, U, or copy thereof (T)) at position (locus) 3). In addition, the first region of the other molecule (molecule 2) may not comprise, at the corresponding position (locus), a transformed nucleotide, but a modified one (it comprises a mC or a copy thereof, uC), at position 3). The nucleotide sequence of the second region is identical or at least substantially identical in each molecule (1 and 2). They are certainly identical at at least a position corresponding to a locus in the original molecule which has a nucleotide (which is occupied by a nucleotide) susceptible of being
modified (in this case, in position 5 of both second regions, which corresponds to the locus at position 3 in the original molecule, as discussed above, there is a G).
In one embodiment, there is more than one original nucleic acid molecule. For instance, there may be at least two original nucleic acid molecules, but there may be many. The original nucleic acid molecules, if more than one, may have all the same sequence, or at least share at least one corresponding loci in the ROI, and may differ only in their epigenetic modifications, such as in their methylation status. Preferably, the original nucleic acid molecule(s) is/are untreated nucleic acid molecule(s), i.e., they have not been converted or transformed, and have preferably not been treated, with agents/methods capable of converting the base of a nucleotide to a base which is read distinctly from the original base.
In the context of the present invention, the term "epigenetic modification" refers to any chemical modification that may be present in one or more nucleotides, but which does not change the nucleotide sequence (does not change the genetic code sequence of the nucleotide). Epigenetic modifications, or "tags," such as DNA methylation, alter DNA accessibility and chromatin structure, thereby regulating patterns of gene expression. In the context of the present invention, an epigenetic modification, may thus be present in any nucleotide, i.e., A, T, U, G and/or C, in DNA or RNA sequences. In the context of the present invention, "epigenetic modification" can be also used interchangeably with the term "chemical modification" of a nucleotide. A "modified nucleotide" or a "chemically modified nucleotide" or an "epigenetically modified nucleotide" thus refers to a nucleotide that has been chemically modified with an epigenetic modification or a "tag". Hence, a "modified nucleotide", in the context of the present invention, is a nucleotide that differs in its structure from primary nucleotides (Guanine, Cytosine, Thymine, Uracil, or Adenine), e.g., because it comprises an epigenetically modified base. A modified nucleotide may thus be a nucleotide that carries "epigenetic information", i.e., a nucleotide that carries an "epigenetic modification", as described above. Preferably, an epigenetically modified base is a methylated base, a hydroxymethylated base, a formylated base, an acetylated base or a carboxylic acidcontaining base. Preferably, the epigenetic modification is a methylation, and the modified base is a modified (e.g., methylated) cytosine.
An epigenetic modification may refer to cytosine methylation. In this case, the nucleotide sequence (C) does not change, but the nucleotide cytosine is chemically modified by the incorporation of a methyl group. Hence, the cytosine is chemically modified because it is methylated. In differentiated mammalian cells, the principal epigenetic tag found in DNA is that of covalent attachment of a methyl group to the C5 position of cytosine residues in CpG dinucleotide sequences (see, e.g., Handy DE. et al. "Epigenetic modifications: basic mechanisms and role in cardiovascular disease", Circulation, 2011, 123(19):2145-56). Together with histone modifications, DNA methylation modulates the chromatin structure and affects cognate gene expression by maintaining various expression patterns across cell types. The presence of DNA methylation in the promoter region is directly connected to repression of transcription. In contrast, DNA methylation in the gene body shows positive correlation with gene expression. Of note, epigenetic modifications may also be associated with the presence of disease. For instance, 5mC oxidation derivatives could be used as markers in cancer diagnostics and prognostics (see, e.g., Chen K., Zhao BS. and He C., "Nucleic acid modifications in regulation of gene expression", Cell Chem Biol., 2016;23(l):74-85). In differentiated mammalian cells, the principal epigenetic modification found in DNA is that of covalent attachment of a methyl group to the C5 position of cytosine residues in CpG dinucleotide sequences (referred to as CpG), but cytosines, other than those in CpG, can be methylated as well. See, e.g., Handy DE. et al., "Epigenetic modifications: basic mechanisms and role in cardiovascular disease", Circulation, 2011;123(19):2145-56. Chemical or epigenetic modifications which take place in nucleotides are, for example, 5-methylcytosine (5mC) and its oxidative derivatives (e.g., 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC) and 5-arboxylcytosine (5caC)) and N6-methyladenine (6mA) in DNA; N6-methyladenosine (m6A), pseudouridine (psi, UJ), and 5-methylcytosine (m5C) in messenger RNA and long noncoding RNA, or /V4-methylcytosine (4mC or m4dC) in bacterial genomes, see, Chen K., Zhao BS. and He C., "Nucleic acid modifications in regulation of gene expression", Cell Chem Biol., 2016;23(l):74-85. As stated above, in addition to genomic DNA, RNA molecules are also decorated with similar modifications. For example, /V6-methyladenosine (m6A) is also present in mRNA, see, e.g., Chen K., Zhao BS. and He C., "Nucleic acid modifications in regulation of gene expression", Cell Chem Biol., 2016;23(l):74-85. In addition, pseudouridine is a ubiquitous constituent of structural RNAs (transfer, ribosomal, small nuclear, and small nucleolar), see, e.g., Charette M and Gray MW, "Pseudouridine in RNA: what, where, how, and why", IUBMB
Life, 2000;49(5):341-51. Cytosine can also be methylated in RNA in order to form 5mC. tRNA modifications are known to affect translation and affect different physiological processes. For example, in 5. cerevisiae, there are 74 genes involved in the installation of ~25 chemically distinct modifications presented at 36 positions in yeast cytoplasmic tRNAs, see Chen K., Zhao BS. and He C., "Nucleic acid modifications in regulation of gene expression", Cell Chem Biol., 2016;23(l):74-85.
The term "modified cytosines", as used herein, refers to cytosine bases that are modified by the replacement or addition of one or more atoms or chemical groups, such as a methyl group. The expression "base that is detectably dissimilar to cytosine in terms of hybridization properties", as used herein, refers to a base that cannot hybridize (hydrogen bridges will not be present) with a guanine in the complementary strand, such as uracil.
Preferably, the conversion of (non-methylated) cytosine to uracil in the paired DNA molecules is performed with a deamination agent such as bisulfite, but any other agent or enzymatic treatment (e.g., TET oxidation of modified cytosines, followed by APOBEC deamination of nonmodified cytosines) may be used.
Modified cytosines such as methylated cytosines are resistant to the treatment with reagents such as bisulfite and A3A because the cytosines remain unchanged after the treatment with these reagents (e.g., they remain as cytosines) or because, upon treatment or after being copied (e.g., amplified by PCR), they are converted into a base that is complementary to guanine and is read as (unmodified, such as unmethylated) cytosine in polymerase-base amplification and sequencing (e.g., 5-hydroxymethylcytosine that is converted to cytosine-5- methylsulfonate, or 5mC/5fC which is converted to 5hmC/5caC, respectively, after treatment with TET- methylcytosine dioxygenase 2). Preferably, the modified cytosine is 5- methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC) or 5-formylcytosine (5fC). Thus, in a preferred embodiment, the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position (locus), a 5-methylcytosine (5mC), a 5-hydroxymethylcytosine (5hmC) or a 5-formylcytosine (5fC). In addition, in a preferred embodiment, the second region of the same molecule does not comprise any one of 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC) or 5-formylcytosine (5fC).
Preferably, the base detectably dissimilar to cytosine is uracil, which is then complementary to A and copied as T.
In the context of the present invention, a "nucleotide which is susceptible of being modified" refers to any nucleotide which may carry an epigenetic modification, as described above. For instance, a cytosine is a preferred nucleotide susceptible of being modified. The cytosine may be methylated (mC) or not (uC), i.e., the cytosine may carry at least one epigenetic modification (e.g., methylation). Preferably, the nucleotide which is susceptible of being modified is cytosine.
The expression "base that is detectably dissimilar to a certain nucleotide in terms of hybridization properties", as used herein, refers to a base that cannot hybridize (hydrogen bridges will not be present) with a base which would be otherwise complementary to it in the complementary strand (an example would be adenine, if the original base was guanine; another example would be uracil if the original base was cytosine). Preferably, the base detectably dissimilar to cytosine is thymine or uracil, more preferably is uracil. The reagent used in this step could be a reagent capable of converting non-methylated cytosines to a base that is detectably dissimilar to cytosine in terms of hybridization properties but incapable of acting on methylated cytosines. As discussed above, examples of such agents are, without limitation, deamination agents, bisulfite, metabisulfite or cytidine-deaminases such as activation-induced cytidine deaminase (AID). In a preferred embodiment the reagent is bisulfite.
The base identities in the first and second regions of the nucleic acid molecules comprised in the plurality provided in step (i) are characterized in that they provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, as mentioned above. Hence, the first and the second regions of the nucleic acid molecules of the present invention are related in the sense that they both have information regarding the base identity in the corresponding loci in an original nucleic acid molecule.
The information provided by the sequence in the first region is independent from the information provided by the sequence in the second region. Each of the nucleic acid molecules of the present invention thus comprise two sources of information regarding the base identity in the corresponding positions (loci) of the original nucleic acid molecule. For example, the modification of a nucleotide (e.g., cytosine residue) at a certain position (locus) in the one or more original nucleic acid molecules may be ascertained from the information provided by the identity of the bases at the corresponding positions (loci) in the first and second regions comprised in the nucleic acid molecules of the invention. With this information, the epigenetic modification status (such as the methylation status) in a ROI within a sample can be ascertained.
For example, in the molecules of the present invention comprised in the plurality provided in step (i), the first region may provide information regarding the base identities in the corresponding loci of an original nucleic acid molecule. Additionally, the second region may provide, independently, information regarding the base identities in the same loci of the same original nucleic acid molecule.
Hence, in the nucleic acid molecules of the present invention, the base identities in one of the first or second regions provide information of the base identities of the corresponding loci in an original nucleic acid molecule, and the base identities in the other region (second or first, respectively) provide information on the base identities of the same loci in the same original nucleic acid molecule, such that there is enriched information coming from the first and second regions of the molecules of the present invention for each locus of the original nucleic acid molecule.
The sequence of the first region of the nucleic acid molecules of the present invention correspond to the sequence of the original nucleic acid molecule after this has been treated with an agent (or method or process, as described herein) capable of converting the base of a nucleotide (e.g., non-methylated cytosine(s)) to a base which is read distinctly from the original base (e.g., cytosine) (e.g., bisulfite or A3A), i.e., the sequence of the first region of the nucleic acid molecule of the present invention corresponds to the converted original nucleic acid molecule. The first region thus provides information on the base identity in the original
nucleic acid molecule. Conversely, in one embodiment, the sequence of the second region of the nucleic acid molecule of the present invention may correspond to the converted sequence of the reverse complementary strand of the first region previous to the conversion (e.g., previous to the bisulfite treatment, see below for further details). Hence, the second region also provides independent information on the base identity in the corresponding loci in the original nucleic acid molecule.
In the context of the present invention, the term "agent capable of converting (or transforming) a nucleotide to another nucleotide which is read distinctly from the original nucleotide" refers to any agent (e.g., a reactive, reagent or enzyme) or method or process which is able to convert or transform (i.e., to alter its chemical structure) a certain nucleotide, so that the converted or transformed nucleotide is read (recognized) by the enzyme responsible of copying the nucleic acid molecule (e.g., polymerase) as another nucleotide which is different from the original nucleotide. Hence, once the agent has transformed/converted a modified/non-modified nucleotide, the enzyme responsible of copying the nucleic acid molecule (e.g., polymerase) will introduce, at that position, a nucleotide which is different to the nucleotide which the enzyme would have introduced if the original nucleotide had not been modified. For example, an agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base may be bisulfite. Bisulfite is able to convert uC to U, by deaminating the C. A U is read by the polymerase differently as a C, i.e., the polymerase would introduce a A when reading U, instead of introducing a G if it had read C. There are other agents capable of converting or transforming a nucleotide to another nucleotide which is read distinctly from the original nucleotide. Examples of other agents are, without limitation, deamination agents, metabisulfite or cytidine-deaminases such as activation-induced cytidine deaminase (AID). For instance, the enzyme beta-glycosyltransferase is able to glycosylate 5hmCs, and the enzyme APOBEC3A cytosine deaminase (A3A) is able to deaminate uCs to Us. The enzyme ten-eleven translocation (TET) methylcytosine dioxygenase 2 is capable of oxidate 5mC to 5hmC or 5fC or 5caC. For instance, an agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base may be the AID/APOBEC family of enzymes, see, e.g., Berney, M. and McGouran, J.F., "Methods for detection of cytosine and thymine modifications in DNA", Nat Rev Chem, 2018, 2, 332-348. The AID/APOBEC family of
enzymes can deaminate mC to T (Nabel CS. et al., "AID/APOBEC deaminases disfavor modified cytosines implicated in DNA demethylation", Nat Chem Biol., 2012, 8(9):751-8). In one embodiment, the agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base is bisulfite. As explained above, sodium bisulfite (commonly known as "bisulfite") selectively changes unmethylated cytosines into uracil through deamination, while leaving methylated cytosines (both 5-methylcytosine and 5- hydroxymethylcytosine) unchanged. As used herein, bisulfite ion has its accustomed meaning of HSO3-. Typically, bisulfite is used as an aqueous solution of a bisulfite salt, for example sodium bisulfite, which has the formula NaHSOs, or magnesium bisulfite, which has the formula MgfHSChh. Suitable counter-ions for the bisulfite compound may be monovalent or divalent. Examples of monovalent cations include, without limitation, sodium, lithium, potassium, ammonium, and tetraalkylammonium. Suitable divalent cations include, without limitation, magnesium, manganese, and calcium. Treatment of DNA with bisulfite converts unmethylated cytosine bases to uracil, but leaves 5-methylcytosine bases unaffected. Said conversion is performed by standard procedures (Frommer eta/. 1992, Proc Natl Acad Sci USA, 89:1827-31; Olek, 1996, Nucleic Acid Res. 24:5064-6; EP 1394172). Methods to obtain the sample include those used for reduced representation bisulfite sequencing (RRBS).
In another embodiment, the agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base is A3A. In another embodiment, the agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base is the enzyme beta-glycosyltransferase.
In the context of the present invention, a "converted nucleotide" or a "transformed nucleotide" refers to a nucleotide which has been put in contact with an agent or method capable of converting the base of a nucleotide to a base which is read distinctly from the original base, under the conditions suitable for the conversion to occur. If the nucleic nucleotide is susceptible of conversion, it will be converted by the action of the agent, thus leading to a "converted or transformed nucleotide". In addition, the term "converted nucleic acid" refers to a nucleic acid which has been put in contact with an agent or method capable of converting the base of a nucleotide into another one which is read distinctly from the original base, under the conditions suitable for the conversion to occur, as explained in detail
above. If the nucleic acid molecule comprises one or more nucleotides susceptible of conversion, these will be converted by the action of the agent/method, thus leading to a "converted or transformed nucleic acid". In the context of the present invention, the term "convert a nucleotide" refers to the chemical modification of the nucleotide originated by the agent (or method or process) capable of converting (or transforming) a nucleotide to another nucleotide which is read distinctly from the original nucleotide as described above, so that it is read distinctly from the original nucleotide. For instance, the conversion of C to U takes place by the chemical modification of the structure of C, which is deaminated to give rise to U. In the context of the present invention, the one or more original nucleic acid molecules comprises one or more nucleotides susceptible of conversion, as explained above. A nucleotide "susceptible of conversion" or a "nucleotide susceptible of being converted", or a "nucleotide susceptible of transformation", or a "nucleotide susceptible of being transformed", or the like, in the context of the present invention, refers to any of the primary nucleotides, G, C, T, A or U, modified or not, which may be transformed or converted by an agent or method or process, as described above, capable of converting the base of a nucleotide to a base which is read distinctly from the original base, such as bisulfite or A3A, as described above. For instance, uC is a nucleotide which is susceptible of conversion, because it can be deaminated and converted to U, which is a base which is read distinctly from the original uC. A transformed molecule or sequence, in the context of the present invention, refers to a sequence or a molecule comprising nucleotides that have been converted.
The sequence of the first region may not comprise nucleotides at all positions corresponding to all loci in the ROI. But the sequence of the first region comprises nucleotides in at least one, preferably in at least two, more preferably in at least three, or in at least 4, 5, 10, 15, 20, 30, 50, 100, 160, 200, 300, 500, 1000 or more corresponding loci in the ROL
The skilled person is able to assign the information (base identity) for each locus in the original nucleic acid molecule and/or in the ROI, combining the information given by both, the first and second regions of the nucleic acid molecule of the present invention.
In the following example, the ROI is 5' ATTTGGC 3' and one original nucleic acid molecule is 5' ATTTGGuC 3':
5' -ATTTGGU - — GUUAAAT-- 3' The second region (GUUAAAT) has a sequence which was complementary to the reverse sequence (GuCuCAAAT) of the first region (ATTTGGC) before both, the second and the first regions were treated with an agent capable of converting the base of a nucleotide to another base which is read distinctly from the original base, e.g., bisulfite or A3A, and, thus, were converted (e.g., non-methylated C to U, also referred to as "uC-to-T conversion"). In this specific case, if a methylated C (mC) was present in the original molecule at a certain locus, the first region will have a mC, uC in the corresponding locus, and the second region will have a G in the corresponding locus. Any other modification will work in the same way considering the original and converted nucleotide sequences and the relationship between the first and second regions. For instance, in this case, both regions provide information regarding the base identity in the corresponding loci in an original nucleic acid molecule (5' ATTTGGuC 3').
5' --ATTTGGU - — TAAAUUG -- 3' The second region (TAAAUUG) has a sequence which comprises complementary bases at the corresponding locus in the first region ("same complementary" sequence) in tandem before conversion, i.e., the second region (TAAAUUG) has a sequence which is complementary to the sequence in the first region before it has been converted (ATTTGGuC). In this case, both regions provide information regarding the base identity in the corresponding loci in an original nucleic acid molecule (5' ATTTGGuC 3').
Hence, as explained above, the nucleic acid molecules of the present invention provide two sources of independent information regarding the true base identity in a certain locus of an original nucleic acid molecule, and conversely, two sources of independent information regarding the true base identity in at least one certain locus in a ROI, wherein the at least one locus in the ROI is occupied by a nucleotide susceptible of being modified.
The first and second regions of the molecules of the present invention are located, within the molecule, next or close to each other, but they are not overlapping. The first and the second regions may be contiguous to each other, i.e., they may be directly linked to each other. Preferably, the first region of a nucleic acid molecule is located towards the 5' region in the molecule or, in other words, the second region of a molecule is located towards the 3' in the
same molecule. Preferably, the first region is comprised in the 5' region of the nucleic acid molecule of the present invention, and the second region is comprised in the 3' region of the nucleic acid molecule of the present invention. Also preferably, the first region of the nucleic acid molecule is located closer to the 5' end of the molecule than the second region (but may not be exactly at the 5' end of the molecule, e.g., there may be other sequences at the 5' end of the molecule which do not belong to the 5' region), and the second region is located closer to the 3' end of the molecule than the first region (but may not be exactly at the 3' end of the molecule, e.g., there may be other sequences at the 3' end of the molecule which do not belong to the 3' region).
The term "ends", as used herein, refers to the regions of sequence at (or proximal to) either end of a nucleic acid sequence. The expression "5' region", as used in the present invention, refers to a region of a nucleotide strand which is located towards the 5' end of said strand. The 5' region of a strand may include the 5' end of said strand. The term "5' end", as used herein, designates the end of a nucleotide strand that has the fifth carbon in the sugar-ring of the deoxyribose at its terminus. The expression "3' region", as used in the present invention, refers to a region of a nucleotide strand which is located towards the 3' end of said strand. The 3' region of a strand may include the 3' end of said strand. The term "3' end", as used herein, designates the end of a nucleotide strand that has the third carbon in the sugar-ring of the deoxyribose at its terminus.
At least two of the molecules of the plurality of nucleic acid molecules may have a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified. For instance, one of the molecules (the "first" molecule) may have a T in the first region at a certain position and the other molecule (the "second" molecule) may have, in its first region, a C, in a position corresponding to the same locus as the T in the first region of the first molecule. Preferably, at least two of the molecules of the plurality of nucleic acid molecules have a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified.
In one embodiment, the first region of at least one of the molecules of the plurality of nucleic acid molecules provided in step (i) of the method of the present invention comprises, at least at a certain position (locus), a modified nucleotide or a copy thereof, wherein the first region of at least another nucleic acid molecule of the plurality of nucleic acid molecules provided in step (i) of the method of the present invention does not comprise, at least at the same position (same locus), a modified nucleotide or a copy thereof; and wherein the at least a modified nucleotide, or copies thereof, are not present in the second region in any of the nucleic acid molecules comprised in the plurality provided in step (i). That is, in one embodiment, in the plurality of nucleic acid molecules provided in step (i), there are at least two molecules that differ at least in that, while one molecule comprises, at a certain position (locus) in the first region, a modified nucleotide ora copy thereof, the other nucleic molecule does not comprise, in the same position (locus) in the first region, a modified nucleotide or a copy thereof.
In another embodiment, the first region of at least one of the molecules of the plurality of nucleic acid molecules provided in step (i) of the method of the present invention comprises, at least at a certain position (locus) a transformed (converted) modified nucleotide, or a copy thereof, wherein the first region of at least another nucleic acid molecule of the plurality of nucleic acid molecules provided in step (i) of the method of the present invention does not comprise, at least at the same position (same locus), a transformed (converted) modified nucleotide, or a copy thereof. That is, in another embodiment, in the plurality of nucleic acid molecules provided in step (i), there are at least two molecules that differ at least in that, while one molecule comprises, at a certain position (locus) in the first region, a transformed modified nucleotide, or a copy thereof, the other nucleic molecule does not comprise, in the same position (locus) in the first region, transformed modified nucleotide, or a copy thereof.
In another embodiment, in the plurality of nucleic acid molecules provided in step (i), there are at least two molecules that differ at least in that, while one molecule comprises, at a certain position (locus) in the first region, a transformed non-modified nucleotide, or a copy thereof, the other nucleic molecule does not comprise, in the same position (locus) in the first region, the same transformed non-modified nucleotide, or a copy thereof (e.g., one of the molecules comprises, at a certain position (locus) in the first region, a U, or a T, and the other
nucleic molecule does not comprise, in the same position (locus) in the first region, the same nucleotide or copy thereof, i.e., does not comprise a U or a T).
The term "transformed modified nucleotide" or "converted modified nucleotide", as used herein, refers to a nucleotide that was originally (epige netica lly) modified nucleotide, but that has been treated with an agent (or method or process, as described herein) capable of converting a nucleotide or, more specifically, the base of a nucleotide into another base (or into another nucleotide) which is read distinctly from the original base (nucleotide) and converted (or transformed), as defined above, and, as a consequence of the transformation (conversion), the epigenetic modification has been removed and the epigenetic modification information has been transformed (or converted). Hence, with the transformation or conversion step, the epigenetic information is preserved or fixed in the successive copies of the nucleic acid molecules that may be performed. Hence, a "transformed modified nucleotide" or "converted modified nucleotide", is a nucleotide that was originally modified, and that has been treated with an agent or method capable of converting the base of said modified nucleotide into another one, which is read distinctly from the original base. As a consequence of the treatment, the nucleotide has been converted. As a consequence of the conversion, the originally modified nucleotide has been transformed, leading to the "transformed modified nucleotide" or "converted modified nucleotide". For instance, a transformed modified nucleotide may be a T derived from a methylated cytosine that has been treated with an agent capable of converting a nucleotide into another one which is read distinctly from the original nucleotide, and converted, so that the original mC has been converted to T as a consequence of the treatment with the AID/APOBEC family of enzymes, see, e.g., Berney, M. and McGouran, J.F., "Methods for detection of cytosine and thymine modifications in DNA", Nat Rev Chem, 2018, 2, 332-348 or (Nabel CS. et al., "AID/APOBEC deaminases disfavor modified cytosines implicated in DNA demethylation", Nat Chem Biol., 2012, 8(9):751-8.
In a preferred embodiment, the first region of at least one of the molecules comprised in the plurality of nucleic acid molecules provided in (i) comprises, at least at one certain position (certain locus), a modified nucleotide or a copy thereof, and, at least one other molecule
comprised in the plurality of nucleic acid molecules provided in (i) does not comprise, at least at the same position (same locus) in its first region, a modified nucleotide or a copy thereof.
In order to illustrate this embodiment, the following example in Figure 1 is provided. In Figure 1 B, eight original nucleic acid molecules are provided, with the same sequence (i.e., the same bases at the same or corresponding loci), wherein each of them comprises a different methylation status ("C" with single underline represents a methylated cytosine, and "C" with double underline represents a non-methylated cytosine). See the first region (towards the 5' region) of the molecules of Figure IB. The region of interest is, in this case: A C C G T C G A C G, wherein "C" without underline represents a cytosine that may be methylated (mC) or not (uC):
Molecule number
For these original DNA molecules, a plurality of nucleic acid molecules comprising two regions and transformed/converted is provided (see Figures IB and 1A)
In the molecules of Figure 1A, the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at a certain position (certain locus), a modified nucleotide (e.g., the third nucleotide, methylated cytosine "C", in molecule number 2 in Figure 1A) or a copy thereof (e.g., the third nucleotide, methylated or unmethylated cytosine "C" depending on the nucleotides given to the polymerase to synthetise the copies, in molecule number 2 in Figure 1A), and the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at the same position (same locus), the same modified
nucleotide (which would be a mC) or copy thereof (uC), but it comprises a transformed nonmodified nucleotide (e.g., uracil "U") or copy thereof (e.g., the third nucleotide, thymine "T" in molecule number 1 in Figure 1A, which was originally a non-modified C, see Figure IB, and which complementary base would be a A).
As used herein, the terms "copy of a nucleotide", "copy of a modified nucleotide" or "the copy of the transformed modified nucleotide" refer to the nucleotide obtained after copying or amplifying (e.g., via PCR) a given nucleic acid molecule comprising a nucleotide, possibly after conversion or transformation.
Hence, in one embodiment in the molecule of the present invention: a) - the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof; and wherein the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at a position which corresponds to the same locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof or a transformed modified nucleotide or a copy thereof; or b) - the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules a transformed modified nucleotide, or a copy thereof; and wherein the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at a position which corresponds to the same locus in the one or more original nucleic acid molecule, a transformed modified
nucleotide, or a copy thereof; or a modified nucleotide or a copy thereof.
Hence, in this particular embodiment, in a) above, the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof, and the first region of at least one other molecule may comprise, at the corresponding locus, a transformed non-modified nucleotide, or a copy thereof.
Hence, in this particular embodiment, in b) above, the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules a transformed modified nucleotide, or a copy thereof; and the first region of at least one other molecule of the plurality of nucleic acid molecules may comprise, at the corresponding locus, a non-modified nucleotide, or a copy thereof.
Hence, in the plurality of molecules of the present invention, there are at least two molecules that may have (and preferably have), at least at one the same locus in the first region, different sequence (a different nucleotide). This potential difference in sequence at at least this specific locus in the first region is a consequence of the potential differences in epigenetic status in the same locus in the original molecules. In other words, this potential difference in sequence is the way the epigenetic information potentially present in the original molecules is fixed or preserved in the molecules of the present invention (and their copies, if any).
In the following example, the one or more original molecules are at least partially overlapping a region of interest (ROI) in the genomic DNA. The first regions in the nucleic acid molecules provided in (i) are fragments of genomic DNA (original molecules) which have been treated with an agent capable of converting (or transforming) a nucleotide to another nucleotide which is read distinctly from the original nucleotide, and they all have nucleotides at at least one position corresponding to a locus in the ROI (e.g., they overlap with at least part of the ROI). As explained above, prior to sequencing, the genomic DNA molecules are generally
randomly fragmented. Hence, if the first region of the molecules are for instance transformed fragments of genomic DNA, the bigger the genome the shorter the odds that more than one fragment from independent original molecules share the exact same region (same start and end on the genome reference sequence). But the first regions of some molecules will comprise at least part of the sequence of the region of interest. The first regions of the nucleic acid molecules comprised in the plurality provided in (i), comprising a first and a second region, will not be all identical. The nucleic acid sequences of the second region should be identical or at least substantially identical in each of the plurality of nucleic acid molecules at at least a position corresponding to the same locus in the region of interest, wherein the locus is occupied by a nucleotide susceptible of being converted. Hence, the second region of each of the nucleic acid molecules comprised in the plurality provided in (i) comprises, at least at one certain position (locus) (the same position (locus) in the second region in all of the molecules), the same nucleotide, which is a nucleotide that corresponds in the ROI to a nucleotide susceptible of being converted by an agent or method capable of converting a nucleotide into another nucleotide which is read distinctly from the original nucleotide. In a preferred embodiment, the second region of each of the nucleic acid molecules comprised in the plurality provided in (i) comprises, at the same loci, at least two nucleotides which are identical or substantially identical in all nucleic acid molecules. More preferably, the second region of each of the nucleic acid molecules comprised in the plurality provided in (i) comprises, at the same loci, one or more nucleotides which are identical or substantially identical in all nucleic acid molecules, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 160, 200, 300, 500, 1000 or more, nucleotides which are identical or substantially identical in all nucleic acid molecules. The second region of each of the nucleic acid molecules comprised in the plurality provided in (i) comprises, at the same loci, at least 1 nucleotide which is identical or substantially identical in all nucleic acid molecules. For instance, the second region of each of the nucleic acid molecules comprised in the plurality provided in (i) comprises, at the same loci, at least 13 nucleotides which are identical orsubstantially identical in all nucleic acid molecules. For instance, the second region of each of the nucleic acid molecules comprised in the plurality provided in (i) comprises, at the same loci, at least 20 nucleotides which are identical or substantially identical in all nucleic acid molecules.
See Figures 1C and ID. Here, the region of interest in the genome is A C C G T C G A C G, where C represents a cytosine which may be methylated (mC) or not (uC).
The first region (the 5' region in this specific case) of the molecules shown in Figure ID may be fragments of genomic DNA and may represent the original molecules. The cytosines in the first region have different methylation status in each of the molecules (methylated cytosines are highlighted with single underlined and non-methylated cytosines are double underlined).
Figure 1C shows a plurality of nucleic acid molecules comprising two regions (they comprise a first region in the 5' region and a second region in the 3' region). The base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of the original nucleic acid molecule (that overlaps with a ROI in this case).
Figure 1C shows the transformed or converted version of the molecules of Figure ID. These molecules may be comprised in the plurality provided in step (i) of the method of the present invention.
In these molecules, the first region of at least two of them may have (and preferably have) a different nucleotide at at least one corresponding locus. For instance, at locus 3, molecule 2 has a uC whereas molecule 3 has a T. This is because the methylation status of the C present in locus 3 in the ROI was different in each of the original molecules (mC in the original molecule of 2 and uC in the original molecule of 1, see Figure 1A). However, in the corresponding locus in the second region, the nucleotide is the same in both molecules (G at position 8 (starting from the 5' of the second region), which corresponds to loci 3 in the ROI). Hence, in the second region, a position corresponding to that same locus in the region of interest is occupied by the same nucleotide in the at least two nucleic acid molecules.
Both the first and the second regions in each of the molecules comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified (locus 3 in the ROI, which is occupied by a C).
In these molecules (Figure ID), the first region of at least one of the molecules of a plurality of nucleic acid molecules comprises, at a certain position corresponding to a certain locus in the region of interest, a modified nucleotide (e.g., at a position corresponding to position (locus) number three in the region of interest, there is a uC (which corresponds to a methylated cytosine in the original molecule) in molecule number 2, 5 and 8. In addition, the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at the same corresponding position (locus) in the original molecule, a modified nucleotide (e.g., at the same position (locus), i.e., at a position (locus) corresponding to position number three in the region of interest, there is a T, a copy of a transformed nonmethylated cytosine (which corresponds to a non-methylated cytosine in the original molecule), in molecules number 3 and 7). This is equivalent to saying that, while the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at a certain position (locus of the original molecule), a modified nucleotide (i.e. a mC), or a copy thereof (i.e. mC or uC, complementary to a G), the first region of at least one other molecule of the plurality of nucleic acid molecules comprises, at least at the same position (locus of the original molecule), a transformed non-modified nucleotide (i.e., U) or a copy thereof (i.e., a T), both complementary to a A.
As shown above, the second region of the nucleic acid molecules comprised in a plurality of nucleic acid molecules comprises, at the same position corresponding to the same locus in the region of interest, at least one nucleotide which is identical or substantially identical in the plurality of nucleic acid molecules, and the locus is occupied in the original molecule, by a nucleotide susceptible of being modified or transformed. For instance, in Figure 1C, at position 8 in the second region of molecules 2, 3, 5, 7 and 8, there is always a G. Position 8 in the second region corresponds to locus 3 in the ROL Hence, although this corresponding locus was originally occupied by a mC in molecules 2, 5 and 8 and by a uC in molecules 3 and 7 (see locus 3 at the first region of Figure ID) in the second region, the corresponding locus is occupied by the same nucleotide, i.e., a G.
In an embodiment, the second region of the nucleic acid molecules comprised in a plurality of nucleic acid molecules does not comprise any modified nucleotide. Preferably, the second
region of the nucleic acid molecules comprised in a plurality of nucleic acid molecules does not comprise any modified C, preferably it does not comprise any methylated C (mC).
In an embodiment, the resulting molecule provided in step (i), comprises at least four regions that are substantially different from each other in sequence, so that one primer can only specifically bind to one of the four regions, and not to the others. Said regions are named 1, 2, 3, and 4 in the context of a strand with a Watson insert, and regions 1', 2', 3' and 4' in the context of a strand with a Crick insert, see e.g., Fig. 7A. Regions 1 to 4 and 1' to 4' can also be referred to in the present document as A, B, C, D or A', B', C', D', respectively. It is noted that said 1-4 and l'-4' regions are not the same as the first and second regions of the molecule, which have been defined above. The Watson and Crick insert may be representative of the first region of the molecule provided in step (i). As will be readily understood by the skilled person, when the present invention refers to regions 1 to 4 or 1' to 4', it is also referring to the complementary sequences thereof.
Thanks to the presence of 1-4 or l'-4 regions in the molecule provided in step (i), amplification and sequencing primers can be designed against one of said the four regions, so that the primers specifically bind only to one of said regions, and not to the others. A region is "substantially different" to another region when the percentage of nucleobase identity between both regions is less than 90%, such as less than 80%, or less than 70%, or less than 60%, or less than 50%, or less than 40%, or less than 30%, or less than 20%, or less than 10% or less. Preferably, a region is "substantially different" to another region when the percentage of nucleobase identity between both regions is such that it does not allow a primer that is capable of specifically binding (specifically hybridizing) to one of these regions to specifically hybridize to the other. Hence, a region is "substantially different" to another region when the percentage of nucleobase identity between both regions is such that it does not allow a primer that is capable of efficiently hybridize to one of the regions, to efficiently hybridize to the others. By "efficient hybridization" is referred herein as a hybridization that has sufficient specificity as to serve as a primer for a specific amplification or sequencing step.
Preferably the at least four regions (1-4 or l'-4') that are substantially different from each other of the resulting molecule of step (i) are located flanking the first and second regions of
the molecule of step (i). By "flanking" is referred herein to a place that is at both sides of a given region. Preferably, the at least four regions that are substantially different from each other of the resulting molecule of step (i) are located 1, 2, 3, 4, 5, 6,7 ,8, 9, 10, 11, 12, 13, 14, 15, 16 or more than 16 nucleotides upstream and downstream the first and second regions of the molecule of step (i).
In an embodiment, regions 1 (in strand with Watson insert) and 1' (in strand with Crick insert) are in the 5' region (or in the vicinity of the 5' region) of the molecules provided in step (i).
In an embodiment, regions 4 (in strand with Watson insert) and 4' (in strand with Crick insert) are in the 3' region (or in the vicinity of the 3' region) of the molecules provided in step (i).
In an embodiment, regions 2 (in strand with Watson insert) and 2' (in strand with Crick insert) are in the 5' region of the region that covalently links the first and second region of the molecule provided in step (i). In other words, regions 2 and 2' are comprised in the linking region between the first and second regions of the molecules provided in (i).
In an embodiment, regions 3 (in strand with Watson insert) and 3' (in strand with Crick insert) are in the 3' region of the region that covalently links the first and second region of the molecule provided in step (i). In other words, regions 3 and 3' are comprised in the linking region between the first and second regions of the molecules provided in (i).
Preferably, the at least four regions that are substantially different from each other of the resulting molecule of step (i) are characterized in that at least two different primers, such as two, three or four different primers, preferably three or four different primers, can bind to said four different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the region 3 or 3' of the molecule, to sequence at least part of the second region of the nucleic acid molecules provided in (i);
3. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and
4. At least one of the primers is capable of binding (hybridizing) at least partially to region 2 or 2' of the molecule , to sequence at least part of the first region the nucleic acid molecules provided in (i).
Preferably, the at least four regions that are substantially different from each other of the resulting molecule of step (i) are characterized in that at least two different primers, such as two, three or four different primers, preferably three or four different primers, can bind to said four different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the region 3 or 3' of the molecule, to sequence at least part of the second region of the nucleic acid molecules provided in (i);
3. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); or
4. At least one of the primers is capable of binding (hybridizing) at least partially to region 2 or 2' of the molecule , to sequence at least part of the first region the nucleic acid molecules provided in (i).
Preferably, the at least four regions that are substantially different from each other of the resulting molecule of step (i) are characterized in that at least two different primers, such as two, three or four different primers, preferably three or four different primers, can bind to three of said four different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and
3. At least one of the primers is capable of binding (hybridizing) at least partially to region 2 or 2' of the molecule , to sequence at least part of the first region the nucleic acid molecules provided in (i).
Preferably, the at least four regions that are substantially different from each other of the resulting molecule of step (i) are characterized in that at least two different primers, such as two, three or four different primers, preferably three or four different primers, can bind to at least three of said four different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the region 3 or 3' of the molecule, to sequence at least part of the second region of the nucleic acid molecules provided in (i); and
3. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i).
In one embodiment, the nucleotide sequence of the second region is identical or at least substantially identical in each of the plurality of nucleic acid molecules provided in (i), as described in detail above (see also Figure 1 A and B).
A region is "substantially identical" to another region when the percentage of nucleobase identity between both regions is at least 98%, at least 99%, at least 99,9%, preferably at least 99,99%. Thus, the "substantial identity" includes the possible errors (i.e., insertion, deletion or substitution of nucleotides made by polymerase enzymes or by DNA damage, library processing, sequencing or mapping. When the original sequence of different molecules is identical or substantially identical, the second region of the molecules provided by the invention is also identical or at least substantially identical in all of the plurality of nucleic acid molecules, because it has been synthetised using the original molecule as a template, and before any transformation step occurs (see below for an exemplary embodiment on how to provide the molecules of step (i)). This means that, in this specific embodiment, regardless of the nucleotide modifications present in the first region, the nucleotide sequence of the second region will be identical or substantially identical in all ofthe plurality of nucleic acid molecules. Thus, in this specific embodiment, the second region represents a common region in all of the plurality of nucleic acid molecules, that will serve for an efficient capture step when a capture probe is designed to hybridize to said second region.
The first and second regions of the nucleic acid molecules of the present invention are linked, preferably covalently linked. In one embodiment, the first and second regions are covalently linked by a third region, also called herein a "linking region". Preferably, the third or linking region comprises or, preferably, is a nucleotide sequence. In other embodiment, the first and the second regions are directly linked to each other, i.e., the first and the second regions are continuous in the molecule, and there is no linker between them.
Preferably, the linker is a nucleotide sequence (such as an adaptor) that is identical or substantially identical in all of the plurality of nucleic acid molecules. Preferably, primers can (at least partially) bind (hybridize) to said linker. In this case, the third region is preferably a nucleotide sequence that is long enough so that primers can (at least partially) bind (hybridize) to it, preferably with enough specificity so that the primer does not substantially bind to other regions of the molecule, in orderto sequence the molecule ofthe present invention, especially the first and second regions of the molecule.
The term "primer", as used herein, refers to a short strand of nucleic acid that is at least partially complementary to a sequence in another nucleic acid and serves as a starting point for nucleic acid (e.g., DNA) synthesis. Preferably the primer has at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, preferably at least 18, at least 20, at least 25, at least 30 or more bases long.
The term "complementary" refers to the base pairing that allows the formation of a duplex between nucleotides or nucleic acids, such as, for instance, between the two strands of a double-stranded DNA molecule or between an oligonucleotide primer and a primer binding site on a single-stranded nucleic acid or between an oligonucleotide probe and its complementary sequence in a DNA molecule. Complementary nucleotides are, generally, A and T (or A and U), or C and G. Two single-stranded DNA molecules are said to be substantially complementary when the nucleotides of one strand, optimally aligned and compared and with appropriate nucleotide insertions or deletions, pair with about 60% of the other strand, at least 70%, at least 80%, at least 85%, usually at least about 90% to about 95%, and even about 98% to about 100%. The "degree of identity" between two nucleotide regions can be determined using algorithms implemented in a computer and methods which are widely known by the persons skilled in the art. The identity between two nucleotide sequences is preferably determined using the BLASTN algorithm (BLAST Manual, Altschul, S. et al., NCBI NLM NIH Bethesda, Md. 20894, Altschul, S., et al. ., 1990, Mol. Biol. 215:403-410).
The skilled person is aware of means for designing the third or linking region as described herein. For instance, the third region or linking region may have a length of at least 5 nucleotides, such as at least 10, or at least 15 nucleotides, or at least 17 nucleotides, such as 17 nucleotides. For instance, the third region or linking region may have a length of from 5 to 100 nucleotides, such as from 15 to 100 nucleotides, such as from 15 to 80 nucleotides, such as from 15 to 70 nucleotides, preferably from 15 to 80 nucleotides, more preferably from 17 to 70 nucleotides, even more preferably from 25 to 65 nucleotides, such as 17 nucleotides, or 29 nucleotides, or 64 nucleotides. For instance, the third region or linking region may have a length of at least 20 nucleotides, such as at least 25, 26, 27, 28, 29 or 30 nucleotides. In a preferred embodiment, the third region or linking region has a length of at least 17 nucleotides, such as 17 nucleotides, or 18 nucleotides, or 19 nucleotides. In another preferred embodiment, the adaptor has a length of 29 nucleotides. The third region or linking region can also have a longer length, such as at least 35, 40, 45, 50, 55 or at least 60 nucleotides. In another preferred embodiment, the third region or linking region has a length of 64 nucleotides, but it can be longer, such as at least 65, 70, 75, 80 or more nucleotides. Hence, the third region or linking region may comprise from 5 to 100 nucleotides, preferably from 15 to 80 nucleotides, more preferably from 25 to 70 nucleotides, even more preferably from 29 to 64 nucleotides. Of course, other lengths are possible, although it is preferably that the length allows that primers can (at least partially) bind (hybridize) to it with enough specificity so that the primer does not substantially bind to other regions of the molecule, in order to sequence the first and/or second regions of the nucleic acid molecules provided in step (i) of the method of the present invention.
In the context of the present invention, "hybridization" (or "hybridize", or variants thereof) refers to the process in which two single-stranded polynucleotides bind (at least partially) non- covalently to form a stable double-stranded polynucleotide. In the context of the present invention, the term "binding" may be used to refer to "hybridize" or "at least partially hybridize".
The skilled person is familiar with conditions and buffers suitable for the hybridization of two single-stranded polynucleotides, as described above.
The nucleic acid molecules of the present invention may further comprise one adapter at the 5' end of the molecule and/or one adapterat the 3' end of the molecule. The terms "adapter" and "adaptor" are used interchangeably in the present description and refer to an oligonucleotide or nucleic acid fragment or segment that can be ligated to a nucleic acid molecule of interest. The "adapter molecule" of the method of the invention is preferably a DNA molecule having one end which is compatible with the end of the nucleic acid molecules (preferably DNA) of the present invention.
An adapter or adaptor in genetic engineering is a short, chemically synthesized, singlestranded or double-stranded oligonucleotide that can be ligated to the ends of other DNA or RNA molecules. Adaptors may contain "sites for cutting" (e.g., "restriction sites", sequences of oligonucleotides that are recognized by restriction enzymes). The "sites for cutting" add a way to adapt the final elements of the library to the needs of the different sequencing platforms.
In one embodiment, at least one portion of the adaptors has sequences common to all the adaptors present in the population of nucleic acid molecules of step (i), if this is the case. In this case, identical primers for sequencing all molecules could be used.
Optionally, the adapters include unique and combinatorial barcodes (also referred to "combinatorial sequences" or "barcodes" or "barcode sequences" or "combinatorial labelling") that allow sample identification, multiplexing, pairing as well as quantitative analysis. The constructs obtained by the methods of the invention may have barcodes that allow generating unique identifiers associated with the initial construct, thus giving the ability to differentiate between constructs. Said unique identifiers allow identification of a specific construct comprising said identifier and its descendants. Each unique identifier is associated with an individual molecule or a fragment of an individual molecule in the starting sample. Therefore, any amplification products of said initial individual molecule bearing the unique identifier are assumed to be identical by descent. The combinatorial barcodes also allow for quantifying the percentage of individual sequences within a sample and are useful for monitoring biases and error control during the amplification steps.
The terms "combinatorial sequence", "barcode sequence", "barcode" and "combinatorial barcode" are used interchangeably all along the present description and refer to an identifier unique to the individual adapter sequence or a separate nucleic acid (e.g., DNA) molecule (barcode sequence on its own, not belonging to the adapter). Preferably, the barcode sequence is included in the adapter. In an embodiment, the combinatorial sequence within the adapter sequence is a degenerate nucleic acid sequence. The combinatorial sequence may contain any nucleotide, including adenine, guanine, thymine, cytosine, uracil, methylated cytosine (e.g., 5mC or 5hmC) and other modified nucleotides. The number of nucleotides in the combinatorial sequence is preferably designed such that the number of potential and actual sequences represented by the combinatorial sequence is greaterthan the total number of adapters in the library. The combinatorial sequence may be located in any region of the adapter sequence.
The skilled person is aware of methods for obtaining the nucleic acid molecules (the plurality of nucleic acid molecules) provided in step (i) of the present invention. Some non-limiting examples thereof are described below.
Exemplary method for obtaining the nucleic acid molecules provided in step (i) of the method of the present invention
The plurality of nucleic acid molecules provided of step (i) may be generated by a method comprising the following steps:
Step (a) Providing a plurality or population of nucleic acid molecules, preferably a plurality or population of double-stranded nucleic acid molecules.
In step (a), a plurality of nucleic acid molecules is provided. The molecules may be single stranded (ss) or double stranded (ds). In a preferred embodiment, the plurality or population of nucleic acid molecules are ds, preferably they are fragments of genomic DNA.
The plurality or population of nucleic acid molecules provided in step (a) would correspond to the "original molecules" in the context of the present invention.
The "population or plurality of nucleic acid molecules", as used herein, is a collection of nucleic acid molecules that may be ds or ss. For instance, they may be ssDNA molecules, or RNA molecules, as described in detail above.
In one embodiment, the population or plurality of double stranded nucleic acid molecules are double stranded, and may be, without limitation, genomic DNA (nuclear DNA, mitochondrial DNA, chloroplast DNA, cfDNA, etc.), plasmid DNA or ds DNA molecules obtained from ss nucleic acid samples (e.g., DNA, cDNA, mRNA, etc.). In an embodiment said population is formed by fragments of dsDNA.
Preferably, the plurality of ds nucleic acid molecules is genomic DNA. This can be the whole genome or a reduced representation of the genome. The genomic DNA comprises the DNA of the nucleus (also referred to as chromosomal DNA) but also the DNA of the plastids (e.g., chloroplasts) and other cellular organelles (e.g., mitochondria) or circulating/cell-free DNA (cfDNA).
In a more preferred embodiment, the double stranded DNA molecules are fragments of genomic DNA.
Continuing with the above examples and Figure 1, the plurality or population of nucleic acid molecules comprised in the first region may correspond to the first region of molecules 1 to 8 as shown in Figure IB or ID.
As shown in Figures IB and ID, the nucleic acid molecules differ in their methylation status at the positions occupied by a cytosine in the corresponding loci in the first region and in the original molecule.
Step (b) Ligating one adaptor to at least one end of the nucleic acid molecules provided in (a), thereby obtaining an adaptor-containing nucleic acid molecule.
In this step, an adapter is ligated to at least the 3' region of the nucleic acid molecules provided in (a).
Preferably, the 3' region of the adaptor forms a hairpin loop whose 3' end that can be extended by action of a polymerase.
If the nucleic acid molecules provided in (a) are double-stranded, then the adaptor may be preferably a double-stranded adaptor. The adaptors may be added as a complex comprising an elongation primer with a hairpin adapter under conditions adequate for the hybridization of the elongation primer to the second strand of the adapter, wherein the elongation primer comprises a 3' region which is complementary to the second strand of the adapter molecule and which, after hybridization with the second strand of the adapter molecule creates overhanging ends, and wherein the hairpin adapter comprises a hairpin loop region and overhanging ends which are compatible with the overhanging ends formed after hybridization of the elongation primer to the second strand of the adapter. Step (b) is also called herein "ligation step".
The adaptors are added (ligated) at least at the 3' end of the nucleic acid molecules provided in (a), but adaptors may also be optionally ligated to the 5' region of the molecules provided in (a). Following the above example (e.g., Figure IB), an adaptor (represented by " — " in the figure) may be ligated to the 3' region of the plurality or population of eight nucleic acid molecules, which then become adaptor-containing nucleic acid molecules:
Molecule number
Or (Figure ID):
The ligation is preferably performed under conditions adequate for the ligation of the adaptors to at least the 3' end of the nucleic acid molecules, thereby obtaining a plurality of "adapter-containing nucleic acid molecules" (also referred herein as "adapter-modified nucleic acid molecules").
In a preferred embodiment, at least a portion of the adaptors have sequences common to all the adaptors used in step (b).
In one embodiment, when the nucleic acid molecules provided in (a) are double-stranded, the adapter is a so-called "Y adapter", which is a ds adapter, and can be ligated to at least one end of a ds nucleic acid molecule. The "Y adapter" has a "Y" form. Further details regarding "Y- adapters" are explained, e.g., in WO 2015/104302. In a Y-adapter, the 3' region of the first DNA strand and the 5' region of the second DNA strand form a double stranded region by sequence complementarity and wherein the 5' region of the first DNA strand and the 3' region of the second DNA strand are not complementary. The terms "Y-adapter" and "Y-adaptor" are used interchangeably and, in the context of the present invention, refer to an adapter formed by two nucleic acid (preferably DNA) (ds DNA) wherein the 3' region of the first DNA strand and the 5' region of the second DNA strand form a double stranded region by sequence complementarity, wherein the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y-adapter are compatible with the ends of the double stranded DNA molecules. The expression "3' region", as used herein, refers to a region of a nucleotide strand that includes the 3' end of said strand.
The "Y adapter" can also be obtained by cleavage of a hairpin. In this case, a hairpin is ligated to at least one end, preferably to both ends of the ds nucleic acid molecules (such as ds DNA
molecules) and, in a further step, the hairpin(s) is(are) cleaved, so that at least one of the strand of the ds nucleic acid molecules comprise an adapter ligated to it. If desired/necessary, in a separate step, further primers may be ligated to at least one end of the ds nucleic acid molecules (such as ds DNA molecules), see below. Hence, a hairpin may be considered to be a type of "Y-adapter", since it may become a Y-adapter if the hairpin is cleaved.
A "hairpin" or "stem loop" occurs when two regions of the same strand, usually complementary in nucleotide sequence when read in opposite directions, base-pair to form a double helix that ends in an unpaired loop. In one embodiment, the molecules generated in step (b), when the adapters do not comprise a hairpin loop, are contacted with a hairpin adapter under conditions adequate for the ligation of the hairpin adapter to the molecules generated in step (b), as described in detail below. If the adapters ligated in step (b) do not comprise a hairpin loop, a hairpin adapter may be ligated to the adapters ligated in step (b). For instance, a hairpin adapter may be incorporated as described in WO 2015/104302.
The expression "sequence complementarity", as used herein, refers to a property shared between two nucleic acid sequences, such that when they are aligned antiparallel to each other, the nucleotide bases at each position (locus) will be complementary.
In one embodiment, the 3' region of the second nucleic acid (e.g., DNA) strand of the Y- adapter forms a hairpin loop by hybridization between a first and a second segment within said 3' region, the first segment being located at the 3' end of the 3' region and the second segment being located in the vicinity of the 5' region of the second DNA strand. The term "hairpin loop", as used herein, refers to a region of DNA formed by unpaired bases that is created when a DNA strand folds and forms base pairs with another section or segment of the same strand.
Optionally, the 3' region of the second DNA strand of the Y-adapter does not form a hairpin loop by hybridization between a first and a second segment within said 3' region.
The adapters of the present invention
In the context of the present invention, the adapters, preferably ds adapters (such as DNA adapters, which may be Y-adapters or not), comprise at least one barcode sequence in a region, preferably the ds region, of the adapter. This will provide at least for the pairing between each original nucleic acid (such as DNA) strand of the original ds nucleic acid molecule and to be able to deduplicate reads, that is, to differentiate reads that originate from the same original sequence or reads that are independent but start and end at the same loci which becomes crucial for enrichment specially of low input/low diversity libraries or high depth whole genome sequencing. This allows for keeping track of both strands of each ds nucleic acid fragment originally used in step (a) as described above. Hence, each of the strands of the double-stranded original nucleic acid molecules can be paired by using barcode sequences in step (b). Preferably, the barcode that pairs each of the strands of an original double-stranded nucleic acid molecule is placed in the 5' region of the second strand of the adapter. Said barcode pairing can be performed either before or after the ligation, or simultaneously with the ligation.
In one embodiment, the Y adapter comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region (which may be referred to as "duplex" or "double-stranded, ds, region" in the context of the present invention) by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y-adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase,
- the first strand comprises at least two regions:
(a) a region comprising at least two nucleotides that are complementary to the second strand and thus form a double stranded region ("duplex"), and
(b) a region that is not complementary to the second strand (i.e., a singlestranded, ss, region) and wherein the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified
nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
In one embodiment, the adapter comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region (which may be referred to as "duplex" or "double-stranded, ds, region" in the context of the present invention) by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase,
- the first strand comprises at least a region comprising at least two nucleotides that are complementary to the second strand and thus form a double stranded region ("duplex"), and and wherein the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
Preferably, the adapter, preferably the Y adapter, is further characterized in that a primer can specifically bind to one of the strands of the double stranded region of the adaptor, or to the their complementary or transformed complementary thereof (including reverse complementary and transformed reverse complementary), thereby allowing the primer to be extended by action of a polymerase.
Preferably, the adapter, preferably the Y adapter, is further characterized in that a primer can specifically bind to the complementary sequence or to the transformed complementary, preferably transformed reverse complementary, of a sequence comprised in the double-
stranded region of the adapter (when denatured), thereby allowing the primerto be extended by action of a polymerase.
Preferably, the adapter, preferably the Y adapter, is further characterized in that a primer can specifically bind:
- to the complementary sequence, or to the transformed complementary, preferably transformed reverse complementary, of a sequence comprised in the double-stranded region of the adapter (when denatured), and
- to a part or a region of the single stranded region of the adapter, thereby allowing the primer to be extended by action of a polymerase.
In another embodiment, the adapter, preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region with a length of at least 3 nucleotides, such as 5 nucleotides, or 6 nucleotides, or 7 nucleotides, or 8, or 9, or 10, or more, by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase,
- the first strand comprises at least 5 nucleotides, preferably at least 10, 13, or 15, such as 15, 16, 17, 18, 19, 20, 21 nucleotides or more, and comprises:
(a) a 3' region comprising at least two nucleotides, preferably more, such as at least 5, or at least 7, or at least 10, or more, such as at least 12, or at least 13, or at least 14, or at least 15, or at least 16, or more, that are complementary to the second strand and thus form a double stranded region, and
(b) optionally, a 5' region that is not complementary to the second strand (a single-stranded region), and wherein the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g.,
methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
In an embodiment, the (a) 3' region of the first strand of the adapter that is complementary to the second strand and thus form a double stranded region with it comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 14, 15, 16, 17, 18, 19, 20 or more nucleotides. Preferably, the 3' region of the first strand that is complementary to the 5' region of the second strand comprises at least 7, more preferably at least 10 nucleotides.
In an embodiment, the double stranded region formed by the 3' region of the first strand and the 5' region of the second strand comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 14, 15, 16, 17, 18, 19, 20 or more nucleotides. Preferably, the double stranded region formed by the 3' region of the first strand and the 5' region of the second strand comprises at least 5, more preferably at least 10 nucleotides, and it comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one nonmodified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
Preferably, the double stranded region formed by the 3' region of the first strand and the 5' region of the second strand comprises: at least 1, 2, 3, 4, preferably 5, 6, 7, 8, 9, 10 or more non-modified nucleotides complementary to a modified nucleotide, (e.g., G if we are converting C to U/T); at least 1, 2, 3, 4, preferably 5, 6, 7, 8, 9, 10 or more modified nucleotides (e.g., methylated C if we are converting C to U/T), and at least 1, 2, 3, 4, preferably 5, 6, 7, 8, 9, 10 or more non-modified nucleotides (e.g., non-methylated C if we are converting C to U/T).
Optionally, the 3' region of the first strand that is complementary to the second strand of the adapter comprises one or more barcode sequences. Preferably, the one or more optional barcode sequences are placed towards, preferably in, the 5' end of the first strand of the adapter and/or in the 3' region of the second strand of the adapter. Preferably, said barcode
sequences comprise at least 4, preferably at least 6, nucleotides. Preferably, the one or more optional barcode sequences can be used as unique molecular identifiers within a population of nucleic acid molecules.
In an embodiment, the adapter, preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region of at least 3, preferably 5 or 10 or 16, or more, nucleotides by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
-the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase,
- the first strand comprises at least 5 nucleotides, preferably at least 10, 12, 13, OR 16, such as 16, 17, 18, 19, 20, 21 nucleotides or more, and:
(a) a 3' region comprising at least five, preferably at least 7, more preferably at least 10, nucleotides that are complementary to the second strand and thus form a double stranded region, and
(b) optionally, a 5' region that is not complementary to the second strand, and wherein the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
In an embodiment, the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T).
In another embodiment, the adapter, preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand
form a double stranded region by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase,
- the 5' region of the second strand comprises SEQ ID NO: 49,
- the first strand comprises at least 7 nucleotides, preferably at least 15, such as 15, 16, 17, 18, 19, 20, 21 nucleotides or more, and comprises:
(a) a 3' region comprising SEQ ID NO: 48, and
(b) optionally, a 5' region that is not complementary to the second strand (a single-stranded region), and wherein SEQ ID NO: 48 and 49 are comprised in the double stranded region.
In another embodiment, the adapter, preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase,
- the 5' region of the second strand comprises SEQ ID NO: 51,
- the first strand comprises at least 7 nucleotides, preferably at least 15, such as 15, 16, 17, 18, 19, 20, 21 nucleotides or more, and comprises:
(a) a 3' region comprising SEQ ID NO: 50, and
(b) optionally, a 5' region that is not complementary to the second strand (a single-stranded region), and wherein SEQ ID NO: 51 and 50 are comprised in the double stranded region.
In another embodiment, the adapter, preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand
form a double stranded region by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 5' region of the second strand comprises SEQ ID NO: 49,
- the first strand comprises at least 7 nucleotides, preferably at least 15, such as 15, 16, 17, 18, 19, 20, 21 nucleotides or more, and comprises:
(a) a 3' region comprising SEQ ID NO: 48, and
(b) optionally, a 5' region that is not complementary to the second strand (a single-stranded region), and wherein SEQ ID NO: 48 and 49 are comprised in the double stranded region.
In another embodiment, the adapter, preferably the Y adapter, comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 5' region of the second strand comprises SEQ ID NO: 51,
- the first strand comprises at least 7 nucleotides, preferably at least 15, such as 15, 16, 17, 18, 19, 20, 21 nucleotides or more, and comprises:
(a) a 3' region comprising SEQ ID NO: 50, and
(b) optionally, a 5' region that is not complementary to the second strand (a single-stranded region), and wherein SEQ ID NO: 51 and 50 are comprised in the double stranded region.
In another embodiment, the adapter, preferably the Y adapter, comprises a first strand and a second strand, wherein the first strand comprises SEQ ID NO: 44 and the second strand comprises SEQ ID NO: 45, or a sequence with at least 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, sequence identity to SEQ ID NO: 44 and 45, respectively.
In another embodiment, the adapter, preferably the Y adapter, comprises a first strand and a second strand, wherein the first strand comprises SEQ ID NO: 46 and the second strand comprises SEQ ID NO: 47, or a sequence with at least 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, sequence identity to SEQ ID NO: 46 and 47, respectively.
In another embodiment, the adapter, preferably the Y adapter, comprises a first strand and a second strand, wherein the first strand comprises SEQ ID NO: 46 and the second strand comprises SEQ ID NO: 47, or a sequence with at least 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, sequence identity to SEQ ID NO: 46 and 47, respectively, wherein the adapter further comprises SEQ ID NO: 52, or a sequence with at least 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, sequence identity to SEQ ID NO: 52 (hairpin), that is placed in the 3' end of SEQ ID NO: 47 (second strand).
In an embodiment, the adapter, preferably the Y adapter, comprises:
- a first strand and a second strand, wherein the first strand comprises SEQ ID NO: 44 and the second strand comprises SEQ ID NO:45, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 44 and 45, respectively, or
- a first strand and a second strand, wherein the first strand comprises SEQ ID NO: 46 and the second strand comprises SEQ ID NO: 47, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 46 and 47, respectively, and wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein, preferably, the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y-adapter are compatible with the ends of a double stranded DNA molecule,
and, preferably, wherein the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase.
Hence, a plurality of adaptor-containing nucleic acid molecules is obtained.
If the adapter is a Y-adapter, the Y adapter may contain one or more barcode sequences in the 5' region of the first nucleic acid (DNA) strand and/or in the 3' region of the second nucleic acid (DNA) strand of the Y adapter formed by two nucleic acid (DNA) strands (and/or in the double stranded region). The barcode sequences may thus be located in the single-stranded region of the Y-adapter molecule and/or in the double stranded region of the Y-adapter. In this case, each original nucleic acid (DNA) strand and its synthetic complementary strand, see step (c), would then be paired. Preferably, the adaptor has a first barcode sequence in the double stranded region and/or a second barcode sequence in the 5' region of the second strand of the adaptor.
Hence, at least one adapter, preferably comprising a hairpin from which a polymerase can synthetise a complementary strand, is ligated to the 3' end of the molecule provided in (a). The 3' adapter may preferably comprise one or more barcodes, as explained above. Preferably, a second adapter, this time without hairpin (e.g., a linear adapter) may be ligated at the 5' end of the molecule provided in (a). Preferably, the adapter ligated at the 5' end of the molecule has a length which is enough for a primer to hybridize to it.
Step (c) Synthesizing, for each of the strands of the nucleic acid molecules obtained in step
(b), a complementary strand.
The complementary strand is also referred to as the "synthetic complementary strand", and it is generated by polymerase elongation from the 3' end of the adapter molecule, using the strands of the nucleic acid molecules obtained in step (b) as template. In this specific embodiment, each original strand of a nucleic acid (e.g., DNA) molecule is physically bound to a complementary strand obtained by synthetic extension. See also Figures IB and ID. Step
(c) is also called herein as "extension step". Before the extending step, the strands may be denatured.
Hence, the original nucleic acid strand of a nucleic acid (e.g., DNA) molecule and its synthetic complementary strand are physically linked by one of their ends by a loop, which is preferably a nucleotide sequence to which primers can at least partially bind, as defined above.
Continuing with the above examples, an extension step (step (c) is performed in which, for each of the strands of the nucleic acid molecules obtained in step (b), a complementary strand, the "synthetic complementary strand", is generated by polymerase elongation from the 3' end of the adapter molecule, using the strands of the five nucleic acid molecules obtained in step (b) as a template, to provide barcode paired adaptor-containing double stranded nucleic acid molecules (see Figures IB and ID, although in these figures the molecules are represented in a linearise mode when in fact, as they are complementary, the have a ds configuration). Preferably, the extension step is performed with natural occurring nucleotides (canonical base (e.g., A, C, G, T, or U) or non-modified nucleotides), and not with modified nucleotides (such as methylated C), so that the resulting molecule of step i) does not have a synthetic complementary strand (which will give rise to the second region of the molecule of the invention) comprising modified nucleotides. Preferably, the synthesis is performed with non-modified cytosines.
The expression "polymerase elongation", as used herein, refers to the synthesis of a complementary strand by a DNA polymerase that adds free nucleotides to the 3'end of the second DNA strand in the adapter molecule. Said adapter molecule may act as a primer for the elongation step, as described above. During this step the temperature is chosen depending on the optimal temperature for the specific DNA polymerase used.
After step (c), double-stranded nucleic acid (e.g., DNA) molecules are obtained from each adapter-containing nucleic acid (e.g., DNA) molecule, and each of said double-stranded nucleic acid (e.g., DNA) molecules is formed by an nucleic acid (e.g., DNA) strand of a nucleic acid (e.g., DNA) molecule and its synthetic complementary strand that are paired at least physically by a linker region (i.e., an Y-adapter or linker), to which primers can preferably bind. They may additionally be paired by barcode sequences, as described above.
Optionally, complementary strands of the plurality of paired adaptor-modified DNA molecules obtained in step (c) can be provided, using primers the sequences of which are complementary to at least a portion of the double-stranded adaptors. The complementary strands of the plurality of paired adaptor-modified DNA molecules obtained in step (c) may be provided using the nucleotides A, G, C, T. Modified nucleotides may also be used, such as modified cytosines, mC (e.g., 5mC, 5hmC or 5fC). This amplification can be referred to as "amplification step". The amplification step is carried out between steps (c) and (d). Preferably, the amplification is performed with natural nucleotides (A, C, T, G), that is, not using modified nucleotides, such as methylated C.
Optionally, the paired double stranded nucleic acid molecules obtained in step (c) are amplified to provide amplified paired double stranded nucleic acid molecules.
The pairing between both strands of the original double-stranded nucleic acid (e.g., DNA) molecules allows keeping track of both strands of each double stranded nucleic acid (e.g., DNA) fragment originally used.
Therefore, each adapter may include unique and combinatorial barcodes (e.g., unique molecular identifiers or UMIs) that allow sample identification and multiplexing as well as quantitative analysis. In a preferred embodiment the adapter, such as the Y-adapter, is provided as a library of adapters wherein each member of the library is distinguishable from the others by a combinatorial sequence located within the double stranded region formed by the 3' region of the first strand and the 5' region of the second strand of the adapter.
Optionally, the Y-adapter incorporates bases labelled with the second member of a binding pair that allows the recovery of the original nucleic acid (e.g., DNA) template after the elongation or amplification steps. This provides the advantage that the sample used as a nucleic acid (e.g., DNA) template may be identified, preserved during the process and recovered, stored and submitted to multiple amplifications with different conditions and sequencings without sample exhaustion.
Optionally, the adaptors (such as Y-adaptors) may contain "sites for cutting", as described above.
In one embodiment, the molecules generated in step (b), when the ds adapters do not comprise a hairpin loop, are contacted with a hairpin adapter under conditions adequate for the ligation of the hairpin adapterto the molecules generated in step (b), as described in detail below. If the adapters ligated in step (b) do not comprise a hairpin loop, a hairpin adapter may be ligated to the adapters ligated in step (b). For instance, a hairpin adapter may be incorporated as described in WO 2015/104302.
Optionally, in this embodiment of the method for the generation of the nucleic acid molecule provided in step (i) of the method of the present invention comprises, after step (c), the following step (cl):
(1) Contacting each strand of the adapter-containing nucleic acid molecules with a complex of an elongation primer with a hairpin adapter under conditions adequate for the hybridization of the elongation primer to the second strand of the adapter, wherein the elongation primer comprises a 3' region which is complementary to the second strand of the adapter molecule and which, after hybridization with the second strand of the adapter molecule creates overhanging ends, and wherein the hairpin adapter comprises a hairpin loop region and overhanging ends which are compatible with the overhanging ends formed after hybridization of the elongation primer to the second strand of the adapter.
Step (d) Transformation or conversion step
After step (c), the molecules generated are treated with an agent or method or process, as described herein, capable of converting a nucleotide into another one which is read distinctly from the original nucleotide, under the conditions suitable forthe conversion/transformation to occur.
Hence, the following steps ((c21) and/or (c22)) are performed after step (c):
(1) Converting the non-modified nucleotides (e.g., non-methylated cytosine(s)) in the paired adaptor-containing nucleic acid molecules obtained in step (c), if any, into another base which is read distinctly from said non-modified nucleotide (e.g., uracil), in the paired adaptor-containing nucleic acid molecules (c21); and/or
(2) Converting modified nucleotides (e.g., methylated cytosine(s)) in the paired adaptor-containing nucleic acid molecules obtained in step (c), if any, into another base which is read distinctly from said modified nucleotide (e.g., cytosine) , in the paired adaptor-containing nucleic acid molecules (c22).
Preferably, the method for the generation of the nucleic acid molecule provided in step (i) of the method of the present invention comprises, after step (c), the following steps (c21 and/or c22, respectively):
(1) Converting non-methylated cytosine(s) in the paired adaptor-containing nucleic acid molecules obtained in step (c), if any, into another nucleotide which is read distinctly from cytosine (e.g., uracil), in the paired adaptor-containing nucleic acid molecules (c21); and/or
(2) Converting methylated cytosine(s) in the paired adaptor-containing nucleic acid molecules obtained in step (c), if any, into another nucleotide which is read distinctly from cytosine (e.g., uracil or thymine), in the paired adaptor-containing nucleic acid molecules (c22).
See Figure 1 A and C.
Hence, step c2 (in any of its variants) provides for the conversion of non-modified (e.g., non- methylated) (c21) or modified (e.g., methylated) (c22) nucleotides (e.g., cytosines) in the nucleic acid molecule provided in step (c), leading to the nucleic acid molecules comprised in the plurality provided in step (i) of the method of the present invention. Steps c21 and c22 are also referred herein as "transformation or conversion step". Preferably step c2 (d) (in any of its variants, c21 or c22) provides for the conversion of non-methylated (c21) or methylated (c22) cytosines in the molecules obtained after step (c), leading to the nucleic acid molecules provided in step (i) of the method of the present invention.
For instance, in this optional embodiment, the nucleic acid molecules obtained in step (c) are treated with an agent capable of converting (or transforming) a nucleotide to another nucleotide which is read distinctly from the original nucleotide (e.g., a reagent which allows conversion of non-methylated cytosines to a base that is detectably dissimilar to cytosine in terms of hybridization properties (preferably uracil) or method or process, as described above, in order to analyse the epigenetic modification status (e.g., the methylation pattern) of the sample (optional step c21).
In an embodiment, after treatment with an agent capable of converting (or transforming) a nucleotide to another nucleotide which is read distinctly from the original nucleotide (e.g., a reagent which allows conversion of non-methylated cytosines to a base that is detectably dissimilar to cytosine in terms of hybridization properties (preferably uracil), the resulting molecule comprises at least four regions that are substantially different so that a primer can only specifically bind to one of them, and not to the others, as explained above.
Once the transformation or conversion step (step (d), cl2 or c22) is performed, the complementarity between the first and second regions will be at least partially lost, and the nucleic acid molecules may no longer be in the form of a ds molecule, but in the form of a ss molecule.
Preferably, the conversion or transformation of modified or non-modified nucleotides (e.g., the transformation of (non-methylated) cytosines) into another nucleotide which is read distinctly from said modified or non-modified nucleotide (e.g., to uracil) in the paired nucleic acid molecules (e.g., DNA molecules) (step (d)) is performed with a deamination agent such as bisulfite or A3A, as described above, but any other reagent, reactive and/or enzymatic treatment or method/process as described in detail above (e.g., TET oxidation of modified cytosines, followed by APOBEC deamination of non-modified cytosines, treatment with the AID/APOBEC family of enzymes, etc.) may be used.
Step (ii): The capture step
The method of the present invention further comprises a step (ii), also called herein "capture step". This step comprises capturing at least some of the molecules provided in (i) by using at least one capture probe that binds at least partially to the second region of the plurality of nucleic acid molecules. As used herein, a "capture probe", or simply "probe" is a nucleic acid molecule, preferably a single-stranded nucleotide, that is independent of the nucleic acid molecules of the present invention and that is designed to hybridize with the same efficiency and efficacy or substantially with the same efficiency and efficacy to the second region of the nucleic acid molecules of the present invention, regardless of the modification status (e.g., methylation status) of the original nucleic acid molecule(s). The probe may bind (hybridize) to the second region of the molecules of the present invention under low stringency conditions, preferably medium stringency conditions, most preferably high stringency conditions. The probe binds specifically to at least part of the second region, as described above. Hence, the probe binds specifically to at least part of the second region, wherein this part of the second region comprises at least one nucleotide which is located at a position which corresponds to a locus in the original molecule which is occupied by a nucleotide susceptible of being modified. According to the present invention, "specific binding" means that the probe binds to the at least part of the second region, as described above, in a specific manner, i.e., it binds at least partially to the to the described second region, but it does not substantially bind to any other region in the nucleic acid molecule provided in (i). The skilled person is aware of means for designing probes and checking them for specificity, see, e.g., Singh RR., "Target enrichment approaches for next-generation sequencing applications in oncology", Diagnostics (Basel), 2022, 12(7):1539.
In one embodiment, the present invention further provides a method for designing at least one probe suitable for performing step (ii) of the method of the present invention, i.e., suitable for capturing the molecules provided in step (i) of the method of the present invention, the method comprising: a) Selecting or identifying a region of interest within a genome and/or within a nucleic acid molecule; b) Inferring the sequence of the second region of the plurality of nucleic acid molecules provided in step (i) of the method of the present invention; and
c) Obtaining the sequence of at least a capture probe that binds to at least a portion of the second region, wherein the second region comprises at least one nucleotide which is located at a position which corresponds to a locus in the region of interest which is occupied by a nucleotide susceptible of being modified.
The method for designing at least one probe of the present invention may further comprise using a computer comprising a processor, a memory, and instructions stored thereupon that, when executed, provides the sequence of at least a capture probe that binds to at least a portion of the second region as defined in (c).
It is noted that the processor may comprise one or more processing units, such as a microprocessor, GPU, CPU, multi-core processor or the like. Similarly, the memory may comprise one or more volatile or non-volatile memory devices, such as DRAM, SRAM, flash memory, read-only memory, ferroelectric RAM, hard disk drives, floppy disks, magnetic tapes, optical disks or the like.
Hence, the invention also provides a computer-implemented method for designing at least one probe suitable for performing step (ii) of the method of the present invention, i.e., suitable for capturing the molecules provided in step (i) of the method of the present invention, the method comprising: a) Selecting or identifying a region of interest within a genome and/or within a nucleic acid molecule; b) Inferring the sequence of the second region of the plurality of nucleic acid molecules provided in step (i) of the method of the present invention; and c) Obtaining the sequence of at least a capture probe that binds to at least a portion of the second region, wherein the second region comprises at least one nucleotide which is located at a position which corresponds to a locus in the region of interest which is occupied by a nucleotide susceptible of being modified.
The invention also provides a computer program comprising instructions which, when executed by a computer, is able to obtain the sequence of at least a capture probe that binds
to at least a portion of the second region, by implementing the method for designing at least one probe of the present invention.
The computer program product may be implemented in software, hardware, or a combination of both. The computer program product can be stored in a memory of the sequencing machine or can be saved remotely, for example, on a connected remote server communicatively to the device.
Massive sequencing methodologies such as next-generation sequencing (NGS) has enabled large-scale sequencing (of up to Terabases of sequences) in a short time (see, e.g., Shendure J. et al., "DNA sequencing at 40: past, present and future", Nature, 2017, 550(7676):345-353), and it has also lowered the cost of sequencing considerably (see, e.g., https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost).
However, even with the advances in NGS, whole genome sequencing (WGS) is still expensive, requires more sequencing yield and reagents, produces massive amounts of data that have to be analysed and interpreted, and generates the need to reconcile the associated uncertainties in data interpretations. Hence, being able to select only specific regions of the genome, such as specific disease-associated genes or other specific genes of interest would be advantageous, since the rest of the whole genome can be disregarded, simplifying downstream bioinformatics analysis and affording the ability to obtain greater depth of coverage. By only targeting specific regions such as exons, one can obtain greater depth of DNA sequencing coverage for regions of interest or increase the sampling numbers of individuals, thereby saving both time and cost, see, e.g., Kozarewa I, Armisen J. et al., "Overview of target enrichment strategies". Curr Protoc Mol Biol, 2015, 112:7.21.1-7.21.23.
Target enrichment may also be advantageous for other applications, since it allows for the specific selection of nucleic acid molecules or regions, facilitating the sequencing process and data analysis.
Several methods of target enrichment are available, but they all comprise the use of a probe or capture probe specifically designed for hybridizing with ("capturing") nucleic acid
molecules comprising the region of interest. After the capture, the sample will be enriched with the molecules comprising the region of interest, and the sequencing can thus be performed only with the interesting material.
As explained above, in the method of the present invention, since the nucleotide sequence of the second region is identical or at least substantially identical in each of the plurality of nucleic acid molecules at at least a position corresponding to the same locus in the region of interest which is occupied by a nucleotide susceptible of being modified (regardless the epigenetic modifications present in the one or more original molecules), an efficient capture step can be carried out. The bias associated to the differences in the epigenetic modifications in the one or more original molecules is thus eliminated. Hence, with the method of the present invention, a region of interest can be enriched with the same efficiency and efficacy, regardless of the modification status (epigenetic modifications) of the original molecules.
As explained above, in order to capture nucleic acid molecules that share a certain sequence, but which have different status of epigenetic modifications at the same locus, one possibility would be to design one probe for each of the possible molecules (with different epigenetic modifications), to capture all molecules. This possibility is expensive and time-consuming, since a large number of probes would be needed when the sample contains multiple possible epigenetic modifications.
For instance, the capture probe binds to at least a part of the second region of the plurality of molecules provided in step (i), including a nucleotide which is located at a position corresponding to a locus in the ROI which is occupied by a nucleotide susceptible of being modified. Hence, the probe "captures" the nucleic acid molecules provided in (i), because it binds to at least part of the second region of these molecules.
The overlap region (the "binding region") between the captured region and the probe can be as small as less than 20 nucleotides such as from 11 to 20 nucleotides, for instance from 13 to 18 nucleotides.
For instance, as shown in Figures 1 and 2, there are eight nucleic acid molecules which first region shows different sequence (e.g., Figures 1A and 1C) arising from a different methylation status in their corresponding original molecules (e.g., Figures IB and ID). If a single capture probe (e.g., TAACAACTAC) is designed to target the first region, said capture probe will not bind with the same efficiency and efficacy to all of the nucleic acid molecules of the present invention, as they vary in sequence, see e.g., Figure 2.
Hence, the capture probe would bind better to some of the molecules (if the capture probe shows higher complementary to them), than to other molecules (if the capture probe shows low complementary to those molecules because of the differences in sequence). This will cause a bias in the capture step, where some molecules will be captured better than other.
However, following the method of the present invention, if the capture probe is designed against the second region (e.g., ACAACTACCA, see Figures 1A and 1C), all molecules will be captured with the same efficiency or efficacy, regardless of the epigenetic modifications present in the corresponding loci in the original molecule(s) (Figures IB and ID).
For instance, in the example shown in Figure 1A, the capture probe binds with the same affinity to all the nucleic acid molecules of the present invention since the capture probe is 100% complementary to all of the second regions. Thus, the capture step of the present method does not cause any bias, making the method more efficient.
In an embodiment, the capture probe is attached to a support that facilitates the purification an/or immobilization of the nucleic acids captured with the method of the present invention. The term "support", as used herein, refers to any material configured to chemically bond with a nucleic acid including but not limited to plastic, latex, glass, metal (i.e., for example a magnetized metal), nylon, nitrocellulose, quartz, silicon or ceramic. The support is preferably solid and may be roughly spherical (i.e., for example a bead) or may comprise a standard laboratory container such as a microwell plate or a surface.
The term "immobilized", as used herein, refers to the association or binding between the molecule (e.g., the capture probe) and the support in a manner that provides a stable
association underthe conditions of elongation, amplification, excision, and other processes as described herein. Such binding can be covalent or non-covalent. Non-covalent binding includes electrostatic, hydrophilic and hydrophobic interactions. Covalent binding is the formation of covalent bonds that are characterized by sharing of pairs of electrons between, atoms. Such covalent binding can be directly between the capture probe and the support or can be formed by a cross linker or by inclusion of a specific reactive group on either the support or the adapter or both. Covalent attachment of a probe can be achieved using a binding partner, such as avidin or streptavidin, immobilized to the support and the non- covalent binding of the biotinylated adapter to the avidin or streptavidin. Immobilization may also involve a combination of covalent and non-covalent interactions.
The capture probe may be synthesized first, with subsequent attachment to the support. Alternatively, the capture probe may be synthesized directly on the support.
In some other embodiments, the capture step or step (ii) is performed over the nucleic acid molecules of the present invention that are on the supernatant of the reaction vessel, so that the capture probe is not attached to any support.
In some embodiments, the capture probe is conjugated to one or more molecules such as one or more chromophore, fluorophores, beads, etc. In some embodiments, the capture probe comprises or is conjugated to tags or one or more recognition molecules (e.g., streptavidin, avidin, neutravidin, horseradish peroxidase, alkaline phosphatase, antibodies, etc).
Optional step (iii): determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules and/or determining the epigenetic status of the original molecule(s) within the ROI
Optionally but preferably, the method of the present invention further comprises a step (iii) of determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules. With this step, it is possible to ascertain the epigenetic status of the plurality of molecules in a sample in the region of interest. In this step (iii), the first and second regions of the plurality of nucleic acid molecules provided in step (i) and captured in step (ii) are sequenced and/or analysed.
The skilled person is aware of means of sequencing the molecules captured in step (ii) of the method of the present invention. The sequencing can take place by using one or more of the currently available sequencing technologies or platforms (e.g., Illumina, Roche, Ion Torrent, etc. sequencing platforms).
The sequencing can be performed either at the low scale, which consists in the analysis of selected fragments, or high throughput (also named genome-scale), which consists in the massive analysis of all or a large representation of the whole material, such as Next Generation Sequencing (NGS) approaches. The length of the fragment that can be analysed depends on the sequencing methodology used. Current state of the art sequencing techniques aiming the genomic scale and most of the locus specific assess ss nucleic acid molecules (such as DNA strands) separately.
The term "sequencing" or the expressions "determining the sequence" or "sequence determination" and the like, such as "determining the base identity" or "determining the identity of a base" means the determination of the information relating to the nucleotide base sequence of a nucleic acid, particularly involving determination and ordering of a plurality of contiguous nucleotides in a nucleic acid. Said information may include the identification or determination of partial as well as full sequence information of the nucleic acid molecule. Said information refers, e.g., to the primary sequence of a DNA molecule, such as a ss or ds DNA molecule or to the epigenetic modifications (for example methylations or hydroxymethylations), or both. The sequence information may be determined with varying degrees of statistical reliability or confidence.
The determination of the primary sequence of a DNA molecule includes the detection of mutations or genetic variants such as polymorphisms (SNPs, INDELs, etc.). By analysing the output of the sequencing, each read will provide information regarding the primary sequence (including mutations and SNPs) and the epigenetic modifications (e.g., methylation status) of the one or more original nucleic acid sequences within a ROL
The methods described herein may be useful in identifying and/or distinguishing epigenetic modifications (i.e., for ascertaining the epigenetic modification status of one or more nucleic acid molecules), as explained above, e.g., cytosine (C), 5-methylcytosine (5mC), 5- hydroxymethylcytosine (5hmC) and 5-formylcytosine (5fC) in the one or more original nucleic acid sequences within the ROL For example, methods described herein may be useful in distinguishing one residue from the group consisting of cytosine (C), 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC) and 5-formylcytosine (5fC) from the other residues in the group. Thus, the sequencing step may comprise identifying the presence of modified cytosines residue in the one or more original nucleic acid sequences within the ROL
In some cases, the method further comprises diagnosing a condition in the subject based at least in part on the sequencing information provided in step (iii) of the method of the present invention. The condition may be any condition, a trait or even aging, obesity, etc. For instance, the condition may be cancer, which can be selected from a sarcoma, a glioma, an adenoma, leukemia, such as chronic lymphocytic leukaemia (CLL), bladder cancer, breast cancer, colorectal cancer (CRC), (endometrial cancer, kidney cancer, liver cancer, lung cancer, melanoma, non-Hodgkin lymphoma, pancreatic cancer, prostate cancer, thyroid cancer, etc. The condition may also be a neurodegenerative condition, such as Alzheimer's disease, frontotemporal dementia, amyotrophic lateral sclerosis, Parkinson's disease, spinocerebellar ataxia, spinal muscle atrophy, Lewy body dementia, or Huntington's disease. The condition may also be any inherited or environmental disease or any rare or common disease or any trait not necessarily linked to disease. The condition may be caused by or be related to the epigenetic modifications in one or more nucleotides susceptible of conversion comprised in a ROI of the genome of the subject.
Hence, the present invention further comprises an in vitro method for diagnosing a condition, the method comprising the steps of:
(1) Selecting or identifying a region of interest relevant to the condition to be diagnosed within the genome of a patient;
(2) Providing a plurality of nucleic acid molecules as defined in step (i) of the method of the present invention from a sample obtained from the patient;
(3) Capturing the molecules provided in (i) by using a probe as defined in step (ii) of the method of the present invention;
(4) Determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules and/or determining the epigenetic status of the original molecule(s) within the ROI; and
(5) Diagnosing a condition in the subject based at least in part on the information provided in step (4).
The method of the present invention may further comprise a step of determining the true identity of a base at a certain position (locus) in a ROI, based on the information provided in step (iii). In the context of the present invention, a "true identity" is the identity of the base originally present at a certain position (locus) in the original nucleic acid molecule (e.g., A, C, G, T, U, or any modification thereof, such as a modified nucleotide, e.g., a modified cytosine, such as a methylated cytosine, mC (e.g., 5mC, 5hmC and/or 5fC)).
Hence, once the plurality of nucleic acid molecules provided in step (i) are captured and sequenced, information regarding the sequence of at least part (and preferably) all of the first and second regions is provided. In particular, at least two sources of information for at least one, preferably for each one of the first and second regions is provided. Since the first and second regions of the nucleic acid molecule provided in step (i) of the present invention provide, independently, information on the base identities in the corresponding loci in an original nucleic acid molecule, of at least two sources of information on the base identities in a corresponding loci in the original nucleic acid molecule and corresponding ROI are provided.
Hence, the method of the present invention also allows for the determination of the base identities in the original molecule (in the corresponding ROI), including the epigenetic modifications, with reduced error.
In other cases, the method of the present invention further comprises using a computer comprising a processor, a memory, and instructions stored thereupon that, when executed, determine the identity of a base (e.g., the true base) at a certain position (locus) in an original
nucleic acid molecule (in the corresponding ROI), based on the sequencing information provided in step (iii).
It is noted that the processor may comprise one or more processing units, such as a microprocessor, GPU, CPU, multi-core processor or the like. Similarly, the memory may comprise one or more volatile or non-volatile memory devices, such as DRAM, SRAM, flash memory, read-only memory, ferroelectric RAM, hard disk drives, floppy disks, magnetic tapes, optical disks or the like.
The present invention thus further provides a computer program comprising instructions which, when executed by a computer, is able to determine the identity and/or a BQ score or probability of being an error) of a true base at a certain position (locus) in an original nucleic acid molecule (in the corresponding ROI), based on the information provided in step (iii) of the method of the present invention.
The present invention also further provides a computer program comprising instructions which, when executed by a computer, is able to implement any of the methods disclosed in the present document. Therefore, such a computer program is communicatively communicated to the electronic components of a sequencing machine.
The computer program product may be implemented in software, hardware, ora combination of both. The computer program product can be stored in a memory of the sequencing machine or can be saved remotely, for example, on a connected remote server communicatively to the device.
Preferably, step (iii) of the method of the present invention comprises using at least two different primers, such as two, three or four different primers, preferably at least three primers, even more preferably four different primers, to sequence the molecule provided in step (i). Hence, step (iii) comprises the use of at least two different primers, preferably at least three different primers, even more preferably four different primers, and the sequencing of the molecules provided in step (i) using the at least two different primers, preferably at least three different primers, even more preferably four different primers. Step (iii) thus provides
sequence information of the molecule provided in step (i) of the method of the present invention.
For the sequencing step, it is preferable that the molecules provided in step (i) of the present invention further comprise one adapter at the 5' end of the molecule and one adapter at the 3' end of the molecule, for instance as described above.
Sequencing a nucleic acid molecule can comprise the determination of the identity of the base (e.g., adenine (A), cytosine (C), thymine (T), guanine (G), uracil (U) and, its modifications, such as methyl cytosines (5mC, 5hmC), etc) present at the specific locus in the original nucleic acid molecule (in the corresponding ROI).
In one embodiment, the at least two different primers, such as two, three or four different primers, preferably three or four different primers, bind at least to four different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of the first region of the molecule, to sequence at least part of the first region of at least one of the nucleic acid molecules provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to the third region which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the second region of at least one of the nucleic acid molecules provided in (i);
3. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least a portion of the second region of the molecule, to sequence at least part of the second region of at least one nucleic acid molecule provided in (i);
4. At least one of the primers is capable of binding (hybridizing) at least partially to the third region which covalently links the first region and the second region of the
nucleic acid molecule provided in (i), to sequence at least part of the first region of at least one of the nucleic acid molecules provided in (i).
Preferably, the at least two different primers, preferably at least three different primers, even more preferably four different primers, to sequence the molecule provided in step (i), wherein the at least two different primers bind to at least three, preferably to at least four different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of the nucleic acid molecule provided in (i);
2. At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
3. At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and/or
4. At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i).
In one embodiment, step iii) comprises using at least two different primers, preferably at least three different primers, to sequence the molecule provided in step (i), wherein the at least two different primers bind to at least three different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers (e.g., the first primer) is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 5' end of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i);
2. At least one of the primers (e.g., the second primer) is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to
sequence at least part of the second region of the nucleic acid molecule provided in (i); and
3. At least one of the primers (e.g., the third primer) is capable of binding (hybridizing) at least partially either:
3.1. to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i); or
3.2 the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i).
Preferably, the at least two different primers, such as two, three or four different primers, preferably three or four different primers, bind at least to four different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the region 3 or 3' of the molecule, to sequence at least part of the second region of the nucleic acid molecules provided in (i);
3. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and
4. At least one of the primers is capable of binding (hybridizing) at least partially to region 2 or 2' of the molecule , to sequence at least part of the first region the nucleic acid molecules provided in (i).
In one embodiment, step iii) comprises using at least two different primers, preferably at least three different primers, to sequence the molecule provided in step (i), wherein the at least two different primers bind to at least three different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and
3. At least one of the primers is capable of binding (hybridizing) at least partially to region 2 or 2' of the molecule , to sequence at least part of the first region the nucleic acid molecules provided in (i).
In one embodiment, step iii) comprises using at least two different primers, preferably at least three different primers, to sequence the molecule provided in step (i), wherein the at least two different primers bind to at least three different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of an adapter at the 5' end of the nucleic acid molecule provided in (i), if present, otherwise, it is capable of binding (hybridizing) at least partially to at least a portion of region 1 or 1' of the molecule, to sequence at least part of the first region of the nucleic acid molecules provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the region 3 or 3' of the molecule, to sequence at least part of the second region of the nucleic acid molecules provided in (i); and
3. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), if present, otherwise it is capable of binding (hybridizing) at least partially to at least region 4 or 4' of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i).
Preferably, the at least two different primers, such as two, or at least three or at least four different primers, preferably four different primers, bind to at least four, preferably to four different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 5' end of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
3. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
4. At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i).
The primer may at least partially bind (hybridize) to the above sequences under low stringency conditions, preferably medium stringency conditions, most preferably high stringency conditions. See, e.g., Figure 3.
The binding of the primers to the at least three, preferably to the at least four, different regions in the nucleic acid molecules provided in (i) may be performed simultaneously (i.e., all
two, three or four primers at the same time) or not-simultaneously. Preferably, the binding of the primers to the at least three, preferably at least to four, different regions in the nucleic acid molecule provided in (i) is not performed simultaneously.
In a preferred embodiment, the binding of the at least two different primers, such as two, three or four different primers, preferably three or four different primers, th the at least three, preferably at least four, different regions in the nucleic acid molecule provided in (i) is specific binding. This means that the primer(s) bind to the above-described regions in the molecule in a specific manner, i.e., it binds to the above-described regions, but it does not substantially bind to any other region in the nucleic acid molecule provided in (i). The skilled person is aware of means for designing primers and checking them for specificity. See, e.g., the primer designing tool provided by the National Library of Medicine (NIH) (Primer designing tool (nih.gov)), or the "How to: Design PCR primers and check them for specificity" from the National Library of Medicine (NIH) (Design PCR primers and check them for specificity (nih.gov)).
In a preferred embodiment, the at least two different primers which bind to at least to at least four different regions in the plurality of nucleic acid molecules provided in (i) are four different primers, each specifically at least partially binding (hybridizing) to the regions 1-3 and/or 1-4, preferably 1-4, as described above.
For example, for Illumina sequencing, primers 1 (capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of the nucleic acid molecule provided in (i)) and 2 (capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence the second region of the nucleic acid molecule provided in (i)) should be different, and primers 3 (capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i)) and 4 (capable of binding at least partially the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence the
first region of the nucleic acid molecule provided in (i)) should also be different. But it is not essential that primers 1 and 3 or 1 and 4 or 2 and 3 or 2 and 4, as defined above, are different.
Hence, the at least two different primers, preferably at least three different primers, even more preferably four different primers, provided in step (iii) of the method of the present invention may be used to sequence the molecules provided in step (i) and captured in step (ii). The sequencing can take place by using one or many of the currently available sequencing technologies (e.g., Illumina, Roche, Ion Torrent, etc. sequencing platforms).
As described herein, and as it will be understood by the skilled person, the fact that the at least two different primers (such as three or four different primers) bind to at least three, preferably to at least four different regions in the molecule provided in step (i), and that both the 5' and 3' regions are to be at least partially sequenced, implies that, for the (at least partially) sequencing of :
(1) the second region of the nucleic acid molecule provided in step (i), using the primer binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i); and
(2) the first region of the nucleic acid molecule provided in step (i), using the primer binding (hybridizing) at least partially to at least a portion of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), the complementary sequence of the nucleic acid molecules provided in (i) needs to be synthesized and amplified (cluster amplification for parallel sequencing by synthesis). This is because, using primers, the nucleic acid molecules are synthesized using the so-called "sequencing by synthesis" technique of next generation sequencing, such as Illumina sequencing, which makes use of the synthesis of the original and the complementary strands to read the sequence of a certain nucleic acid molecule. For instance, a primer attaches to the forward strand adapter primer binding site, and a polymerase adds a fluorescently tagged dNTP to the DNA strand. Only one base is able to be added per round due to the fluorophore acting as a blocking or synthesis terminator group; however, the blocking group is reversible. Using the four-color chemistry, each of the four bases has a unique emission, and after each round, the machine records which nucleotide was added. Once the colour is recorded the
fluorophore is washed away and another dNTP is washed over the flow cell and the process is repeated. Since the polymerase adds nucleotides to the 3' end of a nucleic acid (DNA) strand, the nucleic acid molecule to be sequenced needs to be read in the 5' to 3' direction. Hence, the use of at least two different primers, preferably at least three different primers, even more preferably four different primers, for sequencing the first and second regions (preferably twice) of the nucleic acid molecule provided in step (i) implies that sequencing the first and second regions using primers (a) binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i); and (b) binding (hybridizing) at least partially to at least a portion of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), takes place in a strand complementary to the nucleic acid molecule provided in step (i).
Thus, the method of the present invention can be used to reduce uncertainty and overall error rate in the determination of a sequence of a polynucleotide (e.g., an original DNA polynucleotide), mainly before requiring alignment to a reference genome (or reference nucleic acid sequence). The methods of the present invention thus provide more than two, such as three, and preferably up to four sources of independent information from a singlestranded nucleic acid molecule as described in step i) (e.g. preferably up to eight sources of independent information if we consider a double-stranded nucleic acid molecule) regarding the base identity in each corresponding locus in an original nucleic acid molecule. The methods of the present invention provide more than two, such as three, and preferably up to four sources of information regarding the base identity for each corresponding locus in an original nucleic acid molecule as described in step i) (preferably up to eight sources of information if we consider a double-stranded nucleic acid molecule). Since each of the four nucleotides can be read in different sequence contexts in the nucleic acid provided in step (i) of the method of the present invention, errors that are biased from sequence usually preceding and postceding the base to be analysed are reduced with the method of the present invention. Since every nucleotide of the original molecule is represented more than two, such as three, and preferably up to 4 times, the raw probability of errors for each base can be highly reduced both, mainly at premapping but also at post mapping steps. Reducing the error rate at premapping step also improves the mapping quality of each read, which again reduce mapping errors and therefore variant calling errors. To know exactly where every insert (first
and second regions) starts and ends, also improve the mapping and the calling of SNPs, but mainly the calling of INDELs and other type of rearrangements. Having UMIs (optional) at the beginning and end of every insert improves the sequencing at the beginning of every read, allows deduplication which becomes crucial when doing enrichment, and reduces the number of uninformative cycles of sequencing and unnecessary bioinformatic resources. It also linking dsDNA strands of the original molecule if they have been separated during the procedure.
Therefore, once the at least two different primers have bond to at least three, preferably to at least four different regions in the nucleic acid molecule as described in the method step (ii) of the present invention, the computer program comprises instructions to perform a locus analysis between different readings to determine the true base at a certain locus of the original nucleic acid to be sequenced. It is noted that from the present description the skilled person may envisage different ways in which locus analysis may be performed, all of them comprised within this invention.
It is noted that a sequencing machine usually comprises samples, trays, incubators, fungibles, micropipetting systems, and many other elements within it, that enables the fully automation of the sequencing of a particular nucleic acid molecule. The present embodiment is not limited thus to sequencing machines comprising just these elements but to any other machines capable of automating any of the methods disclosed in the present document as the person skilled in the art may envisage.
Kit of the present invention
The present invention further provides a kit comprising at least two different primers, wherein the at least two different primers are capable of at least partially binding to at least three, preferably to at least four different regions in the nucleic acid molecule provided in step (i) of the method of the present invention.
In one embodiment, the at least two, such as three or preferably four different primers are capable of at least partially binding to at least three, preferably to at least four different
regions in the nucleic acid molecule provided in step (i) of the method of the present invention, wherein:
1. At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of the nucleic acid molecule provided in (i);
2. At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
3. At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i); and/or
4. At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i).
In one embodiment, at least one of the primers (e.g., the first primer) is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 5' end of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i). In addition, at least one of the primers is (e.g., the second primer) capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of the nucleic acid molecule provided in (i). Finally, at least one of the primers (e.g., the third primer) is capable of binding (hybridizing) at least partially either: to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence the second region of the nucleic acid molecule provided in (i); or to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in a., to sequence the first region of the nucleic acid molecule provided in (i).
In a preferred embodiment, the invention provides a kit comprising at least two different primers, such as at least three different primers, preferably four different primers, wherein the at least two different primers, such as at least three different primers, preferably four different primers, are capable of at least partially binding to at least four different regions in the nucleic acid molecule provided in step (i) of the method of the present invention, wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 5' end of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
3. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), to sequence at least part of the second region of the nucleic acid molecule provided in (i);
4. At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence at least part of the first region of the nucleic acid molecule provided in (i).
In a preferred embodiment, the kit of the present invention further comprises instructions for its use.
In another preferred embodiment, the kit of the present invention further comprises a double stranded adapter for use in the method for the generation of the nucleic acid molecule provided in step (i) of the method of the present invention, wherein the adapter comprises a first nucleic acid strand and a second nucleic acid strand, wherein the second region of the first nucleic acid strand and the first region of the second nucleic acid strand form a double stranded region by sequence complementarity, wherein the ends of said double stranded region formed by the second region of the first nucleic acid strand and the first region of the
second nucleic acid strand of the adapter are compatible with the ends of a double stranded nucleic acid molecule, wherein the double-stranded region of the adapter comprises one or more barcode sequence(s), and wherein the second region of the second strand of the adapter forms a hairpin loop by hybridization between a first and a second segment within said second region, the first segment being located at the 3' end of the second region and the second segment being located in the vicinity of the first region of the second strand, and/or wherein the adapter comprises at least one barcode sequence in the single stranded region of the adapter, wherein the barcode sequences consist of unique identifiers that allow identification of a specific construct comprising said identifier and its amplification products, and wherein compatible means that the ends of said double stranded region of the adapter molecule are capable of being ligated to one end or to both ends of a double stranded nucleic acid molecule.
Preferably, the adapter has a restriction site in the first region of the first strand of the adapter. In another preferred embodiment, the adapter comprises at least one barcode sequence in the single stranded region of the adapter and wherein the second region of the second strand of the adapter forms a hairpin loop by hybridization between a first and a second segment within said second region, the first segment being located at the 3' end of the second region and the second segment being located in the vicinity of the first region of the second strand.
In another preferred embodiment, the kit further comprises:
(i) a library of double-stranded adapters, said adapters comprising a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity and wherein the ends of said double stranded region are compatible with the ends of double stranded nucleic acid molecules;
(ii) a plurality of elongation primers, wherein each elongation primer comprises a 3' region which is complementary to the second strand of the adapter molecule as defined in (i) and which, after hybridization with the second strand of the adapter molecule creates overhanging ends; and
(iii) a plurality of hairpin adapters, wherein each hairpin adapter comprises a hairpin loop region and overhanging ends which are compatible with the
overhanging ends formed after hybridization of the elongation primer as defined in (ii) to the second strand of the Y-adapter as defined in (i), wherein the elongation primers of (ii) and the hairpin adapters of (iii) may be provided as a complex; wherein the adapters of (i), the elongation primers of (ii) and the hairpin adapters of (iii) are suitable for obtaining a library of adapters for use in the method for the generation of the nucleic acid molecule provided in step (i) of the method of the present invention.
Preferably, the kit comprises one or more of the adapters of the invention, as defined above under section "adapters of the invention". Preferably, the kit comprises one or more adapters comprising the sequence of SEQ ID NO: 44, 45, 46, 47, 48, 49, 50, 51, or 52, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 44, 45, 46, 47, 48, 49, 50, 51, or 52, respectively. Preferably, the kit comprises at least the adapters comprising the sequence of SEQ ID NO: 44 and 45, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 44, or 45, respectively.
Preferably, the kit comprises at least the adapters comprising the sequence of SEQ ID NO: 46 and 47, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 46 or 47 respectively, and an optional adapter comprising the sequence of SEQ ID NO: 52, or a sequence with at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% sequence identity to SEQ ID NO: 52.
ADAPTERS OF THE INVENTION
The nomenclature used in the following sequence listing is as follows:
"M" denotes methylated cytosine,
"C" denotes non-methylated cytosine,
"A" denotes adenine,
"G" denotes guanine, and "T" denotes thymine.
SEQ ID NO: 44 (E9 full length): GMTMTTMMGATMTGGMGTGGMAG
SEQ ID NO: 45 (E9 full length):
CTGCMACGCMGTGCCTCAGGCTCCGATCGAGTGTTGTCTCGATCGGAGCCTGAGGCAC
SEQ ID NO: 48 (E9 duplex): GGMGTGG
SEQ ID NO: 49 (E9 duplex): CMACGCM
SEQ ID NO: 46 (E15 full length): GMTMTTMMGATMTGGMGTGGMAG
SEQ ID NO: 47 (E15 full length): CTGCMACGCMGTGCCTCAG
SEQ ID NO: 50 (E15 duplex): GGMGTGG
SEQ ID NO: 51 (E15 duplex): CMACGCM
SEQ ID NO: 52 (E15 hairpin):
GCTCCGATCGAGTGTTGTCTCGATCGGAGCCTGAGGCACGGCGTGGCAG
ITEMS OF THE PRESENT INVENTION
1. A method comprising the steps of:
(i) providing a plurality of nucleic acid molecules, wherein each of the nucleic acid molecules comprises two regions: a first region and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, wherein the nucleotide sequence of the second region is identical or at least substantially identical in each of the plurality of nucleic acid molecules at at least a position corresponding to the same locus in a region of interest which is occupied by a nucleotide susceptible of being modified, wherein the first and the second regions in each of the molecules provided in (i) comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified;
wherein at least two of the nucleic acid molecules of the plurality of nucleic acid molecules may have (and preferably have) a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified and wherein, in the second region, a position corresponding to that same locus in the region of interest is occupied by the same nucleotide in the at least two nucleic acid molecules, preferably wherein: a) - the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof; and wherein the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at a position which corresponds to the same locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof or a transformed modified nucleotide or a copy thereof; or b) - the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules a transformed modified nucleotide, or a copy thereof; and wherein the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at a position which corresponds to the same locus in the one or more original nucleic acid molecule, a transformed modified nucleotide, or a copy thereof; or a modified nucleotide or a copy thereof,
and
(ii) Capturing the molecules provided in (i) by using at least one capture probe that binds to at least a portion of the second region which comprises at least one nucleotide which is located at a position which corresponds to a locus in the original molecule which is occupied by a nucleotide susceptible of being modified. The method according to item 1, wherein the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least at a position which corresponds to the same locus in the one or more original nucleic acid molecules, at least a modified nucleotide or a copy thereof, preferably wherein the at least one modified nucleotide is a methylated cytosine and/or the copy thereof is a unmethylated cytosine. The method according to any one of items 1 or 2, wherein the first region is in the 5' region of the nucleic acid molecule, and the second region is in the 3' region of the nucleic acid molecule. The method according any one of the preceding items, wherein the first region of the at least one of the plurality of the nucleic acid molecules provided in step (i) is a fragment of genomic DNA. The method according to any one of the preceding items, wherein the first and the second region are bound by a linker. The method according to item 5, wherein the linker comprises a nucleotide sequence that is identical or substantially identical in all of the plurality of nucleic acid molecules. The method according to any one of items 5 or 6, wherein the linker is a nucleotide sequence with a length of at least 5 nucleotides, preferably with a length of at least 10 nucleotides, even more preferably with a length of at least 17 nucleotides.
The method according to any one of the preceding items, wherein the plurality of nucleic acid molecules provided in step i) are DNA molecules. The method according to any one of the preceding items, wherein at least one of the plurality of nucleic acid molecules provided in step (i) further comprises:
- One adapter at the 5' end of the molecule; and/or
- One adapter at the 3' end of the molecule. The method according to any one of the preceding items, wherein the plurality of nucleic acid molecules in step (i) is provided by: a) Providing one or more original nucleic acid molecules, preferably wherein the one or more original nucleic acid molecules are fragments of genomic DNA; b) Ligating one adaptor to at least one end of the one or more original nucleic acid molecules provided in a), thereby obtaining one or more adaptor-containing original nucleic acid molecules, wherein the 3' region of the adaptor forms a hairpin loop whose 3' end can be extended by action of a polymerase; c) Synthesizing, for each of the one or more adaptor-containing original nucleic acid molecules obtained in step b), a complementary strand, the "synthetic complementary strand", by polymerase elongation of the 3' end of the adaptor molecule, using the one or more adaptor-containing original nucleic acid molecules obtained in step b) as template, thereby pairing the one or more original nucleic acid molecules obtained in step b) with its synthetic complementary strand, to provide one or more paired double-stranded adaptorcontaining nucleic acid molecules; and d) Treating the molecules generated with an agent or method or process capable of converting a nucleotide into another one which is read distinctly from the original nucleotide, under the conditions suitable for the conversion/transformation to occur, i.e.: c21) Converting non-modified nucleotides in the paired adaptor-containing nucleic acid molecules, if any, to another nucleotide which is read distinctly
from said nucleotide, in the paired adaptor-containing nucleic acid molecules; and/or c22) Converting modified nucleotides in the paired adaptor-containing nucleic acid molecules, if any, to another nucleotide which is read distinctly from said nucleotide, in the paired adaptor-containing nucleic acid molecules. The method according to item 10, wherein the one or more original nucleic acid molecules are double-stranded (ds) nucleic acid molecules, preferably genomic ds DNA, and wherein the adaptors of step b) are ds adaptors. The method according to any one of items 10 or 11 wherein at least a portion of the adaptors has a nucleotide sequence common to all the adaptors used in step b). The method according to one or more of items 11 to 12, wherein the adaptor has a first barcode sequence in the double stranded region and/or a second barcode sequence in the 5' region of the second strand of the adaptor. The method according to one or more of items 11 to 13, wherein the strands of the double-stranded original nucleic acid molecules are further paired by using barcode sequences in step b). The method according to item 14, wherein the pairing can be performed either before or after the ligation, or simultaneously with the ligation. The method according to any one of the preceding items, wherein the method further comprises a step (iii) of determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules. The method according to item 16, wherein the determination of the true identity of a base at a certain locus in the one or more original nucleic acid molecules is performed by using at least two different primers, preferably at least three different primers, even
more preferably four different primers, to sequence the molecule provided in step (i) as defined in claim 1, wherein the molecules provided in step (i) further comprise one adapter at the 5' end of the molecule and one adapter at the 3' end of the molecule; wherein the at least two different primers bind to at least three, preferably at least four different regions in the nucleic acid molecule provided in step (i), wherein:
1) At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of at least one nucleic acid molecules provided in step (i);
2) At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in step (i), to sequence the second region of at least one of the nucleic acid molecules provided in step (i);
3) At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of at least one of the nucleic acid molecules provided in step (i);
4) At least one of the primers is capable of binding at least partially the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence the first region of at least one of the nucleic acid molecules provided in step (i). The method according to any one of items 16 to 17, wherein the presence of a modified cytosine at a given position (locus) is determined if a cytosine appears in first region of the nucleic acid molecules obtained in step (c) and a guanine appears in the corresponding position (locus) in the second region of the same molecule, and/or wherein the presence of a non-modified cytosine at a given position (locus) is determined if a uracil or thymine appears in the first region of the nucleic acid molecules obtained in step (c) and a guanine appears in the corresponding position (locus) in the second region of the same molecule.
The method according to one or more of items 16 to 18, wherein the method further comprises using a computer comprising a processor, a memory, and instructions stored thereupon that, when executed, ascertain the identity of a base and/or an associated BQ at a certain position (locus) in an original nucleic acid molecule, based on the information provided in step (iii). An in vitro method for diagnosing a condition, the method comprising the steps of:
(1) Selecting or identifying a region of interest relevant to the condition to be diagnosed within the genome of a patient;
(2) Providing a plurality of nucleic acid molecules as defined in step (i) of the method as defined in any one of items 1 to 19, from a sample obtained from the patient;
(3) Capturing the molecules provided in (i) by using a probe as defined in step (ii) of the method as defined in any one of items 1 to 19;
(4) Determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules and/or determining the epigenetic status of the original molecule(s) within the ROI; and
(5) Diagnosing a condition in the subject based at least in part on the information provided in step (4). A computer program comprising instructions which, when executed by a computer, is able to determine the identity of an ascertained/inferred base and/or the associated BQ at a certain position (locus) in an original nucleic acid molecule, based on the information provided in step (iii) of the method as defined in any one of items 16 to 19. A method for designing at least one probe suitable for performing step (ii) of the method as defined in any one of items 1 to 19, the method comprising:
(a) Selecting or identifying a region of interest within a genome and/or within a nucleic acid molecule;
(b) Inferring the sequence of the second region of the plurality of nucleic acid molecules provided in step (i) of the method as defined in any one of items I to 19; and
(c) Obtaining the sequence of at least a capture probe that binds to at least a portion of the second region, wherein the second region comprises at least one nucleotide which is located at a position which corresponds to a locus in the region of interest which is occupied by a nucleotide susceptible of being modified.
23. A computer program comprising instructions which, when executed by a computer, is able to obtain the sequence of at least a capture probe that binds to at least a portion of the second region, by implementing the method as defined in item 22.
MORE ITEMS OF THE PRESENT INVENTION
1. A method comprising the steps of:
(i) providing a plurality of nucleic acid molecules, wherein each of the nucleic acid molecules comprises two regions: a first region and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, wherein the first and the second regions in each of the molecules provided in (i) comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified; wherein at least two of the nucleic acid molecules of the plurality of nucleic acid molecules may have a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being
modified and wherein, in the second region, a position corresponding to that same locus in the region of interest is occupied by the same nucleotide in the at least two nucleic acid molecules; and
(ii) Capturing the molecules provided in (i) by using at least one capture probe that binds to at least a portion of the second region which comprises at least one nucleotide which is located at a position which corresponds to a locus in the ROI which is occupied by a nucleotide susceptible of being modified.
2. The method according to item 1, wherein the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least a one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof, preferably wherein the at least one modified nucleotide is a methylated cytosine, and wherein the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at a position which corresponds to the same locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof or a transformed modified nucleotide or a copy thereof. The method according to any one of items 1 to 2, wherein the portion in the second region to which at least one capture probe binds has length of at least one nucleotide, preferably at least two nucleotides, more preferably at least five nucleotides and even more preferably at least 10 nucleotides. The method according any one of the preceding items, wherein the first region of the at least one of the plurality of the nucleic acid molecules provided in step (i) is a fragment of genomic DNA, preferably wherein the first region of all of the nucleic acid molecules provided in step (i) are fragments of genomic DNA.
The method according to any one of the preceding items, wherein the plurality of nucleic acid molecules in step (i) is provided by: a) Providing one or more original nucleic acid molecules, preferably wherein the one or more original nucleic acid molecules are fragments of genomic DNA; b) Ligating one adaptor to at least one end of the one or more original nucleic acid molecules provided in a), thereby obtaining one or more adaptorcontaining original nucleic acid molecules, wherein the 3' region of the adaptor forms a hairpin loop whose 3' end can be extended by action of a polymerase; c) Synthesizing, for each of the one or more adaptor-containing original nucleic acid molecules obtained in step b), a complementary strand, the "synthetic complementary strand", by polymerase elongation of the 3' end of the adaptor molecule, using the one or more adaptor-containing original nucleic acid molecules obtained in step b) as template, thereby pairing the one or more original nucleic acid molecules obtained in step b) with its synthetic complementary strand, to provide one or more paired double-stranded adaptor-containing nucleic acid molecules; and c21) Converting non-modified nucleotides in the paired adaptor-containing nucleic acid molecules, if any, to another nucleotide which is read distinctly from said nucleotide, in the paired adaptor-containing nucleic acid molecules; and/or c22) Converting modified nucleotides in the paired adaptor-containing nucleic acid molecules, if any, to another nucleotide which is read distinctly from said nucleotide, in the paired adaptor-containing nucleic acid molecules. The method according to item 5, wherein the one or more original nucleic acid molecules are double-stranded (ds) nucleic acid molecules, preferably genomic ds DNA, and wherein the adaptors of step b) are ds adaptors. The method according to any one of the preceding items, wherein the method further comprises a step (iii) of determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules.
The method according to item 7, wherein the determination of the true identity of a base at a certain locus in the one or more original nucleic acid molecules is performed by using at least two different primers to sequence the molecule provided in step (i) as defined in item 1, wherein the molecules provided in step (i) further comprise one adapter at the 5' end of the molecule and one adapter at the 3' end of the molecule; wherein the at least two different primers bind to four different regions in the nucleic acid molecule provided in step (i), wherein:
1) At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of at least one nucleic acid molecules provided in step (i);
2) At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in step (i), to sequence the second region of at least one of the nucleic acid molecules provided in step (i);
3) At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of at least one of the nucleic acid molecules provided in step (i);
4) At least one of the primers is capable of binding at least partially the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in (i), to sequence the first region of at least one of the nucleic acid molecules provided in step (i). The method according to one or more of items 7 or 8, wherein the method further comprises using a computer comprising a processor, a memory, and instructions stored thereupon that, when executed, ascertain the identity of a base and/or an associated BQ at a certain position (locus) in an original nucleic acid molecule, based on the information provided in step (iii). An in vitro method for diagnosing a condition, the method comprising the steps of:
(1) Selecting or identifying a region of interest relevant to the condition to be diagnosed within the genome of a patient;
(2) Providing a plurality of nucleic acid molecules as defined in step (i) of the method as defined in any one of items 1 to 9, from a sample obtained from the patient;
(3) Capturing the molecules provided in (i) by using a probe as defined in step (ii) of the method as defined in any one of items 1 to 9;
(4) Determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules and/or determining the epigenetic status of the original molecule(s) within the ROI; and
(5) Diagnosing a condition in the subject based at least in part on the information provided in step (4). A computer program comprising instructions which, when executed by a computer, is able to determine the identity of an ascertained/inferred base and/or the associated BQ at a certain position (locus) in an original nucleic acid molecule, based on the information provided in step (iii) of the method as defined in any one of items 7 to 9. A method for designing at least one probe suitable for performing step (ii) of the method as defined in any one of items 1 to 9, the method comprising:
(a) Selecting or identifying a region of interest within a genome and/or within a nucleic acid molecule;
(b) Inferring the sequence of the second region of the plurality of nucleic acid molecules provided in step (i) of the method as defined in any one of items 1 to 9; and
(c) Obtaining the sequence of at least a capture probe that binds to at least a portion of the second region, wherein the second region comprises at least one nucleotide which is located at a position which corresponds to a locus in the region of interest which is occupied by a nucleotide susceptible of being modified.
13. A computer program comprising instructions which, when executed by a computer, is able to obtain the sequence of at least a capture probe that binds to at least a portion of the second region, by implementing the method as defined in item 12.
MORE ITEMS OF THE PRESENT INVENTION l.A method comprising: i. Providing a nucleic acid molecule which comprises a 5' region and a 3' region, wherein the 5' region and the 3' region are covalently linked by a nucleotide sequence to which primers can bind, wherein the base identities in one of the 5' or 3' regions and the base identities in the other region both provide, independently, information on the base identities in the corresponding loci in an original nucleic acid molecule, wherein the molecule further comprises:
One adapter at the 5' end of the molecule;
One adapter at the 3' end of the molecule; ii. Using at least two different primers, such as at least three different primers, preferably at least four different primers, to sequence the molecule provided in step (i), wherein the at least two different primers, such as at least three different primers, preferably at least four different primers, bind to at least three, preferably to at least four different regions in the nucleic acid molecule provided in (i), wherein:
1. At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the 5' region of the nucleic acid molecule provided in (i);
2. At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the 5' region and the 3' region of the nucleic acid molecule
provided in (i), to sequence at least part of the 3' region of the nucleic acid molecule provided in (i);
3. At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the 3' region of the nucleic acid molecule provided in (i); and/or
4. At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the 5' region and the 3' region of the nucleic acid molecule provided in a., to sequence at least part of the 5' region of the nucleic acid molecule provided in (i). The method according to item 1, wherein the 3' region of the nucleic acid molecule provided in step (i) is at least partially complementary to the reverse strand of the 5' region. The method according to any one of the preceding items, wherein the nucleic acid molecule provided in step (i) is a DNA molecule. The method according to any one of the preceding items, wherein the 5' region and/or the 3' region of the nucleic acid molecule provided in step (i) are fragments of genomic DNA. The method according to any one of the preceding items, wherein the 5' region of the nucleic acid molecule provided in step (i) is a fragment of genomic DNA, and the 3' region of the nucleic acid molecule provided in step (i) is the complementary of the reverse strand of the 5' region. The method according to any one of items 1-4, wherein the 3' region of the nucleic acid molecule provided in step (i) is a fragment of genomic DNA, and the 5' region of the nucleic acid molecule provided in step (i) is the complementary of the reverse strand of the 3' region.
The method according to any one of the preceding items, wherein the nucleotide sequence which covalently links the 5' region and the 3' region of the nucleic acid molecule provided in step (i) has a length of at least 5 nucleotides, preferably a length of at least 10 nucleotides, even more preferably at least 17 nucleotides. The method according to any one of the preceding items, wherein, in step (ii), four different primers are used to sequence the molecule provided in step (i). The method according to any one of the preceding items, wherein the nucleic acid molecule of step (i) is generated by: a. Providing a population of double stranded nucleic acid molecules, preferably wherein the plurality of double-stranded DNA molecules are fragments of genomic DNA; b. Ligating, at least partially, double-stranded (ds) adaptors to at least one end of the strands of a plurality of double-stranded nucleic acid molecules, thereby obtaining a plurality of adaptor-containing nucleic acid molecules; c. Synthesizing, for each of the strands of the nucleic acid molecules obtained in step (b), a complementary strand, the "synthetic complementary strand", by polymerase elongation from the 3' end of the second nucleic acid strand in the adapter molecule, using each of the strands of the nucleic acid molecules obtained in step (b) as template, thereby pairing each of the strands of the nucleic acid molecules obtained in step (b) with its synthetic complementary strand to provide a plurality of adaptor-modified nucleic acid molecules, wherein the original nucleic acid strand and its synthetic complementary obtained in step (c) are covalently linked by a nucleotide sequence to which primers can at least partially bind; d. Optionally, providing complementary strands of the plurality of adaptor- modified DNA molecules obtained in step (c), optionally using primers the sequences of which are complementary to at least a portion of the doublestranded adaptors;
e. Optionally amplifying the paired double stranded nucleic acid molecules obtained in step (d) to provide amplified paired double stranded nucleic acid molecules.
10. The method according to item 9, wherein at least a portion of the double-stranded adaptors has sequences common to all the double-stranded adaptors used in step (b).
11. The method according to one or more of items 9 to 10, wherein prior to step (c), the plurality of paired adaptor-modified DNA molecules is separated to generate a library of paired adaptor-modified DNA molecules.
12. The method according to one or more of items 9 to 11, wherein the 3' region of the second DNA strand of the adapter forms a hairpin loop by hybridization between a first and a second segment within said 3' region, the first segment located at the 3' end of the 3' region and the second segment located in the vicinity of the 5' region of the second DNA strand.
13. The method according to one or more of items 9 to 12, wherein the method further comprises, after step (c), the following step (cl):
Contacting each strand of said adapter-containing nucleic acid molecules with a complex of an elongation primerwith a hairpin adapter under conditions adequate for the hybridization of the elongation primer to the second strand of the adapter, wherein the elongation primer comprises a 3' region which is complementary to the second strand of the adapter molecule and which, after hybridization with the second strand of the adapter molecule creates overhanging ends, and wherein the hairpin adapter comprises a hairpin loop region and overhanging ends which are compatible with the overhanging ends formed after hybridization of the elongation primer to the second strand of the adapter.
14. The method according to one or more of items 9 to 13 wherein the adaptor has a first barcode sequence in the double stranded region and/or a second barcode sequence in the 3' region of the second strand of the adaptor.
15. The method according to one or more of items 9 to 14, wherein the adaptor has a restriction site in the 5' region of the first strand of the adaptor.
16. The method according to one or more of items 9 to 15, wherein, in step (b), a plurality of adaptor-modified nucleic acid molecules is provided, wherein the strands of genomic DNA fragments are further paired by using barcode sequences.
17. The method according to item 16, wherein the pairing can be performed either before or after the ligation, or simultaneously with the ligation.
18. The method according to one or more of items 9 to 17, wherein at least a portion of the double-stranded adaptors has sequences common to all the double-stranded adaptors used in step (b).
19. The method according to one or more of items 9 to 18, wherein the method further comprises, after step (c), the following steps c21 and/or c22: c21) Converting non-modified nucleotides in the paired adaptor-containing nucleic acid molecules, if any, to another nucleotide which is read distinctly from said nucleotide, in the paired adaptor-containing nucleic acid molecules; and/or c22) Converting modified nucleotides in the paired adaptor-containing nucleic acid molecules, if any, to another nucleotide which is read distinctly from said nucleotide, in the paired adaptor-containing nucleic acid molecules.
20. The method according to any one of items 1 to 8, wherein the nucleic acid molecule of step (i) is generated by: a. Providing a double stranded nucleic acid molecule, preferably wherein the double-stranded DNA molecule is a fragment of genomic DNA; b. Covalently linking the forward and reverse single-stranded nucleic acid molecules provided in step a.,
wherein the covalent linking in step b. is performed by a nucleotide sequence to which primers can bind, to obtain a nucleic acid molecule which comprises a 5' region and a 3' region, wherein the 5' region and the 3' region are covalently linked by a nucleotide sequence to which primers can bind, wherein the base identities in one of the 5' or 3' regions and the base identities in the other region both provide, independently, information on the base identities in the corresponding loci in an original nucleic acid molecule.
21. The method according to item 20, wherein the double stranded nucleic acid molecule provided in step a. is provided as a population of double stranded nucleic acid molecules, preferably wherein the plurality of double-stranded DNA molecules are fragments of genomic DNA.
22. The method according to one or more of items 1 to 21, wherein the method further comprises determining the true identity of a base at a certain locus in an original nucleic acid molecule, based on the information provided in step (ii).
23. The method according to item 22, wherein the identity of the true base at the locus of the original nucleic acid molecule is determined to be a miscall if the identities of the first base from reads 1 and 3 and the identities of the second base from reads 4 and 2, respectively, does not match any of the following combinations: 1) adenine, adenine, thymine and thymine that corresponds to A, 2) thymine, thymine, adenine and adenine that corresponds to T, 3) thymine, guanine, adenine and guanine that corresponds to unmethylatedC, 4) guanine, adenine, guanine, thymine that corresponds to a G and 5) cytosine, cytosyne, guanine and guanine that corresponds to a methylatedC.
24. The method according to one or more of the preceding items, wherein the method further comprises using a computer comprising a processor, a memory, and instructions stored thereupon that, when executed, ascertain the identity of a base and/or an associated
BQat a certain position in an original nucleic acid molecule, based on the information provided in step (ii).
25. A computer program comprising instructions which, when executed by a computer, is able to determine the identity of an ascertained/inferred base and/or the associated BQ at a certain position in an original nucleic acid molecule, based on the information provided in step (ii). of the method as defined in any one of the preceding items.
26. A kit comprising at least two different primers, such as at least three different primers, preferably four different primers, wherein the at least two different primers, such as at least three different primers, preferably four different primers, are capable of at least partially binding to at least three, preferably at least four different regions in the nucleic acid molecule provided in step (i) of the method as defined in any one of items 1 to 25, wherein:
1. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 5' end of the nucleic acid molecule provided in (i), to sequence at least part of the 5' region of the nucleic acid molecule provided in (i);
2. At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the 5' region and the 3' region of the nucleic acid molecule provided in (i), to sequence at least part of the 3' region of the nucleic acid molecule provided in (i);
3. At least one of the primers is capable of binding (hybridizing) at least partially to at least a portion of the adapter at the 3' end of the nucleic acid molecule provided in (i), to sequence at least part of the 3' region of the nucleic acid molecule provided in (i); and/or
4. At least one of the primers is capable of binding (hybridizing) at least partially to the region of the nucleotide sequence which covalently links the 5' region and the 3' region of the nucleic acid molecule provided in (i), to sequence at least part of the 5' region of the nucleic acid molecule provided in (i).
1. The kit according to item 26, wherein the kit further comprises a double stranded adapter for use in the method as defined in any one of items 9-24, wherein the adapter comprises a first nucleic acid strand and a second nucleic acid strand, wherein the 3' region of the first nucleic acid strand and the 5' region of the second nucleic acid strand form a double stranded region by sequence complementarity, wherein the ends of said double stranded region formed by the 3' region of the first nucleic acid strand and the 5' region of the second nucleic acid strand of the adapter are compatible with the ends of a double stranded nucleic acid molecule, wherein the double-stranded region of the adapter comprises one or more barcode sequence(s), and wherein the 3' region of the second strand of the adapter forms a hairpin loop by hybridization between a first and a second segment within said 3' region, the first segment being located at the 3' end of the 3' region and the second segment being located in the vicinity of the 5' region of the second strand, and/or wherein the adapter comprises at least one barcode sequence in the single stranded region of the adapter, wherein the barcode sequences consist of unique identifiers that allow identification of a specific construct comprising said identifier and its amplification products, and wherein compatible means that the ends of said double stranded region of the adapter molecule are capable of being ligated to one end or to both ends of a double stranded nucleic acid molecule.
28. The kit according to any one of items 26 or 27, wherein the adapter has a restriction site in the 5' region of the first strand of the adapter.
29. The kit according to any one of items 26 to 28, wherein the adapter comprises at least one barcode sequence in the single stranded region of the adapter and wherein the 3' region of the second strand of the adapter forms a hairpin loop by hybridization between a first and a second segment within said 3' region, the first segment being located at the 3' end of the 3' region and the second segment being located in the vicinity of the 5' region of the second strand.
30. The kit according to any one of items 26 to 29, wherein the kit further comprises:
(i) a library of double-stranded adapters, said adapters comprising a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity and wherein the ends of said double stranded region are compatible with the ends of double stranded nucleic acid molecules;
(ii) a plurality of elongation primers, wherein each elongation primer comprises a 3' region which is complementary to the second strand of the adapter molecule as defined in (i) and which, after hybridization with the second strand of the adapter molecule creates overhanging ends; and
(iii) a plurality of hairpin adapters, wherein each hairpin adapter comprises a hairpin loop region and overhanging ends which are compatible with the overhanging ends formed after hybridization of the elongation primer as defined in (ii) to the second strand of the Y-adapter as defined in (i), wherein the elongation primers of (ii) and the hairpin adapters of (iii) may be provided as a complex; wherein the adapters of (i), the elongation primers of (ii) and the hairpin adapters of (iii) are suitable forobtaining a library of adapters for use in the method as defined in any one of items 9 to 24.
EXAMPLES
EXAMPLE 1: Advantages of sequencing the molecule of the invention
STEP i. of the method of the present invention:
In this section, an example on how to provide the molecule of step i). Particularly, the molecule exemplified herein is the GEUS molecule as described in WO 2015/104302. As discussed above, this molecule is an example of a molecule as defined in step i. of the method of the present invention. The advantages and effects explained herein for the GEUS molecule are equally applicable to any molecule as defined in step i. of the method of the present invention.
The nucleic acid molecule of STEP i. may be generated by:
Step a. Providing a double-stranded nucleic acid molecule: See, e.g., Figure 4A:
ATCGAAMGMT TAGCTTGMGA
("M" indicates methylated cytosine, and C indicates nonmethylated cytosine)
Step b. Ligating, at least partially, double-stranded (ds) adaptors to at least one end, and preferably to both ends, wherein one of the adaptors comprises a hairpin ("hairpin" in the below representation): See, e.g., Figure 4B. adaptor - ATCGAAMGMT-hairpin hairpin - TAGCTTGMGA-adaptor
Step c. Synthesizing, for each of the strands of the nucleic acid molecules obtained in step (b), a complementary strand, the "synthetic complementary strand", by polymerase elongation from the 3' end of the "hairpin", using each of the strands of the nucleic acid molecules obtained in step (b2) as template. See, e.g., Figure 4D.
ill adaptor - ATCGAAMGMT -hairpin - AGCGTTCGAT adaptor - AGMGTTCGAT -hairpin - ATCGAACGCT
Step c.21 : Converting the non-methylated cytosine(s) of the molecules obtained after step c), to a base which is read distinctly from cytosine (e.g., uracil/thymine) by, e.g., bisulfite treatment: For simplicity, we will continue this example with only one of the molecules obtained after step c. However, it is noted that the method can be continued with both of them, see Fig. 4E. adaptor - ATUGAACGCT-hairpin-AGGG7TL/GAT- adaptor (Converted non-methylated cytosines underlined)
The resulting molecule is a "GEUS molecule", as described, e.g., in WO 2015/104302, and comprises a first region (in bold), second region (in italics), wherein both regions are covalently linked by a nucleotide sequence to which primers can bind (comprising the hairpin of the adaptor, as explained above, also referred to as "linking region"), wherein the base identities in one of the first or second regions and the base identities in the other region both provide, independently, information on the base identities in the corresponding loci in an original nucleic acid molecule (in this case, they provide information on the base identities of the original strand of step a "ATCGAAMGMT"), and wherein the molecule comprises one adaptor in the 5' end and another adaptor in the 3' end. This molecule is an example molecule of the molecule provided in step i. of the method of the present invention, also represented in Fig. 5A.
STEP ii. of the method of the present invention
This step comprises capturing at least some of the molecules provided in (i) by using at least one capture probe that binds at least partially to the second region of the plurality of nucleic acid molecules.
STEP iii. of the method of the present invention
The molecule provided in STEP i. is sequenced using at least two primers that bind to at least three, preferably to at least four different regions. In this Example, four different primers binding to four different regions in the nucleic acid molecule provided in (i) are be used, as
explained below. The skilled person will however immediately recognise that at least some of the advantages and effects described herein are equally applicable to methods where at least two different primers, such as three different primers, binding to at least three different regions in the nucleic acid molecule provided in (i)are used.
In the present case, as described above, four different primers are used: primer 1 (pl): is capable of binding to a portion of the adapter at the 5' end of the GEUS molecule, to sequence at least part of the first region of the GEUS molecule provided (primer corresponding to item 1. of the claimed method, step iii.) primer 2 (p2): is capable of binding to a region of hairpin, to sequence the second region of the GEUS molecule (primer corresponding to item 2. of the claimed method, step iii.) primer 3 (p3): is capable of binding to the adapter at the 3' end of the GEUS molecule, to sequence at least part of the second region of the GEUS molecule (primer corresponding to item 3. of the claimed method, step iii.) primer 4 (p3): is capable of binding to a region of hairpin, to sequence the first region of the
GEUS molecule (primer corresponding to item 4. of the claimed method, step iii.)
The above four primers will generate four reads: See also Fig. 5B. pl: READ 1 ATTGAACGCT p3: READ 3 ATCAAACACT p4: READ 4 TAACTTGCGA p2: READ 2 TAGTTTGTGA
Importantly, if we only consider Read 1 and Read 3, derived from pl and p3, respectively, the GEUS molecule will be sequenced, in practice, based on a single-end (SE) sequencing, as Read 1 and Read 3 provide information of the bases located in the same loci of the original molecule. Since this GEUS molecule comprises two regions (a first region highlighted in bold, and a second region highlighted in italics) that provide related information, which is the information on the base identities in the corresponding loci in an original nucleic acid molecule
(strand ATCGAAMGMT in step a. above), using only two conventional primers that hybridize in the adaptors of the molecule (i.e., "usual PE sequencing") will result, in practice, in SE sequencing, as shown in Fig. 6A.
However, in the claimed method, two additional reads are obtained, named Read 2 and 4. Said reads derive from primers that hybridize in the hairpin region of the GEUS molecules (p2 and p4, respectively, see Figure 5B). By obtaining these additional two reads, a true pair end (PE) sequencing for each of the first and second regions of the GEUS molecule is obtained, because the first and second regions are now read from each of their ends, as shown above and in Fig. 6B.
Therefore, the method of the present invention allows for PE sequencing for molecules as the ones defined in step i. of the method, for example the GEUS molecule described in WO 2015/104302. The advantages of PE sequencing versus SE sequencing as described above, see background of the invention.
Additionally and importantly, when the four reads are capable of at least partially, preferably completely covering the 5' and 3' regions of the GEUS molecule, the method claimed herein provides more than two, such as at least three and preferably at least four sources of information (namely four reads) per original single strand molecule (e.g., up to eight sources in total if the original molecule was a double stranded molecule, such as in the present example). The presence of more than two, such as three, preferably four sources of information per original strand is advantageous as it allows the detection of sequencing errors that would not be detected if only two reads were obtained, as explained above and herein below:
If only pl and p3 are used, see above, the following reads are obtained, see also Fig. 6A:
THEORETICAL EXAMPLE
CODE FOR 2R INFERRED with a A>G error in READ1:
The underlined bases in reads mean that they correspond to a true base in the original DNA template. For example, in Read 1, when a "A" is read, it means that the original DNA template has a "A" in said position.
The not-underlined bases in reads refer to bases that, when read, provide two possible alternatives, so that the true identity of the base in that position the DNA template cannot be inferred from that source of information. For instance, in Read 1, when a "T" is read, it can be a "T" or a "unmethylated C", in the original DNA template.
The ambiguity or redundancy of the bases with a white background in Read 1 is solved by reading Read 3. For example, the second base "T" in Read 1, which is not underlined, is inferred when in Read 3 a "T" appears. This is because, in Read 3, "T" indicates true "T" in the DNA template. Therefore, even if Read 1 does not provide information of the second base of the DNA template, Read 3 provides the true identity of said base, and thus the sequence of the template DNA can be inferred.
However, when the Read 1 has a sequencing error in one of the underlined bases , Read 3 cannot overcome the ambiguity, and the true identity of the base will be inferred incorrectly. This is the case of the erroneous "G" highlighted with a thick black border in Read 1. Although Read 3 shows a "A", the base "A" in Read 3 can mean either a true "G" or a true "A", as Read 3 is ambiguous for said bases. Therefore, if Read 1 contains an error in a base that cannot be directly inferred by Read 3, the true base identity in the DNA template will be mistaken: since a "G" in Read 1 means a true "G" in the DNA template, the mistaken "G", highlighted with a
thick black border, will be inferred as a true "G", deriving in an error in the sequencing of the DNA template.
These sequencing errors would be removed in the method of the present invention, where at least one, preferably at least two more reads, i.e., Read 2 and/or 4, are obtained, see also Fig. 6B:
In the above table, Read 1 and Read 3 are similar as the one shown above. However, two new reads are provided: Read 4 and Read 2 are the reads derived from primers 2 and 4, see Fig. 6B.
In this case, thanks to the presence of Read 2 and 4, it can be seen that a discrepancy in the bases has been obtained, as Read 4 shows a "T", underlined, but "T" in read 4 means a true "T" in the DNA template. Therefore, thanks to a true paired-end sequencing, two additional reads are obtained, and sequencing errors can be detected due to base discrepancies in said reads.
Hence, the claimed method allows for very low error rates when sequencing a molecule as defined in step i. of the method of the present invention. This is because more than two reads (e.g., at least three, and preferably at least four reads) per molecule are obtained.
EXAMPLE 2: Advantages of using the molecule of the invention wherein the second region does not comprise any modified nucleotides, and of using adapters having the special features of “wherein the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T".
The previous Example shows the advantage of sequencing the molecule of the present invention. In this Example, we will show the advantage derived from the fact that the second region of the molecule of the invention does not comprise any methylated Cytosine and derived from using adapters having certain special features. Example 4 below reproduces the same method and starting molecule, but including modified nucleotides (methylated C) in the extension step. Example 3 below reproduces the same method but using different adapters not having the above mentioned special features.
The starting molecule exemplified herein is the GEUS molecule as described in WO 2015/104302. As discussed above, this molecule is an example of a molecule as defined in step i. of the method of the present invention.
STEP i. providing a plurality of nucleic acid molecules of the present invention.
The nucleic acid molecules of STEP i. may be generated by:
Step a. Providing a double-stranded nucleic acid molecule: See, e.g., Figure 7A, where a fragment of dsDNA from whole genome is represented, to which an "A tailing" is added (highlighted with black background).
Step b. Ligating double-stranded (ds) adaptors to each end of the molecule of step a) wherein one of the adaptors comprises a hairpin ("hairpin" in the below representation): See Fig. 7B, where the adaptors are called herein E9 adapters and are added to each ends of the molecule resulting from step a. In the adapter's sequence, "M" denotes methylated cytosine (i.e., a modified nucleotide), and "C" denotes non-methylated C (i.e., non-modified nucleotide).
It can be observed in Fig 7B that each adapter has a first and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y- adapter are compatible with the ends of the double stranded DNA molecules, wherein: the 3' region of the second strand ("GTGCCTCAGGCTCCGATCGAGTGTTGTCTCGATCGGAGCCTGAGGCAC" in Fig 7B) forms a hairpin loop whose 3' end can be extended by action of a polymerase,
- the first strand ("GMTMTTMMGATMTGGMGTGGMAGGATTATT" in Fig 7B) comprises at least two regions:
(a) a region (named DUPLEXUMIT in Fig 7B) comprising at least two nucleotides that are complementary to the second strand and thus form a double stranded region, and
(b) a region (named Y_ssDNA in Fig 7B, "GMTMTTMMGATMT") that is not complementary to the second strand but that is sufficiently complementary to a primer, thereby allowing the primer to bind to said region and be extended by action of a polymerase, wherein the double stranded region formed by the first and the second strands of the adapter (sequences "GGMGTGGM" and "MCGCAMCG" in Fig. 7B) comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T. In this particular example, the ds region formed by the first and second strands of the adapter comprises:
GGMGTGGM
McGCAMcG
Four "G"s (at least one non-modified nucleotide complementary to a modified nucleotide), in bold "G"
Four M (methylated C; least one modified nucleotide), underlined "M"
Two C (non-methylated C; one non-modified nucleotide susceptible of being converted), indicated with a small letter "c"
Hence, the adaptor shown in Fig. 7B comprises the special features mentioned above.
Further, in this Example, the double stranded region of the adapter comprises UMI sequences (i.e., one or more barcode), together with the A tailing added in step a., as shown in bold and with black highlight, respectively, in Fig. 7B.
Step c. Synthesizing, for each of the strands of the nucleic acid molecules obtained in step (b), a complementary strand, the "synthetic complementary strand", by polymerase elongation from the 3' end of the "hairpin", using each of the strands of the nucleic acid molecules obtained in step (b2) as template. Fig. 7C represents the ligation of the adapters E9 to the molecule provided in step a.
Before the extending step, the strands are denatured (see Fig. 7D). Next, the synthetic complementary strand is generated by extension of the 3' end of the adapters, using natural (non-modified) nucleotides (A, C, G, T). Hence, the synthetic strand does not have any M, see Fig. 7E.
Step d. Transformation or conversion step
After step c, the molecules generated are treated with an agent (in this case, bisulfite) capable of converting a nucleotide into another one which is read distinctly from the original nucleotide, under the conditions suitable for the conversion/transformation to occur. This is represented in Fig. 7F. Fig. 7G shows the same molecules as Fig. 7F but in a linear form.
On one hand, it is noted that the conversion of non-methylated C into uracil allows the two strands of the molecule to separate easier, providing better access to the reagent that acts over single stranded molecules. This is because the hybridization between nucleotides U-G (derived from a C-G, wherein the non-methylated C was converted to U) is weaker than nucleotides M-G. If the second region (synthetic complementary strand) was created with M rather than non-methylated C, then both strands would be bound by more hydrogen bonds, and the bisulfite reagent would have less access to the single strands. A comparison between an extension step using M versus an extension step using non modified C is depicted in Fig.
10. An example of the full method using an extension step using only M is provided in Example 4.
On the other hand, the adapters comprising the special features and used herein also provide further advantages to the method and the molecule of the invention. Since the adapters have the above mentioned special features, the resulting extended molecule comprises four regions, highlighted in Fig 7F with a box (numbered 1-4 in the Watson insert and l'-4' in the Crick insert), that are substantially different among each other, so that a primer can only specifically bind to one of them, and not to the others. The advantages associated to the presence of the four different regions will be explained below. Of note, these four boxes represent regions 1 to 4 and 1' to 4' explained above, and can also be named as A, B, C, D or A', B', C', D', respectively.
The forward (FW) and reverse (RV) fusion primers (FP) used are shown in Fig. 7H. These primers are prepared for Illumina sequencing, which requires different primer binding sites in each end of the molecule, see Fig. 5b.
The next step is amplification and sequencing. Fig. 71 shows in the upper part all of the molecules of the reaction (the two molecules Fig. 7G and the two primers (two copies of each) of Fig. 7H), and in the bottom the specific hybridization of the primers of Fig. 7H in the molecules of Fig. 7G.
Fig. 71 shows that the reverse primer can bind and amplify the molecule of the invention (it hybridizes in box 4 and 4'), while the forward primer does not have a complementary sequence to bind yet. This ensures sequencing directionality (5' and 3'ends of each molecule is different). This directionality is an effect derived from the features of the adaptors, which created the four different regions comprised in boxes 1-4 and l'-4'.
If the adapters did not have the feature of having a double stranded region formed by the first and the second strands of the adapter comprising at least one G, the primers would specifically hybridize in more than one region in the molecule of the invention, loosing said directionality
and generating aberrant amplicons. An example of the full method using adapters not having the above mentioned special features in the ds region is provided in Example 4.
Hence, a first effect of having the adaptors with the special features (in this case, four Gs complementary to a modified nucleotide M) is that they create the necessary directionality for the reverse primer to bind in the first cycle of PCR. In other words, if the adaptors did not have at least one G in the double stranded region, two of the sequences included in boxes, would be reverse complement of each other ("complementarity effect"), resulting in the loss of directionality because the forward primer would be able to also hybridize in the first molecule.
Additionally, if the adapters did not have the special feature of having at least one modified nucleotide (M) and at least one non-modified nucleotide susceptible of being converted (C), the different regions included in the boxes would be identical, so they would be tandem repetitions ("mirror effect"). This would cause the forward primer to hybridize in both ends of the molecule, thereby creating molecules with the same ends, and thus not suitable for Illumina sequencing, as will be explained in the next example. In other words, thanks to the presence of M, G and C in the adaptors, the regions included in boxes 1 - 4 and l'-4' are all different.
Further, the presence of C and M in the adaptors also helps as a control for an efficient bisulfite conversion, because if the bisulfite conversion did not work, or is not carried out, the complementarity/specificity of the reverse primer will be reduced and will increase for the forward primer, leading to the loss of directionality. Hence, the presence of c and M in the adapter also ensures a method with optimized efficiency, which will not be possible if no bisulfite is performed, or if all of the cysteines are methylated (see Example 4). A further control of the efficiency of the bisulfite conversion is the fact that, since the extension step was performed with non-modified C, the resulting synthetic strand of the GEUS molecule lacks non-methylated C (all non-methylated C were converted to U after bisulfite treatment). When the synthetic strand of the GEUS molecule comprises non-modified C, it is an indicative of low bisulfite conversion efficiency.
Fig. 7J shows the extension and denaturation cycles, in preparation for the next amplification step, shown in Fig. 7K. It is observed that, in the second PCR cycle (Fig. 7K), the forward reverse fusion primer can anneal into the first generation amplicons, thereby generating more amplicons. Once the forward primer is extended, the first complete molecules to be sequenced appear (see Fig. 7K, bottom part, molecules inside a box).
Fig. 7L and 7M represent the third cycle of PCR, wherein the dsDNA complete molecules to be sequences are highlighted with a black box.
STEP ii. Capturing step
This step comprises capturing at least some of the molecules provided in (i) by using at least one capture probe that binds at least partially to the second region of the plurality of nucleic acid molecules. As shown in Fig. 7N, thanks to the presence of the second region, a universal probe Watson can be designed, which will be complementary to the second region of all of the molecules in the plurality of molecules (Fig. 7N represents a plurality of 4 different molecules, each one with a different methylation status). If the second region is absent, four different probes Watson would need to be designed in order to capture all molecules (called probe I, II, III, and IV Watson in fig. 7N).
Hence, universal probe Watson is very different to original sequence, so will only work against the synthetic strand that is the same for any methylation status of the same fragment. Further, the same occurs for any possible combination of "c" and "M" (insert fragments are about 150- 200bp), same for any possible combination of "c" and "M" at Crick strands).
Fig. 7N represents the benefits of the capture probe designed for the method of the present invention. It is capable of capturing all molecules with the same specificity and efficiency because it is directed to hybridize with the second region of the molecules, which is identical in all of them.
EXAMPLE 3: Disadvantages of using modified adapters without the special features of “wherein the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T".
In this example we show the disadvantages of using adapters without the above mentioned special features. Fig. 8A shows modified adapters not comprising at least one non-modified nucleotide complementary to a modified nucleotide, (e.g., G if we are converting C to U/T), at least one modified nucleotide (e.g., methylated C if we are converting C to U/T), and at least one non-modified nucleotide susceptible of being converted (e.g., non-methylated C if we are converting C to U/T. The bottom part of Fig. 8A shows the ligation of these modified adapters to the molecule.
Fig. 8B represents the denaturation step. Fig. 8C shows the extension step once said modified adapters have been ligated to the molecule. A synthetic complement strand (second region) is created for each original strand (first region). In this example, the extension is performed with non-methylated Cs.
Fig. 8D shows the resulting molecule after transformation or conversion step. Fig. 8D represents the molecules of Fig. 8B where four boxes are indicated. As can be observed, boxes 1 and 3 and boxes 2 and 4 are tandem repetitions. The same effect is observed in l'-3' and 2' and 4'. This tandem effect is caused by the lack of at least M and one C in the ds region of the adapters. See also Fig. 8E, upper part in modified adapter molecules.
Further, the region included in the boxes 1 and 2, and 3 and 4 are complementary to each other (in reverse complement). The same effect is observed in l'-2' and 3' and 4'. This reverse complement repeats effect is caused due to the lack of at least one G complementary to a modified nucleotide in the ds region of the adapters. See also Fig. 8E, upper part in modified adapter molecules.
These effect do not occur when the adaptors have the special features, as they result in different four regions, see bottom part of fig. 8E, bottom part in GEUS molecule.
Both effects (tandem and reverse complement repeats) will have negative implications in the subsequent steps of the method. On the one hand, the tandem repeat effect will cause that, during amplification, the primers will be able to bind to two regions of the molecules (2 and 4, or 2' and 4') and copy thereof, see Fig. 8G. This will cause the presence of short amplicons (aberrant molecules that will be amplified and sequenced more efficiently) and long amplicons (complete molecules).
Further, the reverse complement repeat effect will cause that the forward primer would have to join the complementary molecule (once the reverse primer acts) but, because of complementarity in the dsDNA sequences in the adapter, these are already complementary regions and can join directly (except for one letter), yielding aberrant molecules that will be amplified but not sequenced (short and complete FPRV-FPRV) and molecules that will be amplified but with lost directionality (short and complete FPRV-FPFW) see Fig. 8G.
Of note, none of these negative effects will be seen if the correct adapters (having at least one G, one M, and one C) are used, see Fig. 8E (Watson, GEUS molecule) and 8F (Crick, GEUS molecule).
Fig. 8H and I show the next amplification steps performed with the negative effects described above.
EXAMPLE 4: Disadvantages of using a molecule wherein the second region comprises methylated C (M).
This Example represents the same method as Example 2, but with the difference that the extension step is performed using methylated C (M).
STEP i. providing a plurality of nucleic acid molecules.
Step a. Providing a double-stranded nucleic acid molecule: See, e.g., Fig. 9A, where a fragment of dsDNA from whole genome is represented, to which an "A tailing" is added.
Step b. Ligating double-stranded (ds) adaptors to each end of the molecule of step a) wherein one of the adaptors comprises a hairpin ("hairpin"): See Fig. 9B, where adaptors called herein E9 adapters are added to each ends of the molecule resulting from step a. In the adapter's sequence, "M" denotes methylated cytosine (i.e., a modified nucleotide), and "C" denotes non-methylated C (i.e., non-modified nucleotide). These adapters are identical to those of Example 2 (Fig. 7B).
Step c. Synthesizing, for each of the strands of the nucleic acid molecules obtained in step (b), a complementary strand, the "synthetic complementary strand", by polymerase elongation from the 3' end of the "hairpin", using each of the strands of the nucleic acid molecules obtained in step (b2) as template. Fig. 9C represents the ligation of the adapters E9 to the molecule provided in step a.
Before the extending step, the strands are denatured (see Fig. 9D). Next, the synthetic complementary strand is generated by extension of the 3' end of the adapters, using natural nucleotides (A, G, T), and modified C (methylated C). Since only methylated cytosines were used in step c., the synthetic complementary strand generated comprises all methylated cytosines (M), see Fig. 9E.
Step d. Transformation or conversion step
After step c., the molecules generated are treated with an agent (in this case, bisulfite) capable of converting a nucleotide into another one which is read distinctly from the original nucleotide, under the conditions suitable for the conversion/transformation to occur. This is represented in Fig. 9F.
First, it is noted that this molecule is more difficult to treat with said agent due to the fact that both strands are strongly bound by G-M bonds. This is because the agent is more effective with ssDNA and, in order to separate the G-M bonds, stronger energy is required. Thus, when the extension step is performed with M, the strands of each molecule are bond stronger to each other than in the case of the molecule in Example 2 (Fig. 7F) where the extension step was performed with non-methylated C, which are transformed into uracil with the agent, and
leave non-complementary gaps in between the two strands that facilitates the access of more agent (such as bisulfite) to completely bisulfite the entire molecule. An efficient transformation step is crucial for the efficiency of the method, as discussed in the previous examples.
Secondly, the resulting molecule, due to the presence of the methylated C in the synthetic complementary strand, comprises four regions, indicated in Fig. 9F with boxes. Two of said four regions are tandem copies, i.e., are similar to each other: box 1 is identical to box 3, and box 1' is identical to box 3'. This effect was avoided in the Example above, Fig. 7F, by performing an extension step with natural C (non-modified C), rather than with methylated C.
Fig. 9G shows the same molecules as Fig. 9F but in a linear form. For the purposes of the amplification, the molecules need to be denaturized. However, in this case, due to the presence of M which are bound to G, the molecule shows high level of complementarity. This did not happen in the molecules depicted in Fig. 7G and 7F, where the level of complementarity between both strands was weaker due to the transformation of nonmethylated cytosines into uracil, which did not base pair with the corresponding G.
Fig. 9H shows the fusion primers to be used, which are identical to those shown in Fig. 7H.
Fig. 91 shows in the upper part all of the molecules prepared for the amplification/sequencing reaction, and in the bottom the specific hybridization of the primers of Fig. 9H in the molecules of Fig. 9G.
Fig. 91 shows that, in this example, it is the forward primer the one that can bind to the box 4 of the molecules, while the reverse cannot bind. When the first amplicon is created, another forward primer will bind to it (Fig. 9J-L). As a result, the molecule is being amplified from both ends with the same primer, thereby generating molecules that have identical ends, produced by the forward primer. These molecules would be extended but not sequenced, since sequencing platforms requires different primer binding regions in each end (Fig 5B). However, in Example 2, Fig. 71-L, only the reverse primer could bind in the first cycle, and the forward primer could bind in the second cycle, creating the desired directionality.
In Fig. 9M, the first complete molecules are generated, but they are not suitable for sequencing because its ends are not different.
STEP ii. Capturing step
This step comprises capturing at least some of the molecules provided in (i) by using at least one capture probe that binds at least partially to the second region of the plurality of nucleic acid molecules. As shown in Fig. 7N, Fig. 9N shows the capture step for this example.
EXAMPLE 5: Method of the present invention using different adapter (E15).
Fig. 11 A, B, C show the adapter E15, and the beginning of the method of the present invention. The only difference with Example 2 is that the adapter that is provided as a complex of two molecules, wherein the hairpin is provided as a different molecule, see Fig. 11A. The ligation is thus performed in two steps, see Fig. 11B and C.
Claims
1. A method comprising the steps of: a) providing a plurality of nucleic acid molecules, wherein each of the nucleic acid molecules comprises two regions: a first region and a second region, wherein the base identities in the first and second region both provide, independently, information on the base identities in the corresponding loci of one or more original nucleic acid molecules, wherein the first and the second regions in each of the molecules provided in (i) comprise at least one nucleotide at a position corresponding to the same locus in a region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified; wherein at least two of the nucleic acid molecules of the plurality of nucleic acid molecules may have a different nucleotide in the first region at least at one position corresponding to the same locus in the region of interest, wherein that locus in the region of interest is occupied by a nucleotide susceptible of being modified and wherein, in the second region, a position corresponding to that same locus in the region of interest is occupied by the same nucleotide in the at least two nucleic acid molecules; wherein the second region does not comprise any methylated cytosine, and
(ii) Capturing the molecules provided in (i) by using at least one capture probe that binds to at least a portion of the second region which comprises at least one nucleotide which is located at a position which corresponds to a locus in the ROI which is occupied by a nucleotide susceptible of being modified.
2. The method according to claim 1, wherein the first region of at least one of the molecules of the plurality of nucleic acid molecules comprises, at least a one certain position which corresponds to a certain locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof, preferably wherein the at least one modified nucleotide is a methylated cytosine, and wherein the first region of at least one other molecule of the plurality of nucleic acid molecules does not comprise, at least at a position which corresponds to the same locus in the one or more original nucleic acid molecules, a modified nucleotide or a copy thereof or a transformed modified nucleotide or a copy thereof.
3. The method according to any one of claims 1 to 2, wherein the portion in the second region to which at least one capture probe binds has length of at least one nucleotide, preferably at least two nucleotides, more preferably at least five nucleotides and even more preferably at least 10 nucleotides.
4. The method according any one of the preceding claims, wherein the first region of the at least one of the plurality of the nucleic acid molecules provided in step (i) is a fragment of genomic DNA, preferably wherein the first region of all of the nucleic acid molecules provided in step (i) are fragments of genomic DNA.
5. The method according to any one of the preceding claims, wherein the plurality of nucleic acid molecules in step (i) is provided by: a) Providing one or more original nucleic acid molecules, preferably wherein the one or more original nucleic acid molecules are fragments of genomic DNA; b) Ligating one adaptor to at least one end of the one or more original nucleic acid molecules provided in a), thereby obtaining one or more adaptorcontaining original nucleic acid molecules, wherein the 3' region of the adaptor forms a hairpin loop whose 3' end can be extended by action of a polymerase; c) Synthesizing, for each of the one or more adaptor-containing original nucleic acid molecules obtained in step b), a complementary strand, the "synthetic complementary strand", by polymerase elongation of the 3' end of the adaptor molecule, using the one or more adaptor-containing original nucleic acid
molecules obtained in step b) as template, thereby pairing the one or more original nucleic acid molecules obtained in step b) with its synthetic complementary strand, to provide one or more paired double-stranded adaptor-containing nucleic acid molecules; wherein the synthesis is performed with non-modified cytosines and c21) Converting non-modified nucleotides in the paired adaptor-containing nucleic acid molecules, if any, to another nucleotide which is read distinctly from said nucleotide, in the paired adaptor-containing nucleic acid molecules; and/or c22) Converting modified nucleotides in the paired adaptor-containing nucleic acid molecules, if any, to another nucleotide which is read distinctly from said nucleotide, in the paired adaptor-containing nucleic acid molecules.
6. The method according to claim 5, wherein the adaptor comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y-adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase,
- the first strand comprises:
(a) a region comprising at least two nucleotides that are complementary to the second strand and thus form a double stranded region, and
(b) optionally, a region that is not complementary to the second strand, and wherein the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, at least one modified nucleotide, and at least one non-modified nucleotide susceptible of being converted.
7. The method according to claim 5 or 6, wherein the adaptor is a Y adapter and comprises a first strand and a second strand, wherein the 3' region of the first strand and the 5' region of
the second strand form a double stranded region with a length of at least 5 nucleotides by sequence complementarity, wherein the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y-adapter are compatible with the ends of a double stranded DNA molecule, wherein:
- the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase,
- the first strand comprises at least 13 nucleotides, and comprises at least two regions:
(a) a 3' region comprising at least 5 nucleotides that are complementary to the second strand and thus form a double stranded region, and
(b) a 5' region that is not complementary to the second strand and is thus a single-stranded region, and wherein the double stranded region formed by the first and the second strands of the adapter comprises at least one non-modified nucleotide complementary to a modified nucleotide, at least one modified nucleotide, and at least one non-modified nucleotide susceptible of being converted.
8. The method according to any one of claims 5 to 7, the Y adapter comprises:
- a first strand and a second strand, wherein the first strand comprises SEQ ID NO: 44 and the second strand comprises SEQ ID NO: 45, or a sequence with at least 90% sequence identity to SEQ ID NO: 44 and 45, respectively, or,
- a first strand and a second strand, wherein the first strand comprises SEQ ID NO: 46 and the second strand comprises SEQ ID NO: 47, or a sequence with at least 90% sequence identity to SEQ ID NO: 46 and 47, respectively, or wherein the 3' region of the first strand and the 5' region of the second strand form a double stranded region by sequence complementarity, wherein the ends of said double stranded region formed by the 3' region of the first DNA strand and the 5' region of the second DNA strand of the Y-adapter are compatible with the ends of a double stranded DNA molecule, and, preferably, wherein the 3' region of the second strand forms a hairpin loop whose 3' end can be extended by action of a polymerase.
9. The method according to any one of claims 5 to 8, wherein the one or more original nucleic acid molecules are double-stranded (ds) nucleic acid molecules, preferably genomic ds DNA, and wherein the adaptors of step b) are ds adaptors.
10. The method according to any one of the preceding claims, wherein the method further comprises a step (iii) of determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules.
11. The method according to claim 10, wherein the determination of the true identity of a base at a certain locus in the one or more original nucleic acid molecules is performed by using at least two different primers, preferably at least three different primers, even more preferably four different primers, to sequence the molecule provided in step (i) as defined in claim 1, wherein the molecules provided in step (i) further comprise one adapter at the 5' end of the molecule and one adapter at the 3' end of the molecule; wherein the at least two different primers bind to at least three, preferably to at least four, different regions in the nucleic acid molecule provided in step (i), wherein:
1. At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 5' end of the molecule, to sequence at least part of the first region of at least one nucleic acid molecules provided in step (i);
2. At least one of the primers is capable of binding at least partially to the region of the nucleotide sequence which covalently links the first region and the second region of the nucleic acid molecule provided in step (i), to sequence the second region of at least one of the nucleic acid molecules provided in step (i);
3. At least one of the primers is capable of binding at least partially to at least a portion of the adapter at the 3' end of the molecule, to sequence at least part of the second region of at least one of the nucleic acid molecules provided in step (i);
4. At least one of the primers is capable of binding at least partially the region of the nucleotide sequence which covalently links the first region and the
second region of the nucleic acid molecule provided in (i), to sequence the first region of at least one of the nucleic acid molecules provided in step (i).
12. The method according to one or more of claims 10 or 11, wherein the method further comprises using a computer comprising a processor, a memory, and instructions stored thereupon that, when executed, ascertain the identity of a base and/or an associated BQ at a certain position (locus) in an original nucleic acid molecule, based on the information provided in step (iii).
13. An in vitro method for diagnosing a condition, the method comprising the steps of:
(1) Selecting or identifying a region of interest relevant to the condition to be diagnosed within the genome of a patient;
(2) Providing a plurality of nucleic acid molecules as defined in step (i) of the method as defined in any one of claims 1 to 12, from a sample obtained from the patient;
(3) Capturing the molecules provided in (i) by using a probe as defined in step (ii) of the method as defined in any one of claims 1 to 12;
(4) Determining the true identity of a base at a certain locus in the one or more original nucleic acid molecules and/or determining the epigenetic status of the original molecule(s) within the ROI; and
(5) Diagnosing a condition in the subject based at least in part on the information provided in step (4).
14. A computer program comprising instructions which, when executed by a computer, is able to determine the identity of an ascertained/inferred base and/or the associated BQ at a certain position (locus) in an original nucleic acid molecule, based on the information provided in step (iii) of the method as defined in any one of claims 10 to 12.
15. A method for designing at least one probe suitable for performing step (ii) of the method as defined in any one of claims 1 to 12, the method comprising:
(a) Selecting or identifying a region of interest within a genome and/or within a nucleic acid molecule;
(b) Inferring the sequence of the second region of the plurality of nucleic acid molecules provided in step (i) of the method as defined in any one of claims 1 to 12; and
(c) Obtaining the sequence of at least a capture probe that binds to at least a portion of the second region, wherein the second region comprises at least one nucleotide which is located at a position which corresponds to a locus in the region of interest which is occupied by a nucleotide susceptible of being modified.
16. A computer program comprising instructions which, when executed by a computer, is able to obtain the sequence of at least a capture probe that binds to at least a portion of the second region, by implementing the method as defined in claim 15.
17. A kit comprising one or more of the adapters comprising SEQ ID NO: 44, 45, 46, 47, 48, 49, 50, 51, or 52, or a sequence with at least 90% sequence identity to SEQ ID NO: 44, 45, 46, 47, 48, 49, 50, 51, or 52, respectively.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23383079.3 | 2023-10-20 | ||
| EP23383078 | 2023-10-20 | ||
| EP23383079 | 2023-10-20 | ||
| EP23383078.5 | 2023-10-20 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025083068A1 true WO2025083068A1 (en) | 2025-04-24 |
Family
ID=93150368
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2024/079220 Pending WO2025083068A1 (en) | 2023-10-20 | 2024-10-16 | Method for capturing epigenetically modified dna |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025083068A1 (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP1394172A1 (en) | 2002-08-29 | 2004-03-03 | Boehringer Mannheim Gmbh | Improved method for bisulfite treatment |
| WO2010048337A2 (en) * | 2008-10-22 | 2010-04-29 | Illumina, Inc. | Preservation of information related to genomic dna methylation |
| WO2015104302A1 (en) | 2014-01-07 | 2015-07-16 | Fundació Privada Institut De Medicina Predictiva I Personalitzada Del Càncer | Method for generating double stranded dna libraries and sequencing methods for the identification of methylated cytosines |
| WO2015131107A1 (en) * | 2014-02-28 | 2015-09-03 | Nugen Technologies, Inc. | Reduced representation bisulfite sequencing with diversity adaptors |
| WO2023028478A2 (en) * | 2021-08-26 | 2023-03-02 | Illumina, Inc. | Methods and compositions for detecting genomic methylation |
| WO2023164505A2 (en) * | 2022-02-23 | 2023-08-31 | Ultima Genomics, Inc. | Methods and compositions for simultaneously sequencing a nucleic acid template sequence and copy sequence |
-
2024
- 2024-10-16 WO PCT/EP2024/079220 patent/WO2025083068A1/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP1394172A1 (en) | 2002-08-29 | 2004-03-03 | Boehringer Mannheim Gmbh | Improved method for bisulfite treatment |
| WO2010048337A2 (en) * | 2008-10-22 | 2010-04-29 | Illumina, Inc. | Preservation of information related to genomic dna methylation |
| WO2015104302A1 (en) | 2014-01-07 | 2015-07-16 | Fundació Privada Institut De Medicina Predictiva I Personalitzada Del Càncer | Method for generating double stranded dna libraries and sequencing methods for the identification of methylated cytosines |
| WO2015131107A1 (en) * | 2014-02-28 | 2015-09-03 | Nugen Technologies, Inc. | Reduced representation bisulfite sequencing with diversity adaptors |
| WO2023028478A2 (en) * | 2021-08-26 | 2023-03-02 | Illumina, Inc. | Methods and compositions for detecting genomic methylation |
| WO2023164505A2 (en) * | 2022-02-23 | 2023-08-31 | Ultima Genomics, Inc. | Methods and compositions for simultaneously sequencing a nucleic acid template sequence and copy sequence |
Non-Patent Citations (13)
| Title |
|---|
| "Concise Dictionary of Biomedicine and Molecular Biology", 2002, CRC PRESS |
| "The Dictionary of Cell and Molecular Biology", 1999, ACADEMIC PRESS |
| ALTSCHUL, S. ET AL., MOL. BIOL., vol. 215, 1990, pages 403 - 410 |
| BERNEY, M.MCGOURAN, J.F.: "Methods for detection of cytosine and thymine modifications in DNA", NAT REV CHEM, vol. 2, 2018, pages 332 - 348, XP036632179, DOI: 10.1038/s41570-018-0044-4 |
| BLAST MANUALALTSCHUL, S. ET AL., NCBI NLM NIH BETHESDA, pages 20894 |
| CHARETTE MGRAY MW: "Pseudouridine in RNA: what, where, how, and why", IUBMB LIFE, vol. 49, no. 5, 2000, pages 341 - 51, XP002598531 |
| CHEN K.ZHAO BS.HE C.: "Nucleic acid modifications in regulation of gene expression", CELL CHEM BIOL., vol. 23, no. 1, 2016, pages 74 - 85, XP029392702, DOI: 10.1016/j.chembiol.2015.11.007 |
| FROMMER ET AL., PROC NATL ACAD SCI USA, vol. 89, 1992, pages 1827 - 31 |
| HANDY DE. ET AL.: "Epigenetic modifications: basic mechanisms and role in cardiovascular disease", CIRCULATION, vol. 123, no. 19, 2011, pages 2145 - 56 |
| KOZAREWA IARMISEN J. ET AL.: "Overview of target enrichment strategies", CURR PROTOC MOL BIOL, 2015 |
| NABEL CS ET AL.: "AID/APOBEC deaminases disfavor modified cytosines implicated in DNA demethylation", NAT CHEM BIOL., vol. 8, no. 9, 2012, pages 751 - 8 |
| OLEK, NUCLEIC ACID RES., vol. 24, 1996, pages 5064 - 6 |
| SHENDURE J. ET AL.: "DNA sequencing at 40: past, present and future", NATURE, vol. 550, no. 7676, 2017, pages 345 - 353 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12351859B2 (en) | Methods for the epigenetic analysis of DNA, particularly cell-free DNA | |
| TWI783821B (en) | Determination of base modifications of nucleic acids | |
| US11608518B2 (en) | Methods for analyzing nucleic acids | |
| EP3572528A1 (en) | Direct capture, amplification and sequencing of target dna using immobilized primers | |
| AU2019253569A1 (en) | Compositions and methods for cancer or neoplasia assessment | |
| US20220364169A1 (en) | Sequencing method for genomic rearrangement detection | |
| WO2012149171A1 (en) | Designing padlock probes for targeted genomic sequencing | |
| JP2007509629A (en) | Complex nucleic acid analysis by cleavage of double-stranded DNA | |
| EP3655541B1 (en) | Improved method and kit for the generation of dna libraries for massively parallel sequencing | |
| JP2023508795A (en) | Methods and Kits for Enrichment and Detection of DNA and RNA Modifications, and Functional Motifs | |
| EP3022321B1 (en) | Mirror bisulfite analysis | |
| WO2025083068A1 (en) | Method for capturing epigenetically modified dna | |
| WO2024213788A1 (en) | Method of dna sequencing | |
| US20250361549A1 (en) | Detection of epigenetic cytosine modification | |
| WO2024137858A2 (en) | Methods and compositions for assessing colorectal cancer |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24790945 Country of ref document: EP Kind code of ref document: A1 |