WO2022253288A1 - Methylation sequencing method and device - Google Patents
Methylation sequencing method and device Download PDFInfo
- Publication number
- WO2022253288A1 WO2022253288A1 PCT/CN2022/096730 CN2022096730W WO2022253288A1 WO 2022253288 A1 WO2022253288 A1 WO 2022253288A1 CN 2022096730 W CN2022096730 W CN 2022096730W WO 2022253288 A1 WO2022253288 A1 WO 2022253288A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- methylation
- block
- cancer
- nucleic acid
- target nucleic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/154—Methylation markers
Definitions
- the present application relates to the field of biomedicine, in particular to a methylation sequencing method and device.
- cfDNA Cell-free DNA
- WBC white blood cells
- ctDNA tumor origin
- methylome analysis revealed several advantages: (i) aberrant methylation often occurs widely during cancer initiation, reflecting early changes in tumors; Specific genomic regions such as "CpG islands" are frequently found to be methylated, which provides a great opportunity to analyze large numbers of alterations by targeted sequencing; (iii) methylation status is cell-type specific and thus can be used to infer Tissue source of ctDNA.
- NGS-based DNA methylation analysis techniques can be classified into two categories: bisulfite conversion-based methods (WGBS, RRBS) and enrichment-based methods (MeDIP, MBD-seq).
- WGBS bisulfite conversion-based methods
- MeDIP enrichment-based methods
- BS-seq bisulfite sequencing
- BC bisulfite conversion
- BC bisulfite conversion
- transformed DNA is often poor in sequence diversity, thus issues such as bias-prone target enrichment, and high sequencing errors further complicate scoring analysis.
- the present applicant developed the MERMAID methylation detection method, which can be based on a sequencing method that maximizes the use of cfDNA and greatly reduces the artifacts of methylation sequencing.
- MERMAID aided by a robust machine learning classifier, substantially outperformed other nonviable tissues in a proof-of-principle study for low-frequency ctDNA detection by ELSA-seq (see International Patent Publications WO2019/191900A1 and WO2019/192489A1)
- the method of examination provides new opportunities for advancing clinical applications.
- MERMAID sequencing method Aided by a robust machine learning classifier, MERMAID substantially outperformed other biopsy-free methods in a proof-of-principle study for low-frequency ctDNA detection, providing new opportunities to advance clinical applications.
- the present application provides a method for detecting methylation modification of a target nucleic acid, the method comprising the following steps: Step (a-1) is based on the correlation coefficient of the CpG site in the target nucleic acid, the CpG site The methylation level of the point and the positional information of the CpG site determine the co-methylation block; and/or step (a-2) is based on the correlation coefficient of the CpG site in the target nucleic acid, the candidate co-methylation The amount of information in the methylation block and the division balance degree of the candidate co-methylation block, determining the co-methylation block, and step (b) based on the methylation of the co-methylation block The degree determines the presence and/or amount of the target nucleic acid in the sample to be tested.
- the present application provides an analysis device for detecting the methylation modification of a target nucleic acid, the device comprising: a block division module (a-1), based on the correlation coefficient of the CpG site in the target nucleic acid, The methylation level of the CpG site and the position information of the CpG site determine a co-methylation block; and/or the block division module (a-2), based on the CpG site in the target nucleic acid The corrected correlation coefficient of the point, the information amount of the candidate co-methylation block and the division balance degree of the candidate co-methylation block, determine the co-methylation block, and the judgment module (b) based on the The methylation degree of the co-methylation block determines the presence and/or content of the target nucleic acid in the sample to be tested.
- Methylation sequencing has attracted enormous interest because of its great potential to improve current ctDNA assays.
- MERMAID as a novel epigenetic analysis method characterized by well-conserved molecular diversity, robust noise suppression, and robust high-dimensional modeling.
- MERMAID may be particularly useful for blood-based applications: (i) a portion of cfDNA can be in single-stranded form, so this ssDNA-compatible approach can maximize the use of limited starting material, increasing the potential for rare Opportunities for ctDNA testing. (ii) Capture panels are designed with an excess of long RNA probes (>100 nucleotides long) complementary to various methylation patterns.
- MERMAID does not require prior knowledge of the assay (eg, biopsied tissue), thus providing a solution for patients without surgically resected samples.
- assay e.g, biopsied tissue
- this method has only been validated on LC, it could be customized to other types of cancer (e.g., CRC) or body fluids (e.g., urine). It can be extended to answer fundamental questions, such as tumor heterogeneity, or applied to other clinical scenarios, such as evaluating treatment effects.
- Fig. 1 illustrates the overview of the MERMAID method of the present application, including (D) to (E) in (A) to (E), wherein,
- B-C Schematic representation of the library construction and targeted sequencing workflow.
- cfDNA fragments were denatured and transformed with sodium bisulfite (Lightning).
- a "tailing" (Tail&Tag-1) step is then performed by TdT (terminal deoxynucleotidyl transferase) to add an extra nucleotide ( Mainly dC, dark purple, just below Tail&Tag-1 on the right end).
- Splint junction 1 pink and yellow, complementary double strands at the right end of the small fragment at the lower right of Tail&Tag-1 protrudes through a 5' protruding "arm" (mainly dG, light purple, at the left end of the small fragment at the lower right of Tail&Tag-1 Part) The ligation process is facilitated in the presence of E. coli ligase (dotted arc).
- a copy strand of the original template is then generated from the common anchor site (yellow, the right end of the lower two strands between Pre-amp and Tail&Tag-2) by a uracil-tolerant polymerase, followed by linker 2 (green and Blue, the ligation mediated by the left end of the small fragment at the lower left of Tail&Tag-2 (complementary double strand).
- PCR-1 PCR amplification
- barcoding dual indexing
- RNA probes purple, short dark gray fragments below the strand directly below Hyb&Cap
- streptavidin beads solid gray, spheres with dendrites above PCR-2
- FIG. 2 schematically illustrates an exemplary WGBS library construction by the present application, showing a comparison of different WGBS protocols.
- Partially methylated E. coli (DH5 ⁇ ) DNA was sheared to ⁇ 200 bp to mimic cfDNA.
- a no-template control (NTC) was performed to assess the background readout (eg, residual adapter dimer) for each protocol, which was then subtracted from the library yield measured in each experiment.
- ELSA An exemplary sequencing method of this application
- SWT Accel-NGS Methyl-SEQ
- NEB NEBNext ultra II. Unless noted, experiments were performed with two technical replicates and two biological replicates, error bars represent one S.D.
- Fig. 2 includes (A) to (F), wherein,
- E-F Mutation spectrum and frequency observed without (E) or with (F) deep sequencing-induced error suppression.
- the same library was sequenced on Illumina Hiseq 2500, Novaseq 6000 and MiniSeq, respectively.
- Figure 3 illustrates the design and performance of an exemplary target panel of the present application, including (A) to (E), wherein,
- Figure 4 illustrates methylation block definition and pattern recognition, including (A) to (D), wherein,
- Figure 5 illustrates the analytical validation of MERMAID, including (A) to (F), where,
- C FDR measured by sequencing normal leukocytes at different depths and with different input amounts (X-axis).
- the Y axis represents the percentage of false positive calls.
- E-F Assay sensitivity of MERMAID (E), ddPCR (F, left panel) and HS-UMI (right panel) measured by sequential dilution of cancer cell (LC, CRC) DNA with normal leukocytes at a ratio of 0.0001–0.1.
- the Y-axis represents the percentage (E) or allele frequency (F) of positive markers observed.
- Dashed lines represent detection thresholds (95% CI) of ddPCR or HS-UMI for the indicated mutations.
- Figure 6 shows an exemplary sequencing method of the present application and the adapter ligation efficiency of TruSeq, including (A) to (E), wherein,
- the synthesized DNA template was ligated to Adapter 1 or Adapter 2 by one of the exemplary sequencing methods of the present application.
- Primers were designed to amplify "total" (For.1-Rev.1) or "ligated” fragments (For.1-Rev.2). Reactions were arranged in separate wells with the same reporter fluorescent probe.
- C-E ddPCR copy number curves (left panel) and ligation efficiency table (right panel) using ds TruSeq adapters (C), adapter 1 (D) and adapter 2 (E).
- the Y-axis represents copies/ ⁇ L
- the X-axis represents 3 technical replicates (T1, 2, 3) and 3 biological replicates (B1, 2, 3).
- the curves show counts per well, while the tables reflect counts normalized by reaction volume (Tail-Tag.1 vs. Tail-Tag.2).
- FIG 7 illustrates schematically the principle of the library preparation method, including (A) to (B), wherein,
- Figure 8 illustrates the free capture performance of an exemplary sequencing method of the present application, including (A) to (G), wherein,
- C ELSA by using 500pg (red, dark red, lower peak), 1 ng (green, blue-purple, middle peak) and 2ng cfDNA (yellow, aqua, upper peak) from two patients A representative library constructed. Each sample was prepared with technical replicates and a no-template control (NTC, blue line, flatter bottom line) was included.
- NTC no-template control
- Figure 9 illustrates the targeting performance of an exemplary sequencing method of the present application, including (A) to (G), wherein,
- normalized coverage was calculated as the observed unique read depth for each base divided by the average unique read depth for all targeted bases.
- the Y-axis represents the unique depth over the panel region, while the X-axis indicates the GC content of the reference genome.
- Figure 10 illustrates a depiction of methylation blocks and patterns, including (A) to (E), wherein,
- Figure 11 illustrates the reproducibility of MERMAID, including (A) to (C), where,
- Figure 12 illustrates the accuracy of MERMAID, including (A) to (D), where,
- Figure 13 illustrates the FDR and LoD of MERMAID, including (A) to (F), where,
- the X-axis represents the pre-defined tumor fraction ⁇ i in each simulated "tumor" sample and the Y-axis represents the average detection of that sample over 10,000 replicates.
- E-F Fluorescence curves of ddPCR showing detection of EGFR p.G719S mutation (A) and EML4-ALK fusion (B) in spike experiments in SW48 and NCI-H2228 cell lines, respectively.
- gating thresholds are indicated by solid horizontal lines (eg: 6000, 2400, 3000, and 2760).
- Figure 14 illustrates tissue-based selection and classification of LC-specific markers, including (A) to (D), wherein,
- Figure 15 graphically illustrates the clinical characteristics of the validation cohort (plasma).
- the table shows 308 LC patients and 261 non-cancer controls stratified by clinical characteristics.
- UNK Unknown.
- LUAD lung adenocarcinoma;
- LUSC lung squamous cell carcinoma.
- Figure 16 illustrates the side-by-side comparison of MERMAID, HS-UMI and patient-specific ddPCR, including (A) to (F), wherein,
- the DNA input for each ddPCR reaction was 38, 62 and 54 ng, respectively.
- NTC no template control
- NC normal WBC
- PC positive control with the desired mutation at 0.1% or 0.5% AF (Multiple I cfDNA reference standard set, Horizon Discovery).
- 0.5% PC was generated by mixing WT and 1% reference DNA in a 1:1 ratio. Details are provided in Table 7.
- Figure 17 graphically illustrates the performance results for detecting cancer-associated changes based on the methylation level iAF of a single site, and based on the average methylation level mAF (mean methylation allele frequency) per block.
- Figure 18 graphically illustrates the results of the relationship between the "regional median length” and “regional length variation coefficient” for the values of ⁇ 2 and ⁇ 1 , respectively.
- Figure 19 graphically illustrates the density curves of the MBS statistic and the mAF statistic in samples of different blends.
- next-generation gene sequencing NGS
- high-throughput sequencing or “next-generation sequencing” generally refer to the second-generation high-throughput sequencing technology and higher-throughput sequencing methods developed thereafter.
- Next-generation sequencing platforms include but are not limited to existing sequencing platforms such as Illumina. With the continuous development of sequencing technology, those skilled in the art can understand that other sequencing methods and devices can also be used for this method. For example, two Generation gene sequencing can have the advantages of high sensitivity, high throughput, high sequencing depth, or low cost.
- Massively Parallel Signature Sequencing Massively Parallel Signature Sequencing, MPSS
- Polony Sequencing 454pyro sequencing
- Illumina (Solexa) sequencing Illumina (Solexa) sequencing
- Ion semi conductor sequencing DNA nano-ball sequencing
- Complete Genomics' DNA nanoarray and combined probe anchor ligation sequencing method etc.
- the second-generation gene sequencing can make it possible to analyze the transcriptome and genome of a species in detail, so it is also called deep sequencing ( deep sequencing).
- deep sequencing deep sequencing
- the method of the present application can also be applied to first-generation gene sequencing, second-generation gene sequencing, third-generation gene sequencing or single molecule sequencing (SMS).
- SMS single molecule sequencing
- sample to be tested generally refers to a sample that needs to be tested. For example, it can be detected whether one or more gene regions on the sample to be tested are modified.
- complementary region generally refers to a region that is complementary to a reference nucleotide sequence.
- a complementary nucleic acid can be a nucleic acid molecule that optionally has an opposite orientation.
- the complementary may refer to having the following complementary associations: guanine and cytosine; adenine and thymine; adenine and uracil.
- hybridization generally refers to a reaction in which one or more polynucleotides react to form a complex stabilized by hydrogen bonds between the bases of the nucleotide residues. Hydrogen bonding can occur through Watson-Crick base pairing, Hoogstein binding, or in any other sequence-specific manner based on base complementarity.
- the complex may comprise two strands forming a double helix, three or more strands forming a multi-strand complex, self-hybridizing single strands, or any combination of these.
- the hybridization reaction may constitute a step in a wider method, such as the initiation of PCR or the enzymatic cleavage of polynucleotides by endonucleases.
- a second sequence that is completely complementary to a first sequence or that is polymerized by a polymerase using the first sequence as a template is said to be "complementary" to said first sequence.
- hybridizable refers to the ability of a polynucleotide to form complexes that are stabilized by hydrogen bonds between the bases of the nucleotide residues in a hybridization reaction.
- a hybridizable nucleotide sequence is at least about 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, or 100% complementary to the sequence to which it hybridizes.
- polynucleotide represents polymeric forms of nucleotides (deoxyribonucleotides or ribonucleotides) of any length, or analogs thereof.
- a polynucleotide can have any three-dimensional structure and can perform any function, whether known or unknown.
- polynucleotides coding or non-coding regions of genes or gene segments, loci (loci) defined by linkage analysis, exons, introns, messenger RNA (mRNA), translocation RNA (tRNA), ribosomal RNA (rRNA), short interfering RNA (siRNA), short-hairpin RNA (shRNA), microRNA (miRNA), ribozyme, cDNA, recombinant polynucleotide, branched polynucleotide, Plasmids, vectors, isolated DNA of any sequence, isolated RNA of any sequence, nucleic acid probes, primers and linkers.
- a polynucleotide may include one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs.
- the term "modification state” generally refers to the modification state of the gene fragment, nucleotide or its base in the present application.
- the modification state in the present application may refer to the modification state of cytosine.
- a gene segment of the present application having a modified state may have altered gene expression activity.
- the modification status of the present application may refer to the methylation modification of a base.
- the modified state in this application may refer to the covalent bonding of a methyl group at the 5' carbon position of cytosine in the CpG region of genomic DNA, for example, it may become 5-methylcytosine (5mC).
- a modification state can refer to the presence or absence of 5-methylcytosine ("5-mCyt") within the DNA sequence.
- methylation generally refers to the methylation state of a gene fragment, nucleotide or its base in this application.
- the DNA fragment where the gene in this application is located may have methylation on one strand or multiple strands.
- the DNA fragment where the gene in this application is located may have methylation at one site or multiple sites.
- transformation generally refers to the transformation of one or more structures into another structure.
- the transformations of the present application can be specific.
- cytosine without methylation modification can be converted into other structures (such as uracil), and cytosine with methylation modification can be converted substantially unchanged.
- cytosine without methylation modification can be cleaved after conversion, and cytosine with methylation modification can be substantially unchanged after conversion.
- the term “bisulfite”, or “bisulfite” generally refers to a reagent that can distinguish DNA regions with and without modification states.
- the bisulfite may include bisulfite, or an analog thereof, or a combination thereof.
- bisulfite can deaminate the amino group of unmodified cytosine to distinguish it from modified cytosine.
- the term “analogue” generally refers to a substance having a similar structure and/or function.
- analogs of bisulfite may have a similar structure to bisulfite.
- an analog of bisulfite may refer to a reagent that can also distinguish between DNA regions that have a modified state and those that do not.
- the term "about” generally refers to a range of 0.5%-10% above or below the specified value, such as 0.5%, 1%, 1.5%, 2%, 2.5%, above or below the specified value. 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10%.
- the present application provides a method for detecting methylation modification of a target nucleic acid, the method comprising the following steps: Step (a-1) is based on the correlation coefficient of the CpG site in the target nucleic acid, the CpG site The methylation level of the target nucleic acid and the position information of the CpG site determine the co-methylation block; and/or step (a-2) is based on the correlation coefficient of the CpG site in the target nucleic acid, the candidate co-methylation The amount of information in the methylation block and the division balance degree of the candidate co-methylation block, determining the co-methylation block, and step (b) based on the degree of methylation of the co-methylation block Determine the presence and/or content of the target nucleic acid in the sample to be tested.
- the method detects the presence and/or content of the target nucleic acid with methylation modification in the test sample.
- the "subject" from which the sample to be tested can be a mammal, such as a non-primate (for example, cow, pig, horse, cat, dog, rat, etc.) or a primate (for example, monkey or person).
- the subject is a human.
- the subject is a mammal (eg, a human) suffering from or potentially suffering from a disease, disorder or condition, examples of which are described herein.
- the subject is a mammal (eg, a human) at risk of developing a disease, disorder or condition, examples of which are described herein.
- the correlation coefficient of the CpG sites comprises a Pearson correlation coefficient between two or more of the CpG sites in the target nucleic acid.
- said level of methylation of said CpG sites comprises a difference in mean methylation allele frequency (mAF) between two or more of said CpG sites in said target nucleic acid.
- mAF mean methylation allele frequency
- the methylation level of the CpG sites comprises the ratio of the difference in mAF to the sum of mAF between two or more of the CpG sites in the target nucleic acid.
- the location information of the CpG sites comprises differences in genomic locations between two or more of the CpG sites in the target nucleic acid.
- the position information of the CpG sites comprises the ratio of the genomic position distance between two or more of the CpG sites in the target nucleic acid to the length of the target nucleic acid.
- the step (a-1) includes: determining the corrected correlation coefficient between every two CpG sites of the target nucleic acid, the corrected correlation coefficient d ij between site i and site j Calculated by the following formula: where ⁇ ij represents the Pearson correlation coefficient, E(y i ) represents the average methylated allele frequency (mAF) of all samples at site i, and E(y j ) represents the average methylated allele frequency (mAF) of all samples at position j.
- ⁇ allele frequency (mAF) pos i represents the genomic position of locus i, pos j represents the genomic position of locus j, L represents the length of the target nucleic acid region, and ⁇ 1 and ⁇ 2 are independently selected from 0 or greater number.
- the value range of ⁇ 1 is 0-1.
- the value range of ⁇ 2 is 0 to 1.
- ⁇ 1 and ⁇ 2 are independently selected from 0.
- ⁇ 1 is selected from 0 in the present application; for example, ⁇ 2 is selected from 0 in the present application.
- the correlation coefficient based on the CpG site in the target nucleic acid in the step (a-2) comprises a corrected correlation coefficient of the CpG site in the target nucleic acid, the CpG site in the target nucleic acid
- the corrected correlation coefficient comprises between two or more of the CpG sites in the target nucleic acid corrected based on the methylation level of the CpG sites and/or the position information of the CpG sites Pearson's correlation coefficient.
- the corrected correlation coefficient of the CpG site in the target nucleic acid comprises the corrected correlation coefficient in the above step (a-1) method.
- the information amount of the candidate co-methylation block includes the quantity information of the CpG sites of the candidate co-methylation block.
- the degree of partition balance of the candidate co-methylation blocks includes differences in the number of CpG sites of different candidate co-methylation blocks.
- the degree of partition balance of the candidate co-methylation blocks includes different coefficients of variation of the numbers of the CpG sites of the candidate co-methylation blocks.
- the step (a-2) includes: maximizing the block index of the candidate co-methylation block, determining the co-methylation block, the candidate co-methylation in the target nucleic acid block
- B i represents the number of the CpG sites of the ith candidate co-methylation block
- ⁇ 1 and ⁇ 2 are independently selected from 0 or greater numbers.
- the block breakpoints for the co-methylated blocks are determined by an iterative method of unique breakpoints.
- the value range of ⁇ 1 is 0-10.
- the value of ⁇ 2 ranges from 0 to 10.
- the value ranges of ⁇ 1 and ⁇ 2 are independently selected from rational numbers from 0 to 10.
- ⁇ 1 and ⁇ 2 are independently selected from 0.
- ⁇ 1 is selected from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10
- ⁇ 2 is selected from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10.
- the step (b) comprises: the length of the continuous CpG of each sequencing read (read) based on the co-methylation block, the number of CpGs on the read, and the co-methylation region
- the total number of reads of the block determines the presence and/or amount of the target nucleic acid.
- the step (b) includes: determining the methylation block score of the co-methylation block, and the methylation block score MBS is calculated by the following formula:
- n is the total number of reads covering all CpG sites of the co-methylation block
- L i is the number of CpG sites contained on the i-th read
- l ij is the continuous methylation CpG on the i-th read
- the length of the site, m is the sequencing depth on the i-th read.
- UMI correction is applied to the summary of the sequencing data of the samples to be tested.
- the method further includes: extracting the feature value of the MBS of the co-methylation block of the tumor sample and the healthy sample through a machine learning model, and determining the MBS based on the MBS of the co-methylation block in the sample to be tested. The presence and/or amount of said target nucleic acid.
- An analysis device for detecting the methylation modification of a target nucleic acid comprising: a block division module (a-1), based on the correlation coefficient of the CpG site in the target nucleic acid, the methyl group of the CpG site The methylation level and the position information of the CpG site determine the co-methylation block; and/or the block division module (a-2), based on the corrected correlation coefficient of the CpG site in the target nucleic acid, the candidate The amount of information of the co-methylation block and the division balance degree of the candidate co-methylation block, determining the co-methylation block, and the determination module (b) based on the co-methylation block The degree of methylation determines the presence and/or amount of the target nucleic acid in the sample to be tested.
- a block division module (a-1) based on the correlation coefficient of the CpG site in the target nucleic acid, the methyl group of the CpG site The methylation level and the position information of the CpG site determine the
- the analytical device for detecting the methylation modification of the target nucleic acid of the present application may comprise the steps of implementing the method for detecting the methylation modification of the target nucleic acid of the present application.
- the present application provides a method for detecting ctDNA using methylation sequencing, which includes: selecting differentially methylated CpG sites; based on the similarity of the methylation status of the CpG sites, CpG sites are separated into multiple co-methylation blocks; samples are sequenced to obtain methylation sequencing reads; the average methylation level of each co-methylation block of samples is detected for further DNA analysis Methylation analysis.
- differentially methylated CpG sites are selected from the TCGA database generated by the Infinium HumanMethylation 450K array.
- the average methylation level is the average methylation allele frequency.
- the number of co-methylation blocks is between 1/30 and 1/5 of the number of CpG sites.
- co-methylated blocks are further restricted by comparing a specific tumor sample with a normal tissue sample, eg, a primary lung tumor with a normal lung tissue sample.
- co-methylated blocks of insufficient depth ( ⁇ 100) on most samples (>80%) are excluded from downstream analysis.
- said co-methylation blocks are separated based on a modified correlation matrix called "block index”.
- the method further comprises normalizing the depth difference of each methylation block using the Methylation Block Score (MBS), the depth difference being used to distinguish between very small Tumor signal of , such as 0.1%, 0.2%, 0.5% and 1%.
- MFS Methylation Block Score
- the sequencing read length to be analyzed is trimmed by any one or more of the following criteria: (i) query length of G bases longer than a fixed value m; (ii) non-G - Fraction of bases less than fixed number p; next base is high quality A/T/C (Phred score > 30)
- duplicate removal is applied with a tolerance of +/- 3 bp at both the start points of Rl and R2 to minimize artifacts associated with inappropriately assigned fragment end positions.
- UMI is applied in the correction.
- the average methylation level of each co-methylated block of unmethylated phage lambda DNA is detected to measure genome-wide "technical noise" (read length 1 in C/C+T in read length 2, G/G+A in read length 2).
- a machine learning classifier of methylation patterns is applied to assess tumor levels, preferably in early tumor screening.
- the method described above is used in the assessment of tumor levels during early screening of tumors from homogenous tumors, heterogeneous tumors, hematologic cancers and/or solid tumor; preferably, said tumor is from one or more of the cancers of the following group: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer , bowel cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gallbladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer , sarcomas, thoracic malignancies (except lung), melanoma, and testicular cancer.
- the method described above is used in the assessment of tumor levels during early screening of tumors from homogenous tumors, heterogeneous tumors, hematologic cancers and/or solid tumor
- kits for detecting ctDNA with methylation sequencing wherein the kit can be used to capture at least 50, 100, 150, 200, 300, 500, 800, 1000, 1500 or 2000 co-methylation blocks as shown in Table 5-2; preferably, this can be used to capture at least all co-methylation blocks as shown in Table 5-1.
- it provides a device for performing the above method in assessing general tumor level during early tumor screening.
- non-volatile memory storing a program, which can be used to perform the above method in assessing general tumor level during early tumor screening.
- the present application provides a method for detecting the level of base modification, comprising providing the nucleic acid molecule combination of the present application and/or the kit of the present application.
- the base modification includes methylation modification.
- the present application provides a storage medium, which records a program capable of running the method of the present application.
- the non-transitory computer readable storage medium may include a floppy disk, a flexible disk, a hard disk, a solid state storage (SSS) (such as a solid state drive (SSD)), a solid state card (SSC), a solid state module (SSM)), an enterprise high-grade flash drives, tape, or any other non-transitory magnetic media, etc.
- SSD solid state drive
- SSC solid state card
- SSM solid state module
- Non-transitory computer readable storage media may also include punched cards, paper tape, cursor sheets (or any other physical media having a pattern of holes or other optically identifiable markings), compact disc read only memory (CD-ROM) , Rewritable Disc (CD-RW), Digital Versatile Disc (DVD), Blu-ray Disc (BD) and/or any other non-transitory optical media.
- CD-ROM compact disc read only memory
- CD-RW Rewritable Disc
- DVD Digital Versatile Disc
- BD Blu-ray Disc
- the present application provides a device, and the device includes the storage medium of the present application.
- the device further includes a processor coupled to the storage medium, and the processor is configured to execute based on a program stored in the storage medium to implement the method of the present application.
- Embodiment 1 The detection method of the present application
- differentially methylated sites were initially screened from the TCGA database generated by the Infinium HumanMethylation 450K array. A total of 4539 tumor samples and 521 normal tissue samples were analyzed. Data from a GEO dataset (GSE40279) for 656 normal WBC samples was used to remove hypermethylated CpG sites (>0.1 ) in the hematopoietic lineage. CpG sites located on the X or Y chromosomes were also excluded. DML selection was performed using "limma (V2.0)" software, and the cutoff value (cutoff) was set to B-H corrected FDR ⁇ 0.05. In addition, CpG sites associated with common cancers in previous studies were also included. This resulted in a total of 80,672 CpG sites in the marker discovery phase.
- the CpG sites were then partitioned into 8,312 blocks (described in "Co-methylation block partitioning") and compared using in-house sequencing of 48 primary lung tumor and 20 normal lung tissue samples The data is verified. Blocks with insufficient depth ( ⁇ 100) on most samples (>80%) were excluded from downstream analysis. Linear regression was used to select differentially methylated blocks, and cutoffs were set at log(fold change) >0.05, and B-H corrected FDR ⁇ 0.05. A total of 2473 blocks were selected as classification features, while the genome coordinates are listed in Table 5.
- FASTQ files were generated from raw BCL data by using bcl2fastq (V2.19.1). Illumina-specific adapters and low-quality sequences (SLIDINGWINDOW:4:15TRAILING:20) were trimmed with trimmomatic (V0.36). For Accel-NGS Methyl-Seq (Swift Biosciences), additional trimming was performed as described in previous work. For ELSA-seq, an exemplary sequencing method of the present application, the present application tested parameters at different stringency levels to remove low-complexity tail sequences.
- This application is first based on the "block index" based on The improved correlation matrix of the capture panel separates the design regions into co-methylation blocks.
- ⁇ ij the Poisson correlation coefficient
- E(y i ) denotes the mAF over all samples at position i
- L represents the length of the original region
- ⁇ 1 and ⁇ 2 are parameters estimated by using prior information.
- ⁇ 1 and ⁇ 2 are the penalty coefficients for unbalanced split and over-split, respectively.
- both ⁇ 1 and ⁇ 2 are set to 1.0 based on the desired block length and uniformity.
- regional median length is used to measure the size of the defined region, the smaller the value, the less information contained in the region, The larger the value, the more information the region contains; the "variation coefficient of region length” (standard deviation/mean) measures the size difference between different regions, and the larger the value, the more unbalanced the region division (more independent points, The more information is missing), the smaller the value, the more balanced the regional division.
- the value of ⁇ 1 mainly affects the "variation coefficient of regional length”
- the value of ⁇ 2 mainly affects the "median length of the region”. The influence relationship is shown in Figure 18.
- MFS Methylation Block Score
- n is the total number of reads covering multiple CpG sites.
- Li is the number of CpG sites covered on the i-th read.
- lij represents the length of consecutive methylated CpG sites (>1)
- m represents the total count on the i-th read length.
- the depth difference was normalized using the number of reads in each block.
- Tail addition Based on empirical observations, a tail sequence (90% G) was added to the 5'-end of R2 with different lengths;
- Stencil Count Set the original (post-BC) fragment depth at 250-500X and the original fragment depth at 500-1000X.
- the 3' adapter was modified with a 6-base random UMI inserted next to the overhang sequence (connector 1-UMI), which was inserted next to the overhang sequence, thus R2 was sequenced during its first six cycles (Table 1).
- two nucleotides eg, 5'-NNDDNN-3', D: A/T/G
- a minimum edit distance of 2 was allowed for error correction.
- TruseQ dsDNA amplified using primers KRASF and KRASR (KRAS-177)
- Ligated copies were detected by primers LEF and LER, probe KRAS-G13D.
- Ligated copies were detected by primers LEF and LER-ATNR1, probe KRAS-G13D.
- the concatenated copies were detected by primers LEF and LER-ATNR1, probe KRAS-G13D.
- WGBS library construction with ELSA-seq, NEBNext Ultra II (New England Biolabs) and Accel-NGS Methyl-Seq (Swift Biosciences) was performed as described in Methods or according to manufacturer's instructions. Genomic DNA was sheared to ⁇ 200 bp (peak) by sonication. Library quality was assessed using LabChip GXII touch 24 (Perkin Elmer). Paired-end sequencing (2 ⁇ 150bp) was then performed on the Illumina NovaSeq 6000 system.
- tail addition Based on empirical observations, tail sequences (90%G) were added to the 5'-end of R2 with different lengths;
- the 3' adapter was modified with a 6-base random UMI inserted next to the overhang sequence (connector 1-UMI), which was inserted next to the overhang sequence, thus R2 was sequenced during its first six cycles (Table 1).
- two nucleotides eg, 5'-NNDDNN-3', D: A/T/G
- a minimum edit distance of 2 was allowed for error correction.
- the in vitro methylation process was performed at 37°C for 1 hour and terminated by heating at 65°C for 20 minutes.
- the sample size (n) is calculated as follows:
- MERMAID was evaluated for diagnostic yield for LC in a hypothetical screening population of 10,000 subjects. Based on the results of this study, the sensitivity was set at 63.0% (194/308), and the specificity was set at 96.2% (251/261). According to the Surveillance, Epidemiology, and End Results (SEER) program (SEER.cancer.gov/data/access, 2020), the prevalence of LC among average-risk older adults was assumed to be 0.53%. Among 10000 subjects, the number of individuals with true positive (TP), false positive (FP), true negative (TN) and false negative (FN) results was predicted. Positive predictive value (PPV) and negative predictive value (NPV) were calculated as follows:
- Predicted FP per TP indicates the number of FP subjects observed when detecting TP subjects.
- stage I Stages Ia and Ib
- R refers to the risk of pulmonary nodules being malignant, for which the applicant can choose to monitor or actively investigate without bias.
- R is set at 1.1% based on reports from the National Lung Screening Trials (NLST) study group.
- Example 2 The results of the methylation detection method of the present application
- the methylation detection method of the present application may be based on methylation sequencing data known in the art.
- the methylation detection method of the present application can be based on an ultrasensitive BS-seq (bisulfite sequencing).
- an available BS-seq can be ELSA-Seq described in WO2019192489A1.
- the design principle of the sequencing method for the methylation sequencing data of this application involves two aspects of molecular and calculation ( FIG. 1 ).
- the present application first focuses on increasing the templates that can be sequenced effectively: (i) DNA molecules need specific templates at both the 5' end and the 3' end to be "read" by the high-throughput sequencer. connector.
- Adapter ligation is another common limiting factor, so this applicant devised a new strategy called "tail and tag" to improve efficiency. Briefly, bisulfite-treated DNA is denatured, dephosphorylated and extended with a cytosine-rich nucleotide tail by TdT (terminal deoxynucleotidyl transferase). Then, the splint adapter was annealed to the tail in the presence of E.
- the present application In order to estimate the template recovery rate of the sequencing method for the methylation sequencing data of the present application, the present application first compared its ligation efficiency with the traditional method (TruSeq) by ddPCR (Fig. 6A-B). The ratios of ligated/total DNA copies are 82% (Tail-Tag.1) and 86% (Tail-Tag.2) for the sequencing methods of the methylation sequencing data of this application, and 64% for TruSeq (Fig. 6C-E). Considering that two rounds of ligation are required for both methods, the recovery is almost doubled (0.64*0.64 vs. 0.82*0.86) just by applying step (ii).
- the applicant compared the sequencing method of the methylation sequencing data of the application with two commercial kits: Accel Methyl-Seq (SWT) and NEBNext Ultra (NEB), and found that using the application's Whole-genome bisulfite sequencing (WGBS) libraries constructed with sequencing methods for methylation-sequencing data showed a 10-fold increase in yield and exhibited the highest number of unique molecules regardless of input amount or sequencing depth, (Fig. 2A-C). Furthermore, for an input as low as 500 pg, the present application's sequencing method for methyl-sequencing data showed the greatest methylome coverage, little amplification bias, and highly reproducible methylation levels (Fig. 2D , Figure 8A-E).
- SWT Accel Methyl-Seq
- NEBNext Ultra NEBNext Ultra
- the present application utilizes deep sequencing to suppress errors within PCR repeat families.
- the present application first applied the sequencing method of the methylation sequencing data of the present application to the unmethylated phage lambda DNA to measure the genome-wide "technical noise" (C/C+T in read length 1, C/C+T in read length 2 G/G+A).
- the error rate (maximum 0.0025, average 0.0017) of the sequencing method for the methylation sequencing data of the present application was reduced by almost 10 times, regardless of the number of sequencing cycles (Fig. 8F).
- the detection method of the present application recognizes signals through single-molecule derivatization patterns
- MERMAID the methylation detection method of the present application, devised a metric, the "block index" (BI), to separate CpG sites showing similar methylation status into different blocks (Fig. 4A).
- a total of 8312 blocks were defined, with a median block size of ⁇ 143 bp and an average of ⁇ 13 CpG sites/block (Fig. 10A-D).
- the present application defines the average methylation level in each block as mAF (mean methylated allele frequency) and compares its performance with iAF for detection of cancer-associated changes.
- SHOX2 a gene frequently methylated in LC, we found that mAF showed significantly higher AUC values than iAF, showing that "blocks" are more distinguishable units than “sites” (Fig. 4B).
- MBS Metal-Semiconductor Block Score
- AUROC receiver operating characteristic curve
- the present application also compares the advantages of the MBS statistic over the traditional mAF statistic.
- the density curves of MBS statistic and mAF statistic in samples with different blending ratios were drawn. It can be seen from the figure that the ratio of MBS statistic to mAF statistic is in There is a larger difference between the "Negative Standard” and "0.1% Spiked Positive Standard", indicating that the MBS statistic is more capable of discriminating methylation differential signals.
- qMSP quantitative methylation-specific PCR
- FDR false discovery rate
- the present application compares the LoD of MERMAID with ddPCR and ultra-deep mutation sequencing with unique molecular specifiers (HS-UMI), two methods that are exceptional in detecting variants at very low frequencies.
- HS-UMI unique molecular specifiers
- MERMAID Methylation-sequencing has attracted enormous interest because it has great potential to improve current ctDNA assays.
- the present application presents MERMAID as a novel epigenetic analysis method characterized by well-conserved molecular diversity, robust noise suppression and robust high-dimensional modeling.
- MERMAID can be particularly useful for blood-based applications: (i) a portion of cfDNA may be in single-stranded form, so this ssDNA-compatible approach can maximize the use of limited starting material, increasing the availability of rare ctDNA opportunity for detection. (ii) Capture panels were designed with an excess of long RNA probes (>100 nucleotides long) complementary to various methylation patterns.
- MERMAID does not require prior knowledge of the assay (eg, biopsied tissue), thus providing a solution for patients without surgically resected samples.
- assay e.g, biopsied tissue
- this method has only been validated on LC, it could be customized to other types of cancer (e.g., CRC) or body fluids (e.g., urine). It can be extended to answer fundamental questions, such as tumor heterogeneity, or applied to other clinical scenarios, such as evaluating treatment effects.
- MERMAID can adopt the bias processing method commonly used in the field to deal with the accompanying risk of C->T/G->A artifacts caused by the conversion of cytosine to uracil after oxidative stress.
- the MERMAIDs of the present application can employ the addition of tissue-specific markers to target panels for multiple cancer classification.
- Table 1-3 is used for the oligonucleotide sequence of ligation efficiency measurement (ddPCR)
- Table 2-6 uses 2 to 30ng human WBC input method quality control index of this application
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Organic Chemistry (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biochemistry (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Pathology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
本申请涉及生物医学领域,具体的涉及一种甲基化测序方法和装置。The present application relates to the field of biomedicine, in particular to a methylation sequencing method and device.
2018年,人类癌症导致了全世界960万例死亡,其中大多数被诊断为晚期。到目前为止,在远端转移之前进行干预提供了改善预后的最大机会,因此开发在症状出现之前检测癌症的敏感、可靠和微创的试验是非常合乎需要的。不幸的是,基于血清的生物标志物通常局限于癌症监视(例如碳水化合物抗原19-9),但用于筛选目的则不够敏感和/或特异。In 2018, human cancers were responsible for 9.6 million deaths worldwide, most of which were diagnosed at an advanced stage. Intervention before distant metastases offers by far the greatest chance of improving prognosis, so the development of sensitive, reliable and minimally invasive tests to detect cancer before symptoms appear is highly desirable. Unfortunately, serum-based biomarkers are often limited to cancer surveillance (eg, carbohydrate antigen 19-9), but are not sensitive and/or specific enough for screening purposes.
无细胞DNA(cfDNA)是指血流中的降解DNA片段,其中大部分来源于正常的白细胞(WBC)。在癌症患者中,一部分cfDNA是肿瘤来源的(ctDNA),实时地提供癌症基因组的快照。ctDNA的突变表征对于癌症诊断、预后和监测取得了令人振奋的成功。然而,由于敏感性有限或者需要活组织检查以指导下游分析,这些方法很大程度上局限于晚期患者。Cell-free DNA (cfDNA) refers to degraded DNA fragments in the bloodstream, most of which are derived from normal white blood cells (WBC). In cancer patients, a fraction of cfDNA is of tumor origin (ctDNA), providing a snapshot of the cancer genome in real time. Mutational characterization of ctDNA has achieved exciting success for cancer diagnosis, prognosis, and monitoring. However, these methods are largely limited to advanced patients due to limited sensitivity or the need for biopsy to guide downstream analysis.
随着第二代测序(NGS)技术的快速发展,表观遗传学分析已赢得重大关注。与搜寻血液中罕见的体细胞突变相比,甲基化组分析显示了几个优点:(i)异常甲基化通常在癌症开始期间广泛发生,从而反映肿瘤中的早期变化;(ii)在特定的基因组区域如“CpG岛”经常发现甲基化修饰,这为通过靶向测序分析大量的改变提供了很大机会;(iii)甲基化状态是细胞型特异的,因此可以用于推知ctDNA的组织来源。With the rapid development of next-generation sequencing (NGS) technologies, epigenetic analysis has gained significant attention. Compared with hunting for rare somatic mutations in blood, methylome analysis revealed several advantages: (i) aberrant methylation often occurs widely during cancer initiation, reflecting early changes in tumors; Specific genomic regions such as "CpG islands" are frequently found to be methylated, which provides a great opportunity to analyze large numbers of alterations by targeted sequencing; (iii) methylation status is cell-type specific and thus can be used to infer Tissue source of ctDNA.
目前,可以将基于NGS的DNA甲基化分析技术归为两类:基于亚硫酸氢盐转化的方法(WGBS,RRBS)和基于富集的方法(MeDIP,MBD-seq)。其中,亚硫酸氢盐测序(BS-seq)被认为是DNA甲基化分析的金标准,因为它提供了单碱基分辨率的定量。不幸的是,亚硫酸氢盐转化(BC)的苛刻条件可能对DNA造成巨大损害,限制了其在基于血液的应用中使用。此外,转化的DNA通常序列多样性较差,因此如倾向偏差的靶标富集的问题,和高的测序误差使评分析进一步复杂。Currently, NGS-based DNA methylation analysis techniques can be classified into two categories: bisulfite conversion-based methods (WGBS, RRBS) and enrichment-based methods (MeDIP, MBD-seq). Among them, bisulfite sequencing (BS-seq) is considered the gold standard for DNA methylation analysis because it provides quantification at single base resolution. Unfortunately, the harsh conditions of bisulfite conversion (BC) can cause enormous damage to DNA, limiting its use in blood-based applications. Furthermore, transformed DNA is often poor in sequence diversity, thus issues such as bias-prone target enrichment, and high sequencing errors further complicate scoring analysis.
为了克服这些问题,本申请开发了MERMAID甲基化检测方法,该方法可以基于一种使cfDNA的使用最大化并且极大地减少甲基化测序的伪影的测序方法。MERMAID在稳固的机器学习分类器的辅助下,在ELSA-seq(可参见国际专利公开文本WO2019/191900A1和WO2019/192489A1)一项用于低频ctDNA检测的原理验证研究中大大胜过其他无活组织检查的方法,为推进临床应用提供了新的机会。To overcome these problems, the present applicant developed the MERMAID methylation detection method, which can be based on a sequencing method that maximizes the use of cfDNA and greatly reduces the artifacts of methylation sequencing. MERMAID, aided by a robust machine learning classifier, substantially outperformed other nonviable tissues in a proof-of-principle study for low-frequency ctDNA detection by ELSA-seq (see International Patent Publications WO2019/191900A1 and WO2019/192489A1) The method of examination provides new opportunities for advancing clinical applications.
发明内容Contents of the invention
为了克服这些问题,本申请开发了MERMAID测序方法。在稳固的机器学习分类器的辅助下,MERMAID用于低频ctDNA检测的原理验证研究中大大胜过其他无活组织检查的方法,为推进临床应用提供了新的机会。To overcome these problems, the present applicants developed a MERMAID sequencing method. Aided by a robust machine learning classifier, MERMAID substantially outperformed other biopsy-free methods in a proof-of-principle study for low-frequency ctDNA detection, providing new opportunities to advance clinical applications.
一方面,本申请提供了一种目标核酸甲基化修饰的检测方法,所述方法包含以下步骤:步骤(a-1)基于所述目标核酸中的CpG位点的相关系数、所述CpG位点的甲基化水平以及所述CpG位点的位置信息,确定共甲基化区块;和/或步骤(a-2)基于所述目标核酸中的CpG位点的相关系数、候选共甲基化区块的信息量以及所述候选共甲基化区块的划分平衡程度,确定所述共甲基化区块,以及步骤(b)基于所述共甲基化区块的甲基化程度确定待测样本中所述目标核酸的存在和/或含量。In one aspect, the present application provides a method for detecting methylation modification of a target nucleic acid, the method comprising the following steps: Step (a-1) is based on the correlation coefficient of the CpG site in the target nucleic acid, the CpG site The methylation level of the point and the positional information of the CpG site determine the co-methylation block; and/or step (a-2) is based on the correlation coefficient of the CpG site in the target nucleic acid, the candidate co-methylation The amount of information in the methylation block and the division balance degree of the candidate co-methylation block, determining the co-methylation block, and step (b) based on the methylation of the co-methylation block The degree determines the presence and/or amount of the target nucleic acid in the sample to be tested.
另一方面,本申请提供了一种检测目标核酸甲基化修饰的分析设备,所述设备包含:区块划分模块(a-1),基于所述目标核酸中的CpG位点的相关系数、所述CpG位点的甲基化水平以及所述CpG位点的位置信息,确定共甲基化区块;和/或区块划分模块(a-2),基于所述目标核酸中的CpG位点的校正后相关系数、候选共甲基化区块的信息量以及所述候选共甲基化区块的划分平衡程度,确定所述共甲基化区块,以及判定模块(b)基于所述共甲基化区块的甲基化程度确定待测样本中所述目标核酸的存在和/或含量。In another aspect, the present application provides an analysis device for detecting the methylation modification of a target nucleic acid, the device comprising: a block division module (a-1), based on the correlation coefficient of the CpG site in the target nucleic acid, The methylation level of the CpG site and the position information of the CpG site determine a co-methylation block; and/or the block division module (a-2), based on the CpG site in the target nucleic acid The corrected correlation coefficient of the point, the information amount of the candidate co-methylation block and the division balance degree of the candidate co-methylation block, determine the co-methylation block, and the judgment module (b) based on the The methylation degree of the co-methylation block determines the presence and/or content of the target nucleic acid in the sample to be tested.
甲基化测序已引起了巨大的兴趣,因为它具有改进目前ctDNA检测的巨大潜力。在此本申请提供了MERMAID作为一种新型的表观遗传学分析方法,它以良好保护的分子多样性、强大的噪声抑制和稳固的高维度建模为特征。除了这些属性外,MERMAID还对基于血液的应用可能是特别有用的:(i)一部分cfDNA可以是单链形式,因此这种ssDNA兼容的方法可以最大限度地利用有限的起始材料,增加对于罕见ctDNA检测的机会。(ii)捕获小组用与各种甲基化模式互补的过量的长RNA探针(>100个核苷酸长)来设计。与基于扩增子的靶标方法(~20个核苷酸长)相比,该策略对序列相关的偏差和多态性更为容忍。(iii)MERMAID不需要对于分析的事先知识(例如经活检组织检查的组织),因此为没有手术切除样品的患者提供了解决方案。虽然该方法仅在LC上得到验证,但可以将它对其他类型的癌症(例如,CRC)或体液(如尿液)用户化。可以将它扩展到回答基本问题,如肿瘤异质性,或应用于其他临床场景,例如评价治疗效果。Methylation sequencing has attracted enormous interest because of its great potential to improve current ctDNA assays. Herein the present application presents MERMAID as a novel epigenetic analysis method characterized by well-conserved molecular diversity, robust noise suppression, and robust high-dimensional modeling. In addition to these properties, MERMAID may be particularly useful for blood-based applications: (i) a portion of cfDNA can be in single-stranded form, so this ssDNA-compatible approach can maximize the use of limited starting material, increasing the potential for rare Opportunities for ctDNA testing. (ii) Capture panels are designed with an excess of long RNA probes (>100 nucleotides long) complementary to various methylation patterns. This strategy is more tolerant to sequence-related biases and polymorphisms than amplicon-based target approaches (~20 nucleotides long). (iii) MERMAID does not require prior knowledge of the assay (eg, biopsied tissue), thus providing a solution for patients without surgically resected samples. Although this method has only been validated on LC, it could be customized to other types of cancer (e.g., CRC) or body fluids (e.g., urine). It can be extended to answer fundamental questions, such as tumor heterogeneity, or applied to other clinical scenarios, such as evaluating treatment effects.
本领域技术人员能够从下文的详细描述中容易地洞察到本申请的其它方面和优势。下文的详细描述中仅显示和描述了本申请的示例性实施方式。如本领域技术人员将认识到的,本申请的内容使得本领域技术人员能够对所公开的具体实施方式进行改动而不脱离本申请所涉 及发明的精神和范围。相应地,本申请的附图和说明书中的描述仅仅是示例性的,而非为限制性的。Those skilled in the art can easily perceive other aspects and advantages of the present application from the following detailed description. In the following detailed description, only exemplary embodiments of the present application are shown and described. As those skilled in the art will appreciate, the content of the present application enables those skilled in the art to make changes to the specific embodiments which are disclosed without departing from the spirit and scope of the invention to which this application relates. Correspondingly, the drawings and descriptions in the specification of the present application are only exemplary rather than restrictive.
本申请所涉及的发明的具体特征如所附权利要求书所显示。通过参考下文中详细描述的示例性实施方式和附图能够更好地理解本申请所涉及发明的特点和优势。对附图简要说明如下:The particular features of the invention to which this application relates are set forth in the appended claims. The features and advantages of the invention to which this application relates can be better understood with reference to the exemplary embodiments described in detail hereinafter and the accompanying drawings. A brief description of the accompanying drawings is as follows:
图1图示说明了本申请MERMAID方法的概况,包括(A)至(E)中的(D)至(E),其中,Fig. 1 illustrates the overview of the MERMAID method of the present application, including (D) to (E) in (A) to (E), wherein,
(A)样品制备程序的示意图。(A) Schematic illustration of the sample preparation procedure.
(B-C)文库构建和靶向测序工作流程的示意图。首先,使cfDNA片段变性并且用亚硫酸氢钠转化(闪电)。然后通过TdT(末端脱氧核苷酸转移酶)进行“加尾”(Tail&Tag-1)步骤,以在模板链(灰色,BC和Tail&Tag-1之间)的3'端添加额外的核苷酸(主要是dC,深紫色,Tail&Tag-1正下方右端)。夹板接头1(粉红色和黄色,Tail&Tag-1右下方小片段的右端互补双链)通过一个5'凸出“臂”(主要是dG,浅紫色,Tail&Tag-1右下方小片段的左端凸出部)在大肠杆菌(E.coli)连接酶(点弧线)的存在下促进连接过程。然后由尿嘧啶耐受型聚合酶从共同锚定位点(黄色,Pre-amp和Tail&Tag-2之间的靠下部的两条链右端)生成原来模板的拷贝链,接着是由接头2(绿色和蓝色,Tail&Tag-2左下方小片段的左端互补双链)介导的连接。通过PCR扩增(PCR-1)生成全甲基化组测序文库,并执行了双索引(条形码)系统以区分多达384个样品。使用经生物素标记的RNA探针(紫色,Hyb&Cap正下方链下方的深灰色短片段)库和链霉抗生物素蛋白珠粒(实心灰色,PCR-2上方带树突的圆球)以将DNA靶标的两条链都拉下来,然后使捕获的分子经历PCR扩增(PCR-2)和NGS测序。(B-C) Schematic representation of the library construction and targeted sequencing workflow. First, cfDNA fragments were denatured and transformed with sodium bisulfite (Lightning). A "tailing" (Tail&Tag-1) step is then performed by TdT (terminal deoxynucleotidyl transferase) to add an extra nucleotide ( Mainly dC, dark purple, just below Tail&Tag-1 on the right end). Splint junction 1 (pink and yellow, complementary double strands at the right end of the small fragment at the lower right of Tail&Tag-1) protrudes through a 5' protruding "arm" (mainly dG, light purple, at the left end of the small fragment at the lower right of Tail&Tag-1 Part) The ligation process is facilitated in the presence of E. coli ligase (dotted arc). A copy strand of the original template is then generated from the common anchor site (yellow, the right end of the lower two strands between Pre-amp and Tail&Tag-2) by a uracil-tolerant polymerase, followed by linker 2 (green and Blue, the ligation mediated by the left end of the small fragment at the lower left of Tail&Tag-2 (complementary double strand). Whole methylome sequencing libraries were generated by PCR amplification (PCR-1) and a dual indexing (barcoding) system was implemented to differentiate up to 384 samples. Use a library of biotin-labeled RNA probes (purple, short dark gray fragments below the strand directly below Hyb&Cap) and streptavidin beads (solid gray, spheres with dendrites above PCR-2) to bind Both strands of the DNA target are pulled down, and the captured molecule is then subjected to PCR amplification (PCR-2) and NGS sequencing.
(D)深度测序驱动的噪声抑制和基于单分子的甲基化模式识别的示意图。(D) Schematic of deep sequencing-driven noise suppression and single-molecule-based methylation pattern recognition.
(E)机器学习辅助分类的示意图。(E) Schematic illustration of machine learning-assisted classification.
图2图示说明了通过本申请一种示例性的WGBS文库构建,其显示了不同的WGBS实验方案的比较。将部分甲基化的大肠杆菌(DH5α)DNA剪切至~200bp以模仿cfDNA。进行无模板对照(NTC)以评价每个实验方案的背景示值读数(例如残留的接头二聚体),然后从每个实验中测量的文库产量减去。ELSA:本申请一种示例性的测序方法ELSA-seq,SWT:Accel-NGS Methyl-SEQ,NEB:NEBNext ultra II。除非注明,否则实验是用两个技术重复和两个生物重复进行的,误差棒代表一个S.D。图2包括(A)至(F),其中,Figure 2 schematically illustrates an exemplary WGBS library construction by the present application, showing a comparison of different WGBS protocols. Partially methylated E. coli (DH5α) DNA was sheared to ~200 bp to mimic cfDNA. A no-template control (NTC) was performed to assess the background readout (eg, residual adapter dimer) for each protocol, which was then subtracted from the library yield measured in each experiment. ELSA: An exemplary sequencing method of this application ELSA-seq, SWT: Accel-NGS Methyl-SEQ, NEB: NEBNext ultra II. Unless noted, experiments were performed with two technical replicates and two biological replicates, error bars represent one S.D. Fig. 2 includes (A) to (F), wherein,
(A)对于ELSA、SWT和NEB分别用12、12和16个PCR循环取得的最终文库产量。注:12个NEB循环未提供足够的DNA用于定量。(A) Final library yields obtained with 12, 12 and 16 PCR cycles for ELSA, SWT and NEB, respectively. NOTE: 12 cycles of NEB did not provide enough DNA for quantification.
(B)当测序至~2,000X中位深度时,用不同的输入量(0.5、1、5ng)观察到唯一读取深度(去除PCR重复的)。(B) Unique read depths (with PCR duplicates removed) were observed with different input amounts (0.5, 1, 5 ng) when sequenced to ~2,000X median depth.
(C)通过以不同深度用500pg大肠杆菌DNA构建的浅度测序文库所观察到的唯一读取深度。DH5α的基因组大小约为4.5Mbp并且0.5ng是指BC前的大约100,000个单倍体基因组拷贝。归一化通过减去在NTC中观察到的唯一读取深度来进行。(C) Unique read depth observed by shallow sequencing libraries constructed with 500 pg of E. coli DNA at different depths. The genome size of DH5α is about 4.5 Mbp and 0.5 ng refers to about 100,000 haploid genome copies before BC. Normalization was performed by subtracting the unique read depths observed in NTCs.
(D)显示在500pg和5ng DNA输入之间的所检测甲基化位点的重叠(深度>0)的图表。(D) Graph showing overlap (depth >0) of methylated sites detected between 500 pg and 5 ng DNA input.
(E-F)在没有(E)或具有(F)深度测序致使的误差抑制的情况下观察到的突变谱和频率。分别在Illumina Hiseq 2500、Novaseq 6000和MiniSeq上对相同的文库进行了测序。(E-F) Mutation spectrum and frequency observed without (E) or with (F) deep sequencing-induced error suppression. The same library was sequenced on Illumina Hiseq 2500, Novaseq 6000 and MiniSeq, respectively.
图3图示说明了本申请一种示例性靶标小组的设计和性能,包括(A)至(E),其中,Figure 3 illustrates the design and performance of an exemplary target panel of the present application, including (A) to (E), wherein,
(A)基于GSE和TCGA数据库的WBC(n=656)、癌组织(n=4,539)和正常组织(n=521)的2,765个探针区(Illumina 450K)的甲基化水平的热图。(A) Heatmap of methylation levels of 2,765 probe regions (Illumina 450K) in WBC (n = 656), cancer tissue (n = 4,539), and normal tissue (n = 521) based on GSE and TCGA databases.
(B-D)通过使用来自代表性捐赠者的10ng cfDNA的本申请一种示例性靶标小组的性能,其中,(B-D) Performance of an exemplary target panel of the present application by using 10 ng cfDNA from a representative donor, wherein,
(B)靶标小组中在所有基因组区域上的唯一深度。(B) Unique depth across all genomic regions in the target panel.
(C)用本申请一种示例性文库构建方法或Truseq文库构建法获得的片段大小。(C) Fragment sizes obtained using one of the exemplary library construction methods of the present application or the Truseq library construction method.
(D)用“+”或“-”链计算的每个CpG位点(iAF)的甲基化水平。(D) Methylation level of each CpG site (iAF) calculated with "+" or "-" strand.
(E)用CHH甲基化水平评估的小组范围技术背景。(E) Panel-wide technical background assessed with CHH methylation levels.
图4图示说明了甲基化区块定义和模式识别,包括(A)至(D),其中,Figure 4 illustrates methylation block definition and pattern recognition, including (A) to (D), wherein,
(A)代表性基因组区域(CHR14:26674253-26674534),显示用泊松相关性(左图版)或“区块指数”(右图版)测量的共甲基化状态。(A) Representative genomic regions (CHR14:26674253-26674534) showing co-methylation status measured with Poisson correlation (left panel) or "block index" (right panel).
(B)箱形图,显示使用SHOX2(CHR3:157813800-157822158)的甲基化状态(mAF或iAF)对恶性(n=48)和正常肺组织(n=20)进行分类的AUC统计分布。所有样品在500X(左图版)和100X(右图版)的唯一读取深度进行测序。除非注明,否则图用一个星号(*)标记,如果P<0.05,用两个星号(**),如果P<0.01,用三个星号(***),如果P<0.001。在此图中,P值是通过使用Wilcoxon Sign检验计算的。(B) Box plot showing statistical distribution of AUC for classifying malignant (n=48) and normal lung tissues (n=20) using the methylation status (mAF or iAF) of SHOX2 (CHR3: 157813800-157822158). All samples were sequenced at unique read depths of 500X (left panel) and 100X (right panel). Unless noted, graphs are marked with one asterisk (*), two asterisks (**) if P<0.05, three asterisks (***) if P<0.01, and P<0.001 . In this figure, P-values were calculated by using the Wilcoxon Sign test.
(C)MBS的公式和具有各种甲基化模式的基因组座位的例子。(C) Equation of MBS and examples of genomic loci with various methylation patterns.
(D)在经甲基转移酶处理的λ加标(spike-in)实验中观察到的mAF(左图版)和MBS(右图版)的密度曲线。彩色线条表示具有不同稀释比的样品(技术重复着色相同),而X轴 上显示经对数变换的mAF或MBS值。(D) Density curves of mAF (left panel) and MBS (right panel) observed in methyltransferase-treated lambda spike-in experiments. Colored lines represent samples with different dilution ratios (technical replicates are colored the same), while log-transformed mAF or MBS values are shown on the x-axis.
图5图示说明了MERMAID的分析性验证,包括(A)至(F),其中,Figure 5 illustrates the analytical validation of MERMAID, including (A) to (F), where,
(A)不同的重复、个体和疾病之间MBS值的泊松相关性(ρ)矩阵。数据产生自4个不同的健康捐赠者(H1-4)和3个LC患者(C1-3),并且对于H1(H1R1-R3)进行了三个重复。计算通过使用8,312个区块在整个小组上执行。(A) Poisson correlation (ρ) matrix of MBS values among different replicates, individuals, and diseases. Data were generated from 4 different healthy donors (H1-4) and 3 LC patients (C1-3), and were performed in triplicate for H1 (H1R1-R3). Computations are performed on the entire subgroup by using 8,312 blocks.
(B)用CRC细胞DNA系列稀释(0.0001-0.05)进行的甲基化信号检测。误差棒描绘了估计的肿瘤分数的95%CI,而点线表示y=x。(B) Methylation signal detection with serial dilutions (0.0001-0.05) of CRC cell DNA. Error bars depict 95% CIs of estimated tumor fractions, while dotted lines indicate y=x.
(C)通过在不同深度和以不同输入量(X轴)对正常白细胞进行测序所测量的FDR。Y轴表示假阳性调用的百分比。(C) FDR measured by sequencing normal leukocytes at different depths and with different input amounts (X-axis). The Y axis represents the percentage of false positive calls.
(D)通过以不同比例(0.00001-0.05)将得自CRC的测序读长混合入正常cfDNA数据中所测量的生物信息学(in sillico)灵敏度。X轴表示预期的肿瘤分数。Y轴表示阳性调用的百分比。P值以t检验进行计算。(D) Bioinformatics (in sillico) sensitivity measured by mixing sequencing reads from CRC into normal cfDNA data at different ratios (0.00001–0.05). X-axis represents expected tumor fraction. The Y axis represents the percentage of positive calls. P values were calculated by t-test.
(E-F)通过用0.0001~0.1的比例的正常白细胞依次稀释癌细胞(LC、CRC)DNA所测量的MERMAID(E)、ddPCR(F,左图版)和HS-UMI(右图版)的测定灵敏度。Y轴表示观察到的阳性标志物的百分比(E)或等位基因频率(F)。虚线表示对于所指示的突变的ddPCR或HS-UMI的检测阈值(95%CI)。(E-F) Assay sensitivity of MERMAID (E), ddPCR (F, left panel) and HS-UMI (right panel) measured by sequential dilution of cancer cell (LC, CRC) DNA with normal leukocytes at a ratio of 0.0001–0.1. The Y-axis represents the percentage (E) or allele frequency (F) of positive markers observed. Dashed lines represent detection thresholds (95% CI) of ddPCR or HS-UMI for the indicated mutations.
图6显示了本申请一种示例性测序方法和TruSeq的接头连接效率,包括(A)至(E),其中,Figure 6 shows an exemplary sequencing method of the present application and the adapter ligation efficiency of TruSeq, including (A) to (E), wherein,
(A)用于估计连接效率的液滴数字PCR(ddPCR)的图解说明。通过本申请一种示例性测序方法将合成的DNA模板连接至接头1或接头2。设计引物以扩增“总的”(For.1-Rev.1)或“连接的”片段(For.1-Rev.2)。将反应安排在分开的孔中,具有相同的报告荧光探针。(A) Schematic illustration of droplet digital PCR (ddPCR) used to estimate ligation efficiency. The synthesized DNA template was ligated to
(B)ddPCR反应的代表性荧光曲线。每个点代表用一个DNA分子的单个PCR反应(液滴)。灰点显示背景荧光,绿点为具有扩增信号的单液滴,。(B) Representative fluorescence curves of ddPCR reactions. Each dot represents a single PCR reaction (droplet) with one DNA molecule. Gray dots show background fluorescence and green dots are single droplets with amplified signal.
(C-E)使用双链TruSeq接头(C)、接头1(D)和接头2(E)的ddPCR拷贝数曲线(左图版)和连接效率表(右图版)。对于拷贝数曲线,Y轴表示拷贝/μL,而X轴表示3个技术重复(T1、2、3)和3个生物重复(B1、2、3)。值得注意的是,曲线显示每孔的计数,而表格反映用反应体积归一化的计数(Tail-Tag.1对Tail-Tag.2)。(C-E) ddPCR copy number curves (left panel) and ligation efficiency table (right panel) using ds TruSeq adapters (C), adapter 1 (D) and adapter 2 (E). For copy number curves, the Y-axis represents copies/μL, while the X-axis represents 3 technical replicates (T1, 2, 3) and 3 biological replicates (B1, 2, 3). Notably, the curves show counts per well, while the tables reflect counts normalized by reaction volume (Tail-Tag.1 vs. Tail-Tag.2).
图7图示说明了文库准备方法的原理,包括(A)至(B),其中,Figure 7 illustrates schematically the principle of the library preparation method, including (A) to (B), wherein,
(A)由NEB(NEBNext Ultra II)、SWT(Accel-NGS Methyl-Seq)、TELP、ELSA(本申请一种示例性测序方法)、SPLAT、SALP(SALP-seq)和Padlock所采用的文库构建工作流程 的示意图。(A) Library construction by NEB (NEBNext Ultra II), SWT (Accel-NGS Methyl-Seq), TELP, ELSA (an exemplary sequencing method in this application), SPLAT, SALP (SALP-seq) and Padlock Schematic diagram of the workflow.
(B)用于全基因组亚硫酸氢盐测序(WGBS)的文库制备方法的总结。数据来源于目前的研究(NEB、SWT、ELSA)或先前的研究(TELP、SPLAT、SALP)。ds:双链接头;ss:单链接头;ds-TA:TA克隆介导的连接子附接;ss-Tailing:合成尾介导的接头附接;ss-引导:随机引导介导的接头附接;嵌合体:组合两个或多个不同部分序列的读长对。注意:Padlock仅可以用于靶向亚硫酸氢盐测序。(B) Summary of library preparation methods for whole genome bisulfite sequencing (WGBS). Data were derived from the current study (NEB, SWT, ELSA) or previous studies (TELP, SPLAT, SALP). ds: double-stranded linker; ss: single-stranded linker; ds-TA: TA cloning-mediated linker attachment; ss-Tailing: synthetic tailing-mediated linker attachment; ss-guided: random guide-mediated linker attachment concatenation; chimera: A pair of reads combining two or more different partial sequences. Note: Padlock can only be used with targeted bisulfite sequencing.
图8图示说明了本申请一种示例性测序方法的自由捕获性能,包括(A)至(G),其中,Figure 8 illustrates the free capture performance of an exemplary sequencing method of the present application, including (A) to (G), wherein,
(A)使用具有不同输入量的λDNA的片段大小密度曲线。(A) Fragment size density curves using λ DNA with different input amounts.
(B)使用0.5、1和5ng的大肠杆菌(DH5α)DNA的甲基化水平的基因组范围泊松相关性(滑动窗口:20,400bp)。(B) Genome-wide Poisson correlation (sliding window: 20,400 bp) of methylation levels using 0.5, 1 and 5 ng of E. coli (DH5α) DNA.
(C)通过使用来自两个患者的500pg(红色、深红色,较低的峰)、1ng(绿色、蓝紫色,中间的峰)和2ng cfDNA(黄色、水绿色,较高的峰)用ELSA构建的代表性文库。每个样品与技术重复一起制备,并且包括无模板对照(NTC,蓝线,底部较平缓的线)。(C) ELSA by using 500pg (red, dark red, lower peak), 1 ng (green, blue-purple, middle peak) and 2ng cfDNA (yellow, aqua, upper peak) from two patients A representative library constructed. Each sample was prepared with technical replicates and a no-template control (NTC, blue line, flatter bottom line) was included.
(D)使用500pg大肠杆菌DNA通过NEB、SWT和ELSA构建的代表性文库。每个样品与技术重复一起制备,并且包括无模板对照(NTC,深红色线,峰较高的线)。(D) Representative libraries constructed by NEB, SWT and ELSA using 500 pg of E. coli DNA. Each sample was prepared with technical replicates and a no-template control (NTC, dark red line, line with higher peak) was included.
(E)用NEB构建的文库的突变谱和频率。(E) Mutation spectrum and frequency of libraries constructed with NEB.
(F)当在~2000X以不同输入量测序时在整个λ基因组上的覆盖度分布。灰色阴影指示在全基因组上的唯一深度,而黑色条突出了CpG位点上的覆盖度。(F) Coverage distribution over the entire lambda genome when sequenced at ~2000X with different input amounts. Gray shading indicates unique depth across the genome, while black bars highlight coverage over CpG sites.
(G)以不同的λDNA量在Illumina Novaseq 6000上用C/C+T(R1)和G/G+A(R2)估计的每个测序循环的甲基化信号(每组柱状图从左到右依次是:Hiseq、Miniseq、Novaseq)。(G) The methylation signal of each sequencing cycle estimated by C/C+T (R1) and G/G+A (R2) on
图9图示说明了本申请一种示例性测序方法的靶向性能,包括(A)至(G),其中,Figure 9 illustrates the targeting performance of an exemplary sequencing method of the present application, including (A) to (G), wherein,
(A)所选择的DML(n=80,672)与已知人类基因组区域的重叠。(A) Overlap of selected DMLs (n=80,672) with known human genome regions.
(B)靶向CpG位点相对于已知基因特征类别的分布。(B) Distribution of targeted CpG sites relative to classes of known gene signatures.
(C)覆盖度均匀性曲线,显示了具有等于或大于归一化覆盖度(x轴)的唯一深度的靶向碱基的分数(y轴)。在此,将归一化覆盖度计算为每个碱基的观察到的唯一读取深度除以所有靶向碱基的平均唯一读取深度。(C) Coverage uniformity curve showing the fraction of targeted bases (y-axis) with a unique depth equal to or greater than normalized coverage (x-axis). Here, normalized coverage was calculated as the observed unique read depth for each base divided by the average unique read depth for all targeted bases.
(D)显示GC含量对归一化读取深度的密度曲线。Y轴表示在小组区上的唯一深度,而X轴指示参考基因组的GC含量。(D) Density curve showing GC content versus normalized read depth. The Y-axis represents the unique depth over the panel region, while the X-axis indicates the GC content of the reference genome.
(E)在Illumina Novaseq 6000上在每个测序循环测量的读长1和读长2的测序质量(Phred评分)。(E) Sequencing quality (Phred score) of
(F)来自两个代表性捐赠者的原来cfDNA(蓝色,标注峰的虚线靠左)、无捕获的扩增cfDNA(Pre-Lib,红色,标注峰的虚线居中)和捕获后的扩增cfDNA(post-lib,深红色,标注峰的虚线靠右)的大小分布。峰(点线)向较高分子量的移位反映了在PCR-1(80bp,左侧虚线)和PCR-2(69bp,中间虚线)期间接头/连接子序列与模板的附接。(F) Original cfDNA (blue, dotted line with peak marked to the left), amplified cfDNA without capture (Pre-Lib, red, dotted line with peak marked to the center) and post-capture amplification from two representative donors Size distribution of cfDNA (post-lib, dark red, dotted line marking peaks to the right). The shift of the peak (dotted line) to higher molecular weight reflects the attachment of adapter/linker sequences to the template during PCR-1 (80bp, left dashed line) and PCR-2 (69bp, middle dashed line).
(G)“+”或“-”链上每个CpG位点的唯一深度。(G) Unique depth of each CpG site on the "+" or "-" strand.
图10图示说明了甲基化区块和模式的描述,包括(A)至(E),其中,Figure 10 illustrates a depiction of methylation blocks and patterns, including (A) to (E), wherein,
(A-B)对于区块分隔的惩罚系数α2(A)和α1(B)的测定。(A-B) Determination of penalty coefficients α2(A) and α1(B) for block separation.
(C)在捕获小组上区块长度(碱基对)的分布。(C) Distribution of block lengths (in base pairs) across capture panels.
(D)在捕获小组上每个区块的CpG位点数的分布。(D) Distribution of the number of CpG sites per block on the capture panel.
(E)由代表性SHOX2区域(chr3:157821291-157821596)内的每个唯一测序读长(垂直线)所图示说明的甲基化状态(灰色)和未甲基化状态(黑色)。(E) Methylation status (gray) and unmethylation status (black) illustrated by each unique sequencing read (vertical line) within a representative SHOX2 region (chr3:157821291-157821596).
图11图示说明了MERMAID的可复现性,包括(A)至(C),其中,Figure 11 illustrates the reproducibility of MERMAID, including (A) to (C), where,
(A)显示通过使用相同NA12878DNA样品(10ng)的iAF(左图版)和MBS(右图版)的批内一致性的散点图。(A) Scatterplot showing intra-assay consistency of iAF (left panel) and MBS (right panel) by using the same NA12878 DNA sample (10 ng).
(B)通过使用相同cfDNA样品(10ng)在两个批内重复(左图版)和批间重复序列(右图版)中观察到的MBS值的散点图。(B) Scatterplot of MBS values observed in two within-batch replicates (left panel) and between-batch replicates (right panel) by using the same cfDNA sample (10 ng).
(C)通过使用不同的NA12878DNA输入量(2、5、10、30ng)或以不同测序深度(1000-5000X)的MBS值的密度曲线。(C) Density curves of MBS values by using different NA12878 DNA input amounts (2, 5, 10, 30 ng) or at different sequencing depths (1000-5000X).
图12图示说明了MERMAID的精确度,包括(A)至(D),其中,Figure 12 illustrates the accuracy of MERMAID, including (A) to (D), where,
(A)SHOX2和SEPT9基因座处H2122、H2228和NA12878细胞的MERMAID数据的IGV(整合基因组学查看器)观察。对于正向读长,红色的“C”表示甲基化的“C”(未转换的),而蓝色的“T”表示未甲基化的C(C->T转换)。对于反向读长,解释正好相反。(A) IGV (Integrated Genomics Viewer) visualization of MERMAID data for H2122, H2228 and NA12878 cells at the SHOX2 and SEPT9 loci. For forward reads, a red "C" indicates a methylated "C" (not converted), while a blue "T" indicates an unmethylated C (C->T converted). For reverse reads, the interpretation is reversed.
(B)通过qMSP(定量甲基化特异性PCR)验证SHOX2和SEPT9的甲基化。Y轴表示甲基化水平的相对倍数变化(每组柱状图左侧为msp_PCR,右侧为NGS)。(B) The methylation of SHOX2 and SEPT9 was verified by qMSP (quantitative methylation-specific PCR). The Y-axis represents the relative fold change in methylation level (msp_PCR on the left of each group of histograms and NGS on the right).
(C)显示用MERMAID和Illumina Epic TruSeq(公共数据)在NA12878细胞中测量的iAF差异的密度曲线。仅使用重叠的CpG位点(n=32,383)。红色虚线指示平台相关的iAF差异的中值。(C) Density curve showing iAF differences measured in NA12878 cells with MERMAID and Illumina Epic TruSeq (public data). Only overlapping CpG sites (n=32,383) were used. The dashed red line indicates the median of platform-dependent iAF differences.
(D)用Illumina Epic TruSeq Methyl(Y轴)和MERMAID(X轴)生成的对数变换iAF的散点图。(D) Scatter plot of log-transformed iAF generated with Illumina Epic TruSeq Methyl (Y-axis) and MERMAID (X-axis).
图13图示说明了MERMAID的FDR和LoD,包括(A)至(F),其中,Figure 13 illustrates the FDR and LoD of MERMAID, including (A) to (F), where,
(A)显示通过使用10ng cfDNA的FDR(Y轴)和测序深度(Y轴)的相依性的箱形图。(A) Box plot showing the dependence of FDR (Y-axis) and sequencing depth (Y-axis) by using 10 ng of cfDNA.
(B)通过泊松分布的取样噪声数字模拟(独特片段=500,噪声=0.002)。X轴表示观测次数(深度),而Y轴表示方差。(B) Numerical simulation of sampling noise via a Poisson distribution (unique segments = 500, noise = 0.002). The X-axis represents the number of observations (depth), while the Y-axis represents the variance.
(C)随肿瘤负荷(0.000001-0.001)、唯一深度(左面版,标记物=1000)和标记物数(右面版,深度=500)而变的肿瘤来源甲基化计数的检测。X轴表示每个模拟的“肿瘤”样品中预先定义的肿瘤分数θ i,Y轴表示在10,000次重复中该样品的平均检测。将观察成功的概率定义为P<0.05,而P值通过使用似然比检验来检验θ i=0的零假设来确定。 (C) Detection of tumor-derived methylation counts as a function of tumor burden (0.000001-0.001), unique depth (left panel, markers=1000) and number of markers (right panel, depth=500). The X-axis represents the pre-defined tumor fraction θ i in each simulated "tumor" sample and the Y-axis represents the average detection of that sample over 10,000 replicates. The probability of observed success was defined as P<0.05, and P values were determined by testing the null hypothesis of θ i =0 using a likelihood ratio test.
(D)对生物信息学LoD测定的批次效应的评价。样品A(左)和B(右)来自两个不同的健康供者,并在不同的两轮中进行测序。按照实施例1的方法部分中描述的来进行读长加标并且进行分析。(D) Evaluation of batch effects for bioinformatics LoD assays. Samples A (left) and B (right) came from two different healthy donors and were sequenced in two different rounds. Read spiking and analysis were performed as described in the Methods section of Example 1.
(E-F)ddPCR的荧光曲线,分别显示了在SW48和NCI-H2228细胞系加标实验中EGFR p.G719S突变(A)和EML4-ALK融合(B)的检测。在每个图版中,设门阈值由横实线指示(例如:6000、2400、3000和2760)。(E-F) Fluorescence curves of ddPCR showing detection of EGFR p.G719S mutation (A) and EML4-ALK fusion (B) in spike experiments in SW48 and NCI-H2228 cell lines, respectively. In each panel, gating thresholds are indicated by solid horizontal lines (eg: 6000, 2400, 3000, and 2760).
图14图示说明了基于组织的LC特异性标志物选择和分类,包括(A)至(D),其中,Figure 14 illustrates tissue-based selection and classification of LC-specific markers, including (A) to (D), wherein,
(A)在LC组织(n=48)对正常肺组织(n=20,左图版)和LC组织对健康cfDNA(n=30,右图版)中所测量的MBS的火山图。深色点指示重要的标志物而浅灰点为不重要的标志物。(A) Volcano plots of MBS measured in LC tissue (n=48) versus normal lung tissue (n=20, left panel) and LC tissue versus healthy cfDNA (n=30, right panel). Dark dots indicate important markers and light gray dots are unimportant markers.
(B)显示LC、正常肺和对照cfDNA样品中的代表性说明符上的MBS的小提琴图。每个点代表一个样品。(B) Violin plot showing MBS over representative specifiers in LC, normal lung, and control cfDNA samples. Each point represents a sample.
(C)用不同说明符数目的肿瘤和正常组织样品的监督分类。Y轴表示100个重复中的预测精度,X轴表示说明符的数量。(C) Supervised classification of tumor and normal tissue samples with different numbers of specifiers. The Y-axis represents prediction accuracy across 100 replicates and the X-axis represents the number of specifiers.
(D)用不同说明符数目的肿瘤和正常组织样品的PCA(主成分分析)聚类。(D) PCA (principal component analysis) clustering of tumor and normal tissue samples with different numbers of specifiers.
图15图示说明了验证群组(血浆)的临床特征。该表显示了按临床特征分层的308例LC患者和261例非癌症对照。UNK:未知。LUAD:肺腺癌;LUSC:肺鳞状细胞癌。Figure 15 graphically illustrates the clinical characteristics of the validation cohort (plasma). The table shows 308 LC patients and 261 non-cancer controls stratified by clinical characteristics. UNK: Unknown. LUAD: lung adenocarcinoma; LUSC: lung squamous cell carcinoma.
图16图示说明了MERMAID、HS-UMI和患者特异性ddPCR的平行比较,包括(A)至(F),其中,Figure 16 illustrates the side-by-side comparison of MERMAID, HS-UMI and patient-specific ddPCR, including (A) to (F), wherein,
(A)“双重”、“三重”、“四重”比较的研究设计。(A) Study designs for "double," "triple," and "quadruple" comparisons.
(B-D)显示对于患者P65的EGFR P.S746_I750DEL(B)、对于患者P64的KRAS p.G12D(C)和对于患者P66的EGFR P.L858R(D)的检测的ddPCR荧光曲线。对于每个ddPCR反应的DNA输入分别为38、62和54ng。NTC:无模板控制;NC:正常WBC;PC:具有0.1%或0.5%AF的期望突变的阳性对照(多重I cfDNA参考标准组,Horizon Discovery)。值得注 意的是,0.5%PC是通过以1:1比例混合WT和1%参考DNA产生的。细节提供于表7中。(B-D) ddPCR fluorescence curves showing the detection of EGFR P.S746_I750DEL for patient P65 (B), KRAS p.G12D for patient P64 (C) and EGFR P.L858R for patient P66 (D). The DNA input for each ddPCR reaction was 38, 62 and 54 ng, respectively. NTC: no template control; NC: normal WBC; PC: positive control with the desired mutation at 0.1% or 0.5% AF (Multiple I cfDNA reference standard set, Horizon Discovery). Notably, 0.5% PC was generated by mixing WT and 1% reference DNA in a 1:1 ratio. Details are provided in Table 7.
(E)按疾病状况分类的组1和组2中的所有参与者的maxAF(n=115)。(E) maxAF for all participants in
(F)按maxAF分类的组1和组2中的所有参与者的预测评分(n=115)。点线表示以96%训练特异性的阈值(阳性或阴性)。(F) Predicted scores for all participants in
图17图示说明了基于单个位点的甲基化水平iAF,和基于每个区块的平均甲基化水平mAF(平均甲基化等位基因频率)检测癌症相关变化的性能结果。Figure 17 graphically illustrates the performance results for detecting cancer-associated changes based on the methylation level iAF of a single site, and based on the average methylation level mAF (mean methylation allele frequency) per block.
图18图示说明了“区域中位长度”和“区域长度变异系数”分别对于α 2和α 1取值的影响关系结果图。 Figure 18 graphically illustrates the results of the relationship between the "regional median length" and "regional length variation coefficient" for the values of α 2 and α 1 , respectively.
图19图示说明了不同掺比样本中MBS统计量和mAF统计量的密度曲线。Figure 19 graphically illustrates the density curves of the MBS statistic and the mAF statistic in samples of different blends.
以下由特定的具体实施例说明本申请发明的实施方式,熟悉此技术的人士可由本说明书所公开的内容容易地了解本申请发明的其他优点及效果。The implementation of the invention of the present application will be described in the following specific examples, and those skilled in the art can easily understand other advantages and effects of the invention of the present application from the content disclosed in this specification.
术语定义Definition of Terms
在本申请中,术语“二代基因测序(NGS)”、高通量测序”或“下一代测序”通常是指第二代高通量测序技术及之后发展的更高通量的测序方法。下一代测序平台包括但不限于已有的Illumina等测序平台。随着测序技术的不断发展,本领域技术人员能够理解的是还可以采用其他方法的测序方法和装置用于本方法。例如,二代基因测序可以具有高灵敏度、通量大、测序深度高、或低成本的优势。根据发展历史、影响力、测序原理和技术不同等,主要有以下几种:大规模平行签名测序(Massively Parallel Signature Sequencing,MPSS)、聚合酶克隆(Polony Sequencing)、454焦磷酸测序(454pyro sequencing)、Illumina(Solexa)sequencing、离子半导体测序(Ion semi conductor sequencing)、DNA纳米球测序(DNA nano-ball sequencing)、Complete Genomics的DNA纳米阵列与组合探针锚定连接测序法等。所述二代基因测序可以使对一个物种的转录组和基因组进行细致全貌的分析成为可能,所以又被称为深度测序(deep sequencing)。例如,本申请的方法同样可以应用于一代基因测序、二代基因测序、三代基因测序或单分子测序(SMS)。In this application, the terms "next-generation gene sequencing (NGS)", high-throughput sequencing" or "next-generation sequencing" generally refer to the second-generation high-throughput sequencing technology and higher-throughput sequencing methods developed thereafter. Next-generation sequencing platforms include but are not limited to existing sequencing platforms such as Illumina. With the continuous development of sequencing technology, those skilled in the art can understand that other sequencing methods and devices can also be used for this method. For example, two Generation gene sequencing can have the advantages of high sensitivity, high throughput, high sequencing depth, or low cost. According to the development history, influence, sequencing principles and technologies, there are mainly the following types: Massively Parallel Signature Sequencing (Massively Parallel Signature Sequencing, MPSS), Polony Sequencing, 454pyro sequencing, Illumina (Solexa) sequencing, Ion semi conductor sequencing, DNA nano-ball sequencing , Complete Genomics' DNA nanoarray and combined probe anchor ligation sequencing method, etc. The second-generation gene sequencing can make it possible to analyze the transcriptome and genome of a species in detail, so it is also called deep sequencing ( deep sequencing). For example, the method of the present application can also be applied to first-generation gene sequencing, second-generation gene sequencing, third-generation gene sequencing or single molecule sequencing (SMS).
在本申请中,术语“待测样本”通常是指需要进行检测的样本。例如,可以检测待测样本上的一个或者多个基因区域是否存在有修饰状态。In this application, the term "sample to be tested" generally refers to a sample that needs to be tested. For example, it can be detected whether one or more gene regions on the sample to be tested are modified.
在本申请中,术语“互补区域”通常是指与参考核苷酸序列相比具有互补的区域。例如,互补核酸可以为任选地具有相反方向的核酸分子。例如,所述互补可以是指具有下面的互补性关联:鸟嘌呤和胞嘧啶;腺嘌呤和胸腺嘧啶;腺嘌呤和尿嘧啶。In this application, the term "complementary region" generally refers to a region that is complementary to a reference nucleotide sequence. For example, a complementary nucleic acid can be a nucleic acid molecule that optionally has an opposite orientation. For example, the complementary may refer to having the following complementary associations: guanine and cytosine; adenine and thymine; adenine and uracil.
在本申请中,术语“杂交”通常是指其中一个或多个多核苷酸反应以形成通过核苷酸残基的碱基之间的氢键稳定的复合物的反应。可以通过沃森-克里克碱基配对、胡格斯丁结合(Hoogsteinbinding)或者根据碱基互补以任何其它序列特异性方式发生氢键作用。所述复合物可以包括形成双螺旋结构的两条链,形成多链复合物的三条或更多条链、自杂交单链或这些的任意组合。杂交反应可以构成更广泛的方法中的步骤,如PCR的起始或者通过核酸内切酶的多核苷酸的酶促切割。将与第一序列完全互补的或者使用第一序列作为模板,通过聚合酶聚合的第二序列称为与所述第一序列“互补”。如应用于多核苷酸的术语“可杂交的”是指多核苷酸在杂交反应中形成通过核苷酸残基的碱基之间的氢键稳定的复合物的能力。在一些实施方式中,可杂交的核苷酸序列与它所杂交的序列至少约50%、60%、70%、75%、80%、85%、90%、95%或100%互补。In this application, the term "hybridization" generally refers to a reaction in which one or more polynucleotides react to form a complex stabilized by hydrogen bonds between the bases of the nucleotide residues. Hydrogen bonding can occur through Watson-Crick base pairing, Hoogstein binding, or in any other sequence-specific manner based on base complementarity. The complex may comprise two strands forming a double helix, three or more strands forming a multi-strand complex, self-hybridizing single strands, or any combination of these. The hybridization reaction may constitute a step in a wider method, such as the initiation of PCR or the enzymatic cleavage of polynucleotides by endonucleases. A second sequence that is completely complementary to a first sequence or that is polymerized by a polymerase using the first sequence as a template is said to be "complementary" to said first sequence. The term "hybridizable" as applied to a polynucleotide refers to the ability of a polynucleotide to form complexes that are stabilized by hydrogen bonds between the bases of the nucleotide residues in a hybridization reaction. In some embodiments, a hybridizable nucleotide sequence is at least about 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, or 100% complementary to the sequence to which it hybridizes.
术在本申请中,语“多核苷酸”、“核苷酸”、“核酸”和“寡核苷酸”是可互换使用的。它们表示具有任何长度的核苷酸(脱氧核糖核苷酸或者核糖核苷酸)的多聚形式,或其类似物。多核苷酸可以具有任何立体结构,并且可以发挥任何功能,无论是已知的还是未知的。以下是多核苷酸的非限制性实例:基因或基因片段的编码或非编码区、根据连锁分析所限定的基因座(基因座)、外显子、内含子、信使RNA(mRNA)、转运RNA(tRNA)、核糖体RNA(rRNA)、短干扰RNA(siRNA)、短-发夹RNA(shRNA)、微小RNA(miRNA)、核糖酶、cDNA、重组多核苷酸、分枝多核苷酸、质粒、载体、具有任何序列的分离的DNA、具有任何序列的分离的RNA、核酸探针、引物和接头。多核苷酸可以包括一个或多个修饰的核苷酸,如甲基化核苷酸和核苷酸类似物。Terminology In this application, the terms "polynucleotide", "nucleotide", "nucleic acid" and "oligonucleotide" are used interchangeably. They represent polymeric forms of nucleotides (deoxyribonucleotides or ribonucleotides) of any length, or analogs thereof. A polynucleotide can have any three-dimensional structure and can perform any function, whether known or unknown. The following are non-limiting examples of polynucleotides: coding or non-coding regions of genes or gene segments, loci (loci) defined by linkage analysis, exons, introns, messenger RNA (mRNA), translocation RNA (tRNA), ribosomal RNA (rRNA), short interfering RNA (siRNA), short-hairpin RNA (shRNA), microRNA (miRNA), ribozyme, cDNA, recombinant polynucleotide, branched polynucleotide, Plasmids, vectors, isolated DNA of any sequence, isolated RNA of any sequence, nucleic acid probes, primers and linkers. A polynucleotide may include one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs.
在本申请中,术语“修饰状态”通常是指本申请中基因片段、核苷酸或其碱基具有的修饰状态。例如,本申请中的修饰状态可以是指胞嘧啶的修饰状态。例如,本申请的具有修饰状态的基因片段可以具有改变的基因表达活性。例如,本申请的修饰状态可以是指碱基具有的甲基化修饰。例如,本申请的修饰状态可以是指在基因组DNA的CpG区域的胞嘧啶5'碳位共价结合一个甲基基团,例如可以成为5-甲基胞嘧啶(5mC)。例如,修饰状态可以是指DNA序列内存在或不存在5-甲基胞嘧啶(“5-mCyt”)。In the present application, the term "modification state" generally refers to the modification state of the gene fragment, nucleotide or its base in the present application. For example, the modification state in the present application may refer to the modification state of cytosine. For example, a gene segment of the present application having a modified state may have altered gene expression activity. For example, the modification status of the present application may refer to the methylation modification of a base. For example, the modified state in this application may refer to the covalent bonding of a methyl group at the 5' carbon position of cytosine in the CpG region of genomic DNA, for example, it may become 5-methylcytosine (5mC). For example, a modification state can refer to the presence or absence of 5-methylcytosine ("5-mCyt") within the DNA sequence.
在本申请中,术语“甲基化”通常是指本申请中基因片段、核苷酸或其碱基具有的甲基化状态。例如,本申请中基因所在的DNA片段可以在一条链或多条链上具有甲基化。例如,本申请中基因所在的DNA片段可以在一个位点或多个位点上具有甲基化。In this application, the term "methylation" generally refers to the methylation state of a gene fragment, nucleotide or its base in this application. For example, the DNA fragment where the gene in this application is located may have methylation on one strand or multiple strands. For example, the DNA fragment where the gene in this application is located may have methylation at one site or multiple sites.
在本申请中,术语“转化”通常是指将一种或多种结构转变为另一种结构。例如,本申请的转化可以是具有特异性。例如,不具有甲基化修饰的胞嘧啶经过转化可以变为其它结构(例 如尿嘧啶),且具有甲基化修饰的胞嘧啶经过转化可以基本不发生变化。例如,不具有甲基化修饰的胞嘧啶经过转化可以被剪切,且具有甲基化修饰的胞嘧啶经过转化可以基本不发生变化。In this application, the term "transformation" generally refers to the transformation of one or more structures into another structure. For example, the transformations of the present application can be specific. For example, cytosine without methylation modification can be converted into other structures (such as uracil), and cytosine with methylation modification can be converted substantially unchanged. For example, cytosine without methylation modification can be cleaved after conversion, and cytosine with methylation modification can be substantially unchanged after conversion.
在本申请中,术语“重亚硫酸盐”,或称为“亚硫酸氢盐”通常是指一种可以区分具有修饰状态和不具有修饰状态的DNA区域的试剂。例如,重亚硫酸盐可以包括重亚硫酸盐、或其类似物或上述的组合。例如,重亚硫酸盐可以使未修饰的胞嘧啶的氨基脱氨基化,以使其与修饰的胞嘧啶区分。在本申请中,术语“类似物”通常是指具有类似结构和/或功能的物质。例如重亚硫酸盐的类似物可以与重亚硫酸盐具有类似的结构。例如,重亚硫酸盐的类似物可以是指一种同样可以区分具有修饰状态和不具有修饰状态的DNA区域的试剂。In this application, the term "bisulfite", or "bisulfite" generally refers to a reagent that can distinguish DNA regions with and without modification states. For example, the bisulfite may include bisulfite, or an analog thereof, or a combination thereof. For example, bisulfite can deaminate the amino group of unmodified cytosine to distinguish it from modified cytosine. In this application, the term "analogue" generally refers to a substance having a similar structure and/or function. For example, analogs of bisulfite may have a similar structure to bisulfite. For example, an analog of bisulfite may refer to a reagent that can also distinguish between DNA regions that have a modified state and those that do not.
在本申请中,术语“包含”通常是指包括明确指定的特征,但不排除其他要素。In this application, the term "comprising" generally means including specifically specified features, but not excluding other elements.
在本申请中,术语“约”通常是指在指定数值以上或以下0.5%-10%的范围内变动,例如在指定数值以上或以下0.5%、1%、1.5%、2%、2.5%、3%、3.5%、4%、4.5%、5%、5.5%、6%、6.5%、7%、7.5%、8%、8.5%、9%、9.5%、或10%的范围内变动。In this application, the term "about" generally refers to a range of 0.5%-10% above or below the specified value, such as 0.5%, 1%, 1.5%, 2%, 2.5%, above or below the specified value. 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10%.
发明详述Detailed description of the invention
一方面,本申请提供一种目标核酸甲基化修饰的检测方法,所述方法包含以下步骤:步骤(a-1)基于所述目标核酸中的CpG位点的相关系数、所述CpG位点的甲基化水平以及所述CpG位点的位置信息,确定共甲基化区块;和/或步骤(a-2)基于所述目标核酸中的CpG位点的相关系数、候选共甲基化区块的信息量以及所述候选共甲基化区块的划分平衡程度,确定所述共甲基化区块,以及步骤(b)基于所述共甲基化区块的甲基化程度确定待测样本中所述目标核酸的存在和/或含量。In one aspect, the present application provides a method for detecting methylation modification of a target nucleic acid, the method comprising the following steps: Step (a-1) is based on the correlation coefficient of the CpG site in the target nucleic acid, the CpG site The methylation level of the target nucleic acid and the position information of the CpG site determine the co-methylation block; and/or step (a-2) is based on the correlation coefficient of the CpG site in the target nucleic acid, the candidate co-methylation The amount of information in the methylation block and the division balance degree of the candidate co-methylation block, determining the co-methylation block, and step (b) based on the degree of methylation of the co-methylation block Determine the presence and/or content of the target nucleic acid in the sample to be tested.
例如,所述方法检测所述具有甲基化修饰的目标核酸在所述待测样本中的存在和/或含量。如本文所使用的,待测样本来自的“对象”,可以是哺乳动物,如非灵长类(例如,牛、猪、马、猫、狗、大鼠等)或者灵长类(例如,猴或人)。在一些实施方式中,所述对象是人。在一些实施方式中,所述对象是患有或潜在患有在本文中描述了其实例的疾病、病症或病况的哺乳动物(例如,人)。在一些实施方式中,所述对象是具有发展在本文中描述了其实例的疾病、病症或病况的风险的哺乳动物(例如,人)。For example, the method detects the presence and/or content of the target nucleic acid with methylation modification in the test sample. As used herein, the "subject" from which the sample to be tested can be a mammal, such as a non-primate (for example, cow, pig, horse, cat, dog, rat, etc.) or a primate (for example, monkey or person). In some embodiments, the subject is a human. In some embodiments, the subject is a mammal (eg, a human) suffering from or potentially suffering from a disease, disorder or condition, examples of which are described herein. In some embodiments, the subject is a mammal (eg, a human) at risk of developing a disease, disorder or condition, examples of which are described herein.
例如,所述CpG位点的相关系数包含所述目标核酸中的两个或更多个所述CpG位点之间的皮尔逊相关系数。For example, the correlation coefficient of the CpG sites comprises a Pearson correlation coefficient between two or more of the CpG sites in the target nucleic acid.
例如,所述CpG位点的所述甲基化水平包含所述目标核酸中的两个或更多个所述CpG位点之间的平均甲基化等位基因频率(mAF)的差异。例如,所述CpG位点的所述甲基化水 平包含所述目标核酸中的两个或更多个所述CpG位点之间的mAF的差值与mAF的总和的比值。For example, said level of methylation of said CpG sites comprises a difference in mean methylation allele frequency (mAF) between two or more of said CpG sites in said target nucleic acid. For example, the methylation level of the CpG sites comprises the ratio of the difference in mAF to the sum of mAF between two or more of the CpG sites in the target nucleic acid.
例如,所述CpG位点的位置信息包含所述目标核酸中的两个或更多个所述CpG位点之间的基因组位置的差异。例如,所述CpG位点的位置信息包含所述目标核酸中的两个或更多个所述CpG位点之间的基因组位置距离与所述目标核酸的长度的比值。For example, the location information of the CpG sites comprises differences in genomic locations between two or more of the CpG sites in the target nucleic acid. For example, the position information of the CpG sites comprises the ratio of the genomic position distance between two or more of the CpG sites in the target nucleic acid to the length of the target nucleic acid.
例如,所述步骤(a-1)包含:确定所述目标核酸的每两个所述CpG位点之间的校正后相关系数,位点i与位点j的所述校正后相关系数d ij通过以下公式计算: 其中,ρ ij表示皮尔逊相关系数,E(y i)表示位点i处所有样品的平均甲基化等位基因频率(mAF),E(y j)表示位置j处所有样品的平均甲基化等位基因频率(mAF),pos i表示位点i的基因组位置,pos j表示位点j的基因组位置,L表示所述目标核酸区域的长度,而λ 1和λ 2相互独立地选自0或更大的数。例如,所述λ 1的取值范围为0至1。例如,所述λ 2的取值范围为0至1。例如,本申请中λ 1和λ 2相互独立地选自0。例如,例如,本申请中λ 1选自0;例如,本申请中λ 2选自0。 For example, the step (a-1) includes: determining the corrected correlation coefficient between every two CpG sites of the target nucleic acid, the corrected correlation coefficient d ij between site i and site j Calculated by the following formula: where ρij represents the Pearson correlation coefficient, E(y i ) represents the average methylated allele frequency (mAF) of all samples at site i, and E(y j ) represents the average methylated allele frequency (mAF) of all samples at position j. λ allele frequency (mAF), pos i represents the genomic position of locus i, pos j represents the genomic position of locus j, L represents the length of the target nucleic acid region, and λ 1 and λ 2 are independently selected from 0 or greater number. For example, the value range of λ1 is 0-1. For example, the value range of λ 2 is 0 to 1. For example, in the present application, λ1 and λ2 are independently selected from 0. For example, for example, λ1 is selected from 0 in the present application; for example, λ2 is selected from 0 in the present application.
例如,所述步骤(a-2)中的所述基于所述目标核酸中的CpG位点的相关系数包含目标核酸中的CpG位点的校正后相关系数,所述目标核酸中的CpG位点的校正后相关系数包含基于所述CpG位点的甲基化水平和/或所述CpG位点的位置信息校正后的所述目标核酸中的两个或更多个所述CpG位点之间的皮尔逊相关系数。For example, the correlation coefficient based on the CpG site in the target nucleic acid in the step (a-2) comprises a corrected correlation coefficient of the CpG site in the target nucleic acid, the CpG site in the target nucleic acid The corrected correlation coefficient comprises between two or more of the CpG sites in the target nucleic acid corrected based on the methylation level of the CpG sites and/or the position information of the CpG sites Pearson's correlation coefficient.
例如,所述目标核酸中的CpG位点的校正后相关系数包含上述步骤(a-1)方法中的所述校正后相关系数。For example, the corrected correlation coefficient of the CpG site in the target nucleic acid comprises the corrected correlation coefficient in the above step (a-1) method.
例如,所述候选共甲基化区块的信息量包含所述候选共甲基化区块的所述CpG位点的数量信息。For example, the information amount of the candidate co-methylation block includes the quantity information of the CpG sites of the candidate co-methylation block.
例如,所述候选共甲基化区块的划分平衡程度包含不同的所述候选共甲基化区块的所述CpG位点的数量的差异。例如,所述候选共甲基化区块的划分平衡程度包含不同的所述候选共甲基化区块的所述CpG位点的数量的变异系数。For example, the degree of partition balance of the candidate co-methylation blocks includes differences in the number of CpG sites of different candidate co-methylation blocks. For example, the degree of partition balance of the candidate co-methylation blocks includes different coefficients of variation of the numbers of the CpG sites of the candidate co-methylation blocks.
例如,所述步骤(a-2)包含:使得所述候选共甲基化区块的区块指数最大化,确定所述共甲基化区块,所述目标核酸中的候选共甲基化区块 的所述区块指数 通过以下公式计算: For example, the step (a-2) includes: maximizing the block index of the candidate co-methylation block, determining the co-methylation block, the candidate co-methylation in the target nucleic acid block The block index of Calculated by the following formula:
其中, B i表示第i个候选共甲基化区块的所述CpG位点的数量,α 1和α 2相互独立地选自0或更大的数。例如,通过唯一断点的迭代法确定使得所述共甲基化区块的区块断点。例如,所述α 1的取值范围为0至10。例如,所述α 2的取值范围为0至10。例如,本申请中,α 1和α 2的取值范围相互独立地选自0至10的有理数。例如,本申请中α 1和α 2相互独立地选自0。例如,例如,本申请中α 1选自0、1、2、3、4、5、6、7、8、9或10;例如,本申请中α 2选自0、1、2、3、4、5、6、7、8、9或10。 in, B i represents the number of the CpG sites of the ith candidate co-methylation block, and α 1 and α 2 are independently selected from 0 or greater numbers. For example, the block breakpoints for the co-methylated blocks are determined by an iterative method of unique breakpoints. For example, the value range of α1 is 0-10. For example, the value of α2 ranges from 0 to 10. For example, in the present application, the value ranges of α1 and α2 are independently selected from rational numbers from 0 to 10. For example, in the present application, α1 and α2 are independently selected from 0. For example, for example, in the present application, α 1 is selected from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10; For example, in the present application, α 2 is selected from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10.
例如,所述步骤(b)包含:基于所述共甲基化区块的每个测序读段(read)的连续CpG的长度、所述read上的CpG的数量以及所述共甲基化区块的read的总数,确定所述目标核酸的存在和/或含量。For example, the step (b) comprises: the length of the continuous CpG of each sequencing read (read) based on the co-methylation block, the number of CpGs on the read, and the co-methylation region The total number of reads of the block determines the presence and/or amount of the target nucleic acid.
例如,所述步骤(b)包含:确定所述共甲基化区块的甲基化区块评分,所述甲基化区块评分MBS通过以下公式计算:For example, the step (b) includes: determining the methylation block score of the co-methylation block, and the methylation block score MBS is calculated by the following formula:
其中,所述n是覆盖所述共甲基化区块所有CpG位点的read总数,L i是第i个read上包含的CpG位点数,l ij是第i个read上连续甲基化CpG位点的长度,m为第i个read上的测序深度。 Wherein, the n is the total number of reads covering all CpG sites of the co-methylation block, L i is the number of CpG sites contained on the i-th read, and l ij is the continuous methylation CpG on the i-th read The length of the site, m is the sequencing depth on the i-th read.
例如,在所述待测样本的测序数据汇总应用UMI校正。For example, UMI correction is applied to the summary of the sequencing data of the samples to be tested.
例如,本方法还包含:通过机器学习模型提取肿瘤样本与健康样本的所述共甲基化区块的所述MBS的特征值,基于待测样本中所述共甲基化区块的MBS确定所述目标核酸的存在和/或含量。For example, the method further includes: extracting the feature value of the MBS of the co-methylation block of the tumor sample and the healthy sample through a machine learning model, and determining the MBS based on the MBS of the co-methylation block in the sample to be tested. The presence and/or amount of said target nucleic acid.
一种检测目标核酸甲基化修饰的分析设备,所述设备包含:区块划分模块(a-1),基于所述目标核酸中的CpG位点的相关系数、所述CpG位点的甲基化水平以及所述CpG位点的位置信息,确定共甲基化区块;和/或区块划分模块(a-2),基于所述目标核酸中的CpG位点的校正后相关系数、候选共甲基化区块的信息量以及所述候选共甲基化区块的划分平衡程度,确定所述共甲基化区块,以及判定模块(b)基于所述共甲基化区块的甲基化程度确定待测样本中所述目标核酸的存在和/或含量。An analysis device for detecting the methylation modification of a target nucleic acid, the device comprising: a block division module (a-1), based on the correlation coefficient of the CpG site in the target nucleic acid, the methyl group of the CpG site The methylation level and the position information of the CpG site determine the co-methylation block; and/or the block division module (a-2), based on the corrected correlation coefficient of the CpG site in the target nucleic acid, the candidate The amount of information of the co-methylation block and the division balance degree of the candidate co-methylation block, determining the co-methylation block, and the determination module (b) based on the co-methylation block The degree of methylation determines the presence and/or amount of the target nucleic acid in the sample to be tested.
例如,本申请的检测目标核酸甲基化修饰的分析设备可以包含实现本申请检测目标核酸甲基化修饰的方法的步骤。For example, the analytical device for detecting the methylation modification of the target nucleic acid of the present application may comprise the steps of implementing the method for detecting the methylation modification of the target nucleic acid of the present application.
作为本公开的一个方面,本申请提供了一种用甲基化测序来检测ctDNA的方法,其包括:选择差异甲基化CpG位点;基于所述CpG位点甲基化状态的相似性将CpG位点分隔成多个共甲基化区块;对样品进行测序以取得甲基化测序读长;检测样品的每个共甲基化区块的平均甲基化水平,以进行进一步的DNA甲基化分析。As an aspect of the present disclosure, the present application provides a method for detecting ctDNA using methylation sequencing, which includes: selecting differentially methylated CpG sites; based on the similarity of the methylation status of the CpG sites, CpG sites are separated into multiple co-methylation blocks; samples are sequenced to obtain methylation sequencing reads; the average methylation level of each co-methylation block of samples is detected for further DNA analysis Methylation analysis.
在本公开的一个优选的实施例中,从通过Infinium HumanMethylation 450K阵列生成的TCGA数据库选择差异甲基化CpG位点。In a preferred embodiment of the present disclosure, differentially methylated CpG sites are selected from the TCGA database generated by the Infinium HumanMethylation 450K array.
在本公开的另一优选的实施方案中,所述平均甲基化水平为平均甲基化等位基因频率。In another preferred embodiment of the present disclosure, the average methylation level is the average methylation allele frequency.
在本公开的另一个优选的实施方案中,所述共甲基化区块数在CpG位点数的1/30至1/5之间。In another preferred embodiment of the present disclosure, the number of co-methylation blocks is between 1/30 and 1/5 of the number of CpG sites.
在本公开的另一个优选的实施方案中,通过比较特定肿瘤样品和正常组织样品,例如原发肺肿瘤和正常肺组织样品,进一步限制共甲基化区块。In another preferred embodiment of the present disclosure, co-methylated blocks are further restricted by comparing a specific tumor sample with a normal tissue sample, eg, a primary lung tumor with a normal lung tissue sample.
在本公开的另一个优选的实施方案中,在大多数样品(>80%)上没有足够深度(<100)的共甲基化区块被排除在下游分析之外。In another preferred embodiment of the present disclosure, co-methylated blocks of insufficient depth (<100) on most samples (>80%) are excluded from downstream analysis.
在本公开的另一个优选的实施方案中,基于被称为“区块指数”的改进的相关矩阵分隔所述共甲基化区块。In another preferred embodiment of the present disclosure, said co-methylation blocks are separated based on a modified correlation matrix called "block index".
在本公开的另一个优选的实施方案中,所述方法进一步包括使用甲基化区块评分(MBS)使每个甲基化区块的深度差归一化,该深度差用于区分极小的肿瘤信号,例如0.1%、0.2%、0.5%和1%。In another preferred embodiment of the present disclosure, the method further comprises normalizing the depth difference of each methylation block using the Methylation Block Score (MBS), the depth difference being used to distinguish between very small Tumor signal of , such as 0.1%, 0.2%, 0.5% and 1%.
在本公开的另一个优选的实施方案中,通过以下标准中的任何一个或多个修剪待分析的测序读长:(i)查询长于固定值m的G碱基的段数;(ii)非G-碱的分数小于固定数p;下一个碱基是高质量A/T/C(Phred评分>30)In another preferred embodiment of the present disclosure, the sequencing read length to be analyzed is trimmed by any one or more of the following criteria: (i) query length of G bases longer than a fixed value m; (ii) non-G - Fraction of bases less than fixed number p; next base is high quality A/T/C (Phred score > 30)
在本公开的另一个优选的实施方案中,在R1和R2的起始点处都以+/-3bp容差应用重复去除,以使与不适当分配的片段末端位置相关的伪影减到最小。In another preferred embodiment of the present disclosure, duplicate removal is applied with a tolerance of +/- 3 bp at both the start points of Rl and R2 to minimize artifacts associated with inappropriately assigned fragment end positions.
在本公开的另一个优选的实施方案中,在校正中应用UMI。In another preferred embodiment of the present disclosure, UMI is applied in the correction.
在本公开的另一个优选的实施方案中,检测未甲基化的噬菌体λDNA的每个共甲基化区块的平均甲基化水平,以测量基因组范围的“技术噪声”(读长1中的C/C+T,读长2中的G/G+A)。In another preferred embodiment of the present disclosure, the average methylation level of each co-methylated block of unmethylated phage lambda DNA is detected to measure genome-wide "technical noise" (read
在本公开的另一个优选的实施方案中,应用甲基化模式的机器学习分类器以评估肿瘤水平,优选地在肿瘤早期筛查中。In another preferred embodiment of the present disclosure, a machine learning classifier of methylation patterns is applied to assess tumor levels, preferably in early tumor screening.
作为本公开的另一个方面,它提供了上述方法在肿瘤早期筛查过程中在评估一般肿瘤水 平中的用途。在本公开的一个优选的实施方案中,在肿瘤早期筛查过程中在评估肿瘤水平中使用上述方法,所述肿瘤来自于同质肿瘤(homogenous tumors)、异质肿瘤、血液癌和/或实体瘤;优选地,所述肿瘤来自于以下组的癌症中的一种或多种:脑癌、肺癌、皮肤癌、鼻咽癌、咽喉癌、肝癌、骨癌、淋巴瘤、胰腺癌、皮肤癌、肠癌、直肠癌、甲状腺癌、膀胱癌、肾癌、口腔癌、胃癌、实体瘤、卵巢癌、食管癌、胆囊癌、胆道癌、乳腺癌、宫颈癌、子宫癌、前列腺癌、头颈癌、肉瘤、胸腔恶性肿瘤(除肺外)、黑色素瘤、和睾丸癌。在本公开的一个优选的实施方案中,在肿瘤早期筛查过程中在评估肺肿瘤水平中使用上述方法。As another aspect of the present disclosure, it provides the use of the above-mentioned method in assessing the general tumor level during early tumor screening. In a preferred embodiment of the present disclosure, the method described above is used in the assessment of tumor levels during early screening of tumors from homogenous tumors, heterogeneous tumors, hematologic cancers and/or solid tumor; preferably, said tumor is from one or more of the cancers of the following group: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer , bowel cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gallbladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer , sarcomas, thoracic malignancies (except lung), melanoma, and testicular cancer. In a preferred embodiment of the present disclosure, the above method is used in assessing the level of lung tumors during early tumor screening.
作为本公开的另一个方面,它提供了用于用甲基化测序检测ctDNA的试剂盒,其中可以使用所述试剂盒以捕获至少50、100、150、200、300、500、800、1000、1500或2000个如在表5-2中所显示的共甲基化区块;优选地,可以使用它以捕获至少所有如在表5-1中所显示的共甲基化区块。As another aspect of the present disclosure, it provides a kit for detecting ctDNA with methylation sequencing, wherein the kit can be used to capture at least 50, 100, 150, 200, 300, 500, 800, 1000, 1500 or 2000 co-methylation blocks as shown in Table 5-2; preferably, this can be used to capture at least all co-methylation blocks as shown in Table 5-1.
作为本公开的另一个方面,它提供了用于在肿瘤早期筛查过程中在评估一般肿瘤水平中执行上述方法的装置。As another aspect of the present disclosure, it provides a device for performing the above method in assessing general tumor level during early tumor screening.
作为本公开的另一个方面,它提供了存储程序的非易失性存储器,其可被用于在肿瘤早期筛查过程中在评估一般肿瘤水平中执行上述方法。As another aspect of the present disclosure, it provides a non-volatile memory storing a program, which can be used to perform the above method in assessing general tumor level during early tumor screening.
一方面,本申请提供一种检测碱基修饰水平的方法,包含提供本申请的核酸分子组合和/或本申请的试剂盒。例如,所述碱基修饰包含甲基化修饰。In one aspect, the present application provides a method for detecting the level of base modification, comprising providing the nucleic acid molecule combination of the present application and/or the kit of the present application. For example, the base modification includes methylation modification.
一方面,本申请提供一种储存介质,其记载可以运行本申请的方法的程序。例如,所述非易失性计算机可读存储介质可以包括软盘、柔性盘、硬盘、固态存储(SSS)(例如固态驱动(SSD))、固态卡(SSC)、固态模块(SSM))、企业级闪存驱动、磁带或任何其他非临时性磁介质等。非易失性计算机可读存储介质还可以包括打孔卡、纸带、光标片(或任何其他具有孔型图案或其他光学可识别标记的物理介质)、压缩盘只读存储器(CD-ROM)、可重写式光盘(CD-RW)、数字通用光盘(DVD)、蓝光光盘(BD)和/或任何其他非临时性光学介质。In one aspect, the present application provides a storage medium, which records a program capable of running the method of the present application. For example, the non-transitory computer readable storage medium may include a floppy disk, a flexible disk, a hard disk, a solid state storage (SSS) (such as a solid state drive (SSD)), a solid state card (SSC), a solid state module (SSM)), an enterprise high-grade flash drives, tape, or any other non-transitory magnetic media, etc. Non-transitory computer readable storage media may also include punched cards, paper tape, cursor sheets (or any other physical media having a pattern of holes or other optically identifiable markings), compact disc read only memory (CD-ROM) , Rewritable Disc (CD-RW), Digital Versatile Disc (DVD), Blu-ray Disc (BD) and/or any other non-transitory optical media.
一方面,本申请提供一种设备,所述设备包含本申请的储存介质。例如,所述设备还包含耦接至所述储存介质的处理器,所述处理器被配置为基于存储在所述储存介质中的程序执行以实现本申请的方法。In one aspect, the present application provides a device, and the device includes the storage medium of the present application. For example, the device further includes a processor coupled to the storage medium, and the processor is configured to execute based on a program stored in the storage medium to implement the method of the present application.
不欲被任何理论所限,下文中的实施例仅仅是为了阐释本申请的方法和用途等,而不用于限制本申请发明的范围。Not intending to be limited by any theory, the following examples are only for explaining the methods and uses of the present application, and are not intended to limit the scope of the invention of the present application.
实施例Example
实施例1本申请的检测方法
方法部分method part
标志物发现与验证Marker Discovery and Validation
开始,从Infinium HumanMethylation 450K阵列生成的TCGA数据库中初步筛选出差异甲基化位点。总共分析了4539个肿瘤样品和521个正常组织样品。使用来自对于656个正常WBC样品的GEO数据集(GSE40279)的数据用以去除造血谱系中的高甲基化CpG位点(>0.1)。也排除了位于X或Y染色体上的CpG位点。使用“limma(V2.0)”软件进行DML选择,将截断值(cutoff)设定为经B-H校正的FDR<0.05。此外,也包括了先前研究中与常见癌症相关的CpG位点。这在标志物发现阶段中总共产生了80,672个CpG位点。Initially, differentially methylated sites were initially screened from the TCGA database generated by the Infinium HumanMethylation 450K array. A total of 4539 tumor samples and 521 normal tissue samples were analyzed. Data from a GEO dataset (GSE40279) for 656 normal WBC samples was used to remove hypermethylated CpG sites (>0.1 ) in the hematopoietic lineage. CpG sites located on the X or Y chromosomes were also excluded. DML selection was performed using "limma (V2.0)" software, and the cutoff value (cutoff) was set to B-H corrected FDR<0.05. In addition, CpG sites associated with common cancers in previous studies were also included. This resulted in a total of 80,672 CpG sites in the marker discovery phase.
然后将CpG位点分隔为8,312个区块(在“共甲基化区块分隔”中所描述的),并通过比较使用了48例原发性肺肿瘤和20例正常肺组织样品的内部测序数据进行验证。在大多数样品(>80%)上没有足够深度(<100)的区块被排除在下游分析之外。使用线性回归法用于选择差异甲基化区块,并且将截断值设置为log(倍数变化)>0.05,和经B-H校正的FDR<0.05。总共选择了2473个区块作为分类特征,而基因组坐标均列于表5中。The CpG sites were then partitioned into 8,312 blocks (described in "Co-methylation block partitioning") and compared using in-house sequencing of 48 primary lung tumor and 20 normal lung tissue samples The data is verified. Blocks with insufficient depth (<100) on most samples (>80%) were excluded from downstream analysis. Linear regression was used to select differentially methylated blocks, and cutoffs were set at log(fold change) >0.05, and B-H corrected FDR<0.05. A total of 2473 blocks were selected as classification features, while the genome coordinates are listed in Table 5.
数据处理框架Data Processing Framework
从原始BCL数据通过使用bcl2fastq(V2.19.1)生成FASTQ文件。用trimmomatic(V0.36)修剪Illumina特异的接头和低质量序列(SLIDINGWINDOW:4:15TRAILING:20)。至于Accel-NGS Methyl-Seq(Swift Biosciences),如在先前工作中所描述的那样进行额外的修剪。对于本申请一种示例性测序方法ELSA-seq,本申请以不同的严紧性程度测试了参数,以去除低复杂度尾序列。最终的修剪标准设置如下:(i)查询长于固定值m的G碱基的段数;(ii)非G-碱基的分数小于固定数p;(iii)下一个碱基是高质量A/T/C(Phred评分>30)。在基于模拟和实验的分析的基础上,在本研究中使用M=10和P=0.1。然后用FastQC(V0.11.5)检查修剪的读长的质量,并且要求50个碱基的最小读长长度。FASTQ files were generated from raw BCL data by using bcl2fastq (V2.19.1). Illumina-specific adapters and low-quality sequences (SLIDINGWINDOW:4:15TRAILING:20) were trimmed with trimmomatic (V0.36). For Accel-NGS Methyl-Seq (Swift Biosciences), additional trimming was performed as described in previous work. For ELSA-seq, an exemplary sequencing method of the present application, the present application tested parameters at different stringency levels to remove low-complexity tail sequences. The final pruning criteria are set as follows: (i) query the segment number of G bases longer than a fixed value m; (ii) the fraction of non-G-bases is less than a fixed number p; (iii) the next base is a high-quality A/T /C (Phred score>30). On the basis of analysis based on simulations and experiments, M=10 and P=0.1 were used in this study. The trimmed reads were then checked for quality with FastQC (V0.11.5), and a minimum read length of 50 bases was required.
其次,使用软件bwa-meth(V0.2.0)将读长与经生物信息学转换的hg 19人类基因组(对于R1,C->T,对于R2,G->A)以默认参数进行比对。用Samblaster(V0.1.24)标记重复读长,用Picard(V1.138)对bam文件进行分拣拣选。将具有比对评分<20、错配>5、不当配对或多个作图位置的读长排除在二次分析之外。为了使与片段末端位置分配不当相关的伪影减到最小,在R1和R2的起始点(“模糊窗口”)都以+/-3bp容差应用了重复去除。一个重要的注意是,尽管这种策略可以在超过90%的时间正确识别原来分子,但当文库测序严重过度或 不足时,则准确性下降。Second, use the software bwa-meth (V0.2.0) to compare the read lengths with the bioinformatics-converted
对于基于上下文的错误校正和甲基化度量计算,建立了一个内部模块,以将正向和反向读长折叠为关于碱基调用质量、雪茄串(cigar string)和相对于参考基因组的突变方向(例如C->A)的单个一致序列。低保真度区域(例如,靠近读长的末端)中的碱基调用被低估,丢失的碱基基于用最近汉明距离的支持性读长被恢复。For context-based error correction and methylation metric calculations, an internal module was built to collapse forward and reverse reads with respect to base call quality, cigar string, and mutation orientation relative to a reference genome (eg C->A) single consensus sequence. Base calls in low-fidelity regions (eg, near the end of the read) are underestimated, and missing bases are recovered based on supporting reads with the nearest Hamming distance.
共甲基化区块分隔Co-methylation block separation
本申请首先根据基于名为“区块指数” 的改进的相关矩阵将捕获小组的设计区域分隔为共甲基化区块。 This application is first based on the "block index" based on The improved correlation matrix of the capture panel separates the design regions into co-methylation blocks.
(i)将r区中各CpG位点的iAF相关矩阵表示为 而位点i与位点j之间的相关系数计算为 (i) Express the iAF correlation matrix of each CpG site in the r region as And the correlation coefficient between site i and site j is calculated as
ρ ij表示泊松相关系数, ρ ij represents the Poisson correlation coefficient,
E(y i)表示位置i处所有样品上的mAF, E(y i ) denotes the mAF over all samples at position i,
pos i表示基因组位置, pos i represents the genomic position,
L表示原来区域的长度,L represents the length of the original region,
而λ 1和λ 2是通过使用先验信息估计的参数。 Whereas λ1 and λ2 are parameters estimated by using prior information.
(ii)将对于原来区域r内的新拆分的 的“区块指数”定义为以下函数, (ii) For the new split in the original region r The "block index" of is defined as the following function,
其中表示B i表示第i个区块的位点数, where B i represents the number of sites in the i-th block,
α 1和α 2分别是非平衡拆分和过度拆分的惩罚系数。在本研究中,基于所期望的区块长度和均匀度,将α 1和α 2两者都设置为1.0。 α 1 and α 2 are the penalty coefficients for unbalanced split and over-split, respectively. In this study, both α1 and α2 are set to 1.0 based on the desired block length and uniformity.
本申请定义了两个指标“区域中位长度”和“区域长度变异系数”:“区域中位长度”用来衡量定义区域的尺寸大小,值越小则表明区域内包含的信息量越少,值越大表明区域包含的信息量越多;“区域长度变异系数”(标准差/均值)衡量不同区域之间大小的差异,值越大则表明区域划分越不平衡(独立的点越多,信息遗漏越多),值越小则表明区域划分越平衡。其中α 1的取值主要影响“区域长度变异系数”,α 2的取值主要影响“区域中位长度”,影响关系 如图18所示。 This application defines two indicators "regional median length" and "regional length variation coefficient": "regional median length" is used to measure the size of the defined region, the smaller the value, the less information contained in the region, The larger the value, the more information the region contains; the "variation coefficient of region length" (standard deviation/mean) measures the size difference between different regions, and the larger the value, the more unbalanced the region division (more independent points, The more information is missing), the smaller the value, the more balanced the regional division. Among them, the value of α 1 mainly affects the "variation coefficient of regional length", and the value of α 2 mainly affects the "median length of the region". The influence relationship is shown in Figure 18.
(iii)通过具有唯一断点的迭代除法来重复分隔过程。假定k个断点,那么通过使 达到最大来选择第(k+1)个新断点。 (iii) Repeat the separation process by iterative division with unique breakpoints. Assuming k breakpoints, then by making The maximum is reached to select the (k+1)th new breakpoint.
(iv)在第(k+1)个断点与第k个断点一致时终止交互式算法。(iv) Terminate the interactive algorithm when the (k+1)th breakpoint coincides with the kth breakpoint.
甲基化区块评分(MBS)定义Methylation Block Score (MBS) Definition
本申请将MBS的量度定义如下This application defines the measure of MBS as follows
对于给定的区块,For a given block,
n是覆盖多个CpG位点的读长的总数目。n is the total number of reads covering multiple CpG sites.
Li是第i个读长上覆盖的CpG位点的数目。Li is the number of CpG sites covered on the i-th read.
lij表示连续甲基化CpG位点(>1)的长度,lij represents the length of consecutive methylated CpG sites (>1),
而m表示第i个读长上的总计数。使用每个区块中的读长数以使深度差归一化。And m represents the total count on the i-th read length. The depth difference was normalized using the number of reads in each block.
对基于UMI校正或移位校正的片段识别的模拟Simulation of segment recognition based on UMI correction or shift correction
如下来进行本申请目前的流水线和UMI辅助方法模拟:The current pipeline and UMI auxiliary method simulation of this application are performed as follows:
(i)模板生成:生成一个2kb DNA区段的库,并通过计算将其“剪切”成各种长度(平均值=170,sd=30);(i) Template generation: a library of 2 kb DNA segments was generated and computationally "cut" into various lengths (mean = 170, sd = 30);
(ii)尾添加:基于经验观察,以不同长度向R2的5'-端添加尾序列(90%G);(ii) Tail addition: Based on empirical observations, a tail sequence (90% G) was added to the 5'-end of R2 with different lengths;
(iii)UMI掺入:在R2的5'-端添加一个随机6碱基UMI标签(对应于经修饰的接头1),以标记原来的模板;(iii) UMI incorporation: a random 6-base UMI tag (corresponding to modified linker 1) was added at the 5'-end of R2 to label the original template;
(iv)技术错误:以对于Illumina测序仪典型的低/中/高替代错误率(0.005、0.01、0.02)生成三组数据集;所有错误(例如,A->T;T->A)以相等的可能性设置;考虑到均聚物性质,在尾区引入indels(+/-1个碱基);通过10轮具有0.9(1意味着完全的加倍)的重复概率/轮的PCR反应来模仿PCR误差的累积;用在R1的5'-端的随机位置涨落(<10个碱基)来模拟不完全的延伸。(iv) Technical errors: Three datasets were generated with low/medium/high substitution error rates (0.005, 0.01, 0.02) typical for Illumina sequencers; all errors (e.g., A->T; T->A) Equal likelihood setting; considering homopolymer properties, introducing indels (+/- 1 base) in the tail region; by 10 rounds of PCR reactions with a repeat probability of 0.9 (1 means complete doubling)/round Accumulation of PCR errors was simulated; random positional fluctuations (<10 bases) at the 5'-end of R1 were used to simulate incomplete extensions.
(v)UMI和非UMI方案:(v) UMI and non-UMI schemes:
a)UMI:对每个末端分别应用UMI提取(R2的5′-末端)和移位校正(R1的5′-末端,fw=3);为避免由于扩增或测序错误导致的UMI过度计数,允许最小编辑距离为2。a) UMI: UMI extraction (5′-end of R2) and shift correction (5′-end of R1, fw=3) were applied separately for each end; to avoid overcounting of UMIs due to amplification or sequencing errors , allowing a minimum edit distance of 2.
b)非UMI:对两个末端都执行移位校正(R1的5'-端和R2的5'-端,fw=3);b) Non-UMI: shift correction is performed on both ends (5'-end of R1 and 5'-end of R2, fw=3);
(vi)模板计数:将原来的(后-BC)片段深度设置在250-500X,而将原始片段深度设置 在500-1000X。(vi) Stencil Count: Set the original (post-BC) fragment depth at 250-500X and the original fragment depth at 500-1000X.
整个过程重复9次,模拟的平均性能总结于表2中。The whole process was repeated 9 times and the average performance of the simulations is summarized in Table 2.
UMI促进的模板计数实验UMI-facilitated template counting experiments
制备具有~5,000个单倍体拷贝(BC之前)的λDNA和填充DNA的DNA混合物(10ng)。为了将UMI掺入本申请一种示例性测序方法ELSA-seq中,用紧靠悬垂序列插入的6碱基随机UMI(接头1-UMI)来修饰3'接头,在悬垂序列旁边插入了,因此在R2的最初六个循环中对其进行了测序(表1)。为了帮助定位UMI,预先定义了两个核苷酸(如5'-NNDDNN-3',D:A/T/G)以充当锚。为了使由于聚合酶或碱基调用错误造成的假UMI识别的风险减到最小,允许二的最小编辑距离为以进行错误校正。对λ基因组(4,500-6,539)中~2kb区域进行了靶向和分析,获得了~50,000个原始读长。通过随机读长降采样,将本申请当前流水线(非UMI)所测量的独特片段与UMI促进的计数策略进行了比较。结果总结于表2中。A DNA mixture (10 ng) of lambda DNA and stuffer DNA with -5,000 haploid copies (before BC) was prepared. In order to incorporate UMIs into ELSA-seq, an exemplary sequencing method of the present application, the 3' adapter was modified with a 6-base random UMI inserted next to the overhang sequence (connector 1-UMI), which was inserted next to the overhang sequence, thus R2 was sequenced during its first six cycles (Table 1). To help locate UMIs, two nucleotides (eg, 5'-NNDDNN-3', D: A/T/G) were predefined to act as anchors. To minimize the risk of false UMI calls due to polymerase or base calling errors, a minimum edit distance of 2 was allowed for error correction. A region of ~2 kb in the lambda genome (4,500-6,539) was targeted and analyzed, resulting in ~50,000 raw reads. Unique fragments measured by our current pipeline (non-UMI) were compared to a UMI-facilitated counting strategy by random read downsampling. The results are summarized in Table 2.
补充方法部分Supplementary Methods Section
通过ddPCR测定本申请一种示例性测序方法ELSA-seq和TruSeq的接头连接效率Determination of adapter ligation efficiency of an exemplary sequencing method ELSA-seq and TruSeq of the application by ddPCR
为了评价连接效率,将ddPCR反应设置如下:To evaluate ligation efficiency, set up the ddPCR reaction as follows:
1.ELSA-seq连接反应(20μl)1. ELSA-seq ligation reaction (20 μl)
2.ddPCR反应(20μl):2.ddPCR reaction (20μl):
底物:Substrate:
ELSA-SEQ:合成的ssDNA(KRAS-10N)ELSA-SEQ: Synthetic ssDNA (KRAS-10N)
TruseQ:使用引物KRASF和KRASR扩增的dsDNA(KRAS-177)TruseQ: dsDNA amplified using primers KRASF and KRASR (KRAS-177)
ELSA连接-1:ELSA connection-1:
通过引物LEF和KRASR、探针KRAS-G13D检测总拷贝。Total copies were detected by primers LEF and KRASR, probe KRAS-G13D.
通过引物LEF和LER、探针KRAS-G13D检测连接的拷贝。Ligated copies were detected by primers LEF and LER, probe KRAS-G13D.
ELSA连接-2:ELSA connection-2:
通过引物LEF和KRASR、探针KRAS-G13D检测总拷贝Total copies detected by primers LEF and KRASR, probe KRAS-G13D
通过引物LEF和LER-ATNR1、探针KRAS-G13D检测连接的拷贝。Ligated copies were detected by primers LEF and LER-ATNR1, probe KRAS-G13D.
TruSeq连接:TruSeq connection:
通过引物LEF和KRASR、探针KRAS-G13D检测总拷贝。Total copies were detected by primers LEF and KRASR, probe KRAS-G13D.
通过引物LEF和LER-ATNR1、探针KRAS-G13D连检测接的拷贝。The concatenated copies were detected by primers LEF and LER-ATNR1, probe KRAS-G13D.
使用KRAS-10N模板中的'N'的设计以避免序列依赖性偏差。所有寡核苷酸序列都列在表1-寡核苷酸序列汇总中。A design using 'N' in the KRAS-10N template was used to avoid sequence-dependent bias. All oligonucleotide sequences are listed in Table 1 - Summary of Oligonucleotide Sequences.
全基因组亚硫酸氢盐测序(WGBS)文库的构建与分析Construction and Analysis of Whole Genome Bisulfite Sequencing (WGBS) Libraries
如在方法中所描述或根据制造商的说明书进行用ELSA-seq、NEBNext Ultra II(New England Biolabs)和Accel-NGS Methyl-Seq(Swift Biosciences)的WGBS库构建。将基因组DNA通过声处理剪切至~200bp(峰)。使用LabChip GXII touch 24(Perkin Elmer)评价文库质量。然后在Illumina NovaSeq 6000系统上进行配对末端测序(2×150bp)。WGBS library construction with ELSA-seq, NEBNext Ultra II (New England Biolabs) and Accel-NGS Methyl-Seq (Swift Biosciences) was performed as described in Methods or according to manufacturer's instructions. Genomic DNA was sheared to ~200 bp (peak) by sonication. Library quality was assessed using LabChip GXII touch 24 (Perkin Elmer). Paired-end sequencing (2×150bp) was then performed on the
对基于UMI校正或移位校正的片段识别的模拟Simulation of segment recognition based on UMI correction or shift correction
如下来进行本申请当前的流水线和UMI辅助方法的模拟:The simulation of the current pipeline and UMI auxiliary method of this application is carried out as follows:
(vii)模板生成:生成一个2kb DNA区段的库,并通过计算将其“剪切”成各种长度(平均值=170,sd=30);(vii) Template generation: A library of 2kb DNA segments was generated and computationally "cut" into various lengths (mean = 170, sd = 30);
(viii)尾添加:基于经验观察,以不同长度向R2的5'-端添加尾序列(90%G);(viii) Tail addition: Based on empirical observations, tail sequences (90%G) were added to the 5'-end of R2 with different lengths;
(ix)UMI掺入:在R2的5'-端添加一个随机6碱基UMI标签(对应于经修饰的接头1),以标记原来的模板;(ix) UMI incorporation: a random 6-base UMI tag (corresponding to modified linker 1) was added at the 5'-end of R2 to label the original template;
(x)技术错误:以对于Illumina测序仪典型的低/中/高替代错误率(0.005、0.01、0.02) 生成三组数据集;所有错误(例如,A->T;T->A)以相等的可能性设置;考虑到均聚物性质,在尾区引入indels(+/-1个碱基);通过10轮具有0.9(1意味着完全的加倍)的重复概率/轮的PCR反应来模仿PCR误差的累积;用在R1的5'-端的随机位置涨落(<10个碱基)来模拟不完全的延伸。(x) Technical errors: Three datasets were generated with low/medium/high substitution error rates (0.005, 0.01, 0.02) typical for Illumina sequencers; all errors (eg, A->T; T->A) Equal likelihood setting; considering homopolymer properties, introducing indels (+/- 1 base) in the tail region; by 10 rounds of PCR reactions with a repeat probability of 0.9 (1 means complete doubling)/round Accumulation of PCR errors was simulated; random positional fluctuations (<10 bases) at the 5'-end of R1 were used to simulate incomplete extensions.
(xi)UMI和非UMI方案:(xi) UMI and non-UMI schemes:
a)UMI:对每个末端分别应用UMI提取(R2的5′-末端)和移位校正(R1的5′-末端,fw=3);a) UMI: UMI extraction (5′-end of R2) and shift correction (5′-end of R1, fw=3) were applied separately for each end;
b)非UMI:对两个末端都执行移位校正(R1的5'-端和R2的5'-端,fw=3);b) Non-UMI: shift correction is performed on both ends (5'-end of R1 and 5'-end of R2, fw=3);
(xii)模板计数:为了在临床环境中反映ELSA-seq库,将原来的(后-BC)片段深度设置在250-500X,而将原始片段深度设置在500-1000X。(xii) Template counting: To reflect the ELSA-seq library in a clinical setting, the original (post-BC) fragment depth was set at 250-500X, while the original fragment depth was set at 500-1000X.
整个过程重复9次,模拟的平均性能总结于表2中。The whole process was repeated 9 times and the average performance of the simulations is summarized in Table 2.
UMI促进的模板计数实验UMI-facilitated template counting experiments
制备具有~5,000个单倍体拷贝(BC之前)的λDNA和填充DNA的DNA混合物(10ng)。为了将UMI掺入本申请一种示例性测序方法ELSA-seq中,用紧靠悬垂序列插入的6碱基随机UMI(接头1-UMI)来修饰3'接头,在悬垂序列旁边插入了,因此在R2的最初六个循环中对其进行了测序(表1)。为了帮助定位UMI,预先定义了两个核苷酸(如5'-NNDDNN-3',D:A/T/G)以充当锚。为了使由于聚合酶或碱基调用错误造成的假UMI识别的风险减到最小,允许二的最小编辑距离为以进行错误校正。对λ基因组(4,500-6,539)中~2kb区域进行了靶向和分析,获得了~50,000个原始读长。通过随机读长降采样,将本申请当前流水线(非UMI)所测量的独特片段与UMI促进的计数策略进行了比较。结果总结于表2中。A DNA mixture (10 ng) of lambda DNA and stuffer DNA with -5,000 haploid copies (before BC) was prepared. In order to incorporate UMIs into ELSA-seq, an exemplary sequencing method of the present application, the 3' adapter was modified with a 6-base random UMI inserted next to the overhang sequence (connector 1-UMI), which was inserted next to the overhang sequence, thus R2 was sequenced during its first six cycles (Table 1). To help locate UMIs, two nucleotides (eg, 5'-NNDDNN-3', D: A/T/G) were predefined to act as anchors. To minimize the risk of false UMI calls due to polymerase or base calling errors, a minimum edit distance of 2 was allowed for error correction. A region of ~2 kb in the lambda genome (4,500-6,539) was targeted and analyzed, resulting in ~50,000 raw reads. Unique fragments measured by our current pipeline (non-UMI) were compared to a UMI-facilitated counting strategy by random read downsampling. The results are summarized in Table 2.
甲基转移酶介导的体外甲基化Methyltransferase-mediated methylation in vitro
将反应设置如下:Set up the reaction as follows:
体外甲基化过程在37℃下进行1小时,并且通过在65℃下加热20分钟进行终止。The in vitro methylation process was performed at 37°C for 1 hour and terminated by heating at 65°C for 20 minutes.
具有所需误差范围的样本量计算(<0.05)Sample size calculation with desired margin of error (<0.05)
样本量(n)计算如下:The sample size (n) is calculated as follows:
其中 并且 为标准累积正态分布的反函数 in and is the inverse function of the standard cumulative normal distribution
对于群组组成的分析Analysis of group composition
为了评估群组设计中潜在的混杂效应(如年龄、性别),使用卡方检验,并且零假设是分类变量在样例类或控制类中可比较(P>0.05)。关于患者和肿瘤的额外数据提供于图15和表6中。To assess potential confounding effects (e.g. age, sex) in cohort designs, chi-square tests were used with the null hypothesis that categorical variables were comparable in sample or control classes (P>0.05). Additional data on patients and tumors are provided in Figure 15 and Table 6.
对于预测因子的多变量分析Multivariate analysis for predictors
通过logistic回归进行单变量和多变量分析,以确定对于MERMAID试验结果的重要临床因素。对所有自变量(临床因素)就与因变量的logit线性相关进行检验(预测评分)。结果提供于表6中。Univariate and multivariate analyzes were performed by logistic regression to identify clinically important factors for MERMAID trial outcomes. All independent variables (clinical factors) were tested for logit linear correlation with the dependent variable (prediction score). The results are provided in Table 6.
假设的筛查人群中潜在临床受益的计算Calculation of Potential Clinical Benefit in a Hypothetical Screening Population
在10,000名受试者的假设筛查人群中,对于就LC的诊断率(diagnostic yield)评估了MERMAID。基于本研究的结果,将敏感度设置为63.0%(194/308),特异性设置为96.2%(251/261)。根据监视、流行病学和最终结果(SEER)计划(SEER.cancer.gov/data/access,2020),假设平均风险老年人中的LC患病率为0.53%。在10000名受试者中,预测了具有真阳性(TP)、假阳性(FP)、真阴性(TN)和假阴性(FN)结果的个体数。阳性预测值(PPV)和阴性预测值(NPV)计算如下:MERMAID was evaluated for diagnostic yield for LC in a hypothetical screening population of 10,000 subjects. Based on the results of this study, the sensitivity was set at 63.0% (194/308), and the specificity was set at 96.2% (251/261). According to the Surveillance, Epidemiology, and End Results (SEER) program (SEER.cancer.gov/data/access, 2020), the prevalence of LC among average-risk older adults was assumed to be 0.53%. Among 10000 subjects, the number of individuals with true positive (TP), false positive (FP), true negative (TN) and false negative (FN) results was predicted. Positive predictive value (PPV) and negative predictive value (NPV) were calculated as follows:
每个TP的预测的FP指示当检测TP受试者时观察到的FP受试者数。一个警告是,虽然本申请所有的患者都有早期(I-III期)LC并且其中的65%(199/308)在进入研究时被诊断为I期(Ia和Ib期),但本申请仍然不知道在真正筛查背景中的分期分布,所以PPV可能比这里报道的要少。Predicted FP per TP indicates the number of FP subjects observed when detecting TP subjects. One caveat is that although all patients in this application had early stage (I-III) LC and 65% of them (199/308) were diagnosed with stage I (Stages Ia and Ib) at study entry, this application still The stage distribution in the true screening setting is not known, so the PPV may be less than reported here.
对于其中阳性结果导致行动的测试,可以如下来评估性能:For tests where a positive result leads to an action, performance can be evaluated as follows:
(1) (1)
(2) 其中R指肺结节为恶性的风险,对此本申请可以不偏倚地选择监测或积极调查。在此基于来自国家肺部筛查试验(NLST)研究小组的报告,将R设置为1.1%。 (2) Where R refers to the risk of pulmonary nodules being malignant, for which the applicant can choose to monitor or actively investigate without bias. Here R is set at 1.1% based on reports from the National Lung Screening Trials (NLST) study group.
在假设的平均风险人群中,MERMAID的性能符合基于公式(1)和(2)的标准(MERMAID:16.6>2.1)。In the assumed average risk population, the performance of MERMAID met the criteria based on equations (1) and (2) (MERMAID: 16.6 > 2.1).
实施例2本申请甲基化检测方法的结果Example 2 The results of the methylation detection method of the present application
本申请一种超灵敏BS-seq测定法的设计Design of an ultrasensitive BS-seq assay in this application
本申请的甲基化检测方法,可以基于本领域已知的甲基化测序数据。例如,本申请的甲基化检测方法可以基于一种超灵敏的BS-seq(亚硫酸氢盐测序)。例如,一种可用的BS-seq可以为WO2019192489A1中记载的ELSA-Seq。例如本申请的甲基化测序数据的测序方法的设计原理涉及分子和计算两个方面(图1)。为了提高检测能力,本申请首先把重点放在增加 可有效被测序的模板上:(i)DNA分子在待被高通量测序仪“读取”的5'端和3'端都需要特异的接头。标签化分子的断裂将导致在流动细胞表面上的“播种”失败。为了使BC引起的接头丢失减到最小,本申请首先安排该变性步骤,接着是单链DNA(ssDNA)兼容的步骤,以使原来模板的检索达到最大。(ii)接头连接是另一个常见的限制因素,因此本申请设计了一种叫作“尾部和标签”的新策略以改善效率。简言之,使经亚硫酸氢盐处理的DNA变性、去磷酸化并通过TdT(末端脱氧核苷酸转移酶)延伸出富含胞嘧啶的核苷酸尾。然后,在大肠杆菌连接酶的存在下,使夹板接头与尾部退火以促进高度有效的连接步骤(Tail-Tag.1)。(iii)其次,由尿嘧啶耐受型聚合酶从共同锚位点生成拷贝链,从而为单标签中间体提供高分子冗余,以在下一轮接头附接(Tail-Tag.2)期间使模板损失减到最小(图1)。The methylation detection method of the present application may be based on methylation sequencing data known in the art. For example, the methylation detection method of the present application can be based on an ultrasensitive BS-seq (bisulfite sequencing). For example, an available BS-seq can be ELSA-Seq described in WO2019192489A1. For example, the design principle of the sequencing method for the methylation sequencing data of this application involves two aspects of molecular and calculation ( FIG. 1 ). In order to improve the detection ability, the present application first focuses on increasing the templates that can be sequenced effectively: (i) DNA molecules need specific templates at both the 5' end and the 3' end to be "read" by the high-throughput sequencer. connector. Fragmentation of tagged molecules will lead to failure of "seeding" on the flow cell surface. In order to minimize loss of adapters by BC, the applicant arranges this denaturation step first, followed by a single-stranded DNA (ssDNA) compatible step to maximize retrieval of the original template. (ii) Adapter ligation is another common limiting factor, so this applicant devised a new strategy called "tail and tag" to improve efficiency. Briefly, bisulfite-treated DNA is denatured, dephosphorylated and extended with a cytosine-rich nucleotide tail by TdT (terminal deoxynucleotidyl transferase). Then, the splint adapter was annealed to the tail in the presence of E. coli ligase to facilitate a highly efficient ligation step (Tail-Tag. 1). (iii) Second, copy strands are generated from the common anchor site by a uracil-tolerant polymerase, thereby providing high molecular redundancy for the single-tag intermediate to enable the next round of adapter attachment (Tail-Tag.2). Template loss is minimized (Figure 1).
为了估计本申请的甲基化测序数据的测序方法的模板回收率,本申请首先通过ddPCR比较了其与传统方法(TruSeq)的连接效率(图6A-B)。连接的/总的DNA拷贝的比例分别为对于本申请的甲基化测序数据的测序方法,82%(Tail-Tag.1)和86%(Tail-Tag.2),和对于TruSeq,64%(图6C-E)。考虑到对于两种方法都需要两轮连接,仅通过应用步骤(ii)回收率就会几乎加倍(0.64*0.64对0.82*0.86)。先前的工作已经证明在步骤(i)和(iii)中所采用的概念可以增加模板保留,因此预期所有这些步骤的组合可以显著提高文库复杂度。图7中总结了本申请的甲基化测序数据的测序方法与其它文库制备方法的比较。In order to estimate the template recovery rate of the sequencing method for the methylation sequencing data of the present application, the present application first compared its ligation efficiency with the traditional method (TruSeq) by ddPCR (Fig. 6A-B). The ratios of ligated/total DNA copies are 82% (Tail-Tag.1) and 86% (Tail-Tag.2) for the sequencing methods of the methylation sequencing data of this application, and 64% for TruSeq (Fig. 6C-E). Considering that two rounds of ligation are required for both methods, the recovery is almost doubled (0.64*0.64 vs. 0.82*0.86) just by applying step (ii). Previous work has demonstrated that the concepts employed in steps (i) and (iii) can increase template retention, so the combination of all these steps is expected to significantly increase library complexity. The comparison of the sequencing method of the methylation sequencing data of the present application with other library preparation methods is summarized in FIG. 7 .
为了进一步评估其性能,本申请将本申请的甲基化测序数据的测序方法与两个商业试剂盒:Accel Methyl-Seq(SWT)和NEBNext Ultra(NEB)进行了比较,并且发现用本申请的甲基化测序数据的测序方法构建的全基因组亚硫酸氢盐测序(WGBS)文库显示10倍产量增加,以及无论输入量或测序深度都展现最高的独特分子数,(图2A-C)。此外,对于低至500pg的输入,本申请的甲基化测序数据的测序方法显示了最大的甲基化组覆盖度、很小的扩增偏差和高度可复现的甲基化水平(图2D,图8A-E)。值得注意的是,虽然读长经受了严格的接头和尾部去除,但残留的合成序列和拷贝链的不完全延伸仍可能改变DNA片段的起始或终止位置,导致对文库复杂度的不正确估计。因此,本申请在数据处理中采用了“条件修剪”和“移位校正”以克服这些挑战(实施例1的方法部分)。通过基于模拟和实验的分析,本申请发现>90%片段可被本申请目前的信息学策略正确识别,接近于UMI辅助测量的水平(实施例1的补充方法部分,表2)。In order to further evaluate its performance, the applicant compared the sequencing method of the methylation sequencing data of the application with two commercial kits: Accel Methyl-Seq (SWT) and NEBNext Ultra (NEB), and found that using the application's Whole-genome bisulfite sequencing (WGBS) libraries constructed with sequencing methods for methylation-sequencing data showed a 10-fold increase in yield and exhibited the highest number of unique molecules regardless of input amount or sequencing depth, (Fig. 2A-C). Furthermore, for an input as low as 500 pg, the present application's sequencing method for methyl-sequencing data showed the greatest methylome coverage, little amplification bias, and highly reproducible methylation levels (Fig. 2D , Figure 8A-E). It is worth noting that although reads are subjected to stringent adapter and tail removal, residual synthetic sequences and incomplete extension of copy strands may still alter where DNA fragments start or end, leading to incorrect estimates of library complexity . Therefore, the present application employs "conditional pruning" and "shift correction" in data processing to overcome these challenges (Methods section of Example 1). Through simulation- and experiment-based analyses, we found that >90% of fragments could be correctly identified by our current informatics strategy, approaching the level of UMI-aided measurements (Supplementary Methods section of Example 1, Table 2).
传统的BS-seq引入了重大量的技术伪影,这阻碍了它用于罕见等位基因检测。为了解决这个问题,本申请利用深度测序来抑制PCR重复家族内的错误。本申请首先对未甲基化的噬菌体λDNA应用了本申请的甲基化测序数据的测序方法以测量基因组范围的“技术噪声”(读 长1中的C/C+T,读长2中的G/G+A)。与先前的研究相比,本申请的甲基化测序数据的测序方法的错误率(最大0.0025,平均0.0017)几乎降低了10倍,无论测序循环数多少(图8F)。对于每个替代类型的非参考等位基因频率(0.00003-0.00135)看起来与对于Illumina平台所描述的那些是相当的(图2E-F,图8G)。最近,报道了一种以基于温和酶的转化为特点的称为EM-seq的方法,允许从少量输入制备文库。对人类甲基化组数据的分析揭示了EM-seq和本申请的甲基化测序数据的测序方法都在文库复杂度和覆盖均匀度方面有优越的性能,其中EM-seq显示更少的AT偏差而本申请的甲基化测序数据的测序方法展示较低的图形化噪声(串联的两个未转换的C,表2)。这项新技术的全部能力还有待测试。Traditional BS-seq introduces a significant amount of technical artifacts, which hinders its use for rare allele detection. To address this issue, the present application utilizes deep sequencing to suppress errors within PCR repeat families. The present application first applied the sequencing method of the methylation sequencing data of the present application to the unmethylated phage lambda DNA to measure the genome-wide "technical noise" (C/C+T in
小组设计与靶标捕获性能Panel Design and Target Capture Performance
由于对人类全甲基化组的深度测序成本会高得令人望而却步,本申请通过分析来自先前研究的数据,将重点放在与常见癌症相关的表观遗传学变化上(图3A)。因此,选择了80,672个CpG位点,跨越大约1.05MB的基因组区域。该小组展示了CpG岛中的重大富集,与所报道的DNA甲基化在肿瘤发生期间在转录调控中的作用相一致(图9A-B)。Because deep sequencing of the full human methylome would be prohibitively costly, we focused on epigenetic changes associated with common cancers by analyzing data from previous studies (Fig. 3A). Therefore, 80,672 CpG sites were selected, spanning a genomic region of approximately 1.05 MB. This panel demonstrated a significant enrichment in CpG islands, consistent with the reported role of DNA methylation in transcriptional regulation during tumorigenesis (Fig. 9A-B).
使用用人淋巴细胞DNA(NA12878)和血浆样品评估了靶向测序的性能。用少至2ng的cfDNA取得了所扩增DNA片段的相当均匀的捕获,其中60-80%读长仅在诱饵区上比对(靶标比率)并且>90%的诱饵区由>200%读长覆盖(均匀度)(图3B,图9C,表2)。对GC含量作图的读长覆盖度揭示出典型的单峰分布(图9D)并且估计的碱基调用准确度对于大多数(>0.94)碱基在99.9%以上(Phred评分>30)(图9E,表2)。经测序的cfDNA片段具有~160bp的单核小体峰长度,与传统方法(TruSeq)的结果接近一致(图3C)。此外,对于与单核小体和双核小体相关的片段观察到可忽略不计的扩增或捕获偏好(图9F)。为了评估常见的捕获偏差,本申请计算了每个单独的CpG位点处甲基化胞嘧啶的百分比(iAF,单个甲基化等位基因频率)并且在正链和负链上都发现高度相关的值(ρ=0.90,图3D)。这由正链对负链的几乎平衡的读取深度所证实(图9G)。靶标富集后的典型技术错误率为0.0012,与使用无捕获方法观察到的错误率相似(图3E)。The performance of targeted sequencing was evaluated using human lymphocyte DNA (NA12878) and plasma samples. Fairly uniform capture of amplified DNA fragments was achieved with as little as 2 ng of cfDNA, where 60-80% of reads aligned on the bait region only (target ratio) and >90% of the bait region consisted of >200% reads Coverage (uniformity) (Figure 3B, Figure 9C, Table 2). Read coverage plotted against GC content revealed a typically unimodal distribution (Fig. 9D) and estimated base calling accuracy was above 99.9% (Phred score >30) for most (>0.94) bases (Fig. 9E, Table 2). The sequenced cfDNA fragments had a mononucleosome peak length of ∼160 bp, which was in close agreement with the results of the traditional method (TruSeq) (Fig. 3C). Furthermore, negligible amplification or capture bias was observed for fragments associated with mono- and di-nucleosomes (Fig. 9F). To assess common capture bias, we calculated the percentage of methylated cytosines at each individual CpG site (iAF, individual methylated allele frequency) and found high correlations on both the positive and negative strands (ρ=0.90, Figure 3D). This is confirmed by the nearly balanced read depth of the positive vs. negative strands (FIG. 9G). The typical technical error rate after target enrichment was 0.0012, similar to that observed using the no-capture method (Figure 3E).
本申请的检测方法MERMAID通过单分子衍生模式的信号识别The detection method of the present application, MERMAID, recognizes signals through single-molecule derivatization patterns
传统的DNA甲基化分析很大程度上基于iAF,其对取样方差和技术噪声敏感。作为回应,本申请的甲基化检测方法MERMAID设计了一个度量,“区块指数”(BI),以将显示相似甲基化状态的CpG位点分隔成不同区块(图4A)。总共定义了8312个块,中位区块大小~143bp,平均~13个CpG位点/区块(图10A-D)。本申请将每个区块中的平均甲基化水平定义为mAF(平均甲基化等位基因频率),并就检测癌症相关变化将其性能与iAF进行了比较。 通过检查SHOX2,一个在LC中经常甲基化的基因,本申请发现mAF显示比iAF显著更高的AUC值,显示“区块”是比“位点”更可区别的单位(图4B)。Traditional DNA methylation analysis is largely based on iAF, which is sensitive to sampling variance and technical noise. In response, MERMAID, the methylation detection method of the present application, devised a metric, the "block index" (BI), to separate CpG sites showing similar methylation status into different blocks (Fig. 4A). A total of 8312 blocks were defined, with a median block size of ~143 bp and an average of ~13 CpG sites/block (Fig. 10A-D). The present application defines the average methylation level in each block as mAF (mean methylated allele frequency) and compares its performance with iAF for detection of cancer-associated changes. By examining SHOX2, a gene frequently methylated in LC, we found that mAF showed significantly higher AUC values than iAF, showing that "blocks" are more distinguishable units than "sites" (Fig. 4B).
MERMAID的优点是生物信号和技术噪声的分离得到了很大改善。对每个DNA片段的分析揭示了对于癌细胞和正常细胞的不同模式,而化学或测序错误通常是零星的(图10E)。为了突出对比,本申请开发了一个新的计量单位,“甲基化区块评分”(MBS),它被定义为连续甲基化模式的加权出现率与每个读长的总CpG位点之比(图4C)。为了评估MBS的性能,本申请将经体外甲基转移酶处理的DNA以不同比例混合入λDNA中。如在图4D中所显示的。通过MBS,甚至具有微小加标(0.001)的样品也可以与阴性对照明显区分,而如果使用mAF,观察到信号的明显重叠,从而支持模式识别在提高信噪比中的成功。The advantage of MERMAID is the much improved separation of biological signal and technical noise. Analysis of each DNA fragment revealed distinct patterns for cancer cells and normal cells, while chemical or sequencing errors were often sporadic (Fig. 10E). To highlight the contrast, this application develops a new unit of measurement, the "Methylation Block Score" (MBS), which is defined as the ratio of the weighted occurrence of consecutive methylation patterns to the total CpG sites per read. ratio (Fig. 4C). In order to evaluate the performance of MBS, DNA treated with in vitro methyltransferase was mixed into λDNA at different ratios. As shown in Figure 4D. By MBS, even samples with small spikes (0.001) were clearly distinguishable from negative controls, whereas if mAF was used, a clear overlap of signals was observed, supporting the success of pattern recognition in increasing the signal-to-noise ratio.
如图17显示的,传统的DNA甲基化分析主要基于单个位点的甲基化水平iAF,iAF对测序深度和技术噪声非常敏感。因此,本申请设计了一个指标“区块指数”,根据CpG位点的甲基化状态和基因组位置信息,将其分离到不同的区块。本申请将每个区块的平均甲基化水平定义为mAF(平均甲基化等位基因频率),并将其与iAF检测癌症相关变化的性能进行比较。通过检测一个在肺癌患者中经常发生甲基化变异的基因SHOX2,本申请发现mAF在受试者操作特征曲线下的面积值(AUROC)明显高于iAF,这表明“块”是比“位点”更具辨别力的单位,同时在测序深度越低的时候,mAF相比于iAF的效果越好。As shown in Figure 17, traditional DNA methylation analysis is mainly based on the methylation level iAF of a single site, and iAF is very sensitive to sequencing depth and technical noise. Therefore, this application designs an index "block index" to separate CpG sites into different blocks according to the methylation status and genomic position information of CpG sites. The present application defines the average methylation level of each block as mAF (mean methylated allele frequency) and compares it with the performance of iAF to detect cancer-associated changes. By detecting SHOX2, a gene that frequently undergoes methylation mutations in lung cancer patients, the present application found that the area under the receiver operating characteristic curve (AUROC) of mAF was significantly higher than that of iAF, which indicated that "block" is more important than "site "The more discriminative units, and the lower the sequencing depth, the better the effect of mAF compared to iAF.
如图19显示的,本申请还比较了MBS统计量相比于传统的mAF统计量的优势。利用一组阴性标准品和阳性标准品之间的掺比数据,绘制了不同掺比样本中MBS统计量和mAF统计量的密度曲线,从图上可以看到,MBS统计量比mAF统计量在“阴性标准品”和“0.1%掺比的阳性标准品”之间有更大的差异,表明MBS统计量更能够区分甲基化差异信号。As shown in Figure 19, the present application also compares the advantages of the MBS statistic over the traditional mAF statistic. Using the blending ratio data between a set of negative standard and positive standard, the density curves of MBS statistic and mAF statistic in samples with different blending ratios were drawn. It can be seen from the figure that the ratio of MBS statistic to mAF statistic is in There is a larger difference between the "Negative Standard" and "0.1% Spiked Positive Standard", indicating that the MBS statistic is more capable of discriminating methylation differential signals.
对MERMAID用于肿瘤来源检测信号的评估Evaluation of MERMAID for Signal of Tumor Origin Detection
从测定和生物信息学两个水平检查了MERMAID的分析性能。在健康个体(包括重复序列)中观察到高度相关的MBS值(ρ=0.94-0.97),暗示出色的可复现性(图5A)。用不同的输入量(2-30ng)、测序深度(1000-5000X)或DNA来源(来自细胞或血浆)也观察到类似的结果(图11)。然而,健康组与癌症组之间以及不同癌症患者之间的差异是非常显著的(ρ=0.60-0.83),暗示甲基化水谱受疾病状况极大影响(图5A)。The analytical performance of MERMAID was examined at both assay and bioinformatics levels. Highly correlated MBS values (p = 0.94-0.97) were observed in healthy individuals (including repeats), suggesting excellent reproducibility (Fig. 5A). Similar results were also observed with different input amounts (2-30 ng), sequencing depth (1000-5000X) or DNA source (from cells or plasma) (Fig. 11). However, the differences between healthy and cancer groups and among different cancer patients were very significant (ρ=0.60-0.83), implying that methylation profiles were greatly affected by disease status (Fig. 5A).
通过进行肿瘤细胞加标实验评估了MERMAID的定量准确性。在用正常WBC DNA对结直肠癌(CRC)DNA的一稀释系列上,在0.0005稀释度下在观察到的与预期的肿瘤比例之间发现接近完美的相关性(图5B,r 2=0.99),暗示精密的定量准确性。当本申请将MERMAID与定量甲基化特异性PCR(qMSP)或不同的靶标方法进行比较时,也观察到了高度一致的结果 (图12)。 The quantitative accuracy of MERMAID was evaluated by performing tumor cell spike experiments. On a dilution series of normal WBC DNA versus colorectal cancer (CRC) DNA, a near-perfect correlation was found between observed and expected tumor proportions at a dilution of 0.0005 (Fig. 5B, r 2 =0.99) , implying fine quantitative accuracy. Highly consistent results were also observed when the present application compared MERMAID with quantitative methylation-specific PCR (qMSP) or different target methods (Figure 12).
通过重复地测序正常白细胞WBC,经验性地评价了该测定法的假发现率(FDR)。如图在5C和表2中所显示,FDR随着DNA输入或测序深度增加而稳步下降,这很可能是因为具有低甲基化计数的标志物产生了大部分假调用,并且这些标志物更多地得益于泊松噪声的降低而产生的(图13A-B)。The false discovery rate (FDR) of the assay was evaluated empirically by repeatedly sequencing normal leukocyte WBCs. As shown in Figure 5C and Table 2, FDR decreased steadily with increasing DNA input or sequencing depth, most likely because markers with low methylation counts generated the majority of false calls, and these markers were more Many benefit from the reduction of Poisson noise (Fig. 13A-B).
本申请在三个不同的水平上评估了MERMAID的检测极限(LoD):(i)数字模拟:通过从二项式分布模拟甲基化计数,本申请发现无论增加标志物数还是测序深度,都可以改进成功率(灵敏度)(图13C)。值得注意的是,当ctDNA分数从1/10,000下降到1/100,000时,标记物的大小急剧膨胀,揭示了肿瘤负荷是检测敏感性的基本限制因素。(ii)生物信息学模拟:通过计算将癌细胞的测序读长与健康cfDNA的测序读长以不同比例混合,进一步检验了该理论。如在图5D中所显示的,在来自LC和CRC的模拟数据中MERMAID的生物信息学敏感度均达到1/100,000(双尾t检验,P<0.001)。该观察看起来是批次无关的,因为来自不同轮测序的混合对照数据没有产生显著的检测(双尾t检验,P>0.05,图13D)。(iii)实验评估:将LC和CRC DNA以一系列稀释度加标入正常WBC DNA中。对于两个细胞系在低至1/10000稀释度下检测到用MBS定量的癌信号(检出率)(双尾t检验,P<0.01,图5E)。The applicant evaluated the limit of detection (LoD) of MERMAID at three different levels: (i) Numerical simulations: By simulating methylation counts from a binomial distribution, the applicant found that both increasing the number of markers and sequencing depth, The success rate (sensitivity) could be improved (FIG. 13C). Notably, the size of the markers swelled dramatically as the ctDNA fraction decreased from 1/10,000 to 1/100,000, revealing that tumor burden is a fundamental limiting factor for detection sensitivity. (ii) Bioinformatics simulations: The theory was further tested by computationally mixing sequencing reads from cancer cells with those from healthy cfDNA at different ratios. As shown in Figure 5D, the bioinformatic sensitivity of MERMAID reached 1/100,000 in simulated data from both LC and CRC (two-tailed t-test, P<0.001). This observation appeared to be batch-independent, as pooled control data from different rounds of sequencing did not yield significant detections (two-tailed t-test, P>0.05, Figure 13D). (iii) Experimental evaluation: LC and CRC DNA were spiked into normal WBC DNA at serial dilutions. A cancer signal (detection rate) quantified with MBS was detected at dilutions as low as 1/10000 for both cell lines (two-tailed t-test, P<0.01, FIG. 5E ).
最后,本申请比较了MERMAID与ddPCR和用独特分子说明符的超深突变测序(HS-UMI),两种在极低频率下检测变体时例外的方法的LoD。对于每个稀释样品,使用大约9000拷贝人类单倍体基因组以模仿10ml血液中的平均cfDNA量。对于预先定义的热点突变,在所有1/1,000稀释液中通过ddPCR和HS-UMI在0.03-0.11%的AF下鉴定EGFR p.G719S(图5F,图13E,表4)。在1/1,000稀释液中致癌的EML4-ALK融合物的检出用ddPCR在AF为0.03%下得到验证,但用HS-UMI则失败(图5F,图13F,表4)。取决于读取深度和背景噪声,包括具有不确定的功能意义的突变,未将最大稀释比提高到1/1000以上(表4)。总之,在相同条件下,MERMAID显示出比突变分析至少10倍更大的能力。Finally, the present application compares the LoD of MERMAID with ddPCR and ultra-deep mutation sequencing with unique molecular specifiers (HS-UMI), two methods that are exceptional in detecting variants at very low frequencies. For each diluted sample, approximately 9000 copies of the human haploid genome were used to mimic the average cfDNA amount in 10 ml of blood. For the pre-defined hotspot mutations, EGFR p.G719S was identified by ddPCR and HS-UMI at 0.03-0.11% AF in all 1/1,000 dilutions (Fig. 5F, Fig. 13E, Table 4). Detection of the oncogenic EML4-ALK fusion at 1/1,000 dilution was verified with ddPCR at an AF of 0.03%, but failed with HS-UMI (Fig. 5F, Fig. 13F, Table 4). Depending on read depth and background noise, including mutations of uncertain functional significance did not increase the maximum dilution ratio above 1/1000 (Table 4). In conclusion, MERMAID showed at least 10-fold greater power than mutation assays under the same conditions.
本申请MERMAID甲基化测序方法的有效性Validity of the MERMAID methylation sequencing method in this application
甲基化测序已引起了巨大的兴趣,因为它具有改进目前ctDNA检测的很大潜力。在此本申请提供MERMAID作为一种新的表观遗传学分析方法,它以良好保护的分子多样性、强大的噪声抑制和稳固的高维度建模为特征。除了这些属性外,MERMAID还可以对基于血液的应用是特别有用的:(i)一部分cfDNA可能是单链形式,因此这种ssDNA兼容方法可以最大限度地利用有限的起始材料,增加对于罕见ctDNA检测的机会。(ii)用与各种甲基化模式互补的过量的长RNA探针(>100个核苷酸长)设计了捕获小组。与基于扩增子的靶标方法(~20 个核苷酸长)相比,该策略对序列相关的偏差和多态性更为容忍。(iii)MERMAID不需要对于分析的事先知识(例如经活检组织检查的组织),因此为没有手术切除样品的患者提供了解决方案。虽然该方法仅在LC上得到验证,但可以将它对其他类型的癌症(例如,CRC)或体液(如尿液)用户化。可以将它扩展到回答基本问题,如肿瘤异质性,或者应用于其他临床场景,如评价治疗效果。Methylation-sequencing has attracted enormous interest because it has great potential to improve current ctDNA assays. Herein the present application presents MERMAID as a novel epigenetic analysis method characterized by well-conserved molecular diversity, robust noise suppression and robust high-dimensional modeling. In addition to these properties, MERMAID can be particularly useful for blood-based applications: (i) a portion of cfDNA may be in single-stranded form, so this ssDNA-compatible approach can maximize the use of limited starting material, increasing the availability of rare ctDNA opportunity for detection. (ii) Capture panels were designed with an excess of long RNA probes (>100 nucleotides long) complementary to various methylation patterns. This strategy is more tolerant to sequence-related biases and polymorphisms than amplicon-based target approaches (~20 nucleotides long). (iii) MERMAID does not require prior knowledge of the assay (eg, biopsied tissue), thus providing a solution for patients without surgically resected samples. Although this method has only been validated on LC, it could be customized to other types of cancer (e.g., CRC) or body fluids (e.g., urine). It can be extended to answer fundamental questions, such as tumor heterogeneity, or applied to other clinical scenarios, such as evaluating treatment effects.
引入合成序列以改进接头标记效率的缺点是未能保护DNA模板的天然末端,那么这可能造成不完全的重复去除。MERMAID采用的计算策略可以,至少部分地,克服这种不完全的重复去除的问题。同时,MERMAID可以采用本领域常用的偏差处理方法,来处理胞嘧啶在氧化应激后可转换为尿嘧啶带来的C->T/G->A伪影的伴随风险。本申请的MERMAID可以采用将组织特异性标志物添加入靶标小组以用于多重癌症分类。The disadvantage of introducing synthetic sequences to improve adapter labeling efficiency is the failure to protect the native ends of the DNA template, which may then result in incomplete repeat removal. The computational strategies employed by MERMAID can, at least in part, overcome this problem of incomplete duplicate removal. At the same time, MERMAID can adopt the bias processing method commonly used in the field to deal with the accompanying risk of C->T/G->A artifacts caused by the conversion of cytosine to uracil after oxidative stress. The MERMAIDs of the present application can employ the addition of tissue-specific markers to target panels for multiple cancer classification.
补充的表格supplementary form
表1-1本申请方法中使用的寡核苷酸序列Oligonucleotide sequences used in the method of the present application in table 1-1
表1-2本申请方法+UMI校正中使用的寡核苷酸序列Table 1-2 Oligonucleotide sequences used in the method of this application + UMI correction
表1-3用于连接效率测量(ddPCR)的寡核苷酸序列Table 1-3 is used for the oligonucleotide sequence of ligation efficiency measurement (ddPCR)
表2-1使用500pg大肠杆菌DNA的ELSA、SWT和NEB的全甲基化组测序指标(深度滴定)Table 2-1 Using 500pg Escherichia coli DNA of ELSA, SWT and NEB full methylome sequencing indicators (deep titration)
表2-2本申请方法和本申请方法+UMI校正(数据模拟)Table 2-2 The method of this application and the method of this application + UMI correction (data simulation)
表2-3本申请方法和本申请方法+UMI校正(试验)Table 2-3 The application method and the application method + UMI correction (test)
表2-4 EM-seq和本申请方法Table 2-4 EM-seq and the method of this application
注1:测序数据是从加入PRJNA591788和PRJNA534206的NCBI序列读取档案(SRC)下载的。注2:为了客观比较这些方法,所有数据都被下采样到~100M读长Note 1: Sequencing data were downloaded from the NCBI Sequence Read Archive (SRC) with accession PRJNA591788 and PRJNA534206. Note 2: In order to objectively compare these methods, all data were downsampled to ~100M read length
表2-5使用2到30ng人类cfDNA输入的本申请方法质量控制指标Table 2-5 The quality control indicators of this application method using 2 to 30ng human cfDNA input
表2-6使用2到30ng的人类WBC输入的本申请方法质量控制指标Table 2-6
表4-1 ddPCR使用滴定的SW48和H2228癌细胞(生物重复)检测热点突变Table 4-1 ddPCR detection of hotspot mutations using titrated SW48 and H2228 cancer cells (biological replicates)
表4-2 HS-UMI使用滴定的SW48和H2228癌细胞(生物重复)检测热点突变Table 4-2 HS-UMI detection of hotspot mutations using titrated SW48 and H2228 cancer cells (biological replicates)
表4-3 HS-UMI使用滴定的SW48和H2228癌细胞(生物重复)检测非热点突变Table 4-3 HS-UMI detection of non-hotspot mutations using titrated SW48 and H2228 cancer cells (biological replicates)
表4-4使用癌细胞系(NCI-H2228、SW48)spike-in实验的HS-UMI的QC指标Table 4-4 QC indicators of HS-UMI using cancer cell lines (NCI-H2228, SW48) spike-in experiments
表5-1“说明符”组织分类(n=157)Table 5-1 "Descriptor" Organization Classification (n=157)
表5-2“说明符”血浆分类(n=2,473)Table 5-2 "Specifier" plasma classification (n=2,473)
表6本申请方法检测ctDNA的多变量分析Table 6 Multivariate analysis of ctDNA detection by the method of this application
表7组2“四重比较”总结Table 7
表7-1“组2-四重比较”中HS-UMI和ddPCR检测ctDNA的总结Table 7-1 Summary of ctDNA detection by HS-UMI and ddPCR in "Group 2 - Quadruple Comparison"
表7-2“组2-四重比较”中ddPCR检测到的ctDNA突变总结Table 7-2 Summary of ctDNA mutations detected by ddPCR in "Group 2 - Quadruple Comparison"
表7-3“组2-四重比较”中使用的ddPCR检测Table 7-3 ddPCR assays used in "Group 2 - Quadruple Comparisons"
前述详细说明是以解释和举例的方式提供的,并非要限制所附权利要求的范围。目前本 申请所列举的实施方式的多种变化对本领域普通技术人员来说是显而易见的,且保留在所附的权利要求和其等同方案的范围内。The foregoing detailed description has been offered by way of explanation and example, not to limit the scope of the appended claims. Variations on the presently recited embodiments of this application will be apparent to those of ordinary skill in the art and remain within the scope of the appended claims and their equivalents.
Claims (46)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2021098117 | 2021-06-03 | ||
| CNPCT/CN2021/098117 | 2021-06-03 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022253288A1 true WO2022253288A1 (en) | 2022-12-08 |
Family
ID=82960588
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/096730 Ceased WO2022253288A1 (en) | 2021-06-03 | 2022-06-02 | Methylation sequencing method and device |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN114974417B (en) |
| WO (1) | WO2022253288A1 (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116153417A (en) * | 2023-04-18 | 2023-05-23 | 珠海圣美生物诊断技术有限公司 | Methylation characteristic screening method and device |
| CN117423388A (en) * | 2023-12-19 | 2024-01-19 | 北京求臻医疗器械有限公司 | Methylation-level-based multi-cancer detection system and electronic equipment |
| WO2024212820A1 (en) * | 2023-04-10 | 2024-10-17 | 上海交通大学医学院附属新华医院 | Method for quantitatively detecting dna modification from nanopore sequencing data |
| CN119560014A (en) * | 2025-01-22 | 2025-03-04 | 浙江高美基因科技有限公司 | A method for identifying interacting DNA fragments based on DNA methylation correlation coefficient |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116153418B (en) * | 2023-04-18 | 2023-07-18 | 臻和(北京)生物科技有限公司 | Method, device, device and storage medium for correcting batch effect of genome-wide methylation sequencing data |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102796808A (en) * | 2011-05-23 | 2012-11-28 | 深圳华大基因科技有限公司 | Methylation high-flux detection method |
| US20140113286A1 (en) * | 2010-12-21 | 2014-04-24 | Sloan-Kettering Institute For Cancer Research | Epigenomic Markers of Cancer Metastasis |
| CN109415771A (en) * | 2016-07-08 | 2019-03-01 | 哈鲁曼有限公司 | The determination method of colorectal cancer initiation potential |
| US20190309372A1 (en) * | 2016-07-06 | 2019-10-10 | Case Western Reserve University | Methods and compositions for detecting esophageal neoplasias and/or metaplasias in the esophagus |
| CN111095422A (en) * | 2017-06-19 | 2020-05-01 | 琼格拉有限责任公司 | Interpretation of genetic and genomic variants through a comprehensive computational and experimental deep mutation learning framework |
| CN112375822A (en) * | 2020-06-01 | 2021-02-19 | 广州市基准医疗有限责任公司 | Methylation biomarker for detecting breast cancer and application thereof |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2992632B1 (en) * | 1998-08-31 | 1999-12-20 | 工業技術院長 | Method for sequentially cutting out DNA fragments and method for analyzing DNA using the same |
| CN103483441A (en) * | 2012-06-10 | 2014-01-01 | 复旦大学 | Rhesus monkey NY-ESO-1 protein, coding gene, and applications thereof |
| WO2017212428A1 (en) * | 2016-06-07 | 2017-12-14 | The Regents Of The University Of California | Cell-free dna methylation patterns for disease and condition analysis |
| CN106701913A (en) * | 2016-10-28 | 2017-05-24 | 华中科技大学 | Method for detecting methylation level of Sipa1 gene promoter |
| ES2945191T3 (en) * | 2018-05-03 | 2023-06-29 | Becton Dickinson Co | High-throughput multi-omics sample analysis |
| NZ773619A (en) * | 2019-05-31 | 2025-07-25 | Freenome Holdings Inc | Methods and systems for high-depth sequencing of methylated nucleic acid |
| CN112176419B (en) * | 2019-10-16 | 2022-03-22 | 中国医学科学院肿瘤医院 | Method for detecting variation and methylation of tumor specific genes in ctDNA |
-
2022
- 2022-06-02 WO PCT/CN2022/096730 patent/WO2022253288A1/en not_active Ceased
- 2022-06-02 CN CN202210629330.1A patent/CN114974417B/en active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140113286A1 (en) * | 2010-12-21 | 2014-04-24 | Sloan-Kettering Institute For Cancer Research | Epigenomic Markers of Cancer Metastasis |
| CN102796808A (en) * | 2011-05-23 | 2012-11-28 | 深圳华大基因科技有限公司 | Methylation high-flux detection method |
| US20190309372A1 (en) * | 2016-07-06 | 2019-10-10 | Case Western Reserve University | Methods and compositions for detecting esophageal neoplasias and/or metaplasias in the esophagus |
| CN109415771A (en) * | 2016-07-08 | 2019-03-01 | 哈鲁曼有限公司 | The determination method of colorectal cancer initiation potential |
| CN111095422A (en) * | 2017-06-19 | 2020-05-01 | 琼格拉有限责任公司 | Interpretation of genetic and genomic variants through a comprehensive computational and experimental deep mutation learning framework |
| CN112375822A (en) * | 2020-06-01 | 2021-02-19 | 广州市基准医疗有限责任公司 | Methylation biomarker for detecting breast cancer and application thereof |
Non-Patent Citations (3)
| Title |
|---|
| KUN SUN, PEIYONG JIANG, K. C. ALLEN CHAN, JOHN WONG, YVONNE K. Y. CHENG, RAYMOND H. S. LIANG, WAI-KONG CHAN, EDMOND S. K. MA, STEP: "Plasma DNA tissue mapping by genome-wide methylation sequencing for noninvasive prenatal, cancer, and transplantation assessments", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES, NATIONAL ACADEMY OF SCIENCES, vol. 112, no. 40, 6 October 2015 (2015-10-06), pages E5503 - E5512, XP055373988, ISSN: 0027-8424, DOI: 10.1073/pnas.1508736112 * |
| LIANG NAIXIN; LI BINGSI; JIA ZIQI; WANG CHENYANG; WU PANCHENG; ZHENG TAO; WANG YANYU; QIU FUJUN; WU YIJUN; SU JING; XU JIAYUE; XU : "Ultrasensitive detection of circulating tumour DNA via deep methylation sequencing aided by machine learning", NATURE BIOMEDICAL ENGINEERING, NATURE PUBLISHING GROUP UK, LONDON, vol. 5, no. 6, 1 June 2021 (2021-06-01), London , pages 586 - 599, XP037483444, DOI: 10.1038/s41551-021-00746-5 * |
| MARTINS JADE, CZAMARA DARINA, SAUER SUSANN, REX-HAFFNER MONIKA, DITTRICH KATJA, DÖRR PEGGY, DE PUNDER KARIN, OVERFELD JUDITH, KNOP: "Childhood adversity correlates with stable changes in DNA methylation trajectories in children and converges with epigenetic signatures of prenatal stress", NEUROBIOLOGY OF STRESS, vol. 15, 1 November 2021 (2021-11-01), pages 100336, XP093009661, ISSN: 2352-2895, DOI: 10.1016/j.ynstr.2021.100336 * |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024212820A1 (en) * | 2023-04-10 | 2024-10-17 | 上海交通大学医学院附属新华医院 | Method for quantitatively detecting dna modification from nanopore sequencing data |
| CN116153417A (en) * | 2023-04-18 | 2023-05-23 | 珠海圣美生物诊断技术有限公司 | Methylation characteristic screening method and device |
| CN117423388A (en) * | 2023-12-19 | 2024-01-19 | 北京求臻医疗器械有限公司 | Methylation-level-based multi-cancer detection system and electronic equipment |
| CN117423388B (en) * | 2023-12-19 | 2024-03-22 | 北京求臻医疗器械有限公司 | Methylation-level-based multi-cancer detection system and electronic equipment |
| CN119560014A (en) * | 2025-01-22 | 2025-03-04 | 浙江高美基因科技有限公司 | A method for identifying interacting DNA fragments based on DNA methylation correlation coefficient |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114974417B (en) | 2025-11-14 |
| CN114974417A (en) | 2022-08-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN114974417B (en) | A methylation sequencing method and apparatus | |
| CN105518151B (en) | Identification and use of circulating nucleic acid tumor markers | |
| CN112941180A (en) | Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit | |
| TW202124728A (en) | Determination of base modifications of nucleic acids | |
| KR20230017169A (en) | Method and system for colorectal cancer detection through nucleic acid methylation analysis | |
| CN111094590A (en) | Cancer detection and classification using methylation component analysis | |
| JP2023526252A (en) | Detection of homologous recombination repair defects | |
| JP7665659B2 (en) | Multimodal analysis of circulating tumor nucleic acid molecules | |
| CN113748467B (en) | Allele frequency-based loss-of-function computational model | |
| WO2021016441A1 (en) | Systems and methods for determining tumor fraction | |
| EP4118653B1 (en) | Methods for classifying genetic mutations detected in cell-free nucleic acids as tumor or non-tumor origin | |
| WO2020061380A9 (en) | Cell-free dna hydroxymethylation profiles in the evaluation of pancreatic lesions | |
| JP2023029945A (en) | Epigenetic profiling of cancer | |
| JP2024056984A (en) | Methods, compositions and systems for calibrating epigenetic compartment assays | |
| CN112779338A (en) | Gene marker for esophageal cancer prognosis evaluation | |
| CN111028888B (en) | A method for detecting whole genome copy number variation and its application | |
| US20200232010A1 (en) | Methods, compositions, and systems for improving recovery of nucleic acid molecules | |
| KR20250019610A (en) | Molecular counting of methylated cell-free DNA for treatment monitoring | |
| CN114752672B (en) | Detection panels, kits and applications for prognostic assessment of follicular lymphoma based on circulating cell-free DNA mutations | |
| KR20240046525A (en) | Compositions and methods associated with TET-assisted pyridine borane sequencing for cell-free DNA | |
| CN112970068A (en) | Method and system for detecting contamination between samples | |
| AU2023226165A1 (en) | Probe sets for a liquid biopsy assay | |
| WO2022262831A1 (en) | Substance and method for tumor assessment | |
| HK40073428A (en) | A methylation sequencing method and device | |
| Doebley | Predicting cancer subtypes from nucleosome profiling of cell-free DNA |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22815330 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22815330 Country of ref document: EP Kind code of ref document: A1 |