[go: up one dir, main page]

WO2013097061A1 - Bs and rrbs sequencing-based bioinformatics analysis method and device - Google Patents

Bs and rrbs sequencing-based bioinformatics analysis method and device Download PDF

Info

Publication number
WO2013097061A1
WO2013097061A1 PCT/CN2011/002243 CN2011002243W WO2013097061A1 WO 2013097061 A1 WO2013097061 A1 WO 2013097061A1 CN 2011002243 W CN2011002243 W CN 2011002243W WO 2013097061 A1 WO2013097061 A1 WO 2013097061A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
thiolation
sequence
site
thiolated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2011/002243
Other languages
French (fr)
Chinese (zh)
Other versions
WO2013097061A8 (en
Inventor
吴红龙
吴博欣
黄树嘉
夏渝东
李英睿
汪建
王俊
杨焕明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Priority to PCT/CN2011/002243 priority Critical patent/WO2013097061A1/en
Publication of WO2013097061A1 publication Critical patent/WO2013097061A1/en
Publication of WO2013097061A8 publication Critical patent/WO2013097061A8/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Definitions

  • the invention belongs to the field of bioinformatics, and performs bioinformatics analysis on the sequencing results of two important DNA thiolation detection methods in epigenetics. Background technique
  • Bisulfite-sequencing is a powerful tool for DNA thiolation research.
  • the unthiolated C thiol group in the genome can be converted to U by bisulfite treatment, and further converted to T by PCR to distinguish it from the thiolated C base.
  • Combining bisulfite treatment with high-throughput sequencing technology it is possible to map DNA methylation profiles with single-subtraction resolution.
  • the analysis of high-accuracy thiolation modification patterns of specific species is bound to be a milestone in the study of epigenetics, and lays a foundation for cell differentiation, tissue development, and animal and plant breeding research.
  • the reduced representative ') ⁇ reduced representation bisulfite sequencing is a new method based on BS.
  • the cost of BS sequencing is high.
  • RRBS selects parts of the whole genome by enzymatic cleavage technology and then performs BS sequencing.
  • the treatment of DNA by heavy tartaric acid converts ungerminated C in the genome to U and is further converted to T by PCR to distinguish it from the thiolated C base. This method is capable of plotting a single-base resolution DNA thiolated map with a cost advantage.
  • the invention relates to a method for analyzing genomic thiolation by heavy bisulfite hydrogenate sequencing or reduced representative bisulfite sequencing, the method comprising the steps of:
  • the number of sequencing sequences that are thiolated at the corresponding position by the first or second sequencing sequence covering the C site and the number of sequencing sequences that are not thiolated is determined according to the following inequality, ie, the minimum k value satisfying the following formula: ⁇ C *: Q)p k (l - p ⁇ k ⁇ 0.01 * mC; wherein
  • mC is the number of sequencing sequences in which the first sequencing sequence or the second sequencing sequence covering the C site is thiolated at the corresponding position
  • nmC is the number of sequencing sequences that cover the C-site of the first sequencing sequence or the second sequencing sequence that is not thiolized at the corresponding position
  • n is the total number of sequencing sequences for this C site
  • p is the error rate, for example, the above steps 1) - 2) are carried out using unmodified control DNA such as lamda DNA, and the p value is determined according to the conversion of C to T 1 ⁇ 2, p is usually 0.001-0.01, if covering The first sequencing sequence or the second sequencing sequence of the C site is thiolated at the corresponding position The number of sequencing sequences is greater than the number of minimal desired sequencing sequences at the site, which is the site of thiolation, preferably forming a result file comprising the following information:
  • the second column b the position of the C site
  • the fourth column c the type of C site
  • the fifth column d specific bases
  • the eighth column g the number of sequencing sequences covering the first or the second sequencing sequence of the C site that are not thiolated at the corresponding position
  • the invention relates to a method of bioinformatics analysis by detecting a method of DNA thiolation selected from:
  • the C site can be divided into three types, namely CG, CHG, and CHH, where H represents a non-G base. From the result file, it can be obtained whether the site is a methylation site, and the site supporting the thiolated reads number greater than the expected value is the thiolation site. The ratio of mC sites to total sites under different coverage depths (the number of times this site is covered by reads) is counted. At the same time, the coverage of the loci under different coverage depths on the genotype is also counted, and the coverage is the proportion of the loci on the whole genome.
  • the level of thiolation of the thiolation site which can be performed by: To understand the overall characteristics of the genomic thiolation map and the average thiolation level of the genome, count each thiolated C base The average thiolation level of the base is calculated as:
  • the number of reads that support merging / (the number of reads that support merging + the number of reads that do not support methylation) ⁇ 100.
  • the statistics reflect the effective coverage of the reads of the thiolated information.
  • the calculation of the effective coverage is based on three aspects: chromosome, gene interval and gene function elements. After the reads with ambiguous ambiguity are excluded, based on the support-supported reads and the reads data that does not support priming, The effective coverage calculation is performed, and the statistical result is obtained.
  • the thiolation levels of C-magnetic groups (CG, CHG, and CHH) for each distribution type in each region are calculated by chromosome, gene interval, and gene functional elements.
  • the distribution of bases near the thiolated C at non-CG sites was analyzed, and the probability of occurrence of different thiolation patterns (CHG and CHH) was counted.
  • Whole genome DNA thiolation map which can be performed by: Delineating the distribution of thiolated C bases from the chromosomal level. The density of the thiolated C bases on the chromosome was counted by a 10 kb window. At the same time, the density distributions of different types of thiolated C tt (CG, CHG, and CHH) in each lOOkb window were counted. Different genomic regions have different biological functions, and the thiolation distribution is preferably expressed in the form of a heat map to help to further understand the thiolated graphical features of these regions. The transcription of genes is regulated by different regulatory elements, dividing all coding gene sequences into seven different regions of transcriptional elements, here! ⁇ Statistics on the level of thiolation of different regions of the transcriptional element.
  • DMR analysis which can be performed by: Determining the thiolation difference region by the thiolation level of the two samples, as follows: First, identify a CG site as a seed At the locus, two samples were obtained to support the thiolation and the unsupported methylation values at the locus, and then the chi-squared level of the two samples was tested by the chi-square test to see whether the difference was significant or not.
  • the horizontal threshold p is generally 0.05, preferably 0.01, more preferably 0.001, most preferably 0.0001; if the difference is significant, a CG site is extended backward, and two samples are obtained to support thiolation and unsupporting at these two sites, respectively.
  • the fragments were obtained by restriction enzyme digestion and then subjected to BS sequencing.
  • the theoretical fragment of the restriction enzyme digestion (40-220 bp) was counted in the chromosomal spots and their proportions in each chromosome and genome, and then counted. The coverage of the actual data obtained by the two samples at the theoretical site.
  • the invention also provides a device for bioinformatics analysis based on heavy bisulfite hydrogenation sequencing or reduced representative bisulfite sequencing, the device comprising bisulfite sequencing and/or reduced representation a bisulfite sequencing module for performing bisulfite sequencing or reduced representative bisulfite sequencing of the sample, and preferably filtering the sequencing sequence to remove unqualified sequences; data output analysis module, Statistics are made on the output of the sequencing data, including the conversion rate, the length of the sequencing fragment, the number of sequencing fragments, and the total amount of the sample; the sequencing fragment comparison module is used to compare the fragments obtained by sequencing to the reference.
  • the statistical content includes the original number of original fragments, the number of comparison fragments, the amount of comparison data, the average comparison ratio, the average coverage depth; the coverage analysis module of the thiolation site Used to calculate the ratio of the number of C-sites covered by the thiolated sequencing sequence to the total number of sites at different depths of coverage; Point methylation level analysis module for counting the average thiolation level of each thiolated C base; genome-wide thiolation data distribution trend analysis module, used to obtain the whole genome thiolation distribution trend, statistics The number and distribution ratio of CG, CHG and CHH in the thiolated C base, where H stands for non-G 3 ⁇ 4; a genome-wide DNA thiolation map analysis module for describing the distribution of thiolated C bases from the chromosomal level Situation; a differential thiolation region analysis module for determining the thiolation difference region by the thiolation level of the two samples; the insert length statistical analysis module, comparing the length of
  • Figure 1 Percentage of loci at different depths of coverage based on BS sequencing.
  • Figure 3 Distribution ratios of different types of thiolated C-based groups based on BS sequencing.
  • Figure 4. The thiolation level distribution of thiolated C based on BS sequencing.
  • Figure 5. Quaternized C-attached features of non-CG sites based on BS sequencing.
  • Figure 6 Illustratively shows the thiolated C density distribution on chromosomes based on BS sequencing.
  • Figure 7 C 1 ⁇ 2 ⁇ -based distribution of different genomic regions based on BS sequencing, and the corresponding CpG density.
  • Figure 8 The horizontal distribution of thiolation in different functional component regions of the entire base group based on BS sequencing.
  • Figure 9. Insert length statistics based on RRBS sequencing.
  • Figure 10 Percentage of loci at different depths of coverage based on RRBS sequencing.
  • bisulfite sequencing or reduced representative bisulfite sequencing can be performed by high throughput sequencing techniques, such as Illumina GA sequencing technology, or other existing high throughput sequencing techniques. Removal of unacceptable sequences can be performed by those skilled in the art based on the actual conditions of sequencing, as needed.
  • the unqualified sequence includes: The number of bases whose sequencing quality is lower than the preset threshold exceeds 50% of the entire sequence and is considered to be a failed sequence.
  • the low-quality threshold is determined by the specific sequencing technology and sequencing environment; the number of subtractions (such as N in Illumina GA sequencing results) whose sequence is uncertain in sequencing is more than 10% of the number of bases in the entire sequence is considered to be inconsistent.
  • Each sequence; in addition to the sample linker sequence is aligned with other exogenous sequences introduced by the experiment, such as various linker sequences. If there is a foreign sequence in the sequence, it is considered to be a non-conforming sequence.
  • the short sequence mapping program may be a block program such as SOAP (available from http://soap.genomics.org.cn/).
  • the screening results of the thiolation can be selected by, for example, unique comparison, comparison of comparison lengths, comparison of mismatches, comparison of the number of comparisons, etc., to screen out the alignment results of each sequence.
  • the selection criteria should be based on the selected software and sequence background.
  • reads refers to a sequence of fragments obtained from a sequencer, or a read.
  • the number of minimum expected sequencing sequences of the site thiolation is determined according to the following inequality, ie, the minimum k value satisfying the following formula: leg C * Q) p k l - p) n ⁇ k ⁇ 0.01 * mC; Its towel,
  • k is the expected value to support the number of singulated reads, which reflects the minimum support at a certain error rate ⁇ Number of base sequencing sequences
  • mC is the number of sequencing sequences in which the first sequencing sequence or the second sequencing sequence covering the site is thiolated at the corresponding position
  • nmC is the number of sequencing sequences that cover the first sequencing sequence or the second sequencing sequence of the site that are not deuterated at the corresponding position
  • n is the total number of sequencing sequences at this site
  • p is the error rate, for example, the above steps 1) - 2) are carried out using unmodified control DNA such as lamda DNA, and the p value is determined according to the conversion rate of C to T tt, p is usually 0.001-0.01, The number of minimum expected sequencing sequences that determine the thiolation of the site is determined based on the binomial distribution.
  • the present invention will be described in detail below with reference to the accompanying drawings and embodiments. invention.
  • two human blood sample DNAs numbered NB and 103 were subjected to bisulfite sequencing or reduced representative bisulfite sequencing at Shenzhen Huada Gene Research Institute, wherein the sequencing technology was Illumina GA sequencing technology. . Sequencing yielded 657.2 trillion and 660.5 trillion sequences, respectively, and then the sequencing sequence was filtered to remove the unqualified sequences, yielding 575.46 megabytes and 576.12 megabytes of qualified sequences, respectively.
  • the unqualified sequence includes: The number of sequencing quality below the preset threshold (66 in this example) exceeds 50% of the entire sequence minus the number of bases. Bases with undefined sequencing results in the sequence (in this example, N in Illumina GA sequencing results) are 10% of the entire number of sequences; exogenous sequences introduced by other experiments except sample linker sequences ; There is an exogenous sequence in the sequence.
  • the reference genomic sequence of the species used in this example is the human genome hgl8, from the database UCSC (http://genome.ucsc.edu/).
  • the filtered sequencing sequence and the reference sequence are preprocessed as follows: i) a sequence obtained by converting all C (cytosine) bases in the respective sequencing sequences into T (thymidine) bases is defined as fql; all G (guanine) bases in the complementary sequence of the respective sequencing sequences The sequence obtained by conversion to A (adenine) tt is defined as fq2, and the schematic is as follows: fql
  • C m represents methylated C (cytosine)
  • the expected value k of the number of thiolated reads supported by each site on the reference genome sequence is obtained by the above formula. For a certain site, if the number of sequencing sequences supporting thiolation is greater than the expected value of the site, the site is a thiol group.
  • the resulting site preferably forms a result file *.cout including the following information:
  • the second column b the position of the C site
  • the fourth column c the type of C site
  • the eighth column g the number of sequencing sequences covering the first or the second sequencing sequence of the C site that are not thiolated at the corresponding position
  • the NB methylation CG site of sample NB was 36.47 megabytes, accounting for 96.88% of the total CG site; the thiolated CHG site was 0.19 megabytes, accounting for 0.51% of the total CHG site; the thiolated CHH site was 0.98 megabytes. It accounts for 2.62% of the total CHH locus.
  • Sample 103 thiolated CG site was 35.01 megabytes, accounting for 97.56% of the total CG site; thiolated CHG site was 0.15 megabytes, accounting for 0.43% of the total CHG site; thiolated CHH site was 0.72 megabytes, Total 2.02% of the CHH locus.
  • the sequenced fragments are aligned to the genome of the species.
  • Step 2 of Data Preprocessing Determination of the position of the sequence on the genome.
  • the comparison results of the two human samples are counted, and the statistics include Raw reads, Raw bases, Mapped reads, and comparisons t. See Figure 2 for the results of the Mapped bases, Average mapped rate, and Average coverage depth (X), the average number of times the entire genome is covered by reads.
  • C 1 ⁇ 2 sites can be divided into three types, namely CG, CHG, CHH, where H represents a non-G base. From the *.cout file we can get whether the site is a thiolated site, and the site that supports the thiolized reads greater than the expected value is the thiolation site. The ratio of mC locus to the total locus under different coverage depths (the number of times the locus is covered by reads) is counted. The statistical result is shown in Figure 1. The horizontal axis represents greater than or equal to the coverage depth, and the vertical axis represents the corresponding depth. proportion.
  • the genomic coverage of the sites under different coverage depths is also counted, and the coverage is the proportion of the number of sites on the full genomics.
  • the statistical results are shown in Fig. 2, the horizontal axis ⁇ the depth of coverage, and the vertical axis represents the coverage of the site under the corresponding depth.
  • the average thiolation level of each methylated C base is calculated by:
  • the statistics reflect the reads of the priming information by 3 ⁇ 4 ⁇ degrees, and the coverage calculation is based on three aspects: chromosomes, gene intervals, and gene functional elements.
  • the effective coverage calculation is performed based on the read-supported reads and the reads data that does not support the support, and the statistical results are obtained, as shown in Table 4.
  • the distribution of thiolated C-velocines is described from the chromosomal level.
  • the distribution of density on the chromosome was counted by a 10 kb window; at the same time, the density distribution of different types of thiolated C minus groups (CG, CHG, and CHH) in each lOOkb window was counted, and then plotted by drawing software. Figure, the results are shown in Figure 6.
  • each graph represents the heat map distribution characteristics of a genomic region, where n represents the number of CpGs involved in the analysis (each CpG has a coverage depth greater than 4X on a single strand).
  • the horizontal axis represents the number of CpGs per 200 bp window (ie, CpG density), and the vertical axis represents the C 1 ⁇ 2 average thiolation level of CpG.
  • the black thin line in the figure indicates the median of the average thiolation level of c bases in this window at a specific CpG density. From the shallow to the deep, the B region indicates the number of CpGs at a specific thiolation level and CpG density.
  • the upper A area represents the distribution of CpG density, mapped to the horizontal axis, and the right C area represents the distribution of the thiolation level and is mapped to the vertical axis.
  • Fig. 8 The transcription of genes is regulated by different regulatory elements, dividing all coding gene sequences into seven different regions of transcriptional elements, here! ⁇ Statistics on the level of thiolation of different transcriptional element regions. The distribution of DNA thiolation levels in different functional regions helps to understand the role of DNA thiolation modifications in different regions from the genome-wide level. The statistical results are shown in Fig. 8: the horizontal axis represents 7 different transcription element regions, and the vertical axis represents the average thiolation level of each point in a specific region, and each red dot is a bin (person For the average thiolization level of a defined length interval, the curve represents the average 5-bin shift, and the dashed line between a and b is the TSS (transcription start site).
  • the methylation difference region was determined by the thiolation level of the two samples as follows: First, a CG site was identified as a seed site, and two samples were obtained to support thiolation and thiolization at the site, respectively. Four values were used to check whether the thiolation level of the two samples was significant at the site by a chi-square test; if the difference was significant, a CG site was extended backward, and two samples were supported at these two sites. ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ Area. This site is then used as a seed site to cycle the above process, resulting in a differential thiolated region.
  • Example 2 Samples of biological information analysis methods based on RRBS sequencing: d liquid sample DNA
  • the liquid sample DNA numbered YH is subjected to heavy subhydro acid hydrogenation sequencing or reduced representative heavy acid sulfite sequencing in the Shenzhen Hua ⁇ ⁇ Institute, wherein the sequencing technology is Illumina GA sequencing technology. .
  • Sequencing yielded 94.5 mega-sequences, and the sequencing sequence was then filtered to remove unqualified sequences, resulting in 83.62 mega-qualified sequences.
  • the unqualified sequence includes: The number of bases whose sequencing quality is lower than the preset threshold (66 in this example) exceeds 50% of the number of bases minus the entire sequence.
  • the number of subtractions in the present example, the number of N in the Illumina GA sequencing results
  • the number of subtractions in the present example, the number of N in the Illumina GA sequencing results
  • the exogenous sequence introduced by other experiments;
  • the presence of an exogenous sequence in the sequence is considered a non-conforming sequence.
  • the reference genomic sequence of the species used in this embodiment is the human genome hgl8, from the database UCSC.
  • the filtered sequencing sequence and the species reference sequence are pretreated as follows: i) the sequence obtained by converting all C (cytosine) tt in the respective sequencing sequences into T (thymine) bases is defined as Fql; The sequence obtained by converting all G (guanine) bases in the complementary sequence of each sequencing sequence into A (adenine) ⁇ is defined as fq2, and the schematic is as follows: fql ⁇ bond ⁇ (3 ⁇ 4' ⁇ 3 ⁇ 43 ⁇ 4$ TT 3 ⁇ 43 ⁇ 4S Beach STAGI ⁇ ⁇ T TeC3 ⁇ 4 CCTC Job TO ⁇ "GCT : Cft3 ⁇ 43 ⁇ 4TCG: The simplified surface is shown in the figure, C m represents the thiolated C (cytosine)
  • RRBS sequencing is performed by BS sequencing of the digested fragments, we need to filter the obtained effective sequences by cleavage sites, and then obtain different cleavage points according to different alcohols.
  • Mspl is used to identify CCGG sites. Enzyme. Both ends of the effective sequence are judged whether there is a restriction site for Mspl. If one of the ends has the alcohol cleavage site, the effective sequence is considered to be a normal restriction fragment, which is used for subsequent analysis.
  • the number of sequencing sequences supporting the thiolation at this site, mC, and the number of sequencing sequences that do not support thiolation, are counted. According to the binomial distribution, the following inequality is used to determine whether the site is a sulfhydryl group.
  • the expected value k of the number of thiolated reads supported by each site on the reference genome sequence is obtained by the above formula. For a certain site, if the number of sequencing sequences supporting thiolation is greater than the expected value of the site, the site is a thiol group.
  • the resulting site preferably forms a result file *.cout including the following information: Chr-a b +/- cdefgh First column Chr-a: chromosome number,
  • the second column b the position of the C site
  • the fourth column c the type of C site
  • the fifth column d the specific ⁇
  • the YH methylation CG site of the sample was 1.84 megabytes, accounting for 52.93% of the covered CG sites; the thiolated CHG site was 0.013 megagrams, accounting for 0.49% of the total CHG sites; the thiolated CHH site was 0.036. Mega, accounting for 2.55% of the total CHH locus.
  • the sequencing data of the DNA of the two jk liquid samples were generated by counting the libraries of different enzyme fragments, and the statistics included the length of the sequencing fragments, the number of sequencing fragments, and the total amount of samples. See Table 7 for specific results.
  • sequenced fragments are aligned to the genome of the species, and the specific comparison method is described.
  • step 2 of the pretreatment the determination of the position of the sequence on the genome. After the data comparison is completed, the comparison results of the two human samples are counted, and the statistics include the Raw reads, the Raw Data, the Mapped reads, and the comparisons. See Table 8 for specific results of Mapped bases, Mapped rate, and Enzyme cutting rate.
  • the C site can be divided into three types, namely CG, CHG, CHH, where H stands for non-G. From the *.cout file we can get whether the site is a thiolated site, and the site that supports the thiolized reads greater than the expected value is the thiolation site. The ratio of the mC locus to the total locus under different coverage depths (the number of times the locus is covered by the reads) is counted. The statistical result is shown in Fig. 10. The horizontal axis represents greater than or equal to the coverage depth, and the vertical axis represents the locus at the corresponding depth. proportion.
  • the coverage of the sites in different coverage depths in each region of the genome is also counted, and the coverage is the proportion of the number of sites covered in each region of the genome.
  • the statistical results are shown in Fig. 11, the horizontal axis ⁇ the depth of coverage, and the vertical axis represents the coverage of the site at the corresponding depth.
  • the average coverage depth of CG loci in each genomic region was counted. See Table 9 for the results.
  • RRBS was digested to obtain fragments and then subjected to BS sequencing.
  • the theoretical fragment (40-220 bp) of the enzyme digestion was counted in the number of CG sites and their proportions in each chromosome and genome, and then the two samples were actually counted.
  • the coverage of the obtained data at the theoretical site is shown in Table 10 and Figure 12.
  • the thiolation difference region was determined by the thiolation level of the two samples as follows:
  • a CG locus is determined as a seed site, and two samples are supported at the site to support the thiolation and the four values that do not support thiolation, and then the chi-square level of the two samples is tested by the chi-square test. Whether the difference in the locus is significant; if the difference is significant, a CG locus is extended backward, and two samples are supported at these two loci to support the thiolation and the four values that do not support thiolation, and then the chi-square test is performed.
  • Such a push cycle when encountering a point where the difference is no longer significant, ⁇ more stop, the front significant area is the differential thiolated area. This site is then used as a seed site to cycle the above process to obtain a differential thiolated region.
  • the corresponding genes namely methylation differential genes, are found through these regions, and the GO clustering function analysis is performed on these genes to analyze whether the differentially related genes have obvious functional clustering. That is, whether to perform certain types of functions in a targeted manner.
  • the RRBS is subjected to restriction enzyme digestion to obtain fragments and then subjected to BS sequencing, and the results include, but are not limited to, the above. RRBS can also be analyzed as appropriate for the content of bioinformatics analysis of BS sequencing. Example results:

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Data Mining & Analysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Provided in the present invention are a method and a device for analyzing methylation in a genome by bisulfite sequencing or reduced representation bisulfite sequencing. Also provided in the present invention is a bioinformatics analysis method based on bisulfite sequencing or reduced representation bisulfite sequencing. The method comprises: detecting DNA methylation, then performing the bioinformatics analysis, wherein the analysis is an analysis for one or more items selected from the following: data output, sequencing fragments alignment, the coverage situation of methylation sites, the methylation level of the methylation sites, the distribution trend of the whole genome methylation data, the whole genome DNA methylation map, the differentially methylated regions, the statistic for the length of insert fragments, and the coverage of CpG sites.

Description

基于 BS和 RRBS测序的生物信息分析方法和装置  Biological information analysis method and device based on BS and RRBS sequencing

技术领域 Technical field

本发明属于生物信息学领域,对表观遗传学中两种重要的 DNA曱基化检测 方法的测序结果进行生物信息学分析。 背景技术  The invention belongs to the field of bioinformatics, and performs bioinformatics analysis on the sequencing results of two important DNA thiolation detection methods in epigenetics. Background technique

重亚硫酸氢盐测序法( Bisulfite-sequencing, BS )是 DNA曱基化研究的一 项有力工具。 通过重亚硫酸盐处理 DNA能够将基因组中未曱基化的 C緘基转 化为 U, 并且经过 PCR.进一步转化为 T, 从而与曱基化的 C碱基区分开来。将 重亚硫酸盐处理与高通量测序技术的结合, 能够绘制单减基分辨率的 DNA甲 基化图谱。 特定物种的高精确度曱基化修饰模式的分析, 必将在表 因组学 研究中具有里程碑式的意义, 并且为细胞分化、 组织发育等^出机制研究以及 动植物育种研究奠定基础。  Bisulfite-sequencing (BS) is a powerful tool for DNA thiolation research. The unthiolated C thiol group in the genome can be converted to U by bisulfite treatment, and further converted to T by PCR to distinguish it from the thiolated C base. Combining bisulfite treatment with high-throughput sequencing technology, it is possible to map DNA methylation profiles with single-subtraction resolution. The analysis of high-accuracy thiolation modification patterns of specific species is bound to be a milestone in the study of epigenetics, and lays a foundation for cell differentiation, tissue development, and animal and plant breeding research.

减少的代表 ')·生重亚石克酸盐测序 ( reduced representation bisulfite sequencing, RRBS )是基于 BS的新方法。 BS测序成本高, RRBS在此基础上, 通过酶切 技术选取全基因组的部分区域再进行 BS测序。通过重亚石克酸盐处理 DNA能够 将基因组中未曱基化的 C 转化为 U, 并且经过 PCR进一步转化为 T, 从而 与曱基化的 C碱基区分开来。 该方法能够绘制单 基分辨率的 DNA曱基化图 谱, 而更具有成本优势。  The reduced representative ')·reduced representation bisulfite sequencing (RRBS) is a new method based on BS. The cost of BS sequencing is high. Based on this, RRBS selects parts of the whole genome by enzymatic cleavage technology and then performs BS sequencing. The treatment of DNA by heavy tartaric acid converts ungerminated C in the genome to U and is further converted to T by PCR to distinguish it from the thiolated C base. This method is capable of plotting a single-base resolution DNA thiolated map with a cost advantage.

本领域需要利用 BS法和 RRBS法这两种 DNA曱基化检测方法进行基因组 曱基化分析的方法, 并需要利用其测序结果进行生物信息学分析的方法。 发明内容  In this field, two methods of DNA thiolation detection, such as BS method and RRBS method, are needed for genomic thiolation analysis, and methods for bioinformatics analysis using the sequencing results are needed. Summary of the invention

在一个方面中, 本发明涉及一种通过重亚磅 υ酸氢盐测序或减少的代表性重 亚硫酸盐测序分析基因组曱基化的方法, 所述方法包括步骤:  In one aspect, the invention relates to a method for analyzing genomic thiolation by heavy bisulfite hydrogenate sequencing or reduced representative bisulfite sequencing, the method comprising the steps of:

1 )对样品进行重亚硫酸氢盐测序或减少的代表性重亚硫酸盐测序,并优选 过滤所述测序序列以去除不合格的序列;  1) subjecting the sample to bisulfite sequencing or reduced representative bisulfite sequencing, and preferably filtering the sequencing sequence to remove unqualified sequences;

2 )确定上述测序序列代表的曱基化位点, 步骤如下: i ) 将所述各个测序序列中所有 c (胞嘧啶) 转化为 T (胸腺嘧啶)碱 基得到的第一测序序列; 将所述各个测序序列互补序列中所有 G (鸟嘌呤)碱 基转化为 A (腺嘌呤)碱基得到的第二测序序列; 2) Determine the thiolation site represented by the above sequencing sequence, the steps are as follows: i) a first sequencing sequence obtained by converting all c (cytosine) in the respective sequencing sequences into T (thymidine) bases; converting all G (guanine) bases in the complementary sequence of the respective sequencing sequences into a second sequencing sequence obtained from the A (adenine) base;

ii )将相应物种的参考基因组序列中所有 C (胞嘧啶;) ½转化为 T (胸腺 嘧啶)减基生成的第一参考序列; 将相应物种的参考基因组序列中所有 G (鸟 嘌呤)碱基转化为 A (腺嘌呤)减基生成的第二参考序列;  Ii) converting all C (cytosine;) 1⁄2 in the reference genome sequence of the corresponding species into the first reference sequence generated by T (thymidine) subtraction; all G (guanine) bases in the reference genome sequence of the corresponding species Converting to a second reference sequence generated by A (adenine) subtraction;

iii )将各个第一测序序列和第二测序序列分别和第一参考序列和第二参考 序列进行比对, 经过比对得到各个第一测序序列和第二测序序列在第一参考序 列和第二参考序列上的位置, 选择在第一参考序列和第二参考序列上有唯一比 对的各个第一测序序列和第二测序序列;  Iii) aligning each of the first sequencing sequence and the second sequencing sequence with the first reference sequence and the second reference sequence, respectively, and obtaining each of the first sequencing sequence and the second sequencing sequence in the first reference sequence and the second Selecting, in the position on the sequence, each of the first sequencing sequence and the second sequencing sequence having unique alignments on the first reference sequence and the second reference sequence;

IV )利用上述选择的第一测序序列和第二测序序列比对的参考基因组序列 的序列信息确定第一测序序列和第二测序序列的曱基化信息, 步骤如下: 如果 在第一测序序列和比对的参考基因组序列的相应位置都是 C (胞嘧啶)碱基, 则表示在第一测序序列的该位置被曱基化; 如果在第二测序序列和比对的参考 基因组序列的相应位置都是 G (鸟嘌呤)碱基, 则表示在第二测序序列的该位 置被曱基化;  IV) determining the thiolation information of the first sequencing sequence and the second sequencing sequence by using the sequence information of the first sequencing sequence and the second sequencing sequence aligned by the second sequencing sequence selected above, as follows: if in the first sequencing sequence and The corresponding positions of the aligned reference genomic sequences are C (cytosine) bases, indicating that they are thiolated at this position of the first sequencing sequence; if at the corresponding position of the second sequencing sequence and the aligned reference genome sequence All of which are G (guanine) bases, indicating that they are thiolated at this position of the second sequencing sequence;

3 )对于参考基因组序列上每个 C位点, 统计覆盖该 C位点的第一测序序 列或第二测序序列在相应位置被曱基化的测序序列数目和未被曱基化的测序序 列数目, 根据以下不等式确定该 C位点曱基化的最小期望测序序列数目, 即满 足下式的最小 k值: 匪 C *: Q)pk(l - p ~k < 0.01 * mC; 其中, 3) For each C site on the reference genomic sequence, the number of sequencing sequences that are thiolated at the corresponding position by the first or second sequencing sequence covering the C site and the number of sequencing sequences that are not thiolated The number of minimum expected sequencing sequences for thiolation of the C site is determined according to the following inequality, ie, the minimum k value satisfying the following formula: 匪C *: Q)p k (l - p ~ k < 0.01 * mC; wherein

mC是覆盖该 C位点的第一测序序列或第二测序序列在相应位置被曱基化 的测序序列数目,  mC is the number of sequencing sequences in which the first sequencing sequence or the second sequencing sequence covering the C site is thiolated at the corresponding position,

nmC是覆盖该 C位点的第一测序序列或第二测序序列在相应位置未被曱基 化的测序序列数目,  nmC is the number of sequencing sequences that cover the C-site of the first sequencing sequence or the second sequencing sequence that is not thiolized at the corresponding position,

n为该 C位点的测序序列总数,  n is the total number of sequencing sequences for this C site,

p为错误率, 例如, 用 C碱基均为未修饰的对照 DNA例如 lamda DNA进 行上述步骤 1 ) -2), 根据 C到 T ½转化率确定 p值, p通常为 0.001-0.01, 如果覆盖该 C位点的第一测序序列或第二测序序列在相应位置被曱基化的 测序序列数目大于该位点的最小期望测序序列数目 , 该 C位点即为曱基化的位 点, 优选形成包括如下信息的结果文件: p is the error rate, for example, the above steps 1) - 2) are carried out using unmodified control DNA such as lamda DNA, and the p value is determined according to the conversion of C to T 1⁄2, p is usually 0.001-0.01, if covering The first sequencing sequence or the second sequencing sequence of the C site is thiolated at the corresponding position The number of sequencing sequences is greater than the number of minimal desired sequencing sequences at the site, which is the site of thiolation, preferably forming a result file comprising the following information:

Chr-a b +/- c d e f g h  Chr-a b +/- c d e f g h

第一列 Chr-a: 染色体号,  First column Chr-a: chromosome number,

第二列 b: C位点的位置,  The second column b: the position of the C site,

第三列 +/-: 正负链信息,  The third column +/-: positive and negative chain information,

第四列 c: C位点的类型,  The fourth column c: the type of C site,

第五列 d: 具体的碱基,  The fifth column d: specific bases,

第六列 e: 测序序列平均比对位置,  Column 6 e: average alignment position of the sequencing sequence,

第七列 f: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置被 甲基化的测序序列数目,  Column 7 f: the number of sequencing sequences that are methylated at the corresponding position by the first sequencing sequence or the second sequencing sequence covering the C site,

第八列 g: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置未 被曱基化的测序序列数目,  The eighth column g: the number of sequencing sequences covering the first or the second sequencing sequence of the C site that are not thiolated at the corresponding position,

第九列 h: 所述 C位点曱基化的最小期望测序序列数目。  Ninth column h: Number of minimum expected sequencing sequences for thiolation of the C site.

在另一方面中, 本发明涉及一种通过检测 DNA 曱基化的方法进行生物信 息学分析的方法, 所述生物信息学分析选自于:  In another aspect, the invention relates to a method of bioinformatics analysis by detecting a method of DNA thiolation selected from:

1. 曱基化位点的覆盖情况, 所述分析可以通过以下方式进行:  1. Coverage of the thiolation site, the analysis can be performed in the following manner:

C 位点可以被分为三种类型, 分别是 CG、 CHG、 CHH, 其中 H代表 非 G碱基。 从结果文件可以得到位点是否为甲基化位点, 支持曱基化的 reads 数大于期望值的位点即为曱基化位点。 统计不同覆盖深度(该位点被 reads覆 盖的次数 )下 mC位点占总位点的比例。 同时还对不同覆盖深度下的位点在基 因组上的覆盖度进行统计, 覆盖度即为位点数在全基因组上的比例。  The C site can be divided into three types, namely CG, CHG, and CHH, where H represents a non-G base. From the result file, it can be obtained whether the site is a methylation site, and the site supporting the thiolated reads number greater than the expected value is the thiolation site. The ratio of mC sites to total sites under different coverage depths (the number of times this site is covered by reads) is counted. At the same time, the coverage of the loci under different coverage depths on the genotype is also counted, and the coverage is the proportion of the loci on the whole genome.

2. 曱基化位点的曱基化水平, 所述分析可以通过以下方式进行: 为了了解基因组曱基化图谱的总体特征和全基因组的平均曱基化水平, 统 计每个曱基化 C碱基的平均曱基化水平, 计算方式为:  2. The level of thiolation of the thiolation site, which can be performed by: To understand the overall characteristics of the genomic thiolation map and the average thiolation level of the genome, count each thiolated C base The average thiolation level of the base is calculated as:

支持曱基化的 reads数 /(支持曱基化的 reads数 +不支持甲基化的 reads 数) χ100。  The number of reads that support merging / (the number of reads that support merging + the number of reads that do not support methylation) χ100.

同时, 统计反映曱基化信息的 reads的有效覆盖度, 有效覆盖度的计算基 于三个方面进行: 染色体、 基因区间和基因功能元件。 将曱基化状态不明确的 reads排除之后,基于支持曱基化的 reads和不支持支持曱基化的 reads数据,进 行有效覆盖度计算, 得到统计结果。 At the same time, the statistics reflect the effective coverage of the reads of the thiolated information. The calculation of the effective coverage is based on three aspects: chromosome, gene interval and gene function elements. After the reads with ambiguous ambiguity are excluded, based on the support-supported reads and the reads data that does not support priming, The effective coverage calculation is performed, and the statistical result is obtained.

3. 全基因组曱基化数据分布趋势, 所述分析可以通过以下方式进行: 为了得到全基因组曱基化分布趋势, 统计曱基化 C 中 CG, CHG与 CHH的数目和分布比例, 即各类型曱基化的 C占总曱基化 C的百分比, 例如: mCHG所占比例 = mCHG数目 /mC的总数。同时,不同分布类型的 C ¾( CG、 CHG和 CHH ), 其曱基化水平在不同物种间, 甚至同一物种不同细胞类型间都 存在差异。 因此, 统计每种类型 ( CG、 CHG和 CHH ) 曱基化的 C碱基曱基化 水平的分布。 同样, 按照染色体、 基因区间和基因功能元件统计各区域内每种 分布类型的 C磁基(CG、 CHG和 CHH )的曱基化水平, 计算公式为: 平均曱 基化水平=支持甲基化的 reads/ (支持曱基化的 reads+不支持甲基化的 reads) χ100。 最后, 分析非 CG位点的曱基化的 C附近碱基的分布情况, 统计不同曱 基化模式 (CHG和 CHH)出现的概率。 3. Trends in genome-wide thiolation data distribution, which can be performed in the following manner: In order to obtain the trend of genome-wide thiolation distribution, the number and distribution ratio of CG, CHG and CHH in the thiolated C, ie each type The percentage of thiolated C to the total thiolization C, for example: proportion of mCHG = number of mCHG / total number of mC. At the same time, different distribution types of C 3⁄4 ( CG, CHG and CHH ) have different levels of thiolation between different species and even different cell types of the same species. Therefore, the distribution of the C base thiolation levels of each type (CG, CHG, and CHH) thiolation was counted. Similarly, the thiolation levels of C-magnetic groups (CG, CHG, and CHH) for each distribution type in each region are calculated by chromosome, gene interval, and gene functional elements. The calculation formula is: Average thiolation level = support for methylation The reads / ( supports simplification of reads+ does not support methylated reads) χ100. Finally, the distribution of bases near the thiolated C at non-CG sites was analyzed, and the probability of occurrence of different thiolation patterns (CHG and CHH) was counted.

4. 全基因组 DNA曱基化图谱, 所述分析可以通过以下方式进行: 从染色体水平来描述曱基化 C碱基的分布情况。 以 10kb的窗口统计曱基 化 C碱基的密度在染色体上的分布情况; 同时, 统计每 lOOkb的窗口内不同类 型曱基化 C tt( CG、 CHG和 CHH )的密度分布。不同的基因组区域具有各自 不同的生物学功能, 优选以热度图的形式表示曱基化分布情况, 以有助于进一 步了解这些区域的曱基化图语特征。 基因的转录受不同的调控元件调控, 将所 有编码基因序列分成 7个不同的转录元件区域, 在此! ^出上对不同转录元件区 域的曱基化水平进行统计。  4. Whole genome DNA thiolation map, which can be performed by: Delineating the distribution of thiolated C bases from the chromosomal level. The density of the thiolated C bases on the chromosome was counted by a 10 kb window. At the same time, the density distributions of different types of thiolated C tt (CG, CHG, and CHH) in each lOOkb window were counted. Different genomic regions have different biological functions, and the thiolation distribution is preferably expressed in the form of a heat map to help to further understand the thiolated graphical features of these regions. The transcription of genes is regulated by different regulatory elements, dividing all coding gene sequences into seven different regions of transcriptional elements, here! ^ Statistics on the level of thiolation of different regions of the transcriptional element.

5. 差异性曱基化区域(DMR )分析, 所述分析可以通过以下方式进行: 通过两个样品的曱基化水平, 确定曱基化差异区域, 方法如下: 首先确定 一个 CG位点作为种子位点, 得到两个样本在该位点分别支持曱基化和不支持 甲基化四个数值, 再通过卡方检验检验两个样本的曱基化水平在该位点差异是 否显著, 显著性水平的阈值 p—般是 0.05, 优选 0.01 , 更优选 0.001 , 最优选 0.0001 ; 如果差异显著则往后延伸一个 CG位点, 得到两个样本在这两个位点 分别支持曱基化和不支持曱基化四个数值, 再进行卡方检验; 以此类推循环, 当遇到差异不再显著的点时,便停止,前面显著的区域即为差异性曱基化区域。 再以此位点为种子位点循环上面过程, 从而得到差异性曱基化区域。 在找到差 异性甲基化区域后, 通过这些区域找到相对应的基因, 即曱基化差异基因, 并 对这些基因做 GO聚类功能分析,以分析差异相关基因是否有明显的功能聚类, 即是否针对性地调控行使某类功能。 6. 插入片段长度统计, 所述分析可以通过以下方式进行: 5. Differential thiolated region (DMR) analysis, which can be performed by: Determining the thiolation difference region by the thiolation level of the two samples, as follows: First, identify a CG site as a seed At the locus, two samples were obtained to support the thiolation and the unsupported methylation values at the locus, and then the chi-squared level of the two samples was tested by the chi-square test to see whether the difference was significant or not. The horizontal threshold p is generally 0.05, preferably 0.01, more preferably 0.001, most preferably 0.0001; if the difference is significant, a CG site is extended backward, and two samples are obtained to support thiolation and unsupporting at these two sites, respectively. The 数值 化 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个This site is then used as a seed site to cycle the above process to obtain a differential thiolated region. After finding the differential methylation regions, the corresponding genes, namely the thiolated differential genes, are found through these regions, and the GO clustering function analysis is performed on these genes to analyze whether the differentially related genes have obvious functional clustering. That is, whether to perform certain types of functions in a targeted manner. 6. Insert fragment length statistics, which can be done in the following ways:

对比对上的 reads进行长度统计, 以判断比对结果是否符合建库片段长度, 统计不同插入片段长度下 reads占总 reads数的百分比。  Compare the length of the reads on the pair to determine whether the comparison result matches the length of the built-in fragment, and count the percentage of the total number of reads in the length of the different inserts.

7. CpG位点的覆盖度, 所述分析可以通过以下方式进行:  7. Coverage of CpG sites, the analysis can be performed in the following ways:

在 RRBS中, 采用酶切技术得到片段再进行 BS测序, !JL对酶切的理论 片段(40-220bp )在各染色体和基因组各区域内覆盖到的 CG位点数及其比例 进行统计, 然后统计两个样品实际得到的数据在理论位点上的覆盖度。  In the RRBS, the fragments were obtained by restriction enzyme digestion and then subjected to BS sequencing. The theoretical fragment of the restriction enzyme digestion (40-220 bp) was counted in the chromosomal spots and their proportions in each chromosome and genome, and then counted. The coverage of the actual data obtained by the two samples at the theoretical site.

本发明还提供了一种基于重亚疏酸氢盐测序或减少的代表性重亚硫酸盐测 序进行生物信息学分析的装置,所述装置包括重亚硫酸氢盐测序和 /或减少的代 表性重亚硫酸盐测序模块, 用于对样品进行重亚硫酸氢盐测序或减少的代表性 重亚硫酸盐测序, 并优选过滤所述测序序列以去除不合格的序列; 数据产出分 析模块, 用于对测序数据产出做出统计,统计内容包括转化率、测序片段读长、 测序片段数量和样品的总数据量等; 测序片段比对分析模块, 用于将测序得到 的片段比对到参考基因组上, 并比对结果做出统计, 统计内容包括原始片段数 原始数据量、 比对片段数、 比对数据量、 平均比对率、 平均覆盖深度; 曱基化 位点的覆盖情况分析模块用于统计不同覆盖深度下, 被曱基化的测序序列覆盖 的 C位点数目占总位点数目的比例; 曱基化位点的甲基化水平分析模块, 用于 统计每个曱基化 C碱基的平均曱基化水平; 全基因组曱基化数据分布趋势分析 模块, 用于得到全基因组曱基化分布趋势, 统计曱基化 C碱基中 CG, CHG与 CHH的数目和分布比例,其中 H代表非 G ¾;全基因组 DNA曱基化图谱分 析模块, 用于从染色体水平来描述曱基化 C碱基的分布情况; 差异性曱基化区 域分析模块, 用于通过两个样品的曱基化水平, 确定曱基化差异区域; 插入片 段长度统计分析模块, 对比对上的测序序列进行长度统计, 以判断比对结果是 否符合建库片段长度; CpG位点的覆盖度分析模块, 用于在各染色体和基因组 各区域内覆盖到的 CG位点数及其比例进行统计, 然后统计实际得到的数据在 理论位点上的覆盖度。 附图说明  The invention also provides a device for bioinformatics analysis based on heavy bisulfite hydrogenation sequencing or reduced representative bisulfite sequencing, the device comprising bisulfite sequencing and/or reduced representation a bisulfite sequencing module for performing bisulfite sequencing or reduced representative bisulfite sequencing of the sample, and preferably filtering the sequencing sequence to remove unqualified sequences; data output analysis module, Statistics are made on the output of the sequencing data, including the conversion rate, the length of the sequencing fragment, the number of sequencing fragments, and the total amount of the sample; the sequencing fragment comparison module is used to compare the fragments obtained by sequencing to the reference. On the genome, and compare the results, the statistical content includes the original number of original fragments, the number of comparison fragments, the amount of comparison data, the average comparison ratio, the average coverage depth; the coverage analysis module of the thiolation site Used to calculate the ratio of the number of C-sites covered by the thiolated sequencing sequence to the total number of sites at different depths of coverage; Point methylation level analysis module for counting the average thiolation level of each thiolated C base; genome-wide thiolation data distribution trend analysis module, used to obtain the whole genome thiolation distribution trend, statistics The number and distribution ratio of CG, CHG and CHH in the thiolated C base, where H stands for non-G 3⁄4; a genome-wide DNA thiolation map analysis module for describing the distribution of thiolated C bases from the chromosomal level Situation; a differential thiolation region analysis module for determining the thiolation difference region by the thiolation level of the two samples; the insert length statistical analysis module, comparing the length of the sequencing sequence on the pair to determine the ratio Whether the result conforms to the length of the built-in fragment; the coverage analysis module of the CpG locus is used to count the number of CG sites and their proportions covered in each chromosome and genome, and then statistically obtain the actual data at the theoretical site. Coverage on. DRAWINGS

图 1 , 基于 BS测序的不同覆盖深度下位点的百分比。  Figure 1. Percentage of loci at different depths of coverage based on BS sequencing.

图 2, 基于 BS测序的不同覆盖深度下位点的 度。  Figure 2. Degrees of loci at different depths of coverage based on BS sequencing.

图 3, 基于 BS测序的不同类型曱基化 C威基的分布比例。  Figure 3. Distribution ratios of different types of thiolated C-based groups based on BS sequencing.

图 4, 基于 BS测序的曱基化 C 的曱基化水平分布。 图 5, 基于 BS测序的非 CG位点的曱基化 C附 列特征。 Figure 4. The thiolation level distribution of thiolated C based on BS sequencing. Figure 5. Quaternized C-attached features of non-CG sites based on BS sequencing.

图 6, 示例性显示基于 BS测序的染色体上曱基化 C密度分布。  Figure 6. Illustratively shows the thiolated C density distribution on chromosomes based on BS sequencing.

图 7, 基于 BS测序的不同基因组区域的 C ½曱基化分布, 以及相应的 CpG密度。  Figure 7. C 1⁄2 曱-based distribution of different genomic regions based on BS sequencing, and the corresponding CpG density.

图 8, 基于 BS测序的全基国组不同功能元件区域的曱基化水平分布。 图 9, 基于 RRBS测序的插入片段长度统计。  Figure 8. The horizontal distribution of thiolation in different functional component regions of the entire base group based on BS sequencing. Figure 9. Insert length statistics based on RRBS sequencing.

图 10, 基于 RRBS测序的不同覆盖深度下位点的百分比。  Figure 10. Percentage of loci at different depths of coverage based on RRBS sequencing.

图 11 , 基于 RRBS测序的覆盖到的 CG位点在基因组各区域覆盖度。  Figure 11. Coverage of covered CG loci in the genome based on RRBS sequencing.

图 12, 基于 RRBS测序的不同深度下 CpG位点的覆盖情况。 具体实施方式  Figure 12. Coverage of CpG loci at different depths based on RRBS sequencing. detailed description

在本发明中 , 重亚硫酸氢盐测序或减少的代表性重亚硫酸盐测序可以通过 高通量测序技术进行, 例如 Illumina GA测序技术, 也可以为现有的其他高通 量测序技术。 根据需要, 去除不合格的序列是本领域技术人员根据测序的实际 情况可以进行的。 例如, 不合格序列包括: 测序质量低于预设阀值的 基个数 超过整条序列 个数的 50%则认为是不合格序列。 低质量阀值由具体测序技 术及测序环境而定; 序列中测序结果不确定的减基(例如 Illumina GA测序结 果中的 N )个数超过整条序列减基个数的 10%则认为是不合才各序列; 除样本接 头序列外, 与其它实验引入的外源序列比对, 如各种接头序列。 若序列中存在 外源序列则认为是不合格序列。  In the present invention, bisulfite sequencing or reduced representative bisulfite sequencing can be performed by high throughput sequencing techniques, such as Illumina GA sequencing technology, or other existing high throughput sequencing techniques. Removal of unacceptable sequences can be performed by those skilled in the art based on the actual conditions of sequencing, as needed. For example, the unqualified sequence includes: The number of bases whose sequencing quality is lower than the preset threshold exceeds 50% of the entire sequence and is considered to be a failed sequence. The low-quality threshold is determined by the specific sequencing technology and sequencing environment; the number of subtractions (such as N in Illumina GA sequencing results) whose sequence is uncertain in sequencing is more than 10% of the number of bases in the entire sequence is considered to be inconsistent. Each sequence; in addition to the sample linker sequence, is aligned with other exogenous sequences introduced by the experiment, such as various linker sequences. If there is a foreign sequence in the sequence, it is considered to be a non-conforming sequence.

在本发明 中 , 短序列映射程序可以是例如 SOAP ( 获 自 http://soap.genomics.org.cn/ )等块射程序。  In the present invention, the short sequence mapping program may be a block program such as SOAP (available from http://soap.genomics.org.cn/).

在本发明中, 对曱基化结果进行筛选可以通过例如唯一性的比较、 比对长 度的比较、 错配数的比较, 比对次数的比较等, 筛选出每条序列比对结果合乎 要求的比对信息, 选用的筛选条件需视选用的比对软件、 序列背景而定。  In the present invention, the screening results of the thiolation can be selected by, for example, unique comparison, comparison of comparison lengths, comparison of mismatches, comparison of the number of comparisons, etc., to screen out the alignment results of each sequence. For comparison information, the selection criteria should be based on the selected software and sequence background.

在本发明中, reads是指从测序仪获得的测序片段, 或者称为读段。  In the present invention, reads refers to a sequence of fragments obtained from a sequencer, or a read.

在本发明中 ,根据以下不等式确定该位点曱基化的最小期望测序序列数目, 即满足下式的最小 k值: 腿 C * Q)pk l一 p)n~k < 0.01 * mC; 其巾, In the present invention, the number of minimum expected sequencing sequences of the site thiolation is determined according to the following inequality, ie, the minimum k value satisfying the following formula: leg C * Q) p k l - p) n ~ k < 0.01 * mC; Its towel,

k为支持曱基化 reads数的期望值,它反映了在一定错误率下的最小支持曱 基化测序序列数目, k is the expected value to support the number of singulated reads, which reflects the minimum support at a certain error rate曱 Number of base sequencing sequences,

mC是覆盖该位点的第一测序序列或第二测序序列在相应位置被曱基化的 测序序列数目,  mC is the number of sequencing sequences in which the first sequencing sequence or the second sequencing sequence covering the site is thiolated at the corresponding position,

nmC是覆盖该位点的第一测序序列或第二测序序列在相应位置未被曱基 化的测序序列数目,  nmC is the number of sequencing sequences that cover the first sequencing sequence or the second sequencing sequence of the site that are not deuterated at the corresponding position,

n为该位点的测序序列总数,  n is the total number of sequencing sequences at this site,

p为错误率, 例如, 用 C碱基均为未修饰的对照 DNA例如 lamda DNA进 行上述步骤 1 ) -2), 根据 C到 T tt转化率确定 p值, p通常为 0.001-0.01, 上式基于二项分布得到确定该位点曱基化的最小期望测序序列数目。 为了详细说明本发明的目的、 技术方案及具体步骤, 以下结合附图及实施 例, 对本发明进行详细阐述, 应当理解, 此处所描述的具体实施例仅用以解释 本发明, 并不用于限定本发明。 实施例一, 基于 BS测序的生物信息分析方法 样本: 两个人的样本  p is the error rate, for example, the above steps 1) - 2) are carried out using unmodified control DNA such as lamda DNA, and the p value is determined according to the conversion rate of C to T tt, p is usually 0.001-0.01, The number of minimum expected sequencing sequences that determine the thiolation of the site is determined based on the binomial distribution. The present invention will be described in detail below with reference to the accompanying drawings and embodiments. invention. Example 1 Bio-information analysis method based on BS sequencing Sample: Sample of two people

具 作錄 Record

一、 数据预处理 First, data preprocessing

1.过滤测序序列;  1. Filter the sequencing sequence;

在本实施例中, 编号为 NB和 103的两个人血液样品 DNA在深圳华大基 因研究院进行重亚硫酸氢盐测序或减少的代表性重亚硫酸盐测序, 其中测序技 术为 Illumina GA测序技术。 测序分别得到 657.2兆和 660.5兆条序列, 然后对 测序序列进行过滤,去除不合格的序列,分别得到 575.46兆和 576.12兆条合格 序列。 不合格序列包括: 测序质量低于预设阀值(本例中为 66 )的 个数超 过整条序列减基个数的 50%。 序列中测序结果不确定的碱基(在本实施例中为 Illumina GA测序结果中的 N )个«£过整条序列 个数的 10%; 除样本接头 序列外, 其它实验引入的外源序列; 序列中存在外源序列。  In this example, two human blood sample DNAs numbered NB and 103 were subjected to bisulfite sequencing or reduced representative bisulfite sequencing at Shenzhen Huada Gene Research Institute, wherein the sequencing technology was Illumina GA sequencing technology. . Sequencing yielded 657.2 trillion and 660.5 trillion sequences, respectively, and then the sequencing sequence was filtered to remove the unqualified sequences, yielding 575.46 megabytes and 576.12 megabytes of qualified sequences, respectively. The unqualified sequence includes: The number of sequencing quality below the preset threshold (66 in this example) exceeds 50% of the entire sequence minus the number of bases. Bases with undefined sequencing results in the sequence (in this example, N in Illumina GA sequencing results) are 10% of the entire number of sequences; exogenous sequences introduced by other experiments except sample linker sequences ; There is an exogenous sequence in the sequence.

2.确定上述过滤后的测序序列代表的曱基化信息  2. Determine the thiolation information represented by the above-described filtered sequencing sequence

在本实施例中使用的物种的参考基因组序列是人的基因组 hgl8,来自数据 库 UCSC ( http://genome.ucsc.edu/ )。 对过滤后的测序序列和所述参考序列进行 预处理, 步骤如下: i ) 将所述各个测序序列中所有 C (胞嘧啶)碱基转化为 T (胸腺嘧啶)碱 基得到的序列定义为 fql; 将所述各个测序序列互补序列中所有 G (鸟嘌呤)碱 基转化为 A (腺嘌呤) tt得到的序列定义为 fq2, 示意图如下: fql The reference genomic sequence of the species used in this example is the human genome hgl8, from the database UCSC (http://genome.ucsc.edu/). The filtered sequencing sequence and the reference sequence are preprocessed as follows: i) a sequence obtained by converting all C (cytosine) bases in the respective sequencing sequences into T (thymidine) bases is defined as fql; all G (guanine) bases in the complementary sequence of the respective sequencing sequences The sequence obtained by conversion to A (adenine) tt is defined as fq2, and the schematic is as follows: fql

iq2

Figure imgf000009_0001
Iq2
Figure imgf000009_0001

在图中, Cm表示甲基化的 C (胞嘧啶) In the figure, C m represents methylated C (cytosine)

ii )将所述参考基因组序列中所有 c (胞嘧啶) 转化为 τ (胸腺嘧啶) tt生成的序列定义为 refl; 将所述参考基因组序列中所有 G (鸟嘌呤) ½ 转化为 A (腺嘌呤) ½生成的序列定义为 re 2;  Ii) converting the sequence generated by converting all c (cytosine) in the reference genome sequence to τ (thymidine) tt as refl; converting all G (guanine) 1⁄2 in the reference genome sequence into A (adenine) The sequence generated is defined as re 2;

iii )将各个序列 fql和 fq2分别和参考序列 refl和 ref2进行比对, 经过比 对得到各个序列 fql和 fq2在转化后的参考基因组序列上的位置信息, 保留在 转化后的参考基因组序列上有唯一比对的各个序列 fql和 fq2, 示意图如下:  Iii) aligning each sequence fql and fq2 with reference sequences refl and ref2, respectively, and obtaining positional information of each sequence fql and fq2 on the transformed reference genomic sequence, which is retained on the transformed reference genome sequence. The unique sequences fql and fq2 are as follows:

refl fql Refl fql

M M  M M

i;ef2 iv )将一个序列 fql和 fq2在转化后的参考基因组序列上的位置信息参考 序列在相应位置处的序列信息相结合, 确定该序列 fql和 fq2的曱基化信息: 如果在该序列 fql上和在参考序列相应位置上都是 C (胞嘧啶)碱基, 则表示 在该序列 fql的该位置被曱基化; 如果在该序列 fq2上和在参考序列相应位置 都是 G (鸟嘌呤)碱基, 则表示在该序列 fq2的该位置被甲基化, 示意图如下:  i;ef2 iv) combining the sequence information of a sequence fql and fq2 on the transformed reference genome sequence with the sequence information at the corresponding position to determine the thiolation information of the sequences fql and fq2: if in the sequence Both fql and C (cytosine) bases at the corresponding positions in the reference sequence indicate that the position of the sequence fql is thiolated; if the sequence fq2 and the corresponding position in the reference sequence are G (bird)嘌呤) base, which means that the position of the sequence fq2 is methylated, the schematic is as follows:

fq2. T CC ACC C TO C S¾ A^ ¾CC Gft^AT CC GAAiSC TG G T AC AT CACj. Fq2. T CC ACC C TO C S3⁄4 A^ 3⁄4CC Gft^AT CC GAAiSC TG G T AC AT CACj.

T TCCACCC TC C T A& C GSSG CC TGA ½C C C T AC AT C¾G 序列 在图中, *号和 #号表示未曱基化, Cra表示曱基化的 C (胞嘧啶) 。 3.经过上述处理, 样品 NB得到 498.82兆有效序列, 样品 103得到 49—1.28 兆有¾ ^列, 能覆盖整个基因组确定参考基因组序列上曱基化位点。 T TCCACCC TC CT A & C GSSG CC TGA 1⁄2C CCT AC AT C3⁄4G Sequence In the figure, * and # indicate undensylated, and C ra denotes thiolated C (cytosine). 3. After the above treatment, the sample NB obtained the 498.82 megabyte effective sequence, and the sample 103 obtained 49-1.28 megabytes with 3⁄4 ^ column, which can cover the entire genome to determine the thiolation site of the reference genome sequence.

对于参考基因组上每个位点, 统计在该位点支持曱基化的测序序列数 mC 和不支持甲基化的测序序列数 nmC,根据二项分布得到以下不等式确定该位点 是否是甲基化位点: nmC * ( )pk(t - p)n-k < 0.01 * mC; 其中, n为该位点的测序序列总数, p为错误率(o.oi ), k为支持曱基化 reads数的期望值, mC为支持曱基化的测序序列数, nmC为不支持曱基化的测 序序列数; For each locus on the reference genome, the number of sequencing sequences supporting the thiolation at this site, mC, and the number of sequencing sequences that do not support methylation, are counted. The following inequality is obtained from the binomial distribution to determine whether the site is methyl. Site: nmC * ( )p k (t - p) n - k < 0.01 * mC; where n is the total number of sequencing sequences at this site, p is the error rate (o.oi), and k is the supporting thiol group. The expected number of reads, mC is the number of sequencing sequences that support thiolation, and nmC is the number of sequencing sequences that do not support thiolation;

通过上述公式得到参考基因组序列上每个位点支持曱基化 reads数的期望 值 k, 对于某个位点如果支持曱基化的测序序列数大于该位点的期望值, 该位 点即为曱基化的位点, 优选形成包括如下信息的结果文件 *.cout:  The expected value k of the number of thiolated reads supported by each site on the reference genome sequence is obtained by the above formula. For a certain site, if the number of sequencing sequences supporting thiolation is greater than the expected value of the site, the site is a thiol group. The resulting site preferably forms a result file *.cout including the following information:

Chr-a b +/- c d e f g h 第一列 Chr-a: 染色体号,  Chr-a b +/- c d e f g h First column Chr-a: chromosome number,

第二列 b: C位点的位置,  The second column b: the position of the C site,

第三列 +/-: 正负链信息,  The third column +/-: positive and negative chain information,

第四列 c: C位点的类型,  The fourth column c: the type of C site,

第五列 d: 具体的 ,  The fifth column d: specific,

第六列 e: 测序序列平均比对位置,  Column 6 e: average alignment position of the sequencing sequence,

第七列 f: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置被 曱基化的测序序列数目,  Column 7 f: the number of sequencing sequences that cover the first or the second sequencing sequence of the C site and are thiolated at the corresponding position,

第八列 g: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置未 被曱基化的测序序列数目,  The eighth column g: the number of sequencing sequences covering the first or the second sequencing sequence of the C site that are not thiolated at the corresponding position,

第九列 h: 所述 C位点曱基化的最小期望测序序列数目。  Ninth column h: Number of minimum expected sequencing sequences for thiolation of the C site.

结果如下:  The results are as follows:

样品 NB甲基化 CG位点为 36.47兆,占总 CG位点的 96.88%;曱基化 CHG 位点为 0.19兆, 占总 CHG位点的 0.51%; 曱基化 CHH位点为 0.98兆, 占总 CHH位点的 2.62%。  The NB methylation CG site of sample NB was 36.47 megabytes, accounting for 96.88% of the total CG site; the thiolated CHG site was 0.19 megabytes, accounting for 0.51% of the total CHG site; the thiolated CHH site was 0.98 megabytes. It accounts for 2.62% of the total CHH locus.

样品 103曱基化 CG位点为 35.01兆,占总 CG位点的 97.56%;曱基化 CHG 位点为 0.15兆, 占总 CHG位点的 0.43%; 曱基化 CHH位点为 0.72兆, 占总 CHH位点的 2.02%。 Sample 103 thiolated CG site was 35.01 megabytes, accounting for 97.56% of the total CG site; thiolated CHG site was 0.15 megabytes, accounting for 0.43% of the total CHG site; thiolated CHH site was 0.72 megabytes, Total 2.02% of the CHH locus.

二、 生物信息分析  Second, biological information analysis

1.数据产出  Data output

对两个 jk液样本 DNA的测序数据产出做出统计, 统计内容包括转化率、 测序片段读长、 测序片段数量和样品的总数据量等。 具体的结果参见表 1。  Statistics were made on the sequencing data of the DNA of the two jk liquid samples, including the conversion rate, the length of the sequencing fragment read, the number of sequencing fragments, and the total amount of data in the sample. See Table 1 for the specific results.

2.测序片段比对  2. Sequencing fragment alignment

将测序得到的片段比对到该物种的基因组上, 具体比对方法见数据预处理 中的第 2步: 序列在基因组上位置的确定。 在完成数据比对后, 对两个人样品 的比对结果做出统计,统计内容包括原始片段数( Raw reads )、原始数据量( Raw bases )、 比对片段数 ( Mapped reads )、 比对 t据量 ( Mapped bases )、 平均比 对率 ( Average mapped rate ) 、 平均覆盖深度 ( Average coverage depth(X), 全 基因组每个位点被 reads覆盖到的平均次数 )具体结果参见表 2。  The sequenced fragments are aligned to the genome of the species. For specific alignment methods, see Step 2 of Data Preprocessing: Determination of the position of the sequence on the genome. After the data comparison is completed, the comparison results of the two human samples are counted, and the statistics include Raw reads, Raw bases, Mapped reads, and comparisons t. See Figure 2 for the results of the Mapped bases, Average mapped rate, and Average coverage depth (X), the average number of times the entire genome is covered by reads.

3.曱基化位点的覆盖情况  3. Coverage of thiolated sites

C ½位点可以被分为三种类型, 分别是 CG、 CHG、 CHH, 其中 H代 表非 G碱基。 从 *.cout文件中我们可以得到位点是否为曱基化位点, 支持 曱基化的 reads数大于期望值的位点即为曱基化位点。 统计不同覆盖深度(该 位点被 reads覆盖的次数)下 mC位点占总位点的比例, 统计结果如图 1所示, 横轴代表大于等于该覆盖深度, 纵轴代表相应深度下位点的比例。 同时还对不 同覆盖深度下的位点在基因组上的覆盖度进行统计, 覆盖度即为位点数在全基 因组上的比例。 统计结果如图 2所示, 横轴^ 该覆盖深度, 纵轴代表相应深 度下位点的覆盖度。  C 1⁄2 sites can be divided into three types, namely CG, CHG, CHH, where H represents a non-G base. From the *.cout file we can get whether the site is a thiolated site, and the site that supports the thiolized reads greater than the expected value is the thiolation site. The ratio of mC locus to the total locus under different coverage depths (the number of times the locus is covered by reads) is counted. The statistical result is shown in Figure 1. The horizontal axis represents greater than or equal to the coverage depth, and the vertical axis represents the corresponding depth. proportion. At the same time, the genomic coverage of the sites under different coverage depths is also counted, and the coverage is the proportion of the number of sites on the full genomics. The statistical results are shown in Fig. 2, the horizontal axis ^ the depth of coverage, and the vertical axis represents the coverage of the site under the corresponding depth.

4. 曱基化位点的曱基化水平  4. The thiolation level of the thiolated site

为了了解基因组曱基化图谱的总体特征和全基因组的平均曱基化水平, 统 计每个甲基化 C碱基的平均曱基化水平, 计算方式为:  To understand the overall characteristics of the genomic thiolation map and the average thiolation level of the whole genome, the average thiolation level of each methylated C base is calculated by:

支持曱基化的 reads数 /(支持曱基化的 reads数 +不支持曱基化的 reads 数 )xl'00。 具体结果如表 3所示。  The number of reads that support merging / (the number of reads that support merging + the number of reads that do not support merging) xl'00. The specific results are shown in Table 3.

同时, 统计反映曱基化信息的 reads的有¾ ^度, 有 盖度的计算基 于三个方面进行: 染色体、 基因区间和基因功能元件。 将曱基化状态不明确的 reads排除之后,基于支持曱基化的 reads和不支持支持曱基化的 reads数据,进 行有效覆盖度计算, 得到统计结果, 如表 4所示。  At the same time, the statistics reflect the reads of the priming information by 3⁄4^ degrees, and the coverage calculation is based on three aspects: chromosomes, gene intervals, and gene functional elements. After the reads with ambiguous state are excluded, the effective coverage calculation is performed based on the read-supported reads and the reads data that does not support the support, and the statistical results are obtained, as shown in Table 4.

5.全基因组甲基化数据分布趋势  5. Whole genome methylation data distribution trend

为了得到全基因组曱基化分布趋势, 统计曱基化 C碱基中 CG, CHG与 CHH的数目和分布比例, 即各类型甲基化的 C占总曱基化 C的百分比, 例如: mCHG所占比例 = mCHG数目 /mC的总数。 统计结果如表 5和图 3所示。 In order to obtain the trend of genome-wide thiolation distribution, statistics of thiolated C bases in CG, CHG and The number and distribution ratio of CHH, that is, the percentage of each type of methylated C to the total thiolation C, for example: the proportion of mCHG = the number of mCHG / the total number of mC. The statistical results are shown in Table 5 and Figure 3.

同时, 不同分布类型的 C CG、 CHG和 CHH ), 其曱基化水平在不 同物种间, 甚至同一物种不同细胞类型间都存在差异。 因此, 统计每种类型 ( CG、 CHG和 CHH ) 曱基化的 C曱基化水平的分布, 统计结果如图 4所示: 其中, 横轴表示曱基化水平从低到高, 以每 10%为一档, 纵轴表示某一特定曱 基化水平的 C ¾ ^所有曱基化 C 中所占的比例。 同样, 按照染色体、基 因区间和基因功能元件统计各区域内每种分布类型的 C碱基( CG、 CHG和 CHH )的曱基化水平,计算公式为: 平均曱基化水平 =支持曱基化的 reads/ (支持 曱基化的 reads+不支持曱基化的 reads)xl00。 统计结果如表 6所示。  At the same time, different distribution types of C CG, CHG and CHH ) have different levels of thiolation between different species and even different cell types of the same species. Therefore, the distribution of the thiolation of each type (CG, CHG, and CHH) is statistically calculated. The statistical results are shown in Figure 4: where the horizontal axis represents the thiolation level from low to high, for every 10 % is a file, and the vertical axis represents the proportion of C 3⁄4 ^ all thiolation C of a particular thiolation level. Similarly, the thiolation levels of C bases (CG, CHG, and CHH) for each distribution type in each region are calculated by chromosome, gene interval, and gene function elements, and the formula is: Average thiolation level = support thiolation The read/ (supports simplistic reads+ does not support simplified reads) xl00. The statistical results are shown in Table 6.

最后, 分析非 CG位点的曱基化的 C附近碱基的分布情况, 统计不同曱基 化模式 (CHG和 CHH)出现的概率, 结果如图 5所示。  Finally, the distribution of bases near the thiolated C of non-CG sites was analyzed, and the probability of occurrence of different thiolation patterns (CHG and CHH) was statistically analyzed. The results are shown in Fig. 5.

6.全基因组 DNA曱基化图谱 6. Whole genome DNA thiolation map

从染色体水平来描述曱基化 C威基的分布情况。 以 10kb的窗口统计甲基 化( 的密度在染色体上的分布情况; 同时, 统计每 lOOkb的窗口内不同类 型曱基化 C减基 ( CG、 CHG和 CHH )的密度分布, 然后以画图软件作图, 结果 如图 6所示。  The distribution of thiolated C-velocines is described from the chromosomal level. The distribution of density on the chromosome was counted by a 10 kb window; at the same time, the density distribution of different types of thiolated C minus groups (CG, CHG, and CHH) in each lOOkb window was counted, and then plotted by drawing software. Figure, the results are shown in Figure 6.

不同的基因组区域具有各自不同的生物学功能, 以热度图的形式表示甲基 化分布情况, 有助于进一步了解这些区域的曱基化图语特征。  Different genomic regions have different biological functions, and the methylation distribution in the form of heat maps helps to further understand the thiolated graphical features of these regions.

结果如图 7所示: 其中, 每张图表示一种基因组区域的热度图分布特征, 其中 n表示参与分析的 CpG个数(每个 CpG在单条链上的覆盖深度均大于等 于 4X )。 横轴表示每 200bp大小窗口中的 CpG个数 (即 CpG密度), 纵轴表 示 CpG的 C ½平均曱基化水平。 图中的黑色细线表示在特定 CpG密度下, 这个窗口中的 c碱基平均曱基化水平的中位数。 B区从浅到深表示特定曱基化 水平和 CpG密度下的 CpG个数。 上方的 A区表示 CpG密度的分布,映射到横 轴上, 右侧的 C区表示曱基化水平的分布, 并且映射到纵轴上。  The results are shown in Figure 7: where each graph represents the heat map distribution characteristics of a genomic region, where n represents the number of CpGs involved in the analysis (each CpG has a coverage depth greater than 4X on a single strand). The horizontal axis represents the number of CpGs per 200 bp window (ie, CpG density), and the vertical axis represents the C 1⁄2 average thiolation level of CpG. The black thin line in the figure indicates the median of the average thiolation level of c bases in this window at a specific CpG density. From the shallow to the deep, the B region indicates the number of CpGs at a specific thiolation level and CpG density. The upper A area represents the distribution of CpG density, mapped to the horizontal axis, and the right C area represents the distribution of the thiolation level and is mapped to the vertical axis.

基因的转录受不同的调控元件调控, 将所有编码基因序列分成 7个不同的 转录元件区域, 在此!^出上对不同转录元件区域的曱基化水平进行统计。 DNA 曱基化水平在不同功能区的分布特点有助于从全基因组水平去了解不同区域的 DNA曱基化修饰的作用。统计结果如图 8所示:横轴表示 7个不同的转录元件 区, 纵轴表示特定区域各位点的平均曱基化水平,每一个红点即为一个 bin (人 为划定的一定长度的区间)的平均曱基化水平, 曲线表示的是平均 5-bin的移 动, a与 b之间的虚线即为 TSS (转录起始位点)。 The transcription of genes is regulated by different regulatory elements, dividing all coding gene sequences into seven different regions of transcriptional elements, here! ^ Statistics on the level of thiolation of different transcriptional element regions. The distribution of DNA thiolation levels in different functional regions helps to understand the role of DNA thiolation modifications in different regions from the genome-wide level. The statistical results are shown in Fig. 8: the horizontal axis represents 7 different transcription element regions, and the vertical axis represents the average thiolation level of each point in a specific region, and each red dot is a bin (person For the average thiolization level of a defined length interval, the curve represents the average 5-bin shift, and the dashed line between a and b is the TSS (transcription start site).

7.差异性甲基化区域(DMR )分析  7. Differential methylation region (DMR) analysis

通过两个样品的曱基化水平, 确定甲基化差异区域, 方法如下: 首先确定一个 CG位点作为种子位点, 得到两个样本在该位点分别支持曱 基化和不支持曱基化四个数值, 再通过卡方检验检验两个样本的曱基化水平在 该位点差异是否显著; 如果差异显著则往后延伸一个 CG位点, 得到两个样本 在这两个位点分别支持曱基化和不支持曱基化四个数值, 再进行卡方 4佥验; 以 此类推循环, 当遇到差异不再显著的点时, 便停止, 前面显著的区域即为差异 性曱基化区域。 再以此位点为种子位点循环上面过程, 从而得到差异性曱基化 区域。  The methylation difference region was determined by the thiolation level of the two samples as follows: First, a CG site was identified as a seed site, and two samples were obtained to support thiolation and thiolization at the site, respectively. Four values were used to check whether the thiolation level of the two samples was significant at the site by a chi-square test; if the difference was significant, a CG site was extended backward, and two samples were supported at these two sites.曱 化 和 和 和 和 和 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个 四个Area. This site is then used as a seed site to cycle the above process, resulting in a differential thiolated region.

在找到差异性曱基化区域后, 通过这些区域找到相对应的基因, 即曱基化 差异基因, 并对这些基因做 GO聚类功能分析, 以分析差异相关基因是否有明 显的功能聚类, 即是否针对性地调控行使某类功能。 实施例结果:  After finding the differential thiolated regions, the corresponding genes, namely the thiolated differential genes, are found through these regions, and the GO clustering function analysis is performed on these genes to analyze whether the differentially related genes have obvious functional clustering. That is, whether to perform certain types of functions in a targeted manner. Example results:

表 1 , 两个样本的数据产出统计  Table 1, data output statistics for two samples

Figure imgf000013_0001
Figure imgf000013_0001

表 2, 比对结果统计 Table 2, comparison results statistics

Figure imgf000013_0002
表 3 , 不同类型曱基化 C威基的平均甲基化水平
Figure imgf000013_0002
Table 3, average methylation levels of different types of thiolated C-Wiki

Figure imgf000014_0001
Figure imgf000014_0001

表 4, 反映曱基化信息的 reads的有^ ^盖度 Table 4, the readings that reflect the singularization information

Figure imgf000014_0002
Figure imgf000014_0002

表 5, 不同分布类型曱基化 C的数量及比例 Table 5, Number and proportion of different distribution types thiolated C

Figure imgf000014_0003
Figure imgf000014_0003

不同基因区域上各类型 C的曱基化水平 The level of thiolation of various types of C in different gene regions

Figure imgf000014_0004
实施例二, 基于 RRBS测序的生物信息分析方法 样本: d液样本 DNA
Figure imgf000014_0004
Example 2: Samples of biological information analysis methods based on RRBS sequencing: d liquid sample DNA

具 作緣  Have a relationship

一、 数据预处理  First, data preprocessing

1.过滤测序序列; 在本实施例中, 编号为 YH的 i液样品 DNA在深圳华 λ ^因研究院进 行重亚疏酸氢盐测序或减少的代表性重亚疏酸盐测序,其中测序技术为 Illumina GA测序技术。 测序得到 94.5兆条序列, 然后对测序序列进行过滤, 去除不合 格的序列, 得到 83.62兆条合格序列。 不合格序列包括: 测序质量低于预设阀 值(本例中为 66 )的 基个数超过整条序列减基个数的 50%。 序列中测序结果 不确定的减基 (在本实施例中 Illumina GA测序结果中的 N )个数超过整条序列 ½个数的 10%; 除样本接头序列外, 其它实验引入的外源序列; 序列中存在 外源序列则认为是不合格序列。 1. Filter the sequencing sequence; In this example, the liquid sample DNA numbered YH is subjected to heavy subhydro acid hydrogenation sequencing or reduced representative heavy acid sulfite sequencing in the Shenzhen Hua λ ^ Institute, wherein the sequencing technology is Illumina GA sequencing technology. . Sequencing yielded 94.5 mega-sequences, and the sequencing sequence was then filtered to remove unqualified sequences, resulting in 83.62 mega-qualified sequences. The unqualified sequence includes: The number of bases whose sequencing quality is lower than the preset threshold (66 in this example) exceeds 50% of the number of bases minus the entire sequence. The number of subtractions (in the present example, the number of N in the Illumina GA sequencing results) in the sequence is more than 10% of the entire sequence; in addition to the sample linker sequence, the exogenous sequence introduced by other experiments; The presence of an exogenous sequence in the sequence is considered a non-conforming sequence.

2.确定上述过滤后的测序序列代表的甲基化信息  2. Determine the methylation information represented by the above filtered sequencing sequence

在本实施中使用的物种的参考基因组序列是人的基因组 hgl8,来自数据库 UCSC。 对过滤后的测序序列和所述的物种参考序列进行预处理, 步骤如下: i )将所述各个测序序列中所有 C (胞嘧啶) tt转化为 T (胸腺嘧啶)碱 基得到的序列定义为 fql; 将所述各个测序序列互补序列中所有 G (鸟嘌呤)碱 基转化为 A (腺嘌呤) ^^得到的序列定义为 fq2, 示意图如下: fql ΘΘΤ鍵 βΆΘΘΑΤΤΜΤ霞 (¾'ΤΪ職 ¾¾$T T ¾¾S灘 STAGI^ ββ T TeC¾ CCTC 請職 TO嚷 "GCT:Cft¾¾TCG:簡麵顯 在图中, Cm表示曱基化的 C (胞嘧啶) The reference genomic sequence of the species used in this embodiment is the human genome hgl8, from the database UCSC. The filtered sequencing sequence and the species reference sequence are pretreated as follows: i) the sequence obtained by converting all C (cytosine) tt in the respective sequencing sequences into T (thymine) bases is defined as Fql; The sequence obtained by converting all G (guanine) bases in the complementary sequence of each sequencing sequence into A (adenine) ^^ is defined as fq2, and the schematic is as follows: fql ΘΘΤ bond βΆΘΘΑΤΤΜΤ霞(3⁄4'ΤΪ职3⁄43⁄4$ TT 3⁄43⁄4S Beach STAGI^ ββ T TeC3⁄4 CCTC Job TO嚷"GCT : Cft3⁄43⁄4TCG: The simplified surface is shown in the figure, C m represents the thiolated C (cytosine)

ii )将所述参考基因组序列中所有 C (胞嘧啶) ½转化为 T (胸腺嘧啶) 生成的序列定义为 refl; 将所述参考基因组序列中所有 G (鸟嘌呤)减基 转化为 A (腺嘌呤) 生成的序列定义为 ref2;  Ii) converting all C (cytosine) 1⁄2 in the reference genomic sequence into T (thymine) to generate a sequence defined as refl; converting all G (guanine) subunits in the reference genome sequence to A (gland)嘌呤) The generated sequence is defined as ref2;

iii )将各个序列 fql和 fq2分别和参考序列 refl和 ref2进行比对, 经过 比对得到各个序列 fql和 fq2在转化后的参考基因组序列上的位置信息, 保留 在转化后的参考基因组序列上有唯一比对的各个序列 fql和 fq2, 示意图如下:  Iii) aligning each sequence fql and fq2 with reference sequences refl and ref2, respectively, and obtaining positional information of each sequence fql and fq2 on the transformed reference genomic sequence, which is retained on the transformed reference genome sequence. The unique sequences fql and fq2 are as follows:

r

Figure imgf000015_0001
ef2 T iv )将一个序列 fql和 fq2在转化后的参考基因组序列上的位置信息参考序 列在相应位置处的序列信息相结合, 确定该序列 fql和 fq2的甲基化信息: 如 果在该序列 fql上和在参考序列相应位置上都是 C (胞嘧啶)碱基, 则表示在 该序列 fql的该位置被曱基化; 如果在该序列 fq2上和在参考序列相应位置都 是 G 则表示在该序列 fq2的该位置被曱基化示意图如下:
Figure imgf000016_0001
r
Figure imgf000015_0001
e f2 T Iv) combining the sequence information of a sequence fql and fq2 on the transformed reference genome sequence with the sequence information at the corresponding position to determine the methylation information of the sequences fql and fq2: if and in the sequence fql If the C (cytosine) base is at the corresponding position of the reference sequence, it means that the position of the sequence fql is thiolated; if both the sequence fq2 and the corresponding position in the reference sequence are G, the sequence is represented. The position of fq2 is singulated as follows:
Figure imgf000016_0001

,權 麵 i^eG^TceT:!^卿:雄聽 ¾?¾:¾3¾¾ ¾¾|0<≤^¾3: ¾聽 AT:e 2: τ^Αε^'Θ σϊ^^ΐ^Ε^^^βόδ^^τ,ίρί^^^τΐ^^漏;^ [" 翔: 在图中, *号和 #号表示未曱基化, Cm表示曱基化的 C (胞嘧啶) ½。 3. 经过上述处理, 样品 YH得到 63.45兆有¾ ^列, 能覆盖整个基因组的 6.2%的 CG位点。 , 权面 i^eG^TceT:!^卿: 雄听3⁄4?3⁄4:3⁄433⁄43⁄4 3⁄43⁄4|0<≤^3⁄43 : 3⁄4 listen AT:e 2: τ^Αε^'Θ σϊ^^ΐ^Ε^^^ Βόδ^^τ, ίρί^^^τΐ^^漏;^ [" Xiang: In the figure, * and # indicate undensified, and C m denotes thiolated C (cytosine) 1⁄2. After the above treatment, the sample YH obtained 63.45 megabytes with 3⁄4 ^ columns, covering 6.2% of the CG sites of the entire genome.

由于 RRBS测序 因酶切片段进行 BS测序, 因此我们需要对得到的有 效序列进行酶切位点的过滤, 才艮据不同的醇得到不同的理^ t刀点, 这里使用了 Mspl识别 CCGG位点的酶。对有效序列的两端进行判断是否还有为 Mspl的酶 切位点, 如果其中一端还有该醇切位点则认为该有效序列为正常酶切片段, 用 于后续的分析。  Since RRBS sequencing is performed by BS sequencing of the digested fragments, we need to filter the obtained effective sequences by cleavage sites, and then obtain different cleavage points according to different alcohols. Here, Mspl is used to identify CCGG sites. Enzyme. Both ends of the effective sequence are judged whether there is a restriction site for Mspl. If one of the ends has the alcohol cleavage site, the effective sequence is considered to be a normal restriction fragment, which is used for subsequent analysis.

3.确定参考基因组序列上曱基化位点  3. Identify the thiolation site on the reference genome sequence

对于参考基因组上每个位点, 统计在该位点支持曱基化的测序序列数 mC 和不支持曱基化的测序序列数 nmC, 据二项分布得到以下不等式确定该位点 是否是曱基化位点: nmC * ()pk(l - v ~k < 0.01 * mC; 其中, n为该位点的测序序列总数, p为错误率(本例中为 0.01 ) , k为 支持曱基化 reads数的期望值, mC为支持曱基化的测序序列数, nmC为不支持 曱基化的测序序列数; For each locus on the reference genome, the number of sequencing sequences supporting the thiolation at this site, mC, and the number of sequencing sequences that do not support thiolation, are counted. According to the binomial distribution, the following inequality is used to determine whether the site is a sulfhydryl group. Site: nmC * ()p k (l - v ~ k < 0.01 * mC; where n is the total number of sequencing sequences at this site, p is the error rate (0.01 in this case), k is the supporting thiol group The expected number of reads, mC is the number of sequencing sequences that support thiolation, and nmC is the number of sequencing sequences that do not support thiolation;

通过上述公式得到参考基因组序列上每个位点支持曱基化 reads数的期望 值 k, 对于某个位点如果支持曱基化的测序序列数大于该位点的期望值, 该位 点即为曱基化的位点, 优选形成包括如下信息的结果文件 *.cout: Chr-a b +/- c d e f g h 第一列 Chr-a: 染色体号, The expected value k of the number of thiolated reads supported by each site on the reference genome sequence is obtained by the above formula. For a certain site, if the number of sequencing sequences supporting thiolation is greater than the expected value of the site, the site is a thiol group. The resulting site preferably forms a result file *.cout including the following information: Chr-a b +/- cdefgh First column Chr-a: chromosome number,

第二列 b: C位点的位置,  The second column b: the position of the C site,

第三列 +/-: 正负链信息,  The third column +/-: positive and negative chain information,

第四列 c: C位点的类型,  The fourth column c: the type of C site,

第五列 d: 具体的^ ,  The fifth column d: the specific ^,

第六列 e: 测序序列平均比对位置,  Column 6 e: average alignment position of the sequencing sequence,

第七列 f: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置被曱 基化的测序序列数目,  Column 7 f: the number of sequencing sequences that are ligated at the corresponding position by the first sequencing sequence or the second sequencing sequence covering the C site,

第八列 g: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置未被 曱基化的测序序列数目,  Column 8 g: the number of sequencing sequences covering the first or the second sequencing sequence of the C site that are not thiolated at the corresponding position,

第九列 h: 所述 C位点曱基化的最小期望测序序列数目。  Ninth column h: Number of minimum expected sequencing sequences for thiolation of the C site.

结果如下: The results are as follows:

样品 YH甲基化 CG位点为 1.84兆, 占覆盖到的 CG位点的 52.93%; 曱基 化 CHG位点为 0.013兆, 占总 CHG位点的 0.49%; 曱基化 CHH位点为 0.036 兆, 占总 CHH位点的 2.55%。  The YH methylation CG site of the sample was 1.84 megabytes, accounting for 52.93% of the covered CG sites; the thiolated CHG site was 0.013 megagrams, accounting for 0.49% of the total CHG sites; the thiolated CHH site was 0.036. Mega, accounting for 2.55% of the total CHH locus.

二、 生物信息分析  Second, biological information analysis

1.数据产出  Data output

对两个 jk液样本 DNA的测序数据产出分不同酶切片段文库做出统计,统 计内容包括测序片段读长、 测序片段数量和样品的总数据量等。 具体的结果参 见表 7。  The sequencing data of the DNA of the two jk liquid samples were generated by counting the libraries of different enzyme fragments, and the statistics included the length of the sequencing fragments, the number of sequencing fragments, and the total amount of samples. See Table 7 for specific results.

2. 测序片段比对  2. Sequencing fragment alignment

将测序得到的片段比对到该物种的基因组上, 具体比对方法见数  The sequenced fragments are aligned to the genome of the species, and the specific comparison method is described.

据预处理中的第 2步: 序列在基因组上位置的确定。 在完成数据比对后, 对两个人样品的比对结果做出统计, 统计内容包括原始片段数(Raw reads ) 、 原始数据量(Raw Data )、 比对片段数 ( Mapped reads )、 比对¾据量(Mapped bases )、 比对率 ( Mapped rate )、 醉切率 ( Enzyme cutting rate )具体结果参见 表 8。  According to step 2 of the pretreatment: the determination of the position of the sequence on the genome. After the data comparison is completed, the comparison results of the two human samples are counted, and the statistics include the Raw reads, the Raw Data, the Mapped reads, and the comparisons. See Table 8 for specific results of Mapped bases, Mapped rate, and Enzyme cutting rate.

3. 插入片段长度统计  3. Insert length statistics

对比对上的 reads进行长度统计, 以判断比对结果是否符合建库片段长度, 统计不同插入片段长度下 reads占总 reads数的百分比。 统计结果如图 9所示。 4. 曱基化位点的覆盖情况 Compare the length of the reads on the pair to determine whether the comparison results match the length of the database. Calculate the percentage of reads in the total number of reads for different insert lengths. The statistical results are shown in Figure 9. 4. Coverage of thiolated sites

C位点可以被分为三种类型, 分别是 CG、 CHG、 CHH, 其中 H代表非 G 。从 *.cout文件中我们可以得到位点是否为曱基化位点,支持曱基化的 reads 数大于期望值的位点即为曱基化位点。 统计不同覆盖深度(该位点被 reads覆 盖的次数)下 mC位点占总位点的比例, 统计结果如图 10所示,横轴代表大于 等于该覆盖深度, 纵轴代表相应深度下位点的比例。 同时还对不同覆盖深度下 的位点在基因组各区域的覆盖度进行统计, 覆盖度即为覆盖到的位点数在基因 组各区域的比例。 统计结果如图 11所示,横轴^ 该覆盖深度, 纵轴代表相应 深度下位点的覆盖度。 最后统计各基因组区域内的 CG位点的平均覆盖深度, 结果参见表 9。 除此之外, 统计基因组各区域内的 CG位点的平均覆盖深度 The C site can be divided into three types, namely CG, CHG, CHH, where H stands for non-G. From the *.cout file we can get whether the site is a thiolated site, and the site that supports the thiolized reads greater than the expected value is the thiolation site. The ratio of the mC locus to the total locus under different coverage depths (the number of times the locus is covered by the reads) is counted. The statistical result is shown in Fig. 10. The horizontal axis represents greater than or equal to the coverage depth, and the vertical axis represents the locus at the corresponding depth. proportion. At the same time, the coverage of the sites in different coverage depths in each region of the genome is also counted, and the coverage is the proportion of the number of sites covered in each region of the genome. The statistical results are shown in Fig. 11, the horizontal axis ^ the depth of coverage, and the vertical axis represents the coverage of the site at the corresponding depth. Finally, the average coverage depth of CG loci in each genomic region was counted. See Table 9 for the results. In addition, the average coverage depth of CG loci in each region of the statistical genome

5. CpG位点的覆盖度 5. Coverage of CpG sites

RRBS 采用酶切技术得到片段再进行 BS 测序, 这里对酶切的理论片段 ( 40-220bp )在各染色体和基因组各区域内覆盖到的 CG位点数及其比例进行 统计, 然后统计两个样品实际得到的数据在理论位点上的覆盖度, 统计结果见 表 10及图 12。  RRBS was digested to obtain fragments and then subjected to BS sequencing. Here, the theoretical fragment (40-220 bp) of the enzyme digestion was counted in the number of CG sites and their proportions in each chromosome and genome, and then the two samples were actually counted. The coverage of the obtained data at the theoretical site is shown in Table 10 and Figure 12.

6. 差异性曱基化区域(DMR )分析  6. Differential thiolized region (DMR) analysis

通过两个样品的曱基化水平, 确定曱基化差异区域, 方法如下:  The thiolation difference region was determined by the thiolation level of the two samples as follows:

首先确定一个 CG位点作为种子位点, 得到两个样本在该位点分别支持曱 基化和不支持曱基化的四个数值, 再通过卡方检验检验两个样本的曱基化水平 在该位点差异是否显著; 如果差异显著则往后延伸一个 CG位点, 得到两个样 本在这两个位点分别支持曱基化和不支持曱基化的四个数值,再进行卡方检验; 以此类推循环, 当遇到差异不再显著的点时, <更停止, 前面显著的区域即为差 异性曱基化区域。 再以此位点为种子位点循环上面过程, 从而得到差异性曱基 化区域。  First, a CG locus is determined as a seed site, and two samples are supported at the site to support the thiolation and the four values that do not support thiolation, and then the chi-square level of the two samples is tested by the chi-square test. Whether the difference in the locus is significant; if the difference is significant, a CG locus is extended backward, and two samples are supported at these two loci to support the thiolation and the four values that do not support thiolation, and then the chi-square test is performed. Such a push cycle, when encountering a point where the difference is no longer significant, <more stop, the front significant area is the differential thiolated area. This site is then used as a seed site to cycle the above process to obtain a differential thiolated region.

在找到差异性曱基化区域后, 通过这些区域找到相对应的基因, 即甲基化 差异基因, 并对这些基因做 GO聚类功能分析, 以分析差异相关基因是否有明 显的功能聚类, 即是否针对性地调控行使某类功能。  After finding the differential thiolated regions, the corresponding genes, namely methylation differential genes, are found through these regions, and the GO clustering function analysis is performed on these genes to analyze whether the differentially related genes have obvious functional clustering. That is, whether to perform certain types of functions in a targeted manner.

RRBS采用酶切技术得到片段再进行 BS测序,所得结果包括但不局限于上 述内容。 对于 BS测序的生物信息分析中的内容, RRBS也可酌情分析。 实施例结果: The RRBS is subjected to restriction enzyme digestion to obtain fragments and then subjected to BS sequencing, and the results include, but are not limited to, the above. RRBS can also be analyzed as appropriate for the content of bioinformatics analysis of BS sequencing. Example results:

表 7, 两个样本的数据产出统计  Table 7, Data output statistics for two samples

Figure imgf000019_0001
Figure imgf000019_0001

表 8, 比对结果统计 Table 8, statistics of comparison results

Figure imgf000019_0002
Figure imgf000019_0002

表 9, 各基因组区域内的 CG位点的平均覆盖深度 Table 9. Average coverage depth of CG loci in each genomic region

CG的平均 ¾ 果度 (X) 基因元件  CG average 3⁄4 degree (X) gene component

LB-NB9 LB-NB10 LB-NB9 LB-NB10

CG岛上游 2k 18.04 17.61Upstream of CG Island 2k 18.04 17.61

CG岛 20.92 19.89CG Island 20.92 19.89

CG岛下游 2k 17.68 17.39 上游 2k 18.7 17.33Downstream of CG Island 2k 17.68 17.39 Upstream 2k 18.7 17.33

5-非翻译区 21.17 19.95 编码序列 19.98 20.67 内含子 18.02 17.555-untranslated region 21.17 19.95 coding sequence 19.98 20.67 intron 18.02 17.55

3-非翻译区 17.69 17.73 下游 2k 20.11 19.85 启动子 19.94 18.73 ii因 20.78 20.98 串联重复序列 18.46 17.593-untranslated region 17.69 17.73 downstream 2k 20.11 19.85 promoter 19.94 18.73 ii due 20.78 20.98 tandem repeat 18.46 17.59

Alu重复序列 13.49 11.3 核糖体 RNA 982.78 726.59 转运 RNA 20.45 18.81 0, CpG位点的覆盖度 Alu repeat sequence 13.49 11.3 ribosomal RNA 982.78 726.59 transfer RNA 20.45 18.81 0, coverage of CpG sites

Figure imgf000020_0001
Figure imgf000020_0001

Claims

权 利 要 求 书 Claim 1. 一种通过重亚石充酸氢盐测序或减少的代表性重亚^ ^酸盐测序检测 DNA 曱基化的方法, 所述方法包括步骤:  CLAIMS 1. A method for detecting DNA thiolation by heavy bisulfite hydrogenation sequencing or reduced representative heavy acid salt sequencing, the method comprising the steps of: 1 )对样品进行重亚硫酸氢盐测序或减少的代表性重亚硫酸盐测序,并优选 过滤所述测序序列以去除不合格的序列;  1) subjecting the sample to bisulfite sequencing or reduced representative bisulfite sequencing, and preferably filtering the sequencing sequence to remove unqualified sequences; 2 )确定上述测序序列代表的曱基化位点, 步骤如下  2) Determine the thiolation site represented by the above sequencing sequence, the steps are as follows i ) 将所述各个测序序列中所有 C (胞嘧啶) 转化为 T (胸腺嘧啶)碱 基得到的第一测序序列; 将所述各个测序序列互补序列中所有 G (鸟嘌呤)碱 基转化为 A (腺嘌呤) 得到的第二测序序列;  i) a first sequencing sequence obtained by converting all C (cytosine) in the respective sequencing sequences into T (thymine) bases; converting all G (guanine) bases in the complementary sequences of the respective sequencing sequences into A (adenine) obtained second sequencing sequence; ϋ )将相应物种的参考基因组序列中所有 C (胞嘧啶 ) ½转化为 Τ (胸腺 嘧啶)减基生成的第一参考序列; 将相应物种的参考基因组序列中所有 G (鸟 嘌呤)碱基转化为 Α (腺嘌呤) 生成的第二参考序列;  ϋ ) converting all C (cytosine) 1⁄2 in the reference genome sequence of the corresponding species into the first reference sequence generated by the Τ (thymidine) subtraction; converting all G (guanine) bases in the reference genome sequence of the corresponding species a second reference sequence generated for Α (adenine); iii )将各个第一测序序列和第二测序序列分别和第一参考序列和第二参考 序列进行比对 , 经过比对得到各个第一测序序列和第二测序序列在第一参考序 列和第二参考序列上的位置, 选择在第一参考序列和第二参考序列上有唯一比 对的各个第一测序序列和第二测序序列;  Iii) aligning each of the first sequencing sequence and the second sequencing sequence with the first reference sequence and the second reference sequence, respectively, and obtaining each of the first sequencing sequence and the second sequencing sequence in the first reference sequence and the second Selecting, in the position on the sequence, each of the first sequencing sequence and the second sequencing sequence having unique alignments on the first reference sequence and the second reference sequence; iv )利用上述选择的第一测序序列和第二测序序列比对的参考基因组序列 的序列信息确定第一测序序列和第二测序序列的甲基化信息, 步骤如下: 如果 在第一测序序列和比对的参考基因组序列的相应位置都是 C (胞嘧啶)碱基, 则表示在第一测序序列的该位置被曱基化; 如果在第二测序序列和比对的参考 基因组序列的相应位置都是 G (鸟嘌呤)碱基, 则表示在第二测序序列的该位 置被曱基化;  Iv) determining the methylation information of the first sequencing sequence and the second sequencing sequence by using the sequence information of the first sequencing sequence and the second sequencing sequence aligned by the second sequencing sequence, as follows: If in the first sequencing sequence and The corresponding positions of the aligned reference genomic sequences are C (cytosine) bases, indicating that they are thiolated at this position of the first sequencing sequence; if at the corresponding position of the second sequencing sequence and the aligned reference genome sequence All of which are G (guanine) bases, indicating that they are thiolated at this position of the second sequencing sequence; 3 )对于参考基因组序列上每个 C位点, 统计覆盖该位点的第一测序序列 或第二测序序列在相应位置被曱基化的测序序列数目和未被曱基化的测序序列 数目, 根据以下不等式确定该位点曱基化的最小期望测序序列数目, 即满足下 式的最小 k: nmC * ( )pk(t p ~k < 0.01 * mC; 其中, 3) for each C site on the reference genomic sequence, the number of sequencing sequences that are thiolated at the corresponding position of the first or second sequencing sequence covering the site and the number of sequencing sequences that are not thiolated, The minimum number of expected sequencing sequences for thiolation of the site is determined according to the following inequality, ie, the minimum k satisfying the following formula: nmC * ( )p k (tp ~ k < 0.01 * mC; wherein mC是覆盖该 C位点的第一测序序列或第二测序序列在相应位置被曱基化 的测序序列数目,  mC is the number of sequencing sequences in which the first sequencing sequence or the second sequencing sequence covering the C site is thiolated at the corresponding position, nmC是覆盖该 C位点的第一测序序列或第二测序序列在相应位置未被曱基 化的测序序列数目, nmC is the first sequencing sequence covering the C site or the second sequencing sequence is not thiol at the corresponding position Number of sequencing sequences, n为该位点的测序序列总数,  n is the total number of sequencing sequences at this site, p为 4 吴率,  p is 4 Wu rate, 如果覆盖该 C位点的第一测序序列或第二测序序列在相应位置被甲基化的 测序序列数目大于该位点的最小期望测序序列数目 , 该 C位点即为曱基化的位 点。  If the number of sequencing sequences that are methylated at the corresponding position by the first or second sequencing sequence covering the C site is greater than the minimum number of desired sequencing sequences at the site, the C site is the site of thiolation . 2.权利要求 1的方法, 其中用 C碱基均为未修饰的对照 DNA例如 lamda DNA代替所述样品进行上述步骤 1 ) -2), 根据 C到 T减基转化率确定 p值, p 通常为 0.001-0.01, 例如 0.01。  2. The method of claim 1, wherein the step 1) -2) is carried out by replacing the sample with an unmodified control DNA such as lamda DNA, and the p value is usually determined according to the conversion ratio of C to T minus, p is usually It is 0.001-0.01, such as 0.01. 3. 权利要求 1或 2的方法,还包括形成包括如下信息的结果文件: 染色体 号、 C位点的位置、 正负链信息、 C位点的类型、 具体的碱基、 覆盖所述 C位 点的第一测序序列或第二测序序列数目、覆盖所述 C位点的第一测序序列或第 二测序序列在相应位置被曱基化的测序序列数目、覆盖所述 C位点的第一测序 序列或第二测序序列在相应位置未被曱基化的测序序列数目和 /或所述 C位点 曱基化的最小期望测序序列数目。  3. The method of claim 1 or 2, further comprising forming a result file comprising: a chromosome number, a position of the C site, positive and negative strand information, a type of C site, a specific base, and a coverage of the C site The number of the first sequencing sequence or the second sequencing sequence of the point, the number of sequencing sequences that are thiolated at the corresponding position by the first sequencing sequence or the second sequencing sequence covering the C site, and the first number covering the C site The number of sequencing sequences that the sequencing sequence or the second sequencing sequence is not thiolated at the corresponding position and/or the minimum number of desired sequencing sequences that are thiolated by the C site. 4.一种基于重亚硫酸氢盐测序或减少的代表性重亚硫酸盐测序的生物信息 学分析的方法, 所述方法包括通过权利要求 1 -3任一项的方法检测 DNA曱基 化, 然后进行生物信息学分析, 所述生物信息学分析为对选自如下的一项或多 项进行分析: 数据产出、 测序片段比对、 曱基化位点的覆盖情况、 曱基化位点 的曱基化水平、 全基因组曱基化数据分布趋势、 全基因组 DNA曱基化图谱、 差异性曱基化区域、 插入片段长度统计和 CpG位点的覆盖度。  4. A method of bioinformatics analysis based on bisulfite sequencing or reduced representative bisulfite sequencing, the method comprising detecting DNA thiolation by the method of any of claims 1-3, Bioinformatics analysis is then performed, which analyzes one or more selected from the group consisting of: data production, sequencing of fragments, coverage of thiolated sites, thiolation sites The level of thiolation, genome-wide thiolation data distribution trends, genome-wide DNA thiolation maps, differential thiolated regions, insert length statistics, and coverage of CpG sites. 5权利要求 4的方法, 所述曱基化位点的覆盖情况分析按如下进行: C tt位点被分为三种类型, 分别是 CG、 CHG、 CHH, 其中 H代表非 G 碱基; 从结果文件得到位点是否为曱基化位点, 支持曱基化的 reads数大于期 望值的位点即为曱基化位点; 统计不同覆盖深度下 mC位点占总位点的比例; 同时还对不同覆盖深度下的位点在基因组上的; ί¾Α度进行统计, 覆盖度即为位 点数在全基因组上的比例。  5. The method of claim 4, wherein the coverage analysis of the thiolation site is performed as follows: C tt sites are classified into three types, namely CG, CHG, CHH, wherein H represents a non-G base; The result file obtains whether the site is a thiolation site, and the site supporting the thiolated reads larger than the expected value is the thiolation site; the ratio of the mC site to the total site at different coverage depths is also calculated; For sites with different coverage depths on the genome, ί3⁄4Α is counted, and the coverage is the proportion of the number of sites on the whole genome. 6.权利要求 4的方法, 所述曱基化位点的曱基化水平分析按如下进行: 为了了解基因组曱基化图谱的总体特征和全基因组的平均甲基化水平, 统 计每个甲基化 C碱基的平均曱基化水平; 统计反映曱基化信息的 reads的有效 覆盖度, 有 盖度的计算基于三个方面进行: 染色体、 基因区间和基因功能 元件, 将曱基化状态不明确的 reads排除之后, 基于支持曱基化的 reads和不支 持支持甲基化的 reads数据, 进行有效覆盖度计算, 得到统计结果。 6. The method of claim 4, wherein the thiolation level analysis of the thiolation site is performed as follows: To understand the overall characteristics of the genomic thiolation map and the average methylation level of the whole genome, each methyl group is counted. The average thiolation level of C bases; statistically reflects the effective coverage of reads of thiolated information. The calculation of coverage is based on three aspects: chromosome, gene interval and gene function After the components are excluded from the ambiguous reads, the effective coverage calculation is performed based on the read-supported singly-based reads and the reads data that does not support methylation, and the statistical results are obtained. 7.权利要求 4的方法,所述全基因组曱基化数据分布趋势分析按如下进行: 为了得到全基因组曱基化分布趋势, 统计曱基化 C ¾中 CG, CHG与 7. The method of claim 4, wherein said genome-wide thiolation data distribution trend analysis is performed as follows: To obtain a genome-wide thiolation distribution trend, statistical thiolation C 3⁄4 CG, CHG and CHH的数目和分布比例, 即各类型曱基化的 C占总曱基化 C的百分比; 统计 每种类型 (CG、 CHG和 CHH ) 曱基化的 C曱基化水平的分布; 按照染色体、 基因区间和基因功能元件统计各区域内每种分布类型的 C CG、 CHG和The number and distribution ratio of CHH, ie the percentage of each type of thiolated C to the total thiolation C; the distribution of the C thiolation level of each type (CG, CHG and CHH) thiolation; Gene intervals and gene functional elements count C CG, CHG and each type of distribution in each region CHH )的曱基化水平); 或者分析非 CG位点的曱基化的 C附近碱基的分布情 况, 统计不同曱基化模式 (CHG和 CHH)出现的概率。 The thiolation level of CHH); or the distribution of bases near the thiolated C at non-CG sites, and the probability of occurrence of different thiolation patterns (CHG and CHH). 8.权利要求 4的方法, 所述全基因组 DNA曱基化图语分析按如下进行: 从染色体水平来描述甲基化 C碱基的分布情况, 以 10kb的窗口统计曱基 化 的密度在染色体上的分布情况; 同时, 统计每 lOOkb的窗口内不同类 型曱基化 C ½( CG、 CHG和 CHH )的密度分布,不同的基因组区域具有各自 不同的生物学功能, 以热度图的形式表示曱基化分布情况, 有助于进一步了解 这些区域的曱基化图语特征, 基因的转录受不同的调控元件调控, 将所有编码 基因序列分成 7个不同的转录元件区域, 在此! ^出上对不同转录元件区域的曱 基化水平进行统计。  8. The method of claim 4, wherein said genome-wide DNA thiolation pattern analysis is performed as follows: Delineating the distribution of methylated C bases from the chromosomal level, and counting the density of thiolation at the chromosome of 10 kb on the chromosome The distribution of the upper; at the same time, the density distribution of different types of thiolized C 1⁄2 ( CG, CHG and CHH ) in each window of 100 kb is counted, and different genomic regions have different biological functions, which are expressed in the form of heat maps. The basic distribution helps to further understand the thiolated graphical features of these regions. The transcription of genes is regulated by different regulatory elements, dividing all coding gene sequences into seven different transcriptional element regions, here! The statistics on the thiolation levels of different transcriptional element regions were made. 9.权利要求 4的方法, 所述差异性曱基化区域(DMR)分析按如下进行: 通过两个样品的曱基化水平, 确定曱基化差异区域, 方法如下: 首先确定 一个 CG位点作为种子位点, 得到两个样本在该位点分别支持曱基化和不支持 曱基化四个数, 再通过卡方检验检验两个样本的曱基化水平在该位点差异是否 显著; 如果差异显著则往后延伸一个 CG位点, 得到两个样本在这两个位点分 别支持曱基化和不支持曱基化四个数, 再进行卡方检验; 以此类推循环, 当遇 到差异不再显著的点时, 便停止, 前面显著的区域即为差异性曱基化区域。 再 以此位点为种子位点循环上面过程, 从而得到差异性甲基化区域, 在找到差异 性曱基化区域后, 通过这些区域找到相对应的基因, 即曱基化差异基因, 并对 这些基因做 GO聚类功能分析, 以分析差异相关基因是否有明显的功能聚类, 即是否针对性地调控行使某类功能。  9. The method of claim 4, said differential thiolated region (DMR) analysis being performed as follows: determining the thiolation difference region by the thiolation level of the two samples, as follows: First determining a CG site As a seed site, two samples were obtained at the site to support thiolation and do not support thiolated four numbers, and then the chi-square test was used to check whether the thiolation level of the two samples was significant at the site; If the difference is significant, then a CG site is extended backwards, and two samples are supported at these two sites to support thiolation and do not support thiolated four numbers, and then perform chi-square test; When the point where the difference is no longer significant, it stops, and the area that is prominent in the front is the difference thiolized area. This site is used as a seed site to circulate the above process, thereby obtaining differential methylation regions. After finding the differential thiolated regions, the corresponding genes are found through these regions, namely the thiolated differential genes, and These genes were analyzed by GO clustering function to analyze whether the differentially related genes have obvious functional clustering, that is, whether to specifically control the exercise of certain functions. 10.权利要求 4的方法, 所述插入片段长度统计分析按如下进行: 对比对上的 reads进行长度统计, 以判断比对结果是否符合建库片段长度, 统计不同插入片段长度下 reads占总 reads数的百分比。  10. The method of claim 4, wherein the statistical analysis of the length of the insert is performed as follows: comparing the length of the reads on the pair to determine whether the comparison result is consistent with the length of the database, and counting the length of the different inserts. The percentage of the number. 11.权利要求 4的方法, 所述 CpG位点的覆盖度分析按如下进行: 在 RRBS中, 采用酶切技术得到片段再进行 BS测序, 这里对酶切的理论 片段(40-220bp )在各染色体和基因组各区域内覆盖到的 CG位点数及其比例 进行统计 , 然后统计两个样品实际得到的数据在理论位点上的覆盖度。 11. The method of claim 4, wherein the coverage analysis of the CpG sites is performed as follows: In RRBS, the fragment was obtained by enzyme digestion and then subjected to BS sequencing. Here, the theoretical fragment (40-220 bp) of the enzyme digestion was counted in the number of CG sites and their proportions in each chromosome and genome, and then the statistics were counted. The coverage of the actual data obtained from a sample at the theoretical site. 12. 一种基于重亚^ ^酸氢盐测序或减少的代表性重亚石充酸盐测序进行生物 信息学分析的装置,所述装置包括重亚硫酸氢盐测序和 /或减少的代表性重亚硫 酸盐测序模块, 用于对样品进行重亚硫酸氢盐测序或减少的代表性重亚硫酸盐 测序, 并优选过滤所述测序序列以去除不合格的序列; 数据产出分析模块, 用 于对测序数据产出做出统计, 统计内容包括转化率、 测序片段读长、 测序片段 数量和样品的总数据量等; 测序片段比对分析模块, 用于将测序得到的片段比 对到参考基因组上, 并比对结果做出统计, 统计内容包括原始片段凄 始数据 量、 比对片段数、 比对数据量、 平均比对率、 平均覆盖深度; 曱基化位点的覆 盖情况分析模块用于统计不同覆盖深度下, 被曱基化的测序序列 ¾Α的 C位点 数目占总位点数目的比例; 甲基化位点的曱基化水平分析模块, 用于统计每个 甲基化 C tt的平均曱基化水平; 全基因组曱基化数据分布趋势分析模块, 用 于得到全基因组曱基化分布趋势, 统计曱基化( 中 CG, CHG与 CHH的 数目和分布比例,其中 H代表非 G ^tt;全基因组 DNA曱基化图谱分析模块, 用于从染色体水平来描述甲基化 C碱基的分布情况; 差异性曱基化区域分析模 块, 用于通过两个样品的甲基化水平, 确定曱基化差异区域; 插入片段长度统 计分析模块, 对比对上的测序序列进行长度统计, 以判断比对结果是否符合建 库片段长度; CpG位点的覆盖度分析模块, 用于在各染色体和基因组各区域内 覆盖到的 CG位点数及其比例进行统计, 然后统计实际得到的数据在理论位点 上的覆盖度。  12. A device for bioinformatics analysis based on heavy bisulfite sequencing or reduced representative heavy sulphate acidation sequencing, the device comprising bisulfite sequencing and/or reduced representation a bisulfite sequencing module for performing bisulfite sequencing or reduced representative bisulfite sequencing of the sample, and preferably filtering the sequencing sequence to remove unqualified sequences; data output analysis module, Statistics are made on the output of the sequencing data, including the conversion rate, the length of the sequencing fragment, the number of sequencing fragments, and the total amount of data of the sample; the sequencing fragment alignment analysis module is used to compare the fragments obtained by sequencing to the reference. On the genome, and compare the results, the statistical content includes the initial fragment data volume, the number of comparison fragments, the amount of comparison data, the average alignment ratio, the average coverage depth; the coverage analysis module of the thiolated site It is used to calculate the ratio of the number of C-sites of the thiolated sequence of 3⁄4Α to the total number of sites at different depths of coverage; A horizontal analysis module for counting the average thiolation level of each methylated C tt; a genome-wide thiolation data distribution trend analysis module for obtaining a genome-wide thiolation distribution trend, statistical thiolation (中CG , the number and distribution ratio of CHG and CHH, where H represents non-G ^tt; a genome-wide DNA thiolation map analysis module for describing the distribution of methylated C bases from the chromosomal level; differential thiolation A region analysis module is used to determine the thiolation difference region by the methylation level of the two samples; the insert length statistical analysis module, and the length of the sequencing sequence on the comparison pair to determine whether the alignment result conforms to the database fragment Length; CpG site coverage analysis module, used to count the number of CG sites and their proportions covered in each chromosome and genome, and then statistically report the coverage of the actual data on the theoretical site.
PCT/CN2011/002243 2011-12-31 2011-12-31 Bs and rrbs sequencing-based bioinformatics analysis method and device Ceased WO2013097061A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/002243 WO2013097061A1 (en) 2011-12-31 2011-12-31 Bs and rrbs sequencing-based bioinformatics analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/002243 WO2013097061A1 (en) 2011-12-31 2011-12-31 Bs and rrbs sequencing-based bioinformatics analysis method and device

Publications (2)

Publication Number Publication Date
WO2013097061A1 true WO2013097061A1 (en) 2013-07-04
WO2013097061A8 WO2013097061A8 (en) 2013-08-22

Family

ID=48696160

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/002243 Ceased WO2013097061A1 (en) 2011-12-31 2011-12-31 Bs and rrbs sequencing-based bioinformatics analysis method and device

Country Status (1)

Country Link
WO (1) WO2013097061A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114937475A (en) * 2022-04-12 2022-08-23 桂林电子科技大学 Automatic evaluation method for error correction result of PacBio sequencing data
CN119943159A (en) * 2025-04-07 2025-05-06 东华理工大学南昌校区 A method to quantitatively identify the strength of gene interactions

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040175702A1 (en) * 2003-03-07 2004-09-09 Illumigen Biosciences, Inc. Method and apparatus for pattern identification in diploid DNA sequence data
CN102061337A (en) * 2010-11-24 2011-05-18 深圳华大基因科技有限公司 Method and system for detecting tissue-specific differentially methylated region (tDMR)

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040175702A1 (en) * 2003-03-07 2004-09-09 Illumigen Biosciences, Inc. Method and apparatus for pattern identification in diploid DNA sequence data
CN102061337A (en) * 2010-11-24 2011-05-18 深圳华大基因科技有限公司 Method and system for detecting tissue-specific differentially methylated region (tDMR)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, LI ET AL.: "Systematic assessment of reduced representation bisulfite sequencing to human blood samples: A promising method for large-sample-scale epigenomic studies", JOURNAL OF BIOTECHNOLOGY, vol. 157, no. 1, 6 July 2011 (2011-07-06), pages 1 - 6 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114937475A (en) * 2022-04-12 2022-08-23 桂林电子科技大学 Automatic evaluation method for error correction result of PacBio sequencing data
CN119943159A (en) * 2025-04-07 2025-05-06 东华理工大学南昌校区 A method to quantitatively identify the strength of gene interactions

Also Published As

Publication number Publication date
WO2013097061A8 (en) 2013-08-22

Similar Documents

Publication Publication Date Title
Lowe et al. Transcriptomics technologies
Zhao et al. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives
JP7051900B2 (en) Methods and systems for the generation and error correction of unique molecular index sets with non-uniform molecular lengths
Alexander et al. Annotating non-coding regions of the genome
Chan et al. Noninvasive detection of cancer-associated genome-wide hypomethylation and copy number aberrations by plasma DNA bisulfite sequencing
AU2011207561B2 (en) Partition defined detection methods
US8036835B2 (en) Probe design methods and microarrays for comparative genomic hybridization and location analysis
Son et al. Database of mRNA gene expression profiles of multiple human organs
AU2014205038B2 (en) Noninvasive prenatal molecular karyotyping from maternal plasma
WO2018144782A1 (en) Methods of detecting somatic and germline variants in impure tumors
Corney RNA-seq using next generation sequencing
AU2012327251A1 (en) Set membership testers for aligning nucleic acid samples
Liu et al. A comprehensive evaluation of computational tools to identify differential methylation regions using RRBS data
Digby et al. Computational approaches and challenges in the analysis of circRNA data
WO2013097061A1 (en) Bs and rrbs sequencing-based bioinformatics analysis method and device
CN107563152A (en) The data analysis application system that methylates based on biological cloud platform
Wei et al. A short review of variants calling for single-cell-sequencing data with applications
Rachappanavar et al. Analytical Pipelines for the GBS Analysis
CN109979534B (en) C site extraction method and device
Jiang et al. High-performance single-chip exon capture allows accurate whole exome sequencing using the Illumina Genome Analyzer
Esim et al. Determination of malignant melanoma by analysis of variation values
CN105787294B (en) Determine method, the kit and application thereof of probe collection
Jaratlerdsiri et al. African‑specific prostate cancer molecular taxonomy
Magagula The chromatin landscape of colorectal cancer cells
Dimartino A machine learning based method to detect genomic imbalances exploiting X chromosome exome reads

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11878977

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11878977

Country of ref document: EP

Kind code of ref document: A1