WO2013097061A1

WO2013097061A1 - Bs and rrbs sequencing-based bioinformatics analysis method and device

Info

Publication number: WO2013097061A1
Application number: PCT/CN2011/002243
Authority: WO
Inventors: 吴红龙; 吴博欣; 黄树嘉; 夏渝东; 李英睿; 汪建; 王俊; 杨焕明
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2011-12-31
Filing date: 2011-12-31
Publication date: 2013-07-04
Anticipated expiration: 2014-06-30
Also published as: WO2013097061A8

Abstract

Provided in the present invention are a method and a device for analyzing methylation in a genome by bisulfite sequencing or reduced representation bisulfite sequencing. Also provided in the present invention is a bioinformatics analysis method based on bisulfite sequencing or reduced representation bisulfite sequencing. The method comprises: detecting DNA methylation, then performing the bioinformatics analysis, wherein the analysis is an analysis for one or more items selected from the following： data output, sequencing fragments alignment, the coverage situation of methylation sites, the methylation level of the methylation sites, the distribution trend of the whole genome methylation data, the whole genome DNA methylation map, the differentially methylated regions, the statistic for the length of insert fragments, and the coverage of CpG sites.

Description

基于 BS和 RRBS测序的生物信息分析方法和装置 Biological information analysis method and device based on BS and RRBS sequencing

技术领域 Technical field

本发明属于生物信息学领域，对表观遗传学中两种重要的 DNA曱基化检测方法的测序结果进行生物信息学分析。背景技术 The invention belongs to the field of bioinformatics, and performs bioinformatics analysis on the sequencing results of two important DNA thiolation detection methods in epigenetics. Background technique

重亚硫酸氢盐测序法（ Bisulfite-sequencing, BS )是 DNA曱基化研究的一项有力工具。通过重亚硫酸盐处理 DNA能够将基因组中未曱基化的 C緘基转化为 U, 并且经过 PCR.进一步转化为 T, 从而与曱基化的 C碱基区分开来。将重亚硫酸盐处理与高通量测序技术的结合，能够绘制单减基分辨率的 DNA甲基化图谱。特定物种的高精确度曱基化修饰模式的分析，必将在表因组学研究中具有里程碑式的意义，并且为细胞分化、组织发育等^出机制研究以及动植物育种研究奠定基础。 Bisulfite-sequencing (BS) is a powerful tool for DNA thiolation research. The unthiolated C thiol group in the genome can be converted to U by bisulfite treatment, and further converted to T by PCR to distinguish it from the thiolated C base. Combining bisulfite treatment with high-throughput sequencing technology, it is possible to map DNA methylation profiles with single-subtraction resolution. The analysis of high-accuracy thiolation modification patterns of specific species is bound to be a milestone in the study of epigenetics, and lays a foundation for cell differentiation, tissue development, and animal and plant breeding research.

减少的代表 ')·生重亚石克酸盐测序 ( reduced representation bisulfite sequencing, RRBS )是基于 BS的新方法。 BS测序成本高， RRBS在此基础上，通过酶切技术选取全基因组的部分区域再进行 BS测序。通过重亚石克酸盐处理 DNA能够将基因组中未曱基化的 C 转化为 U, 并且经过 PCR进一步转化为 T, 从而与曱基化的 C碱基区分开来。该方法能够绘制单基分辨率的 DNA曱基化图谱，而更具有成本优势。 The reduced representative ')·reduced representation bisulfite sequencing (RRBS) is a new method based on BS. The cost of BS sequencing is high. Based on this, RRBS selects parts of the whole genome by enzymatic cleavage technology and then performs BS sequencing. The treatment of DNA by heavy tartaric acid converts ungerminated C in the genome to U and is further converted to T by PCR to distinguish it from the thiolated C base. This method is capable of plotting a single-base resolution DNA thiolated map with a cost advantage.

本领域需要利用 BS法和 RRBS法这两种 DNA曱基化检测方法进行基因组曱基化分析的方法，并需要利用其测序结果进行生物信息学分析的方法。发明内容 In this field, two methods of DNA thiolation detection, such as BS method and RRBS method, are needed for genomic thiolation analysis, and methods for bioinformatics analysis using the sequencing results are needed. Summary of the invention

在一个方面中，本发明涉及一种通过重亚磅 υ酸氢盐测序或减少的代表性重亚硫酸盐测序分析基因组曱基化的方法，所述方法包括步骤： In one aspect, the invention relates to a method for analyzing genomic thiolation by heavy bisulfite hydrogenate sequencing or reduced representative bisulfite sequencing, the method comprising the steps of:

1 )对样品进行重亚硫酸氢盐测序或减少的代表性重亚硫酸盐测序，并优选过滤所述测序序列以去除不合格的序列； 1) subjecting the sample to bisulfite sequencing or reduced representative bisulfite sequencing, and preferably filtering the sequencing sequence to remove unqualified sequences;

2 )确定上述测序序列代表的曱基化位点，步骤如下： i ) 将所述各个测序序列中所有 c (胞嘧啶）转化为 T (胸腺嘧啶）碱基得到的第一测序序列；将所述各个测序序列互补序列中所有 G (鸟嘌呤）碱基转化为 A (腺嘌呤）碱基得到的第二测序序列； 2) Determine the thiolation site represented by the above sequencing sequence, the steps are as follows: i) a first sequencing sequence obtained by converting all c (cytosine) in the respective sequencing sequences into T (thymidine) bases; converting all G (guanine) bases in the complementary sequence of the respective sequencing sequences into a second sequencing sequence obtained from the A (adenine) base;

ii )将相应物种的参考基因组序列中所有 C (胞嘧啶；) ½转化为 T (胸腺嘧啶）减基生成的第一参考序列；将相应物种的参考基因组序列中所有 G (鸟嘌呤）碱基转化为 A (腺嘌呤）减基生成的第二参考序列； Ii) converting all C (cytosine;) 1⁄2 in the reference genome sequence of the corresponding species into the first reference sequence generated by T (thymidine) subtraction; all G (guanine) bases in the reference genome sequence of the corresponding species Converting to a second reference sequence generated by A (adenine) subtraction;

iii )将各个第一测序序列和第二测序序列分别和第一参考序列和第二参考序列进行比对，经过比对得到各个第一测序序列和第二测序序列在第一参考序列和第二参考序列上的位置，选择在第一参考序列和第二参考序列上有唯一比对的各个第一测序序列和第二测序序列； Iii) aligning each of the first sequencing sequence and the second sequencing sequence with the first reference sequence and the second reference sequence, respectively, and obtaining each of the first sequencing sequence and the second sequencing sequence in the first reference sequence and the second Selecting, in the position on the sequence, each of the first sequencing sequence and the second sequencing sequence having unique alignments on the first reference sequence and the second reference sequence;

IV )利用上述选择的第一测序序列和第二测序序列比对的参考基因组序列的序列信息确定第一测序序列和第二测序序列的曱基化信息，步骤如下：如果在第一测序序列和比对的参考基因组序列的相应位置都是 C (胞嘧啶）碱基，则表示在第一测序序列的该位置被曱基化；如果在第二测序序列和比对的参考基因组序列的相应位置都是 G (鸟嘌呤）碱基，则表示在第二测序序列的该位置被曱基化； IV) determining the thiolation information of the first sequencing sequence and the second sequencing sequence by using the sequence information of the first sequencing sequence and the second sequencing sequence aligned by the second sequencing sequence selected above, as follows: if in the first sequencing sequence and The corresponding positions of the aligned reference genomic sequences are C (cytosine) bases, indicating that they are thiolated at this position of the first sequencing sequence; if at the corresponding position of the second sequencing sequence and the aligned reference genome sequence All of which are G (guanine) bases, indicating that they are thiolated at this position of the second sequencing sequence;

3 )对于参考基因组序列上每个 C位点，统计覆盖该 C位点的第一测序序列或第二测序序列在相应位置被曱基化的测序序列数目和未被曱基化的测序序列数目，根据以下不等式确定该 C位点曱基化的最小期望测序序列数目，即满足下式的最小 k值：匪 C *: Q)p^k(l - p ~^k < 0.01 * mC; 其中， 3) For each C site on the reference genomic sequence, the number of sequencing sequences that are thiolated at the corresponding position by the first or second sequencing sequence covering the C site and the number of sequencing sequences that are not thiolated The number of minimum expected sequencing sequences for thiolation of the C site is determined according to the following inequality, ie, the minimum k value satisfying the following formula: 匪C *: Q)p ^k (l - p ~ ^k < 0.01 * mC; wherein

mC是覆盖该 C位点的第一测序序列或第二测序序列在相应位置被曱基化的测序序列数目， mC is the number of sequencing sequences in which the first sequencing sequence or the second sequencing sequence covering the C site is thiolated at the corresponding position,

nmC是覆盖该 C位点的第一测序序列或第二测序序列在相应位置未被曱基化的测序序列数目， nmC is the number of sequencing sequences that cover the C-site of the first sequencing sequence or the second sequencing sequence that is not thiolized at the corresponding position,

n为该 C位点的测序序列总数， n is the total number of sequencing sequences for this C site,

p为错误率，例如，用 C碱基均为未修饰的对照 DNA例如 lamda DNA进行上述步骤 1 ) -2)，根据 C到 T ½转化率确定 p值， p通常为 0.001-0.01，如果覆盖该 C位点的第一测序序列或第二测序序列在相应位置被曱基化的测序序列数目大于该位点的最小期望测序序列数目 , 该 C位点即为曱基化的位点，优选形成包括如下信息的结果文件： p is the error rate, for example, the above steps 1) - 2) are carried out using unmodified control DNA such as lamda DNA, and the p value is determined according to the conversion of C to T 1⁄2, p is usually 0.001-0.01, if covering The first sequencing sequence or the second sequencing sequence of the C site is thiolated at the corresponding position The number of sequencing sequences is greater than the number of minimal desired sequencing sequences at the site, which is the site of thiolation, preferably forming a result file comprising the following information:

Chr-a b +/- c d e f g h Chr-a b +/- c d e f g h

第一列 Chr-a: 染色体号， First column Chr-a: chromosome number,

第二列 b: C位点的位置， The second column b: the position of the C site,

第三列 +/-: 正负链信息， The third column +/-: positive and negative chain information,

第四列 c: C位点的类型， The fourth column c: the type of C site,

第五列 d: 具体的碱基， The fifth column d: specific bases,

第六列 e: 测序序列平均比对位置， Column 6 e: average alignment position of the sequencing sequence,

第七列 f: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置被甲基化的测序序列数目， Column 7 f: the number of sequencing sequences that are methylated at the corresponding position by the first sequencing sequence or the second sequencing sequence covering the C site,

第八列 g: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置未被曱基化的测序序列数目， The eighth column g: the number of sequencing sequences covering the first or the second sequencing sequence of the C site that are not thiolated at the corresponding position,

第九列 h: 所述 C位点曱基化的最小期望测序序列数目。 Ninth column h: Number of minimum expected sequencing sequences for thiolation of the C site.

在另一方面中，本发明涉及一种通过检测 DNA 曱基化的方法进行生物信息学分析的方法，所述生物信息学分析选自于： In another aspect, the invention relates to a method of bioinformatics analysis by detecting a method of DNA thiolation selected from:

1. 曱基化位点的覆盖情况，所述分析可以通过以下方式进行： 1. Coverage of the thiolation site, the analysis can be performed in the following manner:

C 位点可以被分为三种类型，分别是 CG、 CHG、 CHH, 其中 H代表非 G碱基。从结果文件可以得到位点是否为甲基化位点，支持曱基化的 reads 数大于期望值的位点即为曱基化位点。统计不同覆盖深度（该位点被 reads覆盖的次数 )下 mC位点占总位点的比例。同时还对不同覆盖深度下的位点在基因组上的覆盖度进行统计，覆盖度即为位点数在全基因组上的比例。 The C site can be divided into three types, namely CG, CHG, and CHH, where H represents a non-G base. From the result file, it can be obtained whether the site is a methylation site, and the site supporting the thiolated reads number greater than the expected value is the thiolation site. The ratio of mC sites to total sites under different coverage depths (the number of times this site is covered by reads) is counted. At the same time, the coverage of the loci under different coverage depths on the genotype is also counted, and the coverage is the proportion of the loci on the whole genome.

2. 曱基化位点的曱基化水平，所述分析可以通过以下方式进行：为了了解基因组曱基化图谱的总体特征和全基因组的平均曱基化水平，统计每个曱基化 C碱基的平均曱基化水平，计算方式为： 2. The level of thiolation of the thiolation site, which can be performed by: To understand the overall characteristics of the genomic thiolation map and the average thiolation level of the genome, count each thiolated C base The average thiolation level of the base is calculated as:

支持曱基化的 reads数 /(支持曱基化的 reads数 +不支持甲基化的 reads 数) χ100。 The number of reads that support merging / (the number of reads that support merging + the number of reads that do not support methylation) χ100.

同时，统计反映曱基化信息的 reads的有效覆盖度，有效覆盖度的计算基于三个方面进行：染色体、基因区间和基因功能元件。将曱基化状态不明确的 reads排除之后，基于支持曱基化的 reads和不支持支持曱基化的 reads数据，进行有效覆盖度计算，得到统计结果。 At the same time, the statistics reflect the effective coverage of the reads of the thiolated information. The calculation of the effective coverage is based on three aspects: chromosome, gene interval and gene function elements. After the reads with ambiguous ambiguity are excluded, based on the support-supported reads and the reads data that does not support priming, The effective coverage calculation is performed, and the statistical result is obtained.

3. 全基因组曱基化数据分布趋势，所述分析可以通过以下方式进行：为了得到全基因组曱基化分布趋势，统计曱基化 C 中 CG， CHG与 CHH的数目和分布比例，即各类型曱基化的 C占总曱基化 C的百分比，例如： mCHG所占比例 = mCHG数目 /mC的总数。同时，不同分布类型的 C ¾( CG、 CHG和 CHH )，其曱基化水平在不同物种间，甚至同一物种不同细胞类型间都存在差异。因此，统计每种类型（ CG、 CHG和 CHH ) 曱基化的 C碱基曱基化水平的分布。同样，按照染色体、基因区间和基因功能元件统计各区域内每种分布类型的 C磁基（CG、 CHG和 CHH )的曱基化水平，计算公式为: 平均曱基化水平=支持甲基化的 _reads/ ⁽支持曱基化的 reads+不支持甲基化的 reads) χ100。最后，分析非 CG位点的曱基化的 C附近碱基的分布情况，统计不同曱基化模式 (CHG和 CHH)出现的概率。 3. Trends in genome-wide thiolation data distribution, which can be performed in the following manner: In order to obtain the trend of genome-wide thiolation distribution, the number and distribution ratio of CG, CHG and CHH in the thiolated C, ie each type The percentage of thiolated C to the total thiolization C, for example: proportion of mCHG = number of mCHG / total number of mC. At the same time, different distribution types of C 3⁄4 ( CG, CHG and CHH ) have different levels of thiolation between different species and even different cell types of the same species. Therefore, the distribution of the C base thiolation levels of each type (CG, CHG, and CHH) thiolation was counted. Similarly, the thiolation levels of C-magnetic groups (CG, CHG, and CHH) for each distribution type in each region are calculated by chromosome, gene interval, and gene functional elements. The calculation formula is: Average thiolation level = support for methylation The _reads / ⁽ supports simplification of reads+ does not support methylated reads) χ100. Finally, the distribution of bases near the thiolated C at non-CG sites was analyzed, and the probability of occurrence of different thiolation patterns (CHG and CHH) was counted.

4. 全基因组 DNA曱基化图谱，所述分析可以通过以下方式进行：从染色体水平来描述曱基化 C碱基的分布情况。以 10kb的窗口统计曱基化 C碱基的密度在染色体上的分布情况；同时，统计每 lOOkb的窗口内不同类型曱基化 C tt( CG、 CHG和 CHH )的密度分布。不同的基因组区域具有各自不同的生物学功能，优选以热度图的形式表示曱基化分布情况，以有助于进一步了解这些区域的曱基化图语特征。基因的转录受不同的调控元件调控，将所有编码基因序列分成 7个不同的转录元件区域，在此！ ^出上对不同转录元件区域的曱基化水平进行统计。 4. Whole genome DNA thiolation map, which can be performed by: Delineating the distribution of thiolated C bases from the chromosomal level. The density of the thiolated C bases on the chromosome was counted by a 10 kb window. At the same time, the density distributions of different types of thiolated C tt (CG, CHG, and CHH) in each lOOkb window were counted. Different genomic regions have different biological functions, and the thiolation distribution is preferably expressed in the form of a heat map to help to further understand the thiolated graphical features of these regions. The transcription of genes is regulated by different regulatory elements, dividing all coding gene sequences into seven different regions of transcriptional elements, here! ^ Statistics on the level of thiolation of different regions of the transcriptional element.

5. 差异性曱基化区域（DMR )分析，所述分析可以通过以下方式进行：通过两个样品的曱基化水平，确定曱基化差异区域，方法如下：首先确定一个 CG位点作为种子位点，得到两个样本在该位点分别支持曱基化和不支持甲基化四个数值，再通过卡方检验检验两个样本的曱基化水平在该位点差异是否显著，显著性水平的阈值 p—般是 0.05, 优选 0.01 , 更优选 0.001 , 最优选 0.0001 ; 如果差异显著则往后延伸一个 CG位点，得到两个样本在这两个位点分别支持曱基化和不支持曱基化四个数值，再进行卡方检验；以此类推循环，当遇到差异不再显著的点时，便停止，前面显著的区域即为差异性曱基化区域。再以此位点为种子位点循环上面过程，从而得到差异性曱基化区域。在找到差异性甲基化区域后，通过这些区域找到相对应的基因，即曱基化差异基因，并对这些基因做 GO聚类功能分析，以分析差异相关基因是否有明显的功能聚类，即是否针对性地调控行使某类功能。 6. 插入片段长度统计，所述分析可以通过以下方式进行： 5. Differential thiolated region (DMR) analysis, which can be performed by: Determining the thiolation difference region by the thiolation level of the two samples, as follows: First, identify a CG site as a seed At the locus, two samples were obtained to support the thiolation and the unsupported methylation values at the locus, and then the chi-squared level of the two samples was tested by the chi-square test to see whether the difference was significant or not. The horizontal threshold p is generally 0.05, preferably 0.01, more preferably 0.001, most preferably 0.0001; if the difference is significant, a CG site is extended backward, and two samples are obtained to support thiolation and unsupporting at these two sites, respectively. The 数值化四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个This site is then used as a seed site to cycle the above process to obtain a differential thiolated region. After finding the differential methylation regions, the corresponding genes, namely the thiolated differential genes, are found through these regions, and the GO clustering function analysis is performed on these genes to analyze whether the differentially related genes have obvious functional clustering. That is, whether to perform certain types of functions in a targeted manner. 6. Insert fragment length statistics, which can be done in the following ways:

对比对上的 reads进行长度统计，以判断比对结果是否符合建库片段长度，统计不同插入片段长度下 reads占总 reads数的百分比。 Compare the length of the reads on the pair to determine whether the comparison result matches the length of the built-in fragment, and count the percentage of the total number of reads in the length of the different inserts.

7. CpG位点的覆盖度，所述分析可以通过以下方式进行： 7. Coverage of CpG sites, the analysis can be performed in the following ways:

在 RRBS中，采用酶切技术得到片段再进行 BS测序， !JL对酶切的理论片段（40-220bp )在各染色体和基因组各区域内覆盖到的 CG位点数及其比例进行统计，然后统计两个样品实际得到的数据在理论位点上的覆盖度。 In the RRBS, the fragments were obtained by restriction enzyme digestion and then subjected to BS sequencing. The theoretical fragment of the restriction enzyme digestion (40-220 bp) was counted in the chromosomal spots and their proportions in each chromosome and genome, and then counted. The coverage of the actual data obtained by the two samples at the theoretical site.

本发明还提供了一种基于重亚疏酸氢盐测序或减少的代表性重亚硫酸盐测序进行生物信息学分析的装置，所述装置包括重亚硫酸氢盐测序和 /或减少的代表性重亚硫酸盐测序模块，用于对样品进行重亚硫酸氢盐测序或减少的代表性重亚硫酸盐测序，并优选过滤所述测序序列以去除不合格的序列；数据产出分析模块，用于对测序数据产出做出统计，统计内容包括转化率、测序片段读长、测序片段数量和样品的总数据量等；测序片段比对分析模块，用于将测序得到的片段比对到参考基因组上，并比对结果做出统计，统计内容包括原始片段数原始数据量、比对片段数、比对数据量、平均比对率、平均覆盖深度；曱基化位点的覆盖情况分析模块用于统计不同覆盖深度下，被曱基化的测序序列覆盖的 C位点数目占总位点数目的比例；曱基化位点的甲基化水平分析模块，用于统计每个曱基化 C碱基的平均曱基化水平；全基因组曱基化数据分布趋势分析模块，用于得到全基因组曱基化分布趋势，统计曱基化 C碱基中 CG, CHG与 CHH的数目和分布比例，其中 H代表非 G ¾;全基因组 DNA曱基化图谱分析模块，用于从染色体水平来描述曱基化 C碱基的分布情况；差异性曱基化区域分析模块，用于通过两个样品的曱基化水平，确定曱基化差异区域；插入片段长度统计分析模块，对比对上的测序序列进行长度统计，以判断比对结果是否符合建库片段长度； CpG位点的覆盖度分析模块，用于在各染色体和基因组各区域内覆盖到的 CG位点数及其比例进行统计，然后统计实际得到的数据在理论位点上的覆盖度。附图说明 The invention also provides a device for bioinformatics analysis based on heavy bisulfite hydrogenation sequencing or reduced representative bisulfite sequencing, the device comprising bisulfite sequencing and/or reduced representation a bisulfite sequencing module for performing bisulfite sequencing or reduced representative bisulfite sequencing of the sample, and preferably filtering the sequencing sequence to remove unqualified sequences; data output analysis module, Statistics are made on the output of the sequencing data, including the conversion rate, the length of the sequencing fragment, the number of sequencing fragments, and the total amount of the sample; the sequencing fragment comparison module is used to compare the fragments obtained by sequencing to the reference. On the genome, and compare the results, the statistical content includes the original number of original fragments, the number of comparison fragments, the amount of comparison data, the average comparison ratio, the average coverage depth; the coverage analysis module of the thiolation site Used to calculate the ratio of the number of C-sites covered by the thiolated sequencing sequence to the total number of sites at different depths of coverage; Point methylation level analysis module for counting the average thiolation level of each thiolated C base; genome-wide thiolation data distribution trend analysis module, used to obtain the whole genome thiolation distribution trend, statistics The number and distribution ratio of CG, CHG and CHH in the thiolated C base, where H stands for non-G 3⁄4; a genome-wide DNA thiolation map analysis module for describing the distribution of thiolated C bases from the chromosomal level Situation; a differential thiolation region analysis module for determining the thiolation difference region by the thiolation level of the two samples; the insert length statistical analysis module, comparing the length of the sequencing sequence on the pair to determine the ratio Whether the result conforms to the length of the built-in fragment; the coverage analysis module of the CpG locus is used to count the number of CG sites and their proportions covered in each chromosome and genome, and then statistically obtain the actual data at the theoretical site. Coverage on. DRAWINGS

图 1 , 基于 BS测序的不同覆盖深度下位点的百分比。 Figure 1. Percentage of loci at different depths of coverage based on BS sequencing.

图 2，基于 BS测序的不同覆盖深度下位点的度。 Figure 2. Degrees of loci at different depths of coverage based on BS sequencing.

图 3，基于 BS测序的不同类型曱基化 C威基的分布比例。 Figure 3. Distribution ratios of different types of thiolated C-based groups based on BS sequencing.

图 4，基于 BS测序的曱基化 C 的曱基化水平分布。图 5，基于 BS测序的非 CG位点的曱基化 C附列特征。 Figure 4. The thiolation level distribution of thiolated C based on BS sequencing. Figure 5. Quaternized C-attached features of non-CG sites based on BS sequencing.

图 6, 示例性显示基于 BS测序的染色体上曱基化 C密度分布。 Figure 6. Illustratively shows the thiolated C density distribution on chromosomes based on BS sequencing.

图 7, 基于 BS测序的不同基因组区域的 C ½曱基化分布，以及相应的 CpG密度。 Figure 7. C 1⁄2 曱-based distribution of different genomic regions based on BS sequencing, and the corresponding CpG density.

图 8, 基于 BS测序的全基国组不同功能元件区域的曱基化水平分布。图 9, 基于 RRBS测序的插入片段长度统计。 Figure 8. The horizontal distribution of thiolation in different functional component regions of the entire base group based on BS sequencing. Figure 9. Insert length statistics based on RRBS sequencing.

图 10，基于 RRBS测序的不同覆盖深度下位点的百分比。 Figure 10. Percentage of loci at different depths of coverage based on RRBS sequencing.

图 11 , 基于 RRBS测序的覆盖到的 CG位点在基因组各区域覆盖度。 Figure 11. Coverage of covered CG loci in the genome based on RRBS sequencing.

图 12, 基于 RRBS测序的不同深度下 CpG位点的覆盖情况。具体实施方式 Figure 12. Coverage of CpG loci at different depths based on RRBS sequencing. detailed description

在本发明中 , 重亚硫酸氢盐测序或减少的代表性重亚硫酸盐测序可以通过高通量测序技术进行，例如 Illumina GA测序技术，也可以为现有的其他高通量测序技术。根据需要，去除不合格的序列是本领域技术人员根据测序的实际情况可以进行的。例如，不合格序列包括：测序质量低于预设阀值的基个数超过整条序列个数的 50%则认为是不合格序列。低质量阀值由具体测序技术及测序环境而定；序列中测序结果不确定的减基（例如 Illumina GA测序结果中的 N )个数超过整条序列减基个数的 10%则认为是不合才各序列；除样本接头序列外，与其它实验引入的外源序列比对，如各种接头序列。若序列中存在外源序列则认为是不合格序列。 In the present invention, bisulfite sequencing or reduced representative bisulfite sequencing can be performed by high throughput sequencing techniques, such as Illumina GA sequencing technology, or other existing high throughput sequencing techniques. Removal of unacceptable sequences can be performed by those skilled in the art based on the actual conditions of sequencing, as needed. For example, the unqualified sequence includes: The number of bases whose sequencing quality is lower than the preset threshold exceeds 50% of the entire sequence and is considered to be a failed sequence. The low-quality threshold is determined by the specific sequencing technology and sequencing environment; the number of subtractions (such as N in Illumina GA sequencing results) whose sequence is uncertain in sequencing is more than 10% of the number of bases in the entire sequence is considered to be inconsistent. Each sequence; in addition to the sample linker sequence, is aligned with other exogenous sequences introduced by the experiment, such as various linker sequences. If there is a foreign sequence in the sequence, it is considered to be a non-conforming sequence.

在本发明中，短序列映射程序可以是例如 SOAP ( 获自 http://soap.genomics.org.cn/ )等块射程序。 In the present invention, the short sequence mapping program may be a block program such as SOAP (available from http://soap.genomics.org.cn/).

在本发明中，对曱基化结果进行筛选可以通过例如唯一性的比较、比对长度的比较、错配数的比较，比对次数的比较等，筛选出每条序列比对结果合乎要求的比对信息，选用的筛选条件需视选用的比对软件、序列背景而定。 In the present invention, the screening results of the thiolation can be selected by, for example, unique comparison, comparison of comparison lengths, comparison of mismatches, comparison of the number of comparisons, etc., to screen out the alignment results of each sequence. For comparison information, the selection criteria should be based on the selected software and sequence background.

在本发明中， reads是指从测序仪获得的测序片段，或者称为读段。 In the present invention, reads refers to a sequence of fragments obtained from a sequencer, or a read.

在本发明中 ,根据以下不等式确定该位点曱基化的最小期望测序序列数目，即满足下式的最小 k值：腿 C * Q)p^k l一 p)ⁿ~^k < 0.01 * mC; 其巾， In the present invention, the number of minimum expected sequencing sequences of the site thiolation is determined according to the following inequality, ie, the minimum k value satisfying the following formula: leg C * Q) p ^k l - p) ⁿ ~ ^k < 0.01 * mC; Its towel,

k为支持曱基化 reads数的期望值，它反映了在一定错误率下的最小支持曱基化测序序列数目， k is the expected value to support the number of singulated reads, which reflects the minimum support at a certain error rate曱 Number of base sequencing sequences,

mC是覆盖该位点的第一测序序列或第二测序序列在相应位置被曱基化的测序序列数目， mC is the number of sequencing sequences in which the first sequencing sequence or the second sequencing sequence covering the site is thiolated at the corresponding position,

nmC是覆盖该位点的第一测序序列或第二测序序列在相应位置未被曱基化的测序序列数目， nmC is the number of sequencing sequences that cover the first sequencing sequence or the second sequencing sequence of the site that are not deuterated at the corresponding position,

n为该位点的测序序列总数， n is the total number of sequencing sequences at this site,

p为错误率，例如，用 C碱基均为未修饰的对照 DNA例如 lamda DNA进行上述步骤 1 ) -2), 根据 C到 T tt转化率确定 p值， p通常为 0.001-0.01，上式基于二项分布得到确定该位点曱基化的最小期望测序序列数目。为了详细说明本发明的目的、技术方案及具体步骤，以下结合附图及实施例，对本发明进行详细阐述，应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。实施例一，基于 BS测序的生物信息分析方法样本：两个人的样本 p is the error rate, for example, the above steps 1) - 2) are carried out using unmodified control DNA such as lamda DNA, and the p value is determined according to the conversion rate of C to T tt, p is usually 0.001-0.01, The number of minimum expected sequencing sequences that determine the thiolation of the site is determined based on the binomial distribution. The present invention will be described in detail below with reference to the accompanying drawings and embodiments. invention. Example 1 Bio-information analysis method based on BS sequencing Sample: Sample of two people

具作錄 Record

一、数据预处理 First, data preprocessing

1.过滤测序序列； 1. Filter the sequencing sequence;

在本实施例中，编号为 NB和 103的两个人血液样品 DNA在深圳华大基因研究院进行重亚硫酸氢盐测序或减少的代表性重亚硫酸盐测序，其中测序技术为 Illumina GA测序技术。测序分别得到 657.2兆和 660.5兆条序列，然后对测序序列进行过滤，去除不合格的序列，分别得到 575.46兆和 576.12兆条合格序列。不合格序列包括：测序质量低于预设阀值（本例中为 66 )的个数超过整条序列减基个数的 50%。序列中测序结果不确定的碱基（在本实施例中为 Illumina GA测序结果中的 N )个«£过整条序列个数的 10%; 除样本接头序列外，其它实验引入的外源序列；序列中存在外源序列。 In this example, two human blood sample DNAs numbered NB and 103 were subjected to bisulfite sequencing or reduced representative bisulfite sequencing at Shenzhen Huada Gene Research Institute, wherein the sequencing technology was Illumina GA sequencing technology. . Sequencing yielded 657.2 trillion and 660.5 trillion sequences, respectively, and then the sequencing sequence was filtered to remove the unqualified sequences, yielding 575.46 megabytes and 576.12 megabytes of qualified sequences, respectively. The unqualified sequence includes: The number of sequencing quality below the preset threshold (66 in this example) exceeds 50% of the entire sequence minus the number of bases. Bases with undefined sequencing results in the sequence (in this example, N in Illumina GA sequencing results) are 10% of the entire number of sequences; exogenous sequences introduced by other experiments except sample linker sequences ; There is an exogenous sequence in the sequence.

2.确定上述过滤后的测序序列代表的曱基化信息 2. Determine the thiolation information represented by the above-described filtered sequencing sequence

在本实施例中使用的物种的参考基因组序列是人的基因组 hgl8,来自数据库 UCSC ( http://genome.ucsc.edu/ )。对过滤后的测序序列和所述参考序列进行预处理，步骤如下： i ) 将所述各个测序序列中所有 C (胞嘧啶）碱基转化为 T (胸腺嘧啶）碱基得到的序列定义为 fql; 将所述各个测序序列互补序列中所有 G (鸟嘌呤）碱基转化为 A (腺嘌呤） tt得到的序列定义为 fq2，示意图如下： fql The reference genomic sequence of the species used in this example is the human genome hgl8, from the database UCSC (http://genome.ucsc.edu/). The filtered sequencing sequence and the reference sequence are preprocessed as follows: i) a sequence obtained by converting all C (cytosine) bases in the respective sequencing sequences into T (thymidine) bases is defined as fql; all G (guanine) bases in the complementary sequence of the respective sequencing sequences The sequence obtained by conversion to A (adenine) tt is defined as fq2, and the schematic is as follows: fql

iq2

Iq2

在图中， C^m表示甲基化的 C (胞嘧啶） In the figure, C ^m represents methylated C (cytosine)

ii )将所述参考基因组序列中所有 c (胞嘧啶）转化为 τ (胸腺嘧啶） tt生成的序列定义为 refl; 将所述参考基因组序列中所有 G (鸟嘌呤） ½ 转化为 A (腺嘌呤） ½生成的序列定义为 re 2; Ii) converting the sequence generated by converting all c (cytosine) in the reference genome sequence to τ (thymidine) tt as refl; converting all G (guanine) 1⁄2 in the reference genome sequence into A (adenine) The sequence generated is defined as re 2;

iii )将各个序列 fql和 fq2分别和参考序列 refl和 ref2进行比对，经过比对得到各个序列 fql和 fq2在转化后的参考基因组序列上的位置信息，保留在转化后的参考基因组序列上有唯一比对的各个序列 fql和 fq2，示意图如下： Iii) aligning each sequence fql and fq2 with reference sequences refl and ref2, respectively, and obtaining positional information of each sequence fql and fq2 on the transformed reference genomic sequence, which is retained on the transformed reference genome sequence. The unique sequences fql and fq2 are as follows:

refl fql Refl fql

M M M M

i;ef2 iv )将一个序列 fql和 fq2在转化后的参考基因组序列上的位置信息参考序列在相应位置处的序列信息相结合，确定该序列 fql和 fq2的曱基化信息：如果在该序列 fql上和在参考序列相应位置上都是 C (胞嘧啶）碱基，则表示在该序列 fql的该位置被曱基化；如果在该序列 fq2上和在参考序列相应位置都是 G (鸟嘌呤）碱基，则表示在该序列 fq2的该位置被甲基化，示意图如下： i;ef2 iv) combining the sequence information of a sequence fql and fq2 on the transformed reference genome sequence with the sequence information at the corresponding position to determine the thiolation information of the sequences fql and fq2: if in the sequence Both fql and C (cytosine) bases at the corresponding positions in the reference sequence indicate that the position of the sequence fql is thiolated; if the sequence fq2 and the corresponding position in the reference sequence are G (bird)嘌呤) base, which means that the position of the sequence fq2 is methylated, the schematic is as follows:

fq2. T CC ACC C TO C S¾ A^ ¾CC Gft^AT CC GAAiSC TG G T AC AT CACj. Fq2. T CC ACC C TO C S3⁄4 A^ 3⁄4CC Gft^AT CC GAAiSC TG G T AC AT CACj.

T TCCACCC TC C T A& C GSSG CC TGA ½C C C T AC AT C¾G 序列在图中， *号和 #号表示未曱基化， C^ra表示曱基化的 C (胞嘧啶）。 3.经过上述处理，样品 NB得到 498.82兆有效序列，样品 103得到 49—1.28 兆有¾ ^列，能覆盖整个基因组确定参考基因组序列上曱基化位点。 T TCCACCC TC CT A & C GSSG CC TGA 1⁄2C CCT AC AT C3⁄4G Sequence In the figure, * and # indicate undensylated, and C ^ra denotes thiolated C (cytosine). 3. After the above treatment, the sample NB obtained the 498.82 megabyte effective sequence, and the sample 103 obtained 49-1.28 megabytes with 3⁄4 ^ column, which can cover the entire genome to determine the thiolation site of the reference genome sequence.

对于参考基因组上每个位点，统计在该位点支持曱基化的测序序列数 mC 和不支持甲基化的测序序列数 nmC,根据二项分布得到以下不等式确定该位点是否是甲基化位点： nmC * ( )p^k(t - p)ⁿ-^k < 0.01 * mC; 其中， n为该位点的测序序列总数， p为错误率（o.oi )， k为支持曱基化 reads数的期望值， mC为支持曱基化的测序序列数， nmC为不支持曱基化的测序序列数； For each locus on the reference genome, the number of sequencing sequences supporting the thiolation at this site, mC, and the number of sequencing sequences that do not support methylation, are counted. The following inequality is obtained from the binomial distribution to determine whether the site is methyl. Site: nmC * ( )p ^k (t - p) ⁿ - ^k < 0.01 * mC; where n is the total number of sequencing sequences at this site, p is the error rate (o.oi), and k is the supporting thiol group. The expected number of reads, mC is the number of sequencing sequences that support thiolation, and nmC is the number of sequencing sequences that do not support thiolation;

通过上述公式得到参考基因组序列上每个位点支持曱基化 reads数的期望值 k, 对于某个位点如果支持曱基化的测序序列数大于该位点的期望值，该位点即为曱基化的位点，优选形成包括如下信息的结果文件 *.cout: The expected value k of the number of thiolated reads supported by each site on the reference genome sequence is obtained by the above formula. For a certain site, if the number of sequencing sequences supporting thiolation is greater than the expected value of the site, the site is a thiol group. The resulting site preferably forms a result file *.cout including the following information:

Chr-a b +/- c d e f g h 第一列 Chr-a: 染色体号， Chr-a b +/- c d e f g h First column Chr-a: chromosome number,

第四列 c: C位点的类型， The fourth column c: the type of C site,

第五列 d: 具体的 , The fifth column d: specific,

第七列 f: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置被曱基化的测序序列数目， Column 7 f: the number of sequencing sequences that cover the first or the second sequencing sequence of the C site and are thiolated at the corresponding position,

结果如下： The results are as follows:

样品 NB甲基化 CG位点为 36.47兆，占总 CG位点的 96.88%;曱基化 CHG 位点为 0.19兆，占总 CHG位点的 0.51%; 曱基化 CHH位点为 0.98兆，占总 CHH位点的 2.62%。 The NB methylation CG site of sample NB was 36.47 megabytes, accounting for 96.88% of the total CG site; the thiolated CHG site was 0.19 megabytes, accounting for 0.51% of the total CHG site; the thiolated CHH site was 0.98 megabytes. It accounts for 2.62% of the total CHH locus.

样品 103曱基化 CG位点为 35.01兆，占总 CG位点的 97.56%;曱基化 CHG 位点为 0.15兆，占总 CHG位点的 0.43%; 曱基化 CHH位点为 0.72兆，占总 CHH位点的 2.02%。 Sample 103 thiolated CG site was 35.01 megabytes, accounting for 97.56% of the total CG site; thiolated CHG site was 0.15 megabytes, accounting for 0.43% of the total CHG site; thiolated CHH site was 0.72 megabytes, Total 2.02% of the CHH locus.

二、生物信息分析 Second, biological information analysis

1.数据产出 Data output

对两个 jk液样本 DNA的测序数据产出做出统计，统计内容包括转化率、测序片段读长、测序片段数量和样品的总数据量等。具体的结果参见表 1。 Statistics were made on the sequencing data of the DNA of the two jk liquid samples, including the conversion rate, the length of the sequencing fragment read, the number of sequencing fragments, and the total amount of data in the sample. See Table 1 for the specific results.

2.测序片段比对 2. Sequencing fragment alignment

将测序得到的片段比对到该物种的基因组上，具体比对方法见数据预处理中的第 2步：序列在基因组上位置的确定。在完成数据比对后，对两个人样品的比对结果做出统计，统计内容包括原始片段数（ Raw reads )、原始数据量（ Raw bases )、比对片段数 ( Mapped reads )、比对 t据量 ( Mapped bases )、平均比对率 ( Average mapped rate ) 、平均覆盖深度 ( Average coverage depth(X), 全基因组每个位点被 reads覆盖到的平均次数 )具体结果参见表 2。 The sequenced fragments are aligned to the genome of the species. For specific alignment methods, see Step 2 of Data Preprocessing: Determination of the position of the sequence on the genome. After the data comparison is completed, the comparison results of the two human samples are counted, and the statistics include Raw reads, Raw bases, Mapped reads, and comparisons t. See Figure 2 for the results of the Mapped bases, Average mapped rate, and Average coverage depth (X), the average number of times the entire genome is covered by reads.

3.曱基化位点的覆盖情况 3. Coverage of thiolated sites

C ½位点可以被分为三种类型，分别是 CG、 CHG、 CHH, 其中 H代表非 G碱基。从 *.cout文件中我们可以得到位点是否为曱基化位点，支持曱基化的 reads数大于期望值的位点即为曱基化位点。统计不同覆盖深度（该位点被 reads覆盖的次数）下 mC位点占总位点的比例，统计结果如图 1所示，横轴代表大于等于该覆盖深度，纵轴代表相应深度下位点的比例。同时还对不同覆盖深度下的位点在基因组上的覆盖度进行统计，覆盖度即为位点数在全基因组上的比例。统计结果如图 2所示，横轴^ 该覆盖深度，纵轴代表相应深度下位点的覆盖度。 C 1⁄2 sites can be divided into three types, namely CG, CHG, CHH, where H represents a non-G base. From the *.cout file we can get whether the site is a thiolated site, and the site that supports the thiolized reads greater than the expected value is the thiolation site. The ratio of mC locus to the total locus under different coverage depths (the number of times the locus is covered by reads) is counted. The statistical result is shown in Figure 1. The horizontal axis represents greater than or equal to the coverage depth, and the vertical axis represents the corresponding depth. proportion. At the same time, the genomic coverage of the sites under different coverage depths is also counted, and the coverage is the proportion of the number of sites on the full genomics. The statistical results are shown in Fig. 2, the horizontal axis ^ the depth of coverage, and the vertical axis represents the coverage of the site under the corresponding depth.

4. 曱基化位点的曱基化水平 4. The thiolation level of the thiolated site

为了了解基因组曱基化图谱的总体特征和全基因组的平均曱基化水平，统计每个甲基化 C碱基的平均曱基化水平，计算方式为： To understand the overall characteristics of the genomic thiolation map and the average thiolation level of the whole genome, the average thiolation level of each methylated C base is calculated by:

支持曱基化的 reads数 /(支持曱基化的 reads数 +不支持曱基化的 reads 数 )xl'00。具体结果如表 3所示。 The number of reads that support merging / (the number of reads that support merging + the number of reads that do not support merging) xl'00. The specific results are shown in Table 3.

同时，统计反映曱基化信息的 reads的有¾ ^度，有盖度的计算基于三个方面进行：染色体、基因区间和基因功能元件。将曱基化状态不明确的 reads排除之后，基于支持曱基化的 reads和不支持支持曱基化的 reads数据，进行有效覆盖度计算，得到统计结果，如表 4所示。 At the same time, the statistics reflect the reads of the priming information by 3⁄4^ degrees, and the coverage calculation is based on three aspects: chromosomes, gene intervals, and gene functional elements. After the reads with ambiguous state are excluded, the effective coverage calculation is performed based on the read-supported reads and the reads data that does not support the support, and the statistical results are obtained, as shown in Table 4.

5.全基因组甲基化数据分布趋势 5. Whole genome methylation data distribution trend

为了得到全基因组曱基化分布趋势，统计曱基化 C碱基中 CG， CHG与 CHH的数目和分布比例，即各类型甲基化的 C占总曱基化 C的百分比，例如： mCHG所占比例 = mCHG数目 /mC的总数。统计结果如表 5和图 3所示。 In order to obtain the trend of genome-wide thiolation distribution, statistics of thiolated C bases in CG, CHG and The number and distribution ratio of CHH, that is, the percentage of each type of methylated C to the total thiolation C, for example: the proportion of mCHG = the number of mCHG / the total number of mC. The statistical results are shown in Table 5 and Figure 3.

同时，不同分布类型的 C CG、 CHG和 CHH )，其曱基化水平在不同物种间，甚至同一物种不同细胞类型间都存在差异。因此，统计每种类型 ( CG、 CHG和 CHH ) 曱基化的 C曱基化水平的分布，统计结果如图 4所示：其中，横轴表示曱基化水平从低到高，以每 10%为一档，纵轴表示某一特定曱基化水平的 C ¾ ^所有曱基化 C 中所占的比例。同样，按照染色体、基因区间和基因功能元件统计各区域内每种分布类型的 C碱基（ CG、 CHG和 CHH )的曱基化水平,计算公式为：平均曱基化水平 =支持曱基化的 reads/ (支持曱基化的 reads+不支持曱基化的 reads)xl00。统计结果如表 6所示。 At the same time, different distribution types of C CG, CHG and CHH ) have different levels of thiolation between different species and even different cell types of the same species. Therefore, the distribution of the thiolation of each type (CG, CHG, and CHH) is statistically calculated. The statistical results are shown in Figure 4: where the horizontal axis represents the thiolation level from low to high, for every 10 % is a file, and the vertical axis represents the proportion of C 3⁄4 ^ all thiolation C of a particular thiolation level. Similarly, the thiolation levels of C bases (CG, CHG, and CHH) for each distribution type in each region are calculated by chromosome, gene interval, and gene function elements, and the formula is: Average thiolation level = support thiolation The read/ (supports simplistic reads+ does not support simplified reads) xl00. The statistical results are shown in Table 6.

最后，分析非 CG位点的曱基化的 C附近碱基的分布情况，统计不同曱基化模式 (CHG和 CHH)出现的概率，结果如图 5所示。 Finally, the distribution of bases near the thiolated C of non-CG sites was analyzed, and the probability of occurrence of different thiolation patterns (CHG and CHH) was statistically analyzed. The results are shown in Fig. 5.

6.全基因组 DNA曱基化图谱 6. Whole genome DNA thiolation map

从染色体水平来描述曱基化 C威基的分布情况。以 10kb的窗口统计甲基化( 的密度在染色体上的分布情况；同时，统计每 lOOkb的窗口内不同类型曱基化 C减基 ( CG、 CHG和 CHH )的密度分布，然后以画图软件作图，结果如图 6所示。 The distribution of thiolated C-velocines is described from the chromosomal level. The distribution of density on the chromosome was counted by a 10 kb window; at the same time, the density distribution of different types of thiolated C minus groups (CG, CHG, and CHH) in each lOOkb window was counted, and then plotted by drawing software. Figure, the results are shown in Figure 6.

不同的基因组区域具有各自不同的生物学功能，以热度图的形式表示甲基化分布情况，有助于进一步了解这些区域的曱基化图语特征。 Different genomic regions have different biological functions, and the methylation distribution in the form of heat maps helps to further understand the thiolated graphical features of these regions.

结果如图 7所示：其中，每张图表示一种基因组区域的热度图分布特征，其中 n表示参与分析的 CpG个数（每个 CpG在单条链上的覆盖深度均大于等于 4X )。横轴表示每 200bp大小窗口中的 CpG个数 (即 CpG密度），纵轴表示 CpG的 C ½平均曱基化水平。图中的黑色细线表示在特定 CpG密度下，这个窗口中的 c碱基平均曱基化水平的中位数。 B区从浅到深表示特定曱基化水平和 CpG密度下的 CpG个数。上方的 A区表示 CpG密度的分布,映射到横轴上，右侧的 C区表示曱基化水平的分布，并且映射到纵轴上。 The results are shown in Figure 7: where each graph represents the heat map distribution characteristics of a genomic region, where n represents the number of CpGs involved in the analysis (each CpG has a coverage depth greater than 4X on a single strand). The horizontal axis represents the number of CpGs per 200 bp window (ie, CpG density), and the vertical axis represents the C 1⁄2 average thiolation level of CpG. The black thin line in the figure indicates the median of the average thiolation level of c bases in this window at a specific CpG density. From the shallow to the deep, the B region indicates the number of CpGs at a specific thiolation level and CpG density. The upper A area represents the distribution of CpG density, mapped to the horizontal axis, and the right C area represents the distribution of the thiolation level and is mapped to the vertical axis.

基因的转录受不同的调控元件调控，将所有编码基因序列分成 7个不同的转录元件区域，在此！^出上对不同转录元件区域的曱基化水平进行统计。 DNA 曱基化水平在不同功能区的分布特点有助于从全基因组水平去了解不同区域的 DNA曱基化修饰的作用。统计结果如图 8所示：横轴表示 7个不同的转录元件区，纵轴表示特定区域各位点的平均曱基化水平，每一个红点即为一个 bin (人为划定的一定长度的区间）的平均曱基化水平，曲线表示的是平均 5-bin的移动， a与 b之间的虚线即为 TSS (转录起始位点）。 The transcription of genes is regulated by different regulatory elements, dividing all coding gene sequences into seven different regions of transcriptional elements, here! ^ Statistics on the level of thiolation of different transcriptional element regions. The distribution of DNA thiolation levels in different functional regions helps to understand the role of DNA thiolation modifications in different regions from the genome-wide level. The statistical results are shown in Fig. 8: the horizontal axis represents 7 different transcription element regions, and the vertical axis represents the average thiolation level of each point in a specific region, and each red dot is a bin (person For the average thiolization level of a defined length interval, the curve represents the average 5-bin shift, and the dashed line between a and b is the TSS (transcription start site).

7.差异性甲基化区域（DMR )分析 7. Differential methylation region (DMR) analysis

通过两个样品的曱基化水平，确定甲基化差异区域，方法如下：首先确定一个 CG位点作为种子位点，得到两个样本在该位点分别支持曱基化和不支持曱基化四个数值，再通过卡方检验检验两个样本的曱基化水平在该位点差异是否显著；如果差异显著则往后延伸一个 CG位点，得到两个样本在这两个位点分别支持曱基化和不支持曱基化四个数值，再进行卡方 4佥验；以此类推循环，当遇到差异不再显著的点时，便停止，前面显著的区域即为差异性曱基化区域。再以此位点为种子位点循环上面过程，从而得到差异性曱基化区域。 The methylation difference region was determined by the thiolation level of the two samples as follows: First, a CG site was identified as a seed site, and two samples were obtained to support thiolation and thiolization at the site, respectively. Four values were used to check whether the thiolation level of the two samples was significant at the site by a chi-square test; if the difference was significant, a CG site was extended backward, and two samples were supported at these two sites.曱化和和和和和四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个四个Area. This site is then used as a seed site to cycle the above process, resulting in a differential thiolated region.

在找到差异性曱基化区域后，通过这些区域找到相对应的基因，即曱基化差异基因，并对这些基因做 GO聚类功能分析，以分析差异相关基因是否有明显的功能聚类，即是否针对性地调控行使某类功能。实施例结果： After finding the differential thiolated regions, the corresponding genes, namely the thiolated differential genes, are found through these regions, and the GO clustering function analysis is performed on these genes to analyze whether the differentially related genes have obvious functional clustering. That is, whether to perform certain types of functions in a targeted manner. Example results:

表 1 , 两个样本的数据产出统计 Table 1, data output statistics for two samples

表 2，比对结果统计 Table 2, comparison results statistics

表 3 , 不同类型曱基化 C威基的平均甲基化水平

Table 3, average methylation levels of different types of thiolated C-Wiki

表 4, 反映曱基化信息的 reads的有^ ^盖度 Table 4, the readings that reflect the singularization information

表 5，不同分布类型曱基化 C的数量及比例 Table 5, Number and proportion of different distribution types thiolated C

不同基因区域上各类型 C的曱基化水平 The level of thiolation of various types of C in different gene regions

实施例二，基于 RRBS测序的生物信息分析方法样本： d液样本 DNA

Example 2: Samples of biological information analysis methods based on RRBS sequencing: d liquid sample DNA

具作緣 Have a relationship

一、数据预处理 First, data preprocessing

1.过滤测序序列；在本实施例中，编号为 YH的 i液样品 DNA在深圳华 λ ^因研究院进行重亚疏酸氢盐测序或减少的代表性重亚疏酸盐测序，其中测序技术为 Illumina GA测序技术。测序得到 94.5兆条序列，然后对测序序列进行过滤，去除不合格的序列，得到 83.62兆条合格序列。不合格序列包括：测序质量低于预设阀值（本例中为 66 )的基个数超过整条序列减基个数的 50%。序列中测序结果不确定的减基 (在本实施例中 Illumina GA测序结果中的 N )个数超过整条序列 ½个数的 10%; 除样本接头序列外，其它实验引入的外源序列；序列中存在外源序列则认为是不合格序列。 1. Filter the sequencing sequence; In this example, the liquid sample DNA numbered YH is subjected to heavy subhydro acid hydrogenation sequencing or reduced representative heavy acid sulfite sequencing in the Shenzhen Hua λ ^ Institute, wherein the sequencing technology is Illumina GA sequencing technology. . Sequencing yielded 94.5 mega-sequences, and the sequencing sequence was then filtered to remove unqualified sequences, resulting in 83.62 mega-qualified sequences. The unqualified sequence includes: The number of bases whose sequencing quality is lower than the preset threshold (66 in this example) exceeds 50% of the number of bases minus the entire sequence. The number of subtractions (in the present example, the number of N in the Illumina GA sequencing results) in the sequence is more than 10% of the entire sequence; in addition to the sample linker sequence, the exogenous sequence introduced by other experiments; The presence of an exogenous sequence in the sequence is considered a non-conforming sequence.

2.确定上述过滤后的测序序列代表的甲基化信息 2. Determine the methylation information represented by the above filtered sequencing sequence

在本实施中使用的物种的参考基因组序列是人的基因组 hgl8，来自数据库 UCSC。对过滤后的测序序列和所述的物种参考序列进行预处理，步骤如下： i )将所述各个测序序列中所有 C (胞嘧啶） tt转化为 T (胸腺嘧啶）碱基得到的序列定义为 fql; 将所述各个测序序列互补序列中所有 G (鸟嘌呤）碱基转化为 A (腺嘌呤） ^^得到的序列定义为 fq2, 示意图如下： fql ΘΘΤ鍵 βΆΘΘΑΤΤΜΤ霞 (¾'ΤΪ職 ¾¾$T T ¾¾S灘 STAGI^ ββ T TeC¾ CCTC 請職 TO嚷 "GCT_:Cft¾¾TCG:簡麵顯在图中， C^m表示曱基化的 C (胞嘧啶） The reference genomic sequence of the species used in this embodiment is the human genome hgl8, from the database UCSC. The filtered sequencing sequence and the species reference sequence are pretreated as follows: i) the sequence obtained by converting all C (cytosine) tt in the respective sequencing sequences into T (thymine) bases is defined as Fql; The sequence obtained by converting all G (guanine) bases in the complementary sequence of each sequencing sequence into A (adenine) ^^ is defined as fq2, and the schematic is as follows: fql ΘΘΤ bond βΆΘΘΑΤΤΜΤ霞(3⁄4'ΤΪ职3⁄43⁄4$ TT 3⁄43⁄4S Beach STAGI^ ββ T TeC3⁄4 CCTC Job TO嚷"GCT _: Cft3⁄43⁄4TCG: The simplified surface is shown in the figure, C ^m represents the thiolated C (cytosine)

ii )将所述参考基因组序列中所有 C (胞嘧啶） ½转化为 T (胸腺嘧啶）生成的序列定义为 refl; 将所述参考基因组序列中所有 G (鸟嘌呤）减基转化为 A (腺嘌呤）生成的序列定义为 ref2; Ii) converting all C (cytosine) 1⁄2 in the reference genomic sequence into T (thymine) to generate a sequence defined as refl; converting all G (guanine) subunits in the reference genome sequence to A (gland)嘌呤) The generated sequence is defined as ref2;

_r

_ef2 T iv )将一个序列 fql和 fq2在转化后的参考基因组序列上的位置信息参考序列在相应位置处的序列信息相结合，确定该序列 fql和 fq2的甲基化信息：如果在该序列 fql上和在参考序列相应位置上都是 C (胞嘧啶）碱基，则表示在该序列 fql的该位置被曱基化；如果在该序列 fq2上和在参考序列相应位置都是 G 则表示在该序列 fq2的该位置被曱基化示意图如下：

_r

_e f2 T Iv) combining the sequence information of a sequence fql and fq2 on the transformed reference genome sequence with the sequence information at the corresponding position to determine the methylation information of the sequences fql and fq2: if and in the sequence fql If the C (cytosine) base is at the corresponding position of the reference sequence, it means that the position of the sequence fql is thiolated; if both the sequence fq2 and the corresponding position in the reference sequence are G, the sequence is represented. The position of fq2 is singulated as follows:

，權麵 i^eG^TceT:!^卿:雄聽 ¾?¾：¾3¾¾ ¾¾|0<≤^¾3^: ¾聽 AT:e 2: τ^Αε^'Θ σϊ^^ΐ^Ε^^^βόδ^^τ,ίρί^^^τΐ^^漏;^ [" 翔：在图中， *号和 #号表示未曱基化， C^m表示曱基化的 C (胞嘧啶） ½。 3. 经过上述处理，样品 YH得到 63.45兆有¾ ^列，能覆盖整个基因组的 6.2%的 CG位点。 , 权面 i^eG^TceT:!^卿: 雄听3⁄4?3⁄4:3⁄433⁄43⁄4 3⁄43⁄4|0<≤^3⁄43 ^: 3⁄4 listen AT:e 2: τ^Αε^'Θ σϊ^^ΐ^Ε^^^ Βόδ^^τ, ίρί^^^τΐ^^漏;^ [" Xiang: In the figure, * and # indicate undensified, and C ^m denotes thiolated C (cytosine) 1⁄2. After the above treatment, the sample YH obtained 63.45 megabytes with 3⁄4 ^ columns, covering 6.2% of the CG sites of the entire genome.

由于 RRBS测序因酶切片段进行 BS测序，因此我们需要对得到的有效序列进行酶切位点的过滤，才艮据不同的醇得到不同的理^ t刀点，这里使用了 Mspl识别 CCGG位点的酶。对有效序列的两端进行判断是否还有为 Mspl的酶切位点，如果其中一端还有该醇切位点则认为该有效序列为正常酶切片段，用于后续的分析。 Since RRBS sequencing is performed by BS sequencing of the digested fragments, we need to filter the obtained effective sequences by cleavage sites, and then obtain different cleavage points according to different alcohols. Here, Mspl is used to identify CCGG sites. Enzyme. Both ends of the effective sequence are judged whether there is a restriction site for Mspl. If one of the ends has the alcohol cleavage site, the effective sequence is considered to be a normal restriction fragment, which is used for subsequent analysis.

3.确定参考基因组序列上曱基化位点 3. Identify the thiolation site on the reference genome sequence

对于参考基因组上每个位点，统计在该位点支持曱基化的测序序列数 mC 和不支持曱基化的测序序列数 nmC，据二项分布得到以下不等式确定该位点是否是曱基化位点： nmC * ()p^k(l - v ~^k < 0.01 * mC; 其中， n为该位点的测序序列总数， p为错误率（本例中为 0.01 ) , k为支持曱基化 reads数的期望值， mC为支持曱基化的测序序列数， nmC为不支持曱基化的测序序列数； For each locus on the reference genome, the number of sequencing sequences supporting the thiolation at this site, mC, and the number of sequencing sequences that do not support thiolation, are counted. According to the binomial distribution, the following inequality is used to determine whether the site is a sulfhydryl group. Site: nmC * ()p ^k (l - v ~ ^k < 0.01 * mC; where n is the total number of sequencing sequences at this site, p is the error rate (0.01 in this case), k is the supporting thiol group The expected number of reads, mC is the number of sequencing sequences that support thiolation, and nmC is the number of sequencing sequences that do not support thiolation;

通过上述公式得到参考基因组序列上每个位点支持曱基化 reads数的期望值 k，对于某个位点如果支持曱基化的测序序列数大于该位点的期望值，该位点即为曱基化的位点，优选形成包括如下信息的结果文件 *.cout: Chr-a b +/- c d e f g h 第一列 Chr-a: 染色体号， The expected value k of the number of thiolated reads supported by each site on the reference genome sequence is obtained by the above formula. For a certain site, if the number of sequencing sequences supporting thiolation is greater than the expected value of the site, the site is a thiol group. The resulting site preferably forms a result file *.cout including the following information: Chr-a b +/- cdefgh First column Chr-a: chromosome number,

第四列 c: C位点的类型， The fourth column c: the type of C site,

第五列 d: 具体的^ , The fifth column d: the specific ^,

第七列 f: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置被曱基化的测序序列数目， Column 7 f: the number of sequencing sequences that are ligated at the corresponding position by the first sequencing sequence or the second sequencing sequence covering the C site,

第八列 g: 覆盖所述 C位点的第一测序序列或第二测序序列在相应位置未被曱基化的测序序列数目， Column 8 g: the number of sequencing sequences covering the first or the second sequencing sequence of the C site that are not thiolated at the corresponding position,

结果如下： The results are as follows:

样品 YH甲基化 CG位点为 1.84兆，占覆盖到的 CG位点的 52.93%; 曱基化 CHG位点为 0.013兆，占总 CHG位点的 0.49%; 曱基化 CHH位点为 0.036 兆，占总 CHH位点的 2.55%。 The YH methylation CG site of the sample was 1.84 megabytes, accounting for 52.93% of the covered CG sites; the thiolated CHG site was 0.013 megagrams, accounting for 0.49% of the total CHG sites; the thiolated CHH site was 0.036. Mega, accounting for 2.55% of the total CHH locus.

二、生物信息分析 Second, biological information analysis

1.数据产出 Data output

对两个 jk液样本 DNA的测序数据产出分不同酶切片段文库做出统计，统计内容包括测序片段读长、测序片段数量和样品的总数据量等。具体的结果参见表 7。 The sequencing data of the DNA of the two jk liquid samples were generated by counting the libraries of different enzyme fragments, and the statistics included the length of the sequencing fragments, the number of sequencing fragments, and the total amount of samples. See Table 7 for specific results.

2. 测序片段比对 2. Sequencing fragment alignment

将测序得到的片段比对到该物种的基因组上，具体比对方法见数 The sequenced fragments are aligned to the genome of the species, and the specific comparison method is described.

据预处理中的第 2步：序列在基因组上位置的确定。在完成数据比对后，对两个人样品的比对结果做出统计，统计内容包括原始片段数（Raw reads ) 、原始数据量（Raw Data )、比对片段数 ( Mapped reads )、比对¾据量（Mapped bases )、比对率 ( Mapped rate )、醉切率 ( Enzyme cutting rate )具体结果参见表 8。 According to step 2 of the pretreatment: the determination of the position of the sequence on the genome. After the data comparison is completed, the comparison results of the two human samples are counted, and the statistics include the Raw reads, the Raw Data, the Mapped reads, and the comparisons. See Table 8 for specific results of Mapped bases, Mapped rate, and Enzyme cutting rate.

3. 插入片段长度统计 3. Insert length statistics

对比对上的 reads进行长度统计，以判断比对结果是否符合建库片段长度，统计不同插入片段长度下 reads占总 reads数的百分比。统计结果如图 9所示。 4. 曱基化位点的覆盖情况 Compare the length of the reads on the pair to determine whether the comparison results match the length of the database. Calculate the percentage of reads in the total number of reads for different insert lengths. The statistical results are shown in Figure 9. 4. Coverage of thiolated sites

C位点可以被分为三种类型，分别是 CG、 CHG、 CHH, 其中 H代表非 G 。从 *.cout文件中我们可以得到位点是否为曱基化位点，支持曱基化的 reads 数大于期望值的位点即为曱基化位点。统计不同覆盖深度（该位点被 reads覆盖的次数）下 mC位点占总位点的比例，统计结果如图 10所示，横轴代表大于等于该覆盖深度，纵轴代表相应深度下位点的比例。同时还对不同覆盖深度下的位点在基因组各区域的覆盖度进行统计，覆盖度即为覆盖到的位点数在基因组各区域的比例。统计结果如图 11所示，横轴^ 该覆盖深度，纵轴代表相应深度下位点的覆盖度。最后统计各基因组区域内的 CG位点的平均覆盖深度，结果参见表 9。除此之外，统计基因组各区域内的 CG位点的平均覆盖深度 The C site can be divided into three types, namely CG, CHG, CHH, where H stands for non-G. From the *.cout file we can get whether the site is a thiolated site, and the site that supports the thiolized reads greater than the expected value is the thiolation site. The ratio of the mC locus to the total locus under different coverage depths (the number of times the locus is covered by the reads) is counted. The statistical result is shown in Fig. 10. The horizontal axis represents greater than or equal to the coverage depth, and the vertical axis represents the locus at the corresponding depth. proportion. At the same time, the coverage of the sites in different coverage depths in each region of the genome is also counted, and the coverage is the proportion of the number of sites covered in each region of the genome. The statistical results are shown in Fig. 11, the horizontal axis ^ the depth of coverage, and the vertical axis represents the coverage of the site at the corresponding depth. Finally, the average coverage depth of CG loci in each genomic region was counted. See Table 9 for the results. In addition, the average coverage depth of CG loci in each region of the statistical genome

5. CpG位点的覆盖度 5. Coverage of CpG sites

RRBS 采用酶切技术得到片段再进行 BS 测序，这里对酶切的理论片段 ( 40-220bp )在各染色体和基因组各区域内覆盖到的 CG位点数及其比例进行统计，然后统计两个样品实际得到的数据在理论位点上的覆盖度，统计结果见表 10及图 12。 RRBS was digested to obtain fragments and then subjected to BS sequencing. Here, the theoretical fragment (40-220 bp) of the enzyme digestion was counted in the number of CG sites and their proportions in each chromosome and genome, and then the two samples were actually counted. The coverage of the obtained data at the theoretical site is shown in Table 10 and Figure 12.

6. 差异性曱基化区域（DMR )分析 6. Differential thiolized region (DMR) analysis

通过两个样品的曱基化水平，确定曱基化差异区域，方法如下： The thiolation difference region was determined by the thiolation level of the two samples as follows:

首先确定一个 CG位点作为种子位点，得到两个样本在该位点分别支持曱基化和不支持曱基化的四个数值，再通过卡方检验检验两个样本的曱基化水平在该位点差异是否显著；如果差异显著则往后延伸一个 CG位点，得到两个样本在这两个位点分别支持曱基化和不支持曱基化的四个数值，再进行卡方检验；以此类推循环，当遇到差异不再显著的点时， <更停止，前面显著的区域即为差异性曱基化区域。再以此位点为种子位点循环上面过程，从而得到差异性曱基化区域。 First, a CG locus is determined as a seed site, and two samples are supported at the site to support the thiolation and the four values that do not support thiolation, and then the chi-square level of the two samples is tested by the chi-square test. Whether the difference in the locus is significant; if the difference is significant, a CG locus is extended backward, and two samples are supported at these two loci to support the thiolation and the four values that do not support thiolation, and then the chi-square test is performed. Such a push cycle, when encountering a point where the difference is no longer significant, <more stop, the front significant area is the differential thiolated area. This site is then used as a seed site to cycle the above process to obtain a differential thiolated region.

在找到差异性曱基化区域后，通过这些区域找到相对应的基因，即甲基化差异基因，并对这些基因做 GO聚类功能分析，以分析差异相关基因是否有明显的功能聚类，即是否针对性地调控行使某类功能。 After finding the differential thiolated regions, the corresponding genes, namely methylation differential genes, are found through these regions, and the GO clustering function analysis is performed on these genes to analyze whether the differentially related genes have obvious functional clustering. That is, whether to perform certain types of functions in a targeted manner.

RRBS采用酶切技术得到片段再进行 BS测序，所得结果包括但不局限于上述内容。对于 BS测序的生物信息分析中的内容， RRBS也可酌情分析。实施例结果： The RRBS is subjected to restriction enzyme digestion to obtain fragments and then subjected to BS sequencing, and the results include, but are not limited to, the above. RRBS can also be analyzed as appropriate for the content of bioinformatics analysis of BS sequencing. Example results:

表 7，两个样本的数据产出统计 Table 7, Data output statistics for two samples

表 8，比对结果统计 Table 8, statistics of comparison results

表 9, 各基因组区域内的 CG位点的平均覆盖深度 Table 9. Average coverage depth of CG loci in each genomic region

CG的平均 ¾ 果度 (X) 基因元件 CG average 3⁄4 degree (X) gene component

LB-NB9 LB-NB10 LB-NB9 LB-NB10

CG岛上游 2k 18.04 17.61Upstream of CG Island 2k 18.04 17.61

CG岛 20.92 19.89CG Island 20.92 19.89

CG岛下游 2k 17.68 17.39 上游 2k 18.7 17.33Downstream of CG Island 2k 17.68 17.39 Upstream 2k 18.7 17.33

5-非翻译区 21.17 19.95 编码序列 19.98 20.67 内含子 18.02 17.555-untranslated region 21.17 19.95 coding sequence 19.98 20.67 intron 18.02 17.55

3-非翻译区 17.69 17.73 下游 2k 20.11 19.85 启动子 19.94 18.73 ii因 20.78 20.98 串联重复序列 18.46 17.593-untranslated region 17.69 17.73 downstream 2k 20.11 19.85 promoter 19.94 18.73 ii due 20.78 20.98 tandem repeat 18.46 17.59

Alu重复序列 13.49 11.3 核糖体 RNA 982.78 726.59 转运 RNA 20.45 18.81 0, CpG位点的覆盖度 Alu repeat sequence 13.49 11.3 ribosomal RNA 982.78 726.59 transfer RNA 20.45 18.81 0, coverage of CpG sites

Claims

Claim

CLAIMS 1. A method for detecting DNA thiolation by heavy bisulfite hydrogenation sequencing or reduced representative heavy acid salt sequencing, the method comprising the steps of:

1) subjecting the sample to bisulfite sequencing or reduced representative bisulfite sequencing, and preferably filtering the sequencing sequence to remove unqualified sequences;

2) Determine the thiolation site represented by the above sequencing sequence, the steps are as follows

i) a first sequencing sequence obtained by converting all C (cytosine) in the respective sequencing sequences into T (thymine) bases; converting all G (guanine) bases in the complementary sequences of the respective sequencing sequences into A (adenine) obtained second sequencing sequence;

ϋ ) converting all C (cytosine) 1⁄2 in the reference genome sequence of the corresponding species into the first reference sequence generated by the Τ (thymidine) subtraction; converting all G (guanine) bases in the reference genome sequence of the corresponding species a second reference sequence generated for Α (adenine);

Iii) aligning each of the first sequencing sequence and the second sequencing sequence with the first reference sequence and the second reference sequence, respectively, and obtaining each of the first sequencing sequence and the second sequencing sequence in the first reference sequence and the second Selecting, in the position on the sequence, each of the first sequencing sequence and the second sequencing sequence having unique alignments on the first reference sequence and the second reference sequence;

Iv) determining the methylation information of the first sequencing sequence and the second sequencing sequence by using the sequence information of the first sequencing sequence and the second sequencing sequence aligned by the second sequencing sequence, as follows: If in the first sequencing sequence and The corresponding positions of the aligned reference genomic sequences are C (cytosine) bases, indicating that they are thiolated at this position of the first sequencing sequence; if at the corresponding position of the second sequencing sequence and the aligned reference genome sequence All of which are G (guanine) bases, indicating that they are thiolated at this position of the second sequencing sequence;

3) for each C site on the reference genomic sequence, the number of sequencing sequences that are thiolated at the corresponding position of the first or second sequencing sequence covering the site and the number of sequencing sequences that are not thiolated, The minimum number of expected sequencing sequences for thiolation of the site is determined according to the following inequality, ie, the minimum k satisfying the following formula: nmC * ( )p ^k (tp ~ ^k < 0.01 * mC; wherein

mC is the number of sequencing sequences in which the first sequencing sequence or the second sequencing sequence covering the C site is thiolated at the corresponding position,

nmC is the first sequencing sequence covering the C site or the second sequencing sequence is not thiol at the corresponding position Number of sequencing sequences,

n is the total number of sequencing sequences at this site,

p is 4 Wu rate,

If the number of sequencing sequences that are methylated at the corresponding position by the first or second sequencing sequence covering the C site is greater than the minimum number of desired sequencing sequences at the site, the C site is the site of thiolation .

2. The method of claim 1, wherein the step 1) -2) is carried out by replacing the sample with an unmodified control DNA such as lamda DNA, and the p value is usually determined according to the conversion ratio of C to T minus, p is usually It is 0.001-0.01, such as 0.01.

3. The method of claim 1 or 2, further comprising forming a result file comprising: a chromosome number, a position of the C site, positive and negative strand information, a type of C site, a specific base, and a coverage of the C site The number of the first sequencing sequence or the second sequencing sequence of the point, the number of sequencing sequences that are thiolated at the corresponding position by the first sequencing sequence or the second sequencing sequence covering the C site, and the first number covering the C site The number of sequencing sequences that the sequencing sequence or the second sequencing sequence is not thiolated at the corresponding position and/or the minimum number of desired sequencing sequences that are thiolated by the C site.

4. A method of bioinformatics analysis based on bisulfite sequencing or reduced representative bisulfite sequencing, the method comprising detecting DNA thiolation by the method of any of claims 1-3, Bioinformatics analysis is then performed, which analyzes one or more selected from the group consisting of: data production, sequencing of fragments, coverage of thiolated sites, thiolation sites The level of thiolation, genome-wide thiolation data distribution trends, genome-wide DNA thiolation maps, differential thiolated regions, insert length statistics, and coverage of CpG sites.

5. The method of claim 4, wherein the coverage analysis of the thiolation site is performed as follows: C tt sites are classified into three types, namely CG, CHG, CHH, wherein H represents a non-G base; The result file obtains whether the site is a thiolation site, and the site supporting the thiolated reads larger than the expected value is the thiolation site; the ratio of the mC site to the total site at different coverage depths is also calculated; For sites with different coverage depths on the genome, ί3⁄4Α is counted, and the coverage is the proportion of the number of sites on the whole genome.

6. The method of claim 4, wherein the thiolation level analysis of the thiolation site is performed as follows: To understand the overall characteristics of the genomic thiolation map and the average methylation level of the whole genome, each methyl group is counted. The average thiolation level of C bases; statistically reflects the effective coverage of reads of thiolated information. The calculation of coverage is based on three aspects: chromosome, gene interval and gene function After the components are excluded from the ambiguous reads, the effective coverage calculation is performed based on the read-supported singly-based reads and the reads data that does not support methylation, and the statistical results are obtained.

7. The method of claim 4, wherein said genome-wide thiolation data distribution trend analysis is performed as follows: To obtain a genome-wide thiolation distribution trend, statistical thiolation C 3⁄4 CG, CHG and

The number and distribution ratio of CHH, ie the percentage of each type of thiolated C to the total thiolation C; the distribution of the C thiolation level of each type (CG, CHG and CHH) thiolation; Gene intervals and gene functional elements count C CG, CHG and each type of distribution in each region

The thiolation level of CHH); or the distribution of bases near the thiolated C at non-CG sites, and the probability of occurrence of different thiolation patterns (CHG and CHH).

8. The method of claim 4, wherein said genome-wide DNA thiolation pattern analysis is performed as follows: Delineating the distribution of methylated C bases from the chromosomal level, and counting the density of thiolation at the chromosome of 10 kb on the chromosome The distribution of the upper; at the same time, the density distribution of different types of thiolized C 1⁄2 ( CG, CHG and CHH ) in each window of 100 kb is counted, and different genomic regions have different biological functions, which are expressed in the form of heat maps. The basic distribution helps to further understand the thiolated graphical features of these regions. The transcription of genes is regulated by different regulatory elements, dividing all coding gene sequences into seven different transcriptional element regions, here! The statistics on the thiolation levels of different transcriptional element regions were made.

9. The method of claim 4, said differential thiolated region (DMR) analysis being performed as follows: determining the thiolation difference region by the thiolation level of the two samples, as follows: First determining a CG site As a seed site, two samples were obtained at the site to support thiolation and do not support thiolated four numbers, and then the chi-square test was used to check whether the thiolation level of the two samples was significant at the site; If the difference is significant, then a CG site is extended backwards, and two samples are supported at these two sites to support thiolation and do not support thiolated four numbers, and then perform chi-square test; When the point where the difference is no longer significant, it stops, and the area that is prominent in the front is the difference thiolized area. This site is used as a seed site to circulate the above process, thereby obtaining differential methylation regions. After finding the differential thiolated regions, the corresponding genes are found through these regions, namely the thiolated differential genes, and These genes were analyzed by GO clustering function to analyze whether the differentially related genes have obvious functional clustering, that is, whether to specifically control the exercise of certain functions.

10. The method of claim 4, wherein the statistical analysis of the length of the insert is performed as follows: comparing the length of the reads on the pair to determine whether the comparison result is consistent with the length of the database, and counting the length of the different inserts. The percentage of the number.

11. The method of claim 4, wherein the coverage analysis of the CpG sites is performed as follows: In RRBS, the fragment was obtained by enzyme digestion and then subjected to BS sequencing. Here, the theoretical fragment (40-220 bp) of the enzyme digestion was counted in the number of CG sites and their proportions in each chromosome and genome, and then the statistics were counted. The coverage of the actual data obtained from a sample at the theoretical site.

12. A device for bioinformatics analysis based on heavy bisulfite sequencing or reduced representative heavy sulphate acidation sequencing, the device comprising bisulfite sequencing and/or reduced representation a bisulfite sequencing module for performing bisulfite sequencing or reduced representative bisulfite sequencing of the sample, and preferably filtering the sequencing sequence to remove unqualified sequences; data output analysis module, Statistics are made on the output of the sequencing data, including the conversion rate, the length of the sequencing fragment, the number of sequencing fragments, and the total amount of data of the sample; the sequencing fragment alignment analysis module is used to compare the fragments obtained by sequencing to the reference. On the genome, and compare the results, the statistical content includes the initial fragment data volume, the number of comparison fragments, the amount of comparison data, the average alignment ratio, the average coverage depth; the coverage analysis module of the thiolated site It is used to calculate the ratio of the number of C-sites of the thiolated sequence of 3⁄4Α to the total number of sites at different depths of coverage; A horizontal analysis module for counting the average thiolation level of each methylated C tt; a genome-wide thiolation data distribution trend analysis module for obtaining a genome-wide thiolation distribution trend, statistical thiolation (中CG , the number and distribution ratio of CHG and CHH, where H represents non-G ^tt; a genome-wide DNA thiolation map analysis module for describing the distribution of methylated C bases from the chromosomal level; differential thiolation A region analysis module is used to determine the thiolation difference region by the methylation level of the two samples; the insert length statistical analysis module, and the length of the sequencing sequence on the comparison pair to determine whether the alignment result conforms to the database fragment Length; CpG site coverage analysis module, used to count the number of CG sites and their proportions covered in each chromosome and genome, and then statistically report the coverage of the actual data on the theoretical site.