WO2012171213A1

WO2012171213A1 - Method and system for assembly of genome

Info

Publication number: WO2012171213A1
Application number: PCT/CN2011/075852
Authority: WO
Inventors: 王俊; 罗锐邦; 谢寅龙; 刘允杰; 倪培相; 李英睿
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2011-06-17
Filing date: 2011-06-17
Publication date: 2012-12-20
Anticipated expiration: 2013-12-17

Abstract

The present invention provides a method and a system for assembly of genome. The method comprises the steps of shearing the genome to Fosmid clones using Fosmid as the carrier; dividing the Fosmid clones into various pooling libraries, shearing the Fosmid clones of each pooling library and sequencing, getting the reads and performing short sequence assembly to obtain the scaffold of each pooling library, then removing the invalid bases from the scaffold to get the scaftig of each pooling library; splicing the scaftigs to get the scaffold of the genome based on the results of the scaftig alignment, in which the results of the repeatitive sequences alignment have been masked. The method and system for assembly of genome provided by the present invention combines whole-genome shotgun sequencing with Fosmid to Fosmid for sequencing, and hierarchical assembly has also been used to improve the accuracy of the assembly.

Description

一种基因组组装方法和系统技术领域 Genomic assembly method and system

本发明涉及生物信息学技术领域，尤其涉及一种基因组组装方法和系统。背景技术 The present invention relates to the field of bioinformatics technology, and more particularly to a genome assembly method and system. Background technique

目前，基因组组装项目以全基因组鸟枪法测序（ Whole- genome shotgun sequencing, WGS )为主流设计方案, 它主要根据基因组的重复序列的具体特点，搭配不同长度的 DNA插入片段进行双末端测序，在全基因组的平均测序深度足够的情况下可保证单碱基的准确性和基因组的完整性。随着第二代测序技术（Next- generation sequencing, NGS ) 的成熟和普及，测序成本大大降低，基于第二代测序技术的全基因组鸟枪法测序成为各种基因组项目的测序的主流方案。 At present, the genome assembly project uses the Whole Genome Shotgun sequencing (WGS) as the mainstream design scheme. It mainly uses the DNA inserts of different lengths for double-end sequencing according to the specific characteristics of the genome repeat sequences. The accuracy of the single base and the integrity of the genome can be guaranteed if the average sequencing depth of the genome is sufficient. With the maturity and popularity of Next-generation sequencing (NGS), sequencing costs are greatly reduced, and genome-wide shotgun sequencing based on second-generation sequencing technology has become the mainstream of sequencing for various genome projects.

然而对于复杂基因组，其具有高杂合性（杂合性即在同源染色体上的一个或多个位点上有不同等位基因存在的状态）与重复序列等各种问题，上述的解决方案易受这些问题的干扰，组装结果无法达标，导致数据分析且装困难，不适用于复杂基因组。发明内容 However, for complex genomes, it has high heterozygosity (heterozygous, that is, a state in which different alleles exist at one or more sites on a homologous chromosome) and various problems such as repetitive sequences, the above solution Susceptible to these problems, the assembly results can not meet the standard, resulting in data analysis and difficult to install, does not apply to complex genomes. Summary of the invention

开的一个方面要解决的一个技术问题是提供一种基因组组装方法和系统，能够提高组装的准确性。 One technical problem to be solved in one aspect is to provide a genomic assembly method and system that can improve the accuracy of assembly.

根据本发明的一个方面，提供一种基因组组装方法，包括：以 Fosmid为载体将基因组打断为 Fosmid克隆序列；将 Fosmid克隆序列划分到各个混样库（pooling library ), 对各个混样库进行测序；对各个混样库的 Fosmid克隆序列打断并进行测序，对得到的读进行短序列组装获得各个混样库的骨架序列；去除骨架序列中的无效碱基获得各个混样库的骨架剪切重叠群序列；基于骨架剪切重叠群序列间的比对结果拼接骨架剪切重叠群序列获得基因组的骨架序列，其中屏蔽重复序列区域中的比对结果。 According to an aspect of the present invention, a genomic assembly method is provided, comprising: disrupting a genome into a Fosmid clone sequence using Fosmid as a vector; dividing a Fosmid clone sequence into each pooling library, and performing each of the sample libraries Sequencing; disrupting and sequencing the Fosmid clone sequence of each mixed sample library Reading short sequence assembly to obtain the skeleton sequence of each mixed sample library; removing the invalid bases in the skeleton sequence to obtain the skeleton shearing contig sequence of each mixed sample library; splicing the skeleton shear based on the alignment result between the skeleton cut contig sequences The contig sequence is cut to obtain the skeletal sequence of the genome, wherein the alignment results in the region of the repeat sequence are masked.

根据本发明方法的一个实施例，所述基于骨架剪切重叠群序列间的比对结果拼接骨架剪切重叠群序列获得基因组的骨架序列、其中屏蔽重复序列区域中的比对结果包括：将所有骨架剪切重叠群序列进行两两比对；根据比对结果确定重复序列区域，屏蔽重复序列区域中的比对结果；基于比对结果将骨架剪切重叠群序列进行拼接获得基因组的骨架序列。 According to an embodiment of the method of the present invention, the skeletal-shear contig sequence based on the alignment result of the skeletal-shear contig sequence obtains the skeletal sequence of the genomic sequence, wherein the alignment result in the masked repeat region includes: The skeleton cut contig sequence is subjected to pairwise alignment; the repeat sequence region is determined according to the alignment result, and the alignment result in the repeat region region is masked; the skeleton cut contig sequence is spliced based on the alignment result to obtain the skeleton sequence of the genome.

根据本发明方法的一个实施例，还包括：利用读间的对关系将基因组的骨架序列中的无效^ ^转换为有效。 According to an embodiment of the method of the present invention, the method further comprises: converting the invalid ^^ in the skeleton sequence of the genome to be valid by using the pair relationship between the readings.

根据本发明方法的一个实施例，所述以 Fosmid 为载体将基因组打断为 Fosmid克隆序列包括：以 Fosmid为载体将基因组随机打断为 Fosmid克隆序列，所述 Fosmid克隆序列的总长度数倍于所述基因组的长度。 According to an embodiment of the method of the present invention, the disrupting the genome into a Fosmid clone sequence using Fosmid as a vector comprises: randomly breaking the genome into a Fosmid clone sequence using Fosmid as a vector, the total length of the Fosmid clone sequence being several times greater than The length of the genome.

根据本发明方法的一个实施例，每个混样库包括 10 ~ 90 个 Fosmid克隆序列，优选地，包括 30~90个 Fosmid克隆序列。 According to one embodiment of the method of the invention, each sample library comprises from 10 to 90 Fosmid clone sequences, preferably from 30 to 90 Fosmid clone sequences.

根据本发明方法的一个实施例，所述对各个混样库的测序序列进行短序列组装获得各个混样库的骨架序列包括：对每个混样库：将测序序列依次截取出长度为 K的短序列 K-mer; 将 K-mer存储到散列表中，形成德布鲁因图的顶点；在测序序列上前后相继的 K-mer相连，形成德布鲁因图的边；将所有测序序列都处理完得到整个德布鲁因图；去除德布鲁因图中由测序错误、杂合位点引起的路径;将线性的 K-mer路径连接起来形成第一级的重叠群；将测序序列比对到第一级的重叠群序列上，根据读间对关系建立重叠群序列间的相对位置和方向关系，形成每个混样库的骨架序列。根据本发明方法的一个实施例，对每个混样库采用多组不同的 K值，按每个混样库产生不同 K值下的重叠群和骨架序列;对于每个: 库，从不同 K值的组装结果中挑出 N50的重叠群和骨架序列。 According to an embodiment of the method of the present invention, the short sequence assembly of the sequencing sequences of each mixed sample library obtains the skeleton sequences of each mixed sample library including: For each mixed sample library: the sequencing sequence is sequentially cut out to have a length of K Short sequence K-mer; store the K-mer into the hash table to form the vertices of the De Bruin diagram; the successive K-mers on the sequencing sequence are connected to form the edge of the De Bruin diagram; all sequencing sequences are All processed to obtain the entire De Bruin diagram; remove the path caused by sequencing errors and heterozygous sites in the De Bruin diagram; connect the linear K-mer paths to form the first-level contig; Comparing to the first-level contig sequence, the relative position and orientation relationship between the contig sequence are established according to the inter-read relationship, and the skeleton sequence of each mixed library is formed. According to one embodiment of the method of the present invention, multiple sets of different K values are used for each mixed sample library, and contigs and skeleton sequences under different K values are generated for each mixed sample library; for each: library, from different K Among the assembly results of the values, the contig of N50 and the skeleton sequence are picked.

根据本发明方法的一个实施例，基于比对结果将各个混样库的骨架剪切重叠群序列进行拼接获得基因组的骨架序列包括：基于比对结果将骨架剪切重叠群进行聚类获得重叠群聚类；将重叠群聚类中的序列按照前后关系排序，输出统一的重叠群序列；利用读间对关系将重叠群序列连接为骨架序列。 According to an embodiment of the method of the present invention, the skeleton cut contig sequence of each mixed sample library is spliced based on the comparison result to obtain a skeleton sequence of the genome, including: clustering the skeleton cut contigs based on the comparison result to obtain a contig Clustering; Sorting the sequences in the contig cluster according to the context, and outputting a uniform contig sequence; using the inter-read pair relationship to connect the contig sequence to the skeleton sequence.

根据本发明的另一方面，提供一种基因组组装系统，包括：混样库生成单元，用于以 Fosmid为载体将基因组打断为 Fosmid克隆序列，将 Fosmid 克隆序列划分到各个混样库 ( ooling library ); 测序单元，用于对各个混样库的 Fosmid克隆序列打断并进行测序；一级组装单元，用于对各个混样库的得到的读进行短序列组装获得各个混样库的骨架序列；剪切重叠群获取单元，用于去除骨架序列中的无效碱基获得各个混样库的骨架剪切重叠群序列；二级组装单元，用于基于骨架剪切重叠群序列间的比对结果拼接骨架剪切重叠群序列获得基因组的骨架序列，其中屏蔽重复序列区域中的比对结果。 According to another aspect of the present invention, a genome assembly system is provided, comprising: a sample-mixing library generating unit, configured to break a genome into a Fosmid clone sequence using Fosmid as a vector, and divide the Fosmid clone sequence into each mixed sample library (ooling Library); a sequencing unit for interrupting and sequencing the Fosmid clone sequence of each sample library; a first-stage assembly unit for short-sequence assembly of the obtained reads of each sample library to obtain a skeleton of each sample library Sequence; a cut contig acquisition unit for removing invalid bases in the skeleton sequence to obtain a skeleton cut contig sequence of each sample library; a second assembly unit for alignment between skeletal cut contig sequences As a result, the skeletal splicing contig sequence was obtained to obtain the skeletal sequence of the genome, and the alignment result in the region of the repeat sequence was masked.

根据本发明系统的一个实施例，所述二级组装单元包括：比对模块，用于将所有骨架剪切重叠群序列进行两两比对；重复序列处理模块，用于根据比对结果确定重复序列区域，屏蔽重复序列区域中的比对结果；拼接模块，用于基于比对结果将骨架剪切重叠群序列进行拼接获得基因组的骨架序列。 According to an embodiment of the system of the present invention, the secondary assembly unit comprises: a comparison module, configured to perform a pairwise alignment of all skeleton shearing contig sequences; and a sequence processing module for determining a repetition according to the comparison result The sequence region, the alignment result in the repeated sequence region is masked; the splicing module is configured to splicing the skeleton cut contig sequence based on the comparison result to obtain the skeleton sequence of the genome.

根据本发明系统的一个实施例，所述系统还包括：补洞单元，用于利用读间的对关系将基因组的骨架序列中的无效 ½转换为有根据本发明系统的一个实施例，所述 Fosmid克隆序列的总长度数倍于所述基因组的长度；或每个混样库包括 10 ~ 90个 Fosmid 克隆序列。 According to an embodiment of the system of the present invention, the system further comprises: a hole filling unit, configured to convert the invalidity in the skeleton sequence of the genome into having the pair relationship between the readings According to one embodiment of the system of the invention, the total length of the Fosmid cloned sequence is several times greater than the length of the genome; or each mixed sample library comprises 10 to 90 Fosmid cloned sequences.

根据本发明系统的一个实施例，所述一级组装单元对每个混样库：将测序序列依次截取出长度为 K的短序列 K-mer; 将 K-mer 存储到散列表中，形成德布鲁因图的顶点；在测序序列上前后相继的 K-mer相连，形成德布鲁因图的边;将所有测序序列都处理完得到整个德布鲁因图;去除德布鲁因图中由测序错误、杂合位点引起的路径；将线性的 K-mer路径连接起来形成第一级的重叠群；将测序序列比对到第一级的重叠群序列上，根据读间对关系建立重叠群序列间的相对位置和方向关系，形成每个混样库的骨架序列。 According to an embodiment of the system of the present invention, the primary assembly unit pairs each sample library: the sequencing sequence is sequentially cut out of a short sequence K-mer of length K; the K-mer is stored in a hash table to form a German The apex of the Bruin diagram; the successive K-mers on the sequencing sequence are connected to form the edge of the De Bruin diagram; all the sequencing sequences are processed to obtain the entire De Bruin diagram; Paths caused by sequencing errors, heterozygous sites; linear K-mer paths are joined to form a first-order contig; sequenced sequences are aligned to the first-order contig sequence, based on the inter-read relationship The relative position and orientation relationship between the contig sequence forms the skeleton sequence of each mixed sample library.

根据本发明系统的一个实施例，所述一级组装单元对每个混样库采用多组不同的 K值， ^^个混样库产生不同 K值下的重叠群和骨架序列;对于每个混样库，从不同 K值的组装结果中挑出 N50 的重叠群和骨架序列。 According to an embodiment of the system of the present invention, the first-level assembly unit adopts multiple sets of different K values for each mixed sample library, and the ^^ mixed sample libraries generate overlapping groups and skeleton sequences under different K values; The mixed sample library picks out the contigs and skeleton sequences of N50 from the assembly results of different K values.

根据本发明系统的一个实施例，所述拼接模块基于比对结果将骨架剪切重叠群进行聚类获得重叠群聚类；将重叠群聚类中的序列按照前后关系排序，输出统一的重叠群序列；利用读间对关系将重叠群序列连接为骨架序列。 According to an embodiment of the system of the present invention, the splicing module clusters the skeleton cut contigs based on the comparison result to obtain contig group clustering; sorts the sequences in the contig cluster cluster according to the context, and outputs a unified contig Sequence; contig sequence sequences are joined as skeleton sequences using inter-read pair relationships.

本发明提供的基因组组装方法和系统，采用全基因组鸟枪法与 Fosmid to Fosmid相结合的方法进行测序，并进行分级组装，提高了组装的准确性。附图说明 The genomic assembly method and system provided by the invention adopts the method of combining whole genome shotgun and Fosmid to Fosmid for sequencing, and performs hierarchical assembly, thereby improving assembly accuracy. DRAWINGS

图 1示出本发明的基因组组装方法的一个实施例的¾½图；图 2示出本发明的基因组组装方法的另一个实施例的流程图；图 3示出本发明的基因组组装方法的又一个实施例的流程图；图 4示出两个序列的比对结果示例； 1 is a view showing an embodiment of the genome assembly method of the present invention; FIG. 2 is a flow chart showing another embodiment of the genome assembly method of the present invention; and FIG. 3 is a view showing another embodiment of the genome assembly method of the present invention. Flow chart of an embodiment; Figure 4 shows an example of the alignment result of two sequences;

图 5示出序列中的重复序列及其比对结果示例； Figure 5 shows an example of a repeat sequence in a sequence and its alignment result;

图 6示出确定重列区域的示例； Figure 6 shows an example of determining a re-arranged area;

图 7示出修改前的屏蔽策略和修^^的屏蔽策略的图示；图 8示出本发明的基因组组装系统的一个实施例的结构图；图 9 示出本发明的基因组组装系统的一个实施例的部分结构图。具体实施方式 Figure 7 is a diagram showing a masking strategy and a masking strategy before modification; Figure 8 is a structural diagram showing one embodiment of the genome assembly system of the present invention; Figure 9 is a diagram showing a genome assembly system of the present invention. A partial structural view of an embodiment. detailed description

下面参照附图对本发明进行更全面的描述，其中说明本发明的示例性实施例。 The invention will now be described more fully hereinafter with reference to the accompanying drawings

在本发明的一个实施例中，提供一种基于第二代测序技术辅助的分级组装方法和系统，采用全基因组鸟枪法与 Fosmid to Fosmid 相结合的方法进行测序，旨在解决复杂基因组的组装问题。 In one embodiment of the present invention, a hierarchical assembly method and system based on second-generation sequencing technology is provided, which is performed by a genome-wide shotgun method combined with Fosmid to Fosmid for sequencing of complex genomes. .

图 1示出本发明的基因组组装方法的一个实施例的¾½图。如图 1所示，在步骤 102, 以 Fosmid为载体将基因组打断为 Fosmid克隆序列。 Figure 1 shows a 3⁄4⁄2 diagram of one embodiment of the genome assembly method of the present invention. As shown in Figure 1, in step 102, the genome is disrupted to Fosmid cloned sequence using Fosmid as a vector.

步骤 104, 将 Fosmid 克隆序列划分到各个混样库（pooling library ), 对各个混样库进行测序。例如，对混样库的 Fosmid 克隆序列混合打断为小片段，进行双末端测序。 Step 104: Divide the Fosmid clone sequence into each pooling library, and sequence each mixed sample library. For example, a mixture of Fosmid clone sequences from a sample library is broken into small fragments for double-end sequencing.

步骤 106，对各个混样库的读进行短序列组装获得各个混样库的骨架序列； Step 106: Perform short sequence assembly on the reading of each mixed sample library to obtain a skeleton sequence of each mixed sample library;

步骤 108，去除骨架序列中的无效碱基获得各个混样库的骨架剪切重叠群序列。例如，假设骨架序列为 "ATTGCNNNGGAC" , 其中 N表示无效碱基，则对应的骨架剪切重叠群包括 "ATTGC" 和 "GGAC"₀ Step 108: Remove the invalid bases in the skeleton sequence to obtain the skeleton cut contig sequence of each mixed sample library. For example, suppose the skeleton sequence is "ATTGCNNNGGAC", where N represents an invalid base, and the corresponding skeleton-cut contig includes "ATTGC" and "GGAC" ₀

步骤 110，基于骨架剪切重叠群序列间的比对结果拼接骨架剪切重叠群序列获得基因组的骨架序列，其中，屏蔽重复序列区域中的比对结果。将所有骨架剪切重叠群序列进行两两比对；根据比对结果确定重复序列区域，屏蔽重复序列区域中的比对结果，然后基于比对结果将骨架剪切重叠群序列进行拼接获得基因组的骨架序列。 Step 110, splicing the skeleton shear based on the alignment result between the skeleton cut contig sequences The contig sequence is cut to obtain the skeletal sequence of the genome, wherein the alignment result in the region of the repeat sequence is masked. All skeleton cut contig sequences are subjected to pairwise alignment; the repeat region is determined according to the alignment result, the alignment result in the repeat region is masked, and then the skeleton cut contig sequence is spliced based on the alignment result to obtain the genome Skeleton sequence.

本发明提供的基因组组装方法，将基因组打断为 Fosmid DNA 片段后划分为混样库，测序后进行分级组装，提高了组装的准确性。 The genomic assembly method provided by the invention breaks the genome into Fosmid DNA fragments and divides them into a mixed sample library, and performs hierarchical assembly after sequencing, thereby improving assembly accuracy.

图 2示出本发明的基因组组装方法的另一个实施例的流程图。如图 2 所示，步骤 202, 全基因组随机打断为预定长度的片段，例如 32Kb。 Figure 2 is a flow chart showing another embodiment of the genome assembly method of the present invention. As shown in Fig. 2, in step 202, the whole genome is randomly interrupted into segments of a predetermined length, for example, 32 Kb.

步骤 204，构建 Fosmid文库。 Step 204, constructing a Fosmid library.

步骤 206, 将得到所有 Fosmid文库随机组合构成混样库，例如，以 30 ~ 90个左右为一组，放在一起构成一个混样库。 In step 206, all Fosmid libraries are randomly combined to form a mixed sample library, for example, in groups of 30 to 90, and put together to form a mixed sample library.

步骤 208，将混样库中的插入片段随机打断小片段，建小片段插入文库。例如，小片段的长短为 300bp ~ 500bp。 Step 208: Randomly interrupt the small segment in the inserted sample in the mixed sample library, and insert a small segment into the library. For example, the length of a small segment is 300 bp to 500 bp.

步骤 210，对每个混样库的小片段进行测序，得到以混样库为单位的测序数据。 Step 210: Sequencing small fragments of each mixed sample library to obtain sequencing data in units of mixed sample libraries.

步骤 212 , —级组装：将混样库中的插入片段视为虚拟基因组，以每个混样库为单位进行一级组装，这个阶段的组装使用拼接软件 SOAPdenovo, 基于德布鲁因图进行短序列组装。由 PE reads 组装得到长一些的不包括无效碱基的 Contig (重叠群）序列，再由 Contig 组装得到更长的可能包括无效碱基的 scaffold (骨架序列），将所有 scaffold 中的无效碱基去掉，得到介于 Contig 和 scaffold之间的 scaftig (骨架剪切重叠群）序列。将所有混样库得到的 scaftig放在一起，进行二级组装。 Step 212, - level assembly: The inserts in the mixed sample library are regarded as virtual genomes, and the first-level assembly is performed in units of each mixed sample library. The assembly at this stage uses the splicing software SOAPdenovo, which is short based on the De Bruin map. Sequence assembly. A longer Contig (contig) sequence that does not include invalid bases is assembled by PE reads, and then assembled by Contig to obtain longer scaffolds (skeletal sequences) that may include invalid bases, and remove all invalid bases in scaffold. , get the scaftig (skeleton cut contig) sequence between Contig and scaffold. Put all the scaftigs obtained from the sample library together for secondary assembly.

步骤 214，二级组装：主要使用基于比对的拼接方法，利用一级组装得到的所有混样库 scaftig, 将它们放在一起，两两比对，组装出二级 Contig, 乃至二级 scaffold。 Step 214, secondary assembly: mainly using a splicing method based on comparison, using one All the sample libraries obtained from the stage assembly are scaftig, put them together, compare them in pairs, assemble the second-level Contig, and even the secondary scaffold.

图 3示出本发明的基因组组装方法的又一个实施例的流程图。如图 3所示，步骤 302, 对由 Fosmid克隆序列组成的混样库测序得到读（ reads )。 Figure 3 is a flow chart showing still another embodiment of the genome assembly method of the present invention. As shown in Figure 3, in step 302, a sample library consisting of Fosmid clone sequences is sequenced to obtain reads.

根据本发明的一个实施例，采用了 Fosmid to Fosmid的方案，对全基因组以 Fosmid为载体随机打断成为多个长度与 Fosmid相仿的 Fosmid克隆序列。所有 Fosmid克隆序列的总基长度数倍于全基因组?½长度，以尽量保证覆盖全基因组上的每个 ½。例如, 根据 Lander- Waterman模型 (参考文献 Genomic Mapping by Fingerprinting Random Clones: A Mathematical Analysis* ERIC S. According to one embodiment of the present invention, the Fosmid to Fosmid protocol is used to randomly break the whole genome with Fosmid as a Fosmid clone sequence of similar length to Fosmid. The total length of all Fosmid clones is several times longer than the whole genome, 1⁄2 length, to ensure coverage of each 1⁄2 of the genome as much as possible. For example, according to the Lander-Waterman model (Reference Genomic Mapping by Fingerprinting Random Clones: A Mathematical Analysis* ERIC S.

LANDER AND MICHAEL S. WATERMAN, Genomics. 1988 Apr;2(3):231-9. ), 对于长度为 35kbp的 Fosmid, 8倍的基因组覆盖深度能在理论上保证基因组上每个碱^被最少覆盖一次。 LANDER AND MICHAEL S. WATERMAN, Genomics. 1988 Apr;2(3):231-9. ), for Fosmid with a length of 35kbp, 8 times the depth of genome coverage can theoretically ensure that each base on the genome is minimally covered. once.

将例如 10~90 个 Fosmid克隆序列作为一个混样库 ( ooling library ) , 建立混样库后，对 Fosmid 克隆序列进行混样 ( pooling )测序，得到双端测序序列，即读（ reads )。例如可以采用 Illumina GA Solexa测序技术进行高通量测序。 Solexa测序技术属于新一代测序技术（第二代），其核心思想是边合成边测序 ( sequencing by synthesis or ligation, SBS&SbL ), 通过利用单分子阵列实现在小型芯片（Flow Cell )上进行桥式 PCR反应。新的可逆阻断技术可实现每次只合成一个碱基，不需要标记荧光基团，再利用相应的激光^ L荧光基团捕获光，从而读信息。 For example, 10~90 Fosmid clone sequences were used as an ooling library. After the mixed sample library was established, the Fosmid clone sequence was pooled and sequenced to obtain a double-end sequencing sequence, ie read. For example, high-throughput sequencing can be performed using Illumina GA Solexa sequencing technology. Solexa sequencing technology belongs to next-generation sequencing technology (second generation). Its core idea is sequencing by synthesis or ligation, SBS & SbL. Bridge PCR is performed on small cells by using single molecule arrays. reaction. The new reversible blocking technique allows for the synthesis of only one base at a time, without the need to label fluorophores, and then captures the light using the corresponding laser fluorophores to read the information.

步骤 304，基于德布鲁因图的组装。在得到每个混样库的测序序列后，可对每个混样库进行拼接，这个阶段称为第一级组装。 Step 304, based on the assembly of the De Bruin diagram. After obtaining the sequencing sequence of each sample library, each sample library can be spliced. This stage is called first stage assembly.

将读（测序序列）依次截取出长度为 K 的短序列，称为 K- mer, K-mer前后互相重叠 K-1 个碱基。将 K-mer存储到散列表中，形成德布鲁因图的顶点；在读上前后相继的 K-mer认为这两个 K-mer相连，形成德布鲁因图的边。将所有读都处理完后，可以得到整个德布鲁因图，去除图中由测序错误、杂合位点引起的路径，将线性的 K-mer路径连接起来即可形成第一级的重叠群。将这些 K-mer碱基连接起来即形成第一级的重叠群序列。然后将读比对到重叠群序列上，根据读的对关系（paired end )建立重叠群序列间的相对位置和方向关系，从而形成第一级的骨架序列。最后利用读的对关系进行卜洞。 The read (sequencing sequence) is sequentially subjected to a short sequence of length K, called K-mer, and K-mer overlaps K-1 bases before and after. Store K-mer to hash table In the middle, the vertices of the De Bruin diagram are formed; the K-mers that are successively read and read think that the two K-mers are connected to form the edge of the De Bruin diagram. After all the readings have been processed, the entire De Bruin diagram can be obtained. The paths caused by sequencing errors and heterozygous sites are removed, and the linear K-mer paths are connected to form the first-level contigs. . These K-mer bases are joined to form a first-order contig sequence. The readings are then compared to the contig sequence, and the relative position and orientation relationship between the contig sequence is established according to the paired end of the reading, thereby forming a skeleton sequence of the first level. Finally, use the read pair relationship to make a hole.

例如，这个阶段的组装可以使用华大基因研究院自主研发的拼接软件 SOAPdenovo 进行拼接，基于德布鲁因图进行短序列组装，得到每个混样库的第一级的骨架序列（ scaffold )。组装软件参考文献 Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res ( 2009 )。此软件可以从网上免费获得，网址为 http://soap.genomics.org.cn/soapdenovo.htmlo For example, the assembly at this stage can be spliced using the mosaic software SOAPdenovo, which was independently developed by the Huada Gene Research Institute. The short sequence assembly based on the De Bruin diagram is used to obtain the first-level skeleton sequence ( scaffold ) of each sample library. Assembly software reference literature Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res (2009). This software is available free of charge on the web at http://soap.genomics.org.cn/soapdenovo.htmlo

步骤 306，得到骨架剪切重叠群。在第一级组装骨架序列的时候，会有部分重复序列的碱基无法组装到骨架序列上。对每个混样库，在第一级组装完成后，去除骨架序列所有无效碱基 N, 得到混样库的骨架剪切重叠群（ scaftig )。 Step 306, obtaining a skeleton cut contig. When the skeleton sequence is assembled in the first stage, a part of the repeating sequence base cannot be assembled to the skeleton sequence. For each sample library, after the first stage assembly is completed, all invalid bases N of the skeleton sequence are removed, and the skeleton cut contig (scaftig) of the sample library is obtained.

利用各个混样库的数据，进行第二级组装中的骨架剪切重叠群序列组装，这步组装主要使用基于比对的拼接方法。第二级的重叠群序列组装程序的具体流程包括： Using the data of each sample library, the skeleton-shear contig sequence assembly in the second-stage assembly is performed. This assembly mainly uses a splicing method based on alignment. The second level of overlapping group sequence assembly process specific process includes:

步骤 308，骨架剪切重叠群序列两两比对（ Pair-wise alignment ): Step 308, the skeleton-cut contig sequence pair-wise alignment:

通过比对建立序列间的关系。将骨架剪切重叠群进行两两比对，每个比对结果的两条骨架剪切重叠群序列都可以计算出一个相似度分值，以该分值对比对结果进行降序排序。同时对每个骨架剪切重叠群序列建立一个列表以记录所有与其有比对关系的其他骨架剪切重叠群序列。例如，骨架剪切重叠群序列 a与 b, c有比对关系，在列表中记录 a 2(比对关系个数） b, c。 The relationship between the sequences is established by comparison. The skeleton cut contigs are pairwise and the two skeleton cut contig sequences of each comparison result can be calculated as a similarity score, and the results are sorted in descending order. At the same time for each skeleton shear The contig sequence is cut to create a list to record all other skeletal cut contig sequences that are aligned with it. For example, the skeleton cut contig sequence a has an alignment relationship with b, c, and a 2 (the number of comparison relationships) b, c is recorded in the list.

序列比对指标明两个骨架剪切重叠群序列具有相同碱基之处。假定序列 A为 ATTTCGTGA, B为 TTGGTGACTG。比对结果可以表示为： The sequence alignment indicator indicates that the two skeleton-shear contig sequences have the same base. Assume that sequence A is ATTTCGTGA and B is TTGGTGACTG. The comparison result can be expressed as:

ATTTCGTGA TTGGTGACTG ATTTCGTGA TTGGTGACTG

下划线的碱基为没比对上的碱基，其余对应的碱基则为比对上的部分。该比对结果例如如图 4 所表示，竖线部分为比对上的部分，斜线部分为没比对上的部分。 The underlined bases are bases that are not aligned, and the remaining bases are the aligned ones. The alignment result is, for example, as shown in Fig. 4, the vertical line portion is the portion on the alignment, and the oblique line portion is the portion on the unaligned.

步骤 310, 重复序列边界检测 ( Repeat boundary detection ) 基因组中会有重复序列（repetitive sequence), 比对结果中相同 ½部分如果包含重复序列将可能发生错拼。以图 5为例， b为重复序列，如果不将 b的边界区分出来，则拼接结果中可能出现 a-b- e或者 d-b-c 的错误。对比对结果进行筛选，去除重复序列区域中的比对，防止发生错拼。 Step 310, Repeat boundary detection There is a repetitive sequence in the genome, and the same part of the comparison result may be misspelled if it contains a repeat sequence. Taking Figure 5 as an example, b is a repeated sequence. If the boundary of b is not distinguished, an error of a-b-e or d-b-c may occur in the splicing result. The results were screened to eliminate the alignment in the repeat region to prevent misspelling.

通过重叠 (overlap)关系所检测出来的分支 (branch)对骨架剪切重叠群序列中的重复区进行确定。利用重复序列区域两侧必须为非重复序列区域的特点（即图 5中 a和 d不相同， c和 e不相同），通过序列间比对结果的相互关系，可以确定序列间位置关系，以及找到大部分潜在重复序列区域。由此可以将相同碱基完全在重复序列区域中的比对结果进行屏蔽。以图 6为例，序列 A、 B、 C的比对关系如图所示，其中格子区域表示两个序列比对上的区域，而斜线区域表示未比对上的区域。 A、 B、 C间比对的结果表明 A、 B、 C间的相对位置该如图 6。假设需要判断 A、 B间的比对结果是否能用于后续的拼接，则可以借助 A、 C间的比对结果和 B、 C间的比对结果。如图 6 中所示， A的前端与 C 的前端序列没比对上，可以认为重复序列的边界就在 A与 C比对结果的边界上；右端 B 与 C的情况类似，可以认为重复序列的右端的边界在 B与 C的比对结果的边界上。通过建立这样的局部图，可以辨别出重复序列边界（图 6 中两条竖虛线所示区域)。对于 A、 B这种完全处于重复序列边界内的比对结果，将其过滤，不在后续的拼接中使用。 A repeating region in the skeleton cut contig sequence is determined by a branch detected by an overlap relationship. Using the characteristics of the non-repetitive sequence region on both sides of the repeated sequence region (ie, a and d are different in FIG. 5, c and e are not the same), and the inter-sequence positional relationship can be determined by the correlation between the inter-sequence alignment results, and Find most of the potential repeat region. Thereby, the alignment result of the same base completely in the repeat region can be masked. Taking FIG. 6 as an example, the alignment relationship of the sequences A, B, and C is as shown in the figure, wherein the lattice area represents the area on the two sequence alignments, and the oblique line area represents the area on the unaligned. The results of the comparison between A, B and C indicate that the relative positions between A, B and C are as shown in Fig. 6. Suppose you need to judge whether the comparison between A and B can be used for subsequent splicing. You can use the comparison between A and C and between B and C. Compare the results. As shown in Figure 6, the front end of A is not aligned with the front end sequence of C. It can be considered that the boundary of the repeated sequence is on the boundary of the comparison result of A and C; the right end B is similar to the case of C, and the repeat sequence can be considered. The boundary at the right end is at the boundary of the alignment result of B and C. By creating such a partial map, the repeat sequence boundaries (the areas indicated by the two vertical dashed lines in Figure 6) can be discerned. For A, B, which is completely within the boundaries of the repeat sequence, it is filtered and not used in subsequent stitching.

步骤 312，重叠群聚类 ( Contig assembly ): Step 312, Contig assembly:

建立构成重叠群的比对集合，利用一定的打分策略，将每条骨架序列视为一个重叠群，通过比对关系将这些重叠群聚类，一个比对就可以将两个重叠群拼接成一个更大的重叠群，经过不断的聚类，生成大的重叠群，即重叠群聚类。 Establish a comparison set that constitutes a contig, use a certain scoring strategy to treat each skeletal sequence as a contig, and cluster the contigs by comparison. One alignment can splicing two contigs into one Larger contigs, after continuous clustering, generate large contigs, ie contig clusters.

该步骤建立构成重叠群的比对结果集合。根据序列间相同碱基比例（相似度）对每个比对结果进行打分，相同碱基越多分数越高。将比对结果按分数从大到小排序，将相似度高且不一致序列比例在一定阈值以下的比对结果和序列合并到重叠群聚类结构中。 This step establishes a set of comparison results that make up the contig. Each alignment result is scored according to the same base ratio (similarity) between sequences, and the higher the number of identical bases, the higher the score. The alignment results are sorted from large to small, and the alignment results and sequences with high similarity and inconsistent sequence ratio below a certain threshold are merged into the contig cluster structure.

步骤 314，一致性化（ Consensus ): Step 314, Consensus:

根据上面步骤生成的比对结果集合，去除碱基间不一致，生成重叠群序列。读取上一步的重叠群聚类结果，对于每个重叠群聚类，其对应的序列可以知道，将重叠群聚类中序列根据前后关系排序，并且根据序列间比对结果可以将所有序列合成为一个统一的序列。此时输出的统一序列即为第二级的重叠群序列。 According to the comparison result set generated in the above step, the inconsistency between the bases is removed, and the contig sequence is generated. Read the contig clustering result of the previous step. For each contig cluster, the corresponding sequence can be known. The sequences in the contig cluster are sorted according to the context, and all sequences can be synthesized according to the inter-sequence alignment. For a uniform sequence. The unified sequence output at this time is the second-order contig sequence.

步骤 316、 318, 屏蔽部分序列碱基，构建骨架序列。得到重叠群序列后即可进行第二级的骨架序列（scaffold ) 的拼接，利用读（reads ) 间对关系（paired-end ), 把重叠群连接为骨架序列。例如，这个阶段可以利用华大基因研究院自主开发的拼接软件 SOAPdenovo实现。 Steps 316 and 318, shielding part of the sequence base, and constructing a skeleton sequence. After the overlapping group sequence is obtained, the second-level scaffold splicing can be performed, and the contigs are connected into a skeleton sequence by using a paired-end relationship. For example, this stage can be achieved using the mosaic software SOAPdenovo developed by the Huada Gene Research Institute.

此外，可以将此方法中重复序列屏蔽策略根据复杂基因组的情况进行改进，将屏蔽一个序列及其相关连接改为屏蔽其中一个连接，从而令骨架序列延长。原有的屏蔽策略如图 7 ( A )所示，一旦发现某个重叠群（重叠群 1 ) 的上游或下游（图 7 中示例为下游）与之有连接的重叠群出现在位置重叠（如图 7 中的重叠群 2,3, 通过读间对关系和读对应库的插入片段长度 (insert size), 计算出重叠群间的距离，通过这个距离就能计算出相互是否出现位置重叠），则屏蔽此重叠群 1及其相关的连接（如图示）。修^的策略 (如图 7 ( B )所示）为，出现位置重叠后只保留令路径延伸的最远的重叠群及其连接（以图中情况为例，假设重叠群 2后续无连接，而重叠群 3则能连接到更远。这时候就选择屏蔽重叠群 1到重叠群 2的连接)。 In addition, the repeat sequence masking strategy in this method can be based on complex genomes. The situation is improved by masking a sequence and its associated connections to mask one of the connections, thereby extending the skeleton sequence. The original shielding strategy is shown in Figure 7 (A). Once a contig that is connected upstream or downstream of a contig (contig 1) (as illustrated in Figure 7 downstream) is found to overlap in position (such as The contigs 2, 3 in Fig. 7 calculate the distance between the contigs by reading the inter-read relationship and reading the insert size of the corresponding library, by which the position overlap can be calculated. This contig 1 and its associated connections (as shown) are then masked. The strategy of repairing (as shown in Fig. 7(B)) is that only the farthest contigs that extend the path and their connections are retained after the position overlap occurs (as in the case of the figure, it is assumed that the contig 2 is not connected subsequently, The contig 3 can be connected further. At this time, the connection of the contig 1 to the contig 2 is selected.

对比原始版本，使用修改版本的 SOAPdenovo 可以令骨架序列 N50增长 10%左右。 Compared to the original version, using the modified version of SOAPdenovo can increase the skeleton sequence N50 by about 10%.

步骤 320，补洞。在骨架序列完成后，可以利用读间对关系，对骨架序列中无效碱基 N进行填充，例如，采用华大基因研究院自主研发的软件 KGF进行补洞，也可以使用 SOAPdenovo配套的补洞软件 GapCloser 进行此阶段工作， GapCloser 可以在 soap.genomics.org.cn免费获得。 Step 320, fill the hole. After the skeleton sequence is completed, the inter-read pair relationship can be used to fill the invalid base N in the skeleton sequence. For example, the software KGF developed by the Huada Gene Research Institute can be used to fill the hole, and the SOAPdenovo supplement hole software can also be used. GapCloser does this work, GapCloser is available for free at soap.genomics.org.cn.

下面提供本发明的方法的一个具体应用例。在该例子中，实现牡蛎基因组测序组装，具体步骤如下： A specific application example of the method of the present invention is provided below. In this example, the oyster genome sequencing assembly is implemented. The specific steps are as follows:

(一 ) Fosmid 混样数据处理 (1) Fosmid mixed sample data processing

1 ) , 将原始下机数据 ( Raw Data ) 中的接头序列去除，得到去除接头序列后的序列文件。 1), remove the linker sequence in the original data (Raw Data) to obtain the sequence file after removing the linker sequence.

2 ) , 算出上述去除接头的序列中碱基的产量，舍弃?½产量过低的混样库，降低低产量混样库对组装结果的影响。 2) Calculate the yield of bases in the sequence of the above-mentioned removal of the linker, discard the sample library with too low output, and reduce the influence of the low-volume sample-mixing library on the assembly result.

3 ) , 将去除接头后的读和参考序列用短序列比对软件 SOAPaligner进行比对，得到对应的 pair和 single比对结果。参考文献 Li， R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics(2009)。此软件可以免费从网上获得网址为 http://soap.genomics.org.cn/soapaligner.html o 参考序列可以为用全基因组鸟枪法得到的基因组序列。 3), the read and reference sequences after the joint is removed are compared with the short sequence alignment software SOAPaligner, and the corresponding pair and single alignment results are obtained. Reference Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics (2009). This software is available for free from the Internet at http://soap.genomics.org.cn/soapaligner.html o The reference sequence can be a genome sequence obtained using the genome-wide shotgun method.

4 ), 从上一步 SOAPaligner比对结果中计算序列的实际插入片段长度，并生成插入片段长度文件，用于后续组装时建立组装配置文件。 4), from the previous step SOAPaligner aligns the results to calculate the actual insertion fragment length of the sequence, and generates an insert length file for subsequent assembly to build the assembly configuration file.

5 ), 从上一步中得到的每个混样库序列插入片长度单独构建组装配置文件。每个混样库单独构建组装配置文件可以减小配置的插入片段长度和实际插入片段长度的方差，避免由于方差过大而对组装结果精确性的影响。 5), separately construct the assembly configuration file from each of the sample library sequence insert lengths obtained in the previous step. Constructing an assembly profile separately for each sample library reduces the variance of the configured insertion segment length and the actual insertion segment length to avoid the impact on the accuracy of the assembly results due to excessive variance.

6 ) , 使用上一步所产生的配置文件对每个混样库单独用 SOAPdenovo软件进行组装，组装时每个: 库采用多组不同的 k 值，并^^个混样库产生不同 k值下的一系列重叠群和骨架序列等组装结果。 6), using the configuration file generated in the previous step, each mixed sample library is assembled separately with SOAPdenovo software. Each assembly is assembled: the library uses multiple sets of different k values, and ^^ mixed sample libraries generate different k values. Assembly results for a series of contigs and skeleton sequences.

7 ) , 对于每个混样库，从不同 k值的组装结果中挑出 N50最好的重叠群和骨架序列，这些序列将用于后续的处理。 7) For each sample library, pick the N50 best contigs and skeleton sequences from the assembly results of different k values, which will be used for subsequent processing.

8 ) , 将上一步所产生的每个混样库的骨架序列用 KGF软件进行补洞，产生经过补洞后的新的骨架序列。 8), the skeleton sequence of each sample library generated in the previous step is filled with KGF software to generate a new skeleton sequence after the hole is filled.

9 ), 对上一步产生的经过补洞后每个混样库的骨架序列进行纠正，产生经过修正后的骨架序列。 9), correct the skeleton sequence of each mixed sample library after the hole is generated in the previous step, and generate the corrected skeleton sequence.

10 ), 将对上一步每个混样库所产生的经修正后的骨架序列用 qualScore软件进行打分，得到经过修正后的骨架序列每个碱基的分值, 并^ ^ 分值文件。 10), the corrected skeleton sequence generated by each of the sample libraries in the previous step is scored by qualScore software, and the score of each base of the corrected skeleton sequence is obtained, and the ^^ score file is obtained.

11 )，将所有混样库纠正后的骨架序列和其所对应的碱基分值结果整合到一起。可以得到大小为约 5.3G的骨架序列， N50 约为 llkbp。 12 ): 去除序列所有无效碱基 N, 得到 5.3G 的骨架剪切重叠群， N50约为 5kbp。 11), the skeleton sequence corrected by all the sample libraries and the corresponding base score results are integrated. A backbone sequence of about 5.3 G in size can be obtained with an N50 of about llkbp. 12): All the invalid bases N of the sequence are removed, and a 5.3G skeleton-shear contig is obtained, and the N50 is about 5 kbp.

(二）拼接重叠群序列 (2) splicing contig sequence

1 )将第一级拼接输出的骨架剪切重叠群序列集合作为输入序列转换 ID 1) The skeleton cut contig sequence set of the first-stage spliced output is used as the input sequence conversion ID

2 )对输入序列建立后缀树索引。这步将方便下一步的种子比对，可以对序列进行快速索引，以快速找到对应的子序列。 2) Establish a suffix tree index on the input sequence. This step will facilitate the next seed comparison, and the sequence can be quickly indexed to quickly find the corresponding subsequence.

3 )对输入序列进行种子比对。这步将进行快速的种子比对，将两两序列间具有相似的，长度较短的子序列挑选出来，以确定两个序列间有一定的相似性，从而筛选出有可能有较好比对结果的序列对，交给下一步的精确比对。 3) Seed comparison of the input sequences. This step will perform a rapid seed alignment, selecting similar, shorter-length subsequences between the two sequences to determine a certain similarity between the two sequences, thus screening for possible alignment results. The sequence pairs are handed to the exact alignment of the next step.

4 )对输入序列进行建立序列查找库。这步建立的序列主要是为了快速查找序列和压缩序列。 4) Establish a sequence search library for the input sequence. The sequence created in this step is mainly for quickly finding sequences and compressing sequences.

5 ) 比对序列，这步将进行精确比对，以建立序列间的比对关系。 5) Align the sequences, this step will be an exact alignment to establish the alignment between the sequences.

6 )检测重复序列区域。这步将根据上一步的比对结果，建立一系列的子图，以辨别其中的重复序列区域，将完全处于重复序列区域的比对结果去除。 6) Detect the repeat region. This step will create a series of subgraphs based on the comparison results from the previous step to identify the repeat region and remove the alignment results completely in the repeat region.

7 )根据序列间相同碱基比例对每个比对结果进行打分，相同 ½越多分数越高。将比对结果按分数从大到小排序，将相似度高且不一致序列比例在一定阈值以下的比对结果和序列合并到重叠群结构中。 7) Score each alignment result according to the same base ratio between sequences. The higher the score, the higher the score. The alignment results are sorted from large to small, and the alignment results and sequences with high similarity and inconsistent sequence ratio below a certain threshold are merged into the contig structure.

8 )去除碱基不一致，生成第二级的重叠群序列。 8) Remove base inconsistency and generate a second-level contig sequence.

得到大小约为 656M的重叠群文件， n50约为 9kbp. Obtain a contig file of approximately 656M in size, with an n50 of approximately 9kbp.

(三）拼接第二级的骨架序列 (3) splicing the second-level skeleton sequence

1 )骨架序列拼接数据准备，这步主要生成一些前期文件，以供 SOAPde爾。使用。 2 )骨架序列拼接，根据各个读间对关系（paired-end )信息，建立第二级的骨架序列。首先将不同插入片段的读比对到重叠群序列上，然后根据两端都比对到重叠群序列上的双端读信息，确定重叠群序列间的前后关系，排列出骨架序列。 1) Skeleton sequence stitching data preparation, this step mainly generates some preliminary files for SOAPde. use. 2) The skeleton sequence is spliced, and the second-level skeleton sequence is established according to the paired-end information. First, the readings of different inserts are compared to the contig sequence, and then the anterior-post relations between the contig sequences are determined according to the double-end read information on both contig sequences, and the skeleton sequences are arranged.

可以得到大小约为 777M的骨架序列文件， n50约为 46174bp A skeleton sequence file of approximately 777M in size can be obtained, with an n50 of approximately 46174 bp.

3 )补洞，利用双端读中只有一端或者只有一部分比对到重叠群上，另一端则可以根据插入片段大小定位到骨架序列中的 "N" 区域中，这样就可以把骨架序列中的无效碱基转变为有效碱基。 3) Filling the hole, using only one end or only a part of the double-ended reading is aligned to the contig, and the other end can be positioned in the "N" region of the skeleton sequence according to the size of the inserted segment, so that the skeleton sequence can be Invalid bases are converted to valid bases.

至此，得到最后的组装结果。 At this point, the final assembly result is obtained.

本发明实施例提供的基因组组装方法，具有基因组覆盖度高、拼接序列 N50 长、准确度高的优点。例如，在上述应用例中约 99.2%的组装结果能被读比对上，并且组装结果的总长非常接近估计基因组长度，准确度高。牡蛎拼接结果大小约为 850Mbp, 基因组估计大小为 800Mbp，基因组覆盖度高。 The genomic assembly method provided by the embodiments of the invention has the advantages of high genome coverage, long N50 splicing sequence and high accuracy. For example, in the above application example, about 99.2% of the assembly results can be read and compared, and the total length of the assembly results is very close to the estimated genome length with high accuracy. The oyster mosaic result is about 850 Mbp, the genome is estimated to be 800 Mbp, and the genome coverage is high.

把组装出的重叠群或骨架序列从大到小排列，当其累计长度刚刚超过全部组装序列总长度 50%时，最后一个重叠群或骨架序列的大小即为 N50 的大小， N50对评价基因测序的完整性有重要意义。 The assembled contigs or skeleton sequences are arranged from large to small. When the cumulative length just exceeds 50% of the total length of the entire assembly sequence, the size of the last contig or skeleton sequence is N50, and the N50 is sequenced. The integrity of the importance is important.

在牡蛎基因组测序项目中，得到的第二级的重叠群序列和第二级的骨架序列 N50 长度高于鸟枪法得到的重叠群序列和骨架序列，分另¹ J达到 9kb 和 50kbp_o Oyster genome sequencing projects, the second stage of the contig sequence and the second stage resulting framework sequences length N50 higher than the backbone sequence contig sequence and shotgun obtained, the other points ¹ J reaches 9kb and 50kbp _o

图 8示出本发明的基因组组装系统的一个实施例的结构图。如图 8所示，该系统 800包括：混样库生成单元 81，以 Fosmid为载体将基因组打断为 Fosmid克隆序列，将 Fosmid克隆序列划分到各个混样库（pooling library )；测序单元 82 , 对各个混样库的 Fosmid克隆序列打断并进行测序；一级组装单元 83, 用于对各个混样库的得到的读进行短序列组装获得各个混样库的骨架序列；剪切重叠群获取单元 84, 去除骨架序列中的无效碱基获得各个混样库的骨架剪切重叠群序列；二级组装单元 85, 基于骨架剪切重叠群序列间的比对结果拼接骨架剪切重叠群序列获得基因组的骨架序列，其中屏蔽重复序列区域中的比对结果。其中， Fosmid克隆序列的总长度数倍于所述基因组的长度；每个混样库可以包括 10 ~ 90个 Fosmid克隆序列。 Figure 8 is a block diagram showing one embodiment of the genome assembly system of the present invention. As shown in FIG. 8, the system 800 includes: a mixed sample generating unit 81, which interrupts the genome into a Fosmid clone sequence using Fosmid as a carrier, and divides the Fosmid clone sequence into each pooling library; the sequencing unit 82, The Fosmid clone sequence of each sample library is interrupted and sequenced; a first assembly unit 83 is configured to perform short sequence assembly on the obtained reads of each sample library to obtain a skeleton sequence of each sample library; The contig group obtaining unit 84 removes the invalid bases in the skeleton sequence to obtain the skeleton cut contig sequence of each mixed sample library; the second assembly unit 85, splicing the skeleton shear based on the alignment between the skeletal cut contig sequences The contig sequence obtains the skeletal sequence of the genome in which the alignment results in the region of the repeat sequence are masked. Wherein, the total length of the Fosmid cloned sequence is several times longer than the length of the genome; each mixed sample library may include 10 to 90 Fosmid clone sequences.

根据本发明的一个实施例，一级组装单元对每个混样库：将测序序列依次截取出长度为 K的短序列 K -mer; 将 K-mer存储到散列表中，形成德布鲁因图的顶点；在测序序列上前后相继的 K-mer 相连，形成德布鲁因图的边；将所有测序序列都处理完得到整个德布鲁因图;去除德布鲁因图中由测序错误、杂合位点引起的路径；将线性的 K-mer路径连接起来形成第一级的重叠群；将测序序列比对到第一级的重叠群序列上，根据读间对关系建立重叠群序列间的相对位置和方向关系，形成每个混样库的骨架序列。一级组装单元可以对每个混样库采用多组不同的 K值，按每个混样库产生不同 K值下的重叠群和骨架序列;对于每个混样库，从不同 K值的组装结果中挑出 N50的重叠群和骨架序列。 According to one embodiment of the present invention, a primary assembly unit for each sample library: the sequencing sequence is sequentially cut out of a short sequence K-mer of length K; the K-mer is stored in a hash table to form Debruin The vertices of the graph; the successive K-mers on the sequencing sequence are connected to form the edge of the de Bruin diagram; all the sequencing sequences are processed to obtain the entire De Bruin diagram; the de Bruin diagram is removed by sequencing error a path caused by a heterozygous site; a linear K-mer path is joined to form a first-order contig; the sequencing sequence is aligned to the first-order contig sequence, and an contig sequence is established based on the inter-read relationship The relative position and orientation relationship between them forms the skeleton sequence of each mixed sample library. The first-level assembly unit can use multiple sets of different K values for each sample library, and generate contigs and skeleton sequences with different K values for each sample library; for each sample library, assembly from different K values Among the results, a contig of N50 and a skeleton sequence were picked.

图 9 示出本发明的基因组组装系统的一个实施例的部分结构图。根据本发明的一个实施例，基因组组装系统中二级组装单元 95包括：比对模块 951，用于将所有骨架剪切重叠群序列进行两两比对；重复序列处理模块 952，用于根据比对结果确定重复序列区域，屏蔽重复序列区域中的比对结果；拼接模块 953，用于基于比对结果将骨架剪切重叠群序列进行拼接获得基因组的骨架序列。根据本发明的一个实施例，基因组组装系统还包括补洞单元 96, 用于利用读间的对关系将基因组的骨架序列中的无效^ ^转换为有效根据本发明系统的一个实施例，拼接模块基于比对结果将骨架剪切重叠群进行聚类获得重叠群聚类；将重叠群聚类中的序列按照前后关系排序，输出统一的重叠群序列；利用读间对关系将重叠群序列连接为骨架序列。 Figure 9 is a partial structural view showing an embodiment of the genome assembly system of the present invention. According to an embodiment of the present invention, the secondary assembly unit 95 in the genome assembly system comprises: a comparison module 951 for performing pairwise alignment of all skeleton shearing contig sequences; a sequence processing module 952 for ratiometric The result is determined by determining the repeat region, and the alignment result in the repeated sequence region is masked; the splicing module 953 is configured to splicing the skeleton cut contig sequence based on the comparison result to obtain the skeleton sequence of the genome. According to an embodiment of the invention, the genome assembly system further comprises a hole-filling unit 96 for converting the invalidity in the skeleton sequence of the genome into an effective embodiment of the system according to the invention by using the pairwise relationship between the readings, the splicing module Based on the alignment result, the skeleton The contigs are cut and clustered to obtain contig clustering. The sequences in the contig cluster are sorted according to the context, and a unified contig sequence is output. The contig sequence is connected into the skeleton sequence by using the inter-read pair relationship.

对于图 8至图 9中各个装置或单元的功能，可以参考上文中关于本发明方法的实施例中对应部分的说明，为简洁起见，在此不再详述。 For the function of the various devices or units in Figures 8 to 9, reference may be made to the description of the corresponding portions in the above embodiments of the method of the present invention, which will not be described in detail herein for the sake of brevity.

本发明的描述是为了示例和描述起见而给出的，而并不是无遗漏的或者将本发明限于所公开的形式。很多修改和变化对于本领域的普通技术人员而言是显然的。选择和描述实施例是为了更好说明本发明的原理和实际应用 , 并且使本领域的普通技术人员能够理解本发明从而设计适于特定用途的带有各种修改的各种实施例。 The description of the present invention has been presented for purposes of illustration and description. Many modifications and variations will be apparent to those skilled in the art. The embodiment was chosen and described in order to explain the invention and the embodiments of the invention

Claims

Rights request

A method of assembling a genome, comprising:

The genome was disrupted to Fosmid cloned sequence using Fosmid as a vector;

Divide the Fosmid clone sequence into each sample library;

The Fosmid clone sequences of each mixed sample library were interrupted and sequenced, and the obtained reads were subjected to short sequence assembly to obtain skeleton sequences of the respective mixed sample libraries;

Removing the invalid bases in the skeleton sequence to obtain the skeleton cut contig sequence of each mixed sample library;

The skeletal sequence of the genomic sequence is obtained by splicing the skeletal cut contig sequence based on the alignment between the skeletal cut contig sequences, wherein the alignment results in the repeat region are masked.

2 . The method according to claim 1 , wherein the framing the skeletal contig sequence based on the alignment result between the skeletal shear contig sequences obtains a skeletal sequence of the genomic region, wherein the ratio in the region of the mask repeat region The results include:

Performing a pairwise alignment of all skeleton shearing contig sequences;

Determining the repeat region based on the alignment result, masking the alignment result in the repeated sequence region;

The skeleton sequence of the genome is obtained by splicing the skeleton cut contig sequence based on the alignment result.

3. The method according to claim 1, further comprising:

Invalid 1⁄2 in the skeletal sequence of the genome is converted to an effective base using the pair relationship between reads.

4. The method according to claim 1, wherein the disrupting the genome into a Fosmid clone sequence using Fosmid as a vector comprises:

The genome was randomly interrupted with Fosmid as a Fosmid clone sequence, and the total length of the Fosmid clone sequence was several times longer than the length of the genome.

5. The method of claim 1 wherein each of the sample libraries comprises 10 ~ 90 Fosmid clone sequences.

The assembly method according to claim 1, wherein the Fosmid clone sequence of each sample library is interrupted and sequenced, and the obtained sequence is assembled by short sequence to obtain a skeleton sequence of each sample library:

For each mixed sample library:

The sequencing sequence is sequentially cut out of a short sequence K-mer of length K;

Store the K-mer in a hash table to form the vertices of the De Bruin diagram;

The successive K-mers are ligated in the sequencing sequence to form the edge of the de Bruin diagram; all sequencing sequences are processed to obtain the entire De Bruin diagram;

Removal of the St. Bruce caused by sequencing errors and heterozygous sites;

Connecting linear K-mer paths to form a first-level contig;

The sequencing sequence is aligned to the contig sequence of the first stage, and the relative position and orientation relationship between the overlapping group sequences are established according to the inter-read relationship, and the skeleton sequence of each mixed sample library is formed.

7. The assembly method according to claim 6, wherein

Multiple sets of different K values are used for each mixed sample library, and contigs and skeleton sequences under different K values are generated for each mixed sample library;

For each sample library, the contigs and skeleton sequences of the N50 were picked from the assembly results of different K values.

8. The assembly method according to claim 2, wherein the skeletal-shear contig sequence of each of the sample-mixing libraries is spliced based on the comparison result to obtain a skeleton sequence of the genome, including:

The skeletal cut contigs are clustered based on the comparison result to obtain contig clustering; the sequences in the contig cluster are sorted according to the context, and a unified contig sequence is output;

The contig sequence is concatenated into a skeleton sequence using an inter-read pair relationship.

9. A genome assembly system, comprising:

Mixed sample generation unit for interrupting the genome to Fosmid using Fosmid as a vector Cloning the sequence, dividing the Fosmid clone sequence into each sample library;

a sequencing unit for interrupting and sequencing the Fosmid clone sequence of each sample library;

a first-level assembly unit for short-sequence assembly of the read samples of each mixed sample library to obtain a skeleton sequence of each mixed sample library;

a cut contig acquisition unit for removing invalid bases in the skeleton sequence to obtain a skeleton cut contig sequence of each sample library;

A secondary assembly unit for splicing the skeletal contig sequence based on the alignment between the skeletal shear contig sequences to obtain a skeletal sequence of the genome, wherein the alignment results in the region of the repeat sequence are masked.

10. The system of claim 9, wherein the secondary assembly unit comprises:

a comparison module, configured to perform a pairwise alignment of all skeleton cut contig sequences; a repeat sequence processing module, configured to determine a repeat sequence region according to the comparison result, and mask the alignment result in the repeated sequence region;

A splicing module for splicing the skeleton cut contig sequence based on the alignment result to obtain a skeletal sequence of the genome.

11. The system according to claim 9, further comprising: a hole-filling unit for converting the invalid base group in the skeleton sequence of the genome into an effective pair by using the pair relationship between the readings

12. The system of claim 9, wherein the total length of the Fosmid clone sequence is several times greater than the length of the genome;

Or

Each sample library includes 10 to 90 Fosmid clone sequences.

13. The assembly system according to claim 9, wherein the primary assembly unit pairs each of the sample libraries: sequentially sequencing the sequence to extract a short sequence K-mer of length K; storing the K-mer To the scatter table, forming the vertices of the De Bruin diagram; The sequential K-mers are ligated in the sequencing sequence to form the edge of the de Bruin diagram; all the sequencing sequences are processed to obtain the entire De Bruin diagram; the de Bruin diagram is removed by sequencing error, heterozygous a path caused by a point; a linear K-mer ^ St. is joined to form a first-order contig; the sequencing sequence is aligned to the contig sequence of the first level, and a relative contig between the ensemble sequences is established according to the inter-read relationship The position and orientation relationship form the skeleton sequence of each sample library.

14. The assembly system of claim 13 wherein:

The first-level assembly unit adopts multiple sets of different K values for each mixed sample library, and generates contigs and skeleton sequences under different K values for each mixed sample library; for each mixed sample library, from different K values The contigs and skeleton sequences of N50 were picked out from the assembly results.

The assembling method according to claim 10, wherein the splicing module clusters the skeleton cut contigs based on the comparison result to obtain contig clustering; and aligns the sequences in the contig cluster according to the context Sorting, outputting a uniform contig sequence; ligating the contig sequence into a skeleton sequence using the inter-read pair relationship.