WO2010066115A1

WO2010066115A1 - Method and system for lowering time complexity in short sequences assembly

Info

Publication number: WO2010066115A1
Application number: PCT/CN2009/001427
Authority: WO
Inventors: 李瑞强; 朱红梅; 李松岗; 王俊; 杨焕明; 汪建
Original assignee: SHENZHEN HUADA GENE INSTITUTE
Current assignee: SHENZHEN HUADA GENE INSTITUTE
Priority date: 2008-12-12
Filing date: 2009-12-11
Publication date: 2010-06-17
Anticipated expiration: 2011-06-12
Also published as: CN101430742B; CN101430742A

Abstract

The invention is applicable to the technical field of gene engineering, and provides a method for lowering time complexity in short sequences assembly and a system thereof. The method comprises the following steps: receiving sequencing sequences; respectively processing base sliding cutting to the received sequencing sequences one by one to obtain short strings with constant base length and to obtain left and right connection relations of the short strings; and storing the sequence values of the obtained all short strings, left and right connection relations and connection amount as one node of a de Bruijn graph, using a hash table to store the nodes of the de Bruijn graph, the hash key is the sequence value, the hash value is the node. Because of using the de Bruijn graph and applying the hash table for storing, it makes updating the connection relation of the nodes to be equal to searching nodes and updating the connection amount of bases having left and right connections for searched nodes. Thus, the searching and adding nodes and updating the connection relations of nodes can be finished during the time of 0(1). The lowering time complexity in the short sequences assembly can be realized and the short sequences of large genome can be assembled.

Description

一种降低短序列组装过程的时间复杂度的方法及系统技术领域 Method and system for reducing time complexity of short sequence assembly process

本发明属于基因工程技术领域，尤其涉及一种降低短序列组装过程时间复杂度的方法及系统。背景技术 The invention belongs to the technical field of genetic engineering, and in particular relates to a method and system for reducing the time complexity of a short sequence assembly process. Background technique

新测序技术产生的短序列有两个特点： The short sequence generated by the new sequencing technology has two characteristics:

1.序列长度短； 1. The sequence length is short;

2.数据量大。 2. The amount of data is large.

在长序列组装过程中，常用的 phrap等软件均是基于序列间的交叠（overlap )来进行拼接，如果应用在短序列上，则运算量太大，没有实际应用价值。而新兴的短序列组装软件中成功处理短序列的，例如基于 de Bruijn图的 velvet等。但是，由于受内存、时间等的限制 , 现有的这些短序列组装软件只能组装较小的原核生物基因组，对于大基因组，例如真核生物基因组，特别是哺乳动物基因组数据，由于数据处理时时间复杂度较高、内存占用较大，现有的短序列组装软件均难以实现短序列的组装。发明内容 In the long sequence assembly process, the commonly used software such as phrap is based on the overlap between sequences. If the application is on a short sequence, the amount of calculation is too large, and there is no practical application value. In the emerging short-sequence assembly software, short sequences are successfully processed, such as velvet based on de Bruijn graphs. However, due to limitations in memory, time, etc., these short sequence assembly software can only assemble smaller prokaryotic genomes, for large genomes, such as eukaryotic genomes, especially mammalian genomic data, due to data processing. The time complexity is high and the memory usage is large. It is difficult to assemble short sequences by the existing short sequence assembly software. Summary of the invention

本发明实施例的目的在于提供一种降低短序列组装过程的时间复杂度的方法，旨在解决现有短序列组装软件不能组装大基因组的问题。 It is an object of embodiments of the present invention to provide a method for reducing the time complexity of a short sequence assembly process, which aims to solve the problem that existing short sequence assembly software cannot assemble a large genome.

本发明实施例是这样实现的，一种降低短序列组装过程的时间复杂度的方法，所述方法包括下述步骤： The embodiment of the present invention is implemented as a method for reducing the time complexity of a short sequence assembly process, and the method includes the following steps:

接收测序序列；分别将接收到的测序序列逐个碱基滑动切割得到固定碱基长度的短串，并得到所述短串的左、右连接关系，所述左、右连接关系包括短串序列值、存在左连接的各碱基的连接数量和存在右连接的各½的连接数量； Receiving a sequencing sequence; The received sequencing sequence is slid and cut one by one to obtain a short string of a fixed base length, and the left and right connection relationships of the short string are obtained, and the left and right connection relationships include a short string sequence value and a left connection exists. The number of connections for each base and the number of connections for each of the right connections;

将得到的各所述短串的序列值，左、右连接关系及其连接数量存储为 de Bruijn图的一个节点，并釆用哈希表存储所述 de Bruijn 图的各节点，其中哈希键为所述序列值，哈希值为所述节点。 And storing the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and storing the nodes of the de Bruijn graph by using a hash table, wherein the hash key For the sequence value, the hash value is the node.

本发明实施例的另一目的在于提供一种降低短序列组装过程的时间复杂度的系统，所述系统包括： Another object of embodiments of the present invention is to provide a system for reducing the time complexity of a short sequence assembly process, the system comprising:

接收单元，用于接收测序序列； a receiving unit, configured to receive a sequencing sequence;

序列切割单元，用于分别将接收到的测序序列逐个碱基滑动切割得到固定碱基长度的短串，并得到所述短串的左、右连接关系，所述左、右连接关系包括短串序列值、存在左连接的各碱基的连接数量和存在右连接的各碱基的连接数量；以及 a sequence cutting unit, configured to separately slide the received sequencing sequence one by one to obtain a short string of a fixed base length, and obtain a left and right connection relationship of the short string, wherein the left and right connection relationships include short strings a sequence value, the number of linkages of each base in which there is a left junction, and the number of linkages of each base in which a right junction exists;

构图单元，用于将得到的各所述短串的序列值，左、右连接关系及其连接数量存储为 de Bruijn图的一个节点，并采用哈希表存储所述 de Bruijn图的各节点，其中哈希键为所述序列值，哈希值为所述节点。 a composition unit, configured to store the obtained sequence values of the short strings, the left and right connection relationships, and the number of connections thereof as a node of the de Bruijn graph, and store the nodes of the de Bruijn graph by using a hash table, The hash key is the sequence value, and the hash value is the node.

在本发明实施例中，通过分别将接收到的测序序列逐个碱基滑动切割得到固定碱基长度的短串，并得到短串的左、右连接关系，将得到的各短串的序列值，左、右连接关系及其连接数量存储为 de Bruijn图的一个节点，并使用哈希表进行存储。由于使用哈希表存储使得更新节点连接关系等同于节点查找和更新查找到的节点的左、右相连碱基的连接数量，因此在 O ( l )的时间内就可以完成节点的查找、插入和节点连接关系的更新，从而实现在短序列组装过程降低时间复杂度，进而实现对大基因组的短序列组装。附图说明 In the embodiment of the present invention, the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained. The left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph and stored using a hash table. Since the hash table storage is used to make the update node connection relationship equal to the number of connections of the node to find and update the left and right connected bases of the found node, the search, insertion, and insertion of the node can be completed within O ( l ) time. The update of the node connection relationship, thereby reducing the time complexity in the short sequence assembly process, and thus achieving short sequence assembly of large genomes. DRAWINGS

图 1是本发明实施例提供的降低短序列组装过程的时间复杂度的方法的实现流程图； 1 is a flowchart of an implementation of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention;

图 2是本发明实施例提供的节点存储内容的示意图； 2 is a schematic diagram of a node storage content according to an embodiment of the present invention;

图 3本发明实施例提供的降低短序列组装过程的时间复杂度的系统的结构图。具体实施方式 FIG. 3 is a structural diagram of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. detailed description

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。 The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

在本发明实施例中，通过分别将接收到的测序序列逐个碱基滑动切割得到固定碱基长度的短串，并得到短串的左、右连接关系，将得到的各短串的序列值，左、右连接关系及其连接数量存储为 de Bruijn图的一个节点，并采用哈希表存储 de Bruijn图的各节点，其中哈希键为序列值，哈希值为节点。 In the embodiment of the present invention, the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained. The left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph, and the hash table is used to store the nodes of the de Bruijn graph, wherein the hash key is a sequence value and the hash value is a node.

图 1示出了本发明实施例提供的降低短序列组装过程的时间复杂度的方法的实现流程，详述如下： FIG. 1 is a flowchart showing an implementation process of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention, which is described in detail as follows:

在步驟 S101中，接收测序序列； In step S101, receiving a sequencing sequence;

在步骤 S102中，分别将接收到的测序序列逐个碱基滑动切割得到固定长度的短串（kmer ) ，并得到短串的左、右连接关系，所述左、右连接关系包括短串序列值、存在左连接的各碱基的连接数量和存在右连接的各碱基的连接数量； In step S102, the received sequencing sequence is slid and cut one by one to obtain a short length (kmer) of a fixed length, and a left and right connection relationship of the short string is obtained, and the left and right connection relationships include short sequence values. The number of connections of each base in which there is a left connection and the number of connections of each base in which a right connection exists;

在步骤 S103中，将得到的各短串的序列值，左、右连接关系及其连接数量存储为 de Bruijn图的一个节点，并采用哈希表存储 de Briiijn图的各节点，其中哈希键为序列值，哈希值为节点。 In step S103, the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof are stored as a node of the de Bruijn graph, and the hash table is used to store de Each node of the Briiijn graph, where the hash key is a sequence value and the hash value is a node.

在本发明实施例中，测序序列的碱基长度为 25 - 75，切割成固定长度为 21 - 31的短串。当然，切割得到的短串的长度小于测序序列的长度，其长度可以根据测序序列的长度和实际情况设定。 de Bruijn图中每个节点使用相应位存储其序列值、存在左连接的各碱基的连接数量和存在右连接的各碱基的连接数量。这里，用 16字节存储 de B ijn图上的各节点，其存储格式如下： In the embodiment of the present invention, the sequencing sequence has a base length of 25 - 75 and is cut into short strings of a fixed length of 21 - 31. Of course, the length of the short string obtained by the cutting is smaller than the length of the sequence, and the length can be set according to the length of the sequencing sequence and the actual situation. Each node in the de Bruijn diagram uses the corresponding bits to store its sequence value, the number of connections for each base with left joins, and the number of connections for each base with right joins. Here, each node on the de B ijn map is stored in 16 bytes, and its storage format is as follows:

[ seq: 64， left links: 24, right— links: 24, …】； [ seq: 64, left links: 24, right- links: 24, ...];

其中， seq存储短串的序列值，序列值的计算方法是使用 2位存储一个核苷序列， A用 00表示， C用 01表示， T用 10表示， G用 11表示，顺序编码下去生成一个占 64位的整数值，并且，考虑到对于偶数长度的短串，其互补短串可能为它自己，例如短串 GATC 的互补短串为 GATC自己。为了防止这种混淆，短串的长度均为奇数，由于本发明实施例中数据结构的限制，短串的长度不大于 31; leftjinks用 24位存储其左连接关系及数量，将 24位分割成 4个 6 位，即 A: 6, T: 6, G: 6, C: 6, 分别用 6位存储与该短串存在左连接的碱基 A、 T、 G或 C的连接数量，每种连接数量的取值范围为 [0， 63]； right— links用 24位存储其右连接关系及数量，将 24 位分割成 4个 6位，即 A: 6, T: 6, G: 6， C: 6, 分别用 6位存储与该短串存在右连接的 A、 T、 G或 C的连接数量，每种连接数量的取值范围为 [0, 63]；其后面的 8位可以用于存储其他值，例如，可以存储删除标记 closed, 以标识该短串是否被删除；也可以存储使用标记 in一 use, 以标识该短串是否被使用过，还可以存储其他标识。这样，根据节点中存储的短串序列值、存在左连接的各碱基的连接数量和存在右连接的各碱基的连接数量即可构建 de Bmijii图中各节点的连接关系。例如，短串甲为 AAAAAAAA存在右连接的 T的连接数量为 19, 与其右连接碱基 T的短串乙为 AAAAAAAT, 等于短串曱左移一个碱基并加上与其连接的碱基 T，并且与短串甲连接的短串乙有 19个，节点中存储右连接 Τ的连接数量的存储内容如图 2 所示。 Where seq stores the sequence value of the short string, the sequence value is calculated by using 2 bits to store a nucleotide sequence, A is represented by 00, C is represented by 01, T is represented by 10, G is represented by 11, and the sequence is encoded to generate a The 64-bit integer value, and considering the short string of even length, its complementary short string may be itself, for example, the short string GATC's complementary short string is GATC itself. In order to prevent such confusion, the lengths of the short strings are all odd. Due to the limitation of the data structure in the embodiment of the present invention, the length of the short string is not more than 31; the leftjinks uses 24 bits to store the left connection relationship and the number, and divides the 24 bits into 4 6-bit, ie A: 6, T: 6, G: 6, C: 6, respectively, using 6 bits to store the number of connections to the base A, T, G or C with the left link of the short string, each The number of connections is in the range [0, 63]; right-link uses 24 bits to store its right connection and number, and divides 24 bits into 4 6 bits, ie A: 6, T: 6, G: 6, C: 6, respectively, the number of connections of A, T, G or C with the right connection of the short string is stored by 6 bits, and the number of each connection is in the range of [0, 63]; the following 8 bits can be used. For storing other values, for example, the delete flag closed may be stored to identify whether the short string is deleted; the use mark in one may also be stored to identify whether the short string has been used, and other identifiers may also be stored. Thus, the connection relationship of each node in the de Bmijii diagram can be constructed according to the short string sequence value stored in the node, the number of connections of each base having the left connection, and the number of connections of each base having the right connection. For example, the short-chain A is AAAAAAAA, the number of connections of the right-connected T is 19, and the short-chain B of the right-connected base T is AAAAAAAT, which is equal to the short-chain 曱 left-shifting one base and adding the base T connected thereto. And there are 19 short strings B connected to the short string A, and the storage contents of the number of connections in the node storing the right port are as shown in FIG. 2 .

上述步骤 S103具体为： The above step S103 is specifically:

步骤 1.根据得到的短串的序列值在已存储的节点中查询是否已存有相应节点； Step 1. Query whether the corresponding node exists in the stored node according to the obtained sequence value of the short string;

步骤 2.如果没有查询到相应节点，则添加节点； Step 2. If no corresponding node is queried, add a node;

步骤 3.如果查询到相应节点，则更新该相应节点的连接关系。在本发明实施例中，使用哈希表存储 de Bruijn图的各节点，哈希键为序列值，哈希值为节点。例如取一短串为 AAAAAAAA, 其序列值为 0x0000 ，将其序列值 0x0000作为键在哈希表中查询是否已存有相应节点，如果没有查询到相应节点，则添加节点存储到哈希表中，其值中的 seq为该短串的序列值 0x0000，并根据该短串相邻的短串将该节点中相应左、右相连碱基的连接数量置为 1; 如果查询到已存有相应节点，则更新相应节点的连接关系，即根据与该短串相邻的短串更新该节点中相应左、右相连碱基的连接数量，将与该短串有连接的碱基的相应连接数量加 1。完成后，执行步骤 1，查找下一个短串，直至完成全部短串的查找。 Step 3. If the corresponding node is queried, the connection relationship of the corresponding node is updated. In the embodiment of the present invention, each node of the de Bruijn graph is stored using a hash table, the hash key is a sequence value, and the hash value is a node. For example, take a short string of AAAAAAAA, its sequence value is 0x0000, and use its sequence value 0x0000 as a key to query whether the corresponding node exists in the hash table. If no corresponding node is queried, add the node to the hash table. The seq of the value is the sequence value of the short string 0x0000, and the number of connections of the corresponding left and right connected bases in the node is set to 1 according to the short string adjacent to the short string; if the query already has a corresponding The node updates the connection relationship of the corresponding node, that is, updates the number of connections of the corresponding left and right connected bases in the node according to the short string adjacent to the short string, and the number of corresponding connections of the bases connected to the short string plus 1. When finished, perform step 1 to find the next short string until all short strings are found.

在本发明实施例中，使用哈希表可以在 O ( 1 )的时间内完成查找节点、插入节点（即存储节点）和更新节点连接关系。更新节点连接关系等同于查找节点，并更新查找到的节点的左、右相连碱基的连接数量，所以时间复杂度依然为 O ( l )。本发明实施例将原来需要多个逻辑处理的步骤转化为一个步骤，使得计算机在识别节点存储内容这个步骤的运算中，完成了查找、插入和更新联结关系的三个步骤，而时间复杂度仍然为 O ( l )，因此节约了计算机进行逻辑判断的时间，改善了短序列组装过程中计算机的内部性能，从而为大基因组的短序列组装的实现提供了。 In the embodiment of the present invention, the hash node can be used to complete the lookup node, the insert node (ie, the storage node), and the update node connection relationship in the time of O(1). Updating the node connection relationship is equivalent to finding the node and updating the number of connections of the left and right connected bases of the found node, so the time complexity is still O ( l ). The embodiment of the present invention converts the steps that originally required multiple logical processes into one step, so that the computer completes the search, insert, and update connection relationship in the operation of identifying the step of storing the content of the node. The three steps, while the time complexity is still O ( l ), thus saving the time for the computer to make logical judgments, improving the internal performance of the computer during the short sequence assembly process, thus providing for the realization of short sequence assembly of large genomes.

为了降低存储 de Bruijn图中节点所需的内存空间，作为本发明的一个优选实施例，还可以只用 de Bruijn图中的一个节点存储互补的两短串，节点的序列值取互补的两短串中较小的序列值。如果一个的短串的序列值小于其互补短串的序列值，则 de Bruijn图中的节点存储该短串的序列值， seq存储该短串的序列值，与其左连接碱基的相应连接数量更新到 left— links, 与其右连接碱基的相应连接数量更新到 right— links; 如果一个的短串的序列值大于其互补短串的序列值，则 de Bruijn图中的节点存储其互补短串的序列值， seq存储其互补短串的序列值，与其右连接碱基的相应连接数量更新到 leftjinks, 与其左连接碱基的相应连接数量更新到 ri_gM—links。操作图时，可以在程序中使用一个附加的变量来标记我们使用的是互补的两短串的哪一个。并且，在沿图遍历时，只需要程序维持一个这样的变量，就可以正确地得到路径中所有节点的正方向。 In order to reduce the memory space required for storing nodes in the de Bruijn graph, as a preferred embodiment of the present invention, only one node in the de Bruijn graph can be used to store two complementary short strings, and the sequence values of the nodes are complementary to two short A smaller sequence value in the string. If the sequence value of a short string is smaller than the sequence value of its complementary short string, the node in the de Bruijn graph stores the sequence value of the short string, and seq stores the sequence value of the short string, and the number of corresponding connections to the left connected base. Update to left-links, the number of corresponding connections to its right-connected base is updated to right-links; if the sequence value of a short string is greater than the sequence value of its complementary short-string, the node in the de Bruijn graph stores its complementary short string The sequence value, seq stores the sequence value of its complementary short string, the number of corresponding connections to its right-linked base is updated to leftjinks, and the number of corresponding connections to its left-linked base is updated to ri _g M-links. When working with diagrams, you can use an additional variable in the program to mark which of the two short strings we are using. Moreover, when traversing along the graph, only the program needs to maintain one such variable, and the positive direction of all nodes in the path can be correctly obtained.

为了加快构建图的速度，作为本发明的另一个优选实施例，使用多个哈希表唯一存储 de Bruijn图中的不同节点，并采用不同线程访问不同的哈希表。 In order to speed up the construction of the map, as another preferred embodiment of the present invention, multiple hash tables are used to uniquely store different nodes in the de Bruijn graph, and different threads are used to access different hash tables.

在本发明实施例中，建立 8个哈希表，读入一定数目的原始序列，采用 8个线程对读入的原始测序列进行多线程切割、短串求互补，在数据收集完毕后，采用 8个线程进行插入更新节点，其中每个线程只处理固定前缀的序列值。每个哈希表存储指定前缀的序列值，并且一个哈希表只有一个线程访问，以保证节点存储的唯一性。 In the embodiment of the present invention, eight hash tables are created, a certain number of original sequences are read, and eight threads are used to perform multi-thread cutting and short-string complementation on the original sequence to be read. After the data collection is completed, the data is collected. Eight threads are inserted into the update node, where each thread only processes the sequence value of the fixed prefix. Each hash table stores a sequence value of the specified prefix, and a hash table has only one thread access to guarantee the uniqueness of the node storage.

采用上述本发明实施例提供的压缩的数据结构，可以将节点信息（即序列值）和节点的连接信息（即边）组合在一起，从一个节点的值可以得到该节点上的短串、与该短串相邻的短串的序列值及其数量。 By using the compressed data structure provided by the embodiment of the present invention, node information (ie, sequence value) and node connection information (ie, edges) can be combined together from one section. The value of the point gives the sequence value of the short string on the node, the short string adjacent to the short string, and the number thereof.

当然，也可以用其他结构来存储 de Bruijn图的各节点，例如可以用树结构来存储，使用哈希表存储各节点在内存和使用上与用树状结构存储近似，但是使用哈希表存储各节点在访问和修改速度上都明显优于树的存储结构。 Of course, other structures can also be used to store the nodes of the de Bruijn graph, for example, can be stored in a tree structure, and the hash table is used to store each node in memory and usage and stored in a tree structure, but using a hash table storage. Each node is significantly better than the tree's storage structure in terms of access and modification speed.

选取非洲人基因组重测序数据，经纠错处理后，序列数据量 254G碱基，切割成 25碱基长度的定长短串后，短串的总数目（包括正反向序列）为 7G 条，采用本发明实施例提供的方法构建 de Bruijn图，内存最大使用值为 110G，共消耗 23 CPU小时，其中， CPU 的参数为 Quad-Core AMD Opteron(tm) Processor 8356 2.2GHZ。 The African human genome resequencing data was selected. After error correction processing, the sequence data amount was 254 G bases, and after cutting into a fixed length short string of 25 base length, the total number of short strings (including the forward and reverse sequences) was 7G. The method provided by the embodiment of the present invention constructs a de Bruijn graph, and the maximum memory usage value is 110G, which consumes 23 CPU hours, wherein the CPU parameter is Quad-Core AMD Opteron(tm) Processor 8356 2.2GHZ.

本领域普通技术人员可以理解，实现上述实施例方法中的全部或部分步驟是可以通过程序来指令相关的硬件来完成，所述的程序可以在存储于一计算机可读取存储介质中，所述的存储介质，如 ROM/RAM、磁盘、光盘等，该程序用来执行如下步骤： It will be understood by those skilled in the art that all or part of the steps of the foregoing embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium. The storage medium, such as ROM/RAM, disk, CD, etc., is used to perform the following steps:

1.接收测序序列； 1. Receiving a sequencing sequence;

2. 分别将接收到的测序序列逐个碱基滑动切割得到固定碱基长度的短串，并得到短串的左、右连接关系，所述左、右连接关系包括短串序列值、存在左连接的各碱基的连接数量和存在右连接的各碱基的连接数量； 2. The received sequencing sequence is slid and cut one by one to obtain a short string of fixed base length, and the left and right connection relationships of the short strings are obtained, and the left and right connection relationships include short string sequence values and presence of left connections. The number of connections of each base and the number of connections of each base in the right junction;

3. 将得到的各短串的序列值，左、右连接关系及其连接数量存储为 de Bruijn图的一个节点，并采用哈希表存储 de Bruijn图的各节点，其中哈希键为序列值，哈希值为节点。 3. Store the sequence values of the short strings, the left and right connection relationships and their number of connections as a node of the de Bruijn graph, and use the hash table to store the nodes of the de Bruijn graph, where the hash key is the sequence value. , the hash value is a node.

图 3示出了本发明实施例提供的降低短序列组装过程的时间复杂度的系统的结构，为了便于说明仅示出了与本发明实施例相关的部分。 FIG. 3 is a diagram showing the structure of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. For the convenience of description, only the embodiments related to the embodiment of the present invention are shown. section.

该系统可以用于短序列组装中，其中： The system can be used in short sequence assembly, where:

接收单元 301，接收测序序列。 The receiving unit 301 receives the sequencing sequence.

序列切割单元 302，分别将接收到的测序序列逐个碱基滑动切割得到固定碱基长度的短串，并得到短串的左、右连接关系，其实现方式如上所述，不再赘述。 The sequence cutting unit 302 respectively slid and cuts the received sequencing sequence by a base to obtain a short string of a fixed base length, and obtains a left-right connection relationship of the short strings. The actual manner is as described above, and will not be described again.

构图单元 303，将得到的各短串的序列值，左、右连接关系及其连接数量存储为 de Bruijn图的一个节点，并采用哈希表存储 de Bruijn图的各节点，其中哈希键为序列值，哈希值为节点。在本发明实施例中，构图单元 303在 de Bruijn图的节点中使用相应位存储其序列值、存在左连接的各碱基的连接数量和存在右连接的各碱基的连接数量。 The composition unit 303 stores the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and uses a hash table to store each node of the de Bruijn graph, wherein the hash key is The sequence value, the hash value is the node. In the embodiment of the present invention, the composition unit 303 uses the corresponding bit in the node of the de Bruijn diagram to store its sequence value, the number of connections of each base in which there is a left connection, and the number of connections in which each base of the right connection exists.

其中，构图单元 303包括： The composition unit 303 includes:

查询模块 3031，根据得到的短串的序列值在已存储的节点中查询是否已存有相应节点。 The query module 3031 queries, among the stored nodes, whether the corresponding node has been stored according to the obtained sequence value of the short string.

节点添加模块 3032, 在查询模块 3031没有查询到相应节点时，添加节点，其实现方式如上所述，不再赘述。 The node adding module 3032 adds a node when the query module 3031 does not query the corresponding node, and the implementation manner is as described above, and details are not described herein.

连接更新模块 3033, 在查询模块 3031查询到相应节点时，更新该相应节点的连接关系，其实现方式如上所述，不再赘述。 The connection update module 3033, when the query module 3031 queries the corresponding node, updates the connection relationship of the corresponding node, and the implementation manner thereof is as described above, and details are not described herein again.

为了降低存储 de Bruijn图中节点所需空间，作为本发明的一个优选实施例，构图单元 303使用 de Bruijn图中的一个节点存储互补的两短串，节点的序列值取互补的两短串中较小的序列值，其实现方式如上所述，不再赘述。 In order to reduce the space required for storing nodes in the de Bruijn diagram, as a preferred embodiment of the present invention, the composition unit 303 uses a node in the de Bruijn diagram to store complementary two short strings, and the sequence values of the nodes are taken in two complementary short strings. The smaller sequence values are implemented as described above and will not be described again.

为了加快构建图的速度，作为本发明的另一个优选实施例，构图单元 303采用多个哈希表唯一存储 de Bruijn图中的不同节点，并采用不同线程访问不同的哈希表。在本发明实施例中，通过分别将接收到的测序序列逐个碱基滑动切割得到固定碱基长度的短串，并得到短串的左、右连接关系，将得到的各短串的序列值，左、右连接关系及其连接数量存储为 de Bruijii图的一个节点，并使用哈希表进行存储。由于釆用本发明实施例的构图方式并使用哈希表进行存储，使得更新节点连接关系等同于节点查找和更新查找到的节点的左、右相连碱基的连接数量，因此在 O ( l )的时间内就可以完成节点的查找、插入和节点连接关系的更新，从而实现在短序列组装过程降低时间复杂度，进而实现对大基因组的短序列组装。 In order to speed up the construction of the map, as another preferred embodiment of the present invention, the composition unit 303 uniquely stores different nodes in the de Bruijn map using a plurality of hash tables, and uses different threads to access different hash tables. In the embodiment of the present invention, the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained. The left and right connection relationships and their number of connections are stored as a node of the de Bruijii diagram and stored using a hash table. Since the composition mode of the embodiment of the present invention is used and stored by using a hash table, the update node connection relationship is equivalent to the number of connections of the left and right connected bases of the node searched and updated by the node, and thus at O ( l ) The search, insertion and node connection relationship of the node can be updated in time, thereby reducing the time complexity in the short sequence assembly process and realizing the short sequence assembly of the large genome.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。 The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. Within the scope.

Claims

A method of reducing the time complexity of a short sequence assembly process, the method comprising the steps of:

Receiving a sequencing sequence;

The received sequencing sequence is slid and cut one by one to obtain a short string of a fixed base length, and the left and right connection relationships of the short string are obtained, and the left and right connection relationships include a short string sequence value and a left connection exists. The number of connections of each base and the number of connections of each base in the right junction;

And storing the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and storing the nodes of the de Bruijn graph by using a hash table, where the hash key is The sequence value, the hash value is the node.

2. The method according to claim 1, wherein the node of the de Bruijn graph stores a sequence value of the short string, a number of connections of each base having a left connection, and a presence of a right connection using a corresponding bit. The number of base connections.

The method according to claim 1, wherein the step of storing the obtained sequence values of the short strings, the left and right connection relationships, and the number of connections thereof as a node of the de Bruijn graph is specifically:

Querying, according to the obtained sequence value of the short string, whether the corresponding node exists in the stored node;

Add a node if no corresponding node is queried;

If the corresponding node is queried, the connection relationship of the corresponding node is updated.

4. The method according to claim 1, wherein two nodes of the de Bruijn graph store complementary two short strings.

5. The method of claim 1 wherein a plurality of hash tables are employed Uniquely store the different nodes of the de Bruijn graph and access different hash tables with different threads.

6. A system for reducing the time complexity of a short sequence assembly process, the system comprising:

a receiving unit, configured to receive a sequencing sequence;

a sequence cutting unit, configured to respectively slide the received sequencing sequence one by one to obtain a short string of a fixed base length, and obtain a left and right connection relationship of the short string, wherein the left and right connection relationships include short strings a sequence value, the number of linkages of each base in which there is a left junction, and the number of linkages of each base in which a right junction exists;

a composition unit, configured to store the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and store the different nodes in the de Bruijn graph by using a hash table Where the hash key is the sequence value and the hash value is the node.

7. The system according to claim 6, wherein the composition unit stores a sequence value of the short string, a number of connections of each base having a left connection, and a presence using a corresponding bit in a node of the de Bruijn diagram. The number of connections for each base connected to the right.

8. The system of claim 7, wherein the composition unit comprises:

a query module, configured to query, according to the obtained sequence value of the short string, whether the corresponding node is already stored in the stored node;

a node adding module, configured to add a node when the query module does not query the corresponding node;

And a connection update module, configured to update a connection relationship of the corresponding node when the query module queries the corresponding node.