[go: up one dir, main page]

WO2010066115A1 - Method and system for lowering time complexity in short sequences assembly - Google Patents

Method and system for lowering time complexity in short sequences assembly Download PDF

Info

Publication number
WO2010066115A1
WO2010066115A1 PCT/CN2009/001427 CN2009001427W WO2010066115A1 WO 2010066115 A1 WO2010066115 A1 WO 2010066115A1 CN 2009001427 W CN2009001427 W CN 2009001427W WO 2010066115 A1 WO2010066115 A1 WO 2010066115A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
short
sequence
connections
base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2009/001427
Other languages
French (fr)
Chinese (zh)
Inventor
李瑞强
朱红梅
李松岗
王俊
杨焕明
汪建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN HUADA GENE INSTITUTE
Original Assignee
SHENZHEN HUADA GENE INSTITUTE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN HUADA GENE INSTITUTE filed Critical SHENZHEN HUADA GENE INSTITUTE
Publication of WO2010066115A1 publication Critical patent/WO2010066115A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the invention belongs to the technical field of genetic engineering, and in particular relates to a method and system for reducing the time complexity of a short sequence assembly process. Background technique
  • sequence length is short
  • the embodiment of the present invention is implemented as a method for reducing the time complexity of a short sequence assembly process, and the method includes the following steps:
  • the received sequencing sequence is slid and cut one by one to obtain a short string of a fixed base length, and the left and right connection relationships of the short string are obtained, and the left and right connection relationships include a short string sequence value and a left connection exists.
  • Another object of embodiments of the present invention is to provide a system for reducing the time complexity of a short sequence assembly process, the system comprising:
  • a receiving unit configured to receive a sequencing sequence
  • a sequence cutting unit configured to separately slide the received sequencing sequence one by one to obtain a short string of a fixed base length, and obtain a left and right connection relationship of the short string, wherein the left and right connection relationships include short strings a sequence value, the number of linkages of each base in which there is a left junction, and the number of linkages of each base in which a right junction exists;
  • composition unit configured to store the obtained sequence values of the short strings, the left and right connection relationships, and the number of connections thereof as a node of the de Bruijn graph, and store the nodes of the de Bruijn graph by using a hash table,
  • the hash key is the sequence value
  • the hash value is the node.
  • the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained.
  • the left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph and stored using a hash table. Since the hash table storage is used to make the update node connection relationship equal to the number of connections of the node to find and update the left and right connected bases of the found node, the search, insertion, and insertion of the node can be completed within O ( l ) time. The update of the node connection relationship, thereby reducing the time complexity in the short sequence assembly process, and thus achieving short sequence assembly of large genomes.
  • FIG. 1 is a flowchart of an implementation of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a node storage content according to an embodiment of the present invention.
  • FIG. 3 is a structural diagram of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. detailed description
  • the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained.
  • the left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph, and the hash table is used to store the nodes of the de Bruijn graph, wherein the hash key is a sequence value and the hash value is a node.
  • FIG. 1 is a flowchart showing an implementation process of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention, which is described in detail as follows:
  • step S101 receiving a sequencing sequence
  • step S102 the received sequencing sequence is slid and cut one by one to obtain a short length (kmer) of a fixed length, and a left and right connection relationship of the short string is obtained, and the left and right connection relationships include short sequence values.
  • step S103 the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof are stored as a node of the de Bruijn graph, and the hash table is used to store de Each node of the Briiijn graph, where the hash key is a sequence value and the hash value is a node.
  • the sequencing sequence has a base length of 25 - 75 and is cut into short strings of a fixed length of 21 - 31.
  • the length of the short string obtained by the cutting is smaller than the length of the sequence, and the length can be set according to the length of the sequencing sequence and the actual situation.
  • Each node in the de Bruijn diagram uses the corresponding bits to store its sequence value, the number of connections for each base with left joins, and the number of connections for each base with right joins.
  • each node on the de B ijn map is stored in 16 bytes, and its storage format is as follows:
  • sequence value is calculated by using 2 bits to store a nucleotide sequence, A is represented by 00, C is represented by 01, T is represented by 10, G is represented by 11, and the sequence is encoded to generate a
  • A is represented by 00
  • C is represented by 01
  • T is represented by 10
  • G is represented by 11
  • the sequence is encoded to generate a
  • the 64-bit integer value and considering the short string of even length, its complementary short string may be itself, for example, the short string GATC's complementary short string is GATC itself. In order to prevent such confusion, the lengths of the short strings are all odd.
  • the length of the short string is not more than 31;
  • the leftjinks uses 24 bits to store the left connection relationship and the number, and divides the 24 bits into 4 6-bit, ie A: 6, T: 6, G: 6, C: 6, respectively, using 6 bits to store the number of connections to the base A, T, G or C with the left link of the short string, each The number of connections is in the range [0, 63];
  • right-link uses 24 bits to store its right connection and number, and divides 24 bits into 4 6 bits, ie A: 6, T: 6, G: 6, C: 6, respectively, the number of connections of A, T, G or C with the right connection of the short string is stored by 6 bits, and the number of each connection is in the range of [0, 63]; the following 8 bits can be used.
  • the delete flag closed may be stored to identify whether the short string is deleted; the use mark in one may also be stored to identify whether the short string has been used, and other identifiers may also be stored.
  • the connection relationship of each node in the de Bmijii diagram can be constructed according to the short string sequence value stored in the node, the number of connections of each base having the left connection, and the number of connections of each base having the right connection.
  • the short-chain A is AAAAAAAA
  • the number of connections of the right-connected T is 19
  • the short-chain B of the right-connected base T is AAAAAAAT, which is equal to the short-chain ⁇ left-shifting one base and adding the base T connected thereto.
  • the storage contents of the number of connections in the node storing the right port are as shown in FIG. 2 .
  • step S103 is specifically:
  • Step 1 Query whether the corresponding node exists in the stored node according to the obtained sequence value of the short string;
  • Step 2 If no corresponding node is queried, add a node
  • Step 3 If the corresponding node is queried, the connection relationship of the corresponding node is updated.
  • each node of the de Bruijn graph is stored using a hash table, the hash key is a sequence value, and the hash value is a node. For example, take a short string of AAAAAAAA, its sequence value is 0x0000, and use its sequence value 0x0000 as a key to query whether the corresponding node exists in the hash table. If no corresponding node is queried, add the node to the hash table.
  • the seq of the value is the sequence value of the short string 0x0000, and the number of connections of the corresponding left and right connected bases in the node is set to 1 according to the short string adjacent to the short string; if the query already has a corresponding
  • the node updates the connection relationship of the corresponding node, that is, updates the number of connections of the corresponding left and right connected bases in the node according to the short string adjacent to the short string, and the number of corresponding connections of the bases connected to the short string plus 1.
  • step 1 to find the next short string until all short strings are found.
  • the hash node can be used to complete the lookup node, the insert node (ie, the storage node), and the update node connection relationship in the time of O(1). Updating the node connection relationship is equivalent to finding the node and updating the number of connections of the left and right connected bases of the found node, so the time complexity is still O ( l ).
  • the embodiment of the present invention converts the steps that originally required multiple logical processes into one step, so that the computer completes the search, insert, and update connection relationship in the operation of identifying the step of storing the content of the node.
  • the three steps, while the time complexity is still O ( l ) thus saving the time for the computer to make logical judgments, improving the internal performance of the computer during the short sequence assembly process, thus providing for the realization of short sequence assembly of large genomes.
  • only one node in the de Bruijn graph can be used to store two complementary short strings, and the sequence values of the nodes are complementary to two short A smaller sequence value in the string. If the sequence value of a short string is smaller than the sequence value of its complementary short string, the node in the de Bruijn graph stores the sequence value of the short string, and seq stores the sequence value of the short string, and the number of corresponding connections to the left connected base.
  • Update to left-links the number of corresponding connections to its right-connected base is updated to right-links; if the sequence value of a short string is greater than the sequence value of its complementary short-string, the node in the de Bruijn graph stores its complementary short string
  • the sequence value, seq stores the sequence value of its complementary short string
  • the number of corresponding connections to its right-linked base is updated to leftjinks
  • the number of corresponding connections to its left-linked base is updated to ri g M-links.
  • multiple hash tables are used to uniquely store different nodes in the de Bruijn graph, and different threads are used to access different hash tables.
  • eight hash tables are created, a certain number of original sequences are read, and eight threads are used to perform multi-thread cutting and short-string complementation on the original sequence to be read. After the data collection is completed, the data is collected. Eight threads are inserted into the update node, where each thread only processes the sequence value of the fixed prefix. Each hash table stores a sequence value of the specified prefix, and a hash table has only one thread access to guarantee the uniqueness of the node storage.
  • node information ie, sequence value
  • node connection information ie, edges
  • nodes of the de Bruijn graph can be stored in a tree structure, and the hash table is used to store each node in memory and usage and stored in a tree structure, but using a hash table storage.
  • Each node is significantly better than the tree's storage structure in terms of access and modification speed.
  • the African human genome resequencing data was selected. After error correction processing, the sequence data amount was 254 G bases, and after cutting into a fixed length short string of 25 base length, the total number of short strings (including the forward and reverse sequences) was 7G.
  • the method provided by the embodiment of the present invention constructs a de Bruijn graph, and the maximum memory usage value is 110G, which consumes 23 CPU hours, wherein the CPU parameter is Quad-Core AMD Opteron(tm) Processor 8356 2.2GHZ.
  • the received sequencing sequence is slid and cut one by one to obtain a short string of fixed base length, and the left and right connection relationships of the short strings are obtained, and the left and right connection relationships include short string sequence values and presence of left connections.
  • FIG. 3 is a diagram showing the structure of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. For the convenience of description, only the embodiments related to the embodiment of the present invention are shown. section.
  • the system can be used in short sequence assembly, where:
  • the receiving unit 301 receives the sequencing sequence.
  • the sequence cutting unit 302 respectively slid and cuts the received sequencing sequence by a base to obtain a short string of a fixed base length, and obtains a left-right connection relationship of the short strings.
  • the actual manner is as described above, and will not be described again.
  • the composition unit 303 stores the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and uses a hash table to store each node of the de Bruijn graph, wherein the hash key is The sequence value, the hash value is the node.
  • the composition unit 303 uses the corresponding bit in the node of the de Bruijn diagram to store its sequence value, the number of connections of each base in which there is a left connection, and the number of connections in which each base of the right connection exists.
  • the composition unit 303 includes:
  • the query module 3031 queries, among the stored nodes, whether the corresponding node has been stored according to the obtained sequence value of the short string.
  • the node adding module 3032 adds a node when the query module 3031 does not query the corresponding node, and the implementation manner is as described above, and details are not described herein.
  • connection update module 3033 when the query module 3031 queries the corresponding node, updates the connection relationship of the corresponding node, and the implementation manner thereof is as described above, and details are not described herein again.
  • the composition unit 303 uses a node in the de Bruijn diagram to store complementary two short strings, and the sequence values of the nodes are taken in two complementary short strings.
  • the smaller sequence values are implemented as described above and will not be described again.
  • the composition unit 303 uniquely stores different nodes in the de Bruijn map using a plurality of hash tables, and uses different threads to access different hash tables.
  • the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained.
  • the left and right connection relationships and their number of connections are stored as a node of the de Bruijii diagram and stored using a hash table.
  • the update node connection relationship is equivalent to the number of connections of the left and right connected bases of the node searched and updated by the node, and thus at O ( l )
  • the search, insertion and node connection relationship of the node can be updated in time, thereby reducing the time complexity in the short sequence assembly process and realizing the short sequence assembly of the large genome.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is applicable to the technical field of gene engineering, and provides a method for lowering time complexity in short sequences assembly and a system thereof. The method comprises the following steps: receiving sequencing sequences; respectively processing base sliding cutting to the received sequencing sequences one by one to obtain short strings with constant base length and to obtain left and right connection relations of the short strings; and storing the sequence values of the obtained all short strings, left and right connection relations and connection amount as one node of a de Bruijn graph, using a hash table to store the nodes of the de Bruijn graph, the hash key is the sequence value, the hash value is the node. Because of using the de Bruijn graph and applying the hash table for storing, it makes updating the connection relation of the nodes to be equal to searching nodes and updating the connection amount of bases having left and right connections for searched nodes. Thus, the searching and adding nodes and updating the connection relations of nodes can be finished during the time of 0(1). The lowering time complexity in the short sequences assembly can be realized and the short sequences of large genome can be assembled.

Description

一种降低短序列组装过程的时间复杂度的方法及系统 技术领域  Method and system for reducing time complexity of short sequence assembly process

本发明属于基因工程技术领域, 尤其涉及一种降低短序列组装 过程时间复杂度的方法及系统。 背景技术  The invention belongs to the technical field of genetic engineering, and in particular relates to a method and system for reducing the time complexity of a short sequence assembly process. Background technique

新测序技术产生的短序列有两个特点:  The short sequence generated by the new sequencing technology has two characteristics:

1.序列长度短;  1. The sequence length is short;

2.数据量大。  2. The amount of data is large.

在长序列组装过程中, 常用的 phrap等软件均是基于序列间的 交叠(overlap )来进行拼接, 如果应用在短序列上, 则运算量太大, 没有实际应用价值。而新兴的短序列组装软件中成功处理短序列的, 例如基于 de Bruijn图的 velvet等。 但是, 由于受内存、 时间等的限 制 , 现有的这些短序列组装软件只能组装较小的原核生物基因组, 对于大基因组, 例如真核生物基因组, 特别是哺乳动物基因组数据, 由于数据处理时时间复杂度较高、 内存占用较大, 现有的短序列组 装软件均难以实现短序列的组装。 发明内容  In the long sequence assembly process, the commonly used software such as phrap is based on the overlap between sequences. If the application is on a short sequence, the amount of calculation is too large, and there is no practical application value. In the emerging short-sequence assembly software, short sequences are successfully processed, such as velvet based on de Bruijn graphs. However, due to limitations in memory, time, etc., these short sequence assembly software can only assemble smaller prokaryotic genomes, for large genomes, such as eukaryotic genomes, especially mammalian genomic data, due to data processing. The time complexity is high and the memory usage is large. It is difficult to assemble short sequences by the existing short sequence assembly software. Summary of the invention

本发明实施例的目的在于提供一种降低短序列组装过程的时间 复杂度的方法, 旨在解决现有短序列组装软件不能组装大基因组的 问题。  It is an object of embodiments of the present invention to provide a method for reducing the time complexity of a short sequence assembly process, which aims to solve the problem that existing short sequence assembly software cannot assemble a large genome.

本发明实施例是这样实现的, 一种降低短序列组装过程的时间 复杂度的方法, 所述方法包括下述步骤:  The embodiment of the present invention is implemented as a method for reducing the time complexity of a short sequence assembly process, and the method includes the following steps:

接收测序序列; 分别将接收到的测序序列逐个碱基滑动切割得到固定碱基长度 的短串, 并得到所述短串的左、 右连接关系, 所述左、 右连接关系 包括短串序列值、 存在左连接的各碱基的连接数量和存在右连接的 各½的连接数量; Receiving a sequencing sequence; The received sequencing sequence is slid and cut one by one to obtain a short string of a fixed base length, and the left and right connection relationships of the short string are obtained, and the left and right connection relationships include a short string sequence value and a left connection exists. The number of connections for each base and the number of connections for each of the right connections;

将得到的各所述短串的序列值, 左、 右连接关系及其连接数量 存储为 de Bruijn图的一个节点, 并釆用哈希表存储所述 de Bruijn 图的各节点, 其中哈希键为所述序列值, 哈希值为所述节点。  And storing the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and storing the nodes of the de Bruijn graph by using a hash table, wherein the hash key For the sequence value, the hash value is the node.

本发明实施例的另一目的在于提供一种降低短序列组装过程的 时间复杂度的系统, 所述系统包括:  Another object of embodiments of the present invention is to provide a system for reducing the time complexity of a short sequence assembly process, the system comprising:

接收单元, 用于接收测序序列;  a receiving unit, configured to receive a sequencing sequence;

序列切割单元, 用于分别将接收到的测序序列逐个碱基滑动切 割得到固定碱基长度的短串, 并得到所述短串的左、 右连接关系, 所述左、 右连接关系包括短串序列值、 存在左连接的各碱基的连接 数量和存在右连接的各碱基的连接数量; 以及  a sequence cutting unit, configured to separately slide the received sequencing sequence one by one to obtain a short string of a fixed base length, and obtain a left and right connection relationship of the short string, wherein the left and right connection relationships include short strings a sequence value, the number of linkages of each base in which there is a left junction, and the number of linkages of each base in which a right junction exists;

构图单元, 用于将得到的各所述短串的序列值, 左、 右连接关 系及其连接数量存储为 de Bruijn图的一个节点,并采用哈希表存储 所述 de Bruijn图的各节点,其中哈希键为所述序列值,哈希值为所 述节点。  a composition unit, configured to store the obtained sequence values of the short strings, the left and right connection relationships, and the number of connections thereof as a node of the de Bruijn graph, and store the nodes of the de Bruijn graph by using a hash table, The hash key is the sequence value, and the hash value is the node.

在本发明实施例中, 通过分别将接收到的测序序列逐个碱基滑 动切割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 将得到的各短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijn图的一个节点, 并使用哈希表进行存储。 由于使用哈希表存 储使得更新节点连接关系等同于节点查找和更新查找到的节点的 左、 右相连碱基的连接数量, 因此在 O ( l )的时间内就可以完成节 点的查找、 插入和节点连接关系的更新, 从而实现在短序列组装过 程降低时间复杂度, 进而实现对大基因组的短序列组装。 附图说明 In the embodiment of the present invention, the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained. The left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph and stored using a hash table. Since the hash table storage is used to make the update node connection relationship equal to the number of connections of the node to find and update the left and right connected bases of the found node, the search, insertion, and insertion of the node can be completed within O ( l ) time. The update of the node connection relationship, thereby reducing the time complexity in the short sequence assembly process, and thus achieving short sequence assembly of large genomes. DRAWINGS

图 1是本发明实施例提供的降低短序列组装过程的时间复杂度 的方法的实现流程图;  1 is a flowchart of an implementation of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention;

图 2是本发明实施例提供的节点存储内容的示意图;  2 is a schematic diagram of a node storage content according to an embodiment of the present invention;

图 3本发明实施例提供的降低短序列组装过程的时间复杂度的 系统的结构图。 具体实施方式  FIG. 3 is a structural diagram of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. detailed description

为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结 合附图及实施例, 对本发明进行进一步详细说明。 应当理解, 此处 所描述的具体实施例仅仅用以解释本发明, 并不用于限定本发明。  The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

在本发明实施例中, 通过分别将接收到的测序序列逐个碱基滑 动切割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 将得到的各短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijn图的一个节点,并采用哈希表存储 de Bruijn图的各节点,其 中哈希键为序列值, 哈希值为节点。  In the embodiment of the present invention, the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained. The left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph, and the hash table is used to store the nodes of the de Bruijn graph, wherein the hash key is a sequence value and the hash value is a node.

图 1示出了本发明实施例提供的降低短序列组装过程的时间复 杂度的方法的实现流程, 详述如下:  FIG. 1 is a flowchart showing an implementation process of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention, which is described in detail as follows:

在步驟 S101中, 接收测序序列;  In step S101, receiving a sequencing sequence;

在步骤 S102中,分别将接收到的测序序列逐个碱基滑动切割得 到固定 长度的短串 (kmer ) , 并得到短串的左、 右连接关系, 所述左、 右连接关系包括短串序列值、 存在左连接的各碱基的连接 数量和存在右连接的各碱基的连接数量;  In step S102, the received sequencing sequence is slid and cut one by one to obtain a short length (kmer) of a fixed length, and a left and right connection relationship of the short string is obtained, and the left and right connection relationships include short sequence values. The number of connections of each base in which there is a left connection and the number of connections of each base in which a right connection exists;

在步骤 S103中, 将得到的各短串的序列值, 左、 右连接关系及 其连接数量存储为 de Bruijn图的一个节点, 并采用哈希表存储 de Briiijn图的各节点, 其中哈希键为序列值, 哈希值为节点。 In step S103, the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof are stored as a node of the de Bruijn graph, and the hash table is used to store de Each node of the Briiijn graph, where the hash key is a sequence value and the hash value is a node.

在本发明实施例中, 测序序列的碱基长度为 25 - 75, 切割成固 定 长度为 21 - 31的短串。 当然, 切割得到的短串的长度小于测 序序列的长度, 其长度可以根据测序序列的长度和实际情况设定。 de Bruijn图中每个节点使用相应位存储其序列值、存在左连接的各 碱基的连接数量和存在右连接的各碱基的连接数量。 这里, 用 16字 节存储 de B ijn图上的各节点, 其存储格式如下:  In the embodiment of the present invention, the sequencing sequence has a base length of 25 - 75 and is cut into short strings of a fixed length of 21 - 31. Of course, the length of the short string obtained by the cutting is smaller than the length of the sequence, and the length can be set according to the length of the sequencing sequence and the actual situation. Each node in the de Bruijn diagram uses the corresponding bits to store its sequence value, the number of connections for each base with left joins, and the number of connections for each base with right joins. Here, each node on the de B ijn map is stored in 16 bytes, and its storage format is as follows:

[ seq: 64, left links: 24, right— links: 24, …】;  [ seq: 64, left links: 24, right- links: 24, ...];

其中, seq存储短串的序列值,序列值的计算方法是使用 2位存 储一个核苷序列, A用 00表示, C用 01表示, T用 10表示, G用 11表示, 顺序编码下去生成一个占 64位的整数值, 并且, 考虑到 对于偶数长度的短串, 其互补短串可能为它自己, 例如短串 GATC 的互补短串为 GATC自己。 为了防止这种混淆, 短串的长度均为奇 数, 由于本发明实施例中数据结构的限制, 短串的长度不大于 31; leftjinks用 24位存储其左连接关系及数量, 将 24位分割成 4个 6 位, 即 A: 6, T: 6, G: 6, C: 6, 分别用 6位存储与该短串存在 左连接的碱基 A、 T、 G或 C的连接数量, 每种连接数量的取值范 围为 [0, 63]; right— links用 24位存储其右连接关系及数量, 将 24 位分割成 4个 6位, 即 A: 6, T: 6, G: 6, C: 6, 分别用 6位存 储与该短串存在右连接的 A、 T、 G或 C的连接数量, 每种连 接数量的取值范围为 [0, 63]; 其后面的 8位可以用于存储其他值, 例如, 可以存储删除标记 closed, 以标识该短串是否被删除; 也可 以存储使用标记 in一 use, 以标识该短串是否被使用过, 还可以存储 其他标识。 这样, 根据节点中存储的短串序列值、 存在左连接的各 碱基的连接数量和存在右连接的各碱基的连接数量即可构建 de Bmijii图中各节点的连接关系。 例如,短串甲为 AAAAAAAA存在右连接的 T的连接数量 为 19, 与其右连接碱基 T的短串乙为 AAAAAAAT, 等于短串曱左 移一个碱基并加上与其连接的碱基 T, 并且与短串甲连接的短串乙 有 19个, 节点中存储右连接 Τ的连接数量的存储内容如图 2 所示。 Where seq stores the sequence value of the short string, the sequence value is calculated by using 2 bits to store a nucleotide sequence, A is represented by 00, C is represented by 01, T is represented by 10, G is represented by 11, and the sequence is encoded to generate a The 64-bit integer value, and considering the short string of even length, its complementary short string may be itself, for example, the short string GATC's complementary short string is GATC itself. In order to prevent such confusion, the lengths of the short strings are all odd. Due to the limitation of the data structure in the embodiment of the present invention, the length of the short string is not more than 31; the leftjinks uses 24 bits to store the left connection relationship and the number, and divides the 24 bits into 4 6-bit, ie A: 6, T: 6, G: 6, C: 6, respectively, using 6 bits to store the number of connections to the base A, T, G or C with the left link of the short string, each The number of connections is in the range [0, 63]; right-link uses 24 bits to store its right connection and number, and divides 24 bits into 4 6 bits, ie A: 6, T: 6, G: 6, C: 6, respectively, the number of connections of A, T, G or C with the right connection of the short string is stored by 6 bits, and the number of each connection is in the range of [0, 63]; the following 8 bits can be used. For storing other values, for example, the delete flag closed may be stored to identify whether the short string is deleted; the use mark in one may also be stored to identify whether the short string has been used, and other identifiers may also be stored. Thus, the connection relationship of each node in the de Bmijii diagram can be constructed according to the short string sequence value stored in the node, the number of connections of each base having the left connection, and the number of connections of each base having the right connection. For example, the short-chain A is AAAAAAAA, the number of connections of the right-connected T is 19, and the short-chain B of the right-connected base T is AAAAAAAT, which is equal to the short-chain 曱 left-shifting one base and adding the base T connected thereto. And there are 19 short strings B connected to the short string A, and the storage contents of the number of connections in the node storing the right port are as shown in FIG. 2 .

上述步骤 S103具体为:  The above step S103 is specifically:

步骤 1.根据得到的短串的序列值在已存储的节点中查询是否已 存有相应节点;  Step 1. Query whether the corresponding node exists in the stored node according to the obtained sequence value of the short string;

步骤 2.如果没有查询到相应节点, 则添加节点;  Step 2. If no corresponding node is queried, add a node;

步骤 3.如果查询到相应节点, 则更新该相应节点的连接关系。 在本发明实施例中,使用哈希表存储 de Bruijn图的各节点,哈 希键为序列值, 哈希值为节点。 例如取一短串为 AAAAAAAA, 其 序列值为 0x0000 , 将其序列值 0x0000作为键在哈希表中查询是否 已存有相应节点, 如果没有查询到相应节点, 则添加节点存储到哈 希表中, 其值中的 seq为该短串的序列值 0x0000, 并根据该短串相 邻的短串将该节点中相应左、 右相连碱基的连接数量置为 1; 如果 查询到已存有相应节点, 则更新相应节点的连接关系, 即根据与该 短串相邻的短串更新该节点中相应左、 右相连碱基的连接数量, 将 与该短串有连接的碱基的相应连接数量加 1。 完成后, 执行步骤 1, 查找下一个短串, 直至完成全部短串的查找。  Step 3. If the corresponding node is queried, the connection relationship of the corresponding node is updated. In the embodiment of the present invention, each node of the de Bruijn graph is stored using a hash table, the hash key is a sequence value, and the hash value is a node. For example, take a short string of AAAAAAAA, its sequence value is 0x0000, and use its sequence value 0x0000 as a key to query whether the corresponding node exists in the hash table. If no corresponding node is queried, add the node to the hash table. The seq of the value is the sequence value of the short string 0x0000, and the number of connections of the corresponding left and right connected bases in the node is set to 1 according to the short string adjacent to the short string; if the query already has a corresponding The node updates the connection relationship of the corresponding node, that is, updates the number of connections of the corresponding left and right connected bases in the node according to the short string adjacent to the short string, and the number of corresponding connections of the bases connected to the short string plus 1. When finished, perform step 1 to find the next short string until all short strings are found.

在本发明实施例中, 使用哈希表可以在 O ( 1 )的时间内完成查 找节点、 插入节点 (即存储节点)和更新节点连接关系。 更新节点 连接关系等同于查找节点, 并更新查找到的节点的左、 右相连碱基 的连接数量, 所以时间复杂度依然为 O ( l )。 本发明实施例将原来 需要多个逻辑处理的步骤转化为一个步骤, 使得计算机在识别节点 存储内容这个步骤的运算中, 完成了查找、 插入和更新联结关系的 三个步骤, 而时间复杂度仍然为 O ( l ), 因此节约了计算机进行逻 辑判断的时间, 改善了短序列组装过程中计算机的内部性能, 从而 为大基因组的短序列组装的实现提供了 。 In the embodiment of the present invention, the hash node can be used to complete the lookup node, the insert node (ie, the storage node), and the update node connection relationship in the time of O(1). Updating the node connection relationship is equivalent to finding the node and updating the number of connections of the left and right connected bases of the found node, so the time complexity is still O ( l ). The embodiment of the present invention converts the steps that originally required multiple logical processes into one step, so that the computer completes the search, insert, and update connection relationship in the operation of identifying the step of storing the content of the node. The three steps, while the time complexity is still O ( l ), thus saving the time for the computer to make logical judgments, improving the internal performance of the computer during the short sequence assembly process, thus providing for the realization of short sequence assembly of large genomes.

为了降低存储 de Bruijn图中节点所需的内存空间,作为本发明 的一个优选实施例,还可以只用 de Bruijn图中的一个节点存储互补 的两短串, 节点的序列值取互补的两短串中较小的序列值。 如果一 个的短串的序列值小于其互补短串的序列值,则 de Bruijn图中的节 点存储该短串的序列值, seq存储该短串的序列值,与其左连接碱基 的相应连接数量更新到 left— links, 与其右连接碱基的相应连接数量 更新到 right— links; 如果一个的短串的序列值大于其互补短串的序 列值, 则 de Bruijn图中的节点存储其互补短串的序列值, seq存储 其互补短串的序列值, 与其右连接碱基的相应连接数量更新到 leftjinks, 与其左连接碱基的相应连接数量更新到 rigM—links。 操 作图时, 可以在程序中使用一个附加的变量来标记我们使用的是互 补的两短串的哪一个。 并且, 在沿图遍历时, 只需要程序维持一个 这样的变量, 就可以正确地得到路径中所有节点的正方向。 In order to reduce the memory space required for storing nodes in the de Bruijn graph, as a preferred embodiment of the present invention, only one node in the de Bruijn graph can be used to store two complementary short strings, and the sequence values of the nodes are complementary to two short A smaller sequence value in the string. If the sequence value of a short string is smaller than the sequence value of its complementary short string, the node in the de Bruijn graph stores the sequence value of the short string, and seq stores the sequence value of the short string, and the number of corresponding connections to the left connected base. Update to left-links, the number of corresponding connections to its right-connected base is updated to right-links; if the sequence value of a short string is greater than the sequence value of its complementary short-string, the node in the de Bruijn graph stores its complementary short string The sequence value, seq stores the sequence value of its complementary short string, the number of corresponding connections to its right-linked base is updated to leftjinks, and the number of corresponding connections to its left-linked base is updated to ri g M-links. When working with diagrams, you can use an additional variable in the program to mark which of the two short strings we are using. Moreover, when traversing along the graph, only the program needs to maintain one such variable, and the positive direction of all nodes in the path can be correctly obtained.

为了加快构建图的速度, 作为本发明的另一个优选实施例, 使 用多个哈希表唯一存储 de Bruijn图中的不同节点,并采用不同线程 访问不同的哈希表。  In order to speed up the construction of the map, as another preferred embodiment of the present invention, multiple hash tables are used to uniquely store different nodes in the de Bruijn graph, and different threads are used to access different hash tables.

在本发明实施例中, 建立 8个哈希表, 读入一定数目的原始序 列, 采用 8个线程对读入的原始测序列进行多线程切割、 短串求互 补, 在数据收集完毕后, 采用 8个线程进行插入更新节点, 其中每 个线程只处理固定前缀的序列值。 每个哈希表存储指定前缀的序列 值, 并且一个哈希表只有一个线程访问, 以保证节点存储的唯一性。  In the embodiment of the present invention, eight hash tables are created, a certain number of original sequences are read, and eight threads are used to perform multi-thread cutting and short-string complementation on the original sequence to be read. After the data collection is completed, the data is collected. Eight threads are inserted into the update node, where each thread only processes the sequence value of the fixed prefix. Each hash table stores a sequence value of the specified prefix, and a hash table has only one thread access to guarantee the uniqueness of the node storage.

采用上述本发明实施例提供的压缩的数据结构, 可以将节点信 息(即序列值)和节点的连接信息 (即边)组合在一起, 从一个节 点的值可以得到该节点上的短串、 与该短串相邻的短串的序列值及 其数量。 By using the compressed data structure provided by the embodiment of the present invention, node information (ie, sequence value) and node connection information (ie, edges) can be combined together from one section. The value of the point gives the sequence value of the short string on the node, the short string adjacent to the short string, and the number thereof.

当然,也可以用其他结构来存储 de Bruijn图的各节点,例如可 以用树结构来存储, 使用哈希表存储各节点在内存和使用上与用树 状结构存储近似, 但是使用哈希表存储各节点在访问和修改速度上 都明显优于树的存储结构。  Of course, other structures can also be used to store the nodes of the de Bruijn graph, for example, can be stored in a tree structure, and the hash table is used to store each node in memory and usage and stored in a tree structure, but using a hash table storage. Each node is significantly better than the tree's storage structure in terms of access and modification speed.

选取非洲人基因组重测序数据, 经纠错处理后, 序列数据量 254G碱基, 切割成 25碱基长度的定长短串后, 短串的总数目 (包 括正反向序列) 为 7G 条, 采用本发明实施例提供的方法构建 de Bruijn图, 内存最大使用值为 110G, 共消耗 23 CPU小时, 其中, CPU 的参数为 Quad-Core AMD Opteron(tm) Processor 8356 2.2GHZ。  The African human genome resequencing data was selected. After error correction processing, the sequence data amount was 254 G bases, and after cutting into a fixed length short string of 25 base length, the total number of short strings (including the forward and reverse sequences) was 7G. The method provided by the embodiment of the present invention constructs a de Bruijn graph, and the maximum memory usage value is 110G, which consumes 23 CPU hours, wherein the CPU parameter is Quad-Core AMD Opteron(tm) Processor 8356 2.2GHZ.

本领域普通技术人员可以理解, 实现上述实施例方法中的全部 或部分步驟是可以通过程序来指令相关的硬件来完成, 所述的程序 可以在存储于一计算机可读取存储介质中, 所述的存储介质, 如 ROM/RAM、 磁盘、 光盘等, 该程序用来执行如下步骤:  It will be understood by those skilled in the art that all or part of the steps of the foregoing embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium. The storage medium, such as ROM/RAM, disk, CD, etc., is used to perform the following steps:

1.接收测序序列;  1. Receiving a sequencing sequence;

2. 分别将接收到的测序序列逐个碱基滑动切割得到固定碱基 长度的短串, 并得到短串的左、 右连接关系, 所述左、 右连接关系 包括短串序列值、 存在左连接的各碱基的连接数量和存在右连接的 各碱基的连接数量;  2. The received sequencing sequence is slid and cut one by one to obtain a short string of fixed base length, and the left and right connection relationships of the short strings are obtained, and the left and right connection relationships include short string sequence values and presence of left connections. The number of connections of each base and the number of connections of each base in the right junction;

3. 将得到的各短串的序列值, 左、 右连接关系及其连接数量存 储为 de Bruijn图的一个节点, 并采用哈希表存储 de Bruijn图的各 节点, 其中哈希键为序列值, 哈希值为节点。  3. Store the sequence values of the short strings, the left and right connection relationships and their number of connections as a node of the de Bruijn graph, and use the hash table to store the nodes of the de Bruijn graph, where the hash key is the sequence value. , the hash value is a node.

图 3示出了本发明实施例提供的降低短序列组装过程的时间复 杂度的系统的结构, 为了便于说明仅示出了与本发明实施例相关的 部分。 FIG. 3 is a diagram showing the structure of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. For the convenience of description, only the embodiments related to the embodiment of the present invention are shown. section.

该系统可以用于短序列组装中, 其中:  The system can be used in short sequence assembly, where:

接收单元 301, 接收测序序列。  The receiving unit 301 receives the sequencing sequence.

序列切割单元 302, 分别将接收到的测序序列逐个碱基滑动切 割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 其实 现方式如上所述, 不再赘述。  The sequence cutting unit 302 respectively slid and cuts the received sequencing sequence by a base to obtain a short string of a fixed base length, and obtains a left-right connection relationship of the short strings. The actual manner is as described above, and will not be described again.

构图单元 303, 将得到的各短串的序列值, 左、 右连接关系及 其连接数量存储为 de Bruijn图的一个节点, 并采用哈希表存储 de Bruijn图的各节点, 其中哈希键为序列值, 哈希值为节点。 在本发 明实施例中,构图单元 303在 de Bruijn图的节点中使用相应位存储 其序列值、 存在左连接的各碱基的连接数量和存在右连接的各碱基 的连接数量。  The composition unit 303 stores the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and uses a hash table to store each node of the de Bruijn graph, wherein the hash key is The sequence value, the hash value is the node. In the embodiment of the present invention, the composition unit 303 uses the corresponding bit in the node of the de Bruijn diagram to store its sequence value, the number of connections of each base in which there is a left connection, and the number of connections in which each base of the right connection exists.

其中, 构图单元 303包括:  The composition unit 303 includes:

查询模块 3031, 根据得到的短串的序列值在已存储的节点中查 询是否已存有相应节点。  The query module 3031 queries, among the stored nodes, whether the corresponding node has been stored according to the obtained sequence value of the short string.

节点添加模块 3032, 在查询模块 3031没有查询到相应节点时, 添加节点, 其实现方式如上所述, 不再赘述。  The node adding module 3032 adds a node when the query module 3031 does not query the corresponding node, and the implementation manner is as described above, and details are not described herein.

连接更新模块 3033, 在查询模块 3031查询到相应节点时, 更 新该相应节点的连接关系, 其实现方式如上所述, 不再赘述。  The connection update module 3033, when the query module 3031 queries the corresponding node, updates the connection relationship of the corresponding node, and the implementation manner thereof is as described above, and details are not described herein again.

为了降低存储 de Bruijn图中节点所需空间,作为本发明的一个 优选实施例,构图单元 303使用 de Bruijn图中的一个节点存储互补 的两短串, 节点的序列值取互补的两短串中较小的序列值, 其实现 方式如上所述, 不再赘述。  In order to reduce the space required for storing nodes in the de Bruijn diagram, as a preferred embodiment of the present invention, the composition unit 303 uses a node in the de Bruijn diagram to store complementary two short strings, and the sequence values of the nodes are taken in two complementary short strings. The smaller sequence values are implemented as described above and will not be described again.

为了加快构建图的速度, 作为本发明的另一个优选实施例, 构 图单元 303采用多个哈希表唯一存储 de Bruijn图中的不同节点,并 采用不同线程访问不同的哈希表。 在本发明实施例中, 通过分别将接收到的测序序列逐个碱基滑 动切割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 将得到的各短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijii图的一个节点, 并使用哈希表进行存储。 由于釆用本发明实 施例的构图方式并使用哈希表进行存储, 使得更新节点连接关系等 同于节点查找和更新查找到的节点的左、 右相连碱基的连接数量, 因此在 O ( l )的时间内就可以完成节点的查找、 插入和节点连接关 系的更新, 从而实现在短序列组装过程降低时间复杂度, 进而实现 对大基因组的短序列组装。 In order to speed up the construction of the map, as another preferred embodiment of the present invention, the composition unit 303 uniquely stores different nodes in the de Bruijn map using a plurality of hash tables, and uses different threads to access different hash tables. In the embodiment of the present invention, the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained. The left and right connection relationships and their number of connections are stored as a node of the de Bruijii diagram and stored using a hash table. Since the composition mode of the embodiment of the present invention is used and stored by using a hash table, the update node connection relationship is equivalent to the number of connections of the left and right connected bases of the node searched and updated by the node, and thus at O ( l ) The search, insertion and node connection relationship of the node can be updated in time, thereby reducing the time complexity in the short sequence assembly process and realizing the short sequence assembly of the large genome.

以上所述仅为本发明的较佳实施例而已, 并不用以限制本发 明, 凡在本发明的精神和原则之内所作的任何修改、 等同替换和改 进等, 均应包含在本发明的保护范围之内。  The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. Within the scope.

Claims

1. 一种降低短序列组装过程的时间复杂度的方法, 其特征在 于, 所述方法包括下述步骤: A method of reducing the time complexity of a short sequence assembly process, the method comprising the steps of: 接收测序序列;  Receiving a sequencing sequence; 分别将接收到的测序序列逐个碱基滑动切割得到固定碱基 长度的短串, 并得到所述短串的左、 右连接关系, 所述左、 右连 接关系包括短串序列值、 存在左连接的各碱基的连接数量和存在 右连接的各碱基的连接数量;  The received sequencing sequence is slid and cut one by one to obtain a short string of a fixed base length, and the left and right connection relationships of the short string are obtained, and the left and right connection relationships include a short string sequence value and a left connection exists. The number of connections of each base and the number of connections of each base in the right junction; 将得到的各所述短串的序列值, 左、 右连接关系及其连接数 量存储为 de Bruijn 图的一个节点, 并采用哈希表存储所述 de Bruijn图的各节点, 其中哈希键为所述序列值, 哈希值为所述节 点。  And storing the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and storing the nodes of the de Bruijn graph by using a hash table, where the hash key is The sequence value, the hash value is the node. 2. 如权利要求 1所述的方法, 其特征在于, 所述 de Bruijn图 的节点使用相应位存储所述短串的序列值、 存在左连接的各碱基 的连接数量和存在右连接的各碱基的连接数量。  2. The method according to claim 1, wherein the node of the de Bruijn graph stores a sequence value of the short string, a number of connections of each base having a left connection, and a presence of a right connection using a corresponding bit. The number of base connections. 3. 如权利要求 1所述的方法, 其特征在于, 所述将得到的各 短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijn 图的一个节点的步骤具体为:  The method according to claim 1, wherein the step of storing the obtained sequence values of the short strings, the left and right connection relationships, and the number of connections thereof as a node of the de Bruijn graph is specifically: 根据得到的短串的序列值在已存储的节点中查询是否已存有 相应节点;  Querying, according to the obtained sequence value of the short string, whether the corresponding node exists in the stored node; 如果没有查询到相应节点, 则添加节点;  Add a node if no corresponding node is queried; 如果查询到相应节点, 则更新所述相应节点的连接关系。  If the corresponding node is queried, the connection relationship of the corresponding node is updated. 4. 如权利要求 1所述的方法, 其特征在于, 所述 de Bruijn图 的一个节点中存储互补的两短串。  4. The method according to claim 1, wherein two nodes of the de Bruijn graph store complementary two short strings. 5. 如权利要求 1所述的方法, 其特征在于, 采用多个哈希表 唯一存储所述 de Bruijn 图的不同节点, 并釆用不同线程访问不 同的哈希表。 5. The method of claim 1 wherein a plurality of hash tables are employed Uniquely store the different nodes of the de Bruijn graph and access different hash tables with different threads. 6. 一种降低短序列组装过程的时间复杂度的系统, 其特征在 于, 所述系统包括:  6. A system for reducing the time complexity of a short sequence assembly process, the system comprising: 接收单元, 用于接收测序序列;  a receiving unit, configured to receive a sequencing sequence; 序列切割单元,用于分别将接收到的测序序列逐个碱基滑动 切割得到固定碱基长度的短串, 并得到所述短串的左、 右连接关 系, 所述左、 右连接关系包括短串序列值、 存在左连接的各碱基 的连接数量和存在右连接的各碱基的连接数量; 以及  a sequence cutting unit, configured to respectively slide the received sequencing sequence one by one to obtain a short string of a fixed base length, and obtain a left and right connection relationship of the short string, wherein the left and right connection relationships include short strings a sequence value, the number of linkages of each base in which there is a left junction, and the number of linkages of each base in which a right junction exists; 构图单元, 用于将得到的各所述短串的序列值, 左、 右连接 关系及其连接数量存储为 de Bruijn 图的一个节点, 并采用哈希 表存储所述 de Bruijn 图中的不同节点, 其中哈希键为所述序列 值, 哈希值为所述节点。  a composition unit, configured to store the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and store the different nodes in the de Bruijn graph by using a hash table Where the hash key is the sequence value and the hash value is the node. 7. 如权利要求 6所述的系统, 其特征在于, 所述构图单元在 de Bruijn图的节点中使用相应位存储所述短串的序列值、存在左 连接的各碱基的连接数量和存在右连接的各碱基的连接数量。  7. The system according to claim 6, wherein the composition unit stores a sequence value of the short string, a number of connections of each base having a left connection, and a presence using a corresponding bit in a node of the de Bruijn diagram. The number of connections for each base connected to the right. 8. 如权利要求 7所述的系统, 其特征在于, 所述构图单元包 括:  8. The system of claim 7, wherein the composition unit comprises: 查询模块,用于根据得到的短串的序列值在已存储的节点中 查询是否已存有相应节点;  a query module, configured to query, according to the obtained sequence value of the short string, whether the corresponding node is already stored in the stored node; 节点添加模块, 用于在所述查询模块没有查询到相应节点 时, 添加节点;  a node adding module, configured to add a node when the query module does not query the corresponding node; 连接更新模块, 用于在所述查询模块查询到相应节点时, 更 新所述相应节点的连接关系。  And a connection update module, configured to update a connection relationship of the corresponding node when the query module queries the corresponding node.
PCT/CN2009/001427 2008-12-12 2009-12-11 Method and system for lowering time complexity in short sequences assembly Ceased WO2010066115A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200810218338.9 2008-12-12
CN2008102183389A CN101430742B (en) 2008-12-12 2008-12-12 A way to assemble a genome

Publications (1)

Publication Number Publication Date
WO2010066115A1 true WO2010066115A1 (en) 2010-06-17

Family

ID=40646135

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2009/001427 Ceased WO2010066115A1 (en) 2008-12-12 2009-12-11 Method and system for lowering time complexity in short sequences assembly

Country Status (2)

Country Link
CN (1) CN101430742B (en)
WO (1) WO2010066115A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013004005A1 (en) * 2011-07-05 2013-01-10 深圳华大基因科技有限公司 Method for assembling sequenced segments

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101430742B (en) * 2008-12-12 2011-06-29 深圳华大基因研究院 A way to assemble a genome
TWI443544B (en) * 2009-12-23 2014-07-01 Ind Tech Res Inst Data compression method and sequence compression devices
WO2012171213A1 (en) * 2011-06-17 2012-12-20 深圳华大基因科技有限公司 Method and system for assembly of genome
US8751166B2 (en) * 2012-03-23 2014-06-10 International Business Machines Corporation Parallelization of surprisal data reduction and genome construction from genetic data for transmission, storage, and analysis
US8812243B2 (en) 2012-05-09 2014-08-19 International Business Machines Corporation Transmission and compression of genetic data
US8855938B2 (en) 2012-05-18 2014-10-07 International Business Machines Corporation Minimization of surprisal data through application of hierarchy of reference genomes
US10353869B2 (en) 2012-05-18 2019-07-16 International Business Machines Corporation Minimization of surprisal data through application of hierarchy filter pattern
US9002888B2 (en) 2012-06-29 2015-04-07 International Business Machines Corporation Minimization of epigenetic surprisal data of epigenetic data within a time series
US8972406B2 (en) 2012-06-29 2015-03-03 International Business Machines Corporation Generating epigenetic cohorts through clustering of epigenetic surprisal data based on parameters
CN103258145B (en) * 2012-12-22 2016-06-29 中国科学院深圳先进技术研究院 A kind of parallel gene-splicing method based on De Bruijn
CN103093121B (en) * 2012-12-28 2016-01-27 深圳先进技术研究院 The compression storage of two-way multistep deBruijn figure and building method
CN103699819B (en) * 2013-12-10 2016-09-07 深圳先进技术研究院 The summit extended method of elongated kmer based on multistep two-way De Bruijn inquiry
CN104751015B (en) * 2013-12-30 2017-08-29 中国科学院天津工业生物技术研究所 A kind of gene order-checking data sequence assemble method
CN106067824B (en) * 2016-06-02 2019-11-05 洛阳晶云信息科技有限公司 A kind of sequencing data compression method based on bigeminy codon
CN115394348B (en) * 2022-07-15 2025-08-19 中南大学 Method, equipment and medium for predicting lncRNA subcellular localization based on graph rolling network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101430742A (en) * 2008-12-12 2009-05-13 深圳华大基因研究院 Method and system for drawing construction in short sequence assembly

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6952651B2 (en) * 2002-06-17 2005-10-04 Intel Corporation Methods and apparatus for nucleic acid sequencing by signal stretching and data integration
AU2003296994A1 (en) * 2002-12-13 2004-07-09 Applera Corporation Methods for identifying, viewing, and analyzing syntenic and orthologous genomic regions between two or more species
CN101196921A (en) * 2007-12-24 2008-06-11 北京大学 Dimensionality Reduction Method for Long Sequence Data for Approximate Query

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101430742A (en) * 2008-12-12 2009-05-13 深圳华大基因研究院 Method and system for drawing construction in short sequence assembly

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BOKHARI S.H. ET AL: "A Parallel Graph Decomposition Algorithm for DNA Sequencing with Nanopores", BIOINFORMATICS, vol. 21, no. 7, 11 November 2004 (2004-11-11), pages 889 - 896 *
HUO H. W. ET AL: "A Multiple Alignment Approach for DNA Sequences Based on the Maximum Weighted Path Algorithms", JOURNAL OF SOFTWARE, vol. 18, no. 2, February 2007 (2007-02-01), pages 185 - 195 *
ZERBINO D.R. ET AL: "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs", GENOME RES., vol. 18, 18 March 2008 (2008-03-18), pages 821 - 829 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013004005A1 (en) * 2011-07-05 2013-01-10 深圳华大基因科技有限公司 Method for assembling sequenced segments

Also Published As

Publication number Publication date
CN101430742B (en) 2011-06-29
CN101430742A (en) 2009-05-13

Similar Documents

Publication Publication Date Title
WO2010066115A1 (en) Method and system for lowering time complexity in short sequences assembly
CN101594319B (en) Entry lookup method and entry lookup device
US20210165805A1 (en) Partition Merging Method and Database Server
US20130080485A1 (en) Quick filename lookup using name hash
Almodaresi et al. PuffAligner: a fast, efficient and accurate aligner based on the Pufferfish index
JP4995125B2 (en) How to search fixed length data
JP3570323B2 (en) How to store prefixes for addresses
EP2344959A2 (en) Index compression in databases
CN103561133A (en) IP address ownership information indexing and fast querying method
CN101751517B (en) A method and system for rapid processing of genome short sequence mapping
Xiao et al. Using parallel bloom filters for multiattribute representation on network services
WO2021179781A1 (en) Method, systemand device for sequence alignment, and readable storage medium
CN103065067B (en) The filter method of sequence fragment and system in short sequence assembling
US20130307710A1 (en) Compression match enumeration
CN114884877B (en) IPv6 route searching method combining hash table and HOT
CN100496019C (en) A Method for Rapid Search and Update of IPv6 Routing Table
Hon et al. Efficient algorithm for circular Burrows-Wheeler transform
CN112231400B (en) Access method, device, equipment and storage medium of distributed database
CN112269784B (en) Hash table device based on hardware realization and inserting, inquiring and deleting method
WO2010127536A1 (en) Table creating and searching method used by network processor
CN1312890C (en) Method for generating a trie having a reduced number of trie blocks
CN100396015C (en) A TCAM routing table management method and system
WO2015192742A1 (en) Lookup device, lookup method and configuration method
CN109831384B (en) Name lookup method and router
WO2005066835A1 (en) A method for quickly retrieving a record in a data page of a database

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09831392

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09831392

Country of ref document: EP

Kind code of ref document: A1