WO2010066115A1 - Method and system for lowering time complexity in short sequences assembly - Google Patents
Method and system for lowering time complexity in short sequences assembly Download PDFInfo
- Publication number
- WO2010066115A1 WO2010066115A1 PCT/CN2009/001427 CN2009001427W WO2010066115A1 WO 2010066115 A1 WO2010066115 A1 WO 2010066115A1 CN 2009001427 W CN2009001427 W CN 2009001427W WO 2010066115 A1 WO2010066115 A1 WO 2010066115A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- node
- short
- sequence
- connections
- base
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
Definitions
- the invention belongs to the technical field of genetic engineering, and in particular relates to a method and system for reducing the time complexity of a short sequence assembly process. Background technique
- sequence length is short
- the embodiment of the present invention is implemented as a method for reducing the time complexity of a short sequence assembly process, and the method includes the following steps:
- the received sequencing sequence is slid and cut one by one to obtain a short string of a fixed base length, and the left and right connection relationships of the short string are obtained, and the left and right connection relationships include a short string sequence value and a left connection exists.
- Another object of embodiments of the present invention is to provide a system for reducing the time complexity of a short sequence assembly process, the system comprising:
- a receiving unit configured to receive a sequencing sequence
- a sequence cutting unit configured to separately slide the received sequencing sequence one by one to obtain a short string of a fixed base length, and obtain a left and right connection relationship of the short string, wherein the left and right connection relationships include short strings a sequence value, the number of linkages of each base in which there is a left junction, and the number of linkages of each base in which a right junction exists;
- composition unit configured to store the obtained sequence values of the short strings, the left and right connection relationships, and the number of connections thereof as a node of the de Bruijn graph, and store the nodes of the de Bruijn graph by using a hash table,
- the hash key is the sequence value
- the hash value is the node.
- the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained.
- the left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph and stored using a hash table. Since the hash table storage is used to make the update node connection relationship equal to the number of connections of the node to find and update the left and right connected bases of the found node, the search, insertion, and insertion of the node can be completed within O ( l ) time. The update of the node connection relationship, thereby reducing the time complexity in the short sequence assembly process, and thus achieving short sequence assembly of large genomes.
- FIG. 1 is a flowchart of an implementation of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention
- FIG. 2 is a schematic diagram of a node storage content according to an embodiment of the present invention.
- FIG. 3 is a structural diagram of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. detailed description
- the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained.
- the left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph, and the hash table is used to store the nodes of the de Bruijn graph, wherein the hash key is a sequence value and the hash value is a node.
- FIG. 1 is a flowchart showing an implementation process of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention, which is described in detail as follows:
- step S101 receiving a sequencing sequence
- step S102 the received sequencing sequence is slid and cut one by one to obtain a short length (kmer) of a fixed length, and a left and right connection relationship of the short string is obtained, and the left and right connection relationships include short sequence values.
- step S103 the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof are stored as a node of the de Bruijn graph, and the hash table is used to store de Each node of the Briiijn graph, where the hash key is a sequence value and the hash value is a node.
- the sequencing sequence has a base length of 25 - 75 and is cut into short strings of a fixed length of 21 - 31.
- the length of the short string obtained by the cutting is smaller than the length of the sequence, and the length can be set according to the length of the sequencing sequence and the actual situation.
- Each node in the de Bruijn diagram uses the corresponding bits to store its sequence value, the number of connections for each base with left joins, and the number of connections for each base with right joins.
- each node on the de B ijn map is stored in 16 bytes, and its storage format is as follows:
- sequence value is calculated by using 2 bits to store a nucleotide sequence, A is represented by 00, C is represented by 01, T is represented by 10, G is represented by 11, and the sequence is encoded to generate a
- A is represented by 00
- C is represented by 01
- T is represented by 10
- G is represented by 11
- the sequence is encoded to generate a
- the 64-bit integer value and considering the short string of even length, its complementary short string may be itself, for example, the short string GATC's complementary short string is GATC itself. In order to prevent such confusion, the lengths of the short strings are all odd.
- the length of the short string is not more than 31;
- the leftjinks uses 24 bits to store the left connection relationship and the number, and divides the 24 bits into 4 6-bit, ie A: 6, T: 6, G: 6, C: 6, respectively, using 6 bits to store the number of connections to the base A, T, G or C with the left link of the short string, each The number of connections is in the range [0, 63];
- right-link uses 24 bits to store its right connection and number, and divides 24 bits into 4 6 bits, ie A: 6, T: 6, G: 6, C: 6, respectively, the number of connections of A, T, G or C with the right connection of the short string is stored by 6 bits, and the number of each connection is in the range of [0, 63]; the following 8 bits can be used.
- the delete flag closed may be stored to identify whether the short string is deleted; the use mark in one may also be stored to identify whether the short string has been used, and other identifiers may also be stored.
- the connection relationship of each node in the de Bmijii diagram can be constructed according to the short string sequence value stored in the node, the number of connections of each base having the left connection, and the number of connections of each base having the right connection.
- the short-chain A is AAAAAAAA
- the number of connections of the right-connected T is 19
- the short-chain B of the right-connected base T is AAAAAAAT, which is equal to the short-chain ⁇ left-shifting one base and adding the base T connected thereto.
- the storage contents of the number of connections in the node storing the right port are as shown in FIG. 2 .
- step S103 is specifically:
- Step 1 Query whether the corresponding node exists in the stored node according to the obtained sequence value of the short string;
- Step 2 If no corresponding node is queried, add a node
- Step 3 If the corresponding node is queried, the connection relationship of the corresponding node is updated.
- each node of the de Bruijn graph is stored using a hash table, the hash key is a sequence value, and the hash value is a node. For example, take a short string of AAAAAAAA, its sequence value is 0x0000, and use its sequence value 0x0000 as a key to query whether the corresponding node exists in the hash table. If no corresponding node is queried, add the node to the hash table.
- the seq of the value is the sequence value of the short string 0x0000, and the number of connections of the corresponding left and right connected bases in the node is set to 1 according to the short string adjacent to the short string; if the query already has a corresponding
- the node updates the connection relationship of the corresponding node, that is, updates the number of connections of the corresponding left and right connected bases in the node according to the short string adjacent to the short string, and the number of corresponding connections of the bases connected to the short string plus 1.
- step 1 to find the next short string until all short strings are found.
- the hash node can be used to complete the lookup node, the insert node (ie, the storage node), and the update node connection relationship in the time of O(1). Updating the node connection relationship is equivalent to finding the node and updating the number of connections of the left and right connected bases of the found node, so the time complexity is still O ( l ).
- the embodiment of the present invention converts the steps that originally required multiple logical processes into one step, so that the computer completes the search, insert, and update connection relationship in the operation of identifying the step of storing the content of the node.
- the three steps, while the time complexity is still O ( l ) thus saving the time for the computer to make logical judgments, improving the internal performance of the computer during the short sequence assembly process, thus providing for the realization of short sequence assembly of large genomes.
- only one node in the de Bruijn graph can be used to store two complementary short strings, and the sequence values of the nodes are complementary to two short A smaller sequence value in the string. If the sequence value of a short string is smaller than the sequence value of its complementary short string, the node in the de Bruijn graph stores the sequence value of the short string, and seq stores the sequence value of the short string, and the number of corresponding connections to the left connected base.
- Update to left-links the number of corresponding connections to its right-connected base is updated to right-links; if the sequence value of a short string is greater than the sequence value of its complementary short-string, the node in the de Bruijn graph stores its complementary short string
- the sequence value, seq stores the sequence value of its complementary short string
- the number of corresponding connections to its right-linked base is updated to leftjinks
- the number of corresponding connections to its left-linked base is updated to ri g M-links.
- multiple hash tables are used to uniquely store different nodes in the de Bruijn graph, and different threads are used to access different hash tables.
- eight hash tables are created, a certain number of original sequences are read, and eight threads are used to perform multi-thread cutting and short-string complementation on the original sequence to be read. After the data collection is completed, the data is collected. Eight threads are inserted into the update node, where each thread only processes the sequence value of the fixed prefix. Each hash table stores a sequence value of the specified prefix, and a hash table has only one thread access to guarantee the uniqueness of the node storage.
- node information ie, sequence value
- node connection information ie, edges
- nodes of the de Bruijn graph can be stored in a tree structure, and the hash table is used to store each node in memory and usage and stored in a tree structure, but using a hash table storage.
- Each node is significantly better than the tree's storage structure in terms of access and modification speed.
- the African human genome resequencing data was selected. After error correction processing, the sequence data amount was 254 G bases, and after cutting into a fixed length short string of 25 base length, the total number of short strings (including the forward and reverse sequences) was 7G.
- the method provided by the embodiment of the present invention constructs a de Bruijn graph, and the maximum memory usage value is 110G, which consumes 23 CPU hours, wherein the CPU parameter is Quad-Core AMD Opteron(tm) Processor 8356 2.2GHZ.
- the received sequencing sequence is slid and cut one by one to obtain a short string of fixed base length, and the left and right connection relationships of the short strings are obtained, and the left and right connection relationships include short string sequence values and presence of left connections.
- FIG. 3 is a diagram showing the structure of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. For the convenience of description, only the embodiments related to the embodiment of the present invention are shown. section.
- the system can be used in short sequence assembly, where:
- the receiving unit 301 receives the sequencing sequence.
- the sequence cutting unit 302 respectively slid and cuts the received sequencing sequence by a base to obtain a short string of a fixed base length, and obtains a left-right connection relationship of the short strings.
- the actual manner is as described above, and will not be described again.
- the composition unit 303 stores the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and uses a hash table to store each node of the de Bruijn graph, wherein the hash key is The sequence value, the hash value is the node.
- the composition unit 303 uses the corresponding bit in the node of the de Bruijn diagram to store its sequence value, the number of connections of each base in which there is a left connection, and the number of connections in which each base of the right connection exists.
- the composition unit 303 includes:
- the query module 3031 queries, among the stored nodes, whether the corresponding node has been stored according to the obtained sequence value of the short string.
- the node adding module 3032 adds a node when the query module 3031 does not query the corresponding node, and the implementation manner is as described above, and details are not described herein.
- connection update module 3033 when the query module 3031 queries the corresponding node, updates the connection relationship of the corresponding node, and the implementation manner thereof is as described above, and details are not described herein again.
- the composition unit 303 uses a node in the de Bruijn diagram to store complementary two short strings, and the sequence values of the nodes are taken in two complementary short strings.
- the smaller sequence values are implemented as described above and will not be described again.
- the composition unit 303 uniquely stores different nodes in the de Bruijn map using a plurality of hash tables, and uses different threads to access different hash tables.
- the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained.
- the left and right connection relationships and their number of connections are stored as a node of the de Bruijii diagram and stored using a hash table.
- the update node connection relationship is equivalent to the number of connections of the left and right connected bases of the node searched and updated by the node, and thus at O ( l )
- the search, insertion and node connection relationship of the node can be updated in time, thereby reducing the time complexity in the short sequence assembly process and realizing the short sequence assembly of the large genome.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
一种降低短序列组装过程的时间复杂度的方法及系统 技术领域 Method and system for reducing time complexity of short sequence assembly process
本发明属于基因工程技术领域, 尤其涉及一种降低短序列组装 过程时间复杂度的方法及系统。 背景技术 The invention belongs to the technical field of genetic engineering, and in particular relates to a method and system for reducing the time complexity of a short sequence assembly process. Background technique
新测序技术产生的短序列有两个特点: The short sequence generated by the new sequencing technology has two characteristics:
1.序列长度短; 1. The sequence length is short;
2.数据量大。 2. The amount of data is large.
在长序列组装过程中, 常用的 phrap等软件均是基于序列间的 交叠(overlap )来进行拼接, 如果应用在短序列上, 则运算量太大, 没有实际应用价值。而新兴的短序列组装软件中成功处理短序列的, 例如基于 de Bruijn图的 velvet等。 但是, 由于受内存、 时间等的限 制 , 现有的这些短序列组装软件只能组装较小的原核生物基因组, 对于大基因组, 例如真核生物基因组, 特别是哺乳动物基因组数据, 由于数据处理时时间复杂度较高、 内存占用较大, 现有的短序列组 装软件均难以实现短序列的组装。 发明内容 In the long sequence assembly process, the commonly used software such as phrap is based on the overlap between sequences. If the application is on a short sequence, the amount of calculation is too large, and there is no practical application value. In the emerging short-sequence assembly software, short sequences are successfully processed, such as velvet based on de Bruijn graphs. However, due to limitations in memory, time, etc., these short sequence assembly software can only assemble smaller prokaryotic genomes, for large genomes, such as eukaryotic genomes, especially mammalian genomic data, due to data processing. The time complexity is high and the memory usage is large. It is difficult to assemble short sequences by the existing short sequence assembly software. Summary of the invention
本发明实施例的目的在于提供一种降低短序列组装过程的时间 复杂度的方法, 旨在解决现有短序列组装软件不能组装大基因组的 问题。 It is an object of embodiments of the present invention to provide a method for reducing the time complexity of a short sequence assembly process, which aims to solve the problem that existing short sequence assembly software cannot assemble a large genome.
本发明实施例是这样实现的, 一种降低短序列组装过程的时间 复杂度的方法, 所述方法包括下述步骤: The embodiment of the present invention is implemented as a method for reducing the time complexity of a short sequence assembly process, and the method includes the following steps:
接收测序序列; 分别将接收到的测序序列逐个碱基滑动切割得到固定碱基长度 的短串, 并得到所述短串的左、 右连接关系, 所述左、 右连接关系 包括短串序列值、 存在左连接的各碱基的连接数量和存在右连接的 各½的连接数量; Receiving a sequencing sequence; The received sequencing sequence is slid and cut one by one to obtain a short string of a fixed base length, and the left and right connection relationships of the short string are obtained, and the left and right connection relationships include a short string sequence value and a left connection exists. The number of connections for each base and the number of connections for each of the right connections;
将得到的各所述短串的序列值, 左、 右连接关系及其连接数量 存储为 de Bruijn图的一个节点, 并釆用哈希表存储所述 de Bruijn 图的各节点, 其中哈希键为所述序列值, 哈希值为所述节点。 And storing the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and storing the nodes of the de Bruijn graph by using a hash table, wherein the hash key For the sequence value, the hash value is the node.
本发明实施例的另一目的在于提供一种降低短序列组装过程的 时间复杂度的系统, 所述系统包括: Another object of embodiments of the present invention is to provide a system for reducing the time complexity of a short sequence assembly process, the system comprising:
接收单元, 用于接收测序序列; a receiving unit, configured to receive a sequencing sequence;
序列切割单元, 用于分别将接收到的测序序列逐个碱基滑动切 割得到固定碱基长度的短串, 并得到所述短串的左、 右连接关系, 所述左、 右连接关系包括短串序列值、 存在左连接的各碱基的连接 数量和存在右连接的各碱基的连接数量; 以及 a sequence cutting unit, configured to separately slide the received sequencing sequence one by one to obtain a short string of a fixed base length, and obtain a left and right connection relationship of the short string, wherein the left and right connection relationships include short strings a sequence value, the number of linkages of each base in which there is a left junction, and the number of linkages of each base in which a right junction exists;
构图单元, 用于将得到的各所述短串的序列值, 左、 右连接关 系及其连接数量存储为 de Bruijn图的一个节点,并采用哈希表存储 所述 de Bruijn图的各节点,其中哈希键为所述序列值,哈希值为所 述节点。 a composition unit, configured to store the obtained sequence values of the short strings, the left and right connection relationships, and the number of connections thereof as a node of the de Bruijn graph, and store the nodes of the de Bruijn graph by using a hash table, The hash key is the sequence value, and the hash value is the node.
在本发明实施例中, 通过分别将接收到的测序序列逐个碱基滑 动切割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 将得到的各短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijn图的一个节点, 并使用哈希表进行存储。 由于使用哈希表存 储使得更新节点连接关系等同于节点查找和更新查找到的节点的 左、 右相连碱基的连接数量, 因此在 O ( l )的时间内就可以完成节 点的查找、 插入和节点连接关系的更新, 从而实现在短序列组装过 程降低时间复杂度, 进而实现对大基因组的短序列组装。 附图说明 In the embodiment of the present invention, the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained. The left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph and stored using a hash table. Since the hash table storage is used to make the update node connection relationship equal to the number of connections of the node to find and update the left and right connected bases of the found node, the search, insertion, and insertion of the node can be completed within O ( l ) time. The update of the node connection relationship, thereby reducing the time complexity in the short sequence assembly process, and thus achieving short sequence assembly of large genomes. DRAWINGS
图 1是本发明实施例提供的降低短序列组装过程的时间复杂度 的方法的实现流程图; 1 is a flowchart of an implementation of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention;
图 2是本发明实施例提供的节点存储内容的示意图; 2 is a schematic diagram of a node storage content according to an embodiment of the present invention;
图 3本发明实施例提供的降低短序列组装过程的时间复杂度的 系统的结构图。 具体实施方式 FIG. 3 is a structural diagram of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. detailed description
为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结 合附图及实施例, 对本发明进行进一步详细说明。 应当理解, 此处 所描述的具体实施例仅仅用以解释本发明, 并不用于限定本发明。 The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
在本发明实施例中, 通过分别将接收到的测序序列逐个碱基滑 动切割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 将得到的各短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijn图的一个节点,并采用哈希表存储 de Bruijn图的各节点,其 中哈希键为序列值, 哈希值为节点。 In the embodiment of the present invention, the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained. The left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph, and the hash table is used to store the nodes of the de Bruijn graph, wherein the hash key is a sequence value and the hash value is a node.
图 1示出了本发明实施例提供的降低短序列组装过程的时间复 杂度的方法的实现流程, 详述如下: FIG. 1 is a flowchart showing an implementation process of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention, which is described in detail as follows:
在步驟 S101中, 接收测序序列; In step S101, receiving a sequencing sequence;
在步骤 S102中,分别将接收到的测序序列逐个碱基滑动切割得 到固定 长度的短串 (kmer ) , 并得到短串的左、 右连接关系, 所述左、 右连接关系包括短串序列值、 存在左连接的各碱基的连接 数量和存在右连接的各碱基的连接数量; In step S102, the received sequencing sequence is slid and cut one by one to obtain a short length (kmer) of a fixed length, and a left and right connection relationship of the short string is obtained, and the left and right connection relationships include short sequence values. The number of connections of each base in which there is a left connection and the number of connections of each base in which a right connection exists;
在步骤 S103中, 将得到的各短串的序列值, 左、 右连接关系及 其连接数量存储为 de Bruijn图的一个节点, 并采用哈希表存储 de Briiijn图的各节点, 其中哈希键为序列值, 哈希值为节点。 In step S103, the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof are stored as a node of the de Bruijn graph, and the hash table is used to store de Each node of the Briiijn graph, where the hash key is a sequence value and the hash value is a node.
在本发明实施例中, 测序序列的碱基长度为 25 - 75, 切割成固 定 长度为 21 - 31的短串。 当然, 切割得到的短串的长度小于测 序序列的长度, 其长度可以根据测序序列的长度和实际情况设定。 de Bruijn图中每个节点使用相应位存储其序列值、存在左连接的各 碱基的连接数量和存在右连接的各碱基的连接数量。 这里, 用 16字 节存储 de B ijn图上的各节点, 其存储格式如下: In the embodiment of the present invention, the sequencing sequence has a base length of 25 - 75 and is cut into short strings of a fixed length of 21 - 31. Of course, the length of the short string obtained by the cutting is smaller than the length of the sequence, and the length can be set according to the length of the sequencing sequence and the actual situation. Each node in the de Bruijn diagram uses the corresponding bits to store its sequence value, the number of connections for each base with left joins, and the number of connections for each base with right joins. Here, each node on the de B ijn map is stored in 16 bytes, and its storage format is as follows:
[ seq: 64, left links: 24, right— links: 24, …】; [ seq: 64, left links: 24, right- links: 24, ...];
其中, seq存储短串的序列值,序列值的计算方法是使用 2位存 储一个核苷序列, A用 00表示, C用 01表示, T用 10表示, G用 11表示, 顺序编码下去生成一个占 64位的整数值, 并且, 考虑到 对于偶数长度的短串, 其互补短串可能为它自己, 例如短串 GATC 的互补短串为 GATC自己。 为了防止这种混淆, 短串的长度均为奇 数, 由于本发明实施例中数据结构的限制, 短串的长度不大于 31; leftjinks用 24位存储其左连接关系及数量, 将 24位分割成 4个 6 位, 即 A: 6, T: 6, G: 6, C: 6, 分别用 6位存储与该短串存在 左连接的碱基 A、 T、 G或 C的连接数量, 每种连接数量的取值范 围为 [0, 63]; right— links用 24位存储其右连接关系及数量, 将 24 位分割成 4个 6位, 即 A: 6, T: 6, G: 6, C: 6, 分别用 6位存 储与该短串存在右连接的 A、 T、 G或 C的连接数量, 每种连 接数量的取值范围为 [0, 63]; 其后面的 8位可以用于存储其他值, 例如, 可以存储删除标记 closed, 以标识该短串是否被删除; 也可 以存储使用标记 in一 use, 以标识该短串是否被使用过, 还可以存储 其他标识。 这样, 根据节点中存储的短串序列值、 存在左连接的各 碱基的连接数量和存在右连接的各碱基的连接数量即可构建 de Bmijii图中各节点的连接关系。 例如,短串甲为 AAAAAAAA存在右连接的 T的连接数量 为 19, 与其右连接碱基 T的短串乙为 AAAAAAAT, 等于短串曱左 移一个碱基并加上与其连接的碱基 T, 并且与短串甲连接的短串乙 有 19个, 节点中存储右连接 Τ的连接数量的存储内容如图 2 所示。 Where seq stores the sequence value of the short string, the sequence value is calculated by using 2 bits to store a nucleotide sequence, A is represented by 00, C is represented by 01, T is represented by 10, G is represented by 11, and the sequence is encoded to generate a The 64-bit integer value, and considering the short string of even length, its complementary short string may be itself, for example, the short string GATC's complementary short string is GATC itself. In order to prevent such confusion, the lengths of the short strings are all odd. Due to the limitation of the data structure in the embodiment of the present invention, the length of the short string is not more than 31; the leftjinks uses 24 bits to store the left connection relationship and the number, and divides the 24 bits into 4 6-bit, ie A: 6, T: 6, G: 6, C: 6, respectively, using 6 bits to store the number of connections to the base A, T, G or C with the left link of the short string, each The number of connections is in the range [0, 63]; right-link uses 24 bits to store its right connection and number, and divides 24 bits into 4 6 bits, ie A: 6, T: 6, G: 6, C: 6, respectively, the number of connections of A, T, G or C with the right connection of the short string is stored by 6 bits, and the number of each connection is in the range of [0, 63]; the following 8 bits can be used. For storing other values, for example, the delete flag closed may be stored to identify whether the short string is deleted; the use mark in one may also be stored to identify whether the short string has been used, and other identifiers may also be stored. Thus, the connection relationship of each node in the de Bmijii diagram can be constructed according to the short string sequence value stored in the node, the number of connections of each base having the left connection, and the number of connections of each base having the right connection. For example, the short-chain A is AAAAAAAA, the number of connections of the right-connected T is 19, and the short-chain B of the right-connected base T is AAAAAAAT, which is equal to the short-chain 曱 left-shifting one base and adding the base T connected thereto. And there are 19 short strings B connected to the short string A, and the storage contents of the number of connections in the node storing the right port are as shown in FIG. 2 .
上述步骤 S103具体为: The above step S103 is specifically:
步骤 1.根据得到的短串的序列值在已存储的节点中查询是否已 存有相应节点; Step 1. Query whether the corresponding node exists in the stored node according to the obtained sequence value of the short string;
步骤 2.如果没有查询到相应节点, 则添加节点; Step 2. If no corresponding node is queried, add a node;
步骤 3.如果查询到相应节点, 则更新该相应节点的连接关系。 在本发明实施例中,使用哈希表存储 de Bruijn图的各节点,哈 希键为序列值, 哈希值为节点。 例如取一短串为 AAAAAAAA, 其 序列值为 0x0000 , 将其序列值 0x0000作为键在哈希表中查询是否 已存有相应节点, 如果没有查询到相应节点, 则添加节点存储到哈 希表中, 其值中的 seq为该短串的序列值 0x0000, 并根据该短串相 邻的短串将该节点中相应左、 右相连碱基的连接数量置为 1; 如果 查询到已存有相应节点, 则更新相应节点的连接关系, 即根据与该 短串相邻的短串更新该节点中相应左、 右相连碱基的连接数量, 将 与该短串有连接的碱基的相应连接数量加 1。 完成后, 执行步骤 1, 查找下一个短串, 直至完成全部短串的查找。 Step 3. If the corresponding node is queried, the connection relationship of the corresponding node is updated. In the embodiment of the present invention, each node of the de Bruijn graph is stored using a hash table, the hash key is a sequence value, and the hash value is a node. For example, take a short string of AAAAAAAA, its sequence value is 0x0000, and use its sequence value 0x0000 as a key to query whether the corresponding node exists in the hash table. If no corresponding node is queried, add the node to the hash table. The seq of the value is the sequence value of the short string 0x0000, and the number of connections of the corresponding left and right connected bases in the node is set to 1 according to the short string adjacent to the short string; if the query already has a corresponding The node updates the connection relationship of the corresponding node, that is, updates the number of connections of the corresponding left and right connected bases in the node according to the short string adjacent to the short string, and the number of corresponding connections of the bases connected to the short string plus 1. When finished, perform step 1 to find the next short string until all short strings are found.
在本发明实施例中, 使用哈希表可以在 O ( 1 )的时间内完成查 找节点、 插入节点 (即存储节点)和更新节点连接关系。 更新节点 连接关系等同于查找节点, 并更新查找到的节点的左、 右相连碱基 的连接数量, 所以时间复杂度依然为 O ( l )。 本发明实施例将原来 需要多个逻辑处理的步骤转化为一个步骤, 使得计算机在识别节点 存储内容这个步骤的运算中, 完成了查找、 插入和更新联结关系的 三个步骤, 而时间复杂度仍然为 O ( l ), 因此节约了计算机进行逻 辑判断的时间, 改善了短序列组装过程中计算机的内部性能, 从而 为大基因组的短序列组装的实现提供了 。 In the embodiment of the present invention, the hash node can be used to complete the lookup node, the insert node (ie, the storage node), and the update node connection relationship in the time of O(1). Updating the node connection relationship is equivalent to finding the node and updating the number of connections of the left and right connected bases of the found node, so the time complexity is still O ( l ). The embodiment of the present invention converts the steps that originally required multiple logical processes into one step, so that the computer completes the search, insert, and update connection relationship in the operation of identifying the step of storing the content of the node. The three steps, while the time complexity is still O ( l ), thus saving the time for the computer to make logical judgments, improving the internal performance of the computer during the short sequence assembly process, thus providing for the realization of short sequence assembly of large genomes.
为了降低存储 de Bruijn图中节点所需的内存空间,作为本发明 的一个优选实施例,还可以只用 de Bruijn图中的一个节点存储互补 的两短串, 节点的序列值取互补的两短串中较小的序列值。 如果一 个的短串的序列值小于其互补短串的序列值,则 de Bruijn图中的节 点存储该短串的序列值, seq存储该短串的序列值,与其左连接碱基 的相应连接数量更新到 left— links, 与其右连接碱基的相应连接数量 更新到 right— links; 如果一个的短串的序列值大于其互补短串的序 列值, 则 de Bruijn图中的节点存储其互补短串的序列值, seq存储 其互补短串的序列值, 与其右连接碱基的相应连接数量更新到 leftjinks, 与其左连接碱基的相应连接数量更新到 rigM—links。 操 作图时, 可以在程序中使用一个附加的变量来标记我们使用的是互 补的两短串的哪一个。 并且, 在沿图遍历时, 只需要程序维持一个 这样的变量, 就可以正确地得到路径中所有节点的正方向。 In order to reduce the memory space required for storing nodes in the de Bruijn graph, as a preferred embodiment of the present invention, only one node in the de Bruijn graph can be used to store two complementary short strings, and the sequence values of the nodes are complementary to two short A smaller sequence value in the string. If the sequence value of a short string is smaller than the sequence value of its complementary short string, the node in the de Bruijn graph stores the sequence value of the short string, and seq stores the sequence value of the short string, and the number of corresponding connections to the left connected base. Update to left-links, the number of corresponding connections to its right-connected base is updated to right-links; if the sequence value of a short string is greater than the sequence value of its complementary short-string, the node in the de Bruijn graph stores its complementary short string The sequence value, seq stores the sequence value of its complementary short string, the number of corresponding connections to its right-linked base is updated to leftjinks, and the number of corresponding connections to its left-linked base is updated to ri g M-links. When working with diagrams, you can use an additional variable in the program to mark which of the two short strings we are using. Moreover, when traversing along the graph, only the program needs to maintain one such variable, and the positive direction of all nodes in the path can be correctly obtained.
为了加快构建图的速度, 作为本发明的另一个优选实施例, 使 用多个哈希表唯一存储 de Bruijn图中的不同节点,并采用不同线程 访问不同的哈希表。 In order to speed up the construction of the map, as another preferred embodiment of the present invention, multiple hash tables are used to uniquely store different nodes in the de Bruijn graph, and different threads are used to access different hash tables.
在本发明实施例中, 建立 8个哈希表, 读入一定数目的原始序 列, 采用 8个线程对读入的原始测序列进行多线程切割、 短串求互 补, 在数据收集完毕后, 采用 8个线程进行插入更新节点, 其中每 个线程只处理固定前缀的序列值。 每个哈希表存储指定前缀的序列 值, 并且一个哈希表只有一个线程访问, 以保证节点存储的唯一性。 In the embodiment of the present invention, eight hash tables are created, a certain number of original sequences are read, and eight threads are used to perform multi-thread cutting and short-string complementation on the original sequence to be read. After the data collection is completed, the data is collected. Eight threads are inserted into the update node, where each thread only processes the sequence value of the fixed prefix. Each hash table stores a sequence value of the specified prefix, and a hash table has only one thread access to guarantee the uniqueness of the node storage.
采用上述本发明实施例提供的压缩的数据结构, 可以将节点信 息(即序列值)和节点的连接信息 (即边)组合在一起, 从一个节 点的值可以得到该节点上的短串、 与该短串相邻的短串的序列值及 其数量。 By using the compressed data structure provided by the embodiment of the present invention, node information (ie, sequence value) and node connection information (ie, edges) can be combined together from one section. The value of the point gives the sequence value of the short string on the node, the short string adjacent to the short string, and the number thereof.
当然,也可以用其他结构来存储 de Bruijn图的各节点,例如可 以用树结构来存储, 使用哈希表存储各节点在内存和使用上与用树 状结构存储近似, 但是使用哈希表存储各节点在访问和修改速度上 都明显优于树的存储结构。 Of course, other structures can also be used to store the nodes of the de Bruijn graph, for example, can be stored in a tree structure, and the hash table is used to store each node in memory and usage and stored in a tree structure, but using a hash table storage. Each node is significantly better than the tree's storage structure in terms of access and modification speed.
选取非洲人基因组重测序数据, 经纠错处理后, 序列数据量 254G碱基, 切割成 25碱基长度的定长短串后, 短串的总数目 (包 括正反向序列) 为 7G 条, 采用本发明实施例提供的方法构建 de Bruijn图, 内存最大使用值为 110G, 共消耗 23 CPU小时, 其中, CPU 的参数为 Quad-Core AMD Opteron(tm) Processor 8356 2.2GHZ。 The African human genome resequencing data was selected. After error correction processing, the sequence data amount was 254 G bases, and after cutting into a fixed length short string of 25 base length, the total number of short strings (including the forward and reverse sequences) was 7G. The method provided by the embodiment of the present invention constructs a de Bruijn graph, and the maximum memory usage value is 110G, which consumes 23 CPU hours, wherein the CPU parameter is Quad-Core AMD Opteron(tm) Processor 8356 2.2GHZ.
本领域普通技术人员可以理解, 实现上述实施例方法中的全部 或部分步驟是可以通过程序来指令相关的硬件来完成, 所述的程序 可以在存储于一计算机可读取存储介质中, 所述的存储介质, 如 ROM/RAM、 磁盘、 光盘等, 该程序用来执行如下步骤: It will be understood by those skilled in the art that all or part of the steps of the foregoing embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium. The storage medium, such as ROM/RAM, disk, CD, etc., is used to perform the following steps:
1.接收测序序列; 1. Receiving a sequencing sequence;
2. 分别将接收到的测序序列逐个碱基滑动切割得到固定碱基 长度的短串, 并得到短串的左、 右连接关系, 所述左、 右连接关系 包括短串序列值、 存在左连接的各碱基的连接数量和存在右连接的 各碱基的连接数量; 2. The received sequencing sequence is slid and cut one by one to obtain a short string of fixed base length, and the left and right connection relationships of the short strings are obtained, and the left and right connection relationships include short string sequence values and presence of left connections. The number of connections of each base and the number of connections of each base in the right junction;
3. 将得到的各短串的序列值, 左、 右连接关系及其连接数量存 储为 de Bruijn图的一个节点, 并采用哈希表存储 de Bruijn图的各 节点, 其中哈希键为序列值, 哈希值为节点。 3. Store the sequence values of the short strings, the left and right connection relationships and their number of connections as a node of the de Bruijn graph, and use the hash table to store the nodes of the de Bruijn graph, where the hash key is the sequence value. , the hash value is a node.
图 3示出了本发明实施例提供的降低短序列组装过程的时间复 杂度的系统的结构, 为了便于说明仅示出了与本发明实施例相关的 部分。 FIG. 3 is a diagram showing the structure of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. For the convenience of description, only the embodiments related to the embodiment of the present invention are shown. section.
该系统可以用于短序列组装中, 其中: The system can be used in short sequence assembly, where:
接收单元 301, 接收测序序列。 The receiving unit 301 receives the sequencing sequence.
序列切割单元 302, 分别将接收到的测序序列逐个碱基滑动切 割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 其实 现方式如上所述, 不再赘述。 The sequence cutting unit 302 respectively slid and cuts the received sequencing sequence by a base to obtain a short string of a fixed base length, and obtains a left-right connection relationship of the short strings. The actual manner is as described above, and will not be described again.
构图单元 303, 将得到的各短串的序列值, 左、 右连接关系及 其连接数量存储为 de Bruijn图的一个节点, 并采用哈希表存储 de Bruijn图的各节点, 其中哈希键为序列值, 哈希值为节点。 在本发 明实施例中,构图单元 303在 de Bruijn图的节点中使用相应位存储 其序列值、 存在左连接的各碱基的连接数量和存在右连接的各碱基 的连接数量。 The composition unit 303 stores the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and uses a hash table to store each node of the de Bruijn graph, wherein the hash key is The sequence value, the hash value is the node. In the embodiment of the present invention, the composition unit 303 uses the corresponding bit in the node of the de Bruijn diagram to store its sequence value, the number of connections of each base in which there is a left connection, and the number of connections in which each base of the right connection exists.
其中, 构图单元 303包括: The composition unit 303 includes:
查询模块 3031, 根据得到的短串的序列值在已存储的节点中查 询是否已存有相应节点。 The query module 3031 queries, among the stored nodes, whether the corresponding node has been stored according to the obtained sequence value of the short string.
节点添加模块 3032, 在查询模块 3031没有查询到相应节点时, 添加节点, 其实现方式如上所述, 不再赘述。 The node adding module 3032 adds a node when the query module 3031 does not query the corresponding node, and the implementation manner is as described above, and details are not described herein.
连接更新模块 3033, 在查询模块 3031查询到相应节点时, 更 新该相应节点的连接关系, 其实现方式如上所述, 不再赘述。 The connection update module 3033, when the query module 3031 queries the corresponding node, updates the connection relationship of the corresponding node, and the implementation manner thereof is as described above, and details are not described herein again.
为了降低存储 de Bruijn图中节点所需空间,作为本发明的一个 优选实施例,构图单元 303使用 de Bruijn图中的一个节点存储互补 的两短串, 节点的序列值取互补的两短串中较小的序列值, 其实现 方式如上所述, 不再赘述。 In order to reduce the space required for storing nodes in the de Bruijn diagram, as a preferred embodiment of the present invention, the composition unit 303 uses a node in the de Bruijn diagram to store complementary two short strings, and the sequence values of the nodes are taken in two complementary short strings. The smaller sequence values are implemented as described above and will not be described again.
为了加快构建图的速度, 作为本发明的另一个优选实施例, 构 图单元 303采用多个哈希表唯一存储 de Bruijn图中的不同节点,并 采用不同线程访问不同的哈希表。 在本发明实施例中, 通过分别将接收到的测序序列逐个碱基滑 动切割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 将得到的各短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijii图的一个节点, 并使用哈希表进行存储。 由于釆用本发明实 施例的构图方式并使用哈希表进行存储, 使得更新节点连接关系等 同于节点查找和更新查找到的节点的左、 右相连碱基的连接数量, 因此在 O ( l )的时间内就可以完成节点的查找、 插入和节点连接关 系的更新, 从而实现在短序列组装过程降低时间复杂度, 进而实现 对大基因组的短序列组装。 In order to speed up the construction of the map, as another preferred embodiment of the present invention, the composition unit 303 uniquely stores different nodes in the de Bruijn map using a plurality of hash tables, and uses different threads to access different hash tables. In the embodiment of the present invention, the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained. The left and right connection relationships and their number of connections are stored as a node of the de Bruijii diagram and stored using a hash table. Since the composition mode of the embodiment of the present invention is used and stored by using a hash table, the update node connection relationship is equivalent to the number of connections of the left and right connected bases of the node searched and updated by the node, and thus at O ( l ) The search, insertion and node connection relationship of the node can be updated in time, thereby reducing the time complexity in the short sequence assembly process and realizing the short sequence assembly of the large genome.
以上所述仅为本发明的较佳实施例而已, 并不用以限制本发 明, 凡在本发明的精神和原则之内所作的任何修改、 等同替换和改 进等, 均应包含在本发明的保护范围之内。 The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. Within the scope.
Claims
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN200810218338.9 | 2008-12-12 | ||
| CN2008102183389A CN101430742B (en) | 2008-12-12 | 2008-12-12 | A way to assemble a genome |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2010066115A1 true WO2010066115A1 (en) | 2010-06-17 |
Family
ID=40646135
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2009/001427 Ceased WO2010066115A1 (en) | 2008-12-12 | 2009-12-11 | Method and system for lowering time complexity in short sequences assembly |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN101430742B (en) |
| WO (1) | WO2010066115A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2013004005A1 (en) * | 2011-07-05 | 2013-01-10 | 深圳华大基因科技有限公司 | Method for assembling sequenced segments |
Families Citing this family (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101430742B (en) * | 2008-12-12 | 2011-06-29 | 深圳华大基因研究院 | A way to assemble a genome |
| TWI443544B (en) * | 2009-12-23 | 2014-07-01 | Ind Tech Res Inst | Data compression method and sequence compression devices |
| WO2012171213A1 (en) * | 2011-06-17 | 2012-12-20 | 深圳华大基因科技有限公司 | Method and system for assembly of genome |
| US8751166B2 (en) * | 2012-03-23 | 2014-06-10 | International Business Machines Corporation | Parallelization of surprisal data reduction and genome construction from genetic data for transmission, storage, and analysis |
| US8812243B2 (en) | 2012-05-09 | 2014-08-19 | International Business Machines Corporation | Transmission and compression of genetic data |
| US8855938B2 (en) | 2012-05-18 | 2014-10-07 | International Business Machines Corporation | Minimization of surprisal data through application of hierarchy of reference genomes |
| US10353869B2 (en) | 2012-05-18 | 2019-07-16 | International Business Machines Corporation | Minimization of surprisal data through application of hierarchy filter pattern |
| US9002888B2 (en) | 2012-06-29 | 2015-04-07 | International Business Machines Corporation | Minimization of epigenetic surprisal data of epigenetic data within a time series |
| US8972406B2 (en) | 2012-06-29 | 2015-03-03 | International Business Machines Corporation | Generating epigenetic cohorts through clustering of epigenetic surprisal data based on parameters |
| CN103258145B (en) * | 2012-12-22 | 2016-06-29 | 中国科学院深圳先进技术研究院 | A kind of parallel gene-splicing method based on De Bruijn |
| CN103093121B (en) * | 2012-12-28 | 2016-01-27 | 深圳先进技术研究院 | The compression storage of two-way multistep deBruijn figure and building method |
| CN103699819B (en) * | 2013-12-10 | 2016-09-07 | 深圳先进技术研究院 | The summit extended method of elongated kmer based on multistep two-way De Bruijn inquiry |
| CN104751015B (en) * | 2013-12-30 | 2017-08-29 | 中国科学院天津工业生物技术研究所 | A kind of gene order-checking data sequence assemble method |
| CN106067824B (en) * | 2016-06-02 | 2019-11-05 | 洛阳晶云信息科技有限公司 | A kind of sequencing data compression method based on bigeminy codon |
| CN115394348B (en) * | 2022-07-15 | 2025-08-19 | 中南大学 | Method, equipment and medium for predicting lncRNA subcellular localization based on graph rolling network |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101430742A (en) * | 2008-12-12 | 2009-05-13 | 深圳华大基因研究院 | Method and system for drawing construction in short sequence assembly |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6952651B2 (en) * | 2002-06-17 | 2005-10-04 | Intel Corporation | Methods and apparatus for nucleic acid sequencing by signal stretching and data integration |
| AU2003296994A1 (en) * | 2002-12-13 | 2004-07-09 | Applera Corporation | Methods for identifying, viewing, and analyzing syntenic and orthologous genomic regions between two or more species |
| CN101196921A (en) * | 2007-12-24 | 2008-06-11 | 北京大学 | Dimensionality Reduction Method for Long Sequence Data for Approximate Query |
-
2008
- 2008-12-12 CN CN2008102183389A patent/CN101430742B/en active Active
-
2009
- 2009-12-11 WO PCT/CN2009/001427 patent/WO2010066115A1/en not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101430742A (en) * | 2008-12-12 | 2009-05-13 | 深圳华大基因研究院 | Method and system for drawing construction in short sequence assembly |
Non-Patent Citations (3)
| Title |
|---|
| BOKHARI S.H. ET AL: "A Parallel Graph Decomposition Algorithm for DNA Sequencing with Nanopores", BIOINFORMATICS, vol. 21, no. 7, 11 November 2004 (2004-11-11), pages 889 - 896 * |
| HUO H. W. ET AL: "A Multiple Alignment Approach for DNA Sequences Based on the Maximum Weighted Path Algorithms", JOURNAL OF SOFTWARE, vol. 18, no. 2, February 2007 (2007-02-01), pages 185 - 195 * |
| ZERBINO D.R. ET AL: "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs", GENOME RES., vol. 18, 18 March 2008 (2008-03-18), pages 821 - 829 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2013004005A1 (en) * | 2011-07-05 | 2013-01-10 | 深圳华大基因科技有限公司 | Method for assembling sequenced segments |
Also Published As
| Publication number | Publication date |
|---|---|
| CN101430742B (en) | 2011-06-29 |
| CN101430742A (en) | 2009-05-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2010066115A1 (en) | Method and system for lowering time complexity in short sequences assembly | |
| CN101594319B (en) | Entry lookup method and entry lookup device | |
| US20210165805A1 (en) | Partition Merging Method and Database Server | |
| US20130080485A1 (en) | Quick filename lookup using name hash | |
| Almodaresi et al. | PuffAligner: a fast, efficient and accurate aligner based on the Pufferfish index | |
| JP4995125B2 (en) | How to search fixed length data | |
| JP3570323B2 (en) | How to store prefixes for addresses | |
| EP2344959A2 (en) | Index compression in databases | |
| CN103561133A (en) | IP address ownership information indexing and fast querying method | |
| CN101751517B (en) | A method and system for rapid processing of genome short sequence mapping | |
| Xiao et al. | Using parallel bloom filters for multiattribute representation on network services | |
| WO2021179781A1 (en) | Method, systemand device for sequence alignment, and readable storage medium | |
| CN103065067B (en) | The filter method of sequence fragment and system in short sequence assembling | |
| US20130307710A1 (en) | Compression match enumeration | |
| CN114884877B (en) | IPv6 route searching method combining hash table and HOT | |
| CN100496019C (en) | A Method for Rapid Search and Update of IPv6 Routing Table | |
| Hon et al. | Efficient algorithm for circular Burrows-Wheeler transform | |
| CN112231400B (en) | Access method, device, equipment and storage medium of distributed database | |
| CN112269784B (en) | Hash table device based on hardware realization and inserting, inquiring and deleting method | |
| WO2010127536A1 (en) | Table creating and searching method used by network processor | |
| CN1312890C (en) | Method for generating a trie having a reduced number of trie blocks | |
| CN100396015C (en) | A TCAM routing table management method and system | |
| WO2015192742A1 (en) | Lookup device, lookup method and configuration method | |
| CN109831384B (en) | Name lookup method and router | |
| WO2005066835A1 (en) | A method for quickly retrieving a record in a data page of a database |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09831392 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 09831392 Country of ref document: EP Kind code of ref document: A1 |