WO2010066115A1 - 一种降低短序列组装过程的时间复杂度的方法及系统 - Google Patents
一种降低短序列组装过程的时间复杂度的方法及系统 Download PDFInfo
- Publication number
- WO2010066115A1 WO2010066115A1 PCT/CN2009/001427 CN2009001427W WO2010066115A1 WO 2010066115 A1 WO2010066115 A1 WO 2010066115A1 CN 2009001427 W CN2009001427 W CN 2009001427W WO 2010066115 A1 WO2010066115 A1 WO 2010066115A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- node
- short
- sequence
- connections
- base
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
Definitions
- the invention belongs to the technical field of genetic engineering, and in particular relates to a method and system for reducing the time complexity of a short sequence assembly process. Background technique
- sequence length is short
- the embodiment of the present invention is implemented as a method for reducing the time complexity of a short sequence assembly process, and the method includes the following steps:
- the received sequencing sequence is slid and cut one by one to obtain a short string of a fixed base length, and the left and right connection relationships of the short string are obtained, and the left and right connection relationships include a short string sequence value and a left connection exists.
- Another object of embodiments of the present invention is to provide a system for reducing the time complexity of a short sequence assembly process, the system comprising:
- a receiving unit configured to receive a sequencing sequence
- a sequence cutting unit configured to separately slide the received sequencing sequence one by one to obtain a short string of a fixed base length, and obtain a left and right connection relationship of the short string, wherein the left and right connection relationships include short strings a sequence value, the number of linkages of each base in which there is a left junction, and the number of linkages of each base in which a right junction exists;
- composition unit configured to store the obtained sequence values of the short strings, the left and right connection relationships, and the number of connections thereof as a node of the de Bruijn graph, and store the nodes of the de Bruijn graph by using a hash table,
- the hash key is the sequence value
- the hash value is the node.
- the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained.
- the left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph and stored using a hash table. Since the hash table storage is used to make the update node connection relationship equal to the number of connections of the node to find and update the left and right connected bases of the found node, the search, insertion, and insertion of the node can be completed within O ( l ) time. The update of the node connection relationship, thereby reducing the time complexity in the short sequence assembly process, and thus achieving short sequence assembly of large genomes.
- FIG. 1 is a flowchart of an implementation of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention
- FIG. 2 is a schematic diagram of a node storage content according to an embodiment of the present invention.
- FIG. 3 is a structural diagram of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. detailed description
- the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained.
- the left and right connection relationships and their number of connections are stored as a node of the de Bruijn graph, and the hash table is used to store the nodes of the de Bruijn graph, wherein the hash key is a sequence value and the hash value is a node.
- FIG. 1 is a flowchart showing an implementation process of a method for reducing time complexity of a short sequence assembly process according to an embodiment of the present invention, which is described in detail as follows:
- step S101 receiving a sequencing sequence
- step S102 the received sequencing sequence is slid and cut one by one to obtain a short length (kmer) of a fixed length, and a left and right connection relationship of the short string is obtained, and the left and right connection relationships include short sequence values.
- step S103 the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof are stored as a node of the de Bruijn graph, and the hash table is used to store de Each node of the Briiijn graph, where the hash key is a sequence value and the hash value is a node.
- the sequencing sequence has a base length of 25 - 75 and is cut into short strings of a fixed length of 21 - 31.
- the length of the short string obtained by the cutting is smaller than the length of the sequence, and the length can be set according to the length of the sequencing sequence and the actual situation.
- Each node in the de Bruijn diagram uses the corresponding bits to store its sequence value, the number of connections for each base with left joins, and the number of connections for each base with right joins.
- each node on the de B ijn map is stored in 16 bytes, and its storage format is as follows:
- sequence value is calculated by using 2 bits to store a nucleotide sequence, A is represented by 00, C is represented by 01, T is represented by 10, G is represented by 11, and the sequence is encoded to generate a
- A is represented by 00
- C is represented by 01
- T is represented by 10
- G is represented by 11
- the sequence is encoded to generate a
- the 64-bit integer value and considering the short string of even length, its complementary short string may be itself, for example, the short string GATC's complementary short string is GATC itself. In order to prevent such confusion, the lengths of the short strings are all odd.
- the length of the short string is not more than 31;
- the leftjinks uses 24 bits to store the left connection relationship and the number, and divides the 24 bits into 4 6-bit, ie A: 6, T: 6, G: 6, C: 6, respectively, using 6 bits to store the number of connections to the base A, T, G or C with the left link of the short string, each The number of connections is in the range [0, 63];
- right-link uses 24 bits to store its right connection and number, and divides 24 bits into 4 6 bits, ie A: 6, T: 6, G: 6, C: 6, respectively, the number of connections of A, T, G or C with the right connection of the short string is stored by 6 bits, and the number of each connection is in the range of [0, 63]; the following 8 bits can be used.
- the delete flag closed may be stored to identify whether the short string is deleted; the use mark in one may also be stored to identify whether the short string has been used, and other identifiers may also be stored.
- the connection relationship of each node in the de Bmijii diagram can be constructed according to the short string sequence value stored in the node, the number of connections of each base having the left connection, and the number of connections of each base having the right connection.
- the short-chain A is AAAAAAAA
- the number of connections of the right-connected T is 19
- the short-chain B of the right-connected base T is AAAAAAAT, which is equal to the short-chain ⁇ left-shifting one base and adding the base T connected thereto.
- the storage contents of the number of connections in the node storing the right port are as shown in FIG. 2 .
- step S103 is specifically:
- Step 1 Query whether the corresponding node exists in the stored node according to the obtained sequence value of the short string;
- Step 2 If no corresponding node is queried, add a node
- Step 3 If the corresponding node is queried, the connection relationship of the corresponding node is updated.
- each node of the de Bruijn graph is stored using a hash table, the hash key is a sequence value, and the hash value is a node. For example, take a short string of AAAAAAAA, its sequence value is 0x0000, and use its sequence value 0x0000 as a key to query whether the corresponding node exists in the hash table. If no corresponding node is queried, add the node to the hash table.
- the seq of the value is the sequence value of the short string 0x0000, and the number of connections of the corresponding left and right connected bases in the node is set to 1 according to the short string adjacent to the short string; if the query already has a corresponding
- the node updates the connection relationship of the corresponding node, that is, updates the number of connections of the corresponding left and right connected bases in the node according to the short string adjacent to the short string, and the number of corresponding connections of the bases connected to the short string plus 1.
- step 1 to find the next short string until all short strings are found.
- the hash node can be used to complete the lookup node, the insert node (ie, the storage node), and the update node connection relationship in the time of O(1). Updating the node connection relationship is equivalent to finding the node and updating the number of connections of the left and right connected bases of the found node, so the time complexity is still O ( l ).
- the embodiment of the present invention converts the steps that originally required multiple logical processes into one step, so that the computer completes the search, insert, and update connection relationship in the operation of identifying the step of storing the content of the node.
- the three steps, while the time complexity is still O ( l ) thus saving the time for the computer to make logical judgments, improving the internal performance of the computer during the short sequence assembly process, thus providing for the realization of short sequence assembly of large genomes.
- only one node in the de Bruijn graph can be used to store two complementary short strings, and the sequence values of the nodes are complementary to two short A smaller sequence value in the string. If the sequence value of a short string is smaller than the sequence value of its complementary short string, the node in the de Bruijn graph stores the sequence value of the short string, and seq stores the sequence value of the short string, and the number of corresponding connections to the left connected base.
- Update to left-links the number of corresponding connections to its right-connected base is updated to right-links; if the sequence value of a short string is greater than the sequence value of its complementary short-string, the node in the de Bruijn graph stores its complementary short string
- the sequence value, seq stores the sequence value of its complementary short string
- the number of corresponding connections to its right-linked base is updated to leftjinks
- the number of corresponding connections to its left-linked base is updated to ri g M-links.
- multiple hash tables are used to uniquely store different nodes in the de Bruijn graph, and different threads are used to access different hash tables.
- eight hash tables are created, a certain number of original sequences are read, and eight threads are used to perform multi-thread cutting and short-string complementation on the original sequence to be read. After the data collection is completed, the data is collected. Eight threads are inserted into the update node, where each thread only processes the sequence value of the fixed prefix. Each hash table stores a sequence value of the specified prefix, and a hash table has only one thread access to guarantee the uniqueness of the node storage.
- node information ie, sequence value
- node connection information ie, edges
- nodes of the de Bruijn graph can be stored in a tree structure, and the hash table is used to store each node in memory and usage and stored in a tree structure, but using a hash table storage.
- Each node is significantly better than the tree's storage structure in terms of access and modification speed.
- the African human genome resequencing data was selected. After error correction processing, the sequence data amount was 254 G bases, and after cutting into a fixed length short string of 25 base length, the total number of short strings (including the forward and reverse sequences) was 7G.
- the method provided by the embodiment of the present invention constructs a de Bruijn graph, and the maximum memory usage value is 110G, which consumes 23 CPU hours, wherein the CPU parameter is Quad-Core AMD Opteron(tm) Processor 8356 2.2GHZ.
- the received sequencing sequence is slid and cut one by one to obtain a short string of fixed base length, and the left and right connection relationships of the short strings are obtained, and the left and right connection relationships include short string sequence values and presence of left connections.
- FIG. 3 is a diagram showing the structure of a system for reducing the time complexity of a short sequence assembly process according to an embodiment of the present invention. For the convenience of description, only the embodiments related to the embodiment of the present invention are shown. section.
- the system can be used in short sequence assembly, where:
- the receiving unit 301 receives the sequencing sequence.
- the sequence cutting unit 302 respectively slid and cuts the received sequencing sequence by a base to obtain a short string of a fixed base length, and obtains a left-right connection relationship of the short strings.
- the actual manner is as described above, and will not be described again.
- the composition unit 303 stores the obtained sequence values of the short strings, the left and right connection relationships and the number of connections thereof as a node of the de Bruijn graph, and uses a hash table to store each node of the de Bruijn graph, wherein the hash key is The sequence value, the hash value is the node.
- the composition unit 303 uses the corresponding bit in the node of the de Bruijn diagram to store its sequence value, the number of connections of each base in which there is a left connection, and the number of connections in which each base of the right connection exists.
- the composition unit 303 includes:
- the query module 3031 queries, among the stored nodes, whether the corresponding node has been stored according to the obtained sequence value of the short string.
- the node adding module 3032 adds a node when the query module 3031 does not query the corresponding node, and the implementation manner is as described above, and details are not described herein.
- connection update module 3033 when the query module 3031 queries the corresponding node, updates the connection relationship of the corresponding node, and the implementation manner thereof is as described above, and details are not described herein again.
- the composition unit 303 uses a node in the de Bruijn diagram to store complementary two short strings, and the sequence values of the nodes are taken in two complementary short strings.
- the smaller sequence values are implemented as described above and will not be described again.
- the composition unit 303 uniquely stores different nodes in the de Bruijn map using a plurality of hash tables, and uses different threads to access different hash tables.
- the short sequence of the fixed base length is obtained by slidingly cutting the received sequencing sequence one by one, and the left and right connection relationships of the short strings are obtained, and the sequence values of the obtained short strings are obtained.
- the left and right connection relationships and their number of connections are stored as a node of the de Bruijii diagram and stored using a hash table.
- the update node connection relationship is equivalent to the number of connections of the left and right connected bases of the node searched and updated by the node, and thus at O ( l )
- the search, insertion and node connection relationship of the node can be updated in time, thereby reducing the time complexity in the short sequence assembly process and realizing the short sequence assembly of the large genome.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Description
一种降低短序列组装过程的时间复杂度的方法及系统 技术领域
本发明属于基因工程技术领域, 尤其涉及一种降低短序列组装 过程时间复杂度的方法及系统。 背景技术
新测序技术产生的短序列有两个特点:
1.序列长度短;
2.数据量大。
在长序列组装过程中, 常用的 phrap等软件均是基于序列间的 交叠(overlap )来进行拼接, 如果应用在短序列上, 则运算量太大, 没有实际应用价值。而新兴的短序列组装软件中成功处理短序列的, 例如基于 de Bruijn图的 velvet等。 但是, 由于受内存、 时间等的限 制 , 现有的这些短序列组装软件只能组装较小的原核生物基因组, 对于大基因组, 例如真核生物基因组, 特别是哺乳动物基因组数据, 由于数据处理时时间复杂度较高、 内存占用较大, 现有的短序列组 装软件均难以实现短序列的组装。 发明内容
本发明实施例的目的在于提供一种降低短序列组装过程的时间 复杂度的方法, 旨在解决现有短序列组装软件不能组装大基因组的 问题。
本发明实施例是这样实现的, 一种降低短序列组装过程的时间 复杂度的方法, 所述方法包括下述步骤:
接收测序序列;
分别将接收到的测序序列逐个碱基滑动切割得到固定碱基长度 的短串, 并得到所述短串的左、 右连接关系, 所述左、 右连接关系 包括短串序列值、 存在左连接的各碱基的连接数量和存在右连接的 各½的连接数量;
将得到的各所述短串的序列值, 左、 右连接关系及其连接数量 存储为 de Bruijn图的一个节点, 并釆用哈希表存储所述 de Bruijn 图的各节点, 其中哈希键为所述序列值, 哈希值为所述节点。
本发明实施例的另一目的在于提供一种降低短序列组装过程的 时间复杂度的系统, 所述系统包括:
接收单元, 用于接收测序序列;
序列切割单元, 用于分别将接收到的测序序列逐个碱基滑动切 割得到固定碱基长度的短串, 并得到所述短串的左、 右连接关系, 所述左、 右连接关系包括短串序列值、 存在左连接的各碱基的连接 数量和存在右连接的各碱基的连接数量; 以及
构图单元, 用于将得到的各所述短串的序列值, 左、 右连接关 系及其连接数量存储为 de Bruijn图的一个节点,并采用哈希表存储 所述 de Bruijn图的各节点,其中哈希键为所述序列值,哈希值为所 述节点。
在本发明实施例中, 通过分别将接收到的测序序列逐个碱基滑 动切割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 将得到的各短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijn图的一个节点, 并使用哈希表进行存储。 由于使用哈希表存 储使得更新节点连接关系等同于节点查找和更新查找到的节点的 左、 右相连碱基的连接数量, 因此在 O ( l )的时间内就可以完成节 点的查找、 插入和节点连接关系的更新, 从而实现在短序列组装过 程降低时间复杂度, 进而实现对大基因组的短序列组装。
附图说明
图 1是本发明实施例提供的降低短序列组装过程的时间复杂度 的方法的实现流程图;
图 2是本发明实施例提供的节点存储内容的示意图;
图 3本发明实施例提供的降低短序列组装过程的时间复杂度的 系统的结构图。 具体实施方式
为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结 合附图及实施例, 对本发明进行进一步详细说明。 应当理解, 此处 所描述的具体实施例仅仅用以解释本发明, 并不用于限定本发明。
在本发明实施例中, 通过分别将接收到的测序序列逐个碱基滑 动切割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 将得到的各短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijn图的一个节点,并采用哈希表存储 de Bruijn图的各节点,其 中哈希键为序列值, 哈希值为节点。
图 1示出了本发明实施例提供的降低短序列组装过程的时间复 杂度的方法的实现流程, 详述如下:
在步驟 S101中, 接收测序序列;
在步骤 S102中,分别将接收到的测序序列逐个碱基滑动切割得 到固定 长度的短串 (kmer ) , 并得到短串的左、 右连接关系, 所述左、 右连接关系包括短串序列值、 存在左连接的各碱基的连接 数量和存在右连接的各碱基的连接数量;
在步骤 S103中, 将得到的各短串的序列值, 左、 右连接关系及 其连接数量存储为 de Bruijn图的一个节点, 并采用哈希表存储 de
Briiijn图的各节点, 其中哈希键为序列值, 哈希值为节点。
在本发明实施例中, 测序序列的碱基长度为 25 - 75, 切割成固 定 长度为 21 - 31的短串。 当然, 切割得到的短串的长度小于测 序序列的长度, 其长度可以根据测序序列的长度和实际情况设定。 de Bruijn图中每个节点使用相应位存储其序列值、存在左连接的各 碱基的连接数量和存在右连接的各碱基的连接数量。 这里, 用 16字 节存储 de B ijn图上的各节点, 其存储格式如下:
[ seq: 64, left links: 24, right— links: 24, …】;
其中, seq存储短串的序列值,序列值的计算方法是使用 2位存 储一个核苷序列, A用 00表示, C用 01表示, T用 10表示, G用 11表示, 顺序编码下去生成一个占 64位的整数值, 并且, 考虑到 对于偶数长度的短串, 其互补短串可能为它自己, 例如短串 GATC 的互补短串为 GATC自己。 为了防止这种混淆, 短串的长度均为奇 数, 由于本发明实施例中数据结构的限制, 短串的长度不大于 31; leftjinks用 24位存储其左连接关系及数量, 将 24位分割成 4个 6 位, 即 A: 6, T: 6, G: 6, C: 6, 分别用 6位存储与该短串存在 左连接的碱基 A、 T、 G或 C的连接数量, 每种连接数量的取值范 围为 [0, 63]; right— links用 24位存储其右连接关系及数量, 将 24 位分割成 4个 6位, 即 A: 6, T: 6, G: 6, C: 6, 分别用 6位存 储与该短串存在右连接的 A、 T、 G或 C的连接数量, 每种连 接数量的取值范围为 [0, 63]; 其后面的 8位可以用于存储其他值, 例如, 可以存储删除标记 closed, 以标识该短串是否被删除; 也可 以存储使用标记 in一 use, 以标识该短串是否被使用过, 还可以存储 其他标识。 这样, 根据节点中存储的短串序列值、 存在左连接的各 碱基的连接数量和存在右连接的各碱基的连接数量即可构建 de Bmijii图中各节点的连接关系。
例如,短串甲为 AAAAAAAA存在右连接的 T的连接数量 为 19, 与其右连接碱基 T的短串乙为 AAAAAAAT, 等于短串曱左 移一个碱基并加上与其连接的碱基 T, 并且与短串甲连接的短串乙 有 19个, 节点中存储右连接 Τ的连接数量的存储内容如图 2 所示。
上述步骤 S103具体为:
步骤 1.根据得到的短串的序列值在已存储的节点中查询是否已 存有相应节点;
步骤 2.如果没有查询到相应节点, 则添加节点;
步骤 3.如果查询到相应节点, 则更新该相应节点的连接关系。 在本发明实施例中,使用哈希表存储 de Bruijn图的各节点,哈 希键为序列值, 哈希值为节点。 例如取一短串为 AAAAAAAA, 其 序列值为 0x0000 , 将其序列值 0x0000作为键在哈希表中查询是否 已存有相应节点, 如果没有查询到相应节点, 则添加节点存储到哈 希表中, 其值中的 seq为该短串的序列值 0x0000, 并根据该短串相 邻的短串将该节点中相应左、 右相连碱基的连接数量置为 1; 如果 查询到已存有相应节点, 则更新相应节点的连接关系, 即根据与该 短串相邻的短串更新该节点中相应左、 右相连碱基的连接数量, 将 与该短串有连接的碱基的相应连接数量加 1。 完成后, 执行步骤 1, 查找下一个短串, 直至完成全部短串的查找。
在本发明实施例中, 使用哈希表可以在 O ( 1 )的时间内完成查 找节点、 插入节点 (即存储节点)和更新节点连接关系。 更新节点 连接关系等同于查找节点, 并更新查找到的节点的左、 右相连碱基 的连接数量, 所以时间复杂度依然为 O ( l )。 本发明实施例将原来 需要多个逻辑处理的步骤转化为一个步骤, 使得计算机在识别节点 存储内容这个步骤的运算中, 完成了查找、 插入和更新联结关系的
三个步骤, 而时间复杂度仍然为 O ( l ), 因此节约了计算机进行逻 辑判断的时间, 改善了短序列组装过程中计算机的内部性能, 从而 为大基因组的短序列组装的实现提供了 。
为了降低存储 de Bruijn图中节点所需的内存空间,作为本发明 的一个优选实施例,还可以只用 de Bruijn图中的一个节点存储互补 的两短串, 节点的序列值取互补的两短串中较小的序列值。 如果一 个的短串的序列值小于其互补短串的序列值,则 de Bruijn图中的节 点存储该短串的序列值, seq存储该短串的序列值,与其左连接碱基 的相应连接数量更新到 left— links, 与其右连接碱基的相应连接数量 更新到 right— links; 如果一个的短串的序列值大于其互补短串的序 列值, 则 de Bruijn图中的节点存储其互补短串的序列值, seq存储 其互补短串的序列值, 与其右连接碱基的相应连接数量更新到 leftjinks, 与其左连接碱基的相应连接数量更新到 rigM—links。 操 作图时, 可以在程序中使用一个附加的变量来标记我们使用的是互 补的两短串的哪一个。 并且, 在沿图遍历时, 只需要程序维持一个 这样的变量, 就可以正确地得到路径中所有节点的正方向。
为了加快构建图的速度, 作为本发明的另一个优选实施例, 使 用多个哈希表唯一存储 de Bruijn图中的不同节点,并采用不同线程 访问不同的哈希表。
在本发明实施例中, 建立 8个哈希表, 读入一定数目的原始序 列, 采用 8个线程对读入的原始测序列进行多线程切割、 短串求互 补, 在数据收集完毕后, 采用 8个线程进行插入更新节点, 其中每 个线程只处理固定前缀的序列值。 每个哈希表存储指定前缀的序列 值, 并且一个哈希表只有一个线程访问, 以保证节点存储的唯一性。
采用上述本发明实施例提供的压缩的数据结构, 可以将节点信 息(即序列值)和节点的连接信息 (即边)组合在一起, 从一个节
点的值可以得到该节点上的短串、 与该短串相邻的短串的序列值及 其数量。
当然,也可以用其他结构来存储 de Bruijn图的各节点,例如可 以用树结构来存储, 使用哈希表存储各节点在内存和使用上与用树 状结构存储近似, 但是使用哈希表存储各节点在访问和修改速度上 都明显优于树的存储结构。
选取非洲人基因组重测序数据, 经纠错处理后, 序列数据量 254G碱基, 切割成 25碱基长度的定长短串后, 短串的总数目 (包 括正反向序列) 为 7G 条, 采用本发明实施例提供的方法构建 de Bruijn图, 内存最大使用值为 110G, 共消耗 23 CPU小时, 其中, CPU 的参数为 Quad-Core AMD Opteron(tm) Processor 8356 2.2GHZ。
本领域普通技术人员可以理解, 实现上述实施例方法中的全部 或部分步驟是可以通过程序来指令相关的硬件来完成, 所述的程序 可以在存储于一计算机可读取存储介质中, 所述的存储介质, 如 ROM/RAM、 磁盘、 光盘等, 该程序用来执行如下步骤:
1.接收测序序列;
2. 分别将接收到的测序序列逐个碱基滑动切割得到固定碱基 长度的短串, 并得到短串的左、 右连接关系, 所述左、 右连接关系 包括短串序列值、 存在左连接的各碱基的连接数量和存在右连接的 各碱基的连接数量;
3. 将得到的各短串的序列值, 左、 右连接关系及其连接数量存 储为 de Bruijn图的一个节点, 并采用哈希表存储 de Bruijn图的各 节点, 其中哈希键为序列值, 哈希值为节点。
图 3示出了本发明实施例提供的降低短序列组装过程的时间复 杂度的系统的结构, 为了便于说明仅示出了与本发明实施例相关的
部分。
该系统可以用于短序列组装中, 其中:
接收单元 301, 接收测序序列。
序列切割单元 302, 分别将接收到的测序序列逐个碱基滑动切 割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 其实 现方式如上所述, 不再赘述。
构图单元 303, 将得到的各短串的序列值, 左、 右连接关系及 其连接数量存储为 de Bruijn图的一个节点, 并采用哈希表存储 de Bruijn图的各节点, 其中哈希键为序列值, 哈希值为节点。 在本发 明实施例中,构图单元 303在 de Bruijn图的节点中使用相应位存储 其序列值、 存在左连接的各碱基的连接数量和存在右连接的各碱基 的连接数量。
其中, 构图单元 303包括:
查询模块 3031, 根据得到的短串的序列值在已存储的节点中查 询是否已存有相应节点。
节点添加模块 3032, 在查询模块 3031没有查询到相应节点时, 添加节点, 其实现方式如上所述, 不再赘述。
连接更新模块 3033, 在查询模块 3031查询到相应节点时, 更 新该相应节点的连接关系, 其实现方式如上所述, 不再赘述。
为了降低存储 de Bruijn图中节点所需空间,作为本发明的一个 优选实施例,构图单元 303使用 de Bruijn图中的一个节点存储互补 的两短串, 节点的序列值取互补的两短串中较小的序列值, 其实现 方式如上所述, 不再赘述。
为了加快构建图的速度, 作为本发明的另一个优选实施例, 构 图单元 303采用多个哈希表唯一存储 de Bruijn图中的不同节点,并 采用不同线程访问不同的哈希表。
在本发明实施例中, 通过分别将接收到的测序序列逐个碱基滑 动切割得到固定碱基长度的短串, 并得到短串的左、 右连接关系, 将得到的各短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijii图的一个节点, 并使用哈希表进行存储。 由于釆用本发明实 施例的构图方式并使用哈希表进行存储, 使得更新节点连接关系等 同于节点查找和更新查找到的节点的左、 右相连碱基的连接数量, 因此在 O ( l )的时间内就可以完成节点的查找、 插入和节点连接关 系的更新, 从而实现在短序列组装过程降低时间复杂度, 进而实现 对大基因组的短序列组装。
以上所述仅为本发明的较佳实施例而已, 并不用以限制本发 明, 凡在本发明的精神和原则之内所作的任何修改、 等同替换和改 进等, 均应包含在本发明的保护范围之内。
Claims
1. 一种降低短序列组装过程的时间复杂度的方法, 其特征在 于, 所述方法包括下述步骤:
接收测序序列;
分别将接收到的测序序列逐个碱基滑动切割得到固定碱基 长度的短串, 并得到所述短串的左、 右连接关系, 所述左、 右连 接关系包括短串序列值、 存在左连接的各碱基的连接数量和存在 右连接的各碱基的连接数量;
将得到的各所述短串的序列值, 左、 右连接关系及其连接数 量存储为 de Bruijn 图的一个节点, 并采用哈希表存储所述 de Bruijn图的各节点, 其中哈希键为所述序列值, 哈希值为所述节 点。
2. 如权利要求 1所述的方法, 其特征在于, 所述 de Bruijn图 的节点使用相应位存储所述短串的序列值、 存在左连接的各碱基 的连接数量和存在右连接的各碱基的连接数量。
3. 如权利要求 1所述的方法, 其特征在于, 所述将得到的各 短串的序列值, 左、 右连接关系及其连接数量存储为 de Bruijn 图的一个节点的步骤具体为:
根据得到的短串的序列值在已存储的节点中查询是否已存有 相应节点;
如果没有查询到相应节点, 则添加节点;
如果查询到相应节点, 则更新所述相应节点的连接关系。
4. 如权利要求 1所述的方法, 其特征在于, 所述 de Bruijn图 的一个节点中存储互补的两短串。
5. 如权利要求 1所述的方法, 其特征在于, 采用多个哈希表
唯一存储所述 de Bruijn 图的不同节点, 并釆用不同线程访问不 同的哈希表。
6. 一种降低短序列组装过程的时间复杂度的系统, 其特征在 于, 所述系统包括:
接收单元, 用于接收测序序列;
序列切割单元,用于分别将接收到的测序序列逐个碱基滑动 切割得到固定碱基长度的短串, 并得到所述短串的左、 右连接关 系, 所述左、 右连接关系包括短串序列值、 存在左连接的各碱基 的连接数量和存在右连接的各碱基的连接数量; 以及
构图单元, 用于将得到的各所述短串的序列值, 左、 右连接 关系及其连接数量存储为 de Bruijn 图的一个节点, 并采用哈希 表存储所述 de Bruijn 图中的不同节点, 其中哈希键为所述序列 值, 哈希值为所述节点。
7. 如权利要求 6所述的系统, 其特征在于, 所述构图单元在 de Bruijn图的节点中使用相应位存储所述短串的序列值、存在左 连接的各碱基的连接数量和存在右连接的各碱基的连接数量。
8. 如权利要求 7所述的系统, 其特征在于, 所述构图单元包 括:
查询模块,用于根据得到的短串的序列值在已存储的节点中 查询是否已存有相应节点;
节点添加模块, 用于在所述查询模块没有查询到相应节点 时, 添加节点;
连接更新模块, 用于在所述查询模块查询到相应节点时, 更 新所述相应节点的连接关系。
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2008102183389A CN101430742B (zh) | 2008-12-12 | 2008-12-12 | 一种组装基因组的方法 |
| CN200810218338.9 | 2008-12-12 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2010066115A1 true WO2010066115A1 (zh) | 2010-06-17 |
Family
ID=40646135
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2009/001427 Ceased WO2010066115A1 (zh) | 2008-12-12 | 2009-12-11 | 一种降低短序列组装过程的时间复杂度的方法及系统 |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN101430742B (zh) |
| WO (1) | WO2010066115A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2013004005A1 (zh) * | 2011-07-05 | 2013-01-10 | 深圳华大基因科技有限公司 | 组装测序片段的方法 |
Families Citing this family (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101430742B (zh) * | 2008-12-12 | 2011-06-29 | 深圳华大基因研究院 | 一种组装基因组的方法 |
| US8223043B2 (en) | 2009-12-23 | 2012-07-17 | Industrial Technology Research Institute | Method and apparatus for compressing nucleotide sequence data |
| WO2012171213A1 (zh) * | 2011-06-17 | 2012-12-20 | 深圳华大基因科技有限公司 | 一种基因组组装方法和系统 |
| US8751166B2 (en) | 2012-03-23 | 2014-06-10 | International Business Machines Corporation | Parallelization of surprisal data reduction and genome construction from genetic data for transmission, storage, and analysis |
| US8812243B2 (en) | 2012-05-09 | 2014-08-19 | International Business Machines Corporation | Transmission and compression of genetic data |
| US8855938B2 (en) | 2012-05-18 | 2014-10-07 | International Business Machines Corporation | Minimization of surprisal data through application of hierarchy of reference genomes |
| US10353869B2 (en) | 2012-05-18 | 2019-07-16 | International Business Machines Corporation | Minimization of surprisal data through application of hierarchy filter pattern |
| US8972406B2 (en) | 2012-06-29 | 2015-03-03 | International Business Machines Corporation | Generating epigenetic cohorts through clustering of epigenetic surprisal data based on parameters |
| US9002888B2 (en) | 2012-06-29 | 2015-04-07 | International Business Machines Corporation | Minimization of epigenetic surprisal data of epigenetic data within a time series |
| CN103258145B (zh) * | 2012-12-22 | 2016-06-29 | 中国科学院深圳先进技术研究院 | 一种基于De Bruijn图的并行基因拼接方法 |
| CN103093121B (zh) * | 2012-12-28 | 2016-01-27 | 深圳先进技术研究院 | 双向多步deBruijn图的压缩存储和构造方法 |
| CN103699819B (zh) * | 2013-12-10 | 2016-09-07 | 深圳先进技术研究院 | 基于多步双向De Bruijn图的变长kmer查询的顶点扩展方法 |
| CN104751015B (zh) * | 2013-12-30 | 2017-08-29 | 中国科学院天津工业生物技术研究所 | 一种基因组测序数据序列组装方法 |
| CN106067824B (zh) * | 2016-06-02 | 2019-11-05 | 洛阳晶云信息科技有限公司 | 一种基于二联密码子的测序数据压缩方法 |
| CN115394348B (zh) * | 2022-07-15 | 2025-08-19 | 中南大学 | 基于图卷积网络的lncRNA亚细胞定位预测方法、设备及介质 |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101430742A (zh) * | 2008-12-12 | 2009-05-13 | 深圳华大基因研究院 | 一种短序列组装中构建图的方法及系统 |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6952651B2 (en) * | 2002-06-17 | 2005-10-04 | Intel Corporation | Methods and apparatus for nucleic acid sequencing by signal stretching and data integration |
| US20050066276A1 (en) * | 2002-12-13 | 2005-03-24 | Moore Helen M. | Methods for identifying, viewing, and analyzing syntenic and orthologous genomic regions between two or more species |
| CN101196921A (zh) * | 2007-12-24 | 2008-06-11 | 北京大学 | 用于近似查询的长序列数据降维方法 |
-
2008
- 2008-12-12 CN CN2008102183389A patent/CN101430742B/zh active Active
-
2009
- 2009-12-11 WO PCT/CN2009/001427 patent/WO2010066115A1/zh not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101430742A (zh) * | 2008-12-12 | 2009-05-13 | 深圳华大基因研究院 | 一种短序列组装中构建图的方法及系统 |
Non-Patent Citations (3)
| Title |
|---|
| BOKHARI S.H. ET AL: "A Parallel Graph Decomposition Algorithm for DNA Sequencing with Nanopores", BIOINFORMATICS, vol. 21, no. 7, 11 November 2004 (2004-11-11), pages 889 - 896 * |
| HUO H. W. ET AL: "A Multiple Alignment Approach for DNA Sequences Based on the Maximum Weighted Path Algorithms", JOURNAL OF SOFTWARE, vol. 18, no. 2, February 2007 (2007-02-01), pages 185 - 195 * |
| ZERBINO D.R. ET AL: "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs", GENOME RES., vol. 18, 18 March 2008 (2008-03-18), pages 821 - 829 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2013004005A1 (zh) * | 2011-07-05 | 2013-01-10 | 深圳华大基因科技有限公司 | 组装测序片段的方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN101430742A (zh) | 2009-05-13 |
| CN101430742B (zh) | 2011-06-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2010066115A1 (zh) | 一种降低短序列组装过程的时间复杂度的方法及系统 | |
| CN101594319B (zh) | 表项查找方法和装置 | |
| US20130080485A1 (en) | Quick filename lookup using name hash | |
| JP4995125B2 (ja) | 固定長データの検索方法 | |
| CN103561133A (zh) | 一种ip地址归属信息索引方法及快速查询方法 | |
| JP3570323B2 (ja) | アドレスに関するプレフィクスの格納方法 | |
| WO2010062554A2 (en) | Index compression in databases | |
| CN101751517B (zh) | 一种基因组短序列映射的快速处理方法及系统 | |
| Xiao et al. | Using parallel bloom filters for multiattribute representation on network services | |
| WO2021179781A1 (zh) | 一种序列比对的方法、系统、设备及可读存储介质 | |
| CN103065067B (zh) | 短序列组装中序列片段的过滤方法及系统 | |
| US20130307710A1 (en) | Compression match enumeration | |
| CN114884877B (zh) | 一种哈希表和HOT相结合的IPv6路由查找方法 | |
| CN100496019C (zh) | IPv6路由表快速查找和更新的方法 | |
| Hon et al. | Efficient algorithm for circular Burrows-Wheeler transform | |
| CN112231400B (zh) | 分布式数据库的访问方法、装置、设备及存储介质 | |
| CN112269784B (zh) | 一种基于硬件实现的哈希表装置以及插入、查询和删除方法 | |
| WO2010127536A1 (zh) | 网络处理器使用的建表和查表方法 | |
| CN1312890C (zh) | 用于产生具有数量减少的特里块的特里结构的方法 | |
| WO2015192742A1 (zh) | 一种查找装置、查找方法和配置方法 | |
| CN109831384B (zh) | 名字查找方法及路由器 | |
| WO2005066835A1 (en) | A method for quickly retrieving a record in a data page of a database | |
| CN100396015C (zh) | 一种tcam路由表管理方法和系统 | |
| US6233574B1 (en) | Method and apparatus for performing radix lookups using transition tables with pointers | |
| CN118824374A (zh) | 一种基于syncmer的进化距离估计及系统发育树构建方法 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09831392 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 09831392 Country of ref document: EP Kind code of ref document: A1 |