CN103699818B

CN103699818B - Two-way side extended method based on the elongated kmer inquiries of the two-way De Bruijns of multistep

Info

Publication number: CN103699818B
Application number: CN201310670740.1A
Authority: CN
Inventors: 孟金涛; 张慧琳; 彭丰斌; 魏彦杰; 冯圣中
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2013-12-10
Filing date: 2013-12-10
Publication date: 2017-04-05
Anticipated expiration: 2033-12-10
Also published as: CN103699818A

Abstract

The present invention relates to the technical field of gene sequencing, and provides a vertex expansion method for variable-length kmer query based on a multi-step bidirectional De Bruijn graph, comprising: step A: reading a sequencing data source file, and constructing a multi-step bidirectional De Bruijn graph; B: performing statistics on intersecting bidirectional edges in the multi-step bidirectional De Bruijn graph; Step C: expanding bidirectional edges based on variable-length kmer queries in the multi-step bidirectional De Bruijn graph. The present invention only constructs variable-length kmer combinations for existing bifurcated edges on the De Bruijn graph, and queries its occurrence times in the input sequence, and then selects the merging of bidirectional edges according to the times; selects the optimal merging of bifurcated edges , to minimize the possibility of erroneous merging of bidirectional edges; it can significantly increase the length of contigs, and can also minimize the quality loss of contigs.

Description

Bidirectional edge expansion for variable-length kmer queries based on multi-step bidirectional De Bruijn graphs method

【技术领域】【Technical field】

本发明涉及基因测序技术领域，特别是涉及一种基于多步双向De Bruijn图的变长kmer查询的双向边扩展方法。The invention relates to the technical field of gene sequencing, in particular to a bidirectional edge extension method of variable-length kmer query based on a multi-step bidirectional De Bruijn graph.

【背景技术】【Background technique】

基因序列分析以算法与数学模型为核心，研究内容涉及多个方面，主要包括：基因数据的存储与获取、序列比对、测序与拼接、基因预测、生物进化与系统发育分析、蛋白质结构预测、RNA结构预测、分子设计与药物设计、代谢网络分析、基因芯片、DNA计算等等。现在生物技术和计算机信息处理技术的紧密结合，加快了处理生物信息数据的速度，使得在尽量短的时间内对生物学意义做出尽量准确的诠释，加快了生物信息学的发展。目前，生物信息处理成为当前信息技术领域面临的巨大挑战之一。Gene sequence analysis takes algorithms and mathematical models as the core, and the research content involves many aspects, mainly including: storage and acquisition of genetic data, sequence comparison, sequencing and splicing, gene prediction, biological evolution and phylogenetic analysis, protein structure prediction, RNA structure prediction, molecular design and drug design, metabolic network analysis, gene chip, DNA computing, etc. Now the close combination of biotechnology and computer information processing technology has accelerated the speed of processing biological information data, making the interpretation of biological meaning as accurate as possible in the shortest possible time, and accelerating the development of bioinformatics. At present, biological information processing has become one of the great challenges faced by the current information technology field.

基因序列分析是对海量基因序列数据进行分析，从而提取和挖掘新的生物信息知识。其中，涉及到计算机技术中的机器学习、模式识别、书籍分析与挖掘、组合数学、随机模型、字符串、图形算法、分布式计算、高性能计算、并行计算等知识。其中，全基因组学的研究是当前生物信息学研究的核心之一。Gene sequence analysis is the analysis of massive gene sequence data to extract and mine new biological information knowledge. Among them, it involves machine learning, pattern recognition, book analysis and mining, combinatorial mathematics, random models, strings, graph algorithms, distributed computing, high-performance computing, parallel computing and other knowledge in computer technology. Among them, the study of whole genome is one of the cores of current bioinformatics research.

基因是人类最基本的遗传密码，代表着每个人的生命信息。基因序列上存在着遗传位点的细微差异，这些遗传密码的多态性与人类的健康、致病机理、医学治疗有着相当密切的关系。其中，DNA测序是研究全基因组序列需要完成的基本内容之一。Gene is the most basic genetic code of human beings, representing the life information of each person. There are subtle differences in genetic loci in the gene sequence, and these genetic code polymorphisms are closely related to human health, pathogenic mechanism, and medical treatment. Among them, DNA sequencing is one of the basic contents that need to be completed to study the whole genome sequence.

自1977年Sanger测序技术问世以来，经过三十多年的发展，DNA测序技术发展突飞猛进，以高通量、短序列为特点的第二代测序技术逐渐占领市场，以单分子测序为特点的第三代测序技术也逐渐出现，它们分别在测序特点上占有不同的优势。传统的基因测序方法的数据提取和分析软件经过近10年来的研究与开发，目前已经较为完善。但是，测序技术的发展，带来了测序数据的变化，使得当前存在的数据处理软件不能满足当前生物医学研究的需求。Since the advent of Sanger sequencing technology in 1977, after more than 30 years of development, DNA sequencing technology has developed by leaps and bounds. The second-generation sequencing technology characterized by high-throughput and short sequences has gradually occupied the market. Three generations of sequencing technologies have gradually emerged, and they each have different advantages in sequencing characteristics. The data extraction and analysis software of the traditional gene sequencing method has been relatively perfect after nearly 10 years of research and development. However, the development of sequencing technology has brought about changes in sequencing data, making the existing data processing software unable to meet the needs of current biomedical research.

新一代高通量测序方法的应用，可以在短时间内完成整个基因组数据的测定。高通量测序方法的日新月异也同时对获取的基因数据的分析处理方法提出了挑战。在这个目前炙手可热的研究领域，迫切需要开发能满足高通量测序技术的海量数据处理的生物信息学平台。面对个人基因组计划及未来的个性化医疗前景，高效低成本的测序技术成为必然的趋势。同时，简化高效的一站式完备的生物信息学数据分析平台等完备的测序解决方案，也是极为重要不可或缺的发展方向。The application of next-generation high-throughput sequencing methods can complete the determination of the entire genome data in a short time. The rapid development of high-throughput sequencing methods also poses challenges to the analysis and processing methods of the acquired genetic data. In this currently hot research field, there is an urgent need to develop a bioinformatics platform that can meet the massive data processing requirements of high-throughput sequencing technologies. In the face of the Personal Genome Project and the prospect of personalized medicine in the future, high-efficiency and low-cost sequencing technology has become an inevitable trend. At the same time, complete sequencing solutions such as simplified and efficient one-stop complete bioinformatics data analysis platforms are also extremely important and indispensable development directions.

然而新一代的高通量测序方法虽然测序通量高，但是却会引入测序误差，同时测序样本本身的测序错误，测序不均匀，SNP，以及基因组本身的重复序列Repeat，而这些测序误差、SNP、重复序列将会在基因组组装时构造的多步双向De Bruijn图中引入一些错误双向边或者顶点，而使得很多双向边无法继续扩展。同时由于固定kmer长度，使得测序序列的有效信息损失，无法解耦所有长度超过kmer长度的重复序列。以上这些情况使得De Bruijn图无法继续收缩，contig无法扩展，最终使得contig的长度和质量都很低。However, although the next-generation high-throughput sequencing method has a high sequencing throughput, it will introduce sequencing errors. At the same time, the sequencing errors of the sequencing sample itself, uneven sequencing, SNP, and the repeated sequence Repeat of the genome itself, and these sequencing errors, SNP , Repeated sequences will introduce some wrong bidirectional edges or vertices in the multi-step bidirectional De Bruijn graph constructed during genome assembly, making many bidirectional edges unable to continue to expand. At the same time, due to the fixed kmer length, the effective information of the sequencing sequence is lost, and it is impossible to decouple all repetitive sequences whose length exceeds the kmer length. The above conditions make it impossible for the De Bruijn graph to continue to shrink, and the contig cannot be expanded, which ultimately makes the length and quality of the contig very low.

新一代的高通量测序方法产生的短基因片段的组装导致了大量的测序错误，这大大加大了组装算法的计算量。大量的测序错误，使得组装错误率增加，严重影响了组装结果。能否有效地解决这个问题，成为评价一个组装算法优劣的关键。The assembly of short gene fragments generated by next-generation high-throughput sequencing methods leads to a large number of sequencing errors, which greatly increases the computational load of assembly algorithms. A large number of sequencing errors increase the assembly error rate and seriously affect the assembly results. Whether this problem can be effectively solved becomes the key to evaluating the pros and cons of an assembly algorithm.

目前组装算法的策略主要分为两类，一个就是基于Overlap-Layout-Consensus(OLC)的算法，另外一个就是基于De Bruijn图的算法。其中基于OLC组装算法开发的软件，如SSAKE、VCAKE、SHARCGS等，在基因长序列组装中更占有优势，但并不完全适用于新一代的短序列组装。与OLC组装算法不同，De Bruijn算法不再以read为单位组织数据，而是以k-mers为单位进行数据组装，其优点主要有以下几个方面：首先，以k-mers为单位进行序列组装，不影响节点的质量，减少了冗余数据量。其次，在图中重复区域只出现一次，便于识别，可以避免错误的组装，减小出错率。最后，采取将有重叠区域映射到同一条弧上的策略，从而简化了搜索路径。目前，很多短序列组装算法都使用这种框架，如Velvet、IDBA、SOAPdenovo，ABySS等。At present, the strategy of assembly algorithm is mainly divided into two categories, one is the algorithm based on Overlap-Layout-Consensus (OLC), and the other is the algorithm based on De Bruijn graph. Among them, the software developed based on the OLC assembly algorithm, such as SSAKE, VCAKE, SHARCGS, etc., has more advantages in the assembly of long gene sequences, but it is not fully suitable for the new generation of short sequence assembly. Different from the OLC assembly algorithm, the De Bruijn algorithm no longer organizes data in units of reads, but assembles data in units of k-mers. Its advantages mainly include the following aspects: First, sequence assembly is performed in units of k-mers , does not affect the quality of nodes, and reduces the amount of redundant data. Secondly, the repeated area only appears once in the figure, which is easy to identify, can avoid wrong assembly, and reduce the error rate. Finally, a strategy of mapping overlapping regions to the same arc is adopted, which simplifies the search path. At present, many short sequence assembly algorithms use this framework, such as Velvet, IDBA, SOAPdenovo, ABySS, etc.

其中Velvet有效地利用了De Bruijn图，实现了高效的短序列组装。Velvet以k-mer为基本单位构建De Bruijn图，利用图的结构，结合相应的序列特征，简化图的构造，最终找到一条最优路径完成组装过程。Velvet把焦点集中在错误的数据产生的三种结构上，即tip，bubble，以及erroneous connection。它依照长度原则和少数性原则，将长度小于2k的均去除；利用Tour Bus算法中的深度优先搜索策略合并bubble，最后利用覆盖度阈值法去除了erroneous connection。该方法也充分利用了paired-end双端信息，进一步解决repeat问题，优化了组装效果。Velvet充分利用图的结构性质，简化了数据冗余，速度较之前的算法有了很大的改进。虽然它没有在预处理阶段对序列进行纠错，但是其对错误的预防机制，很大程度上弥补了这方面的缺陷。这使得它更好的应用在大型基因组序列的组装中。Among them, Velvet effectively utilizes the De Bruijn graph to realize efficient short sequence assembly. Velvet constructs a De Bruijn graph with k-mer as the basic unit, uses the structure of the graph, combines the corresponding sequence features, simplifies the structure of the graph, and finally finds an optimal path to complete the assembly process. Velvet focuses on three structures generated by erroneous data, namely tip, bubble, and erroneous connection. According to the length principle and the minority principle, it removes all the lengths less than 2k; uses the depth-first search strategy in the Tour Bus algorithm to merge bubbles, and finally uses the coverage threshold method to remove erroneous connections. This method also makes full use of the paired-end double-ended information to further solve the repeat problem and optimize the assembly effect. Velvet makes full use of the structural properties of graphs, simplifies data redundancy, and has a great improvement in speed compared with previous algorithms. Although it does not correct the sequence in the preprocessing stage, its error prevention mechanism largely makes up for this defect. This makes it better for use in the assembly of large genome sequences.

IDBA也是基于De Bruijn图，实现了简便且高效的短序列组装。IDBA以k-mer为基本单位，与以往不同的是，它采用一个变化的k值域(Kmin-Kmax)，代替使用固定的k值来得到k-mers的长度。由于基因组装以k-mers为单位，通常会形成很多个重叠单元，这使得组装面临着错误位置组装、顶点缺失和覆盖度低的问题。正确的选择k值的大小成为组装的一个关键因素。一些错误的reads的产生，也导致产生了大量的branching。K值越小，branching问题越严重，k值越大，则出现的repeat区域则变少，这直接影响了组装的质量。IDBA采用不固定的k值进行组装，可以很好地解决branching问题，从而提高了组装的质量。另外IDBA通过删除低覆盖率的错误k-mers而使得IDBA的内存使用率明显降低，同时也提升了IDBA的处理速度。IDBA is also based on the De Bruijn graph, which realizes simple and efficient short sequence assembly. IDBA uses k-mer as the basic unit. Unlike the past, it uses a variable k value range (Kmin-Kmax) instead of using a fixed k value to obtain the length of k-mers. Since gene assembly is based on k-mers, many overlapping units are usually formed, which makes the assembly face the problems of wrong position assembly, missing vertices and low coverage. The correct selection of the value of k becomes a key factor in assembly. The generation of some wrong reads also led to a large number of branching. The smaller the K value, the more serious the branching problem, and the larger the k value, the fewer repeat areas appear, which directly affects the quality of assembly. IDBA uses an unfixed k value for assembly, which can solve the branching problem well, thereby improving the quality of assembly. In addition, IDBA reduces the memory usage of IDBA significantly by deleting the wrong k-mers with low coverage, and also improves the processing speed of IDBA.

SOAPdenovo能够高效高质量地完成数以亿计的reads的组装。SOAPdenovo继承了OLC算法和De Bruijn图算法的优点，使得其组装质量大为提高。SOAP通过预置k-mer阈值的方法，采取过滤、纠错的方式减少了错误序列的产生。同时，它借鉴了Velvet软件的方法成功处理了bubble，使得其平均覆盖度增加。另外，SOAPdenovo利用了双端信息进行重叠区域匹配，并合并read生成contig片段，生成基于contig的图结构，从而，SOAPdenovo大大简化了contig图的复杂性。SOAPdenovo can efficiently assemble hundreds of millions of reads with high quality. SOAPdenovo inherits the advantages of OLC algorithm and De Bruijn graph algorithm, which greatly improves its assembly quality. SOAP reduces the occurrence of error sequences by means of filtering and error correction by presetting the k-mer threshold. At the same time, it successfully handles bubbles by referring to the method of Velvet software, which increases its average coverage. In addition, SOAPdenovo uses double-ended information for overlapping region matching, and merges reads to generate contig fragments to generate a contig-based graph structure. Thus, SOAPdenovo greatly simplifies the complexity of the contig graph.

ABySS引进并行计算的思想，搭建了一个linux集群，在集群上建立了一个分布式的De Bruijn图结构，将数据分布式存储于每个节点上。其采用MPI通信机制完成节点之间的相互通信。从构建图、纠错处理到后面的定点融合，最后完成整个基因组序列的再现，其在运行时间和内存消耗方面占有很大的优势，并且其错误率极低，在性能方面特别是cluster中单机内存使用上均有很大的提升，正在得到越来越广泛的应用。ABySS introduces the idea of parallel computing, builds a linux cluster, and builds a distributed De Bruijn graph structure on the cluster to store data distributedly on each node. It uses the MPI communication mechanism to complete the mutual communication between nodes. From building graphs, error correction processing to subsequent fixed-point fusion, and finally completing the reproduction of the entire genome sequence, it has a great advantage in terms of running time and memory consumption, and its error rate is extremely low. In terms of performance, especially in a single machine in a cluster The use of memory has been greatly improved, and it is being used more and more widely.

现有的主要序列组装软件，例如SOAPdenovo,Velvet,ABySS,Ray等，都是基于给定长度的kmer进行De Bruijn构建，然后进行收缩。其优化的办法也只有去选择最佳的一个kmer长度。这种基于固定长度kmer的组装策略对于所有长度大约kmer长度的重复序列无法解耦。虽然IDBA能够对多种kmer长度进行迭代收缩De Bruijn图，但是它需要对每种kmer长度去将所有的序列进行分解存储和计算，该策略将会耗费巨大的内存和计算时间。The existing main sequence assembly software, such as SOAPdenovo, Velvet, ABySS, Ray, etc., are all based on the kmer of a given length for De Bruijn construction and then contraction. The only way to optimize it is to choose the best kmer length. This fixed-length kmer-based assembly strategy cannot be decoupled for all repeat sequences with a length of approximately kmer length. Although IDBA can iteratively shrink the De Bruijn graph for various kmer lengths, it needs to decompose, store and calculate all the sequences for each kmer length, and this strategy will consume huge memory and computing time.

鉴于此，克服该现有技术所存在的缺陷是本技术领域亟待解决的问题。In view of this, it is an urgent problem to be solved in this technical field to overcome the defects in the prior art.

【发明内容】【Content of invention】

本发明要解决的技术问题是提供一种基于多步双向De Bruijn图的变长kmer查询的双向边扩展方法。The technical problem to be solved by the present invention is to provide a bidirectional edge extension method of variable length kmer query based on multi-step bidirectional De Bruijn graph.

本发明采用如下技术方案：The present invention adopts following technical scheme:

一种基于多步双向De Bruijn图的变长kmer查询的双向边扩展方法，包括：A bidirectional edge extension method for variable-length kmer queries based on multi-step bidirectional De Bruijn graphs, including:

步骤A：读取测序数据源文件，构造多步双向De Bruijn图；Step A: Read the sequencing data source file and construct a multi-step bidirectional De Bruijn graph;

步骤B：在所述多步双向De Bruijn图中对交叉双向边进行统计；Step B: performing statistics on intersecting bidirectional edges in the multi-step bidirectional De Bruijn graph;

步骤C：在所述多步双向De Bruijn图中基于变长kmer查询的双向边扩展。Step C: Bidirectional edge expansion based on variable length kmer queries in the multi-step bidirectional De Bruijn graph.

进一步地，所述步骤B中，对所述多步双向De Bruijn图中的所有左右两边顶点分叉的双向边与所述双向边的分叉边合并的所有可能情况，组装为一个更长的kmer，并查询相应地更长的kmer的出现次数，以所述出现次数的大小为权重，进行分叉边走向的选取和合并操作。Further, in the step B, assemble into a longer kmer, and query the occurrence times of the correspondingly longer kmer, and use the size of the occurrence times as the weight to select and merge the direction of the bifurcation edge.

进一步地，所述步骤C中，在对每个分叉的双向边的所有可能的分叉转换为更长的kmer后，依据所述出现次数的高低，将出现次数最高的分叉路径作为最终的分叉合并方案，并将相应的分叉路径上的双向边在合并后一起删除。Further, in the step C, after converting all possible bifurcations of each forked bidirectional edge into a longer kmer, according to the number of occurrences, the fork path with the highest number of occurrences is used as the final The bidirectional edges on the corresponding forked paths are deleted together after merging.

进一步地，所述步骤B进一步包括：Further, the step B further includes:

步骤B1：遍历De Bruijn图中的每个双向边e；Step B1: traverse each bidirectional edge e in the De Bruijn graph;

步骤B2：统计所有的输入文件的测序序列的最大长度，并赋值给Lmax；Step B2: count the maximum length of the sequencing sequences of all input files, and assign it to Lmax;

步骤B3：若双向边的长度大于Lmax-kmer的长度，则返回执行步骤B1，否则执行步骤B4；Step B3: If the length of the two-way side is greater than the length of Lmax-kmer, return to step B1, otherwise execute step B4;

步骤B4：双向边e的起始顶点为u，起始方向为Direc_u，终止顶点为v，终止方向为Direc_v；Step B4: The starting vertex of the bidirectional edge e is u, the starting direction is Direc_u, the ending vertex is v, and the ending direction is Direc_v;

步骤B5：所有进入顶点u、终止方向为Direc_u的边取其双向边的最后k+1个字符放入字符串数组m，所有离开顶点v、起始方向为Direc_v的双向边取其最后一个字符放入字符串数组n；Step B5: Take the last k+1 characters of all bidirectional edges that enter vertex u and end in Direc_u and put them into the string array m, and take the last character of all bidirectional edges that leave vertex v and start in Direc_v put into string array n;

步骤B6：将双向边左右分别和m和n数组中的字符串的构造所有组合，构造变长kmer数组，其中变长kmer数组中的每个元素为变长kmer=m+e+n。Step B6: Combining the left and right sides of the bidirectional sides with the strings in the m and n arrays to construct a variable-length kmer array, wherein each element in the variable-length kmer array is variable-length kmer=m+e+n.

进一步地，所述步骤C进一步包括：Further, the step C further includes:

步骤C1：打开测序序列文件，逐个读取每条序列；Step C1: Open the sequence file and read each sequence one by one;

步骤C2：将变长kmer集合组各匹配读入的序列（包括对其反向序列的匹配），并对每个变长kmer计数；Step C2: read the sequence (including the match of its reverse sequence) of each matching sequence of the variable-length kmer set, and count each variable-length kmer;

步骤C3：遍历De Bruijn图中的每个双向边e；Step C3: traverse each bidirectional edge e in the De Bruijn graph;

步骤C4：统计所有的输入文件的测序序列的最大长度，并赋值给Lmax；Step C4: Count the maximum length of the sequencing sequences of all input files, and assign it to Lmax;

步骤C5：若双向边的长度大于Lmax-kmer的长度，则返回执行步骤C3，否则执行步骤C6；Step C5: If the length of the two-way side is greater than the length of Lmax-kmer, return to step C3, otherwise execute step C6;

步骤C6：双向边e的起始顶点为u，起始方向为Direc_u，终止顶点为v，终止方向为Direc_v；Step C6: The starting vertex of the bidirectional edge e is u, the starting direction is Direc_u, the ending vertex is v, and the ending direction is Direc_v;

步骤C7：所有进入顶点u、终止方向为Direc_u的边取其最后双向边e_i的最后k+1个字符放入字符串数组m，所有离开顶点v、起始方向为Direc_v的双向边e_j取其最后一个字符放入字符串数组n；Step C7: Take all the edges that enter the vertex u and end in the direction of Direc_u, take the last k+1 characters of the last two-way edge e_i and put them into the string array m, and take all the two-way edges e_j that leave the vertex v and start in the direction of Direc_v The last character is put into the string array n;

步骤C8：将双向边e左右分别和m和n数组中的字符串的构造所有组合，构造变长kmer数组，其中变长kmer数组中的每个元素为变长kmer=m+e+n。在计数的变长kmer中查询这些变长kmer，并取出计数值最大的一个，并将相应的e_i，e，e_j三条边进行合并，然后删除e_i，e，e_j。Step C8: Combine the left and right sides of the bidirectional side e with the strings in the m and n arrays to construct a variable-length kmer array, wherein each element in the variable-length kmer array is variable-length kmer=m+e+n. Query these variable-length kmers in the counted variable-length kmers, and take out the one with the largest count value, and merge the corresponding three sides e_i, e, and e_j, and then delete e_i, e, and e_j.

进一步地，所述步骤C2中将变长kmer集合组各匹配读入的序列包括对反向序列的匹配。Further, in the step C2, the sequence read into each match of the variable-length kmer set includes a match to the reverse sequence.

进一步地，所述步骤A进一步包括：Further, said step A further includes:

压缩存储步骤，具体为Compress storage steps, specifically

A11、读取一个序列s；A11. Read a sequence s;

A12、将序列s用滑动窗口切割为多个片段t；A12, using a sliding window to cut the sequence s into multiple fragments t;

A13、对每个片段t，使用核酸编码表进行编码，并表示为一个64位的整数a；A13, for each fragment t, use the nucleic acid coding table to encode, and express as a 64-bit integer a;

A14、将片段t进行反转，使用对称互补表将反转的片段互补处理，得到互补片段v，并再次使用步骤A13中的核酸编码表将互补片段进行编码，并表示为一个64位的整数b；A14. Reverse the fragment t, use the symmetrical complementation table to complement the reversed fragment to obtain the complementary fragment v, and use the nucleic acid coding table in step A13 to encode the complementary fragment again, and express it as a 64-bit integer b;

A15、取整数a和整数b的最大数，作为片段t和互补片段v的k分子的标志数；A15, take the maximum number of integer a and integer b, as the mark number of k molecules of fragment t and complementary fragment v;

A16、重复步骤A11-A15，直至所有序列完成；A16. Steps A11-A15 are repeated until all sequences are completed;

和De Bruijn图构造步骤，具体为and De Bruijn graph construction steps, specifically as

A21、读取一个序列s；A21. Read a sequence s;

A22、将序列s用滑动窗口切割为多个片段t，选取一片段t其标志数为cur、并标记其前、后的片段的标志数分别为pre、lat；A22, the sequence s is cut into a plurality of fragments t by a sliding window, a fragment t is selected as cur, and the marks of the preceding and following fragments are respectively pre and lat;

A23、若t的编码小于其互补片段编码，则交换pre,lat的值；A23, if the encoding of t is less than its complementary fragment encoding, then exchange the values of pre and lat;

A24、在cur的正向位置映射表的相应bit位置1来表示指向pre的边；A24, the corresponding bit position in the forward position mapping table of cur is 1 to indicate the edge pointing to pre;

A25、在cur的反向位置映射表的相应bit位置1来表示指向lat的边；A25. The corresponding bit position in the reverse position mapping table of cur is 1 to indicate the edge pointing to lat;

A26、重复步骤A22-A25，处理序列s的其他片段t，直至完成序列s的全部片段t，执行步骤A27；A26. Repeat steps A22-A25 to process other fragments t of the sequence s until all fragments t of the sequence s are completed, and then execute step A27;

A27、读取一个新的序列s，重复步骤A22-A26；直至处理完所有的序列，执行步骤A28；A27, read a new sequence s, repeat steps A22-A26; until all sequences are processed, execute step A28;

A28、完成双向多步De Bruijn图的构造。A28. Complete the construction of the bidirectional multi-step De Bruijn graph.

进一步地，所述步骤A12、A22中的滑动窗口为长度为k的滑动窗口，其中0<k<32且k为奇数；Further, the sliding window in the steps A12 and A22 is a sliding window with a length of k, where 0<k<32 and k is an odd number;

所述步骤A13中的核酸编码表为{A:00,C:01,G:10,T:11}；The nucleic acid coding table in the step A13 is {A:00, C:01, G:10, T:11};

所述步骤A14中的对称互补表为{A->T,C->G,G->C,T->A}；The symmetrical complementary table in the step A14 is {A->T, C->G, G->C, T->A};

所述步骤A14具体为，将片段t的字符串进行反转，使用对称互补表将反转的字符串中每个字符变为其互补字符，得到互补字符的字符串v，并再次使用步骤A13中的核酸编码表将字符串v进行编码，并表示为一个64位的整数b。The step A14 specifically is to reverse the character string of the segment t, use a symmetric complementary table to change each character in the reversed character string into its complementary character, obtain the character string v of the complementary character, and use step A13 again The nucleic acid encoding table in encodes the string v and expresses it as a 64-bit integer b.

进一步地，所述步骤A22中，若片段t没有之前或之后的片段，则对pre或者lat值赋为空或NULL；Further, in the step A22, if there is no segment before or after the segment t, the value of pre or lat is assigned empty or NULL;

步骤A24中正向位置映射表为{A:0，C:1，G:2，T:3}，位置查询字符为pre的最后一个字符；In step A24, the forward position mapping table is {A:0, C:1, G:2, T:3}, and the position query character is the last character of pre;

步骤A25中反向位置映射表为{A:4，C:5，G:6，T:7}，位置查询字符为lat的第一个字符的互补字符。In step A25, the reverse position mapping table is {A:4, C:5, G:6, T:7}, and the position query character is the complementary character of the first character of lat.

与现有技术相比，本发明的有益效果在于：Compared with prior art, the beneficial effect of the present invention is:

（1）只针对De Bruijn图上已经存在的分叉边构造变长的kmer组合，并在输入序列中查询其出现次数，然后根据其次数选择双向边的合并；而IDBA则是通过迭代所有的kmer长度，需要将每个kmer长度的所有可能的kmer都构造出来后，再收缩De Bruijn图，其方法将导致更大的内存消耗和计算时间消耗；(1) Construct a variable-length kmer combination only for the existing bifurcation edges on the De Bruijn graph, and query the number of occurrences in the input sequence, and then select the merge of two-way edges according to the number; while IDBA iterates all kmer length, it is necessary to construct all possible kmers of each kmer length, and then shrink the De Bruijn graph, which method will lead to greater memory consumption and calculation time consumption;

（2）选择最优的分叉边的合并，将双向边的错误合并的可能降到最低；(2) Select the optimal merging of bifurcated edges to minimize the possibility of erroneous merging of bidirectional edges;

（3）可以显著提高contigs的长度，也能够将contig的质量损失降到最小；相比于其他已有方法的提高contig长度就要牺牲contig质量，本发明有了一定程度上的控制和改善。(3) The length of contigs can be significantly increased, and the loss of contig quality can also be minimized; compared with other existing methods that increase the contig length and sacrifice contig quality, the present invention has a certain degree of control and improvement.

【附图说明】【Description of drawings】

图1是本发明实施例基于多步双向De Bruijn图的变长kmer查询的双向边扩展方法流程图；Fig. 1 is the flow chart of the bidirectional edge extension method of the variable length kmer query based on the multi-step bidirectional De Bruijn graph according to the embodiment of the present invention;

图2是图1步骤A中的压缩存储步骤流程图；Fig. 2 is the flow chart of the compression storage step in Fig. 1 step A;

图3是图1步骤A中De Bruijn图构造步骤流程图；Fig. 3 is a flowchart of the construction steps of De Bruijn diagram in Fig. 1 step A;

图4是步骤B在所述多步双向De Bruijn图中对交叉双向边进行统计的流程图；Fig. 4 is the flow chart that step B carries out statistics on crossing bidirectional edges in the multi-step bidirectional De Bruijn graph;

图5是步骤C在所述多步双向De Bruijn图中基于变长kmer查询的双向边扩展的流程图。FIG. 5 is a flow chart of step C of bidirectional edge expansion based on variable-length kmer queries in the multi-step bidirectional De Bruijn graph.

【具体实施方式】【detailed description】

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In addition, the technical features involved in the various embodiments of the present invention described below may be combined with each other as long as they do not constitute a conflict with each other.

本发明的目的在于设计一种基于变长kmer查询的交叉双向边的扩展方法，它将使De Bruijn图继续收缩，contigs继续扩展，同时不会引入错误，造成contig质量的下降，准确度降低。The purpose of the present invention is to design an expansion method based on variable-length kmer query crossing two-way edges, which will make the De Bruijn graph continue to shrink and contigs continue to expand, and will not introduce errors at the same time, resulting in the decline of contig quality and accuracy.

本发明提供了一种基于多步双向De Bruijn图的变长kmer查询的双向边扩展方法，如图1所示，该方法包括：The present invention provides a kind of bidirectional edge expansion method based on the variable length kmer query of multi-step bidirectional De Bruijn graph, as shown in Figure 1, the method comprises:

步骤B：在多步双向De Bruijn图中对交叉双向边进行统计；Step B: counting the intersecting bidirectional edges in the multi-step bidirectional De Bruijn graph;

步骤C：在多步双向De Bruijn图中基于变长kmer查询的双向边扩展。Step C: Bidirectional edge expansion based on variable-length kmer queries in multi-step bidirectional De Bruijn graphs.

其中，步骤A可通过如下方式具体实现：Among them, step A can be specifically realized in the following ways:

压缩存储步骤，所需原始数据包括第一代，第二代和新一代的测序仪器产生出来的FASTA格式文件，将FASTA文件中的序列逐个切割成k分子并且用二进制编码进行压缩存储为一个64位的长整型k分子的标志数。In the compressed storage step, the required raw data includes the FASTA format files generated by the first-generation, second-generation and next-generation sequencing instruments. The sequences in the FASTA files are cut into k molecules one by one and compressed and stored in binary code as a 64 Long integer k number of bits in the numerator.

如图2所示，具体为As shown in Figure 2, specifically

A11、读取一个序列s；其中，序列s取自FASTA格式文件；A11, read a sequence s; wherein, the sequence s is taken from the FASTA format file;

A14、将片段t进行反转，使用对称互补表将反转的片段互补处理，得到互补片段，并再次使用步骤A13中的核酸编码表将互补片段进行编码，并表示为一个64位的整数b；A14, reverse the fragment t, use the symmetrical complementation table to complement the reversed fragment to obtain a complementary fragment, and use the nucleic acid coding table in step A13 to encode the complementary fragment again, and express it as a 64-bit integer b ;

A16、重复步骤A11-A15，直至所有序列完成。A16. Repeat steps A11-A15 until all sequences are completed.

通过上述步骤将两个传统的De Brujin图中的kmer，转化为一个64位的k分子的标志数来存储。该步骤可以将其他软件例如velvet、IDBA、SOAPdenovo里的两个压缩kmer存储为一个压缩k分子的标志数，并且在得到k分子的标志数后也可以反过来求出该k分子的长度为k的片段t和它的互补片段v。Through the above steps, the kmer in the two traditional De Brujin diagrams is converted into a 64-bit k-molecule sign number for storage. This step can store the two compressed kmers in other software such as velvet, IDBA, and SOAPdenovo as a compressed number of k molecules, and after obtaining the number of k molecules, the length of the k molecule can also be calculated as k The fragment t and its complementary fragment v.

步骤A12、A22中的滑动窗口为长度为k的滑动窗口，其中0<k<32且k为奇数；步骤A13中的核酸编码表为{A:00,C:01,G:10,T:11}；步骤A14中的对称互补表为{A->T,C->G,G->C,T->A}；步骤A14具体为，将片段t的字符串进行反转，使用对称互补表将反转的字符串中每个字符变为其互补字符，得到互补字符的字符串v，并再次使用步骤A13中的核酸编码表将字符串v进行编码，并表示为一个64位的整数b。The sliding window in steps A12, A22 is a sliding window whose length is k, wherein 0<k<32 and k is an odd number; the nucleic acid coding table in step A13 is {A:00, C:01, G:10, T: 11}; the symmetric complementary table in step A14 is {A->T, C->G, G->C, T->A}; step A14 is specifically to reverse the character string of segment t, using symmetric The complementary table changes each character in the reversed character string into its complementary character to obtain the character string v of the complementary character, and uses the nucleic acid encoding table in step A13 to encode the character string v again, and expresses it as a 64-bit integer b.

和De Bruijn图构造步骤，1、使用上述压缩存储步骤中计算k分子的标志数，2、将每个片段以及和它前后相邻的片段的扩展字符作为该k分子和其前后相邻的片段的对应的k分子的边并初始化k分子数据结构的边；3、将初始化后的k分子数据结构以k分子的标志数为关键值存入hash_map。and De Bruijn graph construction steps, 1, use the mark number of calculating k molecule in the above-mentioned compressed storage step, 2, use each fragment and the extended characters of its adjacent fragments as this k molecule and its adjacent fragments 3. Store the initialized k-molecule data structure into the hash_map with the key value of the k-molecule number as the key value.

如图3所示，具体为As shown in Figure 3, specifically

A21、读取一个序列s；A21. Read a sequence s;

A23、若t的编码小于其互补片段编码，则交换pre,lat的值;A23, if the coding of t is smaller than its complementary fragment coding, then exchange the values of pre and lat;

A24、在cur的正向位置映射表的相应bit位置1来表示指向pre的边;A24. The corresponding bit position in the forward position mapping table of cur is 1 to indicate the edge pointing to pre;

A25、在cur的反向位置映射表的相应bit位置1来表示指向lat的边;A25. The corresponding bit position in the reverse position mapping table of cur is 1 to indicate the edge pointing to lat;

A26、重复步骤A22-A25，处理序列s的其他片段t，直至完成序列s的全部片段t，执行步骤S27；A26. Steps A22-A25 are repeated to process other segments t of the sequence s until all segments t of the sequence s are completed, and step S27 is executed;

A27、读取一个新的序列s，重复步骤A22-A26；直至处理完所有的序列，执行步骤A28;A27, read a new sequence s, repeat steps A22-A26; until all sequences are processed, execute step A28;

步骤A22中，若片段t没有之前或之后的片段，则对pre或者lat值赋为空或NULL；步骤A24中正向位置映射表为{A:0，C:1，G:2，T:3}，位置查询字符为pre的最后一个字符；步骤A25中反向位置映射表为{A:4，C:5，G:6，T:7}，位置查询字符为lat的第一个字符的互补字符。In step A22, if the fragment t has no previous or subsequent fragments, the value of pre or lat is assigned empty or NULL; the forward position mapping table in step A24 is {A:0, C:1, G:2, T:3 }, the position query character is the last character of pre; the reverse position mapping table is {A:4, C:5, G:6, T:7} in step A25, and the position query character is the first character of lat Complementary characters.

步骤B中，对所述多步双向De Bruijn图中的所有左右两边顶点分叉的双向边与所述双向边的分叉边合并的所有可能情况，组装为一个更长的kmer，并查询相应地更长的kmer的出现次数，以所述出现次数的大小为权重，进行分叉边走向的选取和合并操作。In step B, assemble into a longer kmer for all possible cases where the bidirectional edges of all left and right vertices in the multi-step bidirectional De Bruijn graph are merged with the bifurcated edges of the bidirectional edges, and query the corresponding The number of occurrences of the longer kmer is used as the weight to select and merge the direction of the bifurcation edge.

本方法中设定多步双向De Bruijn图中的每个顶点的数据结构为：In this method, the data structure of each vertex in the multi-step bidirectional De Bruijn graph is set as:

如图4所示，步骤B进一步包括：As shown in Figure 4, step B further includes:

步骤B6：将双向边左右分别和m和n数组中的字符串的构造所有组合，构造变长kmer数组，其中变长kmer数组中的每个元素为变长kmer=m+e+n。当还有双向边需要处理时，返回执行步骤B1；当没有双向边需要处理时，结束。Step B6: Combining the left and right sides of the bidirectional sides with the strings in the m and n arrays to construct a variable-length kmer array, wherein each element in the variable-length kmer array is variable-length kmer=m+e+n. When there are still two-way edges to be processed, return to step B1; when there are no two-way edges to be processed, end.

步骤C中，在对每个分叉的双向边的所有可能的分叉转换为更长的kmer后，依据所述出现次数的高低，将出现次数最高的分叉路径作为最终的分叉合并方案，并将相应的分叉路径上的双向边在合并后一起删除。In step C, after converting all possible bifurcations of each bidirectional edge into a longer kmer, the fork path with the highest number of occurrences is used as the final fork merging scheme according to the number of occurrences , and the bidirectional edges on the corresponding bifurcated paths are deleted together after merging.

如图5所示，步骤C进一步包括：As shown in Figure 5, step C further includes:

步骤C8：将双向边左右分别和m和n数组中的字符串的构造所有组合，构造变长kmer数组，其中变长kmer数组中的每个元素为变长kmer=m+e+n。在计数的变长kmer中查询这些变长kmer，并取出计数值最大的一个，并将相应的e_i，e，e_j三条边进行合并，然后删除e_i，e，e_j。Step C8: Combine the left and right sides of the two-way sides with the strings in the m and n arrays to construct a variable-length kmer array, wherein each element in the variable-length kmer array is variable-length kmer=m+e+n. Query these variable-length kmers in the counted variable-length kmers, and take out the one with the largest count value, and merge the corresponding three sides e_i, e, and e_j, and then delete e_i, e, and e_j.

当还有双向边需要处理时，返回执行步骤C3；当还有序列需要处理时，返回执行步骤C1；当没有双向边需要处理，也没有序列需要处理时，结束。When there are still two-way edges to be processed, return to step C3; when there are still sequences to be processed, return to step C1; when there are no two-way edges to be processed and no sequences to be processed, end.

本领域普通技术人员可以理解实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序可以存储于一计算机可读存储介质中，存储介质可以包括：只读存储器（ROM，Read Only Memory）、随机存取存储器（RAM，Random AccessMemory）、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the embodiments can be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium, and the storage medium can include: only Read memory (ROM, Read Only Memory), random access memory (RAM, Random AccessMemory), disk or CD, etc.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.

Claims

1. a kind of two-way side extended method that elongated kmer based on the two-way De Bruijns of multistep is inquired about, it is characterised in that bag Include：

Step A：Read sequencing data source file, the two-way De Bruijns of construction multistep；

Step B：Count to intersecting two-way side in the two-way De Bruijns of the multistep；

Step C：The two-way side extension inquired about based on elongated kmer in the two-way De Bruijns of the multistep；

In step B, the two-way side of all the right and left summits bifurcated in De Bruijns two-way to the multistep and institute State it is two-way while bifurcated while the be possible to situation that merges, be assembled into a longer kmer, and inquire about correspondingly longer The occurrence number of kmer, the size with the occurrence number carry out selection and the union operation of bifurcated side trend as weight；It is described In step C, after all possible bifurcated on the two-way side to each bifurcated is converted to longer kmer, according to it is described go out occurrence Several height, will appear from number of times highest diverging paths as final bifurcated Merge Scenarios, and by corresponding diverging paths Two-way side delete together after merging.

2. the method for claim 1, it is characterised in that step B is further included：

Step B1：Each two-way side e in traversal De Bruijns；

Step B2：The maximum length of the sequencing sequence of all of input file is counted, and is assigned to Lmax；

Step B3：If the length on two-way side is more than the length of Lmax-kmer, execution step B1, otherwise execution step B4 are returned；

Step B4：The initial vertex of two-way side e is u, and prime direction is Direc_u, and termination summit is v, terminates direction and is Direc_v；

Step B5：All entrance summit u, terminate direction for Direc_u while take its it is two-way while last k+1 character be put into word Symbol string array m, it is all to leave the two-way side that vertex v, prime direction are Direc_v and take its last character and be put into character string number Group n；

Step B6：By two-way side or so respectively with m and n arrays in character string all combinations of construction, construct elongated kmer numbers Group, wherein each element in elongated kmer arrays is elongated kmer=m+e+n.

3. the method for claim 1, it is characterised in that step C is further included：

Step C1：Sequencing sequence file is opened, every sequence is read one by one；

Step C2：The sequence that elongated kmer collection charge-coupled each matching is read in, including the matching to its reverse sequence, and to each change Long kmer is counted；

Step C3：Each two-way side e in traversal De Bruijns；

Step C4：The maximum length of the sequencing sequence of all of input file is counted, and is assigned to Lmax；

Step C5：If the length on two-way side is more than the length of Lmax-kmer, execution step C3, otherwise execution step C6 are returned；

Step C6：The initial vertex of two-way side e is u, and prime direction is Direc_u, and termination summit is v, terminates direction and is Direc_v；

Step C7：All entrance summit u, terminate direction for Direc_u while take its it is last two-way while e_i last k+1 word Symbol is put into character string dimension m, all to leave the two-way side e_j that vertex v, prime direction are Direc_v and take its last character It is put into character string dimension n；

Step C8：By two-way side or so respectively with m and n arrays in character string all combinations of construction, construct elongated kmer numbers Group, wherein each element in elongated kmer arrays is elongated kmer=m+e+n, inquires about these changes in the elongated kmer for counting Long kmer, and maximum one of count value is taken out, and by corresponding e_i, tri- sides of e, e_j merge, and then delete e_i, E, e_j.

4. method as claimed in claim 3, it is characterised in that elongated kmer collection charge-coupled each matching is read in step C2 Sequence include the matching to reverse sequence.

5. the method for claim 1, it is characterised in that step A is further included：

Compression storing step, specially

A11, one sequence s of reading；

A12, sequence s sliding window is cut into into multiple fragments t；

A13, to each fragment t, encoded using nucleic acid coding table, and be expressed as the integer a of 64；

A14, fragment t is inverted, the fragment complementation for inverting is processed using symmetrical complement table, obtained complementary fragment v, and again Complementary fragment is encoded by the nucleic acid coding table in secondary use step A13, and is expressed as the integer b of 64；

The maximum number of A15, round numbers a and integer b, as the conventional number of the k molecules of fragment t and complementary fragment v；

A16, repeat step A11-A15, until all sequences are completed；

With De Bruijn constitution steps, specially

A21, one sequence s of reading；

A22, sequence s sliding window is cut into multiple fragments t, choose fragment t its conventional number be cur and before marking which, The conventional number of fragment afterwards is respectively pre, lat；

If the coding of A23, t is encoded less than its complementary fragment, pre, the value of lat are exchanged；

A24, cur forward position mapping table corresponding bit positions 1 come represent point to pre side；

A25, cur reverse position mapping table corresponding bit positions 1 come represent point to lat side；

A26, repeat step A22-A25, process other fragments t of sequence s, until completing whole fragments t of sequence s, perform step Rapid A27；

A27, one new sequence s of reading, repeat step A22-A26；Until having processed all of sequence, execution step A28；

A28, the construction for completing two-way multistep De Bruijn.

6. method as claimed in claim 5, it is characterised in that the sliding window in step A12, A22 is that length is k's Sliding window, wherein 0<k<32 and k is odd number；

Nucleic acid coding table in step A13 is { A:00,C:01,G:10,T:11}；

Symmetrical complement table in step A14 is { A->T,C->G,G->C,T->A}；

Step A14 specifically, the character string of fragment t is inverted, using in the character string that symmetrical complement table will be inverted Each character is changed into its complementary character, obtains character string v of complementary character, and reuses the nucleic acid coding table in step A13 Character string v is encoded, and is expressed as the integer b of 64.

7. method as claimed in claim 5, it is characterised in that in step A22, if before or after fragment t does not have Fragment, then be assigned to sky or NULL to pre or lat values；

In step A24, forward position mapping table is { A:0, C:1, G:2, T:3 }, the last character of position enquiring character for pre Symbol；

In step A25, reverse position mapping table is { A:4, C:5, G:6, T:7 }, first character of the position enquiring character for lat Complementary character.