CN115346605A - Gene data processing method and device under high-throughput sequencing background and related equipment - Google Patents
Gene data processing method and device under high-throughput sequencing background and related equipment Download PDFInfo
- Publication number
- CN115346605A CN115346605A CN202211019556.6A CN202211019556A CN115346605A CN 115346605 A CN115346605 A CN 115346605A CN 202211019556 A CN202211019556 A CN 202211019556A CN 115346605 A CN115346605 A CN 115346605A
- Authority
- CN
- China
- Prior art keywords
- gene data
- gene
- data block
- data
- memory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/50—Compression of genetic data
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本申请公开了一种高通量测序背景下的基因数据处理方法、装置及相关设备,该方法包括:获取从测序平台实时传入的基因数据块,并将其输入至第一存储器的第一预留区域中;基于基因数据块中各短序列的数据特性,对基因数据块进行压缩,得到基因数据压缩块,将基因数据压缩块保留在第一存储器的第二预留区域中,并从第一存储器中释放基因数据块;当第一存储器中的基因数据压缩块的数目达到J之后,将J个基因数据压缩块输出至第二存储器中,并从第一存储器中释放出来。本申请通过流式地并行处理测序平台输出的各基因数据块,在节省等待时间的同时,提高了计算资源的利用效率,提升了从测序平台数据下机到生物信息分析的整体速度。
The present application discloses a genetic data processing method, device and related equipment under the background of high-throughput sequencing. In the reserved area; based on the data characteristics of each short sequence in the gene data block, the gene data block is compressed to obtain the gene data compression block, and the gene data compression block is retained in the second reserved area of the first memory, and from Release gene data blocks in the first memory; when the number of gene data compression blocks in the first memory reaches J, output J gene data compression blocks to the second memory and release them from the first memory. This application processes the gene data blocks output by the sequencing platform in parallel in a streaming manner, which not only saves waiting time, but also improves the utilization efficiency of computing resources, and improves the overall speed from the off-machine of the sequencing platform data to the analysis of biological information.
Description
技术领域technical field
本申请涉及基于测序技术领域,更具体地说,是涉及一种高通量测序背景下的基因数据处理方法、装置及相关设备。This application relates to the field of sequencing-based technologies, and more specifically, relates to a genetic data processing method, device and related equipment under the background of high-throughput sequencing.
背景技术Background technique
基因测序技术作为探索生命奥秘的重要手段,已经成为了生物信息学研究的重要分支,在物种鉴别、基因检测、疾病诊断等方面有着广泛的应用,基因测序技术的飞速发展为精准医疗奠定了坚实的基础。As an important means to explore the mysteries of life, gene sequencing technology has become an important branch of bioinformatics research. It has been widely used in species identification, gene detection, and disease diagnosis. The rapid development of gene sequencing technology has laid a solid foundation for precision medicine. The basics.
随着基因测序技术的发展,测序成本越来越低,导致测序业务规模越来越广。而随着测序业务规模的不断增大,组网的设备类型、业务容量、网络结构等越来越复杂。如何在现有的计算资源及存储资源下,及时地处理更多的测序业务,提高生物信息测序的交付效率,是值得探究的技术问题。With the development of gene sequencing technology, the cost of sequencing is getting lower and lower, resulting in a wider and wider scale of sequencing business. With the continuous increase of the sequencing business scale, the equipment types, business capacity, and network structure of the networking are becoming more and more complex. How to process more sequencing services in a timely manner and improve the delivery efficiency of biological information sequencing under the existing computing resources and storage resources is a technical issue worth exploring.
发明内容Contents of the invention
有鉴于此,本申请提供了一种高通量测序背景下的基因数据处理方法、装置及相关设备,实现对基因测序数据进行高并发的流式处理,从而提高生物信息测序的交付效率。In view of this, the present application provides a gene data processing method, device and related equipment under the background of high-throughput sequencing, which realizes highly concurrent stream processing of gene sequencing data, thereby improving the delivery efficiency of biological information sequencing.
为实现上述目的,本申请第一方面提供了一种高通量测序背景下的基因数据处理方法,包括:In order to achieve the above purpose, the first aspect of the present application provides a genetic data processing method in the context of high-throughput sequencing, including:
当第一存储器的第一预留区域中的可用空间达到预设容量时,获取预设大小的基因数据块,并将所述基因数据块输入至所述第一存储器的第一预留区域中,其中,所述基因数据块为从测序平台实时传入的短序列集合,所述第一预留区域具备容纳N个基因数据块的能力,所述预设大小不大于所述预设容量;When the available space in the first reserved area of the first memory reaches a preset capacity, acquiring a genetic data block of a preset size, and inputting the genetic data block into the first reserved area of the first memory , wherein the gene data block is a collection of short sequences imported from the sequencing platform in real time, the first reserved area has the ability to accommodate N gene data blocks, and the preset size is not greater than the preset capacity;
基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩,得到基因数据压缩块,将所述基因数据压缩块保留在第一存储器的第二预留区域中,并从第一存储器中释放所述基因数据块,所述第二预留区域具备容纳M个基因数据压缩块的能力;Based on the data characteristics of each short sequence in the gene data block, compress the gene data block to obtain a gene data compression block, retain the gene data compression block in the second reserved area of the first memory, and releasing the gene data block from the first memory, and the second reserved area has the ability to accommodate M gene data compression blocks;
当第一存储器中的基因数据压缩块的数目达到J之后,将J个基因数据压缩块输出至第二存储器中,并从第一存储器中释放所述J个基因数据压缩块;After the number of gene data compression blocks in the first memory reaches J, output J gene data compression blocks to the second memory, and release the J gene data compression blocks from the first memory;
其中,N、M、J均为预设的自然数,且J不大于M。Wherein, N, M, and J are all preset natural numbers, and J is not greater than M.
优选地,所述短序列包括元数据、碱基数据和质量数据;基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩的过程,包括:Preferably, the short sequence includes metadata, base data and quality data; based on the data characteristics of each short sequence in the genetic data block, the process of compressing the genetic data block includes:
利用预设的参考基因组对所述基因数据块中的碱基数据进行比对,并根据比对结果对所述基因数据块中的碱基数据进行压缩;comparing the base data in the gene data block with a preset reference genome, and compressing the base data in the gene data block according to the comparison result;
采用增量编码技术或游程长度编码技术对所述基因数据块中的元数据进行压缩;Compressing the metadata in the genetic data block by using incremental encoding technology or run-length encoding technology;
通过预设的自适应模型确定所述质量数据的复杂度,并基于所述复杂度确定第一目标阶数的上下文统计模型;determining the complexity of the quality data through a preset adaptive model, and determining a context statistical model of a first target order based on the complexity;
利用所述第一目标阶数的上下文统计模型对所述质量数据进行压缩,得到第一中间压缩结果;Compressing the quality data by using the context statistical model of the first target order to obtain a first intermediate compression result;
采用游程长度编码技术、ANS+FSE编码技术、算术编码技术或哈夫曼编码技术对所述第一中间压缩结果进行压缩。The first intermediate compression result is compressed by using run-length coding technology, ANS+FSE coding technology, arithmetic coding technology or Huffman coding technology.
优选地,所述利用预设的参考基因组对所述基因数据块中的碱基数据进行比对,并根据比对结果对所述基因数据块中的碱基数据进行压缩的过程,包括:Preferably, the process of comparing the base data in the gene data block with a preset reference genome, and compressing the base data in the gene data block according to the comparison result includes:
将所述基因数据块中的碱基数据划分成多个子序列;Divide the base data in the gene data block into multiple subsequences;
采用哈希比对方法将每一子序列与预设的参考基因组进行比对,得到每一子序列的匹配信息,所述匹配信息包括错配值;Comparing each subsequence with a preset reference genome using a hash comparison method to obtain matching information for each subsequence, the matching information including a mismatch value;
对于错配值小于或等于预设阈值的子序列,基于所述子序列的匹配信息,对所述子序列进行压缩;For a subsequence whose mismatch value is less than or equal to a preset threshold, compress the subsequence based on the matching information of the subsequence;
对于错配值大于预设阈值的子序列:For subsequences with mismatch values greater than a preset threshold:
通过预设的自适应模型确定所述子序列的复杂度,并基于所述复杂度确定第二目标阶数的上下文统计模型;determining the complexity of the subsequence through a preset adaptive model, and determining a context statistical model of a second target order based on the complexity;
利用所述第二目标阶数的上下文统计模型对所述子序列进行压缩,得到第二中间压缩结果;compressing the subsequence by using the context statistical model of the second target order to obtain a second intermediate compression result;
采用游程长度编码技术、ANS+FSE编码技术、算术编码技术或哈夫曼编码技术对所述第二中间压缩结果进行压缩。The second intermediate compression result is compressed by using run-length coding technology, ANS+FSE coding technology, arithmetic coding technology or Huffman coding technology.
优选地,所述采用哈希比对方法将每一子序列与预设的参考基因组进行比对,得到每一子序列的匹配信息的过程,包括:Preferably, the process of comparing each subsequence with a preset reference genome using the hash comparison method to obtain the matching information of each subsequence includes:
利用每一子序列的哈希值作为查询条件,在预设的哈希表进行查询,得到每一子序列的匹配信息;Using the hash value of each subsequence as the query condition, query in the preset hash table to obtain the matching information of each subsequence;
其中,所述预设的哈希表记载有所述参考基因组中各参考子序列的哈希值以及各参考子序列在所述参考基因组中的位置信息,所述各参考子序列为从所述参考基因组划分得到的。Wherein, the preset hash table records the hash value of each reference subsequence in the reference genome and the position information of each reference subsequence in the reference genome, and each reference subsequence is obtained from the derived from the reference genome.
优选地,从所述参考基因组划分得到各参考子序列的过程,包括:Preferably, the process of obtaining each reference subsequence from the division of the reference genome includes:
以预设的步长从所述参考基因组中划分出长度为K、重叠的多个参考子序列,其中,K为预设的长度值。A plurality of overlapping reference subsequences of length K are divided from the reference genome with a preset step size, wherein K is a preset length value.
优选地,所述利用预设的参考基因组对所述基因数据块中的碱基数据进行比对的过程,包括:Preferably, the process of comparing the base data in the gene data block with a preset reference genome includes:
采用BWT算法将所述基因数据块中的碱基数据与预设的参考基因组进行比对。The base data in the gene data block is compared with a preset reference genome using a BWT algorithm.
优选地,还包括:Preferably, it also includes:
获取第一操作的处理速度,所述第一操作包括:获取预设大小的基因数据块,并将其输入至第一存储器的第一预留区域中;Obtaining the processing speed of the first operation, the first operation comprising: obtaining a genetic data block of a preset size, and inputting it into a first reserved area of the first memory;
获取第二操作的处理速度,所述第二操作包括:基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩;Obtaining the processing speed of the second operation, the second operation comprising: compressing the gene data block based on the data characteristics of each short sequence in the gene data block;
基于所述第一操作的处理速度和所述第二操作的处理速度,确定分配到所述第一操作以及所述第二操作的计算资源。Based on the processing speed of the first operation and the processing speed of the second operation, computing resources allocated to the first operation and the second operation are determined.
本申请第二方面提供了一种高通量测序背景下的基因数据处理装置,包括:The second aspect of the present application provides a genetic data processing device under the background of high-throughput sequencing, including:
数据块获取单元,用于当第一存储器的第一预留区域中的可用空间达到预设容量时,获取预设大小的基因数据块,并将所述基因数据块输入至所述第一存储器的第一预留区域中,其中,所述基因数据块为从测序平台实时传入的短序列集合,所述第一预留区域具备容纳N个基因数据块的能力,所述预设大小不大于所述预设容量;A data block acquiring unit, configured to acquire a gene data block of a preset size when the available space in the first reserved area of the first memory reaches a preset capacity, and input the gene data block into the first memory In the first reserved area, the genetic data block is a collection of short sequences imported from the sequencing platform in real time, the first reserved area has the ability to accommodate N genetic data blocks, and the preset size is not greater than said preset capacity;
数据块处理单元,用于基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩,得到基因数据压缩块,将所述基因数据压缩块保留在第一存储器的第二预留区域中,并从第一存储器中释放所述基因数据块,所述第二预留区域具备容纳M个基因数据压缩块的能力;A data block processing unit, configured to compress the gene data block based on the data characteristics of each short sequence in the gene data block to obtain a gene data compression block, and retain the gene data compression block in the first memory. In the second reserved area, and release the gene data block from the first memory, the second reserved area has the ability to accommodate M gene data compression blocks;
数据块导出单元,用于当第一存储器中的基因数据压缩块的数目达到J之后,将J个基因数据压缩块输出至第二存储器中,并从第一存储器中释放所述J个基因数据压缩块;A data block deriving unit, configured to output J gene data compression blocks to the second memory after the number of gene data compression blocks in the first memory reaches J, and release the J gene data from the first memory compressed block;
其中,N、M、J均为预设的自然数,且J不大于M。Wherein, N, M, and J are all preset natural numbers, and J is not greater than M.
本申请第三方面提供了一种高通量测序背景下的基因数据处理设备,包括:存储器和处理器;The third aspect of the present application provides a genetic data processing device in the context of high-throughput sequencing, including: a memory and a processor;
所述存储器,用于存储程序;The memory is used to store programs;
所述处理器,用于执行所述程序,实现上述的高通量测序背景下的基因数据处理方法的各个步骤。The processor is configured to execute the program to realize each step of the above-mentioned gene data processing method under the background of high-throughput sequencing.
本申请第四方面提供了一种存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现如上述的高通量测序背景下的基因数据处理方法的各个步骤。The fourth aspect of the present application provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, various steps of the above-mentioned gene data processing method under the background of high-throughput sequencing are realized.
经由上述的技术方案可知,本申请当第一存储器的第一预留区域中的可用空间达到预设容量时,获取预设大小的基因数据块,并将所述基因数据块输入至所述第一存储器的第一预留区域中。其中,所述基因数据块为从测序平台实时传入的短序列集合,所述第一预留区域具备容纳N个基因数据块的能力,所述预设大小不大于所述预设容量。接着,基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩,得到基因数据压缩块,将所述基因数据压缩块保留在第一存储器的第二预留区域中,同时从第一存储器中释放所述基因数据块。其中,所述第二预留区域具备容纳M个基因数据压缩块的能力。可以理解的是,获取并放置基因数据块的流程,与对基因数据块进行压缩处理的流程,是循环往复地进行的。即,从测序平台源源不断地将基因数据块输送至第一存储器的第一预留区域中,直到所述第一预留区域不能容下更多的基因数据块;同时对第一预留区域中的每一基因数据块进行压缩处理,并将得到的基因数据压缩块保留在第一存储器的第二预留区域中。当第一存储器中的基因数据压缩块的数目达到J之后,将J个基因数据压缩块输出至第二存储器中,并从第一存储器中释放所述J个基因数据压缩块。其中,N、M、J均为预设的自然数,且J不大于M。本申请实现了流式地并行处理测序平台输出的各基因数据块,无需等待所有基因数据集齐后再处理,在节省等待时间的同时,提高了计算资源的利用效率,提升了从测序平台数据下机到生物信息分析的整体速度,有助于提高生物信息测序的交付效率。It can be seen from the above-mentioned technical solution that when the available space in the first reserved area of the first memory reaches the preset capacity, the present application acquires a genetic data block of a preset size, and inputs the genetic data block into the first In the first reserved area of a memory. Wherein, the gene data block is a collection of short sequences imported from the sequencing platform in real time, the first reserved area has the ability to accommodate N gene data blocks, and the preset size is not greater than the preset capacity. Next, based on the data characteristics of each short sequence in the gene data block, compress the gene data block to obtain a gene data compression block, and reserve the gene data compression block in the second reserved area of the first memory , and release the gene data block from the first memory at the same time. Wherein, the second reserved area is capable of accommodating M compressed blocks of genetic data. It can be understood that the process of acquiring and placing the gene data block and the process of compressing the gene data block are performed in a cycle. That is, the gene data blocks are continuously delivered from the sequencing platform to the first reserved area of the first memory until the first reserved area cannot accommodate more gene data blocks; at the same time, the first reserved area Each gene data block in is subjected to compression processing, and the obtained gene data compression block is reserved in the second reserved area of the first memory. When the number of compressed genetic data blocks in the first memory reaches J, output J compressed genetic data blocks to the second memory, and release the J compressed genetic data blocks from the first memory. Wherein, N, M, and J are all preset natural numbers, and J is not greater than M. This application realizes the parallel processing of each genetic data block output by the sequencing platform in a streaming manner, without waiting for all the genetic data to be collected before processing. While saving waiting time, it improves the utilization efficiency of computing resources and improves the efficiency of data from the sequencing platform. The overall speed from off-machine to bioinformatics analysis helps to improve the delivery efficiency of bioinformatics sequencing.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.
图1为本申请实施例公开的高通量测序背景下的基因数据处理方法的示意图;Fig. 1 is a schematic diagram of the genetic data processing method under the background of high-throughput sequencing disclosed in the embodiment of the present application;
图2示例了本申请实施例公开的各步骤流式地并行处理的示意图;FIG. 2 illustrates a schematic diagram of parallel processing of the steps disclosed in the embodiment of the present application;
图3为本申请实施例公开的高通量测序背景下的基因数据处理方法的示意图;Fig. 3 is a schematic diagram of the genetic data processing method under the background of high-throughput sequencing disclosed in the embodiment of the present application;
图4为本申请实施例公开的高通量测序背景下的基因数据处理装置的示意图;Fig. 4 is a schematic diagram of a genetic data processing device under the background of high-throughput sequencing disclosed in the embodiment of the present application;
图5为本申请实施例公开的高通量测序背景下的基因数据处理装置的示意图;Fig. 5 is a schematic diagram of a genetic data processing device under the background of high-throughput sequencing disclosed in the embodiment of the present application;
图6为本申请实施例公开的高通量测序背景下的基因数据处理设备的示意图。Fig. 6 is a schematic diagram of genetic data processing equipment under the background of high-throughput sequencing disclosed in the embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。下面首先相关的名词解释:The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application. The following first related nouns are explained:
高通量测序(High-throughput sequencing):又称“下一代”测序技术("Next-generation"sequencing technology),以能并行对几十万到几百万条DNA分子进行测序及一般读长较短等为标志。其中测序是指分析特定DNA片段的碱基序列,即腺嘌呤(A)、胸腺嘧啶(T)、胞嘧啶(C)与鸟嘌呤(G)的排列方式。快速的DNA测序方法的出现极大地推动了生物学和医学的研究和发展。High-throughput sequencing: Also known as "Next-generation" sequencing technology, it can sequence hundreds of thousands to millions of DNA molecules in parallel and compare the general read length. Short and so on as a sign. Sequencing refers to the analysis of the base sequence of a specific DNA fragment, that is, the arrangement of adenine (A), thymine (T), cytosine (C) and guanine (G). The emergence of rapid DNA sequencing methods has greatly promoted the research and development of biology and medicine.
Base Calling:从原始图像(row images)里通过计算机视觉的方式识别碱基类型(DNA序列),将结果写到cal文件里,最后生成测序报告和FastQ数据。Base Calling: Identify the base type (DNA sequence) from the original image (row images) through computer vision, write the result to the cal file, and finally generate the sequencing report and FastQ data.
序列比对:将两个或多个序列排列在一起,标明其相似之处。序列中可以插入间隔(通常用短横线“-”表示)。对应的相同或相似的符号(在核酸中是A,T(或U),C,G,在蛋白质中是氨基酸残基的单字母表示)排列在同一列上。通常用于研究由共同祖先进化而来的序列,特别是如蛋白质序列或DNA序列等生物序列。在比对中,错配与突变相应,而空位与插入或缺失对应。Sequence Alignment: Aligning two or more sequences together to indicate their similarities. Gaps (usually indicated by a dash "-") can be inserted in the sequence. Corresponding identical or similar symbols (A, T (or U), C, G in nucleic acids, single-letter representations of amino acid residues in proteins) are arranged in the same column. Often used to study sequences that have evolved from a common ancestor, especially biological sequences such as protein sequences or DNA sequences. In the alignment, mismatches correspond to mutations, while gaps correspond to insertions or deletions.
参考基因组:指该物种的基因组序列,是已经组装成的完整基因组序列,常作为物种的标准参照物,比如人类基因组参考序列(fasta格式)。Reference genome: refers to the genome sequence of the species, which is the complete genome sequence that has been assembled, and is often used as a standard reference for the species, such as the human genome reference sequence (fasta format).
K-mer:一段短序列中长度为K的子序列。K-mer: A subsequence of length K in a short sequence.
组网技术:组网技术就是网络组建技术,分为以太网组网技术和ATM局域网组网技术。以太网组网非常灵活和简便,可使用多种物理介质,以不同拓扑结构组网,是国内外应用最为广泛的一种网络,已成为网络技术的主流。以太网按其传输速率又分成10Mb/s、100Mb/s、1000Mb/s。Networking technology: Networking technology is network building technology, which is divided into Ethernet networking technology and ATM LAN networking technology. Ethernet networking is very flexible and simple. It can use a variety of physical media and network with different topologies. It is the most widely used network at home and abroad, and has become the mainstream of network technology. Ethernet is divided into 10Mb/s, 100Mb/s, and 1000Mb/s according to its transmission rate.
本申请的发明人发现,现有的从测序下机到生信分析的过程包括:The inventors of the present application found that the existing process from off-board sequencing to bioinformatics analysis includes:
1)测序仪完成生化和成像得到碱基图像数据,通过Base Calling将碱基图像数据识别转换成包含DNA序列的CAL文件,通过数据处理将CAL文件转换成包含序列信息的FASTQ文本文件,同时进行数据质控和使用压缩软件将FASTQ压缩并存储备份至磁盘。1) The sequencer completes the biochemical and imaging to obtain the base image data, and converts the base image data into a CAL file containing the DNA sequence through Base Calling, and converts the CAL file into a FASTQ text file containing the sequence information through data processing. Data quality control and use compression software to compress FASTQ and store backup to disk.
2)从磁盘读取压缩后的FASTQ文件,通过序列比对软件(如BWA)将序列与参考基因组中的序列进行比较,找到每一条序列在参考基因组上的位置,然后按顺序排列好。2) Read the compressed FASTQ file from the disk, compare the sequence with the sequence in the reference genome by sequence comparison software (such as BWA), find the position of each sequence on the reference genome, and then arrange them in order.
3)后续再根据研究需要进行变异检测、解读等相关工作。变异检测分析的目的是准确检测出每个样本(如人类)基因组中的变异集合,也就是人与人之间存在差异的那些DNA序列。3) Subsequent work such as mutation detection and interpretation will be carried out according to the needs of the research. The purpose of variant detection analysis is to accurately detect the set of variants in the genome of each sample (such as human beings), that is, those DNA sequences that differ from person to person.
从上面内容可知,在目前的基因测试流程中存在如下问题:From the above, we can see that the following problems exist in the current genetic testing process:
1)需要在Base Calling完成后将CAL文件传输至GPU服务器进行数据转换,数据转换后还需要将转换后的FASTQ文件传输至数据质控服务器进行数据质控,然后将质控后的数据传输至二级存储服务器进行数据的备份存储,同时传输至计算服务器进行后续的计算和分析。各类特种设备混合组网方式复杂,成本极高。1) After Base Calling is completed, the CAL file needs to be transferred to the GPU server for data conversion. After the data conversion, the converted FASTQ file needs to be transferred to the data quality control server for data quality control, and then the quality-controlled data is transferred to The secondary storage server backs up and stores the data, and at the same time transmits it to the computing server for subsequent calculation and analysis. The mixed networking mode of various special equipment is complicated and the cost is extremely high.
2)CAL文件在数据处理转换成FASTQ的过程中,需要将一份CAL文件全部处理完后输出一份完整的FASTQ文件到磁盘后才能进行后续的数据传输、质控、备份和分析,但是整份CAL文件数据量大,转换耗时久。2) In the process of data processing and conversion of CAL files into FASTQ, it is necessary to output a complete FASTQ file to the disk after all processing of a CAL file before subsequent data transmission, quality control, backup and analysis, but the whole A CAL file has a large amount of data, and the conversion takes a long time.
3)CAL文件转换完成后,FASTQ文件需要写入磁盘,压缩软件再从磁盘读取进行压缩;在进行序列比对时,比对软件需要先从磁盘读入FASTQ文件。每次数据的输入输出都需要在磁盘进行I/O,I/O频繁且速度较慢,对存储性能依赖较强。3) After the CAL file conversion is completed, the FASTQ file needs to be written to the disk, and then the compression software reads it from the disk for compression; when performing sequence alignment, the comparison software needs to read the FASTQ file from the disk first. Every input and output of data requires I/O on the disk. I/O is frequent and slow, and relies heavily on storage performance.
由于基因数据的测序文件少则几G,多则几十G到上百G,如果等待全部数据下机完成再进行后续的数据压缩与生信分析,就会出现计算资源空闲、没有得到合理利用的情况,进而影响生信交付的效率。本申请以数据块为单位,采用流水线的方式,并且利用基因数据压缩和序列比对过程都要与参考基因组进行比对这一相同之处,将数据导出及数据压缩从独立的步骤合并到一起,可以提高整体效率,充分利用计算资源。Since the sequencing files of genetic data range from a few gigabytes to tens of gigabytes to hundreds of gigabytes, if you wait for all the data to be downloaded and then perform subsequent data compression and bioinformatics analysis, computing resources will be idle and not properly utilized The situation, which in turn affects the efficiency of student letter delivery. This application takes the data block as the unit, adopts the pipeline method, and uses the similarity that the gene data compression and sequence comparison process must be compared with the reference genome, and merges the data export and data compression from independent steps. , can improve overall efficiency and make full use of computing resources.
下面介绍本申请实施例提供的高通量测序背景下的基因数据处理方法。请参阅图1,本申请实施例提供的高通量测序背景下的基因数据处理方法可以包括如下步骤:The genetic data processing method under the background of high-throughput sequencing provided by the embodiment of the present application is introduced below. Please refer to Figure 1, the genetic data processing method under the background of high-throughput sequencing provided by the embodiment of the present application may include the following steps:
步骤S101,获取预设大小的基因数据块,并将其输入至第一存储器的第一预留区域中。Step S101, acquiring a genetic data block of a preset size and inputting it into the first reserved area of the first memory.
其中,基因数据块为从测序平台实时传入的短序列集合,第一预留区域具备容纳N个基因数据块的能力。N为预设的自然数Among them, the gene data block is a collection of short sequences imported from the sequencing platform in real time, and the first reserved area has the ability to accommodate N gene data blocks. N is a preset natural number
示例性地,该第一存储器可以是内存,特别地,这里的内存可以是DDR内存。由于基因数据较大,而计算设备的内存有限,通常可以扩展为DDR内存,DDR内存在时钟信号上升沿与下降沿各传输一次数据,这使得DDR内存的数据传输速度为传统SDRAM的两倍。并且,由于仅多采用了下降缘信号,因此并不会造成能耗增加。至于定址与控制信号则与传统SDRAM相同,仅在时钟上升缘传输。Exemplarily, the first memory may be a memory, especially, the memory here may be a DDR memory. Due to the large genetic data and the limited memory of computing devices, it can usually be expanded to DDR memory. DDR memory transmits data once on the rising and falling edges of the clock signal, which makes the data transmission speed of DDR memory twice that of traditional SDRAM. Moreover, since only the falling edge signal is used more, it does not cause an increase in energy consumption. As for the addressing and control signals, they are the same as traditional SDRAM and are only transmitted on the rising edge of the clock.
短序列集合中包含多条短序列(read),其中,可以以短序列的条数来定义该基因数据块的大小,例如,可以基于计算设备的内存容量,将基因数据块的预设大小设置为4万至50万短序列。The short sequence collection contains multiple short sequences (read), wherein the size of the gene data block can be defined by the number of short sequences, for example, the preset size of the gene data block can be set based on the memory capacity of the computing device 40,000 to 500,000 short sequences.
步骤S102,基于基因数据块中各短序列的数据特性,对基因数据块进行压缩,得到基因数据压缩块。Step S102, based on the data characteristics of each short sequence in the gene data block, compress the gene data block to obtain a gene data compressed block.
其中,基因数据块中的每一短序列可以包含以下四行:Among them, each short sequence in the gene data block can contain the following four lines:
第一行为元数据,以@开头,后跟一个序列标识符和一个可选描述。The first line of metadata begins with @, followed by a sequence identifier and an optional description.
第二行为碱基数据,本质上是一个字母序列,由A,C,G,T和N这五种字母构成,这也是我们真正关心的DNA序列,N代表的是测序时那些无法被识别出来的碱基。The second line of base data is essentially a sequence of letters consisting of five letters: A, C, G, T, and N. This is also the DNA sequence we really care about. N represents those that cannot be identified during sequencing bases.
第三行以+字符开头,有时后跟与第1行相同的序列标识符。The third line begins with a + character, sometimes followed by the same sequence identifier as line 1.
第四行为质量数据,代表质量得分,描述了每个碱基的可靠程度,为一串由ASCII码组成的字符。The fourth line is the quality data, which represents the quality score and describes the reliability of each base, which is a string of characters composed of ASCII codes.
需要说明的是,第二行和第四行的长度必须相等,且其中每一元素一一对应。从短序列的组成可以知道,第一行的元数据为一般性描述,可以采用通用的压缩方法进行压缩;第二行的碱基数据富含生物特性,可以利用该特性采用参考基因组比对的方式进行压缩;第四行的质量数据内部具有比较高的相关性,可以采用游程长度编码等方式进行压缩。It should be noted that the lengths of the second row and the fourth row must be equal, and each element corresponds to each other. From the composition of the short sequence, we can know that the metadata in the first line is a general description, which can be compressed by a general compression method; the base data in the second line is rich in biological characteristics, which can be used to compare the reference genome The quality data in the fourth line has a relatively high internal correlation, and can be compressed by run-length coding and other methods.
步骤S103,将基因数据压缩块保留在第一存储器的第二预留区域中,并从第一存储器中释放该基因数据块。Step S103, keeping the gene data compressed block in the second reserved area of the first memory, and releasing the gene data block from the first memory.
其中,第二预留区域具备容纳M个基因数据压缩块的能力,M为预设的自然数。需要说明的是,通常地,基因数据压缩块的空间大小会小于相对于的基因数据块的空间大小,然而,考虑到拉取基因数据块的速度与计算得到基于数据压缩块的速度可能会不一致,因此,第一预留区域的容量大小N与第二预留区域M相互之间的大小关系,还得依据前述的两个速度来进一步设定,以免造成空间浪费。Wherein, the second reserved area is capable of accommodating M compressed blocks of genetic data, where M is a preset natural number. It should be noted that, generally, the space size of the gene data compression block will be smaller than the space size of the relative gene data block. However, considering that the speed of pulling the gene data block may not be consistent with the calculated speed based on the data compression block Therefore, the size relationship between the capacity N of the first reserved area and the size of the second reserved area M must be further set according to the aforementioned two speeds, so as to avoid wasting space.
步骤S104,当第一存储器中的基因数据压缩块的数目达到J之后,将J个基因数据压缩块输出至第二存储器中,并从第一存储器中释放这J个基因数据压缩块。Step S104, when the number of compressed genetic data blocks in the first memory reaches J, output J compressed genetic data blocks to the second memory, and release the J compressed genetic data blocks from the first memory.
其中,J为预设的自然数,且J不大于M。示例性地,第二存储器可以是计算设备中的硬盘,即,将每J个基于数据压缩块输出为一个文件切片,最后由各文件切片还原出一个完整的测序文件。Wherein, J is a preset natural number, and J is not greater than M. Exemplarily, the second storage may be a hard disk in the computing device, that is, each J data-based compression block is output as a file slice, and finally a complete sequencing file is restored from each file slice.
可以理解的是,请参阅图2,步骤S101至步骤S103是循环执行的,即,步骤S101会从测序平台源源不断地将基因数据块输送至第一存储器的第一预留区域中,直到第一预留区域不能容下更多的基因数据块;步骤S102会对第一预留区域中的每一基因数据块进行压缩处理,并通过步骤S103将得到的基因数据压缩块保留在第一存储器的第二预留区域中,步骤S102释放出来的第一预留区域的空间以及步骤S103释放出来的第二预留区域的空间用以接收新的基因数据块及基因数据压缩块。如此一来,对于每一基因数据块,步骤S101至步骤S103可以流式地执行测序数据的压缩;同时,各个基因数据块的步骤S101至步骤S103可以高度地并行执行。It can be understood that, referring to FIG. 2, step S101 to step S103 are executed cyclically, that is, step S101 will continuously transfer gene data blocks from the sequencing platform to the first reserved area of the first memory until the first A reserved area cannot accommodate more gene data blocks; step S102 will compress each gene data block in the first reserved area, and retain the obtained gene data compressed blocks in the first memory through step S103 In the second reserved area, the space in the first reserved area released in step S102 and the space in the second reserved area released in step S103 are used to receive new gene data blocks and gene data compression blocks. In this way, for each gene data block, step S101 to step S103 can perform the compression of the sequencing data in a streaming manner; meanwhile, steps S101 to step S103 of each gene data block can be executed in parallel to a high degree.
本申请当第一存储器的第一预留区域中的可用空间达到预设容量时,获取预设大小的基因数据块,并将所述基因数据块输入至所述第一存储器的第一预留区域中,其中,所述基因数据块为从测序平台实时传入的短序列集合,所述第一预留区域具备容纳N个基因数据块的能力,所述预设大小不大于所述预设容量。接着,基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩,得到基因数据压缩块,将所述基因数据压缩块保留在第一存储器的第二预留区域中,同时从第一存储器中释放所述基因数据块。其中,所述第二预留区域具备容纳M个基因数据压缩块的能力。可以理解的是,获取并放置基因数据块的流程,与对基因数据块进行压缩处理的流程,是循环往复地进行的。即,从测序平台源源不断地将基因数据块输送至第一存储器的第一预留区域中,直到所述第一预留区域不能容下更多的基因数据块;同时对第一预留区域中的每一基因数据块进行压缩处理,并将得到的基因数据压缩块保留在第一存储器的第二预留区域中。当第一存储器中的基因数据压缩块的数目达到J之后,将J个基因数据压缩块输出至第二存储器中,并从第一存储器中释放所述J个基因数据压缩块。其中,N、M、J均为预设的自然数,且J不大于M。本申请实现了流式地并行处理测序平台输出的各基因数据块,无需等待所有基因数据集齐后再处理,在节省等待时间的同时,提高了计算资源的利用效率,提升了从测序平台数据下机到生物信息分析的整体速度,有助于提高生物信息测序的交付效率。In this application, when the available space in the first reserved area of the first storage reaches a preset capacity, a genetic data block of a preset size is obtained, and the genetic data block is input into the first reserved area of the first storage. In the area, wherein, the gene data block is a collection of short sequences imported from the sequencing platform in real time, the first reserved area has the ability to accommodate N gene data blocks, and the preset size is not greater than the preset capacity. Next, based on the data characteristics of each short sequence in the gene data block, the gene data block is compressed to obtain a gene data compression block, and the gene data compression block is reserved in the second reserved area of the first memory , and release the gene data block from the first memory at the same time. Wherein, the second reserved area is capable of accommodating M compressed blocks of genetic data. It can be understood that the process of acquiring and placing the gene data block and the process of compressing the gene data block are performed in a cycle. That is, the gene data blocks are continuously delivered from the sequencing platform to the first reserved area of the first memory until the first reserved area cannot accommodate more gene data blocks; at the same time, the first reserved area Each gene data block in is subjected to compression processing, and the obtained gene data compression block is reserved in the second reserved area of the first memory. When the number of compressed genetic data blocks in the first memory reaches J, output J compressed genetic data blocks to the second memory, and release the J compressed genetic data blocks from the first memory. Wherein, N, M, and J are all preset natural numbers, and J is not greater than M. This application realizes the parallel processing of each gene data block output by the sequencing platform in a streaming manner, without waiting for all the gene data to be collected before processing, while saving waiting time, it improves the utilization efficiency of computing resources, and improves the efficiency of data from the sequencing platform. The overall speed from off-machine to bioinformatics analysis helps to improve the delivery efficiency of bioinformatics sequencing.
在本申请的一些实施例中,上述步骤S102基于基因数据块中各短序列的数据特性,对基因数据块进行压缩的过程,可以包括:In some embodiments of the present application, the process of compressing the gene data block based on the data characteristics of each short sequence in the above step S102 may include:
S1,采用增量编码技术或游程长度编码技术对基因数据块中的元数据进行压缩。S1, compressing the metadata in the gene data block by using incremental coding technology or run-length coding technology.
其中,不同测序平台的基因数据块中的元数据会存在不用的格式。通常地,元数据都以符号“@”开头,后跟序列标识符和其他可选信息,如仪器名称、流通池通道、流通池通道内的分块编号、分块内簇的“x”坐标、分块内簇的“y”坐标、混合多样本中的样本编号、双端标识和序列长度。Among them, the metadata in the gene data blocks of different sequencing platforms may have different formats. Typically, metadata starts with the symbol "@", followed by a sequence identifier and other optional information, such as instrument name, flow cell channel, block number within the flow cell channel, "x" coordinates of clusters within a block, The 'y' coordinate of the clusters within the tile, the sample number in pooled multisamples, the paired-end identity, and the sequence length.
对于基因数据块中的元数据,可以首先使用分隔符(标点符号)将它们解析成不同的数据段,分隔符一般包括点、空格、下划线、连字符、斜线、等号和冒号;再对不同的数据段采用识别算法,对数据段进行分类处理,对于仪器名称、序列长度、双端标识等固定信息不做编码与压缩,其他数据段则采用不同的压缩编码方式,包括增量编码、游长编码和上下文混合压缩算法。For the metadata in the genetic data block, they can be parsed into different data segments first using delimiters (punctuation marks). Delimiters generally include dots, spaces, underscores, hyphens, slashes, equal signs and colons; and then Different data segments use identification algorithms to classify the data segments, and do not encode and compress fixed information such as instrument names, sequence lengths, and double-ended identifiers. Other data segments use different compression encoding methods, including incremental encoding, Run-length coding and context hybrid compression algorithms.
对于只包含连续整数的数据段,采用增量编码记录前后两个数据段之间的差异;对于包含字母和数字的数据段,采用改良后的游长编码记录前后数据段中重复出现的字符,采用增量编码记录前后两个数据段之间的差异。For a data segment that only contains continuous integers, use incremental encoding to record the difference between the two data segments before and after; for a data segment that contains letters and numbers, use the improved swim-length encoding to record the characters that appear repeatedly in the data segment before and after, Incremental encoding is used to record the difference between the two data segments before and after.
S2,结合上下文统计模型、游程长度编码技术、ANS+FSE编码技术、算术编码技术或哈夫曼编码技术对基因数据块中的质量数据进行压缩。S2, combining the context statistical model, run-length coding technology, ANS+FSE coding technology, arithmetic coding technology or Huffman coding technology to compress the quality data in the gene data block.
具体地,可以包括:Specifically, it may include:
S21,通过预设的自适应模型确定基因数据块中的质量数据的复杂度,并基于该复杂度确定第一目标阶数的上下文统计模型。S21. Determine the complexity of the quality data in the gene data block through a preset adaptive model, and determine a context statistical model of the first target order based on the complexity.
其中,质量分数的复杂度可能是4质量值、8质量值或40质量值等情况,基于不同的质量值,该第一目标阶数可以是1阶、2阶、3阶、4阶、5阶或6阶。通过预设的自适应模型对质量数据的复杂度进行统计之后,确定出适合于该质量数据的上下文统计模型的阶数,从而可以得到更优的压缩效果。Wherein, the complexity of the quality score may be 4 quality values, 8 quality values or 40 quality values, etc., based on different quality values, the first target order may be 1st order, 2nd order, 3rd order, 4th order, 5th order order or 6th order. After counting the complexity of the quality data through the preset adaptive model, the order of the context statistical model suitable for the quality data is determined, so that a better compression effect can be obtained.
S22,利用第一目标阶数的上下文统计模型对质量数据进行压缩,得到第一中间压缩结果。S22. Compress the quality data by using the context statistical model of the first target order to obtain a first intermediate compression result.
其中,该第一中间压缩结果仅仅是中间输出,不会形成文件输出到磁盘等存储器中。Wherein, the first intermediate compression result is only an intermediate output, and will not form a file to be output to a storage such as a disk.
S23,采用游程长度编码技术、ANS+FSE编码技术、算术编码技术或哈夫曼编码技术对第一中间压缩结果进行压缩。S23. Compress the first intermediate compression result by using run-length coding technology, ANS+FSE coding technology, arithmetic coding technology or Huffman coding technology.
其中,ANS(Asymmetric Numeral System,非对称数系)+FSE(Finite StateEntropy,有限状态熵)是一种无损的压缩技术。通过这几种无损压缩技术中的任何一种对第一中间压缩结果进行压缩,可以进一步提高压缩率。Among them, ANS (Asymmetric Numeral System, asymmetric number system) + FSE (Finite State Entropy, finite state entropy) is a lossless compression technology. Compressing the first intermediate compression result by any one of these lossless compression techniques can further increase the compression ratio.
S3,利用预设的参考基因组对基因数据块中的碱基数据进行比对,并根据比对结果对基因数据块中的碱基数据进行压缩。S3, using the preset reference genome to compare the base data in the gene data block, and compress the base data in the gene data block according to the comparison result.
需要说明的是,这里的参考基因组指的是该带测序物种的基因组序列,即为已经组装成的完整基因组序列,常将该完整基因组序列作为该物种的标准参照物。It should be noted that the reference genome here refers to the genome sequence of the sequenced species, that is, the complete genome sequence that has been assembled, and the complete genome sequence is often used as a standard reference for the species.
示例性地,本申请在对基因数据块中的碱基数据进行比对时,可以将基因数据块中的碱基数据中的碱基与参考基因组进行比对,找到每一条碱基数据在参考基因上的位置,从而得到比对结果。Exemplarily, when the present application compares the base data in the gene data block, it can compare the bases in the base data in the gene data block with the reference genome, and find each piece of base data in the reference The position on the gene, so as to obtain the comparison result.
对于匹配上的碱基数据,可以由其在参考基因组中的位置来表征,即通过该位置来进行记录,而不用保留原始的碱基数据,记录的匹配信息经过排序后再进行匹配信息的增量编码压缩,从而达到了数据压缩的效果;对于没有匹配上的碱基数据,可以采用通用的压缩方式进行压缩。For the matched base data, it can be characterized by its position in the reference genome, that is, it is recorded through this position, instead of retaining the original base data, the recorded matching information is sorted and then the matching information is added. Quantity coding compression, so as to achieve the effect of data compression; for the unmatched base data, it can be compressed by general compression method.
基于此,在本申请的一些实施例中,上述S3利用预设的参考基因组对基因数据块中的碱基数据进行比对,并根据比对结果对基因数据块中的碱基数据进行压缩的过程,可以包括:Based on this, in some embodiments of the present application, the above-mentioned S3 uses the preset reference genome to compare the base data in the gene data block, and compresses the base data in the gene data block according to the comparison result process, which can include:
S31,将所述基因数据块中的碱基数据划分成多个子序列。S31. Divide the base data in the gene data block into multiple subsequences.
示例性地,可以以20为步长,将碱基数据划分成多个子序列。Exemplarily, the base data can be divided into multiple subsequences with a step size of 20.
S32,采用哈希比对方法将每一子序列与预设的参考基因组进行比对,得到每一子序列的匹配信息。S32. Using a hash comparison method to compare each subsequence with a preset reference genome to obtain matching information of each subsequence.
其中,该匹配信息包括错配值(Mismatch)。Wherein, the matching information includes a mismatch value (Mismatch).
S33,对于错配值小于或等于预设阈值的子序列,基于该子序列的匹配信息,对该子序列进行压缩。S33. For a subsequence whose mismatch value is less than or equal to a preset threshold, compress the subsequence based on the matching information of the subsequence.
例如,记录序列比对上参考基因组上的位置、Mismatch等信息,而不用保留原始的子序列,所记录的匹配信息经过排序后再进行匹配信息的增量编码压缩。For example, information such as position and mismatch on the reference genome in the sequence alignment is recorded, instead of retaining the original subsequence, the recorded matching information is sorted and then the matching information is incrementally encoded and compressed.
S34,对于错配值大于预设阈值的子序列,结合上下文混合压缩算法和增量编码技术、游程长度编码技术对该子序列进行压缩。S34. For a subsequence whose mismatch value is greater than a preset threshold, compress the subsequence in combination with a context hybrid compression algorithm, an incremental encoding technique, and a run-length encoding technique.
具体地,可以包括:Specifically, it may include:
S341,通过预设的自适应模型确定该子序列的复杂度,并基于该复杂度确定第二目标阶数的上下文统计模型。S341. Determine the complexity of the subsequence through a preset adaptive model, and determine a context statistical model of a second target order based on the complexity.
其中,该第二目标阶数可以是0阶、1阶或2阶。通过预设的自适应模型对子序列的复杂度进行统计之后,确定出适合于该子序列的上下文统计模型的阶数,从而可以得到更优的压缩效果。Wherein, the second target order may be 0 order, 1 order or 2 order. After the complexity of the subsequence is counted through the preset adaptive model, the order of the context statistical model suitable for the subsequence is determined, so that a better compression effect can be obtained.
具体地,子序列(碱基序列)的字符种类包括A/T/C/G/N这5种,假如GC含量大于80%或小于20%,则通过预测选用0阶上下文统计模型进行压缩;假如GC含量在60%和80%之间,或者GC含量在20%和40%之间,则通过预测选用1阶上下文统计模型进行压缩;假如GC含量在40%和60%之间,则通过预测选用2阶上下文统计模型进行压缩。Specifically, the character types of the subsequence (base sequence) include A/T/C/G/N, and if the GC content is greater than 80% or less than 20%, the 0-order context statistical model is selected for compression through prediction; If the GC content is between 60% and 80%, or between 20% and 40%, the 1st-order context statistical model is used for compression through prediction; if the GC content is between 40% and 60%, the compression is performed by The prediction is compressed using a 2nd-order contextual statistical model.
S342,利用第二目标阶数的上下文统计模型对该子序列进行压缩,得到第二中间压缩结果。S342. Compress the subsequence by using the context statistical model of the second target order to obtain a second intermediate compression result.
其中,该第二中间压缩结果仅仅是中间输出,不会形成文件输出到磁盘等存储器中。Wherein, the second intermediate compression result is only an intermediate output, and will not form a file to be output to a storage such as a disk.
S343,采用游程长度编码技术、ANS+FSE编码技术、算术编码技术或哈夫曼编码技术对第二中间压缩结果进行压缩。S343. Compress the second intermediate compression result by using run-length coding technology, ANS+FSE coding technology, arithmetic coding technology or Huffman coding technology.
示例性地,使用游程长度编码进行预处理,消除冗余,再通过自适应统计模型了解短序列中碱基数据和质量数据的复杂度,再根据不同复杂度采用不同阶数的动态上下文混合算法的数据建模,最后采用算数编码和哈夫曼编码的组合方案进行压缩。Exemplarily, run-length coding is used for preprocessing to eliminate redundancy, and then the complexity of base data and quality data in short sequences is understood through an adaptive statistical model, and then dynamic context mixing algorithms of different orders are used according to different complexity The data modeling is carried out, and finally the combination scheme of arithmetic coding and Huffman coding is used for compression.
在本申请的一些实施例中,上述S32采用哈希比对方法将每一子序列与预设的参考基因组进行比对,得到每一子序列的匹配信息的过程,可以包括:In some embodiments of the present application, the above S32 uses the hash comparison method to compare each subsequence with a preset reference genome, and the process of obtaining the matching information of each subsequence may include:
利用每一子序列的哈希值作为查询条件,在预设的哈希表进行查询,得到每一子序列的匹配信息。Using the hash value of each subsequence as a query condition, the query is performed in a preset hash table to obtain the matching information of each subsequence.
其中,该预设的哈希表记载有参考基因组中各参考子序列的哈希值以及各参考子序列在该参考基因组中的位置信息,各参考子序列为从该参考基因组划分得到的。Wherein, the preset hash table records the hash value of each reference subsequence in the reference genome and the position information of each reference subsequence in the reference genome, and each reference subsequence is divided from the reference genome.
在本申请的一些实施例中,上述从参考基因组划分得到各参考子序列的过程,可以包括:In some embodiments of the present application, the above-mentioned process of obtaining each reference subsequence from the reference genome division may include:
以预设的步长从参考基因组中划分出长度为K、重叠的多个参考子序列。其中,K为预设的长度值。示例性地,K可以取值为20。Multiple overlapping reference subsequences of length K are divided from the reference genome with a preset step size. Wherein, K is a preset length value. Exemplarily, K may take a value of 20.
具体地,首先,将参考基因组R中的序列分割成重叠的k-mer长的子序列。然后,建立一个哈希表HR,通过散列函数将k-mer长的子序列和每条子序列在参考基因组中的来源位置映射存储在哈希表HR中。接着,将基因数据块中的短序列也分成k-mer长的子序列。最后,通过查询基因组k-mer长的子序列构成的哈希表,进行数据匹配,同时计算得到错配值Mismatch。Specifically, first, the sequences in the reference genome R are segmented into overlapping k-mer subsequences. Then, a hash table HR is established, and the k-mer long subsequence and the source position mapping of each subsequence in the reference genome are stored in the hash table HR through a hash function. Next, the short sequence in the gene data block is also divided into k-mer long subsequences. Finally, data matching is performed by querying the hash table composed of subsequences with long k-mer length in the genome, and the mismatch value Mismatch is calculated at the same time.
若短序列中Mismatch小于等于预定义的阈值P,表示短序列的所有子序列均按照正确顺序彼此毗邻,且与参考基因组子序列完全匹配,短序列可以定位到参考基因组上,记录匹配信息。If the Mismatch in the short sequence is less than or equal to the predefined threshold P, it means that all subsequences of the short sequence are adjacent to each other in the correct order and completely match the subsequences of the reference genome. The short sequence can be located on the reference genome and record the matching information.
其中,匹配信息可以包括匹配位置MATCHpos、匹配长度MATCHlen、匹配类型MATCHtype、错配位置MISpos和错配碱基MISalt。Wherein, the matching information may include a matching position MATCHpos, a matching length MATCHlen, a matching type MATCHtype, a mismatching position MISpos, and a mismatching base MISalt.
若短序列中Mismatch小于等于预定义的阈值P,匹配信息包括匹配位置、匹配长度和匹配类型“P(Perfect Match)”;否则,匹配信息包括匹配位置、匹配长度和匹配类型-“I(Insertion)、D(Deletion)”、错配位置和错配碱基。具体地,可以参考序列比对结果中的CIGAR信息。匹配信息再根据匹配位置进行排序,排序后的匹配位置有利于匹配信息的进一步压缩,最后使用增量编码方案记录前后两条匹配信息中的差异。If the Mismatch in the short sequence is less than or equal to the predefined threshold P, the matching information includes the matching position, matching length and matching type "P(Perfect Match)"; otherwise, the matching information includes matching position, matching length and matching type-"I(Insertion ), D(Deletion)", mismatch position and mismatch base. Specifically, CIGAR information in the sequence alignment results can be referred to. The matching information is then sorted according to the matching position. The sorted matching position is conducive to further compression of the matching information. Finally, the incremental encoding scheme is used to record the difference between the two matching information before and after.
需要说明的是,当短序列中含有SNP/INDEL时,错配一定位于短序列的其中一条子序列,然而其他子序列是能与参考基因组子序列完全匹配的。It should be noted that when the short sequence contains SNP/INDEL, the mismatch must be located in one of the subsequences of the short sequence, but the other subsequences can completely match the subsequences of the reference genome.
这样就可以将完全匹配的种子序列位点为锚点,紧邻锚点的区域通过经典比对算法精确比对剩余未完全匹配的子序列(可能存在的错配子序列),找到序列最终的匹配位点,记录匹配信息。In this way, the fully matched seed sequence position can be used as the anchor point, and the region next to the anchor point can be accurately compared with the remaining incompletely matched subsequences (possibly mismatched subsequences) through the classical alignment algorithm to find the final matching position of the sequence Click to record matching information.
在本申请另外的一些实施例中,上述S2利用预设的参考基因组对基因数据块中的碱基数据进行比对的过程,可以包括:In some other embodiments of the present application, the process of comparing the base data in the gene data block with the preset reference genome in the above S2 may include:
采用BWT(Burrows-Wheeler Transform)算法将基因数据块中的碱基数据与预设的参考基因组进行比对。The BWT (Burrows-Wheeler Transform) algorithm is used to compare the base data in the gene data block with the preset reference genome.
其中,BWT算法是一种数据转换算法,它将一个字符串中的相似字符放在相邻的位置,以便于后续的压缩。具体地,基于FM构建索引,基于Burrows-Wheeler矩阵(BWM)和Last-First(LF)算法实现向后搜索和回溯。其中,BWT是包含字符串中所有后缀和前缀的树结构,可实现字符串的快速查找。对基因数据块中的碱基数据进行回溯的过程,就是查找已对齐的碱基数据的最优比对位置的过程。Among them, the BWT algorithm is a data conversion algorithm, which puts similar characters in a string in adjacent positions for subsequent compression. Specifically, the index is constructed based on FM, and the backward search and backtracking are realized based on the Burrows-Wheeler matrix (BWM) and Last-First (LF) algorithm. Among them, BWT is a tree structure containing all suffixes and prefixes in the string, which can realize fast search of strings. The process of backtracking the base data in the gene data block is the process of finding the optimal alignment position of the aligned base data.
在本申请的一些实施例中,请参阅图3,高通量测序背景下的基因数据处理方法还可以包括:In some embodiments of the present application, please refer to Figure 3, the genetic data processing method in the context of high-throughput sequencing may also include:
步骤S105,获取第一操作的处理速度。Step S105, acquiring the processing speed of the first operation.
其中,该第一操作对应于上述步骤S101,包括:获取预设大小的基因数据块,并将其输入至第一存储器的第一预留区域中。Wherein, the first operation corresponds to the above-mentioned step S101, including: acquiring a genetic data block of a preset size, and inputting it into the first reserved area of the first memory.
步骤S106,获取第二操作的处理速度。Step S106, acquiring the processing speed of the second operation.
其中,该第二操作对应于上述步骤S102,包括:基于基因数据块中各短序列的数据特性,对基因数据块进行压缩。Wherein, the second operation corresponds to the above step S102, including: compressing the gene data block based on the data characteristics of each short sequence in the gene data block.
步骤S107,基于第一操作的处理速度和第二操作的处理速度,确定分配到该第一操作以及该第二操作的计算资源。Step S107, based on the processing speed of the first operation and the processing speed of the second operation, determine computing resources allocated to the first operation and the second operation.
具体地,如果第一操作的处理速度小于第二操作的处理速度,第一预留区域内的基因数据块较少,计算资源较为充足,一个基因数据块可以同时开启碱基数据比对压缩、元数据压缩及质量数据压缩这三个进程,并为对应进程分配合适的计算资源。Specifically, if the processing speed of the first operation is lower than the processing speed of the second operation, there are fewer genetic data blocks in the first reserved area, and computing resources are relatively sufficient, one genetic data block can simultaneously enable base data comparison compression, Metadata compression and quality data compression are three processes, and appropriate computing resources are allocated to the corresponding processes.
当每个基因数据块的元数据或质量数据压缩完成时,则将结果存入第二预留区域中,并将计算资源释放;当碱基数据比对任务完成时,将比对结果存入第二预留区域,同时根据碱基数据的比对结果,进行后续的压缩处理,完成后,存入第二预留区域中,并释放计算资源。When the metadata or quality data compression of each gene data block is completed, the result is stored in the second reserved area, and the computing resources are released; when the base data comparison task is completed, the comparison result is stored in In the second reserved area, follow-up compression processing is performed according to the comparison result of the base data. After completion, it is stored in the second reserved area and computing resources are released.
如果第一操作的处理速度大于第二操作的处理速度,第一预留区域会存在多个滞留的基因数据块,此时可以为一个基因数据块分配较少的计算线程,且优先进行该基因数据块的数据比对,在比对过程中,存在资源可以用于元数据和质量数据的压缩,则分配相应资源,否则就等基因数据块比对运行完再一次进行数据压缩存储。If the processing speed of the first operation is greater than the processing speed of the second operation, there will be multiple lingering gene data blocks in the first reserved area. At this time, fewer computing threads can be allocated to a gene data block, and the gene data block is prioritized. For the data comparison of data blocks, during the comparison process, if there are resources that can be used for the compression of metadata and quality data, the corresponding resources will be allocated, otherwise, the data will be compressed and stored again after the comparison of genetic data blocks is completed.
通过上述的资源自适应动态分配算法,可以根据第一预留区域的基因数据块数目动态提供计算资源,尽量保证先进先出,同时实现自动调整计算资源的任务管理和调度能力。弹性伸缩能保证在下一时间段内资源需求上升后有充足的资源量,从而不出现资源短缺和资源供给滞后的情况,当资源需求出现激增情况时,弹性释放的资源能起到有效地缓冲作用。Through the above-mentioned resource adaptive dynamic allocation algorithm, computing resources can be dynamically provided according to the number of genetic data blocks in the first reserved area, as far as possible to ensure first-in-first-out, and at the same time realize the task management and scheduling capabilities of automatically adjusting computing resources. Elastic scaling can ensure that there will be sufficient resources after the resource demand rises in the next period of time, so that there will be no resource shortage and resource supply lag. When the resource demand surges, the elastically released resources can effectively buffer .
下面对本申请实施例提供的高通量测序背景下的基因数据处理装置进行描述,下文描述的高通量测序背景下的基因数据处理装置与上文描述的高通量测序背景下的基因数据处理方法可相互对应参照。The genetic data processing device under the background of high-throughput sequencing provided by the embodiment of the present application is described below. The genetic data processing device under the background of high-throughput sequencing described below is the same as the genetic data processing device under the background of high-throughput sequencing described above. The methods can be referred to each other.
请参见图4,本申请实施例提供的高通量测序背景下的基因数据处理装置,可以包括:Please refer to Figure 4, the genetic data processing device under the background of high-throughput sequencing provided by the embodiment of the present application may include:
数据块获取单元21,用于当第一存储器的第一预留区域中的可用空间达到预设容量时,获取预设大小的基因数据块,并将所述基因数据块输入至所述第一存储器的第一预留区域中,其中,所述基因数据块为从测序平台实时传入的短序列集合,所述第一预留区域具备容纳N个基因数据块的能力,所述预设大小不大于所述预设容量;A data
数据块处理单元22,用于基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩,得到基因数据压缩块,将所述基因数据压缩块保留在第一存储器的第二预留区域中,并从第一存储器中释放所述基因数据块,所述第二预留区域具备容纳M个基因数据压缩块的能力;The data
数据块导出单元23,用于当第一存储器中的基因数据压缩块的数目达到J之后,将J个基因数据压缩块输出至第二存储器中,并从第一存储器中释放所述J个基因数据压缩块;The data block deriving
其中,N、M、J均为预设的自然数,且J不大于M。Wherein, N, M, and J are all preset natural numbers, and J is not greater than M.
在本申请的一些实施例中,In some embodiments of the present application,
所述短序列包括元数据、碱基数据和质量数据;基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩的过程,包括:The short sequence includes metadata, base data and quality data; based on the data characteristics of each short sequence in the gene data block, the process of compressing the gene data block includes:
利用预设的参考基因组对所述基因数据块中的碱基数据进行比对,并根据比对结果对所述基因数据块中的碱基数据进行压缩;comparing the base data in the gene data block with a preset reference genome, and compressing the base data in the gene data block according to the comparison result;
采用增量编码技术或游程长度编码技术对所述基因数据块中的元数据进行压缩;Compressing the metadata in the genetic data block by using incremental encoding technology or run-length encoding technology;
通过预设的自适应模型确定所述质量数据的复杂度,并基于所述复杂度确定第一目标阶数的上下文统计模型;determining the complexity of the quality data through a preset adaptive model, and determining a context statistical model of a first target order based on the complexity;
利用所述第一目标阶数的上下文统计模型对所述质量数据进行压缩,得到第一中间压缩结果;Compressing the quality data by using the context statistical model of the first target order to obtain a first intermediate compression result;
采用游程长度编码技术、ANS+FSE编码技术、算术编码技术或哈夫曼编码技术对所述第一中间压缩结果进行压缩。The first intermediate compression result is compressed by using run-length coding technology, ANS+FSE coding technology, arithmetic coding technology or Huffman coding technology.
在本申请的一些实施例中,数据块处理单元22利用预设的参考基因组对所述基因数据块中的碱基数据进行比对,并根据比对结果对所述基因数据块中的碱基数据进行压缩的过程,可以包括:In some embodiments of the present application, the data
将所述基因数据块中的碱基数据划分成多个子序列;Divide the base data in the gene data block into multiple subsequences;
采用哈希比对方法将每一子序列与预设的参考基因组进行比对,得到每一子序列的匹配信息,所述匹配信息包括错配值;Comparing each subsequence with a preset reference genome using a hash comparison method to obtain matching information for each subsequence, the matching information including a mismatch value;
对于错配值小于或等于预设阈值的子序列,基于所述子序列的匹配信息,对所述子序列进行压缩;For a subsequence whose mismatch value is less than or equal to a preset threshold, compress the subsequence based on the matching information of the subsequence;
对于错配值大于预设阈值的子序列:For subsequences with mismatch values greater than a preset threshold:
通过预设的自适应模型确定所述子序列的复杂度,并基于所述复杂度确定第二目标阶数的上下文统计模型;determining the complexity of the subsequence through a preset adaptive model, and determining a context statistical model of a second target order based on the complexity;
利用所述第二目标阶数的上下文统计模型对所述子序列进行压缩,得到第二中间压缩结果;compressing the subsequence by using the context statistical model of the second target order to obtain a second intermediate compression result;
采用游程长度编码技术、ANS+FSE编码技术、算术编码技术或哈夫曼编码技术对所述第二中间压缩结果进行压缩。The second intermediate compression result is compressed by using run-length coding technology, ANS+FSE coding technology, arithmetic coding technology or Huffman coding technology.
在本申请的一些实施例中,数据块处理单元22采用哈希比对方法将每一子序列与预设的参考基因组进行比对,得到每一子序列的匹配信息的过程,可以包括:In some embodiments of the present application, the data
利用每一子序列的哈希值作为查询条件,在预设的哈希表进行查询,得到每一子序列的匹配信息;Using the hash value of each subsequence as the query condition, query in the preset hash table to obtain the matching information of each subsequence;
其中,所述预设的哈希表记载有所述参考基因组中各参考子序列的哈希值以及各参考子序列在所述参考基因组中的位置信息,所述各参考子序列为从所述参考基因组划分得到的。Wherein, the preset hash table records the hash value of each reference subsequence in the reference genome and the position information of each reference subsequence in the reference genome, and each reference subsequence is obtained from the derived from the reference genome.
在本申请的一些实施例中,数据块处理单元22从所述参考基因组划分得到各参考子序列的过程,可以包括:In some embodiments of the present application, the process of dividing the data
以预设的步长从所述参考基因组中划分出长度为K、重叠的多个参考子序列,其中,K为预设的长度值。A plurality of overlapping reference subsequences of length K are divided from the reference genome with a preset step size, wherein K is a preset length value.
在本申请的一些实施例中,数据块处理单元22利用预设的参考基因组对所述基因数据块中的碱基数据进行比对的过程,可以包括:In some embodiments of the present application, the process of comparing the base data in the genetic data block by the data
采用BWT算法将所述基因数据块中的碱基数据与预设的参考基因组进行比对。The base data in the gene data block is compared with a preset reference genome using a BWT algorithm.
在本申请的一些实施例中,请参阅图5,高通量测序背景下的基因数据处理装置还可以包括计算资源分配单元24,用于:In some embodiments of the present application, please refer to FIG. 5 , the genetic data processing device in the context of high-throughput sequencing may also include a computing
获取第一操作的处理速度,所述第一操作包括:获取预设大小的基因数据块,并将其输入至第一存储器的第一预留区域中;Obtaining the processing speed of the first operation, the first operation comprising: obtaining a genetic data block of a preset size, and inputting it into a first reserved area of the first memory;
获取第二操作的处理速度,所述第二操作包括:基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩;Obtaining the processing speed of the second operation, the second operation comprising: compressing the gene data block based on the data characteristics of each short sequence in the gene data block;
基于所述第一操作的处理速度和所述第二操作的处理速度,确定分配到所述第一操作以及所述第二操作的计算资源。Based on the processing speed of the first operation and the processing speed of the second operation, computing resources allocated to the first operation and the second operation are determined.
本申请实施例提供的高通量测序背景下的基因数据处理装置可应用于高通量测序背景下的基因数据处理设备,如计算机等。可选的,图6示出了高通量测序背景下的基因数据处理设备的硬件结构框图,参照图6,高通量测序背景下的基因数据处理设备的硬件结构可以包括:至少一个处理器31,至少一个通信接口32,至少一个存储器33和至少一个通信总线34。The genetic data processing device in the context of high-throughput sequencing provided in the embodiments of the present application can be applied to genetic data processing equipment in the context of high-throughput sequencing, such as computers. Optionally, FIG. 6 shows a block diagram of the hardware structure of the genetic data processing device in the context of high-throughput sequencing. Referring to FIG. 6, the hardware structure of the genetic data processing device in the context of high-throughput sequencing may include: at least one
在本申请实施例中,处理器31、通信接口32、存储器33、通信总线34的数量为至少一个,且处理器31、通信接口32、存储器33通过通信总线34完成相互间的通信;In the embodiment of the present application, there are at least one
处理器31可能是一个中央处理器CPU,或者是特定集成电路ASIC(ApplicationSpecific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路等;The
存储器33可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatilememory)等,例如至少一个磁盘存储器;The
其中,存储器33存储有程序,处理器31可调用存储器33存储的程序,所述程序用于:Wherein, the
当第一存储器的第一预留区域中的可用空间达到预设容量时,获取预设大小的基因数据块,并将所述基因数据块输入至所述第一存储器的第一预留区域中,其中,所述基因数据块为从测序平台实时传入的短序列集合,所述第一预留区域具备容纳N个基因数据块的能力,所述预设大小不大于所述预设容量;When the available space in the first reserved area of the first memory reaches a preset capacity, acquiring a genetic data block of a preset size, and inputting the genetic data block into the first reserved area of the first memory , wherein the gene data block is a collection of short sequences imported from the sequencing platform in real time, the first reserved area has the ability to accommodate N gene data blocks, and the preset size is not greater than the preset capacity;
基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩,得到基因数据压缩块,将所述基因数据压缩块保留在第一存储器的第二预留区域中,并从第一存储器中释放所述基因数据块,所述第二预留区域具备容纳M个基因数据压缩块的能力;Based on the data characteristics of each short sequence in the gene data block, compress the gene data block to obtain a gene data compression block, retain the gene data compression block in the second reserved area of the first memory, and releasing the gene data block from the first memory, and the second reserved area has the ability to accommodate M gene data compression blocks;
当第一存储器中的基因数据压缩块的数目达到J之后,将J个基因数据压缩块输出至第二存储器中,并从第一存储器中释放所述J个基因数据压缩块;After the number of gene data compression blocks in the first memory reaches J, output J gene data compression blocks to the second memory, and release the J gene data compression blocks from the first memory;
其中,N、M、J均为预设的自然数,且J不大于M。Wherein, N, M, and J are all preset natural numbers, and J is not greater than M.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program can refer to the above description.
本申请实施例还提供一种存储介质,该存储介质可存储有适于处理器执行的程序,所述程序用于:The embodiment of the present application also provides a storage medium, which can store a program suitable for execution by a processor, and the program is used for:
当第一存储器的第一预留区域中的可用空间达到预设容量时,获取预设大小的基因数据块,并将所述基因数据块输入至所述第一存储器的第一预留区域中,其中,所述基因数据块为从测序平台实时传入的短序列集合,所述第一预留区域具备容纳N个基因数据块的能力,所述预设大小不大于所述预设容量;When the available space in the first reserved area of the first memory reaches a preset capacity, acquiring a genetic data block of a preset size, and inputting the genetic data block into the first reserved area of the first memory , wherein the gene data block is a collection of short sequences imported from the sequencing platform in real time, the first reserved area has the ability to accommodate N gene data blocks, and the preset size is not greater than the preset capacity;
基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩,得到基因数据压缩块,将所述基因数据压缩块保留在第一存储器的第二预留区域中,并从第一存储器中释放所述基因数据块,所述第二预留区域具备容纳M个基因数据压缩块的能力;Based on the data characteristics of each short sequence in the gene data block, compress the gene data block to obtain a gene data compression block, retain the gene data compression block in the second reserved area of the first memory, and releasing the gene data block from the first memory, and the second reserved area has the ability to accommodate M gene data compression blocks;
当第一存储器中的基因数据压缩块的数目达到J之后,将J个基因数据压缩块输出至第二存储器中,并从第一存储器中释放所述J个基因数据压缩块;After the number of gene data compression blocks in the first memory reaches J, output J gene data compression blocks to the second memory, and release the J gene data compression blocks from the first memory;
其中,N、M、J均为预设的自然数,且J不大于M。Wherein, N, M, and J are all preset natural numbers, and J is not greater than M.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program can refer to the above description.
综上所述:In summary:
本申请当第一存储器的第一预留区域中的可用空间达到预设容量时,获取预设大小的基因数据块,并将所述基因数据块输入至所述第一存储器的第一预留区域中,其中,所述基因数据块为从测序平台实时传入的短序列集合,所述第一预留区域具备容纳N个基因数据块的能力,所述预设大小不大于所述预设容量。接着,基于所述基因数据块中各短序列的数据特性,对所述基因数据块进行压缩,得到基因数据压缩块,将所述基因数据压缩块保留在第一存储器的第二预留区域中,同时从第一存储器中释放所述基因数据块。其中,所述第二预留区域具备容纳M个基因数据压缩块的能力。可以理解的是,获取并放置基因数据块的流程,与对基因数据块进行压缩处理的流程,是循环往复地进行的。即,从测序平台源源不断地将基因数据块输送至第一存储器的第一预留区域中,直到所述第一预留区域不能容下更多的基因数据块;同时对第一预留区域中的每一基因数据块进行压缩处理,并将得到的基因数据压缩块保留在第一存储器的第二预留区域中。当第一存储器中的基因数据压缩块的数目达到J之后,将J个基因数据压缩块输出至第二存储器中,并从第一存储器中释放所述J个基因数据压缩块。其中,N、M、J均为预设的自然数,且J不大于M。本申请实现了流式地并行处理测序平台输出的各基因数据块,无需等待所有基因数据集齐后再处理,在节省等待时间的同时,提高了计算资源的利用效率,提升了从测序平台数据下机到生物信息分析的整体速度,有助于提高生物信息测序的交付效率。In this application, when the available space in the first reserved area of the first storage reaches a preset capacity, a genetic data block of a preset size is obtained, and the genetic data block is input into the first reserved area of the first storage. In the area, wherein, the genetic data block is a collection of short sequences imported from the sequencing platform in real time, the first reserved area has the ability to accommodate N genetic data blocks, and the preset size is not greater than the preset capacity. Next, based on the data characteristics of each short sequence in the gene data block, compress the gene data block to obtain a gene data compression block, and reserve the gene data compression block in the second reserved area of the first memory , and release the gene data block from the first memory at the same time. Wherein, the second reserved area is capable of accommodating M compressed blocks of genetic data. It can be understood that the process of acquiring and placing the gene data block and the process of compressing the gene data block are performed in a cycle. That is, the gene data blocks are continuously delivered from the sequencing platform to the first reserved area of the first memory until the first reserved area cannot accommodate more gene data blocks; at the same time, the first reserved area Each gene data block in is subjected to compression processing, and the obtained gene data compression block is reserved in the second reserved area of the first memory. When the number of compressed genetic data blocks in the first memory reaches J, output J compressed genetic data blocks to the second memory, and release the J compressed genetic data blocks from the first memory. Wherein, N, M, and J are all preset natural numbers, and J is not greater than M. This application realizes the parallel processing of each genetic data block output by the sequencing platform in a streaming manner, without waiting for all the genetic data to be collected before processing. While saving waiting time, it improves the utilization efficiency of computing resources and improves the efficiency of data from the sequencing platform. The overall speed from off-machine to bioinformatics analysis helps to improve the delivery efficiency of bioinformatics sequencing.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this text, relational terms such as first and second etc. are only used to distinguish one entity or operation from another, and do not necessarily require or imply that these entities or operations, any such actual relationship or order exists. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间可以根据需要进行组合,且相同相似部分互相参见即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on the difference from other embodiments. The various embodiments can be combined as needed, and the same and similar parts can be referred to each other. .
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, the present application will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211019556.6A CN115346605B (en) | 2022-08-24 | 2022-08-24 | Gene data processing method and device in high-throughput sequencing background and related equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211019556.6A CN115346605B (en) | 2022-08-24 | 2022-08-24 | Gene data processing method and device in high-throughput sequencing background and related equipment |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN115346605A true CN115346605A (en) | 2022-11-15 |
| CN115346605B CN115346605B (en) | 2025-12-19 |
Family
ID=83953930
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211019556.6A Active CN115346605B (en) | 2022-08-24 | 2022-08-24 | Gene data processing method and device in high-throughput sequencing background and related equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115346605B (en) |
Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040267458A1 (en) * | 2001-12-21 | 2004-12-30 | Judson Richard S. | Methods for obtaining and using haplotype data |
| US20080182757A1 (en) * | 2007-01-26 | 2008-07-31 | Illumina, Inc. | Image data efficient genetic sequencing method and system |
| EP2795488A1 (en) * | 2011-12-20 | 2014-10-29 | Baym, Michael H. | Compressing, storing and searching sequence data |
| CN106100641A (en) * | 2016-06-12 | 2016-11-09 | 深圳大学 | Multithreading quick storage lossless compression method and system thereof for FASTQ data |
| CN110299187A (en) * | 2019-07-04 | 2019-10-01 | 南京邮电大学 | A kind of parallelization gene data compression method based on Hadoop |
| CN110349635A (en) * | 2019-06-11 | 2019-10-18 | 华南理工大学 | A kind of parallel compression method of gene sequencing quality of data score |
| CN110970088A (en) * | 2019-10-18 | 2020-04-07 | 南京工业职业技术学院 | Lossless compression method for multiple FASTA format gene sequences |
| CN112863600A (en) * | 2021-04-12 | 2021-05-28 | 哈尔滨工业大学 | Data compression method based on exon region insertion |
| CN113268460A (en) * | 2021-05-28 | 2021-08-17 | 中科计算技术西部研究院 | Multilayer parallel-based gene data lossless compression method and device |
| CN113299344A (en) * | 2021-06-23 | 2021-08-24 | 深圳华大医学检验实验室 | Gene sequencing analysis method, gene sequencing analysis device, storage medium and computer equipment |
| CN113630123A (en) * | 2021-06-30 | 2021-11-09 | 山东云海国创云计算装备产业创新中心有限公司 | Data compression system and method |
| CN114317703A (en) * | 2020-09-29 | 2022-04-12 | 深圳市真迈生物科技有限公司 | Nucleic acid sequence determination method, system, storage medium, and computer program product |
-
2022
- 2022-08-24 CN CN202211019556.6A patent/CN115346605B/en active Active
Patent Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040267458A1 (en) * | 2001-12-21 | 2004-12-30 | Judson Richard S. | Methods for obtaining and using haplotype data |
| US20080182757A1 (en) * | 2007-01-26 | 2008-07-31 | Illumina, Inc. | Image data efficient genetic sequencing method and system |
| EP2795488A1 (en) * | 2011-12-20 | 2014-10-29 | Baym, Michael H. | Compressing, storing and searching sequence data |
| CN106100641A (en) * | 2016-06-12 | 2016-11-09 | 深圳大学 | Multithreading quick storage lossless compression method and system thereof for FASTQ data |
| CN110349635A (en) * | 2019-06-11 | 2019-10-18 | 华南理工大学 | A kind of parallel compression method of gene sequencing quality of data score |
| CN110299187A (en) * | 2019-07-04 | 2019-10-01 | 南京邮电大学 | A kind of parallelization gene data compression method based on Hadoop |
| CN110970088A (en) * | 2019-10-18 | 2020-04-07 | 南京工业职业技术学院 | Lossless compression method for multiple FASTA format gene sequences |
| CN114317703A (en) * | 2020-09-29 | 2022-04-12 | 深圳市真迈生物科技有限公司 | Nucleic acid sequence determination method, system, storage medium, and computer program product |
| CN112863600A (en) * | 2021-04-12 | 2021-05-28 | 哈尔滨工业大学 | Data compression method based on exon region insertion |
| CN113268460A (en) * | 2021-05-28 | 2021-08-17 | 中科计算技术西部研究院 | Multilayer parallel-based gene data lossless compression method and device |
| CN113299344A (en) * | 2021-06-23 | 2021-08-24 | 深圳华大医学检验实验室 | Gene sequencing analysis method, gene sequencing analysis device, storage medium and computer equipment |
| CN113630123A (en) * | 2021-06-30 | 2021-11-09 | 山东云海国创云计算装备产业创新中心有限公司 | Data compression system and method |
Non-Patent Citations (3)
| Title |
|---|
| SAMUEL PLANTON ET AL: "A theory of memory for binary Sequences : Evidence for a mental compression algorithm in humans", 《COMPUTATIONAL BIOLOGY》, 19 January 2021 (2021-01-19) * |
| YETING ZHANG ET AL: "A FASTQ compressor based on integer-mapped k-mer indexing for boologist", 《GENE》, 31 December 2015 (2015-12-31) * |
| 李平好: "基于可压缩结构化数据的信息压缩理论研究与算法实现", 《中国知网硕士学位论文全文数据库》, vol. 2015, no. 6, 15 June 2015 (2015-06-15) * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115346605B (en) | 2025-12-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN115312129A (en) | Gene data compression method and device in high-throughput sequencing background and related equipment | |
| US12205679B2 (en) | Systems and methods for sequence encoding, storage, and compression | |
| US11649495B2 (en) | Systems and methods for mitochondrial analysis | |
| CN107609350B (en) | Data processing method of second-generation sequencing data analysis platform | |
| Chaisson et al. | Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory | |
| US20220261384A1 (en) | Biological graph or sequence serialization | |
| CA2839802C (en) | Methods and systems for data analysis | |
| JP2019537172A (en) | Method and system for indexing bioinformatics data | |
| CN107798216B (en) | Alignment of high similarity sequences using divide and conquer | |
| WO2015074290A1 (en) | Database implementation method | |
| CN111951894A (en) | Solid State Drives and Parallelizable Sequence Alignment Methods | |
| US20100293167A1 (en) | Biological database index and query searching | |
| CN110797082A (en) | Method and system for storing and reading gene sequencing data | |
| CN115346605A (en) | Gene data processing method and device under high-throughput sequencing background and related equipment | |
| CN119007838A (en) | Grouping-based protein sequence clustering method and system | |
| EP3418927B1 (en) | Method and device for processing dna sequence | |
| US10867134B2 (en) | Method for generating text string dictionary, method for searching text string dictionary, and system for processing text string dictionary | |
| Gudodagi et al. | Investigations and Compression of Genomic Data | |
| Saggese et al. | STAble: a novel approach to de novo assembly of RNA-seq data and its application in a metabolic model network based metatranscriptomic workflow | |
| HK40082992A (en) | Gene data processing method, device and related equipment under the background of high-throughput sequencing | |
| WO2022095423A1 (en) | Video frame extraction method and related device | |
| TWI785847B (en) | Data processing system for processing gene sequencing data | |
| CN117520306B (en) | Data verification method and device and electronic equipment | |
| CN115391284B (en) | Method, system and computer-readable storage medium for rapid identification of genetic data files | |
| CN119046711A (en) | Method for efficiently clustering transcriptome length readsRNA-seq data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40082992 Country of ref document: HK |
|
| GR01 | Patent grant |