WO2017214765A1

WO2017214765A1 - Multi-thread fast storage lossless compression method and system for fastq data

Info

Publication number: WO2017214765A1
Application number: PCT/CN2016/085426
Authority: WO
Inventors: 朱泽轩; 黄志安; 孙怡雯; 文振焜
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2016-06-12
Filing date: 2016-06-12
Publication date: 2017-12-21
Anticipated expiration: 2018-12-12

Abstract

Provided is a multi-thread fast storage lossless compression method for FASTQ data, which is applied to compression of a DNA sequence. The method comprises: a data classification step of inputting original FASTQ data, and dividing a short reading of the original FASTQ data into three data flows, namely metadata, a mass fraction, and a base sequence (S11); a data compression step of: with respect to the metadata, using incremental encoding to detect and eliminate redundant information of the metadata; with respect to the mass fraction, using a bit level PPM prediction model and arithmetic coding for compression; and with respect to the base sequence, using improved arithmetic coding of a fixed order for compression (S12); and a data output step of archiving and merging compression results of different data flows, and outputting final data after the compression (S13). The solution can improve the compression efficiency and compression speed.

Description

Multi-thread fast storage lossless compression method and system thereof for FASTQ data

Technical field

本发明涉及数据压缩领域，尤其涉及一种针对FASTQ数据的多线程快速存储无损压缩方法及其系统。 The present invention relates to the field of data compression, and in particular, to a multi-thread fast storage lossless compression method and system for FASTQ data.

Background technique

随着DNA测序技术的发展，基因组测序成本越来越低。2014年，测定一个人类基因组的成本控制在1000美元的里程碑已经到来。由于测序效率的提高，DNA序列数据量呈现出爆炸性增长。由于DNA测序数据的增长速度远远超过了计算机微处理器和存储设备的增长速度，存储和分析DNA测序技术和大型基因组项目所产生的DNA数据“海啸”已经成为制约DNA测序产业进一步发展的一个重要瓶颈。而且，由于DNA测序技术正从高通量测序（High-Throughput Sequencing），又称为下一代测序（Next Generation Sequencing）发展到单分子测序技术（又称为第三代测序技术），FASTQ数据的短读从50~200bp不等的固定长度发展到1kbp~300kbp不等的不定长度，数据变化之大进一步制约DNA测序产业的发展，因此迫切需要相关的数据压缩技术投入使用。With the development of DNA sequencing technology, the cost of genome sequencing is getting lower and lower. In 2014, a milestone in measuring the cost control of a human genome at $1,000 has arrived. Due to the increased sequencing efficiency, the amount of DNA sequence data has exploded. As the growth rate of DNA sequencing data far exceeds the growth rate of computer microprocessors and storage devices, the DNA tsunami generated by storing and analyzing DNA sequencing technology and large-scale genome projects has become a constraint to the further development of the DNA sequencing industry. An important bottleneck. Moreover, because DNA sequencing technology is being sequenced from high-throughput (High-Throughput Sequencing), also known as Next Generation Sequencing Sequencing) has developed into single-molecule sequencing technology (also known as third-generation sequencing technology). The short reading of FASTQ data has grown from a fixed length ranging from 50 to 200 bp to an indefinite length ranging from 1 kbp to 300 kbp. The data change is further restricted. With the development of the DNA sequencing industry, there is an urgent need for related data compression technologies to be put into use.

然而，目前一些主流高效的通用压缩软件如gzip（http://www.gzip.org/）、bzip2（http://gzip.org/）和LZMA（http://www.7-zip.org/sdk.html）。gzip软件对于要压缩的文件首先会采用基于LZ77算法的变种压缩方式，对得到的结果再根据情况使用静态Huffman编码或者动态Huffman编码方法进行压缩。 bzip2软件把要压缩的数据进行分块处理（100~900KB每块），对于重复出现的字符序列使用BWT（Burrows-Wheeler transform）算法进行转换处理，然后再采用MTF（Move-To-Front transform）算法和哈弗曼编码（Huffman coding）进行压缩。 LZMA软件使用了类似于LZ77算法的字典编码机制，对数据流、重复序列大小以及重续序列位置单独进行了压缩，支持几种散列链变体、二叉树以及基数树作为它的字典查找算法基础。However, some of the current mainstream and efficient general-purpose compression software such as gzip (http://www.gzip.org/), bzip2 (http://gzip.org/) and LZMA (http://www.7-zip.org) /sdk.html). The gzip software first uses a variant compression method based on the LZ77 algorithm for the file to be compressed, and then uses the static Huffman coding or the dynamic Huffman coding method to compress the obtained result. The bzip2 software divides the data to be compressed into blocks (100~900KB each), and uses BWT (Burrows-Wheeler) for repeated character sequences. Transform) algorithm performs conversion processing, and then uses MTF (Move-To-Front transform) algorithm and Huffman coding (Huffman) Coding) to compress. LZMA software uses a dictionary encoding mechanism similar to LZ77 algorithm to compress the data stream, repeat sequence size and re-sequence position separately, and supports several hash chain variants, binary trees and radix trees as the basis of its dictionary search algorithm. .

但是，这些压缩方法并未考虑DNA数据的生物学特性，如长重复片段和互补回文结构等，导致对DNA序列数据的压缩效果不甚理想，而且压缩速度较慢。However, these compression methods do not take into account the biological properties of DNA data, such as long repeats and complementary palindromes, resulting in less accurate compression of DNA sequence data and slower compression.

technical problem

有鉴于此，本发明的目的在于提供一种针对FASTQ数据的多线程快速存储无损压缩方法及其系统，旨在解决现有技术中针对DNA序列数据的压缩效果较差且压缩速度较慢的问题。 In view of this, an object of the present invention is to provide a multi-thread fast storage lossless compression method and system thereof for FASTQ data, which aims to solve the problem of poor compression effect and slow compression speed for DNA sequence data in the prior art. .

Technical solution

本发明提出一种针对FASTQ数据的多线程快速存储无损压缩方法，应用于DNA序列的压缩，其特征在于，所述方法包括：The invention provides a multi-thread fast storage lossless compression method for FASTQ data, which is applied to compression of DNA sequences, characterized in that the method comprises:

数据分类步骤：输入原始FASTQ数据，并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流；Data classification step: inputting raw FASTQ data, and dividing the short reading of the original FASTQ data into three data streams of metadata, mass fraction and base sequence;

数据压缩步骤：针对元数据，利用增量编码方式进行检测并消除元数据的冗余信息；针对质量分数，利用比特级别的PPM预测模型和算术编码进行压缩；针对碱基序列，利用固定阶位的改良型算术编码进行压缩；Data compression step: for metadata, using incremental coding to detect and eliminate redundant information of metadata; for quality scores, using bit-level PPM prediction model and arithmetic coding for compression; for base sequences, using fixed order Improved arithmetic coding for compression;

数据输出步骤：将不同数据流的压缩结果进行归档合并，输出经过压缩后的最终数据。Data output step: Archive and merge the compression results of different data streams, and output the compressed final data.

优选的，在所述数据压缩步骤中采用了Pthreads的线程级并行编程方式来同时处理所述三个数据流的压缩。Preferably, thread-level parallel programming of Pthreads is employed in the data compression step to simultaneously process compression of the three data streams.

优选的，所述数据压缩步骤具体包括：Preferably, the data compression step specifically includes:

针对质量分数，采用游程长读编码方式对质量分数的数据流进行初次压缩以实现预处理；For the quality score, the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing;

利用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩。The preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding.

针对碱基序列，判断DNA序列的压缩模式是基于非参考基因的压缩模式还是基于参考基因的压缩模式；Determining, according to the base sequence, whether the compression mode of the DNA sequence is based on a compression mode of a non-reference gene or a compression mode based on a reference gene;

如果是基于非参考基因的压缩模式，则利用固定阶位的改良型算术编码将碱基序列的数据流进行压缩；If the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order;

如果是基于参考基因的压缩模式，则通过DNA数据匹配工具对DNA短读序列进行比对并剔除冗余，记录相应的匹配信息并以SAM格式文件保存，然后利用固定阶位的改良型算术编码将保存的SAM格式文件提取压缩。If it is based on the compression pattern of the reference gene, the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.

另一方面，本发明还提供一种针对FASTQ数据的多线程快速存储无损压缩系统，所述系统包括：In another aspect, the present invention also provides a multi-threaded fast storage lossless compression system for FASTQ data, the system comprising:

数据分类模块，用于输入原始FASTQ数据，并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流；a data classification module, configured to input original FASTQ data, and divide the short read of the original FASTQ data into three data streams of a metadata, a quality score, and a base sequence;

数据压缩模块，用于针对元数据，利用增量编码方式进行检测并消除元数据的冗余信息；针对质量分数，利用比特级别的PPM预测模型和算术编码进行压缩；针对碱基序列，利用固定阶位的改良型算术编码进行压缩；A data compression module for detecting and eliminating redundant information of metadata for metadata, using a bit-level PPM prediction model and arithmetic coding for quality scores; and fixing for base sequences Improved arithmetic coding of the order bits for compression;

数据输出模块，用于将不同数据流的压缩结果进行归档合并，输出经过压缩后的最终数据。The data output module is configured to archive and combine the compression results of different data streams, and output the compressed final data.

优选的，所述数据压缩模块采用Pthreads的线程级并行编程方式来同时处理所述三个数据流的压缩。Preferably, the data compression module uses thread-level parallel programming of Pthreads to simultaneously process compression of the three data streams.

优选的，所述数据压缩模块具体用于：Preferably, the data compression module is specifically configured to:

Beneficial effect

本发明提供的技术方案充分利用了DNA数据的生物学特性，能在两种FASTQ数据（即长读和短读）中获得极高的压缩比，通过设计高效的编码方式把FASTQ数据拆分成三种类型的数据流并对此分别进行单独压缩，大大提高了压缩率，并且，对上游DNA数据匹配工具提供了接口，能应用于基于参考基因的压缩模式，发挥同源物种基因组之间的高度相似性，进一步提高重测序数据的压缩比。另外，通过采用Pthreads（POSIX threads）的线程级并行编程，能够同时压缩处理三种数据流所产生的中间文件，从而大大提高压缩速度，增加了适用性，进而提高了压缩效率。 The technical solution provided by the invention fully utilizes the biological characteristics of the DNA data, can obtain a very high compression ratio in two kinds of FASTQ data (ie, long reading and short reading), and splits the FASTQ data into high-efficiency coding methods. Three types of data streams are separately compressed separately, which greatly increases the compression ratio, and provides an interface to the upstream DNA data matching tool, which can be applied to the reference gene-based compression mode to play between the homologous species genomes. Highly similar, further improving the compression ratio of resequencing data. Also, by using Pthreads (POSIX Thread-level parallel programming of threads) can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving the compression speed, increasing the applicability, and thus improving the compression efficiency.

DRAWINGS

图1为本发明一实施方式中针对FASTQ数据的多线程快速存储无损压缩方法流程图；1 is a flowchart of a multi-thread fast storage lossless compression method for FASTQ data according to an embodiment of the present invention;

图2为本发明一实施方式中针对FASTQ数据的多线程快速存储无损压缩系统10的内部结构示意图。2 is a schematic diagram showing the internal structure of a multi-threaded fast storage lossless compression system 10 for FASTQ data according to an embodiment of the present invention.

Embodiments of the invention

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

本发明具体实施方式提供了一种针对FASTQ数据的多线程快速存储无损压缩方法，应用于DNA序列的压缩，其中，所述方法主要包括如下步骤：A specific embodiment of the present invention provides a multi-threaded fast storage lossless compression method for FASTQ data, which is applied to compression of a DNA sequence, wherein the method mainly includes the following steps:

S11、数据分类步骤：输入原始FASTQ数据，并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流；S11. Data classification step: input original FASTQ data, and divide the short read of the original FASTQ data into three data streams of metadata, mass fraction and base sequence;

S12、数据压缩步骤：针对元数据，利用增量编码方式进行检测并消除元数据的冗余信息；针对质量分数，利用比特级别的PPM预测模型和算术编码进行压缩；针对碱基序列，利用固定阶位的改良型算术编码进行压缩；S12. Data compression step: for metadata, using incremental coding to detect and eliminate redundant information of metadata; for quality score, using bit level PPM prediction model and arithmetic coding for compression; for base sequence, using fixed Improved arithmetic coding of the order bits for compression;

S13、数据输出步骤：将不同数据流的压缩结果进行归档合并，输出经过压缩后的最终数据。S13. Data output step: Archive and merge the compression results of different data streams, and output the compressed final data.

本发明提供的一种针对FASTQ数据的多线程快速存储无损压缩方法充分利用了DNA数据的生物学特性，能在两种FASTQ数据（即长读和短读）中获得极高的压缩比，通过设计高效的编码方式把FASTQ数据拆分成三种类型的数据流并对此分别进行单独压缩，大大提高了压缩率。另外，通过采用Pthreads（POSIX threads）的线程级并行编程，能够同时压缩处理三种数据流所产生的中间文件，从而大大提高压缩速度，增加了适用性，进而提高了压缩效率。The multi-threaded fast storage lossless compression method for FASTQ data provided by the invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read). Designing an efficient coding method splits the FASTQ data into three types of data streams and separately compresses them separately, greatly increasing the compression ratio. Also, by using Pthreads (POSIX Thread-level parallel programming of threads) can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving the compression speed, increasing the applicability, and thus improving the compression efficiency.

以下将对本发明所提供的一种针对FASTQ数据的多线程快速存储无损压缩方法进行详细说明。A multi-threaded fast storage lossless compression method for FASTQ data provided by the present invention will be described in detail below.

请参阅图1，为本发明一实施方式中针对FASTQ数据的多线程快速存储无损压缩方法流程图。Please refer to FIG. 1 , which is a flowchart of a multi-thread fast storage lossless compression method for FASTQ data according to an embodiment of the present invention.

在步骤S11中，数据分类步骤：输入原始FASTQ数据，并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流。In step S11, the data sorting step: inputting the original FASTQ data, and dividing the short read of the original FASTQ data into three data streams of metadata, quality score, and base sequence.

在本实施方式中，DNA测序技术产生成千上万条短读，这些短读存储于以FASTQ为格式的文件中，包含测序产生的所有信息。在广泛使用的FASTQ格式中，每个短读包含四行，每行由换行符分割。每个短读以字符‘@’开始，后面紧接着元数据作为第一行，用来唯一标识短读。第二行是碱基数据，由仅包含{‘A’，‘T’，‘C’，‘G’，‘N’}五个字符的序列构成，其中字符‘N’表示不明确的碱基，可表示为{‘A’，‘T’，‘C’,‘G’}中任意一个字符。第三行以字符‘+’开始，紧接着与第一行相同的短读标识。最后一行为质量分数行，与碱基一一对应，表示每个碱基字符对应位置测序的可信度。In this embodiment, DNA sequencing technology produces thousands of short reads, which are stored in a file in FASTQ format, containing all the information generated by sequencing. In the widely used FASTQ format, each short read contains four lines, each line being divided by a newline character. Each short read begins with the character '@' followed by the metadata as the first line to uniquely identify the short read. The second line is the base data, consisting of a sequence consisting of only five characters {'A', 'T', 'C', 'G', 'N'}, where the character 'N' indicates an ambiguous base. , can be expressed as any character in {'A', 'T', 'C', 'G'}. The third line begins with the character '+' followed by the same short read identifier as the first line. The last behavioral quality score line, one-to-one correspondence with the base, indicates the credibility of sequencing of each base character corresponding to the position.

在步骤S12中，数据压缩步骤：针对元数据，利用增量编码方式进行检测并消除元数据的冗余信息；针对质量分数，利用比特级别的PPM（Prediction by partial matching）预测模型和算术编码进行压缩；针对碱基序列，利用固定阶位的改良型算术编码进行压缩。In step S12, the data compression step: for the metadata, using the incremental coding method to detect and eliminate redundant information of the metadata; for the quality score, using the bit level PPM (Prediction) By partial matching) The prediction model and the arithmetic coding are compressed; for the base sequence, compression is performed using a modified arithmetic coding of fixed order bits.

在本实施方式中，所述数据压缩步骤采用Pthreads的线程级并行编程方式来同时处理所述三个数据流的压缩，能够同时压缩处理三种数据流所产生的中间文件，从而大大提高压缩速度，增强其适用性。In this embodiment, the data compression step uses Pthreads thread-level parallel programming to simultaneously process the compression of the three data streams, and can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving the compression speed. To enhance its applicability.

在本实施方式中，针对元数据，利用增量编码方式进行检测并消除元数据的冗余信息，然后利用固定阶位的改良型算术编码进行压缩，采用多线程化并行处理。In the present embodiment, the metadata is detected by the incremental coding method and the redundant information of the metadata is eliminated, and then the compression is performed by the improved arithmetic coding of the fixed order, and the multi-threaded parallel processing is employed.

在本实施方式中，所述数据压缩步骤具体包括：In this embodiment, the data compression step specifically includes:

在本实施方式中，对于质量分数，由于考虑其保存着大量的连续重复的相同字符，故先采用游程长读编码压缩作为预处理，对质量分数的数据流进行初次压缩，然后再使用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩，采用多线程化并行处理。在本实施方式中，这种比特级别的PPM预测模型和算术编码对于由约40种字符并且服从伪随机分布的质量分数进行无损压缩时，具有较高的压缩比和压缩速度。In the present embodiment, for the quality score, since it considers that it holds a large number of consecutively repeated identical characters, the run length long read code compression is used as a pre-processing, and the data stream of the quality score is first compressed, and then the bit level is used. The PPM prediction model and the arithmetic coding recompress the preprocessed compressed data and adopt multithreading parallel processing. In the present embodiment, such a bit-level PPM prediction model and arithmetic coding have a higher compression ratio and compression speed for lossless compression of a quality score of about 40 characters and subject to a pseudo-random distribution.

在本实施方式中，因为DNA数据匹配工具的性能会直接影响到数据压缩的效果，所以以SAM格式文件为中间文件，对上游DNA数据匹配工具提供接口，为后续压缩性能的提升提供了可能。In the present embodiment, because the performance of the DNA data matching tool directly affects the effect of data compression, the SAM format file is used as an intermediate file to provide an interface to the upstream DNA data matching tool, which provides a possibility for subsequent compression performance improvement.

在本实施方式中，对于碱基序列，本发明可适用于两种DNA数据压缩模式，即基于非参考基因的压缩模式和基于参考基因压缩模式，通过判断DNA序列的压缩模式是属于哪一种模式，然后分别进行处理。本实施方式中的DNA数据匹配工具，如BWA（Burrows-Wheeler Aligner）工具、Bowtie工具和CompMap工具，这些DNA数据匹配工具能对DNA短读序列的比对并剔除冗余，实现对DNA短读序列的并行快速匹配，并应用于DNA序列的压缩存储方法中。In the present embodiment, for the base sequence, the present invention is applicable to two DNA data compression modes, that is, a compression mode based on a non-reference gene and a reference gene compression mode, by which the compression mode of the DNA sequence belongs. The mode is then processed separately. DNA data matching tool in the present embodiment, such as BWA (Burrows-Wheeler) Aligner) tool, Bowtie tool and CompMap tool, these DNA data matching tools can compare DNA short read sequences and eliminate redundancy, realize parallel fast matching of DNA short read sequences, and apply to DNA sequence compression storage methods. .

对于基于非参考基因的压缩模式：此压缩模式针对只含有{‘A’，‘T’，‘C’，‘G’，‘N’}五个字符组成碱基序列，由于字符种类较少，采取固定阶位的策略，能大大减少不必要空余比特的浪费，从比特级别去降低存储每一个字符所需要的空间大小，故采取固定阶位的改良型算术编码压缩。也就是说，如果是基于非参考基因的压缩模式，则利用固定阶位的改良型算术编码将碱基序列的数据流进行压缩。For non-reference gene based compression mode: This compression mode is composed of five characters consisting of {'A', 'T', 'C', 'G', 'N'}, due to the small number of characters. Adopting a fixed-order strategy can greatly reduce the waste of unnecessary spare bits, and reduce the size of space required for storing each character from the bit level, so a fixed-order arithmetic coding compression is adopted. That is, if the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order.

对于基于参考基因的压缩模式：此压缩模式能基于以上DNA数据匹配工具进行对应的数据压缩，提供相应的接口，使其能应用于碱基序列的预处理，对DNA短读序列进行比对并剔除冗余，记录相应的匹配信息，以SAM格式文件保存，该匹配信息包含：匹配位置、回文串标记、匹配字符长度、匹配类型和非匹配字符，然后，此压缩模式把这五种信息拆分到各自的文件中，分别用固定阶位的改良型算术编码进行压缩。也就是说，如果是基于参考基因的压缩模式，则通过DNA数据匹配工具对DNA短读序列进行比对并剔除冗余，记录相应的匹配信息并以SAM格式文件保存，然后利用固定阶位的改良型算术编码将保存的SAM格式文件提取压缩。For reference gene-based compression mode: This compression mode can perform corresponding data compression based on the above DNA data matching tool, providing a corresponding interface, which can be applied to the preprocessing of base sequences, and the DNA short read sequences are compared and The redundancy is deleted, and the corresponding matching information is recorded and saved in a SAM format file, and the matching information includes: a matching position, a palindrome string mark, a matching character length, a matching type, and a non-matching character, and then the compressed mode puts the five kinds of information. Split into separate files and compress them with fixed arithmetic coding with fixed order. That is to say, if it is based on the compression mode of the reference gene, the DNA short reading sequence is compared by the DNA data matching tool and the redundancy is eliminated, the corresponding matching information is recorded and saved in the SAM format file, and then the fixed order is used. The improved arithmetic coding extracts and compresses the saved SAM format file.

在步骤S13中，数据输出步骤：将不同数据流的压缩结果进行归档合并，输出经过压缩后的最终数据。In step S13, the data output step: archives the compression results of the different data streams, and outputs the compressed final data.

在本实施方式中，本发明的技术方案从两方面去提升对FASTQ数据的压缩性能，即压缩比与压缩速度。其中，在高压缩模式的比较下，不管以不定长度的长读为主还是以固定长度的短读为主的FASTQ数据中，本发明在两种模式下（即基于参考基因的压缩模式和基于非参考基因的压缩模式）均获得比主流通用压缩软件更高的压缩比和更快的压缩速度。这一优势经过一些数据的测试得以证明，本发明从美国国家生物技术信息中心（NCBI, http://www.ncbi.nlm.nih.gov/）下载的两个FASTQ数据：ERR385912（641MB，短读长度为51，属于短读），ERR654984（1164MB，短读长度为64~502，属于长读），通过测试数据比较，本发明的压缩率平均比bzip2软件和gzip软件分别要高10.7%和15.2%，压缩速度平均比bzip2软件和gzip软件分别要快40.2%和45.5%。In the present embodiment, the technical solution of the present invention improves the compression performance of the FASTQ data, that is, the compression ratio and the compression speed, from two aspects. Among them, in the comparison of the high compression mode, the present invention is in two modes (ie, the compression mode based on the reference gene and based on the FASTQ data mainly based on the long reading of indefinite length or the short reading of fixed length). The compression mode of the non-reference gene) has a higher compression ratio and a faster compression speed than the mainstream general compression software. This advantage has been demonstrated by testing some data from the National Center for Biotechnology Information (NCBI, Http://www.ncbi.nlm.nih.gov/) Two FASTQ data downloaded: ERR385912 (641MB, short read length 51, short read), ERR654984 (1164MB, short read length 64~502, belongs to Long reading), through the comparison of test data, the compression ratio of the present invention is 10.7% and 15.2% higher than that of bzip2 software and gzip software, respectively, and the compression speed is 40.2% and 45.5% faster than bzip2 software and gzip software, respectively.

本发明提供的一种针对FASTQ数据的多线程快速存储无损压缩方法充分利用了DNA数据的生物学特性，能在两种FASTQ数据（即长读和短读）中获得极高的压缩比，通过设计高效的编码方式把FASTQ数据拆分成三种类型的数据流并对此分别进行单独压缩，大大提高了压缩率，并且，对上游DNA数据匹配工具提供了接口，能应用于基于参考基因的压缩模式，发挥同源物种基因组之间的高度相似性，进一步提高重测序数据的压缩比。另外，通过采用Pthreads的线程级并行编程，能够同时压缩处理三种数据流所产生的中间文件，从而大大提高压缩速度，增加了适用性，进而提高了压缩效率。The multi-threaded fast storage lossless compression method for FASTQ data provided by the invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read). Designing efficient coding methods splits FASTQ data into three types of data streams and compresses them separately, greatly increasing the compression ratio, and providing an interface to upstream DNA data matching tools that can be applied to reference-based genes. Compressed mode, which exerts a high degree of similarity between homologous species genomes, further improves the compression ratio of resequencing data. In addition, by using thread-level parallel programming of Pthreads, it is possible to simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly increasing the compression speed, increasing the applicability, and thereby improving the compression efficiency.

本发明具体实施方式还提供一种针对FASTQ数据的多线程快速存储无损压缩系统10，主要包括：The embodiment of the present invention further provides a multi-threaded fast storage lossless compression system 10 for FASTQ data, which mainly includes:

数据分类模块11，用于输入原始FASTQ数据，并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流；a data classification module 11 for inputting original FASTQ data, and dividing the short read of the original FASTQ data into three data streams of metadata, quality score and base sequence;

数据压缩模块12，用于针对元数据，利用增量编码方式进行检测并消除元数据的冗余信息；针对质量分数，利用比特级别的PPM预测模型和算术编码进行压缩；针对碱基序列，利用固定阶位的改良型算术编码进行压缩；The data compression module 12 is configured to detect and eliminate redundant information of the metadata by using an incremental coding method for the metadata, compressing by using a bit-level PPM prediction model and arithmetic coding for the quality score, and utilizing the base sequence for the base sequence. Fixed arithmetic coding of fixed order bits for compression;

数据输出模块13，用于将不同数据流的压缩结果进行归档合并，输出经过压缩后的最终数据。The data output module 13 is configured to archive and combine the compression results of different data streams, and output the compressed final data.

本发明提供的一种针对FASTQ数据的多线程快速存储无损压缩系统10，充分利用了DNA数据的生物学特性，能在两种FASTQ数据（即长读和短读）中获得极高的压缩比，通过设计高效的编码方式把FASTQ数据拆分成三种类型的数据流并对此分别进行单独压缩，大大提高了压缩率。另外，通过采用Pthreads的线程级并行编程，能够同时压缩处理三种数据流所产生的中间文件，从而大大提高压缩速度，增加了适用性，进而提高了压缩效率。The multi-threaded fast storage lossless compression system 10 for FASTQ data provided by the present invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read). By designing efficient coding methods, FASTQ data is split into three types of data streams and separately compressed separately, which greatly improves the compression ratio. In addition, by using thread-level parallel programming of Pthreads, it is possible to simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly increasing the compression speed, increasing the applicability, and thereby improving the compression efficiency.

请参阅图2，所示为本发明一实施方式中针对FASTQ数据的多线程快速存储无损压缩系统10的结构示意图。Referring to FIG. 2, a schematic structural diagram of a multi-threaded fast storage lossless compression system 10 for FASTQ data according to an embodiment of the present invention is shown.

在本实施方式中，针对FASTQ数据的多线程快速存储无损压缩系统10，应用于DNA序列的压缩，主要包括数据分类模块11、数据压缩模块12以及数据输出模块13。In the present embodiment, the multi-threaded fast storage lossless compression system 10 for FASTQ data is applied to compression of DNA sequences, and mainly includes a data classification module 11, a data compression module 12, and a data output module 13.

数据分类模块11，用于输入原始FASTQ数据，并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流。The data classification module 11 is configured to input the original FASTQ data and divide the short read of the original FASTQ data into three data streams of a metadata, a quality score, and a base sequence.

数据压缩模块12，用于针对元数据，利用增量编码方式进行检测并消除元数据的冗余信息；针对质量分数，利用比特级别的PPM预测模型和算术编码进行压缩；针对碱基序列，利用固定阶位的改良型算术编码进行压缩。The data compression module 12 is configured to detect and eliminate redundant information of the metadata by using an incremental coding method for the metadata, compressing by using a bit-level PPM prediction model and arithmetic coding for the quality score, and utilizing the base sequence for the base sequence. Fixed arithmetic coding of fixed order bits for compression.

在本实施方式中，所述数据压缩模块12采用Pthreads的线程级并行编程方式来同时处理所述三个数据流的压缩，能够同时压缩处理三种数据流所产生的中间文件，从而大大提高压缩速度，增强其适用性。In this embodiment, the data compression module 12 uses the thread-level parallel programming mode of Pthreads to simultaneously process the compression of the three data streams, and can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving compression. Speed, enhance its applicability.

针对元数据，数据压缩模块12利用增量编码方式进行检测并消除元数据的冗余信息，然后利用固定阶位的改良型算术编码进行压缩，采用多线程化并行处理。For the metadata, the data compression module 12 detects and eliminates the redundant information of the metadata by using the incremental coding method, and then performs compression using the improved arithmetic coding of the fixed order, and adopts multi-threaded parallel processing.

数据压缩模块12具体用于：The data compression module 12 is specifically configured to:

在本实施方式中，对于质量分数，由于考虑其保存着大量的连续重复的相同字符，故数据压缩模块12先采用游程长读编码压缩作为预处理，对质量分数的数据流进行初次压缩，然后再使用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩，采用多线程化并行处理。在本实施方式中，这种比特级别的PPM预测模型和算术编码对于由约40种字符并且服从伪随机分布的质量分数进行无损压缩时，具有较高的压缩比和压缩速度。In the present embodiment, for the quality score, since it considers that it holds a large number of consecutively repeated identical characters, the data compression module 12 first uses the run length read coding compression as a pre-processing, and compresses the data stream of the quality score for the first time, and then The pre-processed compressed data is again compressed using a bit-level PPM prediction model and arithmetic coding, and multi-threaded parallel processing is employed. In the present embodiment, such a bit-level PPM prediction model and arithmetic coding have a higher compression ratio and compression speed for lossless compression of a quality score of about 40 characters and subject to a pseudo-random distribution.

本实施方式中的DNA数据匹配工具，如BWA（Burrows-Wheeler Aligner）工具、Bowtie工具和CompMap工具，这些DNA数据匹配工具能对DNA短读序列的比对并剔除冗余，实现对DNA短读序列的并行快速匹配，并应用于DNA序列的压缩存储方法中。DNA data matching tool in the present embodiment, such as BWA (Burrows-Wheeler) Aligner) tool, Bowtie tool and CompMap tool, these DNA data matching tools can compare DNA short read sequences and eliminate redundancy, realize parallel fast matching of DNA short read sequences, and apply to DNA sequence compression storage methods. .

在本实施方式中，对于碱基序列，本发明可适用于两种DNA数据压缩模式，即基于非参考基因的压缩模式和基于参考基因压缩模式，通过判断DNA序列的压缩模式是属于哪一种模式，然后分别进行处理。其中，两种DNA数据压缩模式的具体处理流程如前述的步骤S12所示，在此就不做重复描述。In the present embodiment, for the base sequence, the present invention is applicable to two DNA data compression modes, that is, a compression mode based on a non-reference gene and a reference gene compression mode, by which the compression mode of the DNA sequence belongs. The mode is then processed separately. The specific processing flow of the two DNA data compression modes is as shown in the foregoing step S12, and the repeated description is not repeated here.

本发明提供的一种针对FASTQ数据的多线程快速存储无损压缩系统10，充分利用了DNA数据的生物学特性，能在两种FASTQ数据（即长读和短读）中获得极高的压缩比，通过设计高效的编码方式把FASTQ数据拆分成三种类型的数据流并对此分别进行单独压缩，大大提高了压缩率，并且，对上游DNA数据匹配工具提供了接口，能应用于基于参考基因的压缩模式，发挥同源物种基因组之间的高度相似性，进一步提高重测序数据的压缩比。另外，通过采用Pthreads的线程级并行编程，能够同时压缩处理三种数据流所产生的中间文件，从而大大提高压缩速度，增加了适用性，进而提高了压缩效率。The multi-threaded fast storage lossless compression system 10 for FASTQ data provided by the present invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read). By designing efficient coding methods, FASTQ data is split into three types of data streams and compressed separately, which greatly improves the compression ratio, and provides an interface to the upstream DNA data matching tool, which can be applied to reference-based The compression pattern of genes plays a high degree of similarity between genomes of homologous species, further increasing the compression ratio of resequencing data. In addition, by using thread-level parallel programming of Pthreads, it is possible to simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly increasing the compression speed, increasing the applicability, and thereby improving the compression efficiency.

值得注意的是，上述实施例中，所包括的各个单元只是按照功能逻辑进行划分的，但并不局限于上述的划分，只要能够实现相应的功能即可；另外，各功能单元的具体名称也只是为了便于相互区分，并不用于限制本发明的保护范围。It should be noted that, in the foregoing embodiment, each unit included is only divided according to functional logic, but is not limited to the above division, as long as the corresponding function can be implemented; in addition, the specific name of each functional unit is also They are only used to facilitate mutual differentiation and are not intended to limit the scope of the present invention.

另外，本领域普通技术人员可以理解实现上述各实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，相应的程序可以存储于一计算机可读取存储介质中，所述的存储介质，如ROM/RAM、磁盘或光盘等。In addition, those skilled in the art can understand that all or part of the steps of implementing the above embodiments may be completed by a program to instruct related hardware, and the corresponding program may be stored in a computer readable storage medium. Storage medium, such as ROM/RAM, disk or CD.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. Within the scope.

Claims

A multi-threaded fast storage lossless compression method for FASTQ data, applied to compression of DNA sequences, characterized in that the method comprises:

Data classification step: inputting raw FASTQ data, and dividing the short reading of the original FASTQ data into three data streams of metadata, mass fraction and base sequence;

Data compression step: for metadata, using incremental coding to detect and eliminate redundant information of metadata; for quality scores, using bit-level PPM prediction model and arithmetic coding for compression; for base sequences, using fixed order Improved arithmetic coding for compression;

Data output step: Archive and merge the compression results of different data streams, and output the compressed final data.

The multi-threaded fast storage lossless compression method for FASTQ data according to claim 1, wherein a thread-level parallel programming manner of Pthreads is used in the data compression step to simultaneously process compression of the three data streams. .

The multi-threaded fast storage lossless compression method for FASTQ data according to claim 2, wherein the data compression step specifically comprises:

For the quality score, the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing;

The preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding.

Determining, according to the base sequence, whether the compression mode of the DNA sequence is based on a compression mode of a non-reference gene or a compression mode based on a reference gene;

If the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order;

If it is based on the compression pattern of the reference gene, the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.

A multi-threaded fast storage lossless compression system for FASTQ data, characterized in that the system comprises:

a data classification module, configured to input original FASTQ data, and divide the short read of the original FASTQ data into three data streams of a metadata, a quality score, and a base sequence;

A data compression module for detecting and eliminating redundant information of metadata for metadata, using a bit-level PPM prediction model and arithmetic coding for quality scores; and fixing for base sequences Improved arithmetic coding of the order bits for compression;

The data output module is configured to archive and combine the compression results of different data streams, and output the compressed final data.

The multi-threaded fast storage lossless compression system for FASTQ data according to claim 5, wherein the data compression module uses a thread-level parallel programming manner of Pthreads to simultaneously process compression of the three data streams.

The multi-threaded fast storage lossless compression system for FASTQ data according to claim 6, wherein the data compression module is specifically configured to: