[go: up one dir, main page]

WO2017214765A1 - Multi-thread fast storage lossless compression method and system for fastq data - Google Patents

Multi-thread fast storage lossless compression method and system for fastq data Download PDF

Info

Publication number
WO2017214765A1
WO2017214765A1 PCT/CN2016/085426 CN2016085426W WO2017214765A1 WO 2017214765 A1 WO2017214765 A1 WO 2017214765A1 CN 2016085426 W CN2016085426 W CN 2016085426W WO 2017214765 A1 WO2017214765 A1 WO 2017214765A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
compression
fastq
arithmetic coding
dna
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2016/085426
Other languages
French (fr)
Chinese (zh)
Inventor
朱泽轩
黄志安
孙怡雯
文振焜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to PCT/CN2016/085426 priority Critical patent/WO2017214765A1/en
Publication of WO2017214765A1 publication Critical patent/WO2017214765A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present invention relates to the field of data compression, and in particular, to a multi-thread fast storage lossless compression method and system for FASTQ data.
  • FASTQ data has grown from a fixed length ranging from 50 to 200 bp to an indefinite length ranging from 1 kbp to 300 kbp.
  • the data change is further restricted.
  • gzip http://www.gzip.org/
  • bzip2 http://gzip.org/
  • LZMA http://www.7-zip.org
  • the gzip software first uses a variant compression method based on the LZ77 algorithm for the file to be compressed, and then uses the static Huffman coding or the dynamic Huffman coding method to compress the obtained result.
  • the bzip2 software divides the data to be compressed into blocks (100 ⁇ 900KB each), and uses BWT (Burrows-Wheeler) for repeated character sequences.
  • LZMA uses a dictionary encoding mechanism similar to LZ77 algorithm to compress the data stream, repeat sequence size and re-sequence position separately, and supports several hash chain variants, binary trees and radix trees as the basis of its dictionary search algorithm. .
  • an object of the present invention is to provide a multi-thread fast storage lossless compression method and system thereof for FASTQ data, which aims to solve the problem of poor compression effect and slow compression speed for DNA sequence data in the prior art.
  • the invention provides a multi-thread fast storage lossless compression method for FASTQ data, which is applied to compression of DNA sequences, characterized in that the method comprises:
  • Data classification step inputting raw FASTQ data, and dividing the short reading of the original FASTQ data into three data streams of metadata, mass fraction and base sequence;
  • Data compression step for metadata, using incremental coding to detect and eliminate redundant information of metadata; for quality scores, using bit-level PPM prediction model and arithmetic coding for compression; for base sequences, using fixed order Improved arithmetic coding for compression;
  • Data output step Archive and merge the compression results of different data streams, and output the compressed final data.
  • thread-level parallel programming of Pthreads is employed in the data compression step to simultaneously process compression of the three data streams.
  • the data compression step specifically includes:
  • the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing;
  • the preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding.
  • the data compression step specifically includes:
  • the compression mode is based on a non-reference gene
  • the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order
  • the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.
  • the present invention also provides a multi-threaded fast storage lossless compression system for FASTQ data, the system comprising:
  • a data classification module configured to input original FASTQ data, and divide the short read of the original FASTQ data into three data streams of a metadata, a quality score, and a base sequence;
  • a data compression module for detecting and eliminating redundant information of metadata for metadata, using a bit-level PPM prediction model and arithmetic coding for quality scores; and fixing for base sequences Improved arithmetic coding of the order bits for compression;
  • the data output module is configured to archive and combine the compression results of different data streams, and output the compressed final data.
  • the data compression module uses thread-level parallel programming of Pthreads to simultaneously process compression of the three data streams.
  • the data compression module is specifically configured to:
  • the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing;
  • the preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding.
  • the data compression module is specifically configured to:
  • the compression mode is based on a non-reference gene
  • the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order
  • the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.
  • the technical solution provided by the invention fully utilizes the biological characteristics of the DNA data, can obtain a very high compression ratio in two kinds of FASTQ data (ie, long reading and short reading), and splits the FASTQ data into high-efficiency coding methods.
  • Three types of data streams are separately compressed separately, which greatly increases the compression ratio, and provides an interface to the upstream DNA data matching tool, which can be applied to the reference gene-based compression mode to play between the homologous species genomes.
  • Highly similar further improving the compression ratio of resequencing data.
  • Pthreads POSIX Thread-level parallel programming of threads
  • Pthreads POSIX Thread-level parallel programming of threads
  • FIG. 1 is a flowchart of a multi-thread fast storage lossless compression method for FASTQ data according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram showing the internal structure of a multi-threaded fast storage lossless compression system 10 for FASTQ data according to an embodiment of the present invention.
  • a specific embodiment of the present invention provides a multi-threaded fast storage lossless compression method for FASTQ data, which is applied to compression of a DNA sequence, wherein the method mainly includes the following steps:
  • Data classification step input original FASTQ data, and divide the short read of the original FASTQ data into three data streams of metadata, mass fraction and base sequence;
  • the multi-threaded fast storage lossless compression method for FASTQ data provided by the invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read). Designing an efficient coding method splits the FASTQ data into three types of data streams and separately compresses them separately, greatly increasing the compression ratio. Also, by using Pthreads (POSIX Thread-level parallel programming of threads) can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving the compression speed, increasing the applicability, and thus improving the compression efficiency.
  • Pthreads POSIX Thread-level parallel programming of threads
  • a multi-threaded fast storage lossless compression method for FASTQ data provided by the present invention will be described in detail below.
  • FIG. 1 is a flowchart of a multi-thread fast storage lossless compression method for FASTQ data according to an embodiment of the present invention.
  • step S11 the data sorting step: inputting the original FASTQ data, and dividing the short read of the original FASTQ data into three data streams of metadata, quality score, and base sequence.
  • DNA sequencing technology produces thousands of short reads, which are stored in a file in FASTQ format, containing all the information generated by sequencing.
  • each short read contains four lines, each line being divided by a newline character.
  • Each short read begins with the character '@' followed by the metadata as the first line to uniquely identify the short read.
  • the second line is the base data, consisting of a sequence consisting of only five characters ⁇ 'A', 'T', 'C', 'G', 'N' ⁇ , where the character 'N' indicates an ambiguous base. , can be expressed as any character in ⁇ 'A', 'T', 'C', 'G' ⁇ .
  • the third line begins with the character '+' followed by the same short read identifier as the first line.
  • the last behavioral quality score line one-to-one correspondence with the base, indicates the credibility of sequencing of each base character corresponding to the position.
  • step S12 the data compression step: for the metadata, using the incremental coding method to detect and eliminate redundant information of the metadata; for the quality score, using the bit level PPM (Prediction) By partial matching)
  • the prediction model and the arithmetic coding are compressed; for the base sequence, compression is performed using a modified arithmetic coding of fixed order bits.
  • the data compression step uses Pthreads thread-level parallel programming to simultaneously process the compression of the three data streams, and can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving the compression speed. To enhance its applicability.
  • the metadata is detected by the incremental coding method and the redundant information of the metadata is eliminated, and then the compression is performed by the improved arithmetic coding of the fixed order, and the multi-threaded parallel processing is employed.
  • the data compression step specifically includes:
  • the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing;
  • the preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding.
  • the run length long read code compression is used as a pre-processing, and the data stream of the quality score is first compressed, and then the bit level is used.
  • the PPM prediction model and the arithmetic coding recompress the preprocessed compressed data and adopt multithreading parallel processing.
  • such a bit-level PPM prediction model and arithmetic coding have a higher compression ratio and compression speed for lossless compression of a quality score of about 40 characters and subject to a pseudo-random distribution.
  • the data compression step specifically includes:
  • the compression mode is based on a non-reference gene
  • the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order
  • the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.
  • the SAM format file is used as an intermediate file to provide an interface to the upstream DNA data matching tool, which provides a possibility for subsequent compression performance improvement.
  • the present invention is applicable to two DNA data compression modes, that is, a compression mode based on a non-reference gene and a reference gene compression mode, by which the compression mode of the DNA sequence belongs.
  • the mode is then processed separately.
  • DNA data matching tool in the present embodiment such as BWA (Burrows-Wheeler) Aligner) tool, Bowtie tool and CompMap tool, these DNA data matching tools can compare DNA short read sequences and eliminate redundancy, realize parallel fast matching of DNA short read sequences, and apply to DNA sequence compression storage methods. .
  • This compression mode is composed of five characters consisting of ⁇ 'A', 'T', 'C', 'G', 'N' ⁇ , due to the small number of characters. Adopting a fixed-order strategy can greatly reduce the waste of unnecessary spare bits, and reduce the size of space required for storing each character from the bit level, so a fixed-order arithmetic coding compression is adopted. That is, if the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order.
  • This compression mode can perform corresponding data compression based on the above DNA data matching tool, providing a corresponding interface, which can be applied to the preprocessing of base sequences, and the DNA short read sequences are compared and The redundancy is deleted, and the corresponding matching information is recorded and saved in a SAM format file, and the matching information includes: a matching position, a palindrome string mark, a matching character length, a matching type, and a non-matching character, and then the compressed mode puts the five kinds of information. Split into separate files and compress them with fixed arithmetic coding with fixed order.
  • the DNA short reading sequence is compared by the DNA data matching tool and the redundancy is eliminated, the corresponding matching information is recorded and saved in the SAM format file, and then the fixed order is used.
  • the improved arithmetic coding extracts and compresses the saved SAM format file.
  • step S13 the data output step: archives the compression results of the different data streams, and outputs the compressed final data.
  • the technical solution of the present invention improves the compression performance of the FASTQ data, that is, the compression ratio and the compression speed, from two aspects.
  • the present invention in the comparison of the high compression mode, is in two modes (ie, the compression mode based on the reference gene and based on the FASTQ data mainly based on the long reading of indefinite length or the short reading of fixed length).
  • the compression mode of the non-reference gene has a higher compression ratio and a faster compression speed than the mainstream general compression software.
  • the multi-threaded fast storage lossless compression method for FASTQ data provided by the invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read).
  • Designing efficient coding methods splits FASTQ data into three types of data streams and compresses them separately, greatly increasing the compression ratio, and providing an interface to upstream DNA data matching tools that can be applied to reference-based genes.
  • Compressed mode which exerts a high degree of similarity between homologous species genomes, further improves the compression ratio of resequencing data.
  • thread-level parallel programming of Pthreads it is possible to simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly increasing the compression speed, increasing the applicability, and thereby improving the compression efficiency.
  • the embodiment of the present invention further provides a multi-threaded fast storage lossless compression system 10 for FASTQ data, which mainly includes:
  • a data classification module 11 for inputting original FASTQ data, and dividing the short read of the original FASTQ data into three data streams of metadata, quality score and base sequence;
  • the data compression module 12 is configured to detect and eliminate redundant information of the metadata by using an incremental coding method for the metadata, compressing by using a bit-level PPM prediction model and arithmetic coding for the quality score, and utilizing the base sequence for the base sequence. Fixed arithmetic coding of fixed order bits for compression;
  • the data output module 13 is configured to archive and combine the compression results of different data streams, and output the compressed final data.
  • the multi-threaded fast storage lossless compression system 10 for FASTQ data provided by the present invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read).
  • FASTQ data is split into three types of data streams and separately compressed separately, which greatly improves the compression ratio.
  • Pthreads it is possible to simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly increasing the compression speed, increasing the applicability, and thereby improving the compression efficiency.
  • FIG. 2 a schematic structural diagram of a multi-threaded fast storage lossless compression system 10 for FASTQ data according to an embodiment of the present invention is shown.
  • the multi-threaded fast storage lossless compression system 10 for FASTQ data is applied to compression of DNA sequences, and mainly includes a data classification module 11, a data compression module 12, and a data output module 13.
  • the data classification module 11 is configured to input the original FASTQ data and divide the short read of the original FASTQ data into three data streams of a metadata, a quality score, and a base sequence.
  • DNA sequencing technology produces thousands of short reads, which are stored in a file in FASTQ format, containing all the information generated by sequencing.
  • each short read contains four lines, each line being divided by a newline character.
  • Each short read begins with the character '@' followed by the metadata as the first line to uniquely identify the short read.
  • the second line is the base data, consisting of a sequence consisting of only five characters ⁇ 'A', 'T', 'C', 'G', 'N' ⁇ , where the character 'N' indicates an ambiguous base. , can be expressed as any character in ⁇ 'A', 'T', 'C', 'G' ⁇ .
  • the third line begins with the character '+' followed by the same short read identifier as the first line.
  • the last behavioral quality score line one-to-one correspondence with the base, indicates the credibility of sequencing of each base character corresponding to the position.
  • the data compression module 12 is configured to detect and eliminate redundant information of the metadata by using an incremental coding method for the metadata, compressing by using a bit-level PPM prediction model and arithmetic coding for the quality score, and utilizing the base sequence for the base sequence. Fixed arithmetic coding of fixed order bits for compression.
  • the data compression module 12 uses the thread-level parallel programming mode of Pthreads to simultaneously process the compression of the three data streams, and can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving compression. Speed, enhance its applicability.
  • the data compression module 12 detects and eliminates the redundant information of the metadata by using the incremental coding method, and then performs compression using the improved arithmetic coding of the fixed order, and adopts multi-threaded parallel processing.
  • the data compression module 12 is specifically configured to:
  • the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing;
  • the preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding.
  • the data compression module 12 first uses the run length read coding compression as a pre-processing, and compresses the data stream of the quality score for the first time, and then The pre-processed compressed data is again compressed using a bit-level PPM prediction model and arithmetic coding, and multi-threaded parallel processing is employed.
  • a bit-level PPM prediction model and arithmetic coding have a higher compression ratio and compression speed for lossless compression of a quality score of about 40 characters and subject to a pseudo-random distribution.
  • the data compression module 12 is specifically configured to:
  • the compression mode is based on a non-reference gene
  • the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order
  • the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.
  • the SAM format file is used as an intermediate file to provide an interface to the upstream DNA data matching tool, which provides a possibility for subsequent compression performance improvement.
  • DNA data matching tool in the present embodiment such as BWA (Burrows-Wheeler) Aligner) tool, Bowtie tool and CompMap tool
  • BWA Borrows-Wheeler Aligner
  • Bowtie tool and CompMap tool these DNA data matching tools can compare DNA short read sequences and eliminate redundancy, realize parallel fast matching of DNA short read sequences, and apply to DNA sequence compression storage methods.
  • the present invention is applicable to two DNA data compression modes, that is, a compression mode based on a non-reference gene and a reference gene compression mode, by which the compression mode of the DNA sequence belongs.
  • the mode is then processed separately.
  • the specific processing flow of the two DNA data compression modes is as shown in the foregoing step S12, and the repeated description is not repeated here.
  • the data output module 13 is configured to archive and combine the compression results of different data streams, and output the compressed final data.
  • the multi-threaded fast storage lossless compression system 10 for FASTQ data provided by the present invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read).
  • FASTQ data is split into three types of data streams and compressed separately, which greatly improves the compression ratio, and provides an interface to the upstream DNA data matching tool, which can be applied to reference-based
  • the compression pattern of genes plays a high degree of similarity between genomes of homologous species, further increasing the compression ratio of resequencing data.
  • thread-level parallel programming of Pthreads it is possible to simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly increasing the compression speed, increasing the applicability, and thereby improving the compression efficiency.
  • each unit included is only divided according to functional logic, but is not limited to the above division, as long as the corresponding function can be implemented; in addition, the specific name of each functional unit is also They are only used to facilitate mutual differentiation and are not intended to limit the scope of the present invention.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Genetics & Genomics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Provided is a multi-thread fast storage lossless compression method for FASTQ data, which is applied to compression of a DNA sequence. The method comprises: a data classification step of inputting original FASTQ data, and dividing a short reading of the original FASTQ data into three data flows, namely metadata, a mass fraction, and a base sequence (S11); a data compression step of: with respect to the metadata, using incremental encoding to detect and eliminate redundant information of the metadata; with respect to the mass fraction, using a bit level PPM prediction model and arithmetic coding for compression; and with respect to the base sequence, using improved arithmetic coding of a fixed order for compression (S12); and a data output step of archiving and merging compression results of different data flows, and outputting final data after the compression (S13). The solution can improve the compression efficiency and compression speed.

Description

针对FASTQ数据的多线程快速存储无损压缩方法及其系统  Multi-thread fast storage lossless compression method and system thereof for FASTQ data 技术领域Technical field

本发明涉及数据压缩领域,尤其涉及一种针对FASTQ数据的多线程快速存储无损压缩方法及其系统。 The present invention relates to the field of data compression, and in particular, to a multi-thread fast storage lossless compression method and system for FASTQ data.

背景技术Background technique

随着DNA测序技术的发展,基因组测序成本越来越低。2014年,测定一个人类基因组的成本控制在1000美元的里程碑已经到来。由于测序效率的提高,DNA序列数据量呈现出爆炸性增长。由于DNA测序数据的增长速度远远超过了计算机微处理器和存储设备的增长速度,存储和分析DNA测序技术和大型基因组项目所产生的DNA数据“海啸”已经成为制约DNA测序产业进一步发展的一个重要瓶颈。而且,由于DNA测序技术正从高通量测序(High-Throughput Sequencing),又称为下一代测序(Next Generation Sequencing)发展到单分子测序技术(又称为第三代测序技术),FASTQ数据的短读从50~200bp不等的固定长度发展到1kbp~300kbp不等的不定长度,数据变化之大进一步制约DNA测序产业的发展,因此迫切需要相关的数据压缩技术投入使用。With the development of DNA sequencing technology, the cost of genome sequencing is getting lower and lower. In 2014, a milestone in measuring the cost control of a human genome at $1,000 has arrived. Due to the increased sequencing efficiency, the amount of DNA sequence data has exploded. As the growth rate of DNA sequencing data far exceeds the growth rate of computer microprocessors and storage devices, the DNA tsunami generated by storing and analyzing DNA sequencing technology and large-scale genome projects has become a constraint to the further development of the DNA sequencing industry. An important bottleneck. Moreover, because DNA sequencing technology is being sequenced from high-throughput (High-Throughput Sequencing), also known as Next Generation Sequencing Sequencing) has developed into single-molecule sequencing technology (also known as third-generation sequencing technology). The short reading of FASTQ data has grown from a fixed length ranging from 50 to 200 bp to an indefinite length ranging from 1 kbp to 300 kbp. The data change is further restricted. With the development of the DNA sequencing industry, there is an urgent need for related data compression technologies to be put into use.

然而,目前一些主流高效的通用压缩软件如gzip(http://www.gzip.org/)、bzip2(http://gzip.org/)和LZMA(http://www.7-zip.org/sdk.html)。gzip软件对于要压缩的文件首先会采用基于LZ77算法的变种压缩方式,对得到的结果再根据情况使用静态Huffman编码或者动态Huffman编码方法进行压缩。 bzip2软件把要压缩的数据进行分块处理(100~900KB每块),对于重复出现的字符序列使用BWT(Burrows-Wheeler transform)算法进行转换处理,然后再采用MTF(Move-To-Front transform)算法和哈弗曼编码(Huffman coding)进行压缩。 LZMA软件使用了类似于LZ77算法的字典编码机制,对数据流、重复序列大小以及重续序列位置单独进行了压缩,支持几种散列链变体、二叉树以及基数树作为它的字典查找算法基础。However, some of the current mainstream and efficient general-purpose compression software such as gzip (http://www.gzip.org/), bzip2 (http://gzip.org/) and LZMA (http://www.7-zip.org) /sdk.html). The gzip software first uses a variant compression method based on the LZ77 algorithm for the file to be compressed, and then uses the static Huffman coding or the dynamic Huffman coding method to compress the obtained result. The bzip2 software divides the data to be compressed into blocks (100~900KB each), and uses BWT (Burrows-Wheeler) for repeated character sequences. Transform) algorithm performs conversion processing, and then uses MTF (Move-To-Front transform) algorithm and Huffman coding (Huffman) Coding) to compress. LZMA software uses a dictionary encoding mechanism similar to LZ77 algorithm to compress the data stream, repeat sequence size and re-sequence position separately, and supports several hash chain variants, binary trees and radix trees as the basis of its dictionary search algorithm. .

但是,这些压缩方法并未考虑DNA数据的生物学特性,如长重复片段和互补回文结构等,导致对DNA序列数据的压缩效果不甚理想,而且压缩速度较慢。However, these compression methods do not take into account the biological properties of DNA data, such as long repeats and complementary palindromes, resulting in less accurate compression of DNA sequence data and slower compression.

技术问题technical problem

有鉴于此,本发明的目的在于提供一种针对FASTQ数据的多线程快速存储无损压缩方法及其系统,旨在解决现有技术中针对DNA序列数据的压缩效果较差且压缩速度较慢的问题。 In view of this, an object of the present invention is to provide a multi-thread fast storage lossless compression method and system thereof for FASTQ data, which aims to solve the problem of poor compression effect and slow compression speed for DNA sequence data in the prior art. .

技术解决方案Technical solution

本发明提出一种针对FASTQ数据的多线程快速存储无损压缩方法,应用于DNA序列的压缩,其特征在于,所述方法包括:The invention provides a multi-thread fast storage lossless compression method for FASTQ data, which is applied to compression of DNA sequences, characterized in that the method comprises:

数据分类步骤:输入原始FASTQ数据,并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流;Data classification step: inputting raw FASTQ data, and dividing the short reading of the original FASTQ data into three data streams of metadata, mass fraction and base sequence;

数据压缩步骤:针对元数据,利用增量编码方式进行检测并消除元数据的冗余信息;针对质量分数,利用比特级别的PPM预测模型和算术编码进行压缩;针对碱基序列,利用固定阶位的改良型算术编码进行压缩;Data compression step: for metadata, using incremental coding to detect and eliminate redundant information of metadata; for quality scores, using bit-level PPM prediction model and arithmetic coding for compression; for base sequences, using fixed order Improved arithmetic coding for compression;

数据输出步骤:将不同数据流的压缩结果进行归档合并,输出经过压缩后的最终数据。Data output step: Archive and merge the compression results of different data streams, and output the compressed final data.

优选的,在所述数据压缩步骤中采用了Pthreads的线程级并行编程方式来同时处理所述三个数据流的压缩。Preferably, thread-level parallel programming of Pthreads is employed in the data compression step to simultaneously process compression of the three data streams.

优选的,所述数据压缩步骤具体包括:Preferably, the data compression step specifically includes:

针对质量分数,采用游程长读编码方式对质量分数的数据流进行初次压缩以实现预处理;For the quality score, the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing;

利用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩。The preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding.

优选的,所述数据压缩步骤具体包括:Preferably, the data compression step specifically includes:

针对碱基序列,判断DNA序列的压缩模式是基于非参考基因的压缩模式还是基于参考基因的压缩模式;Determining, according to the base sequence, whether the compression mode of the DNA sequence is based on a compression mode of a non-reference gene or a compression mode based on a reference gene;

如果是基于非参考基因的压缩模式,则利用固定阶位的改良型算术编码将碱基序列的数据流进行压缩;If the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order;

如果是基于参考基因的压缩模式,则通过DNA数据匹配工具对DNA短读序列进行比对并剔除冗余,记录相应的匹配信息并以SAM格式文件保存,然后利用固定阶位的改良型算术编码将保存的SAM格式文件提取压缩。If it is based on the compression pattern of the reference gene, the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.

另一方面,本发明还提供一种针对FASTQ数据的多线程快速存储无损压缩系统,所述系统包括:In another aspect, the present invention also provides a multi-threaded fast storage lossless compression system for FASTQ data, the system comprising:

数据分类模块,用于输入原始FASTQ数据,并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流;a data classification module, configured to input original FASTQ data, and divide the short read of the original FASTQ data into three data streams of a metadata, a quality score, and a base sequence;

数据压缩模块,用于针对元数据,利用增量编码方式进行检测并消除元数据的冗余信息;针对质量分数,利用比特级别的PPM预测模型和算术编码进行压缩;针对碱基序列,利用固定阶位的改良型算术编码进行压缩;A data compression module for detecting and eliminating redundant information of metadata for metadata, using a bit-level PPM prediction model and arithmetic coding for quality scores; and fixing for base sequences Improved arithmetic coding of the order bits for compression;

数据输出模块,用于将不同数据流的压缩结果进行归档合并,输出经过压缩后的最终数据。The data output module is configured to archive and combine the compression results of different data streams, and output the compressed final data.

优选的,所述数据压缩模块采用Pthreads的线程级并行编程方式来同时处理所述三个数据流的压缩。Preferably, the data compression module uses thread-level parallel programming of Pthreads to simultaneously process compression of the three data streams.

优选的,所述数据压缩模块具体用于:Preferably, the data compression module is specifically configured to:

针对质量分数,采用游程长读编码方式对质量分数的数据流进行初次压缩以实现预处理;For the quality score, the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing;

利用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩。The preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding.

优选的,所述数据压缩模块具体用于:Preferably, the data compression module is specifically configured to:

针对碱基序列,判断DNA序列的压缩模式是基于非参考基因的压缩模式还是基于参考基因的压缩模式;Determining, according to the base sequence, whether the compression mode of the DNA sequence is based on a compression mode of a non-reference gene or a compression mode based on a reference gene;

如果是基于非参考基因的压缩模式,则利用固定阶位的改良型算术编码将碱基序列的数据流进行压缩;If the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order;

如果是基于参考基因的压缩模式,则通过DNA数据匹配工具对DNA短读序列进行比对并剔除冗余,记录相应的匹配信息并以SAM格式文件保存,然后利用固定阶位的改良型算术编码将保存的SAM格式文件提取压缩。If it is based on the compression pattern of the reference gene, the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.

有益效果Beneficial effect

本发明提供的技术方案充分利用了DNA数据的生物学特性,能在两种FASTQ数据(即长读和短读)中获得极高的压缩比,通过设计高效的编码方式把FASTQ数据拆分成三种类型的数据流并对此分别进行单独压缩,大大提高了压缩率,并且,对上游DNA数据匹配工具提供了接口,能应用于基于参考基因的压缩模式,发挥同源物种基因组之间的高度相似性,进一步提高重测序数据的压缩比。另外,通过采用Pthreads(POSIX threads)的线程级并行编程,能够同时压缩处理三种数据流所产生的中间文件,从而大大提高压缩速度,增加了适用性,进而提高了压缩效率。 The technical solution provided by the invention fully utilizes the biological characteristics of the DNA data, can obtain a very high compression ratio in two kinds of FASTQ data (ie, long reading and short reading), and splits the FASTQ data into high-efficiency coding methods. Three types of data streams are separately compressed separately, which greatly increases the compression ratio, and provides an interface to the upstream DNA data matching tool, which can be applied to the reference gene-based compression mode to play between the homologous species genomes. Highly similar, further improving the compression ratio of resequencing data. Also, by using Pthreads (POSIX Thread-level parallel programming of threads) can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving the compression speed, increasing the applicability, and thus improving the compression efficiency.

附图说明DRAWINGS

图1为本发明一实施方式中针对FASTQ数据的多线程快速存储无损压缩方法流程图;1 is a flowchart of a multi-thread fast storage lossless compression method for FASTQ data according to an embodiment of the present invention;

图2为本发明一实施方式中针对FASTQ数据的多线程快速存储无损压缩系统10的内部结构示意图。2 is a schematic diagram showing the internal structure of a multi-threaded fast storage lossless compression system 10 for FASTQ data according to an embodiment of the present invention.

本发明的实施方式Embodiments of the invention

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

本发明具体实施方式提供了一种针对FASTQ数据的多线程快速存储无损压缩方法,应用于DNA序列的压缩,其中,所述方法主要包括如下步骤:A specific embodiment of the present invention provides a multi-threaded fast storage lossless compression method for FASTQ data, which is applied to compression of a DNA sequence, wherein the method mainly includes the following steps:

S11、数据分类步骤:输入原始FASTQ数据,并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流;S11. Data classification step: input original FASTQ data, and divide the short read of the original FASTQ data into three data streams of metadata, mass fraction and base sequence;

S12、数据压缩步骤:针对元数据,利用增量编码方式进行检测并消除元数据的冗余信息;针对质量分数,利用比特级别的PPM预测模型和算术编码进行压缩;针对碱基序列,利用固定阶位的改良型算术编码进行压缩;S12. Data compression step: for metadata, using incremental coding to detect and eliminate redundant information of metadata; for quality score, using bit level PPM prediction model and arithmetic coding for compression; for base sequence, using fixed Improved arithmetic coding of the order bits for compression;

S13、数据输出步骤:将不同数据流的压缩结果进行归档合并,输出经过压缩后的最终数据。S13. Data output step: Archive and merge the compression results of different data streams, and output the compressed final data.

本发明提供的一种针对FASTQ数据的多线程快速存储无损压缩方法充分利用了DNA数据的生物学特性,能在两种FASTQ数据(即长读和短读)中获得极高的压缩比,通过设计高效的编码方式把FASTQ数据拆分成三种类型的数据流并对此分别进行单独压缩,大大提高了压缩率。另外,通过采用Pthreads(POSIX threads)的线程级并行编程,能够同时压缩处理三种数据流所产生的中间文件,从而大大提高压缩速度,增加了适用性,进而提高了压缩效率。The multi-threaded fast storage lossless compression method for FASTQ data provided by the invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read). Designing an efficient coding method splits the FASTQ data into three types of data streams and separately compresses them separately, greatly increasing the compression ratio. Also, by using Pthreads (POSIX Thread-level parallel programming of threads) can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving the compression speed, increasing the applicability, and thus improving the compression efficiency.

以下将对本发明所提供的一种针对FASTQ数据的多线程快速存储无损压缩方法进行详细说明。A multi-threaded fast storage lossless compression method for FASTQ data provided by the present invention will be described in detail below.

请参阅图1,为本发明一实施方式中针对FASTQ数据的多线程快速存储无损压缩方法流程图。Please refer to FIG. 1 , which is a flowchart of a multi-thread fast storage lossless compression method for FASTQ data according to an embodiment of the present invention.

在步骤S11中,数据分类步骤:输入原始FASTQ数据,并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流。In step S11, the data sorting step: inputting the original FASTQ data, and dividing the short read of the original FASTQ data into three data streams of metadata, quality score, and base sequence.

在本实施方式中,DNA测序技术产生成千上万条短读,这些短读存储于以FASTQ为格式的文件中,包含测序产生的所有信息。在广泛使用的FASTQ格式中,每个短读包含四行,每行由换行符分割。每个短读以字符‘@’开始,后面紧接着元数据作为第一行,用来唯一标识短读。第二行是碱基数据,由仅包含{‘A’,‘T’,‘C’,‘G’,‘N’}五个字符的序列构成,其中字符‘N’表示不明确的碱基,可表示为{‘A’,‘T’,‘C’,‘G’}中任意一个字符。第三行以字符‘+’开始,紧接着与第一行相同的短读标识。最后一行为质量分数行,与碱基一一对应,表示每个碱基字符对应位置测序的可信度。In this embodiment, DNA sequencing technology produces thousands of short reads, which are stored in a file in FASTQ format, containing all the information generated by sequencing. In the widely used FASTQ format, each short read contains four lines, each line being divided by a newline character. Each short read begins with the character '@' followed by the metadata as the first line to uniquely identify the short read. The second line is the base data, consisting of a sequence consisting of only five characters {'A', 'T', 'C', 'G', 'N'}, where the character 'N' indicates an ambiguous base. , can be expressed as any character in {'A', 'T', 'C', 'G'}. The third line begins with the character '+' followed by the same short read identifier as the first line. The last behavioral quality score line, one-to-one correspondence with the base, indicates the credibility of sequencing of each base character corresponding to the position.

在步骤S12中,数据压缩步骤:针对元数据,利用增量编码方式进行检测并消除元数据的冗余信息;针对质量分数,利用比特级别的PPM(Prediction by partial matching)预测模型和算术编码进行压缩;针对碱基序列,利用固定阶位的改良型算术编码进行压缩。In step S12, the data compression step: for the metadata, using the incremental coding method to detect and eliminate redundant information of the metadata; for the quality score, using the bit level PPM (Prediction) By partial matching) The prediction model and the arithmetic coding are compressed; for the base sequence, compression is performed using a modified arithmetic coding of fixed order bits.

在本实施方式中,所述数据压缩步骤采用Pthreads的线程级并行编程方式来同时处理所述三个数据流的压缩,能够同时压缩处理三种数据流所产生的中间文件,从而大大提高压缩速度,增强其适用性。In this embodiment, the data compression step uses Pthreads thread-level parallel programming to simultaneously process the compression of the three data streams, and can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving the compression speed. To enhance its applicability.

在本实施方式中,针对元数据,利用增量编码方式进行检测并消除元数据的冗余信息,然后利用固定阶位的改良型算术编码进行压缩,采用多线程化并行处理。In the present embodiment, the metadata is detected by the incremental coding method and the redundant information of the metadata is eliminated, and then the compression is performed by the improved arithmetic coding of the fixed order, and the multi-threaded parallel processing is employed.

在本实施方式中,所述数据压缩步骤具体包括:In this embodiment, the data compression step specifically includes:

针对质量分数,采用游程长读编码方式对质量分数的数据流进行初次压缩以实现预处理;For the quality score, the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing;

利用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩。The preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding.

在本实施方式中,对于质量分数,由于考虑其保存着大量的连续重复的相同字符,故先采用游程长读编码压缩作为预处理,对质量分数的数据流进行初次压缩,然后再使用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩,采用多线程化并行处理。在本实施方式中,这种比特级别的PPM预测模型和算术编码对于由约40种字符并且服从伪随机分布的质量分数进行无损压缩时,具有较高的压缩比和压缩速度。In the present embodiment, for the quality score, since it considers that it holds a large number of consecutively repeated identical characters, the run length long read code compression is used as a pre-processing, and the data stream of the quality score is first compressed, and then the bit level is used. The PPM prediction model and the arithmetic coding recompress the preprocessed compressed data and adopt multithreading parallel processing. In the present embodiment, such a bit-level PPM prediction model and arithmetic coding have a higher compression ratio and compression speed for lossless compression of a quality score of about 40 characters and subject to a pseudo-random distribution.

在本实施方式中,所述数据压缩步骤具体包括:In this embodiment, the data compression step specifically includes:

针对碱基序列,判断DNA序列的压缩模式是基于非参考基因的压缩模式还是基于参考基因的压缩模式;Determining, according to the base sequence, whether the compression mode of the DNA sequence is based on a compression mode of a non-reference gene or a compression mode based on a reference gene;

如果是基于非参考基因的压缩模式,则利用固定阶位的改良型算术编码将碱基序列的数据流进行压缩;If the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order;

如果是基于参考基因的压缩模式,则通过DNA数据匹配工具对DNA短读序列进行比对并剔除冗余,记录相应的匹配信息并以SAM格式文件保存,然后利用固定阶位的改良型算术编码将保存的SAM格式文件提取压缩。If it is based on the compression pattern of the reference gene, the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.

在本实施方式中,因为DNA数据匹配工具的性能会直接影响到数据压缩的效果,所以以SAM格式文件为中间文件,对上游DNA数据匹配工具提供接口,为后续压缩性能的提升提供了可能。In the present embodiment, because the performance of the DNA data matching tool directly affects the effect of data compression, the SAM format file is used as an intermediate file to provide an interface to the upstream DNA data matching tool, which provides a possibility for subsequent compression performance improvement.

在本实施方式中,对于碱基序列,本发明可适用于两种DNA数据压缩模式,即基于非参考基因的压缩模式和基于参考基因压缩模式,通过判断DNA序列的压缩模式是属于哪一种模式,然后分别进行处理。本实施方式中的DNA数据匹配工具,如BWA(Burrows-Wheeler Aligner)工具、Bowtie工具和CompMap工具,这些DNA数据匹配工具能对DNA短读序列的比对并剔除冗余,实现对DNA短读序列的并行快速匹配,并应用于DNA序列的压缩存储方法中。In the present embodiment, for the base sequence, the present invention is applicable to two DNA data compression modes, that is, a compression mode based on a non-reference gene and a reference gene compression mode, by which the compression mode of the DNA sequence belongs. The mode is then processed separately. DNA data matching tool in the present embodiment, such as BWA (Burrows-Wheeler) Aligner) tool, Bowtie tool and CompMap tool, these DNA data matching tools can compare DNA short read sequences and eliminate redundancy, realize parallel fast matching of DNA short read sequences, and apply to DNA sequence compression storage methods. .

对于基于非参考基因的压缩模式:此压缩模式针对只含有{‘A’,‘T’,‘C’,‘G’,‘N’}五个字符组成碱基序列,由于字符种类较少,采取固定阶位的策略,能大大减少不必要空余比特的浪费,从比特级别去降低存储每一个字符所需要的空间大小,故采取固定阶位的改良型算术编码压缩。也就是说,如果是基于非参考基因的压缩模式,则利用固定阶位的改良型算术编码将碱基序列的数据流进行压缩。For non-reference gene based compression mode: This compression mode is composed of five characters consisting of {'A', 'T', 'C', 'G', 'N'}, due to the small number of characters. Adopting a fixed-order strategy can greatly reduce the waste of unnecessary spare bits, and reduce the size of space required for storing each character from the bit level, so a fixed-order arithmetic coding compression is adopted. That is, if the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order.

对于基于参考基因的压缩模式:此压缩模式能基于以上DNA数据匹配工具进行对应的数据压缩,提供相应的接口,使其能应用于碱基序列的预处理,对DNA短读序列进行比对并剔除冗余,记录相应的匹配信息,以SAM格式文件保存,该匹配信息包含:匹配位置、回文串标记、匹配字符长度、匹配类型和非匹配字符,然后,此压缩模式把这五种信息拆分到各自的文件中,分别用固定阶位的改良型算术编码进行压缩。也就是说,如果是基于参考基因的压缩模式,则通过DNA数据匹配工具对DNA短读序列进行比对并剔除冗余,记录相应的匹配信息并以SAM格式文件保存,然后利用固定阶位的改良型算术编码将保存的SAM格式文件提取压缩。For reference gene-based compression mode: This compression mode can perform corresponding data compression based on the above DNA data matching tool, providing a corresponding interface, which can be applied to the preprocessing of base sequences, and the DNA short read sequences are compared and The redundancy is deleted, and the corresponding matching information is recorded and saved in a SAM format file, and the matching information includes: a matching position, a palindrome string mark, a matching character length, a matching type, and a non-matching character, and then the compressed mode puts the five kinds of information. Split into separate files and compress them with fixed arithmetic coding with fixed order. That is to say, if it is based on the compression mode of the reference gene, the DNA short reading sequence is compared by the DNA data matching tool and the redundancy is eliminated, the corresponding matching information is recorded and saved in the SAM format file, and then the fixed order is used. The improved arithmetic coding extracts and compresses the saved SAM format file.

在步骤S13中,数据输出步骤:将不同数据流的压缩结果进行归档合并,输出经过压缩后的最终数据。In step S13, the data output step: archives the compression results of the different data streams, and outputs the compressed final data.

在本实施方式中,本发明的技术方案从两方面去提升对FASTQ数据的压缩性能,即压缩比与压缩速度。其中,在高压缩模式的比较下,不管以不定长度的长读为主还是以固定长度的短读为主的FASTQ数据中,本发明在两种模式下(即基于参考基因的压缩模式和基于非参考基因的压缩模式)均获得比主流通用压缩软件更高的压缩比和更快的压缩速度。这一优势经过一些数据的测试得以证明,本发明从美国国家生物技术信息中心(NCBI, http://www.ncbi.nlm.nih.gov/)下载的两个FASTQ数据:ERR385912(641MB,短读长度为51,属于短读),ERR654984(1164MB,短读长度为64~502,属于长读),通过测试数据比较,本发明的压缩率平均比bzip2软件和gzip软件分别要高10.7%和15.2%,压缩速度平均比bzip2软件和gzip软件分别要快40.2%和45.5%。In the present embodiment, the technical solution of the present invention improves the compression performance of the FASTQ data, that is, the compression ratio and the compression speed, from two aspects. Among them, in the comparison of the high compression mode, the present invention is in two modes (ie, the compression mode based on the reference gene and based on the FASTQ data mainly based on the long reading of indefinite length or the short reading of fixed length). The compression mode of the non-reference gene) has a higher compression ratio and a faster compression speed than the mainstream general compression software. This advantage has been demonstrated by testing some data from the National Center for Biotechnology Information (NCBI, Http://www.ncbi.nlm.nih.gov/) Two FASTQ data downloaded: ERR385912 (641MB, short read length 51, short read), ERR654984 (1164MB, short read length 64~502, belongs to Long reading), through the comparison of test data, the compression ratio of the present invention is 10.7% and 15.2% higher than that of bzip2 software and gzip software, respectively, and the compression speed is 40.2% and 45.5% faster than bzip2 software and gzip software, respectively.

本发明提供的一种针对FASTQ数据的多线程快速存储无损压缩方法充分利用了DNA数据的生物学特性,能在两种FASTQ数据(即长读和短读)中获得极高的压缩比,通过设计高效的编码方式把FASTQ数据拆分成三种类型的数据流并对此分别进行单独压缩,大大提高了压缩率,并且,对上游DNA数据匹配工具提供了接口,能应用于基于参考基因的压缩模式,发挥同源物种基因组之间的高度相似性,进一步提高重测序数据的压缩比。另外,通过采用Pthreads的线程级并行编程,能够同时压缩处理三种数据流所产生的中间文件,从而大大提高压缩速度,增加了适用性,进而提高了压缩效率。The multi-threaded fast storage lossless compression method for FASTQ data provided by the invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read). Designing efficient coding methods splits FASTQ data into three types of data streams and compresses them separately, greatly increasing the compression ratio, and providing an interface to upstream DNA data matching tools that can be applied to reference-based genes. Compressed mode, which exerts a high degree of similarity between homologous species genomes, further improves the compression ratio of resequencing data. In addition, by using thread-level parallel programming of Pthreads, it is possible to simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly increasing the compression speed, increasing the applicability, and thereby improving the compression efficiency.

本发明具体实施方式还提供一种针对FASTQ数据的多线程快速存储无损压缩系统10,主要包括:The embodiment of the present invention further provides a multi-threaded fast storage lossless compression system 10 for FASTQ data, which mainly includes:

数据分类模块11,用于输入原始FASTQ数据,并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流;a data classification module 11 for inputting original FASTQ data, and dividing the short read of the original FASTQ data into three data streams of metadata, quality score and base sequence;

数据压缩模块12,用于针对元数据,利用增量编码方式进行检测并消除元数据的冗余信息;针对质量分数,利用比特级别的PPM预测模型和算术编码进行压缩;针对碱基序列,利用固定阶位的改良型算术编码进行压缩;The data compression module 12 is configured to detect and eliminate redundant information of the metadata by using an incremental coding method for the metadata, compressing by using a bit-level PPM prediction model and arithmetic coding for the quality score, and utilizing the base sequence for the base sequence. Fixed arithmetic coding of fixed order bits for compression;

数据输出模块13,用于将不同数据流的压缩结果进行归档合并,输出经过压缩后的最终数据。The data output module 13 is configured to archive and combine the compression results of different data streams, and output the compressed final data.

本发明提供的一种针对FASTQ数据的多线程快速存储无损压缩系统10,充分利用了DNA数据的生物学特性,能在两种FASTQ数据(即长读和短读)中获得极高的压缩比,通过设计高效的编码方式把FASTQ数据拆分成三种类型的数据流并对此分别进行单独压缩,大大提高了压缩率。另外,通过采用Pthreads的线程级并行编程,能够同时压缩处理三种数据流所产生的中间文件,从而大大提高压缩速度,增加了适用性,进而提高了压缩效率。The multi-threaded fast storage lossless compression system 10 for FASTQ data provided by the present invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read). By designing efficient coding methods, FASTQ data is split into three types of data streams and separately compressed separately, which greatly improves the compression ratio. In addition, by using thread-level parallel programming of Pthreads, it is possible to simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly increasing the compression speed, increasing the applicability, and thereby improving the compression efficiency.

请参阅图2,所示为本发明一实施方式中针对FASTQ数据的多线程快速存储无损压缩系统10的结构示意图。Referring to FIG. 2, a schematic structural diagram of a multi-threaded fast storage lossless compression system 10 for FASTQ data according to an embodiment of the present invention is shown.

在本实施方式中,针对FASTQ数据的多线程快速存储无损压缩系统10,应用于DNA序列的压缩,主要包括数据分类模块11、数据压缩模块12以及数据输出模块13。In the present embodiment, the multi-threaded fast storage lossless compression system 10 for FASTQ data is applied to compression of DNA sequences, and mainly includes a data classification module 11, a data compression module 12, and a data output module 13.

数据分类模块11,用于输入原始FASTQ数据,并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流。The data classification module 11 is configured to input the original FASTQ data and divide the short read of the original FASTQ data into three data streams of a metadata, a quality score, and a base sequence.

在本实施方式中,DNA测序技术产生成千上万条短读,这些短读存储于以FASTQ为格式的文件中,包含测序产生的所有信息。在广泛使用的FASTQ格式中,每个短读包含四行,每行由换行符分割。每个短读以字符‘@’开始,后面紧接着元数据作为第一行,用来唯一标识短读。第二行是碱基数据,由仅包含{‘A’,‘T’,‘C’,‘G’,‘N’}五个字符的序列构成,其中字符‘N’表示不明确的碱基,可表示为{‘A’,‘T’,‘C’,‘G’}中任意一个字符。第三行以字符‘+’开始,紧接着与第一行相同的短读标识。最后一行为质量分数行,与碱基一一对应,表示每个碱基字符对应位置测序的可信度。In this embodiment, DNA sequencing technology produces thousands of short reads, which are stored in a file in FASTQ format, containing all the information generated by sequencing. In the widely used FASTQ format, each short read contains four lines, each line being divided by a newline character. Each short read begins with the character '@' followed by the metadata as the first line to uniquely identify the short read. The second line is the base data, consisting of a sequence consisting of only five characters {'A', 'T', 'C', 'G', 'N'}, where the character 'N' indicates an ambiguous base. , can be expressed as any character in {'A', 'T', 'C', 'G'}. The third line begins with the character '+' followed by the same short read identifier as the first line. The last behavioral quality score line, one-to-one correspondence with the base, indicates the credibility of sequencing of each base character corresponding to the position.

数据压缩模块12,用于针对元数据,利用增量编码方式进行检测并消除元数据的冗余信息;针对质量分数,利用比特级别的PPM预测模型和算术编码进行压缩;针对碱基序列,利用固定阶位的改良型算术编码进行压缩。The data compression module 12 is configured to detect and eliminate redundant information of the metadata by using an incremental coding method for the metadata, compressing by using a bit-level PPM prediction model and arithmetic coding for the quality score, and utilizing the base sequence for the base sequence. Fixed arithmetic coding of fixed order bits for compression.

在本实施方式中,所述数据压缩模块12采用Pthreads的线程级并行编程方式来同时处理所述三个数据流的压缩,能够同时压缩处理三种数据流所产生的中间文件,从而大大提高压缩速度,增强其适用性。In this embodiment, the data compression module 12 uses the thread-level parallel programming mode of Pthreads to simultaneously process the compression of the three data streams, and can simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly improving compression. Speed, enhance its applicability.

针对元数据,数据压缩模块12利用增量编码方式进行检测并消除元数据的冗余信息,然后利用固定阶位的改良型算术编码进行压缩,采用多线程化并行处理。For the metadata, the data compression module 12 detects and eliminates the redundant information of the metadata by using the incremental coding method, and then performs compression using the improved arithmetic coding of the fixed order, and adopts multi-threaded parallel processing.

数据压缩模块12具体用于:The data compression module 12 is specifically configured to:

针对质量分数,采用游程长读编码方式对质量分数的数据流进行初次压缩以实现预处理;For the quality score, the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing;

利用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩。The preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding.

在本实施方式中,对于质量分数,由于考虑其保存着大量的连续重复的相同字符,故数据压缩模块12先采用游程长读编码压缩作为预处理,对质量分数的数据流进行初次压缩,然后再使用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩,采用多线程化并行处理。在本实施方式中,这种比特级别的PPM预测模型和算术编码对于由约40种字符并且服从伪随机分布的质量分数进行无损压缩时,具有较高的压缩比和压缩速度。In the present embodiment, for the quality score, since it considers that it holds a large number of consecutively repeated identical characters, the data compression module 12 first uses the run length read coding compression as a pre-processing, and compresses the data stream of the quality score for the first time, and then The pre-processed compressed data is again compressed using a bit-level PPM prediction model and arithmetic coding, and multi-threaded parallel processing is employed. In the present embodiment, such a bit-level PPM prediction model and arithmetic coding have a higher compression ratio and compression speed for lossless compression of a quality score of about 40 characters and subject to a pseudo-random distribution.

数据压缩模块12具体用于:The data compression module 12 is specifically configured to:

针对碱基序列,判断DNA序列的压缩模式是基于非参考基因的压缩模式还是基于参考基因的压缩模式;Determining, according to the base sequence, whether the compression mode of the DNA sequence is based on a compression mode of a non-reference gene or a compression mode based on a reference gene;

如果是基于非参考基因的压缩模式,则利用固定阶位的改良型算术编码将碱基序列的数据流进行压缩;If the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order;

如果是基于参考基因的压缩模式,则通过DNA数据匹配工具对DNA短读序列进行比对并剔除冗余,记录相应的匹配信息并以SAM格式文件保存,然后利用固定阶位的改良型算术编码将保存的SAM格式文件提取压缩。If it is based on the compression pattern of the reference gene, the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.

在本实施方式中,因为DNA数据匹配工具的性能会直接影响到数据压缩的效果,所以以SAM格式文件为中间文件,对上游DNA数据匹配工具提供接口,为后续压缩性能的提升提供了可能。In the present embodiment, because the performance of the DNA data matching tool directly affects the effect of data compression, the SAM format file is used as an intermediate file to provide an interface to the upstream DNA data matching tool, which provides a possibility for subsequent compression performance improvement.

本实施方式中的DNA数据匹配工具,如BWA(Burrows-Wheeler Aligner)工具、Bowtie工具和CompMap工具,这些DNA数据匹配工具能对DNA短读序列的比对并剔除冗余,实现对DNA短读序列的并行快速匹配,并应用于DNA序列的压缩存储方法中。DNA data matching tool in the present embodiment, such as BWA (Burrows-Wheeler) Aligner) tool, Bowtie tool and CompMap tool, these DNA data matching tools can compare DNA short read sequences and eliminate redundancy, realize parallel fast matching of DNA short read sequences, and apply to DNA sequence compression storage methods. .

在本实施方式中,对于碱基序列,本发明可适用于两种DNA数据压缩模式,即基于非参考基因的压缩模式和基于参考基因压缩模式,通过判断DNA序列的压缩模式是属于哪一种模式,然后分别进行处理。其中,两种DNA数据压缩模式的具体处理流程如前述的步骤S12所示,在此就不做重复描述。In the present embodiment, for the base sequence, the present invention is applicable to two DNA data compression modes, that is, a compression mode based on a non-reference gene and a reference gene compression mode, by which the compression mode of the DNA sequence belongs. The mode is then processed separately. The specific processing flow of the two DNA data compression modes is as shown in the foregoing step S12, and the repeated description is not repeated here.

数据输出模块13,用于将不同数据流的压缩结果进行归档合并,输出经过压缩后的最终数据。The data output module 13 is configured to archive and combine the compression results of different data streams, and output the compressed final data.

本发明提供的一种针对FASTQ数据的多线程快速存储无损压缩系统10,充分利用了DNA数据的生物学特性,能在两种FASTQ数据(即长读和短读)中获得极高的压缩比,通过设计高效的编码方式把FASTQ数据拆分成三种类型的数据流并对此分别进行单独压缩,大大提高了压缩率,并且,对上游DNA数据匹配工具提供了接口,能应用于基于参考基因的压缩模式,发挥同源物种基因组之间的高度相似性,进一步提高重测序数据的压缩比。另外,通过采用Pthreads的线程级并行编程,能够同时压缩处理三种数据流所产生的中间文件,从而大大提高压缩速度,增加了适用性,进而提高了压缩效率。The multi-threaded fast storage lossless compression system 10 for FASTQ data provided by the present invention fully utilizes the biological characteristics of DNA data, and can obtain extremely high compression ratio in two FASTQ data (ie, long read and short read). By designing efficient coding methods, FASTQ data is split into three types of data streams and compressed separately, which greatly improves the compression ratio, and provides an interface to the upstream DNA data matching tool, which can be applied to reference-based The compression pattern of genes plays a high degree of similarity between genomes of homologous species, further increasing the compression ratio of resequencing data. In addition, by using thread-level parallel programming of Pthreads, it is possible to simultaneously compress and process the intermediate files generated by the three data streams, thereby greatly increasing the compression speed, increasing the applicability, and thereby improving the compression efficiency.

值得注意的是,上述实施例中,所包括的各个单元只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本发明的保护范围。It should be noted that, in the foregoing embodiment, each unit included is only divided according to functional logic, but is not limited to the above division, as long as the corresponding function can be implemented; in addition, the specific name of each functional unit is also They are only used to facilitate mutual differentiation and are not intended to limit the scope of the present invention.

另外,本领域普通技术人员可以理解实现上述各实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,相应的程序可以存储于一计算机可读取存储介质中,所述的存储介质,如ROM/RAM、磁盘或光盘等。In addition, those skilled in the art can understand that all or part of the steps of implementing the above embodiments may be completed by a program to instruct related hardware, and the corresponding program may be stored in a computer readable storage medium. Storage medium, such as ROM/RAM, disk or CD.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. Within the scope.

Claims (8)

一种针对FASTQ数据的多线程快速存储无损压缩方法,应用于DNA序列的压缩,其特征在于,所述方法包括: A multi-threaded fast storage lossless compression method for FASTQ data, applied to compression of DNA sequences, characterized in that the method comprises: 数据分类步骤:输入原始FASTQ数据,并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流;Data classification step: inputting raw FASTQ data, and dividing the short reading of the original FASTQ data into three data streams of metadata, mass fraction and base sequence; 数据压缩步骤:针对元数据,利用增量编码方式进行检测并消除元数据的冗余信息;针对质量分数,利用比特级别的PPM预测模型和算术编码进行压缩;针对碱基序列,利用固定阶位的改良型算术编码进行压缩;Data compression step: for metadata, using incremental coding to detect and eliminate redundant information of metadata; for quality scores, using bit-level PPM prediction model and arithmetic coding for compression; for base sequences, using fixed order Improved arithmetic coding for compression; 数据输出步骤:将不同数据流的压缩结果进行归档合并,输出经过压缩后的最终数据。Data output step: Archive and merge the compression results of different data streams, and output the compressed final data. 如权利要求1所述的针对FASTQ数据的多线程快速存储无损压缩方法,其特征在于,在所述数据压缩步骤中采用了Pthreads的线程级并行编程方式来同时处理所述三个数据流的压缩。 The multi-threaded fast storage lossless compression method for FASTQ data according to claim 1, wherein a thread-level parallel programming manner of Pthreads is used in the data compression step to simultaneously process compression of the three data streams. . 如权利要求2所述的针对FASTQ数据的多线程快速存储无损压缩方法,其特征在于,所述数据压缩步骤具体包括:The multi-threaded fast storage lossless compression method for FASTQ data according to claim 2, wherein the data compression step specifically comprises: 针对质量分数,采用游程长读编码方式对质量分数的数据流进行初次压缩以实现预处理;For the quality score, the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing; 利用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩。The preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding. 如权利要求2所述的针对FASTQ数据的多线程快速存储无损压缩方法,其特征在于,所述数据压缩步骤具体包括:The multi-threaded fast storage lossless compression method for FASTQ data according to claim 2, wherein the data compression step specifically comprises: 针对碱基序列,判断DNA序列的压缩模式是基于非参考基因的压缩模式还是基于参考基因的压缩模式;Determining, according to the base sequence, whether the compression mode of the DNA sequence is based on a compression mode of a non-reference gene or a compression mode based on a reference gene; 如果是基于非参考基因的压缩模式,则利用固定阶位的改良型算术编码将碱基序列的数据流进行压缩;If the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order; 如果是基于参考基因的压缩模式,则通过DNA数据匹配工具对DNA短读序列进行比对并剔除冗余,记录相应的匹配信息并以SAM格式文件保存,然后利用固定阶位的改良型算术编码将保存的SAM格式文件提取压缩。If it is based on the compression pattern of the reference gene, the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract. 一种针对FASTQ数据的多线程快速存储无损压缩系统,其特征在于,所述系统包括:A multi-threaded fast storage lossless compression system for FASTQ data, characterized in that the system comprises: 数据分类模块,用于输入原始FASTQ数据,并将所述原始FASTQ数据的短读分成元数据、质量分数和碱基序列三个数据流;a data classification module, configured to input original FASTQ data, and divide the short read of the original FASTQ data into three data streams of a metadata, a quality score, and a base sequence; 数据压缩模块,用于针对元数据,利用增量编码方式进行检测并消除元数据的冗余信息;针对质量分数,利用比特级别的PPM预测模型和算术编码进行压缩;针对碱基序列,利用固定阶位的改良型算术编码进行压缩;A data compression module for detecting and eliminating redundant information of metadata for metadata, using a bit-level PPM prediction model and arithmetic coding for quality scores; and fixing for base sequences Improved arithmetic coding of the order bits for compression; 数据输出模块,用于将不同数据流的压缩结果进行归档合并,输出经过压缩后的最终数据。The data output module is configured to archive and combine the compression results of different data streams, and output the compressed final data. 如权利要求5所述的针对FASTQ数据的多线程快速存储无损压缩系统,其特征在于,所述数据压缩模块采用Pthreads的线程级并行编程方式来同时处理所述三个数据流的压缩。The multi-threaded fast storage lossless compression system for FASTQ data according to claim 5, wherein the data compression module uses a thread-level parallel programming manner of Pthreads to simultaneously process compression of the three data streams. 如权利要求6所述的针对FASTQ数据的多线程快速存储无损压缩系统,其特征在于,所述数据压缩模块具体用于:The multi-threaded fast storage lossless compression system for FASTQ data according to claim 6, wherein the data compression module is specifically configured to: 针对质量分数,采用游程长读编码方式对质量分数的数据流进行初次压缩以实现预处理;For the quality score, the data stream of the quality score is first compressed by the run length long read coding method to implement preprocessing; 利用比特级别的PPM预测模型和算术编码对经过预处理后的压缩数据进行再次压缩。The preprocessed compressed data is recompressed using a bit-level PPM prediction model and arithmetic coding. 如权利要求6所述的针对FASTQ数据的多线程快速存储无损压缩系统,其特征在于,所述数据压缩模块具体用于:The multi-threaded fast storage lossless compression system for FASTQ data according to claim 6, wherein the data compression module is specifically configured to: 针对碱基序列,判断DNA序列的压缩模式是基于非参考基因的压缩模式还是基于参考基因的压缩模式;Determining, according to the base sequence, whether the compression mode of the DNA sequence is based on a compression mode of a non-reference gene or a compression mode based on a reference gene; 如果是基于非参考基因的压缩模式,则利用固定阶位的改良型算术编码将碱基序列的数据流进行压缩;If the compression mode is based on a non-reference gene, the data stream of the base sequence is compressed using a modified arithmetic coding of a fixed order; 如果是基于参考基因的压缩模式,则通过DNA数据匹配工具对DNA短读序列进行比对并剔除冗余,记录相应的匹配信息并以SAM格式文件保存,然后利用固定阶位的改良型算术编码将保存的SAM格式文件提取压缩。If it is based on the compression pattern of the reference gene, the DNA short-reading sequence is aligned and the redundancy is eliminated by the DNA data matching tool, the corresponding matching information is recorded and saved in the SAM format file, and then the modified arithmetic coding using the fixed order is used. Extract the saved SAM format file extract.
PCT/CN2016/085426 2016-06-12 2016-06-12 Multi-thread fast storage lossless compression method and system for fastq data Ceased WO2017214765A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/085426 WO2017214765A1 (en) 2016-06-12 2016-06-12 Multi-thread fast storage lossless compression method and system for fastq data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/085426 WO2017214765A1 (en) 2016-06-12 2016-06-12 Multi-thread fast storage lossless compression method and system for fastq data

Publications (1)

Publication Number Publication Date
WO2017214765A1 true WO2017214765A1 (en) 2017-12-21

Family

ID=60662902

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/085426 Ceased WO2017214765A1 (en) 2016-06-12 2016-06-12 Multi-thread fast storage lossless compression method and system for fastq data

Country Status (1)

Country Link
WO (1) WO2017214765A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109817277A (en) * 2018-12-29 2019-05-28 北京百迈客生物科技有限公司 Quality control method based on PacBio overall length transcript profile sequencing data
CN110111852A (en) * 2018-01-11 2019-08-09 广州明领基因科技有限公司 A kind of magnanimity DNA sequencing data lossless Fast Compression platform
CN110535846A (en) * 2019-08-22 2019-12-03 中国电力科学研究院有限公司 A data frame compression method and system based on DL/T698.45 protocol
US10554220B1 (en) 2019-01-30 2020-02-04 International Business Machines Corporation Managing compression and storage of genomic data
CN111628779A (en) * 2020-05-29 2020-09-04 深圳华大生命科学研究院 A kind of parallel compression and decompression method and system of FASTQ file
CN111640467A (en) * 2020-05-25 2020-09-08 西安电子科技大学 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence
CN111881324A (en) * 2020-07-30 2020-11-03 苏州工业园区服务外包职业学院 High-throughput sequencing data universal storage format structure, construction method and application thereof
CN112102883A (en) * 2020-08-20 2020-12-18 深圳华大生命科学研究院 Base sequence coding method and system in FASTQ file compression
CN114067910A (en) * 2021-11-15 2022-02-18 厦门大学 Single cell upstream big data processing method based on UMI-tools and Spark
WO2022073225A1 (en) * 2020-10-10 2022-04-14 中国科学院深圳先进技术研究院 Dna storage-based incremental information management method and device
CN117133365A (en) * 2023-08-14 2023-11-28 南开大学 A parallel compression method for high-throughput genome sequencing quality score data
CN118138055A (en) * 2024-03-25 2024-06-04 浙江大学 Gene data lossless compression system
CN119576881A (en) * 2024-12-06 2025-03-07 厦门大学 A method for processing astronomical data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
WO2015120170A1 (en) * 2014-02-05 2015-08-13 Bigdatabio, Llc Methods and systems for biological sequence compression transfer and encryption
US20150227686A1 (en) * 2014-02-12 2015-08-13 International Business Machines Corporation Lossless compression of dna sequences

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
WO2015120170A1 (en) * 2014-02-05 2015-08-13 Bigdatabio, Llc Methods and systems for biological sequence compression transfer and encryption
US20150227686A1 (en) * 2014-02-12 2015-08-13 International Business Machines Corporation Lossless compression of dna sequences

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AKGUN, METE ET AL.: "A new PPM model for quality score compression", SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU, 26 April 2013 (2013-04-26), XP032423184, ISSN: 2165-0608 *
BEHZADI, BEHSHAD ET AL.: "DNA Compression Challenge Revisited: A Dynamic Programming Approach", COMBINATORIAL PATTERN MATCHING, 22 June 2005 (2005-06-22), XP055297614, ISSN: 0302-9743, DOI: doi:10.1007/11496656_17 *
ZHANG, YONGPENG: "Lossless Comprssion of High-through DNA Sequence Data", CHINA MASTER'S THESES FULL-TEXT DATABASE BASIC SCIENCE, 15 December 2015 (2015-12-15), pages 3 - 5 , 12-15 and 20-25, ISSN: 1674-0246 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110111852A (en) * 2018-01-11 2019-08-09 广州明领基因科技有限公司 A kind of magnanimity DNA sequencing data lossless Fast Compression platform
CN109817277B (en) * 2018-12-29 2022-03-18 北京百迈客生物科技有限公司 Quality control method based on PacBio full-length transcriptome sequencing data
CN109817277A (en) * 2018-12-29 2019-05-28 北京百迈客生物科技有限公司 Quality control method based on PacBio overall length transcript profile sequencing data
US10554220B1 (en) 2019-01-30 2020-02-04 International Business Machines Corporation Managing compression and storage of genomic data
US10778246B2 (en) 2019-01-30 2020-09-15 International Business Machines Corporation Managing compression and storage of genomic data
CN110535846B (en) * 2019-08-22 2022-03-04 中国电力科学研究院有限公司 A data frame compression method and system based on DL/T698.45 protocol
CN110535846A (en) * 2019-08-22 2019-12-03 中国电力科学研究院有限公司 A data frame compression method and system based on DL/T698.45 protocol
CN111640467A (en) * 2020-05-25 2020-09-08 西安电子科技大学 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence
CN111640467B (en) * 2020-05-25 2023-03-24 西安电子科技大学 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence
CN111628779A (en) * 2020-05-29 2020-09-04 深圳华大生命科学研究院 A kind of parallel compression and decompression method and system of FASTQ file
CN111628779B (en) * 2020-05-29 2023-10-20 深圳华大生命科学研究院 Parallel compression and decompression method and system for FASTQ file
CN111881324A (en) * 2020-07-30 2020-11-03 苏州工业园区服务外包职业学院 High-throughput sequencing data universal storage format structure, construction method and application thereof
CN111881324B (en) * 2020-07-30 2023-12-15 苏州工业园区服务外包职业学院 High-throughput sequencing data general storage format structure, construction method and application thereof
CN112102883A (en) * 2020-08-20 2020-12-18 深圳华大生命科学研究院 Base sequence coding method and system in FASTQ file compression
CN112102883B (en) * 2020-08-20 2023-12-08 深圳华大生命科学研究院 Base sequence coding method and system in FASTQ file compression
WO2022073225A1 (en) * 2020-10-10 2022-04-14 中国科学院深圳先进技术研究院 Dna storage-based incremental information management method and device
CN114067910A (en) * 2021-11-15 2022-02-18 厦门大学 Single cell upstream big data processing method based on UMI-tools and Spark
CN117133365A (en) * 2023-08-14 2023-11-28 南开大学 A parallel compression method for high-throughput genome sequencing quality score data
CN118138055A (en) * 2024-03-25 2024-06-04 浙江大学 Gene data lossless compression system
CN119576881A (en) * 2024-12-06 2025-03-07 厦门大学 A method for processing astronomical data

Similar Documents

Publication Publication Date Title
WO2017214765A1 (en) Multi-thread fast storage lossless compression method and system for fastq data
CN106100641A (en) Multithreading quick storage lossless compression method and system thereof for FASTQ data
CN110121577B (en) Methods for encoding/decoding genomic sequence data, genome encoders/decoders
CN107609350A (en) A kind of data processing method of two generations sequencing data analysis platform
KR101969848B1 (en) Method and apparatus for compressing genetic data
WO2018000174A1 (en) Rapid and parallelstorage-oriented dna sequence matching method and system thereof
Zhang et al. Light-weight reference-based compression of FASTQ data
CN110178183B (en) Methods and systems for transmitting bioinformatics data
Edera et al. Computational detection of plant RNA editing events
JP2019537781A (en) Methods and systems for storing and accessing bioinformatics data
Mansouri et al. One-bit dna compression algorithm
CN101714187A (en) Index acceleration method and corresponding system in scale protein identification
AU2018221458B2 (en) Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors
JP2020509473A (en) Compact representation method and apparatus for biological information data using a plurality of genome descriptors
CN110797082A (en) Method and system for storing and reading gene sequencing data
KR20190113969A (en) Efficient Compression Method and System of Genomic Sequence Reads
Shibuya et al. Indexing k-mers in linear space for quality value compression
Zhang et al. FQZip: lossless reference-based compression of next generation sequencing data in FASTQ format
CN112489731B (en) Genotype data compression method, genotype data compression system, genotype data compression computer equipment and genotype data storage medium
Liu et al. Quality scores compression of genomic sequencing data: a comprehensive review and performance evaluation
CN110007955A (en) A kind of compression method of instruction set simulator decoding module code
CN115662521A (en) Sequence real-time comparison method based on pan-genome
Lotero et al. UdeAlignC: fast alignment for the compression of DNA reads
Saha et al. NRRC: A Non-referential Reads Compression Algorithm
JP7324145B2 (en) Methods and systems for efficient compaction of genomic sequence reads

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16904884

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11/04/2019)

122 Ep: pct application non-entry in european phase

Ref document number: 16904884

Country of ref document: EP

Kind code of ref document: A1