[go: up one dir, main page]

WO2024187428A1 - Assembly process for constructing high-quality microbial genomes on basis of stlfr metagenomic sequencing data - Google Patents

Assembly process for constructing high-quality microbial genomes on basis of stlfr metagenomic sequencing data Download PDF

Info

Publication number
WO2024187428A1
WO2024187428A1 PCT/CN2023/081733 CN2023081733W WO2024187428A1 WO 2024187428 A1 WO2024187428 A1 WO 2024187428A1 CN 2023081733 W CN2023081733 W CN 2023081733W WO 2024187428 A1 WO2024187428 A1 WO 2024187428A1
Authority
WO
WIPO (PCT)
Prior art keywords
assembly
reads
contigs
sequence
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2023/081733
Other languages
French (fr)
Chinese (zh)
Inventor
张真苗
张璐
方晓东
黄玉芬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Technology Solutions Co Ltd
Hong Kong Baptist University HKBU
Original Assignee
BGI Technology Solutions Co Ltd
Hong Kong Baptist University HKBU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Technology Solutions Co Ltd, Hong Kong Baptist University HKBU filed Critical BGI Technology Solutions Co Ltd
Priority to PCT/CN2023/081733 priority Critical patent/WO2024187428A1/en
Publication of WO2024187428A1 publication Critical patent/WO2024187428A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Definitions

  • the present invention relates to the field of bioinformatics. Specifically, the present invention relates to an assembly process for constructing high-quality microbial genomes based on stLFR metagenomic sequencing data. More specifically, the present invention relates to a method for clustering metagenomic data before assembly, a method for assembling metagenomic data, an electronic device, and a computer-readable storage medium.
  • the intestinal flora is closely related to human health and is expected to become a diagnostic indicator, therapeutic target and prognosis monitoring indicator for many diseases. Therefore, it is extremely important to study the intestinal flora.
  • the study of the composition, abundance, function and interaction of the intestinal flora with human immunity and metabolism based on metagenomics has important scientific significance and clinical guidance value.
  • de novo assembly based on metagenomic sequencing data and the construction of a complete genome of intestinal microorganisms is the premise and basis for accurately predicting microbial gene functions and quantifying the abundance of intestinal microorganisms and metabolic pathways.
  • due to the diversity and complexity of species contained in the intestinal microbial community it is difficult to solve the following problems widely existing in the community microorganisms: 1.
  • metagenomic sequencing is mainly divided into second-generation sequencing and third-generation sequencing technology.
  • the second-generation short-reads sequencing uses the synthesis and sequencing method to produce high-quality short DNA fragments. It has the advantages of high throughput, low price, and low single-base error rate. It is now the mainstream sequencing method.
  • the third-generation long-fragment single-molecule sequencing technology such as PacBio and Oxford Nanopore has a huge advantage in the read length of the generated data (the average sequencing read length is 10-14Kbp, and the longest can reach 40Kbp), which can effectively solve the assembly difficulties caused by complex regions or repetitive sequences in the genome.
  • Short reads with the same barcode are very likely to come from the same long DNA fragment ( ⁇ 50kb).
  • Barcode information can be used to track short reads from the same long DNA fragment template, and use this information to obtain scaffolds with higher accuracy and longer length.
  • this technology is mostly used for the assembly of a single species, including humans, animals and plants.
  • metagenomes only one assembly software (Athena) can perform metagenome assembly for stLFR data.
  • Athena software when the linked-reads barcode specificity is low in the 10x Genomics sequencing data, Athena software will generate many short off-target contig sequences; on the other hand, since local assembly has high requirements for sequencing depth, for low-abundance species, Athena's strategy cannot obtain local assembly results that can connect two adjacent contigs, resulting in poor assembly effect of Athena for low-abundance species; in addition, since 10X Genomics technology requires the purchase of specialized microfluidics system skills for sample preparation, the cost of data generation is greatly increased.
  • the metagenome assembly process MetaTrass of stLFR data classifies sequencing data based on the reference genome, and can only assemble species in the reference genome, ignoring the unique microorganisms in the sample. For low-depth species, the assembly effect is poor. Due to some technical defects, DNA sequences from the same fragment may have different barcodes, and this assembly process can only classify the same barcodes of the sequence into one category, reducing the availability of data.
  • the present invention aims to solve one of the technical problems in the related art at least to a certain extent.
  • the existing linked-reads assembly software has poor assembly effect on low-depth species, cannot take into account microorganisms with different abundances in samples, and the linked-reads sequencing technology is underdeveloped.
  • the present invention develops a metagenomic assembly method that clusters first and then assembles, and uses multiple thresholds to reassemble low-abundance species after assembly to improve the assembly effect of low-abundance species.
  • the present invention proposes a method for clustering and then assembling metagenomic data.
  • the method includes: clustering co-barcoded linked reads based on the sequence features of co-barcoded linked reads in the linked-reads sequencing data of the metagenomics, and generating contigs bins for each generated class assembly. Based on the abundance information of the contigs bins , reads of low-abundance species are extracted for reassembly.
  • Clustering co-barcoded-linked-reads and applying a multi-threshold reassembly strategy can improve the genome coverage of low-abundance species in the assembly results. For example, for the strain with the lowest abundance (0.02%) in the ATCC-MSA-1004 data set of the embodiment, the assembly method of the present invention shows a higher genome coverage than other assembly methods.
  • the above clustering method may further include at least one of the following additional technical features:
  • the sequence features of the linked-reads include at least one of a K-mer frequency and a tetranucleotide frequency.
  • the contigs bin in the present invention refers to the assembly result (first assembly) obtained by binning the linked-reads sequencing data from the metagenomics using the above two sequence characteristics and then assembling each bin separately.
  • the present invention provides a method for assembling metagenomic data.
  • the method comprises: clustering the metagenomic sequencing data using the method described in the first aspect of the present invention; and assembling the metagenomic sequencing data after the clustering process.
  • the metagenome can be spliced and assembled by the above method to obtain the genome sequence of low-abundance species in environmental samples.
  • the present invention provides a method for reassembling metagenomic sequencing data.
  • the method comprises: performing a first assembly process on the metagenomic sequencing data using the method described in the second aspect of the present invention; and performing a second assembly process on the metagenomic sequencing data that has undergone the first assembly process.
  • the inventors have found that performing a second assembly process on the metagenomics can alleviate the negative impact of clustering first and then assembling on the assembly quality of low-abundance species, and comprehensively improve the quality of metagenomic assembly.
  • the above-mentioned method for metagenomic sequencing and assembly may also include at least one of the following additional technical features:
  • the second assembly process is performed in the following manner: the metagenomic sequencing data that has undergone the first assembly process is compared with the contigs bin sequence to obtain the sequencing depth of the contigs bin that has undergone the first assembly process, and based on the sequencing depth, the reads of the sequenced species are obtained for the second assembly process to obtain the contigs low sequence.
  • the contigs bin sequence is derived from the contigs bin sequence of the first assembly result.
  • the comparison process is performed by BWA software processing.
  • the sequencing depth being less than a predetermined threshold is an indication for performing the second assembly process.
  • the predetermined threshold is an integer not less than 1. According to an embodiment of the present invention, it is generally a multiple of 10, and is recommended to be less than 100.
  • the contigs low sequence is merged with the contigs bin sequence and the intermediate contig sequence of the local assembly respectively to obtain the final asm.fa assembly intermediate result; and the final asm.fa assembly intermediate result and the final contig sequence of the local assembly are subjected to a third assembly process using quickmerge software.
  • the contigs bin sequence is obtained by assembling the sequencing reads in several co-barcoded-linked-reads groups formed after clustering through MEGAHIT software.
  • the clustering method will cluster co-barcoded linked-reads with similar composition or abundance together, thereby reducing the complexity of metagenome assembly and obtaining contigs bins .
  • the local assembly contig sequence is an intermediate assembly result after the contigs ori sequence is processed by Athena software.
  • the contigs ori sequence is obtained by subjecting all linked-reads before clustering processing to assembly processing by metaSPAdes software.
  • the merging process is performed through the "--subassemblies" module process of the metaFlye software.
  • the Athena asm.fa assembly intermediate result is the final assembly result after the contigs ori sequence is processed by Athena software.
  • the Athena asm.fa and the final asm.fa assembly intermediate result after the merging process are merged and circularized using quickmerge software to obtain a high-quality microbial genome.
  • the present invention proposes a pre-assembly clustering device for metagenomic data.
  • the device comprises: a first clustering unit, which clusters co-barcoded linked reads based on the sequence features of co-barcoded linked reads in the linked-reads sequencing data of the metagenomics, and generates contigs bins for each generated class assembly; and a reassembly unit, which is connected to the first clustering unit, and extracts reads of low-abundance species for reassembly based on the abundance information of the contigs bin .
  • the apparatus of the present invention can be used to cluster and then assemble metagenomic sequencing data.
  • the sequencing read information includes at least one of a K-mer frequency and a tetranucleotide frequency.
  • the device further includes: a screening unit, which is connected to the first clustering unit and the reassembly unit, and is used to screen the sequencing reads in the co-barcoded-linked-reads to obtain sequencing reads derived from the same genomic fragment.
  • a screening unit which is connected to the first clustering unit and the reassembly unit, and is used to screen the sequencing reads in the co-barcoded-linked-reads to obtain sequencing reads derived from the same genomic fragment.
  • the present invention proposes a metagenomic data assembly device.
  • the device comprises: the metagenomic data pre-assembly clustering device described in the fourth aspect of the present invention, which is used to perform clustering processing on metagenomic sequencing data; and
  • An assembly device is connected to the metagenome data pre-assembly clustering device and is used to assemble the sequencing data of the clustered metagenome.
  • the device can be used for assembling metagenomic data.
  • the present invention proposes a metagenomic sequencing data assembly system.
  • the system comprises: the metagenomic data assembly device according to the fifth aspect of the present invention, which is used to perform a first assembly process on the metagenomic sequencing data; and
  • the second assembly device is connected to the metagenomic data assembly device and is used to perform a second assembly process on the metagenomic sequencing data that has undergone the first assembly process.
  • the system can be used to assemble metagenomic sequencing data.
  • the above-mentioned metagenomic sequencing data assembly system may also include at least one of the following additional technical features:
  • the second assembly device includes: a comparison processing device, which is used to compare the metagenomic sequencing data after the first assembly processing with the contigs bin sequence, so as to obtain the sequencing depth of the contigs bin after the first assembly processing; an acquisition device, which is connected to the comparison processing device and is used to acquire the reads of the sequencing species based on the sequencing depth for second assembly processing to obtain contigs low sequences; a merging device, which is connected to the acquisition device and merges the contigs low sequence with the contigs bin sequence and the intermediate contig sequence of the local assembly, respectively, so as to obtain the final asm.fa assembly intermediate result; and a third assembly processing device, which is connected to the merging device and performs a third assembly processing on the final asm.fa assembly intermediate result and the final contig sequence of the local assembly using quickmerge software.
  • a comparison processing device which is used to compare the metagenomic sequencing data after the first assembly processing with the contigs bin sequence, so as to obtain the sequencing depth of the contigs bin after the
  • the comparison process is performed by BWA software processing.
  • the sequencing depth being less than a predetermined threshold is an indication for performing the acquisition process, wherein the acquisition process is based on the sequencing depth, and the reads of the sequenced species are acquired for a second assembly process to obtain contigs low sequences.
  • the predetermined threshold is an integer not less than 1. According to an embodiment of the present invention, it is generally a multiple of 10, and is recommended to be less than 100.
  • the third assembly process is performed by assembling the final asm.fa intermediate result and the local assembled final contig sequence using quickmerge software.
  • the contigs bin sequence is obtained by assembling the sequencing reads in several co-barcoded-linked-reads groups formed after clustering through MEGAHIT software.
  • the local assembly contig sequence is an intermediate assembly result after the contigs ori sequence is processed by Athena software.
  • the contigs ori sequence is obtained by subjecting all linked-reads before clustering processing to assembly processing by metaSPAdes software.
  • the merging process is performed through the "--subassemblies" module process of the metaFlye software.
  • the Athena asm.fa assembly intermediate result is the final assembly result after the contigs ori sequence is processed by Athena software.
  • the present invention provides an electronic device.
  • the electronic device comprises: a memory and a processor, the memory is used to store a computer program; the processor is used to execute the computer program to implement the method described in any one of the first aspect to the third aspect of the present invention.
  • the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory,
  • the present invention provides a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, and when the computer program instructions are run on a processor, the processor executes the method described in any one of the first aspect to the third aspect of the present invention.
  • FIG1 is a diagram of a clustering device before metagenomic data assembly according to a specific embodiment of the present invention.
  • FIG. 2 is a diagram of a clustering device (including a screening device) before metagenomic data assembly according to a specific embodiment of the present invention
  • FIG3 is a diagram of a metagenomic data assembly device according to a specific embodiment of the present invention.
  • FIG4 is a diagram of a metagenomic sequencing data assembly system according to a specific embodiment of the present invention.
  • FIG. 5 is a system diagram of a second assembly device in a metagenomic sequencing data assembly system according to a specific embodiment of the present invention.
  • Figure 6 is a Pangaea workflow diagram of stLFR linked-reads according to an embodiment of the present invention.
  • FIG7 is a comparison diagram of the results of binning stLFR linked-reads of the microbial community ATCC-MSA-1003 according to Example 1 of the present invention using PCA and VAE;
  • FIG8 is a comparison diagram of the results of binning stLFR linked-reads of the microbial community ATCC-MSA-1003 according to Example 1 of the present invention using RPH-kmeans, Gaussian mixture model (GMM) and K-means;
  • GMM Gaussian mixture model
  • FIG9 shows the precision, recall, F1 and ARI values of the co-barcoded linked-reads clustering results using different numbers of classes (k) for stLFR linked-reads of the microbial community ATCC-MSA-1003 according to Example 1 of the present invention
  • FIG11 is a comparison result of the number of nearly complete genomes generated using different assembly tools according to Example 2 of the present invention.
  • Figure 12 is a scatter plot (left) according to Example 2 of the present invention, which is a genome colinearity analysis diagram between the nearly complete genome generated by Pangaea and its closest reference genome, and the circular plot (right) is a comparison diagram between the genomes of the same microorganism generated by different assembly software.
  • first”, “second”, “third”, “fourth”, etc. are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated; features specified as “first”, “second”, etc. may explicitly or implicitly include one or more of the said features.
  • nucleotide refers to four natural nucleotides (such as dATP, dCTP, dGTP and dTTP or ATP, CTP, GTP and UTP) or their derivatives, and is sometimes directly represented by the bases (A, T/U, C, G) contained therein.
  • bases A, T/U, C, G
  • sequencing refers to sequence determination, which is the same as “nucleic acid sequencing” or “gene sequencing”, and refers to the determination of the base order in a nucleic acid sequence; it includes synthesis sequencing (sequencing by synthesis, SBS) and/or ligation sequencing (sequencing by ligation, SBL); it includes DNA sequencing and/or RNA sequencing; it includes long fragment sequencing and/or short fragment sequencing, and the so-called long fragments and short fragments are relative, such as nucleic acid molecules longer than 1Kb, 2Kb, 5Kb or 10Kb can be called long fragments, and those shorter than 1Kb or 800bp can be called short fragments; it includes double-end sequencing, single-end sequencing and/or paired-end sequencing, etc., and the so-called double-end sequencing or paired-end sequencing can refer to the reading of any two segments or two parts of the same nucleic acid molecule that do not completely overlap.
  • the present application method is used for linked reads.
  • the sequencing results or sequencing data obtained through the sequencing reaction are called reads, and the length of the read is called the read length, which refers to the base sequence obtained by a single sequencing of the sequencer.
  • the metagenome refers to the sum of all biological genetic materials in a specific environment. Generally, total DNA is extracted directly from environmental samples, and new functional genes and bioactive substances are obtained by constructing and screening metagenome libraries. Metagenome libraries include both culturable and unculturable microbial genetic information. Through metagenome analysis, the diversity resources of microorganisms can be fully developed and utilized.
  • linked-read sequencing is described to improve metagenomic assembly by connecting the same barcode to the sequences of long DNA fragments (10-100kb) in order to eliminate some of the misreads.
  • the contigs sequence refers to multiple reads assembled into a larger fragment through fragment overlap, called contig, which are (fragment) overlapping groups; that is, the overlap between different reads, and the sequence spliced together is the contig.
  • the barcode refers to the identity tag of each sample in NGS sequencing
  • the co-barcoded-linked-reads sequence refers to the linked-reads sequence derived from the same fragment, connected to a segment of the same barcode sequence, which is ultimately reflected in read2 of the sequencing result and is used to identify the source of the fragment.
  • Metagenomic data clustering and then assembly method
  • sequence characteristics of the linked-reads include at least one of the K-mer frequency and the tetranucleotide frequency; the K-mer refers to a DNA fragment of length K, which is obtained by cutting a portion of the sequencing reads.
  • K-mer has the following functions:
  • the K-mer frequency and TNF are extracted from co-barcoded linked-reads with a total length greater than 2Kb to ensure feature stability.
  • the K-mer frequency is calculated based on a histogram of global K-mer occurrences, which follows a Poisson distribution with an average value equal to the abundance of the microorganism.
  • the inventors divided the frequency 0-4000 into 400 boxes of the same width, each box length of 10. For each co-barcoded linked-read, its sequence was cut into 15-mers, and these 15-mers were assigned to these 400 boxes according to the global frequency. By counting the number of 15-mers in each bin, a 400-dimensional count vector was generated as the K-mer frequency feature of this co-barcoded linked-read.
  • the TNF (tetranucleotide frequency) feature was constructed by extracting the frequencies of 136 non-redundant 4-mers for each co-barcoded linked-read.
  • the K-mer frequency and TNF features were L1-normalized to eliminate the data skew introduced by co-barcoded linked-reads of different lengths.
  • the inventors concatenated the normalized K-mer frequency (XA) and TNF (XT) features into a 536-dimensional vector as the input of the VAE.
  • the output of the last layer of the encoder is fed to two hidden layers with 32 hidden neurons each, which output ⁇ and ⁇ , respectively, as the parameters of the Gaussian distribution N( ⁇ , ⁇ 2 ).
  • the embedding Z of the VAE is sampled from this Gaussian distribution.
  • the decoder of the VAE contains two fully connected hidden layers of the same size as the encoder layer to reconstruct the input features from the embedding Z ( and ).
  • VAE is trained with early stopping to reduce training time and avoid overfitting.
  • RPH-kmeans Xie, Huang et al. 2020
  • each co-barcoded-linked read takes co-barcoded-linked reads with a length greater than 2k to calculate 15-mer and tetranucleotide. Count the frequency of 15-mer, remove 15-mers with a frequency greater than 4000, and distribute the remaining 15-mers into 400 bins according to the frequency of occurrence. At the same time, each co-barcoded-linked read generates the frequencies of all 136 non-redundant 4-mers to construct the tetranucleotide features. The 15-mer frequency and tetranucleotide features are normalized by L1.
  • the metagenomic data linked-reads are clustered before assembly, and then the clustered co-barcoded-linked-reads sequences are assembled for metagenomics.
  • step (3) using the barcode information to obtain reads at the junction of contigs, wherein the contigs are derived from adjacent contig sequences in the scaffold graph sequence in step (2);
  • MetaFlye software was used to merge the local assembly results and seed contig sequences to obtain the Athena asm.fa database.
  • Metagenomic sequencing data reassembly method
  • MetaSPAdes software was used to assemble the co-barcoded-linked-reads sequences with a sequencing depth less than t i (threshold);
  • the present invention proposes a pre-assembly clustering device for metagenomic data.
  • the device comprises a first clustering unit 100, which is used to cluster co-barcoded linked reads based on sequence features of co-barcoded linked reads in linked-reads sequencing data of the metagenomics, and to generate contigs bins for each generated class assembly; and a reassembly unit S200, which is connected to the first clustering unit S100, and extracts reads of low-abundance species for reassembly based on abundance information of contigs bins .
  • a metagenomic data pre-assembly clustering device described in the present application further includes: a screening unit S101.
  • the screening unit S101 is connected to the first clustering unit S100 and the reassembly unit S200, and is used to screen the sequencing reads in the several co-barcoded-linked-reads groups to obtain sequencing reads derived from the same genome fragment.
  • the present invention proposes a metagenomic data assembly device.
  • the device includes the aforementioned metagenomic data pre-assembly clustering device 300, which is used to perform clustering processing on metagenomic sequencing data; and
  • the assembly device 400 is connected to the metagenome data pre-assembly clustering device 300 and is used to assemble the sequencing data of the clustered metagenome.
  • the present invention proposes a system for assembling metagenomic data.
  • the system includes the aforementioned metagenomic data assembly device 500, which is used to perform a first assembly process on metagenomic sequencing data; and
  • the second assembly device 600 is connected to the metagenomic data assembly device 500 and is used to perform a second assembly process on the metagenomic sequencing data that has undergone the first assembly process.
  • the second assembly device 600 in the metagenomic data assembly system of the present invention further comprises: a comparison processing device 601, the comparison processing device 601 is used to compare the metagenomic sequencing data after the first assembly process with the contigs bin sequence, so as to obtain the sequencing depth of the contigs bin after the first assembly process;
  • An acquisition device 602 the acquisition device 602 is connected to the comparison processing device 601, and is used to acquire reads of the sequenced species based on the sequencing depth, perform a second assembly process, and obtain contigs low sequences;
  • a merging device 603 which is connected to the acquiring device 602, and merges the plurality of contigs low sequence groups with the contigs bin sequences and the intermediate contig sequences of the local assembly, so as to obtain the final asm.fa assembly intermediate result;
  • the third assembly processing device 604 is connected to the merging device 603, and the final asm.fa assembly intermediate result and the final contig sequence of the local assembly are subjected to the third assembly processing by using the quickmerge software.
  • the electronic device includes: a memory and a processor, the memory is used to store a computer program; the processor is used to execute the computer program to implement the metagenome pre-assembly clustering and metagenome reassembly method described in the present application.
  • the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory.
  • the computer-readable storage medium stores a computer program, and when the computer program instructions are run on a processor, the processor executes the metagenomic pre-assembly clustering and metagenomic reassembly method described in the present application.
  • a "computer-readable medium” may be any device that can contain, store, communicate, propagate or transmit a program for use with or in conjunction with an instruction execution system, device or apparatus. More specific examples of computer-readable media (a non-exhaustive list) include the following: an electrical connection with one or more wires (electronic device), a portable computer disk case (magnetic device), a random access memory (RAM), a read-only memory (ROM), an erasable and editable read-only memory (EPROM or flash memory), a fiber optic device, and a portable compact disk read-only memory (CDROM).
  • a computer-readable medium may even be paper or other suitable medium on which the program may be printed, since the program may be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, deciphering or, if necessary, processing in another suitable manner, and then stored in a computer memory.
  • the various computer-readable storage media described in the present invention may represent one or more devices and/or other machine-readable storage media for storing information.
  • the term "machine-readable storage medium” may include, but is not limited to, wireless channels and various other media capable of storing, containing and/or carrying instructions and/or data.
  • the logic and/or steps represented in the device diagram or described in other ways herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be specifically implemented in any computer-readable medium for use by an instruction execution system, device or equipment (such as a computer-based system, a system including a processor, or other system that can fetch instructions from an instruction execution system, device or equipment and execute instructions), or used in combination with these instruction execution systems, devices or equipment.
  • the utilization rate of metagenomic sequencing data can be effectively improved, and a portion of high error rate sequences and impurity sequences that are not from the sample can be filtered out through the above-mentioned clustering method and reassembly method, thereby improving the speed of downstream bioinformatics analysis and increasing the assembly quality of low-abundance species.
  • This example simulates the microbial community ATCC-MSA-1003 (containing 20 strains of bacteria, purchased from ATCC).
  • DNA was extracted and sequenced using the stLFR library construction technology to obtain 132.95Gb of raw data.
  • the stLFR_read_demux program was then used to process the stLFR sequencing data set to obtain the original linked-reads.
  • the simulated microbial community ATCC-MSA-1003 was reassembled by multi-threshold using the Pangaea process, and the assembly quality of low-abundance species was submitted to obtain the final assembly results.
  • the sequencing data of the stLFR library was assembled using metaSPAdes, Athena and Supernova software to compare the assembly results of different software horizontally.
  • Table 1 Assembly results of simulated microbial community ATCC-MSA-1003 Note: N50: an indicator for evaluating the continuity of genome assembly; NA50: an indicator for evaluating the quality of genome assembly.
  • the N50 produced by Pangaea is much higher than that of the other three software (1.44 times that of Athena, 1.06 times that of Supernova, and 4.50 times that of metaSPAdes in S1; 1.61 times that of Athena, 2.64 times that of Supernova, and 8.18 times that of metaSPAdes in S2; Table 2).
  • Table 2 Statistics of assembly results of two fecal samples using different assembly software
  • the Pangaea assembled genome was aligned with the closest reference genome to check their colinearity (Figure 12).
  • the NCMAG of Pangaea and its closest reference genome had high alignment consistency (average 98.04%), stability (average 87.17%), and strong colinearity, indicating that Pangaea generated assembly results with high accuracy.
  • the inventors also found some genomic variations in S1 and S2, such as inversions and genome rearrangements of the reference sequence, including Alistipes sp. (S2) and A. indistinctus (S2). Pangaea assembled Alistipes sp.
  • the genome circularization model was used to check whether there were complete and circular genomes in NCMAG from four assembly software. It was found that only Pangaea generated two circular nearly complete genomes, which were annotated as B. adolescentis and Myoviridae sp., respectively. For both microorganisms, Pangaea generated a contig sequence with no gaps and perfect collinearity with the closest reference genome. Athena generated three and two contig sequences for B. adolescentis and Myoviridae sp., respectively, and its contig N50 was significantly lower than that of the contigs generated by Pangaea (B.
  • first and second are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include at least one of the features. In the description of the present invention, the meaning of "plurality” is at least two, such as two, three, etc., unless otherwise clearly and specifically defined.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physics & Mathematics (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

An assembly process for constructing high-quality microbial genomes on the basis of stLFR metagenomic sequencing data. The process comprises: on the basis of sequence features of co-barcoded linked reads in metagenomic linked-reads sequencing data, clustering the co-barcoded linked reads, and performing assembly on each generated cluster to generate contigsbin; and on the basis of abundance information of the contigsbin, extracting reads of low-abundance species, so as to perform reassembly.

Description

基于stLFR宏基因组测序数据构建高质量微生物基因组的组装流程Assembly process for constructing high-quality microbial genomes based on stLFR metagenomic sequencing data 技术领域Technical Field

本发明涉及生物信息领域。具体地,本发明涉及一种基于stLFR宏基因组测序数据构建高质量微生物基因组的组装流程。更具体地,本发明涉及一种宏基因组数据组装前聚类的方法、宏基因组数据组装的方法、电子设备和计算机可读储存介质。The present invention relates to the field of bioinformatics. Specifically, the present invention relates to an assembly process for constructing high-quality microbial genomes based on stLFR metagenomic sequencing data. More specifically, the present invention relates to a method for clustering metagenomic data before assembly, a method for assembling metagenomic data, an electronic device, and a computer-readable storage medium.

背景技术Background Art

肠道菌群与人类健康息息相关,有望成为多种疾病的诊断指标、治疗靶点和预后监控指标,因此对肠道菌群进行研究显得极为重要。基于宏基因组学对肠道菌群的构成、丰度、功能及其与人体免疫和代谢的相互作用进行研究,具有重要的科学意义和临床指导价值。另外,基于宏基因组测序数据进行从头组装(de novo assembly)并以此构建肠道微生物完整基因组,是准确预测微生物基因功能、定量肠道微生物及代谢通路丰度的前提和基础。然而,由于肠道微生物群落包含物种的多样性和复杂性,基于Illumina等主流二代测序平台产生的short-reads(平均测序读长约150bp),很难解决群落微生物中广泛存在的:1.物种丰度差异;2.重复和保守序列;3.基因组变异所引起的组装难题,难以获得相对完整的肠道微生物基因组序列,只能退而求其次在基因层面进行下游分析,很大程度制约了针对物种层面功能分析的可靠性和稳定性。如果无法得知基因之间的相互关系,那么将会显著增加后续关联分析的复杂程度。The intestinal flora is closely related to human health and is expected to become a diagnostic indicator, therapeutic target and prognosis monitoring indicator for many diseases. Therefore, it is extremely important to study the intestinal flora. The study of the composition, abundance, function and interaction of the intestinal flora with human immunity and metabolism based on metagenomics has important scientific significance and clinical guidance value. In addition, de novo assembly based on metagenomic sequencing data and the construction of a complete genome of intestinal microorganisms is the premise and basis for accurately predicting microbial gene functions and quantifying the abundance of intestinal microorganisms and metabolic pathways. However, due to the diversity and complexity of species contained in the intestinal microbial community, it is difficult to solve the following problems widely existing in the community microorganisms: 1. species abundance differences; 2. repeated and conserved sequences; 3. assembly problems caused by genomic variation based on short-reads (average sequencing read length of about 150bp) generated by mainstream second-generation sequencing platforms such as Illumina. It is difficult to obtain a relatively complete intestinal microbial genome sequence, and downstream analysis can only be performed at the gene level, which greatly restricts the reliability and stability of functional analysis at the species level. If the relationship between genes is unknown, the complexity of subsequent association analysis will be significantly increased.

目前,宏基因组测序主要分为二代技术测序和三代测序技术。二代short-reads测序采用边合成边测序方法,产生高质量的DNA短片段,其拥有通量高,价格便宜,单碱基错误率低等优势,是现在主流测序手段。相较于二代short-reads测序,三代长片段单分子测序技术如PacBio和Oxford Nanopore在产生数据的读长上拥有巨大优势(平均测序读长10-14Kbp,最长能达到40Kbp),可以有效解决由于基因组中复杂区域或者重复序列造成的组装困难。但是,目前从粪便样本中提取的DNA实际长度较短、浓度偏低,很难满足三代测序所需的DNA需求;同时三代测序成本昂贵、平台稳定性差、单碱基较高的错误率也严重限制了其应用推广。近年来兴起的长片段linked-reads文库构建技术,10x Chromium和stLFR(单管长片段)建库结合二代测序,可以得到跨度在30-100Kbp的linked-reads,在人类基因组组装中可获得媲美三代测序的组装结果。Linked-reads测序技术本质上是在二代short-reads的基础上添加barcode序列,带有相同barcode的short reads有很大可能来自同一DNA长片段(~50kb)。通过barcode信息可以追踪来自同一个DNA长片段模板的short reads,并利用该信息获得准确度更高,长度更长的scaffold。目前该技术多用于单一物种的组装,包括人和动植物,在宏基因中,仅有1款组装软件(Athena)可以针对stLFR数据进行宏基因组组装。一方面,Athena软件在10x Genomics技术测序数据中linked-readsbarcode特异性低的时候,会产生很多短off-target的contig序列;另一方面,由于局部组装对测序深度有较高的要求,对于低丰度物种,Athena的策略无法得到能连接两个相邻contigs的局部组装结果,造成Athena对低丰度物种组装效果差;此外,由于10X Genomics技术需要购买专门的微流控系统技能型样本的制备,大大提高了数据产生的成本。并且,stLFR数据的宏基因组组装流程MetaTrass是基于参考基因组对测序数据进行分类,只能组装出参考基因组中有的物种,忽略了样品中特有的微生物。对于低深度的物种,组装效果差。由于技术的部分缺陷,来源于同一片段的DNA序列,可能带有不同的barcode,而该组装流程,只能将序列的一样的barcode归为一类,降低了数据的可利用率。At present, metagenomic sequencing is mainly divided into second-generation sequencing and third-generation sequencing technology. The second-generation short-reads sequencing uses the synthesis and sequencing method to produce high-quality short DNA fragments. It has the advantages of high throughput, low price, and low single-base error rate. It is now the mainstream sequencing method. Compared with the second-generation short-reads sequencing, the third-generation long-fragment single-molecule sequencing technology such as PacBio and Oxford Nanopore has a huge advantage in the read length of the generated data (the average sequencing read length is 10-14Kbp, and the longest can reach 40Kbp), which can effectively solve the assembly difficulties caused by complex regions or repetitive sequences in the genome. However, the actual length of DNA extracted from fecal samples is currently short and the concentration is low, which makes it difficult to meet the DNA requirements required for third-generation sequencing; at the same time, the high cost, poor platform stability, and high single-base error rate of third-generation sequencing also seriously limit its application and promotion. In recent years, long-fragment linked-read library construction technology has emerged. 10x Chromium and stLFR (single-tube long fragment) library construction combined with second-generation sequencing can obtain linked-reads spanning 30-100Kbp, and can obtain assembly results comparable to third-generation sequencing in human genome assembly. Linked-read sequencing technology is essentially adding barcode sequences on the basis of second-generation short-reads. Short reads with the same barcode are very likely to come from the same long DNA fragment (~50kb). Barcode information can be used to track short reads from the same long DNA fragment template, and use this information to obtain scaffolds with higher accuracy and longer length. At present, this technology is mostly used for the assembly of a single species, including humans, animals and plants. Among metagenomes, only one assembly software (Athena) can perform metagenome assembly for stLFR data. On the one hand, when the linked-reads barcode specificity is low in the 10x Genomics sequencing data, Athena software will generate many short off-target contig sequences; on the other hand, since local assembly has high requirements for sequencing depth, for low-abundance species, Athena's strategy cannot obtain local assembly results that can connect two adjacent contigs, resulting in poor assembly effect of Athena for low-abundance species; in addition, since 10X Genomics technology requires the purchase of specialized microfluidics system skills for sample preparation, the cost of data generation is greatly increased. In addition, the metagenome assembly process MetaTrass of stLFR data classifies sequencing data based on the reference genome, and can only assemble species in the reference genome, ignoring the unique microorganisms in the sample. For low-depth species, the assembly effect is poor. Due to some technical defects, DNA sequences from the same fragment may have different barcodes, and this assembly process can only classify the same barcodes of the sequence into one category, reducing the availability of data.

因此,需要开发新的流程方法来解决上述问题。Therefore, new process methods need to be developed to solve the above problems.

发明内容Summary of the invention

本发明旨在至少在一定程度上解决相关技术中的技术问题之一。The present invention aims to solve one of the technical problems in the related art at least to a certain extent.

基于现有linked-reads组装软件对低深度物种组装效果差,无法兼顾样本中不同丰度的微生物以及linked-reads测序技术发展不足等问题。本发明开发了一种先聚类后组装的宏基因组组装方法,并且在组装后使用多个阈值对低丰度物种重组装来改善低丰度物种的组装效果。The existing linked-reads assembly software has poor assembly effect on low-depth species, cannot take into account microorganisms with different abundances in samples, and the linked-reads sequencing technology is underdeveloped. The present invention develops a metagenomic assembly method that clusters first and then assembles, and uses multiple thresholds to reassemble low-abundance species after assembly to improve the assembly effect of low-abundance species.

在本发明的第一方面,本发明提出了宏基因组数据先聚类后组装的方法。根据本发明的实施例,所述方法包括:基于宏基因组的linked-reads测序数据中co-barcodedlinkedreads的序列特征,对co-barcodedlinkedreads进行聚类,对每一个生成的类组装生成contigsbin。基于contigsbin的丰度信息,提取低丰度物种的reads进行重组装。In the first aspect of the present invention, the present invention proposes a method for clustering and then assembling metagenomic data. According to an embodiment of the present invention, the method includes: clustering co-barcoded linked reads based on the sequence features of co-barcoded linked reads in the linked-reads sequencing data of the metagenomics, and generating contigs bins for each generated class assembly. Based on the abundance information of the contigs bins , reads of low-abundance species are extracted for reassembly.

发明人发现,利用聚类后的co-barcoded-linked-reads对宏基因组进行组装,与不聚类直接组装的方式相比显著提高了组装结果的连续性。对co-barcoded-linked-reads进行聚类并应用多阈值重组装策略,可以提高组装结果中低丰度物种的基因组覆盖率。例如,对于实施例的ATCC-MSA-1004数据集的丰度最低的菌株(0.02%),本发明的组装方法相较于其他组装方法显示出更高的基因组覆盖率。The inventors found that the use of clustered co-barcoded-linked-reads to assemble the metagenome significantly improved the continuity of the assembly results compared to the direct assembly method without clustering. Clustering co-barcoded-linked-reads and applying a multi-threshold reassembly strategy can improve the genome coverage of low-abundance species in the assembly results. For example, for the strain with the lowest abundance (0.02%) in the ATCC-MSA-1004 data set of the embodiment, the assembly method of the present invention shows a higher genome coverage than other assembly methods.

根据本发明的实施例,上述聚类方法还可以包括下列附加技术特征中的至少之一:According to an embodiment of the present invention, the above clustering method may further include at least one of the following additional technical features:

根据本发明的实施例,所述linked-reads的序列特征包括K-mer频率和四核苷酸频率中的至少之一。通过提取co-barcodedlinked-reads的K-mer频率和四核苷酸频率特征对co-barcoded-linked-reads聚类,显著降低了宏基因组组装的复杂性并提高了组装结果的连续性。According to an embodiment of the present invention, the sequence features of the linked-reads include at least one of a K-mer frequency and a tetranucleotide frequency. By extracting the K-mer frequency and tetranucleotide frequency features of the co-barcoded linked-reads to cluster the co-barcoded-linked-reads, the complexity of the metagenome assembly is significantly reduced and the continuity of the assembly results is improved.

需要说明的是,所述K-mer频率和四核苷酸频率为co-barcoded-linked-reads序列的特征。本发明中所述contigsbin指对来自宏基因组linked-reads测序数据,利用上述两个序列特征进行binning(分箱)后对每一个分出来的箱单独组装得到的组装结果(第一组装)。It should be noted that the K-mer frequency and the tetranucleotide frequency are characteristics of the co-barcoded-linked-reads sequence. The contigs bin in the present invention refers to the assembly result (first assembly) obtained by binning the linked-reads sequencing data from the metagenomics using the above two sequence characteristics and then assembling each bin separately.

在本发明的第二方面,本发明提供了一种宏基因组数据组装的方法。根据本发明的实施例,所述方法包括:利用本发明第一方面所述的方法对宏基因组的测序数据进行聚类处理;以及将经过聚类处理的宏基因组测序数据进行组装处理。In a second aspect of the present invention, the present invention provides a method for assembling metagenomic data. According to an embodiment of the present invention, the method comprises: clustering the metagenomic sequencing data using the method described in the first aspect of the present invention; and assembling the metagenomic sequencing data after the clustering process.

根据本发明的实施例,通过上述方法可以对宏基因组进行拼接组装,获得环境样品中低丰度物种的基因组序列。 According to an embodiment of the present invention, the metagenome can be spliced and assembled by the above method to obtain the genome sequence of low-abundance species in environmental samples.

在本发明的第三方面,本发明提供了一种宏基因组测序数据重组装的方法。根据本发明的实施例,所述方法包括:利用本发明的第二方面所述的方法,对宏基因组测序数据进行第一组装处理;以及将经过第一组装处理的宏基因组测序数据进行第二组装处理。发明人发现,对宏基因组进行第二组装处理可以缓解先聚类后组装对低丰度物种组装质量的负面影响,全面提高宏基因组组装质量。In the third aspect of the present invention, the present invention provides a method for reassembling metagenomic sequencing data. According to an embodiment of the present invention, the method comprises: performing a first assembly process on the metagenomic sequencing data using the method described in the second aspect of the present invention; and performing a second assembly process on the metagenomic sequencing data that has undergone the first assembly process. The inventors have found that performing a second assembly process on the metagenomics can alleviate the negative impact of clustering first and then assembling on the assembly quality of low-abundance species, and comprehensively improve the quality of metagenomic assembly.

根据本发明的实施例,上述宏基因组测序组装的方法还可以包括下列附加技术特征中的至少之一:According to an embodiment of the present invention, the above-mentioned method for metagenomic sequencing and assembly may also include at least one of the following additional technical features:

根据本发明的实施例,所述第二组装处理是通过如下方式进行的:将经过所述第一组装处理的宏基因组测序数据与contigsbin序列进行比对处理,以便获得所述经过所述第一组装处理的contigsbin的测序深度,基于所述测序深度,获取测序物种的reads进行第二组装处理获得contigslow序列。According to an embodiment of the present invention, the second assembly process is performed in the following manner: the metagenomic sequencing data that has undergone the first assembly process is compared with the contigs bin sequence to obtain the sequencing depth of the contigs bin that has undergone the first assembly process, and based on the sequencing depth, the reads of the sequenced species are obtained for the second assembly process to obtain the contigs low sequence.

需要说明的是,所述contigsbin序列来源于第一组装结果的contigsbin序列。It should be noted that the contigs bin sequence is derived from the contigs bin sequence of the first assembly result.

根据本发明的实施例,所述比对处理是通过BWA软件处理进行的。According to an embodiment of the present invention, the comparison process is performed by BWA software processing.

根据本发明的实施例,所述测序深度小于预定阈值,是进行所述第二组装处理的指示。According to an embodiment of the present invention, the sequencing depth being less than a predetermined threshold is an indication for performing the second assembly process.

根据本发明的实施例,所述预定阈值为不小于1的整数。根据本发明的实施例,一般取10的倍数,建议低于100。According to an embodiment of the present invention, the predetermined threshold is an integer not less than 1. According to an embodiment of the present invention, it is generally a multiple of 10, and is recommended to be less than 100.

根据本发明的实施例,将所述contigslow序列分别与contigsbin序列和局部组装的中间contig序列进行合并处理,以便获得final asm.fa组装中间结果;以及将所述final asm.fa组装中间结果和局部组装的最终contig序列用quickmerge软件进行第三组装处理。According to an embodiment of the present invention, the contigs low sequence is merged with the contigs bin sequence and the intermediate contig sequence of the local assembly respectively to obtain the final asm.fa assembly intermediate result; and the final asm.fa assembly intermediate result and the final contig sequence of the local assembly are subjected to a third assembly process using quickmerge software.

发明人发现,针对不同微生物的基因组,通过对co-barcoded linked-reads序列进行分箱和组装处理能够提高组装出的基因组质量。The inventors found that for the genomes of different microorganisms, the quality of the assembled genomes can be improved by binning and assembling the co-barcoded linked-reads sequences.

根据本发明的实施例,所述contigsbin序列是通过将聚类处理后形成的若干co-barcoded-linked-reads组中的测序读段经过MEGAHIT软件组装处理后获得的。According to an embodiment of the present invention, the contigs bin sequence is obtained by assembling the sequencing reads in several co-barcoded-linked-reads groups formed after clustering through MEGAHIT software.

需要说明的是,聚类方法会将组成相似或丰度相似的co-barcodedlinked-reads聚类到一起,进而降低宏基因组组装复杂度,获得contigsbinIt should be noted that the clustering method will cluster co-barcoded linked-reads with similar composition or abundance together, thereby reducing the complexity of metagenome assembly and obtaining contigs bins .

根据本发明的实施例,所述局部组装contig序列是通过将contigsori序列经过Athena软件处理后的中间组装结果。According to an embodiment of the present invention, the local assembly contig sequence is an intermediate assembly result after the contigs ori sequence is processed by Athena software.

根据本发明的实施例,所述contigsori序列是通过将聚类处理前的所有linked-reads经过metaSPAdes软件组装处理后获得的。According to an embodiment of the present invention, the contigs ori sequence is obtained by subjecting all linked-reads before clustering processing to assembly processing by metaSPAdes software.

根据本发明的实施例,所述合并处理是通过metaFlye软件的“--subassemblies”模块处理进行的。According to an embodiment of the present invention, the merging process is performed through the "--subassemblies" module process of the metaFlye software.

通过对contigslow序列组、contigsbin序列和局部组装的contig序列进行合并处理,得到最完整连续性最高的微生物基因组。By merging the contigs low sequence group, contigs bin sequence and locally assembled contig sequence, the most complete and highest-continuity microbial genome was obtained.

根据本发明的实施例,所述Athena asm.fa组装中间结果是通过将contigsori序列经过Athena软件处理后的最终组装结果。根据本发明的实施例,将Athena asm.fa与合并处理后final asm.fa组装中间结果用quickmerge软件合并并环化处理即可得到高质量的微生物基因组。 According to an embodiment of the present invention, the Athena asm.fa assembly intermediate result is the final assembly result after the contigs ori sequence is processed by Athena software. According to an embodiment of the present invention, the Athena asm.fa and the final asm.fa assembly intermediate result after the merging process are merged and circularized using quickmerge software to obtain a high-quality microbial genome.

在本发明的第四方面,本发明提出了一种宏基因组数据组装前聚类装置。根据本发明的实施例,所述装置包括:第一聚类单元,所述第一聚类单元基于宏基因组的linked-reads测序数据中co-barcoded linked reads的序列特征,对co-barcoded linked reads进行聚类,对每一个生成的类组装生成contigsbin;以及重组装单元,所述重组装单元与所述第一聚类单元相连,基于contigsbin的丰度信息,提取低丰度物种的reads进行重组装。In a fourth aspect of the present invention, the present invention proposes a pre-assembly clustering device for metagenomic data. According to an embodiment of the present invention, the device comprises: a first clustering unit, which clusters co-barcoded linked reads based on the sequence features of co-barcoded linked reads in the linked-reads sequencing data of the metagenomics, and generates contigs bins for each generated class assembly; and a reassembly unit, which is connected to the first clustering unit, and extracts reads of low-abundance species for reassembly based on the abundance information of the contigs bin .

根据本发明的实施例,利用本发明所述装置可以对宏基因组测序数据进行先聚类后组装。According to an embodiment of the present invention, the apparatus of the present invention can be used to cluster and then assemble metagenomic sequencing data.

根据本发明的实施例,所述测序读段信息包括K-mer频率和四核苷酸频率中的至少之一。According to an embodiment of the present invention, the sequencing read information includes at least one of a K-mer frequency and a tetranucleotide frequency.

根据本发明的实施例,所述装置进一步包括:筛选单元,所述筛选单元与所述第一聚类单元和所述重组装单元相连,用于将所述co-barcoded-linked-reads中的测序读段进行筛选,以便获得来源于同一基因组片段的测序读段。According to an embodiment of the present invention, the device further includes: a screening unit, which is connected to the first clustering unit and the reassembly unit, and is used to screen the sequencing reads in the co-barcoded-linked-reads to obtain sequencing reads derived from the same genomic fragment.

在本发明的第五方面,本发明提出了一种宏基因组数据组装设备。根据本发明的实施例,所述设备包括:本发明第四方面所述的宏基因组数据组装前聚类装置,用于对宏基因组的测序数据进行聚类处理;以及In a fifth aspect of the present invention, the present invention proposes a metagenomic data assembly device. According to an embodiment of the present invention, the device comprises: the metagenomic data pre-assembly clustering device described in the fourth aspect of the present invention, which is used to perform clustering processing on metagenomic sequencing data; and

组装装置,所述组装装置与所述宏基因组数据组装前聚类装置相连,用于将经过聚类处理的宏基因组的测序数据进行组装处理。An assembly device is connected to the metagenome data pre-assembly clustering device and is used to assemble the sequencing data of the clustered metagenome.

根据本发明的实施例,所述设备可用于宏基因组数据的组装。According to an embodiment of the present invention, the device can be used for assembling metagenomic data.

在本发明的第六方面,本发明提出了一种宏基因组测序数据组装系统。根据本发明的实施例,所述系统包括:本发明第五方面所述的宏基因组数据组装设备,用于对宏基因组测序数据进行第一组装处理;以及In a sixth aspect of the present invention, the present invention proposes a metagenomic sequencing data assembly system. According to an embodiment of the present invention, the system comprises: the metagenomic data assembly device according to the fifth aspect of the present invention, which is used to perform a first assembly process on the metagenomic sequencing data; and

第二组装设备,所述第二组装设备与所述宏基因组数据组装设备相连,用于将经过第一组装处理的宏基因组测序数据进行第二组装处理。The second assembly device is connected to the metagenomic data assembly device and is used to perform a second assembly process on the metagenomic sequencing data that has undergone the first assembly process.

根据本发明的实施例,所述系统可用于对宏基因组测序数据进行组装。According to an embodiment of the present invention, the system can be used to assemble metagenomic sequencing data.

根据本发明的实施例,上述宏基因组测序数据组装系统还可以包括下列附加技术特征中的至少之一:According to an embodiment of the present invention, the above-mentioned metagenomic sequencing data assembly system may also include at least one of the following additional technical features:

根据本发明的实施例,所述第二组装设备包括:比对处理装置,所述比对处理装置用于将经过所述第一组装处理的宏基因组测序数据与contigsbin序列进行比对处理,以便获得所述经过所述第一组装处理的contigsbin的测序深度;获取装置,所述获取装置与所述比对处理装置相连,用于基于所述测序深度,获取测序物种的reads进行第二组装处理获得contigslow序列;合并装置,所述合并装置与所述获取装置相连,将所述contigslow序列分别与contigsbin序列和局部组装的中间contig序列进行合并处理,以便获得final asm.fa组装中间结果;以及第三组装处理装置,所述第三组装处理装置与所述合并装置相连,将所述final asm.fa组装中间结果和局部组装的最终contig序列用quickmerge软件进行第三组装处理。According to an embodiment of the present invention, the second assembly device includes: a comparison processing device, which is used to compare the metagenomic sequencing data after the first assembly processing with the contigs bin sequence, so as to obtain the sequencing depth of the contigs bin after the first assembly processing; an acquisition device, which is connected to the comparison processing device and is used to acquire the reads of the sequencing species based on the sequencing depth for second assembly processing to obtain contigs low sequences; a merging device, which is connected to the acquisition device and merges the contigs low sequence with the contigs bin sequence and the intermediate contig sequence of the local assembly, respectively, so as to obtain the final asm.fa assembly intermediate result; and a third assembly processing device, which is connected to the merging device and performs a third assembly processing on the final asm.fa assembly intermediate result and the final contig sequence of the local assembly using quickmerge software.

根据本发明的实施例,所述比对处理是通过BWA软件处理进行的。According to an embodiment of the present invention, the comparison process is performed by BWA software processing.

根据本发明的实施例,所述测序深度小于预定阈值,是进行所述获取处理的指示。所述获取处理是基于所述测序深度,获取测序物种的reads进行第二组装处理获得contigslow序列。According to an embodiment of the present invention, the sequencing depth being less than a predetermined threshold is an indication for performing the acquisition process, wherein the acquisition process is based on the sequencing depth, and the reads of the sequenced species are acquired for a second assembly process to obtain contigs low sequences.

根据本发明的实施例,所述预定阈值为不小于1的整数。根据本发明的实施例,一般取10的倍数,建议低于100。According to an embodiment of the present invention, the predetermined threshold is an integer not less than 1. According to an embodiment of the present invention, it is generally a multiple of 10, and is recommended to be less than 100.

根据本发明的实施例,所述第三组装处理是通过将所述final asm.fa组装中间结果和局部组装的最终contig序列用quickmerge软件进行的。According to an embodiment of the present invention, the third assembly process is performed by assembling the final asm.fa intermediate result and the local assembled final contig sequence using quickmerge software.

根据本发明的实施例,所述contigsbin序列是通过将聚类处理后形成的若干co-barcoded-linked-reads组中的测序读段经过MEGAHIT软件组装处理后获得的。According to an embodiment of the present invention, the contigs bin sequence is obtained by assembling the sequencing reads in several co-barcoded-linked-reads groups formed after clustering through MEGAHIT software.

根据本发明的实施例,所述局部组装contig序列是通过将contigsori序列经过Athena软件处理后的中间组装结果。According to an embodiment of the present invention, the local assembly contig sequence is an intermediate assembly result after the contigs ori sequence is processed by Athena software.

根据本发明的实施例,所述contigsori序列是通过将聚类处理前的所有linked-reads经过metaSPAdes软件组装处理后获得的。According to an embodiment of the present invention, the contigs ori sequence is obtained by subjecting all linked-reads before clustering processing to assembly processing by metaSPAdes software.

根据本发明的实施例,所述合并处理是通过metaFlye软件的“--subassemblies”模块处理进行的。According to an embodiment of the present invention, the merging process is performed through the "--subassemblies" module process of the metaFlye software.

根据本发明的实施例,所述Athena asm.fa组装中间结果是通过将contigsori序列经过Athena软件处理后的最终组装结果。According to an embodiment of the present invention, the Athena asm.fa assembly intermediate result is the final assembly result after the contigs ori sequence is processed by Athena software.

在本发明的第七方面,本发明提出了一种电子设备。根据本发明的实施例,所述电子设备包括:存储器和处理器,所述存储器用于存储计算机程序;所述处理器用于执行所述计算机程序以实现本发明第一方面至本发明第三方面任一方面所述的方法。根据本发明的实施例,所述处理器通过读取所述存储器中存储的可执行程序代码来运行与所述可执行程序代码对应的程序,In a seventh aspect of the present invention, the present invention provides an electronic device. According to an embodiment of the present invention, the electronic device comprises: a memory and a processor, the memory is used to store a computer program; the processor is used to execute the computer program to implement the method described in any one of the first aspect to the third aspect of the present invention. According to an embodiment of the present invention, the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory,

在本发明的第八方面,本发明提出了一种计算机可读存储介质。根据本发明的实施例,所述可读存储介质存储有计算机程序,所述计算机程序指令再处理器上运行时,使得所述处理器执行本发明第一方面至本发明第三方面任一方面所述的方法。In an eighth aspect of the present invention, the present invention provides a computer-readable storage medium. According to an embodiment of the present invention, the computer-readable storage medium stores a computer program, and when the computer program instructions are run on a processor, the processor executes the method described in any one of the first aspect to the third aspect of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and easily understood from the description of the embodiments in conjunction with the following drawings, in which:

图1是根据本发明具体实施例的宏基因组数据组装前聚类装置图;FIG1 is a diagram of a clustering device before metagenomic data assembly according to a specific embodiment of the present invention;

图2是根据本发明具体实施例的宏基因组数据组装前聚类装置(含筛选装置)图;2 is a diagram of a clustering device (including a screening device) before metagenomic data assembly according to a specific embodiment of the present invention;

图3是根据本发明具体实施例的宏基因组数据组装设备图;FIG3 is a diagram of a metagenomic data assembly device according to a specific embodiment of the present invention;

图4是根据本发明具体实施例的宏基因组测序数据组装系统图;FIG4 is a diagram of a metagenomic sequencing data assembly system according to a specific embodiment of the present invention;

图5是根据本发明具体实施例的宏基因组测序数据组装系统中第二组装设备系统图;5 is a system diagram of a second assembly device in a metagenomic sequencing data assembly system according to a specific embodiment of the present invention;

图6是根据本发明实施例的stLFR linked-reads的Pangaea工作流程图;Figure 6 is a Pangaea workflow diagram of stLFR linked-reads according to an embodiment of the present invention;

图7是根据本发明实施例1的微生物群落ATCC-MSA-1003的stLFR linked-reads利用PCA和VAE进行分箱的结果比较图;FIG7 is a comparison diagram of the results of binning stLFR linked-reads of the microbial community ATCC-MSA-1003 according to Example 1 of the present invention using PCA and VAE;

图8是根据本发明实施例1的微生物群落ATCC-MSA-1003的stLFR linked-reads利用RPH-kmeans、高斯混合模型(GMM)和K-means进行分箱的结果比较图;FIG8 is a comparison diagram of the results of binning stLFR linked-reads of the microbial community ATCC-MSA-1003 according to Example 1 of the present invention using RPH-kmeans, Gaussian mixture model (GMM) and K-means;

图9是根据本发明实施例1的微生物群落ATCC-MSA-1003的stLFR linked-reads利用不同类数(k)展示co-barcoded linked-reads聚类结果的精度、召回率、F1和ARI值;FIG9 shows the precision, recall, F1 and ARI values of the co-barcoded linked-reads clustering results using different numbers of classes (k) for stLFR linked-reads of the microbial community ATCC-MSA-1003 according to Example 1 of the present invention;

图10是根据本发明实施例1的微生物群落ATCC-MSA-1003的stLFR linked-reads使用不同类数(k=10、15和20)的Pangea组装结果的统计指标(a)、Nx(b)和NAx(c)比较图;Figure 10 is a comparison diagram of statistical indicators (a), Nx (b) and NAx (c) of Pangaea assembly results using different numbers of classes (k = 10, 15 and 20) of stLFR linked-reads of the microbial community ATCC-MSA-1003 according to Example 1 of the present invention;

图11是根据本发明实施例2的使用不同组装工具生成的接近完整基因组数量比较结果;FIG11 is a comparison result of the number of nearly complete genomes generated using different assembly tools according to Example 2 of the present invention;

图12是根据本发明实施例2的散点图(左)是Pangaea生成的接近完整基因组与其最接近的参考基因组之间的基因组共线性分析图,环形图(右)是不同组装软件生成的同一个微生物的基因组之间的比较图。Figure 12 is a scatter plot (left) according to Example 2 of the present invention, which is a genome colinearity analysis diagram between the nearly complete genome generated by Pangaea and its closest reference genome, and the circular plot (right) is a comparison diagram between the genomes of the same microorganism generated by different assembly software.

具体实施方式DETAILED DESCRIPTION

下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are intended to be used to explain the present invention, and should not be construed as limiting the present invention.

定义和说明Definition and Description

在本文中,除非另有说明,单数形式“一种”、“一个”等包括复数指代物(一个以上);“一组”或者“多个”指两个或两个以上。As used herein, unless otherwise indicated, the singular forms "a," "an," and the like include plural referents (more than one); "a set" or "a plurality" refers to two or more.

在本文中,除非另有说明,术语“第一”、“第二”、“第三”、“第四”等仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量;限定有“第一”、“第二”等的特征可以明示或者隐含地包括一个或者更多个所述特征。In this document, unless otherwise specified, the terms "first", "second", "third", "fourth", etc. are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated; features specified as "first", "second", etc. may explicitly or implicitly include one or more of the said features.

在本文中,除非另有说明,核苷酸指四种天然核苷酸(如dATP、dCTP、dGTP和dTTP或者ATP、CTP、GTP和UTP)或其衍生物,有时也直接以其包含的碱基(A、T/U、C、G)表示。本领域普通技术人员根据上下文记载可以知晓核苷酸或碱基在具体实施方式中的指代。In this article, unless otherwise specified, nucleotide refers to four natural nucleotides (such as dATP, dCTP, dGTP and dTTP or ATP, CTP, GTP and UTP) or their derivatives, and is sometimes directly represented by the bases (A, T/U, C, G) contained therein. A person of ordinary skill in the art can know the reference of nucleotides or bases in a specific embodiment according to the context.

在本文中,所称的“测序”为序列测定,同“核酸测序”或“基因测序”,指核酸序列中碱基次序的测定;包括合成测序(边合成边测序,SBS)和/或连接测序(边连接边测序,SBL);包括DNA测序和/或RNA测序;包括长片段测序和/或短片段测序,所称的长片段和短片段是相对的,如长于1Kb、2Kb、5Kb或者10Kb的核酸分子可称为长片段,短于1Kb或者800bp的可称为短片段;包括双末端测序、单末端测序和/或配对末端测序等,所称的双末端测序或者配对末端测序可以指同一核酸分子的不完全重叠的任意两段或两个部分的读出。In this article, the term "sequencing" refers to sequence determination, which is the same as "nucleic acid sequencing" or "gene sequencing", and refers to the determination of the base order in a nucleic acid sequence; it includes synthesis sequencing (sequencing by synthesis, SBS) and/or ligation sequencing (sequencing by ligation, SBL); it includes DNA sequencing and/or RNA sequencing; it includes long fragment sequencing and/or short fragment sequencing, and the so-called long fragments and short fragments are relative, such as nucleic acid molecules longer than 1Kb, 2Kb, 5Kb or 10Kb can be called long fragments, and those shorter than 1Kb or 800bp can be called short fragments; it includes double-end sequencing, single-end sequencing and/or paired-end sequencing, etc., and the so-called double-end sequencing or paired-end sequencing can refer to the reading of any two segments or two parts of the same nucleic acid molecule that do not completely overlap.

需要说明的是,本申请方法用于linked reads读段。通过测序反应获得的测序结果或者测序数据数据被称为读段(reads),读段的长度称为读长,所述读长是指测序仪单次测序所得到的碱基序列。It should be noted that the present application method is used for linked reads. The sequencing results or sequencing data obtained through the sequencing reaction are called reads, and the length of the read is called the read length, which refers to the base sequence obtained by a single sequencing of the sequencer.

在本文中,所述宏基因组(Metagenome)是指特定环境下所有生物遗传物质的总和。一般直接从环境样品中提取总DNA,通过构建和筛选宏基因组文库来获得新的功能基因和生物活性物质。宏基因组文库既包括了可培养的,又包括了未可培养的微生物遗传信息,通过宏基因组分析,可以全面开发和利用微生物的多样性资源。In this article, the metagenome refers to the sum of all biological genetic materials in a specific environment. Generally, total DNA is extracted directly from environmental samples, and new functional genes and bioactive substances are obtained by constructing and screening metagenome libraries. Metagenome libraries include both culturable and unculturable microbial genetic information. Through metagenome analysis, the diversity resources of microorganisms can be fully developed and utilized.

在本文中,所述链读测序(Linked-reads sequencing)通过将相同的barcode与长DNA片段(10-100kb)的序列连接在一起,目的是为了消除其中的一些错读,从而改进宏基因组组装。In this article, linked-read sequencing is described to improve metagenomic assembly by connecting the same barcode to the sequences of long DNA fragments (10-100kb) in order to eliminate some of the misreads.

在本文中,所述contigs序列是指多个reads通过片段重叠,组装成一个更大的片段,称为contig,它们是(片段)重叠群;就是不同reads之间的overlap(交叠区),拼接成的序列就是contig。In this article, the contigs sequence refers to multiple reads assembled into a larger fragment through fragment overlap, called contig, which are (fragment) overlapping groups; that is, the overlap between different reads, and the sequence spliced together is the contig.

在本文中,所述barcode是指NGS测序中每个样品的身份标签;所述co-barcoded-linked-reads序列是指将来源于同一片段的linked-reads序列,连接上一段相同的barcode序列,这段序列最终在测序结果的read2中体现出来,用于识别片段的来源。In this article, the barcode refers to the identity tag of each sample in NGS sequencing; the co-barcoded-linked-reads sequence refers to the linked-reads sequence derived from the same fragment, connected to a segment of the same barcode sequence, which is ultimately reflected in read2 of the sequencing result and is used to identify the source of the fragment.

在本文中,术语“Contig N50”是指Reads拼接后会获得一些不同长度的Contigs,将所有的Contig长度相加,能获得一个Contig总长度,然后将所有的Contigs按照从长到短进行排序,如获得Contig 1,Contig 2,contig 3….Contig 25,将Contig按照这个顺序依次相加,当相加的长度达到Contig总长度的一半时,最后一个加上的Contig长度即为Contig N50。例如:Contig 1+Contig 2+Contig 3+Contig 4=Contig总长度一半时,Contig 4的长度即为Contig N50。ContigN50可以作为基因组拼接的结果好坏的一个判断标准。In this article, the term "Contig N50" refers to the contigs of different lengths obtained after reads are spliced. All the contig lengths are added together to obtain a total contig length. Then all the contigs are sorted from long to short, such as Contig 1, Contig 2, contig 3...Contig 25. The contigs are added in this order. When the added length reaches half of the total contig length, the length of the last contig added is Contig N50. For example: when Contig 1+Contig 2+Contig 3+Contig 4=half of the total contig length, the length of Contig 4 is Contig N50. Contig N50 can be used as a criterion for judging the quality of genome splicing results.

宏基因组数据先聚类后组装方法:Metagenomic data clustering and then assembly method:

(1)基于宏基因组的linked-reads测序数据中co-barcodedlinkedreads的序列特征,对co-barcodedlinkedreads进行聚类;(1) Clustering of co-barcoded linked reads based on the sequence features of co-barcoded linked reads in the linked-reads sequencing data of the metagenomics;

(2)对每一个生成的类组装生成contigsbin。基于contigsbin的丰度信息,提取低丰度物种的reads进行重组装。(2) Generate contig bins for each generated class assembly. Based on the abundance information of the contig bins , extract reads of low-abundance species for reassembly.

需要说明的是,所述linked-reads的序列特征包括K-mer频率和四核苷酸频率中的至少之一;所述K-mer是指一段长度为K的DNA片段,是由测序reads剪切一部分得到的。K-mer具有下述作用:It should be noted that the sequence characteristics of the linked-reads include at least one of the K-mer frequency and the tetranucleotide frequency; the K-mer refers to a DNA fragment of length K, which is obtained by cutting a portion of the sequencing reads. K-mer has the following functions:

(1)利用K-mer拼接出Contig序列。Contig序列的长度与K值的大小密切相关。(1) Use K-mer to assemble the Contig sequence. The length of the Contig sequence is closely related to the size of the K value.

(2)识别测序错误、杂合等位基因和重复序列的reads。(2) Identify reads containing sequencing errors, heterozygous alleles, and repetitive sequences.

(3)估计基因组大小及杂合度。(3) Estimate genome size and heterozygosity.

根据本发明的实施例,所述K-mer频率和TNF(四核苷酸频率)提取自总长度大于2Kb的co-barcoded linked-reads,以确保特征稳定性。K-mer频率是根据全局K-mer出现的直方图计算的,该直方图遵循泊松分布,其平均值等于微生物的丰度。发明人采用与以前的研究(Ruan and Li 2020,Wickramarachchi,Mallawaarachchi et al.2020)相同的k=15,并使用所有linked-reads构建了一个15-mer全局频率表。出现的频率高于4,000的15-mer将被删除(避免重复序列)。发明人将频率0-4000划分成400个宽度相同的箱,每个箱长度为10。对于每个co-barcoded linked-reads,将其序列剪切成15-mers,并将这些15-mers按照全局频率分配到这400个箱中。通过计算了每个箱中的15-mer数量,生成一个400维的计数向量作为这条co-barcoded linked-reads的K-mer频率特征。此外,通过对每个co-barcoded linked-reads提取136个非冗余的4-mer的频率来构建出TNF(四核苷酸频率)特征。K-mer频率和TNF特征经过L1标准化,以消除由不同长度的co-barcoded linked-reads引入的数据倾斜。发明人将归一化后的K-mer频率(XA)和TNF(XT)特征连接成一个536维的向量作为VAE的输入。VAE的编码器由两个全连接层组成,每层有512个隐藏神经元,每一层之后进行batchnormalization和dropout(P=0.2)。编码器的最后一层的输出被输送到各自具有32个隐藏神经元的两个隐藏层,这两个隐藏层分别输出μ和σ,作为高斯分布N(μ,σ2)的参数。VAE的embedding Z采样自这个高斯分布。VAE的解码器包含两个与编码器层大小相同的全连接隐藏层,以从embedding Z重构输入特征()。由于输入特征XA和XT都是L1归一化的,为了让VAE的重构的特征与输入的特征相匹配,发明人在上应用softmax激活函数来模拟概率分布。模型的损失函数(LossFunction)定义为三个分量的加权和:K-mer频率的重构损失(LA)、TNF向量的重构损失(LT),隐藏层的高斯分布和标准高斯分布的Kullback-Leibler散度损失(LKL):



Loss=wALA+wTLT+wKLLKL
According to an embodiment of the present invention, the K-mer frequency and TNF (tetranucleotide frequency) are extracted from co-barcoded linked-reads with a total length greater than 2Kb to ensure feature stability. The K-mer frequency is calculated based on a histogram of global K-mer occurrences, which follows a Poisson distribution with an average value equal to the abundance of the microorganism. The inventors used the same k=15 as in previous studies (Ruan and Li 2020, Wickramarachchi, Mallawaarachchi et al. 2020) and constructed a 15-mer global frequency table using all linked-reads. 15-mers with a frequency of occurrence higher than 4,000 will be deleted (to avoid duplicate sequences). The inventors divided the frequency 0-4000 into 400 boxes of the same width, each box length of 10. For each co-barcoded linked-read, its sequence was cut into 15-mers, and these 15-mers were assigned to these 400 boxes according to the global frequency. By counting the number of 15-mers in each bin, a 400-dimensional count vector was generated as the K-mer frequency feature of this co-barcoded linked-read. In addition, the TNF (tetranucleotide frequency) feature was constructed by extracting the frequencies of 136 non-redundant 4-mers for each co-barcoded linked-read. The K-mer frequency and TNF features were L1-normalized to eliminate the data skew introduced by co-barcoded linked-reads of different lengths. The inventors concatenated the normalized K-mer frequency (XA) and TNF (XT) features into a 536-dimensional vector as the input of the VAE. The encoder of the VAE consists of two fully connected layers, each with 512 hidden neurons, and batch normalization and dropout (P=0.2) are performed after each layer. The output of the last layer of the encoder is fed to two hidden layers with 32 hidden neurons each, which output μ and σ, respectively, as the parameters of the Gaussian distribution N(μ,σ 2 ). The embedding Z of the VAE is sampled from this Gaussian distribution. The decoder of the VAE contains two fully connected hidden layers of the same size as the encoder layer to reconstruct the input features from the embedding Z ( and ). Since the input features XA and XT are L1 normalized, in order to make the reconstructed features of VAE match the input features, the inventors and The softmax activation function is applied to simulate the probability distribution. The loss function of the model is defined as the weighted sum of three components: the reconstruction loss (LA) of the K-mer frequency, the reconstruction loss (LT) of the TNF vector, the Gaussian distribution of the hidden layer, and the Kullback-Leibler divergence loss (LKL) of the standard Gaussian distribution:



Loss=w A L A +w T L T +w KL L KL

其中三个损失分量的权重分别为wA=α/ln(dim(XA)),wT=(1-α)/ln(dim(XT)),wKL=β/dim(Z)。我们对参数α和β分别采用0.1和0.015。VAE采用early stopping的方式进行训练,以减少训练时间并避免过度拟合。获得输入特征的embedding后,发明人应用了RPH-kmeans(Xie,Huang et al.2020),使用随机投影散列算法对embedding进行聚类。The weights of the three loss components are w A = α/ln(dim(X A )), w T = (1-α)/ln(dim(X T )), and w KL = β/dim(Z). We use 0.1 and 0.015 for the parameters α and β, respectively. VAE is trained with early stopping to reduce training time and avoid overfitting. After obtaining the embeddings of the input features, the inventors applied RPH-kmeans (Xie, Huang et al. 2020) to cluster the embeddings using the random projection hashing algorithm.

具体的,为了方便理解,下面对本申请的技术方案进行详细解释和说明。Specifically, for ease of understanding, the technical solution of the present application is explained and illustrated in detail below.

1、在ATCC-MSA-1003的原始测序数据中识别barcode序列,将具有相同barcode序列进行合并,生成co-barcoded linked-reads。1. Identify the barcode sequence in the original sequencing data of ATCC-MSA-1003, merge the reads with the same barcode sequence, and generate co-barcoded linked-reads.

2、取长度大于2k的co-barcoded-linked reads进行15-mer和四核苷酸的计算。统计15-mer的频率,将大于频率大于4000的15-mer去掉后,剩余的15-mer按照出现的频率分配到400个箱中。同时每条co-barcoded-linked reads均生成所有136个非冗余4-mers的频率来构建出四核苷酸特征。15-mer频率和四核苷酸的特征进行L1的标准化。2. Take co-barcoded-linked reads with a length greater than 2k to calculate 15-mer and tetranucleotide. Count the frequency of 15-mer, remove 15-mers with a frequency greater than 4000, and distribute the remaining 15-mers into 400 bins according to the frequency of occurrence. At the same time, each co-barcoded-linked read generates the frequencies of all 136 non-redundant 4-mers to construct the tetranucleotide features. The 15-mer frequency and tetranucleotide features are normalized by L1.

3、将归一化后的15-mer频率和四核苷酸特征连接成一个536维的向量作为变分自编码器输入,对数据进行降维,并与PCA降维方法进行比较,表现出比PCA降维算法更优异的效果(图7)。获得输入特征的embedding后,应用了RPH-kmeans,使用随机投影散列算法对embedding进行聚类。 3. The normalized 15-mer frequency and tetranucleotide features were concatenated into a 536-dimensional vector as the input of the variational autoencoder to reduce the dimensionality of the data and compared with the PCA dimensionality reduction method, which showed a better effect than the PCA dimensionality reduction algorithm (Figure 7). After obtaining the embedding of the input features, RPH-kmeans was applied to cluster the embeddings using the random projection hashing algorithm.

4、与k-means和高斯混合模型对stLFR的模拟数据分箱比较,观察到RPH-k-means取得了更好的整体F1-score和ARI(图8)。较大的k可能会导致更高的binning准确率和更低的召回率(图9),但是实际上k的值对最终组装几乎没有影响,表明类的数量对宏基因组重组装流程(图6)的组装性能是稳健的(图10)。4. Compared with k-means and Gaussian mixture model binning of stLFR simulated data, it is observed that RPH-k-means achieves better overall F1-score and ARI (Figure 8). A larger k may lead to higher binning accuracy and lower recall (Figure 9), but in fact the value of k has little effect on the final assembly, indicating that the number of classes is robust to the assembly performance of the metagenomic reassembly process (Figure 6) (Figure 10).

宏基因组数据组装方法:Metagenomic data assembly methods:

首先对宏基因组数据linked-reads进行组装前聚类,然后对经过聚类后的co-barcoded-linked-reads序列进行宏基因组组装。First, the metagenomic data linked-reads are clustered before assembly, and then the clustered co-barcoded-linked-reads sequences are assembled for metagenomics.

具体的,为了方便理解,下面对本申请的技术方案进行详细解释和说明。Specifically, for ease of understanding, the technical solution of the present application is explained and illustrated in detail below.

(1)利用metaSPAdes软件对聚类后的co-barcoded-linked-reads序列进行组装处理,获得种子contig序列;(1) Use metaSPAdes software to assemble the clustered co-barcoded-linked-reads sequences to obtain seed contig sequences;

(2)将双端测序获得的reads数据比对到步骤(1)中所述种子contig序列上,依据比对结果构建scaffold graph;(2) Aligning the reads data obtained from the double-end sequencing to the seed contig sequence described in step (1), and constructing a scaffold graph based on the alignment results;

(3)利用barcode信息获取contigs连接处的reads,所述contigs来自于步骤(2)中scaffold graph序列中相邻的contigs序列;(3) using the barcode information to obtain reads at the junction of contigs, wherein the contigs are derived from adjacent contig sequences in the scaffold graph sequence in step (2);

(4)利用IDBA-ud软件对步骤(3)中连接处reads进行局部组装处理,获得连接contigs的局部组装结果;(4) Using IDBA-ud software to perform local assembly processing on the reads at the connection in step (3) to obtain the local assembly results of the connection contigs;

(5)利用metaFlye软件对局部组装结果和种子contig序列进行合并处理,获得Athena asm.fa数据库。(5) MetaFlye software was used to merge the local assembly results and seed contig sequences to obtain the Athena asm.fa database.

宏基因组测序数据重组装方法:Metagenomic sequencing data reassembly method:

(1)利用metaSPAdes软件对co-barcoded-linked-reads序列进行处理,获得contigsori序列;(1) Use metaSPAdes software to process the co-barcoded-linked-reads sequence to obtain the contigs ori sequence;

(2)利用Athena软件对contigsori序列进行处理,获得局部组装contig序列中间结果;(2) Use Athena software to process the contigs ori sequence to obtain the intermediate results of the local assembly contig sequence;

(3)利用MEGAHIT软件对经过聚类处理后的co-barcoded-linked-reads序列进行处理,获得contigsbin序列;(3) Use MEGAHIT software to process the co-barcoded-linked-reads sequence after clustering to obtain the contigs bin sequence;

(4)利用BWA软件对聚类前的co-barcoded-linked-reads序列和contigsbin序列进行比对处理,计算contigsbin序列的测序深度;(4) Use BWA software to align the co-barcoded-linked-reads sequence and contigs bin sequence before clustering and calculate the sequencing depth of the contigs bin sequence;

利用metaSPAdes软件对测序深度小于ti(阈值)的co-barcoded-linked-reads序列进行组装处理;MetaSPAdes software was used to assemble the co-barcoded-linked-reads sequences with a sequencing depth less than t i (threshold);

(5)设置一系列阈值{ti|i=1,2,...}对co-barcoded-linked-reads序列进行组装,获得若干contigslow序列组;(5) Setting a series of thresholds {t i |i = 1, 2, ...} to assemble the co-barcoded-linked-reads sequences, and obtaining several contigs low sequence groups;

(6)利用metaFlye软件的“--subassemblies”模块合并contigsbin序列、contigslow序列和局部组装的contig序列,获得final asm.fa组装中间结果;(6) Use the “--subassemblies” module of metaFlye software to merge the contigs bin sequence, contigs low sequence, and locally assembled contig sequence to obtain the final asm.fa assembly intermediate result;

(7)利用Athena软件对contigsori序列进行处理,获得局部组装contig序列最终结果linked-reads;(7) Use Athena software to process the contigs ori sequences to obtain the final linked-reads of the local assembly contig sequences;

(8)利用quickmerge软件对final asm.fa组装中间结果与Athena asm.fa数据库进行宏基因组组装处理,获得微生物完整基因组。(8) Use quickmerge software to perform metagenomic assembly processing on the final asm.fa assembly intermediate results and the Athena asm.fa database to obtain the complete genome of the microorganism.

需要说明的是,为方便在实施例中进行描述,本申请将宏基因组重组装整个流程命名为Pangaea。It should be noted that, for the convenience of description in the examples, this application names the entire process of metagenomic reassembly Pangaea.

宏基因组数据组装前聚类装置Metagenomic data pre-assembly clustering device

本发明提出了一种宏基因组数据组装前聚类装置。根据本发明的实施例,参考图1,所述装置包括第一聚类单元100,所述第一聚类单元100用于基于宏基因组的linked-reads测序数据中co-barcoded linked reads的序列特征,对co-barcoded linked reads进行聚类,对每一个生成的类组装生成contigsbin;以及重组装单元S200,所述重组装单元S200与所述第一聚类单元S100相连,基于contigsbin的丰度信息,提取低丰度物种的reads进行重组装。The present invention proposes a pre-assembly clustering device for metagenomic data. According to an embodiment of the present invention, referring to FIG1 , the device comprises a first clustering unit 100, which is used to cluster co-barcoded linked reads based on sequence features of co-barcoded linked reads in linked-reads sequencing data of the metagenomics, and to generate contigs bins for each generated class assembly; and a reassembly unit S200, which is connected to the first clustering unit S100, and extracts reads of low-abundance species for reassembly based on abundance information of contigs bins .

根据本发明的实施例,本申请所述一种宏基因组数据组装前聚类装置进一步包括:筛选单元S101,参考图2,所述筛选单元S101与所述第一聚类单元S100和所述重组装单元S200相连,用于将所述若干co-barcoded-linked-reads组中的测序读段进行筛选,以便获得来源于同一基因组片段的测序读段。According to an embodiment of the present invention, a metagenomic data pre-assembly clustering device described in the present application further includes: a screening unit S101. Referring to Figure 2, the screening unit S101 is connected to the first clustering unit S100 and the reassembly unit S200, and is used to screen the sequencing reads in the several co-barcoded-linked-reads groups to obtain sequencing reads derived from the same genome fragment.

宏基因组数据组装设备Metagenomic data assembly equipment

本发明提出了一种宏基因组数据组装设备。根据本发明的实施例,参考图3,所述设备包括前述的宏基因组数据组装前聚类装置300,用于对宏基因组的测序数据进行聚类处理;以及The present invention proposes a metagenomic data assembly device. According to an embodiment of the present invention, referring to FIG3 , the device includes the aforementioned metagenomic data pre-assembly clustering device 300, which is used to perform clustering processing on metagenomic sequencing data; and

组装装置400,所述组装装置400与所述宏基因组数据组装前聚类装置300相连,用于将经过聚类处理的宏基因组的测序数据进行组装处理。The assembly device 400 is connected to the metagenome data pre-assembly clustering device 300 and is used to assemble the sequencing data of the clustered metagenome.

宏基因组测序数据组装系统Metagenomic Sequencing Data Assembly System

本发明提出了一种宏基因组数据组装的系统。根据本发明的实施例,参考图4,所述系统包括前述宏基因组数据组装设备500,用于对宏基因组测序数据进行第一组装处理;以及The present invention proposes a system for assembling metagenomic data. According to an embodiment of the present invention, referring to FIG4 , the system includes the aforementioned metagenomic data assembly device 500, which is used to perform a first assembly process on metagenomic sequencing data; and

第二组装设备600,所述第二组装设备600与所述宏基因组数据组装设备500相连,用于将经过第一组装处理的宏基因组测序数据进行第二组装处理。The second assembly device 600 is connected to the metagenomic data assembly device 500 and is used to perform a second assembly process on the metagenomic sequencing data that has undergone the first assembly process.

本发明所述宏基因组数据组装的系统中第二组装设备600。根据本发明的实施例,参考图5,所述第二组装设备600进一步包括:比对处理装置601,所述比对处理装置601用于将经过所述第一组装处理的宏基因组测序数据与contigsbin序列进行比对处理,以便获得所述经过所述第一组装处理的contigsbin的测序深度;The second assembly device 600 in the metagenomic data assembly system of the present invention. According to an embodiment of the present invention, referring to FIG5 , the second assembly device 600 further comprises: a comparison processing device 601, the comparison processing device 601 is used to compare the metagenomic sequencing data after the first assembly process with the contigs bin sequence, so as to obtain the sequencing depth of the contigs bin after the first assembly process;

获取装置602,所述获取装置602与所述比对处理装置601相连,用于基于所述测序深度,获取测序物种的reads进行第二组装处理获得contigslow序列;An acquisition device 602, the acquisition device 602 is connected to the comparison processing device 601, and is used to acquire reads of the sequenced species based on the sequencing depth, perform a second assembly process, and obtain contigs low sequences;

合并装置603,所述合并装置603与所述获取装置602相连,将所述若干contigslow序列组分别与contigsbin序列和局部组装的中间contig序列进行合并处理,以便获得final asm.fa组装中间结果;以及A merging device 603, which is connected to the acquiring device 602, and merges the plurality of contigs low sequence groups with the contigs bin sequences and the intermediate contig sequences of the local assembly, so as to obtain the final asm.fa assembly intermediate result; and

第三组装处理装置604,所述第三组装处理装置604与所述合并装置603相连,将所述final asm.fa组装中间结果和局部组装的最终contig序列用quickmerge软件进行第三组装处理The third assembly processing device 604 is connected to the merging device 603, and the final asm.fa assembly intermediate result and the final contig sequence of the local assembly are subjected to the third assembly processing by using the quickmerge software.

电子设备 Electronic devices

根据本发明的实施例,所述电子设备包括:存储器和处理器,所述存储器用于存储计算机程序;所述处理器用于执行所述计算机程序以实现本申请所述宏基因组组装前聚类以及宏基因组重组装方法。根据本发明的实施例,所述处理器通过读取所述存储器中存储的可执行程序代码来运行与所述可执行程序代码对应的程序。According to an embodiment of the present invention, the electronic device includes: a memory and a processor, the memory is used to store a computer program; the processor is used to execute the computer program to implement the metagenome pre-assembly clustering and metagenome reassembly method described in the present application. According to an embodiment of the present invention, the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory.

计算机可读存储介质Computer readable storage medium

根据本发明的实施例,所述计算机可读存储介质存储有计算机程序,所述计算机程序指令再处理器上运行时,使得所述处理器执行本申请所述宏基因组组装前聚类以及宏基因组重组装方法。According to an embodiment of the present invention, the computer-readable storage medium stores a computer program, and when the computer program instructions are run on a processor, the processor executes the metagenomic pre-assembly clustering and metagenomic reassembly method described in the present application.

就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。For the purposes of this specification, a "computer-readable medium" may be any device that can contain, store, communicate, propagate or transmit a program for use with or in conjunction with an instruction execution system, device or apparatus. More specific examples of computer-readable media (a non-exhaustive list) include the following: an electrical connection with one or more wires (electronic device), a portable computer disk case (magnetic device), a random access memory (RAM), a read-only memory (ROM), an erasable and editable read-only memory (EPROM or flash memory), a fiber optic device, and a portable compact disk read-only memory (CDROM). In addition, a computer-readable medium may even be paper or other suitable medium on which the program may be printed, since the program may be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, deciphering or, if necessary, processing in another suitable manner, and then stored in a computer memory.

本发明描述的各种计算机可读存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读存储介质。术语“机器可读存储介质”可包括但不限于无线信道和能够存储、包含和/或承载指令和/或数据的各种其它介质。The various computer-readable storage media described in the present invention may represent one or more devices and/or other machine-readable storage media for storing information. The term "machine-readable storage medium" may include, but is not limited to, wireless channels and various other media capable of storing, containing and/or carrying instructions and/or data.

需要说明的是,在本申请中,在装置图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。It should be noted that in the present application, the logic and/or steps represented in the device diagram or described in other ways herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be specifically implemented in any computer-readable medium for use by an instruction execution system, device or equipment (such as a computer-based system, a system including a processor, or other system that can fetch instructions from an instruction execution system, device or equipment and execute instructions), or used in combination with these instruction execution systems, devices or equipment.

根据本申请的实施例,通过采用本申请实施例的技术方案,能够有效提高宏基因组测序数据的利用率,通过上述聚类方法以及重组装方法过滤掉一部分高错误率序列及非样本来源的杂质序列,从而提升下游生信分析的速度,增加低丰度物种的组装质量。According to the embodiments of the present application, by adopting the technical solutions of the embodiments of the present application, the utilization rate of metagenomic sequencing data can be effectively improved, and a portion of high error rate sequences and impurity sequences that are not from the sample can be filtered out through the above-mentioned clustering method and reassembly method, thereby improving the speed of downstream bioinformatics analysis and increasing the assembly quality of low-abundance species.

下面将更详细地描述本发明的实施例,所述实施例的示例在附图中示出。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。Embodiments of the present invention will be described in more detail below, examples of which are shown in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are exemplary and intended to be used to explain the present invention, but should not be construed as limiting the present invention.

实施例1Example 1

1、本实施例为模拟微生物群落ATCC-MSA-1003(包含20株菌,从ATCC购买)。首先,通过stLFR建库技术进行了DNA的提取建库和测序,获得132.95Gb的原始数据量。再利用stLFR_read_demux程序对stLFR测序数据集进行处理以获得原始的linked-reads。1. This example simulates the microbial community ATCC-MSA-1003 (containing 20 strains of bacteria, purchased from ATCC). First, DNA was extracted and sequenced using the stLFR library construction technology to obtain 132.95Gb of raw data. The stLFR_read_demux program was then used to process the stLFR sequencing data set to obtain the original linked-reads.

2、通过Pangaea流程对模拟微生物群落ATCC-MSA-1003进行多阈值重组装,提交低丰度物种的组装质量,获得最终的组装结果。同时,利用metaSPAdes、Athena和Supernova软件对stLFR建库的测序数据进行组装,以便横向比较不同软件的组装结果。2. The simulated microbial community ATCC-MSA-1003 was reassembled by multi-threshold using the Pangaea process, and the assembly quality of low-abundance species was submitted to obtain the final assembly results. At the same time, the sequencing data of the stLFR library was assembled using metaSPAdes, Athena and Supernova software to compare the assembly results of different software horizontally.

3、通过比较各项组装指标,发现利用Pangaea流程进行组装后效果最优,具有最高的组装长度,N50(评估基因组组装连续性的指标)分别是其他软件的2.09倍(Athena),7.54倍(Supernova)和13.83倍(metaSPAdes)(表1)。3. By comparing various assembly indicators, it was found that the Pangaea process had the best assembly effect and the highest assembly length. N50 (an indicator for evaluating the continuity of genome assembly) was 2.09 times (Athena), 7.54 times (Supernova) and 13.83 times (metaSPAdes) of other software (Table 1).

表1:模拟微生物群落ATCC-MSA-1003组装结果

注:N50:评估基因组组装连续性的指标;NA50:评估基因组组装质量的指标。
Table 1: Assembly results of simulated microbial community ATCC-MSA-1003

Note: N50: an indicator for evaluating the continuity of genome assembly; NA50: an indicator for evaluating the quality of genome assembly.

实施例2Example 2

1、本实施例以两个真实人的粪便为样品,通过stLFR建库技术进行了DNA的提取建库和测序,分别获得136.6Gb和131.6Gb的原始数据量。利用stLFR_read_demux程序对stLFR测序数据集进行处理以获得原始的linked-reads。1. In this example, two real human feces were used as samples, and DNA was extracted and sequenced using the stLFR library construction technology, obtaining 136.6Gb and 131.6Gb of raw data, respectively. The stLFR_read_demux program was used to process the stLFR sequencing data set to obtain the original linked-reads.

2、将粪便样品的测序数据进行实施例1步骤2-5操作流程进行组装。2. Assemble the sequencing data of the fecal sample according to the operation flow of steps 2-5 of Example 1.

3、结果表明,相比较其他软件,Pangaea技术方案产生的组装结果在S1(Pangaea=488.79Mb,Athena=469.28Mb,Supernova=311.97Mb,metaSPAdes=452.60Mb;表2)和S2上(Pangaea=414.46Mb,Athena=393.69Mb,Supernova=290.60Mb,metaSPAdes=374.17Mb;表2)具有最高的总组装长度。在S1和S2中,Pangea产生的N50比其他三个软件要高得多(在S1中是Athena的1.44倍,Supernova的1.06倍,metaSPAdes的4.50倍;在S2中是Athena的1.61倍,Supernova的2.64倍,metaSPAdes的8.18倍;表2)。3. The results show that compared with other software, the assembly results produced by the Pangaea technical solution have the highest total assembly length on S1 (Pangaea = 488.79Mb, Athena = 469.28Mb, Supernova = 311.97Mb, metaSPAdes = 452.60Mb; Table 2) and S2 (Pangaea = 414.46Mb, Athena = 393.69Mb, Supernova = 290.60Mb, metaSPAdes = 374.17Mb; Table 2). In S1 and S2, the N50 produced by Pangaea is much higher than that of the other three software (1.44 times that of Athena, 1.06 times that of Supernova, and 4.50 times that of metaSPAdes in S1; 1.61 times that of Athena, 2.64 times that of Supernova, and 8.18 times that of metaSPAdes in S2; Table 2).

表2:两种粪便样品不同组装软件组装结果统计
Table 2: Statistics of assembly results of two fecal samples using different assembly software

3、利用CheckM评估4种组装方法产生的高质量(完成性大于90%,污染率小于5%)组装结果。Pangaea在样品S1和样品S2上的组装结果共生成了24个和18个接近完整的基因组,显著多于其他软件生成的,比如Athena(S1:13和S2:12),Supernova(S1:14和S2:10)和metaSPAdes(S1:0和S2:1)。在N50不同最小值下对接近完整的基因组进行计数发现,Pangaea在几乎所有N50阈值下都比其他三个软件获得了更多的接近完整的基因组,证明了Pangaea产生的接近完整基因组的高度连续性。Pangaea在不同最大测序深度阈值下产生的接近完整基因组也明显好于其他软件。特别是当接近完整的基因组N50大于1Mb时,Pangaea实现了在所有丰度阈值下组装出接近完整的基因组(S1:8,S2:4)显著高于其他三个软件,而软件Athena在S1和S2上分别只生产了3个和1个接近完整的基因组(图11)。3. CheckM was used to evaluate the high-quality (completeness greater than 90%, contamination rate less than 5%) assembly results produced by the four assembly methods. Pangaea's assembly results on samples S1 and S2 generated a total of 24 and 18 nearly complete genomes, significantly more than those generated by other software, such as Athena (S1: 13 and S2: 12), Supernova (S1: 14 and S2: 10) and metaSPAdes (S1: 0 and S2: 1). Counting nearly complete genomes at different minimum N50 values found that Pangaea obtained more nearly complete genomes than the other three software at almost all N50 thresholds, demonstrating the high continuity of the nearly complete genomes produced by Pangaea. The nearly complete genomes produced by Pangaea at different maximum sequencing depth thresholds are also significantly better than those produced by other software. In particular, when the N50 of the near-complete genome was greater than 1 Mb, Pangaea achieved the assembly of near-complete genomes at all abundance thresholds (S1: 8, S2: 4), which was significantly higher than the other three software, while the software Athena only produced 3 and 1 near-complete genomes on S1 and S2, respectively (Figure 11).

4、将Pangaea组装基因组与最接近的参考基因组进行比对,以检查它们的共线性(图12)。Pangaea的NCMAG及其最接近的参考基因组具有高度比对一致性(平均98.04%)、稳定性(平均87.17%)和强共线性,表明Pangaea生成了具有高准确度的组装结果。发明人还在S1和S2发现了一些基因组变异,比如参考序列的倒位和基因组重排,包括Alistipes sp.(S2)和A.indistinctus(S2)。Pangaea在S1和S2中均组装出了Alistipes sp.,并发现它们有相似的序列长度(S1:2.84Mb和S2:2.75Mb),但来自S1中的组装结果N50更好(N50:S1=2344.71Kb,S2=513.55Kb)。这个结果可能是由于它们在样本中不同丰度造成的(测序深度:S1为210.87x,S2为69.82x)。对于S1中的Sutterella wadsworthensis和S2中的P.copri,Pangaea可以生成优于其他软件的组装结果和更高的N50。此外,通过比较测序深度和GC偏移,发明人发现Pangaea可以很好的组装出低深度和高GC偏移的区域,例如来自S2的R.hominis中大约1,100Kb的区域。上述结果说明Pangaea有潜力组装出较难组装的基因组区域。4. The Pangaea assembled genome was aligned with the closest reference genome to check their colinearity (Figure 12). The NCMAG of Pangaea and its closest reference genome had high alignment consistency (average 98.04%), stability (average 87.17%), and strong colinearity, indicating that Pangaea generated assembly results with high accuracy. The inventors also found some genomic variations in S1 and S2, such as inversions and genome rearrangements of the reference sequence, including Alistipes sp. (S2) and A. indistinctus (S2). Pangaea assembled Alistipes sp. in both S1 and S2 and found that they had similar sequence lengths (S1: 2.84Mb and S2: 2.75Mb), but the assembly result from S1 had a better N50 (N50: S1 = 2344.71Kb, S2 = 513.55Kb). This result may be due to their different abundance in the samples (sequencing depth: 210.87x for S1 and 69.82x for S2). For Sutterella wadsworthensis in S1 and P. copri in S2, Pangaea can generate assembly results and higher N50 than other software. In addition, by comparing sequencing depth and GC offset, the inventors found that Pangaea can assemble regions with low depth and high GC offset well, such as the approximately 1,100Kb region in R. hominis from S2. The above results show that Pangaea has the potential to assemble genomic regions that are difficult to assemble.

5、利用基因组circularization模型检查来自四种组装软件的NCMAG中是否存在完整和环状的基因组。发现只有Pangaea生成了两个环状接近完整的基因组,它们分别被注释为B.adolescentis和Myoviridae sp.。对于这两种微生物,Pangaea都生成了一个无空隙与最接近的参考基因组具有完美共线性的contig序列。Athena为B.adolescentis和Myoviridae sp.分别生成了三个和两个contig序列,其contig N50显著低于由Pangaea产生的contigs(B.adolescentis:Pangaea=2167.94Kb,Athena=744.54Kb;Myoviridae sp.:Pangaea=2137.66Kb,Athena=1709.63Kb)。而Supernova和metaSPAdes只能生成不完整的MAG或无法组装这两个物种,其组装完整性显著低于Pangaea(图12)。5. The genome circularization model was used to check whether there were complete and circular genomes in NCMAG from four assembly software. It was found that only Pangaea generated two circular nearly complete genomes, which were annotated as B. adolescentis and Myoviridae sp., respectively. For both microorganisms, Pangaea generated a contig sequence with no gaps and perfect collinearity with the closest reference genome. Athena generated three and two contig sequences for B. adolescentis and Myoviridae sp., respectively, and its contig N50 was significantly lower than that of the contigs generated by Pangaea (B. adolescentis: Pangaea = 2167.94 Kb, Athena = 744.54 Kb; Myoviridae sp.: Pangaea = 2137.66 Kb, Athena = 1709.63 Kb). However, Supernova and metaSPAdes could only generate incomplete MAGs or were unable to assemble these two species, and their assembly completeness was significantly lower than that of Pangaea (Figure 12).

此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, the features defined as "first" and "second" may explicitly or implicitly include at least one of the features. In the description of the present invention, the meaning of "plurality" is at least two, such as two, three, etc., unless otherwise clearly and specifically defined.

在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, without contradiction.

尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。 Although the embodiments of the present invention have been shown and described above, it is to be understood that the above embodiments are exemplary and are not to be construed as limitations of the present invention. A person skilled in the art may change, modify, replace and vary the above embodiments within the scope of the present invention.

Claims (18)

一种宏基因组数据先聚类后组装的方法,其特征在于,所述方法包括:A method for clustering and then assembling metagenomic data, characterized in that the method comprises: 基于宏基因组的linked-reads测序数据中co-barcoded linked reads的序列特征,对co-barcoded linked reads进行聚类,对每一个生成的类组装生成contigsbinBased on the sequence features of co-barcoded linked reads in the linked-reads sequencing data of the metagenomics, the co-barcoded linked reads are clustered, and contigs bins are generated for each generated class assembly; 基于contigsbin的丰度信息,提取低丰度物种的reads进行重组装。Based on the abundance information of contigs bin , reads of low-abundance species are extracted for reassembly. 根据权利要求1所述的方法,其特征在于,所述linked-reads的序列特征包括K-mer频率和四核苷酸频率中的至少之一。The method according to claim 1, characterized in that the sequence characteristics of the linked-reads include at least one of K-mer frequency and tetranucleotide frequency. 一种宏基因组数据组装的方法,其特征在于,包括:A method for assembling metagenomic data, comprising: 利用权利要求1或2所述的方法对宏基因组的测序数据进行聚类处理;以及Performing clustering processing on metagenomic sequencing data using the method of claim 1 or 2; and 将经过聚类处理的宏基因组测序数据进行组装处理。The clustered metagenomic sequencing data are assembled. 一种宏基因组测序数据重组装的方法,其特征在于,包括:A method for reassembling metagenomic sequencing data, comprising: 利用权利要求3所述的方法,对宏基因组测序数据进行第一组装处理;以及Using the method of claim 3, performing a first assembly process on the metagenomic sequencing data; and 将经过第一组装处理的宏基因组测序数据进行第二组装处理。The metagenomic sequencing data that has undergone the first assembly process is subjected to a second assembly process. 根据权利要求4所述的方法,其特征在于,所述第二组装处理是通过如下方式进行的:The method according to claim 4, characterized in that the second assembly process is performed in the following manner: 将经过所述第一组装处理的宏基因组测序数据与contigsbin序列进行比对处理,以便获得所述经过所述第一组装处理的contigsbin的测序深度,基于所述测序深度,获取测序物种的reads进行第二组装处理获得contigslow序列;Compare the metagenomic sequencing data after the first assembly process with the contigs bin sequence to obtain the sequencing depth of the contigs bin after the first assembly process, and based on the sequencing depth, obtain the reads of the sequenced species for the second assembly process to obtain the contigs low sequence; 所述比对处理是通过BWA软件处理进行的;The comparison process is performed by BWA software; 所述测序深度小于预定阈值,是进行所述第二组装处理的指示;The sequencing depth being less than a predetermined threshold is an indication for performing the second assembly process; 所述预定阈值为不小于1的整数;The predetermined threshold is an integer not less than 1; 将所述contigslow序列分别与contigsbin序列和局部组装的中间contig序列进行合并处理,以便获得final asm.fa组装中间结果;以及将所述final asm.fa组装中间结果和局部组装的最终contig序列用quickmerge软件进行第三组装处理。The contigs low sequence is respectively merged with the contigs bin sequence and the intermediate contig sequence of the local assembly to obtain the final asm.fa assembly intermediate result; and the final asm.fa assembly intermediate result and the final contig sequence of the local assembly are subjected to a third assembly process using quickmerge software. 根据权利要求5所述的方法,其特征在于,所述contigsbin序列是通过将聚类处理后形成的若干co-barcoded-linked-reads组中的测序读段经过MEGAHIT软件组装处理后获得的。The method according to claim 5, characterized in that the contigs bin sequence is obtained by assembling the sequencing reads in several co-barcoded-linked-reads groups formed after clustering through MEGAHIT software. 根据权利要求5所述的方法,其特征在于,所述局部组装contig序列是通过将contigsori序列经过Athena软件处理后的中间组装结果;The method according to claim 5, characterized in that the local assembly contig sequence is an intermediate assembly result after the contigs ori sequence is processed by Athena software; 所述contigsori序列是通过将聚类处理前的所有linked-reads经过metaSPAdes软件组装处理后获得的。The contigs ori sequences are obtained by assembling all linked-reads before clustering using metaSPAdes software. 根据权利要求5所述的方法,其特征在于,所述合并处理是通过metaFlye软件的“--subassemblies”模块处理进行的。The method according to claim 5 is characterized in that the merging process is performed through the "--subassemblies" module processing of metaFlye software. 根据权利要求5所述的方法,其特征在于,所述Athena asm.fa组装中间结果是通过将contigsori序列经过Athena软件处理后的最终组装结果。 The method according to claim 5, characterized in that the Athena asm.fa assembly intermediate result is the final assembly result after the contigs ori sequence is processed by Athena software. 一种宏基因组数据组装前聚类装置,其特征在于,所述装置包括:A clustering device before metagenomic data assembly, characterized in that the device comprises: 第一聚类单元,所述第一聚类单元基于宏基因组的linked-reads测序数据中co-barcoded linked reads的序列特征,对co-barcoded linked reads进行聚类,对每一个生成的类组装生成contigsbin;以及A first clustering unit, wherein the first clustering unit clusters the co-barcoded linked reads based on sequence features of the co-barcoded linked reads in the linked-reads sequencing data of the metagenome, and generates contigs bins for each generated cluster assembly; and 重组装单元,所述重组装单元与所述第一聚类单元相连,基于contigsbin的丰度信息,提取低丰度物种的reads进行重组装。A reassembly unit, wherein the reassembly unit is connected to the first clustering unit, and extracts reads of low-abundance species for reassembly based on the abundance information of the contigs bin . 根据权利要求10所述的装置,其特征在于,所述测序读段信息包括K-mer和四核苷酸频率中的至少之一。The device according to claim 10 is characterized in that the sequencing read information includes at least one of K-mer and tetranucleotide frequency. 根据权利要求10所述的装置,其特征在于,进一步包括:筛选单元,所述筛选单元与所述第一聚类单元和所述重组装单元相连,用于将所述co-barcoded-linked-reads中的测序读段进行筛选,以便获得来源于同一基因组片段的测序读段。The device according to claim 10 is characterized in that it further comprises: a screening unit, which is connected to the first clustering unit and the reassembly unit, and is used to screen the sequencing reads in the co-barcoded-linked-reads to obtain sequencing reads derived from the same genomic fragment. 一种宏基因组数据组装设备,其特征在于,包括:A metagenomic data assembly device, comprising: 权利要求10~12任一项所述的宏基因组数据组装前聚类装置,用于对宏基因组的测序数据进行聚类处理;以及The metagenome data pre-assembly clustering device according to any one of claims 10 to 12, used for clustering the sequencing data of the metagenome; and 组装装置,所述组装装置与所述宏基因组数据组装前聚类装置相连,用于将经过聚类处理的宏基因组的测序数据进行组装处理。An assembly device is connected to the metagenome data pre-assembly clustering device and is used to assemble the sequencing data of the clustered metagenome. 一种宏基因组测序数据组装系统,其特征在于,包括:A metagenomic sequencing data assembly system, characterized by comprising: 权利要求13所述的宏基因组数据组装设备,用于对宏基因组测序数据进行第一组装处理;以及The metagenomic data assembly device of claim 13, used to perform a first assembly process on metagenomic sequencing data; and 第二组装设备,所述第二组装设备与所述宏基因组数据组装设备相连,用于将经过第一组装处理的宏基因组测序数据进行第二组装处理。The second assembly device is connected to the metagenomic data assembly device and is used to perform a second assembly process on the metagenomic sequencing data that has undergone the first assembly process. 根据权利要求14所述的系统,其特征在于,所述第二组装设备包括:The system according to claim 14, characterized in that the second assembly equipment comprises: 比对处理装置,所述比对处理装置用于将经过所述第一组装处理的宏基因组测序数据与contigsbin序列进行比对处理,以便获得所述经过所述第一组装处理的contigsbin的测序深度;A comparison processing device, the comparison processing device is used to compare the metagenomic sequencing data after the first assembly process with the contigs bin sequence, so as to obtain the sequencing depth of the contigs bin after the first assembly process; 获取装置,所述获取装置与所述比对处理装置相连,用于基于所述测序深度,获取测序物种的reads进行第二组装处理获得contigslow序列;An acquisition device, connected to the comparison processing device, for acquiring reads of the sequenced species based on the sequencing depth, performing a second assembly process to obtain contigs low sequences; 合并装置,所述合并装置与所述获取装置相连,将所述contigslow序列分别与contigsbin序列和局部组装的中间contig序列进行合并处理,以便获得final asm.fa组装中间结果;以及A merging device, which is connected to the acquisition device and merges the contigs low sequence with the contigs bin sequence and the locally assembled intermediate contig sequence, respectively, so as to obtain a final asm.fa assembly intermediate result; and 第三组装处理装置,所述第三组装处理装置与所述合并装置相连,将所述final asm.fa组装中间结果和局部组装的最终contig序列用quickmerge软件进行第三组装处理;所述比对处理是通过BWA软件处理进行的;所述测序深度小于预定阈值,是进行所述获取处理的指示;所述预定阈值为不小于1的整数;所述第三组装处理是通过将所述final asm.fa组装中间结果和局部组装的最终contig序列用quickmerge软件进行的。a third assembly processing device, wherein the third assembly processing device is connected to the merging device, and performs a third assembly process on the final asm.fa assembly intermediate result and the local assembled final contig sequence using quickmerge software; the alignment process is performed by using BWA software; the sequencing depth is less than a predetermined threshold, which is an indication for performing the acquisition process; the predetermined threshold is an integer not less than 1; the third assembly process is performed by using quickmerge software on the final asm.fa assembly intermediate result and the local assembled final contig sequence. 根据权利要求15所述的系统,其特征在于,所述contigsbin序列是通过将聚类处理后形成的若干co-barcoded-linked-reads组中的测序读段经过MEGAHIT软件组装处理后获得的;The system according to claim 15, characterized in that the contigs bin sequence is obtained by assembling the sequencing reads in the several co-barcoded-linked-reads groups formed after clustering through MEGAHIT software; 所述局部组装contig序列是通过将contigsori序列经过Athena软件处理后的中间组装结果;所述contigsori序列是通过将聚类处理前的所有linked-reads经过metaSPAdes软件组装处理后获得的;The local assembly contig sequence is an intermediate assembly result after the contigs ori sequence is processed by Athena software; the contigs ori sequence is obtained by assembling all linked-reads before clustering processing by metaSPAdes software; 所述合并处理是通过metaFlye软件的“--subassemblies”模块处理进行的;The merging process is performed by the "--subassemblies" module of the metaFlye software; 所述Athena asm.fa组装中间结果是通过将contigsori序列经过Athena软件处理后的最终组装结果。The Athena asm.fa assembly intermediate result is the final assembly result after the contigs ori sequence is processed by Athena software. 一种电子设备,其特征在于,包括:存储器和处理器;An electronic device, characterized in that it comprises: a memory and a processor; 所述存储器,用于存储计算机程序;The memory is used to store computer programs; 所述处理器,用于执行所述计算机程序以实现如权利要求1~9任一方面所述的方法。The processor is used to execute the computer program to implement the method according to any one of claims 1 to 9. 一种计算机可读存储介质,其特征在于,所述存储介质存储有计算机程序指令,所述计算机程序指令在处理器上运行时,使得所述处理器执行如权利要求1~9任一方面所述的方法。 A computer-readable storage medium, characterized in that the storage medium stores computer program instructions, and when the computer program instructions are executed on a processor, the processor executes the method according to any one of claims 1 to 9.
PCT/CN2023/081733 2023-03-15 2023-03-15 Assembly process for constructing high-quality microbial genomes on basis of stlfr metagenomic sequencing data Pending WO2024187428A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2023/081733 WO2024187428A1 (en) 2023-03-15 2023-03-15 Assembly process for constructing high-quality microbial genomes on basis of stlfr metagenomic sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2023/081733 WO2024187428A1 (en) 2023-03-15 2023-03-15 Assembly process for constructing high-quality microbial genomes on basis of stlfr metagenomic sequencing data

Publications (1)

Publication Number Publication Date
WO2024187428A1 true WO2024187428A1 (en) 2024-09-19

Family

ID=92754131

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/081733 Pending WO2024187428A1 (en) 2023-03-15 2023-03-15 Assembly process for constructing high-quality microbial genomes on basis of stlfr metagenomic sequencing data

Country Status (1)

Country Link
WO (1) WO2024187428A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119513634A (en) * 2025-01-22 2025-02-25 吉林大学 Single metagenome contig sequence clustering method and system integrating semantic features

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102477460A (en) * 2010-11-24 2012-05-30 深圳华大基因科技有限公司 Method for sequencing and clustering analysis of metagenome 16S hypervariable region V6
CN109741790A (en) * 2018-11-12 2019-05-10 山东省医学科学院基础医学研究所 Method and system for metagenomic analysis of microbial next-generation sequencing data
US20210249102A1 (en) * 2018-05-31 2021-08-12 Arizona Board Of Regents On Behalf Of The University Of Arizona Methods for comparative metagenomic analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102477460A (en) * 2010-11-24 2012-05-30 深圳华大基因科技有限公司 Method for sequencing and clustering analysis of metagenome 16S hypervariable region V6
US20210249102A1 (en) * 2018-05-31 2021-08-12 Arizona Board Of Regents On Behalf Of The University Of Arizona Methods for comparative metagenomic analysis
CN109741790A (en) * 2018-11-12 2019-05-10 山东省医学科学院基础医学研究所 Method and system for metagenomic analysis of microbial next-generation sequencing data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MALANOSKI ANTHONY P., LIN BAOCHUAN, EDDIE BRIAN J., WANG ZHENG, HERVEY W. JUDSON, GLAVEN SARAH M.: "Relative abundance of ‘ Candidatus Tenderia electrophaga’ is linked to cathodic current in an aerobic biocathode community", MICROBIAL BIOTECHNOLOGY, WILEY-BLACKWELL PUBLISHING LTD., GB, vol. 11, no. 1, 1 January 2018 (2018-01-01), GB , pages 98 - 111, XP093209443, ISSN: 1751-7915, DOI: 10.1111/1751-7915.12757 *
ZHANG ZHENMIAO, WANG HONGBO, YANG CHAO, HUANG YUFEN, YUE ZHEN, CHEN YANG, HAN LIJUAN, LYU AIPING, FANG XIAODONG, ZHANG LU: "Exploring high-quality microbial genomes by assembly of linked-reads with high barcode specificity using deep learning", BIORXIV, 9 September 2022 (2022-09-09), XP093209440, Retrieved from the Internet <URL:https://www.biorxiv.org/content/10.1101/2022.09.07.506963v1.full.pdf> DOI: 10.1101/2022.09.07.506963 *
ZHANG, LU ET AL.: "A comprehensive investigation of metagenome assembly by linked-read sequencing", MICROBIOME, vol. 8, no. 1, 11 November 2020 (2020-11-11), XP021284032, DOI: 10.1186/s40168-020-00929-3 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119513634A (en) * 2025-01-22 2025-02-25 吉林大学 Single metagenome contig sequence clustering method and system integrating semantic features

Similar Documents

Publication Publication Date Title
Wyman et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification
JP7051900B2 (en) Methods and systems for the generation and error correction of unique molecular index sets with non-uniform molecular lengths
Wolf Principles of transcriptome analysis and gene expression quantification: an RNA‐seq tutorial
WO2022028624A1 (en) Method and apparatus for determining microbial species and acquiring related information by means of sequencing, computer-readable storage medium, and electronic device
Alic et al. Objective review of de novo stand‐alone error correction methods for NGS data
CN114420212B (en) Escherichia coli strain identification method and system
Delhomme et al. Guidelines for RNA-Seq data analysis
CN108004302A (en) A kind of association analysis method of transcript profile reference and its application
CN117746988A (en) A detection method for fusion genes based on DNA or RNA sequencing technology
Eché et al. A Bos taurus sequencing methods benchmark for assembly, haplotyping, and variant calling
WO2024187428A1 (en) Assembly process for constructing high-quality microbial genomes on basis of stlfr metagenomic sequencing data
CN113345526B (en) Tumor transcriptome multimode information analysis platform PipeOne and construction method thereof
Kwon et al. A chromosome-level genome assembly of the Korean crossbred pig Nanchukmacdon (Sus scrofa)
US20230093253A1 (en) Automatically identifying failure sources in nucleotide sequencing from base-call-error patterns
CN110684830A (en) RNA analysis method for paraffin section tissue
CN112802554B (en) An animal mitochondrial genome assembly method based on second-generation data
CN110164504A (en) Processing method, device and the electronic equipment of two generation sequencing datas
Teets et al. Quantifying hematopoietic stem cell clonal diversity by selecting informative amplicon barcodes
US20240371469A1 (en) Machine learning model for recalibrating genotype calls from existing sequencing data files
Barcelona Cabeza Genomics tools in the cloud: the new frontier in omics data analysis
Sigmon Integration of Optical Maps and Short-Read Sequencing Data in Genomic Investigation
Niehus Multi-Sample Approaches and Applications for Structural Variant Detection
Miller et al. RNA-seq Parent-of-Origin Classification with Machine Learning applied to Alignment Features
Prescott et al. The genome sequence of the Birch Mocha moth, Cyclophora albipunctata (Hufnagel, 1767)
Whelan Detecting and Analyzing Genomic Structural Variation Using Distributed Computing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23926783

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE