[go: up one dir, main page]

WO2016090583A1 - Dispositif et procédé de traitement de données de séquençage - Google Patents

Dispositif et procédé de traitement de données de séquençage Download PDF

Info

Publication number
WO2016090583A1
WO2016090583A1 PCT/CN2014/093511 CN2014093511W WO2016090583A1 WO 2016090583 A1 WO2016090583 A1 WO 2016090583A1 CN 2014093511 W CN2014093511 W CN 2014093511W WO 2016090583 A1 WO2016090583 A1 WO 2016090583A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
sequence
result
read
reads
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2014/093511
Other languages
English (en)
Chinese (zh)
Inventor
刘敬一
刘兴民
刘耿
赵鑫
杨明
侯勇
吴逵
李波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Priority to CN201480082793.4A priority Critical patent/CN107077533B/zh
Priority to PCT/CN2014/093511 priority patent/WO2016090583A1/fr
Publication of WO2016090583A1 publication Critical patent/WO2016090583A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection

Definitions

  • the present invention relates to the field of biological information. Specifically, the present invention relates to a sequencing data processing apparatus and method, and more particularly, to a sequencing data processing apparatus, a sequencing data processing system, and a processing method for sequencing data.
  • cfDNA (cell-free DNA), which is present in serum, plasma or other body fluids, is an effective biomarker that can be applied to a variety of mutation detection, such as cancer, fetal chromosomal variation and other genetic mutations. Due to the lack of high sensitivity and accuracy of quantitative analysis techniques, previous studies have focused on a number of known disease-related genes, such as the pigmentoma-GNAQ gene (Metz, Stephan HD, et al. Ultradeep sequencing detection GNAQ and GNA11mutations). In cell ⁇ free DNA from plasma of patients with uveal melanoma. Cancer medicine 2.2 (2013): 208-215.), 21 Trisomy 21 (Liao, Gary JW, et al. "Noninvasive prenatal diagnosis of fetal trisomy 21by Allelic ratio analysis using targeted massively parallel sequencing of maternal plasma DNA. "PLoS One 7.5 (2012): e38154.) and the like.
  • MPS Massively Parallel Sequencing
  • CNV Copy-Number Variations
  • Copy number variation is an important biomarker for many human diseases (such as cancer, hereditary diseases, cardiovascular diseases) and has become a hot spot in many diseases.
  • the detection of copy number variation in tumors can reveal the loss or doubling of tumor DNA throughout the genome.
  • CGH comparative genomic hybridization
  • ROMA representative oligonucleotide microarray analysis
  • These platforms have insufficient detection capabilities for small CNVs (below 20 kb), and have problems such as cumbersome operations and high costs.
  • the present invention is directed to solving at least some of the above technical problems or at least providing a commercial choice.
  • the present invention provides a sequencing data processing apparatus, the apparatus comprising: a data receiving unit, configured to receive the sequencing data, the sequencing data comprising a plurality of pairs of read pairs, each pair of reads Composed of two reads, each from two locations of a chromosome segment, two reads from each pair of read pairs from the stain
  • the positive and negative strands of the fragment, or both reads of each pair of reads are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each read contains a gap, a pair of reads
  • the two read segments of the pair are respectively defined as a left arm and a right arm
  • a processor for executing a data processing program, and executing the data processing program includes performing comparison of the sequencing data with a reference sequence to obtain an alignment result And eliminating a gap in each of the alignment results, obtaining a universal alignment result, the comparison result comprising a plurality of alignments of the pair of reads, and/or the comparison result
  • the pair of reads from two positions of a chromosome fragment, respectively, can be obtained by sequencing a constructed library by constructing a pair-end library or a mate-pair library.
  • multiple pairs of read pairs are obtained using the library construction method of Complete Genomics (CG) and its sequencing platform.
  • the distance between a pair of read pairs is determined by the length of the read and the enzyme.
  • the distance between the recognition site and the cleavage site is controlled.
  • the CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique.
  • cPAL combinatorial probe-ligation sequencing
  • the bases on both sides of the linker were read because they were ligated by restriction enzyme digestion.
  • Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change.
  • the obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3.
  • the size of the gap in the present invention may also be zero.
  • the 2-AD sequencing output has a total length of 60 bp, which can be divided into two pairs of mate-paired reads, and each pair of reads is centered.
  • the reads have a small gap at 10 bp, an invalid sequencing site N at the 20 bp position, and the distance between the two reads of a pair of reads is generally less than 2000 bp.
  • the term "positive strand” and "negative strand” as used herein are complementary two strands constituting a chromosome fragment, and are opposite. A strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention.
  • a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain.
  • the alignment can be performed using known comparison software, such as SOAP, BWA, etc., or can be performed using the comparison software TeraMap of the CG platform.
  • the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap.
  • the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed.
  • the read segment replaces the size of the positive gap by N, N is A, T, C or G.
  • the read can be divided into two parts based on the gap, the end of the two parts There are 2 nt overlaps.
  • the two parts of the read are ATCGCTTAAG and AGTACGATTC respectively, and the negative gaps are overlapped, and the corresponding read is ATCGCTTAAGTACGATTC.
  • the aligning in the method of one aspect of the invention comprises: comparing the left and right arms of each pair of read pairs to the reference sequence, respectively, to obtain a level one left alignment The result is compared with the first-order right-aligned result; one of the first-order left-aligned result and the first-order right-aligned result is used as a reference, and the other is compared, and the second-order left-aligned result and the second are obtained.
  • the read comparison result can be obtained.
  • the first alignment is globally aligned with the reference sequence
  • the left arm/right arm alignment result is The second alignment of the baseline for the right arm/left arm alignment results is a local alignment, such that alignments from the second-order left alignment result and the second-order right alignment result, respectively, can be performed on the same chromosome. The distance between the two reads that match the expected pair is paired into a pair of read pairs, and the read contrast is obtained.
  • the comparing comprises: setting the size of the notch to compare each left arm or each right arm with the reference sequence multiple times to obtain an optimal ratio For the result.
  • the gaps of each of the left arms or each of the right arms are set to -3 nt, -2 nt, -1 nt, 0 nt, 1 nt, 2 nt, 3 nt, 4 nt, 5 nt, 6 nt, and 7 nt, respectively.
  • a read segment respectively comparing the corresponding plurality of read segments with the reference sequence, and using the optimal aligned sequence as the left arm/right arm, where the comparison result may be based on the utilized Compare the software to the default evaluation of the results.
  • executing the data processing program further includes implementing, before the gap of each of the comparison results in the comparison result, extracting a unique comparison result in the comparison result to replace The alignment result, the unique alignment result comprising a plurality of read pairs uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, each of the The distance between the two reads of the pair of reads corresponds to the expected distance between the two locations of the chromosome segment from which it came.
  • executing the data processing program further comprises implementing correcting a positive strand of the same chromosome that contrasts each pair of the unique alignment results to the reference sequence. For example, for a pair of reads that respectively align the positive and negative strands of the previous chromosome, the reads of the aligned negative strands become their complementary strands, thus replacing the reads with their reverse complementary strands. Said correction.
  • executing the data processing program further comprises implementing a data format conversion, the data format conversion comprising converting the alignment result or the format of the unique alignment result.
  • the format of the general comparison result is required to be SAM or BAM, so as to facilitate subsequent analysis of the data based on the comparison result or the comparison result
  • SAM or BAM is a common binary format
  • BAM is a SAM. Compressed format. Due to the use of different comparison software, the format of the output comparison result or the unique comparison result may not be applicable to existing subsequent data processing or analysis software programs, such as the comparison result of the aforementioned TeraMap format, and the output data format thereof. It does not meet the requirements of the input data format of most existing mutation detection software SOAPsnp, GATK or SOAPindel, and converts the data format to obtain the general comparison result with the common data format, which is convenient for further analysis and processing of the data.
  • a sequencing data processing system comprising a host and a display, the system further comprising a sequencing data processing device in accordance with one or any embodiment of the present invention.
  • a method for processing a sequencing data comprising the steps of: acquiring sequencing data, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, respectively Two positions from one chromosome segment, two reads from each pair of reads are from the positive and negative strands of the chromosome segment, or two reads from each pair of read pairs are from the chromosome a positive strand of a fragment or a negative strand of the chromosome fragment, each read comprising a gap, defining two reads of a pair of read pairs as a left arm and a right arm, respectively; comparing the sequencing data to a reference sequence And obtaining a comparison result, the comparison result comprising a comparison result of the plurality of the pair of readings, and/or, the comparison result comprising a comparison result of the plurality of the left arms and a plurality of the The result of the alignment of the right arm; the gap of each of the
  • the pair of reads from two positions of a chromosome fragment can be constructed by constructing a pair-end library or a mate-pair library.
  • sequencing in one embodiment of the present invention, multiple pairs of read pairs are obtained by using the library construction method of Complete Genomics (CG) and its sequencing platform, and the distance between a pair of read pairs is read by The length and the distance between the recognition site of the enzyme and the cleavage site are controlled.
  • CG Complete Genomics
  • the CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique.
  • the bases on both sides of the linker were read because they were ligated by restriction enzyme digestion.
  • Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change.
  • the obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3.
  • the size of the gap in the present invention may also be zero. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read.
  • the term "positive strand” and "negative strand” as used herein are complementary two strands constituting a chromosome fragment, and are opposite.
  • a strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention.
  • a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain.
  • the alignment can be performed using known comparison software, such as SOAP, BWA, etc., or can be performed using the comparison software TeraMap of the CG platform.
  • the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap.
  • the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed.
  • the read segment replaces the size of the positive gap with N, N is A, T, C or G, for example, for a read with a negative gap such as -2 nt, the read can be divided into two parts based on the gap, and the ends of the two parts have 2 nt overlap, for example, the two parts of the read are respectively ATCGCTTAAG and AGTACGATTC, eliminate the negative gap, that is, the overlapping AG, and obtain the corresponding read segment as ATCGCTTAAGTACGATTC.
  • obtaining the sequencing data comprises constructing a sequencing library to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library, the sequencing library being composed of a strand of the chromosome fragment and at least one The predetermined DNA sequence is constructed.
  • the single-stranded circular library can be constructed by a known library construction method, for example, by constructing a single-linker circular double-stranded library with reference to the construction of a paired-end library of SOCID of Life Technologies, and then separating the double-stranded to obtain a single-stranded circular library.
  • the single-stranded circular library is constructed using the CG library construction technique, and the library construction can be referred to US7897344 to obtain a multi-linker single-stranded circular library.
  • each pair of reads is from both ends of the chromosome segment.
  • two parts of a linker are respectively ligated to both ends of a chromosome fragment, single-stranded and single-stranded to obtain a 1-ligand single-stranded circular library, and the 1-linker single-stranded
  • the circular library consists of a strand of the chromosomal fragment and a predetermined DNA sequence joining the two ends of the strand.
  • the rolling circle is expanded to form a DNA nanosphere (DNB), and the DNB is sequenced by CG sequencing cPAL technology.
  • Implanted on a chip and cPAL technology can be referenced to US8278039B2 and US8518640B2, respectively.
  • the predetermined DNA sequence is a known sequence and is a link of the aforementioned linker or linker.
  • the improved CG building method constructs a 1-ligand circular single-strand library comprising the steps of: (1) extracting a nucleic acid to be tested; (2) phosphorylating the nucleic acid at the terminal to obtain a terminal phosphorylated product; and (3) end-repairing Said terminal phosphorylation product, obtaining a terminal repair product; (4) linking the first sequence and the second sequence to both ends of the terminal repair product to obtain a first ligation product; (5) using the third sequence for the ligation The product is subjected to nick translation and amplification to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; (6) using the biotin label to Amplification products are subjected to single-strand
  • Said fourth sequence is capable of linking said first sequence and said second sequence to form said adaptor, and nick translation is for eliminating a first sequence and/or a second sequence attached at both ends of the end repair product
  • the nick caused by the dideoxynucleotide at the other end uses at least one primer with biotin labeling to carry at least one strand of the amplified product with biotin labeling, so that it is easy to separate and obtain a single strand based on the biotin label. product.
  • the improved CG library construction method constructs a 1-ligand circular single-strand library comprising the steps of: (1) extracting a nucleic acid to be tested; (2) repairing the nucleic acid at the end to obtain a terminal repair product.
  • terminal phosphorylating the terminal repair product to obtain a terminal phosphorylation product (3) terminal phosphorylating the terminal repair product to obtain a terminal phosphorylation product; (4) linking the first sequence and the second sequence to both ends of the terminal phosphorylation product to obtain a first ligation product; Performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; (6) Using the biotin labeling pair The amplification product is subjected to single-strand separation to obtain a single-stranded product; (7) cyclizing the single-stranded product with a fourth sequence to obtain the sequencing library; wherein the fourth sequence is capable of linking the first sequence At one end and at one end of the second sequence, the other end of the first sequence and/or the second sequence is a dideoxynucleotide.
  • End repair is to obtain a blunt-ended nucleic acid fragment that enables attachment of other nucleotides or sequences.
  • Terminal phosphorylation is to reduce the interconnection of sample nucleic acid fragments, so that samples with low nucleic acid content can also be constructed in a library and meet the requirements of the library.
  • Single-linker circular single-strand library As shown in Figure 1, the constructed single-linker circular single-strand library (1-AD) was sequenced on the machine, and the 1-AD sequencing output read pair had a total length of about 30 bp, one read. 12 bp, 19 bp in one read, the median distance of the genome between the two reads in a read is about 140 bp.
  • the single joint has a small amount of storage, which is suitable for the case of less cfDNA content, and has the advantages of short construction time and low construction cost.
  • the alignment in the method of the invention comprises: comparing the left and right arms of each pair of read pairs to the reference sequence, respectively, to obtain a level 1 left alignment result and The first-order right-aligned result is compared with one of the first-order left-aligned result and the first-order right-aligned result, and the other is compared, and the second-order left-aligned result and the second-level right are obtained. Aligning the results, obtaining a comparison result of the plurality of the pair of readings based on the result of the second-order left alignment and the result of the second-order right alignment, or obtaining a comparison result of the plurality of the left arms and The alignment of the right arms.
  • the read comparison result can be obtained.
  • the first alignment is globally aligned with the reference sequence
  • the left arm/right arm alignment result is The second alignment of the baseline for the right arm/left arm alignment results is a local alignment, such that alignments from the second-order left alignment result and the second-order right alignment result, respectively, can be performed on the same chromosome.
  • the distance between the two reads that match the expected pair is paired into a pair of read pairs, and the read contrast is obtained.
  • the aligning includes arranging the gaps such that each left or each right arm is compared with the reference sequence multiple times to obtain an optimal alignment result.
  • the gaps of each of the left arms or each of the right arms are set to -3 nt, -2 nt, -1 nt, 0 nt, 1 nt, 2 nt, 3 nt, 4 nt, 5 nt, 6 nt, and 7 nt, respectively.
  • a read segment respectively comparing the corresponding plurality of read segments with the reference sequence, and using the optimal aligned sequence as the left arm/right arm, where the comparison result may be based on the utilized Compare the software to the default evaluation of the results.
  • executing the data processing program further includes implementing, before the gap of each of the comparison results in the comparison result, extracting a unique comparison result in the comparison result to replace The alignment result, the unique alignment result comprising a plurality of read pairs uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, each of the The distance between the two reads of the pair of reads corresponds to the expected distance between the two locations of the chromosome segment from which it came.
  • executing the data processing program further comprises implementing correcting a positive strand of the same chromosome that contrasts each pair of the unique alignment results to the reference sequence. For example, for a pair of reads that respectively align the positive and negative strands of the previous chromosome, the reads of the aligned negative strands become their reverse complementary strands, thus The correction is achieved by a complement to replace the read.
  • executing the data processing program further comprises implementing a data format conversion, the data format conversion comprising converting the alignment result or the format of the unique alignment result.
  • the format of the general comparison result is required to be SAM or BAM, so as to facilitate subsequent analysis of the data based on the comparison result or the comparison result
  • SAM or BAM is a common binary format
  • BAM is a SAM. Compressed format. Due to the use of different comparison software, the format of the output comparison result or the unique comparison result may not be applicable to existing subsequent data processing or analysis software programs, such as the comparison result of the aforementioned TeraMap format, and the output data format thereof. It does not meet the requirements of the input data format of most existing mutation detection software SOAPsnp, GATK or SOAPindel, and converts the data format to obtain the general comparison result with the common data format, which is convenient for further analysis and processing of the data.
  • a computer readable storage medium for storing a program for execution by a computer, the execution of the program comprising performing an aspect of the aforementioned invention or any one of its embodiments. Sequencing data processing method.
  • the foregoing description of the advantages and technical features of the sequencing data processing method of the present invention is also applicable to the computer readable storage medium, and details are not described herein again.
  • the storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
  • the present invention provides a method for detecting copy number variation (CNV), the method comprising: a. acquiring a nucleic acid of a sample to be tested; b. sequencing the nucleic acid to obtain sequencing data; Processing the sequencing data to obtain a universal alignment result; d. detecting the CNV based on the universal alignment result; wherein c step is sequencing data processing in one aspect of the invention or in any particular embodiment The device and/or method performed.
  • CNV copy number variation
  • the step b includes performing a sequencing library construction on the nucleic acid to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library, and the construction of the single-stranded circular DNA library comprises: End-phosphorylation of the nucleic acid to obtain a terminal phosphorylation product; end-repairing the terminal phosphorylation product to obtain a terminal repair product; and linking the first sequence and the second sequence to both ends of the terminal repair product to obtain a first linkage a product; performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; Performing single-strand separation of the amplification product to obtain a single-stranded product; cyclizing the single-stranded product to obtain the sequencing library; wherein the fourth sequence is capable of joining one end of the first sequence And at one end of the second sequence, the other end of the first sequence and
  • end repair is performed followed by terminal phosphorylation.
  • End repair is to obtain a blunt-ended nucleic acid fragment that enables attachment of other nucleotides or sequences.
  • Terminal phosphorylation is to reduce the interconnection of sample nucleic acid fragments, so that samples with low nucleic acid content can also be constructed in a library and meet the requirements of the library.
  • Single-linker circular single-stranded library is shown in Figure 1. The single-linker has a small amount of storage, which is suitable for cfDNA content. In addition, there are also advantages of short construction time and low cost of building a database.
  • Said fourth sequence is capable of joining the first sequence and the second sequence to form one of said linkers, and the nick translation is to eliminate the dideoxy at the other end of the first sequence and/or the second sequence attached to the ends of the end repair product.
  • a nick caused by a nucleotide, with at least one primer carrying a biotin label carries at least one strand of the amplified product with a biotin label, so that subsequent separation of the single-stranded product based on the biotin label is easily obtained.
  • sequencing of the constructed library is performed using a combinatorial probe anchor ligation sequencing technique, such as using a CG sequencing platform.
  • the detection of CNV based on the general comparison result can utilize the currently known CNV detection methods, such as using hidden Markov model, circular binary segmentation, hierarchical segmentation or kernel smoothing algorithm.
  • the step d includes: setting a plurality of windows on the reference sequence, based on a general comparison result of the amount of the read segment matching the window and the comparison sample in the universal comparison result The difference in the amount of reads in the matching to the same window is significant, determining that the CNV is present in the sample nucleic acid to be tested, wherein the window is part of the reference sequence.
  • the size of the window can be adjusted according to the size of the pre-detected CNV, and the general comparison result of the comparison sample can be obtained by the method of one aspect of the present invention or the sequencing data processing method in any of the specific embodiments, whether the difference is
  • z-score standard score
  • the predetermined threshold is 3, that is, when the absolute value of z is greater than 3, it is determined that CNV occurs in the window.
  • the amount of the read segment may be a number or a ratio.
  • the z-score standard score
  • the depth of sequencing of the window the amount of reads to the window / the size of the window.
  • the GC content in the reads during the actual sequencing process will have a certain effect on the depth of sequencing [Alkan, Can, Jeffrey M Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci , Fereydoun Hormozdiari, Jacob O Kitzman, et al. "Personalized Copy Number and Segmental Duplication Maps Using next-Generation Sequencing.” Nature Genetics 41, no. 10 (October 2009): 1061–67], first performing GC content correction, eliminating GC The effect of the content on the depth of sequencing.
  • the GC content correction can utilize the sequencing data of multiple control samples, take the GC content of multiple window calculation windows and the average sequencing depth, and perform two-dimensional regression analysis on the GC-sequence depth data, for example, using local weighted regression.
  • the point smoothing method (lowess regression) establishes the relationship between the two, and corrects the GC content of each window according to the regression relationship.
  • the relationship between the sequencing depth and the GC content can be established by obtaining sequencing data of a plurality of control sample nucleic acids, the sequencing data being composed of a plurality of reading segments; setting a plurality of windows on the reference sequence, Sequencing data of the plurality of control samples are respectively compared with the window of the reference sequence, and each of the sequencing data of each control sample is calculated.
  • the number of the aforementioned control samples is not less than 30, and the number of samples reaches 30, so that the sample data presentation satisfies a specific distribution conforming to the test using a majority statistical test method, for example, using t test, z test, etc. Inspection generally requires multiple sample data to conform to a normal distribution.
  • the sequencing data, the comparison result, and the like of the foregoing control sample can be obtained by referring to the sequencing data processing method in one aspect of the present invention or in any of the specific embodiments, and can be obtained simultaneously with the sequencing data and the comparison result of the sample to be tested.
  • the save reserve can be obtained in advance.
  • the present invention provides a CNV detecting apparatus for performing all or part of the steps of the CNV detecting method of one aspect of the present invention, the apparatus comprising: a nucleic acid acquiring apparatus for acquiring a test a nucleic acid of the sample; a sequencing device for sequencing the nucleic acid from the nucleic acid acquisition unit to obtain sequencing data; and a data processing device for processing the sequencing data from the sequencing device to obtain a general comparison result; Detecting means for detecting the CNV based on a universal comparison result from the data processing device; wherein the data processing device comprises a data receiving unit for receiving sequencing data from the sequencing device, the sequencing The data includes pairs of pairs of reads, each pair of reads consisting of two reads, each from two locations of a chromosome segment, and two reads of each pair of read pairs are from the positive strand of the chromosome segment, respectively.
  • each read Include a gap, define two reads of a pair of read pairs as a left arm and a right arm, respectively, a processor for executing a data processing program, and executing the data processing program includes implementing the sequencing data and the reference sequence Comparing, obtaining alignment results, and eliminating gaps in each of the alignment results, obtaining a universal alignment result comprising a plurality of alignments of the pair of reads, and/ Or, the comparison result includes a comparison result of the plurality of left arms and a comparison result of a plurality of the right arms, and at least one storage unit for storing data including the data processing program.
  • the CNV detecting device of this aspect of the present invention is also applicable to the description of the advantages and technical features of the CNV detecting method in any of its specific embodiments, and details are not described herein again, and those skilled in the art can understand that the present invention can be understood. All or a portion of the units of this apparatus are selectively detachably including one or more subunits to perform or implement various embodiments of the aforementioned CNV detection methods of the present invention.
  • Sequencing data was obtained by single-link sequencing of the CG platform, and the cost was lower and faster.
  • the TeraMap2Sam conversion software is developed, and the comparison result of the CG platform TeraMap is converted into a common SAM format, so that many excellent open source softwares such as Samtools, GATK, etc. can be directly used for mutation detection.
  • the CNV detection program developed by the CNV detection method and/or device of the present invention performs CNV analysis based on the standard fraction method, and has high speed and high resolution.
  • FIG. 1 is a schematic view showing the structure of a single-linker circular single-stranded library in one embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a sequencing data processing apparatus in an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a sequencing data processing system in an embodiment of the present invention.
  • FIG. 4 is a flow chart of a method for processing sequencing data in an embodiment of the present invention.
  • Figure 5 is a flow chart showing a method of processing sequencing data in an embodiment of the present invention.
  • FIG. 6 is a flow chart of a CNV detecting method in an embodiment of the present invention.
  • FIG. 7 is a flow chart of a CNV detecting method in an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a CNV detecting apparatus in an embodiment of the present invention.
  • Figure 9 is a flow diagram showing the construction and sequencing of a single linker library in one embodiment of the present invention.
  • Figure 10 is a flow chart of the algorithm of the Teramap2Sam software in one embodiment of the present invention.
  • the processing device 100 includes a data receiving unit 10, a processor 20, and a storage unit 30.
  • the processor 20 is connected to the data receiving unit 10 and the storage unit 30, and the storage unit 30 is connected to the data processing unit 10.
  • the data receiving unit 10 is configured to receive sequencing data, where the sequencing data includes multiple pairs of read pairs, each pair of read segments consists of two read segments, respectively, from two positions of a chromosome segment, each pair of read long pairs
  • the two reads are from the positive and negative strands of the chromosome fragment, respectively, or both reads in each pair of reads are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each read
  • the segments all contain gaps, and the two reads of a pair of read pairs are defined as the left and right arms, respectively.
  • the pair of reads from two positions of a chromosome fragment, respectively can be obtained by sequencing a constructed library by constructing a pair-end library or a mate-pair library.
  • multiple pairs of read pairs are obtained using the library construction method of Complete Genomics (CG) and its sequencing platform.
  • the distance between a pair of read pairs is determined by the length of the read and the enzyme.
  • the distance between the recognition site and the cleavage site is controlled.
  • the CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique.
  • cPAL combinatorial probe-ligation sequencing
  • Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change.
  • the obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3.
  • the size of the gap in the present invention may also be zero.
  • the 2-AD sequencing output has a total length of 60 bp, which can be divided into two pairs of mate-paired reads, and each pair of reads is centered.
  • the reads have a small gap at 10 bp, an invalid sequencing site N at the 20 bp position, and the distance between the two reads of a pair of reads is generally less than 2000 bp. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read.
  • positive strand and negative strand are complementary two strands constituting a chromosome fragment, and are opposite.
  • a strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention.
  • a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain.
  • the processor 20 is configured to execute a data processing program, and the executing the data processing program comprises: comparing the sequencing data with a reference sequence, obtaining a comparison result, and eliminating each read in the comparison result a gap, obtaining a universal alignment result, the comparison result comprising a plurality of alignment results of the pair of reads, and/or, the comparison result comprising a plurality of comparison results of the left arm and a plurality of The result of the comparison of the right arm.
  • the comparison can be performed by using known comparison software, such as SOAP, BWA, etc., or by using the comparison software TeraMap of the CG platform.
  • the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap.
  • the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed.
  • the read segment replaces the size of the positive gap by N, and N is A, T, C or G.
  • the read with a gap of 0 is not processed.
  • the read can be divided into two parts based on the gap, and the ends of the two parts have 2 nt overlap, such as two parts of the read.
  • ATCGCTTAAG and AGTACGATTC respectively, eliminate the negative gap, that is, the overlapping AG, and obtain the corresponding reading as ATCGCTTAAGTACGATTC.
  • the storage unit 30 is for storing data, and the above-described data processing program is stored in the storage unit 30, and intermediate data or results of the processing of the sequencing data from the data receiving unit 10 and the processor 20 are also stored.
  • FIG. 3 is a block diagram showing the structure of a system in an embodiment of the sequencing data processing system of the present invention.
  • the sequencing data processing system 1000 includes a sequencing data processing device 100, a host 200, and a display device 300.
  • the host 200 can be an audio/video/signal source device, such as a computer host, mainframe, etc., for transmitting display data required by the display device 300.
  • the host 200 includes at least one interface electrically connected to the sequencing data processing device 100.
  • the sequencing data processing device 100 receives the sequencing data output from the host 200, processes the sequenced data, and then outputs the processed data or results to the display device. 300.
  • the sequencing data processing method comprises the steps of: S1 acquiring sequencing data, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, respectively, from two positions of a chromosome segment, each pair of reads The two reads of the pair are from the positive and negative strands of the chromosome fragment, respectively, or both reads of each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome fragment, Each read segment includes a gap, and two reads of a pair of read pairs are respectively defined as a left arm and a right arm; S2 compares the sequencing data with a reference sequence to obtain a comparison result, and the comparison result Include a comparison result of a plurality of the pair of read segments, and/or, the comparison result includes a comparison result of the plurality of the left arms and a comparison result of the plurality of the
  • the pair of reads from two positions of a chromosome fragment can be constructed by constructing a pair-end library or a mate-pair library.
  • multiple pairs of read pairs are obtained by using the library construction method of Complete Genomics (CG) and its sequencing platform, and the distance between a pair of read pairs is read by The length and the distance between the recognition site of the enzyme and the cleavage site are controlled.
  • the CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique. The bases on both sides of the linker were read because they were ligated by restriction enzyme digestion.
  • cPAL combinatorial probe-ligation sequencing
  • Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change.
  • Make the obtained reading with a gap For example, when constructing a multi-ligand circular library, the Alu enzyme is digested twice to join different portions of multiple linkers, and when the bases next to these linkers are read, a read with a gap of +3/-3 is produced.
  • the size of the gap in the present invention may also be zero. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read.
  • the term "positive strand” and "negative strand” as used herein are complementary two strands constituting a chromosome fragment, and are opposite. When a strand is a positive strand, the complementary strand can be said to be a minus strand.
  • a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain.
  • the comparison can be performed by using known comparison software, such as SOAP, BWA, etc., or by using the comparison software TeraMap of the CG platform.
  • the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap.
  • the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed.
  • the read segment replaces the size of the positive gap by N, N is A, T, C or G, and the read with the gap 0 is not processed.
  • the read based on the gap The segment can be divided into two parts, and the ends of the two parts have 2nt overlap.
  • the two parts of the read segment are ATCGCTTAAG and AGTACGATTC respectively, and the negative gap, that is, the overlapping AG, is eliminated, and the corresponding read segment is obtained as ATCGCTTAAGTACGATTC.
  • FIG. 5 is a flow chart showing the data processing of one embodiment of the sequencing data processing method of the present invention.
  • the sequencing data processing method comprises: S10 acquiring sequencing data, the sequencing data comprising a plurality of pairs of read pairs, each pair of read segments consisting of two read segments, respectively, from two positions of one chromosome segment, each pair of read pairs The two reads in the pair are from the positive and negative strands of the chromosome fragment, or the two reads in each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome fragment, each Each of the read segments includes a gap, and two reads of a pair of read pairs are respectively defined as a left arm and a right arm; S20 compares the sequencing data with a reference sequence to obtain a comparison result, and the comparison result includes Aligning results of a plurality of the pair of read segments, and/or, the comparison result includes a comparison result of the plurality of the left arms and a comparison result of the plurality
  • Fig. 6 is a flow chart showing the detection of an embodiment of the CNV detecting method of the present invention.
  • the CNV detection method comprises the steps of: S11 acquiring nucleic acid of a sample to be tested; S12 sequencing the nucleic acid to obtain sequencing data; S13 processing the sequencing data to obtain a general comparison result; S14 is based on the universal comparison As a result, the CNV is detected; wherein S13 is performed using a sequencing data processing device and/or a sequencing data processing method in one aspect of the invention or in any of the embodiments.
  • Detection of CNV based on universal alignment results can utilize currently known CNV detection methods, such as Use hidden Markov model, circular binary segmentation, hierarchical segmentation or kernel smoothing algorithm.
  • Fig. 7 is a flow chart showing the detection of an embodiment of the CNV detecting method of the present invention.
  • the CNV detection method includes the steps of: S110 acquiring nucleic acid of a sample to be tested; S120 sequencing the nucleic acid to obtain sequencing data; S130 processing the sequencing data to obtain a general comparison result, and S130 is by the above-mentioned invention.
  • S150 corrects the sequencing depth of the window by the relationship between the sequencing depth and the GC content, and obtains the corrected sequencing depth of the window;
  • S160 is based on the The corrected sequencing depth of the window is significantly different from the corrected sequencing depth of the same window of the control sample, and the CNV is determined to be present in the sample nucleic acid to be tested, wherein the window is part of the reference sequence.
  • the number of the aforementioned control samples is not less than 30, and the number of samples reaches 30, so that the sample data is presented to satisfy a specific distribution, which is suitable for testing by using a majority statistical test method, for example, using t test, z test, etc.
  • the sample data is in a normal distribution.
  • the sequencing data, the comparison result, and the like of the foregoing control sample can be obtained by referring to the sequencing data processing method in one aspect of the present invention or in any of the specific embodiments, and can be obtained simultaneously with the sequencing data and the comparison result of the sample to be tested.
  • the save reserve can be obtained in advance.
  • the depth of sequencing and the GC content of the window were determined using two-dimensional regression analysis, for example, using Lowess regression to establish the relationship between sequencing depth and GC content.
  • FIG. 8 is a block diagram showing the structure of an embodiment of a CNV detecting apparatus of the present invention.
  • the device 2000 includes: a nucleic acid acquisition device 200 for acquiring nucleic acid of a sample to be tested; a sequencing device 400 for sequencing nucleic acid from the nucleic acid acquisition unit to obtain sequencing data; and a data processing device 600 for The sequencing data of the sequencing device is processed to obtain a universal alignment result; the detection device 800 is configured to detect the CNV based on a universal comparison result from the data processing device 600; wherein the data processing device 600 includes a data receiving unit 610, configured to receive sequencing data from the sequencing device, the sequencing data comprising a plurality of pairs of read pairs, each pair of read segments consisting of two read segments, respectively, from two locations of a chromosome segment The two reads of each pair of read lengths are from the positive and negative strands of the chromosome fragment, respectively, or the two reads of each pair of read lengths are from the positive strand of the chromosome fragment
  • each read includes a gap
  • two reads of a pair of read pairs are respectively defined as a left arm and a right arm
  • the processor 630 is configured to execute a data processing program, and execute the
  • the data processing program includes implementing the alignment of the sequencing data with a reference sequence, obtaining a comparison result, and eliminating a gap of each of the comparison results, obtaining a universal alignment result, the comparison result including Alignment results of the plurality of read pairs, and/or, the comparison result includes a comparison result of the plurality of left arms and a comparison result of a plurality of the right arms, and at least one storage unit 650. For storing data, including the data processing program.
  • peripheral blood plasma of lung cancer patients was taken as the test object.
  • the samples were from Southwest Hospital and tested as follows:
  • the above reaction product was purified by 60 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.
  • the above reaction product was purified by 40 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.
  • the two strands of the first sequence are: TTGGCCTCCGACT/3-ddT/(SEQ ID NO: 1), /5phos/AAGTCGGAGGCCAAGCGGTCGT/ddC/ (SEQ ID NO: 2).
  • the two strands of the second sequence are: /5Phos/GTCTCCAGTCGAAGCCCGACG/3ddC/(SEQ ID NO: 3), GCTTCGACTGGAGA/3ddC/(SEQ ID NO: 4).
  • the upstream primer in the third sequence is/5-bio/TCCTAAGACCGCTTGGCCTCCGACT (SEQ ID NO: 5),
  • the intermediate "x" is a variable tag sequence region, which can be replaced by N, N is A, T, C or G, when no other sample libraries are mixed together, only A sample library is on the machine, no tag sequence is required, ie the third sequence can be
  • 5Phos/AGACAAGCTCGATCGGGCTTCGACTGGAGAC (SEQ ID NO: 7), in this example, because of the tumor free nucleic acid sample, the target nucleic acid (ctDNA) content in the mixed nucleic acid is low, and if a plurality of such sample libraries are mixed on the machine to obtain mixed data, it is required Splitting the mixed data corresponding to the respective samples will lose a part of the data, and the single-joint circular library reads are relatively short. To accurately detect the mutation, deep sequencing is required to obtain a relatively large amount of measured data, preferably, on a single sample library. machine.
  • the above reaction product was purified by 40 ul of Ampure XP beads and eluted with 37.4 ul of Elution buffer.
  • the above reaction product was purified by 50 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.
  • the PCR product was subjected to concentration determination using a Qubit dsDNA HS assay kit.
  • Tween20 0.5% Tween20, 1X BBB/Tween Mix, 1X BWB/Tween Mix, 0.1M NaOH.
  • the 0.5% Tween20 configuration method is the same as the above, and the other three configuration methods are as follows:
  • the product of this step can be stored frozen at -20 °C.
  • ligase reaction mixture is shaken and thoroughly mixed. After centrifugation, 11 ul of the ligase reaction mixture is added to the EP tube to which the primer reaction mixture has been added, shaken for 10 s, and centrifuged instantaneously.
  • the starting amount of the sample used for the preparation of DNB was adjusted to 35.3 ng-53 ng according to the concentration of single-stranded molecular quantitative determination.
  • the corresponding volume sample ( ⁇ 60 ul) was transferred to the Biorad PCR plate, and the total volume was not more than 120 ul using 1XTE. .
  • the final concentration is 5.625-7.5fmol/ul
  • the volume is 120ul
  • the total amount is 35.3ng-53ng
  • the DNB in the 1adapter sequencing needs 120fmol, 7.5foml/ul, 16ul. Therefore, the library needs to be diluted to 7.5 fmol/ul.
  • the offline data of the first embodiment is processed.
  • the sequencing data processing method and/or the CNV detection method of the present invention based on the CG platform sequencing technology, ultra-micro cfDNA enrichment, library establishment, sequencing and data analysis can be performed.
  • the sequencing reads are short, and there are resequencing and small gaps at specific locations. It is difficult to directly compare the sequencing results using ordinary comparison software.
  • the TG platform's proprietary TeraMap for comparison. The working principle is: First, it will compare the two ends of the read length (LeftArm, RightArm), and TeraMap will try a variety of gaps.
  • the value is used to process the read length to obtain more comparison results; then, the comparison result at each end is taken as a reference, and the other end is locally aligned (for example, 4-AD, the range of the local alignment is 0 to 700bp); if both ends can be well aligned to the same chromosome, and the insert-size meets expectations (eg 4-AD, the distance between the two reads of a read pair is 0-700bp), then only the best alignment result is output Otherwise, multiple comparison results at both ends are output.
  • TeraMap is a comparison software for CG sequencing platform. It can compare CG unique sequences to the reference genome. Its output format consists of three parts. The brief description is as follows: the first line is the reads sequence information; the second line and the third The line is the reading comparison case description; the fourth line and the fifth line are the details of the reads comparison result.
  • the Teramap2Sam software is developed according to the method of the present invention, and the gap in the TeraMap comparison result is removed and converted into SAM (sequence alignment/map format).
  • SAM sequence alignment/map format
  • Step 1 Extract the unique alignment results. According to the TeraMap output result matchCount to determine whether the unique alignment, while requiring the length of the insert to meet the requirements and the read alignment of the two ends on a reference sequence.
  • Step 2 Remove the gap.
  • the gap position in the reads is determined according to the gaps field, and the read sequence is corrected.
  • the third step calculate FLAG. According to the comparison direction of the double-ended read, the FLAG parameter in the SAM file is calculated to obtain the comparison.
  • SAM is a more general format for storing comparison information.
  • Each line is a pair of reads. It consists mainly of eleven fields. Later, more fields can be added to contain more information, such as XT:A: U means that this reads is a unique comparison.
  • U means that this reads is a unique comparison.
  • BAM binary compression format
  • CG developed the Assembly Software for its read structure to reassemble the reads, and perform the follow-up work after the assembly is completed.
  • the short readout is short (12 bp).
  • the original CG mutation detection tool is no longer applicable or the detection result is not good.
  • BAM data to detect copy number variation.
  • the existing copy number variation detection methods include hidden Markov model, circular binary segmentation, hierarchical segmentation, and kernel smoothing algorithm.
  • We use the z-score (standard score) to obtain copy number variation results based on the read depth distribution of multiple windows with a total length of 1,000,000 bp.
  • the GC content in the reads will have a certain influence on the sequencing depth during the actual sequencing process
  • the GC content and the average sequencing depth of a plurality of window calculation windows with a total length of 1,000,000 bp were taken, and the GC-sequence depth data was subjected to lowess regression, and the GC content was corrected according to the regression curve.
  • the standard score also called the z-score
  • z (x - ⁇ ) / ⁇ .
  • x is a specific fraction
  • is the average
  • is the standard deviation.
  • the amount of Z value represents the distance between the original score and the parent mean, calculated in units of standard deviation. Z is negative when the original score is lower than the average, and vice versa if it is lower.
  • copy number variation can be effectively detected by measuring the distance between the reads count (original score) and the overall reads average (multiple normal control samples) in the 2000 bp window using the standard deviation.
  • the reaction is greater than 2 (the normal sample is 2 times), such as repetition, and the negative copy number is less than 2 when the z value is negative, such as a deletion.
  • the above CNV detection method in this embodiment is written as a program, and the program is named calcu_zscore_query, and the region where the absolute value of z is larger than 3 is judged to be CNV.
  • CG single-join sequencing method Compared with the traditional method, we can use the CG single-join sequencing method to achieve ultra-micro-sequencing database sequencing. Only 1-10 ng of nucleic acid is needed for database construction, and the peripheral blood volume is 2-5 ml, and the standardization process of CG is simple and fast.
  • TeraMap ratio After converting the result to SAM format, it is more versatile than the closed source TeraMap format, and can be processed using software such as Samtools.
  • CNV can be quickly detected using z-score (standard score), and CNV analysis of 50-by-full genome data takes only 4 hours, as a comparison, CONTRA software [ http://sourceforge.net/projects/contra-cnv/ ] It takes more than 1 day.
  • TeraMap is used for comparison.
  • the original reads are obtained using the CG platform's integrated tool makeADF, and then compared with TeraMap, and the sequenced reads are aligned on the reference sequence.
  • the resulting alignment results are converted to the generic SAM format using TeraMap2Sam. Table 1 shows the results.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

L'invention concerne un dispositif (100) de traitement de données de séquençage, et le dispositif comprend : une unité de réception de données (10) utilisé pour recevoir les données de séquençage, les données de séquençage comprenant une pluralité de paires de segments de lecture, chaque paire de segments de lecture étant composée de deux segments de lecture provenant de deux positions d'un segment chromosomique respectivement, et chaque segment de lecture contenant un espace; un processeur (20) utilisé pour exécuter un programme de traitement de données, l'exécution du programme de traitement de données comprenant la comparaison des données de séquençage avec une séquence de référence afin d'obtenir le résultat de la comparaison, et l'élimination de l'espace de chaque segment de lecture dans le résultat de la comparaison pour obtenir un résultat de comparaison général; et, au moins une unité de mémoire (30) utilisée pour stocker des données, dans laquelle le programme de traitement de données est inclus. L'invention concerne également un système et un procédé de traitement de données de séquençage, un support de stockage lisible par ordinateur, un procédé et un dispositif de détection de CNV.
PCT/CN2014/093511 2014-12-10 2014-12-10 Dispositif et procédé de traitement de données de séquençage Ceased WO2016090583A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480082793.4A CN107077533B (zh) 2014-12-10 2014-12-10 测序数据处理装置和方法
PCT/CN2014/093511 WO2016090583A1 (fr) 2014-12-10 2014-12-10 Dispositif et procédé de traitement de données de séquençage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/093511 WO2016090583A1 (fr) 2014-12-10 2014-12-10 Dispositif et procédé de traitement de données de séquençage

Publications (1)

Publication Number Publication Date
WO2016090583A1 true WO2016090583A1 (fr) 2016-06-16

Family

ID=56106452

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/093511 Ceased WO2016090583A1 (fr) 2014-12-10 2014-12-10 Dispositif et procédé de traitement de données de séquençage

Country Status (2)

Country Link
CN (1) CN107077533B (fr)
WO (1) WO2016090583A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110462056A (zh) * 2017-05-19 2019-11-15 深圳华大生命科学研究院 基于dna测序数据的样本来源检测方法、装置和存储介质
CN111383717A (zh) * 2018-12-29 2020-07-07 北京安诺优达医学检验实验室有限公司 一种构建生物信息分析参照数据集的方法及系统
CN115132271A (zh) * 2022-09-01 2022-09-30 北京中仪康卫医疗器械有限公司 一种基于批次内校正的cnv检测方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107077538B (zh) * 2014-12-10 2020-08-07 深圳华大生命科学研究院 测序数据处理装置和方法
CN116254320A (zh) * 2022-12-15 2023-06-13 纳昂达(南京)生物科技有限公司 平末端双链接头元件、试剂盒及平末端建库方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101914628A (zh) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 检测基因组目标区域多态性位点的方法及 系统
CN103525939A (zh) * 2013-10-28 2014-01-22 广州爱健生物技术有限公司 无创检测胎儿染色体非整倍体的方法和系统
CN103824001A (zh) * 2014-02-27 2014-05-28 北京诺禾致源生物信息科技有限公司 染色体的检测方法和装置
US20140272954A1 (en) * 2013-03-15 2014-09-18 Nabsys, Inc. Methods and systems for electronic karyotyping
CN104093858A (zh) * 2012-11-13 2014-10-08 深圳华大基因医学有限公司 确定生物样本中染色体数目异常的方法、系统和计算机可读介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DK2171088T3 (en) * 2007-06-19 2016-01-25 Stratos Genomics Inc Nucleic acid sequencing in a high yield by expansion
CA2707901C (fr) * 2007-12-05 2015-09-15 Complete Genomics, Inc. Determination efficace des bases dans les reactions de sequencage
WO2011143231A2 (fr) * 2010-05-10 2011-11-17 The Broad Institute Séquençage à haut rendement de banques à extrémités appariées de clones comportant de grands segments d'insertion
HUE047501T2 (hu) * 2013-05-15 2020-04-28 Bgi Genomics Co Ltd Eljárás kromoszómális szerkezeti abnormalitások kimutatására, és ennek eszköze
CN104156631B (zh) * 2014-07-14 2017-07-18 天津华大基因科技有限公司 染色体三倍体检验方法
CN104133914B (zh) * 2014-08-12 2017-03-08 厦门万基生物科技有限公司 一种消除高通量测序引入的gc偏差及对染色体拷贝数变异的检测方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101914628A (zh) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 检测基因组目标区域多态性位点的方法及 系统
CN104093858A (zh) * 2012-11-13 2014-10-08 深圳华大基因医学有限公司 确定生物样本中染色体数目异常的方法、系统和计算机可读介质
US20140272954A1 (en) * 2013-03-15 2014-09-18 Nabsys, Inc. Methods and systems for electronic karyotyping
CN103525939A (zh) * 2013-10-28 2014-01-22 广州爱健生物技术有限公司 无创检测胎儿染色体非整倍体的方法和系统
CN103824001A (zh) * 2014-02-27 2014-05-28 北京诺禾致源生物信息科技有限公司 染色体的检测方法和装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110462056A (zh) * 2017-05-19 2019-11-15 深圳华大生命科学研究院 基于dna测序数据的样本来源检测方法、装置和存储介质
CN110462056B (zh) * 2017-05-19 2023-08-29 深圳华大生命科学研究院 基于dna测序数据的样本来源检测方法、装置和存储介质
CN111383717A (zh) * 2018-12-29 2020-07-07 北京安诺优达医学检验实验室有限公司 一种构建生物信息分析参照数据集的方法及系统
CN115132271A (zh) * 2022-09-01 2022-09-30 北京中仪康卫医疗器械有限公司 一种基于批次内校正的cnv检测方法

Also Published As

Publication number Publication date
CN107077533B (zh) 2021-07-27
CN107077533A (zh) 2017-08-18

Similar Documents

Publication Publication Date Title
JP5972448B2 (ja) コピー数変異を検出する方法及びシステム
JP6585117B2 (ja) 胎児の染色体異数性の診断
TWI793586B (zh) 血漿dna之單分子定序
JP5938484B2 (ja) ゲノムのコピー数変異の有無を判断する方法、システム及びコンピューター読み取り可能な記憶媒体
US11041203B2 (en) Methods for assessing a genomic region of a subject
CN113362891A (zh) 用短读测序数据检测重复扩增
CN113832139A (zh) 使用具有独特分子索引(umi)的冗余读段在测序dna片段中抑制误差
KR20140050032A (ko) 샘플 중 상이한 이수성의 존재 또는 부재를 결정하는 방법
US20250037796A1 (en) Methods for detecting absence of heterozygosity by low-pass genome sequencing
WO2012068919A1 (fr) Bibliothèque d'adn et procédé de préparation de celle-ci, procédé et dispositif de détection de snp
Babarinde et al. Computational methods for mapping, assembly and quantification for coding and non-coding transcripts
US12416047B2 (en) Noninvasive prenatal diagnostic methods
WO2016090584A1 (fr) Procédé et dispositif de détermination de la concentration d'acide nucléique tumoral
WO2016090583A1 (fr) Dispositif et procédé de traitement de données de séquençage
CN106995851B (zh) 扩增pkd1外显子超长片段的pcr引物、检测pkd1基因突变的试剂盒及应用
KR20210021923A (ko) 핵산 단편간 거리 정보를 이용한 염색체 이상 검출 방법
CN107077538B (zh) 测序数据处理装置和方法
CN105765076B (zh) 一种染色体非整倍性检测方法及装置
US20230178182A1 (en) Method for detecting chromosomal abnormality by using information about distance between nucleic acid fragments
Qian et al. Noninvasive prenatal screening for common fetal aneuploidies using single-molecule sequencing
KR20220071122A (ko) 핵산 길이 비를 이용한 암 진단 및 예후예측 방법
WO2024245328A1 (fr) Procédés pour déterminer la maternité, la paternité ou la parenté et systèmes informatiques pour sa mise en œuvre
HK40058694A (en) Detecting repeat expansions with short read sequencing data
WO2018195878A1 (fr) Identificateur d'adn et utilisation correspondante
HK1245850B (en) Single-molecule sequencing of plasma dna

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14907893

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14907893

Country of ref document: EP

Kind code of ref document: A1