[go: up one dir, main page]

WO2016049929A1 - Method for constructing sequencing library and application thereof - Google Patents

Method for constructing sequencing library and application thereof Download PDF

Info

Publication number
WO2016049929A1
WO2016049929A1 PCT/CN2014/088059 CN2014088059W WO2016049929A1 WO 2016049929 A1 WO2016049929 A1 WO 2016049929A1 CN 2014088059 W CN2014088059 W CN 2014088059W WO 2016049929 A1 WO2016049929 A1 WO 2016049929A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing data
sequencing
strand
sequence
nucleic acid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2014/088059
Other languages
French (fr)
Chinese (zh)
Inventor
吕小星
钱朝阳
管彦芳
常连鹏
易鑫
朱红梅
杨玲
吴仁花
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bgi Tianjin
BGI Shenzhen Co Ltd
Original Assignee
Bgi Tianjin
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bgi Tianjin, BGI Shenzhen Co Ltd filed Critical Bgi Tianjin
Priority to PCT/CN2014/088059 priority Critical patent/WO2016049929A1/en
Publication of WO2016049929A1 publication Critical patent/WO2016049929A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12MAPPARATUS FOR ENZYMOLOGY OR MICROBIOLOGY; APPARATUS FOR CULTURING MICROORGANISMS FOR PRODUCING BIOMASS, FOR GROWING CELLS OR FOR OBTAINING FERMENTATION OR METABOLIC PRODUCTS, i.e. BIOREACTORS OR FERMENTERS
    • C12M1/00Apparatus for enzymology or microbiology
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B50/00Methods of creating libraries, e.g. combinatorial synthesis
    • C40B50/06Biochemical methods, e.g. using enzymes or whole viable microorganisms

Definitions

  • the invention relates to the field of biomedicine.
  • the invention relates to methods of constructing sequencing libraries, sequencing methods, methods of determining nucleic acid sequences, devices for constructing sequencing libraries, sequencing devices, and systems for determining nucleic acid sequences.
  • High-throughput sequencing is gaining increasing attention, but the detection of low-frequency mutations for high-throughput sequencing is still to be improved.
  • the present invention aims to solve at least one of the technical problems existing in the prior art. To this end, in accordance with embodiments of the present invention, the present invention proposes methods for constructing sequencing libraries and means for detecting low frequency mutations.
  • the invention proposes a method of constructing a sequencing library.
  • the method comprises: (a) separately joining a linker at both ends of the double-stranded DNA fragment to obtain a ligation product, wherein the linker comprises a first strand and a second strand, the first strand Matching the second strand portion and the first strand comprises a first tag sequence such that the linker defines a double stranded region and two single stranded tails, the sequence of one of the two single stranded tails comprising the first a label; (b) cleavage of the ligated product into a single-stranded DNA fragment; (c) performing a strand extension reaction on the single-stranded DNA fragment using a first primer to obtain a strand extension product, wherein the first primer comprises a second tag sequence, and the first primer is adapted to form a double stranded structure with the first strand of the linker, except that there is a mis
  • a sequencing library can be efficiently constructed, and at the same time, the constructed sequencing library is directed to the same double-stranded DNA fragment (also referred to herein as a "source sequence").
  • Each of the chains has an amplification product having a first tag sequence and a second tag sequence, respectively, whereby in the analysis of subsequent sequencing results, mutual calibration can be performed according to the sequencing results of the two tags, and the analysis is improved. The reliability of the results.
  • the double-stranded DNA fragment is obtained by subjecting a nucleic acid sample to end repair to obtain a repaired nucleic acid sample; and adding a base A at the 5' end of the nucleic acid sample,
  • the nucleic acid samples having viscous terminal bases A at both ends constitute the double-stranded DNA fragment.
  • a linker can be conveniently added to both ends of the double-stranded DNA fragment in a subsequent operation. Thereby, the efficiency of constructing a sequencing library is improved.
  • the nucleic acid sample is at least a portion of human genomic DNA or a free nucleic acid.
  • the human free nucleic acid is extracted from the peripheral blood of the patient.
  • the The patient has cancer, which is at least one selected from the group consisting of bladder cancer, prostate cancer, lung cancer, colorectal cancer, stomach cancer, breast cancer, kidney cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer. , cervical cancer, esophageal cancer and liver cancer. Therefore, the method of the embodiment of the present invention can effectively analyze gene mutations of human disease patients, and can be effectively used for early diagnosis, individualized medication, and postoperative monitoring of common tumors.
  • At least a portion of the human genomic DNA is obtained by random disruption of human genomic DNA.
  • a linker can be conveniently added to both ends of the double-stranded DNA fragment in a subsequent operation. Thereby, the efficiency of constructing a sequencing library is improved.
  • the linker has a 3' base T sticky end.
  • a linker can be conveniently added to both ends of the double-stranded DNA fragment in a subsequent operation. Thereby, the efficiency of constructing a sequencing library is improved.
  • the single-stranded DNA fragment is obtained by subjecting the ligation product to denaturation treatment.
  • the denaturation treatment may be a heat denaturation treatment or an alkali denaturation treatment.
  • the single-stranded DNA fragment is screened using a probe prior to performing the strand extension, wherein the probe specifically recognizes a predetermined region.
  • the sequencing library can be efficiently constructed for the region of interest, and the efficiency of constructing the sequencing library and subsequent sequencing is improved.
  • the predetermined area comprises one of the following:
  • the probe is provided in the form of a chip. Thereby, the efficiency of probe screening can be improved.
  • the strand extension reaction is carried out in the presence of a UDG enzyme/FPG enzyme.
  • the first tag sequence and the second tag sequence are each independently 4 to 10 nt in length. According to an embodiment of the invention, the first tag sequence and the second tag sequence are both 8 nt in length. According to an embodiment of the invention, there is a mismatch of at least 2 nt between the first tag sequence and the second tag sequence. The inventors have surprisingly found that with such an arrangement, the efficiency of correction using the first tag sequence and the second tag sequence in subsequent analysis can be effectively improved.
  • the first strand of the linker has the sequence set forth in SEQ ID NO: 1
  • the second strand of the linker has the sequence set forth in SEQ ID NO: 2
  • the first tag having SEQ ID NO: shown in any of 3-6 a sequence
  • the second tag having the sequence set forth in at least one of SEQ ID NOs: 7-10
  • the first primer having the sequence set forth in SEQ ID NO: 11
  • the The primers of the first tag sequence and the second tag sequence have the sequences set forth in SEQ ID NO: 13 and SEQ ID NO: 13.
  • the labels include, but are not limited to, the four pairs described above, and multiple pairs of labels can be designed as needed for simultaneous detection of multiple samples.
  • the invention proposes a sequencing method comprising: constructing a sequencing library according to the method described above; sequencing the sequencing library.
  • the sequencing was performed on Hiseq2000 or Hiseq 2500 according to an embodiment of the invention.
  • the invention provides a method of determining a nucleic acid sequence, comprising:
  • sequencing is performed according to the methods previously described in the claims to obtain sequencing results consisting of multiple sequencing data;
  • At least one subset of sequencing data is constructed, wherein all sequencing data in each subset of sequencing data corresponds to the same source sequence on the nucleic acid sample;
  • a sequence of the nucleic acid sample is determined based on the corrected sequencing data.
  • the calibration can be effectively performed based on the positive strand sequencing data and the negative strand sequencing data, thereby improving the reliability of the analysis result.
  • the sequencing is a double-end sequencing, the sequencing result consisting of pairs of pairs of sequencing data.
  • constructing at least one subset of sequencing data based on the sequencing results is performed by the following steps:
  • a paired sequencing data index for each pair of the plurality of pairs of sequenced data, the paired sequencing data index consisting of an initial N bases of each of the paired sequencing data, wherein N is An integer between 10 and 20;
  • the at least one preliminary sequencing data subset is subdivided based on a Hamming distance between the sequencing data in the preliminary sequencing data subset to obtain a plurality of the sequencing data subsets.
  • N is 12.
  • the Hamming distance of any two pairs of paired sequencing data does not exceed 20.
  • the positive strand sequencing data and the negative strand sequencing data are at least two, respectively.
  • determining the corrected sequencing data based on the positive strand sequencing data and the negative strand sequencing data is based on the following principles:
  • Each base in the corrected sequencing data is simultaneously supported by at least 50% positive strand sequencing data and at least 50% negative strand sequencing data.
  • each base in the corrected sequencing data simultaneously obtains at least 80% positive stranding Order data and support for at least 80% negative strand sequencing data.
  • the method further includes:
  • the corrected sequencing data is aligned to a reference sequence and the sequencing data with a quality of less than 30 is deleted.
  • SNV analysis or Indel analysis is performed based on the sequence of the nucleic acid sample.
  • the invention proposes an apparatus for constructing a sequencing library.
  • the apparatus comprises:
  • a linking unit for respectively connecting a linker at both ends of the double-stranded DNA fragment to obtain a ligation product, wherein the linker includes a first strand and a second strand, the first strand and the second strand portion are matched and The first strand comprises a first tag sequence such that the linker defines a double-stranded region and two single-stranded tails, the sequence of one of the two single-stranded tails comprising a first label;
  • cleavage unit for cleaving the ligation product into a single-stranded DNA fragment
  • a strand extension unit for performing a strand extension reaction on the single-stranded DNA fragment with a first primer to obtain a strand extension product, wherein the first primer includes a second tag sequence, and the first primer is adapted to The first strand of the linker forms a double-stranded structure, except that there is a mismatch between the first tag sequence and the second tag sequence;
  • An amplification unit for amplifying the strand extension product to obtain an amplification product, the amplification product constituting the sequencing library, wherein the amplification is adapted to simultaneously amplify the first label a sequence and a primer for the second tag sequence.
  • the above apparatus can effectively implement the method for constructing a sequencing library described above, and can efficiently construct a sequencing library, and at the same time, the constructed sequencing library targets the same double-stranded DNA fragment (in this paper)
  • Each of the strands also referred to as "source sequences”
  • source sequences obtains an amplification product having a first tag sequence and a second tag sequence, respectively, whereby, in the analysis of subsequent sequencing results, sequencing of the two tags can be performed.
  • the results are mutually corrected to improve the reliability of the analysis results.
  • the method further includes:
  • An end repair unit for end-repairing a nucleic acid sample to obtain a repaired nucleic acid sample
  • a terminal modification unit for adding a base A at the 5' end of the nucleic acid sample to obtain a nucleic acid sample having a sticky terminal base A at each end, wherein the two ends respectively have a nucleic acid sample having a sticky terminal base A The double-stranded DNA fragment.
  • a screening unit for screening the single-stranded DNA fragment using a probe before the chain extension is performed, wherein the probe specifically recognizes a predetermined region.
  • the predetermined area comprises one of the following:
  • the probe is provided in the form of a chip.
  • the strand extension reaction is carried out in the presence of a UDG enzyme/FPG enzyme.
  • the first tag sequence and the second tag sequence are each independently 4 to 10 nt in length.
  • the first tag sequence and the second tag sequence are both 8 nt in length.
  • the first strand of the linker has the sequence set forth in SEQ ID NO: 1
  • the second strand of the linker has the sequence set forth in SEQ ID NO: 2
  • the first tag having SEQ ID NO: the sequence of any one of 3-6
  • the second tag has the sequence set forth in at least one of SEQ ID NOs: 7-10
  • the first primer having the sequence set forth in SEQ ID NO:11
  • the primers suitable for simultaneously amplifying the first tag sequence and the second tag sequence have the sequences set forth in SEQ ID NO: 12 and SEQ ID NO: 13.
  • the labels include, but are not limited to, the four pairs described above, and multiple pairs of labels may be involved as needed for simultaneous detection of multiple samples.
  • the invention proposes a sequencing device.
  • the sequencing device comprises: a device for constructing a sequencing library according to the foregoing; a sequencing device for sequencing the sequencing library.
  • the sequencing device is Hiseq2000 or Hiseq 2500.
  • the invention proposes a system for determining a nucleic acid sequence.
  • the system comprises:
  • a sequencing data subset construction device for constructing at least one subset of sequencing data based on the sequencing result, wherein all sequencing data in each subset of sequencing data corresponds to the same source sequence on the nucleic acid sample;
  • a sequencing data classification device configured to determine, for each subset of the sequencing data, sequencing data corresponding to the first label sequence as positive strand sequencing data, and sequencing data corresponding to the second label sequence as negative strand sequencing data ;
  • a sequencing data correction device for correcting the sequencing data for each of the sequencing data subsets based on the positive strand sequencing data and the negative strand sequencing data, respectively, to determine corrected sequencing data
  • a sequence determining device for determining a sequence of the nucleic acid sample based on the corrected sequencing data.
  • the method of determining a nucleic acid sequence as described above can be efficiently carried out using a system for determining a nucleic acid sequence according to an embodiment of the present invention. Therefore, the calibration can be effectively performed based on the positive strand sequencing data and the negative strand sequencing data, thereby improving the reliability of the analysis result.
  • the sequencing is a double-end sequencing, the sequencing result consisting of pairs of pairs of sequencing data.
  • the sequencing data subset construction device comprises:
  • a sequencing data index determining device for determining a paired sequencing data index for each pair of the plurality of pairs of paired sequencing data, the paired sequencing data indexing from the first N of each of the paired sequencing data Base composition, wherein N is an integer between 10 and 20;
  • a preliminary screening device for constructing at least one preliminary sequencing data subset based on the paired sequencing data index, wherein each of the sequencing data subsets has the same paired sequencing data index;
  • a secondary screening device for subdividing the at least one preliminary sequencing data subset based on a Hamming distance between the sequencing data in the preliminary sequencing data subset to obtain a plurality of the sequencing data subsets.
  • N is 12.
  • the Hamming distance of any two pairs of paired sequencing data does not exceed 20.
  • the positive strand sequencing data and the negative strand sequencing data are at least two, respectively.
  • determining the corrected sequencing data based on the positive strand sequencing data and the negative strand sequencing data is based on the following principles:
  • Each base in the corrected sequencing data is simultaneously supported by at least 50% positive strand sequencing data and at least 50% negative strand sequencing data.
  • each base in the corrected sequencing data is simultaneously supported by at least 80% positive strand sequencing data and at least 80% negative strand sequencing data.
  • the method further includes:
  • the corrected sequencing data is aligned to a reference sequence and the sequencing data with a quality of less than 30 is deleted.
  • sequence analysis device for performing SNV analysis or Indel analysis based on the sequence of the nucleic acid sample.
  • FIG. 1 shows a flow chart of a method of constructing a sequencing library in accordance with an embodiment of the present invention
  • Figure 3 shows the results of analysis of a catastrophe spectrum according to an embodiment of the present invention
  • Figure 5 shows the results of analysis of a mutated spectrum in accordance with one embodiment of the present invention
  • Figure 7 shows the results of analysis of a mutation spectrum according to an embodiment of the present invention.
  • Figure 8 shows an analysis result of the same index reads cluster in accordance with one embodiment of the present invention.
  • Figure 9 shows the results of analysis of a mutation spectrum according to an embodiment of the present invention.
  • Figure 10 shows the results of an analysis of the same indexed reads cluster in accordance with one embodiment of the present invention.
  • Figure 11 shows the results of analysis of the catastrophe spectrum according to one embodiment of the present invention.
  • the exon sequence of the relevant gene was retrieved.
  • the final chip only involved the CDS region of the above gene and extended the CDS region by 20 bp.
  • the chip is covered with a rich capture probe with a 98% coverage area, which enriches the target DNA fragment from the complex genome and captures the genomic region with high specificity and high coverage on the same chip.
  • index1 The label, named index1, not only has the ability to distinguish between different samples, it will also be used for subsequent positive-chain markings).
  • the obtained ligation product was subjected to chip hybridization capture, and the eluted single-stranded template product was amplified by one round of one cycle of primers labeled with index 2, so that the anti-strand was labeled.
  • UDG/FPG enzyme was added during the PCR to incubate to eliminate the DNA damage in the template strand and reduce the occurrence of false positives.
  • the product obtained by double-indexing of the positive and negative chains is purified, and then subjected to a second round of PCR enrichment to complete the preparation of the library.
  • the sequencing method adopts Hiseq2000 or Hiseq2500. According to the difference in the amount of sequencing and the number of samples, the appropriate sequencing platform can be flexibly selected.
  • the specific steps include:
  • the cfDNA extracted from the plasma was then subjected to a three-step enzymatic reaction according to the KAPA LTP Library Preparation Kit.
  • an early screening related chip for cancer designed by the inventors is used, and hybridization capture is performed with reference to a specification provided by the chip manufacturer. Finally eluted back to dissolve 21 ⁇ L of ddH 2 O band hybrid eluting magnetic beads.
  • Double index positive and negative chain tagging and enrichment
  • PCR1 was subjected to reverse strand labeling and template DNA damage repair
  • PCR2 was subjected to amplification and enrichment to complete library preparation.
  • the hybrid elution magnetic beads were first removed, and then 40 ⁇ L of Agencourt AMPure XP reagent was added for magnetic bead purification, and finally 20 ul of ddH 2 O was dissolved, and magnetic beads were used for the next reaction.
  • the magnetic beads of the previous step were removed first, then 50 ⁇ L of Agencourt AMPure XP reagent was re-added, magnetic beads were purified, and finally 25 ⁇ L of ddH 2 O was dissolved, and QC and the machine were performed.
  • the paired reads (paired sequencing data) of the first 12 bp base of reads1 and the first 12 bp base of reads2 are connected into a short sequence of 24 bp, and the 24 bp is used as an index of paired reads, and according to Its index marks the positive and negative chains.
  • any two pairs in each small cluster Paired reads have a Hamming distance of no more than 10 in order to distinguish between reads that have the same index but come from different DNA templates.
  • step 4 The copy clusters of the same DNA template obtained in step 3 are screened. If the number of reads of the positive and negative strands is more than 2 pairs, subsequent analysis is performed.
  • the new reads are re-aligned to the genome using the bwa mem algorithm, and the reads with a quality less than 30 are screened out.
  • the base type which is inconsistent with the mainstream base type is a mutated base. type.
  • Example 1 Early screening of gynecological reproductive tract tumors
  • the WCNpan chip includes: Driver Gene (driver gene) related to gynecological genital tract tumors (cervical cancer, endometrial cancer, ovarian cancer), high-frequency mutated genes, and important genes in 12 signaling pathways of cancer, totaling 42 genes. , 300KB.
  • the specific design process of the chip According to the human genome HG19, the exon sequences of the above 42 genes are retrieved. Considering the size and cost of the capture region, the final chip only covers the CDS region of the above gene, and extends the CDS region before and after. 20bp, the chip totals 300kb. The chip is covered with a rich capture probe with a 98% coverage area, which enriches the target DNA fragment from a complex genome and captures approximately 300KB of genomic region with high specificity and high coverage on the same chip. .
  • the positive and negative chain interoperability rate based on the ratio of the total clusters on the clusters/3 reads on the positive and negative chains of 3 reads, to evaluate the positive and negative chain interoperability in the available data; effective data utilization: based on The ratio of the number of reads error correction of at least 2+/2-cluster to the total number of sequencing reads is satisfied; the average sequencing depth: the average coverage of bases in the target region after error correction based on effective data.
  • Fig. 2 The analysis result of the same index reads cluster is shown in Fig. 2, in which the abscissa indicates the number of duplication (dup) of the table cluster, and the ordinate indicates the total number of reads of the cluster satisfying a certain number of dup. It can be seen from the results of Fig. 2 that most of the dup clusters are around 8, and the larger part of the clusters can satisfy the condition of 2 plus + 2 inverses.
  • the effective utilization rate of the final data is 5.14%, and the average sequencing depth is 1153.6X.
  • Fig. 3 The results of the catastrophe spectrum analysis are shown in Fig. 3, in which the complementary mutation type is substantially the same as the theoretical mutation frequency for the double-stranded molecule (DNA).
  • the abscissa represents the type of base mutation; the ordinate represents the number of mutations.
  • the results in Fig. 3 show that the mutated base type distribution is balanced, and the mutation frequency (Mutations per nucleotide) is: 1.7 ⁇ 10 -6 . .
  • Example 2 Twelve common tumor individualized medication
  • the CANPer-YY chip includes: oncogenes, tumor suppressor genes, 12 common cancer high-frequency genes, important genes in 12 signal pathways of cancer, target drugs and chemotherapeutic drugs, etc., a total of 524 genes, 750KB.
  • the main design process of the chip is divided into 4 steps:
  • the mutated samples of the two selected intervals are used as the sample database, and the third interval is screened in the same way until the sample database includes all the samples to count the exon region set, and for the unfiltered All intervals of the genes in any interval are added to the chip interval.
  • the killing effect of chemotherapeutic drugs on tumor cells is significantly correlated with the expression and/or polymorphism of a specific (a group of) genes.
  • the detection of related genes predicts the efficacy of chemotherapeutic drugs and selects appropriate drugs for individualized chemotherapy. It has become a reasonable choice to improve efficacy and reduce ineffective treatment.
  • the PharmGKB database is used to integrate all the current chemotherapeutic drugs and the genes related to curative effect and predictive evaluation of therapeutic effects, and to form a database for interpretation of individualized drugs for chemotherapy.
  • the chemotherapy data was integrated into the individualized information flow of the tumor to complete the automated interpretation of the chemotherapy drug.
  • Targeted drugs have the characteristics of significant drug efficacy and few side effects in tumor therapy, but they are dependent on targets (including protein, DNA, etc.). Target analysis must be performed on patients before they can determine whether patients can take drugs. Integrate current FDA-approved targeted drugs, as well as drugs in clinical III and IV. According to the NCCN clinical guidelines, the clinical drug gene research collates the relationship between the drug target gene and the target drug, and forms a database of individualized target drug interpretation.
  • a patient with advanced gastric cancer (one of the 12 common tumors) is subjected to the individualized drug guidance test according to the steps of the above method, and the results are as follows:
  • the positive and negative chain interoperability rate based on the ratio of the clusters above 3 positive and negative chains/3 total reads, to evaluate the positive and negative chain interoperability in the available data; effective data utilization: based on The ratio of the number of reads error correction of at least 2+/2-cluster to the total number of sequencing reads is satisfied; the average sequencing depth: the average coverage of bases in the target region after error correction based on effective data.
  • Fig. 4 The analysis result of the same index reads cluster is shown in Fig. 4, in which the abscissa represents the number of duplication (dup) of the cluster, and the ordinate represents the total number of reads of the cluster satisfying a certain number of dup.
  • the results in Figure 4 show that most of the dup clusters are around 5, and most of the clusters can satisfy the conditions of 2 plus + 2 inverses.
  • the final data effective utilization rate is 3.5%, and the average sequencing depth is: 667X.
  • Fig. 5 The results of the mutational profiling are shown in Fig. 5, in which the complementary mutation type is substantially the same for the double-stranded molecule (DNA), the abscissa represents the type of base mutation, and the ordinate represents the number of mutations.
  • the results in Figure 5 show that the distribution of the mutated base type is basically balanced, and the mutation frequency (Mutations per nucleotide) is: 4.2 ⁇ 10 -6 .
  • the chemotherapy sites are shown in the following table:
  • the Colorectalpan chip includes: Driver Gene, a high-frequency mutated gene, and an important gene in 12 signaling pathways of cancer, a total of 60 genes, 123KB.
  • the main design process of the chip is divided into 4 steps:
  • the mutated samples of the two selected intervals are used as the sample database, and the third interval is screened in the same way until the sample database includes all the samples to count the exon region set, and for the unfiltered All intervals of the genes in any interval are added to the chip interval.
  • a colorectal cancer early screening test is performed on a patient with intestinal polyps according to the steps of the above method, and the results are as follows:
  • the positive and negative chain interoperability rate based on the ratio of the clusters above 3 positive and negative chains/3 total reads, to evaluate the positive and negative chain interoperability in the available data; effective data utilization: based on The ratio of the number of reads error correction of at least 2+/2-cluster to the total number of sequencing reads is satisfied; the average sequencing depth: the average coverage of bases in the target region after error correction based on effective data.
  • Fig. 6 The analysis of the same index reads cluster is shown in Fig. 6, where the abscissa represents the number of duplications (dup) of the cluster, and the ordinate represents the total number of reads of the cluster satisfying a certain number of dups.
  • the results in Figure 6 show that most of the dup clusters are around 6, and most of the clusters can satisfy the conditions of 2 plus + 2 inverses.
  • the final data effective utilization rate is 5.12%, and the average sequencing depth is: 1033X.
  • the results of the mutational profiling are shown in Figure 7, in which the complementary mutation type is substantially the same for the double-stranded molecule (DNA), the abscissa represents the type of base mutation, and the ordinate represents the number of mutations.
  • the results of Fig. 7 show that the distribution of the mutated base type is basically balanced, and the mutation frequency (Mutations per nucleotide) is: 2.2 ⁇ 10 -6 .
  • Mutation detection list details (based on exon area and non-synonymous mutation statistics):
  • the Lungpan chip includes: lung cancer-related Driver Gene, high-frequency mutated gene, and important genes in 12 signaling pathways of cancer, totaling 145 genes, 250KB.
  • the main design process of the chip is divided into 4 steps:
  • the mutated samples of the two selected intervals are used as the sample database, and the third interval is screened in the same way until the sample database includes all the samples to count the exon region set, and for the unfiltered All intervals of the genes in any interval are added to the chip interval.
  • a lung nodule patient is subjected to early screening of lung cancer according to the steps of the above method, and the results are as follows:
  • the positive and negative chain interoperability rate based on the ratio of the clusters above 3 positive and negative chains/3 total reads, to evaluate the positive and negative chain interoperability in the available data; effective data utilization: based on The ratio of the number of reads error correction of at least 2+/2-cluster to the total number of sequencing reads is satisfied; the average sequencing depth: the average coverage of bases in the target region after error correction based on effective data.
  • Fig. 8 The analysis result of the same index reads cluster is shown in Fig. 8.
  • the abscissa represents the number of duplication (dup) of the cluster, and the ordinate represents the total number of reads of the cluster satisfying a certain number of dup.
  • the results of Fig. 8 show that most of the dup clusters are around 10, and the larger part of the cluster can satisfy the condition of 2 plus + 2 inverse.
  • the effective utilization rate of the final data is 4.12%, and the average sequencing depth is 898X.
  • Fig. 9 The results of the mutational profiling are shown in Fig. 9, in which the complementary mutation type is substantially the same for the double-stranded molecule (DNA), the abscissa represents the type of base mutation, and the ordinate represents the number of mutations.
  • the results in Figure 9 show that the mutated base type distribution is basically balanced, and its mutation frequency (Mutations per nucleotide) is: 2.6 ⁇ 10 -6 .
  • Mutation detection list details (based on exon area and non-synonymous mutation statistics):
  • the CANPer-JK chip includes: 12 common cancer-related Driver Genes, high-frequency mutated genes, and important genes in 12 cancer signaling pathways, totaling 547 genes, 800 KB.
  • the main design process of the chip is divided into 4 steps:
  • the mutated samples of the two selected intervals are used as the sample database, and the third interval is screened in the same way until the sample database includes all the samples to count the exon region set, and for the unfiltered All intervals of the genes in any interval are added to the chip interval.
  • a postoperative breast cancer patient (one of 12 common tumors) is subjected to postoperative monitoring and detection of breast cancer according to the steps of the above method, and the results are as follows:
  • the positive and negative chain interoperability rate based on the ratio of the clusters above 3 positive and negative chains/3 total reads, to evaluate the positive and negative chain interoperability in the available data; effective data utilization: based on The ratio of the number of reads error correction of at least 2+/2-cluster to the total number of sequencing reads is satisfied; the average sequencing depth: the average coverage of bases in the target region after error correction based on effective data.
  • Fig. 10 The analysis result of the same index reads cluster is shown in Fig. 10, in which the abscissa represents the number of duplication (dup) of the cluster, and the ordinate represents the total number of reads of the cluster satisfying a certain number of dup.
  • the results in Figure 10 show that most of the dup clusters are around 6, and most of the clusters can satisfy the conditions of 2 plus + 2 inverses.
  • the effective utilization rate of the final data is 4.74%, and the average sequencing depth is: 1028.6X.
  • Fig. 11 The results of the catastrophe spectrum analysis are shown in Fig. 11, in which the complementary mutation type is substantially the same for the double-stranded molecule (DNA), the abscissa represents the type of base mutation, and the ordinate represents the number of mutations.
  • the results of Fig. 11 show that the distribution of the mutated base type is basically balanced, and the mutation frequency (Mutations per nucleotide) is: 3.1 ⁇ 10 -6 .
  • Mutation detection list details (based on exon area and non-synonymous mutation statistics):

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biochemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Microbiology (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Medicinal Chemistry (AREA)
  • Molecular Biology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Chemical & Material Sciences (AREA)
  • Sustainable Development (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Immunology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention provides a method for constructing a sequencing library, a sequencing method, a method for determining a nucleic acid sequence, an apparatus for constructing a sequencing library, a sequencing device and a system for determining a nucleic acid sequence.

Description

构建测序文库的方法及其应用Method for constructing sequencing library and its application 技术领域Technical field

本发明涉及生物医学领域。具体而言,本发明涉及构建测序文库的方法、测序方法、确定核酸序列的方法、构建测序文库的装置、测序设备以及确定核酸序列的系统。The invention relates to the field of biomedicine. In particular, the invention relates to methods of constructing sequencing libraries, sequencing methods, methods of determining nucleic acid sequences, devices for constructing sequencing libraries, sequencing devices, and systems for determining nucleic acid sequences.

背景技术Background technique

高通量测序日益被关注,但是目前高通量测序用于低频率突变的检测仍有待改进。High-throughput sequencing is gaining increasing attention, but the detection of low-frequency mutations for high-throughput sequencing is still to be improved.

发明内容Summary of the invention

本发明旨在至少解决现有技术中存在的技术问题之一。为此,根据本发明的实施例,本发明提出了用于构建测序文库的方法以及检测低频率突变的手段。The present invention aims to solve at least one of the technical problems existing in the prior art. To this end, in accordance with embodiments of the present invention, the present invention proposes methods for constructing sequencing libraries and means for detecting low frequency mutations.

在本发明的第一方面,本发明提出了一种构建测序文库的方法。根据本发明的实施例,该方法包括:(a)在双链DNA片段的两端分别连接接头,以便获得连接产物,其中,所述接头包括第一链和第二链,所述第一链和第二链部分匹配并且所述第一链包含第一标签序列,以便所述接头上限定出双链区和两个单链尾部,所述两个单链尾部之一的序列中包含第一标签;(b)将所述连接产物裂解为单链DNA片段;(c)利用第一引物对所述单链DNA片段进行链延伸反应,以便获得链延伸产物,其中,所述第一引物包括第二标签序列,并且所述第一引物适于与所述接头的第一链形成双链结构,只是所述第一标签序列与所述第二标签序列之间存在错配;(d)对所述链延伸产物进行扩增,以便获得扩增产物,所述扩增采用适于同时扩增所述第一标签序列和所述第二标签序列的引物。In a first aspect of the invention, the invention proposes a method of constructing a sequencing library. According to an embodiment of the invention, the method comprises: (a) separately joining a linker at both ends of the double-stranded DNA fragment to obtain a ligation product, wherein the linker comprises a first strand and a second strand, the first strand Matching the second strand portion and the first strand comprises a first tag sequence such that the linker defines a double stranded region and two single stranded tails, the sequence of one of the two single stranded tails comprising the first a label; (b) cleavage of the ligated product into a single-stranded DNA fragment; (c) performing a strand extension reaction on the single-stranded DNA fragment using a first primer to obtain a strand extension product, wherein the first primer comprises a second tag sequence, and the first primer is adapted to form a double stranded structure with the first strand of the linker, except that there is a mismatch between the first tag sequence and the second tag sequence; (d) pair The strand extension product is amplified to obtain an amplification product using primers suitable for simultaneously amplifying the first tag sequence and the second tag sequence.

由此,利用根据本发明实施例的构建测序文库的方法,能够有效地构建测序文库,同时,所构建的测序文库中,针对相同的双链DNA片段(在本文中也被称为“源序列”)的每条链,分别获得了具有第一标签序列和第二标签序列的扩增产物,由此,在后续测序结果的分析中,可以依据两种标签的测序结果进行互相校正,提高分析结果的可靠性。Thus, with the method of constructing a sequencing library according to an embodiment of the present invention, a sequencing library can be efficiently constructed, and at the same time, the constructed sequencing library is directed to the same double-stranded DNA fragment (also referred to herein as a "source sequence"). Each of the chains has an amplification product having a first tag sequence and a second tag sequence, respectively, whereby in the analysis of subsequent sequencing results, mutual calibration can be performed according to the sequencing results of the two tags, and the analysis is improved. The reliability of the results.

根据本发明的实施例,所述双链DNA片段是通过下列步骤获得的:将核酸样本进行末端修复,以便获得经过修复的核酸样本;以及在所述核酸样本的5’末端添加碱基A,以便获得两端分别具有粘性末端碱基A的核酸样本,所述两端分别具有粘性末端碱基A的核酸样本构成所述双链DNA片段。由此,可以在后续操作中,方便地在所述双链DNA片段的两端添加接头。从而,提高了构建测序文库的效率。According to an embodiment of the present invention, the double-stranded DNA fragment is obtained by subjecting a nucleic acid sample to end repair to obtain a repaired nucleic acid sample; and adding a base A at the 5' end of the nucleic acid sample, In order to obtain a nucleic acid sample having viscous terminal bases A at both ends, respectively, the nucleic acid samples having viscous terminal bases A at both ends constitute the double-stranded DNA fragment. Thus, a linker can be conveniently added to both ends of the double-stranded DNA fragment in a subsequent operation. Thereby, the efficiency of constructing a sequencing library is improved.

根据本发明的实施例,所述核酸样本为人基因组DNA的至少一部分或游离核酸。根据本发明的实施例,所述人游离核酸是从患者的外周血提取的。根据本发明的实施例,所述 患者患有癌症,所述癌症为选自下列的至少之一:膀胱癌、前列腺癌、肺癌、结直肠癌、胃癌、乳腺癌、肾癌、胰腺癌、卵巢癌、子宫内膜癌、甲状腺癌、宫颈癌、食管癌以及肝癌。由此,利用本发明实施例的方法,能够有效地对人类疾病患者的基因突变进行有效的分析,进而能够有效用于常见肿瘤的早诊、个体化用药、以及术后监控等。According to an embodiment of the invention, the nucleic acid sample is at least a portion of human genomic DNA or a free nucleic acid. According to an embodiment of the invention, the human free nucleic acid is extracted from the peripheral blood of the patient. According to an embodiment of the invention, the The patient has cancer, which is at least one selected from the group consisting of bladder cancer, prostate cancer, lung cancer, colorectal cancer, stomach cancer, breast cancer, kidney cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer. , cervical cancer, esophageal cancer and liver cancer. Therefore, the method of the embodiment of the present invention can effectively analyze gene mutations of human disease patients, and can be effectively used for early diagnosis, individualized medication, and postoperative monitoring of common tumors.

根据本发明的实施例,所述人基因组DNA的至少一部分是通过对人基因组DNA进行随机打断而获得的。由此,可以在后续操作中,方便地在所述双链DNA片段的两端添加接头。从而,提高了构建测序文库的效率。According to an embodiment of the invention, at least a portion of the human genomic DNA is obtained by random disruption of human genomic DNA. Thus, a linker can be conveniently added to both ends of the double-stranded DNA fragment in a subsequent operation. Thereby, the efficiency of constructing a sequencing library is improved.

根据本发明的实施例,所述接头具有3’碱基T粘性末端。由此,可以在后续操作中,方便地在所述双链DNA片段的两端添加接头。从而,提高了构建测序文库的效率。According to an embodiment of the invention, the linker has a 3' base T sticky end. Thus, a linker can be conveniently added to both ends of the double-stranded DNA fragment in a subsequent operation. Thereby, the efficiency of constructing a sequencing library is improved.

根据本发明的实施例,所述单链DNA片段是通过将所述连接产物进行变性处理获得的。由此,可以快速有效的获得单链DNA片段。根据本发明的一些实施例,所述变性处理可以为热变性处理或碱变性处理。According to an embodiment of the present invention, the single-stranded DNA fragment is obtained by subjecting the ligation product to denaturation treatment. Thereby, a single-stranded DNA fragment can be obtained quickly and efficiently. According to some embodiments of the invention, the denaturation treatment may be a heat denaturation treatment or an alkali denaturation treatment.

根据本发明的实施例,在进行所述链延伸之前,利用探针对所述单链DNA片段进行筛选,其中,所述探针特异性识别预定区域。由此,可以有效地针对感兴趣的区域构建测序文库,提高了构建测序文库以及后续测序的效率。According to an embodiment of the present invention, the single-stranded DNA fragment is screened using a probe prior to performing the strand extension, wherein the probe specifically recognizes a predetermined region. Thereby, the sequencing library can be efficiently constructed for the region of interest, and the efficiency of constructing the sequencing library and subsequent sequencing is improved.

根据本发明的实施例,所述预定区域包括下列之一:According to an embodiment of the invention, the predetermined area comprises one of the following:

(1)表1~5所示基因的至少之一;(1) at least one of the genes shown in Tables 1 to 5;

(2)(1)的CDS区域;以及(2) the CDS area of (1);

(3)(2)的上下游至少10bp的区域。(3) A region of at least 10 bp upstream and downstream of (2).

由此,可以有效地对癌症相关的基因进行构建测序文库和后续核酸序列分析。Thereby, it is possible to efficiently construct a sequencing library and subsequent nucleic acid sequence analysis of cancer-related genes.

根据本发明的实施例,所述探针是以芯片的形式提供的。由此,可以提高探针筛选的效率。According to an embodiment of the invention, the probe is provided in the form of a chip. Thereby, the efficiency of probe screening can be improved.

根据本发明的实施例,在存在UDG酶/FPG酶时,进行所述链延伸反应。由此,可以有效地对存在损伤的DNA在链延伸过程中进行修复,减少假阳性的产生,提高构建测序文库的质量。According to an embodiment of the invention, the strand extension reaction is carried out in the presence of a UDG enzyme/FPG enzyme. Thereby, the damaged DNA can be effectively repaired during the chain extension process, the generation of false positives is reduced, and the quality of the constructed sequencing library is improved.

根据本发明的实施例,所述第一标签序列和所述第二标签序列分别独立地长度为4~10nt。根据本发明的实施例,所述第一标签序列和所述第二标签序列的长度均为8nt。根据本发明的实施例,所述第一标签序列和所述第二标签序列之间存在至少2nt的错配。发明人惊奇地发现,采用如此设置,能够有效地提高在后续分析中,利用第一标签序列和第二标签序列进行校正的效率。According to an embodiment of the invention, the first tag sequence and the second tag sequence are each independently 4 to 10 nt in length. According to an embodiment of the invention, the first tag sequence and the second tag sequence are both 8 nt in length. According to an embodiment of the invention, there is a mismatch of at least 2 nt between the first tag sequence and the second tag sequence. The inventors have surprisingly found that with such an arrangement, the efficiency of correction using the first tag sequence and the second tag sequence in subsequent analysis can be effectively improved.

根据本发明的实施例,所述接头的第一链具有SEQ ID NO:1所示的序列,所述接头的第二链具有SEQ ID NO:2所示的序列,所述第一标签具有SEQ ID NO:3-6中任一项所示 的序列,所述第二标签具有SEQ ID NO:7-10中至少之一所示的序列,所述第一引物具有SEQ ID NO:11所示的序列,所述适于同时扩增所述第一标签序列和所述第二标签序列的引物具有SEQ ID NO:12和SEQ ID NO:13所示的序列。According to an embodiment of the invention, the first strand of the linker has the sequence set forth in SEQ ID NO: 1, the second strand of the linker has the sequence set forth in SEQ ID NO: 2, the first tag having SEQ ID NO: shown in any of 3-6 a sequence, the second tag having the sequence set forth in at least one of SEQ ID NOs: 7-10, the first primer having the sequence set forth in SEQ ID NO: 11, the The primers of the first tag sequence and the second tag sequence have the sequences set forth in SEQ ID NO: 12 and SEQ ID NO: 13.

Figure PCTCN2014088059-appb-000001
Figure PCTCN2014088059-appb-000001

其中,接头的第一链的序列中“XXXXXXXX”表示第一标签序列,第一引物中序列中的“XXXXXXXX”表示第二标签序列。Wherein "XXXXXXXX" in the sequence of the first strand of the linker represents the first tag sequence, and "XXXXXXXX" in the sequence in the first primer represents the second tag sequence.

根据本发明的实施例,标签包括但不限于上述所述的4对,可以根据需要设计多对标签以用于多样品的同时检测。In accordance with embodiments of the present invention, the labels include, but are not limited to, the four pairs described above, and multiple pairs of labels can be designed as needed for simultaneous detection of multiple samples.

在本发明的第二方面,本发明提出了一种测序方法,该方法包括:根据前面所述的方法构建测序文库;对所述测序文库进行测序。In a second aspect of the invention, the invention proposes a sequencing method comprising: constructing a sequencing library according to the method described above; sequencing the sequencing library.

根据本发明的实施例,在Hiseq2000或Hiseq2500上进行所述测序。The sequencing was performed on Hiseq2000 or Hiseq 2500 according to an embodiment of the invention.

由此,可以有效地提高测序的效率。另外,前面关于构建测序文库的方法所描述的特征和优点,同样适用该测序方法,在此不再赘述。 Thereby, the efficiency of sequencing can be effectively improved. In addition, the features and advantages described above with respect to the method of constructing a sequencing library are equally applicable to the sequencing method and will not be described herein.

在本发明的第三方面,本发明提出了一种确定核酸序列的方法,其特征在于,包括:In a third aspect of the invention, the invention provides a method of determining a nucleic acid sequence, comprising:

针对核酸样本,根据权利要求前面所述的方法进行测序,以便获得由多个测序数据构成的测序结果;For nucleic acid samples, sequencing is performed according to the methods previously described in the claims to obtain sequencing results consisting of multiple sequencing data;

基于所述测序结果,构建至少一个测序数据子集,其中,每个测序数据子集中的所有测序数据均对应核酸样本上相同的源序列;Based on the sequencing result, at least one subset of sequencing data is constructed, wherein all sequencing data in each subset of sequencing data corresponds to the same source sequence on the nucleic acid sample;

针对每一个测序数据子集,分别确定与所述第一标签序列对应的测序数据为正链测序数据,与所述第二标签序列对应的测序数据为负链测序数据;For each subset of sequencing data, determining that the sequencing data corresponding to the first tag sequence is positive strand sequencing data, and the sequencing data corresponding to the second tag sequence is negative strand sequencing data;

针对每一个测序数据子集,分别基于所述正链测序数据和所述负链测序数据,对测序数据进行校正,以便确定经过校正的测序数据;以及Correcting the sequencing data for each of the sequencing data subsets based on the positive strand sequencing data and the negative strand sequencing data, respectively, to determine corrected sequencing data;

基于所述经过校正的测序数据,确定所述核酸样本的序列。A sequence of the nucleic acid sample is determined based on the corrected sequencing data.

由此,可以有效地基于正链测序数据和负链测序数据进行校正,提高分析结果的可靠性。Thereby, the calibration can be effectively performed based on the positive strand sequencing data and the negative strand sequencing data, thereby improving the reliability of the analysis result.

根据本发明的实施例,所述测序为双末端测序,所述测序结果由多对成对的测序数据构成。According to an embodiment of the invention, the sequencing is a double-end sequencing, the sequencing result consisting of pairs of pairs of sequencing data.

根据本发明的实施例,基于所述测序结果,构建至少一个测序数据子集是通过下列步骤进行的:According to an embodiment of the invention, constructing at least one subset of sequencing data based on the sequencing results is performed by the following steps:

针对所述多对成对的测序数据的每一对,确定成对测序数据索引,所述成对测序数据索引由成对的测序数据的每一个的最初N个碱基构成,其中,N为10~20之间的整数;Determining a paired sequencing data index for each pair of the plurality of pairs of sequenced data, the paired sequencing data index consisting of an initial N bases of each of the paired sequencing data, wherein N is An integer between 10 and 20;

基于所述成对测序数据索引,构建至少一个初步测序数据子集,其中,所述初步测序数据子集中的每一个测序数据均具有相同的成对测序数据索引;以及Constructing at least one preliminary sequencing data subset based on the paired sequencing data index, wherein each of the sequencing data subsets has the same paired sequencing data index;

基于所述初步测序数据子集中测序数据之间的汉明距离,对所述至少一个初步测序数据子集进行细分,以便获得多个所述测序数据子集。The at least one preliminary sequencing data subset is subdivided based on a Hamming distance between the sequencing data in the preliminary sequencing data subset to obtain a plurality of the sequencing data subsets.

根据本发明的实施例,N为12。According to an embodiment of the invention, N is 12.

根据本发明的实施例,在所述多个测序数据子集的每一个中,任意两对成对测序数据的汉明距离不超过20。According to an embodiment of the invention, in each of the plurality of sequencing data subsets, the Hamming distance of any two pairs of paired sequencing data does not exceed 20.

根据本发明的实施例,在所述多个测序数据子集的每一个中,正链测序数据和负链测序数据分别为至少两个。According to an embodiment of the invention, in each of the plurality of sequencing data subsets, the positive strand sequencing data and the negative strand sequencing data are at least two, respectively.

根据本发明的实施例,基于所述正链测序数据和所述负链测序数据,确定经过校正的测序数据是基于下列原则进行的:According to an embodiment of the invention, determining the corrected sequencing data based on the positive strand sequencing data and the negative strand sequencing data is based on the following principles:

经过校正的测序数据中的每一个碱基同时获得至少50%正链测序数据和至少50%负链测序数据的支持。Each base in the corrected sequencing data is simultaneously supported by at least 50% positive strand sequencing data and at least 50% negative strand sequencing data.

根据本发明的实施例,经过校正的测序数据中的每一个碱基同时获得至少80%正链测 序数据和至少80%负链测序数据的支持。According to an embodiment of the invention, each base in the corrected sequencing data simultaneously obtains at least 80% positive stranding Order data and support for at least 80% negative strand sequencing data.

根据本发明的实施例,进一步包括:According to an embodiment of the present invention, the method further includes:

将所述经过校正的测序数据比对至参考序列上,并删除比对质量小于30的测序数据。The corrected sequencing data is aligned to a reference sequence and the sequencing data with a quality of less than 30 is deleted.

根据本发明的实施例,基于所述核酸样本的序列,进行SNV分析或Indel分析。According to an embodiment of the invention, SNV analysis or Indel analysis is performed based on the sequence of the nucleic acid sample.

在本发明的第四方面,本发明提出了一种构建测序文库的装置。根据本发明的实施例,该装置包括:In a fourth aspect of the invention, the invention proposes an apparatus for constructing a sequencing library. According to an embodiment of the invention, the apparatus comprises:

连接单元,用于在双链DNA片段的两端分别连接接头,以便获得连接产物,其中,所述接头包括第一链和第二链,所述第一链和第二链部分匹配并且所述第一链包含第一标签序列,以便所述接头上限定出双链区和两个单链尾部,所述两个单链尾部之一的序列中包含第一标签;a linking unit for respectively connecting a linker at both ends of the double-stranded DNA fragment to obtain a ligation product, wherein the linker includes a first strand and a second strand, the first strand and the second strand portion are matched and The first strand comprises a first tag sequence such that the linker defines a double-stranded region and two single-stranded tails, the sequence of one of the two single-stranded tails comprising a first label;

裂解单元,用于将所述连接产物裂解为单链DNA片段;a cleavage unit for cleaving the ligation product into a single-stranded DNA fragment;

链延伸单元,用于利用第一引物对所述单链DNA片段进行链延伸反应,以便获得链延伸产物,其中,所述第一引物包括第二标签序列,并且所述第一引物适于与所述接头的第一链形成双链结构,只是所述第一标签序列与所述第二标签序列之间存在错配;a strand extension unit for performing a strand extension reaction on the single-stranded DNA fragment with a first primer to obtain a strand extension product, wherein the first primer includes a second tag sequence, and the first primer is adapted to The first strand of the linker forms a double-stranded structure, except that there is a mismatch between the first tag sequence and the second tag sequence;

扩增单元,用于对所述链延伸产物进行扩增,以便获得扩增产物,所述扩增产物构成所述测序文库,其中,所述扩增采用适于同时扩增所述第一标签序列和所述第二标签序列的引物。An amplification unit for amplifying the strand extension product to obtain an amplification product, the amplification product constituting the sequencing library, wherein the amplification is adapted to simultaneously amplify the first label a sequence and a primer for the second tag sequence.

根据本发明的实施例,上述装置能够有效地实施前面所描述的构建测序文库的方法,能够有效地构建测序文库,同时,所构建的测序文库中,针对相同的双链DNA片段(在本文中也被称为“源序列”)的每条链,分别获得了具有第一标签序列和第二标签序列的扩增产物,由此,在后续测序结果的分析中,可以依据两种标签的测序结果进行互相校正,提高分析结果的可靠性。According to an embodiment of the present invention, the above apparatus can effectively implement the method for constructing a sequencing library described above, and can efficiently construct a sequencing library, and at the same time, the constructed sequencing library targets the same double-stranded DNA fragment (in this paper) Each of the strands, also referred to as "source sequences", obtains an amplification product having a first tag sequence and a second tag sequence, respectively, whereby, in the analysis of subsequent sequencing results, sequencing of the two tags can be performed. The results are mutually corrected to improve the reliability of the analysis results.

根据本发明的实施例,进一步包括:According to an embodiment of the present invention, the method further includes:

末端修复单元,用于将核酸样本进行末端修复,以便获得经过修复的核酸样本;以及An end repair unit for end-repairing a nucleic acid sample to obtain a repaired nucleic acid sample;

末端修饰单元,用于在所述核酸样本的5’末端添加碱基A,以便获得两端分别具有粘性末端碱基A的核酸样本,所述两端分别具有粘性末端碱基A的核酸样本构成所述双链DNA片段。a terminal modification unit for adding a base A at the 5' end of the nucleic acid sample to obtain a nucleic acid sample having a sticky terminal base A at each end, wherein the two ends respectively have a nucleic acid sample having a sticky terminal base A The double-stranded DNA fragment.

根据本发明的实施例,进一步包括筛选单元,用于在进行所述链延伸之前,利用探针对所述单链DNA片段进行筛选,其中,所述探针特异性识别预定区域。According to an embodiment of the present invention, there is further included a screening unit for screening the single-stranded DNA fragment using a probe before the chain extension is performed, wherein the probe specifically recognizes a predetermined region.

根据本发明的实施例,所述预定区域包括下列之一:According to an embodiment of the invention, the predetermined area comprises one of the following:

(1)表1~5所示基因的至少之一;(1) at least one of the genes shown in Tables 1 to 5;

(2)(1)的CDS区域;以及 (2) the CDS area of (1);

(3)(2)的上下游至少10bp的区域。(3) A region of at least 10 bp upstream and downstream of (2).

根据本发明的实施例,所述探针是以芯片的形式提供的。According to an embodiment of the invention, the probe is provided in the form of a chip.

根据本发明的实施例,在存在UDG酶/FPG酶时,进行所述链延伸反应。由此,可以有效地对存在损伤的DNA在链延伸过程中进行修复,减少假阳性的产生,提高构建测序文库的质量。According to an embodiment of the invention, the strand extension reaction is carried out in the presence of a UDG enzyme/FPG enzyme. Thereby, the damaged DNA can be effectively repaired during the chain extension process, the generation of false positives is reduced, and the quality of the constructed sequencing library is improved.

根据本发明的实施例,所述第一标签序列和所述第二标签序列分别独立地长度为4~10nt。According to an embodiment of the invention, the first tag sequence and the second tag sequence are each independently 4 to 10 nt in length.

根据本发明的实施例,所述第一标签序列和所述第二标签序列的长度均为8nt。According to an embodiment of the invention, the first tag sequence and the second tag sequence are both 8 nt in length.

根据本发明的实施例,所述第一标签序列和所述第二标签序列之间存在至少2nt的错配。According to an embodiment of the invention, there is a mismatch of at least 2 nt between the first tag sequence and the second tag sequence.

根据本发明的实施例,所述接头的第一链具有SEQ ID NO:1所示的序列,所述接头的第二链具有SEQ ID NO:2所示的序列,所述第一标签具有SEQ ID NO:3-6中任一项所示的序列,所述第二标签具有SEQ ID NO:7-10中至少之一所示的序列,所述第一引物具有SEQ ID NO:11所示的序列,所述适于同时扩增所述第一标签序列和所述第二标签序列的引物具有SEQ ID NO:12和SEQ ID NO:13所示的序列。According to an embodiment of the invention, the first strand of the linker has the sequence set forth in SEQ ID NO: 1, the second strand of the linker has the sequence set forth in SEQ ID NO: 2, the first tag having SEQ ID NO: the sequence of any one of 3-6, wherein the second tag has the sequence set forth in at least one of SEQ ID NOs: 7-10, the first primer having the sequence set forth in SEQ ID NO:11 The primers suitable for simultaneously amplifying the first tag sequence and the second tag sequence have the sequences set forth in SEQ ID NO: 12 and SEQ ID NO: 13.

Figure PCTCN2014088059-appb-000002
Figure PCTCN2014088059-appb-000002

Figure PCTCN2014088059-appb-000003
Figure PCTCN2014088059-appb-000003

根据本发明的实施例,标签包括但不限于上述所述的4对,可以根据需要涉及多对标签以用于多样品的同时检测。In accordance with embodiments of the present invention, the labels include, but are not limited to, the four pairs described above, and multiple pairs of labels may be involved as needed for simultaneous detection of multiple samples.

本领域技术人员能够理解的是,前面针对构建测序文库的方法所描述的特征和优点,同样适用于该构建测序文库的装置,在此不再赘述。Those skilled in the art will appreciate that the features and advantages previously described for the method of constructing a sequencing library are equally applicable to the apparatus for constructing a sequencing library and will not be described herein.

在本发明的第五方面,本发明提出了一种测序设备。根据本发明的实施例,该测序设备包括:根据前面所述的构建测序文库的装置;测序装置,用于对所述测序文库进行测序。In a fifth aspect of the invention, the invention proposes a sequencing device. According to an embodiment of the invention, the sequencing device comprises: a device for constructing a sequencing library according to the foregoing; a sequencing device for sequencing the sequencing library.

由此,可以有效地提高测序的效率。另外,前面关于构建测序文库的方法和装置所描述的特征和优点,同样适用该测序设备,在此不再赘述。Thereby, the efficiency of sequencing can be effectively improved. In addition, the features and advantages described above with respect to the methods and apparatus for constructing a sequencing library are equally applicable to the sequencing apparatus and will not be described herein.

根据本发明的实施例,所述测序装置为Hiseq2000或Hiseq2500。According to an embodiment of the invention, the sequencing device is Hiseq2000 or Hiseq 2500.

在本发明的第六方面,本发明提出了一种确定核酸序列的系统。根据本发明的实施例,该系统包括:In a sixth aspect of the invention, the invention proposes a system for determining a nucleic acid sequence. According to an embodiment of the invention, the system comprises:

前面所述的测序设备,用于针对核酸样本进行测序,以便获得由多个测序数据构成的测序结果;The sequencing device described above for sequencing a nucleic acid sample to obtain a sequencing result composed of a plurality of sequencing data;

测序数据子集构建设备,用于基于所述测序结果,构建至少一个测序数据子集,其中,每个测序数据子集中的所有测序数据均对应核酸样本上相同的源序列;a sequencing data subset construction device for constructing at least one subset of sequencing data based on the sequencing result, wherein all sequencing data in each subset of sequencing data corresponds to the same source sequence on the nucleic acid sample;

测序数据分类设备,用于针对每一个测序数据子集,分别确定与所述第一标签序列对应的测序数据为正链测序数据,与所述第二标签序列对应的测序数据为负链测序数据;a sequencing data classification device, configured to determine, for each subset of the sequencing data, sequencing data corresponding to the first label sequence as positive strand sequencing data, and sequencing data corresponding to the second label sequence as negative strand sequencing data ;

测序数据校正设备,用于针对每一个测序数据子集,分别基于所述正链测序数据和所述负链测序数据,对测序数据进行校正,以便确定经过校正的测序数据;以及a sequencing data correction device for correcting the sequencing data for each of the sequencing data subsets based on the positive strand sequencing data and the negative strand sequencing data, respectively, to determine corrected sequencing data;

序列确定设备,用于基于所述经过校正的测序数据,确定所述核酸样本的序列。A sequence determining device for determining a sequence of the nucleic acid sample based on the corrected sequencing data.

由此,利用根据本发明实施例的确定核酸序列的系统,能够有效地实施前面确定核酸序列的方法。从而可以有效地基于正链测序数据和负链测序数据进行校正,提高分析结果的可靠性。Thus, the method of determining a nucleic acid sequence as described above can be efficiently carried out using a system for determining a nucleic acid sequence according to an embodiment of the present invention. Therefore, the calibration can be effectively performed based on the positive strand sequencing data and the negative strand sequencing data, thereby improving the reliability of the analysis result.

根据本发明的实施例,所述测序为双末端测序,所述测序结果由多对成对的测序数据构成。According to an embodiment of the invention, the sequencing is a double-end sequencing, the sequencing result consisting of pairs of pairs of sequencing data.

根据本发明的实施例,测序数据子集构建设备包括:According to an embodiment of the invention, the sequencing data subset construction device comprises:

测序数据索引确定设备,用于针对所述多对成对的测序数据的每一对,确定成对测序数据索引,所述成对测序数据索引由成对的测序数据的每一个的最初N个碱基构成,其中, N为10~20之间的整数;a sequencing data index determining device for determining a paired sequencing data index for each pair of the plurality of pairs of paired sequencing data, the paired sequencing data indexing from the first N of each of the paired sequencing data Base composition, wherein N is an integer between 10 and 20;

初步筛选装置,用于基于所述成对测序数据索引,构建至少一个初步测序数据子集,其中,所述初步测序数据子集中的每一个测序数据均具有相同的成对测序数据索引;以及a preliminary screening device for constructing at least one preliminary sequencing data subset based on the paired sequencing data index, wherein each of the sequencing data subsets has the same paired sequencing data index;

二次筛选装置,用于基于所述初步测序数据子集中测序数据之间的汉明距离,对所述至少一个初步测序数据子集进行细分,以便获得多个所述测序数据子集。And a secondary screening device for subdividing the at least one preliminary sequencing data subset based on a Hamming distance between the sequencing data in the preliminary sequencing data subset to obtain a plurality of the sequencing data subsets.

根据本发明的实施例,N为12。According to an embodiment of the invention, N is 12.

根据本发明的实施例,在所述多个测序数据子集的每一个中,任意两对成对测序数据的汉明距离不超过20。According to an embodiment of the invention, in each of the plurality of sequencing data subsets, the Hamming distance of any two pairs of paired sequencing data does not exceed 20.

根据本发明的实施例,在所述多个测序数据子集的每一个中,正链测序数据和负链测序数据分别为至少两个。According to an embodiment of the invention, in each of the plurality of sequencing data subsets, the positive strand sequencing data and the negative strand sequencing data are at least two, respectively.

根据本发明的实施例,基于所述正链测序数据和所述负链测序数据,确定经过校正的测序数据是基于下列原则进行的:According to an embodiment of the invention, determining the corrected sequencing data based on the positive strand sequencing data and the negative strand sequencing data is based on the following principles:

经过校正的测序数据中的每一个碱基同时获得至少50%正链测序数据和至少50%负链测序数据的支持。Each base in the corrected sequencing data is simultaneously supported by at least 50% positive strand sequencing data and at least 50% negative strand sequencing data.

根据本发明的实施例,经过校正的测序数据中的每一个碱基同时获得至少80%正链测序数据和至少80%负链测序数据的支持。According to an embodiment of the invention, each base in the corrected sequencing data is simultaneously supported by at least 80% positive strand sequencing data and at least 80% negative strand sequencing data.

根据本发明的实施例,进一步包括:According to an embodiment of the present invention, the method further includes:

将所述经过校正的测序数据比对至参考序列上,并删除比对质量小于30的测序数据。The corrected sequencing data is aligned to a reference sequence and the sequencing data with a quality of less than 30 is deleted.

根据本发明的实施例,进一步包括序列分析装置,所述序列分析装置用于基于所述核酸样本的序列,进行SNV分析或Indel分析。According to an embodiment of the invention, there is further included a sequence analysis device for performing SNV analysis or Indel analysis based on the sequence of the nucleic acid sample.

本领域技术人员可以理解的是,前面关于确定核酸序列的方法所描述的优点和特征同样适用该确定核酸序列的系统,在此不再赘述。It will be understood by those skilled in the art that the advantages and features described above with respect to methods for determining nucleic acid sequences are equally applicable to the system for determining nucleic acid sequences, and are not described herein.

本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。The additional aspects and advantages of the invention will be set forth in part in the description which follows.

附图说明DRAWINGS

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from

图1显示了根据本发明的实施例的构建测序文库的方法的流程图;1 shows a flow chart of a method of constructing a sequencing library in accordance with an embodiment of the present invention;

图2显示了根据本发明的一个实施例的相同索引reads簇的分析结果;2 shows an analysis result of the same index reads cluster according to an embodiment of the present invention;

图3显示了根据本发明的一个实施例的突变谱的分析结果;Figure 3 shows the results of analysis of a catastrophe spectrum according to an embodiment of the present invention;

图4显示了根据本发明的一个实施例的相同索引reads簇的分析结果; 4 shows an analysis result of the same index reads cluster according to an embodiment of the present invention;

图5显示了根据本发明的一个实施例的突变谱的分析结果;Figure 5 shows the results of analysis of a mutated spectrum in accordance with one embodiment of the present invention;

图6显示了根据本发明的一个实施例的相同索引reads簇的分析结果;6 shows an analysis result of the same index reads cluster according to an embodiment of the present invention;

图7显示了根据本发明的一个实施例的突变谱的分析结果;Figure 7 shows the results of analysis of a mutation spectrum according to an embodiment of the present invention;

图8显示了根据本发明的一个实施例的相同索引reads簇的分析结果;Figure 8 shows an analysis result of the same index reads cluster in accordance with one embodiment of the present invention;

图9显示了根据本发明的一个实施例的突变谱的分析结果;Figure 9 shows the results of analysis of a mutation spectrum according to an embodiment of the present invention;

图10显示了根据本发明的一个实施例的相同索引reads簇的分析结果;以及Figure 10 shows the results of an analysis of the same indexed reads cluster in accordance with one embodiment of the present invention;

图11显示了根据本发明的一个实施例的突变谱的分析结果。Figure 11 shows the results of analysis of the catastrophe spectrum according to one embodiment of the present invention.

具体实施方式detailed description

下面通过具体的实施例,对本发明进行说明,需要说明的是这些实施例仅仅是为了说明目的,而不能以任何方式解释成对本发明的限制。The invention is illustrated by the following examples, which are intended to be illustrative only and not to be construed as limiting the invention.

一般方法General method

除非特别说明,在下面的实施例中,按照下面的一般方法进行:Unless otherwise stated, in the following examples, the following general methods are carried out:

一、设计探针First, design the probe

根据人类基因组HG19,调取相关基因的外显子序列,考虑到捕获区域的大小及成本,最终的芯片只涉及了上述基因的CDS区域,并对CDS区域前后延伸了20bp。芯片上覆盖有丰富的捕获探针,探针覆盖区域达98%,可以从复杂的基因组中富集目标DNA片段,在同一张芯片上以高特异性和高覆盖率捕获基因组区域。According to the human genome HG19, the exon sequence of the relevant gene was retrieved. Considering the size and cost of the capture region, the final chip only involved the CDS region of the above gene and extended the CDS region by 20 bp. The chip is covered with a rich capture probe with a 98% coverage area, which enriches the target DNA fragment from the complex genome and captures the genomic region with high specificity and high coverage on the same chip.

二、构建测序文库和测序Second, build sequencing libraries and sequencing

参照图1,构建文库和测序的步骤如下:Referring to Figure 1, the steps for constructing the library and sequencing are as follows:

1.抽取患者5ml外周血,离心分离血浆和白细胞,分别对血浆样本和白细胞样本提取DNA,白细胞提取的DNA之后将作为对照用于体细胞突变的检出。1. 5 ml of peripheral blood was taken from the patient, plasma and leukocytes were separated by centrifugation, DNA was extracted from plasma samples and white blood cell samples, respectively, and DNA extracted by leukocytes was used as a control for detection of somatic mutations.

2.血浆中提取出的游离循环DNA平均在170BP,之后直接按照常规建库方法进行3步酶促反应:末端修复,加“A”和连接特殊处理的测序接头(该接头上带有8BP的标签,被命名为index1,其不仅具有区别不同样品的功能,也将被用于之后正链的标记)。2. The free circulating DNA extracted from plasma averaged 170BP, then proceeded to the 3-step enzymatic reaction directly according to the conventional database construction method: end-repair, plus “A” and a special-handled sequencing linker (8BP on the linker) The label, named index1, not only has the ability to distinguish between different samples, it will also be used for subsequent positive-chain markings).

3.获得的连接产物,进行芯片杂交捕获,其洗脱的单链模板产物之后经过1轮1个循环的带有index2标记的引物扩增,使得反链被标记。同时在PCR过程中加入UDG/FPG酶进行孵育,以消除模板链中带有的DNA损伤,减少假阳性的产生。3. The obtained ligation product was subjected to chip hybridization capture, and the eluted single-stranded template product was amplified by one round of one cycle of primers labeled with index 2, so that the anti-strand was labeled. At the same time, UDG/FPG enzyme was added during the PCR to incubate to eliminate the DNA damage in the template strand and reduce the occurrence of false positives.

4.正反链双index标记完成的产物,经过纯化后,进行第二轮PCR富集,完成文库的制备。4. The product obtained by double-indexing of the positive and negative chains is purified, and then subjected to a second round of PCR enrichment to complete the preparation of the library.

5.测序方法采用Hiseq2000或Hiseq2500,根据测序量的不同和样本数,可以灵活选择合适的测序平台。 5. The sequencing method adopts Hiseq2000 or Hiseq2500. According to the difference in the amount of sequencing and the number of samples, the appropriate sequencing platform can be flexibly selected.

具体步骤包括:The specific steps include:

1.cfDNA的提取1.cfDNA extraction

取5ml外周血分离出的血浆约2-3ml,按照QIAamp Circulating Nucleic Acid Kit提取试剂说明书,进行血浆cfDNA的提取。Qubit(Invitrogen,the Quant-iT TM dsDNA HS Assay Kit)定量所提取的DNA,总量约为5~50ng。About 2-3 ml of plasma isolated from 5 ml of peripheral blood was taken, and plasma cfDNA was extracted according to the QIAamp Circulating Nucleic Acid Kit extraction reagent specification. Qubit (Invitrogen, the Quant-iT TM dsDNA HS Assay Kit) quantification of the extracted DNA, a total of about 5 ~ 50ng.

2.样品文库的制备:2. Preparation of the sample library:

血浆中提取的cfDNA,之后按照KAPA LTP Library Preparation Kit建库说明书,进行3步酶促反应。The cfDNA extracted from the plasma was then subjected to a three-step enzymatic reaction according to the KAPA LTP Library Preparation Kit.

1)末端修复1) End repair

Figure PCTCN2014088059-appb-000004
Figure PCTCN2014088059-appb-000004

之后,加入Agencourt AMPure XP reagent120μL,进行磁珠纯化,最后回溶42μL ddH2O,带磁珠进行下一步反应。Thereafter, 120 μL of Agencourt AMPure XP reagent was added to carry out magnetic bead purification, and finally 42 μL of ddH 2 O was dissolved, and magnetic beads were used for the next reaction.

2)加A2) Add A

Figure PCTCN2014088059-appb-000005
Figure PCTCN2014088059-appb-000005

之后加入PEG/NaCl SPRI溶液90μL,充分混合,进行磁珠纯化,最后回溶(35-接头)μL ddH2O,带磁珠进行下一步反应。Thereafter, 90 μL of a PEG/NaCl SPRI solution was added, thoroughly mixed, magnetic bead purification, and finally dissolved (35-linker) μL ddH 2 O, and magnetic beads were used for the next reaction.

3)接头连接3) Connector connection

Figure PCTCN2014088059-appb-000006
Figure PCTCN2014088059-appb-000006

Figure PCTCN2014088059-appb-000007
Figure PCTCN2014088059-appb-000007

之后分别加入PEG/NaCl SPRI溶液50μL2次,进行2次磁珠纯化,最后回溶25μL ddH2O。Then, 50 μL of PEG/NaCl SPRI solution was added twice, and magnetic beads were purified twice, and finally 25 μL of ddH 2 O was dissolved.

3芯片杂交捕获3 chip hybrid capture

本发明中采用发明人设计的针对癌症的早筛相关芯片,参照芯片制造商提供的说明书进行杂交捕获。最后洗脱回溶21μL ddH2O带杂交洗脱磁珠。In the present invention, an early screening related chip for cancer designed by the inventors is used, and hybridization capture is performed with reference to a specification provided by the chip manufacturer. Finally eluted back to dissolve 21 μL of ddH 2 O band hybrid eluting magnetic beads.

4.双index正反链标记和富集:4. Double index positive and negative chain tagging and enrichment:

总共进行2轮PCR,PCR1进行反链标记和模板DNA损伤修复,PCR2进行扩增富集,完成文库制备。A total of 2 rounds of PCR were performed, PCR1 was subjected to reverse strand labeling and template DNA damage repair, and PCR2 was subjected to amplification and enrichment to complete library preparation.

1)PCR11) PCR1

Figure PCTCN2014088059-appb-000008
Figure PCTCN2014088059-appb-000008

PCR1程序:PCR1 program:

Figure PCTCN2014088059-appb-000009
Figure PCTCN2014088059-appb-000009

先除去杂交洗脱磁珠,然后加入Agencourt AMPure XP reagent40μL,进行磁珠纯化,最后回溶20ul ddH2O,带磁珠进行下一步反应。The hybrid elution magnetic beads were first removed, and then 40 μL of Agencourt AMPure XP reagent was added for magnetic bead purification, and finally 20 ul of ddH 2 O was dissolved, and magnetic beads were used for the next reaction.

2)PCR2 2) PCR2

Figure PCTCN2014088059-appb-000010
Figure PCTCN2014088059-appb-000010

PCR2程序PCR2 program

Figure PCTCN2014088059-appb-000011
Figure PCTCN2014088059-appb-000011

先除去上一步磁珠,然后重新加入Agencourt AMPure XP reagent50μL,进行磁珠纯化,最后回溶25μL ddH2O,进行QC及上机。The magnetic beads of the previous step were removed first, then 50 μL of Agencourt AMPure XP reagent was re-added, magnetic beads were purified, and finally 25 μL of ddH 2 O was dissolved, and QC and the machine were performed.

三、测序结果分析Third, the analysis of sequencing results

1,将paired reads(成对测序数据)的reads1的前12bp碱基和reads2的前12bp碱基(即断点序列)连接成24bp的一条短序列,并且以这24bp作为paired reads的索引,并根据其index标记正链和反链。1, the paired reads (paired sequencing data) of the first 12 bp base of reads1 and the first 12 bp base of reads2 (ie, the breakpoint sequence) are connected into a short sequence of 24 bp, and the 24 bp is used as an index of paired reads, and according to Its index marks the positive and negative chains.

2,对索引进行外部排序,以达到将同一个DNA模板的拷贝聚集到一起的目的。2. Externally sort the index to achieve the purpose of bringing together copies of the same DNA template.

3,对聚集起来的拥有相同索引的reads进行中心聚类,根据其序列之间的汉明距离,将每个有相同索引的大簇聚集成若干个小簇,每个小簇中任意两对paired reads的汉明距离不超过10,以达到区分开拥有相同索引却来自不同DNA模板的reads的目的。3. Center clustering the collected reads with the same index, and clustering each large cluster with the same index into several small clusters according to the Hamming distance between the sequences, any two pairs in each small cluster Paired reads have a Hamming distance of no more than 10 in order to distinguish between reads that have the same index but come from different DNA templates.

4,对步骤3中获得的同一个DNA模板的拷贝簇进行筛选,若正链和反链的reads数都达到2对以上,则进行后续分析。4. The copy clusters of the same DNA template obtained in step 3 are screened. If the number of reads of the positive and negative strands is more than 2 pairs, subsequent analysis is performed.

5,对满足4中条件的簇进行纠错,并产生一对无错的新reads,对于DNA模板的每一个测序碱基,若某种碱基型在正链的reads中的一致率达到80%,且在反链reads中的一致率也达到80%,则记新reads的这个碱基为此碱基型,否则记为N,这样便得到了代表原始DNA模板序列的新reads。5, correcting the clusters satisfying the conditions of 4, and generating a pair of new readings without errors. For each sequence of bases of the DNA template, if the certain base type has a coincidence rate of 80 in the positive chain of reads. %, and the agreement rate in the anti-chain reads is also 80%, then the base of the new reads is the base type, otherwise it is denoted as N, thus obtaining a new read representing the original DNA template sequence.

6,将新reads用bwa mem算法重新比对到基因组上,筛除比对质量小于30的reads。 6. The new reads are re-aligned to the genome using the bwa mem algorithm, and the reads with a quality less than 30 are screened out.

7.SNV分析:7. SNV analysis:

1)根据6中得到的reads进行统计,得到捕获区域内每个位点的碱基型分布,与主流碱基型(比例大于15%的碱基型)不一致的碱基型既为突变碱基型。统计目标区域覆盖大小、平均测序深度,正反链互配率,低频突变率等。1) According to the statistics obtained in 6 to obtain the base type distribution of each site in the capture region, the base type which is inconsistent with the mainstream base type (base ratio greater than 15%) is a mutated base. type. Statistical target area coverage size, average sequencing depth, positive and negative chain intermix rate, low frequency mutation rate, etc.

2)利用CCDS、人类基因组数据库(NCBI36.3)、dbSNP(v130)信息对SNP进行注释,确定突变位点发生的基因、坐标、mRNA位点、氨基酸改变、SNP功能(错义突变/无义突变/可变剪切位点)、SIFT预测SNP影响蛋白功能预测等;2) Using SDS, human genome database (NCBI36.3), dbSNP (v130) information to annotate SNPs, identify genes, coordinates, mRNA sites, amino acid changes, and SNP functions at the site of mutation (missense mutation/nonsense) Mutation/variable cleavage site), SIFT predicts SNP affects protein function prediction, etc.

3)根据患者样品与对照样品信息的比对,Call Somatic Mutation。同时在候选的SNV中去除掉在dbSNP、HAPMAP、1000人类基因组、其他外显子测序项目中出现的SNP,以作为最后疾病相关的候选SNV。3) Based on the comparison of patient sample and control sample information, Call Somatic Mutation. SNPs appearing in the dbSNP, HAPMAP, 1000 human genome, and other exon sequencing projects were also removed from the candidate SNVs as the final disease-related candidate SNV.

8.INDEL分析:8.INDEL analysis:

1)根据6中得到的reads中含有indel的reads进行统计,得到所有的indel并选择有2条及以上reads支持的indel作为可靠的突变indel,1) According to the statistics of the indel-containing reads obtained in 6 to obtain all indels and select indel with 2 or more reads as a reliable mutation indel,

2)利用CCDS、人类基因组数据库(NCBI36.3)、dbSNP(v130)信息对Indel进行注释,确定突变位点发生的基因、坐标、mRNA位点、编码区域序列的改变、对氨基酸的影响、InDel功能(氨基酸插入/氨基酸缺失/移码突变);2) Indicating Indel using CCDS, Human Genome Database (NCBI36.3), and dbSNP (v130) information to determine the gene, coordinates, mRNA site, sequence of the coding region, and the effect on the amino acid of the mutation site, InDel Function (amino acid insertion / amino acid deletion / frameshift mutation);

3)根据患者样品与对照样品信息的比对,Call Somatic Mutation。同时在候选的Indel中去除掉在dbSNP以及其他外显子测序项目中出现的Indel,以作为最后疾病相关的候选Indel。3) Based on the comparison of patient sample and control sample information, Call Somatic Mutation. Indel was also removed from the dbSNP and other exon sequencing projects in the candidate Indel as the final disease-related candidate Indel.

实施例1:妇科生殖道肿瘤早筛Example 1: Early screening of gynecological reproductive tract tumors

一、芯片设计First, the chip design

基于TCGA,ICGC,COSMIC等数据库和相关文献参考,发明人设计出针对妇科生殖道肿瘤早筛的基因芯片WCNpan。WCNpan芯片包括了:妇科生殖道肿瘤(宫颈癌、子宫内膜癌、卵巢癌)相关的Driver Gene(驱动基因),高频突变基因,以及癌症12条信号通路中重要基因等,共计42个基因,300KB。Based on the TCGA, ICGC, COSMIC and other related databases, the inventors designed the gene chip WCNpan for early screening of gynecological reproductive tract tumors. The WCNpan chip includes: Driver Gene (driver gene) related to gynecological genital tract tumors (cervical cancer, endometrial cancer, ovarian cancer), high-frequency mutated genes, and important genes in 12 signaling pathways of cancer, totaling 42 genes. , 300KB.

芯片具体设计过程:根据人类基因组HG19,调取上述42个基因的外显子序列,考虑到捕获区域的大小及成本,最终的芯片只涉及了上述基因的CDS区域,并对CDS区域前后延伸了20bp,芯片总计300kb。该芯片上覆盖有丰富的捕获探针,探针覆盖区域达98%,可以从复杂的基因组中富集目标DNA片段,在同一张芯片上以高特异性和高覆盖率捕获约300KB的基因组区域。The specific design process of the chip: According to the human genome HG19, the exon sequences of the above 42 genes are retrieved. Considering the size and cost of the capture region, the final chip only covers the CDS region of the above gene, and extends the CDS region before and after. 20bp, the chip totals 300kb. The chip is covered with a rich capture probe with a 98% coverage area, which enriches the target DNA fragment from a complex genome and captures approximately 300KB of genomic region with high specificity and high coverage on the same chip. .

基因列表详情见表1。 See Table 1 for details of the gene list.

表1Table 1

AFF3AFF3 BRCA2BRCA2 FBXW7FBXW7 MED12MED12 PDE4DIPPDE4DIP STK11STK11 AKAP9AKAP9 CDK12CDK12 FGFR2FGFR2 MLL2MLL2 PIK3CAPIK3CA TP53TP53 AKT1AKT1 CDKN2ACDKN2A FGFR3FGFR3 MLL3MLL3 PIK3R1PIK3R1   APCAPC CREBBPCREBBP FOXL2FOXL2 MSH6MSH6 PPP2R1APPP2R1A   ARID1AARID1A CSMD3CSMD3 GNASGNAS NF1NF1 PTENPTEN   BCORBCOR CTNNB1CTNNB1 HRASHRAS NFE2L2NFE2L2 RB1RB1   BRAFBRAF EGFREGFR KITKIT NRASNRAS RNF213RNF213   BRCA1BRCA1 FAT3FAT3 KRASKRAS NSD1NSD1 RNF43RNF43  

二、测序分析Second, sequencing analysis

对1例宫颈不典型增生患者按照以上方法的步骤进行分析,测序数据统计结果如下表所示:One patient with cervical dysplasia was analyzed according to the steps of the above methods. The statistical results of the sequencing data are shown in the following table:

Figure PCTCN2014088059-appb-000012
Figure PCTCN2014088059-appb-000012

注释:正反链互配率:基于3条reads以上正反链均有的簇/3条reads上总的簇的比值,以评估可用数据中正反链互配情况;有效数据利用率:基于至少满足2+/2-簇的reads纠错后的个数与总测序reads数的比值;平均测序深度:基于有效数据纠错后,对目标区域碱基的平均覆盖情况。Note: The positive and negative chain interoperability rate: based on the ratio of the total clusters on the clusters/3 reads on the positive and negative chains of 3 reads, to evaluate the positive and negative chain interoperability in the available data; effective data utilization: based on The ratio of the number of reads error correction of at least 2+/2-cluster to the total number of sequencing reads is satisfied; the average sequencing depth: the average coverage of bases in the target region after error correction based on effective data.

相同索引reads簇的分析结果见图2,其中,横坐标表示表簇的duplication(dup)个数,纵坐标表示满足某一dup个数的簇的总的reads个数。由图2的结果可知,dup簇绝大部分在8左右,较大部分簇内能满足2正+2反的条件,最终数据有效利用率为5.14%,平均测序深度为:1153.6X。The analysis result of the same index reads cluster is shown in Fig. 2, in which the abscissa indicates the number of duplication (dup) of the table cluster, and the ordinate indicates the total number of reads of the cluster satisfying a certain number of dup. It can be seen from the results of Fig. 2 that most of the dup clusters are around 8, and the larger part of the clusters can satisfy the condition of 2 plus + 2 inverses. The effective utilization rate of the final data is 5.14%, and the average sequencing depth is 1153.6X.

突变谱分析结果见图3,其中,互补的突变类型对于来源于双链的分子(DNA),理论突变频率基本相同。横坐标代表碱基突变的类型;纵坐标代表突变的个数。图3的结果显示:突变碱基类型分布平衡,其突变频率(Mutations per nucleotide)为:1.7×10-6。。The results of the catastrophe spectrum analysis are shown in Fig. 3, in which the complementary mutation type is substantially the same as the theoretical mutation frequency for the double-stranded molecule (DNA). The abscissa represents the type of base mutation; the ordinate represents the number of mutations. The results in Fig. 3 show that the mutated base type distribution is balanced, and the mutation frequency (Mutations per nucleotide) is: 1.7 × 10 -6 . .

变异检测列表详情(基于exon区和非同义突变统计)见下表。The details of the mutation detection list (based on the exon region and non-synonymous mutation statistics) are shown in the table below.

基因gene cHGVScHGVS pHGVSpHGVS 突变类型Mutation type 突变频率Mutation frequency PIK3CAPIK3CA c.2119G>Ac.2119G>A p.Glu707Lysp.Glu707Lys 错义突变Missense mutation 5.60%5.60% TP53TP53 c.817C>Tc.817C>T p.Arg273Cysp.Arg273Cys 错义突变Missense mutation 2.40%2.40%

CDKN2ACDKN2A c.217A>Cc.217A>C p.Ser73Argp.Ser73Arg 错义突变Missense mutation 1.80%1.80% NRASNRAS c.182A>Gc.182A>G p.Gln61Argp.Gln61Arg 错义突变Missense mutation 1.40%1.40%

结果分析:依据TCGA,COSMIC,ClinVar,HMGD等相关数据库以及文献资料,在患者血浆中检测到了PIK3CAp.Glu707Lys,TP53p.Arg273Cys等Driver mutation,预示着患者具有较高的癌症风险率,建议患者到相关医疗机构进行更全面的检测和采取相关干预措施。Analysis of results: According to TCGA, COSMIC, ClinVar, HMGD and other related databases and literatures, Driver mutations such as PIK3CAp.Glu707Lys, TP53p.Arg273Cys were detected in the plasma of patients, indicating that patients have a higher cancer risk rate, and patients are recommended to relevant. Medical institutions conduct more comprehensive testing and take relevant interventions.

实施例2十二种常见肿瘤个体化用药Example 2 Twelve common tumor individualized medication

一、芯片设计First, the chip design

1)肿瘤个体化基因芯片的设计:1) Design of tumor individualized gene chip:

基于TCGA,ICGC,COSMIC等数据库和相关文献参考,采用迭代算法设计出针对12种常见癌症的肿瘤个体化基因芯片CANPer-YY。CANPer-YY芯片包括了:癌基因,抑癌基因,12种常见癌症高频基因,癌症12条信号通路中重要基因,靶药及化疗药物基因等,共计524个基因,750KB。Based on TCGA, ICGC, COSMIC and other related literature references, an iterative algorithm was used to design a tumor individualized gene chip CANPer-YY for 12 common cancers. The CANPer-YY chip includes: oncogenes, tumor suppressor genes, 12 common cancer high-frequency genes, important genes in 12 signal pathways of cancer, target drugs and chemotherapeutic drugs, etc., a total of 524 genes, 750KB.

芯片主要设计过程分为4步:The main design process of the chip is divided into 4 steps:

1、统计cosmic数据库中有关12种癌症相关的driver gene(驱动基因)的每个外显子区变异样本数、变异样本、最热点变异所在的样本数、PI值(以评估患者回复频率在每个外显子上的水平,PI=每外显子上携带突变的累计患者数目/外显子长度),并根据PI值降序排列。之后采用迭代算法:以第一个外显子区变异的样本作为样本数据库,统计其他所有区间和样本数据库不同样本的个数,将不同样本个数最多的样本区间列为第二个筛选到芯片区间,此时以筛选到的两个区间的变异样本作为样本数据库,以同样的方法筛选第三个区间,直到样本数据库包括了所有的样本,以统计外显子区集,而对于没有筛选到任何区间的基因所有区间,则都加到芯片区间上。1. Count the number of samples of each exon region of the 12 cancer-related driver genes in the cosmic database, the sample of the variation, the number of samples with the most hot variation, and the PI value (to assess the frequency of patient responses at each The level on each exon, PI = cumulative number of patients carrying mutations per exon / exon length), and ranked in descending order of PI values. Then iterative algorithm is adopted: the sample of the first exon region variation is used as the sample database, and the number of different samples of all other intervals and sample databases is counted, and the sample interval with the largest number of different samples is listed as the second screening chip. Interval, at this time, the mutated samples of the two selected intervals are used as the sample database, and the third interval is screened in the same way until the sample database includes all the samples to count the exon region set, and for the unfiltered All intervals of the genes in any interval are added to the chip interval.

2.基于TCGA,ICGC等数据库,以去除driver gene区间并且包括大于等于5个样本的热点变异的区间(SNV>=5)为候选区间,重复上一步的迭代计算。2. Based on the TCGA, ICGC and other databases, to remove the driver gene interval and include the interval of the hotspot variation of 5 samples or more (SNV>=5) as the candidate interval, repeat the iterative calculation of the previous step.

3.基于TCGA,ICGC等数据库,在去除已被筛选的区间中分别以:PI>=30,SNV>=3和:PI>=20,SNV>=3为候选区间,筛选使得单样本数据库样本数减少最多的区间作为第一个芯片区间,重复以上过程进行迭代计算。3. Based on TCGA, ICGC and other databases, in the interval where the screening has been removed, PI>=30, SNV>=3 and :PI>=20, SNV>=3 are selected as the candidate interval, and the screening makes the single sample database sample. The interval with the largest number reduction is taken as the first chip interval, and the above process is repeated for iterative calculation.

4.加入融合基因以及化疗检测位点基因等区间。4. Add the fusion gene and the region of the chemotherapy detection site gene.

基因列表详情见表2。See Table 2 for details of the gene list.

表2Table 2

ABL1ABL1 C1RC1R DIS3DIS3 FGF19FGF19 HSPA4HSPA4 MIR142MIR142 PAX5PAX5 RB1RB1 SRSF2SRSF2 ABL2ABL2 C1SC1S DNMT1DNMT1 FGF23FGF23 IDH1IDH1 MITFMITF PBRM1PBRM1 RELREL SSTR2SSTR2 ACVR1BACVR1B CARD11CARD11 DNMT3ADNMT3A FGF3FGF3 IDH2IDH2 MLH1MLH1 PCBP1PCBP1 RETRET STAG2STAG2

ACVR2AACVR2A CASP8CASP8 DOT1LDOT1L FGF4FGF4 IFNAR1IFNAR1 MLH3MLH3 PCM1PCM1 RHEBRHEB STAT4STAT4 AJUBAAJUBA CBFBCBFB DUSP6DUSP6 FGF6FGF6 IFNAR2IFNAR2 MLLMLL PDGFRAPDGFRA RICTORRICTOR STAT5BSTAT5B AKT1AKT1 CBLCBL EDNRAEDNRA FGF7FGF7 IGF1IGF1 MLL2MLL2 PDGFRBPDGFRB RNASELRNASEL STK11STK11 AKT2AKT2 CBLBCBLB EGFREGFR FGFR1FGFR1 IGF1RIGF1R MLL3MLL3 PDK1PDK1 RNF43RNF43 SUFUSUFU AKT3AKT3 CBR1CBR1 EGR3EGR3 FGFR2FGFR2 IGF2IGF2 MLL4MLL4 PHF6PHF6 ROBO1ROBO1 SUZ12SUZ12 ALKALK CCND1CCND1 EIF4A2EIF4A2 FGFR3FGFR3 IKBKBIKBKB MPLMPL PIGFPIGF ROBO2ROBO2 SYKSYK ALOX12BALOX12B CCND2CCND2 ELAC2ELAC2 FGFR4FGFR4 IKBKEIKBKE MRE11AMRE11A PIK3C2APIK3C2A ROS1ROS1 TAF1TAF1 ANGPT1ANGPT1 CCND3CCND3 ELF3ELF3 FHFH IKZF1IKZF1 MS4A1MS4A1 PIK3C2BPIK3C2B RPA1RPA1 TBL1XR1TBL1XR1 ANGPT2ANGPT2 CCNE1CCNE1 EML4EML4 FLCNFLCN IL7RIL7R MSH2MSH2 PIK3C2GPIK3C2G RPL22RPL22 TBX3TBX3 APCAPC CD79ACD79A EP300EP300 FLT1FLT1 INHBAINHBA MSH3MSH3 PIK3C3PIK3C3 RPL5RPL5 TEKTEK APCDD1APCDD1 CD79BCD79B EPCAMEPCAM FLT3FLT3 IRF4IRF4 MSH4MSH4 PIK3CAPIK3CA RPS14RPS14 TERTTERT ARAR CDC25CCDC25C EPHA2EPHA2 FLT4FLT4 IRS2IRS2 MSH5MSH5 PIK3CBPIK3CB RPS6KB1RPS6KB1 TET2TET2 ARAFARAF CDC42CDC42 EPHA3EPHA3 FNTAFNTA ITGB2ITGB2 MSH6MSH6 PIK3CGPIK3CG RPTORRPTOR TFGTFG ARFRP1ARFRP1 CDC73CDC73 EPHA5EPHA5 FOXA1FOXA1 JAK1JAK1 MSR1MSR1 PIK3R1PIK3R1 RUNX1RUNX1 TGFBR2TGFBR2 ARHGAP35ARHGAP35 CDH1CDH1 EPHB1EPHB1 FOXA2FOXA2 JAK2JAK2 MTORMTOR PIK3R2PIK3R2 RUNX1T1RUNX1T1 TIPARPTIPARP ARID1AARID1A CDK12CDK12 EPHB2EPHB2 FOXL2FOXL2 JAK3JAK3 MUC1MUC1 PLK1PLK1 RXRARXRA TLR4TLR4 ARID1BARID1B CDK2CDK2 EPHB6EPHB6 FPGSFPGS JUNJUN MUTYHMUTYH PMLPML RXRBRXRB TMEM127TMEM127 ARID2ARID2 CDK4CDK4 EPPK1EPPK1 FUBP1FUBP1 KAT6AKAT6A MYCMYC PMS1PMS1 RXRGRXRG TNFAIP3TNFAIP3 ARID5BARID5B CDK6CDK6 ERBB2ERBB2 FYNFYN KDM5AKDM5A MYCL1MYCL1 PMS2PMS2 SDHAF2SDHAF2 TNFRSF14TNFRSF14 ASXL1ASXL1 CDK8CDK8 ERBB3ERBB3 GAB2GAB2 KDM5CKDM5C MYCNMYCN PNRC1PNRC1 SDHBSDHB TNFRSF8TNFRSF8 ATMATM CDKN1ACDKN1A ERBB4ERBB4 GATA1GATA1 KDM6AKDM6A MYD88MYD88 POLQPOLQ SDHCSDHC TNFSF11TNFSF11 ATRATR CDKN1BCDKN1B ERCC2ERCC2 GATA2GATA2 KDRKDR NAV3NAV3 PPP2R1APPP2R1A SDHDSDHD TNFSF13BTNFSF13B ATRXATRX CDKN2ACDKN2A ERCC3ERCC3 GATA3GATA3 KEAP1KEAP1 NBNNBN PRDM1PRDM1 SEMA3ASEMA3A TOP1TOP1 AURKAAURKA CDKN2BCDKN2B ERGERG GID4GID4 KIF1BKIF1B NCOA1NCOA1 PRKAA1PRKAA1 SEMA3ESEMA3E TOP2ATOP2A AURKBAURKB CDKN2CCDKN2C ESR1ESR1 GNA11GNA11 KIF5BKIF5B NCOA2NCOA2 PRKAR1APRKAR1A SETBP1SETBP1 TOP2BTOP2B AXIN1AXIN1 CDX2CDX2 ETV1ETV1 GNA13GNA13 KITKIT NCOR1NCOR1 PRKCAPRKCA SETD2SETD2 TP53TP53 AXIN2AXIN2 CEBPACEBPA ETV6ETV6 GNAQGNAQ KLF4KLF4 NEK11NEK11 PRKCBPRKCB SF1SF1 TRAF7TRAF7 AXLAXL CFLARCFLAR EWSR1EWSR1 GNASGNAS KLHL6KLHL6 NF1NF1 PRKCGPRKCG SF3B1SF3B1 TSC1TSC1 B2MB2M CHD1CHD1 EXT1EXT1 GNRHRGNRHR KRASKRAS NF2NF2 PRKDCPRKDC SH2B3SH2B3 TSC2TSC2 B4GALT3B4GALT3 CHD2CHD2 EXT2EXT2 GPR124GPR124 LCKLCK NFE2L2NFE2L2 PRSS8PRSS8 SIN3ASIN3A TSHRTSHR BACH1BACH1 CHD4CHD4 EZH2EZH2 GRIN2AGRIN2A LIMK1LIMK1 NFE2L3NFE2L3 PSMB1PSMB1 SLAMF7SLAMF7 TSHZ2TSHZ2 BAK1BAK1 CHEK1CHEK1 FAM123BFAM123B GRM3GRM3 LRRK2LRRK2 NFKBIANFKBIA PSMB2PSMB2 SLC4A1SLC4A1 TSHZ3TSHZ3 BAP1BAP1 CHEK2CHEK2 FAM46CFAM46C GSK3BGSK3B LYNLYN NKX2-1NKX2-1 PSMB5PSMB5 SLIT2SLIT2 TUBA1ATUBA1A BARD1BARD1 CHUKCHUK FANCAFANCA H3F3AH3F3A MALAT1MALAT1 NKX3-1NKX3-1 PTCH1PTCH1 SMAD2SMAD2 TUBBTUBB BCL2BCL2 CICCIC FANCCFANCC H3F3CH3F3C MAP2K1MAP2K1 NOTCH1NOTCH1 PTCH2PTCH2 SMAD3SMAD3 TUBD1TUBD1 BCL2A1BCL2A1 CRBNCRBN FANCD2FANCD2 HCKHCK MAP2K2MAP2K2 NOTCH2NOTCH2 PTENPTEN SMAD4SMAD4 TUBE1TUBE1 BCL2L1BCL2L1 CREBBPCREBBP FANCEFANCE HDAC1HDAC1 MAP2K4MAP2K4 NOTCH3NOTCH3 PTP4A3PTP4A3 SMARCA1SMARCA1 TUBG1TUBG1 BCL2L11BCL2L11 CRIPAKCRIPAK FANCFFANCF HDAC2HDAC2 MAP3K1MAP3K1 NOTCH4NOTCH4 PTPN11PTPN11 SMARCA4SMARCA4 TYRTYR BCL2L2BCL2L2 CRKLCRKL FANCGFANCG HDAC3HDAC3 MAP3K13MAP3K13 NPM1NPM1 PTPRDPTPRD SMARCB1SMARCB1 U2AF1U2AF1 BCL6BCL6 CRLF2CRLF2 FANCIFANCI HDAC4HDAC4 MAPK1MAPK1 NR3C1NR3C1 RAC1RAC1 SMARCD1SMARCD1 USP9XUSP9X BCORBCOR CROTCROT FANCLFANCL HDAC6HDAC6 MAPK3MAPK3 NRASNRAS RAC2RAC2 SMC1ASMC1A VEGFAVEGFA BCORL1BCORL1 CSF1RCSF1R FANCMFANCM HDAC8HDAC8 MAPK8MAPK8 NSD1NSD1 RAD21RAD21 SMC3SMC3 VEGFBVEGFB BCRBCR CTCFCTCF FAT3FAT3 HGFHGF MAPK8IP1MAPK8IP1 NTRK1NTRK1 RAD50RAD50 SMOSMO VEZF1VEZF1 BLMBLM CTLA4CTLA4 FBXW7FBXW7 HIF1AHIF1A MAXMAX NTRK2NTRK2 RAD51RAD51 SOCS1SOCS1 VHLVHL

BMPR1ABMPR1A CTNNA1CTNNA1 FCGR1AFCGR1A HIST1H1CHIST1H1C MC1RMC1R NTRK3NTRK3 RAD51BRAD51B SOX10SOX10 WHSC1L1WHSC1L1 BRAFBRAF CTNNB1CTNNB1 FCGR2AFCGR2A HIST1H2BDHIST1H2BD MCL1MCL1 NUP93NUP93 RAD51CRAD51C SOX17SOX17 WISP3WISP3 BRCA1BRCA1 CUL4ACUL4A FCGR2BFCGR2B HIST1H3BHIST1H3B MDM2MDM2 PAK3PAK3 RAD51DRAD51D SOX2SOX2 WWP1WWP1 BRCA2BRCA2 CUL4BCUL4B FCGR2CFCGR2C HNF1AHNF1A MDM4MDM4 PAK7PAK7 RAD52RAD52 SOX9SOX9 XIAPXIAP BRIP1BRIP1 CYLDCYLD FCGR3AFCGR3A HRASHRAS MECOMMECOM PALB2PALB2 RAD54LRAD54L SPENSPEN XPAXPA BTG1BTG1 CYP17A1CYP17A1 FCGR3BFCGR3B HRH2HRH2 MED12MED12 PARP1PARP1 RAF1RAF1 SPOPSPOP XPCXPC BTKBTK DAXXDAXX FGF10FGF10 HSD17B3HSD17B3 MEF2BMEF2B PARP2PARP2 RARARARA SPRY4SPRY4 XPO1XPO1 C11orf30C11orf30 DDR1DDR1 FGF12FGF12 HSD3B2HSD3B2 MEN1MEN1 PARP3PARP3 RARBRARB SRCSRC XRCC3XRCC3 C1QAC1QA DDR2DDR2 FGF14FGF14 HSP90AA1HSP90AA1 METMET PARP4PARP4 RARGRARG SRD5A2SRD5A2 YES1YES1 ZNF217ZNF217 ZNF703ZNF703 ZRSR2ZRSR2 WT1WT1 XRCC1XRCC1 GSTP1GSTP1 ERCC1ERCC1 MTHFRMTHFR SOD2SOD2 CBR3CBR3 ATICATIC MTRRMTRR DPYDDPYD UMPSUMPS TPMTTPMT UGT1A1UGT1A1 MDR1MDR1 CDACDA CYP19A1CYP19A1 CYP2D6CYP2D6              

2)基因预测药物疗效数据库构建:2) Gene prediction drug efficacy database construction:

化疗药物对肿瘤细胞的杀伤效应与特定的一种(一组)基因的表达和/或多态性显著相关,通过相关基因的检测,预测化疗药物的疗效,选择合适的药物进行个体化化疗,已经成为提高疗效、减少无效治疗的合理选择。基于化疗药物以上特点,参考PharmGKB数据库,整合目前临床上所有的化疗药物以及与疗效相关的基因及疗效预测评判,形成化疗个体化用药解读数据库。并将化疗数据整合入肿瘤个体化信息流程,完成化疗药物的自动化解读。The killing effect of chemotherapeutic drugs on tumor cells is significantly correlated with the expression and/or polymorphism of a specific (a group of) genes. The detection of related genes predicts the efficacy of chemotherapeutic drugs and selects appropriate drugs for individualized chemotherapy. It has become a reasonable choice to improve efficacy and reduce ineffective treatment. Based on the above characteristics of chemotherapeutic drugs, the PharmGKB database is used to integrate all the current chemotherapeutic drugs and the genes related to curative effect and predictive evaluation of therapeutic effects, and to form a database for interpretation of individualized drugs for chemotherapy. The chemotherapy data was integrated into the individualized information flow of the tumor to complete the automated interpretation of the chemotherapy drug.

靶向药物在肿瘤治疗中具有药效显著、副作用少的特点,但它对靶点(包括蛋白、DNA等)有依赖性,必须先对患者做靶点分析,才能确定患者能否用药。整合目前FDA批准的靶向药物,以及处于临床Ⅲ、Ⅳ的药物。依据NCCN临床指南,临床药物基因研究整理药物靶点基因与靶药疗效关系,形成肿瘤个体化靶药解读数据库。Targeted drugs have the characteristics of significant drug efficacy and few side effects in tumor therapy, but they are dependent on targets (including protein, DNA, etc.). Target analysis must be performed on patients before they can determine whether patients can take drugs. Integrate current FDA-approved targeted drugs, as well as drugs in clinical III and IV. According to the NCCN clinical guidelines, the clinical drug gene research collates the relationship between the drug target gene and the target drug, and forms a database of individualized target drug interpretation.

对生物信息分析后的变异数据进行个体化解读,参考构建的肿瘤数据库及相关文献,对患者检出的变异进行分析,判断变异所产生的致病原因、各种化疗药物的预期疗效及毒副作用、最适合的获益靶向药物及耐药性靶向药物,让临床医生对于中肿瘤患者的用药治疗更有针对性,免去无效用药所耽误的宝贵时间以及毒副作用给患者带去的治疗痛苦。Individualized interpretation of the mutated data after bioinformatics analysis, reference to the constructed tumor database and related literature, analysis of the variability detected by the patient, determine the cause of the variability, the expected efficacy and side effects of various chemotherapeutic drugs The most suitable benefit-targeted drugs and drug-resistant targeted drugs allow clinicians to be more targeted to the treatment of patients with cancer, avoiding the valuable time and the side effects of ineffective medications. pain.

二、测序分析Second, sequencing analysis

采用本发明,对1例胃癌晚期患者(12种常见肿瘤中的一种)按照以上方法的步骤进行肿瘤个体化用药指导检测,结果如下:According to the present invention, a patient with advanced gastric cancer (one of the 12 common tumors) is subjected to the individualized drug guidance test according to the steps of the above method, and the results are as follows:

测序数据统计结果如下表所示:The statistical results of the sequencing data are shown in the following table:

Figure PCTCN2014088059-appb-000013
Figure PCTCN2014088059-appb-000013

注释:正反链互配率:基于3条reads以上正反链均有的簇/3条reads以上总的簇的比值,以评估可用数据中正反链互配情况;有效数据利用率:基于至少满足2+/2-簇的reads纠错后的个数与总测序reads数的比值;平均测序深度:基于有效数据纠错后,对目标区域碱基的平均覆盖情况。Note: The positive and negative chain interoperability rate: based on the ratio of the clusters above 3 positive and negative chains/3 total reads, to evaluate the positive and negative chain interoperability in the available data; effective data utilization: based on The ratio of the number of reads error correction of at least 2+/2-cluster to the total number of sequencing reads is satisfied; the average sequencing depth: the average coverage of bases in the target region after error correction based on effective data.

簇的分析:Cluster analysis:

相同索引reads簇的分析结果见图4,其中,横坐标代表簇的duplication(dup)个数,纵坐标代表满足某一dup个数的簇的总的reads个数。图4的结果显示:dup簇绝大部分在5左右,大部分簇内能满足2正+2反的条件,最终数据有效利用率为3.5%,平均测序深度为:667XThe analysis result of the same index reads cluster is shown in Fig. 4, in which the abscissa represents the number of duplication (dup) of the cluster, and the ordinate represents the total number of reads of the cluster satisfying a certain number of dup. The results in Figure 4 show that most of the dup clusters are around 5, and most of the clusters can satisfy the conditions of 2 plus + 2 inverses. The final data effective utilization rate is 3.5%, and the average sequencing depth is: 667X.

突变谱分析:Mutation spectrum analysis:

突变谱分析结果见图5,其中,互补的突变类型对于来源于双链的分子(DNA),理论突变频率基本相同,横坐标代表碱基突变的类型;纵坐标代表突变的个数。图5的结果显示:突变碱基类型分布基本平衡,其突变频率(Mutations per nucleotide)为:4.2×10-6The results of the mutational profiling are shown in Fig. 5, in which the complementary mutation type is substantially the same for the double-stranded molecule (DNA), the abscissa represents the type of base mutation, and the ordinate represents the number of mutations. The results in Figure 5 show that the distribution of the mutated base type is basically balanced, and the mutation frequency (Mutations per nucleotide) is: 4.2 × 10 -6 .

变异检测列表详情(基于exon区和非同义突变统计)见下表:The details of the mutation detection list (based on the exon region and non-synonymous mutation statistics) are shown in the following table:

基因gene 碱基突变Base mutation 氨基酸突变Amino acid mutation 突变类型Mutation type 突变频率Mutation frequency TP53TP53 c.241C>Tc.241C>T p.R81Xp.R81X 终止密码子获得突变Stop codon to obtain mutation 10.83%10.83% PIK3CAPIK3CA c.2816A>Gc.2816A>G p.D939Gp.D939G 错义突变Missense mutation 6.34%6.34% KRASKRAS c.35G>Ac.35G>A p.G12Dp.G12D 错义突变Missense mutation 4.36%4.36% ZNF678ZNF678 c.1628G>Cc.1628G>C p.R543Pp.R543P 错义突变Missense mutation 3.40%3.40% ALMS1ALMS1 c.3971T>Gc.3971T>G p.V1324Gp.V1324G 错义突变Missense mutation 3.20%3.20% MLH1MLH1 c.1427A>Tc.1427A>T p.E476Vp.E476V 错义突变Missense mutation 2.80%2.80% ZNF721ZNF721 c.2061C>Gc.2061C>G p.H687Qp.H687Q 错义突变Missense mutation 2.76%2.76% MUC17MUC17 c.392G>Cc.392G>C p.S131Tp.S131T 错义突变Missense mutation 2.73%2.73% GNAQGNAQ c.286A>Tc.286A>T p.T96Sp.T96S 错义突变Missense mutation 2.46%2.46% CASC1CASC1 c.97C>Ac.97C>A p.R33Sp.R33S 错义突变Missense mutation 2.20%2.20% ZNF20ZNF20 c.1016G>Ac.1016G>A p.R339Kp.R339K 错义突变Missense mutation 2.00%2.00% CYP4F2CYP4F2 c.1448C>Gc.1448C>G p.A483Gp.A483G 错义突变Missense mutation 2.00%2.00%

化疗位点如下表所示:The chemotherapy sites are shown in the following table:

Figure PCTCN2014088059-appb-000014
Figure PCTCN2014088059-appb-000014

Figure PCTCN2014088059-appb-000015
Figure PCTCN2014088059-appb-000015

药物预测:Drug prediction:

依据靶药化疗解读数据库,结合上述检测结果,以下结论仅供临床医生制定治疗方案时参考:According to the target drug chemotherapy interpretation database, combined with the above test results, the following conclusions are only for the clinician to develop a treatment plan:

Figure PCTCN2014088059-appb-000016
Figure PCTCN2014088059-appb-000016

Figure PCTCN2014088059-appb-000017
Figure PCTCN2014088059-appb-000017

Figure PCTCN2014088059-appb-000018
Figure PCTCN2014088059-appb-000018

实施例3:结直肠癌早筛Example 3: Early screening of colorectal cancer

一、芯片设计First, the chip design

1)结直肠癌早筛芯片的设计:1) Design of colorectal cancer early screening chip:

基于TCGA,ICGC,COSMIC等数据库和相关文献参考,采用迭代算法设计出针结直肠癌早筛的基因芯片Colorectalpan。Colorectalpan芯片包括了:结直肠癌相关的Driver Gene,高频突变基因,以及癌症12条信号通路中重要基因,共计60个基因,123KB。Based on TCGA, ICGC, COSMIC and other related literature references, an iterative algorithm was used to design a color chip Colorectalpan for early colorectal cancer screening. The Colorectalpan chip includes: Driver Gene, a high-frequency mutated gene, and an important gene in 12 signaling pathways of cancer, a total of 60 genes, 123KB.

芯片主要设计过程分为4步:The main design process of the chip is divided into 4 steps:

1、统计cosmic数据库中有关结直肠癌driver gene的每个外显子区变异样本数、变异样本、最热点变异所在的样本数、PI值(以评估患者回复频率在每个外显子上的水平,PI=每外显子上携带突变的累计患者数目/外显子长度),并根据PI值降序排列。之后采用迭代算法:以第一个外显子区变异的样本作为样本数据库,统计其他所有区间和样本数据库不同样本的个数,将不同样本个数最多的样本区间列为第二个筛选到芯片区间,此时以筛选到的两个区间的变异样本作为样本数据库,以同样的方法筛选第三个区间,直到样本数据库包括了所有的样本,以统计外显子区集,而对于没有筛选到任何区间的基因所有区间,则都加到芯片区间上。1. Count the number of samples of each exon region of the colorectal cancer driver gene in the cosmic database, the variation sample, the number of samples with the most hot variation, and the PI value (to assess the frequency of patient responses on each exon) Level, PI = cumulative number of patients carrying mutations per exon / exon length), and ranked in descending order of PI values. Then iterative algorithm is adopted: the sample of the first exon region variation is used as the sample database, and the number of different samples of all other intervals and sample databases is counted, and the sample interval with the largest number of different samples is listed as the second screening chip. Interval, at this time, the mutated samples of the two selected intervals are used as the sample database, and the third interval is screened in the same way until the sample database includes all the samples to count the exon region set, and for the unfiltered All intervals of the genes in any interval are added to the chip interval.

2.基于TCGA,ICGC等数据库,以去除driver gene区间并且包括大于等于5个样本的热点变异的区间(SNV>=5)为候选区间,重复上一步的迭代计算。2. Based on the TCGA, ICGC and other databases, to remove the driver gene interval and include the interval of the hotspot variation of 5 samples or more (SNV>=5) as the candidate interval, repeat the iterative calculation of the previous step.

3.基于TCGA,ICGC等数据库,在去除已被筛选的区间中分别以:PI>=30,SNV>=3和:PI>=20,SNV>=3为候选区间,筛选使得单样本数据库样本数减少最多的区间作为第一个芯片区间,重复以上过程进行迭代计算。3. Based on TCGA, ICGC and other databases, in the interval where the screening has been removed, PI>=30, SNV>=3 and :PI>=20, SNV>=3 are selected as the candidate interval, and the screening makes the single sample database sample. The interval with the largest number reduction is taken as the first chip interval, and the above process is repeated for iterative calculation.

4.加融合基因等区间。4. Add a fusion gene and other intervals.

基因列表详情见表3。See Table 3 for details of the gene list.

表3table 3

KRASKRAS SRCSRC TLR3TLR3 EP300EP300 TMPRSS13TMPRSS13 EPHA5EPHA5 BRAFBRAF PTENPTEN MC4RMC4R CYLDCYLD PHF2PHF2 EPHA3EPHA3 APCAPC AXIN1AXIN1 MLH1MLH1 FBN2FBN2 OPRD1OPRD1 PTPRDPTPRD TP53TP53 FLGFLG AKT1AKT1 NF1NF1 LILRB5LILRB5 NTRK3NTRK3 PIK3CAPIK3CA LIG1LIG1 CASD1CASD1 ASXL1ASXL1 COL18A1COL18A1 NTRK1NTRK1 CTNNB1CTNNB1 MAP2K1MAP2K1 PTCH1PTCH1 SMAD4SMAD4 LARP4BLARP4B ALKALK

NRASNRAS PIK3R1PIK3R1 ADAMTS18ADAMTS18 IRF5IRF5 DMKNDMKN ROS1ROS1 EGFREGFR ERBB2ERBB2 MSH2MSH2 DOCK3DOCK3 ROBO2ROBO2 RETRET FBXW7FBXW7 STK11STK11 BAP1BAP1 MYOM1MYOM1 KCNN3KCNN3 PDGFRAPDGFRA ARID1AARID1A IL7RIL7R CTNNA1CTNNA1 NEFHNEFH INHBAINHBA FGFR1FGFR1

二、测序分析Second, sequencing analysis

采用本发明,对1例肠息肉患者按照以上方法的步骤进行结直肠癌早筛检测,结果如下:According to the present invention, a colorectal cancer early screening test is performed on a patient with intestinal polyps according to the steps of the above method, and the results are as follows:

测序数据统计结果见下表:The statistical results of the sequencing data are shown in the following table:

Figure PCTCN2014088059-appb-000019
Figure PCTCN2014088059-appb-000019

注释:正反链互配率:基于3条reads以上正反链均有的簇/3条reads以上总的簇的比值,以评估可用数据中正反链互配情况;有效数据利用率:基于至少满足2+/2-簇的reads纠错后的个数与总测序reads数的比值;平均测序深度:基于有效数据纠错后,对目标区域碱基的平均覆盖情况。Note: The positive and negative chain interoperability rate: based on the ratio of the clusters above 3 positive and negative chains/3 total reads, to evaluate the positive and negative chain interoperability in the available data; effective data utilization: based on The ratio of the number of reads error correction of at least 2+/2-cluster to the total number of sequencing reads is satisfied; the average sequencing depth: the average coverage of bases in the target region after error correction based on effective data.

簇的分析:Cluster analysis:

相同索引reads簇的分析见图6,其中,横坐标代表簇的duplication(dup)个数,纵坐标代表满足某一dup个数的簇的总的reads个数。图6的结果显示:dup簇绝大部分在6左右,大部分簇内能满足2正+2反的条件,最终数据有效利用率为5.12%,平均测序深度为:1033XThe analysis of the same index reads cluster is shown in Fig. 6, where the abscissa represents the number of duplications (dup) of the cluster, and the ordinate represents the total number of reads of the cluster satisfying a certain number of dups. The results in Figure 6 show that most of the dup clusters are around 6, and most of the clusters can satisfy the conditions of 2 plus + 2 inverses. The final data effective utilization rate is 5.12%, and the average sequencing depth is: 1033X.

突变谱分析:Mutation spectrum analysis:

突变谱分析结果见图7,其中,互补的突变类型对于来源于双链的分子(DNA),理论突变频率基本相同,横坐标代表碱基突变的类型;纵坐标代表突变的个数。图7的结果显示:突变碱基类型分布基本平衡,其突变频率(Mutations per nucleotide)为:2.2×10-6The results of the mutational profiling are shown in Figure 7, in which the complementary mutation type is substantially the same for the double-stranded molecule (DNA), the abscissa represents the type of base mutation, and the ordinate represents the number of mutations. The results of Fig. 7 show that the distribution of the mutated base type is basically balanced, and the mutation frequency (Mutations per nucleotide) is: 2.2 × 10 -6 .

变异检测列表详情(基于exon区和非同义突变统计):Mutation detection list details (based on exon area and non-synonymous mutation statistics):

基因gene 碱基突变Base mutation 氨基酸突变Amino acid mutation 突变类型Mutation type 突变频率Mutation frequency SMAD4SMAD4 c.2119G>Ac.2119G>A p.Y301Fp.Y301F 错义突变Missense mutation 2.8%2.8% ARID1AARID1A c.817C>Tc.817C>T p.A1872Tp.A1872T 错义突变Missense mutation 2.34%2.34% APCAPC c.217A>Cc.217A>C p.A426Tp.A426T 错义突变Missense mutation 1.80%1.80%

结果分析:依据TCGA,COSMIC,ClinVar,HMGD等相关数据库以及文献资料,在患 者血浆中检测到了SMAD4 p.Y301F,APC p.A426T驱动突变预示着患者具有较高的癌症风险率,建议患者到相关医疗机构进行更全面的检测和采取相关干预措施。Analysis of results: According to TCGA, COSMIC, ClinVar, HMGD and other related databases and literature, in the disease In the plasma, SMAD4 p.Y301F was detected, and the APC p.A426T-driven mutation predicted that the patient had a high cancer risk rate. It is recommended that patients go to relevant medical institutions for more comprehensive testing and relevant interventions.

实施例4:肺癌早筛Example 4: Early screening of lung cancer

一、芯片设计First, the chip design

1)肺癌早筛芯片的设计:1) Design of lung cancer early screening chip:

基于TCGA,ICGC,COSMIC等数据库和相关文献参考,采用迭代算法设计出针肺癌早筛的基因芯片lungpan。Lungpan芯片包括了:肺癌相关的Driver Gene,高频突变基因,以及癌症12条信号通路中重要基因,共计145个基因,250KB。Based on TCGA, ICGC, COSMIC and other related literature references, an iterative algorithm was used to design a gene chip lungpan for early lung cancer screening. The Lungpan chip includes: lung cancer-related Driver Gene, high-frequency mutated gene, and important genes in 12 signaling pathways of cancer, totaling 145 genes, 250KB.

芯片主要设计过程分为4步:The main design process of the chip is divided into 4 steps:

1、统计cosmic数据库中有关肺癌driver gene的每个外显子区变异样本数、变异样本、最热点变异所在的样本数、PI值(以评估患者回复频率在每个外显子上的水平,PI=每外显子上携带突变的累计患者数目/外显子长度),并根据PI值降序排列。之后采用迭代算法:以第一个外显子区变异的样本作为样本数据库,统计其他所有区间和样本数据库不同样本的个数,将不同样本个数最多的样本区间列为第二个筛选到芯片区间,此时以筛选到的两个区间的变异样本作为样本数据库,以同样的方法筛选第三个区间,直到样本数据库包括了所有的样本,以统计外显子区集,而对于没有筛选到任何区间的基因所有区间,则都加到芯片区间上。1. Count the number of samples of each exon region of the lung cancer driver gene in the cosmic database, the variation sample, the number of samples with the most hot variation, and the PI value (to assess the level of patient response frequency on each exon, PI = cumulative number of patients/exon lengths carrying mutations per exon) and ranked in descending order of PI values. Then iterative algorithm is adopted: the sample of the first exon region variation is used as the sample database, and the number of different samples of all other intervals and sample databases is counted, and the sample interval with the largest number of different samples is listed as the second screening chip. Interval, at this time, the mutated samples of the two selected intervals are used as the sample database, and the third interval is screened in the same way until the sample database includes all the samples to count the exon region set, and for the unfiltered All intervals of the genes in any interval are added to the chip interval.

2.基于TCGA,ICGC等数据库,以去除driver gene区间并且包括大于等于5个样本的热点变异的区间(SNV>=5)为候选区间,重复上一步的迭代计算。2. Based on the TCGA, ICGC and other databases, to remove the driver gene interval and include the interval of the hotspot variation of 5 samples or more (SNV>=5) as the candidate interval, repeat the iterative calculation of the previous step.

3.基于TCGA,ICGC等数据库,在去除已被筛选的区间中分别以:PI>=30,SNV>=3和:PI>=20,SNV>=3为候选区间,筛选使得单样本数据库样本数减少最多的区间作为第一个芯片区间,重复以上过程进行迭代计算。3. Based on TCGA, ICGC and other databases, in the interval where the screening has been removed, PI>=30, SNV>=3 and :PI>=20, SNV>=3 are selected as the candidate interval, and the screening makes the single sample database sample. The interval with the largest number reduction is taken as the first chip interval, and the above process is repeated for iterative calculation.

4.加融合基因等区间。4. Add a fusion gene and other intervals.

基因列表详情见表4。See Table 4 for details of the gene list.

表4Table 4

KRASKRAS ALKALK ROS1ROS1 ADAM23ADAM23 KIAA0907KIAA0907 KRTAP5-5KRTAP5-5 MAP1BMAP1B EGFREGFR RB1RB1 FGFR3FGFR3 DNMT3BDNMT3B GAB1GAB1 TSHZ3TSHZ3 ZNF814ZNF814 TP53TP53 PDGFRAPDGFRA FGFR4FGFR4 SDHAP2SDHAP2 OR10Z1OR10Z1 XIRP2XIRP2 ZFHX4ZFHX4 BRAFBRAF KDRKDR JAK3JAK3 DHX9DHX9 CNTNAP3BCNTNAP3B NYAP2NYAP2 ZNF804AZNF804A PIK3CAPIK3CA FBXW7FBXW7 APCAPC CSNK2A1CSNK2A1 IL32IL32 NUDT11NUDT11 OR5D18OR5D18 ERBB2ERBB2 HRASHRAS FRG1BFRG1B CNTN5CNTN5 NAV3NAV3 SNAPC4SNAPC4 ZNF479ZNF479 CDKN2ACDKN2A JAK2JAK2 CHEK2CHEK2 ATXN3ATXN3 TNRC6ATNRC6A ZNF598ZNF598 OR51V1OR51V1 NRASNRAS ERBB4ERBB4 KLK1KLK1 CLIP1CLIP1 FAM135BFAM135B KIAA2022KIAA2022 OR4N2OR4N2

STK11STK11 KITKIT NBPF10NBPF10 OR4M2OR4M2 VGLL3VGLL3 DDX11L2DDX11L2 OR4C15OR4C15 NFE2L2NFE2L2 SMAD4SMAD4 PARGPARG OR10G8OR10G8 KRTAP4-11KRTAP4-11 MUC6MUC6 OR14C36OR14C36 CTNNB1CTNNB1 FGFR2FGFR2 FBN2FBN2 PAPPA2PAPPA2 ANAPC1ANAPC1 ATXN1ATXN1 CROCCCROCC METMET DDR2DDR2 HSD17B7P2HSD17B7P2 OR8H2OR8H2 FAM47CFAM47C MUC16MUC16 OR2T2OR2T2 PTENPTEN ATMATM WASH2PWASH2P PBX2PBX2 AKAP6AKAP6 BEST3BEST3 PCDH11XPCDH11X AKT1AKT1 RETRET POTECPOTEC POLDIP2POLDIP2 ZNF804BZNF804B DSPPDSPP REG3AREG3A KEAP1KEAP1 NOTCH1NOTCH1 EEF1B2EEF1B2 SLC6A10PSLC6A10P ZEB1ZEB1 MB21D2MB21D2 REG1BREG1B DDX11DDX11 EPB41L4AEPB41L4A TBX6TBX6 PRB2PRB2 OR2T34OR2T34 NTRK3NTRK3 LRRIQ3LRRIQ3 DNAH8DNAH8 OR2M2OR2M2 WDR62WDR62 CNTNAP2CNTNAP2 LPALPA NTRK1NTRK1 EPHA5EPHA5 OR2B11OR2B11 OR4C16OR4C16 DCAF4L2DCAF4L2 CDH10CDH10 MMP27MMP27 NF1NF1 OR5L2OR5L2 OR4K2OR4K2 KCNB2KCNB2 EPHA3EPHA3 CDH12CDH12 VAV3VAV3 INHBAINHBA OR2T33OR2T33 FAM47AFAM47A STAG3L2STAG3L2 PTPRDPTPRD RALGAPBRALGAPB THSD4THSD4 FGFR1FGFR1 GNA15GNA15 RYR2RYR2 KRTAP4-8KRTAP4-8 NOTCH2NOTCH2 FOLH1FOLH1 OR4N4OR4N4    

二、测序分析Second, sequencing analysis

采用本发明,对1例肺结节患者按照以上方法的步骤进行肺癌早筛检测,结果如下:According to the present invention, a lung nodule patient is subjected to early screening of lung cancer according to the steps of the above method, and the results are as follows:

测序数据统计结果见下表:The statistical results of the sequencing data are shown in the following table:

Figure PCTCN2014088059-appb-000020
Figure PCTCN2014088059-appb-000020

注释:正反链互配率:基于3条reads以上正反链均有的簇/3条reads以上总的簇的比值,以评估可用数据中正反链互配情况;有效数据利用率:基于至少满足2+/2-簇的reads纠错后的个数与总测序reads数的比值;平均测序深度:基于有效数据纠错后,对目标区域碱基的平均覆盖情况。Note: The positive and negative chain interoperability rate: based on the ratio of the clusters above 3 positive and negative chains/3 total reads, to evaluate the positive and negative chain interoperability in the available data; effective data utilization: based on The ratio of the number of reads error correction of at least 2+/2-cluster to the total number of sequencing reads is satisfied; the average sequencing depth: the average coverage of bases in the target region after error correction based on effective data.

簇的分析:Cluster analysis:

相同索引reads簇的分析结果见图8,其中,横坐标代表簇的duplication(dup)个数,纵坐标代表满足某一dup个数的簇的总的reads个数。图8的结果显示:dup簇绝大部分在10左右,较大部分簇内能满足2正+2反的条件,最终数据数据有效利用率为4.12%,平均测序深度为:898X。The analysis result of the same index reads cluster is shown in Fig. 8. The abscissa represents the number of duplication (dup) of the cluster, and the ordinate represents the total number of reads of the cluster satisfying a certain number of dup. The results of Fig. 8 show that most of the dup clusters are around 10, and the larger part of the cluster can satisfy the condition of 2 plus + 2 inverse. The effective utilization rate of the final data is 4.12%, and the average sequencing depth is 898X.

突变谱分析:Mutation spectrum analysis:

突变谱分析结果见图9,其中,互补的突变类型对于来源于双链的分子(DNA),理论突变频率基本相同,横坐标代表碱基突变的类型;纵坐标代表突变的个数。图9的结果显示:突变碱基类型分布基本平衡,其突变频率(Mutations per nucleotide)为:2.6×10-6The results of the mutational profiling are shown in Fig. 9, in which the complementary mutation type is substantially the same for the double-stranded molecule (DNA), the abscissa represents the type of base mutation, and the ordinate represents the number of mutations. The results in Figure 9 show that the mutated base type distribution is basically balanced, and its mutation frequency (Mutations per nucleotide) is: 2.6 × 10 -6 .

变异检测列表详情(基于exon区和非同义突变统计):Mutation detection list details (based on exon area and non-synonymous mutation statistics):

基因gene 碱基突变Base mutation 氨基酸突变Amino acid mutation 突变类型Mutation type 突变频率Mutation frequency ZNF804AZNF804A c.126G>Cc.126G>C p.K42Np.K42N 错义突变Missense mutation 2.6%2.6%

CDH10CDH10 c.2240C>Tc.2240C>T p.S747Fp.S747F 错义突变Missense mutation 1.3%1.3%

结果分析:依据TCGA,COSMIC,ClinVar,HMGD等相关数据库以及文献资料,在患者血浆中未检测到相关驱动突变,预示着患者具有较低的癌症风险率。Analysis of results: According to TCGA, COSMIC, ClinVar, HMGD and other related databases and literature, no relevant driving mutations were detected in the patient's plasma, indicating that patients have a lower risk of cancer.

实施例5:十二种常见癌症术后监控Example 5: Postoperative monitoring of twelve common cancers

1)12种常见肿瘤早期筛查及术后监控相关基因芯片的设计:1) Design of 12 common tumor early screening and postoperative monitoring related gene chips:

基于TCGA,ICGC,COSMIC等数据库和相关文献参考,采用迭代算法设计出针对12种常见癌症术后监控相关的基因芯片CANPer-JK。CANPer-JK芯片包括了:12种常见癌症相关的Driver Gene,高频突变基因,以及癌症12条信号通路中重要基因等,共计547个基因,800KB。Based on TCGA, ICGC, COSMIC and other related literature references, an iterative algorithm was used to design a gene chip CANPer-JK for 12 common cancer postoperative monitoring. The CANPer-JK chip includes: 12 common cancer-related Driver Genes, high-frequency mutated genes, and important genes in 12 cancer signaling pathways, totaling 547 genes, 800 KB.

芯片主要设计过程分为4步:The main design process of the chip is divided into 4 steps:

1、统计cosmic数据库中有关12种癌症相关的driver gene的每个外显子区变异样本数、变异样本、最热点变异所在的样本数、PI值(以评估患者回复频率在每个外显子上的水平,PI=每外显子上携带突变的累计患者数目/外显子长度),并根据PI值降序排列。之后采用迭代算法:以第一个外显子区变异的样本作为样本数据库,统计其他所有区间和样本数据库不同样本的个数,将不同样本个数最多的样本区间列为第二个筛选到芯片区间,此时以筛选到的两个区间的变异样本作为样本数据库,以同样的方法筛选第三个区间,直到样本数据库包括了所有的样本,以统计外显子区集,而对于没有筛选到任何区间的基因所有区间,则都加到芯片区间上。1. Count the number of samples per mutation in each exon region of the 12 cancer-associated driver genes in the cosmic database, the number of samples, the number of samples with the most hot variation, and the PI value (to assess the frequency of patient responses in each exon) The upper level, PI = cumulative number of patients carrying mutations per exon / exon length), and ranked in descending order of PI values. Then iterative algorithm is adopted: the sample of the first exon region variation is used as the sample database, and the number of different samples of all other intervals and sample databases is counted, and the sample interval with the largest number of different samples is listed as the second screening chip. Interval, at this time, the mutated samples of the two selected intervals are used as the sample database, and the third interval is screened in the same way until the sample database includes all the samples to count the exon region set, and for the unfiltered All intervals of the genes in any interval are added to the chip interval.

2.基于TCGA,ICGC等数据库,以去除driver gene区间并且包括大于等于5个样本的热点变异的区间(SNV>=5)为候选区间,重复上一步的迭代计算。2. Based on the TCGA, ICGC and other databases, to remove the driver gene interval and include the interval of the hotspot variation of 5 samples or more (SNV>=5) as the candidate interval, repeat the iterative calculation of the previous step.

3.基于TCGA,ICGC等数据库,在去除已被筛选的区间中分别以:PI>=30,SNV>=3和:PI>=20,SNV>=3为候选区间,筛选使得单样本数据库样本数减少最多的区间作为第一个芯片区间,重复以上过程进行迭代计算。3. Based on TCGA, ICGC and other databases, in the interval where the screening has been removed, PI>=30, SNV>=3 and :PI>=20, SNV>=3 are selected as the candidate interval, and the screening makes the single sample database sample. The interval with the largest number reduction is taken as the first chip interval, and the above process is repeated for iterative calculation.

4.加入融合基因等区间。4. Add a fusion gene and other intervals.

基因列表详情见表5。See Table 5 for details of the gene list.

表5table 5

ABCB1ABCB1 BRAFBRAF CHD2CHD2 ERBB4ERBB4 FOXA2FOXA2 IKBKEIKBKE MECOMMECOM NTRK1NTRK1 PTCH2PTCH2 SF3A1SF3A1 TIPARPTIPARP ABL1ABL1 BRCA1BRCA1 CHD4CHD4 ERCC2ERCC2 FOXL2FOXL2 IKZF1IKZF1 MED12MED12 NTRK2NTRK2 PTENPTEN SF3B1SF3B1 TLR4TLR4 ABL2ABL2 BRCA2BRCA2 CHEK1CHEK1 ERCC3ERCC3 FPGSFPGS IL13RA2IL13RA2 MEF2BMEF2B NTRK3NTRK3 PTP4A3PTP4A3 SH2B3SH2B3 TMEM127TMEM127 ACVR1BACVR1B BRIP1BRIP1 CHEK2CHEK2 ERGERG FUBP1FUBP1 IL2RAIL2RA MEN1MEN1 NUP93NUP93 PTPN11PTPN11 SIK1SIK1 TNFAIP3TNFAIP3 ACVR2AACVR2A BTG1BTG1 CHUKCHUK ESR1ESR1 FYNFYN IL2RBIL2RB METMET PAK3PAK3 PTPRDPTPRD SIN3ASIN3A TNFRSF14TNFRSF14 AJUBAAJUBA BTKBTK CICCIC ETV1ETV1 GAB2GAB2 IL2RGIL2RG MIR142MIR142 PAK7PAK7 RAC1RAC1 SLAMF7SLAMF7 TNFRSF8TNFRSF8 AKT1AKT1 C11orf30C11orf30 CRBNCRBN ETV6ETV6 GATA1GATA1 IL7RIL7R MITFMITF PALB2PALB2 RAC2RAC2 SLC4A1SLC4A1 TNFSF11TNFSF11

AKT2AKT2 C1QAC1QA CREBBPCREBBP EWSR1EWSR1 GATA2GATA2 INHBAINHBA MLH1MLH1 PARP1PARP1 RAD21RAD21 SLIT2SLIT2 TNFSF13BTNFSF13B AKT3AKT3 C1QBC1QB CRIPAKCRIPAK EXT1EXT1 GATA3GATA3 IRF4IRF4 MLH3MLH3 PARP2PARP2 RAD50RAD50 SMAD2SMAD2 TOP1TOP1 ALKALK C1QCC1QC CRKLCRKL EXT2EXT2 GID4GID4 IRS2IRS2 MLLMLL PARP3PARP3 RAD51RAD51 SMAD3SMAD3 TOP2ATOP2A ALOX12BALOX12B C1RC1R CRLF2CRLF2 EZH2EZH2 GNA11GNA11 ITGB2ITGB2 MLL2MLL2 PARP4PARP4 RAD51BRAD51B SMAD4SMAD4 TOP2BTOP2B ANGPT1ANGPT1 C1SC1S CROTCROT FAM123BFAM123B GNA13GNA13 JAK1JAK1 MLL3MLL3 PAX5PAX5 RAD51CRAD51C SMARCA1SMARCA1 TP53TP53 ANGPT2ANGPT2 CAMK2GCAMK2G CSF1RCSF1R FAM46CFAM46C GNAQGNAQ JAK2JAK2 MLL4MLL4 PBRM1PBRM1 RAD51DRAD51D SMARCA4SMARCA4 TRAF7TRAF7 APCAPC CARD11CARD11 CTCFCTCF FANCAFANCA GNASGNAS JAK3JAK3 MPLMPL PCBP1PCBP1 RAD52RAD52 SMARCB1SMARCB1 TRRAPTRRAP APCDD1APCDD1 CASP8CASP8 CTLA4CTLA4 FANCCFANCC GNRHRGNRHR JUNJUN MRE11AMRE11A PCM1PCM1 RAD54LRAD54L SMARCD1SMARCD1 TSC1TSC1 ARAR CBFBCBFB CTNNA1CTNNA1 FANCD2FANCD2 GPR124GPR124 KAT6AKAT6A MS4A1MS4A1 PDGFRAPDGFRA RAF1RAF1 SMC1ASMC1A TSC2TSC2 ARAFARAF CBLCBL CTNNB1CTNNB1 FANCEFANCE GRIN2AGRIN2A KCNH2KCNH2 MSH2MSH2 PDGFRBPDGFRB RARARARA SMC3SMC3 TSHRTSHR ARFRP1ARFRP1 CBLBCBLB CUL4ACUL4A FANCFFANCF GRM3GRM3 KDM5AKDM5A MSH3MSH3 PDK1PDK1 RARBRARB SMOSMO TSHZ2TSHZ2 ARHGAP35ARHGAP35 CBR1CBR1 CUL4BCUL4B FANCGFANCG GSK3BGSK3B KDM5CKDM5C MSH4MSH4 PHF6PHF6 RARGRARG SOCS1SOCS1 TSHZ3TSHZ3 ARID1AARID1A CCND1CCND1 CYLDCYLD FANCIFANCI H3F3AH3F3A KDM6AKDM6A MSH5MSH5 PIGFPIGF RB1RB1 SOX10SOX10 TUBA1ATUBA1A ARID1BARID1B CCND2CCND2 CYP17A1CYP17A1 FANCLFANCL H3F3CH3F3C KDRKDR MSH6MSH6 PIK3C2APIK3C2A RELREL SOX17SOX17 TUBBTUBB ARID2ARID2 CCND3CCND3 DAXXDAXX FANCMFANCM HCKHCK KEAP1KEAP1 MSR1MSR1 PIK3C2BPIK3C2B RETRET SOX2SOX2 TUBD1TUBD1 ARID5BARID5B CCNE1CCNE1 DDR1DDR1 FAT3FAT3 HDAC1HDAC1 KIF1BKIF1B MTORMTOR PIK3C2GPIK3C2G RFC1RFC1 SOX9SOX9 TUBE1TUBE1 ASXL1ASXL1 CD22CD22 DDR2DDR2 FBXW7FBXW7 HDAC2HDAC2 KIF5BKIF5B MUC1MUC1 PIK3C3PIK3C3 RHEBRHEB SPENSPEN TUBG1TUBG1 ATMATM CD33CD33 DIS3DIS3 FCGR1AFCGR1A HDAC3HDAC3 KITKIT MUTYHMUTYH PIK3CAPIK3CA RICTORRICTOR SPOPSPOP TXNRD1TXNRD1 ATRATR CD3DCD3D DNMT1DNMT1 FCGR2AFCGR2A HDAC4HDAC4 KLF4KLF4 MYCMYC PIK3CBPIK3CB RNASELRNASEL SPRY4SPRY4 TYRTYR ATRXATRX CD3ECD3E DNMT3ADNMT3A FCGR2BFCGR2B HDAC6HDAC6 KLHL6KLHL6 MYCL1MYCL1 PIK3CGPIK3CG RNF43RNF43 SRCSRC U2AF1U2AF1 AURKAAURKA CD3GCD3G DOCK2DOCK2 FCGR2CFCGR2C HDAC8HDAC8 KRASKRAS MYCNMYCN PIK3R1PIK3R1 ROBO1ROBO1 SRD5A2SRD5A2 U2AF2U2AF2 AURKBAURKB CD52CD52 DOT1LDOT1L FCGR3AFCGR3A HGFHGF LCKLCK MYD88MYD88 PIK3R2PIK3R2 ROBO2ROBO2 SRSF1SRSF1 USP9XUSP9X AXIN1AXIN1 CD79ACD79A DUSP6DUSP6 FCGR3BFCGR3B HIF1AHIF1A LHCGRLHCGR NAV3NAV3 PLK1PLK1 ROS1ROS1 SRSF2SRSF2 VEGFAVEGFA AXIN2AXIN2 CD79BCD79B EDNRAEDNRA FGF10FGF10 HIST1H1CHIST1H1C LIFRLIFR NBNNBN PMLPML RPA1RPA1 SRSF7SRSF7 VEGFBVEGFB AXLAXL CD80CD80 EGFREGFR FGF12FGF12 HIST1H2BDHIST1H2BD LIMK1LIMK1 NCOA1NCOA1 PMS1PMS1 RPL22RPL22 SSTR2SSTR2 VEZF1VEZF1 B2MB2M CDC25CCDC25C EGR3EGR3 FGF14FGF14 HIST1H3BHIST1H3B LMO1LMO1 NCOA2NCOA2 PMS2PMS2 RPL5RPL5 SSTR3SSTR3 VHLVHL B4GALT3B4GALT3 CDC42CDC42 EIF4A2EIF4A2 FGF19FGF19 HLA-AHLA-A LRRK2LRRK2 NCOR1NCOR1 PNRC1PNRC1 RPS14RPS14 SSTR5SSTR5 WHSC1L1WHSC1L1 BACH1BACH1 CDC73CDC73 ELAC2ELAC2 FGF23FGF23 HNF1AHNF1A LYNLYN NEK11NEK11 POLQPOLQ RPS6KB1RPS6KB1 STAG2STAG2 WISP3WISP3 BAK1BAK1 CDH1CDH1 ELF3ELF3 FGF3FGF3 HRASHRAS MALAT1MALAT1 NF1NF1 PPP2R1APPP2R1A RPTORRPTOR STAT4STAT4 WWP1WWP1 BAP1BAP1 CDK12CDK12 ELMO1ELMO1 FGF4FGF4 HRH2HRH2 MAP2K1MAP2K1 NF2NF2 PRDM1PRDM1 RUNX1RUNX1 STAT5BSTAT5B XBP1XBP1 BARD1BARD1 CDK2CDK2 EML4EML4 FGF6FGF6 HSD17B3HSD17B3 MAP2K2MAP2K2 NFE2L2NFE2L2 PRKAA1PRKAA1 RUNX1T1RUNX1T1 STK11STK11 XIAPXIAP BCL2BCL2 CDK4CDK4 EP300EP300 FGF7FGF7 HSD3B2HSD3B2 MAP2K4MAP2K4 NFE2L3NFE2L3 PRKAR1APRKAR1A RXRARXRA SUFUSUFU XPAXPA BCL2A1BCL2A1 CDK6CDK6 EPCAMEPCAM FGFR1FGFR1 HSH2DHSH2D MAP3K1MAP3K1 NFKBIANFKBIA PRKCAPRKCA RXRBRXRB SUZ12SUZ12 XPCXPC BCL2L1BCL2L1 CDK8CDK8 EPHA2EPHA2 FGFR2FGFR2 HSP90AA1HSP90AA1 MAP3K13MAP3K13 NKX2-1NKX2-1 PRKCBPRKCB RXRGRXRG SYKSYK XPO1XPO1 BCL2L11BCL2L11 CDKN1ACDKN1A EPHA3EPHA3 FGFR3FGFR3 HSPA4HSPA4 MAPK1MAPK1 NKX3-1NKX3-1 PRKCGPRKCG SDHAF2SDHAF2 TACR1TACR1 XRCC3XRCC3 BCL2L2BCL2L2 CDKN1BCDKN1B EPHA5EPHA5 FGFR4FGFR4 IDH1IDH1 MAPK3MAPK3 NOTCH1NOTCH1 PRKDCPRKDC SDHBSDHB TAF1TAF1 YES1YES1 BCL6BCL6 CDKN2ACDKN2A EPHB1EPHB1 FHFH IDH2IDH2 MAPK8MAPK8 NOTCH2NOTCH2 PRPF40BPRPF40B SDHCSDHC TBL1XR1TBL1XR1 ZNF217ZNF217 BCORBCOR CDKN2BCDKN2B EPHB2EPHB2 FLCNFLCN IFNAR1IFNAR1 MAPK8IP1MAPK8IP1 NOTCH3NOTCH3 PRSS8PRSS8 SDHDSDHD TBX3TBX3 ZNF703ZNF703 BCORL1BCORL1 CDKN2CCDKN2C EPHB6EPHB6 FLT1FLT1 IFNAR2IFNAR2 MAXMAX NOTCH4NOTCH4 PRXPRX SEMA3ASEMA3A TEKTEK ZRSR2ZRSR2 BCRBCR CDX2CDX2 EPOREPOR FLT3FLT3 IGF1IGF1 MC1RMC1R NPM1NPM1 PSMB1PSMB1 SEMA3ESEMA3E TERTTERT WT1WT1 BLMBLM CEBPACEBPA EPPK1EPPK1 FLT4FLT4 IGF1RIGF1R MCL1MCL1 NR3C1NR3C1 PSMB2PSMB2 SETBP1SETBP1 TET2TET2   BMPR1ABMPR1A CFLARCFLAR ERBB2ERBB2 FNTAFNTA IGF2IGF2 MDM2MDM2 NRASNRAS PSMB5PSMB5 SETD2SETD2 TFGTFG   BRAFBRAF CHD1CHD1 ERBB3ERBB3 FOXA1FOXA1 IKBKBIKBKB MDM4MDM4 NSD1NSD1 PTCH1PTCH1 SF1SF1 TGFBR2TGFBR2  

二、测序分析 Second, sequencing analysis

采用本发明,对1例乳腺癌术后患者(12种常见肿瘤中的一种)按照以上方法的步骤进行乳腺癌术后监控检测,结果如下:According to the present invention, a postoperative breast cancer patient (one of 12 common tumors) is subjected to postoperative monitoring and detection of breast cancer according to the steps of the above method, and the results are as follows:

测序数据统计结果见下表:The statistical results of the sequencing data are shown in the following table:

Figure PCTCN2014088059-appb-000021
Figure PCTCN2014088059-appb-000021

注释:正反链互配率:基于3条reads以上正反链均有的簇/3条reads以上总的簇的比值,以评估可用数据中正反链互配情况;有效数据利用率:基于至少满足2+/2-簇的reads纠错后的个数与总测序reads数的比值;平均测序深度:基于有效数据纠错后,对目标区域碱基的平均覆盖情况。Note: The positive and negative chain interoperability rate: based on the ratio of the clusters above 3 positive and negative chains/3 total reads, to evaluate the positive and negative chain interoperability in the available data; effective data utilization: based on The ratio of the number of reads error correction of at least 2+/2-cluster to the total number of sequencing reads is satisfied; the average sequencing depth: the average coverage of bases in the target region after error correction based on effective data.

簇的分析:Cluster analysis:

相同索引reads簇的分析结果见图10,其中,横坐标代表簇的duplication(dup)个数,纵坐标代表满足某一dup个数的簇的总的reads个数。图10的结果显示:dup簇绝大部分在6左右,大部分簇内能满足2正+2反的条件,最终数据数据有效利用率为4.74%,平均测序深度为:1028.6XThe analysis result of the same index reads cluster is shown in Fig. 10, in which the abscissa represents the number of duplication (dup) of the cluster, and the ordinate represents the total number of reads of the cluster satisfying a certain number of dup. The results in Figure 10 show that most of the dup clusters are around 6, and most of the clusters can satisfy the conditions of 2 plus + 2 inverses. The effective utilization rate of the final data is 4.74%, and the average sequencing depth is: 1028.6X.

突变谱分析:Mutation spectrum analysis:

突变谱分析结果见图11,其中,互补的突变类型对于来源于双链的分子(DNA),理论突变频率基本相同,横坐标代表碱基突变的类型;纵坐标代表突变的个数。图11的结果显示:突变碱基类型分布基本平衡,其突变频率(Mutations per nucleotide)为:3.1×10-6The results of the catastrophe spectrum analysis are shown in Fig. 11, in which the complementary mutation type is substantially the same for the double-stranded molecule (DNA), the abscissa represents the type of base mutation, and the ordinate represents the number of mutations. The results of Fig. 11 show that the distribution of the mutated base type is basically balanced, and the mutation frequency (Mutations per nucleotide) is: 3.1 × 10 -6 .

变异检测列表详情(基于exon区和非同义突变统计):Mutation detection list details (based on exon area and non-synonymous mutation statistics):

Figure PCTCN2014088059-appb-000022
Figure PCTCN2014088059-appb-000022

Figure PCTCN2014088059-appb-000023
Figure PCTCN2014088059-appb-000023

结果分析:在患者术后血浆中不仅检测其癌中存在的变异如:ROS1 p.A2106T,AR p.G457del;HLA-A p.R138G,还检测到高频的PML p.R284P,IRF4 p.E11*等变异。预示着患者术后不良,建议患者到相关医疗机构进行更全面的检测和采取相关干预措施。Analysis of results: In the postoperative plasma, not only the mutations in the cancer were detected, such as: ROS1 p.A2106T, AR p.G457del; HLA-A p.R138G, but also high frequency PML p.R284P, IRF4 p. E11* and other variations. It indicates that the patient has poor postoperative condition, and it is recommended that the patient go to the relevant medical institution for more comprehensive testing and relevant intervention measures.

在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。另外,需要说明的是,本领域技术人员能够理解,在本发明所提出的方案中所包含的步骤顺序,本领域技术人员可以进行调整,这也将包括在本发明的范围内。In the description of the present specification, the description with reference to the terms "one embodiment", "some embodiments", "illustrative embodiment", "example", "specific example", or "some examples", etc. Particular features, structures, materials or features described in the examples or examples are included in at least one embodiment or example of the invention. In the present specification, the schematic representation of the above terms does not necessarily mean the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples. In addition, it should be noted that those skilled in the art can understand that the order of the steps included in the solution proposed by the present invention can be adjusted by those skilled in the art, and it is also included in the scope of the present invention.

尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同物限定。 While the embodiments of the present invention have been shown and described, the embodiments of the invention may The scope of the invention is defined by the claims and their equivalents.

Claims (50)

一种构建测序文库的方法,其特征在于,包括:A method of constructing a sequencing library, comprising: (a)在双链DNA片段的两端分别连接接头,以便获得连接产物,其中,所述接头包括第一链和第二链,所述第一链和第二链部分匹配并且所述第一链包含第一标签序列,以便所述接头上限定出双链区和两个单链尾部,所述两个单链尾部之一的序列中包含第一标签;(a) attaching a linker at each end of the double-stranded DNA fragment to obtain a ligation product, wherein the linker includes a first strand and a second strand, the first strand and the second strand portion are matched and the first The chain comprises a first tag sequence such that the linker defines a double-stranded region and two single-stranded tails, the sequence of one of the two single-chain tails comprising a first tag; (b)将所述连接产物裂解为单链DNA片段;(b) cleaving the ligation product into a single-stranded DNA fragment; (c)利用第一引物对所述单链DNA片段进行链延伸反应,以便获得链延伸产物,其中,所述第一引物包括第二标签序列,并且所述第一引物适于与所述接头的第一链形成双链结构,只是所述第一标签序列与所述第二标签序列之间存在错配;(c) performing a strand extension reaction on the single-stranded DNA fragment using a first primer to obtain a strand extension product, wherein the first primer includes a second tag sequence, and the first primer is adapted to be coupled to the linker The first strand forms a double-stranded structure, except that there is a mismatch between the first tag sequence and the second tag sequence; (d)对所述链延伸产物进行扩增,以便获得扩增产物,所述扩增产物构成所述测序文库,其中,所述扩增采用适于同时扩增所述第一标签序列和所述第二标签序列的引物。(d) amplifying the strand extension product to obtain an amplification product, the amplification product constituting the sequencing library, wherein the amplification is adapted to simultaneously amplify the first tag sequence and A primer for the second tag sequence. 根据权利要求1所述的方法,其特征在于,所述双链DNA片段是通过下列步骤获得的:The method according to claim 1, wherein said double-stranded DNA fragment is obtained by the following steps: 将核酸样本进行末端修复,以便获得经过修复的核酸样本;以及Performing end-repair of the nucleic acid sample to obtain a repaired nucleic acid sample; 在所述核酸样本的5’末端添加碱基A,以便获得两端分别具有粘性末端碱基A的核酸样本,所述两端分别具有粘性末端碱基A的核酸样本构成所述双链DNA片段。A base A is added to the 5' end of the nucleic acid sample to obtain a nucleic acid sample having a sticky terminal base A at each end, and a nucleic acid sample having a sticky terminal base A at each end constitutes the double-stranded DNA fragment . 根据权利要求2所述的方法,其特征在于,所述核酸样本为人基因组DNA的至少一部分或游离核酸。The method of claim 2 wherein said nucleic acid sample is at least a portion of human genomic DNA or a free nucleic acid. 根据权利要求3所述的方法,其特征在于,所述人游离核酸是从患者的外周血提取的。The method of claim 3 wherein said human free nucleic acid is extracted from peripheral blood of a patient. 根据权利要求4所述的方法,其特征在于,所述患者患有癌症,所述癌症为选自下列的至少之一:The method according to claim 4, wherein said patient has cancer, said cancer being at least one selected from the group consisting of: 膀胱癌、前列腺癌、肺癌、结直肠癌、胃癌、乳腺癌、肾癌、胰腺癌、卵巢癌、子宫内膜癌、甲状腺癌、宫颈癌、食管癌以及肝癌。Bladder cancer, prostate cancer, lung cancer, colorectal cancer, stomach cancer, breast cancer, kidney cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer, and liver cancer. 根据权利要求5所述的方法,其特征在于,所述人基因组DNA的至少一部分是通过对人基因组DNA进行随机打断而获得的。The method of claim 5 wherein at least a portion of said human genomic DNA is obtained by random disruption of human genomic DNA. 根据权利要求2所述的方法,其特征在于,所述接头具有3’碱基T粘性末端。The method of claim 2 wherein said linker has a 3' base T sticky end. 根据权利要求1所述的方法,其特征在于,所述单链DNA片段是通过将所述连接产物进行变性处理获得的。The method according to claim 1, wherein said single-stranded DNA fragment is obtained by subjecting said ligation product to denaturation treatment. 根据权利要求1所述的方法,其特征在于,在进行所述链延伸之前,利用探针对所 述单链DNA片段进行筛选,其中,所述探针特异性识别预定区域。The method of claim 1 wherein the probe pair is used prior to performing the chain extension The single-stranded DNA fragment is screened, wherein the probe specifically recognizes a predetermined region. 根据权利要求9所述的方法,其特征在于,所述预定区域包括下列之一:The method of claim 9 wherein said predetermined area comprises one of: (1)表1~5所示基因的至少之一;(1) at least one of the genes shown in Tables 1 to 5; (2)(1)的CDS区域;以及(2) the CDS area of (1); (3)(2)的上下游至少10bp的区域。(3) A region of at least 10 bp upstream and downstream of (2). 根据权利要求10所述的方法,其特征在于,所述探针是以芯片的形式提供的。The method of claim 10 wherein said probe is provided in the form of a chip. 根据权利要求1所述的方法,其特征在于,在存在UDG酶/FPG酶时,进行所述链延伸反应。The method according to claim 1, wherein the chain extension reaction is carried out in the presence of a UDG enzyme/FPG enzyme. 根据权利要求1所述的方法,其特征在于,所述第一标签序列和所述第二标签序列分别独立地长度为4~10nt,优选8nt。The method according to claim 1, wherein the first tag sequence and the second tag sequence are each independently 4 to 10 nt in length, preferably 8 nt. 根据权利要求1所述的方法,其特征在于,所述第一标签序列和所述第二标签序列的长度均为8nt。The method of claim 1 wherein the first tag sequence and the second tag sequence are each 8 nt in length. 根据权利要求1所述的方法,其特征在于,所述第一标签序列和所述第二标签序列之间存在至少2nt的错配。The method of claim 1 wherein there is at least a 2 nt mismatch between the first tag sequence and the second tag sequence. 根据权利要求1所述的方法,其特征在于,所述接头的第一链具有SEQ ID NO:1所示的序列,所述接头的第二链具有SEQ ID NO:2所示的序列,所述第一标签具有SEQ ID NO:3-6中至少之一所示的序列,所述第二标签具有SEQ ID NO:7-10中至少之一所示的序列,所述第一引物具有SEQ ID NO:11所示的序列,所述第二引物具有SEQ ID NO:12所示的序列,所述第三引物具有SEQ ID NO:13所示的序列。The method according to claim 1, wherein the first strand of the linker has the sequence of SEQ ID NO: 1, and the second strand of the linker has the sequence of SEQ ID NO: 2, The first tag has the sequence set forth in at least one of SEQ ID NOs: 3-6, and the second tag has the sequence set forth in at least one of SEQ ID NOs: 7-10, the first primer having SEQ ID NO: The sequence shown in Figure 11, wherein the second primer has the sequence shown in SEQ ID NO: 12, and the third primer has the sequence shown in SEQ ID NO: 13. 一种测序方法,其特征在于,包括:A sequencing method, comprising: 根据权利要求1~16任一项所述的方法构建测序文库;Constructing a sequencing library according to the method of any one of claims 1 to 16; 对所述测序文库进行测序。The sequencing library was sequenced. 根据权利要求17所述的方法,其特征在于,在Hiseq2000或Hiseq2500上进行所述测序。The method of claim 17 wherein said sequencing is performed on Hiseq2000 or Hiseq 2500. 一种确定核酸序列的方法,其特征在于,包括:A method of determining a nucleic acid sequence, comprising: 针对核酸样本,根据权利要求17或18所述的方法进行测序,以便获得由多个测序数据构成的测序结果;For nucleic acid samples, sequencing is performed according to the method of claim 17 or 18 to obtain sequencing results consisting of multiple sequencing data; 基于所述测序结果,构建至少一个测序数据子集,其中,每个测序数据子集中的所有测序数据均对应核酸样本上相同的源序列;Based on the sequencing result, at least one subset of sequencing data is constructed, wherein all sequencing data in each subset of sequencing data corresponds to the same source sequence on the nucleic acid sample; 针对每一个测序数据子集,分别确定与所述第一标签序列对应的测序数据为正链测序数据,与所述第二标签序列对应的测序数据为负链测序数据;For each subset of sequencing data, determining that the sequencing data corresponding to the first tag sequence is positive strand sequencing data, and the sequencing data corresponding to the second tag sequence is negative strand sequencing data; 针对每一个测序数据子集,分别基于所述正链测序数据和所述负链测序数据,对测序 数据进行校正,以便确定经过校正的测序数据;以及Sequencing each subset of sequencing data based on the positive strand sequencing data and the negative strand sequencing data, respectively Data is corrected to determine corrected sequencing data; 基于所述经过校正的测序数据,确定所述核酸样本的序列。A sequence of the nucleic acid sample is determined based on the corrected sequencing data. 根据权利要求19所述的方法,其特征在于,所述测序为双末端测序,所述测序结果由多对成对的测序数据构成。The method of claim 19, wherein the sequencing is double-end sequencing, the sequencing result consisting of pairs of pairs of sequencing data. 根据权利要求20所述的方法,其特征在于,基于所述测序结果,构建至少一个测序数据子集是通过下列步骤进行的:The method of claim 20, wherein constructing the at least one subset of sequencing data based on the sequencing result is performed by the following steps: 针对所述多对成对的测序数据的每一对,确定成对测序数据索引,所述成对测序数据索引由成对的测序数据的每一个的最初N个碱基构成,其中,N为10~20之间的整数;Determining a paired sequencing data index for each pair of the plurality of pairs of sequenced data, the paired sequencing data index consisting of an initial N bases of each of the paired sequencing data, wherein N is An integer between 10 and 20; 基于所述成对测序数据索引,构建至少一个初步测序数据子集,其中,所述初步测序数据子集中的每一个测序数据均具有相同的成对测序数据索引;以及Constructing at least one preliminary sequencing data subset based on the paired sequencing data index, wherein each of the sequencing data subsets has the same paired sequencing data index; 基于所述初步测序数据子集中测序数据之间的汉明距离,对所述至少一个初步测序数据子集进行细分,以便获得多个所述测序数据子集。The at least one preliminary sequencing data subset is subdivided based on a Hamming distance between the sequencing data in the preliminary sequencing data subset to obtain a plurality of the sequencing data subsets. 根据权利要求21所述的方法,其特征在于,N为12。The method of claim 21 wherein N is 12. 根据权利要求21所述的方法,其特征在于,在所述多个测序数据子集的每一个中,任意两对成对测序数据的汉明距离不超过20。The method of claim 21 wherein in each of said plurality of sequencing data subsets, the Hamming distance of any two pairs of paired sequencing data does not exceed 20. 根据权利要求21所述的方法,其特征在于,在所述多个测序数据子集的每一个中,正链测序数据和负链测序数据分别为至少两个。The method of claim 21 wherein in each of said plurality of sequencing data subsets, the positive strand sequencing data and the negative strand sequencing data are at least two, respectively. 根据权利要求20所述的方法,其特征在于,基于所述正链测序数据和所述负链测序数据,确定经过校正的测序数据是基于下列原则进行的:The method of claim 20, wherein determining the corrected sequencing data based on the positive strand sequencing data and the negative strand sequencing data is based on the following principles: 经过校正的测序数据中的每一个碱基同时获得至少50%正链测序数据和至少50%负链测序数据的支持。Each base in the corrected sequencing data is simultaneously supported by at least 50% positive strand sequencing data and at least 50% negative strand sequencing data. 根据权利要求20所述的方法,其特征在于,经过校正的测序数据中的每一个碱基同时获得至少80%正链测序数据和至少80%负链测序数据的支持。The method of claim 20 wherein each base in the corrected sequencing data is simultaneously supported by at least 80% positive strand sequencing data and at least 80% negative strand sequencing data. 根据权利要求20所述的方法,其特征在于,进一步包括:The method of claim 20, further comprising: 将所述经过校正的测序数据比对至参考序列上,并删除比对质量小于30的测序数据。The corrected sequencing data is aligned to a reference sequence and the sequencing data with a quality of less than 30 is deleted. 根据权利要求20所述的方法,其特征在于,基于所述核酸样本的序列,进行SNV分析或Indel分析。The method according to claim 20, wherein the SNV analysis or the Indel analysis is performed based on the sequence of the nucleic acid sample. 一种构建测序文库的装置,其特征在于,包括:An apparatus for constructing a sequencing library, comprising: 连接单元,用于在双链DNA片段的两端分别连接接头,以便获得连接产物,其中,所述接头包括第一链和第二链,所述第一链和第二链部分匹配并且所述第一链包含第一标签序列,以便所述接头上限定出双链区和两个单链尾部,所述两个单链尾部之一的序列中包含第一标签; a linking unit for respectively connecting a linker at both ends of the double-stranded DNA fragment to obtain a ligation product, wherein the linker includes a first strand and a second strand, the first strand and the second strand portion are matched and The first strand comprises a first tag sequence such that the linker defines a double-stranded region and two single-stranded tails, the sequence of one of the two single-stranded tails comprising a first label; 裂解单元,用于将所述连接产物裂解为单链DNA片段;a cleavage unit for cleaving the ligation product into a single-stranded DNA fragment; 链延伸单元,用于利用第一引物对所述单链DNA片段进行链延伸反应,以便获得链延伸产物,其中,所述第一引物包括第二标签序列,并且所述第一引物适于与所述接头的第一链形成双链结构,只是所述第一标签序列与所述第二标签序列之间存在错配;a strand extension unit for performing a strand extension reaction on the single-stranded DNA fragment with a first primer to obtain a strand extension product, wherein the first primer includes a second tag sequence, and the first primer is adapted to The first strand of the linker forms a double-stranded structure, except that there is a mismatch between the first tag sequence and the second tag sequence; 扩增单元,用于对所述链延伸产物进行扩增,以便获得扩增产物,所述扩增产物构成所述测序文库,其中,所述扩增采用第二引物以及第三引物,所述第二引物识别所述接头的第二链,所述第三引物被设置为适于同时扩增所述第一标签序列和所述第二标签序列。An amplification unit for amplifying the strand extension product to obtain an amplification product, the amplification product constituting the sequencing library, wherein the amplification employs a second primer and a third primer, A second primer identifies a second strand of the adaptor, the third primer being configured to simultaneously amplify the first tag sequence and the second tag sequence. 根据权利要求29所述的装置,其特征在于,进一步包括:The device according to claim 29, further comprising: 末端修复单元,用于将核酸样本进行末端修复,以便获得经过修复的核酸样本;以及An end repair unit for end-repairing a nucleic acid sample to obtain a repaired nucleic acid sample; 末端修饰单元,用于在所述核酸样本的5’末端添加碱基A,以便获得两端分别具有粘性末端碱基A的核酸样本,所述两端分别具有粘性末端碱基A的核酸样本构成所述双链DNA片段。a terminal modification unit for adding a base A at the 5' end of the nucleic acid sample to obtain a nucleic acid sample having a sticky terminal base A at each end, wherein the two ends respectively have a nucleic acid sample having a sticky terminal base A The double-stranded DNA fragment. 根据权利要求29所述的装置,其特征在于,进一步包括筛选单元,用于在进行所述链延伸之前,利用探针对所述单链DNA片段进行筛选,其中,所述探针特异性识别预定区域。The device according to claim 29, further comprising a screening unit for screening said single-stranded DNA fragment using a probe prior to performing said strand extension, wherein said probe-specific recognition Scheduled area. 根据权利要求31所述的装置,其特征在于,所述预定区域包括下列之一:The apparatus according to claim 31, wherein said predetermined area comprises one of: (1)表1~5所示基因的至少之一;(1) at least one of the genes shown in Tables 1 to 5; (2)(1)的CDS区域;以及(2) the CDS area of (1); (3)(2)的上下游至少10bp的区域。(3) A region of at least 10 bp upstream and downstream of (2). 根据权利要求31所述的装置,其特征在于,所述探针是以芯片的形式提供的。The device of claim 31 wherein said probe is provided in the form of a chip. 根据权利要求29所述的装置,其特征在于,在存在UDG酶/FPG酶时,进行所述链延伸反应。The device according to claim 29, wherein said strand extension reaction is carried out in the presence of a UDG enzyme/FPG enzyme. 根据权利要求29所述的装置,其特征在于,所述第一标签序列和所述第二标签序列分别独立地长度为4~10nt。The apparatus according to claim 29, wherein said first tag sequence and said second tag sequence are each independently 4 to 10 nt in length. 根据权利要求29所述的装置,其特征在于,所述第一标签序列和所述第二标签序列的长度均为8nt。The apparatus according to claim 29, wherein said first tag sequence and said second tag sequence are each 8 nt in length. 根据权利要求29所述的装置,其特征在于,所述第一标签序列和所述第二标签序列之间存在至少2nt的错配。The apparatus of claim 29 wherein there is at least a 2 nt mismatch between said first tag sequence and said second tag sequence. 根据权利要求29所述的装置,其特征在于,所述接头的第一链具有SEQ ID NO:1所示的序列,所述接头的第二链具有SEQ ID NO:2所示的序列,所述第一标签具有SEQ ID NO:3-6中至少之一所示的序列,所述第二标签具有SEQ ID NO:7-10中至少之一所示的序列,所述第一引物具有SEQ ID NO:11所示的序列,所述第二引物具有SEQ ID NO: 12所示的序列,所述第三引物具有SEQ ID NO:13所示的序列。The device according to claim 29, wherein the first strand of the linker has the sequence of SEQ ID NO: 1, and the second strand of the linker has the sequence of SEQ ID NO: 2, The first tag has the sequence set forth in at least one of SEQ ID NOs: 3-6, and the second tag has the sequence set forth in at least one of SEQ ID NOs: 7-10, the first primer having SEQ ID NO: a sequence shown by 11, the second primer having SEQ ID NO: The sequence shown in 12, the third primer having the sequence shown in SEQ ID NO: 13. 一种测序设备,其特征在于,包括:A sequencing device, comprising: 根据权利要求29~38任一项所述的构建测序文库的装置;An apparatus for constructing a sequencing library according to any one of claims 29 to 38; 测序装置,用于对所述测序文库进行测序。A sequencing device for sequencing the sequencing library. 根据权利要求39所述的测序设备,其特征在于,所述测序装置为Hiseq2000或Hiseq2500。The sequencing device according to claim 39, wherein the sequencing device is Hiseq2000 or Hiseq 2500. 一种确定核酸序列的系统,其特征在于,包括:A system for determining a nucleic acid sequence, comprising: 权利要求39或40所述的测序设备,用于针对核酸样本进行测序,以便获得由多个测序数据构成的测序结果;The sequencing device according to claim 39 or 40, for sequencing a nucleic acid sample to obtain a sequencing result composed of a plurality of sequencing data; 测序数据子集构建设备,用于基于所述测序结果,构建至少一个测序数据子集,其中,每个测序数据子集中的所有测序数据均对应核酸样本上相同的源序列;a sequencing data subset construction device for constructing at least one subset of sequencing data based on the sequencing result, wherein all sequencing data in each subset of sequencing data corresponds to the same source sequence on the nucleic acid sample; 测序数据分类设备,用于针对每一个测序数据子集,分别确定与所述第一标签序列对应的测序数据为正链测序数据,与所述第二标签序列对应的测序数据为负链测序数据;a sequencing data classification device, configured to determine, for each subset of the sequencing data, sequencing data corresponding to the first label sequence as positive strand sequencing data, and sequencing data corresponding to the second label sequence as negative strand sequencing data ; 测序数据校正设备,用于针对每一个测序数据子集,分别基于所述正链测序数据和所述负链测序数据,对测序数据进行校正,以便确定经过校正的测序数据;以及a sequencing data correction device for correcting the sequencing data for each of the sequencing data subsets based on the positive strand sequencing data and the negative strand sequencing data, respectively, to determine corrected sequencing data; 序列确定设备,用于基于所述经过校正的测序数据,确定所述核酸样本的序列。A sequence determining device for determining a sequence of the nucleic acid sample based on the corrected sequencing data. 根据权利要求41所述的系统,其特征在于,所述测序为双末端测序,所述测序结果由多对成对的测序数据构成。The system of claim 41 wherein said sequencing is double-end sequencing, said sequencing results consisting of pairs of pairs of sequencing data. 根据权利要求41所述的系统,其特征在于,所述测序数据子集构建设备包括:The system of claim 41 wherein said sequencing data subset building device comprises: 测序数据索引确定设备,用于针对所述多对成对的测序数据的每一对,确定成对测序数据索引,所述成对测序数据索引由成对的测序数据的每一个的最初N个碱基构成,其中,N为10~20之间的整数;a sequencing data index determining device for determining a paired sequencing data index for each pair of the plurality of pairs of paired sequencing data, the paired sequencing data indexing from the first N of each of the paired sequencing data Base composition, wherein N is an integer between 10 and 20; 初步筛选装置,用于基于所述成对测序数据索引,构建至少一个初步测序数据子集,其中,所述初步测序数据子集中的每一个测序数据均具有相同的成对测序数据索引;以及a preliminary screening device for constructing at least one preliminary sequencing data subset based on the paired sequencing data index, wherein each of the sequencing data subsets has the same paired sequencing data index; 二次筛选装置,用于基于所述初步测序数据子集中测序数据之间的汉明距离,对所述至少一个初步测序数据子集进行细分,以便获得多个所述测序数据子集。And a secondary screening device for subdividing the at least one preliminary sequencing data subset based on a Hamming distance between the sequencing data in the preliminary sequencing data subset to obtain a plurality of the sequencing data subsets. 根据权利要求43所述的系统,其特征在于,N为12。The system of claim 43 wherein N is 12. 根据权利要求43所述的系统,其特征在于,在所述多个测序数据子集的每一个中,任意两对成对测序数据的汉明距离不超过20。The system of claim 43 wherein in each of said plurality of sequencing data subsets, the Hamming distance of any two pairs of paired sequencing data does not exceed 20. 根据权利要求43所述的系统,其特征在于,在所述多个测序数据子集的每一个中,正链测序数据和负链测序数据分别为至少两个。The system of claim 43 wherein in each of said plurality of sequencing data subsets, the positive strand sequencing data and the negative strand sequencing data are at least two, respectively. 根据权利要求41所述的系统,其特征在于,基于所述正链测序数据和所述负链测 序数据,确定经过校正的测序数据是基于下列原则进行的:The system of claim 41, based on said positive strand sequencing data and said negative stranding The sequencing data determines that the corrected sequencing data is based on the following principles: 经过校正的测序数据中的每一个碱基同时获得至少50%正链测序数据和至少50%负链测序数据的支持。Each base in the corrected sequencing data is simultaneously supported by at least 50% positive strand sequencing data and at least 50% negative strand sequencing data. 根据权利要求47所述的系统,其特征在于,经过校正的测序数据中的每一个碱基同时获得至少80%正链测序数据和至少80%负链测序数据的支持。The system of claim 47, wherein each base in the corrected sequencing data is simultaneously supported by at least 80% positive strand sequencing data and at least 80% negative strand sequencing data. 根据权利要求41所述的系统,其特征在于,进一步包括:The system of claim 41, further comprising: 将所述经过校正的测序数据比对至参考序列上,并删除比对质量小于30的测序数据。The corrected sequencing data is aligned to a reference sequence and the sequencing data with a quality of less than 30 is deleted. 根据权利要求41所述的系统,其特征在于,进一步包括序列分析装置,所述序列分析装置用于基于所述核酸样本的序列,进行SNV分析或Indel分析。 The system of claim 41, further comprising sequence analysis means for performing SNV analysis or Indel analysis based on the sequence of said nucleic acid samples.
PCT/CN2014/088059 2014-09-30 2014-09-30 Method for constructing sequencing library and application thereof Ceased WO2016049929A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/088059 WO2016049929A1 (en) 2014-09-30 2014-09-30 Method for constructing sequencing library and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/088059 WO2016049929A1 (en) 2014-09-30 2014-09-30 Method for constructing sequencing library and application thereof

Publications (1)

Publication Number Publication Date
WO2016049929A1 true WO2016049929A1 (en) 2016-04-07

Family

ID=55629352

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/088059 Ceased WO2016049929A1 (en) 2014-09-30 2014-09-30 Method for constructing sequencing library and application thereof

Country Status (1)

Country Link
WO (1) WO2016049929A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106636063A (en) * 2016-09-27 2017-05-10 广州精科医学检验所有限公司 Primer compound, application thereof and method for establishing library and confirming nucleotide sequence
CN110791813A (en) * 2018-08-01 2020-02-14 广州华大基因医学检验所有限公司 Method and application for processing single-stranded DNA
CN113249454A (en) * 2020-02-12 2021-08-13 赛纳生物科技(北京)有限公司 Method for obtaining unit signal in multi-base gene sequencing
WO2023092601A1 (en) * 2021-11-29 2023-06-01 京东方科技集团股份有限公司 Umi molecular tag and application, adapter, adapter ligation reagent, and kit thereof, and library construction method
WO2025180330A1 (en) * 2024-03-01 2025-09-04 深圳市真迈生物科技有限公司 Drug resistance database construction method and apparatus, drug resistance detection method and apparatus, and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102409045A (en) * 2010-09-21 2012-04-11 深圳华大基因科技有限公司 Tag library construction method based on DNA (deoxyribonucleic acid) linker connection, tags used by same and tag linker
CN102534811A (en) * 2010-12-16 2012-07-04 深圳华大基因科技有限公司 DNA (deoxyribonucleic acid) library and preparation method thereof, as well as DNA sequencing method and device
CN103667442A (en) * 2013-09-13 2014-03-26 西南民族大学 High-throughout transcriptome sequencing method for micro sample

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102409045A (en) * 2010-09-21 2012-04-11 深圳华大基因科技有限公司 Tag library construction method based on DNA (deoxyribonucleic acid) linker connection, tags used by same and tag linker
CN102534811A (en) * 2010-12-16 2012-07-04 深圳华大基因科技有限公司 DNA (deoxyribonucleic acid) library and preparation method thereof, as well as DNA sequencing method and device
CN103667442A (en) * 2013-09-13 2014-03-26 西南民族大学 High-throughout transcriptome sequencing method for micro sample

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106636063A (en) * 2016-09-27 2017-05-10 广州精科医学检验所有限公司 Primer compound, application thereof and method for establishing library and confirming nucleotide sequence
CN110791813A (en) * 2018-08-01 2020-02-14 广州华大基因医学检验所有限公司 Method and application for processing single-stranded DNA
CN110791813B (en) * 2018-08-01 2023-06-16 广州华大基因医学检验所有限公司 Method and application for processing single-stranded DNA
CN113249454A (en) * 2020-02-12 2021-08-13 赛纳生物科技(北京)有限公司 Method for obtaining unit signal in multi-base gene sequencing
WO2023092601A1 (en) * 2021-11-29 2023-06-01 京东方科技集团股份有限公司 Umi molecular tag and application, adapter, adapter ligation reagent, and kit thereof, and library construction method
WO2025180330A1 (en) * 2024-03-01 2025-09-04 深圳市真迈生物科技有限公司 Drug resistance database construction method and apparatus, drug resistance detection method and apparatus, and device

Similar Documents

Publication Publication Date Title
CN104293938B (en) Build the method and its application of sequencing library
US11001837B2 (en) Low-frequency mutations enrichment sequencing method for free target DNA in plasma
CN104294371B (en) Build method and its application of sequencing library
CN109880910B (en) Detection site combination, detection method, detection kit and system for tumor mutation load
Liu et al. The contribution of hereditary cancer-related germline mutations to lung cancer susceptibility
CN109427412B (en) Sequence combination for detecting tumor mutation load and design method thereof
US20240229112A1 (en) Compositions and methods for analyzing cell-free dna in methylation partitioning assays
CN111996257A (en) Gastric cancer detection panel based on next-generation sequencing technology and application thereof
CN107922973A (en) Method and system for the modification detection based on sequencing
JP2016513959A5 (en)
CN113249483B (en) Gene combination, system and application for detecting tumor mutation load
US20220399080A1 (en) Methods and products for minimal residual disease detection
US20250273295A1 (en) Detecting the presence of a tumor based on methylation status of cell-free nucleic acid molecules
WO2016049929A1 (en) Method for constructing sequencing library and application thereof
US20210087637A1 (en) Methods and systems for screening for conditions
US20230193355A1 (en) Methods and compositions for high-throughput target sequencing in single cells
US20230091151A1 (en) Compositions and Methods for Targeted NGS Sequencing of cfRNA and cfTNA
KR20240049800A (en) Co-occurrence of somatic mutations with abnormally methylated fragments
TWI873135B (en) Dna marker
US20250218532A1 (en) Systems and methods for cancer therapy monitoring
US20250179585A1 (en) Methods and compositions for identifying structural variants
CN118043892A (en) Co-occurrence of somatic variants and aberrant methylated fragments
CN116829736A (en) Method for sorting samples into clinically relevant categories
CN114908163A (en) Markers for predicting the efficacy of immune checkpoint inhibitors in lung cancer and their applications
US20240105279A1 (en) Methods and systems employing targeted next generation sequencing for classifying a tumor sample as having a level of homologous recombination deficiency similar to that associated with mutations in brca1 or brca2 genes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14903291

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14903291

Country of ref document: EP

Kind code of ref document: A1