[go: up one dir, main page]

WO2017204414A1 - Procédé et appareil permettant d'analyser le degré de contamination croisée d'un échantillon - Google Patents

Procédé et appareil permettant d'analyser le degré de contamination croisée d'un échantillon Download PDF

Info

Publication number
WO2017204414A1
WO2017204414A1 PCT/KR2016/009451 KR2016009451W WO2017204414A1 WO 2017204414 A1 WO2017204414 A1 WO 2017204414A1 KR 2016009451 W KR2016009451 W KR 2016009451W WO 2017204414 A1 WO2017204414 A1 WO 2017204414A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence information
allele
sample
alleles
target sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/KR2016/009451
Other languages
English (en)
Korean (ko)
Inventor
박동현
손대순
박웅양
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Samsung Life Public Welfare Foundation
Original Assignee
Samsung Electronics Co Ltd
Samsung Life Public Welfare Foundation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd, Samsung Life Public Welfare Foundation filed Critical Samsung Electronics Co Ltd
Publication of WO2017204414A1 publication Critical patent/WO2017204414A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6858Allele-specific amplification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • a method for analyzing the degree of contamination between samples a computer-readable recording medium having recorded thereon a program for executing the method, and an apparatus for analyzing the degree of contamination between samples.
  • a genome is all the genetic information of a living thing.
  • Techniques for sequencing a person's genome have been developed such as DNA chips, Next Generation Sequencing technology, and Next Next Generation Sequencing technology.
  • Next-generation sequencing can be used interchangeably with large-scale parallel sequencing or second-generation sequencing.
  • genetic information such as nucleotide sequences, proteins, etc. is widely used to find genes expressing diseases such as diabetes and cancer, or to identify correlations between genetic diversity and expression characteristics of individuals.
  • the genetic data collected from the individual is important in identifying the genetic characteristics of the individual associated with different symptoms or disease progression.
  • genetic data such as individual nucleotide sequences, proteins, etc. are essential data to identify current and future disease-related information to prevent disease or to select the optimal treatment method in the early stages of disease.
  • Techniques for accurately analyzing and diagnosing mutations such as Single Nucleotide Variant (SNV), Copy Number Variation (CNV), Insertion and Deletion (InDel), and Translocation using diseases are being studied.
  • the sequence information obtaining unit for obtaining the first sequence information of the nucleic acid fragment from each of the target sample and the additional sample, and the second sequence information of the nucleic acid fragment from the mixed sample mixed with the target sample and the additional sample;
  • An allele frequency calculating unit for calculating an allele frequency from the obtained first sequence information and the second sequence information, respectively; And it provides a device for analyzing the degree of cross-contamination of the sample to the target sample, including a calculation unit for comparing the calculated allele frequency for a specific site of the chromosome.
  • a computer-readable recording medium having recorded thereon a program for executing the method.
  • the sample may be a biological sample or a compound of the subject, that is, a synthetic sample.
  • the subject may include primates and humans, such as humans, non-human primates, cattle, horses, pigs, sheep, goats, dogs, cats, or rodents.
  • the biological sample may be obtained from blood, plasma, serum, urine, saliva, mucosal secretions, sputum, feces, tears, or a combination thereof.
  • the biological sample of the subject may be a sample of eukaryotic cells, prokaryotic cells, viruses, bacteriophage, etc. derived from various species.
  • the sample may include a nucleic acid or synthetic nucleic acid of the subject.
  • the nucleic acid may be used interchangeably with a polynucleotide or oligonucleotide of any length.
  • the nucleic acid may be a cell-free DNA (cf DNA) or an isolated DNA.
  • the method of separating nucleic acid from the sample may be performed by a method known to those skilled in the art.
  • the length of the nucleic acid fragment may be about 10bp (base pair) to about 2000bp, about 15bp to about 1500bp, about 20bp to about 1000bp, about 20bp to about 500bp or about 20 to about 200bp.
  • Obtaining sequence information of the nucleic acid fragment may include obtaining sequence information by performing next-generation sequencing (NGS) on the separated nucleic acid.
  • NGS next-generation sequencing
  • the "next generation sequencing” may be used interchangeably with “massive parallel sequencing” or second-generation sequencing.
  • Next-generation sequencing refers to a technique of fragmenting a full-length genome in chip-based and PCR-based paired end formats, and performing the sequencing of the fragments at high speed based on hybridization.
  • Next-generation sequencing is a technique for sequencing multiple nucleic acids of a large amount of fragments, and may perform targeted sequencing or panel sequencing based on next-generation sequencing.
  • Next-generation sequencing includes, for example, 454 platform (Roche), GS FLX Titanium, Illumina MiSeq, Illumina HiSeq, Illumina Genome Analyzer, Solexa platform, SOLiD System (Applied Biosystems), Ion Proton (Life Technologies), Complete Genomics, Helicos Biosciences Heliscope , Single molecule real time (SMRT TM) technology from Pacific Biosciences, or a combination thereof.
  • 454 platform Roche
  • GS FLX Titanium Illumina MiSeq
  • Illumina HiSeq Illumina Genome Analyzer
  • Solexa platform Solexa platform
  • SOLiD System Applied Biosystems
  • Ion Proton Life Technologies
  • Complete Genomics Helicos Biosciences Heliscope
  • Single molecule real time (SMRT TM) technology from Pacific Biosciences, or a combination thereof.
  • the method may further comprise preparing a nucleic acid library to perform next generation sequencing.
  • the nucleic acid library can be prepared according to the next generation sequencing scheme.
  • Nucleic acid libraries can be constructed according to the manufacturer's instructions to provide next generation sequencing.
  • the sequence information of the obtained nucleic acid fragments may be called a read.
  • Sequence information of the nucleic acid fragments may be stored in the system, and N masking may be performed.
  • N masking means treating missing individual nucleic acids with too low quality.
  • a low quality lead filter can be performed.
  • the low quality read filter means processing to exclude sequence information of nucleic acid fragments that have been read with excessively low quality.
  • the method may include assigning sequence information of the nucleic acid fragment to a chromosome by mapping the obtained sequence information to a human reference genome.
  • the human reference genome may be hg18 or hg19. Sequence information mapped to only one genomic position in the human reference genome may be designated as unique sequence information.
  • the sequence information of the nucleic acid fragments can be assigned to the position of the chromosome based on the designated unique sequence number.
  • the locus of the chromosome may be a continuous range on a chromosome having a length of at least about 5 kb, about 10 kb, about 20 kb, about 50 kb, about 100 kb, about 1000 kb, or 2000 kb.
  • the chromosomal locus may be a single chromosome.
  • a global alignment or a local alignment may be performed in parallel.
  • the global alignment refers to a method of placing the entire sequence information of the nucleic acid fragments in the most similar portion of the reference genome
  • the local alignment refers to a method of positioning some of the sequence information of the nucleic acid fragments in the most similar portion of the reference genome sequence. do.
  • the method may include identifying a variation in the DNA of the sample.
  • the mutation check may be performed using a known mutation detection program, for example, GATK, SAMtool, MoDIL, SeqSeq, PeMer, VariationHunter, Pindel, BreakDancer, and Mutek, but is not limited thereto.
  • the first sequence information may be sequence information of a nucleic acid fragment obtained from each of a plurality of samples including a target sample and an additional sample.
  • the first sequence information may be a result of sequencing the target sample alone.
  • the first sequence information may be a result of sequencing each sample individually for one or more, two or more, or five or more additional samples.
  • the second sequence information may be sequence information of a nucleic acid fragment obtained from a mixed sample in which a target sample and an additional sample are mixed.
  • a sequencer that performs sequencing may be a mixed sample in which a plurality of samples are mixed. In the case of using a mixed sample in which a plurality of samples are mixed, there is an advantage of reducing the cost of increasing the concentration of the target and providing high throughput in a short time. At this time, a plurality of samples can be distinguished from each other by tagging a label unique to a library of a plurality of samples.
  • the method may include calculating an allele frequency from each of the obtained first and second sequence information.
  • the allele frequency of each allele can be calculated.
  • the allele frequency may refer to a numerical value representing a composition ratio between different alleles constituting the same gene in one sample.
  • the allele frequency may be expressed as one or more of A, G, C, and T, or the frequency of sequence information of all of A, G, C, and T.
  • the method may comprise comparing the calculated allele frequency for a particular site of the chromosome.
  • the specific position of the chromosome may be the same or corresponding exon site or intron site between a plurality of samples, and may be the same sequence number site on the same number of chromosomes.
  • the specific site of the chromosome may be a part or all of a region including the mutation predicting site and the surrounding site to be subjected to sequencing in sequencing or target sequencing.
  • the allele frequency obtained from the first sequence information and the allele frequency obtained from the second sequence information can be compared.
  • the allele frequency of A from the first sequence information and the allele frequency of A from the second sequence information can be compared.
  • allele frequencies of G, C, and T from the first sequence information and allele frequencies of each of G, C, and T from the second sequence information can be compared with respect to specific sites of the same target sample and chromosome.
  • the allele frequency may be compared by the number of alleles having the allele frequency, or the ratio of the number of alleles having the allele frequency in the total allele number may be compared.
  • the "cross-contamination" of the sample is a tag tagged in the sequence information of the nucleic acid fragment of another sample tagged with the sequence information of the nucleic acid fragment of one sample, or the sequence information of the nucleic acid fragments of different samples By exchanging the label between the liver, it means that the sequence information of the nucleic acid fragment in which the label is incorrectly tagged. Due to cross-contamination of the samples, allele frequencies are significant when the allele frequency is analyzed from the first sequence information and the allele frequency is analyzed from the second sequence information for a specific chromosomal site of the sample. The difference can be seen.
  • the method selects a mutation prediction site set by combining the mutation prediction sites obtained from the sequence information of each of the target sample and the additional sample in the obtained first sequence information, and selects the positions excluding the mutation prediction site set as control site sets. Selecting as; Calculating allelic frequencies of genotype alleles and background alleles from the obtained first sequence information and the second sequence information, respectively, for the set of predictive mutation sites or the set of control sites; And comparing the calculated allele frequency with respect to the mutation prediction site set or the control site set.
  • the variation may mean different characteristics of a plurality of samples appearing at specific sites of the chromosome.
  • the property may be a nucleic acid sequence or a nucleotide sequence.
  • the genotype allele of one sample obtained from the first sequence information may have a nucleic acid sequence or nucleotide sequence different from the genotype allele of another sample obtained from the first sequence information.
  • the mutation may be Single Nucleotide Polymorphism (SNP).
  • SNP Single Nucleotide Polymorphism
  • SNP refers to the difference between a single nucleotide that appears between individuals in one species, and is a genetic change or variation showing a difference of a nucleotide sequence (A, G, C, T) at a specific position in the nucleic acid sequence.
  • SNP is a genetic factor associated with the disease, and different SNPs show different resistance, sensitivity, and degree of disease to each subject.
  • Each of the plurality of samples may have different or identical SNP sites from each other.
  • the variation may have a variation with respect to the reference dielectric.
  • the variation may include a variation of the nucleic acid sequence or the nucleotide sequence with respect to the reference genome.
  • Variation of the nucleic acid sequence or nucleotide sequence may comprise substitution, insertion, deletion, or translocation of one or more nucleotide sequences relative to a reference genome.
  • Substitution of the one nucleotide sequence may be, for example, Single Nucleotide Variation (SNV).
  • SNV refers to the difference between a single nucleotide that appears in a few populations in one sequence or species, and may be, for example, a difference from the nucleotide sequence of a reference genome appearing in sequencing data.
  • Each of the plurality of samples may have different or identical SNV sites from each other. Allele frequencies of variation can be calculated by counting the number of alleles in existing generation sequencing data using existing programs such as samtools.
  • the method includes selecting a mutation prediction site set by combining the mutation prediction sites obtained from the sequence information of each of the target sample and the additional sample, and selecting the positions other than the mutation prediction site set as the control site sets.
  • the "mutation prediction site” may mean a specific site of the chromosome having the above-described mutation.
  • the spot may be a predictive site of variation of the sample.
  • the SNP site may be included in the predictive site of the mutation of the target sample.
  • Each of the plurality of samples may have different or identical mutation prediction sites from each other. Referring to FIG. 3, for positions 1, 2, 3, 4, and 5, the predicted variance is 2 to 4 digits for sample 1 (S 1), and the predicted variance for sample 2 (S 2).
  • the seat may be 2 to 5 seats.
  • the "union variant set” is a collection of variation prediction sites that combines the variation prediction sites of each of the plurality of samples, that is, the target sample and the additional sample, and is a union of the variation prediction sites of the plurality of samples. Can be. Referring to FIG. 3, for the first to fifth digits, the set of variation prediction sites of the first and second samples may be the second to fifth digits.
  • control position set is a set of sites for which no mutation is detected in any of the plurality of samples because the background alleles of the plurality of samples obtained from the first sequence information are the same for a specific site of the chromosome. Means.
  • the method may calculate the allele frequencies of the alleles, ie, genotype alleles and / or background alleles, from the obtained first sequence information and the second sequence information, respectively, for the set of mutation prediction sites or the control site set. have.
  • allele frequencies calculated from the first sequence information and the second sequence information described above allele frequencies of genotype alleles and / or background alleles for the set of predictive or control sites can be selected or derived.
  • the allele frequency of the target sample may be represented by the frequency of sequence information of one or more of A, G, C and T or all of A, G, C and T.
  • the method determines the allele as a background allele if the allele obtained from the first sequence information has an allele frequency of less than 10%, and if the allele has an allele frequency of 10% or more, the allele Genes can be determined as genotype alleles.
  • the criterion for distinguishing the allele may be any criterion for genotyping.
  • the "background allele” may mean an allele having an allele frequency of less than 10%, 5% or less, 1% or less, 0.5% or 0.1% or less obtained from sequence information.
  • the background allele can be understood as the meaning of the background allele used in the art.
  • the "genotype allele” may refer to an allele having an allele frequency of 10% or more obtained from sequence information.
  • the allele frequency of the genotype allele may be at least 10%, at least 30%, at least 50%, at least 90%, or 100%.
  • the genotype allele may be understood as meaning genotype alleles used in the art.
  • alleles can typically have A, G, C, and T genotypes, of which base sequences having an allele frequency of at least 10% are assigned to genotype alleles, allele frequencies of 1% or less.
  • the branch can be determined by the background allele as the base sequence. Referring to FIG. 3, the genotype allele at position 1 of Sample 1 is represented by T, and the background alleles are A, G, and C. In addition, genotype allele at position 5 of Sample 1 was indicated by T and C, and the background allele was A and G.
  • the method may include comparing the calculated allele frequencies with respect to the mutant prediction site set or the control site set.
  • allele frequencies can be compared in the first sequence information and the second sequence information. For example, for the same target sample and set of mutation prediction sites, the allele frequency of A in the second sequence information and the allele frequency of A in the first sequence information can be compared. Similarly, for the same target sample and the set of mutation prediction sites, allele frequencies of G, C and T in the second sequence information and allele frequencies of G, C and T in the first sequence information can be compared. As a result of the comparison, if there is a significant difference in the allele frequency of any one of A, G, C or T, it may be determined that the target sample is contaminated by another sample.
  • allele frequencies can be compared in the first sequence information and the second sequence information. For example, for the same target sample and control site set, the allele frequency of A in the second sequence information and the allele frequency of A in the first sequence information can be compared. Similarly, for the same target sample and the set of control sites, allele frequencies of G, C and T in the second sequence information and allele frequencies of G, C and T in the first sequence information can be compared. In this case, since the background alleles and genotype alleles of the plurality of samples obtained from the first sequence information are the same for the control site set, it may be determined that there is no cross contamination of the samples. Referring to FIG.
  • the first site of all samples is the same genotype allele as T, the background allele is the same as A, G, and C, and no mutation is detected.
  • the first position becomes one of the set of control sites. In this position, the background allele of one sample may be determined not to be interfered by the genotype allele of another sample.
  • the method selects alleles that are the background alleles of the target sample and the genotype alleles of the additional sample in the first sequence information as a test group, and the mutation prediction site sets and the control site sets.
  • the first sequence information may include the step of selecting the allele that is the background allele of the target sample and the background allele of the additional sample as a control group.
  • control group means an allele that is a background allele of a target sample in the first sequence information and a background allele of a further sample with respect to the mutation predicting site set and the control site set.
  • the method may include comparing allele frequencies of the control group obtained from the target sample in the first sequence information, and allele frequencies of the control group obtained from the target sample in the second sequence information.
  • Alleles that are genes are A, G and C.
  • allele frequencies of A, G, and C, which are control groups of Sample 1 and allele frequencies of A, G, and C, which are control groups of Sample 1 may be compared, respectively, in the second sequence information.
  • the allele which is the background allele of the 2nd position of the sample 1, and the background allele of the sample 2, the sample 3, and the sample 4 which are additional samples at the same time is G and C.
  • the allele frequencies of the control group G and C of sample 1 and the allele frequencies of the control groups G and C of sample 1 in the second sequence information may be compared, respectively.
  • the background allele of the 3rd position of sample 1, and the background allele of sample 2, the sample 3, and the sample 4 which are additional samples are allele A.
  • the allele frequency of A, which is a control group of Sample 1 and the allele frequency of A, which is a control group of Sample 1 may be compared with each other in the second sequence information.
  • the control group of these target samples when comparing the allele frequency in the first sequence information and the allele frequency in the second sequence information, there may be little or no difference.
  • the control group may determine that there is no possibility of cross contamination of the sample.
  • test group refers to the allele which is the background allele of the target sample in the first sequence information and the genotype allele of the additional sample with respect to the set of mutation prediction sites. Since the test group determines that there is a possibility of cross contamination of a sample at a chromosome specific site corresponding to a plurality of samples, the test group may be an object to analyze the degree of contamination.
  • the method may compare the allele frequency of the test group obtained from the target sample in the first sequence information, and the allele frequency of the test group obtained from the target sample in the second sequence information.
  • the method of analyzing the degree of contamination and the method of selecting a test group may vary depending on how and what samples are mixed. If contamination occurs by sample and by chromosome specific site, it may be different. If cross contamination between samples for a target sample occurs, the allele frequency of the background allele in the set of predictive sites of variation of the target sample may be affected by genotype alleles of other samples.
  • the comparing step may analyze the number of alleles having any allele frequency in the test group and / or control group. The number of alleles having the allele frequency by allele frequency may be compared, or the ratio of the number of alleles having the allele frequency in the total alleles by group may be compared.
  • the allele which is the background allele of the fourth position of Sample 1 and the genotype allele of Sample 2, Sample 3, and Sample 4, which are additional samples is T.
  • the allele frequency of T which is the test group of Sample 1 and the allele frequency of T which is the test group of Sample 1 in the second sequence information can be compared.
  • the allele which is the background allele of the 4th position of sample 2, and the genotype allele of sample 1, sample 3, and sample 4 which is an additional sample is G.
  • the allele frequency of G which is the test group of Sample 2 and the allele frequency of G which is the test group of Sample 2 in the second sequence information can be compared.
  • the allele frequency of the background allele G of the fourth digit of Sample 2 may be affected by the genotype allele G of Sample 1, Sample 3, and Sample 4, the allele frequency of the background allele G of Sample 2 is increased. Can vary.
  • the allele which is the genotype allele of the additional sample 2 the sample 2, the sample 3, and the sample 4 is T.
  • the allele frequency of T which is the test group of Sample 1 and the allele frequency of T which is the test group of Sample 1 in the second sequence information can be compared.
  • Another aspect includes a sequence information obtaining unit for obtaining first sequence information of a nucleic acid fragment from each of a target sample and an additional sample, and second sequence information of the nucleic acid fragment from a mixed sample of the target sample and the additional sample; An allele frequency calculating unit for calculating an allele frequency from the obtained first sequence information and the second sequence information, respectively; And it provides a device 100 for analyzing the degree of cross-contamination of the sample to the target sample, including a calculation unit for comparing the calculated allele frequency for a specific site of the chromosome.
  • the device may include a "... part” or “... module” that implements a time series method of analyzing the degree of cross contamination of the sample. Therefore, even if omitted below, the above description of the method for analyzing the degree of cross contamination of a sample may be applied to an apparatus for analyzing the degree of cross contamination of the sample.
  • the components may correspond to a processor.
  • a processor may be implemented as an array of multiple logic gates, or may be implemented as a combination of a general purpose microprocessor and a memory storing a program that may be executed on the microprocessor.
  • it will be understood by those skilled in the art that other types of hardware may be implemented.
  • the sequence information obtaining unit 110 obtains sequence information from a sequencing device.
  • the calculation unit 120 analyzes allele frequencies from the obtained first and second sequence information, respectively.
  • the operation unit compares the allele frequencies calculated from the first sequence information and the second sequence information with respect to a specific site of the chromosome.
  • the operation unit 130 may compare the number of alleles having the allele frequency for each allele frequency, or compare the ratio of the number of alleles having the allele frequency in the total allele number.
  • the apparatus selects a mutation prediction site set by combining the mutation prediction sites obtained from the sequence information of each of the target sample and the additional sample, and selects the positions other than the mutation prediction site set as control site sets.
  • Seat selection unit to be selected as;
  • An allele frequency calculator configured to calculate an allele frequency of genotype alleles and background alleles from the obtained first sequence information and the second sequence information with respect to the set of predictive sites or the set of control sites;
  • the position selector 140 selects a set of predictive positions by combining the predictive positions of each of a plurality of samples, and selects a set of control positions by combining the positions of which no mutation is detected in any of the plurality of samples. .
  • the device may include a group selector for selecting a test group and a control group based on the mutation prediction site set and the control site set.
  • the group selector 150 selects a test group and a control group.
  • the apparatus may include an allele frequency calculation unit for calculating an allele frequency of genotype alleles and background alleles from the obtained first sequence information and the second sequence information, respectively, with respect to the mutation prediction site set or the control site set. Can be.
  • the allele frequency calculating unit may calculate an allele frequency of an allele including a genotype allele and / or a background allele.
  • the group selector selects alleles, which are the background alleles of the target sample and the genotype alleles of the additional samples, as the test group, and the mutation predictive site sets and the control site with respect to the mutation predicting site set. For the set, alleles that are the background alleles of the target sample and the background alleles of the additional sample in the first sequence information can be selected as the control group. If necessary, the test group and the control group may be selected simultaneously or sequentially.
  • the calculating unit compares the allele frequency of the test group obtained from the target sample in the first sequence information, and the allele frequency of the test group obtained from the target sample in the second sequence information, and obtains from the target sample in the first sequence information.
  • the allele frequency of the control group, and the allele frequency of the control group obtained from the target sample in the second sequence information can be compared.
  • the calculating unit may analyze the number of alleles having any allele frequency in the test group and / or the control group. The number of alleles having the allele frequency by allele frequency may be compared, or the ratio of the number of alleles having the allele frequency in the total alleles by group may be compared.
  • the apparatus determines the allele as a background allele when the allele obtained from the first sequence information has an allele frequency of less than 10%, and the allele when the allele has an allele frequency of 10% or more.
  • the allele determining unit 160 may determine the gene as the genotype allele.
  • Another aspect provides a computer readable recording medium having recorded thereon a program for executing a method of analyzing a degree of cross contamination of a sample with respect to the target sample.
  • the method may be implemented in software form readable by various computer means and recorded on a computer readable recording medium.
  • the recording medium may include a program command, a data file, a data structure, etc. alone or in combination.
  • the program instructions recorded on the recording medium may be those specially designed and constructed for the method according to the above, or may be known and available to those skilled in the computer software arts.
  • the recording medium may be magnetic media such as hard disks, floppy disks and magnetic tapes, optical disks such as Compact Disk Read Only Memory (CD-ROM), digital video disks (DVD), Magnetic-Optical Media, such as floppy disks, and hardware devices specially configured to store and execute program instructions, such as ROM, random access memory (RAM), flash memory, and the like. do.
  • program instructions may include high-level language code that can be executed by a computer using an interpreter as well as machine code such as produced by a compiler.
  • Such a hardware device may be configured to operate as one or more software modules to perform the operation of the method according to the above, and vice versa.
  • the specification and drawings describe exemplary device configurations, the functional operations and subject matter implementations described herein may be embodied in other types of digital electronic circuitry, or modified from the structures and structural equivalents disclosed herein. It may be implemented in computer software, firmware or hardware, including, or a combination of one or more of them. Implementations of the subject matter described herein relate to one or more computer program products, ie computer program instructions encoded on a program storage medium of tangible type for controlling or by the operation of an apparatus according to the method. It may be implemented as the above module.
  • the computer readable medium may be a machine readable storage device, a machine readable storage substrate, a memory device, a composition of materials affecting a machine readable propagated signal, or a combination of one or more thereof.
  • a computer program (also known as a program, software, software application, script or code) mounted on a device according to the method and executing the method may be any of a programming language including a compiled or interpreted language or a priori or procedural language. It can be written in any form, and can be deployed in any form, including stand-alone programs or modules, components, subroutines, or other units suitable for use in a computer environment. Computer programs do not necessarily correspond to files in the file system.
  • a program may be in a single file provided to the requested program, in multiple interactive files (eg, a file that stores one or more modules, subprograms, or parts of code), or part of a file that holds other programs or data. (Eg, one or more scripts stored in a markup language document).
  • the computer program may be deployed to run on a single computer or on multiple computers located at one site or distributed across multiple sites and interconnected by a communication network.
  • the contamination rate at the corresponding chromosome site can be accurately measured when the samples are contaminated.
  • the effects of cross contamination between samples were ignored or estimated by comparison with known database values, the degree of contamination between samples can be measured using the results of experiments obtained in the platform of the experiment. Therefore, reliability can be given to the result of variation extraction of individual samples.
  • 1 is a diagram for describing a method of selecting a set of disparity prediction positions.
  • FIG. 2 is a graph showing the ratio of the number of background alleles with allele frequencies of 0 to 0.01 in the test and control groups.
  • FIG. 3 is a view for explaining a method of selecting a control group and a test group between a plurality of samples.
  • FIG. 4 is a block diagram showing the configuration of an apparatus for analyzing the degree of cross contamination of a sample.
  • Agilent SureDesign was used to design a unique RNA bait that targeted ⁇ 0.5 Mb of the human genome.
  • the genome is one that contains introns from exons and five genes from 83 cancer related genes that are frequently rearranged in solid tumors.
  • the double stranded DNA concentration was measured using a QubitFluorometer (Life Technologies). Section size distribution was measured using a 2200 TapeStation instrument (Agilent Technologies). The library was adjusted to a total of 750 ng of DNA for each hybridization selection reaction. SureSelect's blocking oligonucleotides were used for hybridization selection.
  • libraries Prior to capture hybridization, libraries were labeled so as to be distinguishable for each of a plurality of samples based on DNA concentration and average fragment size, and each library was normalized to the same 2 nM concentration and pooled to the same volume. After denaturing the library with 0.2 N NaOH, the library was diluted to 20 pM. Perform cluster amplification of the denatured template and sequence the flowcell using HiSeq 2500 v3 Sequencing-by-Synthesis kit (2 ⁇ 100 bp read), followed by RTA v.1.12. Base calling was performed using 4.2.
  • the reads obtained were arranged in hg19 human reference using BWA v0.7.5a 35 to obtain BAM files.
  • the number of background alleles having a specific allele frequency is different.
  • the group having an allele frequency of 0.007 was about 0.176% when the single sample was analyzed and about 0.427% when the eight mixed samples were analyzed.
  • the frequency of the allele of the background allele was changed even in the same sum map cell line sample.
  • the average allele frequency of the sum map cell line sample was about 0.052% when analyzed by the sum map cell line sample alone.
  • the average allele frequency of the Hapmap cell line sample was about 0.077%. Therefore, it can be seen that the test group of the sum map cell line sample has an average degree of contamination of about 0.025% by the other sum map cell line samples.
  • the average allele frequency of the hapmap cell line sample was about 0.012% when analyzed by the hapmap cell line sample alone.
  • the average allele frequency of the Hapmap cell line sample was about 0.011%. Therefore, it was confirmed that the control group of the corresponding Hapmap cell line sample had no or minimal influence of contamination by other Hapmap cell line samples.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Organic Chemistry (AREA)
  • Molecular Biology (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Physiology (AREA)
  • Ecology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

L'invention concerne un procédé et un appareil permettant d'analyser le degré de contamination croisée d'un échantillon par rapport à un échantillon cible, comprenant les étapes consistant : à acquérir des premières informations de séquence d'un fragment d'acide nucléique à partir d'un échantillon cible et à partir d'un échantillon supplémentaire, et des deuxièmes informations de séquence d'un fragment d'acide nucléique à partir d'un échantillon mixte de l'échantillon cible et de l'échantillon supplémentaire ; à calculer la fréquence allélique à partir des premières informations de séquence et à partir des deuxièmes informations de séquence acquises ; et à comparer les fréquences alléliques calculées par rapport à un locus chromosomique spécifique. Par la mesure du degré de contamination croisée entre des échantillons au niveau d'un locus chromosomique spécifique, le procédé et l'appareil selon l'invention permettent de garantir la fiabilité de résultats de détection de variation.
PCT/KR2016/009451 2016-05-25 2016-08-25 Procédé et appareil permettant d'analyser le degré de contamination croisée d'un échantillon Ceased WO2017204414A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020160064067A KR101882866B1 (ko) 2016-05-25 2016-05-25 시료의 교차 오염 정도를 분석하는 방법 및 장치
KR10-2016-0064067 2016-05-25

Publications (1)

Publication Number Publication Date
WO2017204414A1 true WO2017204414A1 (fr) 2017-11-30

Family

ID=60411779

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2016/009451 Ceased WO2017204414A1 (fr) 2016-05-25 2016-08-25 Procédé et appareil permettant d'analyser le degré de contamination croisée d'un échantillon

Country Status (2)

Country Link
KR (1) KR101882866B1 (fr)
WO (1) WO2017204414A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114730609A (zh) * 2019-11-21 2022-07-08 豪夫迈·罗氏有限公司 用于下一代测序样品中的污染检测的系统和方法
US20250293873A1 (en) * 2022-12-01 2025-09-18 The Broad Institute, Inc. Genomic cryptography

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4335928B1 (fr) * 2018-01-05 2025-10-29 BillionToOne, Inc. Modèles de contrôle de qualité pour garantir la validité d'essais à base d'un séquençage
KR101913735B1 (ko) * 2018-05-03 2018-11-01 주식회사 셀레믹스 차세대 염기서열 분석을 위한 시료 간 교차 오염 탐색용 내부 검정 물질
KR102192864B1 (ko) * 2019-03-29 2020-12-18 연세대학교 산학협력단 Ngs 샘플 검증 방법 및 이를 이용한 디바이스
CN116705153B (zh) * 2022-09-16 2025-07-22 首都医科大学附属北京天坛医院 确定snp检测区域的方法和对测序样本进行校正的方法

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050048505A1 (en) * 2003-09-03 2005-03-03 Fredrick Joseph P. Methods to detect cross-contamination between samples contacted with a multi-array substrate
US20120046877A1 (en) * 2010-07-06 2012-02-23 Life Technologies Corporation Systems and methods to detect copy number variation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2891099A4 (fr) 2012-08-28 2016-04-20 Broad Inst Inc Détection de variants dans des données de séquençage et un étalonnage
US20170136085A1 (en) 2014-05-29 2017-05-18 Synta Pharmaceuticals Corp. Targeted therapeutics
CA2961179A1 (fr) 2014-09-14 2016-03-17 Washington University Vaccins anticancereux personnalises, et procedes correspondants
US10537108B2 (en) 2015-02-08 2020-01-21 Argaman Technologies Ltd. Antimicrobial material comprising synergistic combinations of metal oxides

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050048505A1 (en) * 2003-09-03 2005-03-03 Fredrick Joseph P. Methods to detect cross-contamination between samples contacted with a multi-array substrate
US20120046877A1 (en) * 2010-07-06 2012-02-23 Life Technologies Corporation Systems and methods to detect copy number variation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CIBULSKIS ET AL.: "ContEst: Estimating Cross-contamination of Human Samples in Next-generation Sequencing Data", BIOINFORMATICS, vol. 27, no. 18, 2011, pages 2601 - 2602, XP055442350 *
JUN ET AL.: "Detecting and Estimating Contamination of Human DNA Samples in Sequencing and Array-based Genotype Data", THE AMERICAN JOURNAL OF HUMAN GENETICS, vol. 91, 2012, pages 839 - 848, XP055442346 *
KIM ET AL.: "Virmid: Accurate Detection of Somatic Mutations with Sample Impurity Inference", GENOME BIOLOGY, vol. 14, 2013, pages 1 - 17, XP021165712 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114730609A (zh) * 2019-11-21 2022-07-08 豪夫迈·罗氏有限公司 用于下一代测序样品中的污染检测的系统和方法
US20250293873A1 (en) * 2022-12-01 2025-09-18 The Broad Institute, Inc. Genomic cryptography

Also Published As

Publication number Publication date
KR20170133079A (ko) 2017-12-05
KR101882866B1 (ko) 2018-08-24

Similar Documents

Publication Publication Date Title
Wang et al. CNVcaller: highly efficient and widely applicable software for detecting copy number variations in large populations
Turner et al. Genomic islands of speciation in Anopheles gambiae
Banovich et al. Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels
Tatsumoto et al. Direct estimation of de novo mutation rates in a chimpanzee parent-offspring trio by ultra-deep whole genome sequencing
WO2017204414A1 (fr) Procédé et appareil permettant d'analyser le degré de contamination croisée d'un échantillon
Fujiki et al. Assessing the accuracy of variant detection in cost-effective gene panel testing by next-generation sequencing
DeRycke et al. Targeted sequencing of 36 known or putative colorectal cancer susceptibility genes
BR112015032031B1 (pt) Métodos e processos para avaliação não invasiva das variações genéticas
US20250037796A1 (en) Methods for detecting absence of heterozygosity by low-pass genome sequencing
Corrales et al. High-throughput molecular diagnosis of von Willebrand disease by next generation sequencing methods
US12106825B2 (en) Computational modeling of loss of function based on allelic frequency
Keel et al. Genome‐wide copy number variation in the bovine genome detected using low coverage sequence of popular beef breeds
Pankratov et al. Prioritizing autoimmunity risk variants for functional analyses by fine-mapping mutations under natural selection
Trudsø et al. A comparative study of single nucleotide variant detection performance using three massively parallel sequencing methods
KR102347463B1 (ko) 핵산 서열 분석에서 위양성 변이를 검출하는 방법 및 장치
KR102169699B1 (ko) 유전자 검사를 위한 맞춤형 유전자칩 및 이의 제작 방법
JP2023526441A (ja) 複合遺伝子バリアントの検出およびフェージングのための方法およびシステム
WO2025167582A1 (fr) Procédé et appareil de détermination de la contamination d'un échantillon, dispositif électronique et dispositif de stockage
Viluma et al. Evaluation of whole-genome sequencing of four Chinese crested dogs for variant detection using the ion proton system
Billingsley et al. Genome-wide analysis of structural variants in Parkinson’s disease using short-read sequencing data
Fountain et al. Cross-species application of Illumina iScan microarrays for cost-effective, high-throughput SNP discovery
WO2019031867A1 (fr) Procédé d'augmentation de la précision d'analyse par élimination d'une séquence d'amorce dans un séquençage de nouvelle génération, basé sur un amplicon
WO2019124629A1 (fr) Procédé de détermination d'une fraction fœtale dans un échantillon maternel
WO2016208827A1 (fr) Procédé et dispositif d'analyse de gène
Gao et al. A systematic evaluation of hybridization-based mouse exome capture system

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16903262

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16903262

Country of ref document: EP

Kind code of ref document: A1