[go: up one dir, main page]

WO2016068626A1 - Procédé d'analyse d'informations de séquence d'échantillon basé sur un échantillon unique - Google Patents

Procédé d'analyse d'informations de séquence d'échantillon basé sur un échantillon unique Download PDF

Info

Publication number
WO2016068626A1
WO2016068626A1 PCT/KR2015/011514 KR2015011514W WO2016068626A1 WO 2016068626 A1 WO2016068626 A1 WO 2016068626A1 KR 2015011514 W KR2015011514 W KR 2015011514W WO 2016068626 A1 WO2016068626 A1 WO 2016068626A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
sequence information
purity
copy number
sample sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/KR2015/011514
Other languages
English (en)
Korean (ko)
Inventor
이병철
박정선
윤태균
박동윤
이정호
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SK Telecom Co Ltd
Original Assignee
SK Telecom Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SK Telecom Co Ltd filed Critical SK Telecom Co Ltd
Priority to KR1020157031735A priority Critical patent/KR20160062748A/ko
Publication of WO2016068626A1 publication Critical patent/WO2016068626A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present invention relates to a method for analyzing sample sequence information based on a single sample, and more particularly, at least one required when analyzing sequence information including a single nucleotide polymorphism or a copy number for an experimental sample using only an experimental sample without a control sample.
  • the present invention relates to a method for analyzing sample sequence information, for example, the purity of sample sequences and the number of copies, which can detect somatic cell mutations by measuring the parameters of.
  • Copy number variation is a structural
  • CNV structural variation
  • CNV refers to amplification or deletion of DNA fragments of lkb or more.
  • CNV is present at a very high frequency of more than 10 percent in the human population, and the average size of CNV in an individual's genome is 3.5 ⁇ 5 Mbp (0.1 percent).
  • Many studies have demonstrated that CNV is associated with complex diseases such as autism, schizophrenia, Alzheimer's disease and cancer.
  • FISH fluorescence in situ hybridization
  • aCGH array comparative genomic hybridization
  • NGS Next Generation Sequencing
  • Non-Patent Document 1 Alkan C et al., Nature Genetics 41: 1061-1067; J.L. Hayes et al, Genomics, vol. 102, Issue 3, pp. 174-181, 2013; Chiang DY et al., Nature Methods 6: 677-681, Nicholas B. Larson and Brooke L Fridley, Bio informatics 29 (15): 1888-1889
  • One embodiment of the present invention at least including the purity and average number of replicates of the experimental sample, which can improve the accuracy of somatic mutation discovery, and can be usefully used for the excavation of somatic copy number variation even if no control sample is present.
  • Another example of the present invention provides a computer read method for the analysis of sample sequence information based on a single sample measuring at least one parameter including the purity and average copy number of the experimental sample in at least one target region.
  • Another example of the invention provides a computer readable method for implementing a method of analyzing sample sequence information based on a single sample that measures at least one parameter including the purity and average copy number of an experimental sample in at least one target region.
  • Computer-readable storage media containing computer programs stored on storage media or computer executable instructions (Or recording medium).
  • Another example of the present invention is to implement a computer readable method for the analysis of sample sequence information based on a single sample measuring at least one parameter including the purity and average copy number of the test sample in at least one target region.
  • a computer readable storage medium (or recording medium) containing a computer program stored in a computer readable storage medium or a computer executable instruction is provided.
  • an embodiment of the present invention relates to a -analysis method of sample sequence information based on a single sample comprising the following steps:
  • the different allele frequency ratio B calculating allele frequency (BAF);
  • Determining a purity and an average copy number of the sample by filtering a copy number candidate and a purity candidate of the sample using a filtering parameter.
  • the experimental sample sequence data generated by the genome decoder or sequencer is received and read mapped to the standard reference sequence data for each chromosomal position, and the sample sequence data and the standard are read mapped.
  • Segmentation of the missing sample sequence data based on the frequency rate for the target region of the sample, and applying the at least one segmented segment to the copy number model of the frequency rate for the sample purity to copy and sample purity candidates of other alleles Extracting and filtering the sample purity candidate and / or copy number candidate to estimate the purity and average copy number of the experimental sample.
  • Method for analyzing sample sequence information based on a single sample by measuring at least one parameter including the purity and the average number of copies of the experimental sample, it is possible to improve the accuracy of the discovery of somatic mutations using the parameter In addition, even in the absence of a control sample can be useful for discovering somatic cell clone variation.
  • the sample sequence information may be whole genome sequence information or sequence information of a selected target region, and may be obtained by performing next generation sequencing.
  • the copy number model of the frequency rate with respect to the sample purity may be a nm plot model.
  • the filtering parameter is a ratio filter, a copy number filter, and a unit filter.
  • Another example of the present invention relates to a computer-readable method for analyzing sample sequence information based on a single sample, comprising performing a method for analyzing sample sequence information based on the single sample.
  • a further example of the invention is a computer readable program comprising a computer program stored on a computer readable storage medium or a computer executable instruction coupled to hardware to execute a method for analyzing sample sequence information based on the single sample.
  • a further example of the invention is a computer program or computer executable program stored on a computer readable storage medium for carrying out the steps of the computer readable method for analyzing sample sequence information based on the single sample according to the hardware.
  • computer-readable storage medium (or recording medium) containing instructions.
  • any one of the problem solving means of the present invention described above by measuring at least one parameter including the purity and the average number of copies of the experimental sample, not only can improve the accuracy of somatic mutation discovery, but also the control sample Even if you do not, it can be useful for discovering somatic clones.
  • FIG. 1 is a block diagram illustrating a single sample-based sample sequence information analysis system according to an embodiment of the present invention.
  • FIG. 2 is a block diagram for explaining an apparatus for analyzing sample sequence information shown in FIG. 1.
  • FIG. 3 is a block diagram illustrating a sample sequence information analysis method performed in the apparatus for analyzing sample sequence information shown in FIG. 1.
  • FIG. 4 is a view for explaining a frequency rate calculation method performed in the apparatus for analyzing sample sequence information shown in FIG. 1.
  • FIG. 5 is a diagram for explaining a division method performed in the apparatus for analyzing sample sequence information shown in FIG. 1.
  • FIG. 6 is a diagram for describing a node definition method for candidate extraction performed in the apparatus for analyzing sample sequence information shown in FIG. 1.
  • FIG. 7 is a diagram for describing a filtering method performed in an apparatus for analyzing sample sequence information shown in FIG. 1.
  • FIG. 8 is a diagram for explaining an estimation method performed in the apparatus for analyzing sample sequence information shown in FIG. 1.
  • FIG. 9 is a sample estimated by the apparatus for analyzing sample sequence information shown in FIG. A graph comparing the sample purity simulation values with respect to the estimated values.
  • FIG. 10 is a flowchart illustrating a method for analyzing sample sequence information according to an embodiment of the present invention.
  • FIG. 1 is a diagram illustrating a computer-readable storage medium for executing a method for analyzing sample sequence information according to an embodiment of the present invention.
  • the sample sequence information analysis system 1 may include an apparatus 300 for analyzing sample sequence information and may further include a genome sequencer 100. Therefore, the sequence information according to the present invention can be obtained by a conventional method, and analyzed through a sequencer or stored in a storage medium. It may be provided in a stored form.
  • the sample sequence information analysis system 1 of FIG. 1 is only an embodiment of the present invention, the present invention is not limitedly interpreted through FIG. 1.
  • each component of FIG. 1 is generally connected through a network 200.
  • the sequencer 100 and the apparatus 300 for analyzing sample sequence information may be connected through a network 200.
  • the sequencer 100 and the analysis device 300 of the sample sequence information may be directly connected.
  • it since only the control sample sequence information and the sample sequence information generated by the sequencer 100 need to be received by the analysis apparatus 300 of the sample sequence information, it may include both direct or indirect connection.
  • the network 200 refers to a connection structure capable of exchanging information between each node such as terminals and servers
  • an example of such a network 200 is WCDMA, Internet (Internet), LAN (Local) Area Network (WLAN), Wireless Local Area Network (WLAN), Wide Area Network (WAN), Personal Area Network (PAN), El networks using ATM, 3G, 4G, LTE, and Wi-Fi It doesn't work.
  • the apparatus 300 is not limited to the ones shown in FIG. 1.
  • the sequencer 100 may amplify DNA sequences, photograph fluorescent labels, and the like by photographing means, and perform image processing to parallelize DNA genetic information.
  • the genome sequencer 100 may be applied to the field of determining genetic variation, DNA copy number and chromosomal rearrangement.
  • the genome sequencer 100 may read a single DNA several times.
  • the number of reads may be defined as a read count or a read count, and the read count may also be defined as a depth.
  • the apparatus 300 for analyzing sample sequence information includes a sample purity, an average copy number of a sample, a copy number of an allele having the same sample sequence information and a reference sequence information, and a sample sequence information and a reference sequence information different from each other.
  • Parameters including the number of copies of the allele (B Allele) and unit read count can be analyzed by calculation, calculation and estimation.
  • the aforementioned parameters can be used to derive the purity and absolute copy number of the experimental sample.
  • the information analyzing apparatus 300 of the sample sequence refers to the sample sequence information with reference sequences.
  • a read count may be calculated by reading mapping to information.
  • the information analysis apparatus 300 of a sample sequence calculates the frequency rate of a different allele based on the frequency of the allele whose sample sequence information and reference sequence information are the same, and the allele whose sample sequence information and reference sequence information differ. can do.
  • the sample sequence information measuring apparatus 300 divides the sample sequence information based on the frequency rate for at least one target region of the sample sequence information, thereby determining the different alleles through the copy number model of the frequency rate for the sample purity. Copy number and sample purity candidates can be extracted.
  • the sample sequence information measuring device 300 may finally measure the parameter by selecting at least one of the candidates extracted using the at least one filter.
  • the apparatus 300 for analyzing sample sequence information may be implemented as a computer that can be connected to a server or a terminal at a remote location through the network 200.
  • the computer may include, for example, a notebook, a desktop, a laptop, and the like.
  • FIG. 2 is a block diagram illustrating a device for analyzing sample sequence information shown in FIG. 1
  • FIG. 3 is a block for explaining a sample sequence information analyzing method performed in the device for analyzing sample sequence information shown in FIG. 1.
  • 4 is a diagram for explaining a method of calculating a frequency rate performed in the apparatus for analyzing sample sequence information shown in FIG. 1
  • FIG. 5 is a division performed in the apparatus for analyzing sample sequence information shown in FIG. 1.
  • the sample sequence shown in the Figure, and 1 7 is a diagram illustrating a node definition method for a candidate extraction executed in the analysis device of the sample sequence information shown in the figure
  • the first 6 is a diagram, illustrating a method FIG.
  • FIG. 8 is a view for explaining a filtering method performed in an analysis device of information
  • FIG. 8 is a view for explaining an estimation method performed in an analysis device for sample sequence information shown in FIG. 1.
  • Figure 9 is a view comparing the simulation graph of sample purity value for the estimated value estimated in a sample analysis device of the sample sequence information shown in Fig.
  • the apparatus 300 for analyzing sample sequence information includes a mapping unit 310, a calculation unit 320, a division unit 330, and an extraction unit 340. ),
  • the estimation unit 350 may be included.
  • the mapping unit 310 the sample sequence information, for example genomic sequencer
  • Read sample sequence information obtained in (100) into reference sequence information for each chromosome position It may be mapped (S3100).
  • the sample sequence information may be data having a plurality of read counts by reading an experimental sample a plurality of times in the genome sequencer 100.
  • the test sample may be a cancer sample.
  • the number of read counts for each target region of the sample sequence information may be calculated while reading the sequence information of the 250 test samples and the control sample, respectively.
  • the read count may be calculated in at least one target region located in the sample sequence information.
  • the calculation unit 320 calculates a frequency ratio (BAF: B Allele Frequency) of different alleles based on alleles having the same sample sequence information and reference sequence information and alleles having different sample sequence information and reference sequence information. Can be calculated (S3200).
  • BAF B Allele Frequency
  • the frequency of the gene ⁇ may be defined as in Equations 1 and 2, respectively.
  • BAF may be defined as in Equation 3 below.
  • the BAF is a measured value obtained from lead mapping of sequence information of an experimental sample.
  • n and m may be natural numbers including 0.
  • the frequency rate of different alleles can be calculated based on the number of copies of the same allele, the number of copies of the different alleles, and the purity of the experimental sample.
  • the dividing unit 330 may segment the sample sequence information based on the frequency ratio of the at least one target region of the sample sequence information (S3300).
  • S3300 the sample sequence information
  • (a) is a BAF graph of the control sample which is a control of the test sample
  • (b) is a BAF graph of the test sample.
  • the divider 330 may divide the BAF graph for at least one target region by using (c) a circular binary segment (CBS) or another segmentation method.
  • CBS circular binary segment
  • the extractor 340 may apply the divided at least one segment to a copy number model having a frequency ratio for sample purity to extract copy numbers and sample purity candidates of different alleles.
  • the copy number model may be a nm plot model. That is, the extractor 340 may define each node (nodel, node2... Node6) by applying the segment defined in the divider 330 to the nm plot model.
  • the node since the node includes the value of (n, m, cx, F a , F b ), when the candidate node is selected, the copy number and sample purity candidate of different alleles can also be extracted.
  • Equation 3 can be defined as Equation 4 below.
  • ⁇ candidate can be derived, which is defined as a node candidate or a sample purity candidate. .
  • Equation 4 For example, assuming that ⁇ is 0, m is 1, and the BAF of the segment is 0.7, and each parameter is substituted into Equation 4, ⁇ is 57, Fa is 0.3, and Fb is 1.0. Becomes (0,1,0.57, e3,1.0).
  • the estimator 350 may estimate the purity and the average copy number of the experimental sample by using at least one filter (S3500 and S3600).
  • at least one filter is at least one of a ratio filter, a replica water filter, and a unit filter. It may include one.
  • the purity and the average copy number of the test sample can be estimated by setting the sample purity extracted through at least one filter of the sample purity candidates as the sample purity of the test sample.
  • the ratio filter may be a filter for filtering whether or not the TRR ratio based on a read count in at least one target region to ' Targ e Region Ratio (TRR) having a predetermined number of read counts is matched. It may be defined as in Equation 5.
  • the copy number filter may be a filter capable of filtering whether the average copy number of the test sample is the same, and may be defined as in Equation 6 below.
  • the estimator 350 may filter all of the candidates extracted by the extraction unit 340 using only the copy number filter, leaving only the candidates having the same average copy number J in Equation 6.
  • the unit filter may be a filter for filtering whether the read counts of the unit areas are the same among at least one target area, and may be defined as in Equation 7 below.
  • d denotes a unit read count and may be a read count of a unit region in which at least one target region has a copy number of 1. That is, the estimator 350 is a unit
  • all of the candidates extracted by the extractor 340 may be filtered, leaving only the candidates having the same unit read number d in Equation (7).
  • the candidate extracted by the extractor 340 is defined as nodes 1 to 6 (nodel, ..., node6), and the estimator 350 simultaneously or sequentially defines at least one filter.
  • the candidate extracted in the process of using may be removed, that is, the node may be removed.
  • the three filtering processes illustrated in FIG. 7 do not mean that three filters are sequentially used since at least one filter may be used at the same time.
  • the remaining nodes may be identified through filtering. That is, when nodes 3 and 5 are finally selected, it can be seen that the segments correspond to the case where the purity of the test sample corresponds to 0.7 purity. Therefore, when performing the sample sequence information analysis method according to an embodiment of the present invention, the purity of the test sample, the number of copies of the same allele, the number of copies of the different alleles, the average number of copies of the experimental sample, the unit read count The parameter to include can be calculated. By using this, an absolute copy number in at least one target region may be calculated.
  • (a) is a purity graph according to a single sample-based sample sequence information analysis method according to an embodiment of the present invention
  • (b) shows a purity graph according to the prior art.
  • the X-axis is the purity measurement value of the test sample
  • the y-axis means the purity estimation value of the test sample.
  • (a) shows a better accuracy than (b) in the situation of various replication deletion or duplication. That is, in the case of an experimental sample, for example, a cancer sample, there is a possibility of mixing with a normal sample, so that it is difficult to detect gene duplication mutations, and because there is no method of predicting a parameter, which is a basic factor, it is difficult to accurately analyze the present invention.
  • the detection accuracy of the somatic mutation can be improved.
  • Another example of the present invention is a method of analyzing sample sequence information based on a single sample, and more particularly, from the sample sequence information based on a single sample.
  • the method for analyzing sample sequence information based on a single sample according to the present invention may include the following steps:
  • the sample sequence information and the reference sequence information can be obtained by a conventional sequencing method, for example, a sequencing method can be used to perform large-scale parallel sequencing such as next-generation sequencing on a test sample, or obtain the obtained sequence information. It may be prepared in a form stored in a data storage medium or obtained through a network data transmission / reception apparatus. In one embodiment of the invention, it may be received using the genome sequencer 100 shown in the sequence information analysis system 1 of FIG. 1, provided that the sample sequence information analysis system 1 of FIG. Since only one embodiment of the present invention through FIG. 1 is not limited to interpretation.
  • the sample sequence information refers to sequence information of a sample to be analyzed
  • the reference sequence information refers to a reference genome sequence, which is a genome sequence representing one species.
  • Database References The human reference genome may now be constructed based on published (eg, UCSC, NCBI, etc.) reference genomic sequences such as build 37 (GRCh37), hgl8, hgl9, hg38.
  • the sample sequence information may be whole genome sequence information or sequence information of a selected target region.
  • a target region and a target nucleotide sequence refer to a selection region (target region) and a nucleotide sequence (target base sequence) of the region to be analyzed in the genome or chromosome, respectively.
  • the target region and target base sequence may be present in one or more for one sample.
  • the target region is an arbitrary region to be analyzed in whole genome sequencing, and in the targeted sequencing, a region for designing and selecting probes for sequencing at library prep. Can mean come.
  • the sample sequence information or reference sequence information may be obtained by, for example, a large-scale parallel sequencing method in the next generation sequencing method, and sequence information, read depth, or read count number may be obtained using the next generation sequencing method.
  • the sequence information of the target region may be used as the sample sequencing information by performing the next generation sequencing method by selecting the entire genome sequence information or a specific selection region, that is, a target region.
  • the targeted sequencing method using the NGS method is, for example
  • the polynucleotide fragment is a read used for next-generation sequencing
  • the polynucleotide fragment number is a read count or a read depth
  • the average polynucleotide fragment number is an average. It may be a lead number.
  • sequencing means that a single genome is innumerable polynucleotides
  • the next-generation sequencing method is, for example, 454 platform (Margulies, et al., Nature (2005) 437: 376-380), lllumina Genome Analyzer (or Solexa TM platform), lllumina HiSeq2000, HisSeq2500, MiSeq, NextSeq500, Life Tech Ion PGM, Ion Proton, Ion S5, Ion S5XL, or SOLiD (Applied Biosystems) or Helicos True Single Molecule DNA Sequencing Technology (Harris, et al., Science (2008) 320: 106-109), single molecule from Pacific Biosciences, And / or real-time (SMRT T M) techniques or the like.
  • large-scale parallel sequencing that is possible on nanopore sequencing (Soni and Meller, Clin Chem (2007) 53: 1996-2001) allows the analysis of many nucleic acid molecules isolated from a sample.
  • Sequencing is possible with high order multiplexing in a parallel fashion (Dear, Brief Funct Genomic Proteomic (2003) 1: 397-416). Each of these platforms sequences single molecules that are either clonally expanded or not amplified of nucleic acid fragments. Sequence information of polynucleotide fragments can be obtained using commercially available sequencing instruments. In addition, the sequencing may be performed by various other known sequencing methods and / or modifications thereof.
  • the mapping may include the sample sequence information, for example, a genome.
  • Sample sequence information obtained from the sequencer 100 may be mapped to reference sequence information for each position on a chromosome (S3100), for example, in the mapping unit 310 of the analysis apparatus 300 of the sample sequence information of FIG. 2.
  • the sample sequence information may be data having a plurality of read counts by reading a plurality of test samples from the genome sequencer 100.
  • the test sample may be a cancer sample.
  • the number of read counts of sample sequence information may be calculated while reading sequence information of 250 test samples and a control sample, respectively. At this time, the read count is "can be calculated in at least one region in the sample sequence information.
  • the step (2) includes an allele having the same allele of the sample sequence information and the reference sequence information, and different alleles based on the frequency of the allele having the different sample sequence information and the reference sequence information.
  • B Allele Frequency BAF
  • the calculating step may be performed by the calculation unit 320 of FIG. 2 to calculate a BAllele Frequency (BAF) of different alleles (S3200). Frequency rates of different alleles are measured values obtained using sequence information of experimental samples.
  • the sample when a replication deletion, duplication, translocation, inversion, etc. occurs in a normal cell, the sample may be a cancer cell sample modified with cancer cells, and the same sequence between the cancer cell sample sequence information and the reference sequence information. If the number of copies of the allele is n, the number of copies of alleles different from the sample sequence information, and ⁇ reference sequence information is m, and the purity of the sample is ⁇ , the frequency of the same allele ( ⁇ ) and different alleles ( ⁇ ) is Respective equations 1 and 2 may be defined as follows.
  • Equations 1 and 2 ⁇ is the number of copies of the same allele, m is the number of copies of the different alleles, and m and n are each 0 or natural numbers,
  • is the purity of the sample
  • Fb is the frequency of different alleles (B).
  • the purity of a sample can be expressed as the purity (tumor purity or tumor cellularity) of the total number of cells in the sample when the sample to be analyzed contains the tumor cells and the normal sample. have.
  • the purity tumor purity or tumor cellularity
  • the sample to be analyzed contains the tumor cells and the normal sample. have.
  • the biopsy of the cancer sample means a ratio of only cancer-derived cells excluding normal cells (stroma cells, white blood cells, etc.) contained in the sample.
  • BAF can be defined as shown in Equation 3 below.
  • the frequency of different alleles in the sample (BAF) is the frequency of alleles.
  • Equation 3 n, m, ot, Fa and Fb are defined as Equations 1 and 2.
  • the frequency rate of different alleles can be calculated based on the number of copies of the same allele, the number of copies of the different alleles, and the purity of the experimental sample.
  • step (3) may segment the sample sequence information based on BAFs of different alleles of the sample sequence information.
  • the division of the sequence information has a region in which the average of the frequency of different alleles is different from each other
  • Finding and dividing a segment for example, grabbing a random area and t-testing the mean.
  • the division of the sequence information may be performed by various methods, and the division method includes, for example, a circular binary segmentation (CBS) method, but is not limited thereto.
  • CBS circular binary segmentation
  • the segment refers to a sequence information group having the same average of allele BAFs among the sequence information of the sample, and refers to the black bar portion shown in FIG. 5 (c).
  • step (4) includes at least one segment in a copy number model of a frequency ratio with respect to sample purity. By applying, the copy number and sample purity candidate of a sample can be extracted.
  • At least one segment by applying at least one segment to the copy number model of the frequency rate for the sample purity, at least one node value can be obtained, and the node is ( ⁇ , ⁇ , ⁇ , F a , F b ) Contains a value. Therefore, sample purity candidates and the number of copies of different alleles can be obtained from the node values.
  • the extracting unit 340 of FIGS. 2 and 3 may extract candidate copies and sample purity candidates of different alleles (S3400).
  • the copy number model of the frequency rate with respect to the sample purity may be a nm plot model.
  • each node nodel, node2... Node6
  • the node since the node includes the value of (n, m, a, F a , F b ), when the candidate node is selected, it is possible to extract the copy number and sample purity candidate of different alleles.
  • the values of n, m, ot, F a and F b are as defined in Equations 1 and 2 above.
  • Equation 3 may be converted as shown in Equation 4 below.
  • an ⁇ candidate can be derived, which is defined as a node candidate or a sample purity candidate. Further, candidate values of the copy number (m, n) can also be obtained from the sample purity candidate values.
  • n 0, m is 1, and the BAF of the segment is 0.7.
  • Equation 1 When each parameter is substituted into 4, ⁇ is 0.57, and n, m, and a are represented by Equations 1 and
  • the step (5) among the candidates of the sample purity and the number of copies extracted in the step (4), the sample purity and the number of copies filtered through at least one filter experiment It can be estimated by setting the sample purity and the number of copies of the sample, respectively.
  • the step (5) may estimate the purity and the average copy number of the experimental sample using at least one filter in the estimator 350 of FIGS. 2 and 3 (S3500, S3600). ).
  • the at least one filter may include at least one filter selected from the group consisting of a ratio filter, a copy number filter, and a unit filter, preferably using all of the ratio filter, the copy number filter, and the unit lease count filter. You can filter.
  • the ratio filter may be a filter for filtering a match of a TRR ratio based on a read count in at least one target region with respect to a target region ratio (TRR) having a predetermined number of read counts. It can be defined as Equation 5.
  • the estimating step includes all candidates having the same ratio (r) in Equation 5 among the candidates for sample purity obtained in the extraction step using a ratio filter.
  • TRR is a measured value obtained from lead mapping of sequence information of an experimental sample.
  • the copy number filter may filter whether the average copy number of the test sample is the same and may be defined as in Equation 6 below.
  • the estimating step includes all candidates having the same average number of copies (J) in Equation 6 among the sample candidates obtained in the extraction step using the copy number filter.
  • the unit filter may be a filter for filtering whether the read count of the unit region is the same among at least one target region, and may be defined as in Equation 7 below.
  • d denotes a unit read count, and may be a read count of a unit region in which at least one target region has a copy number of one. That is, the estimating step may filter out all of the sample candidates obtained in the extracting step by using the unit filter, leaving only the candidates having the same number of unit copies (d) of Equation (7).
  • the filtering process is defined as candidates ⁇ nodes 1 to ' 6 (nodel, ..., node6) extracted in the extraction step, and is extracted in the process of using at least one filter simultaneously or sequentially.
  • the sample candidate obtained in the step may be removed, i.e. the node may be removed.
  • three filters may be used.
  • the remaining nodes may be identified through filtering. That is, when nodes 3 and 5 are finally selected, the case corresponds to a purity of 0.7 of the test sample, and it can be seen that the segments coincide.
  • the sample purity, Fa, Fb, the number of copies of the same allele n, the number of copies of different alleles based on the information of the last remaining node m can be found.
  • the average copy of the sample Parameters containing number J and unit read count d can be calculated.
  • the average copy number may be calculated from Equation 6 based on the purity of the test sample and the TRR and the absolute copy number calculated from the sample sequence information.
  • FIG. 10 is a flowchart illustrating a method of analyzing sample sequence information according to an embodiment of the present invention.
  • the apparatus for analyzing sample sequence information receives sample sequence information generated by a genome sequencer and read-maps to reference sequence information for each chromosome position (SI). 100).
  • the analyzer for analyzing the sample sequence information includes different alleles based on the frequency of the allele (A Allele) having the same sample sequence information and the reference sequence information, and the allele (B Allele) having different sample sequence information and the reference sequence information.
  • the analyzer for analyzing sample sequence information divides the sample sequence information based on BAF (S 1300).
  • the apparatus for analyzing sample sequence information applies the divided at least one segment to a copy number model having a frequency ratio to sample purity to extract copy numbers and sample purity candidates of different alleles (S 1400).
  • the apparatus for analyzing sample sequence information estimates the purity and average copy number of the experimental sample using at least one filter (S 1500).
  • Another example of the present invention provides a method of computer read method for sample sequencing comprising the step of analyzing sample sequence information.
  • the methods and information described herein provide a computer program stored in a computer readable storage medium for executing the steps of the method capable of executing the steps described above.
  • the computer program stored in the computer readable storage medium may be combined with hardware.
  • a computer program stored on the computer readable storage medium is described above. Steps from computer It is a program to be executed, and all the above steps may be executed by one program or by two or more programs executing one or more steps.
  • the programs or software stored on the computer readable storage medium may be any, including, for example, on a communication channel such as a telephone line, the Internet, a wireless connection, or through a portable medium such as a computer readable disk, a flash drive, or the like. It can be delivered to a computer device through known delivery methods.
  • Another example also provides a computer readable storage medium (or recording medium) containing computer executable instructions for executing the steps of the method.
  • the computer readable medium may include both computer storage media and communication media.
  • Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • computer storage media may include RAM, ROM, EEPROM, flash memory (eg, USB memory, SD memory, SSD, CF memory, xD memory, etc.), magnetic disks, laser disks, or other memory, CD-ROM, DVD. (digital versatile disk) or other optical disk, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage device or any medium that can be used to store desired information and accessible by a computer. The above may be selected, but is not limited thereto.
  • Communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transmission mechanism, and includes any information delivery media.
  • the communication medium may be selected from at least one of a wired medium such as a wired network or a direct-wired connection, and a wireless medium such as an acoustic medium, an RF, an infrared ray, and other wireless mediums. . Combinations of one or more of the above may also be included within the scope of computer readable media.
  • An example of a computer readable medium according to one embodiment of the present invention is shown in FIG. 1, for example as one component of computer system 500,
  • the computer system can include one or more processors 510, one or more computer readable storage media 530, and a memory 520.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Organic Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne un procédé d'analyse d'informations de séquence d'échantillon basé sur un échantillon unique, et, de manière plus spécifique, un procédé d'analyse pour des informations de séquence d'échantillon, par exemple, la pureté ou le nombre de copies d'une séquence d'échantillon, qui peut détecter une variation de cellule somatique par mesure d'au moins un paramètre requis lors de l'analyse d'informations de séquence comprenant le polymorphisme de nucléotide unique ou le nombre de copies pour un échantillon d'expérience, par utilisation uniquement d'un échantillon d'expérience sans échantillon de groupe témoin.
PCT/KR2015/011514 2014-10-29 2015-10-29 Procédé d'analyse d'informations de séquence d'échantillon basé sur un échantillon unique Ceased WO2016068626A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020157031735A KR20160062748A (ko) 2014-10-29 2015-10-29 단일 시료에 기반한 시료 서열 정보의 분석 방법

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR20140148412 2014-10-29
KR10-2014-0148412 2014-10-29

Publications (1)

Publication Number Publication Date
WO2016068626A1 true WO2016068626A1 (fr) 2016-05-06

Family

ID=55857851

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2015/011514 Ceased WO2016068626A1 (fr) 2014-10-29 2015-10-29 Procédé d'analyse d'informations de séquence d'échantillon basé sur un échantillon unique

Country Status (2)

Country Link
KR (1) KR20160062748A (fr)
WO (1) WO2016068626A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120046877A1 (en) * 2010-07-06 2012-02-23 Life Technologies Corporation Systems and methods to detect copy number variation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120046877A1 (en) * 2010-07-06 2012-02-23 Life Technologies Corporation Systems and methods to detect copy number variation

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
LEE ET AL.: "Poster-H30)Development of model-based tumor content estimator for accurate variant calling in next-generation sequencing (NGS) dataset", 22ND ANNUAL INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY, July 2014 (2014-07-01) *
LIU ET AL.: "Computational methods for detecting copy number variations in cancer genome using next generation sequencing: principles and challenges", ONCOTARGET, vol. 4, no. 11, 2013, pages 1868 - 1881 *
WANG ET AL.: "Copy number variation detection using next generation sequencing read counts", BMC BIOINFORMATICS, vol. 15, no. 109, 14 April 2014 (2014-04-14), pages 1 - 14 *
YOON ET AL.: "Sensitive and accurate detection of copy number variants using read depth of coverage", GENOME RESEARCH, vol. 19, pages 1586 - 1592, XP055167321, DOI: doi:10.1101/gr.092981.109 *
ZHAO ET AL.: "Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives", BMC BIOINFORMATICS, vol. 14, no. 11, 2013, pages 1 - 16 *

Also Published As

Publication number Publication date
KR20160062748A (ko) 2016-06-02

Similar Documents

Publication Publication Date Title
Kumar et al. Next-generation sequencing and emerging technologies
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
Yu et al. Statistical and bioinformatics analysis of data from bulk and single-cell RNA sequencing experiments
US20210265012A1 (en) Systems and methods for use of known alleles in read mapping
US10741270B2 (en) Size-based analysis of cell-free tumor DNA for classifying level of cancer
US9228233B2 (en) Analysis methods
US20150211054A1 (en) Haplotype resolved genome sequencing
JP2017500004A (ja) 遺伝子試料について遺伝子型解析するための方法およびシステム
US20160002717A1 (en) Determining mutation burden in circulating cell-free nucleic acid and associated risk of disease
US20210332354A1 (en) Systems and methods for identifying differential accessibility of gene regulatory elements at single cell resolution
EP3405573A1 (fr) Procédés et systèmes de séquençage haute fidélité
EP3841583A1 (fr) Détection sensible de variations de nombre de copies (cnvs) à partir d'acide nucléique acellulaire circulant
JP6373827B2 (ja) 最適化されたヌクレオチドフロー順序を生成及び使用するためのシステム及び方法
Kumari et al. Advances in long-read single-cell transcriptomics
US20190139628A1 (en) Machine learning techniques for analysis of structural variants
KR101839088B1 (ko) 단일 시료에 기반한 절대 복제수 변이를 분석하는 방법
Du et al. Benchmarking spatial transcriptomics technologies with the multi-sample SpatialBenchVisium dataset
US20160132637A1 (en) Noise model to detect copy number alterations
Ma et al. The analysis of ChIP-Seq data
KR101977976B1 (ko) 앰플리콘 기반 차세대 염기서열 분석기법에서 프라이머 서열을 제거하여 분석의 정확도를 높이는 방법
WO2016068626A1 (fr) Procédé d'analyse d'informations de séquence d'échantillon basé sur un échantillon unique
WO2016068625A1 (fr) Procédé d'élimination du biais dans l'analyse de séquences cibles de nucléotides par nmf
Mitra et al. Statistical analyses of next generation sequencing data: an overview
US20230368863A1 (en) Multiplexed Screening Analysis of Peptides for Target Binding
SK802023A3 (sk) Spôsob a systém na identifikáciu tkaniva pôvodu nádoru zo sekvenovanej voľne cirkulujúcej DNA

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 20157031735

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15853986

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15853986

Country of ref document: EP

Kind code of ref document: A1