CN117935907B - Method and device for detecting copy number variation of true and false genes - Google Patents
Method and device for detecting copy number variation of true and false genes Download PDFInfo
- Publication number
- CN117935907B CN117935907B CN202410135524.5A CN202410135524A CN117935907B CN 117935907 B CN117935907 B CN 117935907B CN 202410135524 A CN202410135524 A CN 202410135524A CN 117935907 B CN117935907 B CN 117935907B
- Authority
- CN
- China
- Prior art keywords
- true
- gene
- genes
- sample
- copy number
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 353
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000012937 correction Methods 0.000 claims abstract description 61
- 108091008109 Pseudogenes Proteins 0.000 claims abstract description 45
- 102000057361 Pseudogenes Human genes 0.000 claims abstract description 44
- 238000004458 analytical method Methods 0.000 claims abstract description 24
- 238000012165 high-throughput sequencing Methods 0.000 claims abstract description 17
- 239000000523 sample Substances 0.000 claims description 107
- 101000861263 Homo sapiens Steroid 21-hydroxylase Proteins 0.000 claims description 30
- 238000004590 computer program Methods 0.000 claims description 21
- 230000035772 mutation Effects 0.000 claims description 18
- 108700024394 Exon Proteins 0.000 claims description 13
- 108091092195 Intron Proteins 0.000 claims description 13
- 238000012935 Averaging Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 8
- 239000013068 control sample Substances 0.000 claims description 7
- 102100039791 43 kDa receptor-associated protein of the synapse Human genes 0.000 claims description 3
- 102100020979 ATP-binding cassette sub-family F member 1 Human genes 0.000 claims description 3
- 102100022549 Beta-hexosaminidase subunit beta Human genes 0.000 claims description 3
- 102100021645 Complex I assembly factor ACAD9, mitochondrial Human genes 0.000 claims description 3
- 102100020750 Dipeptidyl peptidase 3 Human genes 0.000 claims description 3
- 102000017930 EDNRB Human genes 0.000 claims description 3
- 102100029652 EH domain-binding protein 1 Human genes 0.000 claims description 3
- 102100037581 FAST kinase domain-containing protein 2, mitochondrial Human genes 0.000 claims description 3
- 102100023371 Forkhead box protein N1 Human genes 0.000 claims description 3
- 102100028902 Hermansky-Pudlak syndrome 1 protein Human genes 0.000 claims description 3
- 101000744504 Homo sapiens 43 kDa receptor-associated protein of the synapse Proteins 0.000 claims description 3
- 101000783783 Homo sapiens ATP-binding cassette sub-family F member 1 Proteins 0.000 claims description 3
- 101001045433 Homo sapiens Beta-hexosaminidase subunit beta Proteins 0.000 claims description 3
- 101000677550 Homo sapiens Complex I assembly factor ACAD9, mitochondrial Proteins 0.000 claims description 3
- 101000931862 Homo sapiens Dipeptidyl peptidase 3 Proteins 0.000 claims description 3
- 101001012951 Homo sapiens EH domain-binding protein 1 Proteins 0.000 claims description 3
- 101000967299 Homo sapiens Endothelin receptor type B Proteins 0.000 claims description 3
- 101001028255 Homo sapiens FAST kinase domain-containing protein 2, mitochondrial Proteins 0.000 claims description 3
- 101000907576 Homo sapiens Forkhead box protein N1 Proteins 0.000 claims description 3
- 101000838926 Homo sapiens Hermansky-Pudlak syndrome 1 protein Proteins 0.000 claims description 3
- 101001011412 Homo sapiens IQ calmodulin-binding motif-containing protein 1 Proteins 0.000 claims description 3
- 101000966742 Homo sapiens Leucine-rich PPR motif-containing protein, mitochondrial Proteins 0.000 claims description 3
- 101000833892 Homo sapiens Peroxisomal acyl-coenzyme A oxidase 1 Proteins 0.000 claims description 3
- 101001003584 Homo sapiens Prelamin-A/C Proteins 0.000 claims description 3
- 101000585534 Homo sapiens RNA polymerase II-associated factor 1 homolog Proteins 0.000 claims description 3
- 101000933296 Homo sapiens Transcription factor TFIIIB component B'' homolog Proteins 0.000 claims description 3
- 101000649030 Homo sapiens Triple QxxK/R motif-containing protein Proteins 0.000 claims description 3
- 101001087412 Homo sapiens Tyrosine-protein phosphatase non-receptor type 18 Proteins 0.000 claims description 3
- 102100029842 IQ calmodulin-binding motif-containing protein 1 Human genes 0.000 claims description 3
- 102100040589 Leucine-rich PPR motif-containing protein, mitochondrial Human genes 0.000 claims description 3
- 108010011536 PTEN Phosphohydrolase Proteins 0.000 claims description 3
- 102100026798 Peroxisomal acyl-coenzyme A oxidase 1 Human genes 0.000 claims description 3
- 102100025516 Peroxisome biogenesis factor 2 Human genes 0.000 claims description 3
- 102100032543 Phosphatidylinositol 3,4,5-trisphosphate 3-phosphatase and dual-specificity protein phosphatase PTEN Human genes 0.000 claims description 3
- 102100026531 Prelamin-A/C Human genes 0.000 claims description 3
- 108091006736 SLC22A5 Proteins 0.000 claims description 3
- 108091006957 SLC35D1 Proteins 0.000 claims description 3
- 102000010821 Solute Carrier Family 22 Member 5 Human genes 0.000 claims description 3
- 102100028097 Triple QxxK/R motif-containing protein Human genes 0.000 claims description 3
- 102100033018 Tyrosine-protein phosphatase non-receptor type 18 Human genes 0.000 claims description 3
- 102100032284 UDP-glucuronic acid/UDP-N-acetylgalactosamine transporter Human genes 0.000 claims description 3
- 238000007481 next generation sequencing Methods 0.000 abstract description 12
- 230000008859 change Effects 0.000 abstract description 7
- 102100027545 Steroid 21-hydroxylase Human genes 0.000 description 28
- 230000001717 pathogenic effect Effects 0.000 description 20
- 238000012163 sequencing technique Methods 0.000 description 16
- 238000012217 deletion Methods 0.000 description 13
- 230000037430 deletion Effects 0.000 description 13
- 230000007918 pathogenicity Effects 0.000 description 13
- 101150110011 CYP21A2 gene Proteins 0.000 description 10
- 101150024941 Cyp21 gene Proteins 0.000 description 10
- 239000012634 fragment Substances 0.000 description 10
- 241000724284 Peanut stunt virus Species 0.000 description 8
- 238000001514 detection method Methods 0.000 description 8
- 230000004927 fusion Effects 0.000 description 8
- 108020004414 DNA Proteins 0.000 description 7
- 102000054766 genetic haplotypes Human genes 0.000 description 5
- 238000009396 hybridization Methods 0.000 description 5
- 108091023045 Untranslated Region Proteins 0.000 description 4
- 108020003589 5' Untranslated Regions Proteins 0.000 description 3
- 108091026890 Coding region Proteins 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 108020004999 messenger RNA Proteins 0.000 description 3
- 238000003752 polymerase chain reaction Methods 0.000 description 3
- 206010064571 Gene mutation Diseases 0.000 description 2
- 108091027974 Mature messenger RNA Proteins 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 206010020718 hyperplasia Diseases 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 108700026220 vif Genes Proteins 0.000 description 2
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 1
- 206010000021 21-hydroxylase deficiency Diseases 0.000 description 1
- 108020005345 3' Untranslated Regions Proteins 0.000 description 1
- 101150058734 A gene Proteins 0.000 description 1
- 208000005676 Adrenogenital syndrome Diseases 0.000 description 1
- 108020004705 Codon Proteins 0.000 description 1
- 208000008448 Congenital adrenal hyperplasia Diseases 0.000 description 1
- 206010010356 Congenital anomaly Diseases 0.000 description 1
- 208000005156 Dehydration Diseases 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 206010020112 Hirsutism Diseases 0.000 description 1
- 101000626165 Homo sapiens Putative tenascin-XA Proteins 0.000 description 1
- 101000626163 Homo sapiens Tenascin-X Proteins 0.000 description 1
- 208000002682 Hyperkalemia Diseases 0.000 description 1
- 206010020880 Hypertrophy Diseases 0.000 description 1
- 206010021036 Hyponatraemia Diseases 0.000 description 1
- 208000026350 Inborn Genetic disease Diseases 0.000 description 1
- 108091026898 Leader sequence (mRNA) Proteins 0.000 description 1
- 206010036049 Polycystic ovaries Diseases 0.000 description 1
- 102100024653 Putative tenascin-XA Human genes 0.000 description 1
- 238000002105 Southern blotting Methods 0.000 description 1
- 108091081024 Start codon Proteins 0.000 description 1
- 102100024549 Tenascin-X Human genes 0.000 description 1
- 108091036066 Three prime untranslated region Proteins 0.000 description 1
- 101150107399 UTR1 gene Proteins 0.000 description 1
- 230000001919 adrenal effect Effects 0.000 description 1
- 230000001780 adrenocortical effect Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 239000003098 androgen Substances 0.000 description 1
- 229940030486 androgens Drugs 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 108091092356 cellular DNA Proteins 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 238000003776 cleavage reaction Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000001054 cortical effect Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000018044 dehydration Effects 0.000 description 1
- 238000006297 dehydration reaction Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000029087 digestion Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000001962 electrophoresis Methods 0.000 description 1
- 239000003623 enhancer Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004907 flux Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000005861 gene abnormality Effects 0.000 description 1
- 208000016361 genetic disease Diseases 0.000 description 1
- 210000004392 genitalia Anatomy 0.000 description 1
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical class O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 1
- 230000006801 homologous recombination Effects 0.000 description 1
- 238000002744 homologous recombination Methods 0.000 description 1
- 208000000509 infertility Diseases 0.000 description 1
- 230000036512 infertility Effects 0.000 description 1
- 231100000535 infertility Toxicity 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000036244 malformation Effects 0.000 description 1
- 210000001161 mammalian embryo Anatomy 0.000 description 1
- 230000002175 menstrual effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 201000010065 polycystic ovary syndrome Diseases 0.000 description 1
- 208000006155 precocious puberty Diseases 0.000 description 1
- 239000002243 precursor Substances 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000003753 real-time PCR Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
- 230000007017 scission Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 230000035939 shock Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000010023 transfer printing Methods 0.000 description 1
- 230000014616 translation Effects 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/10—Ploidy or copy number detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Theoretical Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The application relates to a method and a device for detecting copy number variation of true and false genes. According to the application, through correction of high-throughput sequencing data and analysis of difference sites, the influence of highly homologous pseudogenes on results can be eliminated to a great extent, and the fluctuation problem of NGS sequencing data is eliminated to a great extent, so that the copy number change of the exon level of the true pseudogenes can be judged; the sequences of the true genes and the false genes on the reference genome are extracted and compared to the true genes, and then the base ratio of the true genes in the continuous target difference sites accounting for all the base of the region is used for obtaining the simulated copy number of the base of the true genes, so that the problem of reads comparison fuzzy caused by high homology of the true genes and the false genes is solved, and the accuracy of detecting copy number variation of the true genes is further improved.
Description
Technical Field
The application relates to the technical field of bioinformatics, in particular to a method and a device for detecting copy number variation of true and false genes.
Background
Congenital Adrenocortical Hyperplasia (CAH) is a group of autosomal recessive genetic diseases, which can be classified into 6 types according to the type of known defective enzymes, involving 9 genes, wherein 21 hydroxylase deficiency due to CYP21A2 gene abnormality is most common, accounting for 90% -95% of CAH patients. Again, non-classical (NCAH) and classical types are classified according to the severity of clinical manifestations. Non-classical forms, also called delayed forms, are usually less or asymptomatic, and female patients may develop hirsutism, polycystic ovary, thin menstrual hair, etc., but do not have male externalization. Typically, they are classified into a salt loss type (SW) and a simple male chemical type (SV), and female patients often develop sexual malformations of the external genitalia to varying degrees due to excessive androgens, and men may develop signs of precocious puberty. In addition, SW patients can also have serious hyponatremia and hyperkalemia within 1-4 weeks after birth, and symptoms such as dehydration, shock and the like can appear if the patients do not intervene in time, so that the death of the patients can be caused.
In the traditional technology, detection of true and false genes comprises imprinting hybridization (Southern blot), multiplex Ligation Probe Amplification (MLPA), site-specific PCR (polymerase chain reaction), high-throughput sequencing (NGS) and the like, but the method still has certain limitations, and reduces the accuracy rate of screening positive samples.
Therefore, a method capable of accurately and effectively detecting true and false gene mutations of high homology is demanded.
Disclosure of Invention
Based on this, it is necessary to provide a method and apparatus for efficiently and accurately detecting copy number variation of true and false genes.
The application provides a method for detecting copy number variation of true and false genes, which comprises the following steps:
Obtaining high-throughput sequencing data of a sample set, wherein the high-throughput sequencing data comprises reads of each region of a true gene and a false gene, the regions comprise UTR (universal terrestrial region), exons and introns, and the sample set comprises a sample to be tested and a control sample;
obtaining a first correction factor of each sample for each control gene according to the sum of the average covering depths of the true genes and the false genes of each sample in the sample set and the average covering depth of each control gene;
Averaging the first correction factors of the corresponding different samples of each control gene to obtain a second correction factor of each control gene;
Averaging the ratio of the first correction factor to the second correction factor of each control gene of each sample to be tested according to each region to obtain the scale factor of each sample to be tested in each region;
Obtaining the total copy number of the true genes and the false genes of each sample to be detected in each region according to the scale factors of each sample to be detected in each region;
Extracting sequences of true genes and false genes of a sample to be detected and comparing the sequences to a reference genome, comparing all the sequences to the true genes, and calculating the base ratio of the base of the true genes in a target difference site to all the bases of the site;
Correcting the base ratio of the true gene at each target difference site according to the total copy number of the true gene and the false gene at the target difference site to obtain the base quasi-copy number of the true gene;
And analyzing mutation conditions of the true genes at target difference sites according to the base quasi-copy numbers of the true genes.
In one embodiment, the first correction factor is calculated according to the following formula:
,
wherein e refers to a gene region;
The i refers to each sample in the sample set;
R eig is a first correction factor;
The C ei1 refers to the average coverage depth of the true gene;
The C ei2 refers to the average coverage depth of the pseudogene;
the term "C ig" refers to the average coverage depth of each gene g in the control genes.
In one embodiment, the second correction factor is calculated according to the following formula:
,
Wherein the said Refers to a second correction factor;
The S refers to the number of samples in the sample set.
In one embodiment, when the base pseudo copy number of the true gene is less than a preset threshold, then it is determined that the true gene has a mutation at the target differential site.
In one embodiment, the true gene comprises the CYP21A2 gene; and/or
The control gene includes at least one of ABCF1、ACAD9、ACOX1、BDP1、DPP3、EDNRB、EHBP1、FASTKD2、FOXN1、HEXB、HPS1、IQCB1、LMNA、LRPPRC、PAF1、PTEN、RAPSN、SLC22A5、SLC35D1、TRIQK.
In one embodiment, the number of control samples is 20 or more; and/or
The number of the control genes is 1 or more.
According to the application, through correction of high-throughput sequencing data and analysis of difference sites, the influence of highly homologous pseudogenes on results can be eliminated to a great extent, and the fluctuation problem of NGS sequencing data is eliminated to a great extent, so that copy number variation of exon levels of true pseudogenes can be judged; the second aspect is to obtain the base ratio of the base of the true gene in the target difference site to all the bases of the target difference site by extracting the sequences of the true gene and the false gene on the reference genome and comparing all the sequences with the true gene, and then correct the base ratio of the true gene in the target difference site according to the total copy number of the true gene and the false gene to obtain the base copy number of the true gene, thereby eliminating the reads comparison and comparison problem caused by the high homology of the true gene and the false gene and further improving the accuracy of detecting the copy number variation of the true gene; the third aspect increases the determination of the copy number of introns and UTR regions of true and false genes, not only provides a continuous determination of the missing repeat of the true and false genes, but also allows identification of potential fusion genes and gives breakpoint regions.
In addition, the whole/regional copy number of the true and false genes can be judged easily through the visualized graph results, and high accuracy exists.
In another aspect, the present application provides an apparatus for detecting copy number variation of true and false genes, comprising:
The data acquisition module is used for acquiring high-throughput sequencing data of a sample set, wherein the high-throughput sequencing data comprises reads of all regions of true genes and pseudogenes, the regions comprise UTR (universal terrestrial radio) and exons and introns, and the sample set comprises a sample to be tested and a control sample;
The first correction module is used for obtaining a first correction factor of each sample for each control gene according to the sum of the average covering depths of the true genes and the false genes of each sample in the sample set and the average covering depth of each control gene;
the second correction module is used for averaging the first correction factors of the corresponding different samples of each control gene to obtain the second correction factors of each control gene;
the third correction module is used for averaging the ratio of the first correction factor to the second correction factor of each control gene of each sample to be detected aiming at each region to obtain the scale factor of each sample to be detected in each region;
The first copy number calculation module is used for obtaining the total copy number of the true genes and the false genes of each sample to be detected in each region according to the scale factors of each sample to be detected in each region;
the comparison module is used for extracting sequences of true genes and false genes of the sample to be detected to be compared to the reference genome, comparing all the sequences to the true genes, and calculating the base ratio of the base of the true gene in the target difference site to all the bases of the site;
the second copy number calculation module is used for correcting the base ratio of the true gene of each target difference site according to the total copy number of the true gene and the false gene of the target difference site to obtain the base pseudo copy number of the true gene;
And the analysis module is used for analyzing the mutation condition of the true gene at the target difference site according to the base quasi-copy number of the true gene.
The application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when the processor executes the computer program.
The application also provides a computer readable storage medium storing a computer program which when executed by a processor implements the steps of the method described above.
The present application also provides a computer program product comprising computer instructions stored in a computer-readable storage medium, the computer instructions being read from the computer-readable storage medium by a processor of a computer device, the computer instructions being executed by the processor to cause the computer device to perform a method of detecting copy number variation of a true or false gene as described above.
Drawings
FIG. 1 is a flow chart of a method for detecting copy number variation of true and false genes according to an embodiment of the present application;
FIG. 2 is a schematic block diagram of an apparatus for detecting copy number variation of true and false genes according to an embodiment of the present application;
FIG. 3 is a diagram illustrating an internal architecture of a computer device according to an embodiment of the present application;
FIG. 4 shows the result of analysis of copy number of CYP21A2/CYP21A1P gene region of the negative sample in example 1 of the application, wherein the abscissa of A in FIG. 4 shows each region of the gene, the ordinate shows a scale factor, the abscissa of B in FIG. 4 shows true and false gene differential sites, the ordinate shows a pseudo copy number of true and false gene bases at the differential sites, P_LP is Pathogenic _ Likely pathogenic, B is Likely benign and/or benign, which represents benign, VUS is Uncertain significance, which represents unknown meaning, C is Conflicting interpretations of pathogenicity, which represents that pathogenicity is controversial, P is Pathogenic or Likely pathogenic, which represents pathogenicity, base represents bases, pseudo represents false gene bases, other bases, and functional represents true gene bases;
FIG. 5 shows the results of copy number analysis of the CYP21A2/CYP21A1P gene region in example 2 of the application, wherein the abscissa of A in FIG. 5 shows the gene regions, the ordinate shows the scale factor, the abscissa of B in FIG. 5 shows the true and false gene differential sites, the ordinate shows the pseudo copy number of the true and false gene bases at the differential sites, P_LP is Pathogenic _ Likely pathogenic, B is Likely benign and/or benign, which represents benign, VUS is Uncertain significance, which represents unknown meaning, C is Conflicting interpretations of pathogenicity, which represents that pathogenicity is controversial, P is Pathogenic or Likely pathogenic, which represents pathogenicity, base represents bases, pseudoo represents pseudobases, other represents other bases, and functional represents true gene bases;
FIG. 6 shows the results of copy number analysis of the CYP21A2/CYP21A1P gene region in example 3 of the application, wherein the abscissa of A in FIG. 6 shows the gene regions, the ordinate shows the scale factor, the abscissa of B in FIG. 6 shows the true and false gene differential sites, the ordinate shows the pseudo copy number of the true and false gene bases at the differential sites, P_LP is Pathogenic _ Likely pathogenic, B is Likely benign and/or benign, which represents benign, VUS is Uncertain significance, which represents unknown meaning, C is Conflicting interpretations of pathogenicity, which represents that pathogenicity is controversial, P is Pathogenic or Likely pathogenic, which represents pathogenicity, base represents bases, pseudoo represents pseudobases, other bases, and functional represents true gene bases;
FIG. 7 shows the results of copy number analysis of CYP21A2/CYP21A1P gene regions in example 4 of the application, wherein the abscissa of A in FIG. 7 shows the gene regions, the ordinate shows the scale factor, the abscissa of B in FIG. 7 shows the true and false gene differential sites, the ordinate shows the pseudo copy number of the true and false gene bases at the differential sites, P_LP is Pathogenic _ Likely pathogenic, B is Likely benign and/or benign, and represents benign, VUS is Uncertain significance, and indicates unknown meaning, C is Conflicting interpretations of pathogenicity, and indicates that there is a dispute in pathogenicity, P is Pathogenic or Likely pathogenic, indicates pathogenicity, base indicates bases, pseudo indicates pseudobases, other indicates other bases, and functional indicates true gene bases.
Detailed Description
In order that the application may be readily understood, a more complete description of the application will be rendered by reference to the appended drawings. Preferred embodiments of the present application are shown in the drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
The fusion gene refers to a chimeric gene formed by connecting coding regions of two or more genes end to end and placing the coding regions under the control of the same set of regulatory sequences (including promoters, enhancers, ribosome binding sequences, terminators and the like), and the expression form of the chimeric gene is gene A/gene B or gene A-gene B, such as BCR-ABL1, and the gene A and the gene B are fusion gene partners.
The UTR (Untranslated Regions), i.e., untranslated region, is a non-coding segment at both ends of a messenger RNA (mRNA) molecule. The 5'-UTR extends from the methylated guanine nucleotide cap at the beginning of the mRNA to the AUG start codon, and the 3' -UTR extends from the stop codon at the end of the coding region to the front of the Poly A tail (Poly-A).
The exon (exo) is part of a eukaryotic gene, which is preserved after splicing and can be expressed as a protein during protein biosynthesis. Exons are the gene sequences, also called expression sequences, that finally appear in the mature RNA.
The introns are intervening sequences in eukaryotic cellular DNA. These sequences are transcribed in the precursor RNA, removed by splicing, and eventually are not present in the mature RNA molecule. The alternating arrangement of introns and exons constitutes a cleavage gene.
The pseudogene may be considered as a copy of nonfunctional genomic DNA in the genome that closely resembles the coding gene sequence.
The reads refer to sequence fragments obtained by high-throughput sequencing.
The average coverage depth refers to the coverage rate of each base, and is the average number of times the genomic base is sequenced. The depth of coverage of a genome is calculated by dividing the number of bases of all short reads that match the genome by the length of the genome.
The said pseudo copy number means that since the sequences around each true and false gene difference site are similar, the case where the base ratio of the target difference site corrected by the total copy number of the true and false genes is used to obtain the copy number of the true and false genes representing the region of the gene where it is located is referred to as "pseudo copy number".
CYP21A2 is located on chromosome 6p21.33, with a full length of about 3.2kb, and a highly similar homologous pseudogene CYP21A1P exists at about 30kb upstream thereof. About 70% of the CAH-related pathogenic variations of CYP21A2 are point mutations caused by small-sized true-false gene conversion, 30% are deletions of about 30kb in length caused by homologous recombination of true-false genes, and the breakpoint is usually located between exon 3 and exon 8, leaving few cases of new mutations of CYP21 A2.
CYP21 refers to two genes CYP21A2 and CYP21A1P, and in general, three genes RP1, C4 and TNXB and pseudogenes RP2 and TNXA thereof exist near the CYP21A2 and the pseudogenes CYP21A1P and form two RCCX modules (RP-C4-CYP 21-TNX) together, wherein the bimodule haplotype accounts for about 69 percent in a human population, and the unimodular haplotype and the trimodal haplotype account for about 14 percent and 17 percent. Most of the trimodal haplotypes contained 2 copies of the CYP21A1P pseudogene, a few contained two copies of CYP21A2, and one of the copies of CYP21A2 usually carried the pathogenic variation p. (Gln 319)。
As shown in the background art, in the conventional method for detecting CYP21A2 gene mutation, PCR is very sensitive to annealing temperature and primer density, and a false negative or false positive result is easy to generate. The high concentration and large amount of DNA required in the imprinting hybridization result in the method comprising many steps such as extraction of DNA, digestion, electrophoresis, transfer printing, labeling, hybridization of probes, etc., being complicated in operation, requiring strict experimental conditions and long experimental period, consuming time and effort, and the large amount of labeled and hybridized probes also increasing detection cost. In addition, the reliability of the results is affected by a number of factors, including sample quality, hybridization techniques, etc., and may be affected by contamination during hybridization, so further analysis and verification are required to ensure accuracy and reliability of the results. The MLPA is only suitable for detecting CNV near a specific probe and cannot detect point mutations or small fragment insertions/deletions, the common MLPA kit for CYP21A2 is only able to detect pseudogene exons 1,3,4, 7 and true gene exons 1,3,4, 6,7 regions based on 7 true pseudogene differential sites, and the probe for exon 1 is actually located in UTR region and cannot cover exon 1 completely. Secondly, MLPA is costly relative to other detection methods and still has the potential to produce false negative or false positive results, often requiring the use of other techniques such as quantitative PCR or sequencing for result verification. NGS has the characteristics of high flux, short time consumption, low cost, high accuracy, and sensitivity, and is one of the important means for detecting genotypes. However, since the sequencing fragment is short, it is easy to interfere with the pseudogene when detecting CYP21A2 mutation, and it is difficult to distinguish highly homologous sequence fragments, and it is clinically very easy to cause a false leak detection. In addition, in the traditional NGS analysis flow, accurate assessment of the copy number change of the single exon of the CYP21A2 cannot be achieved, meanwhile, judgment of copy numbers of other areas except for the exon of the CYP21A2 is ignored, potential fusion genes cannot be identified, and accuracy of screening positive samples is reduced.
As shown in fig. 1, an embodiment of the present application provides a method for detecting copy number variation of true and false genes, which includes step S110, step S120, step S130, step S140, step S150, step S160, step S170 and step S180.
Step S110: and obtaining high-throughput sequencing data of a sample set, wherein the high-throughput sequencing data comprises reads of each region of a true gene and a false gene, the regions comprise UTR, exons and introns, and the sample set comprises a sample to be tested and a control sample.
In some of these embodiments, the sequencing data is whole exon sequencing data.
In some embodiments, the number of control samples is 20 or more.
It can be appreciated that the control sample is the same batch of other samples as the sample to be tested to eliminate sequencing batch effects.
In some of these embodiments, the number of control genes is 1 or greater.
Further, a control gene was selected based on the variation of the gene itself in the sequencing and the variation of the gene with respect to the true gene/false gene.
In one specific example, the control gene includes at least one of ABCF1、ACAD9、ACOX1、BDP1、DPP3、EDNRB、EHBP1、FASTKD2、FOXN1、HEXB、HPS1、IQCB1、LMNA、LRPPRC、PAF1、PTEN、RAPSN、SLC22A5、SLC35D1、TRIQK.
Step S120: and obtaining a first correction factor of each sample for each control gene according to the sum of the average coverage depths of the true genes and the false genes of each sample in the sample set and the average coverage depth of each control gene.
In some of these embodiments, the first correction factor is calculated in step S120 according to the following formula:
,
Wherein e refers to the gene region; i refers to each sample in the sample set; r eig is a first correction factor; c ei1 refers to the average coverage depth of the true gene; c ei2 refers to the average coverage depth of pseudogenes; c ig refers to the average coverage depth of each gene g in the control genes.
Step S130: and averaging the first correction factors of the corresponding different samples of each control gene to obtain the second correction factor of each control gene.
In some of these embodiments, the second correction factor is calculated in step S130 according to the following formula:
,
wherein, Refers to a second correction factor; s refers to the number of samples in the sample set.
Step S140: and averaging the ratio of the first correction factor to the second correction factor of each control gene of each sample to be tested according to each region to obtain the scale factor of each sample to be tested in each region.
Step S150: and obtaining the total copy number of the true genes and the false genes of each sample to be detected in each region according to the scale factors of each sample to be detected in each region.
Specifically, the scale factor of each sample to be tested in each region is multiplied by 4 to obtain the total copy number of the true gene and the false gene of each sample to be tested in each region.
It will be appreciated that for diploid, the number of copies of a gene is typically two, and that the sum of the true and pseudogenes is 4 copies herein, so that the scaling factor multiplied by 4 represents the predicted total copy number of the true and pseudogenes.
Step S160: extracting sequences of true genes and false genes of a sample to be detected and comparing the sequences to a reference genome, comparing all the sequences to the true genes, and calculating the base ratio of the base of the true genes in a target difference site to all the bases of the site.
Step S170: correcting the base ratio of the true gene at each target difference site according to the total copy number of the true gene and the false gene at the target difference site to obtain the base quasi-copy number of the true gene.
Specifically, the total copy number of the true gene and the pseudogene multiplied by the base ratio of the true gene at that site is used as the predicted copy number of the true gene for that gene region in step S170.
It is understood that the variation in copy number of the region of the true gene can be judged by the base ratio of the true gene at several successive different sites.
Step S180: and analyzing mutation conditions of the true genes at target difference sites according to the base quasi-copy numbers of the true genes.
In some of these embodiments, step S180 includes the step of analyzing whether the gene fragment is duplicated or deleted based on the number of simulated copies of the true gene base at consecutive several different sites.
In some embodiments, when the base pseudo copy number of the true gene is less than the preset threshold in step S180, it is determined that the true gene has a mutation at the target differential site.
In a specific example, the preset threshold is 1.25.
In some of these embodiments, the true gene comprises the CYP21A2 gene and the pseudogene comprises the CYP21A1P.
In some of these embodiments, the regions include UTRs, exons, and introns, and specific partitions and genomic coordinates are shown in table 1.
TABLE 1
It will be appreciated that partitioning of the true and pseudogenes according to the respective exon/intron/UTR regions requires a stretch of extension to the 5' UTR in addition to the true and pseudogenes themselves in the target region, facilitating the determination of the true gene long fragment deletions resulting from fusion of the true and pseudogenes.
Alternatively, the genomic reference sequence may be from a public database, e.g., the human genomic sequence may be the human reference genome hs37d5 genome, the GRCh37 genome, the b37 genome, the hg18 genome, the hg17 genome, the hg16 genome, or the hg38 genome 8, etc., in the NCBI or UCSC database.
It will be appreciated that the above-mentioned true genes and pseudogenes are a pair of more common true and pseudogenes, and the method is not limited to this pair of genes, and any gene which cannot be analyzed using conventional mutation analysis because of high similarity in sequence can be suitably used for analysis by the analysis method of the present invention.
According to the application, through correction of high-throughput sequencing data and analysis of difference sites, the influence of highly homologous pseudogenes on results can be eliminated to a great extent, and the fluctuation problem of NGS sequencing data is eliminated to a great extent, so that copy number variation of exon levels of true pseudogenes can be judged; the second aspect is to extract the sample to be tested and compare the sequences of the true gene and the false gene on the reference genome, and compare the sequences to the true gene to obtain the base ratio of the base of the true gene in the target difference site to all the bases of the site, and correct the base ratio of the true gene in the site according to the scale factors of each target difference site to obtain the simulated copy number of the base of the true gene, thereby eliminating the problem of fuzzy reads comparison caused by the high homology of the true gene and the false gene, and further improving the accuracy of detecting the copy number variation of the true gene; the third aspect increases the determination of the copy number of introns and UTR regions of true and false genes, not only provides a continuous determination of the missing repeat of the true and false genes, but also allows identification of potential fusion genes and gives breakpoint regions.
In addition, the whole/regional copy number of the true and false genes can be judged easily through the visualized graph results, and high accuracy exists.
In addition, as shown in fig. 2, an embodiment of the present application further provides an apparatus 200 for detecting copy number variation of true and false genes, where the apparatus includes a data acquisition module 210, a first correction module 220, a second correction module 230, a third correction module 240, a first copy number calculation module 250, a comparison module 260, a second copy number calculation module 270, and an analysis module 280.
Specifically, the data acquisition module 210 is configured to acquire high-throughput sequencing data of a sample set, where the high-throughput sequencing data includes reads of regions of true genes and pseudogenes, the regions include UTRs, exons, and introns, and the sample set includes a sample to be tested and a control sample.
The first correction module 220 is configured to obtain a first correction factor for each control gene according to the sum of the average coverage depths of the true genes and the false genes of each sample in the sample set and the average coverage depth of each control gene.
The second correction module 230 is configured to average the first correction factors of the corresponding different samples of each control gene to obtain a second correction factor of each control gene.
The third correction module 240 is configured to average, for each region, a ratio of the first correction factor to the second correction factor of each control gene of each sample to be tested, so as to obtain a scale factor of each sample to be tested in each region.
The first copy number calculation module 250 is configured to obtain total copy numbers of true genes and false genes of each sample to be tested in each region according to the scale factors of each sample to be tested in each region;
And the comparison module 260 is used for extracting sequences of the true gene and the false gene of the sample to be detected and comparing the sequences to the true gene, and calculating the base ratio of the base of the true gene in the target difference site to all the bases of the site.
The copy number calculation module 270 is configured to correct the base ratio of the true gene at each target differential site according to the total copy number of the true gene and the false gene at the target differential site, so as to obtain the base pseudo copy number of the true gene.
And the analysis module 280 is used for analyzing the mutation condition of the true gene at the target difference site according to the base pseudo copy number of the true gene.
The respective modules in the above-described apparatus 200 for detecting copy number variation of true and false genes may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In addition, an embodiment of the present application further provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method for detecting copy number variation of a true gene in any of the above embodiments when executing the computer program.
It is to be understood that the above-mentioned computer device may be the server 104 or the terminal 102, and the internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, and a communication interface connected by a system bus. When the computer equipment is a terminal, the system also comprises a display screen and an input device which are connected with the system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for detecting copy number variation of true and false genes. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
In addition, an embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a processor implements the steps of the method for detecting copy number variation of a true or false gene in any of the above embodiments.
Furthermore, an embodiment of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, where the computer instructions are stored in a computer readable storage medium, and where a processor of a computer device reads the computer instructions from the computer readable storage medium, and where the processor executes the computer instructions, so that the computer device performs the steps of the method for detecting copy number variation of a true gene or a false gene in any of the above embodiments.
The method for detecting copy number variation of true or false genes of the present application will be described in further detail with reference to specific examples of the method.
Example 1
The original WES data obtained was aligned to the GRCh37 version human reference genome by whole exon sequencing of DNA from clinical samples and the comparison files were analyzed by the method of this example as follows:
1. calculating the total copy number of the true and false genes of each exon and intron by correcting the background of 20 control genes and samples in the same batch
(1) According to the self fluctuation change of the gene in the sequencing and the fluctuation situation of the relative CYP21A2/CYP21A1P gene, 10 genes with the smallest self fluctuation change in the sequencing and the fluctuation change of the relative CYP21A2/CYP21A1P gene are respectively selected as a control gene set G, and specific genes are shown in Table 2.
TABLE 2 control Gene sets for predicting CYP21A2/CYP21A1P copy number variation
(2) 1055 Samples were selected as sample set S, with 1054 identical lot sequencing samples as control samples, to eliminate sequencing lot effects and the effects of errors on the outcome determination.
(3) The true and pseudogenes were partitioned by individual exon/intron/UTR regions. Besides the target region CYP21A2/CYP21A1P gene, a section of extension is required to be carried out at the 5' UTR, so that the true gene long fragment deletion caused by true and false gene fusion can be conveniently judged. Specific partitions and genomic coordinates are shown in Table 1 above with reference to the human genomic reference sequence GRCh 37.
(4) Calculating the relative total average coverage depth R of each region e of the true genes and the gene pseudogenes in the sample set eig
For each sample i in the sample set, dividing the sum of the average coverage depths C ei1+Cei2 of CYP21A2 and CYP21A1P by the average coverage depth C ig of each gene g in the 20 control genes to obtain a first correction factor R eig for the CYP21 gene:
,
Wherein e refers to the gene region; i refers to each sample in the sample set; r eig is a first correction factor; c ei1 refers to the average coverage depth of the true gene; c ei2 refers to the average coverage depth of pseudogenes; c ig refers to the average coverage depth of each gene g in the control genes.
(5) Calculating a second correction factor
For each control gene g, the average of the relative depth of coverage of all samples was calculated in the sample set.
,
Wherein, Refers to a second correction factor; s refers to the number of samples in the sample set.
(6) Calculating the scale factor θ e and determining the copy number of each region e
For 20 control genes, calculating the ratio of the first correction factor to the second correction factor, averaging to obtain a scale factor theta e, and multiplying the scale factor by 4 to predict the CYP21 total copy number in the region. Generally, theta e is 1,2,3, 4, 5 for total copies of true and false genes at 0.25, 0.5, 0.75, 1, 1.25, respectively.
。
2. Calculating the base ratio of true genes of target difference sites of samples to be detected
The true and false gene target difference site PVS (Paralogous sequence variants) is obtained by comparing the CYP21A2 gene and the false gene CYP21A1P on the human genome. And (3) comparing the sample to be detected with reads of CYP21A2 and pseudogene CYP21A1P thereof, extracting, comparing all the samples with true genes, calculating the base ratio of the true genes and the pseudogenes of the target difference sites, namely calculating the ratio of the difference sites to the total reads of the true gene bases, the pseudogene bases and the ready s of other bases respectively, and correcting the base ratio of the true genes of the sites according to the total copy number of the true genes and the pseudogenes of each target difference site, namely multiplying the total copy number of the true genes and the pseudogenes by the base ratio of the true genes of the sites to obtain the simulated copy number of the true genes.
3. Drawing the real gene and pseudogene base quasi-copy numbers of the CYP21 scale factors and target differential sites in the regions, and judging the deletion duplication condition of the CYP21 gene according to the real gene and pseudogene base quasi-copy numbers of a plurality of differential sites in each region.
As a result, as shown in FIG. 4, FIG. 4A shows the scaling factor for the CYP21 true or false gene for each region of the most common double RCCX pattern sample, where each bin is a region of the CYP21 gene from which the total copy number of the CYP21 gene per region of the sample can be predicted. In FIG. 4, B is the relative base copy numbers of the true gene and the false gene, wherein each bin is a PSV, three colors respectively represent the true gene base, the false gene base and the other base pseudo copy numbers from bottom to top, and the copy number change of the true gene and the false gene region can be judged from the proportion of the true gene base at a plurality of different loci. The point of different lightness at the top of the true gene base indicates that the PSV is pathogenic, with the darker the color the higher the likelihood of pathogenicity. The threshold value of the relative base copy number of the true gene is usually set to 1.25 (red dotted line), below which the PSV site of the true gene is considered to be likely to have a point mutation caused by micro-switching with the pseudogene, further, several consecutive changes in the pseudo-copy number of the PSV true gene and pseudogene bases suggest the possibility of duplication or deletion of the gene fragment.
Example 2
The samples in example 2 were patients with adrenal cortical hyperplasia, clitoral hypertrophy, infertility, and first and second generation test tube infants were made, but the embryo quality was poor. The raw WES data obtained was aligned to the GRCh37 version human reference genome by whole exon sequencing of DNA from clinical samples and the alignment was analyzed by the method of example 1, with the following specific analysis results:
the results are shown in FIG. 5, which shows that one CYP21A2 gene of the sample is converted to CYP21A1P gene, and that NM_000500.9 (CYP 21A 2) c.1024C > T (p.Arg 3492 Trp) homozygous mutation is found during mutation detection, there is evidence that the missense mutation was observed in atypical and purely maleated congenital adrenal hyperplasia individuals.
In fig. 5 described above, the scale factors for all regions are concentrated at 1, as shown in fig. 5a, indicating that the sample conforms to the common 4-copy pattern. In FIG. 5, each bin is a PSV, and the copy number variation of the true gene and the pseudogene regions can be judged from the continuous base ratio. Except for individual PSV, the true gene bases in the figures were all below the threshold, suggesting that only 1 copy of CYP21A2 and 3 copies of CYP21A1P were contained in the 4 copies of the CYP21 gene, and that it was suspected that there was one CYP21A2 gene converted to CYP21A1P, i.e., the sample was missing one true gene copy. In combination with its homozygous deletion at CYP21A2:c.1024C > T, the genotype is consistent with its clinical phenotype.
Example 3
The original WES data obtained in example 3 was aligned to the GRCh37 version of the human reference genome by whole exon sequencing of DNA from clinical samples and the alignment was analyzed by the method of example 1, with the following specific analysis results:
As a result, as shown in FIG. 6, concentration of the scale factor for each region around 0.75 indicates a total copy number of 3 for the entire CYP21, which is missing one copy from the sample relative to the most common double RCCX. In analysis of the differential sites of true and pseudogenes, it can be seen that the true bases of the UTR1-exon3 region are mostly below the threshold, except for the two PSVs of exon1, while the gene bases of the exon4-exon10 region are concentrated near 2, suggesting the presence of a chimeric gene (CHIMERIC GENE), and the chimeric type is CH-1, i.e., due to the high similarity in the RCCX modules, one CYP21A1P gene is homologously recombined with one CYP21A2 gene, specifically, the head (5 'UTR to 3 exons) of the CYP21A1P gene is homologously recombined with the tail (4 exons to 3' UTR) of the CYP21A2 gene, resulting in a deletion of about 30kb between the two genes, presumably at the breakpoint at intron 3.
Example 4
Example 4 is a negative sample, the raw WES data obtained was aligned to the GRCh37 version human reference genome by whole exon sequencing of the DNA of the clinical sample, and the alignment was analyzed by the method of example 1, with the following specific analysis results:
As a result, the scale factors are all distributed around 1, as shown in FIG. 7, consistent with the common dual RCCX pattern. The sample is shown in FIG. 7B with true gene relative copy numbers below the threshold for several PSV in succession at UTR1, intron2 and exon10, indicating heterozygous deletions in these regions. In analysis of a large number of samples, this haplotype was found to be relatively common in CAH negative samples, and the absence of these regions did not affect the function of CYP21 A2.
In summary, compared with other detection technical means, the NGS-based analysis method has the advantages of simplicity in operation, shorter time consumption and lower cost. The detection of the insertion deletion of CYP21 gene areas and small fragments which cannot be covered by the MLPA technology is realized. Compared with the traditional NGS data analysis method, the method corrects the CYP21 gene sequencing data of the sample to be detected through the sequencing depth of other samples in the same batch and other genes of the sample to be detected, and eliminates the fluctuation problem of the NGS sequencing data, so that the copy number change of the CYP21 gene exon level can be judged.
In general, the NGS data analysis method only judges copy number variation or large fragment deletion duplication of the CYP21A2 gene, so that the resolution of copy number variation detection is low, and the accuracy is greatly reduced under the ready interference of highly homologous pseudogene CYP21A1P, for example, the deletion situation of the true gene is masked due to the comparison of pseudogene ready to the true gene. The application changes the copy number of the true gene to the total copy number of the true gene and the false gene region, and the sequences of the true gene and the false gene are compared on the reference genome by extracting the sample to be detected and are all compared to the true gene, so as to obtain the base ratio of the true gene in the target difference site, which occupies all the bases of the site, and then the base ratio of the true gene in the site is corrected according to the scale factors of each target difference site, so as to obtain the simulated copy number of the true gene base, thereby eliminating the reads comparison ambiguity problem caused by the high homology of the true gene and the false gene.
In addition, determination of copy number of the CYP21 gene intron and UTR region is increased. Because CAH-related pathogenic sites mostly exist in exon regions, traditional analysis methods for NGS data only consider exon regions, ignoring judgment of introns and UTR regions. The present application, after increasing the determination of copy number of this region, not only provides a continuous determination of the deletion of the repeat of the CYP21 gene, but also allows identification of potential fusion genes and gives a breakpoint region, which has not been reported in the NGS analysis methods heretofore. Further, the whole/regional copy number of the true and false genes can be judged easily through the visualized graph result, and the accuracy is high.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.
Claims (10)
1. The method for detecting copy number variation of true and false genes is characterized by comprising the following steps:
Obtaining high-throughput sequencing data of a sample set, wherein the high-throughput sequencing data comprises reads of each region of a true gene and a false gene, the regions comprise UTR (universal terrestrial region), exons and introns, and the sample set comprises a sample to be tested and a control sample;
obtaining a first correction factor of each sample for each control gene according to the sum of the average covering depths of the true genes and the false genes of each sample in the sample set and the average covering depth of each control gene;
Averaging the first correction factors of the corresponding different samples of each control gene to obtain a second correction factor of each control gene;
Averaging the ratio of the first correction factor to the second correction factor of each control gene of each sample to be tested according to each region to obtain the scale factor of each sample to be tested in each region;
Obtaining the total copy number of the true genes and the false genes of each sample to be detected in each region according to the scale factors of each sample to be detected in each region;
Extracting sequences of true genes and false genes of a sample to be detected and comparing the sequences to a reference genome, comparing all the sequences to the true genes, and calculating the base ratio of the base of the true genes in a target difference site to all the bases of the site;
Correcting the base ratio of the true gene at each target difference site according to the total copy number of the true gene and the false gene at the target difference site to obtain the base quasi-copy number of the true gene;
And analyzing mutation conditions of the true genes at target difference sites according to the base quasi-copy numbers of the true genes.
2. The method for detecting copy number variation of true and false genes according to claim 1, wherein the first correction factor is calculated according to the following formula:
Reig=(Cei1+Cei2)/Cig,
wherein e refers to a gene region;
The i refers to each sample in the sample set;
R eig is a first correction factor;
The C ei1 refers to the average coverage depth of the true gene;
The C ei2 refers to the average coverage depth of the pseudogene;
the term "C ig" refers to the average coverage depth of each gene g in the control genes.
3. The method for detecting copy number variation of true and false genes according to claim 2, wherein the total copy number of the true genes and the false genes in each region of each sample to be detected is obtained by multiplying the scale factor of each sample to be detected in each region by 4.
4. The method for detecting copy number variation of a true or false gene according to any one of claims 1 to 3, wherein a mutation in a true gene is judged to exist at the target differential site when the base-intended copy number of the true gene is smaller than a predetermined threshold.
5. The method for detecting copy number variation of true and false genes according to any one of claims 1 to 3, wherein the true genes include CYP21A2 genes; and/or
The control gene includes at least one of ABCF1、ACAD9、ACOX1、BDP1、DPP3、EDNRB、EHBP1、FASTKD2、FOXN1、HEXB、HPS1、IQCB1、LMNA、LRPPRC、PAF1、PTEN、RAPSN、SLC22A5、SLC35D1、TRIQK.
6. The method for detecting copy number variation of true and false genes according to claim 5, wherein the number of the control samples is 20 or more; and/or
The number of the control genes is 1 or more.
7. An apparatus for detecting copy number variation of a true or false gene, comprising:
The data acquisition module is used for acquiring high-throughput sequencing data of a sample set, wherein the high-throughput sequencing data comprises reads of all regions of true genes and pseudogenes, the regions comprise UTR (universal terrestrial radio) and exons and introns, and the sample set comprises a sample to be tested and a control sample;
The first correction module is used for obtaining a first correction factor of each sample for each control gene according to the sum of the average covering depths of the true genes and the false genes of each sample in the sample set and the average covering depth of each control gene;
the second correction module is used for averaging the first correction factors of the corresponding different samples of each control gene to obtain the second correction factors of each control gene;
the third correction module is used for averaging the ratio of the first correction factor to the second correction factor of each control gene of each sample to be detected aiming at each region to obtain the scale factor of each sample to be detected in each region;
The first copy number calculation module is used for obtaining the total copy number of the true genes and the false genes of each sample to be detected in each region according to the scale factors of each sample to be detected in each region;
the comparison module is used for extracting sequences of true genes and false genes of the sample to be detected to be compared to the reference genome, comparing all the sequences to the true genes, and calculating the base ratio of the base of the true gene in the target difference site to all the bases of the site;
the second copy number calculation module is used for correcting the base ratio of the true gene of each target difference site according to the total copy number of the true gene and the false gene of the target difference site to obtain the base pseudo copy number of the true gene;
And the analysis module is used for analyzing the mutation condition of the true gene at the target difference site according to the base quasi-copy number of the true gene.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1-6 when the computer program is executed.
9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.
10. A computer program product, characterized in that the computer program product comprises computer instructions stored in a computer-readable storage medium, which computer instructions are read from the computer-readable storage medium by a processor of a computer device, which computer instructions are executed by the processor such that the computer device performs the method for detecting copy number variations of true and false genes according to any one of claims 1 to 6.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410135524.5A CN117935907B (en) | 2024-01-31 | 2024-01-31 | Method and device for detecting copy number variation of true and false genes |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410135524.5A CN117935907B (en) | 2024-01-31 | 2024-01-31 | Method and device for detecting copy number variation of true and false genes |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN117935907A CN117935907A (en) | 2024-04-26 |
| CN117935907B true CN117935907B (en) | 2024-09-03 |
Family
ID=90753470
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410135524.5A Active CN117935907B (en) | 2024-01-31 | 2024-01-31 | Method and device for detecting copy number variation of true and false genes |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN117935907B (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105574361A (en) * | 2015-11-05 | 2016-05-11 | 上海序康医疗科技有限公司 | Method for detecting variation of copy numbers of genomes |
| CN107491666A (en) * | 2017-09-01 | 2017-12-19 | 深圳裕策生物科技有限公司 | Single sample somatic mutation loci detection method, device and storage medium in abnormal structure |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8725422B2 (en) * | 2010-10-13 | 2014-05-13 | Complete Genomics, Inc. | Methods for estimating genome-wide copy number variations |
| RU2593708C2 (en) * | 2012-01-20 | 2016-08-10 | БГИ Диагносис Ко., Лтд. | Method and system for detecting variation of number of copies in genome |
| WO2018144449A1 (en) * | 2017-01-31 | 2018-08-09 | Counsyl, Inc. | Systems and methods for identifying and quantifying gene copy number variations |
| CN108427864B (en) * | 2018-02-14 | 2019-01-29 | 南京世和基因生物技术有限公司 | A kind of detection method, device and computer-readable medium copying number variation |
| CN111599407B (en) * | 2020-05-13 | 2021-10-15 | 北京橡鑫生物科技有限公司 | Method and device for detecting copy number variation |
| CN114645075A (en) * | 2020-12-17 | 2022-06-21 | 上海韦翰斯生物医药科技有限公司 | Detection method of true gene variation |
| EP4352729A1 (en) * | 2021-06-07 | 2024-04-17 | Illumina, Inc. | Methods and systems for identifying recombinant variants |
| CN115637288B (en) * | 2022-12-23 | 2023-04-28 | 苏州赛福医学检验有限公司 | Method for detecting copy number change of SMN1 and SMN2 genes and application thereof |
| CN116453588A (en) * | 2023-04-12 | 2023-07-18 | 深圳华大基因股份有限公司 | STRC gene copy number variation detection method based on whole genome sequencing |
-
2024
- 2024-01-31 CN CN202410135524.5A patent/CN117935907B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105574361A (en) * | 2015-11-05 | 2016-05-11 | 上海序康医疗科技有限公司 | Method for detecting variation of copy numbers of genomes |
| CN107491666A (en) * | 2017-09-01 | 2017-12-19 | 深圳裕策生物科技有限公司 | Single sample somatic mutation loci detection method, device and storage medium in abnormal structure |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117935907A (en) | 2024-04-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112322753A (en) | SNP molecular markers related to pork intramuscular fat and its application | |
| Zhang et al. | Molecular genetic testing and diagnosis strategies for dystrophinopathies in the era of next generation sequencing | |
| Simoni et al. | Polymorphisms of the luteinizing hormone/chorionic gonadotropin receptor gene: association with maldescended testes and male infertility | |
| Mortlock et al. | Global endometrial DNA methylation analysis reveals insights into mQTL regulation and associated endometriosis disease risk and endometrial function | |
| Souilmi et al. | Admixture has obscured signals of historical hard sweeps in humans | |
| WO2010035140A1 (en) | Method for analyzing d4z4 tandem repeat arrays of nucleic acid and kit therefore | |
| Chen et al. | Diagnosis of neonatal intrahepatic cholestasis caused by citrin deficiency using high-resolution melting analysis and a clinical scoring system | |
| Yang et al. | Genomic sequencing analysis reveals copy number variations and their associations with economically important traits in beef cattle | |
| Shaomei et al. | Whole exome sequencing applied to 42 Han Chinese patients with posterior hypospadias | |
| CN117935907B (en) | Method and device for detecting copy number variation of true and false genes | |
| Monteiro et al. | Cost-effective genotyping for classical congenital adrenal hyperplasia (CAH) due to 21-hydroxylase deficiency (21-OHD) in resource-poor settings: multiplex ligation probe amplification (MLPA) with/without sequential next-generation sequencing (NGS) | |
| CN110272986B (en) | Targeted detection for XY dysplastic disease | |
| Balraj et al. | Mutational characterization of congenital adrenal hyperplasia due to 21-hydroxylase deficiency in Malaysia | |
| Wang et al. | Allelic variants in HOX genes in cryptorchidism | |
| Lin et al. | Newborn screening and genetic characteristics of patients with short-and very long-chain acyl-CoA dehydrogenase deficiencies | |
| Lipner et al. | The rise and fall and rise of linkage analysis as a technique for finding and characterizing inherited influences on disease expression | |
| Sutherland et al. | Issues with polymorphism analysis in sepsis | |
| CN117904282A (en) | Diagnostic application of OPN1LW mutation as marker in high myopia and related products | |
| WO2016112539A1 (en) | Method and device for determining fetal nucleic acid content | |
| Zöllner et al. | Using GWAS data to identify copy number variants contributing to common complex diseases | |
| CN103131788B (en) | Probe and primer for detecting single nucleotide polymorphism related to chronic periodontitis, and kit thereof | |
| CN118745460B (en) | Application of cat PKD1 gene mutation site, reagent and diagnostic kit | |
| CN112391482B (en) | SNP molecular marker related to pork conductivity and application thereof | |
| Anderson | Data quality control | |
| Cheng et al. | Identification of 13 novel pathogenic SLC25A13 variants and comparison of the genetic spectrum among different geographic regions: Molecular characterization of a large cohort of citrin deficiency in China |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |