CN117935907B

CN117935907B - Method and device for detecting copy number variation of true and false genes

Info

Publication number: CN117935907B
Application number: CN202410135524.5A
Authority: CN
Inventors: 朱秋洁; 侯美灵; 张军; 孔令印; 梁波
Original assignee: Suzhou Basecare Medical Device Co ltd
Current assignee: Suzhou Basecare Medical Device Co ltd
Priority date: 2024-01-31
Filing date: 2024-01-31
Publication date: 2024-09-03
Anticipated expiration: 2044-01-31
Also published as: CN117935907A

Abstract

The application relates to a method and a device for detecting copy number variation of true and false genes. According to the application, through correction of high-throughput sequencing data and analysis of difference sites, the influence of highly homologous pseudogenes on results can be eliminated to a great extent, and the fluctuation problem of NGS sequencing data is eliminated to a great extent, so that the copy number change of the exon level of the true pseudogenes can be judged; the sequences of the true genes and the false genes on the reference genome are extracted and compared to the true genes, and then the base ratio of the true genes in the continuous target difference sites accounting for all the base of the region is used for obtaining the simulated copy number of the base of the true genes, so that the problem of reads comparison fuzzy caused by high homology of the true genes and the false genes is solved, and the accuracy of detecting copy number variation of the true genes is further improved.

Description

Method and device for detecting copy number variation of true and false genes

Technical Field

The application relates to the technical field of bioinformatics, in particular to a method and a device for detecting copy number variation of true and false genes.

Background

Congenital Adrenocortical Hyperplasia (CAH) is a group of autosomal recessive genetic diseases, which can be classified into 6 types according to the type of known defective enzymes, involving 9 genes, wherein 21 hydroxylase deficiency due to CYP21A2 gene abnormality is most common, accounting for 90% -95% of CAH patients. Again, non-classical (NCAH) and classical types are classified according to the severity of clinical manifestations. Non-classical forms, also called delayed forms, are usually less or asymptomatic, and female patients may develop hirsutism, polycystic ovary, thin menstrual hair, etc., but do not have male externalization. Typically, they are classified into a salt loss type (SW) and a simple male chemical type (SV), and female patients often develop sexual malformations of the external genitalia to varying degrees due to excessive androgens, and men may develop signs of precocious puberty. In addition, SW patients can also have serious hyponatremia and hyperkalemia within 1-4 weeks after birth, and symptoms such as dehydration, shock and the like can appear if the patients do not intervene in time, so that the death of the patients can be caused.

In the traditional technology, detection of true and false genes comprises imprinting hybridization (Southern blot), multiplex Ligation Probe Amplification (MLPA), site-specific PCR (polymerase chain reaction), high-throughput sequencing (NGS) and the like, but the method still has certain limitations, and reduces the accuracy rate of screening positive samples.

Therefore, a method capable of accurately and effectively detecting true and false gene mutations of high homology is demanded.

Disclosure of Invention

Based on this, it is necessary to provide a method and apparatus for efficiently and accurately detecting copy number variation of true and false genes.

The application provides a method for detecting copy number variation of true and false genes, which comprises the following steps:

Obtaining high-throughput sequencing data of a sample set, wherein the high-throughput sequencing data comprises reads of each region of a true gene and a false gene, the regions comprise UTR (universal terrestrial region), exons and introns, and the sample set comprises a sample to be tested and a control sample;

obtaining a first correction factor of each sample for each control gene according to the sum of the average covering depths of the true genes and the false genes of each sample in the sample set and the average covering depth of each control gene;

Averaging the first correction factors of the corresponding different samples of each control gene to obtain a second correction factor of each control gene;

Averaging the ratio of the first correction factor to the second correction factor of each control gene of each sample to be tested according to each region to obtain the scale factor of each sample to be tested in each region;

Obtaining the total copy number of the true genes and the false genes of each sample to be detected in each region according to the scale factors of each sample to be detected in each region;

Extracting sequences of true genes and false genes of a sample to be detected and comparing the sequences to a reference genome, comparing all the sequences to the true genes, and calculating the base ratio of the base of the true genes in a target difference site to all the bases of the site;

Correcting the base ratio of the true gene at each target difference site according to the total copy number of the true gene and the false gene at the target difference site to obtain the base quasi-copy number of the true gene;

And analyzing mutation conditions of the true genes at target difference sites according to the base quasi-copy numbers of the true genes.

In one embodiment, the first correction factor is calculated according to the following formula:

，

wherein e refers to a gene region;

The i refers to each sample in the sample set;

R _eig is a first correction factor;

The C _ei1 refers to the average coverage depth of the true gene;

The C _ei2 refers to the average coverage depth of the pseudogene;

the term "C _ig" refers to the average coverage depth of each gene g in the control genes.

In one embodiment, the second correction factor is calculated according to the following formula:

，

Wherein the said Refers to a second correction factor;

The S refers to the number of samples in the sample set.

In one embodiment, when the base pseudo copy number of the true gene is less than a preset threshold, then it is determined that the true gene has a mutation at the target differential site.

In one embodiment, the true gene comprises the CYP21A2 gene; and/or

The control gene includes at least one of ABCF1、ACAD9、ACOX1、BDP1、DPP3、EDNRB、EHBP1、FASTKD2、FOXN1、HEXB、HPS1、IQCB1、LMNA、LRPPRC、PAF1、PTEN、RAPSN、SLC22A5、SLC35D1、TRIQK.

In one embodiment, the number of control samples is 20 or more; and/or

The number of the control genes is 1 or more.

According to the application, through correction of high-throughput sequencing data and analysis of difference sites, the influence of highly homologous pseudogenes on results can be eliminated to a great extent, and the fluctuation problem of NGS sequencing data is eliminated to a great extent, so that copy number variation of exon levels of true pseudogenes can be judged; the second aspect is to obtain the base ratio of the base of the true gene in the target difference site to all the bases of the target difference site by extracting the sequences of the true gene and the false gene on the reference genome and comparing all the sequences with the true gene, and then correct the base ratio of the true gene in the target difference site according to the total copy number of the true gene and the false gene to obtain the base copy number of the true gene, thereby eliminating the reads comparison and comparison problem caused by the high homology of the true gene and the false gene and further improving the accuracy of detecting the copy number variation of the true gene; the third aspect increases the determination of the copy number of introns and UTR regions of true and false genes, not only provides a continuous determination of the missing repeat of the true and false genes, but also allows identification of potential fusion genes and gives breakpoint regions.

In addition, the whole/regional copy number of the true and false genes can be judged easily through the visualized graph results, and high accuracy exists.

In another aspect, the present application provides an apparatus for detecting copy number variation of true and false genes, comprising:

The data acquisition module is used for acquiring high-throughput sequencing data of a sample set, wherein the high-throughput sequencing data comprises reads of all regions of true genes and pseudogenes, the regions comprise UTR (universal terrestrial radio) and exons and introns, and the sample set comprises a sample to be tested and a control sample;

The first correction module is used for obtaining a first correction factor of each sample for each control gene according to the sum of the average covering depths of the true genes and the false genes of each sample in the sample set and the average covering depth of each control gene;

the second correction module is used for averaging the first correction factors of the corresponding different samples of each control gene to obtain the second correction factors of each control gene;

the third correction module is used for averaging the ratio of the first correction factor to the second correction factor of each control gene of each sample to be detected aiming at each region to obtain the scale factor of each sample to be detected in each region;

The first copy number calculation module is used for obtaining the total copy number of the true genes and the false genes of each sample to be detected in each region according to the scale factors of each sample to be detected in each region;

the comparison module is used for extracting sequences of true genes and false genes of the sample to be detected to be compared to the reference genome, comparing all the sequences to the true genes, and calculating the base ratio of the base of the true gene in the target difference site to all the bases of the site;

the second copy number calculation module is used for correcting the base ratio of the true gene of each target difference site according to the total copy number of the true gene and the false gene of the target difference site to obtain the base pseudo copy number of the true gene;

And the analysis module is used for analyzing the mutation condition of the true gene at the target difference site according to the base quasi-copy number of the true gene.

The application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when the processor executes the computer program.

The application also provides a computer readable storage medium storing a computer program which when executed by a processor implements the steps of the method described above.

The present application also provides a computer program product comprising computer instructions stored in a computer-readable storage medium, the computer instructions being read from the computer-readable storage medium by a processor of a computer device, the computer instructions being executed by the processor to cause the computer device to perform a method of detecting copy number variation of a true or false gene as described above.

Drawings

FIG. 1 is a flow chart of a method for detecting copy number variation of true and false genes according to an embodiment of the present application;

FIG. 2 is a schematic block diagram of an apparatus for detecting copy number variation of true and false genes according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an internal architecture of a computer device according to an embodiment of the present application;

FIG. 4 shows the result of analysis of copy number of CYP21A2/CYP21A1P gene region of the negative sample in example 1 of the application, wherein the abscissa of A in FIG. 4 shows each region of the gene, the ordinate shows a scale factor, the abscissa of B in FIG. 4 shows true and false gene differential sites, the ordinate shows a pseudo copy number of true and false gene bases at the differential sites, P_LP is Pathogenic _ Likely pathogenic, B is Likely benign and/or benign, which represents benign, VUS is Uncertain significance, which represents unknown meaning, C is Conflicting interpretations of pathogenicity, which represents that pathogenicity is controversial, P is Pathogenic or Likely pathogenic, which represents pathogenicity, base represents bases, pseudo represents false gene bases, other bases, and functional represents true gene bases;

FIG. 5 shows the results of copy number analysis of the CYP21A2/CYP21A1P gene region in example 2 of the application, wherein the abscissa of A in FIG. 5 shows the gene regions, the ordinate shows the scale factor, the abscissa of B in FIG. 5 shows the true and false gene differential sites, the ordinate shows the pseudo copy number of the true and false gene bases at the differential sites, P_LP is Pathogenic _ Likely pathogenic, B is Likely benign and/or benign, which represents benign, VUS is Uncertain significance, which represents unknown meaning, C is Conflicting interpretations of pathogenicity, which represents that pathogenicity is controversial, P is Pathogenic or Likely pathogenic, which represents pathogenicity, base represents bases, pseudoo represents pseudobases, other represents other bases, and functional represents true gene bases;

FIG. 6 shows the results of copy number analysis of the CYP21A2/CYP21A1P gene region in example 3 of the application, wherein the abscissa of A in FIG. 6 shows the gene regions, the ordinate shows the scale factor, the abscissa of B in FIG. 6 shows the true and false gene differential sites, the ordinate shows the pseudo copy number of the true and false gene bases at the differential sites, P_LP is Pathogenic _ Likely pathogenic, B is Likely benign and/or benign, which represents benign, VUS is Uncertain significance, which represents unknown meaning, C is Conflicting interpretations of pathogenicity, which represents that pathogenicity is controversial, P is Pathogenic or Likely pathogenic, which represents pathogenicity, base represents bases, pseudoo represents pseudobases, other bases, and functional represents true gene bases;

FIG. 7 shows the results of copy number analysis of CYP21A2/CYP21A1P gene regions in example 4 of the application, wherein the abscissa of A in FIG. 7 shows the gene regions, the ordinate shows the scale factor, the abscissa of B in FIG. 7 shows the true and false gene differential sites, the ordinate shows the pseudo copy number of the true and false gene bases at the differential sites, P_LP is Pathogenic _ Likely pathogenic, B is Likely benign and/or benign, and represents benign, VUS is Uncertain significance, and indicates unknown meaning, C is Conflicting interpretations of pathogenicity, and indicates that there is a dispute in pathogenicity, P is Pathogenic or Likely pathogenic, indicates pathogenicity, base indicates bases, pseudo indicates pseudobases, other indicates other bases, and functional indicates true gene bases.

Detailed Description

In order that the application may be readily understood, a more complete description of the application will be rendered by reference to the appended drawings. Preferred embodiments of the present application are shown in the drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

The fusion gene refers to a chimeric gene formed by connecting coding regions of two or more genes end to end and placing the coding regions under the control of the same set of regulatory sequences (including promoters, enhancers, ribosome binding sequences, terminators and the like), and the expression form of the chimeric gene is gene A/gene B or gene A-gene B, such as BCR-ABL1, and the gene A and the gene B are fusion gene partners.

The UTR (Untranslated Regions), i.e., untranslated region, is a non-coding segment at both ends of a messenger RNA (mRNA) molecule. The 5'-UTR extends from the methylated guanine nucleotide cap at the beginning of the mRNA to the AUG start codon, and the 3' -UTR extends from the stop codon at the end of the coding region to the front of the Poly A tail (Poly-A).

The exon (exo) is part of a eukaryotic gene, which is preserved after splicing and can be expressed as a protein during protein biosynthesis. Exons are the gene sequences, also called expression sequences, that finally appear in the mature RNA.

The introns are intervening sequences in eukaryotic cellular DNA. These sequences are transcribed in the precursor RNA, removed by splicing, and eventually are not present in the mature RNA molecule. The alternating arrangement of introns and exons constitutes a cleavage gene.

The pseudogene may be considered as a copy of nonfunctional genomic DNA in the genome that closely resembles the coding gene sequence.

The reads refer to sequence fragments obtained by high-throughput sequencing.

The average coverage depth refers to the coverage rate of each base, and is the average number of times the genomic base is sequenced. The depth of coverage of a genome is calculated by dividing the number of bases of all short reads that match the genome by the length of the genome.

The said pseudo copy number means that since the sequences around each true and false gene difference site are similar, the case where the base ratio of the target difference site corrected by the total copy number of the true and false genes is used to obtain the copy number of the true and false genes representing the region of the gene where it is located is referred to as "pseudo copy number".

CYP21A2 is located on chromosome 6p21.33, with a full length of about 3.2kb, and a highly similar homologous pseudogene CYP21A1P exists at about 30kb upstream thereof. About 70% of the CAH-related pathogenic variations of CYP21A2 are point mutations caused by small-sized true-false gene conversion, 30% are deletions of about 30kb in length caused by homologous recombination of true-false genes, and the breakpoint is usually located between exon 3 and exon 8, leaving few cases of new mutations of CYP21 A2.

CYP21 refers to two genes CYP21A2 and CYP21A1P, and in general, three genes RP1, C4 and TNXB and pseudogenes RP2 and TNXA thereof exist near the CYP21A2 and the pseudogenes CYP21A1P and form two RCCX modules (RP-C4-CYP 21-TNX) together, wherein the bimodule haplotype accounts for about 69 percent in a human population, and the unimodular haplotype and the trimodal haplotype account for about 14 percent and 17 percent. Most of the trimodal haplotypes contained 2 copies of the CYP21A1P pseudogene, a few contained two copies of CYP21A2, and one of the copies of CYP21A2 usually carried the pathogenic variation p. (Gln 319)。

As shown in the background art, in the conventional method for detecting CYP21A2 gene mutation, PCR is very sensitive to annealing temperature and primer density, and a false negative or false positive result is easy to generate. The high concentration and large amount of DNA required in the imprinting hybridization result in the method comprising many steps such as extraction of DNA, digestion, electrophoresis, transfer printing, labeling, hybridization of probes, etc., being complicated in operation, requiring strict experimental conditions and long experimental period, consuming time and effort, and the large amount of labeled and hybridized probes also increasing detection cost. In addition, the reliability of the results is affected by a number of factors, including sample quality, hybridization techniques, etc., and may be affected by contamination during hybridization, so further analysis and verification are required to ensure accuracy and reliability of the results. The MLPA is only suitable for detecting CNV near a specific probe and cannot detect point mutations or small fragment insertions/deletions, the common MLPA kit for CYP21A2 is only able to detect pseudogene exons 1,3,4, 7 and true gene exons 1,3,4, 6,7 regions based on 7 true pseudogene differential sites, and the probe for exon 1 is actually located in UTR region and cannot cover exon 1 completely. Secondly, MLPA is costly relative to other detection methods and still has the potential to produce false negative or false positive results, often requiring the use of other techniques such as quantitative PCR or sequencing for result verification. NGS has the characteristics of high flux, short time consumption, low cost, high accuracy, and sensitivity, and is one of the important means for detecting genotypes. However, since the sequencing fragment is short, it is easy to interfere with the pseudogene when detecting CYP21A2 mutation, and it is difficult to distinguish highly homologous sequence fragments, and it is clinically very easy to cause a false leak detection. In addition, in the traditional NGS analysis flow, accurate assessment of the copy number change of the single exon of the CYP21A2 cannot be achieved, meanwhile, judgment of copy numbers of other areas except for the exon of the CYP21A2 is ignored, potential fusion genes cannot be identified, and accuracy of screening positive samples is reduced.

As shown in fig. 1, an embodiment of the present application provides a method for detecting copy number variation of true and false genes, which includes step S110, step S120, step S130, step S140, step S150, step S160, step S170 and step S180.

Step S110: and obtaining high-throughput sequencing data of a sample set, wherein the high-throughput sequencing data comprises reads of each region of a true gene and a false gene, the regions comprise UTR, exons and introns, and the sample set comprises a sample to be tested and a control sample.

In some of these embodiments, the sequencing data is whole exon sequencing data.

In some embodiments, the number of control samples is 20 or more.

It can be appreciated that the control sample is the same batch of other samples as the sample to be tested to eliminate sequencing batch effects.

In some of these embodiments, the number of control genes is 1 or greater.

Further, a control gene was selected based on the variation of the gene itself in the sequencing and the variation of the gene with respect to the true gene/false gene.

In one specific example, the control gene includes at least one of ABCF1、ACAD9、ACOX1、BDP1、DPP3、EDNRB、EHBP1、FASTKD2、FOXN1、HEXB、HPS1、IQCB1、LMNA、LRPPRC、PAF1、PTEN、RAPSN、SLC22A5、SLC35D1、TRIQK.

Step S120: and obtaining a first correction factor of each sample for each control gene according to the sum of the average coverage depths of the true genes and the false genes of each sample in the sample set and the average coverage depth of each control gene.

In some of these embodiments, the first correction factor is calculated in step S120 according to the following formula:

,

Wherein e refers to the gene region; i refers to each sample in the sample set; r _eig is a first correction factor; c _ei1 refers to the average coverage depth of the true gene; c _ei2 refers to the average coverage depth of pseudogenes; c _ig refers to the average coverage depth of each gene g in the control genes.

Step S130: and averaging the first correction factors of the corresponding different samples of each control gene to obtain the second correction factor of each control gene.

In some of these embodiments, the second correction factor is calculated in step S130 according to the following formula:

,

wherein, Refers to a second correction factor; s refers to the number of samples in the sample set.

Step S140: and averaging the ratio of the first correction factor to the second correction factor of each control gene of each sample to be tested according to each region to obtain the scale factor of each sample to be tested in each region.

Step S150: and obtaining the total copy number of the true genes and the false genes of each sample to be detected in each region according to the scale factors of each sample to be detected in each region.

Specifically, the scale factor of each sample to be tested in each region is multiplied by 4 to obtain the total copy number of the true gene and the false gene of each sample to be tested in each region.

It will be appreciated that for diploid, the number of copies of a gene is typically two, and that the sum of the true and pseudogenes is 4 copies herein, so that the scaling factor multiplied by 4 represents the predicted total copy number of the true and pseudogenes.

Step S160: extracting sequences of true genes and false genes of a sample to be detected and comparing the sequences to a reference genome, comparing all the sequences to the true genes, and calculating the base ratio of the base of the true genes in a target difference site to all the bases of the site.

Step S170: correcting the base ratio of the true gene at each target difference site according to the total copy number of the true gene and the false gene at the target difference site to obtain the base quasi-copy number of the true gene.

Specifically, the total copy number of the true gene and the pseudogene multiplied by the base ratio of the true gene at that site is used as the predicted copy number of the true gene for that gene region in step S170.

It is understood that the variation in copy number of the region of the true gene can be judged by the base ratio of the true gene at several successive different sites.

Step S180: and analyzing mutation conditions of the true genes at target difference sites according to the base quasi-copy numbers of the true genes.

In some of these embodiments, step S180 includes the step of analyzing whether the gene fragment is duplicated or deleted based on the number of simulated copies of the true gene base at consecutive several different sites.

In some embodiments, when the base pseudo copy number of the true gene is less than the preset threshold in step S180, it is determined that the true gene has a mutation at the target differential site.

In a specific example, the preset threshold is 1.25.

In some of these embodiments, the true gene comprises the CYP21A2 gene and the pseudogene comprises the CYP21A1P.

In some of these embodiments, the regions include UTRs, exons, and introns, and specific partitions and genomic coordinates are shown in table 1.

TABLE 1

It will be appreciated that partitioning of the true and pseudogenes according to the respective exon/intron/UTR regions requires a stretch of extension to the 5' UTR in addition to the true and pseudogenes themselves in the target region, facilitating the determination of the true gene long fragment deletions resulting from fusion of the true and pseudogenes.

Alternatively, the genomic reference sequence may be from a public database, e.g., the human genomic sequence may be the human reference genome hs37d5 genome, the GRCh37 genome, the b37 genome, the hg18 genome, the hg17 genome, the hg16 genome, or the hg38 genome 8, etc., in the NCBI or UCSC database.

It will be appreciated that the above-mentioned true genes and pseudogenes are a pair of more common true and pseudogenes, and the method is not limited to this pair of genes, and any gene which cannot be analyzed using conventional mutation analysis because of high similarity in sequence can be suitably used for analysis by the analysis method of the present invention.

According to the application, through correction of high-throughput sequencing data and analysis of difference sites, the influence of highly homologous pseudogenes on results can be eliminated to a great extent, and the fluctuation problem of NGS sequencing data is eliminated to a great extent, so that copy number variation of exon levels of true pseudogenes can be judged; the second aspect is to extract the sample to be tested and compare the sequences of the true gene and the false gene on the reference genome, and compare the sequences to the true gene to obtain the base ratio of the base of the true gene in the target difference site to all the bases of the site, and correct the base ratio of the true gene in the site according to the scale factors of each target difference site to obtain the simulated copy number of the base of the true gene, thereby eliminating the problem of fuzzy reads comparison caused by the high homology of the true gene and the false gene, and further improving the accuracy of detecting the copy number variation of the true gene; the third aspect increases the determination of the copy number of introns and UTR regions of true and false genes, not only provides a continuous determination of the missing repeat of the true and false genes, but also allows identification of potential fusion genes and gives breakpoint regions.

In addition, as shown in fig. 2, an embodiment of the present application further provides an apparatus 200 for detecting copy number variation of true and false genes, where the apparatus includes a data acquisition module 210, a first correction module 220, a second correction module 230, a third correction module 240, a first copy number calculation module 250, a comparison module 260, a second copy number calculation module 270, and an analysis module 280.

Specifically, the data acquisition module 210 is configured to acquire high-throughput sequencing data of a sample set, where the high-throughput sequencing data includes reads of regions of true genes and pseudogenes, the regions include UTRs, exons, and introns, and the sample set includes a sample to be tested and a control sample.

The first correction module 220 is configured to obtain a first correction factor for each control gene according to the sum of the average coverage depths of the true genes and the false genes of each sample in the sample set and the average coverage depth of each control gene.

The second correction module 230 is configured to average the first correction factors of the corresponding different samples of each control gene to obtain a second correction factor of each control gene.

The third correction module 240 is configured to average, for each region, a ratio of the first correction factor to the second correction factor of each control gene of each sample to be tested, so as to obtain a scale factor of each sample to be tested in each region.

The first copy number calculation module 250 is configured to obtain total copy numbers of true genes and false genes of each sample to be tested in each region according to the scale factors of each sample to be tested in each region;

And the comparison module 260 is used for extracting sequences of the true gene and the false gene of the sample to be detected and comparing the sequences to the true gene, and calculating the base ratio of the base of the true gene in the target difference site to all the bases of the site.

The copy number calculation module 270 is configured to correct the base ratio of the true gene at each target differential site according to the total copy number of the true gene and the false gene at the target differential site, so as to obtain the base pseudo copy number of the true gene.

And the analysis module 280 is used for analyzing the mutation condition of the true gene at the target difference site according to the base pseudo copy number of the true gene.

The respective modules in the above-described apparatus 200 for detecting copy number variation of true and false genes may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In addition, an embodiment of the present application further provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method for detecting copy number variation of a true gene in any of the above embodiments when executing the computer program.

It is to be understood that the above-mentioned computer device may be the server 104 or the terminal 102, and the internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, and a communication interface connected by a system bus. When the computer equipment is a terminal, the system also comprises a display screen and an input device which are connected with the system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for detecting copy number variation of true and false genes. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

In addition, an embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a processor implements the steps of the method for detecting copy number variation of a true or false gene in any of the above embodiments.

Furthermore, an embodiment of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, where the computer instructions are stored in a computer readable storage medium, and where a processor of a computer device reads the computer instructions from the computer readable storage medium, and where the processor executes the computer instructions, so that the computer device performs the steps of the method for detecting copy number variation of a true gene or a false gene in any of the above embodiments.

The method for detecting copy number variation of true or false genes of the present application will be described in further detail with reference to specific examples of the method.

Example 1

The original WES data obtained was aligned to the GRCh37 version human reference genome by whole exon sequencing of DNA from clinical samples and the comparison files were analyzed by the method of this example as follows:

1. calculating the total copy number of the true and false genes of each exon and intron by correcting the background of 20 control genes and samples in the same batch

(1) According to the self fluctuation change of the gene in the sequencing and the fluctuation situation of the relative CYP21A2/CYP21A1P gene, 10 genes with the smallest self fluctuation change in the sequencing and the fluctuation change of the relative CYP21A2/CYP21A1P gene are respectively selected as a control gene set G, and specific genes are shown in Table 2.

TABLE 2 control Gene sets for predicting CYP21A2/CYP21A1P copy number variation

(2) 1055 Samples were selected as sample set S, with 1054 identical lot sequencing samples as control samples, to eliminate sequencing lot effects and the effects of errors on the outcome determination.

(3) The true and pseudogenes were partitioned by individual exon/intron/UTR regions. Besides the target region CYP21A2/CYP21A1P gene, a section of extension is required to be carried out at the 5' UTR, so that the true gene long fragment deletion caused by true and false gene fusion can be conveniently judged. Specific partitions and genomic coordinates are shown in Table 1 above with reference to the human genomic reference sequence GRCh 37.

(4) Calculating the relative total average coverage depth R of each region e of the true genes and the gene pseudogenes in the sample set _eig

For each sample i in the sample set, dividing the sum of the average coverage depths C _ei1+C_ei2 of CYP21A2 and CYP21A1P by the average coverage depth C _ig of each gene g in the 20 control genes to obtain a first correction factor R _eig for the CYP21 gene:

，

(5) Calculating a second correction factor

For each control gene g, the average of the relative depth of coverage of all samples was calculated in the sample set.

，

(6) Calculating the scale factor θ _e and determining the copy number of each region e

For 20 control genes, calculating the ratio of the first correction factor to the second correction factor, averaging to obtain a scale factor theta _e, and multiplying the scale factor by 4 to predict the CYP21 total copy number in the region. Generally, theta _e is 1,2,3, 4, 5 for total copies of true and false genes at 0.25, 0.5, 0.75, 1, 1.25, respectively.

。

2. Calculating the base ratio of true genes of target difference sites of samples to be detected

The true and false gene target difference site PVS (Paralogous sequence variants) is obtained by comparing the CYP21A2 gene and the false gene CYP21A1P on the human genome. And (3) comparing the sample to be detected with reads of CYP21A2 and pseudogene CYP21A1P thereof, extracting, comparing all the samples with true genes, calculating the base ratio of the true genes and the pseudogenes of the target difference sites, namely calculating the ratio of the difference sites to the total reads of the true gene bases, the pseudogene bases and the ready s of other bases respectively, and correcting the base ratio of the true genes of the sites according to the total copy number of the true genes and the pseudogenes of each target difference site, namely multiplying the total copy number of the true genes and the pseudogenes by the base ratio of the true genes of the sites to obtain the simulated copy number of the true genes.

3. Drawing the real gene and pseudogene base quasi-copy numbers of the CYP21 scale factors and target differential sites in the regions, and judging the deletion duplication condition of the CYP21 gene according to the real gene and pseudogene base quasi-copy numbers of a plurality of differential sites in each region.

As a result, as shown in FIG. 4, FIG. 4A shows the scaling factor for the CYP21 true or false gene for each region of the most common double RCCX pattern sample, where each bin is a region of the CYP21 gene from which the total copy number of the CYP21 gene per region of the sample can be predicted. In FIG. 4, B is the relative base copy numbers of the true gene and the false gene, wherein each bin is a PSV, three colors respectively represent the true gene base, the false gene base and the other base pseudo copy numbers from bottom to top, and the copy number change of the true gene and the false gene region can be judged from the proportion of the true gene base at a plurality of different loci. The point of different lightness at the top of the true gene base indicates that the PSV is pathogenic, with the darker the color the higher the likelihood of pathogenicity. The threshold value of the relative base copy number of the true gene is usually set to 1.25 (red dotted line), below which the PSV site of the true gene is considered to be likely to have a point mutation caused by micro-switching with the pseudogene, further, several consecutive changes in the pseudo-copy number of the PSV true gene and pseudogene bases suggest the possibility of duplication or deletion of the gene fragment.

Example 2

The samples in example 2 were patients with adrenal cortical hyperplasia, clitoral hypertrophy, infertility, and first and second generation test tube infants were made, but the embryo quality was poor. The raw WES data obtained was aligned to the GRCh37 version human reference genome by whole exon sequencing of DNA from clinical samples and the alignment was analyzed by the method of example 1, with the following specific analysis results:

the results are shown in FIG. 5, which shows that one CYP21A2 gene of the sample is converted to CYP21A1P gene, and that NM_000500.9 (CYP 21A 2) c.1024C > T (p.Arg 3492 Trp) homozygous mutation is found during mutation detection, there is evidence that the missense mutation was observed in atypical and purely maleated congenital adrenal hyperplasia individuals.

In fig. 5 described above, the scale factors for all regions are concentrated at 1, as shown in fig. 5a, indicating that the sample conforms to the common 4-copy pattern. In FIG. 5, each bin is a PSV, and the copy number variation of the true gene and the pseudogene regions can be judged from the continuous base ratio. Except for individual PSV, the true gene bases in the figures were all below the threshold, suggesting that only 1 copy of CYP21A2 and 3 copies of CYP21A1P were contained in the 4 copies of the CYP21 gene, and that it was suspected that there was one CYP21A2 gene converted to CYP21A1P, i.e., the sample was missing one true gene copy. In combination with its homozygous deletion at CYP21A2:c.1024C > T, the genotype is consistent with its clinical phenotype.

Example 3

The original WES data obtained in example 3 was aligned to the GRCh37 version of the human reference genome by whole exon sequencing of DNA from clinical samples and the alignment was analyzed by the method of example 1, with the following specific analysis results:

As a result, as shown in FIG. 6, concentration of the scale factor for each region around 0.75 indicates a total copy number of 3 for the entire CYP21, which is missing one copy from the sample relative to the most common double RCCX. In analysis of the differential sites of true and pseudogenes, it can be seen that the true bases of the UTR1-exon3 region are mostly below the threshold, except for the two PSVs of exon1, while the gene bases of the exon4-exon10 region are concentrated near 2, suggesting the presence of a chimeric gene (CHIMERIC GENE), and the chimeric type is CH-1, i.e., due to the high similarity in the RCCX modules, one CYP21A1P gene is homologously recombined with one CYP21A2 gene, specifically, the head (5 'UTR to 3 exons) of the CYP21A1P gene is homologously recombined with the tail (4 exons to 3' UTR) of the CYP21A2 gene, resulting in a deletion of about 30kb between the two genes, presumably at the breakpoint at intron 3.

Example 4

Example 4 is a negative sample, the raw WES data obtained was aligned to the GRCh37 version human reference genome by whole exon sequencing of the DNA of the clinical sample, and the alignment was analyzed by the method of example 1, with the following specific analysis results:

As a result, the scale factors are all distributed around 1, as shown in FIG. 7, consistent with the common dual RCCX pattern. The sample is shown in FIG. 7B with true gene relative copy numbers below the threshold for several PSV in succession at UTR1, intron2 and exon10, indicating heterozygous deletions in these regions. In analysis of a large number of samples, this haplotype was found to be relatively common in CAH negative samples, and the absence of these regions did not affect the function of CYP21 A2.

In summary, compared with other detection technical means, the NGS-based analysis method has the advantages of simplicity in operation, shorter time consumption and lower cost. The detection of the insertion deletion of CYP21 gene areas and small fragments which cannot be covered by the MLPA technology is realized. Compared with the traditional NGS data analysis method, the method corrects the CYP21 gene sequencing data of the sample to be detected through the sequencing depth of other samples in the same batch and other genes of the sample to be detected, and eliminates the fluctuation problem of the NGS sequencing data, so that the copy number change of the CYP21 gene exon level can be judged.

In general, the NGS data analysis method only judges copy number variation or large fragment deletion duplication of the CYP21A2 gene, so that the resolution of copy number variation detection is low, and the accuracy is greatly reduced under the ready interference of highly homologous pseudogene CYP21A1P, for example, the deletion situation of the true gene is masked due to the comparison of pseudogene ready to the true gene. The application changes the copy number of the true gene to the total copy number of the true gene and the false gene region, and the sequences of the true gene and the false gene are compared on the reference genome by extracting the sample to be detected and are all compared to the true gene, so as to obtain the base ratio of the true gene in the target difference site, which occupies all the bases of the site, and then the base ratio of the true gene in the site is corrected according to the scale factors of each target difference site, so as to obtain the simulated copy number of the true gene base, thereby eliminating the reads comparison ambiguity problem caused by the high homology of the true gene and the false gene.

In addition, determination of copy number of the CYP21 gene intron and UTR region is increased. Because CAH-related pathogenic sites mostly exist in exon regions, traditional analysis methods for NGS data only consider exon regions, ignoring judgment of introns and UTR regions. The present application, after increasing the determination of copy number of this region, not only provides a continuous determination of the deletion of the repeat of the CYP21 gene, but also allows identification of potential fusion genes and gives a breakpoint region, which has not been reported in the NGS analysis methods heretofore. Further, the whole/regional copy number of the true and false genes can be judged easily through the visualized graph result, and the accuracy is high.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. The method for detecting copy number variation of true and false genes is characterized by comprising the following steps:

2. The method for detecting copy number variation of true and false genes according to claim 1, wherein the first correction factor is calculated according to the following formula:

R_eig＝(C_ei1+C_ei2)/C_ig，

wherein e refers to a gene region;

The i refers to each sample in the sample set;

R _eig is a first correction factor;

The C _ei1 refers to the average coverage depth of the true gene;

The C _ei2 refers to the average coverage depth of the pseudogene;

3. The method for detecting copy number variation of true and false genes according to claim 2, wherein the total copy number of the true genes and the false genes in each region of each sample to be detected is obtained by multiplying the scale factor of each sample to be detected in each region by 4.

4. The method for detecting copy number variation of a true or false gene according to any one of claims 1 to 3, wherein a mutation in a true gene is judged to exist at the target differential site when the base-intended copy number of the true gene is smaller than a predetermined threshold.

5. The method for detecting copy number variation of true and false genes according to any one of claims 1 to 3, wherein the true genes include CYP21A2 genes; and/or

6. The method for detecting copy number variation of true and false genes according to claim 5, wherein the number of the control samples is 20 or more; and/or

The number of the control genes is 1 or more.

7. An apparatus for detecting copy number variation of a true or false gene, comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1-6 when the computer program is executed.

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.

10. A computer program product, characterized in that the computer program product comprises computer instructions stored in a computer-readable storage medium, which computer instructions are read from the computer-readable storage medium by a processor of a computer device, which computer instructions are executed by the processor such that the computer device performs the method for detecting copy number variations of true and false genes according to any one of claims 1 to 6.