WO2025065483A1 - Genetic locus mutation prognostic risk assessment method, electronic device and storage medium - Google Patents
Genetic locus mutation prognostic risk assessment method, electronic device and storage medium Download PDFInfo
- Publication number
- WO2025065483A1 WO2025065483A1 PCT/CN2023/122493 CN2023122493W WO2025065483A1 WO 2025065483 A1 WO2025065483 A1 WO 2025065483A1 CN 2023122493 W CN2023122493 W CN 2023122493W WO 2025065483 A1 WO2025065483 A1 WO 2025065483A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- prognostic risk
- amino acid
- target
- mutation
- gene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
Definitions
- the present application relates to the fields of biomedical technology and artificial intelligence technology, and in particular to a prognostic risk assessment method for gene site mutations, an electronic device, and a storage medium.
- Prognostic risk assessment of the prognostic risk caused by gene mutation can provide a reference for doctors to formulate clinical plans, but the prognostic risk assessment schemes in related technologies cannot accurately assess the prognostic risk caused by gene mutation.
- the present disclosure provides a method for assessing the prognostic risk of gene site mutations, comprising: obtaining an initial data set and a control data set, wherein the initial data set is obtained based on a mutation in at least one gene site of a target gene, and the control data set is obtained based on the fact that none of the gene sites of the target gene have mutated; preprocessing the initial data set to obtain a target gene mutation data set; determining the prognostic risk value of each gene site in the target gene mutation data set using a proportional risk regression model; selecting a target gene site based on the prognostic risk value; and obtaining the prognostic risk type result of the target gene site using an integrated classification model.
- the above-mentioned initial data set includes multiple gene mutation data
- the above-mentioned preprocessing of the above-mentioned initial data set to obtain the target gene mutation data set includes: based on the preset gene annotation requirements, data combing of the multiple initial gene mutation data to obtain multiple gene mutation data to be processed; using the preset gene annotation statement, annotating the multiple gene mutation data to be processed respectively to obtain multiple target gene mutation data, wherein each of the above-mentioned target gene mutation data includes multiple transcript information sets and multiple amino acid information corresponding to the above-mentioned gene site where the mutation occurs; and obtaining the above-mentioned target gene mutation data set based on the multiple above-mentioned target gene mutation data.
- obtaining the target gene mutation data set according to the plurality of target gene mutation data includes: acquiring target transcript information and target amino acid information associated with the target gene from a transcript database; filtering the plurality of transcript information and the plurality of amino acid information included in each of the target gene mutation data according to the target transcript information and the target amino acid information, so that only the target transcript information and the target amino acid information are retained among the plurality of transcript information and the plurality of amino acid information of each of the target gene mutation data; and And generating the target gene mutation data set according to the plurality of target gene mutation data retaining the target transcript information and the target amino acid information.
- the above-mentioned use of the proportional risk regression model to determine the prognostic risk value of each gene site in the above-mentioned target gene mutation data set includes: based on the above-mentioned target gene mutation data set, determining the sub-mutation data corresponding to each of the above-mentioned gene sites; using the proportional risk regression model to perform a prognostic risk quantitative analysis between each of the above-mentioned sub-mutation data and the above-mentioned control data set to obtain the prognostic risk value of the above-mentioned gene site.
- the target gene is linked to the target gene mutation data set, the target gene includes m amino acid sites, each of the amino acid sites corresponds to at least one gene site where a mutation occurs, and m is a positive integer; based on the target gene mutation data set, determining the sub-mutation data corresponding to each of the gene sites includes: sliding a sliding window on the target gene, wherein the center point of the sliding window is any one of the m amino acid sites, the size of the sliding window along the extension direction of the target gene is associated with the distance between k adjacent amino acid sites along the extension direction of the target gene, and k is a positive integer; when the sliding window slides to any one of the amino acid sites, based on the size of the sliding window along the extension direction of the target gene, k amino acid sites are selected around the amino acid site as the center to obtain the target amino acid site; and the gene mutation data in the target gene mutation data set linked to the target amino acid site is used as the sub-mutation data.
- the prognostic risk value of each of the above-mentioned gene sites is characterized by the prognostic risk value of the amino acid site associated with the above-mentioned gene site, and the above-mentioned sub-mutation data is associated with the above-mentioned amino acid site;
- the above-mentioned use of the proportional risk regression model to perform a quantitative prognostic risk analysis between each of the above-mentioned sub-mutation data and the above-mentioned control data set to obtain the prognostic risk value of the above-mentioned gene site includes: inputting the above-mentioned sub-mutation data into the prognostic risk function to obtain the prognostic risk function value corresponding to each of the above-mentioned amino acid sites; and taking the ratio between the prognostic risk function value of each of the above-mentioned amino acid sites and the benchmark prognostic risk function value as the prognostic risk value of each of the above-mentioned amino acid sites
- the method further includes: drawing a mutation prognosis risk value scatter plot based on the prognosis risk value of each of the amino acid sites; and displaying the mutation prognosis risk value scatter plot using a visualization component.
- the above-mentioned selecting the target gene site according to the above-mentioned prognostic risk value includes: according to the prognostic risk value of each of the above-mentioned amino acid sites, selecting the target amino acid site from the amino acid sites associated with the above-mentioned gene site, wherein the above-mentioned target amino acid site includes the above-mentioned target gene site.
- the above-mentioned selecting a target amino acid site from the amino acid sites associated with the above-mentioned gene site based on the prognostic risk value of each of the above-mentioned amino acid sites includes: calculating the credibility associated with the prognostic risk value of each of the above-mentioned amino acid sites; comparing the credibility associated with the prognostic risk value of each of the above-mentioned amino acid sites with a credibility threshold to obtain a prognostic risk value less than the above-mentioned credibility threshold; determining the amino acid site associated with the prognostic risk value less than the above-mentioned credibility threshold from the amino acid sites associated with the above-mentioned gene site to obtain the above-mentioned target amino acid site.
- the above-mentioned prognostic risk type result of the above-mentioned target gene site obtained by using the integrated classification model includes: for the prognostic risk caused by the mutation of the above-mentioned target amino acid site, the above-mentioned integrated classification model is used to perform a qualitative analysis of the prognostic risk type to obtain the above-mentioned prognostic risk type result.
- the above-mentioned prognostic risk caused by the mutation of the above-mentioned target amino acid site is qualitatively analyzed by using the above-mentioned integrated classification model to obtain the above-mentioned prognostic risk type result, which includes: based on the above-mentioned target amino acid site, extracting the gene mutation data associated with the mutation of the above-mentioned target amino acid site from the above-mentioned target gene mutation data set to obtain the mutation data of the prognostic risk type to be analyzed; based on the mutation data of the above-mentioned prognostic risk type to be analyzed, using the above-mentioned integrated classification model to predict the degree of match between the prognostic risk caused by the mutation of the above-mentioned target amino acid site and each of the N prognostic risk types, to obtain N predicted matching values corresponding to the N above-mentioned prognostic risk types, N is a positive integer, and the prognostic risk degrees between the N
- the above-mentioned integrated classification model is used to predict the degree of match between the prognostic risk caused by the mutation at the above-mentioned target amino acid site and each of the N prognostic risk types, and N predicted matching values corresponding to the N above-mentioned prognostic risk types are obtained, including: for each of the N prognostic risk types: using the integrated classification model to analyze the prognostic risk caused by the mutation at the above-mentioned target amino acid site and the degree of match between the above-mentioned prognostic risk type, to obtain multiple prediction values, wherein the above-mentioned integrated classification model includes multiple classifiers, each classifier can obtain one of the above-mentioned prediction values; using the gradient boosting classifier to process the multiple above-mentioned prediction values to obtain the above-mentioned prediction matching value.
- determining the above-mentioned prognostic risk type result based on N predicted matching values includes: selecting the highest predicted matching value from the N predicted matching values by performing numerical comparison between the above-mentioned N predicted matching values; and taking the prognostic risk type corresponding to the above-mentioned highest predicted matching value as the above-mentioned prognostic risk type result.
- the method further includes: acquiring the prognostic risk type result; and pushing the prognostic risk type result to the target object, so that the target object generates a clinical decision according to the prognostic risk type result.
- the above-mentioned integrated classification model is trained in the following manner: obtaining a sample data set, wherein the above-mentioned sample data set includes first prognostic risk level data and second prognostic risk level data, and the above-mentioned first prognostic risk level is higher than the above-mentioned second prognostic risk level; randomly splitting the above-mentioned sample data set into an initial training sample set and a verification sample set, wherein the ratio of the number of data in the above-mentioned initial training sample set to the number of data in the above-mentioned test sample data set is a preset ratio; oversampling the above-mentioned initial training sample set so that the number of the above-mentioned first prognostic risk level data in the above-mentioned initial training sample set is the same as the number of the above-mentioned second prognostic risk level data, and obtaining a target training sample set; cross-validation training is performed on the initial integrated classification model
- the cross-validation training of the initial integrated classification model using the target training sample set includes: randomly dividing the target training sample set into Q training sample subsets, where Q is a positive integer; repeatedly performing the following operations until the preset training end condition is met: selecting Q-1 of the Q training sample subsets to train the initial integrated classification model to obtain an intermediate integrated classification model for the next round of training, wherein the Q-1 training sample subsets selected are different from each training session to each training session; and using the remaining 1 of the Q training sample subsets to test the intermediate integrated classification model for the next round of training, wherein the training sample subsets used are different from each test to each test.
- the sample data set includes multiple sample data
- the types of the sample data include at least one of the following: data on the impact of the mutated target amino acid site on protein function, data on the clinical impact of the mutated target amino acid site, and position data of the mutated target amino acid site.
- the above classifier includes at least one of the following: a random forest classifier, a Gini index random tree classifier, an entropy random tree classifier, and a gradient boosting classifier.
- the present disclosure also provides an electronic device, comprising: one or more processors; a storage device for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors execute the above-mentioned prognostic risk assessment method for gene site mutations.
- the present disclosure also provides a computer-readable storage medium having executable instructions stored thereon, which, when executed by a processor, causes the processor to execute the above-mentioned prognostic risk assessment method for gene site mutations.
- the present disclosure further provides a computer program product, including a computer program, which implements the above-mentioned prognostic risk assessment method for gene site mutation when executed by a processor.
- FIG1 is a flow chart of a method for assessing the prognostic risk of gene site mutations according to an embodiment of the present disclosure
- FIG2 is a schematic diagram of an initial data set according to an embodiment of the present disclosure.
- FIG3 is a schematic diagram of a control data set according to an embodiment of the present disclosure.
- FIG4 is a schematic diagram of gene mutation data to be processed according to an embodiment of the present disclosure.
- FIG. 5 is data after gene re-annotation according to an embodiment of the present disclosure
- FIG6 is an example diagram of a target gene mutation dataset according to an embodiment of the present disclosure.
- FIG7 is a schematic diagram of collecting data using a sliding window according to an embodiment of the present disclosure.
- FIG8 is a scatter plot of mutation prognostic risk values according to an embodiment of the present disclosure.
- FIG9 is a schematic diagram of an architecture for obtaining a predicted matching value using an integrated classification model according to an embodiment of the present disclosure
- FIG10A is a schematic diagram of data obtained by using some tools to predict amino acid site mutations according to an embodiment of the present disclosure
- FIG10B is a schematic diagram of data obtained by using some tools to predict amino acid site mutations according to another embodiment of the present disclosure.
- FIG11A is a schematic diagram of clinical effects of amino acid site mutations according to an embodiment of the present disclosure.
- FIG11B is a schematic diagram of clinical effects of amino acid site mutations according to another embodiment of the present disclosure.
- FIG12A is a ROC curve of the integrated classification model of an embodiment of the present disclosure.
- FIG12B is a ROC curve of the related art model
- FIG13 is a method for assessing the prognostic risk of gene site mutations according to another embodiment of the present disclosure.
- FIG. 14 is a block diagram of an electronic device suitable for implementing an image processing method according to an embodiment of the present disclosure.
- the technical solution in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure.
- the described embodiments are part of the embodiments of the present disclosure, not all of the embodiments. Based on the described embodiments of the present disclosure, all other embodiments obtained by ordinary technicians in this field without creative work belong to the protection of the present disclosure.
- the present invention relates to a device for carrying out the present invention and to a device for carrying out the present invention. It should be noted that throughout the accompanying drawings, the same elements are represented by the same or similar reference numerals.
- connection may refer to two components being directly connected, or may refer to two components being connected via one or more other components.
- the two components may be connected or coupled via a wired or wireless manner.
- the collection, storage, use, processing, transmission, provision, disclosure and application of the data involved (such as but not limited to user personal information), the process of obtaining the initial data set about the user, the process of obtaining the reference data set about the user, etc., all comply with the provisions of relevant laws and regulations, take necessary confidentiality measures, and do not violate public order and good morals.
- somatic mutations in cancer is explained through the following scheme: 5234 somatic mutations from the user's clinical report database are obtained as training sets and validation sets, and 6226 mutations are obtained as test sets through literature retrieval; using functional and clinical evidence, the semi-supervised generative adversarial network (SGAN) method is used to predict the carcinogenicity of mutations; somatic mutations are classified into 4 categories: strong clinical significance, potential clinical significance, and clinical significance. uncertain and benign/likely benign.
- SGAN semi-supervised generative adversarial network
- Prognosis refers to the prediction of the development process and final outcome of a disease after measures are taken.
- the embodiments of the present disclosure provide a method for assessing the prognostic risk of gene site mutations, which analyzes the impact of each gene site mutation on the prognostic risk of cancer based on the prognostic risk, and improves the accuracy of assessing the prognostic risk of gene mutations.
- FIG1 is a flow chart of a method for assessing the prognostic risk of gene site mutations according to an embodiment of the present disclosure.
- the method for assessing the prognostic risk of gene site mutation may include operations S110 to S150 .
- an initial data set and a control data set are obtained, wherein the initial data set is obtained based on at least one gene site of the target gene being mutated, and the control data set is obtained based on no gene sites of the target gene being mutated.
- the initial data set is preprocessed to obtain a target gene mutation data set.
- a proportional hazard regression model is used to determine the prognostic risk value of each gene site in the target gene mutation dataset.
- a target gene locus is selected according to the prognostic risk value.
- the prognostic risk type result of the target gene site is obtained using the integrated classification model.
- the database may include at least one of the following: the public ICGC (International Cancer Genome Consortium) database and the public MSK (Memorial Sloan Kettering Cancer Center) database.
- the TP53 gene may include 393 amino acid sites, each amino acid site corresponds to 3 bases, and a mutation in a gene site may refer to a mutation in a base, and the mutation in the gene site will cause the amino acid corresponding to the base to change, and the change in the amino acid will bring about prognostic risk. Therefore, it is necessary to perform a prognostic risk assessment on the prognostic risk caused by the mutation of each gene site on the gene, or on the mutation of each amino acid site.
- the initial data set and the control data set can be used to evaluate the prognostic risk value caused by each gene site mutation or each amino acid site mutation on the target gene (such as TP53). And for the prognostic risk value with low credibility, the clinical significance of the amino acid site mutation corresponding to the prognostic risk value can be predicted. Prognostic risk level. It is understood that the prognostic risk assessment method according to the embodiment of the present disclosure can be used to assess the prognostic risk brought about by mutations at any other gene site or any amino acid site.
- FIG. 2 is a schematic diagram of an initial data set according to an embodiment of the present disclosure.
- the initial data set may include multiple initial gene mutation data, and the initial gene mutation data is obtained based on the mutation of at least one gene site of the target gene.
- the multiple initial gene mutation data may include clinical data and missense mutation data.
- Each row in Figure 2 can be regarded as the initial gene mutation data of a user.
- the clinical data may refer to the user's number, the user's survival status, the user's survival time, etc.
- the missense mutation data may include the mutant gene to be analyzed, the chromosome number, the starting position of the gene site, the ending position of the gene site, the reference gene, and the mutant gene.
- P- in the number column is used to represent the user ID of the initial data set, and n can represent the number of users, where n is a positive integer.
- the “+” in the chain column is used to represent the positive chain, which refers to the chain on DNA that carries the nucleotide sequence encoding the amino acid information of the protein.
- the positive chain is also called the coding chain, meaningful chain or positive chain (+ chain).
- the reference gene at the 7577121 gene site should be G (guanine), but due to gene mutation, it mutated into the mutant gene A (adenine).
- the units of measurement for survival time in Figure 2 can be years, months, days, hours, etc., and can be represented by the symbol t or time.
- the distribution of survival time is generally not a normal distribution. Survival time can be the time between the starting event and the ending event.
- the starting event can be diagnosis, medication, surgical resection, etc.
- the ending event can be recovery, death, remission, recurrence, etc.
- FIG. 3 is a schematic diagram of a comparison data set according to an embodiment of the present disclosure.
- control data set refers to the data obtained when there is no mutation in the gene locus on the target gene (such as TP53), which is used as the control data of the initial gene mutation data.
- the acquisition of the control data set can also be obtained from at least one of the public ICGC database and the public MSK database.
- Each row in Figure 3 can be regarded as the control data of a user.
- the control data may include the user's number, the user's survival status, and the user's survival time.
- D- is used for the user ID of the control data set.
- the process of preprocessing multiple initial gene mutation data may include data combing of the initial gene mutation data, gene re-annotation, and the like, and the target gene mutation data set obtained after preprocessing may be a data set including transcript and amino acid information.
- the target gene locus may be a reliable clinical
- the gene loci of the prognostic risk result for example, due to the small amount of sample data based on which the prognostic risk value is obtained and the low credibility, it is impossible to obtain a reliable clinical prognostic risk result based on the prognostic risk value.
- Such gene loci need to be further analyzed for prognostic risk qualitatively using an integrated classification model to obtain a reliable clinical prognostic risk result.
- the target amino acid loci can be used instead of the analysis of the target gene loci.
- the process of selecting a target gene locus from a gene locus according to a prognostic risk value may include selecting a target gene locus from a gene locus according to conditions such as the amount of sample data based on which the prognostic risk value is obtained is less than a preset sample data amount threshold, and/or the credibility is less than a credibility threshold.
- a target amino acid locus including a target gene locus may also be selected from an amino acid locus according to conditions such as the amount of sample data based on which the amino acid prognostic risk value is obtained is less than a preset sample data amount threshold, and/or the credibility is less than a credibility threshold.
- the embodiment of the present disclosure by obtaining an initial data set and a control data set; preprocessing the initial data set to obtain a target gene mutation data set; using a proportional risk regression model to determine the prognostic risk value of each gene site in the target gene mutation data set; selecting a target gene site according to the prognostic risk value; using an integrated classification model to obtain the prognostic risk type result of the target gene site. Since the embodiment of the present disclosure can realize a quantitative prognostic risk assessment of the prognostic risk value of the mutation gene site, the mutation prognostic risk value of each gene site is obtained, and the target gene site with low credibility of the prognostic risk value and unknown clinical significance is qualitatively assessed to obtain the prognostic risk type of the target gene site.
- the prognostic risk assessment method for gene site mutation can not only quantify the prognostic risk of gene mutation in a more fine-grained manner, but also qualitatively analyze the target gene site of unknown clinical significance, thereby at least partially overcoming the problem of inaccurate assessment of the prognostic risk of gene mutation in the related art, thereby improving the accuracy of the prognostic risk assessment of gene mutation, and providing more comprehensive reference information for the formulation of clinical plans for doctors.
- operation S120 may include the following operations: based on the preset gene annotation requirements, a plurality of initial gene mutation data are combed to obtain a plurality of gene mutation data to be processed; using a preset gene annotation statement, a plurality of gene mutation data to be processed are annotated and processed respectively to obtain a plurality of target gene mutation data, wherein each target gene mutation data includes a plurality of transcript information sets and a plurality of amino acid information corresponding to the gene site where the mutation occurs; and a target gene mutation data set is obtained based on a plurality of target gene mutation data.
- FIG. 4 is a schematic diagram of gene mutation data to be processed according to an embodiment of the present disclosure
- FIG. 5 is data after gene re-annotation according to an embodiment of the present disclosure.
- the preset gene annotation requirements may refer to the need to retain the chromosome number column, gene site start position column, gene site end position column, reference gene column, and mutant gene column in the gene mutation data.
- the obtained multiple gene mutation data to be processed can be shown in FIG4 .
- multiple custom columns can also be set according to the implementation situation.
- FIG3 only takes custom column 1 as a chain, "1" in the chain column is used to represent the positive chain, and custom column 2 is a user number as an example.
- the annotation tool can be adaptively adjusted according to actual needs.
- the embodiments of the present disclosure are described by taking the Annovar annotation tool as an example.
- Multiple gene mutation data to be processed are input into the annotation tool, and the gene annotation statement annotate_variation.pl-geneanno-dbtype refGene-buildver hg19 example/ex1.avinput humandb/ is used to perform gene re-annotation on the multiple gene mutation data to be processed.
- -buildver hg19 can indicate that the reference genome uses the hg19 reference gene.
- a transcript is one or more mature mRNAs (messenger Ribonucleic Acid, messenger RNA) that can encode proteins formed by transcription of a gene.
- N can represent a nonsynonymous mutation type
- S can represent a stoploss mutation type
- F can represent a frame shift mutation.
- the third column in Figure 5 will contain multiple transcript information and corresponding amino acid information, and each row in Figure 5 may also contain multiple transcript information and corresponding amino acid information.
- the information included may be ENST00000642122.1_4: exon10: c.G1641A: p.M547I.
- p.M547I may be the amino acid information corresponding to the transcript.
- the format of the initial gene mutation data can be made neat and unified, thereby improving the efficiency and accuracy of the prognostic risk assessment of gene mutations when evaluating the prognostic risk of gene mutations.
- obtaining a target gene mutation data set based on multiple target gene mutation data may include the following operations: obtaining target transcript information and target amino acid information associated with the target gene from a transcript database; filtering the multiple transcript information and multiple amino acid information included in each target gene mutation data based on the target transcript information and the target amino acid information, so that only the target transcript information and the target amino acid information are retained among the multiple transcript information and multiple amino acid information of each target gene mutation data; and generating a target gene mutation data set based on the multiple target gene mutation data that retain the target transcript information and the target amino acid information.
- the target transcript information may refer to the transcript information containing the most information obtained from a specified database.
- the target transcript information is generally the most classic transcript information, generally the most correct transcript information with a reliable source and verified by multiple parties.
- the target amino acid information may be the amino acid information in the target transcript information.
- FIG. 6 is an example diagram of a target gene mutation dataset according to an embodiment of the present disclosure.
- the target transcript information and target amino acid information are used to filter and screen each row of information in the third column shown in Figure 5, so that each row only retains the target transcript information and target amino acid information.
- the transcript information can be simplified, and only the reference amino acid information, the amino acid information after mutation, and the mutant amino acid position are extracted, and the target gene mutation data set obtained can be shown in Figure 6.
- the reference amino acid information and the amino acid information after mutation can refer to the amino acid information of the five columns;
- the mutant amino acid position can refer to the mutation position of the amino acid site in the sixth column.
- the mutant amino acid p.Y279C as an example, it can be indicated that the reference amino acid tyrosine (Y) with the mutant amino acid position of No. 279 becomes the mutant amino acid cysteine (C).
- the MV of the consequence column in Figure 6 can represent a missense mutation (missense_variant).
- the data set can be preprocessed to fill in missing values and delete abnormal values.
- the abnormal data in the survival status and survival time columns are processed, and the values with negative survival time are deleted; or garbled characters and characters that are not numbers are deleted.
- missing survival time it is also possible to fill in the duration obtained according to the follow-up time or delete the user data corresponding to the missing value according to the survival status, etc.
- the specific situation can be adaptively adjusted according to actual needs.
- the multiple transcript information and the multiple amino acid information included in each target gene mutation data are filtered. Since the target transcript information can be the most correct transcript information verified by multiple parties, only the information associated with the target transcript information and the target amino acid information can be retained after filtering the multiple transcript information and the multiple amino acid information, thereby ensuring the accuracy of the information. The simplicity and accuracy of information can improve the efficiency and accuracy of prognostic risk assessment of gene mutations.
- a sliding window and a proportional-hazards model can be used to analyze the prognostic risk value after a single amino acid site mutation.
- a proportional-hazards model also known as a Cox regression model
- the above-mentioned data preprocessing stage is to obtain and process data with the gene site as the minimum granularity, but the mutation of the gene site will cause the amino acid corresponding to the gene site to change.
- the embodiments of the present disclosure analyze the prognostic risk value of the mutation in units of amino acid sites.
- operation S130 may include the following operations: based on the target gene mutation data set, determining the sub-mutation data corresponding to each gene site; using a proportional risk regression model to perform a quantitative prognostic risk analysis between each sub-mutation data and the control data set to obtain the prognostic risk value of the gene site.
- the sub-mutation data may be data corresponding to each gene site where a gene mutation occurs. Since the mutation of a gene site affects the corresponding amino acid, the sub-mutation data may also be data corresponding to an amino acid site where a gene site mutates. The process of obtaining the sub-mutation data may be obtained by sliding a sliding window on the target gene chain.
- determining the sub-mutation data corresponding to each gene site may include the following operations: sliding a sliding window on the target gene, wherein the center point of the sliding window is any one of the m amino acid sites, the size of the sliding window along the extension direction of the target gene, and the distance between the k adjacent amino acid sites along the extension direction of the target gene are associated, and k is a positive integer; when the sliding window slides to any one amino acid site, based on the size of the sliding window along the extension direction of the target gene, k amino acid sites are selected around the amino acid site as the center to obtain the target amino acid site; and the gene mutation data linked to the target amino acid site in the target gene mutation data set is used as the sub-mutation data.
- a target gene and a target gene mutation dataset can be linked, the target gene includes m amino acid sites, each amino acid site corresponds to at least one gene site where a mutation occurs, and m is a positive integer.
- TP53 includes 393 amino acid sites, and m can be 393.
- FIG. 7 is a schematic diagram of collecting data using a sliding window according to an embodiment of the present disclosure.
- the size B of the sliding window 701 along the extension direction of the target gene can be the size including k adjacent amino acid sites, and FIG. 7 is shown by taking k equal to 3 as an example.
- B is k
- the area enclosed is a sliding window of (ik/2, i+k/2).
- the mutation data within the area enclosed by the sliding window are collected to obtain sub-mutation data.
- k can be adaptively adjusted according to actual needs.
- control group may be the control group data of the target gene (eg, TP53) as a whole.
- control group data of each site are the same, and are all data where no mutation occurs at the gene site on the target gene (eg, TP53).
- the control group data of each sub-mutation data can be the same, and the control data are all obtained when there is no mutation in the gene site on the target gene chain.
- a prognostic risk value corresponding to each sub-mutation data can be obtained.
- the prognostic risk value corresponding to the sub-mutation data can be used as a prognostic risk value corresponding to the gene site where the gene mutation occurs, or as a prognostic risk value corresponding to the amino acid site where the gene site has mutated.
- the prognostic risk value of each gene locus is characterized by the prognostic risk value of the amino acid site associated with the gene locus, and the sub-mutation data is associated with the amino acid site.
- the process of obtaining the prognostic risk value of the gene locus by using a proportional risk regression model to perform a quantitative prognostic risk analysis between each sub-mutation data and a control data set may include the following operations: inputting the sub-mutation data into a prognostic risk function to obtain a prognostic risk function value corresponding to each amino acid site; and using the ratio between the prognostic risk function value of each amino acid site and the benchmark prognostic risk function value as the prognostic risk value of each amino acid site, wherein the benchmark prognostic risk function value is obtained by inputting the control data set into the prognostic risk function.
- a proportional risk regression model is used to analyze the effect of the amino acid site mutation on the prognostic risk.
- the proportional risk regression model is modeled based on the prognostic risk function, which can use censored data to analyze the prognostic risk rate of each factor for survival, but does not consider the distribution of survival time.
- the prognostic risk function may be as shown in formula (1).
- h(t,X) h 0 (t)exp( ⁇ 1 X 1 + ⁇ 2 X 2 +...+ ⁇ P X P ) (1)
- h(t, X) is the prognostic risk function
- h 0 (t) is the baseline prognostic risk function, i.e., the prognostic risk function obtained when all variables are 0 at time t.
- X 1 , X 2 , ..., X p can be covariates, influencing factors or prognostic factors, and ⁇ 1 , ⁇ 2 ..., ⁇ p are regression coefficients.
- X 1 , X 2 , ..., X p can be associated with the fifth column in FIG.
- X when a mutation occurs, X can be assigned a value of 1, and when no mutation occurs, X can be assigned a value of 0.
- the data corresponding to X 1 , X 2 , ..., X p can also be adaptively adjusted according to actual needs.
- the process of calculating the prognostic risk value may be as shown in formula (2).
- HR represents the mutation prognostic risk value of each amino acid site.
- the prognostic risk rate increases, indicating that variable X is a risk factor.
- h 0 (t) can be the baseline prognostic risk function value obtained based on the control data set, that is, the prognostic risk function value at time t when all variables are taken as 0. According to the mutation prognostic risk value characterizing the amino acid site, it can roughly correspond to the mutation prognostic risk value of the gene site.
- FIG. 8 is a scatter plot of mutation prognostic risk values according to an embodiment of the present disclosure.
- a mutation prognosis risk value scatter plot can be drawn based on the prognosis risk value of each amino acid site; and the mutation prognosis risk value scatter plot can be displayed using a visualization component, and the displayed mutation prognosis risk value scatter plot can be shown in Figure 8.
- the visualization component can include components such as Echarts (data visualization chart library) and Highcharts (chart library).
- the upper part of Figure 8 is a scatter plot of the mutation prognostic risk value of the amino acid site of the target gene (e.g., TP53), and the lower part is a bar graph of the number of samples in the sliding window corresponding to the amino acid site of the target gene (e.g., TP53).
- the X-axis represents the 393 amino acid sites of the target gene (e.g., TP53), the left Y-axis represents the logarithmic prognostic risk, and the blue dots represent the mutated amino acid sites with high credibility of the prognostic risk value (P value ⁇ 0.05); other colors
- the dots represent amino acid sites of unknown clinical significance with low credibility (P value>0.05).
- the gray lines represent the 95% upper and lower confidence intervals. The larger the interval, the lower the credibility.
- the right Y-axis represents the number of samples, and the red and blue bars on the right represent the P value. P is between 0 and 1. The larger the P is, the darker the red is. The smaller the P is, the darker the blue is.
- the prognostic risk of point mutations obtained by the proportional risk regression model analysis can be divided into red dots and blue dots, where blue dots are points with a sufficient number of samples and high credibility, and red dots are points with an insufficient number of samples and insufficient credibility. Therefore, a deep learning method is also required to discriminate the prognostic risk of the red dots to determine whether they are protective factors or harmful factors. That is, for amino acid sites of unknown clinical significance in FIG8 , a qualitative analysis can be further performed to determine whether the prognostic risk of amino acid sites of unknown clinical significance is a high prognostic risk, a low prognostic risk, or no prognostic risk, etc. When discriminating the prognostic risk, a machine learning method can be used to discriminate the prognostic risk type of point mutations of unknown clinical significance based on features such as the effect of amino acid point mutations on protein function.
- the data of the target gene mutation data set and the control data set are analyzed and a mutation prognostic risk value scatter plot to be displayed is obtained.
- the prognostic risk value, 95% confidence interval and P value of each amino acid site can be obtained, which not only realizes the quantitative analysis of the prognostic risk of gene mutations, but also characterizes the prognostic risk value of amino acid site mutations in a finer granularity and more accurately.
- the visual display also facilitates relevant personnel to make clinical decisions based on the mutation prognostic risk value scatter plot.
- the prognostic risk value of some gene loci cannot accurately reflect the clinical prognostic risk, for example, the amount of sample data obtained for some gene loci is small, the credibility is low, etc., so that the prognostic risk of clinical significance cannot be accurately obtained according to the prognostic risk value of the gene loci, it is necessary to further analyze the gene loci.
- the process of selecting the gene loci from multiple gene loci according to the prognostic risk value can include the following operations: according to the prognostic risk value of each amino acid site, select the target amino acid site from the amino acid sites associated with the gene site, wherein the target amino acid site includes the target gene site.
- the prognostic risk assessment method for gene site mutations may also include the following operations: calculating the credibility associated with the prognostic risk value of each amino acid site; The confidence associated with the risk value is compared with the confidence threshold to obtain a prognostic risk value less than the confidence threshold; the amino acid site associated with the prognostic risk value less than the confidence threshold is determined from the amino acid sites associated with the gene site to obtain the target amino acid site.
- the confidence associated with the prognostic risk value of each gene site may be calculated by calculating the confidence associated with the prognostic risk value of each amino acid site.
- the credibility associated with the prognostic risk value of the amino acid site can be obtained by using a tool for drawing a scatter plot.
- the target amino acid site can be an amino acid site of unknown clinical significance represented by a red point in Figure 8.
- the prognostic risk type may include high prognostic risk and low prognostic risk, etc.
- credibility can characterize the reliability of the prognostic risk value.
- the credibility threshold to screen the amino acid sites and taking the amino acid sites with a credibility threshold less than the credibility threshold as the target amino acid sites, it is possible to screen out the amino acid sites with low reliability, so that the qualitative analysis of the prognostic risk of the integrated classification model does not need to analyze the prognostic risk of each amino acid site, thereby improving the efficiency of the integrated classification model in performing the qualitative analysis of the prognostic risk.
- operation S150 may include the following operations: for the prognostic risk caused by the mutation of the target amino acid site, the prognostic risk type may be qualitatively analyzed using an integrated classification model to obtain a prognostic risk type result.
- the prognostic risk value of some gene loci cannot accurately reflect the clinical prognostic risk, for example, the sample data of some gene loci is small, the credibility is low, etc., so that the prognostic risk of clinical significance cannot be accurately obtained according to the prognostic risk value of the gene locus, it is necessary to use an integrated classification model to further qualitatively analyze the prognostic risk of this gene locus or the amino acid site associated with this gene locus.
- a prognostic risk type result is obtained, which can provide reference information for clinical decision-making.
- FIG. 9 is an architecture diagram of obtaining a predicted matching value using an integrated classification model according to an embodiment of the present disclosure.
- the integrated classification model may include multiple classifiers, and the classifier may include at least one of the following: a random forest classifier, a Gini index random tree classifier, an entropy random tree classifier, and a gradient boosting classifier.
- the first extreme random tree classifier may be a Gini index random tree classifier or an entropy random tree classifier
- the second extreme random tree classifier may also be a Gini index random tree classifier or an entropy random tree classifier, but the first extreme random tree classifier and the second extreme random tree classifier preferably use different types of random tree classifiers.
- the Gini index random tree classifier refers to a classifier that uses the gini index as a judgment whether a node continues to split;
- the entropy random tree classifier refers to a classifier that uses entropy as a judgment whether a node continues to split.
- the loss function of the random forest classifier is the Gini index, as shown in formula (3).
- pk can represent the probability that the prognostic risk caused by the mutation of the target amino acid site belongs to the Nth prognostic risk; the larger the Gini index, the greater the uncertainty; the smaller the Gini index, the smaller the uncertainty and the cleaner the data segmentation.
- the entropy of the extreme random tree can be expressed as formula (4).
- pk can represent the probability that the prognostic risk caused by the mutation of the target amino acid site belongs to the Nth prognostic risk.
- Gradient boosting is a machine learning technique for regression, classification, and ranking tasks, which can be used in the embodiments of the present disclosure to derive the degree of match between the prognostic risk caused by the mutation at the target amino acid site and each prognostic risk type.
- the mutation data 901 of the prognostic risk type to be analyzed can be respectively input into the random forest classifier 902, the first extreme random tree classifier 903, the second extreme random tree classifier 904, and the gradient boosting classifier 905, and the multiple predicted values obtained can be respectively output by the random forest classifier 902, the second extreme random tree classifier 904, and the gradient boosting classifier 905.
- the first prediction value 906, the second prediction value 907, the third prediction value 908 and the fourth prediction value 909 are processed by another gradient boosting classifier 910 to obtain the final required prediction matching value 911.
- multiple prediction values can be obtained by combining multiple classifiers to obtain an integrated classification model; and the final prediction matching value is obtained by combining multiple prediction values, thereby improving the accuracy of prognostic risk type prediction.
- determining a prognostic risk type result based on N predicted matching values may include the following operations: selecting the highest predicted matching value from the N predicted matching values by performing numerical comparisons between the N predicted matching values; and taking the prognostic risk type corresponding to the highest predicted matching value as the prognostic risk type result.
- the prognostic risk type is divided into high prognostic risk and low prognostic risk, and N is equal to 2 as an example.
- the predicted matching value of the high prognostic risk type and the predicted matching value of the low prognostic risk type are obtained; by comparing the predicted matching value of the high prognostic risk type and the predicted matching value of the low prognostic risk type, when the predicted matching value of the high prognostic risk type is greater than or equal to the predicted matching value of the low prognostic risk type, the prognostic risk type result is that the prognostic risk caused by the mutation of the target amino acid site belongs to the high prognostic risk; when the predicted matching value of the high prognostic risk type is less than the predicted matching value of the low prognostic risk type, the prognostic risk type result is that the prognos
- the embodiments of the present disclosure by using an integrated classification model to perform a qualitative analysis of the prognostic risk of low-credibility prognostic risk values, it is possible to accurately discriminate the prognostic risk of new point mutations with low mutation frequency.
- the final determined prognostic risk type result can be made more representative, providing a more accurate reference for clinical decision-making.
- the following operations can be performed: obtaining the prognostic risk type result; and pushing the prognostic risk type result to a target object, so that the target object generates a clinical decision according to the prognostic risk type result.
- the target object can be a person who makes a clinical decision, or a decision system for generating a clinical decision.
- the efficiency of generating clinical decisions can be improved and the accuracy of clinical decisions can be increased.
- the following operations may be adopted: obtaining a sample data set, wherein the sample data set includes first prognostic risk level data and second prognostic risk level data, and the first prognostic risk level is higher than the second prognostic risk level; randomly splitting the sample data set into an initial training sample set and a validation sample set, wherein the ratio of the number of data in the initial training sample set to the number of data in the test sample data set is a preset ratio; oversampling the initial training sample set so that the number of the first prognostic risk level data in the initial training sample set is the same as the number of the second prognostic risk level data, thereby obtaining a target training sample set; cross-validating the initial integrated classification model using the target training sample set to obtain an intermediate integrated classification model; and using the intermediate integrated classification model that meets a preset training end condition as the integrated classification model, wherein the preset training end condition includes the number of training times reaching a preset training times threshold.
- the problem of unbalanced data distribution caused by random splitting of the sample data set can be avoided.
- the target training sample set to perform cross-validation training on the initial integrated classification model, the problem of low model training accuracy caused by unbalanced division of the target training sample set into training sample subsets and validation sample subsets can be eliminated, thereby improving the accuracy of the final integrated classification model.
- a sample data set may include multiple sample data, and the types of sample data may include at least one of the following: data on the impact of the mutated target amino acid site on protein function, data on the clinical impact of the mutated target amino acid site, and position data of the mutated target amino acid site.
- Figure 10A is a schematic diagram of data obtained by using some tools to predict amino acid site mutations according to an embodiment of the present disclosure
- Figure 10B is a schematic diagram of data obtained by using some tools to predict amino acid site mutations according to another embodiment of the present disclosure.
- the data on the effect of the target amino acid site where the mutation occurs on the protein function can be obtained by using a test tool to analyze the effect of the amino acid point mutation on the protein function.
- the test tools for predicting whether the amino acid point mutation is a harmful mutation or a neutral mutation can include SIFT_score, PolyPhen2_HDIV_score, PolyPhen2_HVAR_score, LRT_score, Mutation Taster_score, Mutation Assessor_score, FATHMM_score, PROVEAN_score, VEST3_score, CADD_raw, CADD_phred, DANN_score, MetaLR_score, integrated_fitCons_score, integrated_confidence_value, GERP++_RS, fathmm-MKL_coding_score, MetaSVM_score, phyloP7way_vertebrate, phyloP20way_mammalian, phastCons7way_
- the results measured by each test tool can be analyzed according to the result analysis method in the relevant technology.
- the test results of the SIFT_score test tool are between 0 and 1, and can be divided into two ranges when analyzing the test results, 0 to 0.05 and 0.05 to 1.
- the test result is between 0 and 0.05, it is recognized that the mutation at this amino acid site is harmful and will cause changes in protein function.
- the SMOTE Synthetic Minority Oversampling Technique
- the ratio of class 1 to class 0 data in the initial training sample set is 177:14
- the ratio of class 1 to class 0 data is 177:177.
- Oversampling can increase the amount of class 0 data by duplicating class 0 data, ultimately making the ratio of class 1 to class 0 data 177:177.
- FIG. 12B is a ROC curve of the related art model.
- the initial integrated classification model is cross-validated by using the target training sample set. Since the target training sample set is divided multiple times, the training sample subset and the validation sample subset used in each training are different. This can eliminate the adverse effects caused by the unbalanced division of the target training sample set in a single division, thereby improving the accuracy of the final integrated classification model.
- FIG. 13 is a method for assessing the prognostic risk of gene site mutations according to another embodiment of the present disclosure.
- an integrated classification model method is used to predict the prognostic risk of the unknown clinical significance site.
- the technical solution of the disclosed embodiment may include two parts.
- a sliding window and proportional risk regression model analysis method is used to achieve quantitative characterization of the prognostic risk of gene point mutations.
- a plurality of testing tools and an integrated classification model are used to achieve qualitative discrimination of the prognostic risk of gene point mutations with unknown prognostic risk.
- the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
- a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the method as described above.
- a computer program product includes a computer program, and when the computer program is executed by a processor, the computer program implements the method as described above.
- FIG 14 is a block diagram of an electronic device suitable for implementing a prognostic risk assessment method for gene site mutation according to an embodiment of the present disclosure.
- Electronic devices are intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
- Electronic devices can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices.
- the components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the present disclosure described herein and/or required.
- the electronic device 1400 includes a computing unit 1401, which can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM) 1402 or a computer program loaded from a storage unit 1408 into a random access memory (RAM) 1403.
- ROM read-only memory
- RAM random access memory
- various programs and data required for the operation of the electronic device 1400 can also be stored.
- the computing unit 1401, the ROM 1402, and the RAM 1403 are connected to each other via a bus 1404.
- An input/output (I/O) interface 1405 is also connected to the bus 1404.
- part or all of the computer program may be loaded and/or installed on the electronic device 1400 via ROM 1402 and/or communication unit 1409.
- the computer program When the computer program is loaded into RAM 1403 and executed by the computing unit 1401, one or more steps of the image processing method described above may be performed.
- the computing unit 1401 may be configured to perform the image processing method in any other appropriate manner (e.g., by means of firmware).
- Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof.
- FPGAs field programmable gate arrays
- ASICs application specific integrated circuits
- ASSPs application specific standard products
- SOCs systems on chips
- CPLDs complex programmable logic devices
- the program code for implementing the method of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special-purpose computer, or other programmable data mining device, so that the program code, when executed by the processor or controller, implements the functions/operations specified in the flow chart and/or block diagram.
- the program code may be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
- a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include an electrical signal medium based on one or more wires.
- a portable computer disk a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
- RAM random access memory
- ROM read-only memory
- EPROM or flash memory erasable programmable read-only memory
- CD-ROM compact disk read-only memory
- CD-ROM compact disk read-only memory
- magnetic storage device a magnetic storage device
- the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and pointing device e.g., a mouse or trackball
- Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including acoustic input, voice input, or tactile input).
- the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such back-end components, middleware components, or front-end components.
- the components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: a local area network (LAN), a wide area network (WAN), and the Internet.
- a computer system may include a client and a server.
- the client and the server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises through computer programs running on respective computers and having a client-server relationship to each other.
- the server may be a cloud server, a server in a distributed system, or a server combined with a blockchain.
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
本申请涉及生物医学技术领域与人工智能技术领域,尤其涉及一种基因位点突变的预后风险评估方法、电子设备及存储介质。The present application relates to the fields of biomedical technology and artificial intelligence technology, and in particular to a prognostic risk assessment method for gene site mutations, an electronic device, and a storage medium.
对由于基因发生突变而导致的预后风险进行预后风险评估,可以为医生制定临床方案提供参考,但相关技术中的预后风险评估方案不能够准确的评估由于基因发生突变而导致的预后风险。Prognostic risk assessment of the prognostic risk caused by gene mutation can provide a reference for doctors to formulate clinical plans, but the prognostic risk assessment schemes in related technologies cannot accurately assess the prognostic risk caused by gene mutation.
发明内容Summary of the invention
根据第一方面,本公开提供了一种基因位点突变的预后风险评估方法,包括:获取初始数据集和对照数据集,其中,上述初始数据集是根据目标基因的至少一个基因位点发生突变得到的,上述对照数据集是根据上述目标基因的上述基因位点均未发生突变得到的;对上述初始数据集进行预处理,得到目标基因突变数据集;利用比例风险回归模型确定上述目标基因突变数据集中每个基因位点的预后风险值;根据上述预后风险值选取目标基因位点;利用集成分类模型得到上述目标基因位点的预后风险类型结果。According to a first aspect, the present disclosure provides a method for assessing the prognostic risk of gene site mutations, comprising: obtaining an initial data set and a control data set, wherein the initial data set is obtained based on a mutation in at least one gene site of a target gene, and the control data set is obtained based on the fact that none of the gene sites of the target gene have mutated; preprocessing the initial data set to obtain a target gene mutation data set; determining the prognostic risk value of each gene site in the target gene mutation data set using a proportional risk regression model; selecting a target gene site based on the prognostic risk value; and obtaining the prognostic risk type result of the target gene site using an integrated classification model.
例如,上述初始数据集包括多个基因突变数据;上述对上述初始数据集进行预处理,得到目标基因突变数据集包括:基于预设的基因注释要求,对多个上述初始基因突变数据进行数据梳理,得到多个待处理的基因突变数据;利用预设的基因注释语句,对多个上述待处理的基因突变数据分别进行注释处理,得到多个目标基因突变数据,其中,每个上述目标基因突变数据包括与发生突变的上述基因位点对应的多个转录本信息集和多个氨基酸信息;以及根据多个上述目标基因突变数据,得到上述目标基因突变数据集。For example, the above-mentioned initial data set includes multiple gene mutation data; the above-mentioned preprocessing of the above-mentioned initial data set to obtain the target gene mutation data set includes: based on the preset gene annotation requirements, data combing of the multiple initial gene mutation data to obtain multiple gene mutation data to be processed; using the preset gene annotation statement, annotating the multiple gene mutation data to be processed respectively to obtain multiple target gene mutation data, wherein each of the above-mentioned target gene mutation data includes multiple transcript information sets and multiple amino acid information corresponding to the above-mentioned gene site where the mutation occurs; and obtaining the above-mentioned target gene mutation data set based on the multiple above-mentioned target gene mutation data.
例如,上述根据多个上述目标基因突变数据,得到上述目标基因突变数据集包括:从转录本数据库中获取与上述目标基因相关联的目标转录本信息和目标氨基酸信息;根据上述目标转录本信息和上述目标氨基酸信息,对每个上述目标基因突变数据所包括的多个转录本信息和多个氨基酸信息进行过滤,以使得每个上述目标基因突变数据的多个转录本信息和多个氨基酸信息中只保留上述目标转录本信息和上述目标氨基酸信息;以 及根据保留有上述目标转录本信息和上述目标氨基酸信息的多个上述目标基因突变数据,生成上述目标基因突变数据集。For example, obtaining the target gene mutation data set according to the plurality of target gene mutation data includes: acquiring target transcript information and target amino acid information associated with the target gene from a transcript database; filtering the plurality of transcript information and the plurality of amino acid information included in each of the target gene mutation data according to the target transcript information and the target amino acid information, so that only the target transcript information and the target amino acid information are retained among the plurality of transcript information and the plurality of amino acid information of each of the target gene mutation data; and And generating the target gene mutation data set according to the plurality of target gene mutation data retaining the target transcript information and the target amino acid information.
例如,上述利用比例风险回归模型确定上述目标基因突变数据集中每个基因位点的预后风险值包括:基于上述目标基因突变数据集,确定与每一个上述基因位点对应的子突变数据;利用比例风险回归模型在每一个上述子突变数据与上述对照数据集之间进行预后风险定量分析,得到上述基因位点的预后风险值。For example, the above-mentioned use of the proportional risk regression model to determine the prognostic risk value of each gene site in the above-mentioned target gene mutation data set includes: based on the above-mentioned target gene mutation data set, determining the sub-mutation data corresponding to each of the above-mentioned gene sites; using the proportional risk regression model to perform a prognostic risk quantitative analysis between each of the above-mentioned sub-mutation data and the above-mentioned control data set to obtain the prognostic risk value of the above-mentioned gene site.
例如,上述目标基因与上述目标基因突变数据集之间相链接,上述目标基因包括m个氨基酸位点,每一个上述氨基酸位点均对应至少一个发生突变的基因位点,m为正整数;上述基于上述目标基因突变数据集,确定与每一个上述基因位点对应的子突变数据包括:利用滑动窗口在上述目标基因上进行滑动,其中,上述滑动窗口的中心点为上述m个氨基酸位点中的任意一个上述氨基酸位点,上述滑动窗口沿上述目标基因延伸方向的尺寸,与k个相邻氨基酸位点之间沿上述目标基因延伸方向的距离相关联,k为正整数;在上述滑动窗口滑动至任意一个上述氨基酸位点的情况下,基于上述滑动窗口沿上述目标基因延伸方向的尺寸,在作为中心的上述氨基酸位点周围选取k个氨基酸位点,得到目标氨基酸位点;以及将上述目标基因突变数据集中与上述目标氨基酸位点相链接的基因突变数据,作为上述子突变数据。For example, the target gene is linked to the target gene mutation data set, the target gene includes m amino acid sites, each of the amino acid sites corresponds to at least one gene site where a mutation occurs, and m is a positive integer; based on the target gene mutation data set, determining the sub-mutation data corresponding to each of the gene sites includes: sliding a sliding window on the target gene, wherein the center point of the sliding window is any one of the m amino acid sites, the size of the sliding window along the extension direction of the target gene is associated with the distance between k adjacent amino acid sites along the extension direction of the target gene, and k is a positive integer; when the sliding window slides to any one of the amino acid sites, based on the size of the sliding window along the extension direction of the target gene, k amino acid sites are selected around the amino acid site as the center to obtain the target amino acid site; and the gene mutation data in the target gene mutation data set linked to the target amino acid site is used as the sub-mutation data.
例如,每一个上述基因位点的预后风险值用与上述基因位点相关联的氨基酸位点的预后风险值进行表征,上述子突变数据与上述氨基酸位点相关联;上述利用比例风险回归模型在每一个上述子突变数据与上述对照数据集之间进行预后风险定量分析,得到上述基因位点的预后风险值包括:将上述子突变数据输入到预后风险函数中,得到与每一个上述氨基酸位点对应的预后风险函数值;以及将每一个上述氨基酸位点的预后风险函数值与基准预后风险函数值之间的比例,作为每一个上述氨基酸位点的预后风险值,其中,上述基准预后风险函数值是通过将上述对照数据集输入到上述预后风险函数中得到的。For example, the prognostic risk value of each of the above-mentioned gene sites is characterized by the prognostic risk value of the amino acid site associated with the above-mentioned gene site, and the above-mentioned sub-mutation data is associated with the above-mentioned amino acid site; the above-mentioned use of the proportional risk regression model to perform a quantitative prognostic risk analysis between each of the above-mentioned sub-mutation data and the above-mentioned control data set to obtain the prognostic risk value of the above-mentioned gene site includes: inputting the above-mentioned sub-mutation data into the prognostic risk function to obtain the prognostic risk function value corresponding to each of the above-mentioned amino acid sites; and taking the ratio between the prognostic risk function value of each of the above-mentioned amino acid sites and the benchmark prognostic risk function value as the prognostic risk value of each of the above-mentioned amino acid sites, wherein the above-mentioned benchmark prognostic risk function value is obtained by inputting the above-mentioned control data set into the above-mentioned prognostic risk function.
例如,上述方法还包括:基于每一个上述氨基酸位点的预后风险值,绘制突变预后风险值散点图;利用可视化组件展示上述突变预后风险值散点图。For example, the method further includes: drawing a mutation prognosis risk value scatter plot based on the prognosis risk value of each of the amino acid sites; and displaying the mutation prognosis risk value scatter plot using a visualization component.
例如,上述根据上述预后风险值选取目标基因位点包括:根据每一个上述氨基酸位点的预后风险值,从与上述基因位点相关联的氨基酸位点中选取目标氨基酸位点,其中,上述目标氨基酸位点中包括上述目标基因位点。 For example, the above-mentioned selecting the target gene site according to the above-mentioned prognostic risk value includes: according to the prognostic risk value of each of the above-mentioned amino acid sites, selecting the target amino acid site from the amino acid sites associated with the above-mentioned gene site, wherein the above-mentioned target amino acid site includes the above-mentioned target gene site.
例如,上述根据每一个上述氨基酸位点的预后风险值,从与上述基因位点相关联的氨基酸位点中选取目标氨基酸位点包括:计算与每一个上述氨基酸位点的预后风险值相关联的可信度;将与每一个上述氨基酸位点的预后风险值相关联的可信度与可信度阈值进行比较,以得到小于上述可信度阈值的预后风险值;从与上述基因位点相关联的氨基酸位点中确定与小于上述可信度阈值的预后风险值相关联的氨基酸位点,得到上述目标氨基酸位点。For example, the above-mentioned selecting a target amino acid site from the amino acid sites associated with the above-mentioned gene site based on the prognostic risk value of each of the above-mentioned amino acid sites includes: calculating the credibility associated with the prognostic risk value of each of the above-mentioned amino acid sites; comparing the credibility associated with the prognostic risk value of each of the above-mentioned amino acid sites with a credibility threshold to obtain a prognostic risk value less than the above-mentioned credibility threshold; determining the amino acid site associated with the prognostic risk value less than the above-mentioned credibility threshold from the amino acid sites associated with the above-mentioned gene site to obtain the above-mentioned target amino acid site.
例如,上述利用集成分类模型得到上述目标基因位点的预后风险类型结果包括:对由于上述目标氨基酸位点发生突变而导致的预后风险,利用上述集成分类模型进行预后风险类型定性分析,得到上述预后风险类型结果。For example, the above-mentioned prognostic risk type result of the above-mentioned target gene site obtained by using the integrated classification model includes: for the prognostic risk caused by the mutation of the above-mentioned target amino acid site, the above-mentioned integrated classification model is used to perform a qualitative analysis of the prognostic risk type to obtain the above-mentioned prognostic risk type result.
例如,上述对由于上述目标氨基酸位点发生突变而导致的预后风险,利用上述集成分类模型进行预后风险类型定性分析,得到上述预后风险类型结果包括:基于上述目标氨基酸位点,从上述目标基因突变数据集中提取与上述目标氨基酸位点发生突变相关联的基因突变数据,得到待分析预后风险类型的突变数据;基于上述待分析预后风险类型的突变数据,利用上述集成分类模型预测由于上述目标氨基酸位点发生突变导致的预后风险与N个预后风险类型中每一个预后风险类型之间的匹配程度,得到与N个上述预后风险类型对应的N个预测匹配值,N为正整数,N个预后风险类型之间的预后风险程度彼此不同;以及根据N个预测匹配值,确定上述预后风险类型结果。For example, the above-mentioned prognostic risk caused by the mutation of the above-mentioned target amino acid site is qualitatively analyzed by using the above-mentioned integrated classification model to obtain the above-mentioned prognostic risk type result, which includes: based on the above-mentioned target amino acid site, extracting the gene mutation data associated with the mutation of the above-mentioned target amino acid site from the above-mentioned target gene mutation data set to obtain the mutation data of the prognostic risk type to be analyzed; based on the mutation data of the above-mentioned prognostic risk type to be analyzed, using the above-mentioned integrated classification model to predict the degree of match between the prognostic risk caused by the mutation of the above-mentioned target amino acid site and each of the N prognostic risk types, to obtain N predicted matching values corresponding to the N above-mentioned prognostic risk types, N is a positive integer, and the prognostic risk degrees between the N prognostic risk types are different from each other; and determining the above-mentioned prognostic risk type result according to the N predicted matching values.
例如,上述基于上述待分析预后风险类型的突变数据,利用上述集成分类模型预测由于上述目标氨基酸位点发生突变导致的预后风险与N个预后风险类型中每一个预后风险类型之间的匹配程度,得到与N个上述预后风险类型对应的N个预测匹配值包括:针对N个预后风险类型中的每一个预后风险类型:采用集成分类模型对由于上述目标氨基酸位点发生突变导致的预后风险,与上述预后风险类型之间的匹配程度进行分析,得到多个预测值,其中,上述集成分类模型包括多个分类器,每个分类器能够得出一个上述预测值;利用梯度提升分类器对多个上述预测值进行处理,得到上述预测匹配值。For example, based on the mutation data of the above-mentioned prognostic risk type to be analyzed, the above-mentioned integrated classification model is used to predict the degree of match between the prognostic risk caused by the mutation at the above-mentioned target amino acid site and each of the N prognostic risk types, and N predicted matching values corresponding to the N above-mentioned prognostic risk types are obtained, including: for each of the N prognostic risk types: using the integrated classification model to analyze the prognostic risk caused by the mutation at the above-mentioned target amino acid site and the degree of match between the above-mentioned prognostic risk type, to obtain multiple prediction values, wherein the above-mentioned integrated classification model includes multiple classifiers, each classifier can obtain one of the above-mentioned prediction values; using the gradient boosting classifier to process the multiple above-mentioned prediction values to obtain the above-mentioned prediction matching value.
例如,上述根据N个预测匹配值,确定上述预后风险类型结果包括:通过在N个上述预测匹配值之间进行数值比较,从N个上述预测匹配值中选择最高预测匹配值;将与上述最高预测匹配值相对应的预后风险类型,作为上述预后风险类型结果。For example, determining the above-mentioned prognostic risk type result based on N predicted matching values includes: selecting the highest predicted matching value from the N predicted matching values by performing numerical comparison between the above-mentioned N predicted matching values; and taking the prognostic risk type corresponding to the above-mentioned highest predicted matching value as the above-mentioned prognostic risk type result.
例如,上述方法还包括:获取上述预后风险类型结果;以及向目标对象推送上述预后风险类型结果,以使得上述目标对象根据上述预后风险类型结果生成临床决策。 For example, the method further includes: acquiring the prognostic risk type result; and pushing the prognostic risk type result to the target object, so that the target object generates a clinical decision according to the prognostic risk type result.
例如,上述集成分类模型是通过如下方式训练的:获取样本数据集,其中上述样本数据集中包括第一预后风险程度数据和第二预后风险程度数据,上述第一预后风险程度高于上述第二预后风险程度;将上述样本数据集随机拆分为初始训练样本集和验证样本集,其中,上述初始训练样本集中的数据数目与上述测试样本数据集中的数据数目的比值为预设比值;对上述初始训练样本集进行过采样处理,以使得上述初始训练样本集中的上述第一预后风险程度数据的数目与上述第二预后风险程度数据的数目相同,得到目标训练样本集;利用上述目标训练样本集对初始集成分类模型进行交叉验证训练,得到中间集成分类模型;将满足预设训练结束条件下的上述中间集成分类模型作为上述集成分类模型,其中,上述预设训练结束条件包括训练次数达到预设训练次数阈值。For example, the above-mentioned integrated classification model is trained in the following manner: obtaining a sample data set, wherein the above-mentioned sample data set includes first prognostic risk level data and second prognostic risk level data, and the above-mentioned first prognostic risk level is higher than the above-mentioned second prognostic risk level; randomly splitting the above-mentioned sample data set into an initial training sample set and a verification sample set, wherein the ratio of the number of data in the above-mentioned initial training sample set to the number of data in the above-mentioned test sample data set is a preset ratio; oversampling the above-mentioned initial training sample set so that the number of the above-mentioned first prognostic risk level data in the above-mentioned initial training sample set is the same as the number of the above-mentioned second prognostic risk level data, and obtaining a target training sample set; cross-validation training is performed on the initial integrated classification model using the above-mentioned target training sample set to obtain an intermediate integrated classification model; the above-mentioned intermediate integrated classification model that meets the preset training end condition is used as the above-mentioned integrated classification model, wherein the above-mentioned preset training end condition includes that the number of training times reaches a preset training time threshold.
例如,上述利用上述目标训练样本集对初始集成分类模型进行交叉验证训练包括:将上述目标训练样本集随机均分成Q个训练样本子集,Q为正整数;重复执行以下操作,直至满足上述预设训练结束条件:从Q个上述训练样本子集中选取Q-1个上述训练样本子集训练上述初始集成分类模型,得到作为用于下一轮训练的中间集成分类模型,其中,每一次训练与每一次训练之间,选取的Q-1个上述训练样本子集不相同;利用Q个上述训练样本子集中剩余的1个上述训练样本子集测试上述用于下一轮训练的中间集成分类模型,其中,每一次测试与每一次测试之间,利用的上述训练样本子集不相同。For example, the cross-validation training of the initial integrated classification model using the target training sample set includes: randomly dividing the target training sample set into Q training sample subsets, where Q is a positive integer; repeatedly performing the following operations until the preset training end condition is met: selecting Q-1 of the Q training sample subsets to train the initial integrated classification model to obtain an intermediate integrated classification model for the next round of training, wherein the Q-1 training sample subsets selected are different from each training session to each training session; and using the remaining 1 of the Q training sample subsets to test the intermediate integrated classification model for the next round of training, wherein the training sample subsets used are different from each test to each test.
例如,上述样本数据集包括多个样本数据,上述样本数据的种类包括以下至少之一:发生突变的上述目标氨基酸位点对蛋白质功能的影响数据、发生突变的上述目标氨基酸位点对临床的影响数据、发生突变的上述目标氨基酸位点的位置数据。For example, the sample data set includes multiple sample data, and the types of the sample data include at least one of the following: data on the impact of the mutated target amino acid site on protein function, data on the clinical impact of the mutated target amino acid site, and position data of the mutated target amino acid site.
例如,上述分类器包括以下至少之一:随机森林分类器、基尼指数随机树分类器、熵随机树分类器、梯度提升分类器。For example, the above classifier includes at least one of the following: a random forest classifier, a Gini index random tree classifier, an entropy random tree classifier, and a gradient boosting classifier.
根据第二方面,本公开还提供了一种电子设备,包括:一个或多个处理器;存储装置,用于存储一个或多个程序,其中,当上述一个或多个程序被上述一个或多个处理器执行时,使得上述一个或多个处理器执行上述的基因位点突变的预后风险评估方法。According to the second aspect, the present disclosure also provides an electronic device, comprising: one or more processors; a storage device for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors execute the above-mentioned prognostic risk assessment method for gene site mutations.
根据第三方面,本公开还提供了一种计算机可读存储介质,其上存储有可执行指令,该指令被处理器执行时使处理器执行上述的基因位点突变的预后风险评估方法。According to the third aspect, the present disclosure also provides a computer-readable storage medium having executable instructions stored thereon, which, when executed by a processor, causes the processor to execute the above-mentioned prognostic risk assessment method for gene site mutations.
根据第四方面,本公开还提供了一种计算机程序产品,包括计算机程序,上述计算机程序被处理器执行时实现上述的基因位点突变的预后风险评估方法。 According to a fourth aspect, the present disclosure further provides a computer program product, including a computer program, which implements the above-mentioned prognostic risk assessment method for gene site mutation when executed by a processor.
通过下文结合附图的详细描述,本公开实施例的上述和其它特征将会变得更加明显,其中:The above and other features of the embodiments of the present disclosure will become more apparent through the following detailed description in conjunction with the accompanying drawings, in which:
图1是根据本公开实施例的基因位点突变的预后风险评估方法的流程图;FIG1 is a flow chart of a method for assessing the prognostic risk of gene site mutations according to an embodiment of the present disclosure;
图2是根据本公开实施例的初始数据集的示意图;FIG2 is a schematic diagram of an initial data set according to an embodiment of the present disclosure;
图3是根据本公开实施例的对照数据集的示意图;FIG3 is a schematic diagram of a control data set according to an embodiment of the present disclosure;
图4是根据本公开实施例的待处理的基因突变数据的示意图;FIG4 is a schematic diagram of gene mutation data to be processed according to an embodiment of the present disclosure;
图5是根据本公开实施例的进行基因重注释后的数据FIG. 5 is data after gene re-annotation according to an embodiment of the present disclosure
图6是根据本公开实施例的目标基因突变数据集的示例图;FIG6 is an example diagram of a target gene mutation dataset according to an embodiment of the present disclosure;
图7是根据本公开实施例的利用滑动窗口采集数据的示意图;FIG7 is a schematic diagram of collecting data using a sliding window according to an embodiment of the present disclosure;
图8是根据本公开实施例的突变预后风险值散点图;FIG8 is a scatter plot of mutation prognostic risk values according to an embodiment of the present disclosure;
图9是根据本公开实施例的利用集成分类模型得到预测匹配值的架构图;FIG9 is a schematic diagram of an architecture for obtaining a predicted matching value using an integrated classification model according to an embodiment of the present disclosure;
图10A是根据本公开实施例的利用部分工具预测氨基酸位点突变得到的数据示意图;FIG10A is a schematic diagram of data obtained by using some tools to predict amino acid site mutations according to an embodiment of the present disclosure;
图10B是根据本公开另一实施例的利用部分工具预测氨基酸位点突变得到的数据示意图;FIG10B is a schematic diagram of data obtained by using some tools to predict amino acid site mutations according to another embodiment of the present disclosure;
图11A是根据本公开实施例的氨基酸位点突变对临床的影响数据示意图;FIG11A is a schematic diagram of clinical effects of amino acid site mutations according to an embodiment of the present disclosure;
图11B是根据本公开另一实施例的氨基酸位点突变对临床的影响数据示意图;FIG11B is a schematic diagram of clinical effects of amino acid site mutations according to another embodiment of the present disclosure;
图12A是本公开实施例的集成分类模型的ROC曲线;FIG12A is a ROC curve of the integrated classification model of an embodiment of the present disclosure;
图12B是相关技术模型的ROC曲线;FIG12B is a ROC curve of the related art model;
图13是根据本公开另一实施例的基因位点突变的预后风险评估方法;以及FIG13 is a method for assessing the prognostic risk of gene site mutations according to another embodiment of the present disclosure; and
图14是根据本公开实施例的适于实现图像处理方法的电子设备的框图。FIG. 14 is a block diagram of an electronic device suitable for implementing an image processing method according to an embodiment of the present disclosure.
在附图中,相同或相似的结构均以相同或相似的附图标记进行标识。In the drawings, the same or similar structures are marked with the same or similar reference numerals.
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保 护的范围。应注意,贯穿附图,相同的元素由相同或相近的附图标记来表示。在以下描述中,一些具体实施例仅用于描述目的,而不应理解为对本公开有任何限制,而只是本公开实施例的示例。在可能导致对本公开的理解造成混淆时,将省略常规结构或配置。应注意,图中各部件的形状和尺寸不反映真实大小和比例,而仅示意本公开实施例的内容。In order to make the purpose, technical solution and advantages of the embodiments of the present disclosure clearer, the technical solution in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, not all of the embodiments. Based on the described embodiments of the present disclosure, all other embodiments obtained by ordinary technicians in this field without creative work belong to the protection of the present disclosure. The present invention relates to a device for carrying out the present invention and to a device for carrying out the present invention. It should be noted that throughout the accompanying drawings, the same elements are represented by the same or similar reference numerals. In the following description, some specific embodiments are used for descriptive purposes only and should not be understood as any limitation to the present disclosure, but are merely examples of embodiments of the present disclosure. Conventional structures or configurations will be omitted when they may cause confusion in the understanding of the present disclosure. It should be noted that the shapes and sizes of the components in the figures do not reflect the actual size and proportion, but only illustrate the contents of the embodiments of the present disclosure.
需要说明的是,在附图中,为了清楚和/或描述的目的,可以放大元件的尺寸和相对尺寸。如此,各个元件的尺寸和相对尺寸不必限于图中所示的尺寸和相对尺寸。在说明书和附图中,相同或相似的附图标号指示相同或相似的部件。It should be noted that in the drawings, the size and relative size of the elements may be exaggerated for the purpose of clarity and/or description. Thus, the size and relative size of each element are not necessarily limited to the size and relative size shown in the drawings. In the specification and drawings, the same or similar reference numerals indicate the same or similar parts.
除非另外定义,本公开实施例使用的技术术语或者科学术语应当为本领域普通技术人员所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。Unless otherwise defined, the technical terms or scientific terms used in the embodiments of the present disclosure should be the common meanings understood by those of ordinary skill in the art. The "first", "second" and similar words used in the present disclosure do not represent any order, quantity or importance, but are only used to distinguish different components. "Include" or "comprising" and similar words mean that the element or object appearing before the word covers the element or object listed after the word and its equivalent, without excluding other elements or objects.
在本文中,除非另有特别说明,诸如“上”、“下”、“左”、“右”、“内”、“外”等方向性术语用于表示基于附图所示的方位或位置关系,仅是为了便于描述本公开,而不是指示或暗示所指的装置、元件或部件必须具有特定的方位、以特定的方位构造或操作。需要理解的是,当被描述对象的绝对位置改变后,则它们表示的相对位置关系也可能相应地改变。因此,这些方向性术语不能理解为对本公开的限制。In this document, unless otherwise specified, directional terms such as "upper", "lower", "left", "right", "inner", "outer", etc. are used to indicate the orientation or positional relationship based on the drawings, and are only for the convenience of describing the present disclosure, and do not indicate or imply that the device, element or component referred to must have a specific orientation, be constructed or operate in a specific orientation. It should be understood that when the absolute position of the described object changes, the relative positional relationship they represent may also change accordingly. Therefore, these directional terms should not be understood as limiting the present disclosure.
此外,在本公开实施例的描述中,术语“相连”或“连接至”可以是指两个组件直接连接,也可以是指两个组件之间经由一个或多个其他组件相连。此外,这两个组件可以通过有线或无线方式相连或相耦合。In addition, in the description of the embodiments of the present disclosure, the term "connected" or "connected to" may refer to two components being directly connected, or may refer to two components being connected via one or more other components. In addition, the two components may be connected or coupled via a wired or wireless manner.
在本公开的技术方案中,所涉及的数据(如包括但不限于用户个人信息)的收集、存储、使用、加工、传输、提供、公开和应用、获取关于用户的初始数据集的过程、获取关于用户的对照数据集的过程等处理,均符合相关法律法规的规定,采取了必要保密措施,且不违背公序良俗。In the technical solution of the present disclosure, the collection, storage, use, processing, transmission, provision, disclosure and application of the data involved (such as but not limited to user personal information), the process of obtaining the initial data set about the user, the process of obtaining the reference data set about the user, etc., all comply with the provisions of relevant laws and regulations, take necessary confidentiality measures, and do not violate public order and good morals.
根据一个示例,通过如下方案实现体细胞突变对癌症的临床意义解释:获取用户临床报告数据库的5234个体细胞突变作为训练集和验证集,并通过文献检索获取6226个变异作为测试集;利用功能和临床证据,通过半监督生成对抗网络(SGAN)方法,预测突变的致癌性;将体细胞突变分类为4个类别:强临床意义、潜在临床意义、临床意 义不确定和良性/可能良性。According to an example, the clinical significance of somatic mutations in cancer is explained through the following scheme: 5234 somatic mutations from the user's clinical report database are obtained as training sets and validation sets, and 6226 mutations are obtained as test sets through literature retrieval; using functional and clinical evidence, the semi-supervised generative adversarial network (SGAN) method is used to predict the carcinogenicity of mutations; somatic mutations are classified into 4 categories: strong clinical significance, potential clinical significance, and clinical significance. uncertain and benign/likely benign.
上述示例仅解释了体细胞突变对癌症的临床意义。但由于临床意义与预后风险并不对等,因此不能根据临床解释从而准确判断体细胞突变对预后风险是保护性的还有危害性的,降低了对基因突变预后风险的评估准确率。预后是指对于某种疾病在采取措施后对发展过程和最后结果的预测。The above example only explains the clinical significance of somatic mutations for cancer. However, since clinical significance is not equivalent to prognostic risk, it is not possible to accurately judge whether somatic mutations are protective or harmful to prognostic risk based on clinical interpretation, which reduces the accuracy of assessing the prognostic risk of gene mutations. Prognosis refers to the prediction of the development process and final outcome of a disease after measures are taken.
有鉴于此,本公开的实施例提供了一种基因位点突变的预后风险评估方法,从预后风险性出发,分析每个基因位点突变对癌症预后风险的影响,提高对基因突变预后风险的评估准确率。In view of this, the embodiments of the present disclosure provide a method for assessing the prognostic risk of gene site mutations, which analyzes the impact of each gene site mutation on the prognostic risk of cancer based on the prognostic risk, and improves the accuracy of assessing the prognostic risk of gene mutations.
图1是根据本公开实施例的基因位点突变的预后风险评估方法的流程图。FIG1 is a flow chart of a method for assessing the prognostic risk of gene site mutations according to an embodiment of the present disclosure.
如图1所示,基因位点突变的预后风险评估方法可以包括操作S110~操作S150。As shown in FIG. 1 , the method for assessing the prognostic risk of gene site mutation may include operations S110 to S150 .
在操作S110,获取初始数据集和对照数据集,其中,初始数据集是根据目标基因的至少一个基因位点发生突变得到的,对照数据集是根据目标基因的基因位点均未发生突变得到的。In operation S110, an initial data set and a control data set are obtained, wherein the initial data set is obtained based on at least one gene site of the target gene being mutated, and the control data set is obtained based on no gene sites of the target gene being mutated.
在操作S120,对初始数据集进行预处理,得到目标基因突变数据集。In operation S120, the initial data set is preprocessed to obtain a target gene mutation data set.
在操作S130,利用比例风险回归模型确定目标基因突变数据集中每个基因位点的预后风险值。In operation S130 , a proportional hazard regression model is used to determine the prognostic risk value of each gene site in the target gene mutation dataset.
在操作S140,根据预后风险值选取目标基因位点。In operation S140, a target gene locus is selected according to the prognostic risk value.
在操作S150,利用集成分类模型得到目标基因位点的预后风险类型结果。In operation S150, the prognostic risk type result of the target gene site is obtained using the integrated classification model.
根据本公开的实施例,数据库可以包括以下至少之一:公开的ICGC(International Cancer Genome Consortium,国际肿瘤基因组协作组)数据库和公开的MSK(Memorial Sloan Kettering Cancer Center,癌症研究中心)数据库。According to an embodiment of the present disclosure, the database may include at least one of the following: the public ICGC (International Cancer Genome Consortium) database and the public MSK (Memorial Sloan Kettering Cancer Center) database.
根据本公开的实施例,以目标基为TP53基因为例,TP53基因可以包括393个氨基酸位点,每一个氨基酸位点对应3个碱基,一个基因位点发生突变可以指一个碱基发生了突变,该基因位点的突变会导致与该碱基对应的氨基酸发生变化,氨基酸发生变化会带来预后风险性。因此,需要对由于基因上的每一个基因位点发生突变带来的预后风险,或者是对每一个氨基酸位点发生的突变,进行预后风险评估。According to the embodiments of the present disclosure, taking the target gene as the TP53 gene as an example, the TP53 gene may include 393 amino acid sites, each amino acid site corresponds to 3 bases, and a mutation in a gene site may refer to a mutation in a base, and the mutation in the gene site will cause the amino acid corresponding to the base to change, and the change in the amino acid will bring about prognostic risk. Therefore, it is necessary to perform a prognostic risk assessment on the prognostic risk caused by the mutation of each gene site on the gene, or on the mutation of each amino acid site.
根据本公开的实施例,初始数据集和对照数据集可以用于评估目标基因(例如TP53)上由于每一个基因位点突变,或每一个氨基酸位点突变所带来的预后风险值。以及对于可信度低的预后风险值,可以预测与该预后风险值对应的氨基酸位点突变对临床意义的 预后风险性高低。可以理解,根据本公开实施例的预后风险评估方法可以用于评估其他任意基因位点突变或任意氨基酸位点突变带来的预后风险。According to the embodiments of the present disclosure, the initial data set and the control data set can be used to evaluate the prognostic risk value caused by each gene site mutation or each amino acid site mutation on the target gene (such as TP53). And for the prognostic risk value with low credibility, the clinical significance of the amino acid site mutation corresponding to the prognostic risk value can be predicted. Prognostic risk level. It is understood that the prognostic risk assessment method according to the embodiment of the present disclosure can be used to assess the prognostic risk brought about by mutations at any other gene site or any amino acid site.
图2是根据本公开实施例的初始数据集的示意图。FIG. 2 is a schematic diagram of an initial data set according to an embodiment of the present disclosure.
如图2所示,初始数据集可以包括多个初始基因突变数据,初始基因突变数据是根据目标基因的至少一个基因位点发生突变得到的。具体地,多个初始基因突变数据可以包括临床类数据和错义突变类数据。图2中的每一行可以看作是一个用户的初始基因突变数据。在每一行中,临床类数据可以指用户的编号、用户的生存状态、用户的生存时长等。错义突变类数据可以包括待分析的突变基因、染色体号、基因位点的起始位置、基因位点的终止位置、参考基因、突变基因。As shown in Figure 2, the initial data set may include multiple initial gene mutation data, and the initial gene mutation data is obtained based on the mutation of at least one gene site of the target gene. Specifically, the multiple initial gene mutation data may include clinical data and missense mutation data. Each row in Figure 2 can be regarded as the initial gene mutation data of a user. In each row, the clinical data may refer to the user's number, the user's survival status, the user's survival time, etc. The missense mutation data may include the mutant gene to be analyzed, the chromosome number, the starting position of the gene site, the ending position of the gene site, the reference gene, and the mutant gene.
继续参照图2,编号列中的P-用于表示初始数据集的用户标识,n可以表示用户人数,n为正整数。在链列中的“+”用于表征正义链,正义链是指DNA上携带有编码蛋白质氨基酸信息的核苷酸序列的链,正义链又称编码链、有意义链或正链(+链)。Continuing with FIG. 2 , P- in the number column is used to represent the user ID of the initial data set, and n can represent the number of users, where n is a positive integer. The “+” in the chain column is used to represent the positive chain, which refers to the chain on DNA that carries the nucleotide sequence encoding the amino acid information of the protein. The positive chain is also called the coding chain, meaningful chain or positive chain (+ chain).
关于图2中的参考基因和突变基因,以图2中的第一行为例,在7577121基因位点的参考基因应该是G(鸟嘌呤),但由于发生基因突变,结果突变成了突变基因A(腺嘌呤)。Regarding the reference gene and mutant gene in Figure 2, taking the first row in Figure 2 as an example, the reference gene at the 7577121 gene site should be G (guanine), but due to gene mutation, it mutated into the mutant gene A (adenine).
图2中的生存时长的度量单位可以是年、月、日和小时等,可以用符号t或time表示,生存时长的分布一般不是正态分布。生存时长可以是起始事件和终止事件之间的时长,起始事件可以是诊断、服药、手术切除等,终止事件可以是痊愈、死亡、缓解和复发等。The units of measurement for survival time in Figure 2 can be years, months, days, hours, etc., and can be represented by the symbol t or time. The distribution of survival time is generally not a normal distribution. Survival time can be the time between the starting event and the ending event. The starting event can be diagnosis, medication, surgical resection, etc., and the ending event can be recovery, death, remission, recurrence, etc.
图3是根据本公开实施例的对照数据集的示意图。FIG. 3 is a schematic diagram of a comparison data set according to an embodiment of the present disclosure.
如图3所示,对照数据集指的是在目标基因(例如TP53)上没有基因位点发生突变的情况下获得的数据,用于作为初始基因突变数据的对照数据。对照数据集的获取也可以是从公开的ICGC数据库和公开的MSK数据库中的至少一种数据库获得。图3中的每一行可以看作是一个用户的对照数据。具体地,对照数据可以包括用户的编号,用户的生存状态、用户的生存时长。在用户的编号列中,D-用于对照数据集的用户标识。As shown in Figure 3, the control data set refers to the data obtained when there is no mutation in the gene locus on the target gene (such as TP53), which is used as the control data of the initial gene mutation data. The acquisition of the control data set can also be obtained from at least one of the public ICGC database and the public MSK database. Each row in Figure 3 can be regarded as the control data of a user. Specifically, the control data may include the user's number, the user's survival status, and the user's survival time. In the user's number column, D- is used for the user ID of the control data set.
根据本公开的实施例,对多个初始基因突变数据进行预处理的过程可以包括对初始基因突变数据的数据梳理、基因重新注释等过程,预处理后得到的目标基因突变数据集中可以是包括转录本和氨基酸信息的数据集。According to an embodiment of the present disclosure, the process of preprocessing multiple initial gene mutation data may include data combing of the initial gene mutation data, gene re-annotation, and the like, and the target gene mutation data set obtained after preprocessing may be a data set including transcript and amino acid information.
根据本公开的实施例,目标基因位点可以是根据预后风险值未能分析出可靠的临床 预后风险结果的基因位点,例如由于得到预后风险值所依据的样本数据量少,可信度低等,使得根据该预后风险值不能得出可靠的临床预后风险结果,这样的基因位点还需要进一步利用集成分类模型进行预后风险定性分析,以得到可靠的临床预后风险结果。可选的,由于基因位点的变化会导致氨基酸位点的变化,所以在对目标基因位点进行定性分析的过程中,可以利用目标氨基酸位点代替对目标基因位点的分析。According to the embodiments of the present disclosure, the target gene locus may be a reliable clinical For the gene loci of the prognostic risk result, for example, due to the small amount of sample data based on which the prognostic risk value is obtained and the low credibility, it is impossible to obtain a reliable clinical prognostic risk result based on the prognostic risk value. Such gene loci need to be further analyzed for prognostic risk qualitatively using an integrated classification model to obtain a reliable clinical prognostic risk result. Optionally, since changes in gene loci will lead to changes in amino acid loci, in the process of qualitative analysis of target gene loci, the target amino acid loci can be used instead of the analysis of the target gene loci.
根据本公开的实施例,根据预后风险值从基因位点中选取目标基因位点的过程可以包括根据得到预后风险值所依据的样本数据量少于预设样本数据量阈值,以及/或者可信度小于可信度阈值等条件从基因位点中选取目标基因位点。可选地,也可以根据得到氨基酸预后风险值所依据的样本数据量少于预设样本数据量阈值,以及/或者可信度小于可信度阈值等条件从氨基酸位点中选取包括目标基因位点的目标氨基酸位点。According to an embodiment of the present disclosure, the process of selecting a target gene locus from a gene locus according to a prognostic risk value may include selecting a target gene locus from a gene locus according to conditions such as the amount of sample data based on which the prognostic risk value is obtained is less than a preset sample data amount threshold, and/or the credibility is less than a credibility threshold. Alternatively, a target amino acid locus including a target gene locus may also be selected from an amino acid locus according to conditions such as the amount of sample data based on which the amino acid prognostic risk value is obtained is less than a preset sample data amount threshold, and/or the credibility is less than a credibility threshold.
根据本公开的实施例,通过获取初始数据集和对照数据集;对初始数据集进行预处理,得到目标基因突变数据集;利用比例风险回归模型确定目标基因突变数据集中每个基因位点的预后风险值;据预后风险值选取目标基因位点;利用集成分类模型得到目标基因位点的预后风险类型结果。由于本公开实施例可以实现对突变基因位点的预后风险值进行定量的预后风险评估,得到了每一个基因位点的突变预后风险值,并对预后风险值可信度低、未知临床意义的目标基因位点进行定性的预后风险评估,得到了目标基因位点预后风险类型。本公开实施例提供的基因位点突变的预后风险评估方法,不仅能够更细粒度的量化基因突变的预后风险,还可以对未知临床意义的目标基因位点进行定性分析,从而至少部分地克服了相关技术中对基因突变预后风险的评估不准确的问题,进而提高了对基因突变预后风险评估的准确率,并为医生的临床方案制定提供更全面的参考信息。According to the embodiment of the present disclosure, by obtaining an initial data set and a control data set; preprocessing the initial data set to obtain a target gene mutation data set; using a proportional risk regression model to determine the prognostic risk value of each gene site in the target gene mutation data set; selecting a target gene site according to the prognostic risk value; using an integrated classification model to obtain the prognostic risk type result of the target gene site. Since the embodiment of the present disclosure can realize a quantitative prognostic risk assessment of the prognostic risk value of the mutation gene site, the mutation prognostic risk value of each gene site is obtained, and the target gene site with low credibility of the prognostic risk value and unknown clinical significance is qualitatively assessed to obtain the prognostic risk type of the target gene site. The prognostic risk assessment method for gene site mutation provided by the embodiment of the present disclosure can not only quantify the prognostic risk of gene mutation in a more fine-grained manner, but also qualitatively analyze the target gene site of unknown clinical significance, thereby at least partially overcoming the problem of inaccurate assessment of the prognostic risk of gene mutation in the related art, thereby improving the accuracy of the prognostic risk assessment of gene mutation, and providing more comprehensive reference information for the formulation of clinical plans for doctors.
根据本公开的实施例,由于ICGC数据库或MSK数据库中会使用不同的参考基因组对突变基因进行注释,进而造成同一个基因的同义突变可能会有多种注释结果。所以,需要使用注释工具对图2所示的初始基因突变数据进行重新注释预处理,以下简称基因重注释。具体地,操作S120可以包括如下操作:基于预设的基因注释要求,对多个初始基因突变数据进行数据梳理,得到多个待处理的基因突变数据;利用预设的基因注释语句,对多个待处理的基因突变数据分别进行注释处理,得到多个目标基因突变数据,其中,每个目标基因突变数据包括与发生突变的基因位点对应的多个转录本信息集和多个氨基酸信息;以及根据多个目标基因突变数据,得到目标基因突变数据集。 According to an embodiment of the present disclosure, since different reference genomes are used to annotate mutant genes in the ICGC database or the MSK database, there may be multiple annotation results for synonymous mutations of the same gene. Therefore, it is necessary to use an annotation tool to re-annotate the initial gene mutation data shown in Figure 2, hereinafter referred to as gene re-annotation. Specifically, operation S120 may include the following operations: based on the preset gene annotation requirements, a plurality of initial gene mutation data are combed to obtain a plurality of gene mutation data to be processed; using a preset gene annotation statement, a plurality of gene mutation data to be processed are annotated and processed respectively to obtain a plurality of target gene mutation data, wherein each target gene mutation data includes a plurality of transcript information sets and a plurality of amino acid information corresponding to the gene site where the mutation occurs; and a target gene mutation data set is obtained based on a plurality of target gene mutation data.
图4是根据本公开实施例的待处理的基因突变数据的示意图;图5是根据本公开实施例的进行基因重注释后的数据。FIG. 4 is a schematic diagram of gene mutation data to be processed according to an embodiment of the present disclosure; FIG. 5 is data after gene re-annotation according to an embodiment of the present disclosure.
如图4所示,预设的基因注释要求可以指在基因突变数据中需要保留染色体号列、基因位点起始位置列、基因位点终止位置列、参考基因列以及突变基因列。将初始基因突变数据按照预设的注释要求进行数据梳理后,得到的多个待处理的基因突变数据可以如图4所示。图4中的除了染色体号列、基因位点起始位置列、基因位点终止位置列、参考基因列以及突变基因列以外,还可以根据实现情况设置多个自定义列,图3仅以自定义列1为链,链列中的“1“用于表示正义链,自定义列2为用户编号为例。As shown in FIG4 , the preset gene annotation requirements may refer to the need to retain the chromosome number column, gene site start position column, gene site end position column, reference gene column, and mutant gene column in the gene mutation data. After the initial gene mutation data is sorted according to the preset annotation requirements, the obtained multiple gene mutation data to be processed can be shown in FIG4 . In addition to the chromosome number column, gene site start position column, gene site end position column, reference gene column, and mutant gene column in FIG4 , multiple custom columns can also be set according to the implementation situation. FIG3 only takes custom column 1 as a chain, "1" in the chain column is used to represent the positive chain, and custom column 2 is a user number as an example.
根据本公开的实施例,注释工具可以根据实际需要进行适应性调整,本公开实施例以Annovar注释工具为例进行描述。将多个待处理的基因突变数据输入到注释工具中,利用基因注释语句annotate_variation.pl-geneanno-dbtype refGene-buildver hg19 example/ex1.avinput humandb/对多个待处理的基因突变数据进行基因重注释。其中,-buildver hg19可以表示参考基因组使用hg19参考基因。利用上述基因注释语句对图4所示的多个待处理的基因突变数据进行基因重注释后,得到的数据可以如图5所示。According to the embodiments of the present disclosure, the annotation tool can be adaptively adjusted according to actual needs. The embodiments of the present disclosure are described by taking the Annovar annotation tool as an example. Multiple gene mutation data to be processed are input into the annotation tool, and the gene annotation statement annotate_variation.pl-geneanno-dbtype refGene-buildver hg19 example/ex1.avinput humandb/ is used to perform gene re-annotation on the multiple gene mutation data to be processed. Among them, -buildver hg19 can indicate that the reference genome uses the hg19 reference gene. After the multiple gene mutation data to be processed shown in Figure 4 are re-annotated using the above-mentioned gene annotation statement, the obtained data can be shown in Figure 5.
继续参照图5,第二列所示的基因突变类型以及第三列所示的转录本信息和氨基酸信息,是对图4所示的多个待处理的基因突变数据进行基因重注释后得到的。其中,转录本是由一条基因通过转录形成的一种或多种可供编码蛋白质的成熟的mRNA(messenger Ribonucleic Acid,信使RNA)。在第二列中,N可以表示非同义突变类型(nonsynonymous mutation),S可以表示终止子丢失突变类型(stoploss),F可以表示移码突变(frame shift mutation)。Continuing to refer to FIG5 , the gene mutation types shown in the second column and the transcript information and amino acid information shown in the third column are obtained after re-annotating the multiple gene mutation data to be processed shown in FIG4 . Among them, a transcript is one or more mature mRNAs (messenger Ribonucleic Acid, messenger RNA) that can encode proteins formed by transcription of a gene. In the second column, N can represent a nonsynonymous mutation type, S can represent a stoploss mutation type, and F can represent a frame shift mutation.
根据本公开的实施例,由于基因一般有多个转录本,即图5中的第三列会存在多个转录本信息及对应的氨基酸信息,图5中的每一行也可能存在多个转录本信息及对应的氨基酸信息。示例性的,一个完整的转录本中,所包括的信息可以是ENST00000642122.1_4:exon10:c.G1641A:p.M547I。其中,p.M547I可以是与该转录本对应的氨基酸信息。在一实施例中,转录本包括的信息越多,则涵盖的基因突变信息会越全面。基于此,可以对图5所示的基因重注释后的数据进行转录本信息的过滤筛选。According to an embodiment of the present disclosure, since a gene generally has multiple transcripts, that is, the third column in Figure 5 will contain multiple transcript information and corresponding amino acid information, and each row in Figure 5 may also contain multiple transcript information and corresponding amino acid information. Exemplarily, in a complete transcript, the information included may be ENST00000642122.1_4: exon10: c.G1641A: p.M547I. Among them, p.M547I may be the amino acid information corresponding to the transcript. In one embodiment, the more information the transcript includes, the more comprehensive the gene mutation information covered will be. Based on this, the data after the gene re-annotation shown in Figure 5 can be filtered and screened for transcript information.
根据本公开的实施例,通过对初始基因突变数据进行数据梳理、基因重注释等预处理,可以使得初始基因突变数据的格式整齐统一,进而在对基因突变的预后风险进行评估时,可以提高了对基因突变预后风险评估的效率和准确率。 According to the embodiments of the present disclosure, by performing preprocessing such as data combing and gene re-annotation on the initial gene mutation data, the format of the initial gene mutation data can be made neat and unified, thereby improving the efficiency and accuracy of the prognostic risk assessment of gene mutations when evaluating the prognostic risk of gene mutations.
根据本公开的实施例,根据多个目标基因突变数据,得到目标基因突变数据集可以包括如下操作:从转录本数据库中获取与目标基因相关联的目标转录本信息和目标氨基酸信息;根据目标转录本信息和目标氨基酸信息,对每个目标基因突变数据所包括的多个转录本信息和多个氨基酸信息进行过滤,以使得每个目标基因突变数据的多个转录本信息和多个氨基酸信息中只保留目标转录本信息和目标氨基酸信息;以及根据保留有目标转录本信息和目标氨基酸信息的多个目标基因突变数据,生成目标基因突变数据集。According to an embodiment of the present disclosure, obtaining a target gene mutation data set based on multiple target gene mutation data may include the following operations: obtaining target transcript information and target amino acid information associated with the target gene from a transcript database; filtering the multiple transcript information and multiple amino acid information included in each target gene mutation data based on the target transcript information and the target amino acid information, so that only the target transcript information and the target amino acid information are retained among the multiple transcript information and multiple amino acid information of each target gene mutation data; and generating a target gene mutation data set based on the multiple target gene mutation data that retain the target transcript information and the target amino acid information.
根据本公开的实施例,目标转录本信息可以是指从指定数据库中获取到的包含信息最多的转录本信息,目标转录本信息一般是最经典的转录本信息,一般是来源可靠并且经过多方验证过的最正确的转录本信息。目标氨基酸信息可以是目标转录本信息中的氨基酸信息。According to an embodiment of the present disclosure, the target transcript information may refer to the transcript information containing the most information obtained from a specified database. The target transcript information is generally the most classic transcript information, generally the most correct transcript information with a reliable source and verified by multiple parties. The target amino acid information may be the amino acid information in the target transcript information.
图6是根据本公开实施例的目标基因突变数据集的示例图。FIG. 6 is an example diagram of a target gene mutation dataset according to an embodiment of the present disclosure.
如图6所示,利用目标转录本信息和目标氨基酸信息对图5所示的第三列的每一行信息进行过滤筛选,以使得每一行均只保留目标转录本信息和目标氨基酸信息。在筛选后,可以对转录本信息进行简化处理,只提取参考氨基酸信息、突变后氨基酸信息和突变氨基酸位置,得到的目标基因突变数据集可以如图6所示。其中,参考氨基酸信息和突变后氨基酸信息可以参考五列的氨基酸信息;突变氨基酸位置可以参考第六列的氨基酸位点的突变位置。示例性的,以突变氨基酸p.Y279C为例,可以表示突变氨基酸位置为279号的参考氨基酸酪氨酸(Y)变成突变后氨基酸半胱氨酸(C)。图6中后果列的MV可以表示错义突变(missense_variant)。As shown in Figure 6, the target transcript information and target amino acid information are used to filter and screen each row of information in the third column shown in Figure 5, so that each row only retains the target transcript information and target amino acid information. After screening, the transcript information can be simplified, and only the reference amino acid information, the amino acid information after mutation, and the mutant amino acid position are extracted, and the target gene mutation data set obtained can be shown in Figure 6. Among them, the reference amino acid information and the amino acid information after mutation can refer to the amino acid information of the five columns; the mutant amino acid position can refer to the mutation position of the amino acid site in the sixth column. Exemplarily, taking the mutant amino acid p.Y279C as an example, it can be indicated that the reference amino acid tyrosine (Y) with the mutant amino acid position of No. 279 becomes the mutant amino acid cysteine (C). The MV of the consequence column in Figure 6 can represent a missense mutation (missense_variant).
根据本公开的实施例,在对转录本信息进行简化处理,提取参考氨基酸信息、突变后氨基酸信息和突变氨基酸位置之后,以及得到目标基因突变数据集之前,可以进行对数据集进行缺失值填充、异常值删除的预处理。例如对生存状态、生存时长列的异常数据进行处理,将生存时长为负数的值删除;或者将乱码、不是数字的字符删除。也可以在生存时长缺值的情况下,根据生存状态,填充根据随访时间得到的时长或删除与缺失值对应用户数据等,具体情况可以根据实际需要进行适应性调整。According to the embodiments of the present disclosure, after simplifying the transcript information, extracting the reference amino acid information, the amino acid information after mutation and the position of the mutated amino acid, and before obtaining the target gene mutation data set, the data set can be preprocessed to fill in missing values and delete abnormal values. For example, the abnormal data in the survival status and survival time columns are processed, and the values with negative survival time are deleted; or garbled characters and characters that are not numbers are deleted. In the case of missing survival time, it is also possible to fill in the duration obtained according to the follow-up time or delete the user data corresponding to the missing value according to the survival status, etc. The specific situation can be adaptively adjusted according to actual needs.
根据本公开的实施例,通过利用目标转录本信息和目标氨基酸信息,对每个目标基因突变数据所包括的多个转录本信息和多个氨基酸信息进行过滤,由于目标转录本信息可以是经过多方验证过的最正确的转录本信息,所以对多个转录本信息和多个氨基酸信息进行过滤后可以只保留与目标转录本信息和目标氨基酸信息相关联的信息,保障了信 息的简洁与准确率,提高对基因突变预后风险评估的效率和准确率。According to the embodiments of the present disclosure, by using the target transcript information and the target amino acid information, the multiple transcript information and the multiple amino acid information included in each target gene mutation data are filtered. Since the target transcript information can be the most correct transcript information verified by multiple parties, only the information associated with the target transcript information and the target amino acid information can be retained after filtering the multiple transcript information and the multiple amino acid information, thereby ensuring the accuracy of the information. The simplicity and accuracy of information can improve the efficiency and accuracy of prognostic risk assessment of gene mutations.
根据本公开的实施例,在得到目标基因突变数据集之后,可以采用滑动窗和比例风险回归模型(proportional-hazards model,又称Cox回归模型),分析单个氨基酸位点突变后的预后风险值。需要说明的是,上述的数据预处理阶段是以基因位点为最小粒度进行数据的获取与处理,但基因位点的突变会使得与基因位点对应的氨基酸发生变化,因此,为了简化评估的工作量,提高对突变预后风险的评估效率,并且能够更清晰简洁的描述突变的预后风险值,本公开实施例在分析突变预后风险值的过程中,以氨基酸位点为单位进行分析。According to the embodiments of the present disclosure, after obtaining the target gene mutation data set, a sliding window and a proportional-hazards model (also known as a Cox regression model) can be used to analyze the prognostic risk value after a single amino acid site mutation. It should be noted that the above-mentioned data preprocessing stage is to obtain and process data with the gene site as the minimum granularity, but the mutation of the gene site will cause the amino acid corresponding to the gene site to change. Therefore, in order to simplify the workload of the evaluation, improve the efficiency of the evaluation of the prognostic risk of the mutation, and be able to describe the prognostic risk value of the mutation more clearly and concisely, the embodiments of the present disclosure analyze the prognostic risk value of the mutation in units of amino acid sites.
根据本公开的实施例,操作S130可以包括如下操作:基于目标基因突变数据集,确定与每一个基因位点对应的子突变数据;利用比例风险回归模型在每一个子突变数据与对照数据集之间进行预后风险定量分析,得到基因位点的预后风险值。According to an embodiment of the present disclosure, operation S130 may include the following operations: based on the target gene mutation data set, determining the sub-mutation data corresponding to each gene site; using a proportional risk regression model to perform a quantitative prognostic risk analysis between each sub-mutation data and the control data set to obtain the prognostic risk value of the gene site.
根据本公开的实施例,子突变数据可以是与每一个发生基因突变的基因位点对应的数据,由于基因位点的突变会影响对应氨基酸,所以子突变数据也可以是与有基因位点发生突变的氨基酸位点对应的数据。获取子突变数据的过程可以采用滑动窗口在目标基因链上滑动得到。具体地,基于目标基因突变数据集,确定与每一个基因位点对应的子突变数据可以包括如下操作:利用滑动窗口在目标基因上进行滑动,其中,滑动窗口的中心点为m个氨基酸位点中的任意一个氨基酸位点,滑动窗口沿目标基因延伸方向的尺寸,与k个相邻氨基酸位点之间沿目标基因延伸方向的距离相关联,k为正整数;在滑动窗口滑动至任意一个氨基酸位点的情况下,基于滑动窗口沿目标基因延伸方向的尺寸,在作为中心的氨基酸位点周围选取k个氨基酸位点,得到目标氨基酸位点;以及将目标基因突变数据集中与目标氨基酸位点相链接的基因突变数据,作为子突变数据。According to an embodiment of the present disclosure, the sub-mutation data may be data corresponding to each gene site where a gene mutation occurs. Since the mutation of a gene site affects the corresponding amino acid, the sub-mutation data may also be data corresponding to an amino acid site where a gene site mutates. The process of obtaining the sub-mutation data may be obtained by sliding a sliding window on the target gene chain. Specifically, based on the target gene mutation data set, determining the sub-mutation data corresponding to each gene site may include the following operations: sliding a sliding window on the target gene, wherein the center point of the sliding window is any one of the m amino acid sites, the size of the sliding window along the extension direction of the target gene, and the distance between the k adjacent amino acid sites along the extension direction of the target gene are associated, and k is a positive integer; when the sliding window slides to any one amino acid site, based on the size of the sliding window along the extension direction of the target gene, k amino acid sites are selected around the amino acid site as the center to obtain the target amino acid site; and the gene mutation data linked to the target amino acid site in the target gene mutation data set is used as the sub-mutation data.
根据本公开的实施例,目标基因与目标基因突变数据集之间可以相链接,目标基因包括m个氨基酸位点,每一个氨基酸位点均对应至少一个发生突变的基因位点,m为正整数。示例性的,以目标基因为TP53为例,TP53包括393个氨基酸位点,m可以为393。According to an embodiment of the present disclosure, a target gene and a target gene mutation dataset can be linked, the target gene includes m amino acid sites, each amino acid site corresponds to at least one gene site where a mutation occurs, and m is a positive integer. For example, taking TP53 as an example, TP53 includes 393 amino acid sites, and m can be 393.
图7是根据本公开实施例的利用滑动窗口采集数据的示意图。FIG. 7 is a schematic diagram of collecting data using a sliding window according to an embodiment of the present disclosure.
如图7所示,滑动窗口701沿目标基因延伸方向的尺寸B可以是包括k个相邻氨基酸位点的尺寸,图7是以k等于3为例进行展示的。利用滑动窗口701在目标基因链702上进行滑动,可以得到以氨基酸位点i(0≤i≤392)为中心,沿目标基因延伸方向的尺寸 B为k,围成的区域为(i-k/2,i+k/2)的滑动窗口,将该滑动窗口围成区域内的突变数据收集起来,可以得到子突变数据。其中,k可以根据实际需要进行适应性调整。当i=0时,滑动窗口所围成的的区域可以为(0,k);当i==392时,滑动窗口所围成的区域可以为(392-k,392);当i=100时,滑动窗口所围成的区域可以为(100-k/2,100+k/2)。总体上,需要保证窗口沿目标基因延伸方向的尺寸为k个相邻氨基酸之间的距离。As shown in FIG. 7 , the size B of the sliding window 701 along the extension direction of the target gene can be the size including k adjacent amino acid sites, and FIG. 7 is shown by taking k equal to 3 as an example. By sliding the sliding window 701 on the target gene chain 702, the size B along the extension direction of the target gene with amino acid site i (0≤i≤392) as the center can be obtained. B is k, and the area enclosed is a sliding window of (ik/2, i+k/2). The mutation data within the area enclosed by the sliding window are collected to obtain sub-mutation data. Among them, k can be adaptively adjusted according to actual needs. When i=0, the area enclosed by the sliding window can be (0, k); when i==392, the area enclosed by the sliding window can be (392-k, 392); when i=100, the area enclosed by the sliding window can be (100-k/2, 100+k/2). In general, it is necessary to ensure that the size of the window along the extension direction of the target gene is the distance between k adjacent amino acids.
在一实施例中,对照组可以为该目标基因(例如TP53)整体的对照组数据。也就是说,每个位点的对照组数据都相同,均是在目标基因(例如TP53)上没有基因位点发生突变的数据。In one embodiment, the control group may be the control group data of the target gene (eg, TP53) as a whole. In other words, the control group data of each site are the same, and are all data where no mutation occurs at the gene site on the target gene (eg, TP53).
根据本公开的实施例,通过利用滑动窗口的方式采集与每一个基因位点对应的子突变数据,可以实现高效便捷且简洁清晰的获取子突变数据,提高对基因突变预后风险评估的效率和准确率。According to the embodiments of the present disclosure, by collecting sub-mutation data corresponding to each gene site in a sliding window manner, it is possible to obtain sub-mutation data efficiently, conveniently, concisely and clearly, thereby improving the efficiency and accuracy of gene mutation prognostic risk assessment.
根据本公开的实施例,每一个子突变数据的对照组数据可以是都相同的,对照的均是在目标基因链上没有基因位点发生突变情况下,得到的数据。通过在每一个子突变数据和对照数据之间采用比例风险回归模型的预后风险回归函数进行预后风险回归定量分析,可以得到与每一个子突变数据对应的预后风险值。与子突变数据对应的预后风险值,可以作为与发生基因突变的基因位点所对应的预后风险值,也可以作为与有基因位点发生突变的氨基酸位点对应的预后风险值。According to the embodiments of the present disclosure, the control group data of each sub-mutation data can be the same, and the control data are all obtained when there is no mutation in the gene site on the target gene chain. By using the prognostic risk regression function of the proportional risk regression model between each sub-mutation data and the control data to perform a prognostic risk regression quantitative analysis, a prognostic risk value corresponding to each sub-mutation data can be obtained. The prognostic risk value corresponding to the sub-mutation data can be used as a prognostic risk value corresponding to the gene site where the gene mutation occurs, or as a prognostic risk value corresponding to the amino acid site where the gene site has mutated.
具体地,得到子突变数据后,可以进行回归分析分析,得到预后风险值。在本公开实施例中,每一个基因位点的预后风险值用与基因位点相关联的氨基酸位点的预后风险值进行表征,子突变数据与氨基酸位点相关联。利用比例风险回归模型在每一个子突变数据与对照数据集之间进行预后风险定量分析,得到所述基因位点的预后风险值的过程可以包括如下操作:将子突变数据输入到预后风险函数中,得到与每一个氨基酸位点对应的预后风险函数值;以及将每一个氨基酸位点的预后风险函数值与基准预后风险函数值之间的比例,作为每一个氨基酸位点的预后风险值,其中,基准预后风险函数值是通过将对照数据集输入到预后风险函数中得到的。Specifically, after obtaining the sub-mutation data, regression analysis can be performed to obtain a prognostic risk value. In the disclosed embodiment, the prognostic risk value of each gene locus is characterized by the prognostic risk value of the amino acid site associated with the gene locus, and the sub-mutation data is associated with the amino acid site. The process of obtaining the prognostic risk value of the gene locus by using a proportional risk regression model to perform a quantitative prognostic risk analysis between each sub-mutation data and a control data set may include the following operations: inputting the sub-mutation data into a prognostic risk function to obtain a prognostic risk function value corresponding to each amino acid site; and using the ratio between the prognostic risk function value of each amino acid site and the benchmark prognostic risk function value as the prognostic risk value of each amino acid site, wherein the benchmark prognostic risk function value is obtained by inputting the control data set into the prognostic risk function.
根据本公开的实施例,基于每个氨基酸位点的子突变数据与对照数据集之间,利用比例风险回归模型分析该氨基酸位点突变对所产生的预后风险影响。比例风险回归模型是基于预后风险函数进行建模,它能利用截尾数据,分析每个因素对生存的预后风险率,但不考虑生存时间分布。 According to an embodiment of the present disclosure, based on the sub-mutation data of each amino acid site and the control data set, a proportional risk regression model is used to analyze the effect of the amino acid site mutation on the prognostic risk. The proportional risk regression model is modeled based on the prognostic risk function, which can use censored data to analyze the prognostic risk rate of each factor for survival, but does not consider the distribution of survival time.
根据本公开的实施例,预后风险函数可以如公式(1)所示。
h(t,X)=h0(t)exp(β1X1+β2X2+…+βPXP) (1)According to an embodiment of the present disclosure, the prognostic risk function may be as shown in formula (1).
h(t,X)=h 0 (t)exp(β 1 X 1 +β 2 X 2 +…+β P X P ) (1)
其中,h(t,X)是t时刻预后风险函数、预后风险率或瞬时死亡率,h0(t)是基准预后风险函数,即在t时刻,所有变量都取0的情况下,得到的预后风险函数。X1、X2、...、Xp可以是协变量、影响因素或预后因素,β1、β2...、βp是回归系数。X1、X2、...、Xp可以与图6中的第5列相关联,发生突变的情况下,可以将X赋值为1,没有发生突变的情况下,可以将X赋值为0。X1、X2、...、Xp所对应的数据也可以根据实际需要进行适应性调整。Wherein, h(t, X) is the prognostic risk function, the prognostic risk rate or the instantaneous mortality rate at time t, and h 0 (t) is the baseline prognostic risk function, i.e., the prognostic risk function obtained when all variables are 0 at time t. X 1 , X 2 , ..., X p can be covariates, influencing factors or prognostic factors, and β 1 , β 2 ..., β p are regression coefficients. X 1 , X 2 , ..., X p can be associated with the fifth column in FIG. 6 , and when a mutation occurs, X can be assigned a value of 1, and when no mutation occurs, X can be assigned a value of 0. The data corresponding to X 1 , X 2 , ..., X p can also be adaptively adjusted according to actual needs.
根据本公开的实施例,计算预后风险值的过程可以如公式(2)所示。
According to an embodiment of the present disclosure, the process of calculating the prognostic risk value may be as shown in formula (2).
其中,HR表示每一个氨基酸位点的突变预后风险值,当β>0,HR>1时,预后风险率增加,说明变量X是危险因素。当β<0,HR<1时,预后风险率下降,说明变量X是保护因素;当β=0,HR=1时,预后风险率不变,说明变量X是危险无关因素。h0(t)可以是根据对照数据集得到的基准预后风险函数值,即所有变量都取0时t时刻预后风险函数值。根据表征氨基酸位点的突变预后风险值,可以大致对应基因位点的突变预后风险值。Among them, HR represents the mutation prognostic risk value of each amino acid site. When β>0 and HR>1, the prognostic risk rate increases, indicating that variable X is a risk factor. When β<0 and HR<1, the prognostic risk rate decreases, indicating that variable X is a protective factor; when β=0 and HR=1, the prognostic risk rate remains unchanged, indicating that variable X is a risk-independent factor. h 0 (t) can be the baseline prognostic risk function value obtained based on the control data set, that is, the prognostic risk function value at time t when all variables are taken as 0. According to the mutation prognostic risk value characterizing the amino acid site, it can roughly correspond to the mutation prognostic risk value of the gene site.
根据本公开的实施例,通过结合滑动窗采集子突变数据,以及利用比例风险回归模型输出氨基酸位点的预后风险值,可以实现对基因点突变预后风险的定量表征,提高对基因突变预后风险评估的准确率。According to the embodiments of the present disclosure, by combining the sliding window to collect sub-mutation data and using the proportional risk regression model to output the prognostic risk value of the amino acid site, it is possible to quantitatively characterize the prognostic risk of gene point mutations and improve the accuracy of gene mutation prognostic risk assessment.
图8是根据本公开实施例的突变预后风险值散点图。FIG. 8 is a scatter plot of mutation prognostic risk values according to an embodiment of the present disclosure.
通过对每个氨基酸位点的子突变数据和对照数据集进行比例风险回归模型分析,得到每个氨基酸位点的突变预后风险值后,可以基于每一个氨基酸位点的预后风险值,绘制突变预后风险值散点图;并利用可视化组件展示突变预后风险值散点图,展示的突变预后风险值散点图可以如图8所示。其中,可视化组件可以包括Echarts(数据可视化图表库)、Highcharts(图表库)等组件。After the mutation prognosis risk value of each amino acid site is obtained by performing proportional risk regression model analysis on the sub-mutation data and the control data set of each amino acid site, a mutation prognosis risk value scatter plot can be drawn based on the prognosis risk value of each amino acid site; and the mutation prognosis risk value scatter plot can be displayed using a visualization component, and the displayed mutation prognosis risk value scatter plot can be shown in Figure 8. Among them, the visualization component can include components such as Echarts (data visualization chart library) and Highcharts (chart library).
继续参照图8,图8的上半部分为目标基因(例如TP53)氨基酸位点的突变预后风险值散点图,下半部分是目标基因(例如TP53)氨基酸位点对应的滑动窗口的样本数量柱状图。X轴表示目标基因(例如TP53)的393个氨基酸位点,左边Y轴表示对数预后风险,蓝点表示预后风险值可信度高(P值<0.05)的突变氨基酸位点;其他颜色的 点表示可信度低(P值>0.05)的未知临床意义的氨基酸位点。灰色线条表示95%上下置信区间,区间越大,可信度越低。右边Y轴表示样本数量,右边红蓝色的条形框表示P值,P在0、1之间,P越大于0.05,红色越深,P越小于0.05,蓝色越深。Continuing to refer to Figure 8, the upper part of Figure 8 is a scatter plot of the mutation prognostic risk value of the amino acid site of the target gene (e.g., TP53), and the lower part is a bar graph of the number of samples in the sliding window corresponding to the amino acid site of the target gene (e.g., TP53). The X-axis represents the 393 amino acid sites of the target gene (e.g., TP53), the left Y-axis represents the logarithmic prognostic risk, and the blue dots represent the mutated amino acid sites with high credibility of the prognostic risk value (P value <0.05); other colors The dots represent amino acid sites of unknown clinical significance with low credibility (P value>0.05). The gray lines represent the 95% upper and lower confidence intervals. The larger the interval, the lower the credibility. The right Y-axis represents the number of samples, and the red and blue bars on the right represent the P value. P is between 0 and 1. The larger the P is, the darker the red is. The smaller the P is, the darker the blue is.
根据本公开的实施例,如图8所示,比例风险回归模型分析得到的点突变预后风险,可以分为红点和蓝点,蓝点是样本数量足够多、可信度高的点,红点是样本数量不够多、可信度不够高的点。因此,还要采用深度学习的方法,对红点的预后风险进行判别,判别是属于保护因素的还是危害因素。即,对于图8中未知临床意义的氨基酸位点,可以进一步进行定性分析,以确定未知临床意义的氨基酸位点的预后风险性是高预后风险还是低预后风险还是无预后风险等。在对预后风险进行判别时,可以基于氨基酸点突变对蛋白质功能影响等特征,采用机器学习方法,判别未知临床意义点突变的预后风险类型。According to an embodiment of the present disclosure, as shown in FIG8 , the prognostic risk of point mutations obtained by the proportional risk regression model analysis can be divided into red dots and blue dots, where blue dots are points with a sufficient number of samples and high credibility, and red dots are points with an insufficient number of samples and insufficient credibility. Therefore, a deep learning method is also required to discriminate the prognostic risk of the red dots to determine whether they are protective factors or harmful factors. That is, for amino acid sites of unknown clinical significance in FIG8 , a qualitative analysis can be further performed to determine whether the prognostic risk of amino acid sites of unknown clinical significance is a high prognostic risk, a low prognostic risk, or no prognostic risk, etc. When discriminating the prognostic risk, a machine learning method can be used to discriminate the prognostic risk type of point mutations of unknown clinical significance based on features such as the effect of amino acid point mutations on protein function.
根据本公开的实施例,通过利用滑动窗和预后风险回归函数,分析目标基因突变数据集和对照数据集的数据并得到需要展示的突变预后风险值散点图,基于突变预后风险值散点图可以得到每氨基酸位点的预后风险值、95%置信区间和P值,不仅实现了对基因突变预后风险定量分析,更细粒度、更精确的表征了氨基酸位点突变的预后风险值,可视化的展示还便于相关人员根据突变预后风险值散点图进行临床决策。According to the embodiments of the present disclosure, by utilizing a sliding window and a prognostic risk regression function, the data of the target gene mutation data set and the control data set are analyzed and a mutation prognostic risk value scatter plot to be displayed is obtained. Based on the mutation prognostic risk value scatter plot, the prognostic risk value, 95% confidence interval and P value of each amino acid site can be obtained, which not only realizes the quantitative analysis of the prognostic risk of gene mutations, but also characterizes the prognostic risk value of amino acid site mutations in a finer granularity and more accurately. The visual display also facilitates relevant personnel to make clinical decisions based on the mutation prognostic risk value scatter plot.
根据本公开的实施例,由于部分基因位点的预后风险值并不能准确反应临床的预后风险,例如得到部分基因位点的样本数据量少,可信度低等,使得根据该基因位点的预后风险值不能准确得到临床意义的预后风险,所以需要对该种基因位点进行进一步的分析。具体地,对该种基因位点进行进一步的分析,需要先根据预后风险值从多个基因位点中选取出该基因位点,而根据预后风险值从多个基因位点中选取该基因位点的过程可以包括如下操作:根据每一个氨基酸位点的预后风险值,从与基因位点相关联的氨基酸位点中选取目标氨基酸位点,其中,目标氨基酸位点中包括目标基因位点。According to the embodiments of the present disclosure, since the prognostic risk value of some gene loci cannot accurately reflect the clinical prognostic risk, for example, the amount of sample data obtained for some gene loci is small, the credibility is low, etc., so that the prognostic risk of clinical significance cannot be accurately obtained according to the prognostic risk value of the gene loci, it is necessary to further analyze the gene loci. Specifically, to further analyze the gene loci, it is necessary to first select the gene loci from multiple gene loci according to the prognostic risk value, and the process of selecting the gene loci from multiple gene loci according to the prognostic risk value can include the following operations: according to the prognostic risk value of each amino acid site, select the target amino acid site from the amino acid sites associated with the gene site, wherein the target amino acid site includes the target gene site.
根据本公开的实施例,基因位点的突变会使得与基因位点对应的氨基酸发生变化,氨基酸的数目小于基因位点的数目,因此,利用目标氨基酸位点代替目标基因位点进行预后风险类型定性分析,可以简化分析的工作量,提高对突变预后风险类型的分析效率,并且能够更清晰简洁的描述突变的预后风险类型结果。本公开实施例在分析突变预后风险类型的过程中,以氨基酸位点为单位进行分析。According to the embodiments of the present disclosure, the mutation of a gene locus will cause the amino acid corresponding to the gene locus to change, and the number of amino acids is less than the number of gene loci. Therefore, using the target amino acid locus instead of the target gene locus for qualitative analysis of the prognosis risk type can simplify the workload of the analysis, improve the efficiency of the analysis of the mutation prognosis risk type, and be able to more clearly and concisely describe the prognosis risk type results of the mutation. In the process of analyzing the prognosis risk type of the mutation, the embodiments of the present disclosure perform the analysis in units of amino acid sites.
根据本公开的实施例,基因位点突变的预后风险评估方法还可以包括如下操作:计算与每一个氨基酸位点的预后风险值相关联的可信度;将与每一个氨基酸位点的预后风 险值相关联的可信度与可信度阈值进行比较,以得到小于可信度阈值的预后风险值;从与基因位点相关联的氨基酸位点中确定与小于可信度阈值的预后风险值相关联的氨基酸位点,得到目标氨基酸位点。According to an embodiment of the present disclosure, the prognostic risk assessment method for gene site mutations may also include the following operations: calculating the credibility associated with the prognostic risk value of each amino acid site; The confidence associated with the risk value is compared with the confidence threshold to obtain a prognostic risk value less than the confidence threshold; the amino acid site associated with the prognostic risk value less than the confidence threshold is determined from the amino acid sites associated with the gene site to obtain the target amino acid site.
根据本公开的实施例,由于基因位点的突变,可以影响对应氨基酸。所以计算与每一个基因位点的预后风险值相关联的可信度,可以用计算与每一个氨基酸位点的预后风险值相关联的可信度进行代表。According to the embodiments of the present disclosure, since the mutation of a gene site may affect the corresponding amino acid, the confidence associated with the prognostic risk value of each gene site may be calculated by calculating the confidence associated with the prognostic risk value of each amino acid site.
如图8所示,与氨基酸位点的预后风险值相关联的可信度,可以通过绘制散点图的工具得到。目标氨基酸位点可以是图8中利用红色点表示的示未知临床意义的氨基酸位点。As shown in Figure 8, the credibility associated with the prognostic risk value of the amino acid site can be obtained by using a tool for drawing a scatter plot. The target amino acid site can be an amino acid site of unknown clinical significance represented by a red point in Figure 8.
根据本公开的实施例,预后风险类型可以包括高预后风险以及低预后风险等。According to an embodiment of the present disclosure, the prognostic risk type may include high prognostic risk and low prognostic risk, etc.
根据本公开的实施例,可信度可以表征预后风险值的可靠性,通过利用可信度阈值对氨基酸位点进行筛选,将小于可信度阈值的氨基酸位点作为目标氨基酸位点,可以实现将可靠性低的氨基酸位点筛选出,使得集成分类模型的预后风险定性分析不需要分析每一个氨基酸位点的预后风险,从而提高了集成分类模型进行预后风险定性分析的效率。According to the embodiments of the present disclosure, credibility can characterize the reliability of the prognostic risk value. By using the credibility threshold to screen the amino acid sites and taking the amino acid sites with a credibility threshold less than the credibility threshold as the target amino acid sites, it is possible to screen out the amino acid sites with low reliability, so that the qualitative analysis of the prognostic risk of the integrated classification model does not need to analyze the prognostic risk of each amino acid site, thereby improving the efficiency of the integrated classification model in performing the qualitative analysis of the prognostic risk.
根据本公开的实施例,操作S150可以包括如下操作:对由于目标氨基酸位点发生突变而导致的预后风险,可以利用集成分类模型进行预后风险类型定性分析,得到预后风险类型结果。具体地,由于部分基因位点的预后风险值并不能准确反应临床的预后风险,例如得到部分基因位点的样本数据量少,可信度低等,使得根据该基因位点的预后风险值不能准确得到临床意义的预后风险,所以需要利用集成分类模型对该种基因位点或与该种基因位点相关联的氨基酸位点的预后风险进行进一步的定性分析。According to an embodiment of the present disclosure, operation S150 may include the following operations: for the prognostic risk caused by the mutation of the target amino acid site, the prognostic risk type may be qualitatively analyzed using an integrated classification model to obtain a prognostic risk type result. Specifically, since the prognostic risk value of some gene loci cannot accurately reflect the clinical prognostic risk, for example, the sample data of some gene loci is small, the credibility is low, etc., so that the prognostic risk of clinical significance cannot be accurately obtained according to the prognostic risk value of the gene locus, it is necessary to use an integrated classification model to further qualitatively analyze the prognostic risk of this gene locus or the amino acid site associated with this gene locus.
具体地,利用集成分类模型进行预后风险类型定性分析的过程可以包括如下操作:基于目标氨基酸位点,从目标基因突变数据集中提取与目标氨基酸位点发生突变相关联的基因突变数据,得到待分析预后风险类型的突变数据;基于待分析预后风险类型的突变数据,利用集成分类模型预测由于目标氨基酸位点发生突变导致的预后风险与N个预后风险类型中每一个预后风险类型之间的匹配程度,得到与N个预后风险类型对应的N个预测匹配值,N为正整数,N个预后风险类型之间的预后风险程度彼此不同;以及根据N个预测匹配值,确定预后风险类型结果。Specifically, the process of using an integrated classification model to perform qualitative analysis of prognostic risk types may include the following operations: based on the target amino acid site, extracting gene mutation data associated with the mutation at the target amino acid site from the target gene mutation data set to obtain mutation data of the prognostic risk type to be analyzed; based on the mutation data of the prognostic risk type to be analyzed, using the integrated classification model to predict the degree of match between the prognostic risk caused by the mutation at the target amino acid site and each of the N prognostic risk types, to obtain N predicted matching values corresponding to the N prognostic risk types, where N is a positive integer, and the prognostic risk degrees between the N prognostic risk types are different from each other; and determining the prognostic risk type result based on the N predicted matching values.
根据本公开的实施例,通过利用集成分类模型对未知临床预后风险意义的氨基酸位点进行定性评估,得到预后风险类型结果,可以为临床决策提供参考信息。 According to the embodiments of the present disclosure, by using an integrated classification model to perform a qualitative assessment of amino acid sites of unknown clinical prognostic risk significance, a prognostic risk type result is obtained, which can provide reference information for clinical decision-making.
根据本公开的实施例,基于待分析预后风险类型的突变数据,利用集成分类模型预测由于目标氨基酸位点发生突变导致的预后风险与N个预后风险类型中每一个预后风险类型之间的匹配程度,得到与N个预后风险类型对应的N个预测匹配值可以包括如下操作:针对N个预后风险类型中的每一个预后风险类型:采用集成分类模型对由于目标氨基酸位点发生突变导致的预后风险,与预后风险类型之间的匹配程度进行分析,得到多个预测值,其中,集成分类模型包括多个分类器,每个分类器能够得出一个预测值;利用梯度提升分类器对多个预测值进行处理,得到预测匹配值。According to an embodiment of the present disclosure, based on the mutation data of the prognostic risk type to be analyzed, an integrated classification model is used to predict the degree of match between the prognostic risk caused by the mutation at the target amino acid site and each of the N prognostic risk types, and obtaining N predicted matching values corresponding to the N prognostic risk types may include the following operations: for each of the N prognostic risk types: using an integrated classification model to analyze the prognostic risk caused by the mutation at the target amino acid site and the degree of match between the prognostic risk type, to obtain multiple predicted values, wherein the integrated classification model includes multiple classifiers, each classifier can derive a predicted value; using a gradient boosting classifier to process the multiple predicted values to obtain a predicted matching value.
图9是根据本公开实施例的利用集成分类模型得到预测匹配值的架构图。FIG. 9 is an architecture diagram of obtaining a predicted matching value using an integrated classification model according to an embodiment of the present disclosure.
如图9所示,集成分类模型可以包括多个分类器,分类器可以包括以下至少之一:随机森林分类器、基尼指数随机树分类器、熵随机树分类器、梯度提升分类器。第一极端随机树分类器可以是基尼指数随机树分类器或熵随机树分类器,第二极端随机树分类器也可以是基尼指数随机树分类器或熵随机树分类器,但第一极端随机树分类器和第二极端随机树分类器优选用不同种类的随机树分类器。基尼指数随机树分类器是指利用gini指数作为判断节点是否继续分裂的分类器;熵随机树分类器是指利用熵作为判断节点是否继续分裂的分类器。As shown in FIG9 , the integrated classification model may include multiple classifiers, and the classifier may include at least one of the following: a random forest classifier, a Gini index random tree classifier, an entropy random tree classifier, and a gradient boosting classifier. The first extreme random tree classifier may be a Gini index random tree classifier or an entropy random tree classifier, and the second extreme random tree classifier may also be a Gini index random tree classifier or an entropy random tree classifier, but the first extreme random tree classifier and the second extreme random tree classifier preferably use different types of random tree classifiers. The Gini index random tree classifier refers to a classifier that uses the gini index as a judgment whether a node continues to split; the entropy random tree classifier refers to a classifier that uses entropy as a judgment whether a node continues to split.
随机森林分类器的损失函数为基尼指数,如公式(3)所示。
The loss function of the random forest classifier is the Gini index, as shown in formula (3).
pk可以表示由于目标氨基酸位点发生突变导致的预后风险属于第N个预后风险的概率;基尼指数越大,说明不确定性越大;基尼指数越小,不确定性越小,数据分割越干净。 pk can represent the probability that the prognostic risk caused by the mutation of the target amino acid site belongs to the Nth prognostic risk; the larger the Gini index, the greater the uncertainty; the smaller the Gini index, the smaller the uncertainty and the cleaner the data segmentation.
极端随机树的熵可以如公式(4)所示。
The entropy of the extreme random tree can be expressed as formula (4).
其中,pk可以表示由于目标氨基酸位点发生突变导致的预后风险属于第N个预后风险的概率。Among them, pk can represent the probability that the prognostic risk caused by the mutation of the target amino acid site belongs to the Nth prognostic risk.
梯度提升,是一种用于回归、分类和排序任务的机器学习技术,在本公开的实施例中可以用于得出由于目标氨基酸位点发生突变导致的预后风险与每一个预后风险类型之间的匹配程度。Gradient boosting is a machine learning technique for regression, classification, and ranking tasks, which can be used in the embodiments of the present disclosure to derive the degree of match between the prognostic risk caused by the mutation at the target amino acid site and each prognostic risk type.
继续参照图9,在预测匹配值时,可以将待分析预后风险类型的突变数据901分别输入到随机森林分类器902、第一极端随机树分类器903、第二极端随机树分类器904、梯度提升分类器905中,得到的多个预测值分别可以是由随机森林分类器902输出的第 一预测值906、由第一极端随机树分类器903输出的第二预测值907、由第二极端随机树分类器904输出的第三预测值908、以及由梯度提升分类器905输出的第四预测值909。再利用另一个梯度提升分类器910对第一预测值906、第二预测值907、第三预测值908以及第四预测值909进行处理,可以得到最终需要的预测匹配值911。9, when predicting the matching value, the mutation data 901 of the prognostic risk type to be analyzed can be respectively input into the random forest classifier 902, the first extreme random tree classifier 903, the second extreme random tree classifier 904, and the gradient boosting classifier 905, and the multiple predicted values obtained can be respectively output by the random forest classifier 902, the second extreme random tree classifier 904, and the gradient boosting classifier 905. The first prediction value 906, the second prediction value 907 output by the first extreme random tree classifier 903, the third prediction value 908 output by the second extreme random tree classifier 904, and the fourth prediction value 909 output by the gradient boosting classifier 905. The first prediction value 906, the second prediction value 907, the third prediction value 908 and the fourth prediction value 909 are processed by another gradient boosting classifier 910 to obtain the final required prediction matching value 911.
根据本公开的实施例,通过集合多个分类器得到集成分类模型,可以得到多个预测值;通过集合多个预测值,得到最终的预测匹配值,提高了预后风险类型预测的准确率。According to the embodiments of the present disclosure, multiple prediction values can be obtained by combining multiple classifiers to obtain an integrated classification model; and the final prediction matching value is obtained by combining multiple prediction values, thereby improving the accuracy of prognostic risk type prediction.
根据本公开的实施例,根据N个预测匹配值,确定预后风险类型结果可以包括如下操作:通过在N个预测匹配值之间进行数值比较,从N个预测匹配值中选择最高预测匹配值;将与最高预测匹配值相对应的预后风险类型,作为预后风险类型结果。According to an embodiment of the present disclosure, determining a prognostic risk type result based on N predicted matching values may include the following operations: selecting the highest predicted matching value from the N predicted matching values by performing numerical comparisons between the N predicted matching values; and taking the prognostic risk type corresponding to the highest predicted matching value as the prognostic risk type result.
示例性的,以预后风险类型分为了高预后风险和低预后风险,N等于2为例,通过计算由于目标氨基酸位点发生突变导致的预后风险与高预后风险类型和低预后风险类型之间的匹配程度,得到高预后风险类型的预测匹配值和低预后风险类型的预测匹配值;通过比较高预后风险类型的预测匹配值和低预后风险类型的预测匹配值,在高预后风险类型的预测匹配值大于等于低预后风险类型的预测匹配值的情况下,预后风险类型结果为由于目标氨基酸位点发生突变导致的预后风险属于高预后风险;在高预后风险类型的预测匹配值小于低预后风险类型的预测匹配值的情况下,预后风险类型结果为由于目标氨基酸位点发生突变导致的预后风险属于低预后风险。上述示例仅用于描述,并不限制本发明,预后风险类型可以根据实际需要进行适应性划分和调整。Exemplarily, the prognostic risk type is divided into high prognostic risk and low prognostic risk, and N is equal to 2 as an example. By calculating the matching degree between the prognostic risk caused by the mutation of the target amino acid site and the high prognostic risk type and the low prognostic risk type, the predicted matching value of the high prognostic risk type and the predicted matching value of the low prognostic risk type are obtained; by comparing the predicted matching value of the high prognostic risk type and the predicted matching value of the low prognostic risk type, when the predicted matching value of the high prognostic risk type is greater than or equal to the predicted matching value of the low prognostic risk type, the prognostic risk type result is that the prognostic risk caused by the mutation of the target amino acid site belongs to the high prognostic risk; when the predicted matching value of the high prognostic risk type is less than the predicted matching value of the low prognostic risk type, the prognostic risk type result is that the prognostic risk caused by the mutation of the target amino acid site belongs to the low prognostic risk. The above examples are only used for description and do not limit the present invention. The prognostic risk type can be adaptively divided and adjusted according to actual needs.
根据本公开的实施例,通过利用集成分类模型,对可信度低的预后风险值进行预后风险定性分析,可以实现对突变频率低的、新的点突变的预后风险进行精确判别。通过对多个预测匹配值进行比较,并将最高预测匹配值对应的预后风险类型作为预后风险类型结果,可以使得最终确定的预后风险类型结果更具有代表性,为临床决策提供更准确的参考。According to the embodiments of the present disclosure, by using an integrated classification model to perform a qualitative analysis of the prognostic risk of low-credibility prognostic risk values, it is possible to accurately discriminate the prognostic risk of new point mutations with low mutation frequency. By comparing multiple predicted matching values and taking the prognostic risk type corresponding to the highest predicted matching value as the prognostic risk type result, the final determined prognostic risk type result can be made more representative, providing a more accurate reference for clinical decision-making.
根据本公开的实施例,针对上述过程得到的预后风险类型结果,可以执行如下操作:获取预后风险类型结果;以及向目标对象推送预后风险类型结果,以使得目标对象根据预后风险类型结果生成临床决策。目标对象可以是制定临床决策的人员,或用于生成临床决策的决策系统。According to an embodiment of the present disclosure, for the prognostic risk type result obtained in the above process, the following operations can be performed: obtaining the prognostic risk type result; and pushing the prognostic risk type result to a target object, so that the target object generates a clinical decision according to the prognostic risk type result. The target object can be a person who makes a clinical decision, or a decision system for generating a clinical decision.
根据本公开的实施例,通过将集成分类模型分析得到的预后风险类型结果作为临床决策的依据,可以提高生成临床决策的效率和增加临床决策的准确率。 According to the embodiments of the present disclosure, by using the prognostic risk type results obtained by analyzing the integrated classification model as the basis for clinical decision-making, the efficiency of generating clinical decisions can be improved and the accuracy of clinical decisions can be increased.
本公开实施例在训练集成分类模型的过程中,可以采用如下操作:获取样本数据集,其中所述样本数据集中包括第一预后风险程度数据和第二预后风险程度数据,所述第一预后风险程度高于所述第二预后风险程度;将所述样本数据集随机拆分为初始训练样本集和验证样本集,其中,所述初始训练样本集中的数据数目与所述测试样本数据集中的数据数目的比值为预设比值;对所述初始训练样本集进行过采样处理,以使得所述初始训练样本集中的所述第一预后风险程度数据的数目与所述第二预后风险程度数据的数目相同,得到目标训练样本集;利用所述目标训练样本集对初始集成分类模型进行交叉验证训练,得到中间集成分类模型;将满足预设训练结束条件下的所述中间集成分类模型作为所述集成分类模型,其中,所述预设训练结束条件包括训练次数达到预设训练次数阈值。In the process of training the integrated classification model in the embodiment of the present disclosure, the following operations may be adopted: obtaining a sample data set, wherein the sample data set includes first prognostic risk level data and second prognostic risk level data, and the first prognostic risk level is higher than the second prognostic risk level; randomly splitting the sample data set into an initial training sample set and a validation sample set, wherein the ratio of the number of data in the initial training sample set to the number of data in the test sample data set is a preset ratio; oversampling the initial training sample set so that the number of the first prognostic risk level data in the initial training sample set is the same as the number of the second prognostic risk level data, thereby obtaining a target training sample set; cross-validating the initial integrated classification model using the target training sample set to obtain an intermediate integrated classification model; and using the intermediate integrated classification model that meets a preset training end condition as the integrated classification model, wherein the preset training end condition includes the number of training times reaching a preset training times threshold.
根据本公开的实施例,通过对初始训练样本集进行过采样处理,可以避免对样本数据集随机拆分导致的数据分布不平衡的问题,通过利用目标训练样本集对初始集成分类模型进行交叉验证训练,可以消除由于将目标训练样本集单次划分为训练样本子集和验证样本子集划分得不平衡导致得模型训练准确率低的问题,从而提高了提高最终得到的集成分类模型的准确率。According to the embodiments of the present disclosure, by oversampling the initial training sample set, the problem of unbalanced data distribution caused by random splitting of the sample data set can be avoided. By using the target training sample set to perform cross-validation training on the initial integrated classification model, the problem of low model training accuracy caused by unbalanced division of the target training sample set into training sample subsets and validation sample subsets can be eliminated, thereby improving the accuracy of the final integrated classification model.
根据本公开的实施例,样本数据集中可以包括多个样本数据,样本数据的种类可以包括以下至少之一:发生突变的目标氨基酸位点对蛋白质功能的影响数据、发生突变的目标氨基酸位点对临床的影响数据、发生突变的目标氨基酸位点的位置数据。According to an embodiment of the present disclosure, a sample data set may include multiple sample data, and the types of sample data may include at least one of the following: data on the impact of the mutated target amino acid site on protein function, data on the clinical impact of the mutated target amino acid site, and position data of the mutated target amino acid site.
图10A是根据本公开实施例的利用部分工具预测氨基酸位点突变得到的数据示意图;图10B是根据本公开另一实施例的利用部分工具预测氨基酸位点突变得到的数据示意图。Figure 10A is a schematic diagram of data obtained by using some tools to predict amino acid site mutations according to an embodiment of the present disclosure; Figure 10B is a schematic diagram of data obtained by using some tools to predict amino acid site mutations according to another embodiment of the present disclosure.
如图10A~图10B所示,发生突变的目标氨基酸位点对蛋白质功能的影响数据可以采用测试工具分析氨基酸点突变对蛋白质功能影响得到。预测氨基酸点突变是有害突变还是中性突变的测试工具可以有SIFT_score、PolyPhen2_HDIV_score、PolyPhen2_HVAR_score、LRT_score、Mutation Taster_score、Mutation Assessor_score、FATHMM_score、PROVEAN_score、VEST3_score、CADD_raw、CADD_phred、DANN_score、MetaLR_score、integrated_fitCons_score、integrated_confidence_value、GERP++_RS、fathmm-MKL_coding_score、MetaSVM_score、phyloP7way_vertebrate、phyloP20way_mammalian、phastCons7way_vertebrate、phastCons20way_mammalian、 SiPhy_29way_logOdds等,通过利用上述23种测试工具进行在线预测测试,可以得到23种影响数值。在图10A~图10B种仅示出了部分工具的的预测结果作为示意,即示出了SIFT_score、PolyPhen2_HDIV_score、PolyPhen2_HVAR_score、LRT_score、Mutation Taster_score、Mutation Assessor_score、FATHMM_score、PROVEAN_score、VEST3_score、CADD_raw、CADD_phred、DANN_score、fathmm-MKL_coding_score、MetaSVM_score、MetaLR_score、integrated_fitCons_score、integrated_confidence_value、GERP++_RS工具的测试结果。As shown in FIG. 10A to FIG. 10B , the data on the effect of the target amino acid site where the mutation occurs on the protein function can be obtained by using a test tool to analyze the effect of the amino acid point mutation on the protein function. The test tools for predicting whether the amino acid point mutation is a harmful mutation or a neutral mutation can include SIFT_score, PolyPhen2_HDIV_score, PolyPhen2_HVAR_score, LRT_score, Mutation Taster_score, Mutation Assessor_score, FATHMM_score, PROVEAN_score, VEST3_score, CADD_raw, CADD_phred, DANN_score, MetaLR_score, integrated_fitCons_score, integrated_confidence_value, GERP++_RS, fathmm-MKL_coding_score, MetaSVM_score, phyloP7way_vertebrate, phyloP20way_mammalian, phastCons7way_vertebrate, phastCons20way_mammalian, SiPhy_29way_logOdds, etc. By using the above 23 test tools for online prediction tests, 23 influence values can be obtained. In Figures 10A and 10B, only the prediction results of some tools are shown as an illustration, namely, the test results of SIFT_score, PolyPhen2_HDIV_score, PolyPhen2_HVAR_score, LRT_score, Mutation Taster_score, Mutation Assessor_score, FATHMM_score, PROVEAN_score, VEST3_score, CADD_raw, CADD_phred, DANN_score, fathmm-MKL_coding_score, MetaSVM_score, MetaLR_score, integrated_fitCons_score, integrated_confidence_value, and GERP++_RS tools are shown.
继续参照图10A~图10B,每种测试工具测得的结果可以根据相关技术中的结果分析方式进行分析。示例性的,SIFT_score测试工具的测试结果在0~1之间,在分析测试结果时可以分成两个范围,0~0.05和0.05~1。当测试结果在0~0.05之间的情况下,认可以为这个氨基酸位点的突变是有害的,会导致蛋白质功能的改变。数值越小,引起蛋白质功能改变的可能性越大;当测试结果在0.05~1之间的情况下,可以认为这个氨基酸位点的突变是良性的,数值越接近1,对蛋白质功能的影响越小。在分析PolyPhen2_HDIV_score、PolyPhen2_HVAR_score、LRT_score、Mutation Taster_score、Mutation Assessor_score、FATHMM_score、VEST3_score、CADD_raw、CADD_phred、DANN_score、MetaSVM_score、MetaLR_score、integrated_fitCons_score、integrated_confidence_value、GERP++_RS测试工具得到的测试结果时,数值越大,氨基酸位点的突变可能蛋白质功能的有害影响程度越大。PROVEAN_score的测试结果可以在-14~14之间,在分析PROVEAN_score测试工具得到的测试结果时,可以分成两个范围-14~2.5和2.5~14。当测试结果在-14~2.5的情况下,可以认为这个氨基酸位点的突变是有害的;当测试结果在2.5~14的情况下,可以认为这个氨基酸位点的突变是中性的。在分析fathmm-MKL_coding_score测试工具得到的测试结果时,当测试结果大于等于0.5的情况下,可以认为这个氨基酸位点的突变是有害的;当测试结果在小于0.5的情况下,可以认为这个氨基酸位点的突变是中性的。Continuing to refer to Figures 10A to 10B, the results measured by each test tool can be analyzed according to the result analysis method in the relevant technology. Exemplarily, the test results of the SIFT_score test tool are between 0 and 1, and can be divided into two ranges when analyzing the test results, 0 to 0.05 and 0.05 to 1. When the test result is between 0 and 0.05, it is recognized that the mutation at this amino acid site is harmful and will cause changes in protein function. The smaller the value, the greater the possibility of causing changes in protein function; when the test result is between 0.05 and 1, it can be considered that the mutation at this amino acid site is benign, and the closer the value is to 1, the smaller the impact on protein function. When analyzing the test results obtained by the PolyPhen2_HDIV_score, PolyPhen2_HVAR_score, LRT_score, Mutation Taster_score, Mutation Assessor_score, FATHMM_score, VEST3_score, CADD_raw, CADD_phred, DANN_score, MetaSVM_score, MetaLR_score, integrated_fitCons_score, integrated_confidence_value, and GERP++_RS test tools, the larger the value, the greater the degree of harmful effect of the mutation at the amino acid site on the protein function. The test results of PROVEAN_score can be between -14 and 14. When analyzing the test results obtained by the PROVEAN_score test tool, they can be divided into two ranges -14 to 2.5 and 2.5 to 14. When the test result is between -14 and 2.5, the mutation at this amino acid site can be considered harmful; when the test result is between 2.5 and 14, the mutation at this amino acid site can be considered neutral. When analyzing the test results obtained by the fathmm-MKL_coding_score test tool, when the test result is greater than or equal to 0.5, the mutation at this amino acid site can be considered harmful; when the test result is less than 0.5, the mutation at this amino acid site can be considered neutral.
图11A是根据本公开实施例的氨基酸位点突变对临床的影响数据示意图;图11B是根据本公开另一实施例的氨基酸位点突变对临床的影响数据示意图。FIG11A is a schematic diagram of data on clinical effects of amino acid site mutations according to an embodiment of the present disclosure; FIG11B is a schematic diagram of data on clinical effects of amino acid site mutations according to another embodiment of the present disclosure.
根据本公开的实施例,利用在线工具VIC,根据AMP/ASCO/CAP 2017临床指南、临床的期刊论文、学术论文以及临床数据库等,评估目标基因(例如TP53)的氨基酸位点变异与临床密切相关的影响数据,例如某些突变是诊断为某种癌症的标志,某些突 变是癌症预后好的标志,某些突变在癌症中的某个通路中存在等12种影响数据,对于每种影响数据可以用4维的one-hot向量表示,进而得到48维临床影响数据,图11A~图11B仅示出了部分影响数据的4维one-hot向量。According to the embodiments of the present disclosure, the online tool VIC is used to evaluate the impact data of amino acid site variation of target genes (such as TP53) closely related to clinical conditions, such as certain mutations are signs of diagnosis of certain cancers, and certain mutations are closely related to clinical conditions. There are 12 kinds of impact data, such as mutation is a sign of good cancer prognosis, certain mutations exist in a certain pathway in cancer, etc. Each impact data can be represented by a 4-dimensional one-hot vector, and then 48-dimensional clinical impact data are obtained. Figures 11A to 11B only show the 4-dimensional one-hot vectors of some impact data.
继续参照图11A~图11B,EVS_1可以表示食品药品监管机构认可的和治疗相关的氨基酸位点突变,EVS_2可以表示专家认可的和诊断相关的氨基酸位点突变,EVS_3可以表示专家认可的和预后相关的氨基酸位点突变,EVS_4可以表示突变类型(错义突变、无义突变、插入缺失突变和可变剪切)。未示出的EVS_5可以表示体细胞突变数据库COSMIC(癌症相关体细胞突变位点的最大的数据库之一)和ICGC。EVS_6可以表示预测软件的评分。EVS_7可以表示种系突变。EVS_8可以表示癌症相关的通路。EVS_9可以表示公开文献中临床相关的突变。EVS_10可以表示流行人群数据,关于某种癌种中某种突变的发生。EVS_11可以用于表示癌症的等级。EVS_12可以表示癌症的类型。{-1},{0},{1},{2}可以代表四个等级,由低到高排列,表示证据与临床关联的强度,2是最强证据,-1是可能是相关证据。EVS_1~12的值是one-hot类型,1代表存在,0代表不存在。例如,EVS_1_{-1}=1,EVS_1_{0}=0,EVS_1_{1}=0,EVS_1_{2}=0,表示某种突变是食品药品监管机构批准的某种癌症治疗的相关标志物,不是最强关联的某种癌种治疗的标志物。Continuing to refer to Figures 11A to 11B, EVS_1 can represent amino acid site mutations that are recognized by food and drug regulatory agencies and are related to treatment, EVS_2 can represent amino acid site mutations that are recognized by experts and are related to diagnosis, EVS_3 can represent amino acid site mutations that are recognized by experts and are related to prognosis, and EVS_4 can represent mutation types (missense mutations, nonsense mutations, insertion and deletion mutations, and variable splicing). EVS_5, not shown, can represent the somatic mutation database COSMIC (one of the largest databases of cancer-related somatic mutation sites) and ICGC. EVS_6 can represent the score of the prediction software. EVS_7 can represent germline mutations. EVS_8 can represent cancer-related pathways. EVS_9 can represent clinically relevant mutations in public literature. EVS_10 can represent prevalent population data on the occurrence of a certain mutation in a certain type of cancer. EVS_11 can be used to represent the grade of cancer. EVS_12 can represent the type of cancer. {-1}, {0}, {1}, {2} can represent four levels, arranged from low to high, indicating the strength of evidence and clinical association, 2 is the strongest evidence, -1 is possible relevant evidence. The values of EVS_1 to 12 are one-hot types, 1 represents existence, and 0 represents non-existence. For example, EVS_1_{-1} = 1, EVS_1_{0} = 0, EVS_1_{1} = 0, EVS_1_{2} = 0, indicating that a certain mutation is a related marker for a certain cancer treatment approved by the Food and Drug Administration, but is not the strongest associated marker for a certain cancer treatment.
根据本公开的实施例,发生突变的目标氨基酸位点的位置数据,可以用实数表示。According to an embodiment of the present disclosure, the position data of the target amino acid site where the mutation occurs can be represented by a real number.
根据本公开的实施例,将上述发生突变的目标氨基酸位点对蛋白质功能的影响数据、发生突变的目标氨基酸位点对临床的影响数据、发生突变的目标氨基酸位点的位置数据进行拼接,可以得到用于作为初始特征的输入数据。可以理解的是,发生突变的目标氨基酸位点对蛋白质功能的影响数据是23维的,发生突变的目标氨基酸位点对临床的影响数据是48维的,发生突变的目标氨基酸位点的位置数据是1维的,所以在拼接后得到的特征数据可以是72维的。According to the embodiments of the present disclosure, the data of the impact of the target amino acid site that has mutated on the protein function, the data of the impact of the target amino acid site that has mutated on the clinic, and the data of the position of the target amino acid site that has mutated are spliced to obtain input data for use as initial features. It is understandable that the data of the impact of the target amino acid site that has mutated on the protein function is 23-dimensional, the data of the impact of the target amino acid site that has mutated on the clinic is 48-dimensional, and the data of the position of the target amino acid site that has mutated is 1-dimensional, so the feature data obtained after splicing can be 72-dimensional.
根据本公开的实施例,可以采用方差分析法对这72维特征进行分析,并过滤掉方差小于1的特征,保留方差大于1的特征。由于方差小于1的特征表明数据之间的波动较小,为了能够更好的训练模型,所以保留数据之间波动较大的特征数据。According to the embodiments of the present disclosure, the 72-dimensional features can be analyzed by variance analysis, and features with variance less than 1 are filtered out, and features with variance greater than 1 are retained. Since features with variance less than 1 indicate that the fluctuation between data is small, in order to better train the model, feature data with large fluctuation between data are retained.
根据本公开的实施例,第一预后风险程度数据可以指高预后风险类数据,在本公开实施例中用1类进行标识;第二预后风险程度数据可以指低预后风险类数据,在本公开实施例中用0类进行标识。 According to an embodiment of the present disclosure, the first prognostic risk level data may refer to high prognostic risk class data, which is identified by class 1 in the embodiment of the present disclosure; the second prognostic risk level data may refer to low prognostic risk class data, which is identified by class 0 in the embodiment of the present disclosure.
根据本公开的实施例,将样本数据集随机拆分为初始训练样本集和验证样本集的过程、对所述初始训练样本集进行过采样处理的过程可以如下示例所示。在本公开的一个实施例中,样本数据集的个数为213个,按照初始训练样本集与验证样本集之间是9∶1的比例划分样本数据集,得到191个初始训练样本集和22个验证样本集,在22个验证样本集中包括2个0类的数据。其中,经过统计,发现样本数据集中的数据分布中度不平衡,1类有197个,0类16个。因此,在模型分类之前,使用SMOTE(Synthetic Minority Oversampling Technique,合成少数类过采样技术)方法对初始训练样本集进行过采样,过采样之前,初始训练样本集中的1类和0类数据之比为177∶14,过采样之后,1类和0类数据之比为177∶177。过采样可以通过复制0类数据,以增加0类数据的数量,最终使得1类和0类数据之比为177∶177。According to an embodiment of the present disclosure, the process of randomly splitting a sample data set into an initial training sample set and a validation sample set, and the process of oversampling the initial training sample set can be shown in the following example. In one embodiment of the present disclosure, the number of sample data sets is 213, and the sample data sets are divided according to the ratio of 9:1 between the initial training sample set and the validation sample set, so as to obtain 191 initial training sample sets and 22 validation sample sets, and the 22 validation sample sets include 2 data of class 0. Among them, after statistics, it is found that the data distribution in the sample data set is moderately unbalanced, with 197 data in class 1 and 16 data in class 0. Therefore, before model classification, the SMOTE (Synthetic Minority Oversampling Technique) method is used to oversample the initial training sample set. Before oversampling, the ratio of class 1 to class 0 data in the initial training sample set is 177:14, and after oversampling, the ratio of class 1 to class 0 data is 177:177. Oversampling can increase the amount of class 0 data by duplicating class 0 data, ultimately making the ratio of class 1 to class 0 data 177:177.
根据本公开的实施例,通过使用发生突变的目标氨基酸位点对蛋白质功能的影响数据、发生突变的目标氨基酸位点对临床的影响数据、发生突变的目标氨基酸位点的位置数据,可以实现从多个维度对未知临床预后风险的氨基酸位点进行预后风险类型分析,提高集成分类模型对预后风险类型进行分析的准确率。According to the embodiments of the present disclosure, by using the impact data of the mutated target amino acid site on protein function, the clinical impact data of the mutated target amino acid site, and the position data of the mutated target amino acid site, it is possible to perform prognostic risk type analysis on amino acid sites of unknown clinical prognostic risk from multiple dimensions, thereby improving the accuracy of the integrated classification model in analyzing the prognostic risk type.
根据本公开的实施例,利用目标训练样本集对初始集成分类模型进行交叉验证训练可以包括如下操作:将目标训练样本集随机均分成Q个训练样本子集,Q为正整数;重复执行以下操作,直至满足预设训练结束条件:从Q个训练样本子集中选取Q-1个训练样本子集训练初始集成分类模型,得到作为用于下一轮训练的中间集成分类模型,其中,每一次训练与每一次训练之间,选取的Q-1个训练样本子集不相同;利用Q个训练样本子集中剩余的1个训练样本子集测试用于下一轮训练的中间集成分类模型,其中,每一次测试与每一次测试之间,利用的训练样本子集不相同。According to an embodiment of the present disclosure, cross-validation training of an initial integrated classification model using a target training sample set may include the following operations: randomly dividing the target training sample set into Q training sample subsets, where Q is a positive integer; repeatedly performing the following operations until a preset training end condition is met: selecting Q-1 training sample subsets from the Q training sample subsets to train the initial integrated classification model to obtain an intermediate integrated classification model for the next round of training, wherein the selected Q-1 training sample subsets are different from each training session to each other; using the remaining 1 training sample subset from the Q training sample subsets to test the intermediate integrated classification model for the next round of training, wherein the used training sample subsets are different from each test to each other.
根据本公开的实施例,交叉验证训练好集成分类模型后,可以利用验证样本集对训练好的集成分类模型进行验证,对集成分类模型的输出结果绘制ROC曲线(receiver operating characteristic curve,接受者操作特性曲线),并利用ROC曲线下方的面积(AUC,Area Under Curve)和准确率(ACC,Accuracy)来评估集成分类模型的性能。AUC是ROC曲线下与坐标轴围成的面积,X轴是伪阳性率(FPR,False Positive Rate),表示为FPR=FP/(FP+TN),其中,FP(False Positive)为基因突变为伪阳性的个数,TN(True negative)为基因突变为真阴性的个数。Y轴是真阳性率(TPR,True Positive Rate),表示为TPR=TP/(TP+FN),其中,TP(TruePositive)为基因突变为真阳性的个数,FN(False negative)为基因突变为假阴性的个数。According to the embodiments of the present disclosure, after the integrated classification model is trained by cross-validation, the trained integrated classification model can be verified using a validation sample set, and a ROC curve (receiver operating characteristic curve) is drawn for the output result of the integrated classification model, and the area under the ROC curve (AUC) and the accuracy (ACC) are used to evaluate the performance of the integrated classification model. AUC is the area under the ROC curve and the coordinate axis. The X-axis is the false positive rate (FPR), expressed as FPR=FP/(FP+TN), where FP (False Positive) is the number of gene mutations that are false positives, and TN (True negative) is the number of gene mutations that are true negatives. The Y-axis is the true positive rate (TPR), expressed as TPR=TP/(TP+FN), where TP (True Positive) is the number of gene mutations that are true positives, and FN (False Positive) is the number of gene mutations that are true negatives. negative) is the number of false negative gene mutations.
在一个二分类模型中,对于所得到的模型输出结果,在已确定一个阀值的情况下,示例性地,阀值为0.5,大于这个阀值的实例可以划归为正类,小于这个阀值则可以划到负类中。ROC曲线上的每一个点对应于一个threshold(测试样本中每个样本预测为正类的概率,并且从大到小排列),对于一个分类器,每个threshold下会有一个TPR和FPR,因此,模型ACC的确定过程可以如公式(5)所示。
In a binary classification model, for the obtained model output result, when a threshold is determined, for example, the threshold is 0.5, instances greater than this threshold can be classified as positive, and instances less than this threshold can be classified as negative. Each point on the ROC curve corresponds to a threshold (the probability of each sample in the test sample being predicted as a positive class, and arranged from large to small). For a classifier, there will be a TPR and FPR under each threshold. Therefore, the determination process of the model ACC can be shown as formula (5).
图12A是本公开实施例的集成分类模型的ROC曲线。FIG. 12A is a ROC curve of the integrated classification model according to an embodiment of the present disclosure.
如图12A所示,本公开实施例的集成分类模型的输出结果可以是:当阈值为0.5的情况下,TP数量为19个,TN数量为1个,FP数量为1个,FN数量为1个,根据FP、FN、TP、TN的个数,可以得到FPR为1/2,TPR为19/20。将上述FP、FN、TP、TN的个数、FPR、TPR的值代入到公式(5)中,可以得到验证样本集的ACC(test ACC)为0.909。图12A中的验证样本集的AUC(test AUC)为0.975,目标训练样本集的ACC(tarin ACC)为1。As shown in FIG12A , the output result of the integrated classification model of the embodiment of the present disclosure may be: when the threshold is 0.5, the number of TPs is 19, the number of TNs is 1, the number of FPs is 1, and the number of FNs is 1. According to the number of FPs, FNs, TPs, and TNs, the FPR is 1/2 and the TPR is 19/20. Substituting the above-mentioned number of FPs, FNs, TPs, TNs, FPRs, and TPRs into formula (5), the ACC (test ACC) of the validation sample set is 0.909. The AUC (test AUC) of the validation sample set in FIG12A is 0.975, and the ACC (tarin ACC) of the target training sample set is 1.
图12B是相关技术模型的ROC曲线。FIG. 12B is a ROC curve of the related art model.
如图12B所示,为验证本申请的集成分类模型能够达到较优的输出结果,还对相关技术中的SVM(support vector machines,支持向量机)进行了测试。在测试时,采用与训练集成分类模型同样的初始训练样本集,同样对初始训练样本集采用了SMOTE方法进行过采样,得到目标训练样本集,并利用目标训练样本集参与训练,完成训练后利用与验证集成分类模型时使用的验证样本集对训练好的SVM模型进行验证。最终,得到的ROC曲线可以如图12B所示。目标训练样本集的ACC为:92.67%,验证样本集的ACC为:90.91%,验证样本集的AUC为:0.88。As shown in FIG12B , in order to verify that the integrated classification model of the present application can achieve a better output result, the SVM (support vector machines) in the related art was also tested. During the test, the same initial training sample set as that used to train the integrated classification model was used, and the SMOTE method was also used to oversample the initial training sample set to obtain the target training sample set, and the target training sample set was used to participate in the training. After the training was completed, the trained SVM model was verified using the verification sample set used to verify the integrated classification model. Finally, the ROC curve obtained can be shown in FIG12B . The ACC of the target training sample set is: 92.67%, the ACC of the verification sample set is: 90.91%, and the AUC of the verification sample set is: 0.88.
继续参照图12A和图12B,图12A中ROC线所对应的纵坐标值高于图12B中ROC实线所对应的纵坐标,所以图12A中ROC线下方的面积AUC大于图12B的AUC。AUC值越大,越接近于1,可以表明模型的正确率越高,即图12A的集成分类模型的模型效果正确率优于图12B的SVM模型。Continuing to refer to FIG. 12A and FIG. 12B , the ordinate value corresponding to the ROC line in FIG. 12A is higher than the ordinate value corresponding to the ROC solid line in FIG. 12B , so the area AUC under the ROC line in FIG. 12A is greater than the AUC in FIG. 12B . The larger the AUC value is, the closer it is to 1, which indicates that the accuracy of the model is higher, that is, the accuracy of the model effect of the integrated classification model in FIG. 12A is better than that of the SVM model in FIG. 12B .
图12A和图12B中的均虚线为y=x,用于表示假阳性率等于真阳性率。ROC线一般都处于y=x直线的上方,其线下面积AUC的取值范围可以在0.5和1之间。AUC越接近1.0,说明模型输出结果的真实性越高;AUC等于0.5时,则真实性最低,说明 模型预测约等于随机猜测,没有实际应用价值。所以虚线可以是一个阈值,如果ROC线条低于蓝色虚线,说明模型没有应用价值。The dashed lines in Figures 12A and 12B are y=x, which indicates that the false positive rate is equal to the true positive rate. The ROC line is generally above the y=x line, and the area under the line AUC can range from 0.5 to 1. The closer the AUC is to 1.0, the higher the authenticity of the model output results; when AUC is equal to 0.5, the authenticity is the lowest, indicating that The model prediction is approximately equal to random guessing and has no practical application value. So the dotted line can be a threshold. If the ROC line is lower than the blue dotted line, it means that the model has no application value.
根据本公开的实施例,通过对比本公开实施例集成分类模型的ROC曲线和相关技术的ROC曲线,对比集成分类模型和相关技术模型的ACC以及AUC,可以得到,本公开集成分类模型的AUC和ACC都更优于相关技术的ACC以及AUC,表明本公开实施例的集成分类模型能更精准的分类点突变的预后风险性高低。According to the embodiments of the present disclosure, by comparing the ROC curve of the integrated classification model of the embodiments of the present disclosure with the ROC curve of the related technology, and comparing the ACC and AUC of the integrated classification model and the related technology model, it can be obtained that the AUC and ACC of the integrated classification model of the present disclosure are better than the ACC and AUC of the related technology, indicating that the integrated classification model of the embodiments of the present disclosure can more accurately classify the prognostic risk of point mutations.
根据本公开的实施例,通过利用目标训练样本集对初始集成分类模型进行交叉验证训练,由于对目标训练样本集进行了多次划分,每一次训练采用的训练样本子集和验证样本子集均不同,可以消除单次划分目标训练样本集划分得不平衡而造成的不良影响,进而提高最终得到的集成分类模型的准确率。According to the embodiments of the present disclosure, the initial integrated classification model is cross-validated by using the target training sample set. Since the target training sample set is divided multiple times, the training sample subset and the validation sample subset used in each training are different. This can eliminate the adverse effects caused by the unbalanced division of the target training sample set in a single division, thereby improving the accuracy of the final integrated classification model.
图13是根据本公开另一实施例的基因位点突变的预后风险评估方法。FIG. 13 is a method for assessing the prognostic risk of gene site mutations according to another embodiment of the present disclosure.
如图13所示,另一实施例的基因位点突变的预后风险评估方法可以包括操作S1310~操作S1330。As shown in FIG. 13 , a prognostic risk assessment method for gene site mutation according to another embodiment may include operations S1310 to S1330 .
在操作S1310,使用ICGC数据库和MSK数据库,获取TP53点突变数据和临床数据,并进行基因重注释等预处理。In operation S1310, the ICGC database and the MSK database are used to obtain TP53 point mutation data and clinical data, and preprocessing such as gene re-annotation is performed.
在操作S1320,基于上述预处理后的数据,采用滑动窗和比例风险回归模型分析方法,分析单个位点突变的预后风险值。In operation S1320, based on the preprocessed data, a sliding window and proportional risk regression model analysis method is used to analyze the prognostic risk value of a single site mutation.
在操作S1330,基于蛋白质功能影响分数、临床证据分数和位置信息,采用集成分类模型方法,预测未知临床意义位点的预后风险性高低。In operation S1330, based on the protein function impact score, clinical evidence score and location information, an integrated classification model method is used to predict the prognostic risk of the unknown clinical significance site.
本公开实施例的技术方案可以包括两部分。第一部分,由滑动窗和比例风险回归模型分析方法,实现对基因点突变预后风险的定量表征。第二部分,由多个测试工具和集成分类模型,实现对未知预后风险的基因点突变进行预后风险定性判别。通过采用定量表征与定性预后风险分析两种方式,可以实现对基因突变位点的预后风险分析,为医生的临床方案制定提供参考信息。The technical solution of the disclosed embodiment may include two parts. In the first part, a sliding window and proportional risk regression model analysis method is used to achieve quantitative characterization of the prognostic risk of gene point mutations. In the second part, a plurality of testing tools and an integrated classification model are used to achieve qualitative discrimination of the prognostic risk of gene point mutations with unknown prognostic risk. By adopting both quantitative characterization and qualitative prognostic risk analysis, the prognostic risk analysis of gene mutation sites can be achieved, providing reference information for doctors to formulate clinical plans.
根据本公开的实施例,通过使用滑动窗和比例风险回归模型分析方法,利用由突变数据和临床数据组成的基因突变组数据与对照组数据,分析TP53基因位点突变的预后风险大小,实现定量分析;再基于基因位点突变或氨基酸位点突变对蛋白功能的影响数据、基因位点突变或氨基酸位点突变对临床的影响数据和基因位点突变或氨基酸突变的位置数据,利用集成分类模型,预测TP53基因位点突变中未知临床意义突变位点的预 后风险性,实现定性分析,从而实现对TP53基因位点突变预后风险的定量和定性问题,细化了对基因突变预后风险的评估粒度,提高了评估基因突变预后风险的准确率,为临床医生制定临床方案提供依据。According to the embodiments of the present disclosure, by using a sliding window and proportional hazard regression model analysis method, the prognostic risk of TP53 gene site mutations is analyzed using gene mutation group data composed of mutation data and clinical data and control group data to achieve quantitative analysis; based on the data on the impact of gene site mutations or amino acid site mutations on protein function, the data on the impact of gene site mutations or amino acid site mutations on clinical data, and the location data of gene site mutations or amino acid mutations, an integrated classification model is used to predict the prognostic risk of unknown clinically significant mutation sites in TP53 gene site mutations. After the risk is determined, qualitative analysis is achieved, thereby realizing the quantitative and qualitative problems of the prognostic risk of TP53 gene site mutation, refining the assessment granularity of the prognostic risk of gene mutation, improving the accuracy of assessing the prognostic risk of gene mutation, and providing a basis for clinicians to formulate clinical plans.
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to an embodiment of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
根据本公开的实施例,一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行如上所述的方法。According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method as described above.
根据本公开的实施例,一种存储有计算机指令的非瞬时计算机可读存储介质,其中,计算机指令用于使计算机执行如上所述的方法。According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to cause a computer to execute the method as described above.
根据本公开的实施例,一种计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现如上所述的方法。According to an embodiment of the present disclosure, a computer program product includes a computer program, and when the computer program is executed by a processor, the computer program implements the method as described above.
图14是根据本公开实施例的适于实现基因位点突变的预后风险评估方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。Figure 14 is a block diagram of an electronic device suitable for implementing a prognostic risk assessment method for gene site mutation according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the present disclosure described herein and/or required.
如图14所示,电子设备1400包括计算单元1401,其可以根据存储在只读存储器(ROM)1402中的计算机程序或者从存储单元1408加载到随机访问存储器(RAM)1403中的计算机程序,来执行各种适当的动作和处理。在RAM 1403中,还可存储电子设备1400操作所需的各种程序和数据。计算单元1401、ROM 1402以及RAM 1403通过总线1404彼此相连。输入/输出(I/O)接口1405也连接至总线1404。As shown in FIG. 14 , the electronic device 1400 includes a computing unit 1401, which can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM) 1402 or a computer program loaded from a storage unit 1408 into a random access memory (RAM) 1403. In the RAM 1403, various programs and data required for the operation of the electronic device 1400 can also be stored. The computing unit 1401, the ROM 1402, and the RAM 1403 are connected to each other via a bus 1404. An input/output (I/O) interface 1405 is also connected to the bus 1404.
电子设备1400中的多个部件连接至I/O接口1405,包括:输入单元1406,例如键盘、鼠标等;输出单元1407,例如各种类型的显示器、扬声器等;存储单元1408,例如磁盘、光盘等;以及通信单元1409,例如网卡、调制解调器、无线通信收发机等。通信单元1409允许电子设备1400通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the electronic device 1400 are connected to the I/O interface 1405, including: an input unit 1406, such as a keyboard, a mouse, etc.; an output unit 1407, such as various types of displays, speakers, etc.; a storage unit 1408, such as a disk, an optical disk, etc.; and a communication unit 1409, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 1409 allows the electronic device 1400 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
计算单元1401可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单 元1401的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元1401执行上文所描述的各个方法和处理,例如,图像处理方法。例如,在一些实施例中,图像处理方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1408。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1402和/或通信单元1409而被载入和/或安装到电子设备1400上。当计算机程序加载到RAM 1403并由计算单元1401执行时,可以执行上文描述的图像处理方法的一个或多个步骤。备选地,在其他实施例中,计算单元1401可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行图像处理方法。The computing unit 1401 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of element 1401 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (DSPs), and any appropriate processors, controllers, microcontrollers, etc. The computing unit 1401 performs the various methods and processes described above, for example, image processing methods. For example, in some embodiments, the image processing method may be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as a storage unit 1408. In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device 1400 via ROM 1402 and/or communication unit 1409. When the computer program is loaded into RAM 1403 and executed by the computing unit 1401, one or more steps of the image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1401 may be configured to perform the image processing method in any other appropriate manner (e.g., by means of firmware).
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include: being implemented in one or more computer programs that can be executed and/or interpreted on a programmable system including at least one programmable processor, which can be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据挖掘装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。The program code for implementing the method of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special-purpose computer, or other programmable data mining device, so that the program code, when executed by the processor or controller, implements the functions/operations specified in the flow chart and/or block diagram. The program code may be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气 连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include an electrical signal medium based on one or more wires. connection, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including acoustic input, voice input, or tactile input).
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such back-end components, middleware components, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: a local area network (LAN), a wide area network (WAN), and the Internet.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以是分布式系统的服务器,或者是结合了区块链的服务器。A computer system may include a client and a server. The client and the server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises through computer programs running on respective computers and having a client-server relationship to each other. The server may be a cloud server, a server in a distributed system, or a server combined with a blockchain.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps recorded in this disclosure can be executed in parallel, sequentially or in different orders, as long as the desired results of the technical solutions disclosed in this disclosure can be achieved, and this document does not limit this.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。 The above specific implementations do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modification, equivalent substitution and improvement made within the spirit and principle of the present disclosure shall be included in the protection scope of the present disclosure.
Claims (21)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202380010975.XA CN120113000A (en) | 2023-09-28 | 2023-09-28 | Prognostic risk assessment method for gene site mutation, electronic device and storage medium |
| PCT/CN2023/122493 WO2025065483A1 (en) | 2023-09-28 | 2023-09-28 | Genetic locus mutation prognostic risk assessment method, electronic device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2023/122493 WO2025065483A1 (en) | 2023-09-28 | 2023-09-28 | Genetic locus mutation prognostic risk assessment method, electronic device and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025065483A1 true WO2025065483A1 (en) | 2025-04-03 |
Family
ID=95204138
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/122493 Pending WO2025065483A1 (en) | 2023-09-28 | 2023-09-28 | Genetic locus mutation prognostic risk assessment method, electronic device and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN120113000A (en) |
| WO (1) | WO2025065483A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220349013A1 (en) * | 2019-06-25 | 2022-11-03 | The Translational Genomics Research Institute | Detection and treatment of residual disease using circulating tumor dna analysis |
| CN115547496A (en) * | 2022-09-27 | 2022-12-30 | 中国人民解放军中部战区总医院 | Method for establishing glioblastoma autophagy-related prognosis model |
| CN116189889A (en) * | 2022-12-14 | 2023-05-30 | 西南交通大学 | Construction method of prognosis risk assessment model of liver cancer patient |
-
2023
- 2023-09-28 CN CN202380010975.XA patent/CN120113000A/en active Pending
- 2023-09-28 WO PCT/CN2023/122493 patent/WO2025065483A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220349013A1 (en) * | 2019-06-25 | 2022-11-03 | The Translational Genomics Research Institute | Detection and treatment of residual disease using circulating tumor dna analysis |
| CN115547496A (en) * | 2022-09-27 | 2022-12-30 | 中国人民解放军中部战区总医院 | Method for establishing glioblastoma autophagy-related prognosis model |
| CN116189889A (en) * | 2022-12-14 | 2023-05-30 | 西南交通大学 | Construction method of prognosis risk assessment model of liver cancer patient |
Non-Patent Citations (3)
| Title |
|---|
| BI YANQING, HUANG RONG;SONG WEI;ZHANG ZIYING;TIAN ZIXUAN;LIU MIN;BAO HAN;YAN TAO;XIA YUAN;ZHANG NAN;ZHANG XINGGUANG: "Predictive value of TTN mutation-related gene characteristics on prognosis of hepatocellular carcinoma", JOURNAL OF MODERN ONCOLOGY, vol. 31, no. 12, 6 May 2023 (2023-05-06), pages 2275 - 2283, XP093296462, ISSN: 1672-4992, DOI: 10.3969/j.issn.1672-4992.2023.12.018 * |
| SHUN ZHANG, LIU CAIYAN;TAN GUICHUN: "Development of a prognostic model based on PIK3CA mutation and its related genes for endometrioid adenocarcinoma", PROGRESS IN OBSTETRICS AND GYNECOLOGY, vol. 29, no. 5, 21 April 2020 (2020-04-21), pages 329 - 334, XP093296466, ISSN: 1004-7379, DOI: 10.13283/j.cnki.xdfckjz.2020.05.002 * |
| ZHENG HAOTIAN, WANG GUANGHUI;ZHAO XIAOGANG;WANG YADONG;ZENG YUKAI;DU JIAJUN: "A prognostic risk model for LKB1 mutant lung adenocarcinoma based on aberrant DNA methylation sites", JOURNAL OF SHANDONG UNIVERSITY(HEALTH SCIENCES), 10 March 2022 (2022-03-10), CN , pages 51 - 58, XP093296457, ISSN: 1671-7554, DOI: 10.6040/j.issn.1671-7544.0.2021.0821 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN120113000A (en) | 2025-06-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106909806B (en) | Method and device for spot detection of variants | |
| US10354747B1 (en) | Deep learning analysis pipeline for next generation sequencing | |
| US20200303038A1 (en) | Variant calling in single molecule sequencing using a convolutional neural network | |
| CN111243753B (en) | Multi-factor correlation interactive analysis method for medical data | |
| KR102404947B1 (en) | Method and apparatus for machine learning based identification of structural variants in cancer genomes | |
| KR102812123B1 (en) | Method and apparatus for classifying variation candidates within whole genome sequence | |
| US12272431B2 (en) | Detecting false positive variant calls in next-generation sequencing | |
| CN111651753A (en) | User behavior analysis system and method | |
| CN120164524B (en) | Data analysis method, system and storage medium for genetic disease gene detection | |
| US9965584B2 (en) | Identifying interacting DNA loci using a contingency table, classification rules and statistical significance | |
| WO2025065483A1 (en) | Genetic locus mutation prognostic risk assessment method, electronic device and storage medium | |
| JP2003281156A (en) | Screen display system and medical diagnosis support system | |
| CN113380324B (en) | T cell receptor sequence motif combination recognition detection method, storage medium and equipment | |
| CN118609646A (en) | A risk stratification analysis method and electronic device for thyroid papillary microcarcinoma | |
| CN120435254A (en) | Variant processing method, system, device and storage medium | |
| CN108959853A (en) | A kind of analysis method, analytical equipment, equipment and storage medium copying number variation | |
| US20200105374A1 (en) | Mixture model for targeted sequencing | |
| US20250384960A1 (en) | Device for determining an indicator of presence of hrd in a genome of a subject | |
| CN118471340B (en) | A sequencing quality assessment method, device, equipment, medium and product | |
| CN119400244B (en) | Indel mutation data analysis method, device, system and readable storage medium | |
| WO2025180330A1 (en) | Drug resistance database construction method and apparatus, drug resistance detection method and apparatus, and device | |
| Aufiero et al. | inDAGO: a user-friendly interface for seamless dual and bulk RNA-Seq analysis | |
| Bonetti et al. | Accuracy of RENOVO Predictions on Genetic Variants Reclassified Over Time | |
| JP2025183943A (en) | Treatment drug selection and/or clinical trial entry decision support system | |
| Istighosah et al. | Machine Learning-Based Prediction Model for Thyroid Cancer Diagnosis Using Clinicopathologic Features |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23953632 Country of ref document: EP Kind code of ref document: A1 |
|
| WWP | Wipo information: published in national office |
Ref document number: 202380010975.X Country of ref document: CN |