CN117637035A

CN117637035A - Classification model and method for multiple groups of credible integration of students based on graph neural network

Info

Publication number: CN117637035A
Application number: CN202311702871.3A
Authority: CN
Inventors: 姚晓辉; 丛山; 罗昊燃; 梁洪; 贾淼; 袁浚博
Original assignee: Qingdao Harbin Engineering University Innovation Development Center
Current assignee: Qingdao Harbin Engineering University Innovation Development Center
Priority date: 2023-10-20
Filing date: 2023-12-12
Publication date: 2024-03-01

Abstract

The invention relates to a classification model and a classification method for multiple groups of credible integration based on a graph neural network. The method comprises the following steps: preparing, for a sample, histology data of the sample; constructing a specific network of each histology data; performing aggregation update on the histology specific network, and performing dimension reduction and classification on the extracted histology characteristics to generate initial classification of each histology; calculating the confidence coefficient of each group, and enhancing the aggregated characteristics; fusing confidence enhancement features of multiple groups to generate a final classification result; and outputting the medical analysis result of the target object. The model comprises: the system comprises a multi-group data preparation module, a multi-group data networking construction module, a characteristic aggregation and classification module, a confidence coefficient calculation and enhancement module, a characteristic fusion and classification module and an output module.

Description

Classification model and method for multiple groups of credible integration of students based on graph neural network

Technical Field

The invention relates to the technical field of bioinformatics, in particular to a classification model and a classification method for multiple groups of credible integration based on a graph neural network.

Background

With the continuous depth of medical research, histologic data (such as genomics, transcriptomics, proteomics, metabonomics and the like) provide valuable information resources for us, and help us better understand the occurrence, development and treatment mechanism of diseases. These data represent a great value, especially in the diagnosis and treatment of complex diseases, such as alzheimer's disease and cancer.

With the rapid development of high-throughput sequencing technology and the reduction in cost, more and more public databases containing high quality histology data have been developed. Thus, researchers in the field of bioinformatics have also developed from using only a single set of data to using multiple sets of data simultaneously. Meanwhile, the classification and typing of the complex diseases are used as a complex character, have different clinical, pathological and molecular characteristics, and have prognostic and therapeutic significance. Thus, studies on complex disease classification are of great importance for accurate medicine and prognosis prediction. Many related methods are based on traditional machine learning and are mostly based on a single set of learning data. The method based on the integration of multiple groups of chemical data is not more, and the result is still to be improved.

For example, wang et al [ Wang T, shao W, huang Z, et al MOGONET integers Multi-omics data using graph convolutional networks allowing patient classification and biomarker identification [ J ]. Nature Communications,2021,12 (1): 1-13 ] propose a multiple-study graph rolling network (Multi-Omics Graph cOnvolutional NETworks, MOGONET) integration method for biomedical classification. The method can be summarized in three parts: firstly, preprocessing and feature selection are carried out on each histology data type, then specific histology learning is carried out through GCN, and finally, multiple histology integration is carried out through VCDN. The method has the advantages that the added VCDN model can better classify the data, and the experimental result also has good interpretation.

As another example, althubaiti et al [ Althubaiti S, kulmanov M, liu Y, et al DeepMOCCA: A pan-cancer prognostic model identifies personalized prognostic markers through graph attention and multi-omics data integration [ J ]. BioRxiv,2021 ] developed a framework DeepMOCCA for multiple-set cancer analysis consisting of a graph-rolling neural network and a graph-priming mechanism that was able to predict the survival of a sample of 33 cancer types, superior to most existing methods, and that could be used to identify driver and prognostic markers in patients, but that was deficient in the lack of accurate prognostic markers for many cancers.

The prior art scheme has defects in extracting the histology characterization information and overcoming the histology heterogeneity. Technically, the multiple groups of chemical integration can be classified into three different types of early integration, mid-integration and late integration according to the integration timing classification. Early integration refers to converting a data set into a single feature-based table or a graph-based representation, then adopting different data combinations after original or dimension reduction processing, and finally inputting a machine learning model to obtain a prediction result. Its disadvantage is that the unique distribution of each histologic data type is ignored, the weights need to be normalized, increasing the dimension of the input data. Moreover, as the number of integrated histology increases, the effect of integration tends to decrease. Mid-term integration refers to the process of retaining the data structures of the data sets and combining them only in the analysis stage, and is an algorithm for fusing the data sets through a joint model, so that the problem of diversity of the data sets can be solved. The method has the defects of high pretreatment requirements on the characteristics, limitation of the quantity, prevention of dimension explosion and satisfaction of the requirement of expressing the characteristics of the histology data. Later integration refers to that each group of data types respectively learns the characteristics to form a plurality of first-stage training models, and then the characteristics obtained by the first-stage training are integrated and used as input of a classifier or a regressor. The method has the defects of low reliability, large feature cost for mining and integrating only the prediction results of each group, and no utilization of complementary information among groups.

Therefore, there is a need in the art to develop a deep learning algorithm based on the integration of multiple sets of mathematical data of a graph neural network to achieve prediction of differentiation and subtype classification of complex diseases.

Furthermore, there are differences in one aspect due to understanding to those skilled in the art; on the other hand, since the applicant has studied a lot of documents and patents while making the present invention, the text is not limited to details and contents of all but it is by no means the present invention does not have these prior art features, but the present invention has all the prior art features, and the applicant remains in the background art to which the right of the related prior art is added.

Disclosure of Invention

In view of the shortcomings of the prior art, the invention aims to provide a classification model and a classification method for multiple sets of credible integration of graphics neural network, so as to obtain medical analysis results (disease course classification and disease subtype) of a target object by utilizing multiple sets of data (including mRNA histology, methylation histology and miRNA histology) of complex diseases (Alzheimer disease, cancer and the like). Traditional statistical methods require a great deal of manual intervention in processing the histologic data, and it is difficult to provide a clear classification or typing of the disease. In addition, although the existing machine learning method can screen out biomarkers related to diseases, the prediction result lacks interpretation, and the prediction accuracy needs to be improved.

The reason for the problems in the prior art is mainly as follows:

(1) The histology data has the respective characteristics:

multiple-omic integration analysis requires the use of multiple sets of genomic data, e.g., metabolome, transcriptome, genome, etc. These data structures are different, as are the data types. This feature allows various groups to interfere with each other during the integration process, affecting the effect of the integration and thus the final task goal.

(2) Algorithm model:

the histology data has the characteristics of high dimensionality, multiple noise, sparse data and heterogeneity, and the problem of unbalanced data sets in experiments, which can affect the accuracy of model prediction. Therefore, to integrate different, complex and large-scale histology data, high requirements are placed on the analytical capabilities of the algorithm model and the computing platform. Early and mid integration strategies do solve this problem by pre-integrating all data sets, but the large matrices resulting from early integration are difficult to use by most machine learning models, while mid integration typically relies on unsupervised matrix factorization, and it is difficult to incorporate significant amounts of pre-existing biological knowledge. The existing method for integrating and analyzing the histologic data and the algorithm model have been successful to a certain extent, but most of the methods integrate the results after independent analysis of the histologic data, and have limited integration and analysis capacity.

(3) Feature extraction capability:

the traditional integration method often directly inputs the preprocessed features into the model, and the operation cannot well extract hidden information of the histology data. The hidden information of the histology data can be further extracted by utilizing the natural topological attribute of the histology, so that the subsequent integration operation is facilitated.

In order to integrate multiple groups of learning data, the prior art has developed a technical scheme for realizing accurate data selection by integrating the advantages of various machine learning by using methods such as integrated learning. For example, patent document with publication number CN115565610a discloses a recurrence and metastasis analysis model building method and system based on multiple sets of study data, the method performs normalization processing, comparison analysis, building the relationship of different sets of data, obtaining multiple sets of study data, and extracting the characteristic data of the multiple sets of study data; performing dimension reduction treatment on the group of the characteristic data by using a principal component analysis method; performing data enhancement on the dimension-reduced group of the characteristic data so that the group of the characteristic data accords with the sample size requirement; and constructing a recurrence and metastasis analysis model by adopting an integrated learning algorithm based on the group of characteristic data meeting the sample size requirement. According to the technical scheme, the multiple sets of study characteristic data are systematically selected and subjected to dimension reduction treatment, so that the data of different sets of study can be effectively utilized and screened, the quality control of the multiple sets of study data established by the recurrence and metastasis analysis model is carried out, multiple classical machine learning models are finally synthesized, and the accuracy of the recurrence and metastasis analysis model is improved. However, in the technical scheme, all data sources of different groups are required to be normalized, and the influence of abnormal values in the data of different groups on the normalized data is ignored in the processing mode, so that the influence of key data in the data of different groups on the overall effect cannot be evaluated, and the stability and accuracy of normalization cannot be ensured. In contrast, the invention firstly carries out separation processing on different histology data so as to extract different histology characteristics, thereby ensuring the accuracy of analysis of the different histology data and avoiding the mutual interference among the different histology data.

Therefore, aiming at the problem that the complex diseases cannot be overcome due to the fact that the multi-group-science integration technology cannot influence the integration effect when used for predicting and classifying the complex diseases, the invention provides a classification model and a classification method for multi-group-science credible integration based on a graph neural network, which are used for combining the graph neural network and a credible mechanism to perform multi-group-science integration, so that the prediction classification is realized.

In a first aspect, the invention discloses a classification method for multiple sets of credible integration based on a graph neural network, which comprises the following steps:

preparing, for a sample, histology data of the sample;

constructing a specific network of each histology data;

performing aggregation update on the histology specific network, and performing dimension reduction and classification on the extracted histology characteristics to generate initial classification of each histology;

calculating the confidence coefficient of each group, and enhancing the aggregated characteristics;

fusing confidence enhancement features of multiple groups to generate a final classification result;

and outputting the medical analysis result of the target object.

Compared with the prior art, the method can aggregate and update the histology specific network, calculate the confidence coefficient of each histology, strengthen the aggregated characteristics, and fuse the confidence enhancement characteristics of a plurality of histology to generate a final classification result. Based on the above distinguishing technical features, the problems to be solved by the present invention may include: how to improve the accuracy of medical analysis results fusing various histology data. The prior art has developed a technical solution for cluster analysis of multiple sets of chemical data based on a graph neural network model. For example, patent document publication No. CN113392894a discloses a cluster analysis method and system of multiple sets of chemical data, by dividing MR image information by using a neural network, extracting high-throughput image super-parameters according to the divided information of each part; processing the clinical data, the demographic data, and the laboratory test data to generate vector representations of different dimensions; carrying out multi-source data fusion on the high-flux image data and vector representations with different dimensions to obtain fused multi-source heterogeneous data; constructing a multi-source heterogeneous data set, and obtaining an optimal model by training and testing a multi-source graph clustering model; and inputting the MR image information into the optimal model, and analyzing the differences of different categories and the similarity of the same category. According to the technical scheme, the association condition between data is intuitively expressed by adopting a graph structure mode, different characteristics are captured, and the model has good robustness, so that a high-efficiency clustering algorithm based on a graph neural network model is realized. However, the neural network in the technical scheme is mainly used for dividing MR influence information, which is equivalent to dividing information of single group of data to obtain super-parameters capable of automatically generating quantitative positioning, and providing high-flux image data for multi-source heterogeneous data. That is, the neural network of the technical scheme is to obtain more detailed data information from different omics data, which is the contrary to the processing direction of the graph neural network of the invention. Specifically, the invention uses the graph neural network to aggregate and update each group of study data, then obtains an initial classification result through the neural network, obtains uncertainty through subjective logic, and finally obtains a final classification result through an evidence fusion theory, so as to extract more hidden information from each group, thereby improving the accuracy of classifying the course of the complex disease and predicting and classifying the disease subtype. The treatment modes in the prior art are completely opposite to the treatment method of the invention, and provide technical teaching completely opposite to the invention, and the technical scheme or the combined technical scheme thereof cannot be used as the basis of the technical scheme of the invention by the skilled in the art.

According to a preferred embodiment, the histology data of the prepared samples comprises a plurality of histology, each histology consisting of a number of pre-screened features.

According to a preferred embodiment, a histology information network is constructed by a weighted gene co-expression network analysis in constructing a specific network of each histology data, and a graph network of the histology data is constructed using topological features to achieve the combination of the expression data and the graph network. Compared with the prior art, the invention can realize the combination of the expression data and the graph network. Based on the above distinguishing technical features, the problems to be solved by the present invention may include: how to construct a graph network of histology data. Specifically, the invention constructs a histology information network through weighted gene co-expression network analysis, and constructs a graph network of histology data by using topological characteristics, and better classification performance and more interpretable biomarkers can be generated through the combination of the expression data and the graph network.

According to a preferred embodiment, for each type of histology data, an initial co-expression graph network will be input to the graph intent neural network layer to achieve weighted sum aggregation of features, and the initial classification of each set is done by a neural network comprising an input layer, an output layer and 3 intermediate layers.

According to a preferred embodiment, in the aggregation update of a omic specific network, a multi-head attention mechanism is used to stabilize the self-attention learning process and/or a multi-level graph feature complete fusion method is used to promote the aggregation of information of molecular modules using the relationships between internal features.

According to a preferred embodiment, the true class probability confidence criteria are used to obtain predictive confidence for each of the histology when calculating the confidence of each of the histology and enhancing the aggregated features, wherein for the mth histology dataset, a data set with the parameter θ is introduced ^(m) For estimating True Class Probability (TCP) confidence over training data. Further, m is the class of histology, θ ^(m) True class probability confidence generated for class m histology estimation. In contrast to the prior art described above, the present invention employs a trusted strategy to evaluate and adaptively adjust the predictive confidence associated with each type of histologic data. Based on the above distinguishing technical features, the problems to be solved by the present invention may include: how to obtain more reliable prediction confidence. Specifically, the traditional confidence inference method is the maximum class probability, and the confidence MCP of the predicted class in this way is the highest softmax probability, which in turn leads to excessive confidence in incorrect predictions. To solve this problem, the present invention proposes a True Class Probability (TCP) confidence scoreQuasi, low and high confidence levels are intended to be assigned to erroneous and successful predictions, respectively. Thus, more reliable prediction confidence is obtained for various histology by employing the TCP standard in the model of the present invention.

According to a preferred embodiment, a joint post-mix integration technique is employed in fusing confidence-enhancing features of multiple histology to leverage a histology-level confidence mechanism to adjust the contribution of inter-histology fusion between different histology data sets to address the complexity of inter-histology analysis.

In a second aspect, the invention discloses a classification model for multiple sets of credible integration based on a graph neural network, which comprises the following steps:

a multi-group data preparation module for preparing, for a sample, group data of the sample;

the histology data networking construction module is used for constructing a specific network of each kind of histology data;

the feature aggregation and classification module is used for carrying out aggregation update on the histology specific network, and carrying out dimension reduction and classification on the extracted histology features to generate initial classification of each histology;

the confidence coefficient calculating and enhancing module is used for calculating the confidence coefficient of each group and enhancing the aggregated characteristics;

the feature fusion and classification module is used for fusing confidence enhancement features of various groups to generate a final classification result;

and the output module is used for outputting the medical analysis result of the target object.

In a third aspect, the invention discloses an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, the processor being capable of implementing the steps of the method described above when executing the program.

In a fourth aspect, the present invention discloses a computer readable storage medium having stored thereon a computer program which, when executed by a processor, is capable of carrying out the steps of the above method.

The invention adopts a later-stage integrated frame to carry out end-to-end graph classification tasks, and utilizes the combination advantages of the graph neural network and a trusted mechanism to realize high-accuracy classification prediction. The method comprises the steps of carrying out aggregation updating on each group of data by using a graph neural network, obtaining an initial classification result by using the neural network, obtaining uncertainty by using subjective logic, and finally obtaining a final classification result by using an evidence fusion theory. When the invention is applied, more hidden information can be extracted from each group, and mutual interference among groups is overcome through integration of decision planes, so that a better integration effect is realized, and task indexes for classifying the course of complex diseases (Alzheimer's disease, cancer and the like) and predicting and classifying disease subtypes are improved, wherein the task indexes comprise accuracy, F1 value and AUC. Further, the F1 value is a harmonic mean of the precision and recall that can be used to measure the performance of the model while maintaining a balance between precision and recall.

Drawings

Fig. 1 is a schematic diagram of the structure and flow of a classification model according to a preferred embodiment of the present invention.

List of reference numerals

10: a multi-study data preparation module; 20: a networking construction module for the group study data; 30: a feature aggregation and classification module; 40: confidence calculating and enhancing modules; 50: the feature fusion and classification module; 60: and an output module.

Detailed Description

The following detailed description refers to the accompanying drawings.

Fig. 1 shows a schematic structure and a flow chart of a classification model according to a preferred embodiment of the present invention.

According to a preferred embodiment, the invention discloses a classification method for multiple sets of credible integration based on a graph neural network, which comprises the following steps:

preparing, for a sample, histology data of the sample;

constructing a specific network of each histology data;

and outputting the medical analysis result of the target object.

Preferably, the sample to which the present invention is directed is typically a complex disease patient, and the prepared omic data includes omic 1, omic 2, omic 3, etc. Each group consists of several features, which are typically screened by preprocessing and scaled to between 0 and 1 before the model is input.

Preferably, in constructing a specific network for each of the omics data, the following steps may be performed:

construction of a histologic information network by weighted gene co-expression network analysis (Weighted correlation network analysis, WGCNA), construction of a graph network of histologic data using topological features, the combination of this expression data and graph network enabling better classification performance and more interpretable biomarkers.

For each patient, the initial co-expression map network may be represented by the following formula:

G ₀ ＝G(V ^K×1 ,E ^K×K )，

wherein G is a graph network, V is a node set of the graph network, E is a side of the graph network, K is a feature number, V ^K×1 Representing characteristics of nodes, E ^K×K Representing an edge matrix.

Co-expression computation based on WGCNA analysis to compute edge matrix E ^K×K . WGCNA is performed by R packet "WGCNA". Specifically, for one sample, a dimension 1×k vector is generated. K represents the number of features. For N samples belonging to the same class of histology data, a matrix NxK is generated to calculate the co-expression matrix A ^K×K N is the number of samples. Node v _i And node v _j Co-expression matrix A for correlation calculation between _ij The following are provided:

wherein,and->Is node v _i And v _j β represents the soft threshold of WGCNA.

Binarizing the matrix by the following steps, from matrix A ^K×K Is used for generating an edge matrix E ^K×K ：

Preferably, in the present invention, the threshold value for binarizing the matrix is set to 0.08.

Preferably, when the omic specific network is updated in an aggregate and the extracted omic features are down-scaled and classified, the initial classification may be accomplished by a graph attention neural network (or referred to as a graph neural network) and a neural network. For each type of histology data, the initial co-expression graph network will be input to the graph ideographic neural network (GAT) layer. The figure attention neural network combines an attention mechanism with a figure convolution, and the calculation process comprises the following two steps:

first, a attention coefficient parameter is calculated. For node i, the similarity coefficient between itself and its neighbors is calculated as follows:

e _ij ＝a([Wh _i ||Wh _j ]),j∈N _i 。

enhancement of node features is by adding dimensions to node features through linear mapping of the shared parameter W, transformation features of nodes i and j are represented by [ Wh _i ||Wh _j ]And (5) splicing. The connected high-dimensional features map to real numbers, and the attention coefficients are normalized as follows:

wherein e _ij Is a similarity coefficient between nodes;

next, the features are weighted and aggregated according to the calculated attention coefficients, as follows:

h′ _i ＝σ(∑ _j∈N α _ij Wh _j )，

wherein h is _i ' means the generated characteristics output by the GAT layer for each node i.

Preferably, the present invention employs a multi-headed attentiveness mechanism in order to stabilize the self-attentive learning process. That is, the operation of the GAT layer is replicated independently T times, each copy having different parameters, and the output is feature aggregated by stitching as follows:

wherein, I represents feature series, alpha _ij ^t Representing the coefficient of interest derived from the t-th copy, W ^t A weight matrix representing the linear transformation of the t-th replica.

Further, in order to utilize the relationships between internal features, the present invention applies a multi-level graph feature complete fusion method. In addition to aggregating molecular features at the node level, the method is utilized to promote information aggregation of molecular modules. Specifically, by at G ₀ Generating a high-level graph G by applying a multi-head GAT layer ₁ And in a similar manner from G ₁ Derived G ₂ . Thereafter, graph embeddings from the three levels are concatenated together to produce a richer multi-level representation. These representations are then input into the fully connected layer to produce a histology-specific embedding, denoted F _GAT . At the same time, for each omic type, the GAT classifier is trained to incorporate the omic internal information into the predictions:

wherein L is _CE Is a cross entropy loss function, N is the number of training samples, y _i Is a real tag that is not a real tag,is the predictive label of the m-th omic data.

Preferably, in the present invention, the fully connected layers of the neural network are provided with 3 intermediate layers in addition to the input layers and the output layers, wherein the sum of the number of the input layers and the feature numbers of the multiple groups is equal, and the number of the final output layers and the number of the types are equal.

Preferably, the inherent heterogeneity between multiple sets of chemical data and external influences from differences in data acquisition and storage conditions present significant challenges for data integration. Thus, in addition to improving the predictive power of each type of histologic data, the present invention employs a trusted strategy to evaluate and adaptively adjust the predictive confidence associated with each type of histologic data.

The traditional confidence inference method is Maximum Class Probability (MCP). For histology m, given an input feature matrix X ^(m) The classifier can be interpreted as a probabilistic model. Using the softmax function, it assigns a predictive probability distribution P (y|x) to each class k=1, …, K ^(m) ). Subsequently, the prediction category can be inferred as:

it can be observed that MCP selected the highest softmax probability, resulting in excessive confidence in incorrect predictions.

To address this problem, the present invention proposes a True Class Probability (TCP) confidence criterion aimed at assigning low and high confidence levels for erroneous and successful predictions, respectively. The TCP standard is employed in the model of the present invention to obtain more reliable predictive confidence for various histology:

TCP(X,y ^* )＝P(Y＝y ^* ∣X)，

wherein y is ^* Is a true label vector, where P (y=y ^* I X) is the probability of the real label calculated for the softmax function. The formula shows that when samples are correctly classified, TCP and MCP are equal, while for misclassified samples, the former yields lower values.

However, since there is no availability of real labels on the test set, TCP confidence cannot be estimated directly. Thus, for the mth omic dataset, a data set with the parameter θ is introduced ^(m) For estimating TCP confidence over training data:

specifically, the L2 penalty is used to train the confidence network:

wherein L is _Cls ^(m) Is the cross entropy penalty of the group-specific classifier.

In summary, the present invention constructs a histology-specific classifier and a confidence neural network on top of the rich GAT layer representation and generates confidence scores for each histology data type.

Preferably, the enhanced interactivity and informativeness of multi-level GAT generation increases heterogeneity, making the inter-team analysis more complex. To address these complexities, the present invention employs a joint post-mix integration technique that leverages a histology-level confidence mechanism to adjust the contribution of inter-histology fusion between different histology datasets.

Preferably, the GAT encoded, group-specific representation F _GAT ^(m) Is converted into a degree of informativeness of the group, expressed as a cognitive level feature:

further, the present invention introduces a selective attention mechanism to generate more distinguishing features:

wherein σ represents the activation function, F _Cog ^(m) Representing cognitive class characteristics, F _{Cog_att} ^(m) Cognitive level features representing selective attention mechanisms. This mechanism not only emphasizes salient features, but also reduces the impact of non-informative attributes, focusing on active information.

Further, features from multiple groups are connected for final classification. Overall loss can be expressed, in general, as:

wherein lambda is _g And lambda (lambda) _c Representing hyper-parameters for adjusting different losses, lambda _g Is the super parameter for adjusting the contribution of the feature aggregation update module, lambda _c Is the hyper-parameter for adjusting the contribution of the confidence enhancement module, L _GAT ^(m) Is a histologic-specific representation of GAT coding, L _Final Is the cross entropy loss of the final classification. Preferably, in the present invention, the super parameter lambda may be _g And lambda (lambda) _c Are set to 1.

Preferably, the medical analysis result of the target object can be obtained through the above steps.

According to a preferred embodiment, the invention discloses a classification model of multiple sets of credible integration based on a graph neural network, and also discloses a classification device of multiple sets of credible integration based on the graph neural network, which comprises:

a plurality of group data preparing module 10 for preparing group data of a sample for the sample;

a histology data networking construction module 20 for constructing a specificity network of each histology data;

the feature aggregation and classification module 30 is configured to aggregate and update the omic specific network, and perform degradation and classification on the extracted omic features to generate an initial classification of each omic;

a confidence calculating and enhancing module 40 for calculating the confidence of each group and enhancing the aggregated features;

the feature fusion and classification module 50 is used for fusing confidence enhancement features of various groups to generate a final classification result;

an output module 60 for outputting the medical analysis result of the target object.

Preferably, the sample for which the multi-omic data preparation module 10 is directed is typically a complex disease patient, and the prepared omic data includes omic 1, omic 2, omic 3, etc. Each group consists of several features, which are typically screened by preprocessing and scaled to between 0 and 1 before the model is input.

Preferably, the histology data networking module 20 may comprise a plurality of feature networking modules to build a specific network of a plurality of histology data to obtain networking features for each of the histology.

Preferably, the histology data networking construction module 20 can construct a histology information network by weighted gene co-expression network analysis (Weighted correlation network analysis, WGCNA), and use topological features to construct a graph network of histology data, the combination of which can yield better classification performance and more interpretable biomarkers.

G ₀ ＝G(V ^K×1 ,E ^K×K )，

wherein V is ^K×1 Representing characteristics of nodes, E ^K×K Representing an edge matrix.

Co-expression computation based on WGCNA analysis to compute edge matrix E ^K×K . WGCNA is performed by R-package "WGCNAAnd (3) row. Specifically, for one sample, a dimension 1×k vector is generated. K represents the number of features. For N samples belonging to the same class of histology data, a matrix NxK is generated to calculate the co-expression matrix A ^K×K . Node v _i And node v _j Correlation calculation A between _ij The following are provided:

wherein,and->Is node v _i And v _j β represents the soft threshold automatically calculated by WGCNA.

Preferably, the feature aggregation and classification module 30 may include a graph neural network feature aggregation module for performing aggregate updates to the omic specific network and a neural network classification module for performing degradation and classification to the extracted omic features to generate a probability distribution for an initial classification of each of the omics, wherein the neural network classification module may be a neural network initial classification module.

Preferably, the feature aggregation and classification module 30 may complete the initial classification through the graph attention neural network and the neural network when performing an aggregate update on the omic specific network and performing a drop and classification on the extracted omic features. For each type of histology data, the initial co-expression graph network will be input to the graph ideographic neural network (GAT) layer. The figure attention neural network combines an attention mechanism with a figure convolution, and the calculation process comprises the following two steps:

e _ij ＝a([Wh _i ||Wh _j ]),j∈N _i 。

h′ _i ＝σ(∑ _j∈N α _ij Wh _j )，

Preferably, in order to stabilize the self-attention learning process, the graph neural network feature aggregation module of the present invention adopts a multi-head attention mechanism. That is, the operation of the GAT layer is replicated independently T times, each copy having different parameters, and the output is feature aggregated by stitching as follows:

Further, in order to utilize the relationship between the internal features, the graph neural network of the present invention is specific toThe feature aggregation module applies a multistage graph feature complete fusion method. In addition to aggregating molecular features at the node level, the method is utilized to promote information aggregation of molecular modules. Specifically, by at G ₀ Generating a high-level graph G by applying a multi-head GAT layer ₁ And in a similar manner from G ₁ Derived G ₂ . Thereafter, graph embeddings from the three levels are concatenated together to produce a richer multi-level representation. These representations are then input into the fully connected layer to produce a histology-specific embedding, denoted F _GAT . At the same time, for each omic type, the GAT classifier is trained to incorporate the omic internal information into the predictions:

Preferably, in the present invention, the fully connected layers of the neural network classification module are provided with 3 intermediate layers except for the input layers and the output layers, wherein the sum of the number of the input layers and the feature numbers of multiple groups is equal, and the number of the final output layers and the number of the types are equal.

Preferably, the confidence computation and enhancement module 40 may include a plurality of true probability confidence enhancement modules to compute the confidence of each group and enhance the aggregated features.

The inherent heterogeneity between multiple sets of chemical data and external influences from differences in data acquisition and storage conditions present significant challenges for data integration. Thus, in addition to improving the predictive power of each type of histologic data, the confidence calculation and enhancement module 40 of the present invention employs a trusted strategy to evaluate and adaptively adjust the predictive confidence associated with each type of histologic data.

it can be observed that the confidence MCP of the prediction category is the highest softmax probability, resulting in excessive confidence in incorrect predictions.

TCP(X,y ^* )＝P(Y＝y ^* ∣X)，

wherein y is ^* Is the true tag vector. The formula shows that when samples are correctly classified, TCP and MCP are equal, while for misclassified samples, the former yields lower values.

specifically, the L2 penalty is used to train the confidence network:

In summary, the true probability confidence enhancement module constructs a histology-specific classifier and a confidence neural network on top of the rich GAT layer representation and generates confidence scores for each histology data type.

Preferably, the feature fusion and classification module 50, which is capable of fusing confidence-enhanced features of multiple groups, may include a multi-feature fusion module and a neural network classification module.

The enhanced interactivity and informativeness of multi-level GAT generation increases heterogeneity, making the inter-team analysis more complex. To address these complexities, the feature fusion and classification module 50 of the present invention employs a joint post-mix integration technique that leverages a histology-level confidence mechanism to adjust the contribution of inter-histology fusion between different histology datasets.

further, the feature fusion and classification module 50 of the present invention introduces a selective attention mechanism to generate more distinguishing features:

where σ represents the attention activation function. This mechanism not only emphasizes salient features, but also reduces the impact of non-informative attributes, focusing on active information.

wherein lambda is _g And lambda (lambda) _c Representing hyper-parameters for adjusting different losses, L _Final Is the cross entropy loss of the final classification. Preferably, in the present invention, the super parameter lambda may be _g And lambda (lambda) _c Are set to 1.

Preferably, the sample final class probability distribution may be output by the output module 60 as a result of the medical analysis of the target object.

Illustratively, a series of histologic data collection, including mRNA histology, methylation histology, miRNA histology, are performed for alzheimer's disease patients and are utilized by using the classification methods and/or classification models of the present invention to achieve accurate classification of disease subtypes for patients.

Preferably, the experiments performed by the present invention follow the same experimental set-up and evaluation criteria as Mogonet in order to fairly compare the model of the present invention with the model of the prior art. Four baseline data sets were chosen for performance evaluation and applied with six conventional single set of chemical classifiers (KNN, SVM, lasso, random Forest (RF), XGboost and fully connected Neural Network (NN)) and two advanced methods (Mogonet and Dynamics).

Preferably, the experimental data set used comprises:

ROSMAP for Alzheimer's Disease (AD): the ROSMAP dataset was designed specifically for the Alzheimer's Disease (AD) classification. Alzheimer's disease is a progressive neurodegenerative disease that results in memory loss and other cognitive dysfunction.

BRCA for PAM50-defined breast cancer subtype: the BRCA dataset was used for PAM50 defined breast cancer subtype classification. PAM50 is a 50 gene based test for determining subtypes of breast cancer in order to provide patients with more targeted therapeutic advice.

LGG for low-grade glaoma (LGG) grade 2vs.grade 3classification: LGG datasets were designed for class 2 and class 3classification of low-grade gliomas (LGGs). Gliomas are tumors that occur in the brain or spinal cord and are classified into a number of classes, with the diagnostic and therapeutic strategies being different between class 2 and class 3.

KIPAN for renal cell carcinoma subtype classification: the KIPAN dataset is used for classification of renal cell carcinoma subtypes. Renal cell carcinoma is a kidney-derived cancer that has multiple subtypes, each of which differs in biological characteristics and therapeutic response.

Preferably, the evaluation index used comprises:

binary classification: accuracy (ACC), F1 score (F1), and area under the subject's working characteristics (AUC);

classification of multiple classes: ACC, weighted average F1 score (F1-weighted), and macro average F1 score (F1-macro).

Preferably, the experimental results are shown in table 1 and table 2, and the multi-group learning integration technology based on the graph neural network of the invention shows excellent performance on various evaluation indexes, and not only successfully solves the problem of group learning heterogeneity, but also greatly improves the prediction accuracy in the binary or multi-group classification task. Compared with the existing advanced method, the model of the invention realizes remarkable improvement on part of key indexes, which fully proves the practical value of the model in medical data classification. In addition, this technique provides unique weights for each of the omics data, further enhancing the reliability of the predictions. By this method, accurate disease subtype classification can be successfully provided for each patient, and powerful guidance is provided for subsequent personalized treatment.

Table 1 results of comparison with other methods on the ROSMAP and BRCA datasets

Table 2 results of comparisons with other methods on LGG and KIPAN datasets

According to a preferred embodiment, the invention discloses an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, the processor being capable of implementing the steps of the method as described above when executing the program.

According to a preferred embodiment, the present invention discloses a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, is capable of carrying out the steps of the method as described above.

It should be noted that the above-described embodiments are exemplary, and that a person skilled in the art, in light of the present disclosure, may devise various solutions that fall within the scope of the present disclosure and fall within the scope of the present disclosure. It should be understood by those skilled in the art that the present description and drawings are illustrative and not limiting to the claims. The scope of the invention is defined by the claims and their equivalents. The description of the invention includes various inventive concepts such as "preferably," "according to a preferred embodiment," or "optionally," all means that the corresponding paragraph discloses a separate concept, and the applicant reserves the right to filed a divisional application according to each inventive concept. Throughout this document, the word "preferably" is used in a generic sense to mean only one alternative, and not to be construed as necessarily required, so that the applicant reserves the right to forego or delete the relevant preferred feature at any time.

Claims

1. A classification method for multiple sets of credible integration based on a graph neural network, which is characterized by comprising the following steps:

preparing, for a sample, histology data of the sample;

constructing a specific network of each histology data;

and outputting the medical analysis result of the target object.

2. The method of classification as claimed in claim 1, wherein the histology data of the prepared sample comprises a plurality of histology, each of the plurality of pre-screened features.

3. The classification method according to claim 1 or 2, wherein a histology information network is constructed by a weighted gene co-expression network analysis in constructing a specific network of each histology data, and a graph network of the histology data is constructed using topological features to realize the combination of the expression data and the graph network.

4. A classification method according to any one of claims 1 to 3, characterized in that for each type of histology data, an initial co-expression graph network is to be input to the graph-meaning neural network layer to achieve a weighted sum aggregation of features, and the initial classification of each group is done by a neural network comprising an input layer, an output layer and 3 intermediate layers.

5. The classification method according to any one of claims 1 to 4, wherein a multi-head attention mechanism is used to stabilize the self-attention learning process and/or a multi-level graph feature complete fusion method is used to promote the aggregation of information of molecular modules by using the relationship between internal features when the aggregation update is performed on a histology-specific network.

6. The classification method according to any one of claims 1-5, characterized in that, when calculating the confidence level of each of the histology and enhancing the aggregated features, a true class probability confidence level is used to obtain the predicted confidence level for each of the histology, wherein for the mth histology dataset, a data set with the parameter θ is introduced ^(m) For estimating true class probability confidence on training data.

7. The classification method of any of claims 1-6, characterized in that a joint post-mix integration technique is employed in fusing confidence-enhancing features of multiple histology to utilize a histology-level confidence mechanism to adjust the contribution of inter-histology fusion between different histology datasets to address the complexity of inter-histology analysis.

8. A classification model for multiple sets of credible integration based on a graph neural network, comprising:

a plurality of sets of data preparation modules (10) for preparing, for a sample, sets of data of the sample;

a histology data networking construction module (20) for constructing a specific network of each histology data;

the feature aggregation and classification module (30) is used for carrying out aggregation update on the histology specific network, and carrying out degradation and classification on the extracted histology features to generate initial classification of each histology;

the confidence coefficient calculating and enhancing module (40) is used for calculating the confidence coefficient of each group and enhancing the aggregated characteristics;

the feature fusion and classification module (50) is used for fusing confidence enhancement features of various groups to generate a final classification result;

and an output module (60) for outputting a medical analysis result of the target object.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, characterized in that the processor is capable of implementing the steps of the method according to any one of claims 1-7 when executing the program.

10. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, is capable of implementing the steps of the method according to any of claims 1 to 7.