CN118116600A

CN118116600A - Colorectal cancer prognosis method based on multiple sets of clinical test data

Info

Publication number: CN118116600A
Application number: CN202410532738.6A
Authority: CN
Inventors: 吴艳平; 王飞; 马韵洁; 王佐成
Original assignee: Data Space Research Institute
Current assignee: Data Space Research Institute
Priority date: 2024-04-30
Filing date: 2024-04-30
Publication date: 2024-05-31
Anticipated expiration: 2044-04-30
Also published as: CN118116600B

Abstract

The invention discloses a colorectal cancer prognosis method based on multiple groups of study and clinical test data, which comprises the steps of S1, collecting the study data of patients with different colorectal cancers and the survival condition of the patients after two years of surgery to remove the colorectal cancer from different data sources, databases or experiments, collecting clinical data related to the patients, and preprocessing the study data and the clinical data; s2, constructing a patient histology similarity network through the preprocessed histology data and clinical data and survival conditions; s3, encoding node topological structure information of a patient and clinical data information of the patient; s4, adding node topological structure information of the patient and clinical data information of the patient into the codes of the graph attention network; s5, predicting the survival condition of the patient after two years of surgery to remove colorectal cancer through a prediction model; and S6, optimizing the prediction model by using a binary cross entropy loss function. The invention has better dynamic adaptability and is suitable for dynamic change in cancer development.

Description

Colorectal cancer prognosis method based on multiple sets of clinical test data

Technical Field

The invention relates to the technical field of medical data analysis, in particular to a colorectal cancer prognosis method based on multiple sets of clinical test data.

Background

In recent years, the rapid development of bioinformatics and computer science has provided unprecedented opportunities and challenges for cancer research. Cancer is a complex disease whose pathogenesis involves multiple levels of biological processes. In order to understand the molecular mechanisms of cancer more fully and to improve the accuracy of predictive models, researchers are actively exploring how to integrate multiple sets of mathematical data. Colorectal cancer is one of the common malignant tumors, and research thereof has important significance for deep understanding of cancer biology and formulation of treatment strategies.

The prior art has the following problems:

Data isolation: traditional prognostic models are typically based on a single type of data, such as clinical exam data, resulting in limited information interaction between the data. This data isolation makes the model unable to fully consider the histology information of colorectal cancer patients, limiting the insight into the pathological state of the patients. The development of cancer is affected by a variety of data types, such as multiunit, proteomics, etc. The complexity of cancer pathogenesis may not be adequately captured using only a small number of data types, resulting in inaccurate prediction results.

The characteristic representation is insufficient: the partial model is limited in learning the characteristic representation of the patient and fails to mine the potential information in the patient data deeply. This may result in the model failing to capture the disease state changes in colorectal cancer patients, affecting the sensitivity of the prognostic assay.

Data integration and analysis complexity: multi-set data fusion is becoming more and more common in bioinformatics and medical research, but one of the major challenges faced is how to efficiently integrate multi-source data from gene expression, copy number variation, and DNA methylation, and extract key information in these complex data sets.

Cancer heterogeneity: cancer is a highly heterogeneous disease, with significant molecular and phenotypic differences between patients. How to overcome this heterogeneity to build a more accurate survival prediction model is challenging for personalized treatment and research of finer cancer subtypes.

Relationship between patient groups of chemical similarity: traditional models often ignore the histological similarity between patients, which may be a critical factor in disease studies. The interaction of the patient on multiple sets of chemical data such as gene expression, copy number variation and DNA methylation is effectively considered and utilized, so that the comprehensiveness and the predictive performance of the model are improved, and the method is a difficult problem to be solved.

Therefore, how to provide a colorectal cancer prognosis method based on multiple sets of clinical test data is a problem that the skilled artisan is urgent to solve.

Disclosure of Invention

The invention aims to provide a colorectal cancer prognosis method based on multiple sets of clinical test data, and the colorectal cancer prognosis method considers topological structure information so that a model can carry out cancer prognosis analysis at a more comprehensive view angle, and has better dynamic adaptability so as to better adapt to dynamic changes in cancer development.

A colorectal cancer prognosis method based on multiple sets of clinical test data according to an embodiment of the present invention includes the steps of:

s1, collecting histology data of patients with different colorectal cancers and survival conditions of the patients after two years of surgery to remove the colorectal cancers from different data sources, databases or experiments, collecting clinical data related to the patients, and preprocessing the histology data and the clinical data;

S2, constructing a patient histology similarity network through the preprocessed histology data and clinical data and survival conditions;

s3, encoding node topological structure information of a patient and clinical data information of the patient;

s4, adding node topological structure information of the patient and clinical data information of the patient into the codes of the graph attention network;

S5, predicting the survival condition of the patient after two years of surgery to remove colorectal cancer through a prediction model;

and S6, optimizing the prediction model by using a binary cross entropy loss function.

Optionally, the histologic data comprises lncRNA, miRNA, DNA copy number variation and DNA methylation data, and the clinical data comprises age, sex, stage of cancer, body mass index, tumor location, tumor size, systolic blood pressure, diastolic blood pressure, tumor marker carcinoembryonic antigen, CA 19-9, inflammation index C-reactive protein, white blood cell count, and deposition rate.

Optionally, the S1 specifically includes:

S11, deleting the data of a certain group in the acquired data set if the deletion proportion of the data of the group is more than 20%;

s12, using Z-score normalization for histology data and clinical data;

s13, projecting original features of different types of nodes into a unified latent space through node feature transformation according to the different types of histology data features.

Optionally, the S2 specifically includes:

s21, defining a similarity matrix of each group of data ：

；

Wherein,Representing the euclidean distance and,Is a super-parameter which is used for the processing of the data,The method comprises the following steps:

；

wherein, Representing a patientIn the first placeThe first m neighbors in the omics data are calculated byAnd the firstAll the rest samples of the genealogy are Euclidean distances, sorting is carried out, and the first m samples are selected as neighbors;

S22, introducing weight Similarity of the four histology data is integrated, representing the firstImportance of the omics data, final integrated similarity matrixThe definition is as follows:

；

S23, constructing a graph with node characteristics ：

；

Wherein,Representing a set of vertices, E representing a set of edges, definitionAndThe number of top points and the number of sides are respectively;

S24, constructing a correlation graph among patients according to the survival condition of the patients after two years of surgical excision of colorectal cancer, wherein the death represents 0, the survival represents 1, and each patient is represented as a node in the graph, if two patients survive for two years, creating an edge, and for each patient, integrating a similarity matrix finally The first K most similar patients are selected as neighbors to create an edge.

Optionally, the graphThe establishing of the topological structure information of the middle nodes comprises the steps of generating a node sequence with the designated length of D for each node i by using a random walk algorithm, learning the node sequence representation of each node i by using a bidirectional long-short-time memory network, and fusing the sequence representations to obtain the topological structure representation of the nodes.

Optionally, the graphThe establishing of the topological structure information of the middle node specifically comprises the following steps:

Nodes on the slave graph G Random walk, at random walk's firstStep access nodeThe next node to walk randomly follows from the probability as followsIs selected from the neighbor nodes of:

；

wherein, Representing nodesIs used for the degree of (3),Representation ofIs defined by a set of neighboring nodes of the network,Representing random walk time, selecting nodeThe next selected node;

The nodes of the random walk records are sequentially connected to form a random walk sequence with the length of D, the random walk sequence presents a node path from a starting node to a target node, and a random walk algorithm is used for each node Generating p random sequences with seq=Representing p random sequences;

learning the representation of p random sequences using a long-short memory network, letting the y-th sequence of node i be represented as Wherein the node sequence of the node sequence is defined by, among other things,Is the starting node which is the starting node,The method comprises the steps that node characteristics in a random sequence are gradually input through a short-time memory network, and final representation of a y-th sequence is obtained;

weighting and fusing the obtained p random sequence representations to obtain node topological structure representation ：

；

Wherein,，The value of (c) is in the range of 1 to p,The method comprises the following steps:

。

optionally, the feature encoding of the patient clinical data comprises:

The original clinical data feature vector of the patient is recorded as Using a transformation matrixFeature vectorConversion to a 64-dimensional vector:

；

The node type features and the node structure features are subjected to weighted fusion, the node type features are original clinical data features, the node structure features are topological structure representations of nodes, and finally, the obtained clinical data features of patients are represented as follows:

；

wherein, Is a preset hyper-parameter, representing the weight.

Optionally, the topological structure information of the nodes and the patient clinical data information are added to the coding of the graph attention network, forming an improved graph attention mechanism:

；

By adopting a multi-head attention mechanism, M parameter matrixes Calculating attention coefficients respectivelyCombining the calculated results:

；

wherein, Expressed in GAT (th)Node in layerIs the original of the code representation of (1)For a plurality of groups of the chemical characteristics of the patient,Representing nodesIs defined by a set of neighboring nodes of the network,Representing nodesWith its neighborsThe weight of the attention between them,Represents a weight parameter, for calculating an attention score,Representing the weighting matrix of the graph neural network at layer (l),Representing a natural exponential function of the sign,Represents the leak ReLU activation function, || represents the concatenation operation of vectors.

Optionally, the step S5 specifically includes:

By using = {0,1} Represents patient survival, and two-layer MLP predictor was used to predict patient survival after two years, calculated as follows:

；

wherein, Is a function of the activation of the ReLU,Is a Sigmoid activation function that is activated by,Is a set of weight matrices that can be trained,Is a set of bias vectors;

The higher the value, the lower the probability of survival, indicating that timely access to treatment is required, whereas the higher the probability of survival.

Optionally, the optimizing of the prediction model specifically includes:

the binary cross entropy loss function is defined as follows:

；

wherein, The value range of (1) to n, n represents the number of patients during the training of the prediction model,The actual label representing the q patient sample takes a value of 0 or 1,Representing the probability that the predictive model predicts that the q-th sample is of positive class, its value is between 0 and 1,AndThe logarithmic values representing the prediction probability and its complement probability respectively,Representing the sum of the calculated losses for all samples;

The whole predictive model is trained end-to-end using Adam optimizers to minimize the binary cross entropy loss function.

The beneficial effects of the invention are as follows:

(1) The invention effectively solves the problem of data isolation by fully fusing the histology data and the clinical data of the patient. By integrating the multi-source data from gene expression, copy number variation, and DNA methylation, a more comprehensive consideration of the model's multi-aspect impact on cancer progression is ensured.

(2) The present invention introduces an improved map attention mechanism that enables models to mine more deeply for potential information in patient data. This helps to increase the sensitivity to changes in the disease state of colorectal cancer patients, solving the problem of insufficient characterization.

(3) The present invention better addresses the challenge of colorectal cancer as a highly heterogeneous disease by effectively integrating multiple sets of mathematical data, accounting for molecular and phenotypic differences between patients. This helps to build a more accurate survival prediction model, providing support for personalized treatment and fine research of cancer subtypes.

(4) The invention effectively considers and utilizes the interaction of patients on multiple sets of chemical data such as gene expression, copy number variation, DNA methylation and the like through an improved map attention mechanism. This helps to increase the sensitivity of the model to the histological similarity between patients, enhancing the comprehensiveness and predictive performance of the model.

(5) The improved map attention mechanism is focused on effectively fusing the type information and the topological structure information of the patient nodes. This mechanism enables more accurate capture of the intricate relationships between patients, enabling models to more fully understand the interactions in cancer progression.

(6) The invention considers the topological structure information to enable the model to carry out cancer prognosis analysis at a more comprehensive view angle, and has better dynamic adaptability so as to better adapt to dynamic changes in cancer development.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of a colorectal cancer prognosis method based on multiple sets of clinical test data according to the present invention.

Detailed Description

The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.

Referring to fig. 1, a colorectal cancer prognosis method based on multiple sets of clinical test data, comprising the steps of:

In this embodiment, the histologic data includes lncRNA, miRNA, DNA copy number variations and DNA methylation data, and the clinical data includes age, sex, stage of cancer, body mass index, tumor location, tumor size, systolic pressure, diastolic pressure, tumor marker carcinoembryonic antigen, CA 19-9, inflammation index C-reactive protein, white blood cell count and deposition rate, to more fully understand the physiological status and cancer characteristics of the patient.

In this embodiment, S1 specifically includes:

s12, using Z-score normalization for histology data and clinical data;

Here we will typeWhereinLncRNA, miRNA, CNV (DNA copy number variation) and DNA (methylation data), respectively. Data for a given type of data setWe apply a linear transformation matrix specific to the data type to obtain the feature vector of the data after the dimension reduction, defined as follows:

；

Wherein the linear transformation matrix Is a parameter that can be learned and is,Representing the original feature vector, wherein for each of the omics dataIn a different manner, the processing time is different,Representing a group-wise linear transformation matrix for type c, usingThese histology data are reduced in dimension to 64 dimensions. The feature matrix corresponding to each group of data is recorded asWhereinIs the number of patients to be treated,IncRNA, miRNA, CNV and DNA methylation data are shown, respectively.

In this embodiment, S2 specifically includes:

s21, defining a similarity matrix of each group of data ：

；

This integration method considers the similarity of each of the histology data and adjusts the weights To balance the effects between them.

S23, constructing a graph with node characteristics：

；

In the present embodiment, the drawingsThe establishing of the topological structure information of the middle nodes comprises the steps of generating a node sequence with the designated length of D for each node i by using a random walk algorithm, learning the node sequence representation of each node i by using a bidirectional long-short-time memory network, and fusing the sequence representations to obtain the topological structure representation of the nodes.

In the present embodiment, the drawingsThe establishing of the topological structure information of the middle node specifically comprises the following steps:

；

。

In this embodiment, the feature code of the patient clinical data includes:

；

wherein, Is a preset hyper-parameter, representing the weight.

In this embodiment, the topological structure information of the nodes and the patient clinical data information are added to the coding of the graph attention network, forming an improved graph attention mechanism:

；

We use 5 layers of GAT training data to characterize the last encoded patient group as。

In this embodiment, S5 specifically includes:

Patient survival after two years of surgical removal of colorectal cancer (death: 0, survival: 1) with = {0,1} Represents patient survival, and then a two-layer MLP predictor was used to predict patient survival after two years. Specifically, the following is calculated:

；

In this embodiment, the optimization of the prediction model specifically includes:

the binary cross entropy loss function is defined as follows:

；

Example 1 is predictive model experimental effect:

performance assessment of cancer survival prediction models is primarily dependent on common classification model metrics, including Accuracy (Accuracy), precision (Precision), recall (Recall), AUC values, and ROC curves. These indices quantify the classification ability of the model for new samples by predicting the test dataset, computing true examples, false positive examples, false negative examples, and true negative examples of the model. The accuracy reflects the proportion of all correctly classified samples in the total samples, and the accuracy and recall rate focus on the correct prediction of the model alignment and the proportion of the correct classification.

Accuracy: representing the proportion of all correctly classified samples in the total sample:

；

Precision: indicate how many proportions of all samples the model predicts as positive samples are true positive samples:

；

Recall, which indicates how many of all samples with correct classification of the model are positive samples with correct classification:

；

In the colorectal cancer prediction task, we compared the performance of four existing models KNN (K nearest neighbor), SVM (support vector machine), DNN (deep neural network), LR (logistic regression)) with the COADGAT prediction model we proposed. The result shows that COADGAT prediction models are excellent in indexes such as accuracy, precision, recall rate and AUC, and compared with traditional models such as K nearest neighbor, support vector machine, deep neural network and logistic regression, survival conditions of colorectal cancer patients are predicted more effectively. This highlights the usefulness of COADGAT in survival prediction for colorectal cancer patients. These findings provide a powerful support for selecting appropriate predictive models in colorectal cancer and a more comprehensive understanding of patient survival.

We actively collected detailed data for 628 colorectal cancer patients via TCGA database, which included lncRNA, miRNA, DNA copies of variable, DNA methylation and clinical profile. TCGA, as a global oncology program, provides large-scale cancer patient data, providing valuable resources for our research to gain insight into the molecular characteristics and clinical manifestations of colorectal cancer. The total data set is randomly divided into a training set and a test set, the dividing ratio is approximately 8:2, and the specific data division is shown in Table 1. In order to ensure fairness and robustness of the research method, we respectively perform 5 experiments on each data division according to 5-fold cross validation, and finally average to obtain the total evaluation index score of the method under the data division, as shown in the following tables 1 and 2:

TABLE 1 data distribution case

Table 2 comparison table of various performance evaluation indexes of five different models

The invention effectively solves the problem of data isolation by fully fusing the histology data and the clinical data of the patient. By integrating the multi-source data from gene expression, copy number variation, and DNA methylation, a more comprehensive consideration of the model's multi-aspect impact on cancer progression is ensured.

The present invention introduces an improved map attention mechanism that enables models to mine more deeply for potential information in patient data. This helps to increase the sensitivity to changes in the disease state of colorectal cancer patients, solving the problem of insufficient characterization.

The present invention better addresses the challenge of colorectal cancer as a highly heterogeneous disease by effectively integrating multiple sets of mathematical data, accounting for molecular and phenotypic differences between patients. This helps to build a more accurate survival prediction model, providing support for personalized treatment and fine research of cancer subtypes.

The invention effectively considers and utilizes the interaction of patients on multiple sets of chemical data such as gene expression, copy number variation, DNA methylation and the like through an improved map attention mechanism. This helps to increase the sensitivity of the model to the histological similarity between patients, enhancing the comprehensiveness and predictive performance of the model.

The improved map attention mechanism is focused on effectively fusing the type information and the topological structure information of the patient nodes. This mechanism enables more accurate capture of the intricate relationships between patients, enabling models to more fully understand the interactions in cancer progression.

The invention considers the topological structure information to enable the model to carry out cancer prognosis analysis at a more comprehensive view angle, and has better dynamic adaptability so as to better adapt to dynamic changes in cancer development.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A colorectal cancer prognosis method based on multiple sets of clinical test data, comprising the steps of:

2. The method of claim 1, wherein the omics data comprises lncRNA, miRNA, DNA copy number variation and DNA methylation data, and wherein the clinical data comprises age, sex, stage of cancer, body mass index, tumor location, tumor size, systolic pressure, diastolic pressure, tumor marker carcinoembryonic antigen, CA 19-9, inflammation index C-reactive protein, white blood cell count, and deposition rate.

3. A method of colorectal cancer prognosis based on multiple sets of clinical test data according to claim 2, characterized in that S1 comprises in particular:

s12, using Z-score normalization for histology data and clinical data;

4. A colorectal cancer prognosis method based on multiple sets of clinical test data according to claim 3, characterized in that S2 comprises in particular:

s21, defining a similarity matrix of each group of data ：

；

Wherein,Representing Euclidean distance,Is a superparameter,The method comprises the following steps:

；

wherein, Representing patientInThe first m neighbors in the omics data, m neighbors calculatedAndAll the rest samples of the genealogy are Euclidean distances, sorting is carried out, and the first m samples are selected as neighbors;

S22, introducing weight Similarity of the four histology data was integrated, representing theImportance of the omics data, final integrated similarity matrixThe definition is as follows:

；

S23, constructing a graph with node characteristics ：

；

5. The method for prognosis of colorectal cancer based on multiple sets of clinical test data according to claim 4, wherein the mapThe establishing of the topological structure information of the middle nodes comprises the steps of generating a node sequence with the designated length of D for each node i by using a random walk algorithm, learning the node sequence representation of each node i by using a bidirectional long-short-time memory network, and fusing the sequence representations to obtain the topological structure representation of the nodes.

6. The method for colorectal cancer prognosis based on multiple sets of clinical test data according to claim 5, wherein the mapThe establishing of the topological structure information of the middle node specifically comprises the following steps:

Nodes on the slave graph G Random walk, the first/>, of the random walkNode accessed by stepThe next node to walk randomly follows the probability fromIs selected from the neighbor nodes of:

；

wherein, Representing nodesDegree of (v)/(v)RepresentationIs set of neighbor nodes,Representing random walk time, a node/>, is selectedThe next selected node;

learning the representation of p random sequences using a long-short memory network, letting the y-th sequence of node i be represented as Wherein,Is the starting node,The method comprises the steps that node characteristics in a random sequence are gradually input through a short-time memory network, and final representation of a y-th sequence is obtained;

；

Wherein,，The value of (1) is 1 to p,The method comprises the following steps:

。

7. A method of colorectal cancer prognosis based on multiple sets of clinical test data according to claim 6, characterized in that the characteristic encoding of the patient clinical data comprises:

The original clinical data feature vector of the patient is recorded as Using a transformation matrixFeature vectorConversion to 64-dimensional vector:

；

wherein, Is a preset hyper-parameter, representing the weight.

8. A colorectal cancer prognosis method based on multiple sets of clinical test data according to claim 7, characterized in that the topological structure information of the nodes and patient clinical data information are added to the coding of the graph attention network, forming an improved graph attention mechanism:

；

By adopting a multi-head attention mechanism, M parameter matrixes Calculate the attention coefficients/>, respectivelyCombining the calculated results:

；

wherein, Expressed at GATNode in layerIs encoded to indicate, initiallyFor multiple groups of patient's characteristics,Representing nodesIs set of neighbor nodes,Representing nodesWith its neighbor nodeAttention weight between/(Representing weight parameters for calculating an attention score,Weight matrix representing the graph neural network at layer (l)Representing a natural exponential function,Represents the leak ReLU activation function, || represents the concatenation operation of vectors.

9. A method of colorectal cancer prognosis based on multiple sets of clinical test data according to claim 8, characterized in that S5 comprises in particular:

；

wherein, Is a ReLU activation function,Is a Sigmoid activation function,Is a trainable set of weight matrices,Is a set of bias vectors;

10. A method for colorectal cancer prognosis based on multiple sets of clinical test data according to claim 9, characterized in that the optimization of the predictive model comprises in particular:

the binary cross entropy loss function is defined as follows:

；

wherein, The value range of (1) is 1 to n, n represents the number of patients during the training of the prediction model,The actual tag value representing the q patient sample is 0 or 1,Representing the probability that the q-th sample is predicted to be positive by the prediction model, wherein the value of the probability is between 0 and 1,AndLogarithmic values representing the prediction probability and its complement probability, respectively,Representing the sum of the calculated losses for all samples;