[go: up one dir, main page]

CN118116600A - Colorectal cancer prognosis method based on multiple sets of clinical test data - Google Patents

Colorectal cancer prognosis method based on multiple sets of clinical test data Download PDF

Info

Publication number
CN118116600A
CN118116600A CN202410532738.6A CN202410532738A CN118116600A CN 118116600 A CN118116600 A CN 118116600A CN 202410532738 A CN202410532738 A CN 202410532738A CN 118116600 A CN118116600 A CN 118116600A
Authority
CN
China
Prior art keywords
data
node
patient
representing
clinical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410532738.6A
Other languages
Chinese (zh)
Other versions
CN118116600B (en
Inventor
吴艳平
王飞
马韵洁
王佐成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Data Space Research Institute
Original Assignee
Data Space Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Data Space Research Institute filed Critical Data Space Research Institute
Priority to CN202410532738.6A priority Critical patent/CN118116600B/en
Publication of CN118116600A publication Critical patent/CN118116600A/en
Application granted granted Critical
Publication of CN118116600B publication Critical patent/CN118116600B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/40ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Public Health (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Epidemiology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Primary Health Care (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a colorectal cancer prognosis method based on multiple groups of study and clinical test data, which comprises the steps of S1, collecting the study data of patients with different colorectal cancers and the survival condition of the patients after two years of surgery to remove the colorectal cancer from different data sources, databases or experiments, collecting clinical data related to the patients, and preprocessing the study data and the clinical data; s2, constructing a patient histology similarity network through the preprocessed histology data and clinical data and survival conditions; s3, encoding node topological structure information of a patient and clinical data information of the patient; s4, adding node topological structure information of the patient and clinical data information of the patient into the codes of the graph attention network; s5, predicting the survival condition of the patient after two years of surgery to remove colorectal cancer through a prediction model; and S6, optimizing the prediction model by using a binary cross entropy loss function. The invention has better dynamic adaptability and is suitable for dynamic change in cancer development.

Description

Colorectal cancer prognosis method based on multiple sets of clinical test data
Technical Field
The invention relates to the technical field of medical data analysis, in particular to a colorectal cancer prognosis method based on multiple sets of clinical test data.
Background
In recent years, the rapid development of bioinformatics and computer science has provided unprecedented opportunities and challenges for cancer research. Cancer is a complex disease whose pathogenesis involves multiple levels of biological processes. In order to understand the molecular mechanisms of cancer more fully and to improve the accuracy of predictive models, researchers are actively exploring how to integrate multiple sets of mathematical data. Colorectal cancer is one of the common malignant tumors, and research thereof has important significance for deep understanding of cancer biology and formulation of treatment strategies.
The prior art has the following problems:
Data isolation: traditional prognostic models are typically based on a single type of data, such as clinical exam data, resulting in limited information interaction between the data. This data isolation makes the model unable to fully consider the histology information of colorectal cancer patients, limiting the insight into the pathological state of the patients. The development of cancer is affected by a variety of data types, such as multiunit, proteomics, etc. The complexity of cancer pathogenesis may not be adequately captured using only a small number of data types, resulting in inaccurate prediction results.
The characteristic representation is insufficient: the partial model is limited in learning the characteristic representation of the patient and fails to mine the potential information in the patient data deeply. This may result in the model failing to capture the disease state changes in colorectal cancer patients, affecting the sensitivity of the prognostic assay.
Data integration and analysis complexity: multi-set data fusion is becoming more and more common in bioinformatics and medical research, but one of the major challenges faced is how to efficiently integrate multi-source data from gene expression, copy number variation, and DNA methylation, and extract key information in these complex data sets.
Cancer heterogeneity: cancer is a highly heterogeneous disease, with significant molecular and phenotypic differences between patients. How to overcome this heterogeneity to build a more accurate survival prediction model is challenging for personalized treatment and research of finer cancer subtypes.
Relationship between patient groups of chemical similarity: traditional models often ignore the histological similarity between patients, which may be a critical factor in disease studies. The interaction of the patient on multiple sets of chemical data such as gene expression, copy number variation and DNA methylation is effectively considered and utilized, so that the comprehensiveness and the predictive performance of the model are improved, and the method is a difficult problem to be solved.
Therefore, how to provide a colorectal cancer prognosis method based on multiple sets of clinical test data is a problem that the skilled artisan is urgent to solve.
Disclosure of Invention
The invention aims to provide a colorectal cancer prognosis method based on multiple sets of clinical test data, and the colorectal cancer prognosis method considers topological structure information so that a model can carry out cancer prognosis analysis at a more comprehensive view angle, and has better dynamic adaptability so as to better adapt to dynamic changes in cancer development.
A colorectal cancer prognosis method based on multiple sets of clinical test data according to an embodiment of the present invention includes the steps of:
s1, collecting histology data of patients with different colorectal cancers and survival conditions of the patients after two years of surgery to remove the colorectal cancers from different data sources, databases or experiments, collecting clinical data related to the patients, and preprocessing the histology data and the clinical data;
S2, constructing a patient histology similarity network through the preprocessed histology data and clinical data and survival conditions;
s3, encoding node topological structure information of a patient and clinical data information of the patient;
s4, adding node topological structure information of the patient and clinical data information of the patient into the codes of the graph attention network;
S5, predicting the survival condition of the patient after two years of surgery to remove colorectal cancer through a prediction model;
and S6, optimizing the prediction model by using a binary cross entropy loss function.
Optionally, the histologic data comprises lncRNA, miRNA, DNA copy number variation and DNA methylation data, and the clinical data comprises age, sex, stage of cancer, body mass index, tumor location, tumor size, systolic blood pressure, diastolic blood pressure, tumor marker carcinoembryonic antigen, CA 19-9, inflammation index C-reactive protein, white blood cell count, and deposition rate.
Optionally, the S1 specifically includes:
S11, deleting the data of a certain group in the acquired data set if the deletion proportion of the data of the group is more than 20%;
s12, using Z-score normalization for histology data and clinical data;
s13, projecting original features of different types of nodes into a unified latent space through node feature transformation according to the different types of histology data features.
Optionally, the S2 specifically includes:
s21, defining a similarity matrix of each group of data
Wherein,Representing the euclidean distance and,Is a super-parameter which is used for the processing of the data,The method comprises the following steps:
wherein, Representing a patientIn the first placeThe first m neighbors in the omics data are calculated byAnd the firstAll the rest samples of the genealogy are Euclidean distances, sorting is carried out, and the first m samples are selected as neighbors;
S22, introducing weight Similarity of the four histology data is integrated, representing the firstImportance of the omics data, final integrated similarity matrixThe definition is as follows:
S23, constructing a graph with node characteristics
Wherein,Representing a set of vertices, E representing a set of edges, definitionAndThe number of top points and the number of sides are respectively;
S24, constructing a correlation graph among patients according to the survival condition of the patients after two years of surgical excision of colorectal cancer, wherein the death represents 0, the survival represents 1, and each patient is represented as a node in the graph, if two patients survive for two years, creating an edge, and for each patient, integrating a similarity matrix finally The first K most similar patients are selected as neighbors to create an edge.
Optionally, the graphThe establishing of the topological structure information of the middle nodes comprises the steps of generating a node sequence with the designated length of D for each node i by using a random walk algorithm, learning the node sequence representation of each node i by using a bidirectional long-short-time memory network, and fusing the sequence representations to obtain the topological structure representation of the nodes.
Optionally, the graphThe establishing of the topological structure information of the middle node specifically comprises the following steps:
Nodes on the slave graph G Random walk, at random walk's firstStep access nodeThe next node to walk randomly follows from the probability as followsIs selected from the neighbor nodes of:
wherein, Representing nodesIs used for the degree of (3),Representation ofIs defined by a set of neighboring nodes of the network,Representing random walk time, selecting nodeThe next selected node;
The nodes of the random walk records are sequentially connected to form a random walk sequence with the length of D, the random walk sequence presents a node path from a starting node to a target node, and a random walk algorithm is used for each node Generating p random sequences with seq=Representing p random sequences;
learning the representation of p random sequences using a long-short memory network, letting the y-th sequence of node i be represented as Wherein the node sequence of the node sequence is defined by, among other things,Is the starting node which is the starting node,The method comprises the steps that node characteristics in a random sequence are gradually input through a short-time memory network, and final representation of a y-th sequence is obtained;
weighting and fusing the obtained p random sequence representations to obtain node topological structure representation
Wherein,The value of (c) is in the range of 1 to p,The method comprises the following steps:
optionally, the feature encoding of the patient clinical data comprises:
The original clinical data feature vector of the patient is recorded as Using a transformation matrixFeature vectorConversion to a 64-dimensional vector:
The node type features and the node structure features are subjected to weighted fusion, the node type features are original clinical data features, the node structure features are topological structure representations of nodes, and finally, the obtained clinical data features of patients are represented as follows:
wherein, Is a preset hyper-parameter, representing the weight.
Optionally, the topological structure information of the nodes and the patient clinical data information are added to the coding of the graph attention network, forming an improved graph attention mechanism:
By adopting a multi-head attention mechanism, M parameter matrixes Calculating attention coefficients respectivelyCombining the calculated results:
wherein, Expressed in GAT (th)Node in layerIs the original of the code representation of (1)For a plurality of groups of the chemical characteristics of the patient,Representing nodesIs defined by a set of neighboring nodes of the network,Representing nodesWith its neighborsThe weight of the attention between them,Represents a weight parameter, for calculating an attention score,Representing the weighting matrix of the graph neural network at layer (l),Representing a natural exponential function of the sign,Represents the leak ReLU activation function, || represents the concatenation operation of vectors.
Optionally, the step S5 specifically includes:
By using = {0,1} Represents patient survival, and two-layer MLP predictor was used to predict patient survival after two years, calculated as follows:
wherein, Is a function of the activation of the ReLU,Is a Sigmoid activation function that is activated by,Is a set of weight matrices that can be trained,Is a set of bias vectors;
The higher the value, the lower the probability of survival, indicating that timely access to treatment is required, whereas the higher the probability of survival.
Optionally, the optimizing of the prediction model specifically includes:
the binary cross entropy loss function is defined as follows:
wherein, The value range of (1) to n, n represents the number of patients during the training of the prediction model,The actual label representing the q patient sample takes a value of 0 or 1,Representing the probability that the predictive model predicts that the q-th sample is of positive class, its value is between 0 and 1,AndThe logarithmic values representing the prediction probability and its complement probability respectively,Representing the sum of the calculated losses for all samples;
The whole predictive model is trained end-to-end using Adam optimizers to minimize the binary cross entropy loss function.
The beneficial effects of the invention are as follows:
(1) The invention effectively solves the problem of data isolation by fully fusing the histology data and the clinical data of the patient. By integrating the multi-source data from gene expression, copy number variation, and DNA methylation, a more comprehensive consideration of the model's multi-aspect impact on cancer progression is ensured.
(2) The present invention introduces an improved map attention mechanism that enables models to mine more deeply for potential information in patient data. This helps to increase the sensitivity to changes in the disease state of colorectal cancer patients, solving the problem of insufficient characterization.
(3) The present invention better addresses the challenge of colorectal cancer as a highly heterogeneous disease by effectively integrating multiple sets of mathematical data, accounting for molecular and phenotypic differences between patients. This helps to build a more accurate survival prediction model, providing support for personalized treatment and fine research of cancer subtypes.
(4) The invention effectively considers and utilizes the interaction of patients on multiple sets of chemical data such as gene expression, copy number variation, DNA methylation and the like through an improved map attention mechanism. This helps to increase the sensitivity of the model to the histological similarity between patients, enhancing the comprehensiveness and predictive performance of the model.
(5) The improved map attention mechanism is focused on effectively fusing the type information and the topological structure information of the patient nodes. This mechanism enables more accurate capture of the intricate relationships between patients, enabling models to more fully understand the interactions in cancer progression.
(6) The invention considers the topological structure information to enable the model to carry out cancer prognosis analysis at a more comprehensive view angle, and has better dynamic adaptability so as to better adapt to dynamic changes in cancer development.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a colorectal cancer prognosis method based on multiple sets of clinical test data according to the present invention.
Detailed Description
The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.
Referring to fig. 1, a colorectal cancer prognosis method based on multiple sets of clinical test data, comprising the steps of:
s1, collecting histology data of patients with different colorectal cancers and survival conditions of the patients after two years of surgery to remove the colorectal cancers from different data sources, databases or experiments, collecting clinical data related to the patients, and preprocessing the histology data and the clinical data;
S2, constructing a patient histology similarity network through the preprocessed histology data and clinical data and survival conditions;
s3, encoding node topological structure information of a patient and clinical data information of the patient;
s4, adding node topological structure information of the patient and clinical data information of the patient into the codes of the graph attention network;
S5, predicting the survival condition of the patient after two years of surgery to remove colorectal cancer through a prediction model;
and S6, optimizing the prediction model by using a binary cross entropy loss function.
In this embodiment, the histologic data includes lncRNA, miRNA, DNA copy number variations and DNA methylation data, and the clinical data includes age, sex, stage of cancer, body mass index, tumor location, tumor size, systolic pressure, diastolic pressure, tumor marker carcinoembryonic antigen, CA 19-9, inflammation index C-reactive protein, white blood cell count and deposition rate, to more fully understand the physiological status and cancer characteristics of the patient.
In this embodiment, S1 specifically includes:
S11, deleting the data of a certain group in the acquired data set if the deletion proportion of the data of the group is more than 20%;
s12, using Z-score normalization for histology data and clinical data;
s13, projecting original features of different types of nodes into a unified latent space through node feature transformation according to the different types of histology data features.
Here we will typeWhereinLncRNA, miRNA, CNV (DNA copy number variation) and DNA (methylation data), respectively. Data for a given type of data setWe apply a linear transformation matrix specific to the data type to obtain the feature vector of the data after the dimension reduction, defined as follows:
Wherein the linear transformation matrix Is a parameter that can be learned and is,Representing the original feature vector, wherein for each of the omics dataIn a different manner, the processing time is different,Representing a group-wise linear transformation matrix for type c, usingThese histology data are reduced in dimension to 64 dimensions. The feature matrix corresponding to each group of data is recorded asWhereinIs the number of patients to be treated,IncRNA, miRNA, CNV and DNA methylation data are shown, respectively.
In this embodiment, S2 specifically includes:
s21, defining a similarity matrix of each group of data
Wherein,Representing the euclidean distance and,Is a super-parameter which is used for the processing of the data,The method comprises the following steps:
wherein, Representing a patientIn the first placeThe first m neighbors in the omics data are calculated byAnd the firstAll the rest samples of the genealogy are Euclidean distances, sorting is carried out, and the first m samples are selected as neighbors;
S22, introducing weight Similarity of the four histology data is integrated, representing the firstImportance of the omics data, final integrated similarity matrixThe definition is as follows:
This integration method considers the similarity of each of the histology data and adjusts the weights To balance the effects between them.
S23, constructing a graph with node characteristics
Wherein,Representing a set of vertices, E representing a set of edges, definitionAndThe number of top points and the number of sides are respectively;
S24, constructing a correlation graph among patients according to the survival condition of the patients after two years of surgical excision of colorectal cancer, wherein the death represents 0, the survival represents 1, and each patient is represented as a node in the graph, if two patients survive for two years, creating an edge, and for each patient, integrating a similarity matrix finally The first K most similar patients are selected as neighbors to create an edge.
In the present embodiment, the drawingsThe establishing of the topological structure information of the middle nodes comprises the steps of generating a node sequence with the designated length of D for each node i by using a random walk algorithm, learning the node sequence representation of each node i by using a bidirectional long-short-time memory network, and fusing the sequence representations to obtain the topological structure representation of the nodes.
In the present embodiment, the drawingsThe establishing of the topological structure information of the middle node specifically comprises the following steps:
Nodes on the slave graph G Random walk, at random walk's firstStep access nodeThe next node to walk randomly follows from the probability as followsIs selected from the neighbor nodes of:
wherein, Representing nodesIs used for the degree of (3),Representation ofIs defined by a set of neighboring nodes of the network,Representing random walk time, selecting nodeThe next selected node;
The nodes of the random walk records are sequentially connected to form a random walk sequence with the length of D, the random walk sequence presents a node path from a starting node to a target node, and a random walk algorithm is used for each node Generating p random sequences with seq=Representing p random sequences;
learning the representation of p random sequences using a long-short memory network, letting the y-th sequence of node i be represented as Wherein the node sequence of the node sequence is defined by, among other things,Is the starting node which is the starting node,The method comprises the steps that node characteristics in a random sequence are gradually input through a short-time memory network, and final representation of a y-th sequence is obtained;
weighting and fusing the obtained p random sequence representations to obtain node topological structure representation
Wherein,The value of (c) is in the range of 1 to p,The method comprises the following steps:
In this embodiment, the feature code of the patient clinical data includes:
The original clinical data feature vector of the patient is recorded as Using a transformation matrixFeature vectorConversion to a 64-dimensional vector:
The node type features and the node structure features are subjected to weighted fusion, the node type features are original clinical data features, the node structure features are topological structure representations of nodes, and finally, the obtained clinical data features of patients are represented as follows:
wherein, Is a preset hyper-parameter, representing the weight.
In this embodiment, the topological structure information of the nodes and the patient clinical data information are added to the coding of the graph attention network, forming an improved graph attention mechanism:
By adopting a multi-head attention mechanism, M parameter matrixes Calculating attention coefficients respectivelyCombining the calculated results:
wherein, Expressed in GAT (th)Node in layerIs the original of the code representation of (1)For a plurality of groups of the chemical characteristics of the patient,Representing nodesIs defined by a set of neighboring nodes of the network,Representing nodesWith its neighborsThe weight of the attention between them,Represents a weight parameter, for calculating an attention score,Representing the weighting matrix of the graph neural network at layer (l),Representing a natural exponential function of the sign,Represents the leak ReLU activation function, || represents the concatenation operation of vectors.
We use 5 layers of GAT training data to characterize the last encoded patient group as
In this embodiment, S5 specifically includes:
Patient survival after two years of surgical removal of colorectal cancer (death: 0, survival: 1) with = {0,1} Represents patient survival, and then a two-layer MLP predictor was used to predict patient survival after two years. Specifically, the following is calculated:
wherein, Is a function of the activation of the ReLU,Is a Sigmoid activation function that is activated by,Is a set of weight matrices that can be trained,Is a set of bias vectors;
The higher the value, the lower the probability of survival, indicating that timely access to treatment is required, whereas the higher the probability of survival.
In this embodiment, the optimization of the prediction model specifically includes:
the binary cross entropy loss function is defined as follows:
wherein, The value range of (1) to n, n represents the number of patients during the training of the prediction model,The actual label representing the q patient sample takes a value of 0 or 1,Representing the probability that the predictive model predicts that the q-th sample is of positive class, its value is between 0 and 1,AndThe logarithmic values representing the prediction probability and its complement probability respectively,Representing the sum of the calculated losses for all samples;
The whole predictive model is trained end-to-end using Adam optimizers to minimize the binary cross entropy loss function.
Example 1 is predictive model experimental effect:
performance assessment of cancer survival prediction models is primarily dependent on common classification model metrics, including Accuracy (Accuracy), precision (Precision), recall (Recall), AUC values, and ROC curves. These indices quantify the classification ability of the model for new samples by predicting the test dataset, computing true examples, false positive examples, false negative examples, and true negative examples of the model. The accuracy reflects the proportion of all correctly classified samples in the total samples, and the accuracy and recall rate focus on the correct prediction of the model alignment and the proportion of the correct classification.
Accuracy: representing the proportion of all correctly classified samples in the total sample:
Precision: indicate how many proportions of all samples the model predicts as positive samples are true positive samples:
Recall, which indicates how many of all samples with correct classification of the model are positive samples with correct classification:
In the colorectal cancer prediction task, we compared the performance of four existing models KNN (K nearest neighbor), SVM (support vector machine), DNN (deep neural network), LR (logistic regression)) with the COADGAT prediction model we proposed. The result shows that COADGAT prediction models are excellent in indexes such as accuracy, precision, recall rate and AUC, and compared with traditional models such as K nearest neighbor, support vector machine, deep neural network and logistic regression, survival conditions of colorectal cancer patients are predicted more effectively. This highlights the usefulness of COADGAT in survival prediction for colorectal cancer patients. These findings provide a powerful support for selecting appropriate predictive models in colorectal cancer and a more comprehensive understanding of patient survival.
We actively collected detailed data for 628 colorectal cancer patients via TCGA database, which included lncRNA, miRNA, DNA copies of variable, DNA methylation and clinical profile. TCGA, as a global oncology program, provides large-scale cancer patient data, providing valuable resources for our research to gain insight into the molecular characteristics and clinical manifestations of colorectal cancer. The total data set is randomly divided into a training set and a test set, the dividing ratio is approximately 8:2, and the specific data division is shown in Table 1. In order to ensure fairness and robustness of the research method, we respectively perform 5 experiments on each data division according to 5-fold cross validation, and finally average to obtain the total evaluation index score of the method under the data division, as shown in the following tables 1 and 2:
TABLE 1 data distribution case
Table 2 comparison table of various performance evaluation indexes of five different models
The invention effectively solves the problem of data isolation by fully fusing the histology data and the clinical data of the patient. By integrating the multi-source data from gene expression, copy number variation, and DNA methylation, a more comprehensive consideration of the model's multi-aspect impact on cancer progression is ensured.
The present invention introduces an improved map attention mechanism that enables models to mine more deeply for potential information in patient data. This helps to increase the sensitivity to changes in the disease state of colorectal cancer patients, solving the problem of insufficient characterization.
The present invention better addresses the challenge of colorectal cancer as a highly heterogeneous disease by effectively integrating multiple sets of mathematical data, accounting for molecular and phenotypic differences between patients. This helps to build a more accurate survival prediction model, providing support for personalized treatment and fine research of cancer subtypes.
The invention effectively considers and utilizes the interaction of patients on multiple sets of chemical data such as gene expression, copy number variation, DNA methylation and the like through an improved map attention mechanism. This helps to increase the sensitivity of the model to the histological similarity between patients, enhancing the comprehensiveness and predictive performance of the model.
The improved map attention mechanism is focused on effectively fusing the type information and the topological structure information of the patient nodes. This mechanism enables more accurate capture of the intricate relationships between patients, enabling models to more fully understand the interactions in cancer progression.
The invention considers the topological structure information to enable the model to carry out cancer prognosis analysis at a more comprehensive view angle, and has better dynamic adaptability so as to better adapt to dynamic changes in cancer development.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (10)

1. A colorectal cancer prognosis method based on multiple sets of clinical test data, comprising the steps of:
s1, collecting histology data of patients with different colorectal cancers and survival conditions of the patients after two years of surgery to remove the colorectal cancers from different data sources, databases or experiments, collecting clinical data related to the patients, and preprocessing the histology data and the clinical data;
S2, constructing a patient histology similarity network through the preprocessed histology data and clinical data and survival conditions;
s3, encoding node topological structure information of a patient and clinical data information of the patient;
s4, adding node topological structure information of the patient and clinical data information of the patient into the codes of the graph attention network;
S5, predicting the survival condition of the patient after two years of surgery to remove colorectal cancer through a prediction model;
and S6, optimizing the prediction model by using a binary cross entropy loss function.
2. The method of claim 1, wherein the omics data comprises lncRNA, miRNA, DNA copy number variation and DNA methylation data, and wherein the clinical data comprises age, sex, stage of cancer, body mass index, tumor location, tumor size, systolic pressure, diastolic pressure, tumor marker carcinoembryonic antigen, CA 19-9, inflammation index C-reactive protein, white blood cell count, and deposition rate.
3. A method of colorectal cancer prognosis based on multiple sets of clinical test data according to claim 2, characterized in that S1 comprises in particular:
S11, deleting the data of a certain group in the acquired data set if the deletion proportion of the data of the group is more than 20%;
s12, using Z-score normalization for histology data and clinical data;
s13, projecting original features of different types of nodes into a unified latent space through node feature transformation according to the different types of histology data features.
4. A colorectal cancer prognosis method based on multiple sets of clinical test data according to claim 3, characterized in that S2 comprises in particular:
s21, defining a similarity matrix of each group of data
Wherein,Representing Euclidean distance,Is a superparameter,The method comprises the following steps:
wherein, Representing patientInThe first m neighbors in the omics data, m neighbors calculatedAndAll the rest samples of the genealogy are Euclidean distances, sorting is carried out, and the first m samples are selected as neighbors;
S22, introducing weight Similarity of the four histology data was integrated, representing theImportance of the omics data, final integrated similarity matrixThe definition is as follows:
S23, constructing a graph with node characteristics
Wherein,Representing a set of vertices, E representing a set of edges, definitionAndThe number of top points and the number of sides are respectively;
S24, constructing a correlation graph among patients according to the survival condition of the patients after two years of surgical excision of colorectal cancer, wherein the death represents 0, the survival represents 1, and each patient is represented as a node in the graph, if two patients survive for two years, creating an edge, and for each patient, integrating a similarity matrix finally The first K most similar patients are selected as neighbors to create an edge.
5. The method for prognosis of colorectal cancer based on multiple sets of clinical test data according to claim 4, wherein the mapThe establishing of the topological structure information of the middle nodes comprises the steps of generating a node sequence with the designated length of D for each node i by using a random walk algorithm, learning the node sequence representation of each node i by using a bidirectional long-short-time memory network, and fusing the sequence representations to obtain the topological structure representation of the nodes.
6. The method for colorectal cancer prognosis based on multiple sets of clinical test data according to claim 5, wherein the mapThe establishing of the topological structure information of the middle node specifically comprises the following steps:
Nodes on the slave graph G Random walk, the first/>, of the random walkNode accessed by stepThe next node to walk randomly follows the probability fromIs selected from the neighbor nodes of:
wherein, Representing nodesDegree of (v)/(v)RepresentationIs set of neighbor nodes,Representing random walk time, a node/>, is selectedThe next selected node;
The nodes of the random walk records are sequentially connected to form a random walk sequence with the length of D, the random walk sequence presents a node path from a starting node to a target node, and a random walk algorithm is used for each node Generating p random sequences with seq=Representing p random sequences;
learning the representation of p random sequences using a long-short memory network, letting the y-th sequence of node i be represented as Wherein,Is the starting node,The method comprises the steps that node characteristics in a random sequence are gradually input through a short-time memory network, and final representation of a y-th sequence is obtained;
weighting and fusing the obtained p random sequence representations to obtain node topological structure representation
Wherein,The value of (1) is 1 to p,The method comprises the following steps:
7. A method of colorectal cancer prognosis based on multiple sets of clinical test data according to claim 6, characterized in that the characteristic encoding of the patient clinical data comprises:
The original clinical data feature vector of the patient is recorded as Using a transformation matrixFeature vectorConversion to 64-dimensional vector:
The node type features and the node structure features are subjected to weighted fusion, the node type features are original clinical data features, the node structure features are topological structure representations of nodes, and finally, the obtained clinical data features of patients are represented as follows:
wherein, Is a preset hyper-parameter, representing the weight.
8. A colorectal cancer prognosis method based on multiple sets of clinical test data according to claim 7, characterized in that the topological structure information of the nodes and patient clinical data information are added to the coding of the graph attention network, forming an improved graph attention mechanism:
By adopting a multi-head attention mechanism, M parameter matrixes Calculate the attention coefficients/>, respectivelyCombining the calculated results:
wherein, Expressed at GATNode in layerIs encoded to indicate, initiallyFor multiple groups of patient's characteristics,Representing nodesIs set of neighbor nodes,Representing nodesWith its neighbor nodeAttention weight between/(Representing weight parameters for calculating an attention score,Weight matrix representing the graph neural network at layer (l)Representing a natural exponential function,Represents the leak ReLU activation function, || represents the concatenation operation of vectors.
9. A method of colorectal cancer prognosis based on multiple sets of clinical test data according to claim 8, characterized in that S5 comprises in particular:
By using = {0,1} Represents patient survival, and two-layer MLP predictor was used to predict patient survival after two years, calculated as follows:
wherein, Is a ReLU activation function,Is a Sigmoid activation function,Is a trainable set of weight matrices,Is a set of bias vectors;
The higher the value, the lower the probability of survival, indicating that timely access to treatment is required, whereas the higher the probability of survival.
10. A method for colorectal cancer prognosis based on multiple sets of clinical test data according to claim 9, characterized in that the optimization of the predictive model comprises in particular:
the binary cross entropy loss function is defined as follows:
wherein, The value range of (1) is 1 to n, n represents the number of patients during the training of the prediction model,The actual tag value representing the q patient sample is 0 or 1,Representing the probability that the q-th sample is predicted to be positive by the prediction model, wherein the value of the probability is between 0 and 1,AndLogarithmic values representing the prediction probability and its complement probability, respectively,Representing the sum of the calculated losses for all samples;
The whole predictive model is trained end-to-end using Adam optimizers to minimize the binary cross entropy loss function.
CN202410532738.6A 2024-04-30 2024-04-30 Colorectal cancer prognosis method based on multiple sets of clinical test data Active CN118116600B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410532738.6A CN118116600B (en) 2024-04-30 2024-04-30 Colorectal cancer prognosis method based on multiple sets of clinical test data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410532738.6A CN118116600B (en) 2024-04-30 2024-04-30 Colorectal cancer prognosis method based on multiple sets of clinical test data

Publications (2)

Publication Number Publication Date
CN118116600A true CN118116600A (en) 2024-05-31
CN118116600B CN118116600B (en) 2024-07-09

Family

ID=91216354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410532738.6A Active CN118116600B (en) 2024-04-30 2024-04-30 Colorectal cancer prognosis method based on multiple sets of clinical test data

Country Status (1)

Country Link
CN (1) CN118116600B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118522467A (en) * 2024-07-22 2024-08-20 南通大学 Digestive tract health data analysis method and system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109994200A (en) * 2019-03-08 2019-07-09 华南理工大学 A multi-omics cancer data integration analysis method based on similarity fusion
CN111291777A (en) * 2018-12-07 2020-06-16 深圳先进技术研究院 A cancer subtype classification method based on multi-omics integration
CN112309576A (en) * 2020-09-22 2021-02-02 江南大学 Colorectal cancer survival period prediction method based on deep learning CT (computed tomography) image omics
CN112820403A (en) * 2021-02-25 2021-05-18 中山大学 Deep learning method for predicting prognosis risk of cancer patient based on multiple groups of mathematical data
US20220293272A1 (en) * 2021-03-15 2022-09-15 Anima Group Inc. Machine-learning-based healthcare system
CN115985442A (en) * 2023-02-07 2023-04-18 电子科技大学 Method for constructing cancer survival prediction model based on graph comparison learning
CN116741397A (en) * 2023-08-15 2023-09-12 数据空间研究院 Cancer typing method, system and storage medium based on multi-group data fusion
CN117079804A (en) * 2023-08-20 2023-11-17 中国科学技术大学 Method and system for constructing digestive system tumor clinical result prediction model
CN117591953A (en) * 2024-01-19 2024-02-23 数据空间研究院 Cancer classification methods, systems and electronic devices based on multi-omics data
CN117594243A (en) * 2023-10-13 2024-02-23 太原理工大学 Ovarian cancer prognosis prediction method based on cross-modal view association discovery network

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291777A (en) * 2018-12-07 2020-06-16 深圳先进技术研究院 A cancer subtype classification method based on multi-omics integration
CN109994200A (en) * 2019-03-08 2019-07-09 华南理工大学 A multi-omics cancer data integration analysis method based on similarity fusion
CN112309576A (en) * 2020-09-22 2021-02-02 江南大学 Colorectal cancer survival period prediction method based on deep learning CT (computed tomography) image omics
CN112820403A (en) * 2021-02-25 2021-05-18 中山大学 Deep learning method for predicting prognosis risk of cancer patient based on multiple groups of mathematical data
US20220293272A1 (en) * 2021-03-15 2022-09-15 Anima Group Inc. Machine-learning-based healthcare system
CN115985442A (en) * 2023-02-07 2023-04-18 电子科技大学 Method for constructing cancer survival prediction model based on graph comparison learning
CN116741397A (en) * 2023-08-15 2023-09-12 数据空间研究院 Cancer typing method, system and storage medium based on multi-group data fusion
CN117079804A (en) * 2023-08-20 2023-11-17 中国科学技术大学 Method and system for constructing digestive system tumor clinical result prediction model
CN117594243A (en) * 2023-10-13 2024-02-23 太原理工大学 Ovarian cancer prognosis prediction method based on cross-modal view association discovery network
CN117591953A (en) * 2024-01-19 2024-02-23 数据空间研究院 Cancer classification methods, systems and electronic devices based on multi-omics data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118522467A (en) * 2024-07-22 2024-08-20 南通大学 Digestive tract health data analysis method and system

Also Published As

Publication number Publication date
CN118116600B (en) 2024-07-09

Similar Documents

Publication Publication Date Title
Yao et al. ICSDA: a multi-modal deep learning model to predict breast cancer recurrence and metastasis risk by integrating pathological, clinical and gene expression data
KR102190299B1 (en) Method, device and program for predicting the prognosis of gastric cancer using artificial neural networks
CN116741397B (en) Cancer typing method, system and storage medium based on multi-group data fusion
CN112435720B (en) Prediction method based on self-attention mechanism and multi-drug characteristic combination
Abdulla et al. G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays
CN114242186A (en) Method, system and storage medium for relocation of Chinese and Western medicines fused with GHP and GCN
Oh et al. Prediction of overall survival and novel classification of patients with gastric cancer using the survival recurrent network
Abdikenov et al. Analytics of heterogeneous breast cancer data using neuroevolution
CN115985503B (en) Cancer Prediction System Based on Ensemble Learning
Rahman et al. IDMIL: an alignment-free Interpretable Deep Multiple Instance Learning (MIL) for predicting disease from whole-metagenomic data
CN119252348A (en) Nucleic acid binding protein identification method based on protein graph and protein language model
Hu et al. Improving protein-protein interaction site prediction using deep residual neural network
CN118116600B (en) Colorectal cancer prognosis method based on multiple sets of clinical test data
Kumar et al. An early cancer prediction based on deep neural learning
Choi et al. Cell subtype classification via representation learning based on a denoising autoencoder for single-cell RNA sequencing
JP2004355174A (en) Data analysis method and system
US20240273359A1 (en) Apparatus and method for discovering biomarkers of health outcomes using machine learning
CN119943430A (en) A multi-modal fusion cancer survival prediction system and storage medium
CN111599412B (en) DNA replication initiation region identification method based on word vector and convolutional neural network
CN116150679A (en) Cancer identification and classification method
Nascimben et al. Polygenic risk modeling of tumor stage and survival in bladder cancer
CN120015323B (en) Lung cancer prognosis prediction method based on multi-group data fusion
Alzubaidi et al. A new hybrid global optimization approach for selecting clinical and biological features that are relevant to the effective diagnosis of ovarian cancer
CN120748601B (en) Feature extraction method, system, equipment and medium of medical clinical data
Mandal et al. A genetic algorithm-based clustering approach for selecting non-redundant microrna markers from microarray expression data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant