CN118280453A

CN118280453A - Cancer driving gene identification method based on heterogeneous map diffusion convolution network

Info

Publication number: CN118280453A
Application number: CN202410693066.7A
Authority: CN
Inventors: 韩聪聪; 周树森; 王庆军; 臧睦君; 刘通; 柳婵娟
Original assignee: Ludong University
Current assignee: Ludong University
Priority date: 2024-05-31
Filing date: 2024-05-31
Publication date: 2024-07-02
Anticipated expiration: 2044-05-31
Also published as: CN118280453B

Abstract

The present invention belongs to the field of bioinformatics and relates to a method for identifying cancer driver genes based on a heterogeneous graph diffusion convolutional network. Accurate identification of cancer driver genes is crucial for formulating effective treatment plans. The present invention improves the accurate identification performance of cancer driver genes through multi-layer convolution. First, a graph diffusion generation auxiliary network is introduced to perform data enhancement on the original graph data to better capture the correlation information in the biomolecular network. Secondly, a feature extraction module is constructed for feature extraction, which includes a fully connected layer and four graph attention convolutional layers. Finally, the extracted feature matrix is input into a hierarchical attention classifier to obtain the prediction score of the cancer driver gene. Experiments have shown that the present invention not only effectively identifies known driver genes on different networks, but also discovers new cancer candidate genes, providing strong support for precision medicine and personalized treatment, and further promoting the development of cancer research and treatment.

Description

A method for identifying cancer driver genes based on heterogeneous graph diffusion convolutional networks

技术领域Technical Field

本发明属于生物信息学领域，涉及一种基于异质图扩散卷积网络的癌症驱动基因识别方法。The present invention belongs to the field of bioinformatics and relates to a method for identifying cancer driver genes based on a heterogeneous graph diffusion convolutional network.

背景技术Background technique

癌症驱动基因可以赋予突变细胞选择性生长优势，引起细胞异常，失控增殖，并驱动癌症的发生和发展。在生物分子网络中，癌症驱动基因及其相互作用形成了一种复杂的信号传递网络，处理和分析其所传递的信息，与特征提取模块有直接关系。Cancer driver genes can give mutant cells a selective growth advantage, causing cell abnormalities, uncontrolled proliferation, and driving the occurrence and development of cancer. In the biomolecular network, cancer driver genes and their interactions form a complex signal transmission network. Processing and analyzing the information they transmit is directly related to the feature extraction module.

准确识别癌症驱动基因对于理解癌症的发病机制、推进精准治疗、制定个性化治疗方案疗和发现生物标志物具有重大意义。Accurate identification of cancer driver genes is of great significance for understanding the pathogenesis of cancer, promoting precision treatment, formulating personalized treatment plans and discovering biomarkers.

目前大多数的癌症驱动基因识别方法都是只基于简单的图卷积层进行生物分子网络的特征提取，而少部分方法使用多个图卷积层和图采样聚合卷积层，无法有效提取生物分子网络的特征。因此，如何适应生物分子网络的异质性，有效捕获远距离基因特征成为当前该领域的一大难点。At present, most cancer driver gene identification methods are based on simple graph convolution layers to extract features of biomolecular networks, while a few methods use multiple graph convolution layers and graph sampling aggregation convolution layers, which cannot effectively extract features of biomolecular networks. Therefore, how to adapt to the heterogeneity of biomolecular networks and effectively capture long-distance gene features has become a major difficulty in this field.

发明内容Summary of the invention

为了克服上述困难，本发明提出了一种基于异质图扩散卷积网络的癌症驱动基因识别方法。该方法基于图注意力卷积层有效地捕获生物分子网络和辅助生物分子网络的基因特征，多生物分子网络中基因复杂的相互关系得到有效的学习，丰富特征表示。并通过学习特征提取模块学习到的隐藏特征得到癌症驱动基因的预测评分。In order to overcome the above difficulties, the present invention proposes a cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network. The method effectively captures the gene features of biomolecular networks and auxiliary biomolecular networks based on the graph attention convolution layer, and the complex interrelationships of genes in multiple biomolecular networks are effectively learned to enrich the feature representation. The prediction score of cancer driver genes is obtained by learning the hidden features learned by the feature extraction module.

一种基于异质图扩散卷积网络的癌症驱动基因识别方法，包括图数据预处理与增强、特征提取模块的构建和多层注意力分类器的构建三个步骤，其具体步骤如下：A cancer driver gene identification method based on heterogeneous graph diffusion convolutional network includes three steps: graph data preprocessing and enhancement, construction of feature extraction module and construction of multi-layer attention classifier. The specific steps are as follows:

步骤 1、首先，加载癌症驱动基因的生物分子网络的基因特征矩阵和边数据；然后，先将基因特征矩阵转换为NumPy数组，将这些数据标准化，再将NumPy数组转换为PyTorch张量；使用基于随机游走的图扩散算法增强原生物分子网络，得到辅助生物分子网络。Step 1: First, load the gene feature matrix and edge data of the biomolecular network of cancer driver genes; then, convert the gene feature matrix into a NumPy array, standardize these data, and then convert the NumPy array into a PyTorch tensor; use the random walk-based graph diffusion algorithm to enhance the original biomolecular network to obtain an auxiliary biomolecular network.

步骤2、在前向传播过程中，原生物分子网络的基因特征矩阵使用全连接层进行特征变换；然后，将特征变换后的矩阵分别与原图和辅助图的边数据进行图注意力卷积操作，其输出进行特征融合，得到第一次特征提取后的矩阵；接着，将第一次特征提取后的矩阵分别与原图和辅助图的边数据进行图注意力卷积操作，其输出进行特征融合，得到第二次特征提取后的矩阵。Step 2. During the forward propagation process, the gene feature matrix of the original biological molecular network is transformed using a fully connected layer; then, the matrix after feature transformation is convolved with the edge data of the original image and the auxiliary image respectively, and the output is subjected to feature fusion to obtain the matrix after the first feature extraction; then, the matrix after the first feature extraction is convolved with the edge data of the original image and the auxiliary image respectively, and the output is subjected to feature fusion to obtain the matrix after the second feature extraction.

步骤3、将步骤2中经特征变换的特征矩阵和两次特征提取后的特征矩阵分别传入全连接层中，输出一个标量，然后将三个标量各自乘以一个权重并求和得到最终的癌症驱动基因的预测评分。Step 3: Pass the feature matrix after feature transformation in step 2 and the feature matrix after two feature extractions into the fully connected layer respectively, output a scalar, and then multiply the three scalars by a weight respectively and sum them up to obtain the final prediction score of the cancer driver gene.

一种基于异质图扩散卷积网络的癌症驱动基因识别方法，步骤1实现过程如下：A cancer driver gene identification method based on heterogeneous graph diffusion convolutional network. The implementation process of step 1 is as follows:

首先加载生物分子网络图数据，包括图节点的特征矩阵，图节点的标签，图节点的边数据；然后将图节点的特征矩阵进行标准化处理得到生物分子网络的基因特征矩阵；接着将基因特征矩阵、标签和边数据输入到基于随机游走的图扩散算法中得到辅助网络的边数据；具体而言，首先计算每个节点的状态向量，这反映了节点在全局图结构中的重要性；根据这些状态向量，筛选出具有较高权重的节点，并基于这些节点重新构建辅助网络，通过这种方式，生成的辅助网络为后续的数据分析和机器学习任务提供了更准确和全局性的特征表示。First, the biomolecular network graph data is loaded, including the feature matrix of the graph nodes, the labels of the graph nodes, and the edge data of the graph nodes; then the feature matrix of the graph nodes is standardized to obtain the gene feature matrix of the biomolecular network; then the gene feature matrix, labels and edge data are input into the random walk-based graph diffusion algorithm to obtain the edge data of the auxiliary network; specifically, the state vector of each node is first calculated, which reflects the importance of the node in the global graph structure; based on these state vectors, nodes with higher weights are screened out, and the auxiliary network is reconstructed based on these nodes. In this way, the generated auxiliary network provides a more accurate and global feature representation for subsequent data analysis and machine learning tasks.

一种基于异质图扩散卷积网络的癌症驱动基因识别方法，步骤2实现过程如下：A cancer driver gene identification method based on heterogeneous graph diffusion convolutional network. The implementation process of step 2 is as follows:

首先，将生物分子网络基因特征矩阵输入到丢弃率为0.5的随机失活函数，将其输出输入到输入维度58，输出维度80的全连接层和ReLU激活函数得到特征变换输出。然后，将特征变换输出分别和两个边数据输入到图注意力卷积层，其输入维度80，输出维度160，注意力头数为2，丢弃率不变，经ReLU函数激活后进行特征融合得到第一个特征提取模块输出。接着同样的方式，将第一个特征提取模块输出分别和两个边数据输入到图注意力卷积层，其输入维度160，输出维度160，注意力头数和丢弃率不变，得到第二个特征提取模块输出。First, the biomolecular network gene feature matrix is input into a random dropout function with a dropout rate of 0.5, and its output is input into a fully connected layer with an input dimension of 58 and an output dimension of 80 and a ReLU activation function to obtain a feature transformation output. Then, the feature transformation output and two edge data are respectively input into a graph attention convolution layer with an input dimension of 80, an output dimension of 160, a number of attention heads of 2, and a dropout rate that remains unchanged. After activation by the ReLU function, feature fusion is performed to obtain the output of the first feature extraction module. Then, in the same way, the output of the first feature extraction module and two edge data are respectively input into a graph attention convolution layer with an input dimension of 160, an output dimension of 160, and a number of attention heads and a dropout rate that remain unchanged to obtain the output of the second feature extraction module.

一种基于异质图扩散卷积网络的癌症驱动基因识别方法，步骤3实现过程如下：A cancer driver gene identification method based on heterogeneous graph diffusion convolutional network, the implementation process of step 3 is as follows:

将特征提取得到的特征变换输出，第一个特征提取模块输出和第二个特征提取模块输出分别输入到一个全连接层，其输入维度分别为80、160和160，输出维度均为1。最后将三个全连接层的输出各乘以一个可学习的参数并求和得到最终癌症驱动基因的预测分数。The feature transformation output obtained by feature extraction, the output of the first feature extraction module and the output of the second feature extraction module are respectively input into a fully connected layer, whose input dimensions are 80, 160 and 160 respectively, and the output dimension is 1. Finally, the outputs of the three fully connected layers are multiplied by a learnable parameter and summed to obtain the final prediction score of the cancer driver gene.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是一种基于异质图扩散卷积网络的癌症驱动基因识别方法流程图。Figure 1 is a flow chart of a cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network.

图2是生物分子网络数据预处理与增强流程图。FIG2 is a flowchart of biomolecular network data preprocessing and enhancement.

图3是特征提取模块流程图。FIG3 is a flow chart of the feature extraction module.

图4是多层注意力分类器流程图。Figure 4 is a flowchart of the multi-layer attention classifier.

具体实施方式Detailed ways

以下结合附图和实例对本发明进行详细说明。The present invention is described in detail below with reference to the accompanying drawings and examples.

本发明提出一种基于异质图扩散卷积网络的癌症驱动基因识别方法，特别地，用于癌症驱动基因的识别。The present invention proposes a cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network, in particular, for the identification of cancer driver genes.

一种基于异质图扩散卷积网络的癌症驱动基因识别方法，图1是一种基于异质图扩散卷积网络的癌症驱动基因识别方法流程图，包括图数据预处理与增强、特征提取模块的构建和多层注意力分类器的构建三个步骤，其具体实施方式如下：A cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network. FIG1 is a flow chart of a cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network, which includes three steps: graph data preprocessing and enhancement, construction of a feature extraction module, and construction of a multi-layer attention classifier. The specific implementation method is as follows:

步骤1：图数据预处理与增强，图2为图数据预处理与增强流程图，包括以下内容：Step 1: Graph data preprocessing and enhancement. Figure 2 is a flowchart of graph data preprocessing and enhancement, which includes the following contents:

首先，加载生物分子网络数据集，包括图节点的特征矩阵，图节点的标签，图节点的边数据等，将图节点的特征矩阵转换为NumPy数组。然后，使用scikit-learn框架的preprocessing.StandardScaler函数将图节点的特征矩阵进行均值为0，方差为1的标准化处理，将标准化后的特征矩阵转换为PyTorch张量。接着，使用PyTorch Geometric框架的data.Data函数构括Data对象，其属性包括数据集的基因特征矩阵、标签和边数据，使用PyTorch Geometric框架的transforms.T.GDC函数创建一个几何数据转换器进行图扩散，其扩散方法为ppr，扩散矩阵的衰减系数设为0.9，正则化项设为0.0001。几何数据转换器在图像处理中的应用主要是通过执行几何变换和扩散操作来改变图像的位置、大小、形状和细节，从而实现各种图像处理效果。扩散方法ppr基于随机游走，用于测量图中节点的接近程度，具体而言，ppr定义了一个迭代过程，其中每个基因以概率c沿着边随机行走，并以的概率返回自身。当不再显著的更新时，基因的迭代收敛。最终得到的状态向量中的每个元素表示了对应基因与其他基因的连接强度。这个更新过程可以由以下公式表示：First, load the biomolecular network dataset, including the feature matrix of graph nodes, the labels of graph nodes, the edge data of graph nodes, etc., and convert the feature matrix of graph nodes into NumPy array. Then, use the preprocessing.StandardScaler function of the scikit-learn framework to standardize the feature matrix of graph nodes with a mean of 0 and a variance of 1, and convert the standardized feature matrix into a PyTorch tensor. Next, use the data.Data function of the PyTorch Geometric framework to construct a Data object, whose attributes include the gene feature matrix, labels and edge data of the dataset. Use the transforms.T.GDC function of the PyTorch Geometric framework to create a geometric data transformer for graph diffusion. Its diffusion method is ppr, the attenuation coefficient of the diffusion matrix is set to 0.9, and the regularization term is set to 0.0001. The application of geometric data transformers in image processing is mainly to change the position, size, shape and details of the image by performing geometric transformations and diffusion operations, thereby achieving various image processing effects. The diffusion method PPR is based on random walks and is used to measure the proximity of nodes in a graph. Specifically, PPR defines an iterative process in which each gene walks randomly along the edge with probability c and The probability of returning to itself. When there is no longer a significant update, the gene iteration converges. The final state vector Each element in represents the connection strength of the corresponding gene with other genes. This update process can be expressed by the following formula:

(1) (1)

是一个N维的概率向量，表示第i个基因（节点）的状态，其中N是基因的数量。是一个one-hot向量，即只有一个元素为1，其余为0，表示第i个基因。c是一个参数，称为阻尼因子，用于控制随机游走的概率。这个方程表示，在每次迭代中，当前状态向量通过转移M矩阵更新，同时加上一个偏置向量。最后，将Data对象作为几何数据转换器的输入得到边数据增强后生物分子网络的Data对象。 is an N -dimensional probability vector representing the state of the i -th gene (node), where N is the number of genes. is a one-hot vector, that is, only one element is 1 and the rest are 0, representing the i -th gene. c is a parameter, called the damping factor, which is used to control the probability of random walk. This equation means that in each iteration, the current state vector By transferring the M matrix update, and adding a bias vector Finally, the Data object is used as the input of the geometric data converter to obtain the Data object of the biomolecular network after edge data enhancement.

步骤2：特征提取模块的构建，图3为特征提取模块流程图，包括以下内容：Step 2: Construction of feature extraction module. Figure 3 is a flow chart of the feature extraction module, which includes the following contents:

首先，使用PyTorch Geometric框架的utils.dropout_edge函数将原图和辅助图的边数据进行随机失活，丢弃率为0.5，使用PyTorch框架的nn.Functional.F.dropout函数将特征矩阵进行随机失活，丢弃率为0.5；然后，使用PyTorch框架的nn.Linear函数将特征矩阵进行特征变换，其输入维度58，输出维度80；接着，使用PyTorch Geometric框架的nn.GATConv函数构建4个卷积层对输入特征进行传播和转换，其输入维度80，输出维度160，注意力头数为2，丢弃率为0.5；第一个卷积层输入为原图边数据和特征变换后的矩阵，第二个卷积层输入为辅助图边数据和特征变换后的矩阵，将这两个卷积层输出经ReLU函数激活后进行特征融合得到第一次特征提取后的矩阵；第三个卷积层输入为原图边数据和第一次特征提取后的矩阵，第四个卷积层输入为辅助图边数据和第一次特征提取后的矩阵，将这两个卷积层输出经ReLU函数激活后进行特征融合得到第二次特征提取后的矩阵；First, the utils.dropout_edge function of the PyTorch Geometric framework is used to randomly inactivate the edge data of the original image and the auxiliary image, with a dropout rate of 0.5. The nn.Functional.F.dropout function of the PyTorch framework is used to randomly inactivate the feature matrix, with a dropout rate of 0.5. Then, the nn.Linear function of the PyTorch framework is used to transform the feature matrix, with an input dimension of 58 and an output dimension of 80. Then, PyTorch is used to The nn.GATConv function of the Geometric framework constructs 4 convolutional layers to propagate and transform the input features. Its input dimension is 80, the output dimension is 160, the number of attention heads is 2, and the drop rate is 0.5; the first convolutional layer input is the original image edge data and the matrix after feature transformation, the second convolutional layer input is the auxiliary image edge data and the matrix after feature transformation, the outputs of these two convolutional layers are activated by the ReLU function and the features are fused to obtain the matrix after the first feature extraction; the third convolutional layer input is the original image edge data and the matrix after the first feature extraction, the fourth convolutional layer input is the auxiliary image edge data and the matrix after the first feature extraction, the outputs of these two convolutional layers are activated by the ReLU function and the features are fused to obtain the matrix after the second feature extraction;

步骤3：多层注意力分类器的构建，图4为多层注意力分类器流程图，包括以下内容：Step 3: Construction of multi-layer attention classifier. Figure 4 is a flowchart of the multi-layer attention classifier, which includes the following contents:

首先，使用PyTorch框架的nn.Functional.F.dropout函数构建三个随机失活层将特征变换后的矩阵，第一次特征提取后的矩阵和第二次特征提取后的矩阵进行随机失活，丢弃率为0.5；使用PyTorch框架的nn.Linear函数构建三个全连接层，其输入维度分别为80、160和160，输出维度均为1。将特征变换后的矩阵，第一次特征提取后的矩阵和第二次特征提取后的矩阵依次输入到这三个全连接层中，最后将三个全连接层的输出各乘以一个可学习的参数并求和得到最终癌症驱动基因的预测分数。First, the nn.Functional.F.dropout function of the PyTorch framework was used to construct three random dropout layers to randomly drop out the matrix after feature transformation, the matrix after the first feature extraction, and the matrix after the second feature extraction, with a dropout rate of 0.5; the nn.Linear function of the PyTorch framework was used to construct three fully connected layers, with input dimensions of 80, 160, and 160, respectively, and output dimensions of 1. The matrix after feature transformation, the matrix after the first feature extraction, and the matrix after the second feature extraction were input into the three fully connected layers in sequence, and finally the outputs of the three fully connected layers were multiplied by a learnable parameter and summed to obtain the final prediction score of the cancer driver gene.

将本发明所提出方法应用到癌症驱动基因的识别时，在HDGC所提供的三个数据集上测试得到的AUC分别为0.8512，0.8646，0.8675，AUPR分别为0.8014，0.8527，0.8313。优于HDGC、HDGICN在这三个数据集上的表现，其中HDGC的AUC分别为0.8278，0.8405，0.8337，AUPR分别为0.7770，0.8286，0.7929。HDGICN的AUC分别为0.8421，0.8427，0.8418，AUPR分别为0.7936，0.8405，0.8115。本发明将原生物分子网络与数据增强后产生的辅助网络两种数据进行充分的特征提取和特征融合，因此性能高于其它现有方法。When the method proposed in the present invention is applied to the identification of cancer driver genes, the AUCs tested on the three data sets provided by HDGC are 0.8512, 0.8646, and 0.8675, and the AUPRs are 0.8014, 0.8527, and 0.8313, respectively. It is better than the performance of HDGC and HDGICN on these three data sets, where the AUCs of HDGC are 0.8278, 0.8405, and 0.8337, and the AUPRs are 0.7770, 0.8286, and 0.7929, respectively. The AUCs of HDGICN are 0.8421, 0.8427, and 0.8418, and the AUPRs are 0.7936, 0.8405, and 0.8115, respectively. The present invention fully extracts and fuses the features of the original biomolecular network and the auxiliary network generated after data enhancement, so the performance is higher than other existing methods.

最优模型参数如下表所示。The optimal model parameters are shown in the following table.

表1 最优模型参数Table 1 Optimal model parameters

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above contents are further detailed descriptions of the present invention in combination with specific preferred embodiments, and it cannot be determined that the specific implementation of the present invention is limited to these descriptions. For ordinary technicians in the technical field to which the present invention belongs, several simple deductions or substitutions can be made without departing from the concept of the present invention, which should be regarded as falling within the protection scope of the present invention.

Claims

1. A cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network, characterized in that an auxiliary network is generated by using a random walk-based graph diffusion to enhance the original biomolecular network data, a feature extraction module based on a graph attention convolutional layer is constructed to simultaneously extract the original biomolecular network data and the auxiliary biomolecular network data and perform feature fusion, and then the fused features are input into a multi-layer attention classifier to obtain the prediction score of the cancer driver gene, including three steps of graph data enhancement and preprocessing, construction of a feature extraction module and construction of a multi-layer attention classifier, and the specific steps are as follows:

Step 1: First, load the gene feature matrix and edge data of the biomolecular network of cancer driver genes; then, convert the gene feature matrix into a NumPy array, standardize the data, and then convert the NumPy array into a PyTorch tensor; use the random walk-based graph diffusion algorithm to enhance the original biomolecular network to obtain the auxiliary biomolecular network;

Step 2: During the forward propagation process, the gene feature matrix of the original biomolecular network is transformed using a fully connected layer; then, the matrix after feature transformation is subjected to graph attention convolution operation with the edge data of the original image and the auxiliary image respectively, and the output is subjected to feature fusion to obtain the matrix after the first feature extraction; then, the matrix after the first feature extraction is subjected to graph attention convolution operation with the edge data of the original image and the auxiliary image respectively, and the output is subjected to feature fusion to obtain the matrix after the second feature extraction;

Step 3: Pass the feature matrix after feature transformation in step 2 and the feature matrix after two feature extractions into the fully connected layer respectively, output a scalar, and then multiply the three scalars by a weight respectively and sum them up to obtain the final prediction score of the cancer driver gene.

2. According to claim 1, a cancer driver gene identification method based on heterogeneous graph diffusion convolutional network is characterized in that the original biomolecular network data and the auxiliary network data generated by graph diffusion are used simultaneously to perform feature extraction and feature fusion through a feature extraction module based on a graph attention convolutional layer; the feature extraction steps are as follows:

First, the utils.dropout_edge function of the PyTorch Geometric framework is used to randomly inactivate the edge data of the original image and the auxiliary image, with a dropout rate of 0.5. The nn.Functional.F.dropout function of the PyTorch framework is used to randomly inactivate the feature matrix, with a dropout rate of 0.5. Then, the nn.Linear function of the PyTorch framework is used to transform the feature matrix, with an input dimension of 58 and an output dimension of 80. Then, PyTorch is used to The nn.GATConv function of the Geometric framework constructs 4 convolutional layers to propagate and transform the input features. Its input dimension is 80, the output dimension is 160, the number of attention heads is 2, and the drop rate is 0.5; the input of the first convolutional layer is the original image edge data and the matrix after feature transformation, and the input of the second convolutional layer is the auxiliary image edge data and the matrix after feature transformation. The outputs of these two convolutional layers are activated by the ReLU function and the features are fused to obtain the matrix after the first feature extraction; the input of the third convolutional layer is the original image edge data and the matrix after the first feature extraction, and the input of the fourth convolutional layer is the auxiliary image edge data and the matrix after the first feature extraction. The outputs of these two convolutional layers are activated by the ReLU function and the features are fused to obtain the matrix after the second feature extraction.