[go: up one dir, main page]

CN118280453A - Cancer driving gene identification method based on heterogeneous map diffusion convolution network - Google Patents

Cancer driving gene identification method based on heterogeneous map diffusion convolution network Download PDF

Info

Publication number
CN118280453A
CN118280453A CN202410693066.7A CN202410693066A CN118280453A CN 118280453 A CN118280453 A CN 118280453A CN 202410693066 A CN202410693066 A CN 202410693066A CN 118280453 A CN118280453 A CN 118280453A
Authority
CN
China
Prior art keywords
feature
matrix
graph
feature extraction
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410693066.7A
Other languages
Chinese (zh)
Other versions
CN118280453B (en
Inventor
韩聪聪
周树森
王庆军
臧睦君
刘通
柳婵娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ludong University
Original Assignee
Ludong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ludong University filed Critical Ludong University
Priority to CN202410693066.7A priority Critical patent/CN118280453B/en
Publication of CN118280453A publication Critical patent/CN118280453A/en
Application granted granted Critical
Publication of CN118280453B publication Critical patent/CN118280453B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明属于生物信息学领域,涉及一种基于异质图扩散卷积网络的癌症驱动基因识别方法。癌症驱动基因的精准识别对于制定有效的治疗方案至关重要。本发明通过多层卷积来提高对癌症驱动基因的准确识别性能。首先,引入了图扩散生成辅助网络,对原始图数据进行数据增强以更好地捕获生物分子网络中的关联信息。其次,构建特征提取模块进行特征提取,其包括一个全连接层和四个图注意力卷积层。最后,将提取到的特征矩阵输入到分层注意力分类器得到癌症驱动基因的预测评分。实验证明,本发明不仅有效识别了不同网络上的已知驱动基因,还发现了新的癌症候选基因,为精准医学和个体化治疗提供了有力支持,进一步推动了癌症研究和治疗的发展。

The present invention belongs to the field of bioinformatics and relates to a method for identifying cancer driver genes based on a heterogeneous graph diffusion convolutional network. Accurate identification of cancer driver genes is crucial for formulating effective treatment plans. The present invention improves the accurate identification performance of cancer driver genes through multi-layer convolution. First, a graph diffusion generation auxiliary network is introduced to perform data enhancement on the original graph data to better capture the correlation information in the biomolecular network. Secondly, a feature extraction module is constructed for feature extraction, which includes a fully connected layer and four graph attention convolutional layers. Finally, the extracted feature matrix is input into a hierarchical attention classifier to obtain the prediction score of the cancer driver gene. Experiments have shown that the present invention not only effectively identifies known driver genes on different networks, but also discovers new cancer candidate genes, providing strong support for precision medicine and personalized treatment, and further promoting the development of cancer research and treatment.

Description

一种基于异质图扩散卷积网络的癌症驱动基因识别方法A method for identifying cancer driver genes based on heterogeneous graph diffusion convolutional networks

技术领域Technical Field

本发明属于生物信息学领域,涉及一种基于异质图扩散卷积网络的癌症驱动基因识别方法。The present invention belongs to the field of bioinformatics and relates to a method for identifying cancer driver genes based on a heterogeneous graph diffusion convolutional network.

背景技术Background technique

癌症驱动基因可以赋予突变细胞选择性生长优势,引起细胞异常,失控增殖,并驱动癌症的发生和发展。在生物分子网络中,癌症驱动基因及其相互作用形成了一种复杂的信号传递网络,处理和分析其所传递的信息,与特征提取模块有直接关系。Cancer driver genes can give mutant cells a selective growth advantage, causing cell abnormalities, uncontrolled proliferation, and driving the occurrence and development of cancer. In the biomolecular network, cancer driver genes and their interactions form a complex signal transmission network. Processing and analyzing the information they transmit is directly related to the feature extraction module.

准确识别癌症驱动基因对于理解癌症的发病机制、推进精准治疗、制定个性化治疗方案疗和发现生物标志物具有重大意义。Accurate identification of cancer driver genes is of great significance for understanding the pathogenesis of cancer, promoting precision treatment, formulating personalized treatment plans and discovering biomarkers.

目前大多数的癌症驱动基因识别方法都是只基于简单的图卷积层进行生物分子网络的特征提取,而少部分方法使用多个图卷积层和图采样聚合卷积层,无法有效提取生物分子网络的特征。因此,如何适应生物分子网络的异质性,有效捕获远距离基因特征成为当前该领域的一大难点。At present, most cancer driver gene identification methods are based on simple graph convolution layers to extract features of biomolecular networks, while a few methods use multiple graph convolution layers and graph sampling aggregation convolution layers, which cannot effectively extract features of biomolecular networks. Therefore, how to adapt to the heterogeneity of biomolecular networks and effectively capture long-distance gene features has become a major difficulty in this field.

发明内容Summary of the invention

为了克服上述困难,本发明提出了一种基于异质图扩散卷积网络的癌症驱动基因识别方法。该方法基于图注意力卷积层有效地捕获生物分子网络和辅助生物分子网络的基因特征,多生物分子网络中基因复杂的相互关系得到有效的学习,丰富特征表示。并通过学习特征提取模块学习到的隐藏特征得到癌症驱动基因的预测评分。In order to overcome the above difficulties, the present invention proposes a cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network. The method effectively captures the gene features of biomolecular networks and auxiliary biomolecular networks based on the graph attention convolution layer, and the complex interrelationships of genes in multiple biomolecular networks are effectively learned to enrich the feature representation. The prediction score of cancer driver genes is obtained by learning the hidden features learned by the feature extraction module.

一种基于异质图扩散卷积网络的癌症驱动基因识别方法,包括图数据预处理与增强、特征提取模块的构建和多层注意力分类器的构建三个步骤,其具体步骤如下:A cancer driver gene identification method based on heterogeneous graph diffusion convolutional network includes three steps: graph data preprocessing and enhancement, construction of feature extraction module and construction of multi-layer attention classifier. The specific steps are as follows:

步骤 1、首先,加载癌症驱动基因的生物分子网络的基因特征矩阵和边数据;然后,先将基因特征矩阵转换为NumPy数组,将这些数据标准化,再将NumPy数组转换为PyTorch张量;使用基于随机游走的图扩散算法增强原生物分子网络,得到辅助生物分子网络。Step 1: First, load the gene feature matrix and edge data of the biomolecular network of cancer driver genes; then, convert the gene feature matrix into a NumPy array, standardize these data, and then convert the NumPy array into a PyTorch tensor; use the random walk-based graph diffusion algorithm to enhance the original biomolecular network to obtain an auxiliary biomolecular network.

步骤2、在前向传播过程中,原生物分子网络的基因特征矩阵使用全连接层进行特征变换;然后,将特征变换后的矩阵分别与原图和辅助图的边数据进行图注意力卷积操作,其输出进行特征融合,得到第一次特征提取后的矩阵;接着,将第一次特征提取后的矩阵分别与原图和辅助图的边数据进行图注意力卷积操作,其输出进行特征融合,得到第二次特征提取后的矩阵。Step 2. During the forward propagation process, the gene feature matrix of the original biological molecular network is transformed using a fully connected layer; then, the matrix after feature transformation is convolved with the edge data of the original image and the auxiliary image respectively, and the output is subjected to feature fusion to obtain the matrix after the first feature extraction; then, the matrix after the first feature extraction is convolved with the edge data of the original image and the auxiliary image respectively, and the output is subjected to feature fusion to obtain the matrix after the second feature extraction.

步骤3、将步骤2中经特征变换的特征矩阵和两次特征提取后的特征矩阵分别传入全连接层中,输出一个标量,然后将三个标量各自乘以一个权重并求和得到最终的癌症驱动基因的预测评分。Step 3: Pass the feature matrix after feature transformation in step 2 and the feature matrix after two feature extractions into the fully connected layer respectively, output a scalar, and then multiply the three scalars by a weight respectively and sum them up to obtain the final prediction score of the cancer driver gene.

一种基于异质图扩散卷积网络的癌症驱动基因识别方法,步骤1实现过程如下:A cancer driver gene identification method based on heterogeneous graph diffusion convolutional network. The implementation process of step 1 is as follows:

首先加载生物分子网络图数据,包括图节点的特征矩阵,图节点的标签,图节点的边数据;然后将图节点的特征矩阵进行标准化处理得到生物分子网络的基因特征矩阵;接着将基因特征矩阵、标签和边数据输入到基于随机游走的图扩散算法中得到辅助网络的边数据;具体而言,首先计算每个节点的状态向量,这反映了节点在全局图结构中的重要性;根据这些状态向量,筛选出具有较高权重的节点,并基于这些节点重新构建辅助网络,通过这种方式,生成的辅助网络为后续的数据分析和机器学习任务提供了更准确和全局性的特征表示。First, the biomolecular network graph data is loaded, including the feature matrix of the graph nodes, the labels of the graph nodes, and the edge data of the graph nodes; then the feature matrix of the graph nodes is standardized to obtain the gene feature matrix of the biomolecular network; then the gene feature matrix, labels and edge data are input into the random walk-based graph diffusion algorithm to obtain the edge data of the auxiliary network; specifically, the state vector of each node is first calculated, which reflects the importance of the node in the global graph structure; based on these state vectors, nodes with higher weights are screened out, and the auxiliary network is reconstructed based on these nodes. In this way, the generated auxiliary network provides a more accurate and global feature representation for subsequent data analysis and machine learning tasks.

一种基于异质图扩散卷积网络的癌症驱动基因识别方法,步骤2实现过程如下:A cancer driver gene identification method based on heterogeneous graph diffusion convolutional network. The implementation process of step 2 is as follows:

首先,将生物分子网络基因特征矩阵输入到丢弃率为0.5的随机失活函数,将其输出输入到输入维度58,输出维度80的全连接层和ReLU激活函数得到特征变换输出。然后,将特征变换输出分别和两个边数据输入到图注意力卷积层,其输入维度80,输出维度160,注意力头数为2,丢弃率不变,经ReLU函数激活后进行特征融合得到第一个特征提取模块输出。接着同样的方式,将第一个特征提取模块输出分别和两个边数据输入到图注意力卷积层,其输入维度160,输出维度160,注意力头数和丢弃率不变,得到第二个特征提取模块输出。First, the biomolecular network gene feature matrix is input into a random dropout function with a dropout rate of 0.5, and its output is input into a fully connected layer with an input dimension of 58 and an output dimension of 80 and a ReLU activation function to obtain a feature transformation output. Then, the feature transformation output and two edge data are respectively input into a graph attention convolution layer with an input dimension of 80, an output dimension of 160, a number of attention heads of 2, and a dropout rate that remains unchanged. After activation by the ReLU function, feature fusion is performed to obtain the output of the first feature extraction module. Then, in the same way, the output of the first feature extraction module and two edge data are respectively input into a graph attention convolution layer with an input dimension of 160, an output dimension of 160, and a number of attention heads and a dropout rate that remain unchanged to obtain the output of the second feature extraction module.

一种基于异质图扩散卷积网络的癌症驱动基因识别方法,步骤3实现过程如下:A cancer driver gene identification method based on heterogeneous graph diffusion convolutional network, the implementation process of step 3 is as follows:

将特征提取得到的特征变换输出,第一个特征提取模块输出和第二个特征提取模块输出分别输入到一个全连接层,其输入维度分别为80、160和160,输出维度均为1。最后将三个全连接层的输出各乘以一个可学习的参数并求和得到最终癌症驱动基因的预测分数。The feature transformation output obtained by feature extraction, the output of the first feature extraction module and the output of the second feature extraction module are respectively input into a fully connected layer, whose input dimensions are 80, 160 and 160 respectively, and the output dimension is 1. Finally, the outputs of the three fully connected layers are multiplied by a learnable parameter and summed to obtain the final prediction score of the cancer driver gene.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是一种基于异质图扩散卷积网络的癌症驱动基因识别方法流程图。Figure 1 is a flow chart of a cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network.

图2是生物分子网络数据预处理与增强流程图。FIG2 is a flowchart of biomolecular network data preprocessing and enhancement.

图3是特征提取模块流程图。FIG3 is a flow chart of the feature extraction module.

图4是多层注意力分类器流程图。Figure 4 is a flowchart of the multi-layer attention classifier.

具体实施方式Detailed ways

以下结合附图和实例对本发明进行详细说明。The present invention is described in detail below with reference to the accompanying drawings and examples.

本发明提出一种基于异质图扩散卷积网络的癌症驱动基因识别方法,特别地,用于癌症驱动基因的识别。The present invention proposes a cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network, in particular, for the identification of cancer driver genes.

一种基于异质图扩散卷积网络的癌症驱动基因识别方法,图1是一种基于异质图扩散卷积网络的癌症驱动基因识别方法流程图,包括图数据预处理与增强、特征提取模块的构建和多层注意力分类器的构建三个步骤,其具体实施方式如下:A cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network. FIG1 is a flow chart of a cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network, which includes three steps: graph data preprocessing and enhancement, construction of a feature extraction module, and construction of a multi-layer attention classifier. The specific implementation method is as follows:

步骤1:图数据预处理与增强,图2为图数据预处理与增强流程图,包括以下内容:Step 1: Graph data preprocessing and enhancement. Figure 2 is a flowchart of graph data preprocessing and enhancement, which includes the following contents:

首先,加载生物分子网络数据集,包括图节点的特征矩阵,图节点的标签,图节点的边数据等,将图节点的特征矩阵转换为NumPy数组。然后,使用scikit-learn框架的preprocessing.StandardScaler函数将图节点的特征矩阵进行均值为0,方差为1的标准化处理,将标准化后的特征矩阵转换为PyTorch张量。接着,使用PyTorch Geometric框架的data.Data函数构括Data对象,其属性包括数据集的基因特征矩阵、标签和边数据,使用PyTorch Geometric框架的transforms.T.GDC函数创建一个几何数据转换器进行图扩散,其扩散方法为ppr,扩散矩阵的衰减系数设为0.9,正则化项设为0.0001。几何数据转换器在图像处理中的应用主要是通过执行几何变换和扩散操作来改变图像的位置、大小、形状和细节,从而实现各种图像处理效果。扩散方法ppr基于随机游走,用于测量图中节点的接近程度,具体而言,ppr定义了一个迭代过程,其中每个基因以概率c沿着边随机行走,并以的概率返回自身。当不再显著的更新时,基因的迭代收敛。最终得到的状态向量中的每个元素表示了对应基因与其他基因的连接强度。这个更新过程可以由以下公式表示:First, load the biomolecular network dataset, including the feature matrix of graph nodes, the labels of graph nodes, the edge data of graph nodes, etc., and convert the feature matrix of graph nodes into NumPy array. Then, use the preprocessing.StandardScaler function of the scikit-learn framework to standardize the feature matrix of graph nodes with a mean of 0 and a variance of 1, and convert the standardized feature matrix into a PyTorch tensor. Next, use the data.Data function of the PyTorch Geometric framework to construct a Data object, whose attributes include the gene feature matrix, labels and edge data of the dataset. Use the transforms.T.GDC function of the PyTorch Geometric framework to create a geometric data transformer for graph diffusion. Its diffusion method is ppr, the attenuation coefficient of the diffusion matrix is set to 0.9, and the regularization term is set to 0.0001. The application of geometric data transformers in image processing is mainly to change the position, size, shape and details of the image by performing geometric transformations and diffusion operations, thereby achieving various image processing effects. The diffusion method PPR is based on random walks and is used to measure the proximity of nodes in a graph. Specifically, PPR defines an iterative process in which each gene walks randomly along the edge with probability c and The probability of returning to itself. When there is no longer a significant update, the gene iteration converges. The final state vector Each element in represents the connection strength of the corresponding gene with other genes. This update process can be expressed by the following formula:

(1) (1)

是一个N维的概率向量,表示第i个基因(节点)的状态,其中N是基因的数量。是一个one-hot向量,即只有一个元素为1,其余为0,表示第i个基因。c是一个参数,称为阻尼因子,用于控制随机游走的概率。这个方程表示,在每次迭代中,当前状态向量通过转移M矩阵更新,同时加上一个偏置向量。最后,将Data对象作为几何数据转换器的输入得到边数据增强后生物分子网络的Data对象。 is an N -dimensional probability vector representing the state of the i -th gene (node), where N is the number of genes. is a one-hot vector, that is, only one element is 1 and the rest are 0, representing the i -th gene. c is a parameter, called the damping factor, which is used to control the probability of random walk. This equation means that in each iteration, the current state vector By transferring the M matrix update, and adding a bias vector Finally, the Data object is used as the input of the geometric data converter to obtain the Data object of the biomolecular network after edge data enhancement.

步骤2:特征提取模块的构建,图3为特征提取模块流程图,包括以下内容:Step 2: Construction of feature extraction module. Figure 3 is a flow chart of the feature extraction module, which includes the following contents:

首先,使用PyTorch Geometric框架的utils.dropout_edge函数将原图和辅助图的边数据进行随机失活,丢弃率为0.5,使用PyTorch框架的nn.Functional.F.dropout函数将特征矩阵进行随机失活,丢弃率为0.5;然后,使用PyTorch框架的nn.Linear函数将特征矩阵进行特征变换,其输入维度58,输出维度80;接着,使用PyTorch Geometric框架的nn.GATConv函数构建4个卷积层对输入特征进行传播和转换,其输入维度80,输出维度160,注意力头数为2,丢弃率为0.5;第一个卷积层输入为原图边数据和特征变换后的矩阵,第二个卷积层输入为辅助图边数据和特征变换后的矩阵,将这两个卷积层输出经ReLU函数激活后进行特征融合得到第一次特征提取后的矩阵;第三个卷积层输入为原图边数据和第一次特征提取后的矩阵,第四个卷积层输入为辅助图边数据和第一次特征提取后的矩阵,将这两个卷积层输出经ReLU函数激活后进行特征融合得到第二次特征提取后的矩阵;First, the utils.dropout_edge function of the PyTorch Geometric framework is used to randomly inactivate the edge data of the original image and the auxiliary image, with a dropout rate of 0.5. The nn.Functional.F.dropout function of the PyTorch framework is used to randomly inactivate the feature matrix, with a dropout rate of 0.5. Then, the nn.Linear function of the PyTorch framework is used to transform the feature matrix, with an input dimension of 58 and an output dimension of 80. Then, PyTorch is used to The nn.GATConv function of the Geometric framework constructs 4 convolutional layers to propagate and transform the input features. Its input dimension is 80, the output dimension is 160, the number of attention heads is 2, and the drop rate is 0.5; the first convolutional layer input is the original image edge data and the matrix after feature transformation, the second convolutional layer input is the auxiliary image edge data and the matrix after feature transformation, the outputs of these two convolutional layers are activated by the ReLU function and the features are fused to obtain the matrix after the first feature extraction; the third convolutional layer input is the original image edge data and the matrix after the first feature extraction, the fourth convolutional layer input is the auxiliary image edge data and the matrix after the first feature extraction, the outputs of these two convolutional layers are activated by the ReLU function and the features are fused to obtain the matrix after the second feature extraction;

步骤3:多层注意力分类器的构建,图4为多层注意力分类器流程图,包括以下内容:Step 3: Construction of multi-layer attention classifier. Figure 4 is a flowchart of the multi-layer attention classifier, which includes the following contents:

首先,使用PyTorch框架的nn.Functional.F.dropout函数构建三个随机失活层将特征变换后的矩阵,第一次特征提取后的矩阵和第二次特征提取后的矩阵进行随机失活,丢弃率为0.5;使用PyTorch框架的nn.Linear函数构建三个全连接层,其输入维度分别为80、160和160,输出维度均为1。将特征变换后的矩阵,第一次特征提取后的矩阵和第二次特征提取后的矩阵依次输入到这三个全连接层中,最后将三个全连接层的输出各乘以一个可学习的参数并求和得到最终癌症驱动基因的预测分数。First, the nn.Functional.F.dropout function of the PyTorch framework was used to construct three random dropout layers to randomly drop out the matrix after feature transformation, the matrix after the first feature extraction, and the matrix after the second feature extraction, with a dropout rate of 0.5; the nn.Linear function of the PyTorch framework was used to construct three fully connected layers, with input dimensions of 80, 160, and 160, respectively, and output dimensions of 1. The matrix after feature transformation, the matrix after the first feature extraction, and the matrix after the second feature extraction were input into the three fully connected layers in sequence, and finally the outputs of the three fully connected layers were multiplied by a learnable parameter and summed to obtain the final prediction score of the cancer driver gene.

将本发明所提出方法应用到癌症驱动基因的识别时,在HDGC所提供的三个数据集上测试得到的AUC分别为0.8512,0.8646,0.8675,AUPR分别为0.8014,0.8527,0.8313。优于HDGC、HDGICN在这三个数据集上的表现,其中HDGC的AUC分别为0.8278,0.8405,0.8337,AUPR分别为0.7770,0.8286,0.7929。HDGICN的AUC分别为0.8421,0.8427,0.8418,AUPR分别为0.7936,0.8405,0.8115。本发明将原生物分子网络与数据增强后产生的辅助网络两种数据进行充分的特征提取和特征融合,因此性能高于其它现有方法。When the method proposed in the present invention is applied to the identification of cancer driver genes, the AUCs tested on the three data sets provided by HDGC are 0.8512, 0.8646, and 0.8675, and the AUPRs are 0.8014, 0.8527, and 0.8313, respectively. It is better than the performance of HDGC and HDGICN on these three data sets, where the AUCs of HDGC are 0.8278, 0.8405, and 0.8337, and the AUPRs are 0.7770, 0.8286, and 0.7929, respectively. The AUCs of HDGICN are 0.8421, 0.8427, and 0.8418, and the AUPRs are 0.7936, 0.8405, and 0.8115, respectively. The present invention fully extracts and fuses the features of the original biomolecular network and the auxiliary network generated after data enhancement, so the performance is higher than other existing methods.

最优模型参数如下表所示。The optimal model parameters are shown in the following table.

表1 最优模型参数Table 1 Optimal model parameters

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。The above contents are further detailed descriptions of the present invention in combination with specific preferred embodiments, and it cannot be determined that the specific implementation of the present invention is limited to these descriptions. For ordinary technicians in the technical field to which the present invention belongs, several simple deductions or substitutions can be made without departing from the concept of the present invention, which should be regarded as falling within the protection scope of the present invention.

Claims (2)

1.一种基于异质图扩散卷积网络的癌症驱动基因识别方法,其特征在于,使用基于随机游走的图扩散生成辅助网络,对原生物分子网络数据进行数据增强,构建了一种基于图注意力卷积层的特征提取模块同时提取原生物分子网络数据和辅助生物分子网络数据并进行特征融合,然后将融合后的特征输入多层注意力分类器得到癌症驱动基因的预测评分,包括图数据增强与预处理、特征提取模块的构建和多层注意力分类器的构建三个步骤,其具体步骤如下:1. A cancer driver gene identification method based on a heterogeneous graph diffusion convolutional network, characterized in that an auxiliary network is generated by using a random walk-based graph diffusion to enhance the original biomolecular network data, a feature extraction module based on a graph attention convolutional layer is constructed to simultaneously extract the original biomolecular network data and the auxiliary biomolecular network data and perform feature fusion, and then the fused features are input into a multi-layer attention classifier to obtain the prediction score of the cancer driver gene, including three steps of graph data enhancement and preprocessing, construction of a feature extraction module and construction of a multi-layer attention classifier, and the specific steps are as follows: 步骤 1、首先,加载癌症驱动基因的生物分子网络的基因特征矩阵和边数据;然后,先将基因特征矩阵转换为NumPy数组,将这些数据标准化,再将NumPy数组转换为PyTorch张量;使用基于随机游走的图扩散算法增强原生物分子网络,得到辅助生物分子网络;Step 1: First, load the gene feature matrix and edge data of the biomolecular network of cancer driver genes; then, convert the gene feature matrix into a NumPy array, standardize the data, and then convert the NumPy array into a PyTorch tensor; use the random walk-based graph diffusion algorithm to enhance the original biomolecular network to obtain the auxiliary biomolecular network; 步骤2、在前向传播过程中,原生物分子网络的基因特征矩阵使用全连接层进行特征变换;然后,将特征变换后的矩阵分别与原图和辅助图的边数据进行图注意力卷积操作,其输出进行特征融合,得到第一次特征提取后的矩阵;接着,将第一次特征提取后的矩阵分别与原图和辅助图的边数据进行图注意力卷积操作,其输出进行特征融合,得到第二次特征提取后的矩阵;Step 2: During the forward propagation process, the gene feature matrix of the original biomolecular network is transformed using a fully connected layer; then, the matrix after feature transformation is subjected to graph attention convolution operation with the edge data of the original image and the auxiliary image respectively, and the output is subjected to feature fusion to obtain the matrix after the first feature extraction; then, the matrix after the first feature extraction is subjected to graph attention convolution operation with the edge data of the original image and the auxiliary image respectively, and the output is subjected to feature fusion to obtain the matrix after the second feature extraction; 步骤3、将步骤2中经特征变换的特征矩阵和两次特征提取后的特征矩阵分别传入全连接层中,输出一个标量,然后将三个标量各自乘以一个权重并求和得到最终的癌症驱动基因的预测评分。Step 3: Pass the feature matrix after feature transformation in step 2 and the feature matrix after two feature extractions into the fully connected layer respectively, output a scalar, and then multiply the three scalars by a weight respectively and sum them up to obtain the final prediction score of the cancer driver gene. 2.根据权利要求1所述的一种基于异质图扩散卷积网络的癌症驱动基因识别方法,其特征在于,同时使用原生物分子网络数据和图扩散产生的辅助网络数据经基于图注意力卷积层的特征提取模块进行特征提取和特征融合;特征提取步骤如下:2. According to claim 1, a cancer driver gene identification method based on heterogeneous graph diffusion convolutional network is characterized in that the original biomolecular network data and the auxiliary network data generated by graph diffusion are used simultaneously to perform feature extraction and feature fusion through a feature extraction module based on a graph attention convolutional layer; the feature extraction steps are as follows: 首先,使用PyTorch Geometric框架的utils.dropout_edge函数将原图和辅助图的边数据进行随机失活,丢弃率为0.5,使用PyTorch框架的nn.Functional.F.dropout函数将特征矩阵进行随机失活,丢弃率为0.5;然后,使用PyTorch框架的nn.Linear函数将特征矩阵进行特征变换,其输入维度58,输出维度80;接着,使用PyTorch Geometric框架的nn.GATConv函数构建4个卷积层对输入特征进行传播和转换,其输入维度80,输出维度160,注意力头数为2,丢弃率为0.5;第一个卷积层输入为原图边数据和特征变换后的矩阵,第二个卷积层输入为辅助图边数据和特征变换后的矩阵,将这两个卷积层输出经ReLU函数激活后进行特征融合得到第一次特征提取后的矩阵;第三个卷积层输入为原图边数据和第一次特征提取后的矩阵,第四个卷积层输入为辅助图边数据和第一次特征提取后的矩阵,将这两个卷积层输出经ReLU函数激活后进行特征融合得到第二次特征提取后的矩阵。First, the utils.dropout_edge function of the PyTorch Geometric framework is used to randomly inactivate the edge data of the original image and the auxiliary image, with a dropout rate of 0.5. The nn.Functional.F.dropout function of the PyTorch framework is used to randomly inactivate the feature matrix, with a dropout rate of 0.5. Then, the nn.Linear function of the PyTorch framework is used to transform the feature matrix, with an input dimension of 58 and an output dimension of 80. Then, PyTorch is used to The nn.GATConv function of the Geometric framework constructs 4 convolutional layers to propagate and transform the input features. Its input dimension is 80, the output dimension is 160, the number of attention heads is 2, and the drop rate is 0.5; the input of the first convolutional layer is the original image edge data and the matrix after feature transformation, and the input of the second convolutional layer is the auxiliary image edge data and the matrix after feature transformation. The outputs of these two convolutional layers are activated by the ReLU function and the features are fused to obtain the matrix after the first feature extraction; the input of the third convolutional layer is the original image edge data and the matrix after the first feature extraction, and the input of the fourth convolutional layer is the auxiliary image edge data and the matrix after the first feature extraction. The outputs of these two convolutional layers are activated by the ReLU function and the features are fused to obtain the matrix after the second feature extraction.
CN202410693066.7A 2024-05-31 2024-05-31 A method for identifying cancer driver genes based on heterogeneous graph diffusion convolutional networks Active CN118280453B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410693066.7A CN118280453B (en) 2024-05-31 2024-05-31 A method for identifying cancer driver genes based on heterogeneous graph diffusion convolutional networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410693066.7A CN118280453B (en) 2024-05-31 2024-05-31 A method for identifying cancer driver genes based on heterogeneous graph diffusion convolutional networks

Publications (2)

Publication Number Publication Date
CN118280453A true CN118280453A (en) 2024-07-02
CN118280453B CN118280453B (en) 2024-08-16

Family

ID=91640415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410693066.7A Active CN118280453B (en) 2024-05-31 2024-05-31 A method for identifying cancer driver genes based on heterogeneous graph diffusion convolutional networks

Country Status (1)

Country Link
CN (1) CN118280453B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120412748A (en) * 2025-04-14 2025-08-01 云南财经大学 Identification method of specific and shared driver genes based on federated transfer learning and deep learning algorithms

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170017749A1 (en) * 2015-07-15 2017-01-19 International Business Machines Corporation System and method for identifying cancer driver genes
CN115019883A (en) * 2022-02-13 2022-09-06 昆明理工大学 Cancer driver gene identification method based on multi-network graph convolution
CN115171779A (en) * 2022-07-13 2022-10-11 浙江大学 Cancer driver gene prediction device based on graph attention network and multigroup chemical fusion
US20220392579A1 (en) * 2019-11-13 2022-12-08 Memorial Sloan Kettering Cancer Center Classifier models to predict tissue of origin from targeted tumor dna sequencing
CN115910203A (en) * 2022-06-24 2023-04-04 山东大学 Cancer Synergy-Driven Pathway Identification System Based on Attribute Heterogeneous Network Embedding
CN116825204A (en) * 2023-08-30 2023-09-29 鲁东大学 Single-cell RNA sequence gene regulation inference method based on deep learning
CN117334252A (en) * 2023-10-11 2024-01-02 西南科技大学 Cancer driving gene identification method based on heterophilic graph information maximization

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170017749A1 (en) * 2015-07-15 2017-01-19 International Business Machines Corporation System and method for identifying cancer driver genes
US20220392579A1 (en) * 2019-11-13 2022-12-08 Memorial Sloan Kettering Cancer Center Classifier models to predict tissue of origin from targeted tumor dna sequencing
CN115019883A (en) * 2022-02-13 2022-09-06 昆明理工大学 Cancer driver gene identification method based on multi-network graph convolution
CN115910203A (en) * 2022-06-24 2023-04-04 山东大学 Cancer Synergy-Driven Pathway Identification System Based on Attribute Heterogeneous Network Embedding
CN115171779A (en) * 2022-07-13 2022-10-11 浙江大学 Cancer driver gene prediction device based on graph attention network and multigroup chemical fusion
CN116825204A (en) * 2023-08-30 2023-09-29 鲁东大学 Single-cell RNA sequence gene regulation inference method based on deep learning
CN117334252A (en) * 2023-10-11 2024-01-02 西南科技大学 Cancer driving gene identification method based on heterophilic graph information maximization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEI PENG等: "Identifying cancer driver genes based on multi‑view heterogeneous graph convolutional network and self‑attention mechanism", 《BMC BIOINFORMATICS》, 13 January 2023 (2023-01-13) *
万美含;熊贇;朱扬勇;: "基于异质网络层次注意力机制的基因功能预测", 计算机工程, no. 07 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120412748A (en) * 2025-04-14 2025-08-01 云南财经大学 Identification method of specific and shared driver genes based on federated transfer learning and deep learning algorithms
CN120412748B (en) * 2025-04-14 2025-11-04 云南财经大学 Specific driving gene and common driving gene identification method based on federal transfer learning and deep learning algorithm

Also Published As

Publication number Publication date
CN118280453B (en) 2024-08-16

Similar Documents

Publication Publication Date Title
CN107506822B (en) Deep neural network method based on space fusion pooling
CN114496092A (en) miRNA and disease association relation prediction method based on graph convolution network
US12026624B2 (en) System and method for loss function metalearning for faster, more accurate training, and smaller datasets
CN112711953A (en) Text multi-label classification method and system based on attention mechanism and GCN
Miao et al. Lasagne: A multi-layer graph convolutional network framework via node-aware deep architecture
CN113065649A (en) A complex network topology graph representation learning method, prediction method and server
CN112861752B (en) DCGAN and RDN-based crop disease identification method and system
CN113010683B (en) Entity relationship recognition method and system based on improved graph attention network
CN114065048A (en) Article recommendation method based on multi-different-pattern neural network
WO2023217290A1 (en) Genophenotypic prediction based on graph neural network
CN113240683B (en) Attention mechanism-based lightweight semantic segmentation model construction method
CN113468291A (en) Patent network representation learning-based automatic patent classification method
CN118280453B (en) A method for identifying cancer driver genes based on heterogeneous graph diffusion convolutional networks
WO2022221991A1 (en) Image data processing method and apparatus, computer, and storage medium
Li et al. DLW-NAS: differentiable light-weight neural architecture search
CN117236409A (en) Small model training method, device, system and storage medium based on large model
CN117349494A (en) Graph classification method, system, medium and equipment for space graph convolution neural network
CN118821863A (en) A Neural Architecture Search Method for Chip Design
CN114757271A (en) A social network node classification method and system based on multi-channel graph convolutional network
CN117894451A (en) Multimodal deep learning pregnancy prediction method and system for in vitro fertilization and embryo transfer
CN114842920A (en) A molecular property prediction method, device, storage medium and electronic device
CN116956993A (en) Method, device and storage medium for constructing graph integration model
CN114757784B (en) Method for identifying enterprises needing financing, method for training model, device and equipment
CN113408546B (en) Single-sample target detection method based on mutual global context attention mechanism
CN118672594B (en) A software defect prediction method and system

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant