CN119601090A

CN119601090A - A gene co-expression network identification method and system based on graph convolutional neural network

Info

Publication number: CN119601090A
Application number: CN202411649910.2A
Authority: CN
Inventors: 徐斯文; 杨舒婷; 陆紫箫
Original assignee: Guangdong Pharmaceutical University
Current assignee: Guangdong Pharmaceutical University
Priority date: 2024-11-19
Filing date: 2024-11-19
Publication date: 2025-03-11
Anticipated expiration: 2044-11-19
Also published as: CN119601090B

Abstract

The invention relates to a gene co-expression network identification method and system based on a graph convolution neural network, wherein the method comprises the following steps of S1, data preprocessing; S2, constructing a gene network diagram based on the gene expression correlation, S3, capturing node information extraction features through a diagram convolution neural network GRAPHSAGE model, S4, performing cluster analysis by using a Gaussian mixture model, and S5, dividing a gene module according to a clustering result. The method can help researchers to extract valuable information from massive gene expression data more effectively, understand the gene regulation network deeply, discover new gene functions and regulation mechanisms, and provide important guidance for disease treatment and prevention. Meanwhile, the method has good expandability and flexibility, and can meet the analysis requirements of gene expression data of different scales and types.

Description

Gene co-expression network identification method and system based on graph convolution neural network

Technical Field

The invention belongs to the field of bioinformatics, and particularly relates to a gene co-expression network identification method and system based on a graph convolution neural network.

Background

In recent years, with the rapid development of bioinformatics, analysis of gene expression data has become one of the important means for revealing the mystery of life sciences. The gene expression data reflect the expression level of genes in organisms under different conditions, and has important significance for understanding gene functions, regulation and control mechanisms and disease occurrence mechanisms. However, in the face of the enormous complexity of gene expression data, how to extract valuable information from massive amounts of data remains one of the challenges facing the current bioinformatics field. The current widely used gene co-expression network analysis methods are mainly based on the biological statistics technology, and the methods face the problems of high computational complexity, insufficient precision and the like when processing large-scale genome data.

Disclosure of Invention

In order to solve the defects in the prior art, the invention introduces the graph rolling network GRAPHSAGE in the deep learning field to perform neighbor sampling and aggregate to realize feature selection, and performs the identification of the gene co-expression module on the basis, so that the regulation and control relationship among genes can be expected to be revealed more accurately and efficiently, and further, some gene functions which are not recognized yet can be found, and the invention has important guiding significance for the treatment and prevention of diseases.

The technical scheme of the invention is as follows:

a gene co-expression network identification method based on a graph convolution neural network comprises the following steps:

s1, preprocessing data to reduce data noise and scale difference.

S2, constructing a gene network diagram based on the gene expression correlation, and constructing the gene network diagram by calculating the correlation (such as pearson correlation coefficient) between genes based on the preprocessed gene expression data. In the figure, genes are used as nodes, and correlations between genes are used as weights of edges.

S3, capturing node information extraction features through a graph convolution neural network GRAPHSAGE model, capturing and extracting the node information in the gene network graph by utilizing a graph convolution neural network GRAPHSAGE model, and sampling and aggregating neighbors of the nodes by combining a sampler. GRAPHSAGE can learn the characteristics of the nodes and the neighbor nodes thereof, thereby obtaining the complex regulation and control relationship among genes.

S4, performing cluster analysis by using a Gaussian mixture model, and performing cluster analysis on the characteristics extracted by the graph rolling network by using a Gaussian Mixture Model (GMM). GMM is able to fit complex distributions of data and identify different genetic modules. The clustering results divided the genes into different co-expression modules.

S5, dividing the gene modules according to the clustering result, analyzing the principal components of each clustering module to obtain a first principal component, calculating a correlation coefficient by using the sample labels and the first principal component, and finally taking the module genes with high correlation as markers of a gene network, wherein the module genes can be used for subsequent downstream analysis.

According to an embodiment of the present invention, in the step S1, the preprocessing includes data cleansing, missing value filling, normalization and variance screening. Preferably, the method comprises the steps of firstly carrying out mean screening on the data, then carrying out variance screening, sorting the variances of the genes, and screening the genes with high variances to ensure the accuracy and comparability of the data. In a preferred embodiment, the homogeneous screening threshold is fpkm mean 0.5, the first 10000 high variance genes screened.

According to an embodiment of the present invention, in the step S2, the Pearson correlation coefficient C is calculated for all pairs of genes with the genes as nodes, and then the P value is obtained by calculating the correlation coefficient by the given soft threshold st to determine the edge, if the P value is greater than the hard threshold ht set value, the edge is entered into the gene network graph, and the P value is used as the weight of the edge, where the P value has the following calculation formula of p= |cij|st. In a preferred embodiment, the given soft threshold st=7 and the hard threshold ht is set to 0.01.

According to an embodiment of the present invention, in the step S3, the GRAPHSAGE model includes an input layer, two convolution layers SAGEConv and an output layer, the input layer receives the node feature matrix x, each convolution layer SAGEConv is followed by a network layer for normalizing the features, adding model nonlinearity and preventing overfitting, and the output layer normalizes the feature dimensions to obtain a final graph embedding, where the graph embedding represents node features with low dimensions. In a preferred embodiment, the number of sampling neighbors of the two layers of convolution layers is 10, only 128 nodes are sampled at a time, the feature dimension of the output after the two layers of convolution layers SAGEConv is 128, and the feature dimension is normalized to 5 by the output layer.

According to an embodiment of the present invention, in the step S4, the features extracted from the graph rolling network are clustered by using a gaussian mixture model, the genes are divided into different co-expression modules, wherein the number k of clusters is set to twice the square root of the number of graph embeddings, the covariance type is set to be diagonal type, so as to allow uncorrelation among the dimensions of each cluster, and the prior weight concentration ratio 1/k is adjusted according to the maximum value of the number of clusters, so as to control the weight distribution among different clusters. And soft clustering is used to assign each data point a probability distribution that belongs to each cluster.

According to an embodiment of the present invention, in the step S5, the step of calculating the saliency score of the clustering module performs principal component analysis on each clustering module to obtain a first principal component, and then calculates Pearson correlation coefficient by using the sample tag and the first principal component, and finally takes the module gene with high correlation as a marker of the gene network, so that the module gene can be used for subsequent downstream analysis.

In another aspect, the invention also provides a gene coexpression network identification system based on the graph rolling neural network, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program is loaded to the processor to realize the gene coexpression network identification method based on the graph rolling neural network.

Compared with the prior art, the invention has the following beneficial effects:

By the analysis method provided by the invention, researchers can more effectively extract valuable information from massive gene expression data, understand the gene regulation network in depth, discover new gene functions and regulation mechanisms, and provide important guidance for the treatment and prevention of diseases. Meanwhile, the gene co-expression network identification method has good expandability and flexibility, and can meet the analysis requirements of gene expression data of different scales and types.

Drawings

Fig. 1 is a schematic technical flow chart of an embodiment of the present invention.

Detailed Description

In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The reagents, materials and equipment used in the embodiment of the invention are all commercially available sources unless otherwise specified, and the test methods are all conventional test methods in the field unless otherwise specified.

Referring to fig. 1, the invention provides a gene co-expression network identification method based on a graph convolution neural network, which comprises the following steps:

S1, preprocessing data.

Preprocessing may include data cleansing, missing value filling, normalization, variance screening, etc., with selections made as needed to reduce data noise and scale differences.

Before variance screening, the data is subjected to mean screening so as to ensure the accuracy and comparability of the data.

In one embodiment, the data is first mean screened, the screening threshold is fpkm mean 0.5, then variance screening is performed, the variances of the genes are ordered, 10000 genes with high variances are screened out, and accuracy and comparability of the data are ensured.

S2, constructing a gene co-expression network.

Based on the preprocessed gene expression data, a gene network map is constructed by calculating correlations between genes (e.g., pearson correlation coefficients). In the figure, genes are used as nodes, and correlations between genes are used as weights of edges.

Specifically, the genes are used as nodes, the Pearson correlation coefficient C is calculated for all the gene pairs, then the P value is obtained after the correlation coefficient is calculated through a given soft threshold value st to determine the edge, if the P value is larger than a set value of a hard threshold value ht, the edge is input into a gene network diagram, the P value is used as the weight of the edge, and the P value calculation formula is as follows, P= |Cij|st. In a preferred embodiment, the given soft threshold st=7 and the hard threshold ht is set to 0.01.

S3, low-dimensional feature expression

And capturing and extracting the node information in the gene network graph by using a graph convolution neural network GRAPHSAGE model, and sampling and aggregating neighbors of the nodes by combining a sampler. GRAPHSAGE can learn the characteristics of the nodes and the neighbor nodes thereof, thereby obtaining the complex regulation and control relationship among genes.

In one embodiment, GRAPHSAGE model comprises an input layer, two convolution layers SAGEConv and an output layer, wherein the input layer receives node feature matrix x, each convolution layer SAGEConv is followed by a network layer for normalizing the features, adding model nonlinearity and preventing overfitting, and finally the output layer normalizes the feature dimensions to obtain the final graph embedding, which is represented as low-dimensional node features. In one embodiment, the number of sampling neighbors of the two layers of convolution layers is 10, only 128 nodes are sampled at a time, the feature dimension of the two layers of convolution layers SAGEConv is 128, and the feature dimension is normalized to 5 by the output layer.

S4, performing cluster analysis by using a Gaussian mixture model.

The features extracted from the graph rolling network are clustered using a Gaussian Mixture Model (GMM). GMM is able to fit complex distributions of data and identify different genetic modules. The clustering results divided the genes into different co-expression modules.

In an embodiment, the number of clusters k is set to twice the square root of the number of graph embeddings, the covariance type is set to diagonal to allow uncorrelation between the dimensions of each cluster, and the a priori weight concentration 1/k is adjusted according to the maximum of the number of clusters to control the weight distribution between different clusters. And soft clustering is used to assign each data point a probability distribution that belongs to each cluster.

S5, dividing the gene modules according to the clustering result.

Performing principal component analysis on each clustering module to obtain a first principal component, and calculating a correlation coefficient, such as a Pearson correlation coefficient, by using the sample tag and the first principal component; and finally taking the module genes with high correlation as markers of the gene network by using the sample labels and the first main component, and using the module genes for subsequent downstream analysis.

The invention also provides a gene coexpression network identification system based on the graph rolling neural network, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program is loaded to the processor to realize the gene coexpression network identification method based on the graph rolling neural network.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A gene co-expression network identification method based on a graph convolution neural network is characterized by comprising the following steps:

s1, preprocessing data;

s2, constructing a gene network diagram based on the gene expression correlation;

s3, capturing node information extraction features through a graph convolution neural network GRAPHSAGE model;

S4, performing cluster analysis by using a Gaussian mixture model;

s5, dividing the gene modules according to the clustering result.

2. The method according to claim 1, wherein the preprocessing in step S1 includes data cleansing, missing value filling, normalization and variance screening.

3. The method for identifying a gene co-expression network according to claim 2, wherein the data is firstly subjected to a homogeneous screening and then to a variance screening, preferably the homogeneous screening threshold is fpkm means 0.5, and 10000 genes with high variances are screened out by the variance.

4. The method of claim 1, wherein in step S2, the gene is used as a node, the Pearson correlation coefficient C is calculated for all the pairs of genes, and then the P value is obtained by calculating the correlation coefficient through a given soft threshold value st to determine the edge, if the P value is greater than the hard threshold value ht set value, the edge is entered into the gene network graph, and the P value is used as the weight of the edge, and the P value calculation formula is as follows, p= |cij|st.

5. The method according to claim 4, wherein the given soft threshold st=7 and the hard threshold ht is set to 0.01.

6. The method according to claim 1, wherein in step S3, the GRAPHSAGE model comprises an input layer, two layers of convolution layers SAGEConv and an output layer, wherein the input layer receives a node feature matrix x, each layer of convolution layers SAGEConv is followed by a network layer for normalizing features, adding model nonlinearity and preventing overfitting, and finally the output layer normalizes feature dimensions to obtain a final graph insert, and the graph insert is represented as a low-dimensional node feature.

7. The method for identifying the gene co-expression network according to claim 6, wherein the number of sampling neighbors of the two layers of convolution layers is 10, only 128 nodes are sampled at a time, the feature dimension is 128 after the two layers of convolution layers SAGEConv, and the feature dimension is normalized to 5 by the output layer.

8. The method according to claim 1, wherein in step S4, the number of clusters k is set to twice the square root of the number of graph embeddings, the covariance type is set to be diagonal, the a priori weight concentration ratio 1/k is adjusted according to the maximum value of the number of clusters, and a probability distribution belonging to each cluster is assigned to each data point by using soft clusters.

9. The method according to claim 1, wherein in step S5, principal component analysis is performed on each cluster module to obtain a first principal component, then Pearson correlation coefficient is calculated by using the sample tag and the first principal component, and finally a module gene with high correlation is taken as a marker of the gene network for subsequent downstream analysis.

10. A graph-rolling neural network-based gene co-expression network identification system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when loaded into the processor implements the graph-rolling neural network-based gene co-expression network identification method according to any one of claims 1-9.