[go: up one dir, main page]

CN116386729A - scRNA-seq data dimension reduction method based on graph neural network - Google Patents

scRNA-seq data dimension reduction method based on graph neural network Download PDF

Info

Publication number
CN116386729A
CN116386729A CN202211716676.1A CN202211716676A CN116386729A CN 116386729 A CN116386729 A CN 116386729A CN 202211716676 A CN202211716676 A CN 202211716676A CN 116386729 A CN116386729 A CN 116386729A
Authority
CN
China
Prior art keywords
cell
data
neural network
scrna
gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211716676.1A
Other languages
Chinese (zh)
Inventor
王树林
孙鸿福
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202211716676.1A priority Critical patent/CN116386729A/en
Publication of CN116386729A publication Critical patent/CN116386729A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明涉及生物信息学中的数据挖掘,特别是涉及对单细胞RNA测序数据的挖掘。具体涉及通过深度学习的方法对单细胞RNA测序数据进行维度压缩以及聚类,来达到有效识别细胞种群的目的。本发明的方法包括对scRNA‑seq数据进行收集和预处理;构建图神经网络模型;使用构建的模型对预处理过的数据进行降维;对降维后的结果进行聚类分析。我们的模型约束了数据结构,并通过图神经网络模块进行降维,并在降维结果中同时保留细胞‑细胞关系和基因‑基因关系。以标准化互信息和调整兰德指数作为评价指标,在五个真实的scRNA‑seq数据集上进行的实验表明,本方法具有不错的性能。The invention relates to data mining in bioinformatics, in particular to the mining of single-cell RNA sequencing data. Specifically, it involves dimensionality compression and clustering of single-cell RNA sequencing data through deep learning methods to achieve the purpose of effectively identifying cell populations. The method of the present invention includes collecting and preprocessing scRNA-seq data; constructing a graph neural network model; using the constructed model to reduce the dimension of the preprocessed data; and performing cluster analysis on the result after dimension reduction. Our model constrains the data structure and performs dimensionality reduction through a graph neural network module, and preserves both cell-cell relationships and gene-gene relationships in the dimensionality reduction results. Using standardized mutual information and adjusted Rand index as evaluation indicators, experiments on five real scRNA-seq datasets show that the method has good performance.

Description

一种基于图神经网络的scRNA-seq数据降维方法A Dimensionality Reduction Method for scRNA-seq Data Based on Graph Neural Network

技术领域technical field

本发明涉及生物信息学中的数据挖掘,特别是涉及对单细胞RNA测序数据的挖掘。具体涉及通过对单细胞RNA测序数据进行维度压缩以及聚类,来达到有效识别细胞种群的目的。The invention relates to data mining in bioinformatics, in particular to the mining of single-cell RNA sequencing data. Specifically, it involves dimensionality compression and clustering of single-cell RNA sequencing data to achieve the purpose of effectively identifying cell populations.

背景技术Background technique

随着近年来单细胞RNA测序(scRNAseq)技术的爆炸式增长,出现了前所未有的单细胞转录分析机会。传统的批量RNA测序方法对数百万个细胞的混合物进行测序。这导致一个基因的基因表达反映了所有细胞中基因表达的平均值,而忽略了细胞之间的异质性。与bulk RNAseq不同,scRNAseq第一步分离细胞,第二步对每个细胞的数千个基因进行测序。根据不同的测序方案,每个基因收集了数以百万计的表达值,从而可以识别新的细胞类型,确定基因调控机制,解决发育过程的细胞动力学问题。With the explosion of single-cell RNA sequencing (scRNAseq) technologies in recent years, unprecedented opportunities for single-cell transcriptional profiling have emerged. Traditional bulk RNA sequencing methods sequence mixtures of millions of cells. This results in the gene expression of a gene reflecting the average of gene expression in all cells, ignoring heterogeneity between cells. Unlike bulk RNAseq, scRNAseq isolates cells in the first step and sequences thousands of genes in each cell in the second step. Depending on the sequencing protocol, millions of expression values are collected for each gene, allowing the identification of new cell types, the determination of gene regulatory mechanisms, and the resolution of cellular dynamics during development.

单细胞RNA测序(scRNA-seq)是研究细胞间变异的理想方法。主成分分析(PCA)和t-分布式随机邻域嵌入(t-SNE)等常规降维技术在scRNA-seq数据上实施,用于可视化和下游分析,显着增加了我们对细胞异质性和发育进度的理解。最近出现的大规模并行scRNA-seq(例如液滴平台)使得能够对复杂生物系统中的数百万个细胞进行测序,这为组织和细胞微环境的解剖、稀有/新细胞类型的鉴定、发育谱系的推断以及细胞对刺激的反应机制的阐明提供了极好的潜力。然而,大规模并行scRNA-seq生成的数据具有高dropout、高噪声、结构复杂等特点,给降维带来了一系列挑战。特别是,保留细胞间复杂的拓扑结构是一个巨大的挑战。Single-cell RNA sequencing (scRNA-seq) is an ideal method for studying cell-to-cell variation. Conventional dimensionality reduction techniques such as principal component analysis (PCA) and t-distributed stochastic neighborhood embedding (t-SNE) implemented on scRNA-seq data for visualization and downstream analysis significantly increase our knowledge of cellular heterogeneity and understanding of developmental progress. The recent emergence of massively parallel scRNA-seq (e.g., droplet platforms) enables the sequencing of millions of cells in complex biological systems, providing new insights into tissue and cellular microenvironment dissection, identification of rare/new cell types, developmental The inference of lineages and the elucidation of the mechanisms by which cells respond to stimuli offer excellent potential. However, the data generated by massively parallel scRNA-seq has the characteristics of high dropout, high noise, and complex structure, which brings a series of challenges to dimensionality reduction. In particular, preserving the complex topology between cells is a great challenge.

在过去的几年中,已经开发或引入了许多用于scRNA-seq数据分析的降维方法。最近开发的竞争方法包括DCA、scVI、scDeepCluster、PHATE、SAUCIE、scGNN、ZINB-WaVE和Ivis。其中,深度学习显示出最大的潜力。例如,DCA、scDeepCluster、Ivis和SAUCIE调整了自动编码器以对scRNA-seq数据进行降噪、可视化和聚类。然而,这些基于深度学习的模型只嵌入了不同的细胞特征而忽略了细胞与细胞之间的关系,这限制了它们揭示细胞间复杂拓扑结构的能力,也使它们难以阐明发育轨迹。最近提出的图自动编码器非常有前途,因为它保留了潜在空间中数据之间的长距离关系。In the past few years, many dimensionality reduction methods for scRNA-seq data analysis have been developed or introduced. Recently developed competing methods include DCA, scVI, scDeepCluster, PHATE, SAUCIE, scGNN, ZINB-WaVE, and Ivis. Among them, deep learning shows the greatest potential. For example, DCA, scDeepCluster, Ivis, and SAUCIE adapted autoencoders to denoise, visualize, and cluster scRNA-seq data. However, these deep learning-based models only embed distinct cellular features while ignoring cell-to-cell relationships, which limits their ability to reveal complex intercellular topologies and also makes them difficult to elucidate developmental trajectories. The recently proposed graph autoencoder is very promising because it preserves long-distance relationships between data in the latent space.

然而,研究表明,基因调控网络或蛋白质-蛋白质相互作用(PPI)网络中涉及的基因相互作用在不同的生物学背景下具有丰富的信息。此外,之前的研究表明,将scRNA-seq数据与先前的基因相互作用信息联合分析可以导致对数据的有意义的理解。NetNMF-sc是一种专为scRNA-seq分析设计的网络正则化非负矩阵分解,它利用先验基因网络获得更有意义的基因低维表示。相对应的,scRNA-seq数据也包含丰富的信息来推断基因-基因相互作用。However, studies have shown that gene interactions involved in gene regulatory networks or protein-protein interaction (PPI) networks are informative in different biological contexts. Furthermore, previous studies have shown that joint analysis of scRNA-seq data with prior gene interaction information can lead to meaningful understanding of the data. NetNMF-sc is a network regularized non-negative matrix factorization specially designed for scRNA-seq analysis, which utilizes prior gene networks to obtain more meaningful low-dimensional representations of genes. Correspondingly, scRNA-seq data also contain rich information to infer gene-gene interactions.

受上述理解的启发,我们提出了scTPGAE,这是一种基于图神经网络的计算方法,它利用两个图神经网络同时将细胞-细胞关系,基因-基因关系保留到降维结果中,以达到更好的下游分析结果。Inspired by the above understanding, we propose scTPGAE, a computational method based on graph neural networks, which utilizes two graph neural networks to simultaneously preserve cell-cell relationships, gene-gene relationships into dimensionality reduction results to achieve Better downstream analysis results.

发明内容Contents of the invention

本发明针对以上方法存在的问题与scRNA-seq数据的复杂性,我们提出了一种基于图神经网络的scRNA-seq数据降维方法。本发明的方法可以有效的解决现有降维方法存在的重要信息丢失,特征提取不充分等问题,并在降维结果中同时保留了细胞-细胞关系和基因-基因关系,并获得了更好的聚类精度。所叙述方法的步骤包括:Aiming at the problems existing in the above methods and the complexity of scRNA-seq data, the present invention proposes a dimensionality reduction method for scRNA-seq data based on a graph neural network. The method of the present invention can effectively solve the problems of important information loss and insufficient feature extraction existing in the existing dimensionality reduction methods, and simultaneously retain the cell-cell relationship and gene-gene relationship in the dimensionality reduction results, and obtain better clustering accuracy. The steps of the described method include:

1.数据预处理1. Data preprocessing

首先,假设我们有一个原始的scRNA-seq计数矩阵C,它过滤掉了任何细胞中没有计数的基因。C可以表示为P乘N维矩阵,其中P被定义为基因总数,N被定义为细胞总数,Cij表示细胞j中基因i的表达值。First, assume we have a raw scRNA-seq count matrix C that filters out genes not counted in any cells. C can be expressed as a P by N dimensional matrix, where P is defined as the total number of genes, N is defined as the total number of cells, and Cij represents the expression value of gene i in cell j.

在这项工作中,我们首先对原始scRNA-seq计数数据进行预处理,包括对数转换和z分数归一化。我们有一个归一化输出X,公示如下In this work, we first preprocess the raw scRNA-seq count data, including log transformation and z-score normalization. We have a normalized output X, publicized as follows

Figure SMS_1
Figure SMS_1

X=zscore(X′)X=zscore(X')

其中Sj是每个细胞j的大小因子。数据预处理的优点是保留数据大小差异的影响,并将离散值转换为连续值,从而为后续建模提供更大的灵活性。where Sj is the size factor of each cell j. The advantage of data preprocessing is to preserve the effect of data size differences and convert discrete values to continuous values, thus providing greater flexibility for subsequent modeling.

图神经网络需要的输入除了上述的基因-细胞关系矩阵外,还需要细胞-细胞关系图和基因-基因交互网络。In addition to the above-mentioned gene-cell relationship matrix, the input required by the graph neural network also requires a cell-cell relationship graph and a gene-gene interaction network.

其中,细胞-细胞关系图由Scikit-learn Python包中的K最近邻(KNN)算法构建。默认K在本研究中预定义为35,并根据我们实验中的数据集进行调整。生成的邻接矩阵是一个0-1的矩阵,1代表连通,0代表不连通。Among them, the cell-cell relationship graph is constructed by the K nearest neighbor (KNN) algorithm in the Scikit-learn Python package. The default K is predefined as 35 in this study and adjusted according to the dataset in our experiments. The generated adjacency matrix is a 0-1 matrix, 1 means connected and 0 means disconnected.

基因-基因交互网络则可以利用现有的数据,我们收集了七种不同的人类基因相互作用网络和一种小鼠基因相互作用网络来评估scTPGAE的性能。最著名的基因相互作用网络之一是STRING数据库,这是一个PPI网络,它从文献和实验等多种资源中收集和整合蛋白质-蛋白质关联信息。HumanNet是一个人类功能基因网络,它通过贝叶斯统计框架整合了多种类型的组学数据。HumanNet包括人类基因网络的层次结构,即人类衍生的PPI、共功能链接、共引用和来自其他物种的互斥。具体来说,我们使用了两个版本的HumanNet,HumanNet-CF和HumanNet-PI,它们分别由协同功能网络和PPI网络组成。FunCoup是全基因组功能关联网络,使用独特的冗余加权贝叶斯积分来组合10种不同类型的功能关联数据。GeneMANIA通过对多重功能基因组数据集进行加权来创建组合基因网络。此外,我们从pgWalk收集了两个功能相似矩阵,它们分别来自KEGG通路和Gene Ontology生物过程。接下来,我们通过过滤掉那些相似度值小于某个阈值(即0.9)的基因对,将这两个相似度矩阵转换为基因网络。这两个网络分别称为pgWalk-kegg和pgWalk-gobp。The gene-gene interaction network can make use of existing data, and we collected seven different human gene interaction networks and one mouse gene interaction network to evaluate the performance of scTPGAE. One of the most famous gene interaction networks is the STRING database, a PPI network that collects and integrates protein-protein association information from various sources such as literature and experiments. HumanNet is a human functional gene network that integrates multiple types of omics data through a Bayesian statistical framework. HumanNet includes a hierarchical structure of human gene networks, i.e., human-derived PPIs, co-functional links, co-citations, and mutual exclusions from other species. Specifically, we use two versions of HumanNet, HumanNet-CF and HumanNet-PI, which consist of a co-functional network and a PPI network, respectively. FunCoup is a genome-wide functional association network that uses a unique redundancy-weighted Bayesian integral to combine 10 different types of functional association data. GeneMANIA creates combinatorial gene networks by weighting multiple functional genomic datasets. In addition, we collected two functional similarity matrices from pgWalk, which are from KEGG pathways and Gene Ontology biological processes, respectively. Next, we converted these two similarity matrices into a gene network by filtering out those gene pairs whose similarity value is less than a certain threshold (i.e., 0.9). These two networks are called pgWalk-kegg and pgWalk-gobp respectively.

2.构建用于降维的图神经网络2. Building a graph neural network for dimensionality reduction

(1)保留细胞-细胞关系的图神经网络G1(1) Graph neural network G1 that preserves cell-cell relationships

图自动编码器是一种用于对图结构数据进行无监督表示学习的人工神经网络。图形自动编码器具有低维瓶颈层,因此可以用作降维模型。假设输入是节点矩阵X和邻接矩阵A的细胞-细胞关系图。在我们的联合图自动编码器中,有一个编码器E用于整个图,两个解码器DX和DA分别用于节点和边。在实践中,我们首先将输入图编码为潜在变量h=E(X,a),然后将h解码为重构的节点矩阵xr=DX(h)和重构的邻接矩阵Ar=DA(h)。学习过程的目标是最小化重建损失Graph autoencoders are a type of artificial neural network for unsupervised representation learning on graph-structured data. Graph autoencoders have low-dimensional bottleneck layers and thus can be used as dimensionality reduction models. Suppose the input is a cell-cell graph of node matrix X and adjacency matrix A. In our joint graph autoencoder, there is one encoder E for the whole graph and two decoders D X and D A for nodes and edges respectively. In practice, we first encode the input graph as a latent variable h=E(X,a), and then decode h into a reconstructed node matrix x r =D X (h) and a reconstructed adjacency matrix A r =D A (h). The goal of the learning process is to minimize the reconstruction loss

Figure SMS_2
Figure SMS_2

其中权重是超参数。在我们的实验中,设置为0.6。where weights are hyperparameters. In our experiments, it is set to 0.6.

我们使用Python包Spektral32来实现我们的模型。有许多类型的图形神经网络可以用作编码器或解码器。因此,为了借助于节点的邻居来提取节点的特征,我们在编码器中应用图注意力层为默认值。其他图形神经网络如GCN、GraphSAGE和TAGCN也可以作为scTPGAE中的编码器实现。特征解码器DX是一个四层完全连接的神经网络,在隐藏层中有64、256、512个节点。We implement our model using the Python package Spektral32. There are many types of graph neural networks that can be used as encoders or decoders. Therefore, in order to extract features of a node with the help of its neighbors, we apply a graph attention layer as default in the encoder. Other graph neural networks such as GCN, GraphSAGE, and TAGCN can also be implemented as encoders in scTPGAE. The feature decoder DX is a four-layer fully connected neural network with 64, 256, 512 nodes in the hidden layers.

边缘解码器由一个完全连接的层组成,然后是象限化和激活的组成:The edge decoder consists of a fully connected layer followed by quadrantization and activations:

Ar=DA(h)=σ(ZZT)A r =D A (h)=σ(ZZ T )

其中Z=σ(Wh)作为具有权重矩阵W的完全连接层的输出,σ(x)=max(0,x)是直线线性单位。where Z = σ(Wh) as the output of a fully connected layer with a weight matrix W, and σ(x) = max(0,x) is the straight linear unit.

(2)保留基因-基因关系的图神经网络G2(2) Graph neural network G2 that retains gene-gene relationships

我们注意到,当将基因相互作用网络应用于某个数据集时,只有那些在该数据集中出现两个相互作用基因的相互作用对被保留,其余对被丢弃。换句话说,不同数据集的基因相互作用网络的相互作用对的数量可能彼此不同。为了捕获一对基因中的两个调控方向及其相应的强度,基因相互作用网络被认为是有向图,因此对于来自无向基因网络的A基因和B基因的边,例如STRING PPI网络,我们将其视为一对边(即从A到B的边和从B到A的边)。We noticed that when applying the gene interaction network to a certain dataset, only those interaction pairs where two interacting genes appear in the dataset are kept, and the rest are discarded. In other words, the number of interaction pairs of gene interaction networks of different datasets may be different from each other. To capture the two regulatory directions and their corresponding strengths in a pair of genes, the gene interaction network is considered as a directed graph, so for the edges of gene A and gene B from an undirected gene network, such as the STRING PPI network, we Think of it as a pair of edges (i.e. the edge from A to B and the edge from B to A).

具体的图神经网络构建方式与保留细胞-细胞关系的图神经网络的构建方式相同,只不过图神经网络的输入由细胞-细胞关系图转换为了基因-基因关系的PPI交互网络。基因之间的相互作用关系可以自发地以图形格式呈现,其中应用图形神经网络对这种关系进行建模。在图卷积层中,每个节点代表一个基因,两个节点之间的边代表这两个对应基因的关系。图表示模块设计为一个图卷积层,通过聚合其邻居节点的信息来更新每个节点。The specific construction method of the graph neural network is the same as that of the graph neural network that preserves the cell-cell relationship, except that the input of the graph neural network is converted from a cell-cell relationship graph to a gene-gene relationship PPI interaction network. The interaction relationship between genes can be presented spontaneously in a graph format, where a graph neural network is applied to model this relationship. In the graph convolution layer, each node represents a gene, and the edge between two nodes represents the relationship between these two corresponding genes. The graph representation module is designed as a graph convolutional layer, which updates each node by aggregating the information of its neighbor nodes.

3.对scRNA-seq数据降维3. Dimensionality reduction of scRNA-seq data

利用构建的图神经网络对预处理过的scRNA-seq数据进行降维。Dimensionality reduction of the preprocessed scRNA-seq data was performed using the constructed graph neural network.

将基因-细胞计数矩阵和细胞-细胞关系输入到图神经网络G1中,得到降维后的细胞特征θ1。The gene-cell count matrix and cell-cell relationship are input into the graph neural network G1 to obtain the dimensionality-reduced cell feature θ1.

将将基因-细胞计数矩阵和基因-基因交互网络输入到图神经网络G2中,得到降维后的细胞特征θ2。The gene-cell count matrix and the gene-gene interaction network are input into the graph neural network G2 to obtain the dimensionality-reduced cell feature θ2.

将学习到的细胞特征连接起来作为后续下游分析的降维结果。The learned cell features are concatenated as dimensionality reduction results for subsequent downstream analyses.

4.K-means算法聚类4. K-means algorithm clustering

本方法使用了ZINB条件似然来重构scRNA-seq数据的解码器输出,ZINB分布被证明是一种可以较好的描述scRNA-seq数据的模型,并且是一种普遍接受的基因表达分布结构。This method uses the ZINB conditional likelihood to reconstruct the decoder output of scRNA-seq data. The ZINB distribution is proved to be a good model for scRNA-seq data and is a generally accepted gene expression distribution structure. .

为了评估方法的有效性,我们应用了k-means聚类算法对降维后的数据进行聚类,并用标准化互信息这一指标进行评价。假设X是预测的聚类结果,Y是带有真实标签的细胞类型,NMI分数计算如下:In order to evaluate the effectiveness of the method, we applied the k-means clustering algorithm to cluster the data after dimensionality reduction, and evaluated it with the index of normalized mutual information. Assuming that X is the predicted clustering result and Y is the cell type with the true label, the NMI score is calculated as follows:

Figure SMS_3
Figure SMS_3

MI是X和Y之间的互熵,H是香农熵。MI is the mutual entropy between X and Y, and H is the Shannon entropy.

从上面所述可以看出,本说明书一个或多个实施例提供的基于图神经网络的scRNA-seq数据降维方法,在降维结果中同时保留了细胞-细胞关系和基因-基因关系。我们的模型约束了数据结构,并通过两个图神经网络模块进行降维。在五个真实的scRNA-seq数据集上进行的实验表明,本方法能够提供更准确的scRNA-seq数据的低维表示。It can be seen from the above that the dimensionality reduction method for scRNA-seq data based on the graph neural network provided by one or more embodiments of this specification preserves both cell-cell relationships and gene-gene relationships in the dimensionality reduction results. Our model constrains the data structure and performs dimensionality reduction through two graph neural network modules. Experiments on five real scRNA-seq datasets demonstrate that our method can provide more accurate low-dimensional representations of scRNA-seq data.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实验,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail in combination with experiments below. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

1.数据集概述1. Dataset overview

为了评估scTPGAE的性能,我们关注相对较大的数据集;选择具有已知细胞类型的五个真实scRNA-seq数据集。下表总结了五个真实数据集的基本信息,下面我们将描述这些数据集。To evaluate the performance of scTPGAE, we focus on relatively large datasets; five real scRNA-seq datasets with known cell types are selected. The table below summarizes the basic information of the five real datasets, which we describe below.

Figure SMS_4
Figure SMS_4

(i)10X PBMC数据集,10X scRNA-seq平台提供,数据采集自一个健康人类;(ii)小鼠胚胎干细胞数据集,描述了白血病抑制因子(LIF)退出消除后小鼠胚胎干细胞异质分化的转录组;(iii)小鼠膀胱细胞数据集来自小鼠细胞图谱项目GSE108097。从原始计数矩阵中,我们选择了约2700个来自膀胱组织的细胞;(iv)蠕虫神经元细胞数据集通过单细胞组合索引RNA测序进行分析,该测序来自L2幼虫期秀丽隐杆线虫;(v)Zeisel数据集包含3005个细胞,这些细胞来自小鼠皮层和海马体GSE60361。(i) 10X PBMC dataset, provided by the 10X scRNA-seq platform, collected from a single healthy human; (ii) mouse embryonic stem cell dataset describing the heterogeneous differentiation of mouse embryonic stem cells after withdrawal of leukemia inhibitory factor (LIF) (iii) The mouse bladder cell dataset is from the Mouse Cell Atlas project GSE108097. From the raw count matrix, we selected approximately 2700 cells from bladder tissue; (iv) the worm neuronal cell dataset was analyzed by single-cell combinatorial indexed RNA-sequencing from L2 larval stage C. elegans; (v ) Zeisel dataset contains 3005 cells from mouse cortex and hippocampus GSE60361.

2.实验环境及参数设置2. Experimental environment and parameter settings

硬件环境主要是一台PC主机。其中,PC主机的CPU 11th Gen Intel(R)Core(TM)i5-1135G7,2.42GHz,内存为16GB RAM,64位操作系统。软件以Windows 10为平台,在Pycharm环境下用Python语言实现,python版本为3.5.0,Tensorflow版本为1.4.0。The hardware environment is mainly a PC host. Among them, the CPU of the PC host is 11th Gen Intel(R) Core(TM) i5-1135G7, 2.42GHz, the memory is 16GB RAM, and the 64-bit operating system. The software uses Windows 10 as the platform and is implemented in the Python language in the Pycharm environment. The python version is 3.5.0 and the Tensorflow version is 1.4.0.

我们使用Python包Spektral32来实现我们的模型。有许多类型的图形神经网络可以用作编码器或解码器。因此,为了借助于节点的邻居来提取节点的特征,我们在编码器中应用图注意力层作为默认值。其他图神经网络如GCN、GraphSAGE和TAGCN也可以作为scTPGAE中的编码器实现。特征解码器DX是一个四层完全连接的神经网络,在隐藏层中有64、256、512个节点。We implement our model using the Python package Spektral32. There are many types of graph neural networks that can be used as encoders or decoders. Therefore, in order to extract features of a node with the help of its neighbors, we apply a graph attention layer in the encoder as default. Other graph neural networks such as GCN, GraphSAGE, and TAGCN can also be implemented as encoders in scTPGAE. The feature decoder DX is a four-layer fully connected neural network with 64, 256, 512 nodes in the hidden layers.

边缘解码器由一个完全连接的层组成,然后是象限化和激活的组成:The edge decoder consists of a fully connected layer followed by quadrantization and activations:

Ar=DA(h)=σ(ZZT)A r =D A (h)=σ(ZZ T )

其中Z=σ(Wh)作为具有权重矩阵W的完全连接层的输出,σ(x)=max(0,x)是直线线性单位。where Z = σ(Wh) as the output of a fully connected layer with a weight matrix W, and σ(x) = max(0,x) is the straight linear unit.

图神经网络需要的输入除了上述的基因-细胞关系矩阵外,还需要细胞-细胞关系图和基因-基因交互网络。In addition to the above-mentioned gene-cell relationship matrix, the input required by the graph neural network also requires a cell-cell relationship graph and a gene-gene interaction network.

其中,细胞-细胞关系图由Scikit-learn Python包中的K最近邻(KNN)算法构建。默认K在本研究中预定义为35,并根据我们实验中的数据集进行调整。生成的邻接矩阵是一个0-1的矩阵,1代表连通,0代表不连通。Among them, the cell-cell relationship graph is constructed by the K nearest neighbor (KNN) algorithm in the Scikit-learn Python package. The default K is predefined as 35 in this study and adjusted according to the dataset in our experiments. The generated adjacency matrix is a 0-1 matrix, 1 means connected and 0 means disconnected.

基因-基因交互网络则可以利用现有的数据,我们收集了七种不同的人类基因相互作用网络和一种小鼠基因相互作用网络来评估scTPGAE的性能。The gene-gene interaction network can make use of existing data, and we collected seven different human gene interaction networks and one mouse gene interaction network to evaluate the performance of scTPGAE.

3.评价指标3. Evaluation indicators

为了使不同方法的结果易于比较,我们采用K均值进行聚类分析,并将参数K设置为每个数据集中的真实聚类数。在我们的实验中,使用标准化互信息(NMI)和调整兰德指数(ARI)这两个指标来评估scTPGAE模型,这两个指标被广泛用于无监督学习场景的模型性能评估中。To make the results of different methods easy to compare, we employ K-means for cluster analysis and set the parameter K to the true number of clusters in each dataset. In our experiments, scTPGAE models are evaluated using two metrics, Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI), which are widely used in model performance evaluation in unsupervised learning scenarios.

4.实验结果分析4. Analysis of experimental results

在这里,主要将本方法在五个真实数据集上进行了实验,得到的归一化互信息和调整兰德指数如下表所示。Here, this method is mainly tested on five real data sets, and the normalized mutual information and adjusted Rand index obtained are shown in the table below.

归一化互信息(NMI)Normalized Mutual Information (NMI)

Figure SMS_5
Figure SMS_5

调整兰德指数(ARI)Adjusted Rand Index (ARI)

Figure SMS_6
Figure SMS_6

上述实验结果表明,基于图神经网络的scTPGAE方法是一种很有前途的新方法。本方法在五个真实数据集上,均获得了较好的性能,这表明本方法能够提供更准确的scRNA-seq数据的低维表示。The above experimental results show that the scTPGAE method based on graph neural network is a promising new method. This method achieved good performance on five real datasets, which indicates that this method can provide a more accurate low-dimensional representation of scRNA-seq data.

可见,我们提出的scTPGAE方法,是一种用于对单细胞RNA-seq数据进行降维、聚类分析的方法,该方法具有以下几个优势,首先,scTPGAE将潜在空间分布与选择的先验进行匹配;其次,scTPGAE在降维结果中保留了细胞-细胞之间的关系;再次,scTPGAE方法在保留了细胞-细胞关系的同时保留了基因-基因之间的关系;最后,本方法考虑了深度神经网络框架中的并行和可扩展特性。我们的模型约束了数据结构,并通过图神经网络模块进行降维。以标准化互信息和调整兰德指数作为评价指标,在五个真实的scRNA-seq数据集上进行的实验表明,本方法具有不错的性能。It can be seen that the scTPGAE method we proposed is a method for dimensionality reduction and cluster analysis of single-cell RNA-seq data. This method has the following advantages. First, scTPGAE combines the potential space distribution with the selected prior Matching; secondly, scTPGAE retains the relationship between cells and cells in the dimensionality reduction results; thirdly, the scTPGAE method preserves the relationship between genes and genes while preserving the cell-cell relationship; finally, this method considers Parallel and scalable features in deep neural network frameworks. Our model constrains the data structure and performs dimensionality reduction through a graph neural network module. Using standardized mutual information and adjusted Rand index as evaluation indicators, experiments on five real scRNA-seq datasets show that the method has good performance.

附图说明Description of drawings

图1:基于图神经网络的scRNA-seq数据降维方法的流程示意图;Figure 1: Schematic flow chart of scRNA-seq data dimensionality reduction method based on graph neural network;

图2:以归一化互信息(NMI)作为衡量指标的实验结果;Figure 2: Experimental results using normalized mutual information (NMI) as a measure;

图3:以调整兰德指数(ARI)作为衡量指标的实验结果。Figure 3: Experimental results using the Adjusted Rand Index (ARI) as a measure.

Claims (5)

1. A scRNA-seq data dimension reduction method based on a graph neural network is characterized by comprising the following implementation steps:
(1) Preprocessing data; collecting scRNA-seq datasets from different species, different types, different cell numbers; preprocessing the collected original scRNA-seq data by adopting a logarithmic conversion and z fraction normalization method, and reconstructing the input data by utilizing zero expansion negative binomial distribution to obtain noiseless data;
(2) Constructing a graphic neural network for dimension reduction, which is an automatic encoder framework consisting of a depth encoder, an intermediate hidden layer and a depth decoder; the topological structure between cells and the topological structure between genes can be simultaneously reserved in the dimension reduction result;
(3) Reducing the dimension of the preprocessed scRNA-seq data by using the constructed graph neural network, learning a hidden layer feature vector by using an intermediate hidden layer of an automatic encoder, restraining prior distribution of the hidden layer feature vector, and matching the hidden layer feature vector with the selected prior distribution; connecting the hidden layer feature vectors learned in the two graph neural networks so as to facilitate subsequent downstream analysis;
(4) And clustering the dimensionality reduced data by using a k-means clustering algorithm to obtain a standardized mutual information score and adjust the Rand index.
2. The method for reducing dimension of scRNA-seq data based on graphic neural network according to claim 1, wherein the data is collected and the collected single cell RNA sequencing data is preprocessed:
we collected five scRNA-seq datasets from different species, different types, different cell numbers, and were then preprocessed using the method of logarithmic transformation and z-score normalization.
Specifically, we performed data preprocessing operations on the following five data sets.
(1) 10X PBMC dataset, provided by the 10X scRNA-seq platform, data collected from a healthy human;
(2) A mouse embryonic stem cell dataset describing a transcriptome of mouse embryonic stem cell heterodifferentiation following withdrawal of Leukemia Inhibitory Factor (LIF);
(3) The mouse bladder cell dataset was from the mouse cytogram project GSE108097. From the original count matrix, we selected about 2700 cells from bladder tissue;
(4) The worm neuron cell dataset was analyzed by single cell combinatorial indexing RNA sequencing from L2 larval stage caenorhabditis elegans;
(5) The Zeisel dataset contained 3005 cells from mouse cortex and hippocampal GSE60361.
3. The method for reducing dimension of scRNA-seq data based on graphic neural network according to claim 1, wherein the construction of a graphic neural network is an automatic encoder framework composed of a depth encoder, an intermediate hidden layer and a depth decoder, and specifically comprises:
(1) Graphic neural network G1 retaining cell-cell relationship
The graph automatic encoder is an artificial neural network for unsupervised representation learning of graph structure data. The graphic auto-encoder has a low-dimensional bottleneck layer and thus can be used as a dimension-reduction model. Assume that the inputs are a cell-cell relationship graph of node matrix X and adjacency matrix a. In our joint picture automatic encoder, there is one encoder E for the whole picture, two decoders D X And D A For nodes and edges, respectively. In practice, we first encode the input graph as the latent variable h=e (X, a), and then decode h into the reconstructed node matrix X r =D X (h) And a reconstructed adjacency matrix A r =D A (h) A. The invention relates to a method for producing a fibre-reinforced plastic composite The goal of the learning process is to minimize reconstruction losses
Figure FDA0004014338180000021
Wherein the weights are superparameters. In our experiments, set to 0.6.
We use Python package Spektral32 to implement our model. There are many types of graphic neural networks that can be used as encoders or decoders. Therefore, to extract the features of the nodes by means of their neighbors, we apply the graph attention layer as default in the encoder. Other graph neural networks such as GCN, graphSAGE and TAGCN may also be implemented as encoders in the scTPGAE. Feature decoder D X Is a four-layer fully connected neural network with 64, 256 and 512 nodes in the hidden layer.
The edge decoder consists of one fully connected layer, then the components of quadrant and activation:
A r =D A (h)=σ(ZZ T )
where z=σ (Wh) as the output of the fully connected layer with the weight matrix W, σ (x) =max (0, x) is a straight linear unit.
(2) Graph neural network G2 retaining gene-gene relationship
We note that when a gene interaction network is applied to a data set, only those interaction pairs in which two interacting genes occur in the data set are retained, and the remaining pairs are discarded. In other words, the number of interaction pairs of the gene interaction networks of different data sets may differ from each other. To capture both regulatory directions and their corresponding intensities in a pair of genes, the gene interaction network is considered a directed graph, so for the edges of the a and B genes from the undirected gene network, e.g., STRING PPI network, we consider it as a pair of edges (i.e., the edge from a to B and the edge from B to a).
The specific graph neural network construction method is the same as that of the graph neural network which retains the cell-cell relationship, except that the input of the graph neural network is converted from the cell-cell relationship graph into the PPI interaction network of the gene-gene relationship. The interaction relationships between genes can spontaneously be presented in a graphical format, where a graphical neural network is applied to model such relationships. In the graph roll stack, each node represents one gene, and the edge between two nodes represents the relationship of the two corresponding genes. The graph representation module is designed as a graph volume layer, updating each node by aggregating the information of its neighboring nodes.
4. The method for reducing the dimension of scRNA-seq data based on the graphic neural network according to claim 1, wherein the method for reducing the dimension of the preprocessed scRNA-seq data by using the constructed graphic neural network is characterized by comprising the following steps:
and (3) performing dimension reduction on the preprocessed scRNA-seq data by using the constructed graph neural network.
Inputting the gene-cell count matrix and the cell-cell relation into the graph neural network G1 to obtain the cell characteristics theta 1 after dimension reduction.
Inputting the gene-cell count matrix and the gene-gene interaction network into the graph neural network G2 to obtain the cell characteristics theta 2 after dimension reduction.
The learned cell characteristics are linked as a dimension reduction result of subsequent downstream analysis.
5. The method for reducing the dimension of scRNA-seq data based on the graphic neural network according to claim 1, wherein the k-means clustering algorithm is applied to cluster the dimension-reduced data. The method specifically comprises the following steps:
the present method uses the ZINB conditional likelihood to reconstruct the decoder output of the scRNA-seq data, and the ZINB distribution has proven to be a better model for describing the scRNA-seq data and is a widely accepted gene expression distribution structure.
In order to evaluate the effectiveness of the method, a k-means clustering algorithm is applied to cluster the data after dimension reduction, and standardized mutual information and an adjusted Rand index are used as evaluation indexes. Experiments performed on five real scRNA-seq datasets indicate that the present method can provide a more accurate low-dimensional representation of the scRNA-seq data.
CN202211716676.1A 2022-12-23 2022-12-23 scRNA-seq data dimension reduction method based on graph neural network Pending CN116386729A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211716676.1A CN116386729A (en) 2022-12-23 2022-12-23 scRNA-seq data dimension reduction method based on graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211716676.1A CN116386729A (en) 2022-12-23 2022-12-23 scRNA-seq data dimension reduction method based on graph neural network

Publications (1)

Publication Number Publication Date
CN116386729A true CN116386729A (en) 2023-07-04

Family

ID=86975628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211716676.1A Pending CN116386729A (en) 2022-12-23 2022-12-23 scRNA-seq data dimension reduction method based on graph neural network

Country Status (1)

Country Link
CN (1) CN116386729A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665786A (en) * 2023-07-21 2023-08-29 曲阜师范大学 A Hierarchical Embedding Clustering Method for RNA Based on Graph Convolutional Neural Networks
CN116825204A (en) * 2023-08-30 2023-09-29 鲁东大学 Single-cell RNA sequence gene regulation inference method based on deep learning
CN117854597A (en) * 2024-01-15 2024-04-09 杭州电子科技大学 Track prediction method based on contrast learning feature dimension reduction
CN118335192A (en) * 2024-06-13 2024-07-12 杭州电子科技大学 Single-cell sequencing data clustering method based on self-attention network and contrast learning
CN118645154A (en) * 2024-08-12 2024-09-13 中国医学科学院基础医学研究所 A single-cell Hi-C profile prediction method based on single-cell RNA expression data
CN118969078A (en) * 2024-07-09 2024-11-15 上海交通大学 A spatial omics tumor evolution prediction method and system based on graph neural network
CN119132389A (en) * 2024-08-14 2024-12-13 东北林业大学 A method for generating single-cell sequencing data
CN119252341A (en) * 2024-09-10 2025-01-03 桂林电子科技大学 A PCA dimensionality reduction method for scRNA-seq sequencing data with added mask
WO2025007301A1 (en) * 2023-07-05 2025-01-09 深圳理工大学(筹) Graph neural network construction method and apparatus, and electronic device and storage medium
CN119323992A (en) * 2024-09-20 2025-01-17 东北林业大学 Method for identifying cell communication based on multiple sets of chemical data
CN119400249A (en) * 2024-10-12 2025-02-07 哈尔滨工程大学 A scRNA-seq data feature learning method based on graph autoencoder
CN119601090A (en) * 2024-11-19 2025-03-11 广东药科大学 A gene co-expression network identification method and system based on graph convolutional neural network

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025007301A1 (en) * 2023-07-05 2025-01-09 深圳理工大学(筹) Graph neural network construction method and apparatus, and electronic device and storage medium
CN116665786A (en) * 2023-07-21 2023-08-29 曲阜师范大学 A Hierarchical Embedding Clustering Method for RNA Based on Graph Convolutional Neural Networks
CN116825204A (en) * 2023-08-30 2023-09-29 鲁东大学 Single-cell RNA sequence gene regulation inference method based on deep learning
CN116825204B (en) * 2023-08-30 2023-11-07 鲁东大学 A method for inferring gene regulation from single-cell RNA sequences based on deep learning
CN117854597A (en) * 2024-01-15 2024-04-09 杭州电子科技大学 Track prediction method based on contrast learning feature dimension reduction
CN118335192A (en) * 2024-06-13 2024-07-12 杭州电子科技大学 Single-cell sequencing data clustering method based on self-attention network and contrast learning
CN118969078A (en) * 2024-07-09 2024-11-15 上海交通大学 A spatial omics tumor evolution prediction method and system based on graph neural network
CN118645154B (en) * 2024-08-12 2024-11-08 中国医学科学院基础医学研究所 Single-cell Hi-C map prediction method based on single-cell RNA expression data
CN118645154A (en) * 2024-08-12 2024-09-13 中国医学科学院基础医学研究所 A single-cell Hi-C profile prediction method based on single-cell RNA expression data
CN119132389A (en) * 2024-08-14 2024-12-13 东北林业大学 A method for generating single-cell sequencing data
CN119252341A (en) * 2024-09-10 2025-01-03 桂林电子科技大学 A PCA dimensionality reduction method for scRNA-seq sequencing data with added mask
CN119323992A (en) * 2024-09-20 2025-01-17 东北林业大学 Method for identifying cell communication based on multiple sets of chemical data
CN119323992B (en) * 2024-09-20 2025-07-22 东北林业大学 Method for identifying cell communication based on multiple sets of chemical data
CN119400249A (en) * 2024-10-12 2025-02-07 哈尔滨工程大学 A scRNA-seq data feature learning method based on graph autoencoder
CN119601090A (en) * 2024-11-19 2025-03-11 广东药科大学 A gene co-expression network identification method and system based on graph convolutional neural network

Similar Documents

Publication Publication Date Title
CN116386729A (en) scRNA-seq data dimension reduction method based on graph neural network
Wen et al. CellPLM: Pre-training of cell language model beyond single cells
CN113393911B (en) Ligand compound rapid pre-screening method based on deep learning
CN114091603A (en) A spatial transcriptome cell clustering and analysis method
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
CN111276187B (en) Gene expression profile feature learning method based on self-encoder
CN110335160B (en) A method and system for predicting medical migration behavior based on improved Bi-GRU based on grouping and attention
CN115881232A (en) ScRNA-seq cell type annotation method based on graph neural network and feature fusion
CN119763673A (en) A semi-supervised single-cell RNA sequencing data clustering method based on iterative screening
CN119763665A (en) A gene regulatory network inference method and system based on graph representation learning
CN117611974A (en) Image recognition method and system based on searching of multiple group alternative evolutionary neural structures
CN118196490A (en) A single cell type annotation method based on graph attention autoencoder
Wu et al. AAE-SC: A scRNA-seq clustering framework based on adversarial autoencoder
CN112071362B (en) Method for detecting protein complex fusing global and local topological structures
Darmawahyuni et al. Health-related data analysis using metaheuristic optimization and machine learning
CN119400252A (en) Construction method, prediction system and prediction method of circular RNA and disease association prediction model based on shared units
CN117594132A (en) Single-cell RNA sequence data clustering method based on robust residual error map convolutional network
CN117951603A (en) Multi-group chemical data classification method based on enhancement chart convolution self-encoder
CN117476252A (en) Etiology and pathology prediction method based on knowledge graph
Chen et al. A deep graph convolution network with attention for clustering scrna-seq data
CN119400249B (en) A scRNA-seq data feature learning method based on graph autoencoder
Zhang et al. HGLA: Biomolecular Interaction Prediction Based on Mixed High-Order Graph Convolution With Filter Network via LSTM and Channel Attention
Li et al. Single-cell Curriculum Learning-based Deep Graph Embedding Clustering
Kuang et al. Subtype-DCGCN: an unsupervised approach for cancer subtype diagnosis based on multi-omics data
Wen Single Cells Are Biological Tokens: Towards Cell Language Models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination