CN114121158A

CN114121158A - Deep network self-adaption based scRNA-seq cell type identification method

Info

Publication number: CN114121158A
Application number: CN202111471768.3A
Authority: CN
Inventors: 王树林; 刘孟林
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-03-01

Abstract

The present invention relates to data mining in bioinformatics, and in particular to the mining of scRNA-seq data. In particular to a deep network self-adaptive scRNA-seq cell type identification method. The method of the invention comprises the processing of scRNA-seq data; constructing neural network training scRNA-seq data; adding a self-adaptive layer optimization neural network architecture to overcome the difference between different batches of data sets; accurate identification of cell types in scRNA-seq datasets with unknown type information. The method can be used for identifying the cell type of the unknown scRNA-seq data set, and can effectively overcome the technical difference and batch effect influence between the data set with known type information and the data set with unknown type information.

Description

Deep network self-adaption based scRNA-seq cell type identification method

Technical Field

The invention relates to data mining in bioinformatics, in particular to the mining of scRNA-seq data, and particularly relates to a deep network self-adaptive scRNA-seq cell type identification method.

Background

Cells are considered to be the basic structural and functional unit of an organism. Human cells contain about 2 ten thousand genes, and each cell has a self-specific gene expression pattern, and only partial genes are expressed. This results in cell-specific protein components and biological functions. scRNA-seq uses single cell as unit, improves high-throughput sequencing through whole genome or transcriptome amplification, can reveal the gene structure and gene expression state of single cell and reflect heterogeneity among cells. The development of scRNA-seq technology has progressed rapidly in the last decade, the scale of sequencing data has increased from tens to thousands or even millions of cells, and a number of new sequencing platforms, such as 10XGenomics chromosome, inDrop and Drop-seq, have emerged. The recognition of cell types plays an important role in the analysis of scRNA-seq data, and well annotated scRNA-seq data enables biologists to conduct further downstream analyses and improves our understanding of the cellular mechanisms of disease.

The current methods for identifying scRNA-seq data types through bioinformatics are mainly divided into three categories: the first method firstly clusters cell groups, then finds out a specific marker gene of each cluster through differential expression analysis, and finally annotates cells according to the body function of the gene. However, the generalization performance of such methods is generally poor. Furthermore, as the scale of sequencing data increases, the task of annotating cells by looking for marker genes becomes increasingly burdensome and time consuming. The second method is to use the information of a well-annotated reference data set to assist in cell type identification in new data. Representative methods of this are to project cells in the target data set into a space defined by highly informative genes selected from a well-annotated source data set and then assign cell identities to the cells in the target data set based on their correlation to the average cell-type specific gene expression in the source data. However, such methods can only use cell type information in the reference data, ignoring useful information in the target data. The last method mainly overcomes the burden of large-scale scRNA-seq data type identification through a deep neural network, and sequencing data are embedded into a low-dimensional space by using a nonlinear automatic encoder to perform subsequent clustering and classification tasks. Also these methods do not take into account the performance breakdown that can be caused by technical variations and batch effects, especially when the target and reference data come from different sequencing platforms, the accuracy of cell classification can be greatly reduced.

In summary, the differences between different sequencing platforms, different tissues and different species data sets are not fully considered in the existing methods, and the well-annotated reference data set and the gene expression information and data distribution information of the unknown data set are rarely fully utilized, so that how to design a robust method to accurately identify unknown scRNA-seq cell types still remains a challenge.

Disclosure of Invention

Aiming at the problems existing in the method and the importance of accurate identification of the scRNA-seq cell types, the invention provides a method for identifying the scRNA-seq cell types based on deep network self-adaptation. The method adopts deep network self-adaptation to extract gene expression information and align data distribution of a well-annotated reference data set and an unknown target data set, and is a method for identifying cell types of scRNA-seq data sets in different batches. The method comprises the following steps:

1. data collection phase

The method collects data sets of multiple situations from multiple data platforms. The first is a universal reference dataset, generated by two sequencing modes, 10x and CelSeq2, respectively; the second type is a human pancreas tissue data set generated by adopting different sequencing modes, and the human pancreas tissue data set is generated by five sequencing modes, namely CelSeq, CelSeq2, SmartSeq2, Fluidigmc1 and inDrop; the third category is a dataset of different tissues within the same species, a mouse senescent cell map (Tabula Muris Senis) dataset downloaded from Figshare, containing 23341 gene expression information from 96307 cells, containing 22 tissues. The data sets can be combined to evaluate the accuracy of the method for identifying cell types of different tissues under multiple species.

2. Data preprocessing stage

And randomly dividing different scRNA-seq data sets into a source domain and a target domain, wherein the type information of the source domain is known, and the type information of the target domain is unknown. The processing object is a gene expression matrix of scRNA-seq data, wherein the names of the row cells are listed as gene names. The additional columns are cell type information. The data preprocessing comprises three steps of quality control, data standardization and cell type conversion. Quality control is to check whether outliers are present in the original data set and set a threshold removal, and data normalization is to filter low quality cells with less than 5000 reads and 500 genes, and genes expressed by less than 10 cells. Each cell was then normalized to 10000 read counts using SCANPY; and finally, carrying out logarithmic processing and normalization processing on the data set. Cell type conversion is the conversion of cell type information of a data set into numerical numbers for subsequent cell classification.

3. Stage for building neural network architecture

The neural network used in the method consists of an input layer and two full-connection layers, the number of neurons of the input layer is the number of genes after data preprocessing, 1000 neurons are used in the first layer of the full-connection layers, and 100 neurons are used in the second layer. The activity of neurons in the fully-connected Layer is normalized by Normalization Layer Normalization. Layer Normalization is defined as:

the nonlinear activation function in the fully-connected layer uses SELU, defined as:

SELU(x)＝scale*(max(0,x)+min(0,α*(exp(x)-1)))

in the pre-training stage, a mirror image of a neural network is used as a decoder, an auto-encoder is integrally formed to pre-train a target domain, and Mean Square Error (MSE) is used as a reconstruction loss function of the auto-encoder; in the formal training stage, the source domain and the target domain both adopt the neural network as a basic network structure, the source domain network further comprises a classification layer, the number of neurons in the classification layer is the number of cell types, and cross-entropy (cross-entropy) is used as a classification loss function of the source domain network, and is defined as:

where y represents the true type tag of the cell, y [ j ] is defined as 1 if the cell belongs to the jth cell type, and the rest of y is defined as 0. y 'represents the type label of the output, and y' j represents the posterior probability that the cell is the jth cell type.

4. Optimizing neural network architecture phases

And a domain self-adaptive layer is added after the second full-connection layer of the network structures of the source domain and the target domain, so that the data distribution of the source domain and the data distribution of the target domain are closer to each other by the self-adaptive layer, and the influence of batch effect on the final classification result is reduced. The adaptive measurement method adopts multinuclear MMD (MK-MMD), and the square formula of the MK-MMD is defined as:

where p, q represent the probability distributions of the source and target domains, respectively, H_kRepresenting a regenerated Hilbert space RKHS, d with a characteristic kernel k_k(p, q) represents the RKHS distance between the average embeddings of p and q. The important property is if

Then p is q. The feature kernel associated with the feature map φ is defined as:

k(X^s,X^t)＝<φ(X^s),φ(X^t)>

its multi-core representation is a plurality of PSD cores k_uConvex combination of }:

wherein for the coefficient { beta_uConstraint is applied to ensure that the derived multi-core k is characteristic. The MK-MMD is weighted by a plurality of different cores, and the finally obtained characterization capability is stronger than that of the MMD with only one core.

The optimization goal of the final network model consists of two parts: classification loss functions and adaptive losses. The optimization objective is achieved by minimizing the classification loss and MK-MMD, with the overall loss function defined as:

wherein Θ represents all weights and bias parameters of the network, and is a target to be learned; lambda [ alpha ]>0 is a penalty parameter; l₁To l₂Representing the number of layers to be adapted;

and

a layer I hidden representation representing the source domain and the target domain respectively; x is the number of_aAnd n_aRespectively representing all sets of data containing type information in a source domain and a target domain; j (-) represents the classification loss function. The key task in the learning phase is to learn the network parameters Θ and β of MK-MMD.

5. Accurate identification of cell types for unknown scRNA-seq datasets

And performing parameter updating and iterative optimization on the source domain network and the target network by using a mini-batch Stochastic Gradient Descent (SGD) method. And dividing the source data set and the target data set into a plurality of mini-batch as input training and optimizing networks through the DataLoder of the PyTorch, and using the finally trained target domain network as a classifier to accurately identify the type of the target data set with unknown type information.

Drawings

FIG. 1: deep network adaptive model skeleton map

FIG. 2: target data pre-training flow chart

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to experiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The hardware environment is mainly a PC host. The CPU of the PC host is Intel (R) core (TM) i5-6400, 2.70GHz, and the memory is 16GB RAM, 64-bit operating system. The software is implemented in Python language under Pycharm environment by taking Windows 10 as a platform, the Pycharm version is 2021.1.3, and the Python language version is 3.7.0.

1. Data collection and arrangement stage

The data used in the method comprises three major types, wherein the first type is a reference data set and is generated by two sequencing modes, namely 10x and CelSeq 2; the second type is a human pancreas tissue data set generated by adopting different sequencing modes, and the human pancreas tissue data set is generated by five sequencing modes, namely CelSeq, CelSeq2, SmartSeq2, Fluidigmc1 and inDrop; the third category is the mouse senescent cell map (Tabula Muris Senis) dataset downloaded from Figshare. The gene expression information of 23341 genes from 96307 cells is included, 22 tissues are included, and sequencing data are complete. The practicability of the method can be evaluated more perfectly through the three types of data. All data objects are stored over andata. The initial data for scRNA-seq consists of several parts as shown in Table 1.

Table 1: the major components of AnnData

In the method, the matrix data is a matrix of cells by genes; the observed value data includes cell type information, batch information of sequencing data, and the like.

2. Data preprocessing stage

And randomly dividing different scRNA-seq data sets into a source data set and a target data set, wherein the type information of the source data set is known, and the type information of the target data set is unknown. The method includes the steps of firstly determining genes detected in a source data set and a target data set, and then combining the source data set and the target data set into a matrix based on common genes. Then, pretreatment of the scRNA-seq initial data is started, wherein the pretreatment comprises three steps of quality control, data standardization and cell type conversion. The quality control mainly comprises deleting data without cell type information in cells, wherein the type information of the data is 'nan' or 'NA'; data normalization is achieved through a SCANPY package, low-quality cells with less than 5000 reads and 500 genes and genes expressed by less than 10 cells are filtered, then each cell is normalized into 10000 reading counts, and finally whether the data set is subjected to logarithmic processing and normalization can be selected according to actual conditions. The method performs logarithmic processing and normalization on all data sets in the experimental process. The cell type conversion is to map the cell type information in the character form into a number, so that the cell can be conveniently classified and the type identification accuracy of the evaluation method can be conveniently realized. After preprocessing, the data is re-split into a source data set and a target data set. Example data information for the source domain and the target domain is shown in table 2.

Table 2: data information of source domain and target domain

3. Stage for building neural network architecture

The neural network used in the method consists of an input layer and two full-connection layers, the number of neurons of the input layer is the number of genes after data preprocessing, 1000 neurons are used in the first layer of the full-connection layers, and 100 neurons are used in the second layer. Wherein each connection layer comprises four steps which are respectively as follows: (1) applying a linear transformation to the input data; (2) normalizing the activity of the neurons by Normalization Layer Normalization; (3) performing nonlinear transformation on the activity of the neuron by using an activation function SELU; (4) regularization is implemented using dropout.

In the pre-training stage, a mirror image of a neural network is used as a decoder, an auto-encoder is integrally formed to pre-train a target domain, and Mean Square Error (MSE) is used as a reconstruction loss function of the auto-encoder; the default parameter for this method is pretrain _ epochs, 10. Whether to start the pre-training step can be selected according to the actual situation of the data set and the training result.

In the formal training stage, the source domain and the target domain both adopt the neural network as a basic network structure, the source domain network further comprises a classification layer, and the number of neurons in the classification layer is the number of cell types. The main parameter settings in the neural network are as follows: the initial learning rate is 0.001, the learning rate is decayed in a step exponential manner, and the decay step length is set to be 20. This means that after every 20 epochs, the learning rate will be the raw learning rate multiplied by 0.95, the neural network is trained using 50 epochs, the mini-batch size is 32, which is the number of cells used in each epochs.

4. Optimization and training phase

A domain adaptation layer is added after the second fully-connected layer of the network structure of both the source domain and the target domain, and the adaptation loss between the source domain and the target domain is measured using a multi-core MMD (MK-MMD). And calculating the MMD distance of the source domain and the target domain in the adaptive layer in the training process, specifically mapping the source domain and the target domain into a regeneration Hilbert space RKHS with a characteristic kernel k, and then calculating the data distribution distance of the source domain and the target domain in a high-dimensional space. The method uses 40 n _ iter _ per _ epoch to train the adaptive layer, which means that 40 mmd processes are iteratively trained in each global training step.

The optimization goal of the final network model consists of two parts: classification loss functions and adaptive losses. The optimization objective is achieved by minimizing the classification loss and MK-MMD. The penalty parameter for the adaptive loss part of the total loss function is set to 10 by default. And performing parameter updating and iterative optimization on the source domain network and the target network by using a mini-batch Stochastic Gradient Descent (SGD) method. And dividing the source data set and the target data set into a plurality of mini-batch serving as input training and optimizing networks through the DataLoder of the PyTorch, and using the finally trained source domain network as a classifier to accurately identify the type of the target data set with unknown type information.

5. Result analysis verification

The scRNA-seq data generated by sequencing 10x and CelSeq2 in the reference dataset are respectively used as a source dataset and a target dataset, the corresponding accuracy rates are counted, then the two are exchanged, the corresponding accuracy rates are counted, and the results are shown in Table 3.

Table 3: accuracy of cell type identification in reference dataset

For human pancreatic tissue data sets generated by different sequencing platforms, the method uses CelSeq and CelSeq2 as a source data set and a target data set respectively, and counts corresponding accuracy rates, and then exchanges the data sets to count the corresponding accuracy rates, wherein experimental results are shown in Table 4.

Table 4: accuracy of cell type identification in human pancreatic tissue data set

As can be seen from tables 3 and 4, although the source data set and the target data set are generated by different sequencing platforms, the accuracy of type identification of the target data set without type information in the reference data set and the human pancreatic tissue data set is relatively high, the former is close to 100%, and the latter is 92%. The method can well overcome the difference between the reference data set and different batches of data in human pancreatic tissues to a certain extent, and accurately identify the type of unknown data by using the existing data.

The data set of mouse senescent cell map (Tabula Muris Senis) comprises 22 tissues, and 4 tissues with rich cell types (Heart, Limb _ Muscle, Brain _ Non-Myeloid, Liver) are selected for the experiment. The data sequenced by 10XGenomics is taken as a source data set, the data sequenced by SmartSeq2 is taken as a target data set, the corresponding accuracy is counted, and the experimental results are shown in Table 5.

Table 5: cell type identification accuracy rate in mouse senescent cell map data set

As can be seen from table 5, in the mouse aging cell map (Tabula Muris Senis) with a large data volume and rich cell types, the accuracy of the method for identifying the type of the target data set without type information in a plurality of different tissues is still high, and the reliability of the method for identifying the scRNA-seq cell type with batch effect is further confirmed.

Claims

1. A deep network self-adaptive scRNA-seq cell type identification method is characterized by comprising the following implementation steps:

(1) collecting data, including a universal reference data set, human pancreas tissue data sets generated by different sequencing modes, and data sets of different tissues in the same species;

(2) preprocessing scRNA-seq data, wherein different scRNA-seq data sets are randomly divided into a source data set and a target data set, the type information of the source data set is known, the type information of the target data set is unknown, and the preprocessing comprises three steps of quality control, data standardization and cell type conversion;

(3) building a neural network architecture, firstly initializing neural network parameters by a target domain through a self-encoder, and then adopting the same neural network as a basic network structure of the source domain and the target domain;

(4) the optimization framework is characterized in that a domain self-adaptive layer is added in the network structures of the source domain and the target domain, the self-adaptive layer can enable the data distribution of the source domain and the data distribution of the target domain to be closer, and the influence of batch effect on the final classification result is reduced;

(5) accurately identifying the cell type of an unknown scRNA-seq data set, performing parameter updating and iterative optimization on a source domain network and a target network by using a small-batch stochastic gradient descent (mini-batch SGD) method, and enabling a final model to have the capability of accurately identifying the type of the target data set with unknown type information.

2. The deep network adaptation-based scRNA-seq cell type identification method according to claim 1, characterized in that the data collection stage:

(1) the reference data set was generated by two sequencing modes, 10x and CelSeq2 respectively;

(2) the human pancreatic tissue data set was generated by five sequencing modes, CelSeq2, SmartSeq2, Fluidigmc1, inDrop respectively;

(3) a mouse senescent cell map (Tabula Muris Senis) dataset downloaded from Figshare, containing 23341 gene expression information from 96307 cells, containing 22 tissues.

3. The deep network adaptation-based scRNA-seq cell type identification method according to claim 1, characterized in that the data preprocessing stage:

(1) checking whether an abnormal value exists in the original data set and setting a threshold value for removal;

(2) filtering low quality cells with less than 5000 reads and 500 genes, and genes expressed by less than 10 cells, regularizing each cell to 10000 read counts using SCANPY, and finally logarithmically processing and normalizing the data set;

(3) converting the cell type information of the data set to a numerical number facilitates subsequent cell classification.

4. The deep network adaptation-based scRNA-seq cell type identification method according to claim 1, which is characterized in that a neural network architecture stage is built:

(1) the neural network consists of an input layer and two full-connection layers, the number of neurons of the input layer is the number of genes after data preprocessing, 1000 neurons are used in the first layer of the full-connection layers, and 100 neurons are used in the second layer;

(2) in the pre-training stage, a mirror image of a neural network is used as a decoder, an auto-encoder is integrally formed to pre-train a target domain, and Mean Square Error (MSE) is used as a reconstruction loss function of the auto-encoder;

(3) in the formal training stage, the source domain and the target domain both adopt the neural network as a basic network structure, the source domain network further comprises a classification layer, the number of neurons of the classification layer is the number of cell types, and cross-entropy (cross-entropy) is used as a classification loss function of the source domain network.

5. The deep network adaptation-based scRNA-seq cell type identification method according to claim 1, characterized by an optimization network framework stage:

(1) adding a self-adaptive layer after a second full connection layer of the source domain network and the target network;

(2) the adaptive measurement method adopts multi-core MMD (MK-MMD) which measures the distance of data distribution of a source domain and a target domain in a regeneration Hilbert space RKHS, and the square formula of the MK-MMD is defined as follows:

the feature kernel associated with the feature map φ is defined as:

k(X^s,X^t)＝<φ(X^s),φ(X^t)>

(3) alignment of data distribution of the source domain and the target domain is achieved by minimizing MK-MMD.

6. The deep network adaptive based scRNA-seq cell type recognition method according to claim 1, characterized in that the cell type of unknown scRNA-seq data set can be accurately recognized, the source data set and the target data set are divided into a plurality of mini-lots as input training and optimization networks through the DataLoder carried by PyTorch, and the optimization target is composed of two parts: and (4) classification loss and adaptive loss, and finally, taking the trained target domain network as a classifier to accurately identify the type of the target data set.