Single-cell cross-species cell type identification method
Technical Field
The invention belongs to the field of single-cell data analysis, relates to processing and analysis of single-cell biological information data, and particularly relates to a single-cell cross-species cell type identification method.
Background
Currently, large-scale single-cell sequencing has been widely used in various studies, and related single-cell analysis procedures and tools are rapidly developed. The first step of single cell data analysis is to perform original processing on sequencing machine data, and process a sequencing sequence file into a matrix of digital gene Expression (DIGITAL GENE Expression, DGE). After the DGE matrix file is obtained, downstream analysis can be performed on single cell sequencing data.
Cell type identification and characterization is one of the most important issues in single cell transcriptome sequencing (scRNA-seq) data analysis. Currently, a variety of analytical tools such as Seurat, monocle and scanpy perform unsupervised clustering of cells and calculate the genes specifically expressed by each cell population, and manually identify cell types by combinations of marker genes. However, this artificial cell type identification and characterization is somewhat subjective. Although studies have reported the identification of engineered cell types based on gene expression of group transcriptome sequencing (bulk RNA-seq), these studies ignore cell heterogeneity within the group, and their reference cell types and gene expression profiles are also imperfect and difficult to apply to accurate identification at the single cell level. Thus, tools for cell type identification at the single cell level have yet to be developed.
Although some cell type identification and query tools have been developed successively, such as CellBlast, scmap, cellfisheing. Jl, CELLATLASSEARCH, etc. These tools can be divided into two broad categories based on computational methods, simply finding similar cell types directly from single cell gene expression profiles, or relying on machine learning methods to generate models to predict intercellular similarity. However, both of these methods rely heavily on reference databases, and there is currently no large-scale reference database with uniform and comparable annotations, and users often need to search for appropriate reference data themselves, which creates challenges for accuracy of the comparison results. In addition, these methods are usually developed and tested based on smaller samples, and calculation accuracy and efficiency are difficult to guarantee for single cell data of the order of hundreds of thousands or even millions at the present stage. Meanwhile, some methods lack an online tool and visually output results, which is unfavorable for users to use the tool.
Cell types are considered as evolutionary units with independent evolution potential, and research on cell type comparison and evolution is still limited to specific tissues, and systematic comparison analysis in species whole maps has not been performed yet. With the development of single cell technology, the drawing of single cell maps of more and more species provides an unprecedented opportunity to overcome this biological problem. However, current cell type identification tools can only identify cell types of the same species based on a reference database, and cannot perform cross-species identification and analysis. This creates challenges for analytical comparisons across species cell types, as well as for identification of certain single cell dataset cell types lacking a reference database.
Disclosure of Invention
In view of the above-described deficiencies in the prior art, the present invention provides a method for identifying a single cell cross-species cell type. The method collects high-quality data resources which are non-preferential and cover the whole cell types from single-cell map data of multiple species, overcomes the defects of cell identification based on group gene expression data and the trouble of a standard reference database, and provides data support for cross-species analysis. Through homologous gene comparison, extraction of characteristic genes and comparison of similarity of gene expression patterns, cross-species cell type identification is realized for the first time. The feature extraction mode greatly reduces the computational complexity, so that the method is suitable for cell type identification of millions of large-scale single-cell data. An online use platform is developed, and the popularization and the use of the invention are utilized.
The technical scheme of the invention is as follows:
The invention provides a single-cell cross-species cell type identification method, which comprises the following steps:
(1) Collecting single-cell transcriptome map data sets of different species, and sorting cell types according to the data set annotation information, and unifying naming modes and rules;
(2) Carrying out RPKM standardization treatment on gene expression matrixes of single-cell map data of different species to obtain standardized expression values of each gene in different cells; for each cell type, carrying out differential gene expression analysis according to the standardized expression value of the gene to obtain a characteristic gene which is specifically and highly expressed in each cell type compared with other cell types;
(3) For each species, selecting a plurality of differential expression genes in front of each cell type, and combining the differential expression genes as a characteristic gene set used as a reference data set, namely a cell type reference database;
(4) Identification of the cell type of the data set to be tested:
(4.1) collecting annotated homologous genes among the species to be detected, downloading annotated gene coding regions of the species to be detected or predicting coding regions of unknown homologous genes from a public database for unknown homologous genes among the species to be detected, inputting a gene coding region file of the species to be detected, and predicting homologous genes of the species to be detected;
(4.2) calculation of cell type similarity score:
Carrying out standardization processing and logarithmic conversion on the data set to be tested according to the step (2);
Carrying out homologous gene conversion on the expression matrix of the data set to be tested according to homologous genes among species in the step (4.1) to obtain a new gene expression matrix;
The method comprises the steps of extracting an expression value of a characteristic gene set from a cell type reference database of a query target species, calculating similarity scores of cell types between an expression matrix of a data set to be tested and the query target of the cell type reference database, counting the similarity scores of the cell types in the data set to be tested and the cell types of the cell type reference database according to cell information contained in each cell type, and taking the cell type with the highest similarity score as an identification result of the cell type of the data set to be tested.
Preferably, the species in step (1) is at least 15 species and ranges from lower to higher species. The invention collects more than six million single-cell data, covers 15 species from lower to higher representatively on the evolution of multi-cell animals, and is the most comprehensive single-cell reference database so far. The whole cell map of these species covers most of the tissue and organ and more comprehensive cell types of the species, and the same species data is derived from the same laboratory and uses the same sequencing methods, so that the collected cell types are unbiased and more comprehensive.
Specifically, the single cell transcriptome map dataset in step (1) is disclosed, and the dataset is classified into 8 main lineages according to the cell major categories, namely nerve cells, muscle cells, immune cells, endothelial cells, epithelial cells, secretory cells, stromal cells and proliferation cells, and the uncovered markers are other.
Preferably, the calculation formula of the gene expression matrix in step (2) is as follows:
wherein: Normalized expression values of the jth gene in the ith cell in the gene expression matrix;
Number of unique identifiers of the jth gene in the ith cell in the gene expression matrix.
In the step (2), single-cell map data from different sequencing platforms are subjected to RPKM (Reads Per Kilobase per Million MAPPED READS) standardization treatment, 100 cells are randomly extracted for three times to obtain an average value, and about 10000 genes are expressed in three sampling average value data of each cell type. Taking the sequencing depth of different platforms into consideration, capturing the difference of gene numbers of different cell types of the same platform, drawing a single cell data sequencing depth and cumulative base factor curve, and finding that the base factors reach balance at about 10000. The normalization and random extraction steps can eliminate the effect of single cell data sequencing depth between different sequencing platforms, different species, and even different cell types.
Specifically, randomly sampling 100 cells from each cell type without replacement, repeating for 3 times, and obtaining average expression value of genes to obtain average value data of 3 samples;
Using Seurat software to take the average data of 3 samples of each cell type as an input gene expression matrix, generating seurat file by CreateSeuratObject function, log2 (TPM/100+1) conversion of the data by NormalizeData function, linear transformation of the data by SCALEDATA function, findAllCluster function operation and Wilcoxon rank sum test method to calculate the differential expression genes of each cell type, and finally sorting the generated gene list by cell type, p_val_adj and avg_log2FC from high to low.
Preferably, in step (3), when obtaining the reference expression value of gene expression of each cell type, the average value of 3 samples of each cell type is rounded, the rounded values are averaged again, logarithmic conversion is performed to further reduce the difference of sequencing depth, and finally, the average standardized expression value after logarithmic conversion of the characteristic gene set is reserved as the cell type reference database.
The comparison analysis shows that when the first 10 differential expression genes are taken from each cell type to establish a reference database, the comparison efficiency reaches a higher and stable level, and the balance between high accuracy and quick operation is realized, so that the invention is suitable for the identification of the cell type of millions of large-scale single cell data.
Preferably, the predicted dataset is of the type of unique identifier, reads per million map reads per kilobase of transcription, or fragments per million map reads per kilobase of transcription.
Preferably, in the step (4.1), the homologous genes are aligned in a one-to-one orthologous manner. In order to further improve the comparison accuracy and comparison efficiency of homologous genes, the invention only considers one-to-one homologous genes for further analysis, and discards one-to-many or many-to-many homologous genes.
Preferably, in step (4.2), when the test data set does not detect the expression of some of the characteristic genes, the expression values of these characteristic genes in the sample are counted as zero, so that the test data set and the reference data set have the same number of lines.
Preferably, in step (4.2), the similarity score is Pearson correlation coefficient.
The method can be realized in an online platform of http:// bis.zju.edu.cn/cellatlas/animation.html.
The invention integrates single cell transcriptome total maps of a plurality of representative species, and constructs the largest single cell reference database so far, which comprises a plurality of cell types with no preference and more comprehensive coverage. The invention overcomes the defects of cell identification based on group gene expression data and the trouble of no standard reference database, and provides data support for cross-species analysis. Through homologous gene comparison, characteristic gene extraction and gene expression pattern similarity calculation, the invention eliminates the influence of single cell data sequencing depth among different sequencing platforms, different species and even different cell types, balances high accuracy and quick operation, and realizes identification of cell types of cross species for the first time. The method provided by the invention has important guiding significance for identification of unknown cell types and cross-species analysis.
Drawings
FIG. 1 is a flow chart of the technical route of the present invention.
FIG. 2 is a diagram showing the comparison result of the human mouse cell map predicted by the invention.
FIG. 3 is a display of the results of the identification of human adult kidney profiles across species cell types predicted by the present invention.
FIG. 4 is a display of the identification of the species-crossing cell types of the predicted zebra fish intestinal epithelial cells of the present invention.
Detailed Description
The invention constructs a single cell reference database based on the whole atlas of single cell transcriptomes of multiple species, and predicts the cell types of the cross species through homologous gene comparison, extraction of characteristic genes and calculation of similarity of gene expression patterns. The method comprises the following steps, as shown in fig. 1:
(1) Single cell profile data collection
The published single cell transcriptome map data sets are collected from published literature, and include a human cell map (59 ten thousand cells), a mouse development and aging cell map (112 ten thousand cells), a cynomolgus monkey cell map (100 ten thousand cells), a rat aging cell map (16 ten thousand cells), a zebra fish development cell map (103 ten thousand cells), a drosophila cell map (43 ten thousand cells), a salamander cell map (107 ten thousand cells), a xenopus metamorphosis development cell map (50 ten thousand cells), a glass sea squirt cell map (9 ten thousand cells), an earthworm cell map (9 ten thousand cells), a nematode cell map (5 ten thousand cells), a vortex worm cell map (6 ten thousand cells), a star sea anemone cell map (1 ten thousand cells), a hydroid cell map (2 ten thousand cells) and a sponge cell map (1 ten thousand cells). In total, over six million single-cell data were collected, covering 15 species from lower to higher that are representative of multi-cell animal evolution. Cell types are organized according to data set annotation information, named and ordered uniformly, and are classified into 8 major lineages according to cell major categories including neural cells, muscle cells, immune cells, endothelial cells, epithelial cells, secretory cells, stromal cells and proliferative cells, with the non-covered labels being others. At the same time, statistics are taken of the information of the period, sex, tissue, etc. of each cell-derived sample.
(2) Atlas data feature extraction
RPKM (Reads Per Kilobase per Million MAPPED READS) standardization processing is carried out on single-cell map data of each species from different sequencing platforms, and a gene expression matrix is processed according to the following calculation formula:
wherein: Normalized expression values of the jth gene in the ith cell in the gene expression matrix;
Number of unique identifiers of the jth gene in the ith cell in the gene expression matrix.
100 Cells were randomly extracted and averaged to give an average of three samples of each cell type with approximately 10000 gene expressions. The procedure was to randomly sample 100 cells from each cell type without replacement (all cells were selected if the total number of cells in one cell type was less than 100), repeat 3 times, and to determine the average expression value of the gene, and to obtain the average data of 3 samples for subsequent analysis.
Differential gene expression analysis was performed using 3 data of sampling average for each cell type, and characteristic genes of high expression specific to each cell type were obtained. Using Seurat software to take the three sampling mean values of each cell type as an input gene expression matrix, generating seurat files by using CreateSeuratObject functions, performing log2 (TPM/100+1) conversion on the data by NormalizeData functions, performing linear conversion on the data by SCALEDATA functions, running FindAllCluster functions, calculating the differential expression genes of each cell type by using Wilcoxon rank sum test method, and finally sorting the generated gene list from high to low according to cell types and avg_log2FC (average expression log difference multiple).
(3) Cell type reference database creation
For each species, the top 10 differentially expressed genes per cell type (avg_log2fc top 10) were selected and the signature gene sets used as reference data were pooled. The comparison analysis shows that when the first 10 differential expression genes are taken from each cell type to establish a reference database, the comparison efficiency reaches a higher and stable level, and the balance between high accuracy and quick operation is realized, so that the invention is suitable for the identification of the cell type of millions of large-scale single cell data. In order to better acquire the reference expression value of the gene expression of each cell type, we round up the average value of 3 samples of each cell type, then calculate the average value again after rounding up, make logarithmic conversion to further reduce the difference of sequencing depth, and finally keep the average standardized expression value after logarithmic conversion of the characteristic gene set as the cell type reference database.
(4) Homologous gene alignment
The annotated homologous genes between species were collected from BioMart of Ensembl database. For homologous genes unknown between species, annotated gene coding regions are downloaded from Ensembl database or coding regions of predicted genes are used TransDecoder, then gene coding region files of two species are input as OrthoFinder, and homologous genes of two species are predicted. In order to further improve the comparison accuracy and comparison efficiency of homologous genes, the invention only considers one-to-one homologous genes for further analysis, and discards one-to-many or many-to-many homologous genes.
(5) Cell type similarity score calculation.
For the dataset to be measured, it may be a unique identifier (Unique Molecular Identifiers, UMI), the number of reads per million map reads per kilobase of transcripts (Reads Per Kilo base per Million MAPPED READS, RPKM), or the fragments per million map reads per kilobase of transcripts (FRAGMENTS PER Kilo base per Million MAPPED FRAGMENTS, FPKM) type.
And (3) carrying out standardization processing and logarithmic conversion on the data set to be tested according to the step (1). And (3) carrying out homologous gene conversion on the expression matrix of the data set to be tested according to the orthologous genes between every two species in the step (4), obtaining a new gene expression matrix, converting the behavior into the genes of the query target species, and listing the genes as cells of the data set to be tested.
And extracting the expression value of the characteristic gene set based on the cell type reference database of the target species. If the test data set does not detect the expression of certain characteristic genes, the expression values of the characteristic genes in the sample are counted as zero, so that the test data set and the reference data set have the same number of lines.
And calculating a Pearson correlation coefficient between the data expression matrix to be detected and the cell type reference database query target, wherein the correlation coefficient is the cell type similarity score. According to the cell grouping information, namely the cell information contained in each cell type, the similarity score between the cell type in the data set to be tested and the cell type in the cell type reference database is counted, and the cell type with the highest similarity score is used as the identification result of the cell type in the data set to be tested.
The process according to the invention is described in more detail below by way of examples.
Example 1 alignment of human murine cytograms
By one-to-one ortholog transformation, human cell patterns were aligned to mouse developmental and senescent cell patterns, and similar cell types in both humans and mice were found to exhibit the highest similarity (left in fig. 2). Eight cell lineages of secretion, epithelium, immunity, nerves, content, muscle, stroma, erythroid, and proliferation essentially correspond in both humans and mice. For example, C33, C42, and C79 are annotated as smooth muscle cells in the human cell profile, which are highly correlated with smooth muscle cell types in the mouse developmental and senescent cell profile (right in fig. 2). Notably, C74 is similar to ventricular cardiomyocytes of fetal and neonatal hearts in the mouse developmental and senescent cell patterns (right in fig. 2). This cell type is annotated in the human cytogram as ventricular cardiomyocytes, of which 99.6% are from fetal heart. The accuracy of cell types and cell phases further demonstrates the effectiveness of the present invention in identifying cell types across species.
Example 2 identification of human adult kidney maps across species cell types
By one-to-one ortholog gene transformation, human adult kidney maps were aligned to mouse developmental and senescent cell maps, and various parenchymal cells in the kidneys were well annotated for release, including henle's ring, leap cells, master cells, proximal tubules, etc. (fig. 3).
Example 3 identification of the Cross-species cell type of Zebra fish intestinal epithelial cells
The zebra fish intestinal epithelial cells were aligned to the mouse developmental and senescent cell patterns by one-to-one ortholog gene transformation, and the cell types common in intestinal epithelium were well aligned to the corresponding cell lineages in the mouse patterns, including intestinal secretory cells, ionic cells, intestinal epithelial cells, epidermal cells, goblet cells, mesenchymal cells, and the like (fig. 4).