[go: up one dir, main page]

CN115064220B - A single-cell cross-species cell type identification method - Google Patents

A single-cell cross-species cell type identification method Download PDF

Info

Publication number
CN115064220B
CN115064220B CN202210677411.9A CN202210677411A CN115064220B CN 115064220 B CN115064220 B CN 115064220B CN 202210677411 A CN202210677411 A CN 202210677411A CN 115064220 B CN115064220 B CN 115064220B
Authority
CN
China
Prior art keywords
cell
species
gene
cell type
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210677411.9A
Other languages
Chinese (zh)
Other versions
CN115064220A (en
Inventor
郭国骥
张霈婧
孙慧宇
汪仁英
马立枫
鄂唯高
王晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202210677411.9A priority Critical patent/CN115064220B/en
Publication of CN115064220A publication Critical patent/CN115064220A/en
Application granted granted Critical
Publication of CN115064220B publication Critical patent/CN115064220B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开了一种单细胞的跨物种细胞类型鉴定方法,属于单细胞数据分析领域。本发明整合了多个代表性物种的单细胞转录组全图谱,构建了迄今为止最大的单细胞参考数据库,包含无偏好性且覆盖较为全面的多种细胞类型。本发明弥补了以群体基因表达数据为基础的细胞鉴定的缺陷和没有标准参考数据库的困扰,为跨物种分析提供数据支撑。通过同源基因比对、特征基因的提取和基因表达模式相似性的计算,本发明消除了不同测序平台、不同物种乃至不同细胞类型间的单细胞数据测序深度的影响,在高正确率和快速运算之间取得了平衡,并首次实现跨物种细胞类型的鉴定。本发明提供的方法对未知细胞类型鉴定及跨物种分析具有重要指导意义。

The present invention discloses a method for cross-species cell type identification of a single cell, which belongs to the field of single-cell data analysis. The present invention integrates the full map of single-cell transcriptomes of multiple representative species, and constructs the largest single-cell reference database to date, which contains a variety of cell types without bias and with relatively comprehensive coverage. The present invention makes up for the defects of cell identification based on population gene expression data and the trouble of no standard reference database, and provides data support for cross-species analysis. Through homologous gene alignment, characteristic gene extraction and calculation of gene expression pattern similarity, the present invention eliminates the influence of single-cell data sequencing depth between different sequencing platforms, different species and even different cell types, and strikes a balance between high accuracy and fast calculation, and realizes the identification of cross-species cell types for the first time. The method provided by the present invention has important guiding significance for the identification of unknown cell types and cross-species analysis.

Description

Single-cell cross-species cell type identification method
Technical Field
The invention belongs to the field of single-cell data analysis, relates to processing and analysis of single-cell biological information data, and particularly relates to a single-cell cross-species cell type identification method.
Background
Currently, large-scale single-cell sequencing has been widely used in various studies, and related single-cell analysis procedures and tools are rapidly developed. The first step of single cell data analysis is to perform original processing on sequencing machine data, and process a sequencing sequence file into a matrix of digital gene Expression (DIGITAL GENE Expression, DGE). After the DGE matrix file is obtained, downstream analysis can be performed on single cell sequencing data.
Cell type identification and characterization is one of the most important issues in single cell transcriptome sequencing (scRNA-seq) data analysis. Currently, a variety of analytical tools such as Seurat, monocle and scanpy perform unsupervised clustering of cells and calculate the genes specifically expressed by each cell population, and manually identify cell types by combinations of marker genes. However, this artificial cell type identification and characterization is somewhat subjective. Although studies have reported the identification of engineered cell types based on gene expression of group transcriptome sequencing (bulk RNA-seq), these studies ignore cell heterogeneity within the group, and their reference cell types and gene expression profiles are also imperfect and difficult to apply to accurate identification at the single cell level. Thus, tools for cell type identification at the single cell level have yet to be developed.
Although some cell type identification and query tools have been developed successively, such as CellBlast, scmap, cellfisheing. Jl, CELLATLASSEARCH, etc. These tools can be divided into two broad categories based on computational methods, simply finding similar cell types directly from single cell gene expression profiles, or relying on machine learning methods to generate models to predict intercellular similarity. However, both of these methods rely heavily on reference databases, and there is currently no large-scale reference database with uniform and comparable annotations, and users often need to search for appropriate reference data themselves, which creates challenges for accuracy of the comparison results. In addition, these methods are usually developed and tested based on smaller samples, and calculation accuracy and efficiency are difficult to guarantee for single cell data of the order of hundreds of thousands or even millions at the present stage. Meanwhile, some methods lack an online tool and visually output results, which is unfavorable for users to use the tool.
Cell types are considered as evolutionary units with independent evolution potential, and research on cell type comparison and evolution is still limited to specific tissues, and systematic comparison analysis in species whole maps has not been performed yet. With the development of single cell technology, the drawing of single cell maps of more and more species provides an unprecedented opportunity to overcome this biological problem. However, current cell type identification tools can only identify cell types of the same species based on a reference database, and cannot perform cross-species identification and analysis. This creates challenges for analytical comparisons across species cell types, as well as for identification of certain single cell dataset cell types lacking a reference database.
Disclosure of Invention
In view of the above-described deficiencies in the prior art, the present invention provides a method for identifying a single cell cross-species cell type. The method collects high-quality data resources which are non-preferential and cover the whole cell types from single-cell map data of multiple species, overcomes the defects of cell identification based on group gene expression data and the trouble of a standard reference database, and provides data support for cross-species analysis. Through homologous gene comparison, extraction of characteristic genes and comparison of similarity of gene expression patterns, cross-species cell type identification is realized for the first time. The feature extraction mode greatly reduces the computational complexity, so that the method is suitable for cell type identification of millions of large-scale single-cell data. An online use platform is developed, and the popularization and the use of the invention are utilized.
The technical scheme of the invention is as follows:
The invention provides a single-cell cross-species cell type identification method, which comprises the following steps:
(1) Collecting single-cell transcriptome map data sets of different species, and sorting cell types according to the data set annotation information, and unifying naming modes and rules;
(2) Carrying out RPKM standardization treatment on gene expression matrixes of single-cell map data of different species to obtain standardized expression values of each gene in different cells; for each cell type, carrying out differential gene expression analysis according to the standardized expression value of the gene to obtain a characteristic gene which is specifically and highly expressed in each cell type compared with other cell types;
(3) For each species, selecting a plurality of differential expression genes in front of each cell type, and combining the differential expression genes as a characteristic gene set used as a reference data set, namely a cell type reference database;
(4) Identification of the cell type of the data set to be tested:
(4.1) collecting annotated homologous genes among the species to be detected, downloading annotated gene coding regions of the species to be detected or predicting coding regions of unknown homologous genes from a public database for unknown homologous genes among the species to be detected, inputting a gene coding region file of the species to be detected, and predicting homologous genes of the species to be detected;
(4.2) calculation of cell type similarity score:
Carrying out standardization processing and logarithmic conversion on the data set to be tested according to the step (2);
Carrying out homologous gene conversion on the expression matrix of the data set to be tested according to homologous genes among species in the step (4.1) to obtain a new gene expression matrix;
The method comprises the steps of extracting an expression value of a characteristic gene set from a cell type reference database of a query target species, calculating similarity scores of cell types between an expression matrix of a data set to be tested and the query target of the cell type reference database, counting the similarity scores of the cell types in the data set to be tested and the cell types of the cell type reference database according to cell information contained in each cell type, and taking the cell type with the highest similarity score as an identification result of the cell type of the data set to be tested.
Preferably, the species in step (1) is at least 15 species and ranges from lower to higher species. The invention collects more than six million single-cell data, covers 15 species from lower to higher representatively on the evolution of multi-cell animals, and is the most comprehensive single-cell reference database so far. The whole cell map of these species covers most of the tissue and organ and more comprehensive cell types of the species, and the same species data is derived from the same laboratory and uses the same sequencing methods, so that the collected cell types are unbiased and more comprehensive.
Specifically, the single cell transcriptome map dataset in step (1) is disclosed, and the dataset is classified into 8 main lineages according to the cell major categories, namely nerve cells, muscle cells, immune cells, endothelial cells, epithelial cells, secretory cells, stromal cells and proliferation cells, and the uncovered markers are other.
Preferably, the calculation formula of the gene expression matrix in step (2) is as follows:
wherein: Normalized expression values of the jth gene in the ith cell in the gene expression matrix;
Number of unique identifiers of the jth gene in the ith cell in the gene expression matrix.
In the step (2), single-cell map data from different sequencing platforms are subjected to RPKM (Reads Per Kilobase per Million MAPPED READS) standardization treatment, 100 cells are randomly extracted for three times to obtain an average value, and about 10000 genes are expressed in three sampling average value data of each cell type. Taking the sequencing depth of different platforms into consideration, capturing the difference of gene numbers of different cell types of the same platform, drawing a single cell data sequencing depth and cumulative base factor curve, and finding that the base factors reach balance at about 10000. The normalization and random extraction steps can eliminate the effect of single cell data sequencing depth between different sequencing platforms, different species, and even different cell types.
Specifically, randomly sampling 100 cells from each cell type without replacement, repeating for 3 times, and obtaining average expression value of genes to obtain average value data of 3 samples;
Using Seurat software to take the average data of 3 samples of each cell type as an input gene expression matrix, generating seurat file by CreateSeuratObject function, log2 (TPM/100+1) conversion of the data by NormalizeData function, linear transformation of the data by SCALEDATA function, findAllCluster function operation and Wilcoxon rank sum test method to calculate the differential expression genes of each cell type, and finally sorting the generated gene list by cell type, p_val_adj and avg_log2FC from high to low.
Preferably, in step (3), when obtaining the reference expression value of gene expression of each cell type, the average value of 3 samples of each cell type is rounded, the rounded values are averaged again, logarithmic conversion is performed to further reduce the difference of sequencing depth, and finally, the average standardized expression value after logarithmic conversion of the characteristic gene set is reserved as the cell type reference database.
The comparison analysis shows that when the first 10 differential expression genes are taken from each cell type to establish a reference database, the comparison efficiency reaches a higher and stable level, and the balance between high accuracy and quick operation is realized, so that the invention is suitable for the identification of the cell type of millions of large-scale single cell data.
Preferably, the predicted dataset is of the type of unique identifier, reads per million map reads per kilobase of transcription, or fragments per million map reads per kilobase of transcription.
Preferably, in the step (4.1), the homologous genes are aligned in a one-to-one orthologous manner. In order to further improve the comparison accuracy and comparison efficiency of homologous genes, the invention only considers one-to-one homologous genes for further analysis, and discards one-to-many or many-to-many homologous genes.
Preferably, in step (4.2), when the test data set does not detect the expression of some of the characteristic genes, the expression values of these characteristic genes in the sample are counted as zero, so that the test data set and the reference data set have the same number of lines.
Preferably, in step (4.2), the similarity score is Pearson correlation coefficient.
The method can be realized in an online platform of http:// bis.zju.edu.cn/cellatlas/animation.html.
The invention integrates single cell transcriptome total maps of a plurality of representative species, and constructs the largest single cell reference database so far, which comprises a plurality of cell types with no preference and more comprehensive coverage. The invention overcomes the defects of cell identification based on group gene expression data and the trouble of no standard reference database, and provides data support for cross-species analysis. Through homologous gene comparison, characteristic gene extraction and gene expression pattern similarity calculation, the invention eliminates the influence of single cell data sequencing depth among different sequencing platforms, different species and even different cell types, balances high accuracy and quick operation, and realizes identification of cell types of cross species for the first time. The method provided by the invention has important guiding significance for identification of unknown cell types and cross-species analysis.
Drawings
FIG. 1 is a flow chart of the technical route of the present invention.
FIG. 2 is a diagram showing the comparison result of the human mouse cell map predicted by the invention.
FIG. 3 is a display of the results of the identification of human adult kidney profiles across species cell types predicted by the present invention.
FIG. 4 is a display of the identification of the species-crossing cell types of the predicted zebra fish intestinal epithelial cells of the present invention.
Detailed Description
The invention constructs a single cell reference database based on the whole atlas of single cell transcriptomes of multiple species, and predicts the cell types of the cross species through homologous gene comparison, extraction of characteristic genes and calculation of similarity of gene expression patterns. The method comprises the following steps, as shown in fig. 1:
(1) Single cell profile data collection
The published single cell transcriptome map data sets are collected from published literature, and include a human cell map (59 ten thousand cells), a mouse development and aging cell map (112 ten thousand cells), a cynomolgus monkey cell map (100 ten thousand cells), a rat aging cell map (16 ten thousand cells), a zebra fish development cell map (103 ten thousand cells), a drosophila cell map (43 ten thousand cells), a salamander cell map (107 ten thousand cells), a xenopus metamorphosis development cell map (50 ten thousand cells), a glass sea squirt cell map (9 ten thousand cells), an earthworm cell map (9 ten thousand cells), a nematode cell map (5 ten thousand cells), a vortex worm cell map (6 ten thousand cells), a star sea anemone cell map (1 ten thousand cells), a hydroid cell map (2 ten thousand cells) and a sponge cell map (1 ten thousand cells). In total, over six million single-cell data were collected, covering 15 species from lower to higher that are representative of multi-cell animal evolution. Cell types are organized according to data set annotation information, named and ordered uniformly, and are classified into 8 major lineages according to cell major categories including neural cells, muscle cells, immune cells, endothelial cells, epithelial cells, secretory cells, stromal cells and proliferative cells, with the non-covered labels being others. At the same time, statistics are taken of the information of the period, sex, tissue, etc. of each cell-derived sample.
(2) Atlas data feature extraction
RPKM (Reads Per Kilobase per Million MAPPED READS) standardization processing is carried out on single-cell map data of each species from different sequencing platforms, and a gene expression matrix is processed according to the following calculation formula:
wherein: Normalized expression values of the jth gene in the ith cell in the gene expression matrix;
Number of unique identifiers of the jth gene in the ith cell in the gene expression matrix.
100 Cells were randomly extracted and averaged to give an average of three samples of each cell type with approximately 10000 gene expressions. The procedure was to randomly sample 100 cells from each cell type without replacement (all cells were selected if the total number of cells in one cell type was less than 100), repeat 3 times, and to determine the average expression value of the gene, and to obtain the average data of 3 samples for subsequent analysis.
Differential gene expression analysis was performed using 3 data of sampling average for each cell type, and characteristic genes of high expression specific to each cell type were obtained. Using Seurat software to take the three sampling mean values of each cell type as an input gene expression matrix, generating seurat files by using CreateSeuratObject functions, performing log2 (TPM/100+1) conversion on the data by NormalizeData functions, performing linear conversion on the data by SCALEDATA functions, running FindAllCluster functions, calculating the differential expression genes of each cell type by using Wilcoxon rank sum test method, and finally sorting the generated gene list from high to low according to cell types and avg_log2FC (average expression log difference multiple).
(3) Cell type reference database creation
For each species, the top 10 differentially expressed genes per cell type (avg_log2fc top 10) were selected and the signature gene sets used as reference data were pooled. The comparison analysis shows that when the first 10 differential expression genes are taken from each cell type to establish a reference database, the comparison efficiency reaches a higher and stable level, and the balance between high accuracy and quick operation is realized, so that the invention is suitable for the identification of the cell type of millions of large-scale single cell data. In order to better acquire the reference expression value of the gene expression of each cell type, we round up the average value of 3 samples of each cell type, then calculate the average value again after rounding up, make logarithmic conversion to further reduce the difference of sequencing depth, and finally keep the average standardized expression value after logarithmic conversion of the characteristic gene set as the cell type reference database.
(4) Homologous gene alignment
The annotated homologous genes between species were collected from BioMart of Ensembl database. For homologous genes unknown between species, annotated gene coding regions are downloaded from Ensembl database or coding regions of predicted genes are used TransDecoder, then gene coding region files of two species are input as OrthoFinder, and homologous genes of two species are predicted. In order to further improve the comparison accuracy and comparison efficiency of homologous genes, the invention only considers one-to-one homologous genes for further analysis, and discards one-to-many or many-to-many homologous genes.
(5) Cell type similarity score calculation.
For the dataset to be measured, it may be a unique identifier (Unique Molecular Identifiers, UMI), the number of reads per million map reads per kilobase of transcripts (Reads Per Kilo base per Million MAPPED READS, RPKM), or the fragments per million map reads per kilobase of transcripts (FRAGMENTS PER Kilo base per Million MAPPED FRAGMENTS, FPKM) type.
And (3) carrying out standardization processing and logarithmic conversion on the data set to be tested according to the step (1). And (3) carrying out homologous gene conversion on the expression matrix of the data set to be tested according to the orthologous genes between every two species in the step (4), obtaining a new gene expression matrix, converting the behavior into the genes of the query target species, and listing the genes as cells of the data set to be tested.
And extracting the expression value of the characteristic gene set based on the cell type reference database of the target species. If the test data set does not detect the expression of certain characteristic genes, the expression values of the characteristic genes in the sample are counted as zero, so that the test data set and the reference data set have the same number of lines.
And calculating a Pearson correlation coefficient between the data expression matrix to be detected and the cell type reference database query target, wherein the correlation coefficient is the cell type similarity score. According to the cell grouping information, namely the cell information contained in each cell type, the similarity score between the cell type in the data set to be tested and the cell type in the cell type reference database is counted, and the cell type with the highest similarity score is used as the identification result of the cell type in the data set to be tested.
The process according to the invention is described in more detail below by way of examples.
Example 1 alignment of human murine cytograms
By one-to-one ortholog transformation, human cell patterns were aligned to mouse developmental and senescent cell patterns, and similar cell types in both humans and mice were found to exhibit the highest similarity (left in fig. 2). Eight cell lineages of secretion, epithelium, immunity, nerves, content, muscle, stroma, erythroid, and proliferation essentially correspond in both humans and mice. For example, C33, C42, and C79 are annotated as smooth muscle cells in the human cell profile, which are highly correlated with smooth muscle cell types in the mouse developmental and senescent cell profile (right in fig. 2). Notably, C74 is similar to ventricular cardiomyocytes of fetal and neonatal hearts in the mouse developmental and senescent cell patterns (right in fig. 2). This cell type is annotated in the human cytogram as ventricular cardiomyocytes, of which 99.6% are from fetal heart. The accuracy of cell types and cell phases further demonstrates the effectiveness of the present invention in identifying cell types across species.
Example 2 identification of human adult kidney maps across species cell types
By one-to-one ortholog gene transformation, human adult kidney maps were aligned to mouse developmental and senescent cell maps, and various parenchymal cells in the kidneys were well annotated for release, including henle's ring, leap cells, master cells, proximal tubules, etc. (fig. 3).
Example 3 identification of the Cross-species cell type of Zebra fish intestinal epithelial cells
The zebra fish intestinal epithelial cells were aligned to the mouse developmental and senescent cell patterns by one-to-one ortholog gene transformation, and the cell types common in intestinal epithelium were well aligned to the corresponding cell lineages in the mouse patterns, including intestinal secretory cells, ionic cells, intestinal epithelial cells, epidermal cells, goblet cells, mesenchymal cells, and the like (fig. 4).

Claims (10)

1. A method for identifying a single cell cross-species cell type, comprising the steps of:
(1) Collecting single-cell transcriptome map data sets of different species, and sorting cell types according to the data set annotation information, and unifying naming modes and rules;
(2) Carrying out RPKM standardization treatment on gene expression matrixes of single-cell map data of different species to obtain standardized expression values of each gene in different cells; for each cell type, carrying out differential gene expression analysis according to the standardized expression value of the gene to obtain a characteristic gene which is specifically and highly expressed in each cell type compared with other cell types;
(3) For each species, selecting a plurality of differential expression genes in front of each cell type, and combining the differential expression genes as a characteristic gene set used as a reference data set, namely a cell type reference database;
(4) Identification of the cell type of the data set to be tested:
(4.1) collecting annotated homologous genes among the species to be detected, downloading annotated gene coding regions of the species to be detected or predicting coding regions of unknown homologous genes from a public database for unknown homologous genes among the species to be detected, inputting a gene coding region file of the species to be detected, and predicting homologous genes of the species to be detected;
(4.2) calculation of cell type similarity score:
carrying out standardization processing on the data set to be tested according to the step (2);
Carrying out homologous gene conversion on the expression matrix of the data set to be tested according to homologous genes among species in the step (4.1) to obtain a new gene expression matrix;
The method comprises the steps of extracting an expression value of a characteristic gene set from a cell type reference database of a query target species, calculating similarity scores of cell types between an expression matrix of a data set to be tested and the query target of the cell type reference database, counting the similarity scores of the cell types in the data set to be tested and the cell types of the cell type reference database according to cell information contained in each cell type, and taking the cell type with the highest similarity score as an identification result of the cell type of the data set to be tested.
2. The method of claim 1, wherein the species in step (1) is at least 15 species and ranges from lower to higher species.
3. The method of claim 1, wherein the single cell transcriptome map dataset of step (1) is disclosed and the dataset is classified into 8 major lineages by cell major categories, neural cells, muscle cells, immune cells, endothelial cells, epithelial cells, secretory cells, stromal cells and proliferative cells, and the uncovered labels are others.
4. The identification method of claim 1, wherein the calculation formula of the gene expression matrix in step (2) is as follows:
wherein: Normalized expression values of the jth gene in the ith cell in the gene expression matrix;
Number of unique identifiers of the jth gene in the ith cell in the gene expression matrix.
5. The method of claim 4, wherein 100 cells are randomly sampled from each cell type without replacement, and the average expression value of the gene is obtained by repeating the sampling 3 times to obtain the average value data of 3 samples;
Using Seurat software to take the average data of 3 samples of each cell type as an input gene expression matrix, generating seurat file by CreateSeuratObject function, log2 (TPM/100+1) conversion of the data by NormalizeData function, linear transformation of the data by SCALEDATA function, findAllCluster function operation and Wilcoxon rank sum test method to calculate the differential expression genes of each cell type, and finally sorting the generated gene list by cell type, p_val_adj and avg_log2FC from high to low.
6. The method of claim 1, wherein in step (3), when obtaining the reference expression value of each cell type gene, the average value of 3 samples of each cell type is rounded, the rounded values are averaged again, logarithmic conversion is performed to further reduce the difference in sequencing depth, and finally the average normalized expression value after logarithmic conversion of the characteristic gene set is retained as the cell type reference database.
7. The assay method of claim 1, wherein in step (4), the predicted dataset is of the unique identifier, reads per million map reads per kilobase of transcription, or fragments types per million map reads per kilobase of transcription.
8. The method of claim 1, wherein the homologous gene alignment in step (4.1) is performed using a one-to-one orthologous gene alignment.
9. The assay method of claim 1, wherein in step (4.2), when the test dataset does not detect the expression of certain signature genes, the expression values of the signature genes in the test dataset are set to zero so that the test dataset and the reference dataset have the same number of rows.
10. The assay of claim 1 wherein in step (4.2), the similarity score is a Pearson correlation coefficient.
CN202210677411.9A 2022-06-14 2022-06-14 A single-cell cross-species cell type identification method Active CN115064220B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210677411.9A CN115064220B (en) 2022-06-14 2022-06-14 A single-cell cross-species cell type identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210677411.9A CN115064220B (en) 2022-06-14 2022-06-14 A single-cell cross-species cell type identification method

Publications (2)

Publication Number Publication Date
CN115064220A CN115064220A (en) 2022-09-16
CN115064220B true CN115064220B (en) 2025-05-16

Family

ID=83199669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210677411.9A Active CN115064220B (en) 2022-06-14 2022-06-14 A single-cell cross-species cell type identification method

Country Status (1)

Country Link
CN (1) CN115064220B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116486914A (en) * 2023-01-06 2023-07-25 上海欧易生物医学科技有限公司 Single-cell transcriptome sequencing-based inter-species homologous gene conversion analysis method and system
CN116153401B (en) * 2023-01-07 2024-01-05 杭州链康医学检验实验室有限公司 PDX-based single cell transcriptome data analysis methods, systems, equipment and media
CN116434842B (en) * 2023-04-18 2025-09-26 上海市东方医院(同济大学附属东方医院) A biological sample information processing method, system and storage medium
CN116564418B (en) * 2023-04-20 2024-06-11 深圳湾实验室 Cell group correlation network construction method, device, equipment and storage medium
CN116525010A (en) * 2023-04-26 2023-08-01 杭州联川生物技术股份有限公司 Single-cell transcriptome double-source multi-cell filtering method, medium and equipment
CN116705175B (en) * 2023-06-08 2023-12-29 南京农业大学 A cross-species comparative genomics database and its construction and analysis methods
CN117116364B (en) * 2023-10-25 2024-02-20 智泽童康(广州)生物科技有限公司 Single cell database and associated cell subgroup automatic recommendation method thereof

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117542414A (en) * 2023-11-21 2024-02-09 浙江大学 A method, device and medium for cross-species analysis of single-cell chromatin accessibility maps

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150072876A1 (en) * 2013-07-19 2015-03-12 The Board Of Trustees Of The Leland Stanford Junior University Self-directed method for cell-type identification and separation of gene expression microarrays
US20170002319A1 (en) * 2015-05-13 2017-01-05 Whitehead Institute For Biomedical Research Master Transcription Factors Identification and Use Thereof
WO2020068731A1 (en) * 2018-09-25 2020-04-02 The General Hospital Corporation Methods for integrating cell gene expression data from multiple single-cell data sets and uses thereof
BR112021019640A2 (en) * 2019-12-19 2022-06-21 Illumina Inc High-throughput single cell libraries and methods of preparation and use
CN112700820B (en) * 2021-01-07 2021-11-19 广州华银健康医疗集团股份有限公司 Cell subset annotation method based on single cell transcriptome sequencing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117542414A (en) * 2023-11-21 2024-02-09 浙江大学 A method, device and medium for cross-species analysis of single-cell chromatin accessibility maps

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于单细胞测序数据的免疫细胞类型识别方法的开发及应用;蒋英;中国博士学位论文全文数据库(电子期刊) 医药卫生科技辑;20240515;E059-27 *

Also Published As

Publication number Publication date
CN115064220A (en) 2022-09-16

Similar Documents

Publication Publication Date Title
CN115064220B (en) A single-cell cross-species cell type identification method
Li et al. Rooting the animal tree of life
CN110060729B (en) Method for annotating cell identity based on single cell transcriptome clustering result
CN106202999B (en) Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement
Adebali et al. Phylogenetic analysis of SARS-CoV-2 genomes in Turkey
CN118366547B (en) Application of gene markers in early screening of multiple cancers, early screening model construction method and detection device
CN118248210A (en) Pedigree tracing method based on whole genome resequencing SNP big data and deep learning
CN118072825B (en) A method for identifying and analyzing microorganisms in soil
CN118155721A (en) A method and system for evaluating aquatic germplasm resources based on deep learning
CN111445991A (en) Method for clinical immune monitoring based on cell transcriptome data
CN113889274B (en) Method and device for constructing risk prediction model of autism spectrum disorder
CN110570901A (en) A method and system for SSR typing based on sequencing data
CN118782149B (en) A Hi-C-based microbial metagenomic sequencing analysis method and system
CN106709273B (en) Rapid detection method and system for matching microalgal protein signature sequence tags
CN118571324B (en) Data processing method and device, storage medium and electronic device
CN118098376B (en) Intelligent non-exosome particle identification method and system applied to exosome separation
CN109308934A (en) A kind of gene regulatory network construction method based on integration characteristic importance and chicken group's algorithm
CN117059180B (en) A method, device, equipment and storage medium for screening donor intestinal flora
WO2012096016A1 (en) Nucleic acid information processing device and processing method thereof
CN115547407B (en) lncRNA-protein interaction prediction method based on depth automatic encoder
CN114496089B (en) Pathogenic microorganism identification method
KR101046689B1 (en) Apparatus and method for visualizing and analyzing gene expression pattern of biological sample using gene ontology tree
CN118064602A (en) Gene related to Longshan sheldrake egg yield traits and screening method and application thereof
CN117457200A (en) Hypertension risk prediction method, device, system and storage medium
Marić et al. Approaches to metagenomic classification and assembly

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant