WO2023142041A1 - Methods for processing sequencing data and uses thereof - Google Patents
Methods for processing sequencing data and uses thereof Download PDFInfo
- Publication number
- WO2023142041A1 WO2023142041A1 PCT/CN2022/074991 CN2022074991W WO2023142041A1 WO 2023142041 A1 WO2023142041 A1 WO 2023142041A1 CN 2022074991 W CN2022074991 W CN 2022074991W WO 2023142041 A1 WO2023142041 A1 WO 2023142041A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- gene
- cells
- cellular components
- expression
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/40—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
- G16B5/20—Probabilistic models
Definitions
- the present invention belongs to the field of bioinformatics, and in particular relates to methods and systems for processing bulk sequencing data to determine expression of a gene of interest in individual cellular components.
- TME tumor microenvironment
- various immune cells play important roles in overcoming tumorigenesis including detecting, infiltrating, inhibiting, and destroying tumors.
- Studying gene expression profiling in bulk tumor samples has been successful for the discovery of cancer cell specific target genes.
- TME is a complex mixture of cancer, stromal and immune cells, pinpointing the expression status of various cellular sub-components would elevate mechanistic research, diagnosis, and treatment to a finer resolution.
- scRNASeq single cell RNA sequencing
- the present disclosure provides an indication-or population-based computational estimation of gene expression or its equivalent in individual cell types per tumor indication or per cohort.
- the present disclosure provides a method for processing data to determine expression of a gene of interest in individual cellular components in multiple samples from a cohort, the method comprising:
- the step (b) comprises:
- the Gene-SELECT algorithm is configured to quantify the expression of the gene of interest in the individual cellular components by estimating expression coefficient of each of the cellular components.
- the Gene-SELECT algorithm estimates the expression of coefficient of each of the cellular components by a regression approach, e.g., by a multivariate linear regression approach.
- expression of the gene of interest in the individual cellular components is determined by multiplying the expression coefficient by the fraction of each of the cellular components.
- step (c) of the method comprises generating a Gene-SELECT Score (GSS) of the gene by combining expressions of the gene in the individual cellular components.
- GSS Gene-SELECT Score
- the method further comprises visualizing the generated data matrix by a heatmap.
- the multiple samples comprise at least 30 samples, at least 50 samples, or at least 80 samples.
- the multiple samples comprise a number of samples that is at least three times, at least four times, or at least five times of the number of the individual cellular components.
- the cohort comprises subjects with a cancer, and the samples are tumor tissue samples from the subjects.
- the individual cellular components comprise cancer cells, stromal cells and leukocytes.
- the leukocytes are further divided into neutrophils, eosinophils, mast cells, dendritic cells, macrophages, and lymphocytes.
- the mast cells are further divided into activated mast cells and resting mast cells; the dendritic cells are further divided into activated dendritic cells and resting dendritic cells; the macrophages are further divided into M0 macrophages, M1 macrophages, M2 macrophages, and monocytes; and/or the lymphocytes are further divided into NK cells, T cells, and B cells.
- the NK cells are further divided into activated NK cells and resting NK cells; the T cells are further divided into gamma delta T cells, regulatory T cells (Tregs) , follicular helper T cells, CD4+ T cells, and CD8+ T cells; and/or the B cells are further divided into plasma cells, memory B cells and B cells.
- Tregs regulatory T cells
- follicular helper T cells CD4+ T cells
- CD8+ T cells CD8+ T cells
- the B cells are further divided into plasma cells, memory B cells and B cells.
- the CD4+ T cells are further divided into activated memory CD4+ T cells, resting memory CD4+ T cells, and CD4+ T cells.
- the individual cellular components comprise those listed in Table 1 below:
- the cohort comprises subjects with the same cancer type.
- the Gene-SELECT algorithm is configured to quantify expression of the gene of interest in the individual cellular components averaged for the samples in terms of the cancer type.
- the cohort is comprised of subjects with different cancer types, and expression of the gene of interest in individual cellular components for each of the cancer types is determined.
- the Gene-SELECT algorithm is configured to quantify expression of the gene of interest in the individual cellular components averaged for each of the cancer types.
- step (c) of the method comprises generating a Gene-SELECT Score (GGS) of the gene by combining expressions of the gene in the individual cellular components for each of the cancer types.
- the method further comprises generating a data matrix of GGS of the gene by combining GGSs for the different cancer types.
- the method further comprises selecting the cell component (s) with high (e.g., highest, or top 3, or top 5) expression of the gene of interest across samples from multiple or all of the cancer types evaluated.
- the method further comprises ranking the expression of the gene of interest in the selected cell component (s) by cancer types.
- the expression of one or two genes of interest are determined in the method.
- the expressions of more than one gene of interest are determined in the method.
- the gene of interest encodes a tumor antigen or a tumor associated antigen (TAA) .
- TAA tumor associated antigen
- the present disclosure provides a system for processing data to determine expression of a gene of interest in individual cellular components in samples from a cohort, the system comprising:
- the input module is further configured to receive cell deconvolution data including the fraction of each of the cellular components, and/or wherein the analyzing module is further configured to apply a cell deconvolution algorithm to deconvolve the fraction of each of the cellular components.
- the present disclosure provides a device for processing data to determine expression of a gene of interest in individual cellular components in samples from a cohort, the device comprising:
- a memory for storing computer program instructions
- the device executes the method of the first aspect of the present disclosure.
- the present disclosure provides a computer-readable medium, which stores computer program instructions, wherein when the computer program instructions are executed by a processor, the method of the first aspect or the second aspect of the present disclosure is executed.
- Figure 1 illustrates the framework for high-throughput and high-resolution mRNA expression level estimation.
- Figure 2 illustrates that the bulk expression level of a particular gene is a summation of expression levels of its individual cellular components, wherein the tumor tissue comprises cancer cells, stromal cells, and leukocytes, while leukocytes are part of the stromal cells.
- the bulk expression level of a gene could be deconvoluted into expression levels in individual cellular components.
- Figure 3 depicts visualization of the Gene-SELECT Scores for LAG3: (A) indications ordered by pattern similarity; (B) Indications ordered by alphabetical and specific subtypes.
- Figure 4 illustrates validation of LAG3 Single-cell expression in melanoma.
- a single-cell RNA-seq dataset in melanoma contains CD8 T cell, CD4 T cell, B cell, macrophage, CAF, endothelial cell, NK cell and malignant cell, with LAG-3 expressed cell amounts as well as total cell amounts listed under the X-axis corresponding to each cell component.
- Figure 5 illustrates validation of LAG3 Single-cell expression in colorectal cancer.
- a single-cell RNA-seq dataset in colorectal cancer contains CD8 T cell, CD4 T cell, myeloid cell, B cell, ILC, CAF, epithelial cell, fibroblast and malignant cell, with LAG-3 expressed cell amounts as well as total cell amounts listed under the X-axis corresponding to each cell component.
- Figure 6 illustrates visualization of large categories of cellular components, including cancer cells, neutrophils, eosinophils, mast cells, dendritic cells, macrophages, and lymphocytes.
- Figure 8 shows dual visualization of ERBB2 (HER2) and LAG3, wherein the first panel is the label of genes, the second and third panel demonstrate the averaged expression levels of two genes respectively per indication, and the column of heatmap is ordered by indication names, with two genes together per indication.
- ⁇ SELECT Gene S pecific E xpression L oading by E stimating C ellular T ranscripts
- the challenges may occur when specific signals should be precisely distinguished and extracted from a mixture of signals. Accordingly, disclosed herein are algorithms and methods enabling the determination of the status of sources that correspond to the signals of interest.
- the Gene-SELECT algorithm is an algorithm utilizing the bulk RNA-seq data and cellular components fraction data of samples (e.g., cancer tissue samples) to calculate the expression patterns of one or more genes in individual cellular components (e.g. cancer cells, stromal cells and immune cells) by a regression approach.
- the Gene-SELECT algorithm disclosed herein uses multivariate linear regression to estimate each of the expression coefficients of individual cellular components in a population of samples, e.g., per indication, given a particular gene of interest. Sufficient sample size is desired to give statistically reliable estimates. For example, if 20 cellular components are given, at least 60 or preferably at least 80 samples would be suggested. The composition of cellular components could be adjusted, integrated, or pruned.
- the Gene-SELECT scores are generated for each gene by multiplication between the average fraction in a cellular component across the indication and the corresponding expression coefficient in the same cellular component. In this way, the score reflects the relative expression levels of that gene in each cellular component in the same indication. Gene-SELECT scores can be visualized in specialized ways such as single-gene heatmaps and multiple-gene heatmaps.
- a computer-implemented method for the determination of cell-based gene expression signals from the complex mixture of gene expression signals for designated datasets based on samples e.g. clinical patient samples
- the method is comprised of: (a) receiving the dataset comprising a group of signals, wherein the set of signals correspond to designated signal sources (including the levels of samples/patients, population/indication and gene expression) ; (b) utilizing cell deconvolution algorithm based cell fraction results, to deconvolve the set of signals to identify and quantify the independent signal sources (including samples/patients, population/indication and cell types/cellular components) ; (c) analyzing the designated dataset using the Gene-SELECT algorithm configured to deconvolve gene expression that correspond to the distinct cellular components and generate a data matrix for the Gene-SELECT Scores (GSSs) by combining parameters with corresponding cell fractions; and (d) visualizing the generated data matrix by using the tool of heatmap whether for a single gene or for two genes.
- GSSs Gene-SELECT Scores
- the samples comprise at least 20 samples, at least 30 samples, at least 50 samples, at least 80 samples, at least 100 samples, at least 200 samples, at least 300 samples, at least 500 samples, at least 800 samples, at least 1000 samples, at least 2000 samples, at least 3000 samples, or at least 5000 samples.
- the number of samples is at least two times, at least three times, at least four times, at least five times, at least 10 times, at least 20 times, at least 30 times, at least 50 times, or at least 100 times of the number of the individual cell components.
- the Gene-SELECT algorithm is configured to determine parameters based on designated cellular components in the unit of single gene and single population/indication.
- the Gene-SELECT algorithm identifies and quantifies the parameters for distinct cellular components using multivariate linear regression (MLR) model for gene expression deconvolution.
- MLR multivariate linear regression
- the cell-based gene expression is scored by multiplying cell-based estimated parameters by corresponding cell fractions.
- the normalization method for visualization of GSS data matrix is default to scaling by each indication.
- setting trim to a number between 0 and 1 uses equidistant classes between the (trim) -and (1-trim) -quantile, and lumps the values below and above this range into separate open-ended classes. If the data comes from a heavy-tailed distribution, this can save the display from putting too many values into too few classes (See, Servant, Nicolas, et al. "Package ‘EMA’ . " (2020) ) .
- the trimmed value can be set to a value between 0 and 1, for example, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 or 0.9.
- the trimmed value cutoff of scaling is default to 0.3.
- the datasets comprise bulk RNA-seq data and cell fraction data.
- the dataset of bulk RNA-seq data is derived from a group of cancer tissue samples.
- the dataset of cell fraction signals corresponds to a series of cancer, stromal and immune cell types.
- Another aspect of the present disclosure provides a prediction of Gene-SELECT in cancer cells, stromal cells, and various immune cells, the varieties comprising: (a) the most precise classification of leukocytes (Table 1) ; (b) the large categories of leukocytes, including neutrophils, eosinophils, mast cells, dendritic cells, macrophages, and lymphocytes; (c) leukocyte as an entire class.
- Another aspect of the present disclosure provides a method of helping optimizing the indication selection approach, the method comprising: (a) obtaining the dataset of a designated objective from the designated data source; (b) using Gene-SELECT algorithm to generate GSS data matrix; (c) visualizing GSS data matrix by heatmap; (d) selecting the cell component with most highly expression across most of indications; (e) ranking the GSSs by indications, and listing the rank of that GSS in the corresponding cell component for each indication.
- an objective is defined as one protein-coding gene.
- the designated data source is the bulk RNA-seq data matrix of the objective with labels of sample IDs and indications.
- Another aspect of the present disclosure provides a method of visualizing GSSs for two objectives simultaneously, the method comprising: (a) obtaining the dataset of two designated objectives from the designated data source; (b) using Gene-SELECT algorithm to generate GSS data matrix for two objectives, respectively; (c) combining GSS data matrices of two objectives to an integrated form; (d) visualizing the integrated GSS data matrix by a heatmap.
- an objective is defined as one protein-coding gene.
- the designated data source is the bulk RNA-seq data matrix of the objective with labels of sample IDs and indications.
- the framework of the methodology is depicted in Figure 1. There are totally four elements of input information utilized in this study: sample information (patient level) , population information (indication level) , individual gene expression (gene level) , and cell type information (cell level) .
- the methodology is comprised of three steps: (1) downloading bulk RNA-seq data from TCGA primary tumors for a given designated gene name, which data include patient level, indication level and gene level, without cell level; (2) obtaining the cell fraction data from TCGA primary tumors, which include patient level, indication level and cell level, without gene level; (3) estimating cell-based Gene-SELECT Scores (GSSs) according to both bulk RNA-seq data and cell fraction data and visualize GSSs.
- GSSs Gene-SELECT Scores
- TCGA Cancer Genome Atlas Database
- RNA-seq data were downloaded from Qiagen Omicsoft TCGA Land, which contains 9862 primary tumor samples per gene across 32 indications (Table 2) .
- cell component classifications 1) cancer cell, leukocyte, and stromal cell; 2) large categories of cellular components including cancer, neutrophils, eosinophils, mast cells, dendritic cells, macrophages, lymphocytes, and stroma; 3) small categories with comprehensively 23 cellular components (Table 1, see above) .
- Any other classifications of cellular components are highly flexible to be designed.
- Cell deconvolution is a method to estimate specific cell fractions/proportions based on bulk RNA-seq data.
- the original cell deconvolution method was utilized (Abbas et al. PLoS ONE, 2009) for providing cell fraction data of the present disclosure, which is summarized as follows.
- Expression deconvolution was performed on linear, untransformed data as follows: in each mixture sample, the total expression signal of each microarray probe was modeled as the sum of the expression signals of its constituent parts, each of which is described as a product of the expression signal of that probe in that purified cell type times the fractional abundance of that cell type in the mixture.
- a ij is an expression signal measurement in a purified cell
- B i is an expression signal measurement in a mixture of cells
- X j is a fractional abundance, for each of i probesets and j cell types.
- A is the basis matrix of the expression levels of all probesets in all cell types
- B is the vector of expression levels of all probesets in one mixture
- X is the vector of the relative levels of cell types comprising B.
- the equation was solved for X with the R function ‘lsfit’ (a linear least squares algorithm) followed by removal of the lowest negative coefficient from the equation and iteration of the solution if necessary until all coefficients were nonnegative. Fractions of the cell types were determined by dividing the coefficients by the yield of mRNA per cell input.
- the X which is the predicted fraction vector of cellular components, is the output of cell deconvolution procedure, and the input of the Gene-SELECT analysis.
- the data matrix of cell deconvolution result is originated from Supplemental Information (Thorsson et al. Immunity, 2018: Table S1. PanImmune Feature Matrix of Immune Characteristics https: //www. cell. com/cms/10.1016/j. immuni. 2018.03.023/attachment/1b63d4bc-af31-4a23-99bb-7ca23c7b4e0a/mmc2.
- xlsx for multiple reasons: 1) it is based on RNA-seq data of 11080 primary tumor samples within 33 indications from TCGA; 2) for each sample, it estimates cancer cell fractions, leukocyte fractions as well as stromal cell fractions (1 –cancer cell fractions) ; 3) cancer cell fractions are produced from bulk RNA-seq and tumor purity, while leukocyte fractions are produced from DNA methylation data; 4) it comprehensively includes 22 leukocyte components (Table 1, see above) . Only 23 indications with sample size > 75 remains for the following analysis (Table 3) .
- the bulk expression level of a particular gene is a summation of expression levels of its individual cellular components.
- the tumor tissue is composed of cancer cells, stromal cells, and leukocytes, while leukocytes are part of the stromal cells. Therefore, the bulk expression level of a gene could be deconvolute into expression levels in individual cellular components, and the resolution of leukocytes could be further split into various subtypes (Figure 2) . The ultimate resolution depends on the resolution of cell fraction data.
- the Gene-SELECT algorithm is based on the multivariate linear regression, which estimates the expression coefficients of individual cellular components in a population/indication-based strategy. This population-based strategy, scarifying sample-wise resolution of expression, could highly keep the coefficients robust enough to reflect the cell-based expression in an entire indication, which ensures the strong power in indication selection.
- theExp ij is the bulk RNA-seq expression level of a particular gene i in samplej
- ⁇ Cn_j are expression coefficients in samplej
- X Cn_j are cellular components in samplej with known proportions estimated from bulk RNA-seq, tumor purity or methylation data independent of specific genes.
- E is the vector of the expression levels of gene i from bulk RNA-seq data inj samples
- B is the vector of expression coefficients to be estimated
- X is the matrix of the relative fractions of cellular components comprising tumor microenvironment (TME) .
- Exp ij is the bulk RNA-Seq expression level in samplej
- X jk is the fraction (proportion) vector for thekth cell component in samplej
- ⁇ i is the population-based intercept term
- ⁇ i is the population-based error term assumed to follow Gaussian distribution
- k 1, 2, ..., n as cellular components.
- composition of cellular components could be pruned or simplified, since cell fractions per sample are additive.
- the most simplified model (cancer vs. leukocyte) is defined as
- ⁇ CC_i and ⁇ LC_i are the expression coefficients to be estimated corresponding to cancer cell and leukocyte components for gene i, respectively.
- GSS ik the Gene-SELECT Score of gene i in the kth cell component
- GSS ik is the Gene-SELECT Score of gene i
- ⁇ ik is the estimated coefficient of gene i in the kth cell component, and is the averaged fraction in m samples of a given population in the kth cell component.
- a score matrix is produced with each row a cell component and each column an indication.
- heatmap is utilized as the tool for visualizing the ultimate GSSs by genes, cellular components and indications ( Figure 3, 6-8) .
- single gene heatmap and double gene heatmap are applicable.
- the default setting of single gene heatmap is as below: 1) each heatmap represents one gene; 2) each row is a cell component, and each column is an indication; 3) row order is fixed, and columns are either sorted by alphabetic order or clustered by pattern similarity; 4) the normalization is applied to all cellular components per indication; 5) trimmed cutoff values are used to smoothen the color scale potentially distorted by outliers.
- double gene heatmap is utilized for the combinatorial visualization of two genes ( Figure 8) .
- the default setting of single gene heatmap is as below: 1) each heatmap represents two genes together; 2) each row is a cell component, and each two column is an indication by genes; 3) row order is fixed, and columns are either sorted by alphabetic order or clustered by pattern similarity; 4) considering the different expression baseline between two genes across indications, the baseline expression values are also mapped to the double gene heatmap.
- RNA-seq data of LAG3 gene expression together with cell fraction data about 23 cellular components in 9862 samples across 25 indications (3 of which are customized indications) , are selected for Gene-SELECT analysis.
- Gene-SELECT algorithm is applied, the GSSs in the form of data matrix is generated (Table 4) , which is then used to deliver heatmap visualization.
- Table 4 There are two options for heatmap visualization: indications ordered by pattern similarity ( Figure 3A) , and indications ordered by alphabetical and specific subtypes ( Figure 3B) .
- GSS-based indication selection is manipulated by a two-step strategy: first, cellular components with high expression are picked up, for which the ranked list of Gene-SELECT Scores is then extracted. Particularly, LAG3 is highly expressed in CD8 + T cell widely across most indications. Then, the GLSs of LAG3 in CD8 + T cell are specifically pulled out and ranked from high to low (Table 5) .
- melanoma indications including UVM and SKCM are top 2 indications, while UVM, SKCM, and KIRC are indications with GSS > 1.
- the GSS cutoff is highly dependent on the case of individual genes, suggesting consideration of the top indications for the selected cell component would be a better solution for the further studies.
- LAG3 is expressed in more than half of CD8 + T cells (1430/2405, Figure 5) , which is also the highest among all cellular components. Both datasets well confirmed the highest expression of LAG3 in CD8 + T cell, demonstrating that Gene-SELECT performs robust estimation of gene expression in cellular components.
- the present disclosure provides a simplified option of GSS estimation, with large categories of cellular components including cancer cells, neutrophils, eosinophils, mast cells, dendritic cells, macrophages, lymphocytes, and stromal cells (shown in Method, Cellular Components) .
- stromal cells are very important for the tumor function. Taking stromal cell component as a central role is a relatively rare case, which would change the logic of the linear model estimation, particularly setting stroma component to k ⁇ n (shown in Methods) . This will make another cell component not to be estimated.
- FAP is a gene in TME reported to be specifically expressed in stromal cells but not cancer cells. However, the expression pattern of FAP gene in the TME is still unclear.
- FAP is also found to be highly expressed in M2 macrophages, including PRAD, SARC, LUAD, LIHC, STAD, READ and COAD (Figure 7) , which provides an alternative option to select indications based on FAP expression in M2 macrophages besides stroma. Notice that stroma and M2 macrophages do not share many indications with high expression of FAP, suggesting that the stroma-based GSS estimation could largely and effectively expand the applicable indications.
- the present disclosure provides a solution to help select indications for both gene targets by visualizing GSSs about two genes in the same heatmap, simultaneously.
- the GSSs about ERBB2 and LAG3 are generated individually and are visualized as an entire heatmap in Figure 8.
- the indication selection could be manipulated for either individual genes or both genes. For instance, if cancer cells for ERBB2 and CD8 + T cells for LAG3 are focused, the indications such as KIRC and PAAD are good candidates, of which the GSSs are high in cancer cells for ERBB2 and high in CD8 + T cells for LAG3.
- the present disclosure provides a solution to help select indications for both gene targets by visualizing GSSs about two genes in the same heatmap, simultaneously.
- the GSSs about ERBB2 and LAG3 are generated individually and are visualized as an entire heatmap in Figure 8.
- the indication selection could be manipulated for either individual genes or both genes. For instance, if cancer cells for ERBB2 and CD8 + T cells for LAG3 are focused, the indications such as KIRC and PAAD are good candidates, of which the GSSs are high in cancer cells for ERBB2 and high in CD8 + T cells for LAG3.
Landscapes
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Biotechnology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Physiology (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Methods and systems for processing data to determine expression of a gene of interest in individual cellular components in samples from a cohort.
Description
The present invention belongs to the field of bioinformatics, and in particular relates to methods and systems for processing bulk sequencing data to determine expression of a gene of interest in individual cellular components.
Cancers are complex diseases featuring abnormal and malignant cell growth and eventually, if left uncontrolled, spreading to distant organs in the body. In solid tumor, the interactions among different cellular components including tumor cells, infiltrating immune cells as well as stromal cells form a dynamic morphological structure called tumor microenvironment (TME) . Within TME, various immune cells play important roles in overcoming tumorigenesis including detecting, infiltrating, inhibiting, and destroying tumors. Studying gene expression profiling in bulk tumor samples has been successful for the discovery of cancer cell specific target genes. However, since TME is a complex mixture of cancer, stromal and immune cells, pinpointing the expression status of various cellular sub-components would elevate mechanistic research, diagnosis, and treatment to a finer resolution.
In drug discovery, an important question to evaluate drug targets, understand molecular mechanisms and select indications for clinical development is to understand where the gene (s) of interest express, in cancer or TME, and in what tumor indications. To address this question, tens of thousands of bulk RNASeq data under repositories measuring human tumor biopsy and peripheral blood (e.g. TCGA and ICGC) were used to provide the overall expression level. However, bulk RNA-Seq data cannot provide the information of expression levels in cancer and TME individually, nor solving the resolution in major immune cell populations. In recent years, cell deconvolution algorithms have been widely used to computationally estimate the proportion of individual cell types from a bulk RNA-seq sample. However, individual gene expression in particular cell types cannot be directly inferred by cell deconvolution methods as it estimates the cellular proportion but not specific gene expressions. With the advancement of single cell RNA sequencing (scRNASeq) technologies, specific gene expression levels can be directly measured at single cell level. However, balancing cost-effectiveness and feasibility, single cell data in tumor typically have far fewer samples per indication, with a typical single cell cohort covering less than 20 patients. Therefore, scRNASeq data can provide gene expression patterns in a small cohort, and suitable for deep understanding of expression in known or new cell populations in indications of interest but not a broad representation in multiple indications for which tens of thousands of patient samples are needed.
To overcome these challenges and given the fact that bulk RNASeq data are widely available in large patient cohorts, computational approaches to address the aforementioned drug discovery question asking the gene expression pattern in tumor cells and TME are needed.
SUMMARY OF THE INVENTION
The present disclosure provides an indication-or population-based computational estimation of gene expression or its equivalent in individual cell types per tumor indication or per cohort.
In a first aspect, the present disclosure provides a method for processing data to determine expression of a gene of interest in individual cellular components in multiple samples from a cohort, the method comprising:
(a) receiving bulk RNASeq data of each of the samples including total expression of the gene of interest in the sample;
(b) obtaining the fraction of each of the cellular components in each of the samples; and
(c) analyzing the bulk RNASeq data and the fractions using a Gene-SELECT algorithm configured to quantify expression of the gene of interest of the individual cellular components in the samples.
In some embodiments of the method, the step (b) comprises:
(i) applying a cell deconvolution algorithm to deconvolve the fraction of each of the cellular components; or
(ii) receiving cell deconvolution data including the fraction of each of the cellular components.
In some embodiments of the method, the Gene-SELECT algorithm is configured to quantify the expression of the gene of interest in the individual cellular components by estimating expression coefficient of each of the cellular components.
In some embodiments, the Gene-SELECT algorithm estimates the expression of coefficient of each of the cellular components by a regression approach, e.g., by a multivariate linear regression approach.
In some embodiments of the method, expression of the gene of interest in the individual cellular components is determined by multiplying the expression coefficient by the fraction of each of the cellular components.
In some embodiments of the method, step (c) of the method comprises generating a Gene-SELECT Score (GSS) of the gene by combining expressions of the gene in the individual cellular components.
In some embodiments, the method further comprises visualizing the generated data matrix by a heatmap.
In some embodiments of the method, the multiple samples comprise at least 30 samples, at least 50 samples, or at least 80 samples.
In some embodiments of the method, the multiple samples comprise a number of samples that is at least three times, at least four times, or at least five times of the number of the individual cellular components.
In some embodiments of the method, the cohort comprises subjects with a cancer, and the samples are tumor tissue samples from the subjects.
In some embodiments, the individual cellular components comprise cancer cells, stromal cells and leukocytes.
In some embodiments, the leukocytes are further divided into neutrophils, eosinophils, mast cells, dendritic cells, macrophages, and lymphocytes.
In some embodiments, the mast cells are further divided into activated mast cells and resting mast cells; the dendritic cells are further divided into activated dendritic cells and resting dendritic cells; the macrophages are further divided into M0 macrophages, M1 macrophages, M2 macrophages, and monocytes; and/or the lymphocytes are further divided into NK cells, T cells, and B cells.
In some embodiments, the NK cells are further divided into activated NK cells and resting NK cells; the T cells are further divided into gamma delta T cells, regulatory T cells (Tregs) , follicular helper T cells, CD4+ T cells, and CD8+ T cells; and/or the B cells are further divided into plasma cells, memory B cells and
B cells.
In some embodiments, the CD4+ T cells are further divided into activated memory CD4+ T cells, resting memory CD4+ T cells, and
CD4+ T cells.
In some embodiments, the individual cellular components comprise those listed in Table 1 below:
Table 1. Leukocytes components
| Leukocyte Type/Component |
| Neutrophils |
| Eosinophils |
| Mast Cells Activated |
| Mast Cells Resting |
| Dendritic Cells Activated |
| Dendritic Cells Resting |
| Macrophages M0 |
| Macrophages M1 |
| Macrophages M2 |
| Monocytes |
| NK Cells Activated |
| NK Cells Resting |
| T Cells gamma delta |
| T Cells Regulatory Tregs |
| T Cells Follicular Helper |
| T Cells CD4 Memory Activated |
| T Cells CD4 Memory Resting |
| T Cells CD4 Naive |
| T Cells CD8 |
| Plasma Cells |
| B Cells Memory |
| B Cells Naive |
In some embodiments of the method of the present disclosure, the cohort comprises subjects with the same cancer type. In some embodiments, the Gene-SELECT algorithm is configured to quantify expression of the gene of interest in the individual cellular components averaged for the samples in terms of the cancer type.
In other embodiments, the cohort is comprised of subjects with different cancer types, and expression of the gene of interest in individual cellular components for each of the cancer types is determined. In some embodiments, the Gene-SELECT algorithm is configured to quantify expression of the gene of interest in the individual cellular components averaged for each of the cancer types.
In some embodiments of the method where the cohort is comprised of subjects with different cancer types, step (c) of the method comprises generating a Gene-SELECT Score (GGS) of the gene by combining expressions of the gene in the individual cellular components for each of the cancer types. In some embodiments, the method further comprises generating a data matrix of GGS of the gene by combining GGSs for the different cancer types.
In some embodiments, the method further comprises selecting the cell component (s) with high (e.g., highest, or top 3, or top 5) expression of the gene of interest across samples from multiple or all of the cancer types evaluated.
In some embodiments, the method further comprises ranking the expression of the gene of interest in the selected cell component (s) by cancer types.
In some embodiments of the method of the present disclosure, the expression of one or two genes of interest are determined in the method.
In some embodiments of the method of the present disclosure, the expressions of more than one gene of interest are determined in the method.
In some embodiments, the gene of interest encodes a tumor antigen or a tumor associated antigen (TAA) .
In a second aspect, the present disclosure provides a system for processing data to determine expression of a gene of interest in individual cellular components in samples from a cohort, the system comprising:
(a) an input module configured to receive bulk RNASeq data of each of the samples comprising total expression of the gene of interest in the sample; and
(b) an analyzing module configured to quantify expression of the gene of interest in the individual cellular components with a Gene-SELECT algorithm according to the method of the first aspect of the present disclosure,
wherein the input module is further configured to receive cell deconvolution data including the fraction of each of the cellular components, and/or wherein the analyzing module is further configured to apply a cell deconvolution algorithm to deconvolve the fraction of each of the cellular components.
In a third aspect, the present disclosure provides a device for processing data to determine expression of a gene of interest in individual cellular components in samples from a cohort, the device comprising:
a memory for storing computer program instructions; and
a processor for executing computer program instructions,
wherein when the computer program instructions are executed by the processor, the device executes the method of the first aspect of the present disclosure.
In a fourth aspect, the present disclosure provides a computer-readable medium, which stores computer program instructions, wherein when the computer program instructions are executed by a processor, the method of the first aspect or the second aspect of the present disclosure is executed.
An understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:
Figure 1 illustrates the framework for high-throughput and high-resolution mRNA expression level estimation.
Figure 2 illustrates that the bulk expression level of a particular gene is a summation of expression levels of its individual cellular components, wherein the tumor tissue comprises cancer cells, stromal cells, and leukocytes, while leukocytes are part of the stromal cells. The bulk expression level of a gene could be deconvoluted into expression levels in individual cellular components.
Figure 3 depicts visualization of the Gene-SELECT Scores for LAG3: (A) indications ordered by pattern similarity; (B) Indications ordered by alphabetical and specific subtypes.
Figure 4 illustrates validation of LAG3 Single-cell expression in melanoma. A single-cell RNA-seq dataset in melanoma contains CD8 T cell, CD4 T cell, B cell, macrophage, CAF, endothelial cell, NK cell and malignant cell, with LAG-3 expressed cell amounts as well as total cell amounts listed under the X-axis corresponding to each cell component.
Figure 5 illustrates validation of LAG3 Single-cell expression in colorectal cancer. A single-cell RNA-seq dataset in colorectal cancer contains CD8 T cell, CD4 T cell, myeloid cell, B cell, ILC, CAF, epithelial cell, fibroblast and malignant cell, with LAG-3 expressed cell amounts as well as total cell amounts listed under the X-axis corresponding to each cell component.
Figure 6 illustrates visualization of large categories of cellular components, including cancer cells, neutrophils, eosinophils, mast cells, dendritic cells, macrophages, and lymphocytes. The stromal cell component is set to k=n and not evaluated in Gene-SELECT algorithm.
Figure 7 shows stromal-based estimation and visualization of FAP, wherein the cancer cell component is set to k=n and not evaluated in Gene-SELECT algorithm.
Figure 8 shows dual visualization of ERBB2 (HER2) and LAG3, wherein the first panel is the label of genes, the second and third panel demonstrate the averaged expression levels of two genes respectively per indication, and the column of heatmap is ordered by indication names, with two genes together per indication.
Disclosed herein, in some embodiments, are methods and systems for analyzing complex data signals using the Gene-SELECT (
Gene
Specific
Expression
Loading by
Estimating
Cellular
Transcripts) algorithm to determine output pertaining to the state or status of one or more parameters. The challenges may occur when specific signals should be precisely distinguished and extracted from a mixture of signals. Accordingly, disclosed herein are algorithms and methods enabling the determination of the status of sources that correspond to the signals of interest.
The Gene-SELECT algorithm is an algorithm utilizing the bulk RNA-seq data and cellular components fraction data of samples (e.g., cancer tissue samples) to calculate the expression patterns of one or more genes in individual cellular components (e.g. cancer cells, stromal cells and immune cells) by a regression approach.
Specifically, the Gene-SELECT algorithm disclosed herein uses multivariate linear regression to estimate each of the expression coefficients of individual cellular components in a population of samples, e.g., per indication, given a particular gene of interest. Sufficient sample size is desired to give statistically reliable estimates. For example, if 20 cellular components are given, at least 60 or preferably at least 80 samples would be suggested. The composition of cellular components could be adjusted, integrated, or pruned. The Gene-SELECT scores are generated for each gene by multiplication between the average fraction in a cellular component across the indication and the corresponding expression coefficient in the same cellular component. In this way, the score reflects the relative expression levels of that gene in each cellular component in the same indication. Gene-SELECT scores can be visualized in specialized ways such as single-gene heatmaps and multiple-gene heatmaps.
Disclosed herein, in another aspect, is a computer-implemented method for the determination of cell-based gene expression signals from the complex mixture of gene expression signals for designated datasets based on samples (e.g. clinical patient samples) (Figure 1) . The method is comprised of: (a) receiving the dataset comprising a group of signals, wherein the set of signals correspond to designated signal sources (including the levels of samples/patients, population/indication and gene expression) ; (b) utilizing cell deconvolution algorithm based cell fraction results, to deconvolve the set of signals to identify and quantify the independent signal sources (including samples/patients, population/indication and cell types/cellular components) ; (c) analyzing the designated dataset using the Gene-SELECT algorithm configured to deconvolve gene expression that correspond to the distinct cellular components and generate a data matrix for the Gene-SELECT Scores (GSSs) by combining parameters with corresponding cell fractions; and (d) visualizing the generated data matrix by using the tool of heatmap whether for a single gene or for two genes.
The number of samples to be used in the method is not particularly limited. In some embodiments, the samples comprise at least 20 samples, at least 30 samples, at least 50 samples, at least 80 samples, at least 100 samples, at least 200 samples, at least 300 samples, at least 500 samples, at least 800 samples, at least 1000 samples, at least 2000 samples, at least 3000 samples, or at least 5000 samples.
In some embodiments, the number of samples is at least two times, at least three times, at least four times, at least five times, at least 10 times, at least 20 times, at least 30 times, at least 50 times, or at least 100 times of the number of the individual cell components.
In some embodiments, the Gene-SELECT algorithm is configured to determine parameters based on designated cellular components in the unit of single gene and single population/indication.
In some embodiments, the Gene-SELECT algorithm identifies and quantifies the parameters for distinct cellular components using multivariate linear regression (MLR) model for gene expression deconvolution.
In some embodiments, the cell-based gene expression is scored by multiplying cell-based estimated parameters by corresponding cell fractions.
In some embodiments, the normalization method for visualization of GSS data matrix is default to scaling by each indication. According to Servant et al., setting trim to a number between 0 and 1 uses equidistant classes between the (trim) -and (1-trim) -quantile, and lumps the values below and above this range into separate open-ended classes. If the data comes from a heavy-tailed distribution, this can save the display from putting too many values into too few classes (See, Servant, Nicolas, et al. "Package ‘EMA’ . " (2020) ) . Accordingly, the trimmed value can be set to a value between 0 and 1, for example, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 or 0.9. In some embodiments, the trimmed value cutoff of scaling is default to 0.3.
In some embodiments, the datasets comprise bulk RNA-seq data and cell fraction data. In some embodiments, the dataset of bulk RNA-seq data is derived from a group of cancer tissue samples.
In some embodiments, the dataset of cell fraction signals corresponds to a series of cancer, stromal and immune cell types.
Another aspect of the present disclosure provides a prediction of Gene-SELECT in cancer cells, stromal cells, and various immune cells, the varieties comprising: (a) the most precise classification of leukocytes (Table 1) ; (b) the large categories of leukocytes, including neutrophils, eosinophils, mast cells, dendritic cells, macrophages, and lymphocytes; (c) leukocyte as an entire class.
Another aspect of the present disclosure provides a method of helping optimizing the indication selection approach, the method comprising: (a) obtaining the dataset of a designated objective from the designated data source; (b) using Gene-SELECT algorithm to generate GSS data matrix; (c) visualizing GSS data matrix by heatmap; (d) selecting the cell component with most highly expression across most of indications; (e) ranking the GSSs by indications, and listing the rank of that GSS in the corresponding cell component for each indication.
In some embodiments, an objective is defined as one protein-coding gene.
In some embodiments, the designated data source is the bulk RNA-seq data matrix of the objective with labels of sample IDs and indications.
Another aspect of the present disclosure provides a method of visualizing GSSs for two objectives simultaneously, the method comprising: (a) obtaining the dataset of two designated objectives from the designated data source; (b) using Gene-SELECT algorithm to generate GSS data matrix for two objectives, respectively; (c) combining GSS data matrices of two objectives to an integrated form; (d) visualizing the integrated GSS data matrix by a heatmap.
In some embodiments, an objective is defined as one protein-coding gene.
In some embodiments, the designated data source is the bulk RNA-seq data matrix of the objective with labels of sample IDs and indications.
Methods
The framework of the methodology is depicted in Figure 1. There are totally four elements of input information utilized in this study: sample information (patient level) , population information (indication level) , individual gene expression (gene level) , and cell type information (cell level) . The methodology is comprised of three steps: (1) downloading bulk RNA-seq data from TCGA primary tumors for a given designated gene name, which data include patient level, indication level and gene level, without cell level; (2) obtaining the cell fraction data from TCGA primary tumors, which include patient level, indication level and cell level, without gene level; (3) estimating cell-based Gene-SELECT Scores (GSSs) according to both bulk RNA-seq data and cell fraction data and visualize GSSs.
1. Input Information
Bulk RNA-Seq Data
The Cancer Genome Atlas Database (TCGA) contains primary tumor samples with wide range of tumor indications and comprehensive bulk RNA-seq data, which has been reported to generate specific cell fractions for each patient sample.
The single-gene bulk RNA-seq data were downloaded from Qiagen Omicsoft TCGA Land, which contains 9862 primary tumor samples per gene across 32 indications (Table 2) .
Table 2. Summarization of indications and sample sizes of primary tumor samples downloaded from Qiagen Omicsoft TCGA Land
Cellular Components
Currently three layers of complexity have been defined for the cell component classifications: 1) cancer cell, leukocyte, and stromal cell; 2) large categories of cellular components including cancer, neutrophils, eosinophils, mast cells, dendritic cells, macrophages, lymphocytes, and stroma; 3) small categories with comprehensively 23 cellular components (Table 1, see above) . Any other classifications of cellular components are highly flexible to be designed.
2. Cell Deconvolution Algorithm
Cell deconvolution is a method to estimate specific cell fractions/proportions based on bulk RNA-seq data. In the following examples, the original cell deconvolution method was utilized (Abbas et al. PLoS ONE, 2009) for providing cell fraction data of the present disclosure, which is summarized as follows.
Expression deconvolution was performed on linear, untransformed data as follows: in each mixture sample, the total expression signal of each microarray probe was modeled as the sum of the expression signals of its constituent parts, each of which is described as a product of the expression signal of that probe in that purified cell type times the fractional abundance of that cell type in the mixture.
A
11X
1 + A
12X
2 ... A
1jX
j = B
1
A
21X
1 + A
22X
2 ... A
2jX
j = B
2
............
A
i1X
1 + A
i2X
2 ... A
ijX
j = B
i
where A
ij is an expression signal measurement in a purified cell, B
i is an expression signal measurement in a mixture of cells, and X
j is a fractional abundance, for each of i probesets and j cell types. These equations may be rewritten as a matrix equation:
AX = B
where A is the basis matrix of the expression levels of all probesets in all cell types, B is the vector of expression levels of all probesets in one mixture, and X is the vector of the relative levels of cell types comprising B. The equation was solved for X with the R function ‘lsfit’ (a linear least squares algorithm) followed by removal of the lowest negative coefficient from the equation and iteration of the solution if necessary until all coefficients were nonnegative. Fractions of the cell types were determined by dividing the coefficients by the yield of mRNA per cell input.
Here the X, which is the predicted fraction vector of cellular components, is the output of cell deconvolution procedure, and the input of the Gene-SELECT analysis. The data matrix of cell deconvolution result is originated from Supplemental Information (Thorsson et al. Immunity, 2018: Table S1. PanImmune Feature Matrix of Immune Characteristics https: //www. cell. com/cms/10.1016/j. immuni. 2018.03.023/attachment/1b63d4bc-af31-4a23-99bb-7ca23c7b4e0a/mmc2. xlsx) for multiple reasons: 1) it is based on RNA-seq data of 11080 primary tumor samples within 33 indications from TCGA; 2) for each sample, it estimates cancer cell fractions, leukocyte fractions as well as stromal cell fractions (1 –cancer cell fractions) ; 3) cancer cell fractions are produced from bulk RNA-seq and tumor purity, while leukocyte fractions are produced from DNA methylation data; 4) it comprehensively includes 22 leukocyte components (Table 1, see above) . Only 23 indications with sample size > 75 remains for the following analysis (Table 3) .
Table 3. Indications and sample sizes of primary tumor samples filtered by cell fraction data
3. Gene-SELECT Algorithm
The basic overview of Gene-SELECT algorithm is depicted in Figure 2. The bulk expression level of a particular gene is a summation of expression levels of its individual cellular components. The tumor tissue is composed of cancer cells, stromal cells, and leukocytes, while leukocytes are part of the stromal cells. Therefore, the bulk expression level of a gene could be deconvolute into expression levels in individual cellular components, and the resolution of leukocytes could be further split into various subtypes (Figure 2) . The ultimate resolution depends on the resolution of cell fraction data.
The Gene-SELECT algorithm is based on the multivariate linear regression, which estimates the expression coefficients of individual cellular components in a population/indication-based strategy. This population-based strategy, scarifying sample-wise resolution of expression, could highly keep the coefficients robust enough to reflect the cell-based expression in an entire indication, which ensures the strong power in indication selection.
Generally, given one certain indication, the expression coefficients of multivariate linear regression model for gene i are fitted as an equation solving approach:
Exp
i1~β
C1_i×X
C1_1+β
C2_i×X
C2_1+…+β
Cn_i×X
Cn_1
Exp
i2~β
C1_i×X
C1_2+β
C2_i×X
C2_2+…+β
Cn_i×X
Cn_2
............
Exp
ij~β
C1_i×X
C1_j+β
C2_i×X
C2_j+…+β
Cn_i×X
Cn_j
where theExp
ij is the bulk RNA-seq expression level of a particular gene i in samplej, β
Cn_j are expression coefficients in samplej, X
Cn_j are cellular components in samplej with known proportions estimated from bulk RNA-seq, tumor purity or methylation data independent of specific genes.
These equations may be rewritten as a matrix equation:
E = X·B
where E is the vector of the expression levels of gene i from bulk RNA-seq data inj samples, B is the vector of expression coefficients to be estimated, and X is the matrix of the relative fractions of cellular components comprising tumor microenvironment (TME) .
More formally, given one certain population of samples j = 1, 2, …, m, the expression coefficient vector β
ik of a series of cellular components to be estimated for a particular gene i is estimated as:
which is constrained to
where Exp
i and X
k are both m-length vectors respectively as
For each of gene i, Exp
ij is the bulk RNA-Seq expression level in samplej, X
jk is the fraction (proportion) vector for thekth cell component in samplej, α
i is the population-based intercept term, ε
i is the population-based error term assumed to follow Gaussian distribution, and k = 1, 2, …, n as cellular components. Particularly, k =n is default to stromal cell component, of which β
i, n does not need to be estimated, since one degree of freedom is decreased by the constraint condition. In some special situations, while gene expression in stromal cell needs to be estimated, cellular components other than stromal cell would be set as k =n (see Example 3) .
The composition of cellular components could be pruned or simplified, since cell fractions per sample are additive. For example, the most simplified model (cancer vs. leukocyte) is defined as
Exp
ij=α
ij+β
CC_i×X
CC_ij+β
LC_i×X
LC_ij+ε
ij forj = 1, 2, …, m
whereβ
CC_i and β
LC_i are the expression coefficients to be estimated corresponding to cancer cell and leukocyte components for gene i, respectively.
All the β values are produced for the next analysis.
Generation of Gene-SELECT Scores (GSSs)
For a given population (e.g., a tumor type or an indication) , individual “averaged” expression level for cellular components, defined as the “Gene-SELECT Score” , is calculated by
More formally, GSS
ik, the Gene-SELECT Score of gene i in the kth cell component, is defined as
where GSS
ik is the Gene-SELECT Score of gene i, β
ik is the estimated coefficient of gene i in the kth cell component, and
is the averaged fraction in m samples of a given population in the kth cell component.
Ultimately, for each gene, a score matrix is produced with each row a cell component and each column an indication.
Visualization of Gene-SELECT Scores
Based on data matrix of GSSs for each gene, heatmap is utilized as the tool for visualizing the ultimate GSSs by genes, cellular components and indications (Figure 3, 6-8) . Currently, single gene heatmap and double gene heatmap are applicable. The default setting of single gene heatmap is as below: 1) each heatmap represents one gene; 2) each row is a cell component, and each column is an indication; 3) row order is fixed, and columns are either sorted by alphabetic order or clustered by pattern similarity; 4) the normalization is applied to all cellular components per indication; 5) trimmed cutoff values are used to smoothen the color scale potentially distorted by outliers.
After data matrices of GSSs for two genes are merged, double gene heatmap is utilized for the combinatorial visualization of two genes (Figure 8) . The default setting of single gene heatmap is as below: 1) each heatmap represents two genes together; 2) each row is a cell component, and each two column is an indication by genes; 3) row order is fixed, and columns are either sorted by alphabetic order or clustered by pattern similarity; 4) considering the different expression baseline between two genes across indications, the baseline expression values are also mapped to the double gene heatmap.
Example 1. Gene-SELECT Estimation for Indication Selection
The bulk RNA-seq data of LAG3 gene expression, together with cell fraction data about 23 cellular components in 9862 samples across 25 indications (3 of which are customized indications) , are selected for Gene-SELECT analysis. After Gene-SELECT algorithm is applied, the GSSs in the form of data matrix is generated (Table 4) , which is then used to deliver heatmap visualization. There are two options for heatmap visualization: indications ordered by pattern similarity (Figure 3A) , and indications ordered by alphabetical and specific subtypes (Figure 3B) .
Table 4. Data matrix of Gene-SELECT Scores of LAG3 gene
GSS-based indication selection is manipulated by a two-step strategy: first, cellular components with high expression are picked up, for which the ranked list of Gene-SELECT Scores is then extracted. Particularly, LAG3 is highly expressed in CD8
+ T cell widely across most indications. Then, the GLSs of LAG3 in CD8
+ T cell are specifically pulled out and ranked from high to low (Table 5) .
Table 5. GSSs of LAG3 gene in CD8+ T cells
From the ranked list, melanoma indications including UVM and SKCM are top 2 indications, while UVM, SKCM, and KIRC are indications with GSS > 1. The GSS cutoff is highly dependent on the case of individual genes, suggesting consideration of the top indications for the selected cell component would be a better solution for the further studies.
Single cell RNA-seq datasets in melanoma and colorectal cancer were utilized to cross-validate the Gene-SELECT estimation of LAG3, since both indications have high GSSs in CD8
+T cell that is the top cell component in both indications (Table 5, the fourth column) . In melanoma (Jerby-Arnon L et al. Cell, 2018) , LAG3 is expressed in about half of CD8
+ T cells (851/1759, Figure 4) , which is the highest among all cellular components. Similarly, in colorectal cancer (Zhang L et al. Cell, 2020) , LAG3 is expressed in more than half of CD8
+ T cells (1430/2405, Figure 5) , which is also the highest among all cellular components. Both datasets well confirmed the highest expression of LAG3 in CD8
+ T cell, demonstrating that Gene-SELECT performs robust estimation of gene expression in cellular components.
Example 2: Large Categories of Cellular Components
The present disclosure provides a simplified option of GSS estimation, with large categories of cellular components including cancer cells, neutrophils, eosinophils, mast cells, dendritic cells, macrophages, lymphocytes, and stromal cells (shown in Method, Cellular Components) . By setting stromal cell component to k=n, large categories of cellular components could be visualized, as shown in Figure 6. Notice that the definition of the cell categories could be highly customized based on the specific scenario.
Example 3: Stromal-based Gene-SELECT estimation
Sometimes stromal cells are very important for the tumor function. Taking stromal cell component as a central role is a relatively rare case, which would change the logic of the linear model estimation, particularly setting stroma component to k≠n (shown in Methods) . This will make another cell component not to be estimated.
FAP is a gene in TME reported to be specifically expressed in stromal cells but not cancer cells. However, the expression pattern of FAP gene in the TME is still unclear. The Gene-SELECT estimation of FAP in TME is constructed by both setting stroma component to k≠n and setting cancer component to k=n, simultaneously. The expression pattern is shown in Figure 7. As expected, FAP is highly expressed in stroma component across most of indications, including MESO, KIRP, KIRC, UCEC, BRCA, CESC, BLCA, and HNSC. Interestingly, FAP is also found to be highly expressed in M2 macrophages, including PRAD, SARC, LUAD, LIHC, STAD, READ and COAD (Figure 7) , which provides an alternative option to select indications based on FAP expression in M2 macrophages besides stroma. Notice that stroma and M2 macrophages do not share many indications with high expression of FAP, suggesting that the stroma-based GSS estimation could largely and effectively expand the applicable indications.
Example 4: Dual visualization of two genes
The present disclosure provides a solution to help select indications for both gene targets by visualizing GSSs about two genes in the same heatmap, simultaneously. As an example, the GSSs about ERBB2 and LAG3 are generated individually and are visualized as an entire heatmap in Figure 8. In this kind of figure, the indication selection could be manipulated for either individual genes or both genes. For instance, if cancer cells for ERBB2 and CD8
+ T cells for LAG3 are focused, the indications such as KIRC and PAAD are good candidates, of which the GSSs are high in cancer cells for ERBB2 and high in CD8
+ T cells for LAG3.
The present disclosure provides a solution to help select indications for both gene targets by visualizing GSSs about two genes in the same heatmap, simultaneously. As an example, the GSSs about ERBB2 and LAG3 are generated individually and are visualized as an entire heatmap in Figure 8. In this kind of figure, the indication selection could be manipulated for either individual genes or both genes. For instance, if cancer cells for ERBB2 and CD8
+ T cells for LAG3 are focused, the indications such as KIRC and PAAD are good candidates, of which the GSSs are high in cancer cells for ERBB2 and high in CD8
+ T cells for LAG3.
While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments described herein may be employed. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
Claims (27)
- A method for processing data to determine expression of a gene of interest in individual cellular components in multiple samples from a cohort, the method comprising:(a) receiving bulk RNASeq data of each of the samples including total expression of the gene of interest in the sample;(b) obtaining the fraction of each of the cellular components in each of the samples; and(c) analyzing the bulk RNASeq data and the fractions using a Gene-SELECT algorithm configured to quantify expression of the gene of interest of the individual cellular components in the samples.
- The method of claim 1, wherein the step (b) comprises:(i) applying a cell deconvolution algorithm to deconvolve the fraction of each of the cellular components; or(ii) receiving cell deconvolution data including the fraction of each of the cellular components.
- The method of claim 1 or 2, wherein the Gene-SELECT algorithm is configured to quantify expression of the gene of interest in the individual cellular components by estimating expression coefficient of each of the cellular components.
- The method of claim 3, wherein the Gene-SELECT algorithm estimates the expression of coefficient of each of the cellular components by a regression approach, e.g. by a multivariate linear regression approach.
- The method of claim 3 or 4, wherein the expression of the gene of interest in the individual cellular components is determined by multiplying the expression coefficient by the fraction of each of the cellular components.
- The method of any of claims 1-5, wherein the step (c) comprises generating a Gene-SELECT Score (GSS) of the gene by combining expressions of the gene in the individual cellular components.
- The method of claim 6, wherein the method further comprises visualizing the generated data matrix by a heatmap.
- The method of any of claims 1-7, wherein the multiple samples comprise at least 30 samples, at least 50 samples, or at least 80 samples.
- The method of any of claims 1-7, wherein the multiple samples comprise a number of samples that is at least three times, at least four times or at least five times of the number of the individual cellular components.
- The method of any of claims 1-9, wherein the cohort comprises subjects with a cancer, and the samples are tumor tissue samples from the subjects.
- The method of claim 10, wherein the individual cellular components comprise cancer cells, stromal cells and leukocytes.
- The method of claim 11, wherein the leukocytes are further divided into neutrophils, eosinophils, mast cells, dendritic cells, macrophages, and lymphocytes.
- The method of claim 12, wherein:the mast cells are further divided into activated mast cells and resting mast cells;the dendritic cells are further divided into activated dendritic cells and resting dendritic cells;the macrophages are further divided into M0 macrophages, M1 macrophages, M2 macrophages, and monocytes; and/orthe lymphocytes are further divided into NK cells, T cells, and B cells.
- The method of claim 13, wherein:the NK cells are further divided into activated NK cells and resting NK cells;the T cells are further divided into gamma delta T cells, regulatory T cells (Tregs) , follicular helper T cells, CD4+ T cells, and CD8+ T cells; and/or
- The method of any of claims 10-15, wherein the cohort comprises subjects with the same cancer type.
- The method of any of claims 10-15, wherein the cohort comprises subjects with different cancer types, and the expression of the gene of interest in individual cellular components for each of the cancer types is determined.
- The method of claim 17, wherein the step (c) comprises generating a Gene-SELECT Score (GGS) of the gene by combining expressions of the gene in the individual cellular components for each of the cancer types.
- The method of claim 18, wherein the method further comprises generating a data matrix of GGS of the gene by combining GGSs for the different cancer types.
- The method of claim 19, wherein the method further comprises selecting the cell component (s) with high expression of the gene of interest across samples from multiple or all of the cancer types.
- The method of claim 20, wherein the method further comprises ranking the expression of the gene of interest in the selected cell component (s) by cancer types.
- The method of any of claims 1-21, wherein the expression of one or two genes of interest are determined.
- The method of any of claims 1-21, wherein the expressions of more than one genes of interest are determined.
- The method of any of claims 1-23, wherein the gene of interest encodes a tumor antigen or a tumor associated antigen.
- A system for processing data to determine expression of a gene of interest in individual cellular components in samples from a cohort, the system comprising:(a) an input module configured to receive bulk RNASeq data of each of the samples comprising total expression of the gene of interest in the sample; and(b) an analyzing module configured to quantify expression of the gene of interest in the individual cellular components with a Gene-SELECT algorithm according to the method of any of claims 1-24,wherein the input module is further configured to receive cell deconvolution data including the fraction of each of the cellular components, and/or wherein the analyzing module is further configured to apply a cell deconvolution algorithm to deconvolve the fraction of each of the cellular components.
- A device for processing data to determine expression of a gene of interest in individual cellular components in samples from a cohort, the device comprising:a memory for storing computerprogram instructions; anda processor for executing computer program instructions,wherein when the computer program instructions are executed by the processor, the device executes the method of any of claims 1-24.
- A computer-readable medium, which stores computer program instructions, wherein when the computer program instructions are executed by a processor, the method according to any of claims 1-24 is executed.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2022/074991 WO2023142041A1 (en) | 2022-01-29 | 2022-01-29 | Methods for processing sequencing data and uses thereof |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2022/074991 WO2023142041A1 (en) | 2022-01-29 | 2022-01-29 | Methods for processing sequencing data and uses thereof |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023142041A1 true WO2023142041A1 (en) | 2023-08-03 |
Family
ID=87470171
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/074991 Ceased WO2023142041A1 (en) | 2022-01-29 | 2022-01-29 | Methods for processing sequencing data and uses thereof |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2023142041A1 (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017004153A1 (en) * | 2015-06-29 | 2017-01-05 | The Broad Institute Inc. | Tumor and microenvironment gene expression, compositions of matter and methods of use thereof |
| WO2020068731A1 (en) * | 2018-09-25 | 2020-04-02 | The General Hospital Corporation | Methods for integrating cell gene expression data from multiple single-cell data sets and uses thereof |
| CN112635063A (en) * | 2020-12-30 | 2021-04-09 | 华南理工大学 | Lung cancer prognosis comprehensive prediction model, construction method and device |
| WO2021092236A1 (en) * | 2019-11-05 | 2021-05-14 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and methods for deconvoluting tumor ecosystems for personalized cancer therapy |
| WO2021092387A1 (en) * | 2019-11-08 | 2021-05-14 | Regeneron Pharmaceuticals, Inc. | Accurate and robust information-deconvolution from bulk tissue transcriptomes |
| WO2021108556A1 (en) * | 2019-11-26 | 2021-06-03 | The United States Of America, As Represented By The Secretary, Department Of Health And Human Services | Methods of identifying cell-type-specific gene expression levels by deconvolving bulk gene expression |
-
2022
- 2022-01-29 WO PCT/CN2022/074991 patent/WO2023142041A1/en not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017004153A1 (en) * | 2015-06-29 | 2017-01-05 | The Broad Institute Inc. | Tumor and microenvironment gene expression, compositions of matter and methods of use thereof |
| WO2020068731A1 (en) * | 2018-09-25 | 2020-04-02 | The General Hospital Corporation | Methods for integrating cell gene expression data from multiple single-cell data sets and uses thereof |
| WO2021092236A1 (en) * | 2019-11-05 | 2021-05-14 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and methods for deconvoluting tumor ecosystems for personalized cancer therapy |
| WO2021092387A1 (en) * | 2019-11-08 | 2021-05-14 | Regeneron Pharmaceuticals, Inc. | Accurate and robust information-deconvolution from bulk tissue transcriptomes |
| WO2021108556A1 (en) * | 2019-11-26 | 2021-06-03 | The United States Of America, As Represented By The Secretary, Department Of Health And Human Services | Methods of identifying cell-type-specific gene expression levels by deconvolving bulk gene expression |
| CN112635063A (en) * | 2020-12-30 | 2021-04-09 | 华南理工大学 | Lung cancer prognosis comprehensive prediction model, construction method and device |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Woo et al. | Conservation of copy number profiles during engraftment and passaging of patient-derived cancer xenografts | |
| Vlachavas et al. | Radiogenomic analysis of F-18-fluorodeoxyglucose positron emission tomography and gene expression data elucidates the epidemiological complexity of colorectal cancer landscape | |
| JP7545891B2 (en) | Systems and methods for analyzing mixed cell populations - Patents.com | |
| Aran et al. | xCell: digitally portraying the tissue cellular heterogeneity landscape | |
| EP4073805B1 (en) | Systems and methods for predicting homologous recombination deficiency status of a specimen | |
| JP2024016039A (en) | An integrated machine learning framework for estimating homologous recombination defects | |
| Peng et al. | Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer | |
| US20210381056A1 (en) | Systems and methods for joint interactive visualization of gene expression and dna chromatin accessibility | |
| White et al. | Community assessment of methods to deconvolve cellular composition from bulk gene expression | |
| JP2015526816A (en) | Population classification of genetic datasets using tree-type spatial data structures | |
| JP7357023B2 (en) | Method and system for generating non-coding-coding gene co-expression networks | |
| Xiong et al. | Construction and validation of a risk scoring model for diffuse large B-cell lymphoma based on ferroptosis-related genes and its association with immune infiltration | |
| Skok Gibbs et al. | PMF-GRN: a variational inference approach to single-cell gene regulatory network inference using probabilistic matrix factorization | |
| Saha et al. | Bayesian inference of sample-specific coexpression networks | |
| Bhattacharya et al. | DeCompress: tissue compartment deconvolution of targeted mRNA expression panels using compressed sensing | |
| Ramazzotti et al. | Longitudinal cancer evolution from single cells | |
| WO2023142041A1 (en) | Methods for processing sequencing data and uses thereof | |
| US20160357906A1 (en) | Biological data annotation and visualization | |
| WO2022190495A1 (en) | Mechanical detection of breakpoint candidate of copy number variant on genome sequence | |
| Kulm et al. | Benchmarking the accuracy of polygenic risk scores and their generative methods | |
| CN117766024B (en) | Ovarian cancer CD8+T cell related prognosis evaluation method, system and application thereof | |
| Shah et al. | Model-based clustering of array CGH data | |
| Zhang et al. | Deconer: An evaluation toolkit for reference-based deconvolution methods using gene expression data | |
| Subramanian et al. | Novel multisample scheme for inferring phylogenetic markers from whole genome tumor profiles | |
| Kuznetsov et al. | Statistically weighted voting analysis of microarrays for molecular pattern selection and discovery cancer genotypes |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22922872 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22922872 Country of ref document: EP Kind code of ref document: A1 |