[go: up one dir, main page]

WO2025104702A1 - Analyse de profil génétique lié à une perturbation - Google Patents

Analyse de profil génétique lié à une perturbation Download PDF

Info

Publication number
WO2025104702A1
WO2025104702A1 PCT/IB2024/061436 IB2024061436W WO2025104702A1 WO 2025104702 A1 WO2025104702 A1 WO 2025104702A1 IB 2024061436 W IB2024061436 W IB 2024061436W WO 2025104702 A1 WO2025104702 A1 WO 2025104702A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
cell
genes
property
perturbed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/IB2024/061436
Other languages
English (en)
Inventor
Wenmin ZHAO
Sera Aylin CAKIROGLU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cosyne Therapeutics Ltd
Original Assignee
Cosyne Therapeutics Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cosyne Therapeutics Ltd filed Critical Cosyne Therapeutics Ltd
Publication of WO2025104702A1 publication Critical patent/WO2025104702A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • Cell properties such as gene expression, cell viability, structure and function are widely variable between individual cells within any given cell population. Each cellular property is influenced by the cellular transcriptome, the protein-coding portion of an organism’s genome, and transcriptomic regulation.
  • Described herein are embodiments of techniques for generative and/or predictive artificial intelligence (Al)-driven modeling for determining the genetic profile, gene expression values, targets for therapeutic interventions, and cell properties for a cell or a cell population of interest.
  • Al generative and/or predictive artificial intelligence
  • methods comprising simulating a result of a perturbation of a gene of a target cell, wherein the simulating comprises generating an adjusted genetic profile for each cell of a population of cells of a simulation and each cell of the population of cells corresponds to the target cell, the generating comprising: receiving an inputthat identifies (a) a polynucleotide sequence from at least a first gene of the target cell, and/or (b) a perturbed first gene of the target cell, and generating, based on the input and using one or more trained models, the adjusted genetic profile for each cell of the population of cells, wherein generating the adjusted genetic profiles comprises simulating variation in gene expression between cells of the population of cells; and outputting a first adjusted genetic profile of a first cell from among the population of cells.
  • methods comprising: evaluating a genetic profile for a cell to identify a perturbation that had been made to at least one gene of a target cell and yielded the genetic profile, the cell corresponding to the target cell, the evaluating comprising: receiving input that identifies a set of sequences for the cell; determining, based on the input and using the one or more trained models, a set of genes altered in the cell with respect to a set of reference sequences for the target cell; determining, using the one or more trained models, the perturbation made to the at least one gene of the set of genes; outputting an identification of the perturbation.
  • the set of sequences are RNA sequences.
  • the set of sequences are DNA sequences.
  • the set of sequences encode for a set of proteins.
  • methods comprising: predicting at least one gene to perturb in a target cell to yield an adjusted genetic profile for the target cell, the predicting comprising: receiving input identifying a desired property of a first cell, the first cell corresponding to the target cell; identifying, using one or more trained models, one or more target genes associated with the cell property of the first cell corresponding to the target cell, wherein the one or more trained models is trained based on adjusted genetic profiles of single cells in response to one or more perturbations, wherein the adjusted genetic profiles comprise a changed expression level of one or more genes affected by the one or more perturbations; identifying, using the one or more trained models, one or more perturbations that when made to the one or more target genes yield the desired property in the first cell corresponding to the target cell; and outputting an identification of the one or more target genes.
  • methods comprising: receiving an input that identifies a perturbed first gene in a cell and a first gene expression value in the cell; generating, using one or more trained models, a genetic profile for the cell, wherein the genetic profile of the cell comprises a changed expression level of a second gene or a second set of genes that are impacted by the perturbed first gene, wherein the perturbed first gene is different from the second gene or the second set of genes; and outputting the genetic profile for the cell.
  • methods comprising: receiving a gene expression value of a first gene from a cell or a population of cells; generating, using one or more trained models, one or more perturbed genes associated with a change in the gene expression value of the first gene; and outputting the one or more perturbed genes.
  • storage media that have encoded thereon executable instructions that, when executed by at least one processor, cause the at least one processor to perform any one, or any combination of, the methods or method acts described above or otherwise provided herein.
  • apparatuses including systems
  • processors that comprise at least one processor and storage media that have encoded thereon executable instructions that, when executed by the processor(s), cause the processor(s) to perform any one, or any combination of, the methods or method acts described above or otherwise provided herein.
  • such systems may be or include at least one computer-readable storage medium having encoded thereon executable instructions that, when executed by at least one processor, cause the at least one processor to carry out any one or any combination of the methods described above or otherwise described herein, or any combination of acts described herein.
  • such systems may be or include at least one circuit configured to perform any one or any combination of the methods described above or otherwise described herein, or any combination of acts described herein.
  • a circuit may be or include at least one processor and at least one storage medium having encoded thereon executable instructions that, when executed by the at least one processor, cause the at least one processor to carry out such a method, combination of methods, or combination of acts.
  • FIG. 1 shows a schematic of an illustrative single-cell RNA (scRNA) sequencing model with self-supervised learning that represents an in silico cellular genetic profile, with which some embodiments may operate.
  • scRNA single-cell RNA
  • FIG. 2 shows a schematic of another illustrative model that receives as input a knockdown target and produces as an output a genetic profile of a perturbed cell, with which some embodiments may operate.
  • FIG. 3 shows a schematic of an illustrative model arranged to predict a genetic profile in response to receiving as input a polynucleotide sequence for a target cell, with which some embodiments may operate.
  • FIG. 4 shows another schematic of an illustrative model arranged to predict a genetic profile in response to receiving as input a polynucleotide sequence or gene expression values of a gene for a target cell or a set of target cells, with which some embodiments may operate .
  • FIG. 5 shows a schematic diagram of examples of fine-tuning tasks that may be used with a scRNA-sequencing model to further train such a model to predict one or more genetic profiles for a cell, with which some embodiments may operate.
  • FIG. 6 shows a schematic diagram of two illustrative models trained on a batch of cells to generate in silico sequence-based genetic profde and cell-level genetic profde for a set of cells, with which some embodiments may operate.
  • FIG. 7 shows a schematic diagram of further examples of fine-tuning tasks that may be used in some embodiments to further train one or more models for predicting one or more gene expression values for a cell population.
  • FIG. 8 shows a schematic diagram of an illustrative process for training one or more models using information regarding a perturbed target and for outputting a genetic profile of a cell or a cell population, with which some embodiments may operate.
  • FIG. 9 shows a schematic diagram of an illustrative computing device with which some embodiments the methods described herein may be implemented.
  • FIGS. 10A-10B depict an example implementation of some techniques described herein, in which an illustrative model (sometimes labeled CosySeq2Cell herein) together with an illustrative use case and outputs for such use case, in which the model predicts poisson distributions per gene in each cell.
  • FIG. 10A shows a schematic overview of the illustrative CosySeq2Cell model and sampling of single cell transcriptomes from its predictions.
  • 10B shows graphs of the Pearson correlation coefficients of predicted vs actual gene expression values for per gene across all cells (left) and for all genes in a cell for all cells (right), for including expression dropouts (e.g. , zero observed expression) (left) and excluding dropouts (right).
  • FIGS. 11A-11D shows an example of the illustrative model discussed above (CosySeq2Cell) using sequence information to predict accurately gene expression values.
  • FIG. 11A shows model aggregation weights for the embedding layers 200kb around the TSSs. The dotted line in the middle represents the TSS position.
  • FIG. 11B shows the expression change of HHEX under random replacement of the enhancer region (left) and 500bp downstream of the TSS (right)
  • FIG. 11C shows the CosySeq2cell in silico perturbation. Shown are the Wasserstein distances between the predicted gene expression distributions given the original DNA sequence and when lOObp are replaced with a random sequence.
  • FIG. 11A shows model aggregation weights for the embedding layers 200kb around the TSSs. The dotted line in the middle represents the TSS position.
  • FIG. 11B shows the expression change of HHEX under random replacement of the enhancer region (left) and 500bp downstream of the TSS
  • 11D shows CosySeq2Cell vs Enformer enhancer prioritization performance on data. Results were generated by replacing the enhancer sequence with a random sequence, taking only the high confidence enhancers as ground truth and taking both high and low confidence enhancers as ground truth. The number of validated (p) and negative (n) enhancer-gene pairs within each window are shown under each bar.
  • FIGS. 12A-12B show graphs of in silico scRNA-seq data for cell populations with CosySeq2Cell in an example.
  • FIG. 12A shows the predicted Poisson means of gene expression distributions, sampled values from the Poisson distribution, target prediction of co-regulation model and actual gene expression values across 8k genes for a single cell sampled from the test set.
  • FIG. 12B shows correlations between the target and the predicted Poisson means, sampled values from the Poisson distribution and the predicted values of the co-regulation model across 900 cells. All the correlations here were calculated including dropouts (zeros).
  • FIG. 13A-13D shows another example implementation of some techniques described herein and how an illustrative model (sometimes labeled CosyFormer herein) in one illustrative use case predicts gene rankings accurately even at the lower expression level.
  • FIG. 13A shows an example of a schematic of the CosyFormer model, its training input and prediction task.
  • FIG. 13C shows mean spearman correlations at different thresholds between ground-truth and predicted gene rankings for the top 100, 500 and 2000 expressed genes per cell in ⁇ 66k PBMCs.
  • FIG. 13D shows density plots of gene rankings: ground truth ranks for each of the 2000 highest expressed gene across all ⁇ 66k PBMCs are plotted along the x-axis and predicted ranking position along the y-axis. Density of the points in the area are indicated. 15% of the input genes were masked at random and the model was asked to generate full ranking outputs for all positions. Ranking positions were compared when the predicted genes were in the input ranking; white/empty areas indicate that the models were not able to correctly predict genes at the lower expression levels most of the time.
  • FIGS. 14A shows, in an example of some of the techniques described herein, the average AUC of a 5-fold cross-validation fine-tuning of the illustrative CosyFormer-012 model and another illustrative model (sometimes labeled Geneformer herein) to predict gene labels for the gene in 15k ESC single cell transcriptomes.
  • FIG. 14B shows the results for 4 different finetuning tasks: Bivalently marked vs lys4 only marked promoters, Bivalently marked vs not methylated promoters, and dosage sensitive vs insensitive TFs.
  • FIGS. 15A- 15B show an example of some of the techniques described herein, in which another illustrative model (sometimes labeled PerturbFormer herein) in an illustrative use case is able to predict gene expression changes in perturbed single cell transcriptomes given a CRISPRi perturbation and an unperturbed sc transcriptome.
  • FIG. 15A shows a schematic of this example of PerturbFormer’ s architecture as well as its training input and prediction task.
  • 15B shows boxplots showing for each of the methods number of predicted candidate targets of a perturbation that were amongst the top 20 differentially expressed genes between perturbed and unperturbed cells.
  • candidates were the top 300 genes that changed absolute rank between predicted and input transcriptomes (“perturbGPT”).
  • Predictions were compared with three baselines: calculating highly variable genes amongst the unperturbed cells (“highly variable”), taking the top 300 expressed genes (median-normalized) of a size matched random sample of unperturbed cells (“highly ranked”), and identifying the top 300 genes influencing the prediction of the perturbation target with GRNBoost (“gmboost”) .
  • FIGS. 16A-16B shows an example in which in silico mutagenesis reveals learned connections in an illustrative model operating in accordance with some techniques described herein (sometimes labeled GE-CosyFormer herein).
  • FIG. 16A shows a schematic of GE- Cosyformer. Token ID embeddings of CosyFormer are replaced with a linear layer that projects aggregated Enformer embeddings of the gene’s DNA-sequence into the token embedding space.
  • 16B shows barplots of precision (top) and recall (bottom) of GATA4 and TBX5 targets amongst the 50 genes with the biggest changes in their contextualized gene embeddings before and after in silico mutagenesis of lOObp around the TSS of GATA4’s underlying DNA sequence in ⁇ 130 cardiomyocytes.
  • precision and recall of a random sample of 50 genes are shown for each target group. Transcription factor targets were grouped as indicated.
  • FIG. 17 shows data from an example, and in particular depicts a comparison of CosySeq2Cell vs Enformer enhancer prioritization performance on data. Results were generated by replacing the enhancer sequence with a random sequence, taking only the high confidence enhancers as ground truth and taking both high and low confidence enhancers as ground truth.
  • Enformer_TSS grey
  • Enformer_All black
  • FIG. 18 shows, from an example, a heatmap shaded by the number of cells in the scRNA seq dataset grouped by technology (rows) and tissue type (columns) across the full dataset.
  • FIG. 21 shows an example implementation of techniques described herein, in which the CosySEQ2Cell is used for predicting gene expression, CosyFormer models are used to replace the autoencoder models, and PerturbFormer, a model to predict transcriptomic states of cell populations under CRISPRi perturbation in different genetic contexts are implemented together.
  • the inventors have recognized and appreciated that variability among individual cells within a cell population has hampered identification of new therapeutic targets in conventional drug development pipeline. Such variability has not been reflected in conventional approaches to modeling. As a result, conventional modeling approaches have not captured in a realistic way the environment and/or biological system to be modeled, meaning the model is flawed. The incomplete or inconsistent modeling of cellular systems and gene networks within a simulated population of cells prevents the simulation from surfacing relevant information and prevents successful drug candidates from being identified. To counter the underperformance, larger models and more computing resources have been devoted to analysis of potential drug candidates, meaning more processing unit time, memory, power, bandwidth, and other resources are consumed at higher rates, thereby increasing costs astronomically.
  • the inventors have recognized and appreciated, however, that this continual investment of resources has not and cannot overcome that the models themselves are flawed from not capturing in a realistic way the environment/system to be modeled.
  • the inventors have further recognized and appreciated that, as an example of such variability, there can be variability in genetic profiles between cells, even two cells that are the same cell target.
  • a cell target may be a cell of a particular tissue, such as a cell that has a particular anatomical function or position in an organism.
  • Such a genetic profile may include information regarding cellular genotypes, gene sequence variation, gene expression values for one or more genes, or changes in gene expression values in response to a perturbation.
  • Described herein are embodiments of techniques for generative and/or predictive artificial intelligence (Al)-driven modeling for determining a genetic profile for a target cell.
  • a genetic profile may include a gene expression profile and/or cell properties for a target cell and/or for a cell population.
  • models provided herein may receive natural language inputs from a user and output an answer related to molecular relationships between genes and proteins in a cell.
  • Some techniques described herein may include and enable generating an in silico cell or cell population and identifying gene co-expression relationships for a set of genes based on the input received by the one or more trained models and simulating the variation in gene expression based on in-vitro and in-vivo datasets.
  • systems can be used to create, maintain, supplement, analyze, and otherwise interact with single cell RNA sequencing data extending into the millions of different genes and cell types or beyond in some cases, and including information on an array of cell properties.
  • single cell RNA sequencing data may be processed alone or in combination with other single cell or population cell level data such as genotype, DNA-methylation, chromatin accessibility (e.g. , ATAC-SEeq), and other information.
  • Some embodiments may enable artificial intelligence (AI)- driven generative processing of information on previously-unknown, understudied, or previously deemed ‘undruggable’ properties of molecular targets.
  • Some embodiments may enable identification of target genes and proteins for the production of therapeutic agents or gene therapies for any disease.
  • using some of the techniques provided herein may enable a transformation of the lead target discovery in the pharmaceutical industry to a more efficient, comprehensive, and precise process that takes hours rather than years or days, and takes into account (alone, or together with genetic profile information) relationships between genes that may result in negative interactions in mammalian or non-mammalian subjects.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the disclosure. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
  • a sample includes a plurality of samples, including mixtures thereof.
  • determining means determining if an element may be present or not (for example, detection). These terms can include quantitative, qualitative or quantitative, and qualitative determinations. Assessing can be alternatively relative or absolute. “Detecting the presence of’ or “identifying the presence of’ are used interchangeably to include determining the amount of something present (e.g. , a genetic perturbation, a genetic alteration, or a barcode provided herein), as well as determining whether it may be present or absent.
  • the term “about” a number refer to that number plus or minus: 1%, 2%, 5%, or 10% of that number.
  • the term ‘about’ a range can refer to that range minus: 1%, 2%, 5%, or 10% of its lowest value and plus 10% of its greatest value.
  • An alteration can be any change in a polynucleotide.
  • a change in a polynucleotide can occur when a molecule associates with, hybridizes to, or intercalates into a polynucleotide or a double stranded polynucleotide.
  • a non-covalent binding e.g., van der Waals interactions, hydrogen bonding, base pairing, wobble base pairing, base stacking
  • an agent such as: an interfering RNA to a polynucleotide, a protein, or a fragment thereof, an antibody or a fragment thereof, an aptamer, an enzyme, or a de-activated enzyme, is an example of an alteration.
  • a protein which can be a CRISPRi protein can bind to the polynucleotide.
  • the introduction of a CRISPR system can result in sequential alterations.
  • the first can be binding of a guide nucleic acid to the polynucleotide.
  • the second can be, when capable of being facilitated by the guide nucleic acid and a biologically active protein (e.g., Cas9), covalent modification of the polynucleotide following binding of the guide to the polynucleotide.
  • a biologically active protein e.g., Cas9
  • an alteration results in a chemical modification of a polynucleotide or a double stranded polynucleotide.
  • Alterations can include chemically modified base of a nucleotide of a polynucleotide (e.g., adding a methyl, hydroxymethyl, formyl, or carboxyl group to the base of the polynucleotide, or hydrolyzing an amino group of an adenosine to form an inosine), a chemically modified sugar of a base of a polynucleotide (e.g. , replacing a hydroxyl group with a hydrogen or a methoxy group or a fluorine atom, or forming a locked nucleic acid), or altering a phosphate group or a phosphodiester bond (e.g, to replace phosphorus with sulfur or to replace an oxygen with sulfur).
  • chemically modified base of a nucleotide of a polynucleotide e.g., adding a methyl, hydroxymethyl, formyl, or carboxyl group to the base of the polynucleotide, or hydroly
  • Chemical modification is meant to illustrate changes that can occur on one or more nucleotides of the polynucleotide and is not limited in the way the modification is enacted.
  • a chemical modification such as addition of a methyl epigenetic mark can be affected by an enzyme or a biologically active fragment thereof (e.g., Dmntl).
  • Individual nucleotides of a polynucleotide can be independently chemically modified.
  • chemical modification and covalent modification are used interchangeably.
  • Other exemplary agents that can cause alterations include ethidium bromide, radiation, and enzymes that introduce, alter, or remove epigenetic marks.
  • An alteration to a polynucleotide can include a genetic alteration, for example, an alteration in a coding region of the polynucleotide (e.g., an alteration in an exon). Alterations in a polynucleotide can occur in non-coding regions, including, for example, introns and promoters. In some instances, an alteration to a polynucleotide results in a chemical modification to the polynucleotide.
  • alterations refers to any intentionally introduced or targeted alteration of a polynucleotide.
  • alterations encompass intentional changes (perturbations) to a polynucleotide, or for example, using a CRISPR-Cas system to add a nucleotide or delete a nucleotide at a pre-determined nucleic acid of a polynucleotide, or binding or hybridizing an interfering RNA to a target RNA represent examples of perturbations because these make targeted changes to a target nucleotide or a fragment thereof.
  • a perturbation, an alteration, or both can independently occur on the same polynucleotide or on two different polynucleotides.
  • a targeted edit of a polynucleotide by a CRISPR-Cas system is a perturbation and an alteration, but an off-target edit resulting from the same CRISPR-Cas system is an alteration but not a perturbation.
  • a perturbation is introduced by a small molecule that binds to protein, protein complexes, or transcriptional elements to change gene expression of a gene in a cell.
  • a gene expression level (also referred to herein as a gene expression value) is the level of RNA transcripts relative to a reference level of RNA transcripts for a population of cells.
  • the gene expression level (or gene expression value) is the raw data value from an in vitro assay.
  • a change to a genetic profile can include a change to, for example, an exon, an intron, a non-coding region, junk DNA region. DNA in a cell and/or an RNA expressed by the cell.
  • the change can be, for example, the binding of a protein, hybridization of a polynucleotide to the polynucleotide of interest, a mutation in a polynucleotide, a missense mutation, a nonsense mutation, an siRNA, and/or an edit (an indel edit).
  • a gene expression level provided herein can be used to determine various parameters of the genetic profile in response to a stimulus or perturbation.
  • the gene expression level can be, influenced, for example, by the amount of RNA expressed for a given gene, a change in the amount of an RNA transcript expressed due to alternative splicing, a change in the composition of a protein translated from an RNA, and/or a change in an amount of protein translated from an RNA relative to a reference gene expression level.
  • target cell can be used to describe any origin cell with an original genetic profile, gene expression levels, or cell properties that are characteristic of a cell before perturbation of a target gene.
  • a cell that has had a genetic perturbation may be, for example, a perturbed target cell.
  • a population may be a population of target cells (with one or more perturbations or not).
  • a gene of the target cell is perturbed or a modified.
  • the target cell is characterized as having a modification in a genetic profile, one or more gene expression levels, or one or more cell properties after a perturbation has been introduced to a target gene of the target cell .
  • the target cell comprises a cell from a subject or a cell line.
  • the target cell comprises a simulation of a cell from a subject or a cell line.
  • the target cell is a prokaryotic cell or a eukaryotic cell.
  • the cell is a mammalian cell, an insect cell, or other animal cell.
  • Non-limiting examples of target cells include but are not limited to: a neuron, a tumor cell or a cancer cell, a glial cell, an astrocyte, a dendritic cell, an immune cell, a stem cell, an in w/ro-diffcrcntiatcd cell, a heart cell, a muscle cell, an epithelial cell, an endothelial cell, a liver cell, a hepatocyte, a pancreatic cell, a skin cell, a fibroblast, or any cell.
  • reference sequence can be used to refer to a known nucleotide sequence, e.g. , a chromosomal region whose sequence is deposited at NCBI’ s GenBank database or other databases.
  • a reference sequence can be a wild-type sequence.
  • a reference sequence can be a nucleotide sequence obtained from a healthy individual or a group of healthy individuals without a disease or condition.
  • exemplary is used herein to mean serving as an example, instance, or illustration. Any embodiment, implementation, process, feature, etc. described herein as exemplary should therefore be understood to be an illustrative example and should not be understood to be a preferred or advantageous example unless otherwise indicated.
  • Some embodiments provided herein include methods comprising: simulating a result of a perturbation of a gene of a target cell, wherein the simulating comprises generating an adjusted genetic profile for each cell of a population of cells of a simulation and each cell of the population of cells corresponds to the target cell, the generating comprising: receiving an input that identifies (a) a polynucleotide sequence from at least a first gene of the target cell, and/or (b) a perturbed first gene of the target cell, and generating, based on the input and using one or more trained models, the adjusted genetic profile for each cell of the population of cells, wherein generating the adjusted genetic profiles comprises simulating variation in gene expression between cells of the population of cells; and outputting a first adjusted genetic profile of a first cell from among the population of cells.
  • the input identifies a set of genes.
  • the input identifies the sequence of each set of genes.
  • one or more trained models described herein are implemented to generate an adjusted genetic profile for a cell or a population of cells in silico based on a perturbation in a polynucleotide sequence generated in a target cell that changes the cell’s genetic context.
  • the perturbation made to the genetic profile of the cell or the population of cells can comprise an increase in the expression level of the one or more first genes, a decrease in the expression level of the one or more first genes, a perturbation in the one or more first genes, and/or a mutation in the one or more first genes.
  • the perturbation can include but is not limited to an insertion, a deletion, a change in copy number, a point mutation, a frameshift mutation, a missense mutation, a nonsense mutation, a mutation in a stop codon, an epigenetic mark, a reduction in gene expression, an overexpression of a gene, or any combination thereof.
  • the output adjusted genetic profile can include but is not limited to: a set of genes; one or more gene expression values; a transcriptomic profile for a set of genes; proteins that are altered by the input perturbation, protein information, and/or protein sequences.
  • the trained model(s) may be used to identify a set of phenotypes present in the population of cells based on the adjusted genetic profiles generated for each cell of the population of cells.
  • the trained model(s) may simulate variation in gene expression between cells of the population of cells to generate gene expression values with variation between cells of the population of cells.
  • the trained model(s) provided herein may also determine the presence or absence of sequence variation in a polynucleotide sequence input relative to a reference sequence.
  • the one or more trained models generate the gene expression values of an adjusted genetic profile, at least in part, by processing an interdependency between inputs and one or more quantitative trait loci, a sequence alignment relative to a reference sequence, a sequence variation that is associated with a disease or a disorder, a sequence variation that is associated with an alteration or a perturbation, a chromosomal position, a variation in chromosomal position, or any combination thereof.
  • the interdependency between inputs can be determined for each gene of the set of genes input into at least one model.
  • the interdependencies between inputs and outputs are determined by data obtained from in-vitro or in-vivo assays.
  • a single cell RNA sequencing assay provides gene expression values for a set of genes in response to a predetermined perturbation.
  • Large single cell RNA sequence data sets can be used to determine relationships between gene knockdown and the gene expression values of difference sets of genes in a cell.
  • the one or more trained models of these examples can also be fine-tuned based on a series of training sets.
  • the one or more trained models tune the output(s) based on genetic profiles for each of a set of training cells.
  • the one or more trained models further comprise tuning the one or more trained models based on an identification of a perturbation made to each training cell of one or more cells of the set of training cells. Table 1 shows exemplary inputs and output prompts for providing an adjusted genetic profile.
  • Table 1 Exemplary Inputs and Outputs to Determine The Genetic Profile for a Cell with a Perturbation.
  • the adjusted genetic profile output by the one or more trained models provided herein can be for a single cell or correspond to each cell in a population of cells. Additional outputs of the one or more trained models provided herein can include the cell state, viability or nonviability of a cell, and a cell property.
  • Non-limiting examples of cell properties include: nonviability of a cell; viability of a cell; a variation in a sequence of the perturbed first gene or a variation in the sequence of the polynucleotide sequence; an alteration in a gene product that is correlated with the perturbed first gene; an alteration a second gene that is correlated with the perturbed first gene; a disease or a condition that is associated with the perturbed first gene; an environmental change that is associated with the perturbed first gene; or any combination of cell properties.
  • the output from the one or more trained models can include information that is in ranked order. For example, gene expression values of a set of genes can be provided to the user in ascending or descending order according to the gene expression level; and or the change in gene expression level relative to a reference level.
  • methods comprise: evaluating a genetic profile for a cell to identify a perturbation that had been made to at least one gene of a target cell and yielded the genetic profile, the cell corresponding to the target cell, the evaluating comprising: receiving input that identifies a set of sequences for the cell; determining, based on the input and using at least one trained model, a set of genes altered in the cell with respect to a set of reference sequences for the target cell; determining, using the at least one trained model, the perturbation made to the at least one gene of the set of genes; outputting an identification of the perturbation.
  • the set of sequences are RNA sequences.
  • the set of sequences are DNA sequences.
  • the set of sequences encode for a set of proteins.
  • the one or more trained models identifies polynucleotide sequences that are associated with a disease, a condition, a perturbation, an alteration, a cell property, a cell viability, a cell non-viability, or a combination thereof and correlate gene sequences with the gene expression values for the gene or the set of genes.
  • the one or more trained models identify gene sequences that are associated with the cell property that is selected from the group consisting of: a transcriptomic property, a structural property, an electrical property, a mechanical property, a biochemical property, a metabolic property, a physical property, and any combination thereof.
  • the at least one trained model outputs gene sequence information.
  • the gene sequence information comprises a single polynucleotide polymorphism, a gene sequence variation, a gene sequence structural variant, a codon sequence, an exon or a gene sequence encoding for the exon, an intron or the gene sequence encoding for the intron, a non-coding region sequence, a promoter region sequence, a sequence alignment relative to a reference sequence, a sequence variation that is associated with a disease or a disorder, a sequence variation that is associated with an alteration or a perturbation, a chromosomal position, a variation in chromosomal position, a quantitative trait locus, a methylation pattern of a gene sequence, or any combination thereof.
  • FIG. 2 shows an example of how the one or more trained models can be used to determine the identity of a perturbed target gene.
  • the one or more trained models and systems provided herein can also generate sequence information for a cell.
  • the input is a genetic profile of a cell with a given genetic context.
  • the trained model processes relationships between genes and their respective expression values and compares sequencing data across genes in the database to identify the perturbation that gave rise to the genetic profile. This input and output process is useful for identifying a previously unknown perturbation.
  • Table 2 shows exemplary inputs and outputs for the at least one trained model. Table 2. Sequence Information.
  • some methods may comprise: receiving an input that identifies a perturbed first gene in a cell and a first gene expression value in the cell; generating, using one or more trained models, a genetic profile for the cell, wherein the genetic profile of the cell comprises a changed expression level of a second gene or a second set of genes that are impacted by the perturbed first gene, wherein the perturbed first gene is different from the second gene or the second set of genes; and outputting the genetic profile for the cell.
  • the one or more trained models generate the genetic profile for the cell, at least in part, by processing an interdependency between the perturbed first gene and a property of the cell.
  • the interdependency between the perturbed first gene and a second gene or a second set of genes is determined, at least in part, by a set of data obtained from an in-vitro assay.
  • the methods further comprise receiving an inputthat identifies a cell property that is not present in a cell in an absence of the perturbation, and using the one or more trained models, fine-tuning the genetic profile based on a set of training cells.
  • Some methods and systems provided herein can also include predictive modules with generative models to predict single-cell expression profiles and cell states for cells with a perturbation.
  • Table 3 shows the exemplary inputs and outputs for determining the effect of gene expression of sets of genes in response to a perturbation in a gene of interest.
  • Table 3. Gene expression of non-perturbed genes.
  • methods may comprise: predicting at least one gene to perturb in a target cell to yield an adjusted genetic profile for the target cell, the predicting comprising: receiving input identifying a desired property of a first cell, the first cell corresponding to the target cell; identifying, using at least one trained model, one or more target genes associated with the cell property of the first cell corresponding to the target cell, wherein the at least one trained model is trained based on adjusted genetic profiles of single cells in response to one or more perturbations, wherein the adjusted genetic profiles comprise a changed expression level of one or more genes affected by the one or more perturbations; identifying, using the at least one trained model, one or more perturbations that when made to the one or more target genes yield the desired property in the first cell corresponding to the target cell; and outputting an identification of the one or more target genes.
  • methods comprising: receiving a gene expression value of a first gene from a cell or a population of cells; generating, using one or more trained models, one or more perturbed genes associated with a change in the gene expression value of the first gene; and outputting the one or more perturbed genes.
  • a cell property can include any characterized property of a cell of interest to the user of the methods and systems provided herein.
  • the cell property is selected from the group consisting of: a transcriptomic property, a structural property, an electrical property, a mechanical property, a biochemical property, a metabolic property, and a physical property.
  • the cell property is cell viability.
  • the cell property is cell non-viability.
  • the cell property and cell type can be a cell type or a cell property of interest to the user of the at least one trained model or can be a cell property or cell type identified by an in-vitro and/or in-vivo assay.
  • Methods of determining, identifying, characterizing, or quantifying a transcriptomic property of a cell include, but are not limited to RNA sequencing, polymerase chain reaction (PCR), and gene expression microarrays.
  • Methods of determining, identifying, characterizing, or quantifying a structural property can include, but is not limited to, microscopy, confocal microscopy, and electromagnetic microscopy.
  • Methods of determining, identifying, characterizing, or quantifying an electrical property of a cell can include but is not limited to, electrophysiological techniques, calcium imaging, or optogenetic techniques.
  • Methods of determining, identifying, characterizing, or quantifying a mechanical property of a cell can include but is not limited to, cellular impedance assays, atomic force microscopy, cellular stress, and cellular strain assays.
  • Methods of determining, identifying, characterizing, or quantifying a biochemical property of a cell can include but is not limited to, mass spectrometry, liquid chromatography, immunosorbent assays, chemiluminescent assays, and reporter assays.
  • Methods of determining, identifying, characterizing, or quantifying a metabolic property of a cell can include, for example, live cell oxygen consumption rate (OCR) and extracellular acidification rate (ECAR) assays, or a mitochondrial stress test.
  • OCR live cell oxygen consumption rate
  • ECAR extracellular acidification rate
  • Methods of determining, identifying, characterizing, or quantifying a physical property can include, but is not limited to microscopy, confocal microscopy, electromagnetic microscopy, cellular impedance assays, and atomic force microscopy.
  • the gene expression level of a gene in a given cell can be relative to a reference gene expression level or be associated with a cell property. Exemplary inputs and outputs for the method provided above are shown in Table 4.
  • systems may perform any one or any combination of methods described herein or any step or combination of steps of the methods provided herein.
  • Techniques operating according to the principles described herein may be implemented in any suitable manner. Included in the discussion provided herein are a series of flow charts showing the steps and acts of various processes for perturbation-linked genetic profile analysis. The processing and decision blocks of the flow charts provided herein represent steps and acts that may be included in algorithms that carry out these various processes. Algorithms derived from these processes may be implemented as software integrated with and directing the operation of one or more single- or multi-purpose processors, may be implemented as functionally- equivalent circuits such as a Digital Signal Processing (DSP) circuit or an Application-Specific Integrated Circuit (ASIC), or may be implemented in any other suitable manner.
  • DSP Digital Signal Processing
  • ASIC Application-Specific Integrated Circuit
  • the techniques described herein may be embodied in computer-executable instructions implemented as software, including as application software, system software, firmware, middleware, embedded code, or any other suitable type of computer code.
  • Such computer-executable instructions may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
  • these computer-executable instructions may be implemented in any suitable manner, including as a number of functional facilities, each providing one or more operations to complete execution of algorithms operating according to these techniques.
  • a “functional facility,” however instantiated, is a structural component of a computer system that, when integrated with and executed by one or more computers, causes the one or more computers to perform a specific operational role.
  • a functional facility may be a portion of or an entire software element.
  • a functional facility may be implemented as a function of a process, or as a discrete process, or as any other suitable unit of processing.
  • each functional facility may be implemented in its own way; all need not be implemented the same way.
  • these functional facilities may be executed in parallel and/or serially, as appropriate, and may pass information between one another using a shared memory on the computer(s) on which they are executing, using a message passing protocol, or in any other suitable way.
  • functional facilities include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • functionality of the functional facilities may be combined or distributed as desired in the systems in which they operate.
  • one or more functional facilities carrying out techniques herein may together form a complete software package.
  • These functional facilities may, in alternative embodiments, be adapted to interact with other, unrelated functional facilities and/or processes, to implement a software program application.
  • Some exemplary functional facilities have been described herein for carrying out one or more tasks. It should be appreciated, though, that the functional facilities and division of tasks described is merely illustrative of the type of functional facilities that may implement the exemplary techniques described herein, and that embodiments are not limited to being implemented in any specific number, division, or type of functional facilities. In some implementations, all functionalities may be implemented in a single functional facility. It should also be appreciated that, in some implementations, some of the functional facilities described herein may be implemented together with or separately from others (i.e., as a single unit or separate units), or some of these functional facilities may not be implemented.
  • Computer-executable instructions implementing the techniques described herein may, in some embodiments, be encoded on one or more computer-readable media to provide functionality to the media.
  • Computer-readable media include magnetic media such as a hard disk drive, optical media such as a Compact Disk (CD) or a Digital Versatile Disk (DVD), a persistent or non-persistent solid-state memory (e.g. , Flash memory, Magnetic RAM, etc.), or any other suitable storage media.
  • Such a computer-readable medium may be implemented in any suitable manner, including as computer-readable storage media described below (i.e., as a portion of a computing device 700 in FIG. 9) or as a stand-alone, separate storage medium.
  • “computer-readable media” refers to tangible storage media. Tangible storage media are non-transitory and have at least one physical, structural component.
  • a “computer- readable medium,” as used herein at least one physical, structural component has at least one physical property that may be altered in some way during a process of creating the medium with embedded information, a process of recording information thereon, or any other process of encoding the medium with information. For example, a magnetization state of a portion of a physical structure of a computer-readable medium may be altered during a recording process.
  • Functional facilities comprising these computer-executable instructions may be integrated with and direct the operation of a single multi-purpose programmable digital computing device, a coordinated system of two or more multi-purpose computing device sharing processing power and jointly carrying out the techniques described herein, a single computing device or coordinated system of computing devices (co-located or geographically distributed) dedicated to executing the techniques described herein, one or more Field-Programmable Gate Arrays (FPGAs) for carrying out the techniques described herein, or any other suitable system.
  • FPGAs Field-Programmable Gate Arrays
  • FIG. 9 illustrates one exemplary implementation of a computing device in the form of a computing device 700 that may be used in a system implementing techniques described herein, although others are possible. It should be appreciated that FIG. 9 is intended neither to be a depiction of necessary components for a computing device to execute a compound analysis facility 122 in accordance with the principles described herein, nor a comprehensive depiction.
  • Computing device 700 may comprise at least one processor 702, a network adapter 704, and computer-readable storage media 706.
  • Computing device 700 may be, for example, a desktop or laptop personal computer, a personal digital assistant (PDA), a smart mobile phone, a server, a wireless access point or other networking element, or any other suitable computing device.
  • Network adapter 704 may be any suitable hardware and/or software to enable the computing device 700 to communicate wired and/or wirelessly with any other suitable computing device over any suitable computing network.
  • the computing network may include wireless access points, switches, routers, gateways, and/or other networking equipment as well as any suitable wired and/or wireless communication medium or media for exchanging data between two or more computers, including the Internet.
  • Computer-readable media 706 may be adapted to store data to be processed and/or instructions to be executed by processor 702.
  • Processor 702 enables processing of data and execution of instructions.
  • the data and instructions may be stored on the computer-readable storage media 706.
  • the data and instructions stored on computer-readable storage media 706 may comprise computer-executable instructions implementing techniques which operate according to the principles described herein.
  • computer-readable storage media 706 stores computer-executable instructions implementing various facilities and storing various information as described above.
  • Computer-readable storage media 706 may store perturbation analysis facility 122, which may implement any one or any combination of methods described herein, or a combination of acts described herein.
  • a computing device may additionally have one or more components and peripherals, including input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computing device may receive input information through speech recognition or in other audible format.
  • the systems provided herein further comprise devices for use in sequencing, cell sorting, cell counting, cell imaging, and gene expression analysis.
  • Devices provided herein can include, for example, microfluidic devices, arrays, microscopes, imagers, cellular metabolism analyzers, electrophysiological devices, or impedance devices. These devices can be used to measure, analyze, and/or quantify a cell property.
  • Embodiments have been described where the techniques are implemented in circuitry and/or computer-executable instructions. It should be appreciated that some embodiments may be in the form of a method, of which at least one example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
  • methods may comprise: simulating a result of a perturbation of a gene of a target cell, wherein the simulating comprises: generating an adjusted genetic profde for each cell of a population of cells of a simulation and each cell of the population of cells corresponds to the target cell, the generating comprising: receiving an input that identifies: (a) a polynucleotide sequence from a first gene of the target cell; and/or (b) a perturbed first gene of the target cell, and generating, based on the input and using one or more trained models, the adjusted genetic profile for each cell of the population of cells, wherein generating the adjusted genetic profiles comprises simulating variation in gene expression between cells of the population of cells; and outputting a first adjusted genetic profile of a first cell from among the population of cells.
  • the outputting further comprises outputting the adjusted genetic profile for each cell of the population of cells. Further provided herein are methods, wherein the methods further comprise: identifying a set of phenotypes present in the population of cells based on the adjusted genetic profiles generated for each cell of the population of cells.
  • receiving the input further comprises receiving an input identifying the target cell
  • generating the adjusted genetic profile for the population of cells comprises generating, using the one or more trained models and based on the input polynucleotide sequence and/or perturbed first gene, gene expression values for each cell of the population of cells corresponding to the target cell
  • simulating variation in gene expression between cells of the population of cells comprises generating gene expression values with variation between cells of the population of cells.
  • the methods further comprise: determining viability or nonviability for each cell of the population of cells.
  • the methods further comprise: determining viability or non-viability of the population of cells.
  • the methods further comprise: receiving the input that identifies a polynucleotide sequence from the first gene; and generating the adjusted genetic profile comprises generating gene expression values for a set of genes, for each cell of the population of cells.
  • receiving the input comprises receiving a natural language input identifying the polynucleotide sequence from the first gene of the target cell, and/or the perturbed first gene of the target cell.
  • receiving the natural language input comprises receiving a natural language input identifying the target cell.
  • receiving the natural language input identifying the perturbed first gene of the target cell comprises receiving a natural language description of the perturbation that yielded the perturbed first gene of the target cell.
  • natural language input comprises receiving a natural language input identifying a cell type of the target cell, a tissue type from which the target cell is derived, and/or a disease state of a target cell, tissue, or subject.
  • the methods further comprise receiving an input comprising a cell property of one or more cells of the population of cells.
  • the one or more trained models generate one or more gene expression values, at least in part, by processing an interdependency between input (a) or input (b); and a cell property of one or more cells of a population of cells.
  • the cell property is selected from the group consisting of: a transcriptomic property, a structural property, an electrical property, a mechanical property, a biochemical property, a metabolic property, and a physical property.
  • the cell property comprises: (i) non-viability of a cell; (ii) viability of a cell; (iii) a variation in a sequence of the perturbed first gene or a variation in the sequence of the polynucleotide sequence; (iv) an alteration in a gene product that is correlated with the perturbed first gene; (v) an alteration a second gene that is correlated with the perturbed first gene; (vi) a disease or a condition that is associated with the perturbed first gene; (vii) an environmental change that is associated with the perturbed first gene; or (viii) any combination of (i)-(vii).
  • the one or more trained models determines the presence or absence of a sequence variation in the polynucleotide sequence relative to a reference polynucleotide sequence.
  • the receiving further comprises identifying a perturbation of the perturbed first gene.
  • the perturbation comprises an insertion, a deletion, a change in copy number, a point mutation, a frameshift: mutation, a missense mutation, a nonsense mutation, a mutation in a stop codon, an epigenetic mark, a reduction in gene expression, an overexpression of a gene, or any combination thereof.
  • the inputting further comprises a gene expression value for at least one gene in the set of genes.
  • the one or more trained models generate the gene expression values, at least in part, by processing an interdependency between input (a) or input (b); and one or more quantitative trait loci, a sequence alignment relative to a reference sequence, a sequence variation that is associated with a disease or a disorder, a sequence variation that is associated with an alteration or a perturbation, a chromosomal position, a variation in chromosomal position, or any combination thereof.
  • the interdependency between the input (a) and each gene of the set of genes is determined, at least in part, by a set of data obtained from an in-vitro assay.
  • the in-vitro assay comprises single cell RNA sequencing (scRNA seq).
  • the gene expression values comprise a set of genes that are ranked in ascending or descending order according to: (i) gene expression level; and/or (ii) a change in gene expression level relative to a reference level.
  • the reference level is a gene expression level of a second gene in an absence of a perturbation in the perturbed first gene.
  • the reference level is a gene expression level of the second gene when the second gene has been perturbed. Further provided herein are methods, wherein the method further comprises receiving an input that identifies a property of a cell that is not present in a cell in an absence of the perturbation. Further provided herein are methods, wherein the adjusted genetic profile reflects an adjusted expression level of at least one second gene impacted by the change impacting an expression level of the first gene or the perturbed first gene, the adjusted expression level of the at least one second gene being adjusted with respect to a reference expression level for the at least one second gene for the cell type, the first gene or the perturbed first gene being different from the second gene.
  • the perturbation made to the genetic profile for the cell or the population of cells yields an increase in the expression level of the one or more first genes or a decrease in the expression level of the one or more first genes. Further provided herein are methods, wherein the perturbation is made to the first gene, and/or a mutation in the first gene.
  • methods may comprise: evaluating a genetic profile for a cell to identify a perturbation that had been made to at least one gene of a target cell and yielded the genetic profile, the cell corresponding to the target cell, the evaluating comprising: receiving input that identifies a set of sequences for the cell; determining, based on the input and using at least one trained model, a set of genes altered in the cell with respect to a set of reference sequences for the target cell; determining, using the at least one trained model, the perturbation made to the at least one gene of the set of genes; outputting an identification of the perturbation.
  • evaluating the genetic profile for the cell to identify the perturbation comprises evaluating the genetic profile for the cell to identify an unspecified perturbation that had been made to the at least one gene of the target cell. Further provided herein are methods, wherein evaluating the genetic profile comprises evaluating gene expression values for genes corresponding to the set of sequences for the cell. Further provided herein are methods, wherein the methods further comprise: outputting gene sequence information for the set of genes altered by the perturbation.
  • the gene sequence information comprises a single polynucleotide polymorphism, a gene sequence variation, a gene sequence structural variant, a codon sequence, an exon or a gene sequence encoding for the exon, an intron or the gene sequence encoding for the intron, a non-coding region sequence, a promoter region sequence, a sequence alignment relative to a reference sequence, a sequence variation that is associated with a disease or a disorder, a sequence variation that is associated with an alteration or a perturbation, a chromosomal position, a variation in chromosomal position, a quantitative trait locus, a methylation pattern of a gene sequence, or any combination thereof.
  • the methods further comprise: tuning the at least one trained model based on genetic profdes for each of a set of training cells.
  • tuning the at least one trained model further comprises tuning the at least one trained model based on an identification of a perturbation made to each training cell of one or more cells of the set of training cells.
  • the at least one trained model identifies gene sequences that are associated with a disease, a condition, a perturbation, an alteration, a cell property, a cell viability, a cell non-viability, or a combination thereof and correlate gene sequences with the gene expression values for the gene or the set of genes.
  • the at least one trained model identifies gene sequences that are associated with the cell property that is selected from the group consisting of: a transcriptomic property, a structural property, an electrical property, a mechanical property, a biochemical property, a metabolic property, a physical property, and any combination thereof.
  • the methods comprise: predicting at least one gene to perturb in a target cell to yield an adjusted genetic profile for the target cell, the predicting comprising: receiving input identifying a desired property of a first cell, the first cell corresponding to the target cell; identifying, using at least one trained model, one or more target genes associated with the cell property of the first cell corresponding to the target cell, wherein the at least one trained model is trained based on adjusted genetic profiles of single cells in response to one or more perturbations, wherein the adjusted genetic profiles comprise a changed expression level of one or more genes affected by the one or more perturbations; identifying, using the at least one trained model, one or more perturbations that when made to the one or more target genes yield the desired property in the first cell corresponding to the target cell; and outputting an identification of the one or more target genes.
  • the outputting further comprises outputting the one or more perturbations.
  • the target cell does not exhibit the cell property of the cell in the absence of a perturbation.
  • the one or more perturbations are performed in silico.
  • the one or more perturbations are performed in-vitro.
  • the at least one trained model generates the one or more target genes, at least in part, by processing an interdependency between a property of a cell, a cell type, and/or genetic profile for one or more genes in response to a perturbation.
  • the desired property of the first cell is selected from the group consisting of: a transcriptomic property, a structural property, an electrical property, a mechanical property, a biochemical property, a metabolic property, and a physical property.
  • methods may comprise: receiving an input that identifies a perturbed first gene in a cell and a first gene expression value in the cell; generating, using one or more trained models, a genetic profile for the cell, wherein the genetic profile of the cell comprises a changed expression level of a second gene or a second set of genes that are impacted by the perturbed first gene, wherein the perturbed first gene is different from the second gene or the second set of genes; and outputting the genetic profile for the cell.
  • the receiving further comprises identifying a type of perturbation for the perturbed first gene.
  • the type of perturbation comprises an insertion, a deletion, a change in copy number, a point mutation, a frameshift: mutation, a missense mutation, a nonsense mutation, a mutation in a stop codon, an addition, a deletion, and/or a change to an epigenetic mark, a reduction in gene expression, an overexpression of a gene, or any combination thereof.
  • the one or more trained models generate the genetic profile for the cell, at least in part, by processing an interdependency between the perturbed first gene and a property of the cell.
  • the methods further comprise: generating, using the one or more models, a first set of genes for the cell that reflect the cell property of the cell; and outputting a second set of genes that, at least in part, are correlated to the cell property of the cell in the presence of the perturbation.
  • the cell property of the cell is selected from the group consisting of: a transcriptomic property, a structural property, an electrical property, a mechanical property, a biochemical property, a metabolic property, and a physical property.
  • the cell property of the cell comprises: (i) non-viability of a cell; (ii) viability of a cell; (iii) a variation in a sequence of the perturbed first gene or a variation in the sequence of a second gene; (iv) an alteration in a gene product that is correlated with the perturbed first gene; (v) an alteration a second gene that is correlated with the perturbed first gene; (vi) a disease or a condition that is associated with the perturbed first gene; (vii) an environmental change that is associated with the perturbed first gene; or (viii) any combination of (i)-(vii).
  • the interdependency between the perturbed first gene and a second gene or a second set of genes is determined, at least in part, by a set of data obtained from an in-vitro assay.
  • the in-vitro assay comprises single cell RNA sequencing (scRNA seq).
  • the second set of genes comprises two or more genes and a gene expression value for each gene in the second set of genes, and wherein each gene in the second set of genes is different from the one or more first genes.
  • the second set of genes are ranked in ascending or descending order according to: (i) gene expression level; and/or (ii) a change in gene expression level relative to a reference gene expression level.
  • the reference gene expression level is a gene expression level of a second gene in an absence of a perturbation in the first gene.
  • the reference gene expression level is a gene expression level of the second gene when the second gene has been perturbed.
  • the method further comprises receiving an input that identifies a cell property that is not present in a cell in an absence of the perturbation, and using the one or more trained models, fine-tuning the genetic profile based on a set of training cells.
  • methods may comprise: receiving a gene expression value of a first gene from a cell or a population of cells; generating, using one or more trained models, one or more perturbed genes associated with a change in the gene expression value of the first gene; and outputting the one or more perturbed genes.
  • the cell or the population of cells do not exhibit the change in the gene expression value in an absence of the one or more perturbed genes.
  • the generating further comprises identifying a type of perturbation associated with the change in the gene expression value of the first gene.
  • the type of perturbation comprises an insertion, a deletion, a change in copy number, a point mutation, a frameshift: mutation, a missense mutation, a nonsense mutation, a mutation in a stop codon, an addition, a deletion, and/or a change to an epigenetic mark, a reduction in gene expression, an overexpression of a gene, or any combination thereof.
  • the one or more trained models generate the one or more perturbed genes, at least in part, by processing an interdependency between the first gene and a property of the cell.
  • the cell property of the cell is selected from the group consisting of: a transcriptomic property, a structural property, an electrical property, a mechanical property, a biochemical property, a metabolic property, and a physical property.
  • the cell property of the cell comprises: (i) nonviability of a cell; (ii) viability of a cell; (iii) a variation in a sequence of the perturbed first gene or a variation in the sequence of the polynucleotide sequence; (iv) an alteration in a gene product that is correlated with the perturbed first gene; (v) an alteration a second gene that is correlated with the perturbed first gene; (vi) a disease or a condition that is associated with the perturbed first gene; (vii) an environmental change that is associated with the perturbed first gene; or (viii) any combination of (i)-(vii).
  • the one or more trained models generate the one or more perturbed genes, at least in part, by identifying an interdependency between the gene expression value of the first gene and the one or more perturbed genes.
  • the interdependency between the gene expression value of the first gene and one or more perturbed genes is determined, at least in part, by a set of data obtained from an in-vitro assay.
  • the in-vitro assay comprises single cell RNA sequencing (scRNA seq).
  • the methods further comprise: receiving a gene expression value for one or more additional genes from the cell or a population of cells; generating, using one or more trained models, one or more perturbed genes associated with a change in the gene expression value of the one or more additional genes; and outputting the one or more perturbed genes.
  • EXAMPLE 1 AN IN SILICO MODEL OF GENE EXPRESSION PROFILES OF A CELL POPULATION FOR A GENOTYPE OF INTEREST.
  • Described herein are examples of a model (or a combination of models, if functionality described herein is divided between two or more models) that receives a gene name or polynucleotide sequence of the gene or RNA transcript as input and predicts its gene expression across one or more cells (FIGS. 1-3).
  • cell level gene expression value predictions are formulated as fine tuning tasks for a sequence foundation model that had been trained, for example, with self-supervision on polynucleotide sequences (e.g., DNA sequences).
  • the schematic of FIG. 3 depicts one sequence per gene, but in other embodiments may be extended to include allele sequences when modeling cells in silico, for example, tumor cells or brain cells.
  • models that operate in accordance with techniques provided herein take a gene-specific DNA sequence as input to a sequence foundation model and generate gene expression values for a gene or a set of genes across a set of single cells.
  • a pretrained sequence embedding model for example, HyenaDNA, Enformer
  • the DNA sequence of a gene for example Xbp around the transcriptional start site (TSS)
  • TSS transcriptional start site
  • the DNA-sequence of the allele of the gene is also input into the embedding.
  • a loss of an allele (or gain) is encoded where a loss of an allele is a value of 0 and a gain of an allele is an additional input sequence.
  • the at least one trained model provided herein is at cell line level and the embeddings can all be pre-computed for the cell line of interest. Additional multi-omic data shared across the cell line at sequence level, such as methylation, can also be included with corresponding embedding models producing multiple, parallel embedding vectors in addition to the sequence embedding(s). The context of these embeddings do not have to be the same as the sequence embedding, but can span a larger window of the genome. If the additional signal is at cell level, cell level embeddings (for example scBasset to embed scATAC-seq) can be added to the module before the cell-level gene expression generation in silico.
  • One or more models are trained to take the embedding(s) as input and return the gene expression value for a given gene across a set of cells.
  • each training instance may be a gene and its readout in the training cells, that is, each gene may encode in the same model repeated across the gene-dimension (FIG. 3).
  • a dataset can be sub-sampled into sets of cells that form the target values and increase the number of training samples while reducing the complexity of the at least one trained model (FIG. 4 and FIG. 5). With the growing number of cells per dataset that may be needed for target identification, the output layer of a model provided herein can scale to millions of cells.
  • Subsampled sets vary in length allowing the at least one trained model to predict the gene expression patterns for any number of cells per dataset.
  • the at least one trained model may in some cases be run in parallel on all genes for cell-level gene expression.
  • a second model is trained to learn co-regulatory networks.
  • the predicted gene expression values may form the input to a pre-trained scRNA-Seq foundation model that embeds the gene expression series via an encoder.
  • the same pre-trained model is repeated across all cells (FIG. 6).
  • EXAMPLE 2 TRAINING OF A FIRST MODEL AND SECOND MODEL TO PRODUCE GENE EXPRESSION PROFILES AT THE CELL LEVEL.
  • the expression values of all genes for the cell can be generated from a first model and then fed into a second model as a gene expression profile (FIG. 5 and FIG. 6).
  • a small number of epochs are trained to receive the input gene sequence and produce the corresponding observed profile.
  • For training cells from one cell line cells are split into test and training sets.
  • all of the cells per dataset are predicted at the same time with the first model in a given sequence or train models individually trained on the outputs from the first model.
  • the first model is trained on subsets of cells by sampling random sets of cells from the training set to form training runs for each gene.
  • the first and second models can be trained on individual cells being one training sample or after pretraining the sequence-based model to generate in silico single cell RNA sequence datasets of cell populations sharing the same genotype, for example, clonal cells within glioblastoma cell lines. Training on one cell line and validating on a second cell line can be performed initially with smaller datasets that can be extended to larger and larger datasets for fine-tuning the one or more models.
  • Fine-tuned models that predict cell states learned from the RNA-sequencing data are also provided herein.
  • the fine-tuned models are trained for a small number of epochs to adjust for the difference between the input gene expression predicted from the first sequence to expression model and the actual observed profiles from in- vitro data sets.
  • EXAMPLE 3 PREDICTING PERTURBATION EFFECTS ON CELL FATES FOR ALL CELLS IN THE POPULATION AND IDENTIFYING GENETIC BIOMARKERS.
  • a model implemented in accordance with some techniques described herein can be used to generate cell-level expression values that are similar to a given cell line.
  • One or more models might be implemented.
  • the model(s) may receive as input a perturbation to a gene and output the genetic profile for a cell.
  • a model or a system of some such embodiments can determine from the genetic profiles identified from RNA sequencing and cellular data how a perturbation alters cell fate or expresses a gene or biomarker (FIGS. 7 and 8).
  • One or more trained models operating in accordance with some techniques described herein may receive natural language inputs that may ask for the changes in gene expression that occur upon perturbation for a given a gene knock-down on each cell expression profile from real RNA-Seq.
  • importance scores or ablation studies can be used to identify the genetic markers that are most relevant to make an accurate prediction of the gene expression of top gene candidates (e.g., ranked from highest relative change to lowest) for a given set of cells.
  • the trained model(s) can include an extension where at least one model is trained in which the embedding of cell A is updated with the current embeddings of adjoining cells B, C at each training step.
  • the resulting embeddings are concatenated and used as (additional) input to identify additional biomarkers or genes.
  • EXAMPLE 4 APPLICATIONS OF THE CELL-LINE RESPONSE MODEL.
  • a cell-line level measure of how a given perturbation affects the proliferation of cells can be used to fine-tune any of the models provided herein.
  • the perturbation is taken into account before passing the predicted gene expression values from the first model into a second model to mimic the CRISPRi knock down (or by removal of the gene’s DNA sequence from the input to mimic CRISPR knock out).
  • This method can downscale the gene’s predicted expression values across all perturbed cells by an efficiency factor to mimic the CRISPRi effect.
  • a cell line level model is trained where the target value is the dropout rate for cells given a perturbation.
  • the model(s) can also use all dropout rates at the same time. Nevertheless, all cells in a given population of cells can be monitored for whether they have been perturbed with the same target or different target. The effects are linked via common pathways or co- regulation patterns. This requires only a shallow neural network after the scRNA sequencing foundation models.
  • the process for the cell line level model to predict a target gene expression value at cell line from the cell population given the genotype is shown in FIG. 8.
  • EXAMPLE 5 GENERATING TARGET GENES FOR THERAPEUTIC INTERVENTION.
  • One or more models operating in accordance with some techniques provided herein can identify specific target genes from a series of inputs with a training step.
  • the trained model(s) of this example can be trained on large data sets of single-cell RNA sequencing data that is produced by high-throughput in-vitro assays for various cell types.
  • the in-vitro assay can be a genomic integration assay.
  • a genetic perturbation is made in a population of cells using CRISPRi techniques and a guide RNA with a unique molecular identifier (UMI) or a cellular barcode (CB) that is introduced to the perturbed gene of a cell.
  • UMI unique molecular identifier
  • CB cellular barcode
  • the method can further include cell viability assays, microscopy for cellular structural and mechanical phenotype identification, or functional assays for electrical, mechanical, or metabolic changes that are identified with the scRNA sequencing data profile.
  • Training and Outputs The datasets obtained from the in-vitro assays are used for a training sequence where one or more models use transfer learning to track and identify relationships between cell type, cell properties, and genes associated with the cell property in a particular type of cell which is a major component of lead target discovery for developing new therapeutics.
  • EXAMPLE 6 LEARNING INDIVIDUAL GENE’S SCRNA-SEQ DISTRIBUTIONS FOR DIFFERENT GENETIC CONTEXTS.
  • Some embodiments may include one or more models that may be labeled herein “CosySeq2Cell,” which may be trained to predict distribution of a gene’s expression value in a population of cells.
  • a sample of these distributions may, in some embodiments, be used as an input to an auto-encoder model to predict single cell transcriptomes.
  • Some methods of this example fine-tune layers of a model (e.g., a model as provided by Enformer) and learn an aggregation layer that gives different weights to different positions using embeddings for each of the positions in the full 200kb sequence with Enformer instead of only one that is representing the transcriptional start site.
  • the model can thereby learn more subtle contexts across the full sequence, while the learnable aggregation layer provides interpretability by identifying which genomic positions are needed for a model prediction.
  • expression can be modelled using a Poisson regression, using a negative Poisson log likelihood loss to optimize, for a given cell, the probability of sampling the observed (normalized) number of RNA-molecules given the gene’s underlying DNA sequence.
  • a read count read-out per gene in a cell is provided as an observation/sample of a Poisson distribution in order to model the distribution rather than the data point (see schematics in FIG. 10A).
  • the prediction captures information on the genotype and can carry this information onwards when it forms an input to a model downstream that models the joint probability distribution of all gene expression values for within a cell. Distributions are sampled to generate as many training samples as necessary that all capture the genetic context for the downstream model and the samples are not limited by the number of cells in the dataset.
  • EXAMPLE 7 COSYSEQ2CELL USES SEQUENCE INFORMATION FOR ACCURATE GENE EXPRESSION PREDICTION.
  • a model was warm-started with weights of relevant Enformer layers and then fine-tuned on scRNA-seq data from a set of ⁇ 10k lymphoblast (K562) cells with ⁇ 8k expressed genes in total sampled from unperturbed cells.
  • One training sample included the DNA-sequence of 200kb around the transcriptional start site of one gene.
  • the model was asked to predict the observed gene expression values of all 10k cells simultaneously.
  • Using the Poisson log likelihood loss means that the model is learning to predict the mean of a Poisson distribution from which the observed gene expression values were sampled from, and so individual gene expression value distributions per cell were modeled rather than the individual data points (FIG. 10A).
  • EXAMPLE 8 COSYSEQ2CELL CAN ACCURATELY PREDICT EXPERIMENTALLY VALIDATED ENHANCER ELEMENTS
  • FIG. 11B shows the changes in predicted gene expression distributions for the gene HHEX under the perturbation: disrupting the enhancer sequence shifts the full distribution to the lower end. Perturbation of a sequence much closer to the TSS results in an even more dramatic shift (FIG. 11B, right).
  • FIG. 11C shows for the genes HHEX and LCOR large changes in expression prediction upon perturbation around the TSS and in the enhancer region. While gene expression prediction may increase or decrease upon perturbation on the TSS, perturbation on the enhancer led to a decreased expression prediction.
  • the annotated enhancer regions were replaced with random sequences and computed the change of prediction both for CosySeq2Cell and Enformer.
  • CosySeq2Cell outperforms Enformer for both high and high+low confidence enhancer-gene pairs where the enhancer is within 3kb of the TSS (FIG. 11D).
  • CosySeq2Cell can identify changes in the DNA sequence that alter gene expression such as in regulatory regions as enhancers.
  • EXAMPLE 9 GENERATING IN SILICO SCRNA-SEQ DATA FOR CELL POPULATIONS WITH COSYSEQ2CELL.
  • CosySeq2Cell predicts expression value distributions that are based on the underlying DNA-sequence; in some applications, the model does not have (or learn) any information about co-regulatory networks.
  • an auto-encoder model was used to calibrate the predicted expression values into full single cell transcriptomes by taking into account co-regulation patterns. As inputs to the auto-encoder, samples were generated from the Poisson distributions per gene and cells learned from CosySeq2Cell (FIG. 10A, middle and right boxes). The model was asked to reconstruct the actual gene expression profile of the cell across all the expressed genes from these samples as shown in FIG. 12A.
  • a Poisson mean was predicted using CosySeq2Cell (first row) and a sample value (second row). This forms the input for the auto-encoder to predict the actual gene expression profile (last row).
  • the predicted values from the autoencoder (third row) can use information from all the 8000 genes to make a prediction which is closer to the target values.
  • EXAMPLE 10 COSYFORMER: A FAMILY OF FOUNDATIONAL MODELS TRAINED ON LARGE SCRNA-SEQ DATASETS
  • an auto-encoder model to process outputs that are samples from distributions of gene expression values, to predict single cell transcriptomes.
  • another model (or multiple models) may be used instead of an auto-encoder model.
  • one or more models may be trained on a corpus of scRNA-seq training data. Some such models may be labeled “CosyFormer” in examples herein.
  • the non-zero median expression value of each gene across the training corpus was used in this example to normalize the per-cell expression value for the purpose of prioritizing genes that uniquely distinguished cell state.
  • Three different training datasets were accessed: the first one containing normal cells; the second normal and non-cancer cells; and the last containing all cells.
  • a BERT-based architecture was used with six transformer layers with input size of 2,048, embedding dimensions 256, four attention heads per layer and feed forward size of 512. All models were trained for 3 full epochs during which the test loss for overfitting was monitored. Early stopping was not necessary.
  • a schematic of the model and the input and prediction task is shown in FIG. 13A.
  • EXAMPLE 11 COSYFORMER MODEL FUNCTION ON MASKED TOKEN PREDICTION METRICS.
  • PBMCs peripheral blood mononuclear cells
  • micro- and macro-averaged hits@k metrics on masked tokens were evaluated at different thresholds in the 2000 highest expressed genes in 10k randomly sampled PBMCs (FIG. 13B and FIG. 19).
  • Macro averaging gives equal weight to each gene when computing the accuracy, whereas the microaverage gives equal weight to each instance.
  • Micro-average is a measure of the model’s overall performance on the most frequently occurring genes, while macro-averaging gives a sense of the model’s performance by giving equal weight to each gene.
  • an instance is one prediction instance, e.g., one of the 15% masked genes in the 2000 genes that the model was instructed to fill in.
  • EXAMPLE 12 COSYFORMER MODELS PREDICT OVERALL GENE RANKINGS MORE ACCURATELY THAN GENEFORMER.
  • the model’s performance was evaluated by finetuning it on a range of cell and gene classification tasks. Fine-tuning tasks were replicated in those cases where the data was publicly available as one dataset: transcription factor dosage sensitivity and chromatin dynamics (bivalently marked promoters) as gene classification tasks, and cell type annotation as cell classification task.
  • Gene dosage sensitivity is commonly predicted using conservation and allele frequency to interpret copy number variants in genetic diagnoses. However, these features do not vary across cell states and do not capture transcriptional dynamics that may inform contextual dosage sensitivity. Gene sets that were either dosage sensitive or insensitive to label the genes in 10k were randomly selected as single-cell transcriptomes from the training dataset as either dosage sensitive or insensitive. The fine-tuned Cosyformer-012 performed similarly to Geneformer (FIG.
  • ESCs embryonic stem cells
  • orthose that are bivalently marked in ⁇ 15k transcriptomes of ESCs were also evaluated.
  • CosyFormer-012 also predicted the state of genes that were excluded from the fine-tuning training data comprising 56 labelled genes. [00124] For the cell classification task of predicting the cell type per tissue, the fine-tuned CosyFormer- 012 achieved comparable performance accuracy relative to Geneformer.
  • Embodiments operating in accordance with some techniques described herein may also include a model to predict transcriptomic states of cell populations under CRISPRi perturbation in different genetic contexts.
  • a model operating in accordance with some techniques discussed above in connection with CosySeq2Cell models may be implemented, which provide samples of gene expression distributions in a cell population with a given genetic context. Such samples that are received from a CosySeq2Cell model may be input to a fine-tuned model operating in accordance with some techniques described above in connection with CosyFormer models, which would predict perturbed transcriptomic states for a given CRISPRi target.
  • This combination of models may connect genetic changes and effects of CRISPRi perturbations at a single cell level. This may allow, in some embodiments, identification of genetic markers together with CRISPRi targets to promote desired transcriptomic phenotypes. Such a combination of models, or a model performing these tasks, may be labeled “PerturbFormer” in some examples herein.
  • a CosyFormer model is finetuned to predict effect of a knock-down of a gene on the gene expression ranking as measured by scPerturb-seq.
  • This problem was phrased as a Question-Answering (QA) problem, where the “question” is the gene name that is perturbed by CRISPRi and the model learns to produce the perturbed gene expression ranking of the cell out as “answer”.
  • the model was also provided with a context for the answer, for example, a gene expression ranking from a randomly sampled cell from the unperturbed population. The results were drawn from the distribution predicted by CosySeq2Cell to account for the effect of genetic variation on gene expression profiles.
  • CosyFormer-012 was finetuned using PerturbSeq data in a supervised manner.
  • Encoder models are suited to the task of learning contextualized gene embeddings that can encode information from the full sequence. Any down-stream task requires fine tuning to make a prediction based on these contextualized gene embeddings.
  • decoder models learn gene rankings in an autoregressive manner, always only looking at the first N genes to make a prediction for position N+l. In the case of predicting genes ranked by their expression, this means the model may not have important information from the lower ranked genes available to make its prediction.
  • CosyFormer was fine-tuned into a generative model by retraining the pretrained model weights.
  • These encoder-decoder models can take any pre-trained encoder and/or decoder model and use the weights for its encoder/decoder modules where appropriate, while other weights are initialized randomly and learnt during training.
  • available perturb-seq data was processed in lymphoblast cells resulting in 1 ,9M cells of which 1 ,8M cells contained targeting guides. For each cell with targeting guide the output gene ranking was encoded as a sequence of token IDs corresponding to the gene names via CosyFormer’s tokenizer, framed by an [CLS] and a [SEP] token.
  • the target gene name of the perturbation was encoded, framed by an [CLS] and a [SEP] token, followed by an encoded gene ranking sampled from the cell population with non-targeting guides, followed by a [SEP] token.
  • a schematic of the model’s architecture, the input data and prediction task are shown in FIG. 15A.
  • the perturbed genes were split into training, test and validation perturbations and the processed data was split into different datasets accordingly.
  • the model was trained for 3 epochs on the training data set with similar hyperparameters as were chosen to train the CosyFormer models.
  • the model was evaluated by asking it to predict single cell transcriptomes for each of the cells whose perturbation target was set aside for validation.
  • Per perturbation target around 300 perturbed transcriptomes were predicted from an unperturbed transcriptome and perturbation information.
  • the top 100 genes with the highest change in absolute rank between the predicted and input (e.g., the unperturbed) transcriptomes were recorded.
  • the top 20 differentially expressed genes (DEGs) between the perturbed and the unperturbed cell populations for each perturbation in the validation set were calculated.
  • DEGs differentially expressed genes
  • the top variable genes amongst the unperturbed cell population were calculated. These genes are useful for differentiating transcriptomic states in the cell population and therefore are likely to change expression upon perturbation.
  • the 100 highest average ranked genes across the set of unperturbed input transcriptomes per perturbation e.g., a size-matched random sample of unperturbed cells per perturbation
  • GRNboost was used as a first step for inferring gene regulatory networks (GRN) from gene expression data. GRNboost trains XGBoost models to predict the expression of each gene (child gene) in the dataset, using the input of all the other genes (e.g., parent genes), using the feature importance scores to identify the most predictive genes.
  • PerturbFormer recovered significantly more differentially expressed genes (DEGs) in its validation data than GRNBoost, and was comparable to the two baselines (FIG. 15B).
  • EXAMPLE 15 GENETICALLY ENHANCED-COSYFORMER.
  • a drawback of a Geneformer model is that Geneformer models can not directly take genetic information into account.
  • some CosySeq2Cell models operating in accordance with some techniques provided herein can accurately predict experimentally validated enhancer elements on Cosy Seq2 Cell showing that a gene’s underlying DNA-sequence harbors enough information to be able to predict its prediction across several cells in a cell population.
  • the first step in CosyFormer implementations that are BERT-based model may be passing input token IDs (e.g., integers corresponding to gene names) through an embedding layer which acts as a lookup table.
  • token IDs e.g., integers corresponding to gene names
  • each token ID is mapped to a fixed-sized vector according to pre-leamed embeddings.
  • These embeddings form the input to the transformer encoder and are passed through multiple layers with self-attention.
  • the model can then be (pre-)trained end-to- end, such as in the same way as described for CosyFormer above on the dataset vO. 1.2 containing all cells from non-cancer related studies.
  • a schematic of this Genetically Enhanced-CosyFormer (labeled herein GE-Cosyformer) is shown in FIG. 16A.
  • EXAMPLE 16 GE-COSYFORMER ENCODES THE IMPACT OF DNA-SEQUENCE CHANGES ON CO- REGULATION PATTERNS.
  • This example includes a trained PerturbFormer model based on a pre-trained CosyFormer model.
  • the input to the transformer encoder were the projected Enformer embeddings of the input DNA sequence. These embeddings were passed through multiple layers with self- attention. The output is the same number of now contextualized embeddings — one for each input token.
  • the BERT encoder allows each contextualized token embedding to capture information from other tokens in the input. In this example, these embeddings are refined through multiple layers of self-attention and feedforward layers through aggregation of contextual information from the entire input sequence.
  • the contextualized embeddings were used to understand how changes in the input DNA sequence (e.g., in silico mutagenesis) affects the embeddings of the other genes in the transcriptome of a single cell - and ultimately if the model has learned co- expression patterns that are dependent on genetic changes.
  • the inventors assessed the impact of in silico mutagenesis of the underlying DNA sequence of transcription factors on the gene embeddings of their targets and, as control, the remainder of the genome.
  • the transcription factor GATA4 in fetal cardiomyocytes was used as a target in silico mutagenesis of lOObp around the transcription start site (TSS) of GATA4 was performed by replacing the original sequence with a random sequence.
  • TSS transcription start site
  • the contextualized gene embeddings of genes that are downstream targets were compared before and after perturbation. Changes in gene embedding of the targets were significantly more than control as measured by the cosine similarity before and after perturbations. In addition, those 50 genes whose embeddings displayed the biggest changes as putative targets were returned.
  • GATA4 and TBX5 are two known congenital heart disease genes which are co-expressed during cardiac morphogenesis, physically interact and have co-bound targets. A GATA4 mutation was shown to disrupt recruitment of TBX5 to cardiac enhancers.
  • the contextualized embeddings of the targets of GATA4 and TBX5 targets changed more than a random gene set with many of the top 50 putative targets containing true GATA4 and TBX5 targets (FIG. 16B).
  • EXAMPLE 17 MODELLING SINGLE-CELL RNA-SEQ OF A CELL POPULATION WITHIN A GIVEN GENETIC CONTEXT.
  • This example describes techniques that may be used to implement a CosySeq2Cell model in some embodiments. Examples of Cosy Seq2 Cell models are described above.
  • CosySeq2Cell was trained to leam the expected number ofRNA molecules A given the DNA sequence by optimizing the log likelihood:
  • CosySeq2Cell models genes and cells independently, and so its joint probability distribution is learned over all gene expression values in a cell is product of all the individual gene expression distributions, e.g.,
  • N is the total number of expressed genes in a cell.
  • genes can regulate the expression of other genes (e.g., transcription factors and their targets), this is not a good representation of a single cell transcriptome.
  • CosyFormer leams a distribution of gene rankings for individual cells from scRNA-seq data (Wang and Cho, 2019) which is implicitly an approximation of the joint probability distribution of genes
  • rank is a ranking operator returning the gene names ranked in descending order by their value x- .
  • Conditional distribution was approximated when modifying the gene expression of a gene in silico. For example, when forcing the expression of gene i in cell j to be zero, CosyFormer was used to approximate p( ‘ank(x ⁇ . . . , ) ⁇ 0).
  • EXAMPLE 18 COSYSEQ2CELL ARCHITECTURE, TRAINING AND EVALUATION.
  • Cosy Seq2 Cell was trained on 10k cells sampled from cells with non-targeting guides to represent cells without a perturbation. After filtering genes to have a minimum total read count of 10 and to be expressed in at least 500 cells, a count matrix of 10k cells and 8k genes was obtained. The counts were normalized by dividing by the total count per cell and computed the log transform. For each of the 8k genes, the transcriptional start site (TSS) was obtained from Ensembl CRCh38.98 as well as the DNA-sequences per gene of length 196,608 around the TSSs, where any shorter sequence was padded with N to full length.
  • TSS transcriptional start site
  • CosySeq2Cell Model Architecture A schematic overview of CosySeq2Cell’s architecture can be found in FIG. 10A.
  • the first layers in CosySeq2Cell’s architecture include: 7 convolutional blocks with pooling are followed by 11 transformer blocks and a cropping layer to trim the positions at the far ends.
  • the inputs are one-hot-encoded DNA sequences of length 196,608 centred around the TSS.
  • each input sequence is represented as an embedding of length 896.
  • CosySeq2Cell has an aggregation layer which is a learnable vector of dimensions (896, 1) that aggregates the embeddings of size (896, 3072) to (1, 3072) which forms the gene embedding, followed by 2 MLP layers that finally predicts the log mean of the Poisson for each of the 10k cells.
  • CosySeq2Cell Model Training and Evaluation CosySeq2Cell was trained, and evaluated using the Poisson negative log-likelihood loss function. The genes were split into training, validation and test by chromosomes, with chromosomes 1-14 in the training set ( ⁇ 5 700 genes), 15-18 in the validation set ( ⁇ 1 200 genes) and 19-23 in the test set ( ⁇ 1 000 genes).
  • training Enformer the authors trained across species (human and mouse) and grouped sequences to avoid data leakage due to sequence conservation.
  • training was performed on human data and so splitting by chromosomes is sufficient to avoid potential data leakage due to overlapping sequences.
  • the pytorch Enformer checkpoint was obtained from HuggingFace (Wolf et al. , HuggingFace’s Transformers: State-of-the-art Natural Language Processing. 2020), which was implemented and trained by the Enformer Pytorch available on the world wide web at ⁇ https://github.com/lucidrains/enformer-pytorch>.
  • the model was trained by unfreezing the last 5 Enformer-layers together with the aggregation and MLP layers.
  • the model was trained for 50 epochs and monitored the negative log likelihood loss for convergence and overfitting.
  • the model was trained with the AdamW optimiser in Pytorch with 0.1 weight decay.
  • CosySeq2Cell Enhancer Prioritization A set of experimentally derived enhancergene pairs were obtained to test whether CosySeq2Cell has learnt to use DNA elements to predict gene expression. The coordinates of the 5779 enhancer candidates were obtained from hgl9 to hg38 with the UCSC liftOver webtool (Kent et al., The Human Genome Browser at UCSC. Genome Res. 12, 996-1006, 2002).
  • the genes (from the set of 8k gene in the K562 RNA-seq data) that were ⁇ 90kb away were used to form a set of 6527 enhancer-gene candidates, of which 351 pairs were experimentally validated and 273 of high confidence (FIG. 17).
  • the inventors replaced the exact enhancer region with random sequences and measured the absolute change of expression predicted by CosySeq2Cell. This process was repeated 90 times for each candidate pair and the mean of these was reported as score. The scores for all the candidates were obtained and calculated the area under the precision-recall curve (auPRc). The same random replacement of sequences was repeated for Enformer and measured (1) the absolute change of CAGE values at the major TSS, where the major TSS is defined as the position where the maximum CAGE value was predicted by Enformer, plus the 1 flanking position on each side, and (2) the sum of absolute change of CAGE at all the positions .
  • the whole model consists of 4 MLP layers of dimension 2000, 1000, 1000, 2000, each followed by a ReLU activation.
  • the inputs are the expression values of 8k genes of a cell sampled from CosySeq2Cell, and the model is trained to reconstruct the actual expression profile of that cell for all 8k genes.
  • CosyFormer models the inventors downloaded all processed data from the CellXGene data portal available in the census dataset (version 25th July 2023) (CZI Single- Cell Biology Program et al., 2023). In total, the inventors downloaded 33 364 242 cells across 265 datasets (duplicate cells across datasets, for example from meta-analyses, were excluded).
  • the inventors extracted all available datasets from the Human Cell Atlas that were not already contained in the CellxGene census data resulting in an additional ⁇ 12M cells.
  • the inventors restricted the download and processing to 4 441 544 cells across 459 datasets that were identified as “normal” to reduce time consuming manual labelling.
  • the inventors excluded any datasets that did not have readily available EnsembllDs.
  • the gene expression profile of a cell was encoded as an ordered sequence of gene names from left to right ordered by their (normalized) expression.
  • the inventors retained cells with total read counts within 3 standard deviation of the mean within that dataset and used the non-zero median expression value of each gene across the training corpus to normalize the per-cell expression value - this will prioritize genes that uniquely distinguish cell state. While Geneformer was trained on cells with as few as 7 expressed genes, the inventors prioritized data quality over quantity and only retained cells with at least 500 expressed genes.
  • the inventors restricted the vocabulary to the Ensembl IDs of 30, 401 genes with EntrezIDs. Table 1 shows the number of genes and cells in each dataset.
  • Supplementary table STI compares the numbers of cells before and after filtering in the training datasets for the CosyFormer models with those of Geneformer and scGPT.
  • each cell’s ranked genes were framed by an [CLS] in the beginning and a [SEP] token at the end. This full ranking was subsequently tokenized and encoded into token ids, e. g.
  • Each gene name (and the special tokens [CLS], [SEP]) was assigned an integer value so that the gene ranking can form a numerical input vector to the model.
  • a schematic of the model and the input and prediction task is shown in FIG. 13A.
  • Datasets were split into training, test and validation sets (80/10/10) resulting in:
  • CosyFormer architecture and pre-training In this first iteration, CosyFormer has the same architecture as Geneformer, that is six transformer encoder units, each composed of a selfattention layer and feed forward neural network layer with input size of 2,048, embedding dimensions 256, four attention heads per layer and feed forward size of 512. As Geneformer, CosyFormer uses full dense self-attention across the input size of 2,048. Further parameters were as follows: nonlinear activation function, rectified linear unit (ReLU); dropout probability for all fully connected layers, 0.02; dropout ratio for attention probabilities, 0.02; standard deviation of the initializer for weight matrices, 0.02; epsilon for layer normalization layers, 1 x 10 A (— 12). Modelling was implemented in pytorch and using the Huggingface Transformers library for model configuration, data loading and training.
  • ReLU rectified linear unit
  • CosyFormer was pre-trained as GeneFormer with a masked language modelling task, i.e., predicting a set of randomly masked tokens in a sequence.
  • the model is trained to predict which token should be in the masked position using as context the other unmasked gene names in the ranking.
  • the inventors chose to mask 15% of input tokens (e.g., gene names) with 80% of these replaced by the mask-token, 10% with a random token and 10% with the true token - this masking strategy is widely used in NLP modelling.
  • Pretraining hyperparameters were chosen to be the similar to the ones used to pretrain Geneformer but with some adjustments for distributed learning: max learning rate, 1 x 10 3 scaled by the number of GPUs; a learning scheduler, linear with warmup and with linear decay after 10,000 warm up steps; Adam optimizer with weight decay parameter 0.001. Training was distributed over 4 GPUs in one node with minibatch size 11 and 2 gradient accumulation steps, leading to an effective batch size of 88. The models were trained for three epochs.
  • hits@k is the number of correct genes predicted (e.g 0 or 1) in the top k predicted genes per masked token, and the macro-averaged hits@k metric is calculated as
  • the inventors compared the rankings by computing the Spearman correlation coefficient between them.
  • Gene embeddings The models embed each gene of an input sequence into a 156- dimensional space that captures characteristics of the gene and the context of the other genes in the input. Contextual embeddings like these can be extracted as the hidden state weights of the embedding layers for each gene in the input sequence computed by a forward pass of the input through the model. The inventors extracted and analyzed embeddings from the second to last layer of the models as this is a more generalizable representation of the input than the final layer which learns features more directly related to the learning objective prediction.
  • Cell embeddings are generated by averaging the embeddings of each gene detected in that cell, resulting in a 256-dimensional embedding.
  • Fine-tuning tasks The inventors followed the Geneformer publication for the following fine tuning tasks. The authors used the same fine-tuning hyperparameters for all applications, but noted that hyperparameter optimization generally significantly enhances learning for deep learning models and the reported results are likely to be easily improved on. Hyperparameters used for fine-tuning were as follows: max learning rate, 5 x 10-5; learning scheduler, linear with warmup; optimizer, Adam with weight decay fix; warmup steps, 500; weight decay, 0.001; batch size, 12.
  • Fine-tuning results for gene classification applications were reported as AUCs ⁇ standard deviation and Fl score calculated on the basis of a fivefold cross-validation strategy for which training was performed on 80% of the gene training labels and performance was tested on the 20% held-out gene training labels, repeating for five folds. All fine tuning for gene classification tasks was performed with a single training epoch.
  • a Cosyformer model was initialized with its pretrained weights, adding a final task-specific transformer layer, and fine-tuning all non-frozen layers with the task-specific data. The numbers of frozen layers of Cosyformer were 4 (dosage sensitivity), 0 (chromatin dynamics), and 0 (cell type classification).
  • Transcription factor dosage sensitivity 10k cells were sampled from the pretraining corpus and labelled the genes with their dosage sensitivity labels obtained from Theodoris et al., Nature (2023).
  • Chromatin dynamics (bivalently marked promoters): The raw scRNA-Seq data for embryonic stem cells was obtained and processed with the Cosyformer preprocessing pipelines requiring at least 100 expressed genes per cell resulting in a dataset of 26,785 cells. Genes were labelled with the respective labels. Data was split into 5 cross-validation folds of training and test data. The inventors also tested whether, without any further training, the model fine-tuned to distinguish bivalent versus single Lys4-marked genes by training on the 56 highly conserved loci would generalize to the genome -wide setting.
  • Cell type classification The raw scRNA-Seq data and cell type annotation was downloaded from the github release for scDeepsort. Both adult and fetal tissues were used within each category. Large intestine samples included those labelled as fetal intestine, ascending colon, transverse colon, sigmoid colon, and rectum; and blood samples included those labelled as peripheral blood and bone marrow. Cells accounting for at least 5% of the total cells in each tissue were included. Data was shuffled and randomly divided into training and evaluation data at a ratio of 80:20.
  • the pretrained Geneformer was fine-tuned with zero frozen layers for 10 epochs, otherwise using the learning hyperparameters described above.
  • a model was trained for each organ to predict cell type annotations classes. Predictive performance was evaluated using accuracy and macro Fl score, which averages the Fl score for each of the classes such that each class is equally weighted for multiclass predictions.
  • PerturbFormer implementation and training The inventors used pre-trained checkpoints for both encoder and decoder fortraining the model. As both encoder and decoder the inventors loaded the final checkpoint of CosyFormer-012 into the model.
  • target name and probe type were retained before tokenization and data splitting into train/test and validation data.
  • the inventors encoded the output gene ranking as before as a sequence of token ids corresponding to the gene names via CosyFormer-012’s tokenizer, framed by an [CLS] and a [SEP] token.
  • the inventors encoded the target gene name of the perturbation, framed by an [CLS] and a [SEP] token, followed by an encoded gene ranking sampled from the cell population with nontargeting guides, followed by a [SEP] token.
  • a schematic of the model and its input and prediction task is shown in FIG. 15A.
  • the inventors trained the model for 3 epochs on the training data set. After each epoch, the inventors evaluated the performance on the test set by calculating the per-sample BLEU score and averaging over the full dataset.
  • Pretraining hyperparameters were chosen to be the similar to the ones used to pretrain CosyFormer: max learning rate, 1 x 10 ’ scaled by the number of GPUs; a learning scheduler, linear with warmup and with linear decay after 5000 warm up steps; Adam optimizer with weight decay parameter 0.001. Training was distributed over 4 GPUs in one node with minibatch size 5 and 2 gradient accumulation steps, leading to an effective batch size of 40. To speed up training the inventors also used dynamic padding combined with a length-grouped sampler to minimize computation on padding as for CosyFormer. Overall, pretraining was achieved between 4-5 days distributed across one node with four Nvidia L4 16GB GPUs.
  • Model architecture and training GE-Cosyformer is a multi-modal foundation model that can learn from scRNA-seq and DNA sequences jointly. Instead of the input being a ranked list of gene names and learning the embeddings for each gene like Cosyformer and Geneformer, the inventors used the genetic embedding from Enformer and apply a simple MLP layer that maps it to the BERT embedding space.
  • the model provided herein can learn co-expression patterns under genetic context by perturbing the DNA sequences of a gene in-silico at the Enformer input level and observing changes in embeddings of other related genes.
  • the inventors used Ensembl (GRCh38.108) to obtain the transcription start site (TSS) for 25k protein- coding genes. Using the reference genome, sequences of length of 196,608bp centered on the TSS were used as input for Enformer to obtain embeddings of dimension (896, 3072), each 896 dimension represents a genomic location. The inventors took the mean across the genomic location to get an embedding vector of size (1, 3072) per gene, forming genetic embeddings. The Enformer checkpoint is obtained the world wide web at ⁇ github . com/lucidrains/enformer-pytorch> .
  • each gene is represented by a word embedding which is randomly initialized and trained.
  • the inventors replace the word embedding with the genetic embeddings and apply a MLP player followed by a Softplus activation to which maps 3072 to 256, the dimension of the original world embeddings.
  • targets were determined by Chip-seq with direct targets defined as those genes that are differentially expressed in the TF-deficient cells that also have the corresponding TF's ChlP-seq peak assigned to that gene. The peaks to the closest gene were assigned within 20kb. Indirect targets were defined as those genes that are differentially expressed in the TF-deficient cells that do not have the corresponding TF's ChlP-seq peak assigned to that gene. The effects of the in silico mutagenesis perturbation was measured by comparing the gene embeddings before and after perturbation.
  • EXAMPLE 19 ASSESSING PERTURBFORMER’S ABILITY TO PREDICT THE EFFECT OF CRISPRI PERTURBATIONS ON CELL SURVIVAL.
  • Paired scPerturb-seq and WGS data of unseen cell lines during training are used to evaluate PerturbFormer’ s performance to predict perturbed gene rankings when presented with unperturbed gene rankings generated by GE-CosyFormer or CosySeq2Cell when given the DNA sequences of a different genetic context (z.e., of the unseen cell lines).
  • the models provided herein are being applied to glioblastoma cells to provide more specific and nuanced information on the regulatory DNA content.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne divers modes de réalisation de systèmes et de procédés permettant de déterminer et/ou analyser des réseaux de co-expression génique pour l'identification de cible exacte et précise afin de générer des interventions thérapeutiques fiables dans le traitement d'une maladie quelconque. Certains procédés décrits dans la description génèrent un profil génétique ajusté pour une cellule ou une population de cellules sur la base d'une perturbation d'un gène dans une cellule cible dans un contexte génétique spécifique.
PCT/IB2024/061436 2023-11-17 2024-11-15 Analyse de profil génétique lié à une perturbation Pending WO2025104702A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363600225P 2023-11-17 2023-11-17
US63/600,225 2023-11-17

Publications (1)

Publication Number Publication Date
WO2025104702A1 true WO2025104702A1 (fr) 2025-05-22

Family

ID=93648107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2024/061436 Pending WO2025104702A1 (fr) 2023-11-17 2024-11-15 Analyse de profil génétique lié à une perturbation

Country Status (1)

Country Link
WO (1) WO2025104702A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190085324A1 (en) * 2015-10-28 2019-03-21 The Broad Institute Inc. Assays for massively combinatorial perturbation profiling and cellular circuit reconstruction
US20210366577A1 (en) * 2020-05-22 2021-11-25 Insitro, Inc. Predicting disease outcomes using machine learned models
WO2023193935A1 (fr) * 2022-04-08 2023-10-12 NEC Laboratories Europe GmbH Procédé et système permettant de prédire des perturbations d'expression génique

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190085324A1 (en) * 2015-10-28 2019-03-21 The Broad Institute Inc. Assays for massively combinatorial perturbation profiling and cellular circuit reconstruction
US20210366577A1 (en) * 2020-05-22 2021-11-25 Insitro, Inc. Predicting disease outcomes using machine learned models
WO2023193935A1 (fr) * 2022-04-08 2023-10-12 NEC Laboratories Europe GmbH Procédé et système permettant de prédire des perturbations d'expression génique

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ARSHAM GHAHRAMANI ET AL: "Generative adversarial networks simulate gene expression and predict perturbations in single cells", BIORXIV, 30 July 2018 (2018-07-30), XP055593091, Retrieved from the Internet <URL:https://www.biorxiv.org/content/early/2018/02/08/262501.full.pdf> [retrieved on 20250207], DOI: 10.1101/262501 *
CUI HAOTIAN ET AL: "scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI", BIORXIV, 2 July 2023 (2023-07-02), XP093247452, Retrieved from the Internet <URL:https://www.biorxiv.org/content/10.1101/2023.04.30.538439v2.full.pdf> [retrieved on 20250206], DOI: 10.1101/2023.04.30.538439 *
KENT ET AL., THE HUMAN GENOME BROWSER AT UCSC. GENOME RES, vol. 12, 2002, pages 996 - 1006
THEODORIS ET AL., NATURE, 2023
WOLF ET AL., HUGGINGFACE'S TRANSFORMERS: STATE-OF-THE-ART NATURAL LANGUAGE PROCESSING, 2020

Similar Documents

Publication Publication Date Title
Gulati et al. Profiling cell identity and tissue architecture with single-cell and spatial transcriptomics
Hesami et al. Machine learning: its challenges and opportunities in plant system biology
Zampieri et al. Machine and deep learning meet genome-scale metabolic modeling
Li et al. Applications of deep learning in understanding gene regulation
Vamathevan et al. Applications of machine learning in drug discovery and development
Kelley et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks
Chao et al. Integrating omics databases for enhanced crop breeding
CA2894317C (fr) Systemes et methodes de classement, priorisation et interpretation de variants genetiques et therapies employant un reseau neuronal profond
Gunavathi et al. A review on convolutional neural network based deep learning methods in gene expression data for disease diagnosis
Hwang et al. Big data and deep learning for RNA biology
Lengerich et al. Personalized regression enables sample-specific pan-cancer analysis
Deng et al. Massive single-cell RNA-seq analysis and imputation via deep learning
Gao et al. EpiGePT: a Pretrained Transformer model for epigenomics
Schwessinger et al. Single-cell gene expression prediction from DNA sequence at large contexts
Thibodeau et al. CoRE-ATAC: A deep learning model for the functional classification of regulatory elements from single cell and bulk ATAC-seq data
Dsouza et al. Learning representations of chromatin contacts using a recurrent neural network identifies genomic drivers of conformation
Schaefer et al. Multimodal learning of transcriptomes and text enables interactive single-cell RNA-seq data exploration with natural-language chats
Trapnell Revealing gene function with statistical inference at single-cell resolution
KR101067352B1 (ko) 생물학적 네트워크 분석을 이용한 마이크로어레이 실험 자료의 작용기작, 실험/처리 조건 특이적 네트워크 생성 및 실험/처리 조건 관계성 해석을 위한 알고리즘을 포함한 시스템 및 방법과 상기 방법을 수행하기 위한 프로그램을 갖는 기록매체
Allard et al. Evolutionary sparse learning reveals the shared genetic basis of convergent traits
Song et al. Predicting the structural impact of human alternative splicing
Ugolotti et al. Visual search of neuropil-enriched RNAs from brain in situ hybridization data through the image analysis pipeline Hippo-ATESC
WO2025104702A1 (fr) Analyse de profil génétique lié à une perturbation
Sharma et al. Evolutionary algorithms and artificial intelligence in drug discovery: opportunities, tools, and prospects
Chong et al. SeqControl: process control for DNA sequencing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24812246

Country of ref document: EP

Kind code of ref document: A1