[go: up one dir, main page]

WO2024153239A1 - Prediction model training method, gene expression data correction method, and downstream task execution method - Google Patents

Prediction model training method, gene expression data correction method, and downstream task execution method Download PDF

Info

Publication number
WO2024153239A1
WO2024153239A1 PCT/CN2024/073344 CN2024073344W WO2024153239A1 WO 2024153239 A1 WO2024153239 A1 WO 2024153239A1 CN 2024073344 W CN2024073344 W CN 2024073344W WO 2024153239 A1 WO2024153239 A1 WO 2024153239A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene expression
expression data
gene
masked
counts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2024/073344
Other languages
French (fr)
Chinese (zh)
Inventor
龚警
曾信
郝敏升
刘迟明
王太峰
成幸毅
宋乐
马剑竹
张学工
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Biomap Beijing Intelligence Technology Ltd
Original Assignee
Biomap Beijing Intelligence Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202310097546.2A external-priority patent/CN116403634A/en
Priority claimed from CN202310630156.7A external-priority patent/CN119068985A/en
Application filed by Biomap Beijing Intelligence Technology Ltd filed Critical Biomap Beijing Intelligence Technology Ltd
Publication of WO2024153239A1 publication Critical patent/WO2024153239A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Definitions

  • the present disclosure relates to the field of artificial intelligence, and in particular to a prediction model training method, a gene expression data correction method, a downstream task execution method, an electronic device, a computer-readable storage medium, and a computer program product.
  • Single-cell sequencing technology is a technology that performs sequencing analysis on the genome, transcriptome, and epigenome at the level of a single cell. Due to the interference of technical noise, there are certain differences in the sequencing results of similar cells at the same sequencing depth. This leads to certain difficulties in the results of downstream applications when the single-cell sequencing results are directly used. Therefore, the results of single-cell sequencing need to be processed in a certain way to better serve downstream applications.
  • a method for training a prediction model comprising: obtaining a plurality of samples, wherein each of the plurality of samples comprises first gene expression data and masked gene expression data, the first gene expression data comprises counts of different genes in a single cell, the masked gene expression data is obtained by processing counts of some genes in the first gene expression data, and the processing for obtaining the masked gene expression data comprises masking; for each of the plurality of samples: processing the masked gene expression data in the sample using the prediction model to be trained to obtain a predicted value of the counts of some genes corresponding to the sample; and calculating a predicted value based on the predicted value and the counts of some genes in the first gene expression data corresponding to the sample.
  • the corresponding count determines the loss value corresponding to the sample; and updates the prediction model to be trained according to the loss value corresponding to each sample in the multiple samples.
  • a method for correcting gene expression data comprising: obtaining current gene expression data, wherein the current gene expression data includes respective counts of different genes measured at an actual sequencing depth; and processing the current gene expression data using at least part of the network layers in a prediction model trained by the training method of the first aspect of the present disclosure to obtain a correction value of the current gene expression data or an intermediate processing result of the correction value.
  • a downstream task execution method comprising: obtaining input data, wherein the input data includes i) a correction value obtained by the method according to the second aspect of the present disclosure; or ii) an intermediate processing result of the correction value according to the second aspect of the present disclosure; or iii) a preprocessing result obtained by preprocessing the correction value obtained by the method according to the second aspect of the present disclosure; or iv) a preprocessing result obtained by preprocessing the intermediate processing result of the correction value obtained by the method according to the second aspect of the present disclosure; and processing the input data using a downstream task algorithm to obtain a downstream task result, wherein the downstream task includes a cell classification task, a perturbation prediction task or a drug response prediction task.
  • an electronic device including: a processor; and a memory, wherein the memory stores instructions executable by the processor, and when the instructions are executed by the processor, the processor executes any one of the above methods.
  • a non-transitory computer-readable storage medium storing instructions.
  • the processor executes any of the above methods.
  • a computer program product comprising: instructions, wherein when the instructions are executed by a processor, the processor is caused to execute any of the above methods.
  • FIG1 is a flow chart of a method for training a prediction model according to an embodiment of the present disclosure
  • FIG1A is a schematic diagram of a model structure of a prediction model according to an embodiment of the present disclosure
  • FIG1B is a flowchart of a method for training a prediction model according to another embodiment of the present disclosure.
  • FIG1C is a flowchart of a method for training a prediction model according to another embodiment of the present disclosure.
  • FIG2 is a flow chart of a method for correcting gene expression data according to an embodiment of the present disclosure
  • FIG2A is a flow chart of a method for correcting gene expression data according to another embodiment of the present disclosure.
  • 2B is a flow chart of a method for correcting gene expression data according to another embodiment of the present disclosure.
  • FIG3 is a flow chart of a downstream task execution method according to an embodiment of the present disclosure.
  • FIG. 4 is a block diagram of an exemplary electronic device that can be used to implement an embodiment of the present disclosure.
  • first, second, etc. to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of these elements, and such terms are only used to distinguish one element from another element.
  • first element and the second element may refer to the same example of the element, and in some cases, based on the description of the context, they may also refer to different examples.
  • genes In the field of genomics, genes have expression levels in cells, and each cell has its true gene expression level.
  • sequencing cells the number of reads mapped to each gene at a certain sequencing depth can be obtained.
  • the number of reads mapped to a gene can be called the count of the gene, and the count can be understood as the gene expression level observed at the sequencing depth.
  • the corrected relative relationship or the intermediate representation of the corrected relative relationship can be predicted through the prediction model, and the corrected relative relationship can be used as the true relative relationship, or the intermediate representation of the corrected relative relationship can be used as the intermediate representation of the true relative relationship for various downstream tasks.
  • Fig. 1 is a flow chart of a prediction model training method 100 according to an embodiment of the present disclosure. As shown in Fig. 1 , the method 100 includes steps 110 to 130.
  • a plurality of samples are obtained.
  • Each of the plurality of samples includes first gene expression data and masked gene expression data
  • the first gene expression data includes counts of different genes in a single cell
  • the masked gene expression data is obtained by processing counts of some genes in the first gene expression data
  • the processing for obtaining the masked gene expression data includes masking.
  • a single cell refers to a single cell of a human body, an animal body or other organisms.
  • the single cells in multiple samples can be different types of cells, such as T cells, B cells, etc.
  • a single cell expresses multiple genes.
  • the first gene expression data is the count of each of the different genes in a single cell measured at a certain sequencing depth.
  • Table 1 is an example of the first gene expression data.
  • the first gene expression data may include counts of N (about 20,000) genes, and the first gene expression data of each sample are composed of counts corresponding to the same gene.
  • the genes in the first gene expression data include both highly variable genes (highly variable genes are genes that are more likely to be regulated by other genes, and here, being regulated by other genes may refer to being activated or inhibited by other genes, etc.), and non-highly variable genes.
  • the first gene expression data includes genes of the entire genome or genes of the entire transcriptome.
  • the single-cell sequencing results whose non-zero gene counts exceed a non-zero count threshold can be screened out from the collected single-cell sequencing results as the first gene expression data to filter out extremely low-quality or damaged single-cell sequencing results.
  • obtaining the first gene expression data included in the plurality of samples may be obtaining an initial gene expression matrix, where each row or column of the initial gene expression matrix corresponds to the first gene expression data in one sample.
  • the matrix size may be C*N (C is greater than or equal to 1, which is the number of samples), with the cell identifier (or sample identifier) of a single cell and the gene identifier of each gene as the first dimension and the second dimension of the matrix respectively, and the expression amount of each gene in each single cell (hereinafter also referred to as the gene count) as is the value of the corresponding element in the matrix.
  • the cell identifiers of the single cells corresponding to the first gene expression data in the three samples are Cell 1, Cell 2, and Cell 3, respectively.
  • the gene identifiers of the six genes in these three single cells are Gene1, Gene2, Gene3, Gene4, Gene5, and Gene6, respectively.
  • the initial gene expression matrix is shown in Table 2. Taking the value of 2.7 in the last cell of the first row in Table 2 as an example, it means that the count of gene Gene6 in cell Cell 1 is 2.7.
  • the prediction model is trained using the MAE (Masked Autoencoders) method, that is, the prediction model is used to predict the masked gene counts based on the partially masked gene counts (the unmasked counts are used as the context of the masked counts). Therefore, when constructing the training data, the first gene expression data obtained by actual sequencing is used as the true value, and the masked gene expression data obtained by masking the counts of some genes in the first gene expression data is used as the input of the prediction model.
  • MAE Mask Autoencoders
  • the masking (hereinafter also referred to as "masking") operation can be implemented by replacing the counts of the masked part of the genes in the first gene expression data with mask symbols.
  • the masked gene data can be [4, 1, M, M, 0, M, ..., M], wherein the gene counts of the masked part are replaced with the mask symbol "M", while the unmasked gene counts retain the original counts.
  • the gene ID corresponding to the masked gene count is known, and the gene count is unknown.
  • the number of masked gene counts in the masked gene expression data can be controlled by setting the mask ratio. The larger the mask ratio, the more masked gene counts.
  • the masked counts are also referred to as the first type of elements selected from the initial gene expression matrix, and the unmasked counts (i.e., the elements in the initial gene expression matrix other than the first type of elements) are referred to as the second type of elements selected from the initial gene expression matrix.
  • the mask ratio i.e., the number of masked gene counts
  • the masked gene counts can be determined first.
  • the masked gene counts can be set to The total number of counts is less than 70% of the total number of gene counts.
  • the total number of gene counts in Table 2 is 18.
  • the number of masked gene counts can be determined to be 12.
  • the number of masked gene counts is randomly selected as masked gene counts. For example, 12 counts are randomly selected from the first gene expression data as masked counts, and the remaining 6 elements are used as unmasked counts.
  • the selection should be carried out according to the principle of uniform randomness, for example: generate multiple random numbers, these random numbers correspond to each count respectively, and these random numbers obey a uniform distribution with a mean of 0 and a variance of 1, and select the first 12 random numbers with the largest values from these random numbers as the random numbers corresponding to the masked counts, and determine the counts corresponding to the first 12 random numbers as the masked counts.
  • the mask ratio of each sample may be different.
  • the specific mask ratio and mask mode may be set according to actual needs and algorithm design. The above is only an example.
  • the first gene expression data may be original sequencing data, or may be sequencing data after normalization of the original sequencing data. It is understandable that if the first gene expression data is normalized, the influence of factors such as sequencing depth on the absolute value of counts can be eliminated, so that the model focuses on the relative relationship between gene counts.
  • the original sequencing data is not needed, and only the normalized sequencing data is needed.
  • the first gene expression data in the sample is normalized, the masked gene expression data that is masked is also normalized, and the predicted value of each gene count predicted by the prediction model through learning is also normalized; in some training tasks, the normalized sequencing data is not needed, and only the original sequencing data is needed.
  • the first gene expression data in the sample is not normalized, the masked gene expression data that is masked is also not normalized, and the predicted value of each gene count predicted by the prediction model through learning is also not normalized; in some training tasks, both the original sequencing data and the normalized sequencing data are needed.
  • the sample includes the normalized first gene expression data, the normalized first gene expression data corresponds to the first gene expression data (i.e., the unnormalized sequencing data), and the masked gene expression data is obtained by processing the first gene expression data.
  • the scRNA-seq results of cells are stored in databases such as the Gene Expression Comprehensive Database, Human Cell Atlas, and EMBL-EBI. scRNA-seq is manually collected from these databases. seq data, and delete the data set with duplicate id.
  • Some of the sequencing data in the database are original sequencing data, and some are normalized. In the normalized sequencing data, some of the normalized sequencing data can be restored to the original sequencing data, and some of the normalized data cannot be restored to the original sequencing data. Exemplary, in order to facilitate the unified processing of the collected data in subsequent operations, it can be collected as the original sequencing data when the data is collected.
  • step 120 for each sample in the plurality of samples, the operations of step 121 and step 122 are performed. It can be understood that step 121 and step 122 can be sub-steps of step 120.
  • step 121 the masked gene expression data in the sample is processed using the prediction model to be trained to obtain the predicted value of the partial gene count corresponding to the sample.
  • the prediction model to be trained (hereinafter also referred to as the gene regulation relationship model) may refer to an untrained model.
  • the partial gene counts refer to the masked partial gene counts.
  • the architecture of the prediction model to be trained can adopt an encoder network-decoder network, a decoder network.
  • the prediction model may include more than 1 million parameters, for example, 3 million, 10 million, or 100 million parameters.
  • the model parameters can be increased by increasing the number of layers, the number of latent vectors in each layer, and/or the dimension of the latent vectors.
  • the decoder network or the encoder network-decoder network outputs an intermediate representation of the relative relationship between the counts of each gene after correction (hereinafter also referred to as "gene regulatory relationship representation").
  • the prediction model to be trained also includes a network layer (for example, a multi-layer perceptron MLP) located after the decoder network or the encoder network-decoder network, which is used to project the intermediate representation of the relative relationship between the counts of each gene after correction into a predicted value of the gene count.
  • a network layer for example, a multi-layer perceptron MLP located after the decoder network or the encoder network-decoder network, which is used to project the intermediate representation of the relative relationship between the counts of each gene after correction into a predicted value of the gene count.
  • a network layer for example, a multi-layer perceptron MLP located after the decoder network or the encoder network-decoder network, which is used to project the intermediate representation of the relative relationship between the counts of each gene after correction into a predicted value of the gene count.
  • the predicted value includes the predicted values of at least some genes (i.e., some genes that are masked), and may also include the predicted values of all genes.
  • the predicted values mentioned in the embodiments of the present disclosure can be understood in this way.
  • the first gene expression data included in the sample is the initial gene expression matrix
  • the masked gene expression data obtained by masking the first gene expression data and the predicted values of the gene counts corresponding to the sample are matrices of the same size as the initial gene expression matrix.
  • a loss value corresponding to the sample is determined according to the predicted value and the counts corresponding to the part of the genes in the first gene expression data
  • the predicted value may include only the predicted values of the masked partial genes, or may include the predicted values of all genes.
  • the masked partial genes may be selected therefrom, and the loss value corresponding to the sample may be determined based on the predicted values of the masked partial genes and the counts corresponding to the masked partial genes in the first gene expression data.
  • step 130 the prediction model to be trained is updated according to the loss value corresponding to each sample in the plurality of samples.
  • a batch of training examples is fed into the model, the batch loss is calculated, and then the parameters are updated through backpropagation.
  • the updates can be done one by one or simultaneously.
  • the preset training end condition may include any of the following: the loss value is less than the loss value threshold, the number of training times reaches a preset number of times, and the training time reaches a preset duration.
  • Method 100 uses the first gene expression data and the masked gene expression data obtained by masking the first gene expression data as samples to train a prediction model. By predicting the masked counts as a training task, the prediction model learns to capture the relationship between gene counts in a single cell, so that the gene expression data input into the prediction model can be corrected in the inference stage to minimize technical noise interference in sequencing, thereby improving the accuracy of the output results of downstream tasks.
  • step 120 the operation of step 123 is also performed, and step 123 may be a sub-step of step 120.
  • step 123 the feature tensor corresponding to the masked gene expression data is determined. It is understandable that step 123 is performed before step 121, and accordingly, after step 123, step 121 is step 121A, and the feature tensor corresponding to the masked gene expression data in the sample is processed using the prediction model to be trained.
  • step 123 includes: determining the masked gene expression data The feature tensor of each count in the data (hereinafter also referred to as "embedding vector").
  • the masked gene expression data includes three types of counts: unmasked and non-zero counts, zero and unmasked counts (this case is also referred to as zero-value counts in this article), and masked counts (the feature vectors corresponding to these three types of counts are also referred to as second gene features, zero-value gene features, and first gene features in the following text, respectively).
  • the feature tensor of each count can be determined according to the gene embedding vector representing the gene ID (also referred to as "gene identification feature” in this article) and the count embedding vector representing the gene count (also referred to as “gene expression feature” in this article), for example, by adding the two.
  • These three types of counts all have gene IDs (also referred to as gene identifiers), and their corresponding gene embedding vectors can be determined according to the gene ID (for example, N different gene embedding vectors corresponding to N gene IDs are set one by one, and the gene embedding vector corresponding to the gene ID can be uniquely determined according to the gene ID).
  • the count embedding vector is related to the size of the count, and the count embedding vectors corresponding to non-zero counts of the same size are the same.
  • their count embedding vectors can be randomly generated or pre-specified (i.e., a uniform count embedding vector is specified for all counts that are 0).
  • their count embedding vectors can be pre-specified (i.e., a uniform count embedding vector is specified for all masked counts).
  • the count embedding vector can be determined by a lookup table.
  • the lookup table can record the correspondence between the count value range and the count embedding vector, and count values in different ranges correspond to different count embedding vectors.
  • the corresponding count embedding vector can be determined after rounding or classification of the gene expression value. For example, if the gene expression value is rounded to 1, it corresponds to a count embedding vector of 1. Alternatively, if the gene expression value is 1-1.99 and is classified into category 1, it corresponds to a count embedding vector of 1.
  • this pre-rounding and classification rule is not flexible enough and will limit the model's ability to learn mapping.
  • the method for determining the count embedding vector corresponding to the unmasked and non-zero count in step 123 can be: for the unmasked and non-zero count, it can be input into the module for determining the embedding vector (also called "gene expression feature extraction model") to obtain the count embedding vector corresponding to the unmasked and non-zero count.
  • the module for determining the embedding vector includes learnable parameters, which can be updated according to the loss value corresponding to each sample in a plurality of samples, and can be trained together with the prediction model.
  • the operations performed in the module for determining the embedding vector include: randomly initializing a lookup table, which includes 100 1*768 vectors.
  • a lookup table which includes 100 1*768 vectors.
  • the intermediate vector v1 is processed with another linear layer (parameter is w2) and the scaling factor ⁇ to obtain the intermediate vector v2.
  • v2 v1*w2+ ⁇ *v1.
  • the size of w2 is 100*100
  • w2 and ⁇ are learnable parameters.
  • v2 is normalized with SoftMax to obtain the weight vector v3, the size of v3 is 1*100.
  • the weighted sum of the 100 vectors in the lookup table is performed to obtain the count vector Ex.
  • the model can learn that gene expression values 1.1 and 1.2 are closer expression values, and gene expression values 1.1 and 1.9 are more different expression values, so that gene features can better characterize the relationship between different gene expression values and different count embedding vectors.
  • step 121A includes: inputting the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data into the first network layer of the prediction model to be trained; inputting the feature tensor corresponding to the masked counts in the masked gene expression data and the feature tensor corresponding to the counts of 0 in the masked gene expression data into a network layer of the prediction model to be trained that is different from the first network layer.
  • the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data is input into the first network layer of the prediction model to be trained to obtain the processing result of the first network layer, and then the processing result of the first network layer and the feature tensor corresponding to the masked counts in the masked gene expression data and the feature tensor corresponding to the counts of 0 in the masked gene expression data are spliced and input into the second network layer of the prediction model to be trained, wherein the second network layer is located downstream of the first network layer.
  • the zero-value counts are input into the second network layer, so that the model takes into account the influence of the zero-value counts when learning the relative relationship between genes, allowing the model to learn the potential gene regulatory relationship at the whole genome level.
  • the prediction model to be trained includes an encoder network and a decoder network.
  • Step 121A includes:
  • Step 1211 using the encoder network to encode the input of the encoder network to obtain the output of the encoder network; the input of the encoder network includes a feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data.
  • the input of the encoder network is also referred to as input features in this article
  • the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data is also referred to as the first tensor in this article
  • the output of the encoder network is also referred to as the initial encoded feature tensor or encoded tensor in this article.
  • the feature tensors corresponding to the three types of counts in the masked gene expression data are obtained in step 123, the feature tensors corresponding to the counts that are 0 and not masked and the feature tensors corresponding to the masked counts are not input into the encoder network. That is to say, the feature tensors corresponding to the counts that are 0 and not masked and the feature tensors corresponding to the masked counts are filtered out before being input into the encoder network, which can significantly reduce the amount of data input into the encoder network.
  • the feature tensors corresponding to the counts with a value of 0 and not masked are arranged according to the positions of the counts with a value of not masked and not 0 in the first gene expression data. For example, the counts corresponding to genes 2, 5, etc.
  • the number of masked counts can be determined based on the model learning effect (it is understandable that when the proportion of masked counts is too high, accurate prediction cannot be made, and when the proportion of masked counts is too low, the prediction task is too simple), the sparsity of the first gene expression data (i.e., the proportion of 0-value counts) and the desired model scale.
  • the sparser the first gene expression data the more counts are filtered out because they are 0 values, and the number of masked counts can be appropriately reduced.
  • different mask ratios can be set for 0-value counts and non-0-value counts in the first gene data to achieve a suitable total number of masked counts, while retaining more information on non-0-value counts.
  • the input of the encoder network usually contains feature tensors corresponding to multiple samples.
  • the encoder network can batch process them only when the feature tensors corresponding to each sample have the same size.
  • the number of unmasked and non-zero counts included in the masked gene expression data in different samples may be different, which requires padding them to the same size to make each sample
  • the feature tensor size corresponding to the samples is the same.
  • the number of non-zero and unmasked gene counts included in the masked gene expression data in each sample fluctuates between 300 and 1000, and can be uniformly padded to 1000 (so that the length of the input feature tensor after padding is on the order of 10% of the total number of genes), and the padding result is shown in Table 3.
  • Table 3 in the feature vector corresponding to sample 2 in the input of the encoder network, the 1-999th elements of the 1000 elements correspond to non-zero and unmasked counts (corresponding to gene 11, gene 32, gene 42, gene 54, .. gene 18900, respectively), and the 1000th element is the padded element.
  • Each element in Table 3 corresponds to a feature tensor, and these feature tensors constitute the input of the encoder network with a size of C*L*D (where C is the number of samples, i.e., the number of single cells, L is the size after uniform padding, and D is the length of the feature tensor corresponding to the element). That is to say, the input of the encoder network includes not only the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data, but also the feature vector corresponding to the filling element.
  • a filling vector can be specified for the filling element as its corresponding feature tensor (that is, the feature tensors corresponding to all filling elements are the same), and there is no need to distinguish between the gene embedding vector and the count embedding vector.
  • the feature tensor corresponding to the filling element may not participate in the calculation and only serve as a placeholder.
  • the values of the feature tensors specified by elements of different types are different from each other, for example, the feature tensors specified for the filling elements and the count embedding vectors specified for the masked counts are different from each other.
  • Step 1212 obtaining the input of the decoder network according to the output of the encoder network, the feature tensor corresponding to the masked counts in the masked gene expression data, and the feature tensor corresponding to the counts of zero values.
  • the feature tensor corresponding to the unmasked count of 0 the feature tensor corresponding to the masked count, and the output of the encoder network (understandably, the feature tensor corresponding to the filling elements in the output of the encoder network needs to be removed at this time, otherwise the size of the tensors corresponding to each sample in the input of the decoder network will be inconsistent) are merged to obtain the input of the decoder network.
  • the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data is replaced with the encoding tensor to obtain the input of the decoder network.
  • the size of the input of the decoder network can be C*N*D, where C is the number of samples, i.e., the number of single cells, N is the number of genes in the first gene expression data, and D is the length of the feature tensor corresponding to each count.
  • Step 1213 using the decoder network to decode the input of the decoder network.
  • the input to the decoder network is also referred to in this article as the target encoded feature tensor or the encoded input feature vector.
  • the output of the decoder network is an intermediate representation of the relative relationship between the counts of each gene after correction, and the size of the output of the decoder network can be C*N*F, where C is the number of samples, i.e., the number of single cells, N is the number of genes in the first gene expression data, and F is the length of the feature tensor corresponding to a single count in the decoder output.
  • step 121A also includes step 1214, using a multilayer perceptron to project the output of the decoder network into a predicted value of the count of a portion of the genes (i.e., the masked portion of the genes).
  • the output of the decoder network is processed by the multilayer perceptron in the prediction model to obtain the predicted value of the count of each gene corresponding to the sample.
  • the size of the predicted value is C*N.
  • the multilayer perceptron can be trained together with other parts in the prediction model.
  • the encoder network is used to determine the expression information and regulatory relationship information of the unmasked and non-zero counts
  • the decoder network is used to restore the expression information and regulatory relationship information to the expression information of the masked counts.
  • the model including the encoder network and the decoder network learns and mines the correlation and interaction between gene expression.
  • the output of the encoder network and the output of the decoder network are both intermediate representations of the relative relationship between the counts of each gene after correction, which contains a large amount of information that can characterize the regulatory relationship of cellular genes.
  • the output of the encoder network and the output of the decoder network can be used for downstream tasks, or the output of the encoder network and the output of the decoder network can be preprocessed and used for downstream tasks.
  • the computational complexity of the model is exponentially the length of the input data.
  • the number of model parameters required increases, which increases the computational complexity and amount of the model, making model training difficult.
  • the amount of input data can be reduced by only selecting the expression of highly variable genes in cells as input, but ignoring non-highly variable genes will cause the regulatory relationship map learned by the model to have systematic omissions.
  • the encoder network focuses on the feature tensors corresponding to the non-zero counts in the masked gene data, while the decoder network receives the feature tensors corresponding to all counts (i.e., the feature tensors that are not masked and are 0).
  • the encoder network integrates information from all positions by processing the feature tensor corresponding to the counts of the unmasked and masked counts and the feature tensor corresponding to the unmasked and non-zero counts processed by the encoder network. After the zero-valued counts and the masked counts are filtered, the length of the input sequence input to the encoder network is approximately 10% of the length of the whole genome (or transcriptome).
  • This design greatly reduces the required computing resources, allowing the encoder network to use a series of ordinary Transformer blocks to capture gene dependencies, greatly improving training efficiency and training effects. Since the encoder network module only processes unmasked and non-zero counts, the model can effectively focus on the most informative non-zero expressed genes, while allowing zero-valued genes to participate in model training in the decoder network stage, so that the model can make more comprehensive and accurate predictions based on genes with zero-valued and non-zero-valued counts. At the same time, the encoder network only processes unmasked and non-zero counts, so that the gene regulatory relationship model can have a smaller scale.
  • the encoder network includes M layers of encoding units.
  • the decoder network includes N layers of decoding units.
  • the value of M is greater than the value of N.
  • the encoder network often requires more layers to extract high-order features of the data, so as to better represent the original data.
  • the decoder network can use fewer layers to complete the decoding task, because it only needs to restore the low-order feature representation to the original data, and does not need to perform feature extraction and abstraction like the encoder network. Therefore, when setting the number of layers of the encoder network and the decoder network, a deeper encoder network and a shallower decoder network can be selected. While retaining the original data information, the number of parameters of the model can be reduced, and the training efficiency and generalization ability of the model can be improved.
  • the encoder network and the decoder network can adopt the structure in the existing Transformer model.
  • the encoder network adopts the structure of the encoder in the existing Transformer model, and the decoder network can adopt the architecture of Performer.
  • each layer of encoding units of the encoder network includes at least one multi-head attention unit and at least one forward propagation unit.
  • Each layer of decoding units of the decoder network includes at least one forward propagation unit and at least one linear attention unit or sparse attention unit.
  • each layer of encoding units includes multiple units, such as a multi-head attention unit and a feed forward unit.
  • Each layer of decoding units includes multiple units. Units, such as feed forward and linear attention, where the linear attention unit can also be replaced by a sparse attention unit.
  • the input of the encoder network is input into the multi-head attention unit of the encoder network, and then the residual connection and layer normalization operations, forward propagation unit, residual connection and layer normalization operations are performed in the encoder network in sequence to obtain the output of the encoder network.
  • multi-head attention units and forward propagation units as encoding units in each layer of the encoder network can help the model better capture the dependencies and semantic information between different genes, further improving the performance of the model.
  • gene expression data is usually high-dimensional and sparse, with high noise and complexity, so the multi-head attention units of the encoder network can help the model better mine the associations and interactions between genes.
  • the use of forward propagation units and linear attention units or sparse attention units can help the model better predict the first gene expression data gene by gene.
  • Linear attention units can help the model better control the distribution of attention weights and improve the performance and stability of the model.
  • Sparse attention units can further reduce the amount of calculation of the model and improve the efficiency of the model. It can be seen that the decoder uses lightweight units, whose parameter volume and computational complexity are lower than those of the encoder. Compared with multi-head attention units, linear attention units or sparse attention units can further reduce the algorithm time and space complexity, so that the model parameter volume can be increased to the billion level.
  • the encoder network-decoder network model using multi-head attention units, forward propagation units, and attention mechanisms can achieve better performance and results in predicting gene expression data.
  • This model can automatically learn complex patterns and associations in the data and accurately predict and analyze the first gene expression data.
  • the masked gene expression data is normalized; the first gene expression data included in each sample of the multiple samples is the normalized first gene expression data; step 121 includes: using the prediction model to be trained to process the masked gene expression data in the sample to obtain the predicted value of the normalized value of each gene count corresponding to the sample; step 122 includes step 1220: determining the loss value corresponding to the sample based on the predicted value and the counts corresponding to some genes in the normalized first gene expression data.
  • the method 100 includes:
  • Step 110 obtaining a plurality of samples, wherein each of the plurality of samples comprises normalized first gene expression data and masked gene expression data, and the normalized first gene expression data comprises a single
  • the counts of different genes in the cell are respectively obtained by processing the counts of some genes in the first gene expression data, and the processing for obtaining the masked gene expression data includes masking.
  • both the first gene expression data (unnormalized) and the normalized first gene expression data include the counts of different genes in a single cell that are actually measured, except that the normalized first gene expression data is normalized.
  • the masked gene expression data is obtained by processing the counts of some genes in the normalized first gene expression data, and the processing for obtaining the masked gene expression data includes masking; exemplarily, the masked gene expression data is obtained by processing the counts of some genes in the first gene expression data (unnormalized original sequencing data), and the processing for obtaining the masked gene expression data includes normalization and masking. In either case, the masked gene expression data is normalized.
  • Step 120 for each sample in the plurality of samples, execute steps 121 and 122:
  • Step 121 using the prediction model to be trained to process the masked gene expression data in the sample to obtain the predicted value of the normalized value of the count of the partial gene (i.e., the masked partial gene) corresponding to the sample;
  • Step 122 determining the loss value corresponding to the sample according to the predicted value and the counts corresponding to the part of the genes in the normalized first gene expression data (ie, step 1220);
  • Step 130 update the prediction model to be trained according to the loss value corresponding to each sample in the multiple samples.
  • the normalized first gene expression data can eliminate the influence of factors such as sequencing depth on the absolute value of the count, and make the model focus on the relative relationship between gene counts.
  • the predicted value output by the prediction model is also the normalized predicted value.
  • the masked gene expression data is normalized, and each sample in the multiple samples includes normalized first gene expression data and masked gene expression data
  • the first gene expression data i.e., the original gene expression data before normalization corresponding to the normalized first gene expression data
  • the masked gene expression data is obtained by processing the counts of some genes in the first gene expression data
  • the processing for obtaining the masked gene expression data includes downsampling, normalization, and masking; the downsampled first gene expression data simulates the counts measured at a second sequencing depth lower than the first sequencing depth.
  • the method further comprises the following steps: obtaining the counts of different genes in a single cell; each sample in the multiple samples further comprises auxiliary information, the auxiliary information comprising a first total count and a second total count, the first total count being the sum of the counts of each gene in the first gene expression data; the second total count being the sum of the counts of each gene in the downsampled first gene expression data; step 121 comprises step 121B: processing the masked gene expression data and the auxiliary information in the sample using the prediction model to be trained to obtain a predicted value of the normalized value of the counts of some genes (i.e., some masked genes) at the first sequencing depth corresponding to the sample.
  • the method 100 includes:
  • Step 110 obtaining multiple samples, wherein each of the multiple samples includes normalized first gene expression data, masked gene expression data and auxiliary information, the first gene expression data includes counts of different genes in a single cell measured at a first sequencing depth, the masked gene expression data is obtained by processing the counts of some genes in the first gene expression data, the processing for obtaining the masked gene expression data includes downsampling, normalization, and masking, the masked gene expression data is normalized, the downsampled first gene expression data simulates the counts of different genes in a single cell measured at a second sequencing depth lower than the first sequencing depth, the auxiliary information includes a first total count and a second total count, the first total count is the sum of the counts of each gene in the first gene expression data; the second total count is the sum of the counts of each gene in the downsampled first gene expression data.
  • the first gene expression data (original sequencing data) needs to be used when calculating the first total count and when downsampling, and the normalized first gene expression data needs to be used as the true value corresponding to the masked count, that is, in the training task of this embodiment, both the original sequencing data and the normalized sequencing data need to be used.
  • the original sequencing data is obtained from the database, it can be normalized to obtain the normalized sequencing data; if the normalized sequencing data is obtained from the database, it can be restored (the inverse of the normalization process) to obtain the original sequencing data.
  • first sequencing depth refers to the higher sequencing depth actually used when measuring the first gene expression data
  • second sequencing depth refers to the lower sequencing depth simulated by the downsampled first gene expression data.
  • the first sequencing depth and the second sequencing depth are used to indicate the sequencing depth corresponding to the first gene expression data and the mask expression data in the sample, respectively, and do not limit the first sequencing depth or the second sequencing depth of all samples to be the same.
  • the relative relationship between the counts of each gene measured under different sequencing methods and sequencing depths may be different. It is understandable that the higher the sequencing depth, the greater the relative relationship between the observed counts of each gene. A higher probability is closer to the relative relationship between the actual gene expression of each gene in the cell. For example, when the sequencing depth is too low, the observed counts of some genes are 0, but when the sequencing depth is high enough, the observed counts are no longer 0. However, due to various restrictions, sometimes only a low sequencing depth is used in actual sequencing without using the expected high sequencing depth. At the same time, due to the existence of technical noise, the accuracy of the relative relationship between the measured counts of each gene does not meet the requirements.
  • a prediction model is required to correctly predict the gene counts at a high sequencing depth based on the gene counts at a low sequencing depth. Since a single cell will be damaged after one sequencing, it is impossible to repeat sequencing of the same cell to obtain gene expression data at high and low sequencing depths, and thus it is impossible to construct gene expression data at low sequencing depth for each cell by actual sequencing methods - gene expression data at high sequencing depth as training samples.
  • the only way is to simulate the gene expression data of the cell at low sequencing depth by algorithms based on the gene expression data at high sequencing depth, i.e., the first gene expression data after downsampling, so as to construct the first gene expression data (equivalent to the gene expression data at low sequencing depth) - the first gene expression data (i.e., the gene expression data at high sequencing depth) after downsampling for each cell for training.
  • the two can be normalized, and then the gene expression data at low sequencing depth can be masked, and the masked gene expression data - the normalized first gene expression data can be used as training samples. That is, the normalized first gene expression data is used as the true value of the gene count at high sequencing depth, and the model is trained by MAE, so that the prediction model can learn to capture the relationship between the gene expression of similar cells at different sequencing depths.
  • the prediction model can be explicitly informed of the current input low sequencing depth and the high sequencing depth that is expected to be predicted, so that the sample includes auxiliary information reflecting the low sequencing depth and the high sequencing depth, allowing the prediction model to process the masked gene expression data and the auxiliary information.
  • the auxiliary information can include a first total count T and a second total count S, the first total count being the sum of the counts of each gene in the first gene expression data, which is used to characterize a high sequencing depth, and the second total count being the sum of the counts of each gene in the downsampled first gene expression data, which is used to characterize a low sequencing depth.
  • the first gene expression data is obtained by transcript sequencing (such as RNA-Seq technology).
  • transcript sequencing such as RNA-Seq technology
  • the first total count may also be referred to as a first total transcript count, that is, the sum of the transcript counts of each gene in the first gene expression data.
  • masked gene expression data can be obtained from the first gene expression data in the following manner: first, downsampling the first gene expression data; second, normalizing the downsampled first gene expression data to obtain normalized first gene expression data; and third, masking the counts of some genes in the normalized first expression data to obtain masked gene expression data.
  • the downsampling method is described below.
  • the first gene expression data is measured at a sufficiently high sequencing depth, downsampling it can obtain a more credible gene count at a low sequencing depth.
  • the first gene expression data is originally measured at a low sequencing depth, the gene count at a low sequencing depth obtained by downsampling it is less credible.
  • the sequencing depth can be characterized by the sum of the counts of each gene at the sequencing depth, so it is considered that the sequencing first gene expression data with the sum of the counts of each gene greater than the threshold is measured at a sufficiently high sequencing depth) is downsampled; for the case where the sum of the counts of each gene in the first gene expression data is not greater than the preset threshold, the normalized first gene expression data can be directly masked to obtain masked gene expression data without downsampling. By including samples of masked gene expression data obtained without downsampling, the model can learn to capture the relationship between genes in a single cell, but cannot learn the association between gene expression amounts at different sequencing depths.
  • a preset threshold e.g. 1000
  • a downsampling probability may be set, that is, the first gene expression data may be downsampled with a certain probability. For example, for the first gene expression data whose sum of counts of each gene is greater than a preset threshold, a 50% probability of downsampling the data may be set, and a 50% probability of not downsampling the data may be set.
  • a sampling ratio (i.e., sampling factor) can be set when downsampling.
  • the sampling ratio represents the degree of downsampling of the first gene expression data.
  • the sampling ratio can be set by using Poisson Distribution, Bayesian Distribution, etc.
  • the sampling ratio of the first gene expression data is determined by a statistical sampling algorithm such as Beta-Binomial Distribution or Binomial Distribution.
  • the normalization method can be applied to the process of obtaining normalized first gene expression data from first gene expression data, and can also be applied to the process of normalizing downsampled first gene expression data.
  • the sequencing data to be normalized is normalized by dividing each count in the sequencing data to be normalized (such as each count in the first gene expression data) by the sum of each count in the sequencing data to be normalized (such as the sum of each gene count in the first gene expression data).
  • a normalization method such as TPM or RPKM may be used to normalize the first gene expression data or the downsampled first gene expression data.
  • the log(TPM+1) normalization method can also be used to normalize the first gene expression data and the downsampled first gene expression data.
  • the first gene expression data is a vector X1
  • the normalized first gene expression data is log[X1/sum(X1)*10000+1].
  • the downsampled first gene expression data is a vector X2, and the downsampled and normalized first gene expression data is log[X2/sum(X2)*10000+1].
  • the log(TPM+1) normalization method performs a logarithmic transformation based on TPM. Logarithmic transformation can reduce the skewness in the original data and improve the comparability between variables.
  • the log(TPM+1) normalization method is a variant method of logarithmic transformation of the standard TPM normalization method, which can eliminate skewness and improve the comparability between variables.
  • Step 120 for each sample in the plurality of samples, execute steps 121 and 122:
  • Step 121 use the prediction model to be trained to process the masked gene expression data and auxiliary information in the sample to obtain the predicted value of the normalized value of the count of some genes (i.e., the masked part of the genes) at the first sequencing depth corresponding to the sample (i.e., step 121B).
  • the feature tensor corresponding to the masked gene expression data and the feature tensor corresponding to the auxiliary information in the sample may be processed using the prediction model to be trained.
  • Step 122 determining the loss value corresponding to the sample according to the predicted value and the counts corresponding to the part of the genes in the normalized first gene expression data (ie, step 1220);
  • Step 130 update the prediction model to be trained according to the loss value corresponding to each sample in the multiple samples.
  • the prediction model to be trained includes an encoder network and a decoder network; step 121B includes:
  • Step A Encode the input of the encoder network using the encoder network to obtain the output of the encoder network; the input of the encoder network includes a feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data, and the input of the encoder network also includes a feature tensor corresponding to the auxiliary information.
  • method 100 further includes step 123 for determining a feature tensor corresponding to the masked gene expression data and step 125 for determining a feature tensor corresponding to the auxiliary information.
  • the feature tensor corresponding to the auxiliary information and the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data are used as the input of the encoder network together, and the two are input to the prediction model from the first network layer of the prediction model.
  • the feature tensor corresponding to the auxiliary information can be spliced after the feature tensor corresponding to the unmasked and non-zero counts.
  • the input of the encoder network is [the feature tensor corresponding to gene 1, the feature tensor corresponding to gene 3, the feature tensor corresponding to gene 4, ... the feature tensor corresponding to the filler element, the feature tensor corresponding to the first total count T, and the feature tensor corresponding to the second total count S].
  • Step B obtaining an input to a decoder network based on the output of the encoder network, the feature tensor corresponding to the masked counts in the masked gene expression data, and the feature tensor corresponding to the counts of 0 values in the masked gene expression data;
  • Step C Decode the input of the decoder network using the decoder network.
  • step 121B may further include step D: using a multilayer perceptron to project the output of the decoder network into a predicted value of the normalized value of each gene count at the first sequencing depth.
  • steps B to D please refer to the description of steps 1212 to 1214 and will not be repeated here.
  • the prediction model includes an encoder network and a decoder network
  • the first gene expression data is subjected to Bayesian downsampling to obtain downsampled first gene expression data
  • the count sum of the first gene expression data is calculated as the first total count T
  • the count sum of the downsampled first gene expression data is calculated as the second total count S
  • the downsampled first gene expression data is normalized (not shown in FIG1A)
  • masked to obtain masked gene expression data
  • the 0-value count and the masked value count in the feature tensor corresponding to the masked gene expression data are removed.
  • the feature tensor corresponding to the count is spliced with the feature tensor corresponding to T and S (the feature tensor used is not shown in FIG1A), and the spliced result is input into the encoder network (Encoder in FIG1A) to obtain the output of the encoder network, and the output of the encoder network is combined with the feature tensor corresponding to the masked count and the feature tensor corresponding to the 0-value count to obtain the input of the decoder network, which is processed by the decoder network to obtain the output of the decoder network, and the output of the decoder network enters the MLP to obtain the predicted value of each gene count at the first sequencing depth, and the predicted value and the normalized first gene expression data (the difference between the normalized first gene expression data and the unnormalized first gene expression data is not shown in FIG1A) calculate the loss value (reconstruction loss in FIG1A). Pooling the output of the encoder network can obtain a cell representation (cellular embedding in
  • the prediction model to be trained includes an input layer, an output layer, and a plurality of intermediate layers between the input layer and the output layer; using the prediction model to be trained to process the masked gene expression data and auxiliary information in the sample includes:
  • Step 1 inputting the feature tensor corresponding to the masked gene expression data in the sample into the input layer and the first predetermined layer of the plurality of intermediate layers to obtain an intermediate feature tensor, wherein the number of the first predetermined layers is greater than or equal to 0;
  • method 100 further includes step 123 for determining a feature tensor corresponding to the masked gene expression data.
  • Step 2 Concatenate the feature tensor corresponding to the auxiliary information with the intermediate feature tensor to obtain a concatenated intermediate feature tensor;
  • method 100 further includes step 125 for determining a feature tensor corresponding to the auxiliary information.
  • Step 3 Input the concatenated intermediate feature tensor into a second predetermined layer among the multiple intermediate layers and an output layer, where the second predetermined layer is different from the first predetermined layer.
  • step 121B may also include step 4: using a multilayer perceptron to project the output of the output layer into a predicted value of the normalized value of each gene count at the first sequencing depth.
  • the masked gene expression data can be input from the first network layer of the prediction model, and the auxiliary information can be input from a network layer of the prediction model that is different from the first network layer.
  • the prediction model includes a decoder network
  • the feature tensor corresponding to the masked gene expression data is input to the first layer of the decoder network
  • the feature tensor corresponding to the auxiliary information is input before the last several layers of the decoder network
  • the prediction model includes an encoder network and a decoder network
  • the masked gene expression data is input to the first layer of the encoder network.
  • the feature tensor corresponding to the auxiliary information is input after a certain network layer of the encoder network, or the feature tensor corresponding to the auxiliary information is input before the last several layers of the decoder network.
  • the specific input method of the feature tensor corresponding to the auxiliary information can be to splice the feature tensor corresponding to the auxiliary information with the intermediate feature tensor obtained by processing the feature tensor corresponding to the masked gene expression data by the previous network layer (the first network layer + the first predetermined layer in multiple intermediate layers) (for example, splicing the feature tensor corresponding to the auxiliary information at the tail of the intermediate feature tensor), and input the spliced intermediate feature tensor into the subsequent network layer (the second predetermined layer in multiple intermediate layers + the output layer).
  • the model can use auxiliary information to extract more meaningful features or make more accurate predictions in the process of processing input data.
  • step 125 specifically includes:
  • Step 1251 inputting the auxiliary information into a module for determining an embedding vector to obtain a count embedding vector corresponding to the auxiliary information; wherein the module for determining the embedding vector includes learnable parameters;
  • the first total count T and the second total count S are regarded as ordinary counts, and a module for determining embedding vectors is used to determine a count embedding vector corresponding to the first total count and a count embedding vector corresponding to the second total count.
  • Step 1252 obtaining the gene embedding vector corresponding to the auxiliary information.
  • the first total count T and the second total count S are assigned symbols that are different from each other and from other gene IDs, and the corresponding gene embedding vectors are determined using the symbols.
  • Step 1253 obtaining a feature tensor corresponding to the auxiliary information according to the count embedding vector corresponding to the auxiliary information and the gene embedding vector corresponding to the auxiliary information.
  • the count embedding vector corresponding to the auxiliary information and the gene embedding vector corresponding to the auxiliary information are added in place to obtain a feature tensor corresponding to the auxiliary information.
  • the module for determining the embedding vector in step 125 and the module for determining the embedding vector in step 123 may be the same module and have the same parameters.
  • the present disclosure embodiment further provides a gene regulation relationship model generation method 100B, which includes steps 110B to 150B.
  • Step 110B Obtaining an initial gene expression matrix corresponding to the target cells, wherein the initial gene expression matrix includes the expression levels of the highly variable genes and the expression levels of the non-hypervariable genes in the cells.
  • the target cell may refer to a cell in a human or animal body, and the target cell may contain multiple genes.
  • the target cell may include multiple different types of cells, such as T cells, B cells, Etc.
  • the initial gene expression matrix may refer to a matrix used to characterize the expression of different genes in different cells.
  • the initial gene expression matrix includes the expression of highly variable genes and non-hypervariable genes in a single cell, that is, the expression of the whole genome of a single cell.
  • highly variable genes refer to genes whose expression varies greatly in different cells. For example, if the expression of a gene is extremely large in some cells and extremely small in other cells, then the gene is a highly variable gene.
  • a highly variable gene is a gene that is more likely to be regulated by other genes.
  • being regulated by other genes may refer to being activated or inhibited by other genes.
  • the non-zero proportion of gene expression in cells is about 10%, so the initial gene expression matrix is a very sparse matrix. The more elements with a value of 0 in the matrix, the sparser the matrix is, and the fewer elements with a value of 0 in the matrix, the denser the matrix is.
  • step 110B includes taking the cell identifier of the target cell and the gene identifier of each gene in the target cell as the first dimension and the second dimension of the matrix respectively, and taking the gene expression value of each gene as the value of the corresponding element in the matrix to construct an initial gene expression matrix; wherein the gene expression value of each gene is obtained by gene sequencing.
  • Step 120B randomly selecting multiple elements in the initial gene expression matrix as first-category elements, and taking elements in the gene expression matrix other than the first-category elements as second-category elements.
  • the first type of elements may refer to the elements in the initial gene expression matrix that are filtered out for the first time, and the role of the first type of elements is to reduce the amount of input data of the gene regulation relationship model to be trained at the whole genome level.
  • the second type of elements may refer to the elements in the initial gene expression matrix that are not filtered after the first filtering.
  • the selection method of the first type of elements is as described in the above method of selecting masked counts.
  • Step 130B Determine input features according to the positions of the elements in the initial gene expression matrix, the input features including second gene features corresponding to elements of the second category whose values are non-zero, and the input features do not include first gene features corresponding to the first category elements and zero-value gene features corresponding to elements of the second category whose values are zero, the second gene features being determined according to the expression of the gene and the gene identifier corresponding to the gene;
  • step 130B includes: generating a second gene expression feature according to the element value of the element whose element value is not 0 in the second category of elements; generating a second gene identification feature according to the gene identification corresponding to the element whose element value is not 0 in the second category of elements; determining a second gene feature according to the second gene expression feature and the second gene identification feature (specifically, the element value of the element whose element value is not 0 in the second category of elements is input into the gene expression feature extraction model to generate the second gene expression feature, The parameter values of the gene expression feature extraction model are adjusted according to the loss value of the following step 150B).
  • the method 100B also includes: determining the first gene feature according to the first gene expression feature corresponding to the first-category element and the first gene identification feature corresponding to the first-category element; wherein the first gene expression features corresponding to all the first-category elements are the same, and the first gene expression feature is different from the second gene expression feature; determining the zero-value gene feature according to the third gene expression feature corresponding to the element with a value of 0 in the second-category element and the third gene identification feature corresponding to the element with a value of 0 in the second-category element.
  • the input features include the second gene features corresponding to the elements in the second category whose values are not 0, and the input features do not include the first gene features corresponding to the first category elements and the zero-value gene features corresponding to the elements in the second category whose values are 0, and the second gene features are determined based on the expression level of the gene and the gene identifier corresponding to the gene.
  • the input feature may refer to the gene feature corresponding to the remaining elements after the first filtering and the second filtering.
  • the second filtering may refer to the filtering of the elements with a value of 0 in the second class elements.
  • the second gene feature is determined according to the gene expression feature and the gene identification feature, and the second gene features corresponding to the elements with different gene expression amounts and/or different gene identifications are different.
  • the first class elements, the second class elements with a value of 0, and the non-zero elements in the second class elements all correspond to gene features, and the corresponding gene features are respectively called the first gene feature, the second gene feature, and the zero-value gene feature.
  • the input feature can be a tensor of C ⁇ L ⁇ D, where C is the number of cells, L is the number of genes with the maximum non-zero expression expected in a single cell, and D is the length of the gene feature. If the number of unmasked and non-zero genes of a cell is less than L, it is filled to L.
  • the fill-in elements correspond to fill-in gene features, for example, a unified fill-in gene feature is specified for all fill-in elements.
  • the input feature can be a tensor of the number of non-zero elements of the second category ⁇ D, where the second category non-zero elements in the input feature are arranged in the order of row number and column number (for example, first in order from small to large according to the row number of the element, and then in order from small to large according to the column number).
  • gene features can be determined by gene expression features and gene identification features.
  • Gene expression features are related to the gene expression levels corresponding to the elements. If the gene expression levels corresponding to the elements are the same, then the corresponding gene expression features are the same.
  • Gene identification features are related to the gene identifications corresponding to the elements. If the gene identifications corresponding to the elements are the same, then the corresponding gene identification features are the same.
  • Gene expression features and gene identification features can be determined by looking up a mapping relationship table, or by other methods. For example, a mapping relationship between a gene identifier and a gene identification feature is preset (for example, gene identifier 1 corresponds to gene identification feature 1...), so that the gene identification feature can be determined based on the gene identifier.
  • Pre-set A mapping relationship between gene expression and gene expression features is set so that the gene expression features can be determined based on the gene expression. It is understandable that the gene expression is usually a decimal, and the gene expression can be rounded or classified to determine the corresponding gene expression features. For example, if the gene expression is rounded to 1, it corresponds to gene expression feature 1. Alternatively, if the gene expression is 1-1.99 and is classified into category 1, it corresponds to gene expression feature 1.
  • the gene feature corresponding to the Pad element can be a specified value of the gene feature of different other elements.
  • the gene expression corresponding to the first type of element is unknown, so a unified first gene expression feature can be assigned to the first type of element.
  • the position of the first type of element is known, so the gene identification feature corresponding to the first element can be determined.
  • the gene expression feature of the zero-value element in the second type of element is the same, and the gene identification feature can be determined according to the gene identification corresponding to its position.
  • determining the second gene feature according to the second gene expression feature and the second gene identification feature may be element-wise adding the second gene expression feature and the second gene identification feature to obtain the second gene expression feature.
  • the first gene feature and the zero-value gene feature may also be determined in this manner.
  • the first gene expression feature may be different from any second gene expression feature.
  • the second gene expression features corresponding to gene expression levels of categories 1-10 are gene expression features 1-10, respectively, and the first gene expression feature may be gene expression feature 11.
  • the third gene expression feature corresponding to the element with a value of 0 in the second category element may correspond to gene expression feature 0.
  • the element value of the element whose element value is not 0 in the second class of elements generates the second gene expression quantity feature, including: inputting the element value of the element whose element value is not 0 in the second class of elements into the gene expression quantity feature extraction model to generate the second gene expression quantity feature.
  • the mapping relationship between gene expression quantity and gene expression quantity feature can be preset, and the gene expression quantity feature is determined according to the gene expression quantity.
  • this pre-setting is not flexible enough and will limit the ability of the model to learn mapping. For example, the model will not be able to see the difference between gene expression quantities 1.1 and 1.9, because they both correspond to gene expression quantity 1 after rounding/classification.
  • a gene expression quantity feature extraction model can be set, and the element value of the element whose element value is not 0 in the second class of elements is input into the gene expression quantity feature extraction model to generate the second gene expression quantity feature.
  • This model can be updated together with the gene regulation relationship model to be trained, that is, when the training termination condition is not met, the model parameters of the gene regulation relationship model and the gene expression feature extraction model are adjusted according to the difference in loss value determined by the difference between the value of each element in the predicted gene expression matrix and the value of the element at the corresponding position in the initial gene expression matrix. In this way, the model can learn the gene expression Gene expression levels 1.1 and 1.2 are closer, and gene expression levels 1.1 and 1.9 are more different, so that gene signatures can better characterize the relationship between different gene expression levels and different genes.
  • Step 140B inputting the input features into the gene regulation relationship model to be trained to obtain a gene regulation relationship representation, converting the gene regulation relationship representation to generate a predicted gene expression matrix, wherein the elements in the predicted gene expression matrix correspond to the elements in the initial gene expression matrix;
  • the gene regulation relationship model to be trained includes an encoder and a decoder; step 140B includes: inputting the input feature to the encoder to obtain an initial coding feature tensor; according to the position of the element in the initial gene expression matrix, the first gene feature corresponding to the first type of element and the zero-value gene feature corresponding to the element with a value of 0 in the second type of element are merged with the initial coding feature tensor to obtain a target coding feature tensor; the target coding feature tensor is input to the decoder to obtain a gene regulation relationship representation.
  • the encoder includes M layers of encoding units
  • the decoder includes N layers of decoding units, and the value of M is greater than the value of N.
  • each layer of encoding units of the encoder includes a multi-head attention unit and a forward propagation unit; each layer of decoding units of the decoder includes a forward propagation unit, and also includes a linear attention unit or a sparse attention unit.
  • the gene regulatory relationship representation may refer to the expression tensor of the regulatory relationship between each gene.
  • the gene regulatory relationship representation is a three-dimensional tensor, which includes not only the regulatory relationship between each gene, but also the expression level information corresponding to each gene in the cell.
  • the gene regulatory relationship representation is transformed through a layer of network to obtain a predicted gene expression matrix, which is a prediction matrix for predicting the values of the first type of elements in the initial gene expression matrix.
  • the elements in the predicted gene expression matrix correspond to the elements in the initial gene expression matrix, that is, if the initial gene expression matrix is A ⁇ B dimensional, then the predicted gene expression matrix is also A ⁇ B dimensional.
  • the encoder is used to determine feature information of the second type of element, wherein the feature information includes expression information and regulatory relationship information.
  • the decoder is used to restore the expression information of the first type of element based on the expression information and the regulatory relationship information.
  • the gene regulation relationship model to be trained consists of two parts, namely an encoder and a decoder.
  • the input feature is input into the encoder to obtain an initial coding feature tensor (for example, a tensor of C ⁇ L ⁇ D), which includes the expression information and regulatory relationship information of the second-class elements.
  • an initial coding feature tensor for example, a tensor of C ⁇ L ⁇ D
  • the first gene feature corresponding to the first-class element and the zero-value gene feature corresponding to the element with a value of 0 in the second-class element are merged with the initial coding feature tensor to obtain a target coding feature tensor (for example, a tensor of C ⁇ G ⁇ D).
  • the target coding feature tensor is input into the decoder to obtain a gene regulation relationship representation (for example, a tensor of C ⁇ G ⁇ D) to predict the updated first gene feature corresponding to each first-class element.
  • Step 150B updating the model parameters in the gene regulation relationship model to be trained based on the initial gene expression matrix and the predicted gene expression matrix to generate a trained gene regulation relationship model.
  • step 150B includes: calculating the difference between the value of each element in the predicted gene expression matrix and the value of the element at the corresponding position in the initial gene expression matrix; and determining the loss value according to the difference in the values.
  • the training termination condition is not met, the values of the model parameters are adjusted according to the loss value, and the process returns to the step of randomly selecting multiple elements in the initial gene expression matrix as the first type of elements.
  • model parameters may include parameters of an encoder and parameters of a decoder, and may also include parameters of a network used to convert a gene regulatory relationship representation into a predicted gene expression matrix.
  • the first-class elements and elements with a value of 0 are removed from the initial gene expression matrix including the expression of highly variable genes and non-hypervariable genes, and only the input features corresponding to the elements with non-0 values in the second-class elements are used as the input of the gene regulation relationship model to be trained.
  • There is no need to calculate a large amount of data which reduces the computational complexity and amount, improves the training efficiency, and enables the model to learn the potential gene regulation relationship at the whole genome level.
  • the gene regulation relationship model generation method in the prior art it solves the problem of difficulty in training the regulation relationship model and systematic missing of the regulation relationship map learned by the model.
  • the present disclosure embodiment further provides a prediction model training method 100C, and the method 100C includes steps 110C to 130C.
  • Step 110C obtaining a plurality of samples, wherein each of the plurality of samples comprises first gene expression data, mask gene expression data, and auxiliary information, wherein the first gene expression data comprises counts of different genes in a single cell measured at a first sequencing depth, and the mask gene expression data is obtained by The first gene expression data is downsampled and counts of some genes in the downsampled first gene expression data are masked, the downsampled first gene expression data simulates respective counts of different genes in a single cell measured at a second sequencing depth lower than the first sequencing depth, and the auxiliary information includes a first total count, which is the sum of the counts of each gene in the first gene expression data;
  • the first gene expression data obtained by actual sequencing is used as the true value of the gene count at a high sequencing depth
  • the first gene expression data after downsampling is used to simulate the gene count at a low sequencing depth
  • the model is used to predict the gene count at a high sequencing depth based on the gene count at a low sequencing depth
  • the loss is calculated based on the predicted value and the true value of the gene count at a high sequencing depth
  • the model is updated with the loss value. In this way, the prediction model can learn to capture the relationship between the gene expressions of similar cells at different sequencing depths.
  • the MAE method is used for training, that is, the first gene expression data after downsampling is used as the gene count at a low sequencing depth, and partially masked, and the model is used to predict the gene count at a high sequencing depth based on the partially masked gene count at a low sequencing depth, and the loss is calculated based on the predicted gene count at a high sequencing depth and the corresponding part of the masked elements in the true value of the gene count at a high sequencing depth, and the model is updated with the loss value. Therefore, when constructing the training data, it is necessary to use the first gene expression data obtained by actual sequencing, downsample the first gene expression data, and mask the counts of some genes in the downsampled first gene expression data to obtain the masked gene expression data.
  • the prediction model In order for the prediction model to correctly predict the gene count at a high sequencing depth based on the gene count at a low sequencing depth, the prediction model needs to understand what the low sequencing depth and high sequencing depth are. It is understandable that the higher the sequencing depth, the greater the sum of the counts of each gene in the sequencing result, so the sequencing depth can be characterized by the sum of the counts of each gene at the sequencing depth.
  • the auxiliary information includes the first total count, but does not necessarily include the sum of the gene counts at the low sequencing depth, that is, the sum of the counts of each gene in the downsampled first gene expression data. Because even if the auxiliary information does not include the sum of the counts of each gene in the downsampled first gene expression data, the model can guess the gene counts of the masked part based on the masked gene expression data in the sample, and then estimate the sum of the counts of each gene in the downsampled first gene expression data.
  • the auxiliary information may also include the sum of the counts of each gene in the downsampled first gene expression data, so that the model can be explicitly informed of the low sequencing depth, thereby reducing the learning cost and accelerating model convergence.
  • Step 120C for each sample in the plurality of samples, executing step PB1 and step PB2, wherein step PB1 and step PB2 are sub-steps of step 120C;
  • Step PB1 using the prediction model to be trained to process the masked gene expression data and auxiliary information in the sample to obtain the predicted value of each gene count at the first sequencing depth corresponding to the sample;
  • the masked gene expression data and the auxiliary information can both be input from the first network layer of the prediction model.
  • the masked gene expression data in the sample and the auxiliary information are concatenated as the input feature tensor corresponding to the sample; or the feature tensor corresponding to the masked gene expression data in the sample and the feature tensor corresponding to the auxiliary information are concatenated as the input feature tensor corresponding to the sample; and the input feature tensor is processed using the prediction model to be trained.
  • the masked gene expression data can also be input from the first network layer of the prediction model, and the auxiliary information can be input from the network layer behind the prediction model.
  • the prediction model to be trained includes an input layer, an output layer, and multiple intermediate layers between the input layer and the output layer.
  • step PB1 includes: using the masked gene expression data in the sample as an input feature tensor or the feature tensor corresponding to the masked gene expression data in the sample as an input feature tensor, inputting the input layer and the first predetermined layer in the multiple intermediate layers to obtain an intermediate feature tensor, and the number of the first predetermined layers is greater than or equal to 0; splicing the auxiliary information with the intermediate feature tensor to obtain the spliced intermediate feature tensor; and inputting the spliced intermediate feature tensor into the second predetermined layer in the multiple intermediate layers and the output layer, and the second predetermined layer is different from the first predetermined layer.
  • the auxiliary information is input before the last several layers of the decoder network; for another example, when the prediction model includes an encoder network and a decoder network, the auxiliary information can be input before a certain layer of the encoder network or before the last several layers of the decoder network. In this way, the model can use the auxiliary information to extract more meaningful features or make more accurate predictions in the process of processing input data.
  • the input to the prediction model may be the masked gene expression data and auxiliary information themselves, or the masked gene expression data and auxiliary information may be converted into corresponding embedding vectors and then input into the prediction model;
  • the input to the first network layer of the prediction model may be the counts corresponding to all genes in the masked gene expression data, or the counts corresponding to some genes (for example, the counts corresponding to genes whose counts are not 0 and are not masked).
  • the masked gene expression data, the feature tensor corresponding to the masked gene expression data, the concatenation result of the masked gene expression data and the auxiliary information, the feature tensor corresponding to the masked gene expression data and the auxiliary information are concatenated.
  • the concatenation result of the feature tensors corresponding to the information can be interpreted broadly as the input feature tensor, which refers not only to the first network layer of the input prediction model, but also to any network layer of the input prediction model. For example, part of the tensors in the input feature tensor are used as the input of the first network layer of the prediction model, and part of the tensors are used as the input of the middle network layer of the prediction model.
  • the input feature tensor may include the part corresponding to all elements of the masked gene expression data, or only the part corresponding to some elements of the masked gene expression data (such as non-zero and unmasked parts), and also include the part corresponding to elements outside the masked gene expression data (such as filled elements, which will be explained later).
  • the size of the object to be processed of the network layer of the input prediction model is certain, for example, 2000*768.
  • the counts corresponding to the genes with non-zero and unmasked counts in the masked gene expression data are input into the network layer of the prediction model, the different numbers of non-zero gene counts and the different proportions of masked elements included in different masked gene expression data may lead to different sizes of the objects to be processed of the network layer of the input prediction model. In this case, they can be padded to the same size to ensure that the sizes of the objects to be processed of the network layer of the input prediction model are the same.
  • the number of non-zero and unmasked gene counts included in the masked gene expression data in each sample fluctuates between 300-1000, and the size of the input feature tensor can be padded to 1000.
  • the size of the input feature tensor can be padded to 10% of the total number of genes.
  • Step PB2 determining a loss value corresponding to the sample according to the predicted value and the counts corresponding to the part of the genes in the first gene expression data
  • Step 130C update the prediction model to be trained according to the loss value corresponding to each sample in the multiple samples.
  • This specific implementation method downsamples the sequencing results at an actual high sequencing depth, simulates the sequencing results at a low sequencing depth, uses the sequencing results at the high sequencing depth and the sequencing results at the low sequencing depth as samples to train the prediction model, and explicitly introduces auxiliary information that characterizes the expected sequencing depth, so that the prediction model learns to capture the relationship between gene counts of similar cells at different sequencing depths and the relationship between gene counts in a single cell, thereby minimizing technical noise interference in sequencing, and increasing the sequencing depth by computational means, thereby improving the accuracy of downstream task output results.
  • each sample among the multiple samples includes normalized first gene expression data, normalized masked gene expression data and auxiliary information
  • the normalized first gene expression data is obtained by normalizing the first gene expression data
  • the normalized masked gene expression data is obtained by downsampling the first gene expression data, normalizing the downsampled first gene expression data, and masking the counts of some genes in the normalized result
  • the auxiliary information also includes a second total count, which is the sum of the counts of each gene in the downsampled first gene expression data
  • step PB1 including: using the prediction model to be trained to process the normalized masked gene expression data and the auxiliary information in the sample to obtain a predicted value of the normalized value of the counts of each gene at the first sequencing depth corresponding to the sample, and step PB2, including: determining the loss value corresponding to the sample based on the predicted value and the counts corresponding to the part of the genes in the normalized first gene expression data.
  • the relative relationship between the counts of different genes is more important for downstream tasks.
  • the sequencing results need to be standardized and normalized to obtain the samples used for training.
  • the normalized first gene expression data and the normalized masked gene expression data can be used to replace the first gene expression data and the masked gene expression data in step 110, step 120, step 121, and step 122.
  • the sequencing depth is represented by the total counts.
  • the total counts in the gene expression data can be used for normalization. The specific normalization method is described above.
  • the downsampled first gene expression data simulates the counts of different genes in a single cell measured at a second sequencing depth lower than the first sequencing depth, and the second total count can be used as a representation of the second sequencing depth.
  • the first sequencing depth can be obtained through actual sequencing and is known, the second sequencing depth is unknown.
  • the first total count and the second total count are used to characterize the first sequencing depth and the second sequencing depth.
  • the first total count T and the second total count S can both determine their corresponding embedding vectors in the manner described above, and the sign of the first total count is different from the sign of the second total count and is also different from other gene IDs.
  • the model cannot estimate the sum of the counts of each gene in the downsampled but not normalized first gene expression data, that is, it cannot estimate the normalized masked gene expression data. How low is the lower sequencing depth corresponding to the gene expression data. Therefore, the second total count must be included in the auxiliary information. In this way, the prediction model can explicitly know how low the lower sequencing depth corresponding to the normalized masked gene expression data is, and how high the higher sequencing depth corresponding to the first gene expression data is based on the second total count, and then can predict the gene count at the higher sequencing depth based on the normalized masked gene expression data.
  • the predicted value of each gene count at the first sequencing depth output by the model is also normalized.
  • the predicted value can be restored to the predicted value of the gene count that has not been normalized by multiplying the predicted value by the first total count, thereby remapping the normalized predicted value to the original count space.
  • the normalized masked gene expression data can be obtained from the first gene expression data in the following manner: Step 1, downsampling the first gene expression data. Step 2, normalizing the downsampled first gene expression data. Step 3, masking the counts of some genes in the result of Step 2 to obtain the masked gene expression data.
  • the prediction model to be trained includes an encoder network and a decoder network, wherein processing the input feature tensor using the prediction model to be trained includes: encoding the first tensor corresponding to the unmasked and non-zero counts in the input feature tensor using the encoder network to obtain an encoded tensor; replacing the first tensor in the input feature tensor with the encoded tensor to obtain an encoded input feature tensor; and decoding the encoded input feature tensor using the decoder network.
  • step PB1 includes:
  • Step PB11 using the encoder network to encode the first tensor corresponding to the unmasked and non-zero counts in the input feature tensor to obtain an encoded tensor.
  • the size of the first tensor can be 2000*768, and the size of the encoded tensor can be 2000*768.
  • the 1st, 2nd, 4th, 10th, 30th, etc. elements of the 2000 elements are non-zero and unmasked elements, and the 1561st to 2000th elements are filled elements.
  • Step PB12 replace the first tensor in the input feature tensor with the encoding tensor to obtain the encoded input feature tensor.
  • Step PB13 using a decoder network to decode the encoded input feature tensor to obtain a decoded tensor.
  • the size of the decoded tensor can be N*768.
  • the prediction model further includes a multilayer perceptron.
  • step PB1 further includes step PB14: using the multilayer perceptron in the prediction model to process the decoded tensor to obtain the predicted value of each gene count at the first sequencing depth corresponding to the sample.
  • the size of the predicted value is N*1. It is understandable that the multilayer perceptron is trained together with other parts in the prediction model.
  • the masked gene expression data and auxiliary information can be input into a module that determines count embedding vectors based on counts using the method described above to obtain an input feature tensor.
  • the size of the masked gene expression data is N*1
  • the auxiliary information includes T and S
  • the size of the input feature tensor is (N+2)*768.
  • the encoder part focuses on the embedding vector corresponding to the non-zero count in the input feature tensor, while the decoder part receives the embedding vectors of all genes (i.e., the embedding processed by the encoder and the embedding of other genes), integrating the information of all positions.
  • the input sequence length of the encoder input is approximately 10% of the full gene length.
  • the prediction model to be trained is one of an encoder-decoder network, a decoder network, and a multi-layer perceptron.
  • the downsampled first gene expression data is obtained by downsampling the first gene expression data using a statistical sampling algorithm. For details, see the above description of downsampling.
  • FIG. 2 is a flow chart of a method for correcting gene expression data according to an embodiment of the present disclosure.
  • a prediction model has been trained, and the trained prediction model can correct gene expression data.
  • the trained prediction model is used for reasoning. As shown in FIG. 2, the method 200 includes steps 210 and 220.
  • the method for correcting gene expression data is also referred to as a method for determining gene regulatory relationships.
  • Step 210 obtaining current gene expression data, wherein the current gene expression data includes respective counts of different genes measured at an actual sequencing depth.
  • the prediction model trained in method 100 is used by method 200, and the current gene expression data should be in the same form as the masked gene expression data, except that no mask is required. If the masked gene expression data in the training sample is normalized, the current gene expression data should also be normalized; if the masked gene expression data in the training sample is not normalized, the current gene expression data should also be not normalized.
  • method 200 uses a prediction model that provides auxiliary information during training, then the auxiliary information corresponding to the current gene expression data should also be obtained in step 210; if method 200 uses a prediction model that does not provide auxiliary information during training, then there is no need to obtain the auxiliary information corresponding to the current gene expression data in step 210.
  • the current gene expression data may be single-cell gene expression data or gene expression data obtained by sequencing a large number of cells (bulk).
  • the current gene expression data may be a gene expression matrix, and the gene expression matrix may be obtained through single-cell sequencing. If the current gene expression data is derived from a single cell, the gene expression matrix may be a vector; if the current gene expression data is derived from multiple single cells, the gene expression matrix may be a matrix, and each row or column of the matrix corresponds to the gene expression level of a single cell.
  • the genes in the current gene expression data are a subset of the genes contained in the first gene expression data in the sample used when training the prediction model. Otherwise, the expression level relationship between the genes in the current gene expression data is not learned by the prediction model, and the corresponding gene expression level relationship cannot be predicted by the prediction model.
  • Step 220 using at least part of the network layers in the prediction model to process the current gene expression data to obtain a correction value of the current gene expression data or an intermediate processing result of the correction value.
  • the prediction model may be trained according to method 100 .
  • all network layers in the prediction model may be used to process the current gene expression data, and the corrected value of the current gene expression data is obtained at this time; or only some network layers in the prediction model (e.g., the first M layers, e.g., the network layers in the prediction model except for the network layer used to project the intermediate representation of the relative relationship between the counts of each gene after correction as the predicted value of the gene count) may be used to process the current gene expression data, and the intermediate processing result of the corrected value of the current gene expression data is obtained at this time. It depends on the needs of the downstream tasks.
  • the correction value is the output of the multilayer perceptron in FIG. 1A
  • the intermediate processing result of the correction value refers to the intermediate processing result obtained to obtain the correction value, and does not mean that the correction value is finally obtained.
  • the encoding vector output by the encoder network and the decoding vector output by the decoder network in Figure 1A are all intermediate processing results.
  • At least some network layers in the prediction model used in method 220 include an encoder network and may also include a decoder network.
  • Features characterizing cells can be obtained based on the output of the encoder network, and features representing genes can be obtained based on the output of the decoder network.
  • step 220 only the encoder network can be used to process the current gene expression data, and in subsequent steps, the features characterizing cells obtained from the output of the encoder network are used for the downstream task; when the downstream task needs to use features characterizing genes, in step 220, the encoder network and the decoder network can be used to process the current gene expression data, and in subsequent steps, the features characterizing genes obtained from the output of the decoder network are used for the downstream task.
  • the method of obtaining the characteristics of the cell from the output of the encoder network may be that the output of the encoder network passes through a pooling layer (e.g., a maximum pooling layer) to obtain the characteristics of the cell.
  • a pooling layer e.g., a maximum pooling layer
  • the method of obtaining the characteristics of the gene from the output of the decoder network may be that the output of the decoder network passes through a multi-layer perceptron to obtain the characteristics of the gene.
  • step 220 also uses at least part of the network layer in the prediction model to process the auxiliary information of the current gene expression data.
  • what is obtained in step 220 is the corrected value of the current gene expression data at the expected sequencing depth or the intermediate processing result of the corrected value;
  • method 200 uses a prediction model that does not provide auxiliary information during training, in this case, what is obtained in step 220 is the corrected value of the current gene expression data or the intermediate processing result of the corrected value at a sequencing depth where the current gene expression data remains substantially unchanged. In this case, there is no need to process the auxiliary information corresponding to the current gene expression data in step 220.
  • method 200 further includes step 230, determining a feature tensor corresponding to the current gene expression data, and step 220 includes processing a feature vector corresponding to the current gene expression data using at least part of the network layer in the prediction model.
  • step 230 can refer to the method for determining the feature tensors corresponding to unmasked and non-0 counts and unmasked and 0 counts in step 123.
  • Step 220 includes: Step 2201, encoding the input of the encoder network using the encoder network to obtain the output of the encoder network; the input of the encoder network includes the current gene expression The feature tensor corresponding to the non-zero counts in the data (the counts that are not masked in the current gene expression data); exemplarily, the input of the encoder network also includes the feature vector corresponding to the filler element.
  • step 220 includes step 2202, according to the output of the encoder network and the feature tensor corresponding to the counts of 0 values in the current gene expression data (exemplarily, the two are spliced according to the position of the counts), the input of the decoder network is obtained; the input of the decoder network is decoded by the decoder network to obtain the output of the decoder network.
  • step 220 includes 2203, projecting the output of the decoder network as a correction value of the current gene expression data.
  • step 2201 can be executed to obtain the output of the encoder network, and then the output of the encoder network (which can be the feature tensor of the position corresponding to the filler element removed from the output of the encoder network) is input into a pooling layer to obtain features that characterize cells, and the features are used for downstream tasks.
  • steps 2201 and 2202 can be executed to obtain the output of the decoder network, and then the output of the decoder network is input into a multilayer perceptron to obtain features that characterize genes, and the features are used for downstream tasks.
  • steps 2201, 2202 and 2203 can be executed to use the correction value for the downstream task.
  • Method 200 can improve the accuracy of the relative relationship between each gene count in the current gene expression data through the prediction model, obtain the correction value of the current gene expression data that can be used for different downstream tasks or the intermediate processing result of the correction value, and apply it to the downstream task to improve the accuracy of the downstream task.
  • the downstream task in the disclosed embodiment can be an existing downstream task algorithm or model, such as a cell classification model, a disturbance prediction model, etc., and these algorithms or models use the characteristics of characterizing cells or the characteristics of characterizing genes as input.
  • the characteristics of characterizing cells and the characteristics of characterizing genes provided by the disclosed embodiment replace the characteristics of characterizing cells and the characteristics of characterizing genes used in the downstream task algorithm or model of the prior art. That is, the characteristics of characterizing cells or the characteristics of characterizing genes obtained according to method 200 can replace the characteristics of characterizing cells or the characteristics of characterizing genes originally used in the downstream task.
  • step 210 includes: obtaining normalized current gene expression data and current auxiliary information; wherein the current gene expression data includes the counts of different genes measured at the actual sequencing depth, and the normalized current gene expression data is obtained by normalizing the current gene expression data; the current auxiliary information includes an expected first total count and a current second total count; the current second total count is the sum of the counts of each gene in the current gene expression data; the expected first total count is used for Characterizing the expected sequencing depth, the expected first total count is greater than or equal to the sum of the counts of each gene in the current gene expression data.
  • the sample includes normalized first gene expression data, masked gene expression data, and auxiliary information, so that the prediction model learns to capture the relationship between gene counts of similar cells at different sequencing depths and the relationship between gene counts in a single cell, and is thus able to predict gene counts at a high sequencing depth based on gene counts at a low sequencing depth; accordingly, when using the prediction model trained using the sample, the input of the prediction model includes normalized current gene expression data and current auxiliary information characterizing low sequencing depth and high sequencing depth, so that the prediction model is able to predict the expected gene counts at a high sequencing depth based on the gene counts at a low sequencing depth.
  • the first total count may be greater than or equal to the current second total count.
  • the prediction model predicts the expected gene count at a higher sequencing depth based on the gene counts actually measured at a lower sequencing depth.
  • the prediction model predicts the corrected gene count under the condition that the sequencing depth remains substantially unchanged based on the gene counts actually measured.
  • the expected sequencing depth can be set according to the sequencing method.
  • the sequencing method is 10X Genomics Chromium sequencing technology (10X Genomics Chromium sequencing technology, referred to as "10X sequencing technology"), and the sum of the counts of each gene in the current gene expression data is about 1000.
  • the expected sequencing depth can be set to 10000.
  • the current gene expression data is normalized to obtain normalized current gene expression data, so that at least part of the network layer in the prediction model can be used to process the normalized current gene expression data later. It is understandable that if the gene expression data obtained at a low sequencing depth is normalized, it can be restored to the unnormalized current gene expression data first.
  • Step 220 includes: using at least part of the network layer in the prediction model to process the normalized current gene expression data and the current auxiliary information to obtain a corrected value of the normalized value of the current gene expression data at the expected sequencing depth or an intermediate processing result of the corrected value of the normalized value.
  • At least part of the network layer in the prediction model is used to process the normalized feature tensor corresponding to the current gene expression data and the feature tensor corresponding to the current auxiliary information.
  • the normalized feature tensor corresponding to the current gene expression data and the feature tensor corresponding to the current auxiliary information need to be determined.
  • the method for determining the feature tensor corresponding to the normalized current gene expression data is described in the description of step 230, and the method for determining the feature tensor corresponding to the current auxiliary information is described in the description of step 125.
  • the current auxiliary information can be input into the first network layer along with the current gene expression data, or only the current gene expression can be input into the first network layer, and the auxiliary information can be input into a network layer different from the first network layer.
  • the specific input method and input form must be consistent with the training stage and will not be repeated here.
  • a method 200A for determining a gene regulation relationship including: step 210A, obtaining a gene expression matrix corresponding to the cells to be treated; wherein the genes expressed by the cells to be treated are a subset of the genes contained in the initial gene expression matrix; step 220A, inputting the gene expression matrix into the gene regulation relationship model, obtaining a gene regulation relationship representation, and the gene regulation relationship representation is a three-dimensional tensor; wherein the gene regulation relationship model is generated according to the gene regulation relationship model generation method of method 100B.
  • the gene expression matrix can be obtained, for example, by single-cell sequencing.
  • the gene expression matrix can be a vector, and when the cell to be treated is a plurality of cells, the gene expression matrix can be a matrix, and each row or column of the matrix corresponds to the gene expression of a cell.
  • the obtained gene expression matrix is input into the trained gene regulation relationship model, and the gene regulation relationship model outputs the gene regulation relationship representation corresponding to the cells to be treated.
  • the gene regulation relationship is expressed as a three-dimensional tensor (for example, a three-dimensional tensor of C ⁇ G ⁇ D), and the gene regulation relationship corresponding to each cell to be treated is characterized by the matrix corresponding to the cell to be treated in the three-dimensional tensor.
  • the gene regulation relationship model is generated according to the gene regulation relationship model generation method. It is understandable that when C is 1, the three-dimensional tensor of C ⁇ G ⁇ D becomes 1 ⁇ G ⁇ D. It is understandable that the genes expressed by the cells to be treated are a subset of the genes contained in the initial gene expression matrix used when training the gene regulation relationship model. If the genes expressed by the cells to be treated are not in the genes contained in the initial gene expression matrix, it means that the regulatory relationship between its genes has not been learned by the gene regulation relationship model, and its corresponding gene regulation relationship cannot be predicted by the gene regulation relationship model.
  • a method 200B for correcting gene expression data is provided, as shown in FIG2B , and the method 200B includes: step 210B, obtaining current gene expression data and current auxiliary information, wherein the current gene expression data includes the counts of different genes measured at the actual sequencing depth, and the current auxiliary information includes the expected first total count, the expected first total count is used to characterize the expected sequencing depth, and the expected first total count is greater than or equal to the sum of the counts of each gene in the current gene expression data; step 220B, using at least part of the network layer in the prediction model trained according to method 100C to process the current gene expression data and the current auxiliary information, so as to obtain the correction value of the current gene expression data at the expected sequencing depth or the intermediate processing result of the correction value.
  • the auxiliary information may only include the first total count but not the second total count, and accordingly, in the inference stage, the current auxiliary information may only include the expected first total count but not the current second total count.
  • the first gene expression data in the training sample in the training stage and the current gene expression data in the inference stage are both unnormalized.
  • FIG3 is a flow chart of a downstream task execution method according to an embodiment of the present disclosure. As shown in FIG3 , the method 300 includes step 310 and step 320 .
  • step 310 input data is obtained, wherein the input data includes i) a correction value obtained according to method 200; or ii) an intermediate processing result of the correction value obtained according to method 200; or iii) a preprocessing result obtained by preprocessing the correction value obtained by method 200; or iv) a preprocessing result obtained by preprocessing the intermediate processing result of the correction value obtained by method 200.
  • preprocessing can be pooling, processing using a linear layer, processing using a specific model, etc.
  • preprocessing can be the method described above for obtaining features characterizing cells from the output of the encoder, or it can be the method described above for obtaining features characterizing genes from the output of the decoder.
  • step 320 the input data is processed using a downstream task algorithm to obtain a downstream task result, wherein the downstream task includes a cell classification task, a disturbance prediction task or a drug response prediction task.
  • downstream tasks include cell classification tasks (further subdivided into cell classification tasks, cell clustering tasks), perturbation prediction tasks or drug response prediction tasks.
  • features characterizing cells are used for cell classification tasks, cell clustering tasks, and drug response prediction tasks, and features characterizing genes are used for perturbation prediction tasks.
  • the classification model can use, for example, support vector Machine (Support Vector Machine, SVM), Multilayer Perceptron (Multilayer Perceptron, MLP) or Decision Tree (Decision Tree), etc.
  • SVM Support Vector Machine
  • MLP Multilayer Perceptron
  • Decision Tree Decision Tree
  • a clustering algorithm can be used to cluster the features representing the cells to obtain a clustering result.
  • the clustering algorithm can be K-means clustering or the like.
  • the cell classification results may be, for example, classification results of cancer cells and normal cells, clustering results of different types of immune cells, or classification results of cell types in different organs.
  • graph neural networks can be used to extract drug features, and prediction models can be used to determine features that characterize cells.
  • the drug features and features that characterize cells are input into the drug response prediction model to obtain a prediction result on whether the cells are sensitive to the drug.
  • the features characterizing the genes determined by the prediction model can be used as the gene features corresponding to the gene nodes of the gene co-expression graph neural network.
  • the perturbation features are applied to the gene co-expression graph neural network to predict the expression of the perturbed genes.
  • models other than the prediction model used in downstream tasks can be trained together with the prediction model.
  • an electronic device comprising: at least one processor; and at least one memory communicatively connected to the at least one processor, wherein the at least one memory stores instructions, and when the instructions are executed by the at least one processor, the at least one processor executes the above method.
  • a non-transitory computer-readable storage medium storing instructions.
  • the instructions When the instructions are executed by at least one processor of a computer, the computer executes the above method.
  • a computer program product including a computer program, and the computer program implements the above method when executed by a processor.
  • FIG. 4 is a block diagram of an exemplary electronic device that can be used to implement an embodiment of the present disclosure.
  • the electronic device 400 can be a variety of different types of devices. Examples of the electronic device 400 include, but are not limited to, desktop computers, server computers, laptop or netbook computers, mobile devices (e.g., tablet computers, cellular or other wireless phones (e.g., smart phones), notepad computers, mobile stations), wearable devices (e.g., glasses, watches), entertainment devices (e.g., entertainment appliances, set-top boxes communicatively coupled to a display device, game consoles), televisions or other display devices, automotive computers, and the like.
  • mobile devices e.g., tablet computers, cellular or other wireless phones (e.g., smart phones), notepad computers, mobile stations
  • wearable devices e.g., glasses, watches
  • entertainment devices e.g., entertainment appliances, set-top boxes communicatively coupled to a display device, game consoles
  • televisions or other display devices automotive computers, and the like.
  • the electronic device 400 may include at least one processor 402, memory 404, communication interface(s) 406, a display device 408, other input/output (I/O) devices 410, and one or more mass storage devices 412 that can communicate with each other, such as via a system bus 414 or other appropriate connection.
  • processor 402 memory 404
  • communication interface(s) 406 communication interface(s) 406
  • display device 408 other input/output (I/O) devices 410
  • mass storage devices 412 that can communicate with each other, such as via a system bus 414 or other appropriate connection.
  • Processor 402 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores.
  • Processor 402 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuits, and/or any device that manipulates signals based on operating instructions.
  • processor 402 may be configured to obtain and execute computer-readable instructions stored in memory 404, mass storage device 412, or other computer-readable media, such as program code for operating system 416, program code for application program 418, program code for other programs 420, and the like.
  • the memory 404 and the mass storage device 412 are examples of computer-readable storage media for storing instructions that are executed by the processor 402 to implement the various functions described above.
  • the memory 404 may generally include both volatile memory and non-volatile memory (e.g., RAM, ROM, etc.).
  • the mass storage device 412 may generally include a hard drive, a solid-state drive, a removable medium, including external and removable drives, a memory card, a flash memory, a floppy disk, an optical disk (e.g., a CD, a DVD), a storage array, a network attached storage, a storage area network, etc.
  • the memory 404 and the mass storage device 412 may all be collectively referred to herein as memory or computer-readable storage media, and may be a non-transitory medium capable of storing computer-readable, processor-executable program instructions as computer program code, which may be executed by the processor 402 as a specific machine configured to implement the operations and functions described in the examples herein.
  • a plurality of programs may be stored on the mass storage device 412. These programs include an operating system 416, one or more application programs 418, other programs 420, and program data 422, and they may be loaded into the memory 404 for execution. Examples of such applications or program modules may include, for example, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: method 100 (including any suitable steps of method 100), method 100B (including any suitable steps of method 100B), method 100C (including any suitable steps of method 100C), method 200 (including any suitable steps of method 200), method 200A (including any suitable steps of method 200A), method 200B (including any suitable steps of method 200B), method 300 (including any suitable steps of method 300), and/or other embodiments described herein.
  • method 100 including any suitable steps of method 100
  • method 100B including any suitable steps of method 100B
  • method 100C including any suitable steps of method 100C
  • method 200 including any suitable steps of method 200
  • method 200A including any suitable steps of method 200A
  • method 200B
  • computer-readable media includes at least two types of computer-readable media, namely, computer-readable storage media and communication media.
  • Computer-readable storage media include volatile and non-volatile, removable and non-removable media implemented by any method or technology for storing information, such as computer-readable instructions, data structures, program modules or other data.
  • Computer-readable storage media include but are not limited to RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD), or other optical storage device, magnetic cassette, magnetic tape, magnetic disk storage device or other magnetic storage device, or any other non-transmission medium that can be used to store information for access by electronic devices.
  • communication media can embody computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transmission mechanism.
  • Computer-readable storage media defined herein do not include communication media.
  • One or more communication interfaces 406 are used to exchange data with other devices, such as through a network, direct connection, etc.
  • Such communication interfaces can be one or more of the following: any type of network interface (e.g., a network interface card (NIC)), a wired or wireless (such as IEEE 802.11 wireless LAN (WLAN)) wireless interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a BluetoothTM interface, a Near Field Communication (NFC) interface, etc.
  • NIC network interface card
  • Wi-MAX Worldwide Interoperability for Microwave Access
  • Ethernet interface e.g., a Universal Serial Bus (USB) interface
  • USB Universal Serial Bus
  • BluetoothTM a BluetoothTM interface
  • NFC Near Field Communication
  • the communication interface 406 can facilitate communication within a variety of network and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, etc.
  • the communication interface 406 can also provide communication with external storage devices (not shown) such as storage arrays, network attached storage, storage area networks, etc.
  • a display device 408 such as a monitor may be included for displaying information and images to the user.
  • Other I/O devices 410 may be devices that receive various inputs from the user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and the like.
  • the techniques described herein may be supported by these various configurations of electronic device 400 and are not limited to the specific examples of the techniques described herein.
  • the functionality may also be implemented in whole or in part on a "cloud" using a distributed system.
  • the cloud includes and/or represents a platform for resources.
  • the platform abstracts the underlying functionality of the hardware (e.g., servers) and software resources of the cloud.
  • Resources may include resources that are located remotely from the computer. Applications and/or data that can be used when performing computing processing on the server of the sub-device 400.
  • Resources can also include services provided through the Internet and/or through a subscriber network such as a cellular or Wi-Fi network.
  • the platform can abstract resources and functions to connect the electronic device 400 with other electronic devices. Therefore, the implementation of the functions described herein can be distributed throughout the cloud. For example, the functions can be implemented partially on the electronic device 400 and partially through a platform that abstracts the functions of the cloud.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A prediction model training method, a gene expression data correction method, a downstream task execution method, an electronic device, and a medium. The training method comprises: acquiring a plurality of samples, wherein each of the plurality of samples comprises first gene expression data and masked gene expression data, the first gene expression data comprises respective counts of different genes in a single cell, the masked gene expression data is obtained by processing the counts of some genes in the first gene expression data, and the processing for obtaining the masked gene expression data comprises masking; for each of the plurality of samples: processing the masked gene expression data in the sample by using a prediction model to be trained, so as to obtain predicted values of the counts of some genes corresponding to the sample; according to the predicted values, and the counts corresponding to some genes in the first gene expression data, determining a loss value corresponding to the sample; and updating said prediction model according to the loss value corresponding to each of the plurality of samples.

Description

预测模型的训练方法、基因表达数据的校正方法和下游任务执行方法Methods for training prediction models, correcting gene expression data, and performing downstream tasks

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求对2023年1月19日提交的申请CN202310097546.2和2023年5月30日提交的申请CN202310630156.7享有优先权。在先申请视为本申请公开的一部分,并整体纳入本申请。This application claims priority to application CN202310097546.2 filed on January 19, 2023 and application CN202310630156.7 filed on May 30, 2023. The prior application is deemed to be part of the disclosure of this application and is incorporated into this application as a whole.

技术领域Technical Field

本公开涉及人工智能领域,特别是涉及一种预测模型的训练方法、一种基因表达数据的校正方法、一种下游任务执行方法、电子设备、计算机可读存储介质和计算机程序产品。The present disclosure relates to the field of artificial intelligence, and in particular to a prediction model training method, a gene expression data correction method, a downstream task execution method, an electronic device, a computer-readable storage medium, and a computer program product.

背景技术Background technique

单细胞测序技术是在单个细胞水平上,对基因组、转录组及表观基因组水平进行测序分析的技术。由存在技术噪声干扰,相似细胞在同一测序深度的测序结果存在一定差异。这就导致单细胞测序结果直接用于下游应用时会给下游应用结果带来一定困难,因此需要对单细胞测序的结果进行一定处理才能更好的服务于下游应用。Single-cell sequencing technology is a technology that performs sequencing analysis on the genome, transcriptome, and epigenome at the level of a single cell. Due to the interference of technical noise, there are certain differences in the sequencing results of similar cells at the same sequencing depth. This leads to certain difficulties in the results of downstream applications when the single-cell sequencing results are directly used. Therefore, the results of single-cell sequencing need to be processed in a certain way to better serve downstream applications.

在此部分中描述的方法不一定是之前已经设想到或采用的方法。除非另有指明,否则不应假定此部分中描述的任何方法仅因其包括在此部分中就被认为是现有技术。类似地,除非另有指明,否则此部分中提及的问题不应认为在任何现有技术中已被公认。The methods described in this section are not necessarily methods that have been previously conceived or employed. Unless otherwise indicated, it should not be assumed that any method described in this section is considered to be prior art simply because it is included in this section. Similarly, unless otherwise indicated, the issues mentioned in this section should not be considered to have been recognized in any prior art.

发明内容Summary of the invention

根据本公开的第一方面,提供一种预测模型的训练方法,包括:获取多个样本,其中,多个样本中的每个样本包括第一基因表达数据和掩码基因表达数据,第一基因表达数据包括单细胞中不同基因各自的计数,掩码基因表达数据是通过对第一基因表达数据中部分基因的计数进行处理得到的,用于得到掩码基因表达数据的处理包括掩码;对于多个样本中的每个样本:利用待训练的预测模型处理该样本中的掩码基因表达数据,以得到该样本对应的部分基因计数的预测值;根据预测值与第一基因表达数据中与部分基因所对 应的计数确定该样本对应的损失值;以及根据多个样本中的每个样本对应的损失值更新待训练的预测模型。According to a first aspect of the present disclosure, a method for training a prediction model is provided, comprising: obtaining a plurality of samples, wherein each of the plurality of samples comprises first gene expression data and masked gene expression data, the first gene expression data comprises counts of different genes in a single cell, the masked gene expression data is obtained by processing counts of some genes in the first gene expression data, and the processing for obtaining the masked gene expression data comprises masking; for each of the plurality of samples: processing the masked gene expression data in the sample using the prediction model to be trained to obtain a predicted value of the counts of some genes corresponding to the sample; and calculating a predicted value based on the predicted value and the counts of some genes in the first gene expression data corresponding to the sample. The corresponding count determines the loss value corresponding to the sample; and updates the prediction model to be trained according to the loss value corresponding to each sample in the multiple samples.

根据本公开的第二方面,提供一种基因表达数据的校正方法,包括:获取当前基因表达数据,其中,当前基因表达数据包括在实际测序深度下测得的不同基因各自的计数;以及利用本公开第一方面的训练方法训练的预测模型中至少部分网络层处理当前基因表达数据,以得到当前基因表达数据的校正值或校正值的中间处理结果。According to a second aspect of the present disclosure, a method for correcting gene expression data is provided, comprising: obtaining current gene expression data, wherein the current gene expression data includes respective counts of different genes measured at an actual sequencing depth; and processing the current gene expression data using at least part of the network layers in a prediction model trained by the training method of the first aspect of the present disclosure to obtain a correction value of the current gene expression data or an intermediate processing result of the correction value.

根据本公开的第三方面,提供一种下游任务执行方法,包括:获取输入数据,其中,输入数据包括i)根据本公开第二方面的方法得到的校正值;或者ii)根据本公开第二方面中,校正值的中间处理结果;或者iii)对根据本公开第二方面的方法得到的校正值进行预处理得到的预处理结果;或者iv)对本公开第二方面的方法得到的校正值的中间处理结果进行预处理得到的预处理结果;以及利用下游任务算法处理输入数据,以得到下游任务结果,下游任务包括细胞归类任务,扰动预测任务或药物反应预测任务。According to a third aspect of the present disclosure, a downstream task execution method is provided, comprising: obtaining input data, wherein the input data includes i) a correction value obtained by the method according to the second aspect of the present disclosure; or ii) an intermediate processing result of the correction value according to the second aspect of the present disclosure; or iii) a preprocessing result obtained by preprocessing the correction value obtained by the method according to the second aspect of the present disclosure; or iv) a preprocessing result obtained by preprocessing the intermediate processing result of the correction value obtained by the method according to the second aspect of the present disclosure; and processing the input data using a downstream task algorithm to obtain a downstream task result, wherein the downstream task includes a cell classification task, a perturbation prediction task or a drug response prediction task.

根据本公开的另一方面,提供了一种电子设备,包括:处理器;以及存储器,存储器存储有可被处理器执行的指令,指令在由处理器执行时,使处理器执行上述任一方面的方法。According to another aspect of the present disclosure, an electronic device is provided, including: a processor; and a memory, wherein the memory stores instructions executable by the processor, and when the instructions are executed by the processor, the processor executes any one of the above methods.

根据本公开的另一方面,提供了一种存储有指令的非瞬时计算机可读存储介质,指令在由处理器执行时,使处理器执行上述任一方面的方法。According to another aspect of the present disclosure, a non-transitory computer-readable storage medium storing instructions is provided. When the instructions are executed by a processor, the processor executes any of the above methods.

根据本公开的另一方面,提供了一种计算机程序产品,包括:指令,其中,指令在被处理器执行时,使处理器执行上述任一方面的方法。According to another aspect of the present disclosure, a computer program product is provided, comprising: instructions, wherein when the instructions are executed by a processor, the processor is caused to execute any of the above methods.

根据在下文中所描述的实施例,本公开的这些和其它方面将是清楚明白的,并且将参考在下文中所描述的实施例而被阐明。These and other aspects of the disclosure will be apparent from and elucidated with reference to the embodiments described hereinafter.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图示例性地示出了实施例并且构成说明书的一部分,与说明书的文字描述一起用于讲解实施例的示例性实施方式。所示出的实施例仅出于例示的目的,并不限制权利要求的范围。在所有附图中,相同的附图标记指代类似但不一定相同的要素。The accompanying drawings exemplarily illustrate the embodiments and constitute a part of the specification, and together with the text description of the specification, are used to explain the exemplary implementation of the embodiments. The embodiments shown are for illustrative purposes only and do not limit the scope of the claims. In all drawings, the same reference numerals refer to similar but not necessarily identical elements.

图1是根据本公开实施例的预测模型的训练方法的流程图;FIG1 is a flow chart of a method for training a prediction model according to an embodiment of the present disclosure;

图1A是根据本公开实施例的预测模型的模型结构的示意图; FIG1A is a schematic diagram of a model structure of a prediction model according to an embodiment of the present disclosure;

图1B是根据本公开另一实施例的预测模型的训练方法的流程图;FIG1B is a flowchart of a method for training a prediction model according to another embodiment of the present disclosure;

图1C是根据本公开另一实施例的预测模型的训练方法的流程图;FIG1C is a flowchart of a method for training a prediction model according to another embodiment of the present disclosure;

图2是根据本公开实施例的基因表达数据的校正方法的流程图;FIG2 is a flow chart of a method for correcting gene expression data according to an embodiment of the present disclosure;

图2A是根据本公开另一实施例的基因表达数据的校正方法的流程图;FIG2A is a flow chart of a method for correcting gene expression data according to another embodiment of the present disclosure;

图2B是根据本公开另一实施例的基因表达数据的校正方法的流程图;2B is a flow chart of a method for correcting gene expression data according to another embodiment of the present disclosure;

图3是根据本公开实施例的下游任务执行方法的流程图;FIG3 is a flow chart of a downstream task execution method according to an embodiment of the present disclosure;

图4是能够用于实现本公开的实施例的示例性电子设备的结构框图。FIG. 4 is a block diagram of an exemplary electronic device that can be used to implement an embodiment of the present disclosure.

具体实施方式Detailed ways

在本公开中,除非另有说明,否则使用术语“第一”、“第二”等来描述各种要素不意图限定这些要素的位置关系、时序关系或重要性关系,这种术语只是用于将一个元件与另一元件区分开。在一些示例中,第一要素和第二要素可以指向要素的同一示例,而在某些情况下,基于上下文的描述,它们也可以指代不同示例。In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of these elements, and such terms are only used to distinguish one element from another element. In some examples, the first element and the second element may refer to the same example of the element, and in some cases, based on the description of the context, they may also refer to different examples.

在本公开中对各种所述示例的描述中所使用的术语只是为了描述特定示例的目的,而并非旨在进行限制。除非上下文另外明确地表明,如果不特意限定要素的数量,则该要素可以是一个也可以是多个。如本文使用的,术语“多个”意指两个或更多,并且术语“基于”应解释为“至少部分地基于”。此外,术语“和/或”以及“……中的至少一个”涵盖所列出的项目中的任何一个以及全部可能的组合方式。The terms used in the description of various examples described in this disclosure are only for the purpose of describing specific examples and are not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the element can be one or more. As used herein, the term "plurality" means two or more, and the term "based on" should be interpreted as "based at least in part on". In addition, the terms "and/or" and "at least one of..." cover any one of the listed items and all possible combinations.

在基因组学领域,基因在细胞中具有表达量,每个细胞都有其真实的基因表达量,通过对细胞进行测序,能够得到在某测序深度下比对到各基因上的reads数,比对到某基因上的reads数可称为该基因的计数,该计数可理解为在该测序深度下观测到的基因表达量。In the field of genomics, genes have expression levels in cells, and each cell has its true gene expression level. By sequencing cells, the number of reads mapped to each gene at a certain sequencing depth can be obtained. The number of reads mapped to a gene can be called the count of the gene, and the count can be understood as the gene expression level observed at the sequencing depth.

由于技术噪声的存在,测得的各基因的计数之间相对关系的准确性不满足要求。因此,希望能够基于实际测序深度下观测到的各基因计数之间的相对关系,通过预测模型预测出校正后的相对关系或校正后相对关系的中间表示,将校正后的相对关系作为真实相对关系,或将校正后相对关系的中间表示作为真实相对关系的中间表示应用于各种下游任务。Due to the existence of technical noise, the accuracy of the relative relationship between the counts of each gene measured does not meet the requirements. Therefore, it is hoped that based on the relative relationship between the counts of each gene observed at the actual sequencing depth, the corrected relative relationship or the intermediate representation of the corrected relative relationship can be predicted through the prediction model, and the corrected relative relationship can be used as the true relative relationship, or the intermediate representation of the corrected relative relationship can be used as the intermediate representation of the true relative relationship for various downstream tasks.

除非另有定义,本文中使用的所有术语(包括技术术语和科学术语)具有与本公开所属领域的普通技术人员所通常理解的相同含义。将进一步理解 的是,诸如那些在通常使用的字典中定义的之类的术语应当被解释为具有与其在相关领域和/或本说明书上下文中的含义相一致的含义,并且将不在理想化或过于正式的意义上进行解释,除非本文中明确地如此定义。Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present disclosure belongs. It is understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning consistent with their meaning in the context of the relevant art and/or this specification, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

图1是根据本公开实施例的预测模型的训练方法100的流程图。如图1所示,该方法100包括步骤110至步骤130。Fig. 1 is a flow chart of a prediction model training method 100 according to an embodiment of the present disclosure. As shown in Fig. 1 , the method 100 includes steps 110 to 130.

在步骤110,获取多个样本。其中,多个样本中的每个样本包括第一基因表达数据和掩码基因表达数据,第一基因表达数据包括单细胞中不同基因各自的计数,掩码基因表达数据是通过对第一基因表达数据中部分基因的计数进行处理得到的,用于得到掩码基因表达数据的处理包括掩码。In step 110, a plurality of samples are obtained. Each of the plurality of samples includes first gene expression data and masked gene expression data, the first gene expression data includes counts of different genes in a single cell, the masked gene expression data is obtained by processing counts of some genes in the first gene expression data, and the processing for obtaining the masked gene expression data includes masking.

单细胞指人体、动物体或其他生物体的单个细胞,多个样本中的单细胞可以是不同类型的细胞,例如T细胞,B细胞等。单细胞表达有多个基因。第一基因表达数据是在某测序深度下测得的单细胞中不同基因各自的计数。表1为第一基因表达数据的一个示例。示例性的,第一基因表达数据中可包括N(约20000)个基因的计数,且各样本的第一基因表达数据均由同样的基因所对应的计数组成。示例性的,第一基因表达数据中的基因既包括高变基因(高变基因是有更大的可能性被其他基因调控的基因,这里,被其他基因调控可指被其他基因激活或者抑制等),又包括非高变基因。示例性的,第一基因表达数据中包括全基因组的基因或全部转录组的基因。A single cell refers to a single cell of a human body, an animal body or other organisms. The single cells in multiple samples can be different types of cells, such as T cells, B cells, etc. A single cell expresses multiple genes. The first gene expression data is the count of each of the different genes in a single cell measured at a certain sequencing depth. Table 1 is an example of the first gene expression data. Exemplarily, the first gene expression data may include counts of N (about 20,000) genes, and the first gene expression data of each sample are composed of counts corresponding to the same gene. Exemplarily, the genes in the first gene expression data include both highly variable genes (highly variable genes are genes that are more likely to be regulated by other genes, and here, being regulated by other genes may refer to being activated or inhibited by other genes, etc.), and non-highly variable genes. Exemplarily, the first gene expression data includes genes of the entire genome or genes of the entire transcriptome.

表1
Table 1

单细胞测序结果中通常有大约500个非0的基因计数,示例性的,可从收集到的单细胞测序结果中筛选出计数非0的基因超过非0计数阈值(如200)的单细胞测序结果作为第一基因表达数据,以过滤质量极低或受损的单细胞测序结果。There are usually about 500 non-zero gene counts in the single-cell sequencing results. For example, the single-cell sequencing results whose non-zero gene counts exceed a non-zero count threshold (such as 200) can be screened out from the collected single-cell sequencing results as the first gene expression data to filter out extremely low-quality or damaged single-cell sequencing results.

示例性的,获取多个样本中包括的第一基因表达数据可以为获取初始基因表达量矩阵,初始基因表达量矩阵的每行或每列对应一个样本中的第一基因表达数据。矩阵大小可为C*N(C大于等于1,为样本数),以单细胞的细胞标识(或样本标识)以及各个基因的基因标识分别作为矩阵的第一维度以及第二维度,各单细胞各基因的表达量(在下文中也称为基因的计数)作 为矩阵中对应元素的取值。例如,三个样本中第一基因表达数据所对应的单细胞的细胞标识分别为Cell 1、Cell 2以及Cell 3,这三个单细胞中的六个基因的基因标识分别为Gene1、Gene2、Gene3、Gene4、Gene5、Gene6,初始基因表达量矩阵如表2所示。以表2中第一行最后一个单元格的取值为2.7为例,表示细胞Cell 1中基因Gene6的计数为2.7。Exemplarily, obtaining the first gene expression data included in the plurality of samples may be obtaining an initial gene expression matrix, where each row or column of the initial gene expression matrix corresponds to the first gene expression data in one sample. The matrix size may be C*N (C is greater than or equal to 1, which is the number of samples), with the cell identifier (or sample identifier) of a single cell and the gene identifier of each gene as the first dimension and the second dimension of the matrix respectively, and the expression amount of each gene in each single cell (hereinafter also referred to as the gene count) as is the value of the corresponding element in the matrix. For example, the cell identifiers of the single cells corresponding to the first gene expression data in the three samples are Cell 1, Cell 2, and Cell 3, respectively. The gene identifiers of the six genes in these three single cells are Gene1, Gene2, Gene3, Gene4, Gene5, and Gene6, respectively. The initial gene expression matrix is shown in Table 2. Taking the value of 2.7 in the last cell of the first row in Table 2 as an example, it means that the count of gene Gene6 in cell Cell 1 is 2.7.

表2
Table 2

为使预测模型学会捕捉单细胞内基因表达量之间的关系,使用MAE(Masked Autoencoders)方式对预测模型进行训练,即用预测模型根据部分掩码后的基因计数(未被掩码的计数作为被掩码的计数的上下文)预测被掩码的基因计数。因此,在构建训练数据时,需使用实际测序得到的第一基因表达数据作为真实值,需使用对第一基因表达数据中部分基因的计数进行掩码得到的掩码基因表达数据作为预测模型的输入。In order to enable the prediction model to learn to capture the relationship between gene expression levels in a single cell, the prediction model is trained using the MAE (Masked Autoencoders) method, that is, the prediction model is used to predict the masked gene counts based on the partially masked gene counts (the unmasked counts are used as the context of the masked counts). Therefore, when constructing the training data, the first gene expression data obtained by actual sequencing is used as the true value, and the masked gene expression data obtained by masking the counts of some genes in the first gene expression data is used as the input of the prediction model.

掩码(下文也称为“遮盖”)操作可以为将第一基因表达数据中被掩码的部分基因的计数替换为掩码符号来实现。例如,掩码基因数据可以为[4,1,M,M,0,M,…,M],其中,掩码部分的基因计数被替换为掩码符号“M”,而未被掩码的基因计数保留原始的计数。得到的掩码基因表达数据中,被掩码的基因计数所对应的基因ID是已知的,基因的计数是未知的。The masking (hereinafter also referred to as "masking") operation can be implemented by replacing the counts of the masked part of the genes in the first gene expression data with mask symbols. For example, the masked gene data can be [4, 1, M, M, 0, M, ..., M], wherein the gene counts of the masked part are replaced with the mask symbol "M", while the unmasked gene counts retain the original counts. In the obtained masked gene expression data, the gene ID corresponding to the masked gene count is known, and the gene count is unknown.

掩码基因表达数据中被掩码的基因计数的数量可以通过设置掩码比例来控制,掩码比例越大,被掩码的基因计数越多。The number of masked gene counts in the masked gene expression data can be controlled by setting the mask ratio. The larger the mask ratio, the more masked gene counts.

在下文中,被掩码的计数也被称为从初始基因表达量矩阵中选取的第一类元素,未被掩码的计数(即初始基因表达量矩阵中除第一类元素外的元素)被称为从初始基因表达量矩阵中选取的第二类元素。示例性的,可随机选取初始基因表达量矩阵中的多个元素作为第一类元素,将除第一类元素以外的基因表达量矩阵中的元素作为第二类元素。示例性的,选取被掩码的计数时,可首先确定掩码比例即被掩码的基因计数的数量,例如,设定被掩码的基因 计数的总数量小于基因计数总数量的70%,以表2为例,表2中基因计数的总数量为18个,设定被掩码的基因计数的总数量小于基因计数总数量的70%时,可将被掩码的基因计数的数量确定为12个。然后,随机选取被掩码的基因计数的数量个被掩码的基因计数。例如,在第一基因表达数据中随机选取12个计数作为被掩码的计数,将剩余的6个元素作为未被掩码的计数。其中,在选取12个被掩码的计数的过程中,应按照均匀随机的原则进行选取,例如:生成多个随机数,这些随机数分别与每个计数对应,且这些随机数服从均值为0、方差为1的均匀分布,从这些随机数中选取数值最大的前12个随机数作为被掩码的计数对应的随机数,将前12个随机数对应的计数确定为被掩码的计数。In the following, the masked counts are also referred to as the first type of elements selected from the initial gene expression matrix, and the unmasked counts (i.e., the elements in the initial gene expression matrix other than the first type of elements) are referred to as the second type of elements selected from the initial gene expression matrix. Exemplarily, multiple elements in the initial gene expression matrix can be randomly selected as the first type of elements, and the elements in the gene expression matrix other than the first type of elements can be used as the second type of elements. Exemplarily, when selecting the masked counts, the mask ratio, i.e., the number of masked gene counts, can be determined first. For example, the masked gene counts can be set to The total number of counts is less than 70% of the total number of gene counts. Taking Table 2 as an example, the total number of gene counts in Table 2 is 18. When the total number of masked gene counts is set to be less than 70% of the total number of gene counts, the number of masked gene counts can be determined to be 12. Then, the number of masked gene counts is randomly selected as masked gene counts. For example, 12 counts are randomly selected from the first gene expression data as masked counts, and the remaining 6 elements are used as unmasked counts. Among them, in the process of selecting 12 masked counts, the selection should be carried out according to the principle of uniform randomness, for example: generate multiple random numbers, these random numbers correspond to each count respectively, and these random numbers obey a uniform distribution with a mean of 0 and a variance of 1, and select the first 12 random numbers with the largest values from these random numbers as the random numbers corresponding to the masked counts, and determine the counts corresponding to the first 12 random numbers as the masked counts.

需要注意的是,每个样本的掩码比例可以是不同的,具体的掩码比例和掩码方式可以根据实际需求和算法设计来进行设置,上述仅为示例说明。It should be noted that the mask ratio of each sample may be different. The specific mask ratio and mask mode may be set according to actual needs and algorithm design. The above is only an example.

示例性的,第一基因表达数据可以是原始测序数据,也可以是原始测序数据经过归一化的测序数据。可理解的是,若第一基因表达数据是经归一化的,能够消除测序深度等因素对于计数绝对值的影响,使模型关注基因计数之间的相对关系。Exemplarily, the first gene expression data may be original sequencing data, or may be sequencing data after normalization of the original sequencing data. It is understandable that if the first gene expression data is normalized, the influence of factors such as sequencing depth on the absolute value of counts can be eliminated, so that the model focuses on the relative relationship between gene counts.

在一些训练任务中,不需要原始的测序数据,只需使用经归一化的测序数据,此时,样本中的第一基因表达数据是经归一化的,对其进行掩码的掩码基因表达数据也是经归一化的,预测模型通过学习预测出的各基因计数的预测值也是归一化的;在一些训练任务中,不需要归一化的测序数据,只需使用原始的测序数据,此时,样本中的第一基因表达数据是未经归一化的,对其进行掩码的掩码基因表达数据也是未经归一化的,预测模型通过学习预测出的各基因计数的预测值也是未归一化的;在一些训练任务中,既需要使用原始的测序数据,又需使用归一化的测序数据,例如,样本中包括的是归一化的第一基因表达数据,归一化的第一基因表达数据对应有第一基因表达数据(即未经归一化的测序数据),掩码基因表达数据是对第一基因表达数据进行处理得到的。In some training tasks, the original sequencing data is not needed, and only the normalized sequencing data is needed. In this case, the first gene expression data in the sample is normalized, the masked gene expression data that is masked is also normalized, and the predicted value of each gene count predicted by the prediction model through learning is also normalized; in some training tasks, the normalized sequencing data is not needed, and only the original sequencing data is needed. In this case, the first gene expression data in the sample is not normalized, the masked gene expression data that is masked is also not normalized, and the predicted value of each gene count predicted by the prediction model through learning is also not normalized; in some training tasks, both the original sequencing data and the normalized sequencing data are needed. For example, the sample includes the normalized first gene expression data, the normalized first gene expression data corresponds to the first gene expression data (i.e., the unnormalized sequencing data), and the masked gene expression data is obtained by processing the first gene expression data.

下面以第一基因表达数据为转录本测序结果为例,示例性的说明第一基因表达数据是如何收集的。细胞的scRNA-seq结果储存在基因表达综合数据库、人类细胞图谱、EMBL-EBI等数据库中。从这些数据库中手动收集scRNA- seq数据,并删除具有重复id的数据集。数据库中的测序数据有些是原始的测序数据,有些是经过归一化的。在经过归一化的测序数据中,部分归一化的测序数据能够还原为原始的测序数据,部分归一化的数据无法还原为原始的测序数据。示例性的,为了便于在后续操作中能够对收集的数据进行统一处理,可在收集数据时统一将其收集为原始的测序数据。事实上,大多数数据集提供原始计数矩阵(即原始测序数据)。对于具有标准化表达谱(“标准化”即“归一化”,“表达谱”即“测序数据”)的数据集,将其转换回原始计数形式:将原始计数矩阵中最小的非零值视为原始计数值1,所有剩余的非零值除以该最小值,并取整数部分(该过程将归一化的测序数据还原为原始的测序数据)。对于带有无法转换回原始计数的TPM或FKPM表达式配置文件的数据集,保持它们不变。如此,在多个器官(如心脏、肾脏、大脑)和组织(即结缔组织、上皮组织、肌肉组织和神经组织)中收集了超过5000万个单细胞的转录本测序结果。The following uses the transcript sequencing results as an example to illustrate how the first gene expression data is collected. The scRNA-seq results of cells are stored in databases such as the Gene Expression Comprehensive Database, Human Cell Atlas, and EMBL-EBI. scRNA-seq is manually collected from these databases. seq data, and delete the data set with duplicate id. Some of the sequencing data in the database are original sequencing data, and some are normalized. In the normalized sequencing data, some of the normalized sequencing data can be restored to the original sequencing data, and some of the normalized data cannot be restored to the original sequencing data. Exemplary, in order to facilitate the unified processing of the collected data in subsequent operations, it can be collected as the original sequencing data when the data is collected. In fact, most data sets provide raw count matrices (i.e., raw sequencing data). For data sets with standardized expression profiles ("standardization" is "normalization", "expression profile" is "sequencing data"), it is converted back to the original count form: the smallest non-zero value in the original count matrix is regarded as the original count value 1, all the remaining non-zero values are divided by the minimum value, and the integer part is taken (this process restores the normalized sequencing data to the original sequencing data). For data sets with TPM or FKPM expression profiles that cannot be converted back to raw counts, keep them unchanged. In this way, transcriptome sequencing results of more than 50 million single cells were collected in multiple organs (such as heart, kidney, brain) and tissues (i.e. connective tissue, epithelial tissue, muscle tissue and neural tissue).

在步骤120,对于多个样本中的每个样本,执行步骤121和步骤122的操作。可以理解的是,步骤121和步骤122步骤可以是步骤120的子步骤。In step 120, for each sample in the plurality of samples, the operations of step 121 and step 122 are performed. It can be understood that step 121 and step 122 can be sub-steps of step 120.

在步骤121,利用待训练的预测模型处理该样本中的掩码基因表达数据,以得到该样本对应的部分基因计数的预测值。In step 121, the masked gene expression data in the sample is processed using the prediction model to be trained to obtain the predicted value of the partial gene count corresponding to the sample.

该步骤中,待训练的预测模型(下文也将预测模型称为基因调控关系模型)可指未训练完成的模型。示例性的,部分基因计数指被掩码的部分基因计数。In this step, the prediction model to be trained (hereinafter also referred to as the gene regulation relationship model) may refer to an untrained model. Exemplarily, the partial gene counts refer to the masked partial gene counts.

在一个示例中,待训练的预测模型的架构可以采用编码器网络-解码器网络、解码器网络。预测模型可以包括100万以上,例如包括300万、1千万、一亿个参数,通过增加层数、每层隐向量个数和/或隐向量的维度,可增加模型参数。示例性的,解码器网络或编码器网络-解码器网络输出的是校正后各基因的计数之间相对关系的中间表示(后文也称为“基因调控关系表示”)。示例性的,待训练的预测模型还包括位于解码器网络或编码器网络-解码器网络后的网络层(例如是多层感知器MLP),用于将校正后各基因的计数之间相对关系的中间表示投影为基因计数的预测值。可理解的是,可对所有基因均进行投影,得到各基因计数的预测值;也可仅对被掩码的部分基因进行投影,得到被掩码的部分基因的预测值,后者可减少计算量。也就 是说,预测值中包含至少部分基因(即被掩码的部分基因)的预测值,也可以包含全部基因的预测值。本公开实施例中提及的预测值均可作此理解。示例性的,样本中包括的第一基因表达数据为初始基因表达量矩阵,由第一基因表达数据掩码得到的掩码基因表达数据、该样本对应的各基因计数的预测值为和初始基因表达量矩阵大小相同的矩阵。In one example, the architecture of the prediction model to be trained can adopt an encoder network-decoder network, a decoder network. The prediction model may include more than 1 million parameters, for example, 3 million, 10 million, or 100 million parameters. The model parameters can be increased by increasing the number of layers, the number of latent vectors in each layer, and/or the dimension of the latent vectors. Exemplarily, the decoder network or the encoder network-decoder network outputs an intermediate representation of the relative relationship between the counts of each gene after correction (hereinafter also referred to as "gene regulatory relationship representation"). Exemplarily, the prediction model to be trained also includes a network layer (for example, a multi-layer perceptron MLP) located after the decoder network or the encoder network-decoder network, which is used to project the intermediate representation of the relative relationship between the counts of each gene after correction into a predicted value of the gene count. It is understandable that all genes can be projected to obtain the predicted value of each gene count; it is also possible to project only some of the masked genes to obtain the predicted value of the masked part of the genes, and the latter can reduce the amount of calculation. In other words That is, the predicted value includes the predicted values of at least some genes (i.e., some genes that are masked), and may also include the predicted values of all genes. The predicted values mentioned in the embodiments of the present disclosure can be understood in this way. Exemplarily, the first gene expression data included in the sample is the initial gene expression matrix, and the masked gene expression data obtained by masking the first gene expression data and the predicted values of the gene counts corresponding to the sample are matrices of the same size as the initial gene expression matrix.

在步骤122,根据预测值与第一基因表达数据中与部分基因所对应的计数确定该样本对应的损失值;In step 122, a loss value corresponding to the sample is determined according to the predicted value and the counts corresponding to the part of the genes in the first gene expression data;

如前所述,预测值中可仅包含被掩码的部分基因的预测值,也可包含所有基因的预测值,在包含所有基因的预测值时,可从中选取被掩码的部分基因,根据被掩码的部分基因的预测值与第一基因表达数据中与被掩码的部分基因所对应的计数确定该样本对应的损失值。As mentioned above, the predicted value may include only the predicted values of the masked partial genes, or may include the predicted values of all genes. When the predicted values of all genes are included, the masked partial genes may be selected therefrom, and the loss value corresponding to the sample may be determined based on the predicted values of the masked partial genes and the counts corresponding to the masked partial genes in the first gene expression data.

在步骤130,根据多个样本中的每个样本对应的损失值更新待训练的预测模型。In step 130, the prediction model to be trained is updated according to the loss value corresponding to each sample in the plurality of samples.

在一个示例中,在批量训练中,将一批训练样本输入模型,计算批量损失,然后反向传播更新参数。更新的方式可以是逐个更新,也可以是同时更新。In one example, in batch training, a batch of training examples is fed into the model, the batch loss is calculated, and then the parameters are updated through backpropagation. The updates can be done one by one or simultaneously.

在一个示例中,预设的训练结束条件可以包括以下任一项:损失值小于损失值阈值,训练次数达到预设次数,训练时间达到预设时长。In one example, the preset training end condition may include any of the following: the loss value is less than the loss value threshold, the number of training times reaches a preset number of times, and the training time reaches a preset duration.

方法100以第一基因表达数据和对第一基因表达数据进行掩码得到的掩码基因表达数据作为样本训练预测模型,通过预测被掩码的计数这一训练任务,使得预测模型学会捕捉单个细胞内基因计数之间的关系,从而能够在推理阶段对输入预测模型的基因表达数据进行校正,以最大限度地减少测序中的技术噪声干扰,进而提升下游任务输出结果的准确性。Method 100 uses the first gene expression data and the masked gene expression data obtained by masking the first gene expression data as samples to train a prediction model. By predicting the masked counts as a training task, the prediction model learns to capture the relationship between gene counts in a single cell, so that the gene expression data input into the prediction model can be corrected in the inference stage to minimize technical noise interference in sequencing, thereby improving the accuracy of the output results of downstream tasks.

在一个具体实施方式中,在步骤120,还执行步骤123的操作,步骤123可以是步骤120的子步骤。在步骤123,确定掩码基因表达数据对应的特征张量。可理解的是,步骤123在步骤121前执行,相应的,在步骤123之后,步骤121为步骤121A,利用待训练的预测模型处理该样本中的掩码基因表达数据对应的特征张量。In a specific embodiment, in step 120, the operation of step 123 is also performed, and step 123 may be a sub-step of step 120. In step 123, the feature tensor corresponding to the masked gene expression data is determined. It is understandable that step 123 is performed before step 121, and accordingly, after step 123, step 121 is step 121A, and the feature tensor corresponding to the masked gene expression data in the sample is processed using the prediction model to be trained.

可理解的是,在进入待训练的预测模型之前,需将的掩码基因表达数据转化为其对应的特征张量。示例性的,步骤123包括:确定掩码基因表达数 据中的每个计数的特征张量(下文中也称为“嵌入向量”)。示例性的,掩码基因表达数据中包括3类计数:未被掩码且非0的计数、为0且未被掩码的计数(本文中也将这种情况简称为0值的计数)、被掩码的计数(这3类计数对应的特征向量在后文中也分别被称为第二基因特征、零值基因特征和第一基因特征),每个计数的特征张量可根据表征基因ID的基因嵌入向量(本文也成为“基因标识特征”)和表征基因计数的计数嵌入向量(本文也称为“基因表达量特征”)确定,例如由二者对位相加得到。这3类计数均有基因ID(也称为基因标识),可根据基因ID确定出其对应的基因嵌入向量(例如设置N个不同的、与N个基因ID一一对应的基因嵌入向量,根据基因ID可唯一确定基因ID对应的基因嵌入向量)。示例性的,计数嵌入向量跟计数大小有关,相同大小的非0计数对应的计数嵌入向量相同。对于为0且未被掩码的计数,其计数嵌入向量可随机生成或预先指定(即为所有为0的计数指定统一的计数嵌入向量)。对于被掩码的计数,其计数未知,其计数嵌入向量可预先指定(即为所有掩码计数指定统一的计数嵌入向量)。对于未被掩码且非0的计数,可通过查找表确定计数嵌入向量。例如,查找表可记录计数取值范围和计数嵌入向量的对应关系,不同范围的计数值对应不同的计数嵌入向量。例如,基因表达量是小数时,可将基因表达量取整或分类后确定其对应的计数嵌入向量。例如,基因表达量取整后为1,则对应计数嵌入向量1。或者,基因表达量为1-1.99被分为类别1,则对应计数嵌入向量1。但是这种预先的取整、分类规则不够灵活,会限定模型学习映射的能力,例如模型会看不出基因表达量1.1和1.9的区别,因为其在取整/分类后都对应基因表达量1。为解决该问题,示例性的,步骤123中确定未被掩码且非0的计数所对应的计数嵌入向量的方式可以为:对于未被掩码且非0的计数,可将其输入用于确定嵌入向量的模块(也称为“基因表达量特征提取模型”),得到未被掩码且非0的计数对应的计数嵌入向量。用于确定嵌入向量的模块包括可学习的参数,该可学的参数可根据多个样本中的每个样本对应的损失值而更新,可随预测模型一起训练得到。示例性的,在用于确定嵌入向量的模块中进行的运算包括:随机初始化一个查找表,其中包括100个1*768的向量。对于一个未被掩码且非0的计数x,首先经过一线性层和一个leaky ReLU层得到中间向量v1=leaky ReLU(x*w1),其中,w1是 可学的参数,尺寸为1*100。之后,用另一个线性层(参数为w2)和比例因子α处理中间向量v1,得到中间向量v2。v2=v1*w2+α*v1。其中w2的尺寸为100*100,w2和α为可学的参数。之后,用SoftMax对v2进行归一化,得到权重向量v3,v3尺寸为1*100。以v3为查找表中100个向量各自的权重,对查找表中100个向量进行加权求和,得到计数向量Ex。如此,通过自适应离散化的方式,模型可学到基因表达量1.1和1.2是更接近的表达量,基因表达量1.1和1.9是差别更大的表达量,使得基因特征可以更好的刻画不同基因表达量和不同计数嵌入向量之间的关系。It is understandable that before entering the prediction model to be trained, the masked gene expression data needs to be converted into its corresponding feature tensor. Exemplarily, step 123 includes: determining the masked gene expression data The feature tensor of each count in the data (hereinafter also referred to as "embedding vector"). Exemplarily, the masked gene expression data includes three types of counts: unmasked and non-zero counts, zero and unmasked counts (this case is also referred to as zero-value counts in this article), and masked counts (the feature vectors corresponding to these three types of counts are also referred to as second gene features, zero-value gene features, and first gene features in the following text, respectively). The feature tensor of each count can be determined according to the gene embedding vector representing the gene ID (also referred to as "gene identification feature" in this article) and the count embedding vector representing the gene count (also referred to as "gene expression feature" in this article), for example, by adding the two. These three types of counts all have gene IDs (also referred to as gene identifiers), and their corresponding gene embedding vectors can be determined according to the gene ID (for example, N different gene embedding vectors corresponding to N gene IDs are set one by one, and the gene embedding vector corresponding to the gene ID can be uniquely determined according to the gene ID). Exemplarily, the count embedding vector is related to the size of the count, and the count embedding vectors corresponding to non-zero counts of the same size are the same. For counts that are 0 and not masked, their count embedding vectors can be randomly generated or pre-specified (i.e., a uniform count embedding vector is specified for all counts that are 0). For masked counts, whose counts are unknown, their count embedding vectors can be pre-specified (i.e., a uniform count embedding vector is specified for all masked counts). For counts that are not masked and are not 0, the count embedding vector can be determined by a lookup table. For example, the lookup table can record the correspondence between the count value range and the count embedding vector, and count values in different ranges correspond to different count embedding vectors. For example, when the gene expression value is a decimal, the corresponding count embedding vector can be determined after rounding or classification of the gene expression value. For example, if the gene expression value is rounded to 1, it corresponds to a count embedding vector of 1. Alternatively, if the gene expression value is 1-1.99 and is classified into category 1, it corresponds to a count embedding vector of 1. However, this pre-rounding and classification rule is not flexible enough and will limit the model's ability to learn mapping. For example, the model will not be able to see the difference between gene expression values of 1.1 and 1.9, because they both correspond to gene expression values of 1 after rounding/classification. To solve this problem, illustratively, the method for determining the count embedding vector corresponding to the unmasked and non-zero count in step 123 can be: for the unmasked and non-zero count, it can be input into the module for determining the embedding vector (also called "gene expression feature extraction model") to obtain the count embedding vector corresponding to the unmasked and non-zero count. The module for determining the embedding vector includes learnable parameters, which can be updated according to the loss value corresponding to each sample in a plurality of samples, and can be trained together with the prediction model. Exemplarily, the operations performed in the module for determining the embedding vector include: randomly initializing a lookup table, which includes 100 1*768 vectors. For an unmasked and non-zero count x, first pass through a linear layer and a leaky ReLU layer to obtain an intermediate vector v1=leaky ReLU(x*w1), where w1 is Learnable parameters, size is 1*100. Afterwards, the intermediate vector v1 is processed with another linear layer (parameter is w2) and the scaling factor α to obtain the intermediate vector v2. v2=v1*w2+α*v1. Among them, the size of w2 is 100*100, w2 and α are learnable parameters. Afterwards, v2 is normalized with SoftMax to obtain the weight vector v3, the size of v3 is 1*100. Taking v3 as the weight of each of the 100 vectors in the lookup table, the weighted sum of the 100 vectors in the lookup table is performed to obtain the count vector Ex. In this way, through adaptive discretization, the model can learn that gene expression values 1.1 and 1.2 are closer expression values, and gene expression values 1.1 and 1.9 are more different expression values, so that gene features can better characterize the relationship between different gene expression values and different count embedding vectors.

在一个具体实施方式中,步骤121A包括:将掩码基因表达数据中与未被掩码且非0的计数对应的特征张量输入待训练的预测模型的首个网络层;将掩码基因表达数据中与被掩码的计数对应的特征张量和掩码基因表达数据中与0值的计数对应的特征张量输入待训练的预测模型不同于首个网络层的网络层。也就是说,将掩码基因表达数据中与未被掩码且非0的计数对应的特征张量输入待训练的预测模型的第一网络层,得到第一网络层的处理结果,之后,将第一网络层的处理结果和掩码基因表达数据中与被掩码的计数对应的特征张量以及掩码基因表达数据中与0值的计数对应的特征张量拼接后输入待训练的预测模型的第二网络层,其中,第二网络层位于第一网络层的下游。如此,第一网络层只需处理未被掩码且非0的计数,大大减少了第一网络层所需的参数量,将0值计数输入第二网络层,使模型在学习各基因之间相对关系考虑了0值计数的影响,使得模型可以学到在全基因组水平上潜在的基因调控关系。In a specific embodiment, step 121A includes: inputting the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data into the first network layer of the prediction model to be trained; inputting the feature tensor corresponding to the masked counts in the masked gene expression data and the feature tensor corresponding to the counts of 0 in the masked gene expression data into a network layer of the prediction model to be trained that is different from the first network layer. In other words, the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data is input into the first network layer of the prediction model to be trained to obtain the processing result of the first network layer, and then the processing result of the first network layer and the feature tensor corresponding to the masked counts in the masked gene expression data and the feature tensor corresponding to the counts of 0 in the masked gene expression data are spliced and input into the second network layer of the prediction model to be trained, wherein the second network layer is located downstream of the first network layer. In this way, the first network layer only needs to process unmasked and non-zero counts, which greatly reduces the number of parameters required for the first network layer. The zero-value counts are input into the second network layer, so that the model takes into account the influence of the zero-value counts when learning the relative relationship between genes, allowing the model to learn the potential gene regulatory relationship at the whole genome level.

在一个具体实施方式中,待训练的预测模型包括编码器网络和解码器网络。步骤121A包括:In one embodiment, the prediction model to be trained includes an encoder network and a decoder network. Step 121A includes:

步骤1211,利用编码器网络对编码器网络的输入进行编码,以得到编码器网络的输出;编码器网络的输入包括掩码基因表达数据中与未被掩码且非0的计数对应的特征张量。Step 1211, using the encoder network to encode the input of the encoder network to obtain the output of the encoder network; the input of the encoder network includes a feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data.

编码器网络的输入在本文中也称为输入特征,掩码基因表达数据中与未被掩码且非0的计数对应的特征张量在本文中也称为第一张量,编码器网络的输出在本文中也称为初始编码特征张量或编码张量。 The input of the encoder network is also referred to as input features in this article, the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data is also referred to as the first tensor in this article, and the output of the encoder network is also referred to as the initial encoded feature tensor or encoded tensor in this article.

示例性的,虽然在步骤123得到了掩码基因表达数据中3类计数对应的特征张量,但是不将为0且未被掩码的计数对应的特征张量、被掩码的计数对应的特征张量输入编码器网络,也就是说,在输入编码器网络前将0且未被掩码的计数对应的特征张量、被掩码的计数对应的特征张量过滤掉,如此可显著降低输入编码器网络的数据量。Exemplarily, although the feature tensors corresponding to the three types of counts in the masked gene expression data are obtained in step 123, the feature tensors corresponding to the counts that are 0 and not masked and the feature tensors corresponding to the masked counts are not input into the encoder network. That is to say, the feature tensors corresponding to the counts that are 0 and not masked and the feature tensors corresponding to the masked counts are filtered out before being input into the encoder network, which can significantly reduce the amount of data input into the encoder network.

示例性的,编码器网络的输入中,在将0值且未被掩码的计数对应的特征张量、被掩码的计数对应的特征张量过滤掉之后,将与未被掩码且非0的计数对应的特征张量根据未被掩码且非0的计数在第一基因表达数据中的位置进行排列。例如,单细胞1(或样本1)的基因2、基因5..对应的计数因为是0值或被掩码而被过滤掉,则在编码器网络的输入与单细胞1对应的部分,基因1、基因3、基因4..等非0且未被掩码的基因所对应的计数按基因编号顺序排列,如表3所示。Exemplarily, in the input of the encoder network, after filtering out the feature tensors corresponding to the counts with a value of 0 and not masked and the feature tensors corresponding to the counts with a masked value, the feature tensors corresponding to the counts with a value of 0 and not masked are arranged according to the positions of the counts with a value of not masked and not 0 in the first gene expression data. For example, the counts corresponding to genes 2, 5, etc. of single cell 1 (or sample 1) are filtered out because they are 0 or masked, then in the part of the input of the encoder network corresponding to single cell 1, the counts corresponding to genes 1, 3, 4, etc., which are not 0 and not masked, are arranged in the order of gene numbers, as shown in Table 3.

可理解的是,被掩码的计数的数量可根据模型学习效果(可理解,被掩码的计数比例过高时无法准确预测,被掩码的计数比例过低,预测任务过于简单)、第一基因表达数据的稀疏程度(即0值计数的比例)和希望的模型规模确定。希望的模型规模越小,可以把被掩码计数的数量设置得越高。第一基因表达数据越稀疏,因为是0值而被过滤掉的计数越多,可适当减少被掩码计数的数量。示例性的,可为第一基因数据中0值计数和非0值计数设置不同的掩码比例,以达到合适的总掩码计数数量,同时较多的保留非0值计数的信息。It is understandable that the number of masked counts can be determined based on the model learning effect (it is understandable that when the proportion of masked counts is too high, accurate prediction cannot be made, and when the proportion of masked counts is too low, the prediction task is too simple), the sparsity of the first gene expression data (i.e., the proportion of 0-value counts) and the desired model scale. The smaller the desired model scale, the higher the number of masked counts can be set. The sparser the first gene expression data, the more counts are filtered out because they are 0 values, and the number of masked counts can be appropriately reduced. Exemplarily, different mask ratios can be set for 0-value counts and non-0-value counts in the first gene data to achieve a suitable total number of masked counts, while retaining more information on non-0-value counts.

表3
table 3

可理解的是,为对多个样本进行批处理,编码器网络的输入通常包含多个样本对应的特征张量,每个样本对应的特征张量尺寸相同时,编码器网络才能对其进行批处理。然而,不同样本中的掩码基因表达数据所包括的未被掩码且非0计数的数量可能不同,这就需要将其填充至相同尺寸才能使每个 样本对应的特征张量尺寸相同。例如,各样本中掩码基因表达数据中包括的非0且未被掩码的基因计数的数量在300-1000之间波动,可统一填充(padding)为1000(可使填充后的输入特征张量的长度是全量基因10%的量级),填充得到的结果如表3所示。表3中,编码器网络的输入中与样本2对应的特征向量中,1000个元素中第1-999个元素对应非0且未掩码的计数(分别对应基因11、基因32、基因42、基因54、..基因18900),第1000个元素为填充的元素。表3中每个元素(包括计数对应的元素和填充的元素)都对应一特征张量,这些特征张量组成了尺寸为C*L*D(其中C为样本数即单细胞数,L为统一填充后的尺寸,D为元素对应的特征张量的长度)的编码器网络的输入。也就是说,编码器网络的输入除了包括掩码基因表达数据中与未被掩码且非0的计数对应的特征张量,还包括填充元素对应的特征向量。示例性的,可为填充元素指定一填充向量作为其对应的特征张量(即所有填充元素对应的特征张量相同),不必再区分基因嵌入向量和计数嵌入向量。在编码器网络中,填充元素对应的特征张量可以是不参与运算的,只起到占位符的作用。示例性的,各类型的元素指定的特征张量的取值互不相同,例如为填充元素指定的特征张量、为被掩码的计数指定的计数嵌入向量的取值互不相同。It is understandable that in order to batch process multiple samples, the input of the encoder network usually contains feature tensors corresponding to multiple samples. The encoder network can batch process them only when the feature tensors corresponding to each sample have the same size. However, the number of unmasked and non-zero counts included in the masked gene expression data in different samples may be different, which requires padding them to the same size to make each sample The feature tensor size corresponding to the samples is the same. For example, the number of non-zero and unmasked gene counts included in the masked gene expression data in each sample fluctuates between 300 and 1000, and can be uniformly padded to 1000 (so that the length of the input feature tensor after padding is on the order of 10% of the total number of genes), and the padding result is shown in Table 3. In Table 3, in the feature vector corresponding to sample 2 in the input of the encoder network, the 1-999th elements of the 1000 elements correspond to non-zero and unmasked counts (corresponding to gene 11, gene 32, gene 42, gene 54, .. gene 18900, respectively), and the 1000th element is the padded element. Each element in Table 3 (including the elements corresponding to the counts and the padded elements) corresponds to a feature tensor, and these feature tensors constitute the input of the encoder network with a size of C*L*D (where C is the number of samples, i.e., the number of single cells, L is the size after uniform padding, and D is the length of the feature tensor corresponding to the element). That is to say, the input of the encoder network includes not only the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data, but also the feature vector corresponding to the filling element. Exemplarily, a filling vector can be specified for the filling element as its corresponding feature tensor (that is, the feature tensors corresponding to all filling elements are the same), and there is no need to distinguish between the gene embedding vector and the count embedding vector. In the encoder network, the feature tensor corresponding to the filling element may not participate in the calculation and only serve as a placeholder. Exemplarily, the values of the feature tensors specified by elements of different types are different from each other, for example, the feature tensors specified for the filling elements and the count embedding vectors specified for the masked counts are different from each other.

步骤1212,根据编码器网络的输出、掩码基因表达数据中与被掩码的计数对应的特征张量和与0值的计数对应的特征张量,得到解码器网络的输入。Step 1212, obtaining the input of the decoder network according to the output of the encoder network, the feature tensor corresponding to the masked counts in the masked gene expression data, and the feature tensor corresponding to the counts of zero values.

为了建立全基因组(或全转录组)范围内的基因调控关系,在恢复被掩码的计数时,也需考虑0值基因的影响。因此,需要将0值基因的信息输入至解码器网络。In order to establish gene regulation relationships within the whole genome (or transcriptome), the influence of zero-valued genes must also be considered when restoring masked counts. Therefore, the information of zero-valued genes needs to be input into the decoder network.

示例性的,按照各计数在第一基因表达数据中的位置,将未被掩码且为0的计数对应的特征张量、被掩码的计数对应的特征张量与编码器网络的输出(可理解的,此时需要去掉编码器网络的输出中与填充元素对应的特征张量,否则解码器网络的输入中各样本对应的张量大小不一致)合并,得到解码器网络的输入。Exemplarily, according to the position of each count in the first gene expression data, the feature tensor corresponding to the unmasked count of 0, the feature tensor corresponding to the masked count, and the output of the encoder network (understandably, the feature tensor corresponding to the filling elements in the output of the encoder network needs to be removed at this time, otherwise the size of the tensors corresponding to each sample in the input of the decoder network will be inconsistent) are merged to obtain the input of the decoder network.

示例性的,将掩码基因表达数据中与未被掩码且非0的计数对应的特征张量替换为编码张量,以得到解码器网络的输入。 Exemplarily, the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data is replaced with the encoding tensor to obtain the input of the decoder network.

示例性的,解码器网络的输入的尺寸可以为C*N*D,其中C为样本数即单细胞数,N为第一基因表达数据中的基因数,D为各计数对应的特征张量的长度。Exemplarily, the size of the input of the decoder network can be C*N*D, where C is the number of samples, i.e., the number of single cells, N is the number of genes in the first gene expression data, and D is the length of the feature tensor corresponding to each count.

步骤1213,利用解码器网络对解码器网络的输入进行解码。Step 1213, using the decoder network to decode the input of the decoder network.

解码器网络的输入在本文也称为目标编码特征张量或编码输入特征向量。The input to the decoder network is also referred to in this article as the target encoded feature tensor or the encoded input feature vector.

示例性的,解码器网络的输出为校正后各基因的计数之间相对关系的中间表示,解码器网络的输出的尺寸可以为C*N*F,其中C为样本数即单细胞数,N为第一基因表达数据中的基因数,F为解码器输出中与单个计数对应的特征张量的长度。Exemplarily, the output of the decoder network is an intermediate representation of the relative relationship between the counts of each gene after correction, and the size of the output of the decoder network can be C*N*F, where C is the number of samples, i.e., the number of single cells, N is the number of genes in the first gene expression data, and F is the length of the feature tensor corresponding to a single count in the decoder output.

示例性的,步骤121A还包括步骤1214,利用多层感知器将解码器网络的输出投影为部分基因(即被掩码的部分基因)计数的预测值。利用预测模型中的多层感知器处理解码器网络的输出,得到样本对应的各基因计数的预测值。预测值的尺寸为C*N。示例性的,多层感知器可与预测模型中其他部分一起训练。Exemplarily, step 121A also includes step 1214, using a multilayer perceptron to project the output of the decoder network into a predicted value of the count of a portion of the genes (i.e., the masked portion of the genes). The output of the decoder network is processed by the multilayer perceptron in the prediction model to obtain the predicted value of the count of each gene corresponding to the sample. The size of the predicted value is C*N. Exemplarily, the multilayer perceptron can be trained together with other parts in the prediction model.

示例性的,编码器网络用于确定未被掩码且非0的计数的表达量信息以及调控关系信息,解码器网络用于将表达量信息及调控关系信息还原为被掩码计数的表达量信息。可理解的是,在训练阶段,包含编码器网络和解码器网络的模型学习和挖掘基因表达量之间的关联性和相互作用,在图例阶段,编码器网络的输出和解码器网络的输出都属于校正后各基因的计数之间相对关系的中间表示,蕴含了大量可对细胞基因调控关系进行刻画的信息,可将编码器网络的输出、解码器网络的输出用于下游任务,或对编码器网络的输出、解码器网络的输出进行预处理后用于下游任务。Exemplarily, the encoder network is used to determine the expression information and regulatory relationship information of the unmasked and non-zero counts, and the decoder network is used to restore the expression information and regulatory relationship information to the expression information of the masked counts. It is understandable that in the training stage, the model including the encoder network and the decoder network learns and mines the correlation and interaction between gene expression. In the legend stage, the output of the encoder network and the output of the decoder network are both intermediate representations of the relative relationship between the counts of each gene after correction, which contains a large amount of information that can characterize the regulatory relationship of cellular genes. The output of the encoder network and the output of the decoder network can be used for downstream tasks, or the output of the encoder network and the output of the decoder network can be preprocessed and used for downstream tasks.

在预测模型的模型架构中,模型的计算复杂度是输入数据长度的指数倍,当输入数据量变大时,需要的模型参数量也随之增大,使得模型的计算复杂度和计算量都增大,导致模型训练难度大。在实操中,可通过只选取细胞中的高变基因的表达量作为输入来减少输入的数据量,但是忽略非高变基因会使模型学习到的调控关系图谱存在系统缺失。In the model architecture of the prediction model, the computational complexity of the model is exponentially the length of the input data. When the amount of input data increases, the number of model parameters required also increases, which increases the computational complexity and amount of the model, making model training difficult. In practice, the amount of input data can be reduced by only selecting the expression of highly variable genes in cells as input, but ignoring non-highly variable genes will cause the regulatory relationship map learned by the model to have systematic omissions.

本公开实施例中,编码器网络关注于掩码基因数据中的非0计数对应的特征张量,而解码器网络接收所有计数对应的特征张量(即未被掩码且为0 的计数对应的特征张量和被掩码的计数对应的特征张量、编码器网络处理过的与未被掩码且非0的计数对应的特征张量),整合所有位置的信息。在0值计数和被掩码的计数被过滤之后,编码器网络输入的输入序列长度大约是全基因组(或转录组)长度的10%。这种设计大大减少了所需的计算资源,使编码器网络能够采用一系列普通的Transformer块来捕获基因依赖性,大大提升了训练效率和训练效果。由于编码器网络模块仅处理未被掩码且非0的计数,使模型能够有效地关注最具信息量的非0表达基因,同时在解码器网络阶段允许0值基因参与模型训练,使得模型能够基于0值和非0值计数的基因更全面精确的进行预测。同时,编码器网络只处理未被掩码且非0的计数,使基因调控关系模型可以具有更小的规模,这就允许训练时可以将细胞全基因组表达量作为输入,使得待训练基因调控关系模型可以学习得到全基因组(或转录组)的调控关系,如此,可解决调控关系模型训练难度大以及模型学到的调控关系图谱存在系统性缺失的问题。In the disclosed embodiment, the encoder network focuses on the feature tensors corresponding to the non-zero counts in the masked gene data, while the decoder network receives the feature tensors corresponding to all counts (i.e., the feature tensors that are not masked and are 0). The encoder network integrates information from all positions by processing the feature tensor corresponding to the counts of the unmasked and masked counts and the feature tensor corresponding to the unmasked and non-zero counts processed by the encoder network. After the zero-valued counts and the masked counts are filtered, the length of the input sequence input to the encoder network is approximately 10% of the length of the whole genome (or transcriptome). This design greatly reduces the required computing resources, allowing the encoder network to use a series of ordinary Transformer blocks to capture gene dependencies, greatly improving training efficiency and training effects. Since the encoder network module only processes unmasked and non-zero counts, the model can effectively focus on the most informative non-zero expressed genes, while allowing zero-valued genes to participate in model training in the decoder network stage, so that the model can make more comprehensive and accurate predictions based on genes with zero-valued and non-zero-valued counts. At the same time, the encoder network only processes unmasked and non-zero counts, so that the gene regulatory relationship model can have a smaller scale. This allows the whole genome expression of the cell to be used as input during training, so that the gene regulatory relationship model to be trained can learn the regulatory relationship of the whole genome (or transcriptome). In this way, the problem of the difficulty of training the regulatory relationship model and the systematic omission of the regulatory relationship map learned by the model can be solved.

在一个具体实施方式中,编码器网络包括M层编码单元。解码器网络包括N层解码单元。M的数值大于N的数值。In a specific embodiment, the encoder network includes M layers of encoding units. The decoder network includes N layers of decoding units. The value of M is greater than the value of N.

编码器网络往往需要较多的层数来提取数据的高阶特征,从而更好地表示原始数据。而解码器网络则可以使用较少的层数来完成解码任务,因为它只需要将低阶特征表示还原为原始数据,不需要像编码器网络那样进行特征提取和抽象。因此,在设置编码器网络和解码器网络的层数时,可以选择较深的编码器网络和较浅的解码器网络,可以在保留原始数据信息的同时,减少模型的参数数量,提高模型的训练效率和泛化能力。示例性的,可设定M+N=8,当M=5时,N=3;当M=6时,N=2。示例性的,编码器网络和解码器网络可采用现有Transformer模型中的结构,示例性的,编码器网络采用的是现有的Transformer模型中编码器的结构,解码器网络可采用Performer的架构。The encoder network often requires more layers to extract high-order features of the data, so as to better represent the original data. The decoder network can use fewer layers to complete the decoding task, because it only needs to restore the low-order feature representation to the original data, and does not need to perform feature extraction and abstraction like the encoder network. Therefore, when setting the number of layers of the encoder network and the decoder network, a deeper encoder network and a shallower decoder network can be selected. While retaining the original data information, the number of parameters of the model can be reduced, and the training efficiency and generalization ability of the model can be improved. Exemplarily, M+N=8 can be set, when M=5, N=3; when M=6, N=2. Exemplarily, the encoder network and the decoder network can adopt the structure in the existing Transformer model. Exemplarily, the encoder network adopts the structure of the encoder in the existing Transformer model, and the decoder network can adopt the architecture of Performer.

在一个具体实施方式中,编码器网络的每层编码单元包括至少一个多头注意力单元和至少一个前向传播单元。解码器网络的每层解码单元包括至少一前向传播单元,还包括至少一线性注意力单元或稀疏注意力单元。In a specific embodiment, each layer of encoding units of the encoder network includes at least one multi-head attention unit and at least one forward propagation unit. Each layer of decoding units of the decoder network includes at least one forward propagation unit and at least one linear attention unit or sparse attention unit.

示例性的,每层编码单元包括多个单元,例如:多头注意力单元(multi-head attention)和前向传播单元(feed forward)。每层解码单元包括多个单 元,例如:前向传播单元(feed forward)和线性注意力单元(linear attention),其中,还可以将线性注意力单元替换为稀疏注意力单元(sparse attention)。Exemplarily, each layer of encoding units includes multiple units, such as a multi-head attention unit and a feed forward unit. Each layer of decoding units includes multiple units. Units, such as feed forward and linear attention, where the linear attention unit can also be replaced by a sparse attention unit.

将编码器网络的输入输入至编码器网络的多头注意力单元中,然后在编码器网络中依次进行残差连接和层归一化操作、前向传播单元、残差连接和层归一化操作后得到编码器网络的输出。The input of the encoder network is input into the multi-head attention unit of the encoder network, and then the residual connection and layer normalization operations, forward propagation unit, residual connection and layer normalization operations are performed in the encoder network in sequence to obtain the output of the encoder network.

使用多头注意力单元和前向传播单元作为编码器网络的每层编码单元,可以帮助模型更好地捕捉不同基因之间的依赖关系和语义信息,进一步提高模型的性能。第一基因表达数据通常是高维稀疏的,具有很高的噪声和复杂性,因此编码器网络的多头注意力单元可以帮助模型更好地挖掘基因之间的关联和相互作用。Using multi-head attention units and forward propagation units as encoding units in each layer of the encoder network can help the model better capture the dependencies and semantic information between different genes, further improving the performance of the model. First, gene expression data is usually high-dimensional and sparse, with high noise and complexity, so the multi-head attention units of the encoder network can help the model better mine the associations and interactions between genes.

在解码器网络中,使用前向传播单元和线性注意力单元或稀疏注意力单元,可以帮助模型更好地对第一基因表达数据进行逐个基因的预测。线性注意力单元可以帮助模型更好地控制注意力权重的分配,提高模型的性能和稳定性。稀疏注意力单元可以进一步减少模型的计算量,提高模型的效率。可见,解码器使用的都是轻量级的单元,其参数量和计算复杂度都低于编码器,相比于多头注意力单元,线性注意力单元或者稀疏注意力单元能够进一步降低算法时间和空间复杂度,使得模型参数量可以增大到亿级。In the decoder network, the use of forward propagation units and linear attention units or sparse attention units can help the model better predict the first gene expression data gene by gene. Linear attention units can help the model better control the distribution of attention weights and improve the performance and stability of the model. Sparse attention units can further reduce the amount of calculation of the model and improve the efficiency of the model. It can be seen that the decoder uses lightweight units, whose parameter volume and computational complexity are lower than those of the encoder. Compared with multi-head attention units, linear attention units or sparse attention units can further reduce the algorithm time and space complexity, so that the model parameter volume can be increased to the billion level.

总的来说,使用多头注意力单元、前向传播单元和注意力机制的编码器网络-解码器网络模型可以在预测基因表达数据方面获得更好的性能和效果。这种模型可以自动学习数据中的复杂模式和关联,对第一基因表达数据进行准确的预测和分析。In general, the encoder network-decoder network model using multi-head attention units, forward propagation units, and attention mechanisms can achieve better performance and results in predicting gene expression data. This model can automatically learn complex patterns and associations in the data and accurately predict and analyze the first gene expression data.

在一个具体实施方式中,掩码基因表达数据是经过了归一化的;多个样本中的每个样本中包括的第一基因表达数据是归一化的第一基因表达数据;步骤121包括:利用待训练的预测模型处理该样本中的掩码基因表达数据,以得到该样本对应的各基因计数的归一化值的预测值;步骤122包括步骤1220:根据预测值与归一化的第一基因表达数据中与部分基因所对应的计数确定该样本对应的损失值。In a specific embodiment, the masked gene expression data is normalized; the first gene expression data included in each sample of the multiple samples is the normalized first gene expression data; step 121 includes: using the prediction model to be trained to process the masked gene expression data in the sample to obtain the predicted value of the normalized value of each gene count corresponding to the sample; step 122 includes step 1220: determining the loss value corresponding to the sample based on the predicted value and the counts corresponding to some genes in the normalized first gene expression data.

在该实施方式中,方法100包括:In this embodiment, the method 100 includes:

步骤110,获取多个样本,其中,多个样本中的每个样本包括归一化的第一基因表达数据和掩码基因表达数据,归一化的第一基因表达数据包括单 细胞中不同基因各自的计数,掩码基因表达数据是通过对第一基因表达数据中部分基因的计数进行处理得到的,用于得到掩码基因表达数据的处理包括掩码。Step 110, obtaining a plurality of samples, wherein each of the plurality of samples comprises normalized first gene expression data and masked gene expression data, and the normalized first gene expression data comprises a single The counts of different genes in the cell are respectively obtained by processing the counts of some genes in the first gene expression data, and the processing for obtaining the masked gene expression data includes masking.

可理解的是,第一基因表达数据(未归一的)和归一化的第一基因表达数据均包括实际测得的单细胞中不同基因各自的计数,只是归一化的第一基因表达数据是经过归一化的。It is understandable that both the first gene expression data (unnormalized) and the normalized first gene expression data include the counts of different genes in a single cell that are actually measured, except that the normalized first gene expression data is normalized.

可理解的是,示例性的,掩码基因表达数据是通过对归一化的第一基因表达数据中部分基因的计数进行处理得到的,用于得到掩码基因表达数据的处理包括掩码;示例性的,掩码基因表达数据是通过对第一基因表达数据(未归一化的原始测序数据)中部分基因的计数进行处理得到的,用于得到掩码基因表达数据的处理包括归一化、掩码。无论是哪种情况,掩码基因表达数据都是经过了归一化的。It is understandable that, exemplarily, the masked gene expression data is obtained by processing the counts of some genes in the normalized first gene expression data, and the processing for obtaining the masked gene expression data includes masking; exemplarily, the masked gene expression data is obtained by processing the counts of some genes in the first gene expression data (unnormalized original sequencing data), and the processing for obtaining the masked gene expression data includes normalization and masking. In either case, the masked gene expression data is normalized.

步骤120,对于多个样本中的每个样本,执行步骤121和步骤122:Step 120, for each sample in the plurality of samples, execute steps 121 and 122:

步骤121,利用待训练的预测模型处理该样本中的掩码基因表达数据,以得到该样本对应的部分基因(即被掩码的部分基因)计数的归一化值的预测值;Step 121, using the prediction model to be trained to process the masked gene expression data in the sample to obtain the predicted value of the normalized value of the count of the partial gene (i.e., the masked partial gene) corresponding to the sample;

步骤122,根据预测值与归一化的第一基因表达数据中与部分基因所对应的计数确定该样本对应的损失值(也即步骤1220);Step 122, determining the loss value corresponding to the sample according to the predicted value and the counts corresponding to the part of the genes in the normalized first gene expression data (ie, step 1220);

步骤130,根据多个样本中的每个样本对应的损失值更新待训练的预测模型。Step 130: update the prediction model to be trained according to the loss value corresponding to each sample in the multiple samples.

使样本中用于作为真实值的测序数据为归一化的第一基因表达数据,能够消除测序深度等因素对于计数绝对值的影响,使模型关注基因计数之间的相对关系,此时预测模型输出的预测值也为归一化的预测值。Making the sequencing data used as the true value in the sample the normalized first gene expression data can eliminate the influence of factors such as sequencing depth on the absolute value of the count, and make the model focus on the relative relationship between gene counts. At this time, the predicted value output by the prediction model is also the normalized predicted value.

在一个具体实施方式中,掩码基因表达数据是经过了归一化的,多个样本中的每个样本包括归一化的第一基因表达数据和掩码基因表达数据,第一基因表达数据(即与归一化的第一基因表达数据相对应的归一化前的原始基因表达数据)包括在第一测序深度下测得的单细胞中不同基因各自的计数,掩码基因表达数据是通过对第一基因表达数据中部分基因的计数进行处理得到的,用于得到掩码基因表达数据的处理包括降采样、归一化、掩码;经降采样的第一基因表达数据模拟在低于第一测序深度的第二测序深度下测 得的单细胞中不同基因各自的计数;多个样本中的每个样本还包括辅助信息,辅助信息包括第一总计数和第二总计数,第一总计数为第一基因表达数据中各基因的计数之和;第二总计数为经降采样的第一基因表达数据中各基因的计数之和;步骤121包括步骤121B:利用待训练的预测模型处理该样本中的掩码基因表达数据以及辅助信息,以得到该样本对应的第一测序深度下部分基因(即被掩码的部分基因)计数的归一化值的预测值。In a specific embodiment, the masked gene expression data is normalized, and each sample in the multiple samples includes normalized first gene expression data and masked gene expression data, the first gene expression data (i.e., the original gene expression data before normalization corresponding to the normalized first gene expression data) includes the counts of different genes in a single cell measured at a first sequencing depth, and the masked gene expression data is obtained by processing the counts of some genes in the first gene expression data, and the processing for obtaining the masked gene expression data includes downsampling, normalization, and masking; the downsampled first gene expression data simulates the counts measured at a second sequencing depth lower than the first sequencing depth. The method further comprises the following steps: obtaining the counts of different genes in a single cell; each sample in the multiple samples further comprises auxiliary information, the auxiliary information comprising a first total count and a second total count, the first total count being the sum of the counts of each gene in the first gene expression data; the second total count being the sum of the counts of each gene in the downsampled first gene expression data; step 121 comprises step 121B: processing the masked gene expression data and the auxiliary information in the sample using the prediction model to be trained to obtain a predicted value of the normalized value of the counts of some genes (i.e., some masked genes) at the first sequencing depth corresponding to the sample.

在该实施方式中,方法100包括:In this embodiment, the method 100 includes:

步骤110,获取多个样本,其中,多个样本中的每个样本包括归一化的第一基因表达数据、掩码基因表达数据和辅助信息,第一基因表达数据包括在第一测序深度下测得的单细胞中不同基因各自的计数,掩码基因表达数据是通过对第一基因表达数据中部分基因的计数进行处理得到的,用于得到掩码基因表达数据的处理包括降采样、归一化、掩码,掩码基因表达数据是经过了归一化的,经降采样的第一基因表达数据模拟在低于第一测序深度的第二测序深度下测得的单细胞中不同基因各自的计数,辅助信息包括第一总计数和第二总计数,第一总计数为第一基因表达数据中各基因的计数之和;第二总计数为经降采样的第一基因表达数据中各基因的计数之和。Step 110, obtaining multiple samples, wherein each of the multiple samples includes normalized first gene expression data, masked gene expression data and auxiliary information, the first gene expression data includes counts of different genes in a single cell measured at a first sequencing depth, the masked gene expression data is obtained by processing the counts of some genes in the first gene expression data, the processing for obtaining the masked gene expression data includes downsampling, normalization, and masking, the masked gene expression data is normalized, the downsampled first gene expression data simulates the counts of different genes in a single cell measured at a second sequencing depth lower than the first sequencing depth, the auxiliary information includes a first total count and a second total count, the first total count is the sum of the counts of each gene in the first gene expression data; the second total count is the sum of the counts of each gene in the downsampled first gene expression data.

可理解的是,该实施方式中,在计算第一总计数时和降采样时需要使用第一基因表达数据(原始的测序数据),在作为被掩码计数对应的真实值时需要用到归一化的第一基因表达数据,也就是说,在本实施方式的训练任务中,既需要使用原始的测序数据,又需使用归一化的测序数据。如前所述,如果从数据库中获得的是原始的测序数据,则可将其进行归一化得到归一化的测序数据;如果从数据库中获得的是归一化的测序数据,则可将其进行还原(归一化处理的逆处理)得到原始的测序数据。It is understandable that in this embodiment, the first gene expression data (original sequencing data) needs to be used when calculating the first total count and when downsampling, and the normalized first gene expression data needs to be used as the true value corresponding to the masked count, that is, in the training task of this embodiment, both the original sequencing data and the normalized sequencing data need to be used. As mentioned above, if the original sequencing data is obtained from the database, it can be normalized to obtain the normalized sequencing data; if the normalized sequencing data is obtained from the database, it can be restored (the inverse of the normalization process) to obtain the original sequencing data.

可理解的是,第一测序深度是指测得第一基因表达数据时实际使用的、较高的测序深度,第二测序深度是指降采样的第一基因表达数据所模拟的、较低的测序深度。第一测序深度和第二测序深度分别用于表示样本中第一基因表达数据和掩码表达数据对应的测序深度的高低,并不限定所有样本的第一测序深度相同或第二测序深度相同。It is understandable that the first sequencing depth refers to the higher sequencing depth actually used when measuring the first gene expression data, and the second sequencing depth refers to the lower sequencing depth simulated by the downsampled first gene expression data. The first sequencing depth and the second sequencing depth are used to indicate the sequencing depth corresponding to the first gene expression data and the mask expression data in the sample, respectively, and do not limit the first sequencing depth or the second sequencing depth of all samples to be the same.

不同测序手段和测序深度下测得的各基因的计数之间的相对关系可能不同。可理解的是,测序深度越高,观测到的各基因计数之间的相对关系有 更高的概率接近于各基因在细胞中的真实基因表达量之间的相对关系。例如,测序深度过低时,有些基因的观测到的计数为0,但当测序深度足够高时,观测到的计数不再为0。然而,因为各种因素限制,有时实际测序时只用了低测序深度而未使用期望的高测序深度,同时由于技术噪声的存在,导致测得的各基因的计数之间相对关系的准确性不满足要求。因此需要预测模型能够根据低测序深度下的基因计数正确的预测高测序深度下的基因计数。由于单个细胞在经过一次测序后就会被损坏,因此无法对同一细胞进行重复测序以获得高测序深度下和低测序深度下的基因表达数据,从而无法通过实际测序的方法构建每个细胞的低测序深度下的基因表达数据—高测序深度下的基因表达数据作为训练样本,只能通过算法根据高测序深度下的基因表达数据来模拟出细胞在低测序深度下的基因表达数据,即经降采样的第一基因表达数据,从而构建出每个细胞的经降采样的第一基因表达数据(相当于低测序深度下的基因表达数据)—第一基因表达数据(即高测序深度下的基因表达数据)用于训练。实际训练时,可将二者进行归一化,再将低测序深度下的基因表达数据进行掩码,将得到的掩码基因表达数据-归一化的第一基因表达数据作为训练样本。也即以归一化的第一基因表达数据作为高测序深度下的基因计数的真实值,通过MAE的方式训练模型,使预测模型学会捕捉不同测序深度下的相似细胞的基因表达之间的关系。The relative relationship between the counts of each gene measured under different sequencing methods and sequencing depths may be different. It is understandable that the higher the sequencing depth, the greater the relative relationship between the observed counts of each gene. A higher probability is closer to the relative relationship between the actual gene expression of each gene in the cell. For example, when the sequencing depth is too low, the observed counts of some genes are 0, but when the sequencing depth is high enough, the observed counts are no longer 0. However, due to various restrictions, sometimes only a low sequencing depth is used in actual sequencing without using the expected high sequencing depth. At the same time, due to the existence of technical noise, the accuracy of the relative relationship between the measured counts of each gene does not meet the requirements. Therefore, a prediction model is required to correctly predict the gene counts at a high sequencing depth based on the gene counts at a low sequencing depth. Since a single cell will be damaged after one sequencing, it is impossible to repeat sequencing of the same cell to obtain gene expression data at high and low sequencing depths, and thus it is impossible to construct gene expression data at low sequencing depth for each cell by actual sequencing methods - gene expression data at high sequencing depth as training samples. The only way is to simulate the gene expression data of the cell at low sequencing depth by algorithms based on the gene expression data at high sequencing depth, i.e., the first gene expression data after downsampling, so as to construct the first gene expression data (equivalent to the gene expression data at low sequencing depth) - the first gene expression data (i.e., the gene expression data at high sequencing depth) after downsampling for each cell for training. During actual training, the two can be normalized, and then the gene expression data at low sequencing depth can be masked, and the masked gene expression data - the normalized first gene expression data can be used as training samples. That is, the normalized first gene expression data is used as the true value of the gene count at high sequencing depth, and the model is trained by MAE, so that the prediction model can learn to capture the relationship between the gene expression of similar cells at different sequencing depths.

为使预测模型更容易根据低测序深度下的基因计数预测高测序深度下的基因计数,进而降低学习成本、加速模型收敛,可以显式的告知预测模型当前输入的低测序深度和希望预测的高测序深度分别是多少,因此使样本中包括体现低测序深度和高测序深度的辅助信息,使预测模型对掩码基因表达数据和辅助信息进行处理。In order to make it easier for the prediction model to predict the gene count at a high sequencing depth based on the gene count at a low sequencing depth, thereby reducing the learning cost and accelerating the model convergence, the prediction model can be explicitly informed of the current input low sequencing depth and the high sequencing depth that is expected to be predicted, so that the sample includes auxiliary information reflecting the low sequencing depth and the high sequencing depth, allowing the prediction model to process the masked gene expression data and the auxiliary information.

虽然第一测序深度可通过实际测序获得,是已知的,但第二测序深度是未知的。测序深度越高,测序结果中各基因计数之和越大,因此可由该测序深度下各基因的计数之和来表征测序深度,以使第一测序深度和第二测序深度可由相同的量度表征。因此,可使辅助信息包括第一总计数T和第二总计数S,第一总计数为第一基因表达数据中各基因计数之和,用于表征高测序深度,第二总计数为经降采样的第一基因表达数据中各基因的计数之和,用于表征低测序深度。在利用转录本测序(例如RNA-Seq技术)得到第一基因 表达数据时,第一总计数也可以称为第一总转录本数,即第一基因表达数据中各基因的转录本数之和。Although the first sequencing depth can be obtained through actual sequencing and is known, the second sequencing depth is unknown. The higher the sequencing depth, the greater the sum of the counts of each gene in the sequencing result. Therefore, the sequencing depth can be characterized by the sum of the counts of each gene at the sequencing depth, so that the first sequencing depth and the second sequencing depth can be characterized by the same measure. Therefore, the auxiliary information can include a first total count T and a second total count S, the first total count being the sum of the counts of each gene in the first gene expression data, which is used to characterize a high sequencing depth, and the second total count being the sum of the counts of each gene in the downsampled first gene expression data, which is used to characterize a low sequencing depth. When the first gene is obtained by transcript sequencing (such as RNA-Seq technology), the first gene expression data is obtained by transcript sequencing (such as RNA-Seq technology). When the first gene expression data is obtained, the first total count may also be referred to as a first total transcript count, that is, the sum of the transcript counts of each gene in the first gene expression data.

在一个示例中,可以通过以下方式由第一基因表达数据得到掩码基因表达数据。第一步,对第一基因表达数据进行降采样。第二步,对经降采样的第一基因表达数据进行归一化处理,得到归一化的第一基因表达数据;第三步,对归一化的第一表达数据中部分基因的计数进行掩码,以得到掩码基因表达数据。In one example, masked gene expression data can be obtained from the first gene expression data in the following manner: first, downsampling the first gene expression data; second, normalizing the downsampled first gene expression data to obtain normalized first gene expression data; and third, masking the counts of some genes in the normalized first expression data to obtain masked gene expression data.

下面对降采样的方法进行说明。The downsampling method is described below.

可理解的是,如果第一基因表达数据是在足够高的测序深度下测得的,对其进行降采样可获得较为可信的低测序深度下的基因计数。但如果第一基因表达数据本就是在低测序深度下测得的,对其进行降采样获得的低测序深度下的基因计数较为不可信。因此,在一个示例中,仅当第一基因表达数据中各基因的计数之和大于预设阈值(例如1000)时(如前所述,测序深度可由该测序深度下各基因的计数之和来表征,因此认为各基因的计数之和大于阈值的测序第一基因表达数据是在足够高的测序深度下测得的)对其进行降采样处理;对于第一基因表达数据中各基因的计数之和不大于预设阈值的情形,可直接对归一化的第一基因表达数据进行掩码得到掩码基因表达数据,而不进行降采样。通过包括未经降采样获得的掩码基因表达数据的样本,模型可学会捕捉单个细胞内基因之间的关系,而无法学到不同测序深度下基因表达量之间的关联。It is understandable that if the first gene expression data is measured at a sufficiently high sequencing depth, downsampling it can obtain a more credible gene count at a low sequencing depth. However, if the first gene expression data is originally measured at a low sequencing depth, the gene count at a low sequencing depth obtained by downsampling it is less credible. Therefore, in one example, only when the sum of the counts of each gene in the first gene expression data is greater than a preset threshold (e.g., 1000) (as described above, the sequencing depth can be characterized by the sum of the counts of each gene at the sequencing depth, so it is considered that the sequencing first gene expression data with the sum of the counts of each gene greater than the threshold is measured at a sufficiently high sequencing depth) is downsampled; for the case where the sum of the counts of each gene in the first gene expression data is not greater than the preset threshold, the normalized first gene expression data can be directly masked to obtain masked gene expression data without downsampling. By including samples of masked gene expression data obtained without downsampling, the model can learn to capture the relationship between genes in a single cell, but cannot learn the association between gene expression amounts at different sequencing depths.

在一个示例中,对于各基因的计数之和大于预设阈值的第一基因表达数据,可设置降采样概率,即以一定概率对第一基因表达数据进行降采样。例如,对于各基因的计数之和大于预设阈值的第一基因表达数据,可设置50%的概率对其进行降采样,50%的概率不对其进行降采样。In one example, for the first gene expression data whose sum of counts of each gene is greater than a preset threshold, a downsampling probability may be set, that is, the first gene expression data may be downsampled with a certain probability. For example, for the first gene expression data whose sum of counts of each gene is greater than a preset threshold, a 50% probability of downsampling the data may be set, and a 50% probability of not downsampling the data may be set.

在一个示例中,对于确定要对其进行降采样的第一基因表达数据,进行降采样时可设置采样比率(也即采样因子)。采样比率表示对第一基因表达数据降采样的程度,设置多种不同的采样比率可以由同一第一基因表达数据得到经过不同程度降采样的第一基因表达数据,从而能够用同一高测序深度下的基因计数模拟出不同低测序深度下的基因计数,如此扩充了训练数据,实现了训练数据的增广。示例性的,可通过松分布(Poisson Distribution)、贝 塔二项分布(Beta-Binomial Distribution)、二项分布(Binomial Distribution)等统计采样算法确定第一基因表达数据的采样比率。In one example, for the first gene expression data that is determined to be downsampled, a sampling ratio (i.e., sampling factor) can be set when downsampling. The sampling ratio represents the degree of downsampling of the first gene expression data. By setting a variety of different sampling ratios, the first gene expression data that has been downsampled to different degrees can be obtained from the same first gene expression data, so that the gene counts at different low sequencing depths can be simulated with the gene counts at the same high sequencing depth, thereby expanding the training data and achieving the augmentation of the training data. Exemplarily, the sampling ratio can be set by using Poisson Distribution, Bayesian Distribution, etc. The sampling ratio of the first gene expression data is determined by a statistical sampling algorithm such as Beta-Binomial Distribution or Binomial Distribution.

下面对归一化的方法进行说明。The normalization method is described below.

该归一化的方法可适用于由第一基因表达数据得到归一化的第一基因表达数据的过程,也可适用于对降采样的第一基因表达数据进行归一化的过程。The normalization method can be applied to the process of obtaining normalized first gene expression data from first gene expression data, and can also be applied to the process of normalizing downsampled first gene expression data.

在一个示例中,用待归一化的测序数据中各计数(如第一基因表达数据中各计数)除以待归一化的测序数据中各计数的和(如第一基因表达数据中各基因计数的和)来对待归一化的测序数据进行归一化。In one example, the sequencing data to be normalized is normalized by dividing each count in the sequencing data to be normalized (such as each count in the first gene expression data) by the sum of each count in the sequencing data to be normalized (such as the sum of each gene count in the first gene expression data).

在一个示例中,可以使用TPM或RPKM等归一化方法来归一化第一基因表达数据或经降采样的第一基因表达数据。In one example, a normalization method such as TPM or RPKM may be used to normalize the first gene expression data or the downsampled first gene expression data.

在一个示例中,还可以使用log(TPM+1)归一化方法来归一化第一基因表达数据和经降采样的第一基因表达数据。例如,第一基因表达数据为向量X1,归一化的第一基因表达数据为log[X1/sum(X1)*10000+1]。经降采样的第一基因表达数据为向量X2,经降采样、归一化的第一基因表达数据为log[X2/sum(X2)*10000+1]。Log(TPM+1)归一化方法在TPM基础上进行了对数转换。对数转换可以将原始数据中的偏斜性降低,并提高变量之间的比较性。在这种方法中,每个第一基因表达数据的TPM先加上一个常数1,然后取对数。最终结果是一个实数,表示对数转换后的TPM值。总之,log(TPM+1)归一化方法是一种对标准的TPM归一化方法进行对数转换的变种方法,它可以消除偏斜性并提高变量之间的比较性。In one example, the log(TPM+1) normalization method can also be used to normalize the first gene expression data and the downsampled first gene expression data. For example, the first gene expression data is a vector X1, and the normalized first gene expression data is log[X1/sum(X1)*10000+1]. The downsampled first gene expression data is a vector X2, and the downsampled and normalized first gene expression data is log[X2/sum(X2)*10000+1]. The log(TPM+1) normalization method performs a logarithmic transformation based on TPM. Logarithmic transformation can reduce the skewness in the original data and improve the comparability between variables. In this method, a constant 1 is added to the TPM of each first gene expression data, and then the logarithm is taken. The final result is a real number, representing the TPM value after logarithmic transformation. In short, the log(TPM+1) normalization method is a variant method of logarithmic transformation of the standard TPM normalization method, which can eliminate skewness and improve the comparability between variables.

掩码的方法在前文已经说明,不再赘述。The masking method has been explained in the previous article and will not be repeated here.

步骤120,对于多个样本中的每个样本,执行步骤121和步骤122:Step 120, for each sample in the plurality of samples, execute steps 121 and 122:

步骤121,利用待训练的预测模型处理该样本中的掩码基因表达数据和辅助信息,以得到该样本对应的第一测序深度下部分基因(即被掩码的部分基因)计数的归一化值的预测值(即步骤121B)。Step 121, use the prediction model to be trained to process the masked gene expression data and auxiliary information in the sample to obtain the predicted value of the normalized value of the count of some genes (i.e., the masked part of the genes) at the first sequencing depth corresponding to the sample (i.e., step 121B).

示例性的,在步骤121B中,利用待训练的预测模型处理的可以是该样本中的掩码基因表达数据对应的特征张量和辅助信息对应的特征张量。Exemplarily, in step 121B, the feature tensor corresponding to the masked gene expression data and the feature tensor corresponding to the auxiliary information in the sample may be processed using the prediction model to be trained.

步骤122,根据预测值与归一化的第一基因表达数据中与部分基因所对应的计数确定该样本对应的损失值(也即步骤1220); Step 122, determining the loss value corresponding to the sample according to the predicted value and the counts corresponding to the part of the genes in the normalized first gene expression data (ie, step 1220);

步骤130,根据多个样本中的每个样本对应的损失值更新待训练的预测模型。Step 130: update the prediction model to be trained according to the loss value corresponding to each sample in the multiple samples.

在一个具体实施方式中,待训练的预测模型包括编码器网络和解码器网络;步骤121B包括:In a specific embodiment, the prediction model to be trained includes an encoder network and a decoder network; step 121B includes:

步骤A:利用编码器网络对编码器网络的输入进行编码,以得到编码器网络的输出;编码器网络的输入包括掩码基因表达数据中与未被掩码且非0的计数对应的特征张量,编码器网络的输入还包括辅助信息对应的特征张量。Step A: Encode the input of the encoder network using the encoder network to obtain the output of the encoder network; the input of the encoder network includes a feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data, and the input of the encoder network also includes a feature tensor corresponding to the auxiliary information.

可理解的是,在步骤A之前,方法100还包括用于确定掩码基因表达数据对应的特征张量的步骤123和用于确定辅助信息对应的特征张量的步骤125。It is understandable that, before step A, method 100 further includes step 123 for determining a feature tensor corresponding to the masked gene expression data and step 125 for determining a feature tensor corresponding to the auxiliary information.

示例性的,辅助信息对应的特征张量和掩码基因表达数据中与未被掩码且非0的计数对应的特征张量共同作为编码器网络的输入,二者一起从预测模型的首个网络层输入到预测模型。例如,可将辅助信息对应的特征张量拼接在与未被掩码且非0的计数对应的特征张量之后,此时编码器网络的输入为[基因1对应的特征张量,基因3对应的特征张量,基因4对应的特征张量,…填充元素对应的特征张量,第一总计数T对应的特征张量,第二总计数S对应的特征张量]。Exemplarily, the feature tensor corresponding to the auxiliary information and the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data are used as the input of the encoder network together, and the two are input to the prediction model from the first network layer of the prediction model. For example, the feature tensor corresponding to the auxiliary information can be spliced after the feature tensor corresponding to the unmasked and non-zero counts. At this time, the input of the encoder network is [the feature tensor corresponding to gene 1, the feature tensor corresponding to gene 3, the feature tensor corresponding to gene 4, ... the feature tensor corresponding to the filler element, the feature tensor corresponding to the first total count T, and the feature tensor corresponding to the second total count S].

步骤B:根据编码器网络的输出、掩码基因表达数据中与被掩码的计数对应的特征张量和掩码基因表达数据中与0值的计数对应的特征张量,得到解码器网络的输入;以及Step B: obtaining an input to a decoder network based on the output of the encoder network, the feature tensor corresponding to the masked counts in the masked gene expression data, and the feature tensor corresponding to the counts of 0 values in the masked gene expression data; and

步骤C:利用解码器网络对解码器网络的输入进行解码。Step C: Decode the input of the decoder network using the decoder network.

与步骤121A类似,步骤121B还可包括步骤D:利用多层感知器将解码器网络的输出投影为第一测序深度下各基因计数的归一化值的预测值。Similar to step 121A, step 121B may further include step D: using a multilayer perceptron to project the output of the decoder network into a predicted value of the normalized value of each gene count at the first sequencing depth.

对步骤B-步骤D的描述参见对步骤1212-1214的描述,不再赘述。For the description of steps B to D, please refer to the description of steps 1212 to 1214 and will not be repeated here.

该具体实施方式对应的一个具体示例如图1A所示。在该具体示例中,预测模型包括编码器网络和解码器网络,第一基因表达数据经贝叶斯降采样得到降采样的第一基因表达数据,计算第一基因表达数据的计数和作为第一总计数T,计算降采样的第一基因表达数据的计数和作为第二总计数S,对降采样的第一基因表达数据进行归一化(图1A中未示出)、掩码得到掩码基因表达数据,去除掩码基因表达数据对应的特征张量中0值计数和被掩码 的计数对应的特征张量,和T、S对应的特征张量(图1A中未示出使用的是特征张量)拼接,将拼接后结果输入编码器网络(图1A中Encoder),得到编码器网络的输出,将编码器网络的输出与被掩码的计数对应的特征张量、0值计数对应的特征张量进行组合,得到解码器网络的输入,解码器网络对其进行处理,得到解码器网络的输出,解码器网络的输出进入MLP,得到第一测序深度下各基因计数的预测值,该预测值与归一化的第一基因表达数据(图1A中未示出归一化的第一基因表达数据与未归一化的第一基因表达数据的区别)计算损失值(图1A中reconstruction loss)。对编码器网络的输出进行池化,可得到细胞表征(图1A中cellular embedding)。A specific example corresponding to this specific implementation is shown in FIG1A. In this specific example, the prediction model includes an encoder network and a decoder network, the first gene expression data is subjected to Bayesian downsampling to obtain downsampled first gene expression data, the count sum of the first gene expression data is calculated as the first total count T, the count sum of the downsampled first gene expression data is calculated as the second total count S, the downsampled first gene expression data is normalized (not shown in FIG1A), masked to obtain masked gene expression data, and the 0-value count and the masked value count in the feature tensor corresponding to the masked gene expression data are removed. The feature tensor corresponding to the count is spliced with the feature tensor corresponding to T and S (the feature tensor used is not shown in FIG1A), and the spliced result is input into the encoder network (Encoder in FIG1A) to obtain the output of the encoder network, and the output of the encoder network is combined with the feature tensor corresponding to the masked count and the feature tensor corresponding to the 0-value count to obtain the input of the decoder network, which is processed by the decoder network to obtain the output of the decoder network, and the output of the decoder network enters the MLP to obtain the predicted value of each gene count at the first sequencing depth, and the predicted value and the normalized first gene expression data (the difference between the normalized first gene expression data and the unnormalized first gene expression data is not shown in FIG1A) calculate the loss value (reconstruction loss in FIG1A). Pooling the output of the encoder network can obtain a cell representation (cellular embedding in FIG1A).

在一个具体实施方式中,待训练的预测模型包括输入层、输出层以及输入层与输出层之间的多个中间层;利用待训练的预测模型处理该样本中的掩码基因表达数据以及辅助信息,包括:In a specific embodiment, the prediction model to be trained includes an input layer, an output layer, and a plurality of intermediate layers between the input layer and the output layer; using the prediction model to be trained to process the masked gene expression data and auxiliary information in the sample includes:

步骤1:将该样本中的掩码基因表达数据对应的特征张量输入输入层和多个中间层中的第一预定层,得到中间特征张量,第一预定层的数量大于等于0;Step 1: inputting the feature tensor corresponding to the masked gene expression data in the sample into the input layer and the first predetermined layer of the plurality of intermediate layers to obtain an intermediate feature tensor, wherein the number of the first predetermined layers is greater than or equal to 0;

可理解的是,在步骤1之前,方法100还包括用于确定掩码基因表达数据对应的特征张量的步骤123。It is understandable that, before step 1, method 100 further includes step 123 for determining a feature tensor corresponding to the masked gene expression data.

步骤2:将辅助信息对应的特征张量与中间特征张量进行拼接,以得到拼接后的中间特征张量;Step 2: Concatenate the feature tensor corresponding to the auxiliary information with the intermediate feature tensor to obtain a concatenated intermediate feature tensor;

可理解的是,在步骤2之前,方法100还包括用于确定辅助信息对应的特征张量的步骤125。It is understandable that, before step 2, method 100 further includes step 125 for determining a feature tensor corresponding to the auxiliary information.

步骤3:将拼接后的中间特征张量输入多个中间层中的第二预定层以及输出层,第二预定层不同于第一预定层。Step 3: Input the concatenated intermediate feature tensor into a second predetermined layer among the multiple intermediate layers and an output layer, where the second predetermined layer is different from the first predetermined layer.

与步骤121A类似,步骤121B还可包括步骤4:利用多层感知器将输出层的输出投影为第一测序深度下各基因计数的归一化值的预测值。Similar to step 121A, step 121B may also include step 4: using a multilayer perceptron to project the output of the output layer into a predicted value of the normalized value of each gene count at the first sequencing depth.

在该实施方式中,可以使掩码基因表达数据从预测模型首个网络层输入,辅助信息从预测模型不同于首个网络层的网络层输入。例如,预测模型包括解码器网络时,在解码器网络的首层输入掩码基因表达数据对应的特征张量,在解码器网络的最后若干层之前将辅助信息对应的特征张量输入;再例如,预测模型包括编码器网络和解码器网络时,在编码器网络的首层输入掩码基 因表达数据对应的特征张量,在编码器网络的某个网络层之后将辅助信息对应的特征张量输入,或在解码器网络的最后若干层之前将辅助信息的特征张量输入。示例性的,辅助信息对应的特征张量的具体输入方式可以为,将辅助信息对应的特征张量与前面网络层(首个网络层+多个中间层中的第一预定层)对掩码基因表达数据对应的特征张量进行处理得到的中间特征张量进行拼接(例如将辅助信息对应的特征张量拼接在中间特征张量尾部),把拼接后的中间特征张量输入后面的网络层(多个中间层中的第二预定层+输出层)。这样,模型可以在处理输入数据的过程中利用辅助信息来提取更有意义的特征或进行更准确的预测。In this embodiment, the masked gene expression data can be input from the first network layer of the prediction model, and the auxiliary information can be input from a network layer of the prediction model that is different from the first network layer. For example, when the prediction model includes a decoder network, the feature tensor corresponding to the masked gene expression data is input to the first layer of the decoder network, and the feature tensor corresponding to the auxiliary information is input before the last several layers of the decoder network; for another example, when the prediction model includes an encoder network and a decoder network, the masked gene expression data is input to the first layer of the encoder network. Due to the feature tensor corresponding to the expression data, the feature tensor corresponding to the auxiliary information is input after a certain network layer of the encoder network, or the feature tensor corresponding to the auxiliary information is input before the last several layers of the decoder network. Exemplarily, the specific input method of the feature tensor corresponding to the auxiliary information can be to splice the feature tensor corresponding to the auxiliary information with the intermediate feature tensor obtained by processing the feature tensor corresponding to the masked gene expression data by the previous network layer (the first network layer + the first predetermined layer in multiple intermediate layers) (for example, splicing the feature tensor corresponding to the auxiliary information at the tail of the intermediate feature tensor), and input the spliced intermediate feature tensor into the subsequent network layer (the second predetermined layer in multiple intermediate layers + the output layer). In this way, the model can use auxiliary information to extract more meaningful features or make more accurate predictions in the process of processing input data.

在一个具体实施方式中,步骤125具体包括:In a specific implementation, step 125 specifically includes:

步骤1251,将辅助信息输入用于确定嵌入向量的模块,得到辅助信息对应的计数嵌入向量;其中,用于确定嵌入向量的模块中包括可学的参数;Step 1251, inputting the auxiliary information into a module for determining an embedding vector to obtain a count embedding vector corresponding to the auxiliary information; wherein the module for determining the embedding vector includes learnable parameters;

示例性的,将第一总计数T和第二总计数S视为普通的计数,利用用于确定嵌入向量的模块确定第一总计数对应的计数嵌入向量和第二总计数对应的计数嵌入向量。Exemplarily, the first total count T and the second total count S are regarded as ordinary counts, and a module for determining embedding vectors is used to determine a count embedding vector corresponding to the first total count and a count embedding vector corresponding to the second total count.

步骤1252,获取辅助信息对应的基因嵌入向量。Step 1252, obtaining the gene embedding vector corresponding to the auxiliary information.

示例性的,为第一总计数T和第二总计数S指定彼此不同、也不同于其他基因ID的记号,用该记号确定其对应的基因嵌入向量。Exemplarily, the first total count T and the second total count S are assigned symbols that are different from each other and from other gene IDs, and the corresponding gene embedding vectors are determined using the symbols.

步骤1253,根据辅助信息对应的计数嵌入向量和辅助信息对应的基因嵌入向量,得到辅助信息对应的特征张量。Step 1253, obtaining a feature tensor corresponding to the auxiliary information according to the count embedding vector corresponding to the auxiliary information and the gene embedding vector corresponding to the auxiliary information.

示例性的,将辅助信息对应的计数嵌入向量和辅助信息对应的基因嵌入向量对位相加,得到辅助信息对应的特征张量。Exemplarily, the count embedding vector corresponding to the auxiliary information and the gene embedding vector corresponding to the auxiliary information are added in place to obtain a feature tensor corresponding to the auxiliary information.

示例性的,步骤125中的用于确定嵌入向量的模块和步骤123中的用于确定嵌入向量的模块可以是同一个模块,且具有相同参数。Exemplarily, the module for determining the embedding vector in step 125 and the module for determining the embedding vector in step 123 may be the same module and have the same parameters.

在本公开的一个具体实施方式中,如图1B所示,本公开实施例还提供了一种基因调控关系模型生成方法100B,其包括步骤110B至步骤150B。In a specific embodiment of the present disclosure, as shown in FIG. 1B , the present disclosure embodiment further provides a gene regulation relationship model generation method 100B, which includes steps 110B to 150B.

步骤110B:获取目标细胞对应的初始基因表达量矩阵,初始基因表达量矩阵包括细胞中高变基因的表达量以及非高变基因的表达量。Step 110B: Obtaining an initial gene expression matrix corresponding to the target cells, wherein the initial gene expression matrix includes the expression levels of the highly variable genes and the expression levels of the non-hypervariable genes in the cells.

该步骤中,目标细胞可指人体或者动物体中的细胞,目标细胞中存在多个基因。目标细胞可以包括多个不同类型的细胞,例如包括T细胞,B细胞 等。初始基因表达量矩阵可指用于表征不同细胞中不同基因的表达量的矩阵,初始基因表达量矩阵包括单个细胞中高变基因的表达量以及非高变基因的表达量,也即包括单个细胞全基因组的表达量。其中,高变基因指的是在不同细胞中表达量变化较大的基因,例如:某基因在有些细胞中的表达量极大,而在另一些细胞中的表达量极小,则该基因为高变基因,高变基因是有更大的可能性被其他基因调控的基因,这里,被其他基因调控可指被其他基因激活或者抑制等。细胞中基因表达量的非0比例约为10%的量级,因此初始基因表达量矩阵是一个非常稀疏的矩阵,矩阵中取值为0的元素的数量越多,表明矩阵越稀疏,矩阵中取值为0的元素的数量越少,表明矩阵越稠密。In this step, the target cell may refer to a cell in a human or animal body, and the target cell may contain multiple genes. The target cell may include multiple different types of cells, such as T cells, B cells, Etc. The initial gene expression matrix may refer to a matrix used to characterize the expression of different genes in different cells. The initial gene expression matrix includes the expression of highly variable genes and non-hypervariable genes in a single cell, that is, the expression of the whole genome of a single cell. Among them, highly variable genes refer to genes whose expression varies greatly in different cells. For example, if the expression of a gene is extremely large in some cells and extremely small in other cells, then the gene is a highly variable gene. A highly variable gene is a gene that is more likely to be regulated by other genes. Here, being regulated by other genes may refer to being activated or inhibited by other genes. The non-zero proportion of gene expression in cells is about 10%, so the initial gene expression matrix is a very sparse matrix. The more elements with a value of 0 in the matrix, the sparser the matrix is, and the fewer elements with a value of 0 in the matrix, the denser the matrix is.

示例性的,步骤110B包括,将目标细胞的细胞标识以及目标细胞中各个基因的基因标识分别作为矩阵的第一维度以及第二维度,将每个基因的基因表达量的取值作为矩阵中对应元素的取值,构建初始基因表达量矩阵;其中,每个基因的基因表达量的取值通过基因测序法获取。Exemplarily, step 110B includes taking the cell identifier of the target cell and the gene identifier of each gene in the target cell as the first dimension and the second dimension of the matrix respectively, and taking the gene expression value of each gene as the value of the corresponding element in the matrix to construct an initial gene expression matrix; wherein the gene expression value of each gene is obtained by gene sequencing.

步骤120B:随机选取初始基因表达量矩阵中的多个元素作为第一类元素,将除第一类元素以外的基因表达量矩阵中的元素作为第二类元素。Step 120B: randomly selecting multiple elements in the initial gene expression matrix as first-category elements, and taking elements in the gene expression matrix other than the first-category elements as second-category elements.

该步骤中,第一类元素可指第一次被过滤掉的初始基因表达量矩阵中的元素,第一类元素的作用在于在全基因组水平下降低待训练基因调控关系模型的输入数据量。第二类元素可指在第一次过滤后剩余的未被过滤的初始基因表达量矩阵中的元素。示例性的,第一类元素的选取方式如前文选取被掩码计数的方式所描述。In this step, the first type of elements may refer to the elements in the initial gene expression matrix that are filtered out for the first time, and the role of the first type of elements is to reduce the amount of input data of the gene regulation relationship model to be trained at the whole genome level. The second type of elements may refer to the elements in the initial gene expression matrix that are not filtered after the first filtering. Exemplarily, the selection method of the first type of elements is as described in the above method of selecting masked counts.

步骤130B:根据元素在初始基因表达量矩阵中的位置,确定输入特征,输入特征包括第二类元素中元素取值非0的元素对应的第二基因特征,输入特征不包括第一类元素对应的第一基因特征和第二类元素中取值为0的元素对应的零值基因特征,第二基因特征是根据基因的表达量和基因对应的基因标识确定的;Step 130B: Determine input features according to the positions of the elements in the initial gene expression matrix, the input features including second gene features corresponding to elements of the second category whose values are non-zero, and the input features do not include first gene features corresponding to the first category elements and zero-value gene features corresponding to elements of the second category whose values are zero, the second gene features being determined according to the expression of the gene and the gene identifier corresponding to the gene;

示例性的,步骤130B包括:根据第二类元素中元素取值非0的元素的元素值生成第二基因表达量特征;根据第二类元素中元素取值非0的元素对应的基因标识生成第二基因标识特征;根据第二基因表达量特征和第二基因标识特征确定第二基因特征(具体的,可以为将第二类元素中元素取值非0的元素的元素值输入基因表达量特征提取模型,生成第二基因表达量特征, 基因表达量特征提取模型的参数取值根据下述步骤150B的损失值调整)。方法100B还包括:根据第一类元素对应的第一基因表达量特征和第一类元素对应的第一基因标识特征确定第一基因特征;其中,所有第一类元素对应的第一基因表达量特征相同,且第一基因表达量特征与第二基因表达量特征不同;根据第二类元素中取值为0的元素对应的第三基因表达量特征和第二类元素中取值为0的元素对应的第三基因标识特征确定零值基因特征。Exemplarily, step 130B includes: generating a second gene expression feature according to the element value of the element whose element value is not 0 in the second category of elements; generating a second gene identification feature according to the gene identification corresponding to the element whose element value is not 0 in the second category of elements; determining a second gene feature according to the second gene expression feature and the second gene identification feature (specifically, the element value of the element whose element value is not 0 in the second category of elements is input into the gene expression feature extraction model to generate the second gene expression feature, The parameter values of the gene expression feature extraction model are adjusted according to the loss value of the following step 150B). The method 100B also includes: determining the first gene feature according to the first gene expression feature corresponding to the first-category element and the first gene identification feature corresponding to the first-category element; wherein the first gene expression features corresponding to all the first-category elements are the same, and the first gene expression feature is different from the second gene expression feature; determining the zero-value gene feature according to the third gene expression feature corresponding to the element with a value of 0 in the second-category element and the third gene identification feature corresponding to the element with a value of 0 in the second-category element.

在该步骤中,输入特征包括第二类元素中元素取值非0的元素对应的第二基因特征,输入特征不包括第一类元素对应的第一基因特征和第二类元素中取值为0的元素对应的0值基因特征,第二基因特征是根据基因的表达量和基因对应的基因标识确定的。In this step, the input features include the second gene features corresponding to the elements in the second category whose values are not 0, and the input features do not include the first gene features corresponding to the first category elements and the zero-value gene features corresponding to the elements in the second category whose values are 0, and the second gene features are determined based on the expression level of the gene and the gene identifier corresponding to the gene.

该步骤中,输入特征可指第一次过滤以及第二次过滤后剩余元素所对应的基因特征。其中,第二次过滤可指针对第二类元素中取值为0的元素的过滤。第二基因特征是根据基因表达量特征和基因标识特征确定的,对应于不同基因表达量和/或不同基因标识的元素对应的第二基因特征不同。第一类元素、第二类元素0值元素、第二类元素中非0元素都对应有基因特征,其对应的基因特征分别称为第一基因特征、第二基因特征和零值基因特征。In this step, the input feature may refer to the gene feature corresponding to the remaining elements after the first filtering and the second filtering. Among them, the second filtering may refer to the filtering of the elements with a value of 0 in the second class elements. The second gene feature is determined according to the gene expression feature and the gene identification feature, and the second gene features corresponding to the elements with different gene expression amounts and/or different gene identifications are different. The first class elements, the second class elements with a value of 0, and the non-zero elements in the second class elements all correspond to gene features, and the corresponding gene features are respectively called the first gene feature, the second gene feature, and the zero-value gene feature.

示例性的,在该步骤中,输入特征可以为C×L×D的张量,其中C为细胞数,L为单个细胞中预计最大非零表达的基因的个数,D为基因特征的长度。如果某个细胞的未被掩码且非0的基因数不足L,则填充至L。填充元素对应有填充基因特征,例如为所有的填充元素指定统一的填充基因特征。示例性的,输入特征可以为第二类非0元素数×D的张量,其中输入特征中第二类非0元素按照行号和列号顺序(例如先按照元素行号由小到大,再按照列号由小到大的顺序)排列。Exemplarily, in this step, the input feature can be a tensor of C×L×D, where C is the number of cells, L is the number of genes with the maximum non-zero expression expected in a single cell, and D is the length of the gene feature. If the number of unmasked and non-zero genes of a cell is less than L, it is filled to L. The fill-in elements correspond to fill-in gene features, for example, a unified fill-in gene feature is specified for all fill-in elements. Exemplarily, the input feature can be a tensor of the number of non-zero elements of the second category × D, where the second category non-zero elements in the input feature are arranged in the order of row number and column number (for example, first in order from small to large according to the row number of the element, and then in order from small to large according to the column number).

示例性的,基因特征可以由基因表达量特征和基因标识特征确定。基因表达量特征跟元素对应的基因表达量有关,如果元素对应的基因表达量相同,则其对应的基因表达量特征相同。基因标识特征跟元素对应的基因标识有关,如果元素对应的基因标识相同,则其对应的基因标识特征相同。基因表达量特征和基因标识特征可以通过查找映射关系表的方式确定,也可通过其他方式确定。例如,预先设置基因标识和基因标识特征的映射关系(例如基因标识1对应基因标识特征1…),从而能够根据基因标识确定基因标识特征。预先 设置基因表达量和基因表达量特征的映射关系,从而能够根据基因表达量确定基因表达量特征。可理解的是,基因表达量通常是小数,可将基因表达量取整或分类后确定其对应的基因表达量特征。例如,基因表达量取整后为1,则对应基因表达量特征1。或者,基因表达量为1-1.99被分为类别1,则对应基因表达量特征1。Pad元素对应的基因特征可以为不同其他元素基因特征的指定值。Exemplarily, gene features can be determined by gene expression features and gene identification features. Gene expression features are related to the gene expression levels corresponding to the elements. If the gene expression levels corresponding to the elements are the same, then the corresponding gene expression features are the same. Gene identification features are related to the gene identifications corresponding to the elements. If the gene identifications corresponding to the elements are the same, then the corresponding gene identification features are the same. Gene expression features and gene identification features can be determined by looking up a mapping relationship table, or by other methods. For example, a mapping relationship between a gene identifier and a gene identification feature is preset (for example, gene identifier 1 corresponds to gene identification feature 1...), so that the gene identification feature can be determined based on the gene identifier. Pre-set A mapping relationship between gene expression and gene expression features is set so that the gene expression features can be determined based on the gene expression. It is understandable that the gene expression is usually a decimal, and the gene expression can be rounded or classified to determine the corresponding gene expression features. For example, if the gene expression is rounded to 1, it corresponds to gene expression feature 1. Alternatively, if the gene expression is 1-1.99 and is classified into category 1, it corresponds to gene expression feature 1. The gene feature corresponding to the Pad element can be a specified value of the gene feature of different other elements.

可理解的是,第一类元素对应的基因表达量未知,因此可以给第一类元素指定统一的第一基因表达量特征。第一类元素的位置已知,因此可以确定第一元素对应的基因标识特征。第二类元素中零值元素的基因表达量特征相同,基因标识特征可根据其位置对应的基因标识确定。It is understandable that the gene expression corresponding to the first type of element is unknown, so a unified first gene expression feature can be assigned to the first type of element. The position of the first type of element is known, so the gene identification feature corresponding to the first element can be determined. The gene expression feature of the zero-value element in the second type of element is the same, and the gene identification feature can be determined according to the gene identification corresponding to its position.

示例性的,根据第二基因表达量特征和第二基因标识特征确定第二基因特征可以为将第二基因表达量特征和第二基因标识特征按元素相加,得到第二基因表达量特征。第一基因特征和零值基因特征也可按照此方式确定。Exemplarily, determining the second gene feature according to the second gene expression feature and the second gene identification feature may be element-wise adding the second gene expression feature and the second gene identification feature to obtain the second gene expression feature. The first gene feature and the zero-value gene feature may also be determined in this manner.

示例性的,第一基因表达量特征可以和任何第二基因表达量特征不同。例如,对应1-10类基因表达量的第二基因表达量特征分别为基因表达量特征1-10,第一基因表达量特征可以为基因表达量特征11。第二类元素中取值为0的元素对应的第三基因表达量特征可以对应基因表达量特征0。Exemplarily, the first gene expression feature may be different from any second gene expression feature. For example, the second gene expression features corresponding to gene expression levels of categories 1-10 are gene expression features 1-10, respectively, and the first gene expression feature may be gene expression feature 11. The third gene expression feature corresponding to the element with a value of 0 in the second category element may correspond to gene expression feature 0.

示例性的,第二类元素中元素取值非0的元素的元素值生成第二基因表达量特征,包括:将第二类元素中元素取值非0的元素的元素值输入基因表达量特征提取模型,生成第二基因表达量特征。前文已提到,可以预先设置基因表达量和基因表达量特征的映射关系,根据基因表达量确定基因表达量特征。但是这种预先的设定不够灵活,会限定模型学习映射的能力,例如模型会看不出基因表达量1.1和1.9的区别,因为其在取整/分类后都对应基因表达量1。因此,可设置一基因表达量特征提取模型,将第二类元素中元素取值非0的元素的元素值输入基因表达量特征提取模型,生成第二基因表达量特征。该模型可和待训练基因调控关系模型一同更新,即在未达到训练终止条件时,根据由预测基因表达量矩阵中每个元素的取值与初始基因表达量矩阵中对应位置上元素的取值之差确定的损失值之差调整基因调控关系模型和基因表达量特征提取模型的模型参数的取值。如此,模型可学到基因表 达量1.1和1.2是更接近的表达量,基因表达量1.1和1.9是差别更大的表达量,使得基因特征可以更好的刻画不同基因表达量和不同基因之间的关系。Exemplarily, the element value of the element whose element value is not 0 in the second class of elements generates the second gene expression quantity feature, including: inputting the element value of the element whose element value is not 0 in the second class of elements into the gene expression quantity feature extraction model to generate the second gene expression quantity feature. As mentioned above, the mapping relationship between gene expression quantity and gene expression quantity feature can be preset, and the gene expression quantity feature is determined according to the gene expression quantity. However, this pre-setting is not flexible enough and will limit the ability of the model to learn mapping. For example, the model will not be able to see the difference between gene expression quantities 1.1 and 1.9, because they both correspond to gene expression quantity 1 after rounding/classification. Therefore, a gene expression quantity feature extraction model can be set, and the element value of the element whose element value is not 0 in the second class of elements is input into the gene expression quantity feature extraction model to generate the second gene expression quantity feature. This model can be updated together with the gene regulation relationship model to be trained, that is, when the training termination condition is not met, the model parameters of the gene regulation relationship model and the gene expression feature extraction model are adjusted according to the difference in loss value determined by the difference between the value of each element in the predicted gene expression matrix and the value of the element at the corresponding position in the initial gene expression matrix. In this way, the model can learn the gene expression Gene expression levels 1.1 and 1.2 are closer, and gene expression levels 1.1 and 1.9 are more different, so that gene signatures can better characterize the relationship between different gene expression levels and different genes.

步骤140B:将输入特征输入至待训练基因调控关系模型获取基因调控关系表示,对基因调控关系表示进行转换生成预测基因表达量矩阵,预测基因表达量矩阵中的元素与初始基因表达量矩阵中的元素相对应;Step 140B: inputting the input features into the gene regulation relationship model to be trained to obtain a gene regulation relationship representation, converting the gene regulation relationship representation to generate a predicted gene expression matrix, wherein the elements in the predicted gene expression matrix correspond to the elements in the initial gene expression matrix;

示例性的,待训练基因调控关系模型包括编码器以及解码器;步骤140B包括:将输入特征输入至编码器获取初始编码特征张量;按照元素在初始基因表达量矩阵中的位置,将第一类元素对应的第一基因特征、第二类元素中取值为0的元素对应的零值基因特征与初始编码特征张量合并,得到目标编码特征张量;将目标编码特征张量输入至解码器获取基因调控关系表示。示例性的,编码器包括M层编码单元,解码器包括N层解码单元,M的数值大于N的数值。示例性的,编码器的每层编码单元包括一多头注意力单元和一前向传播单元;解码器的每层解码单元包括一前向传播单元,还包括一线性注意力单元或稀疏注意力单元。Exemplarily, the gene regulation relationship model to be trained includes an encoder and a decoder; step 140B includes: inputting the input feature to the encoder to obtain an initial coding feature tensor; according to the position of the element in the initial gene expression matrix, the first gene feature corresponding to the first type of element and the zero-value gene feature corresponding to the element with a value of 0 in the second type of element are merged with the initial coding feature tensor to obtain a target coding feature tensor; the target coding feature tensor is input to the decoder to obtain a gene regulation relationship representation. Exemplarily, the encoder includes M layers of encoding units, and the decoder includes N layers of decoding units, and the value of M is greater than the value of N. Exemplarily, each layer of encoding units of the encoder includes a multi-head attention unit and a forward propagation unit; each layer of decoding units of the decoder includes a forward propagation unit, and also includes a linear attention unit or a sparse attention unit.

示例性的,基因调控关系表示可指各个基因之间调控关系的表达张量,基因调控关系表示是一个三维张量,该基因调控关系表示中不仅包括各个基因之间的调控关系,还包括细胞中各个基因对应的表达量信息。Exemplarily, the gene regulatory relationship representation may refer to the expression tensor of the regulatory relationship between each gene. The gene regulatory relationship representation is a three-dimensional tensor, which includes not only the regulatory relationship between each gene, but also the expression level information corresponding to each gene in the cell.

示例性的,基因调控关系表示为C×G×D的三维张量,其中,C表示细胞数量,以上述示例为例,C=3;G表示基因数量,以上述示例为例,G=6;D表示特征维度,通常情况下D的取值是2的n次方,例如为128。Exemplarily, the gene regulation relationship is represented as a three-dimensional tensor of C×G×D, where C represents the number of cells, taking the above example, C=3; G represents the number of genes, taking the above example, G=6; D represents the feature dimension, and usually the value of D is 2 to the power of n, for example, 128.

示例性的,得到基因调控关系表示后,通过一层网络对基因调控关系表示进行转换,得到预测基因表达量矩阵,预测基因表达量矩阵是对初始基因表达量矩阵中第一类元素的取值进行预测的预测矩阵。Exemplarily, after obtaining the gene regulatory relationship representation, the gene regulatory relationship representation is transformed through a layer of network to obtain a predicted gene expression matrix, which is a prediction matrix for predicting the values of the first type of elements in the initial gene expression matrix.

示例性的,预测基因表达量矩阵中的元素与初始基因表达量矩阵中的元素相对应,即,如果初始基因表达量矩阵为A×B维,则预测基因表达量矩阵也是A×B维。Exemplarily, the elements in the predicted gene expression matrix correspond to the elements in the initial gene expression matrix, that is, if the initial gene expression matrix is A×B dimensional, then the predicted gene expression matrix is also A×B dimensional.

示例性的,编码器用于确定第二类元素的特征信息,其中,特征信息包括表达量信息以及调控关系信息。解码器用于根据表达量信息及调控关系信息还原出第一类元素的表达量信息。 Exemplarily, the encoder is used to determine feature information of the second type of element, wherein the feature information includes expression information and regulatory relationship information. The decoder is used to restore the expression information of the first type of element based on the expression information and the regulatory relationship information.

示例性的,待训练基因调控关系模型由两部分组成,分别是编码器和解码器,先将输入特征输入至编码器中获取初始编码特征张量(例如可以是C×L×D的张量),初始编码特征张量中包括了第二类元素的表达量信息以及调控关系信息,再按照所有元素(第一类元素以及0值和非0值的第二类元素)在初始基因表达量矩阵中的位置,将第一类元素对应的第一基因特征、第二类元素中取值为0的元素对应的零值基因特征与初始编码特征张量合并,得到目标编码特征张量(例如可以是C×G×D的张量)。最后,将目标编码特征张量输入至解码器中获得基因调控关系表示(例如可以是C×G×D的张量),以预测出各个第一类元素对应的更新第一基因特征。Exemplarily, the gene regulation relationship model to be trained consists of two parts, namely an encoder and a decoder. First, the input feature is input into the encoder to obtain an initial coding feature tensor (for example, a tensor of C×L×D), which includes the expression information and regulatory relationship information of the second-class elements. Then, according to the position of all elements (first-class elements and second-class elements with 0 values and non-0 values) in the initial gene expression matrix, the first gene feature corresponding to the first-class element and the zero-value gene feature corresponding to the element with a value of 0 in the second-class element are merged with the initial coding feature tensor to obtain a target coding feature tensor (for example, a tensor of C×G×D). Finally, the target coding feature tensor is input into the decoder to obtain a gene regulation relationship representation (for example, a tensor of C×G×D) to predict the updated first gene feature corresponding to each first-class element.

步骤150B:基于初始基因表达量矩阵以及预测基因表达量矩阵更新待训练基因调控关系模型中的模型参数,生成训练后的基因调控关系模型。Step 150B: updating the model parameters in the gene regulation relationship model to be trained based on the initial gene expression matrix and the predicted gene expression matrix to generate a trained gene regulation relationship model.

示例性的,步骤150B包括:计算预测基因表达量矩阵中每个元素的取值与初始基因表达量矩阵中对应位置上元素的取值之差;根据取值之差,确定损失值。Exemplarily, step 150B includes: calculating the difference between the value of each element in the predicted gene expression matrix and the value of the element at the corresponding position in the initial gene expression matrix; and determining the loss value according to the difference in the values.

若未达到训练终止条件,则根据损失值调整模型参数的取值,返回执行随机选取初始基因表达量矩阵中的多个元素作为第一类元素的步骤。If the training termination condition is not met, the values of the model parameters are adjusted according to the loss value, and the process returns to the step of randomly selecting multiple elements in the initial gene expression matrix as the first type of elements.

示例性的,模型参数可包括编码器的参数和解码器的参数,还可包括用于将基因调控关系表示转换生成预测基因表达量矩阵的网络的参数。Exemplarily, the model parameters may include parameters of an encoder and parameters of a decoder, and may also include parameters of a network used to convert a gene regulatory relationship representation into a predicted gene expression matrix.

该具体实施方式中,从包括高变基因以及非高变基因的表达量的初始基因表达量矩阵中去除第一类元素和取值为0的元素,仅将第二类元素中元素取值非0的元素对应的输入特征作为待训练基因调控关系模型的输入,无需对大量数据进行计算,降低了计算复杂度以及计算量,提高了训练效率,使得模型可以学到在全基因组水平上潜在的基因调控关系。与现有技术中的基因调控关系模型生成方法相比,解决了调控关系模型训练难度大以及模型学到的调控关系图谱存在系统性缺失的问题。In this specific implementation, the first-class elements and elements with a value of 0 are removed from the initial gene expression matrix including the expression of highly variable genes and non-hypervariable genes, and only the input features corresponding to the elements with non-0 values in the second-class elements are used as the input of the gene regulation relationship model to be trained. There is no need to calculate a large amount of data, which reduces the computational complexity and amount, improves the training efficiency, and enables the model to learn the potential gene regulation relationship at the whole genome level. Compared with the gene regulation relationship model generation method in the prior art, it solves the problem of difficulty in training the regulation relationship model and systematic missing of the regulation relationship map learned by the model.

在本公开的一个具体实施方式中,如图1C所示,本公开实施例还提供了一种预测模型的训练方法100C,方法100C包括步骤110C至步骤130C。In a specific implementation of the present disclosure, as shown in FIG. 1C , the present disclosure embodiment further provides a prediction model training method 100C, and the method 100C includes steps 110C to 130C.

步骤110C,获取多个样本,其中,多个样本中的每个样本包括第一基因表达数据、掩码基因表达数据以及辅助信息,第一基因表达数据包括在第一测序深度下测得的单细胞中不同基因各自的计数,掩码基因表达数据是通过 对第一基因表达数据进行降采样并对经降采样的第一基因表达数据中部分基因的计数进行掩码得到的,经降采样的第一基因表达数据模拟在低于第一测序深度的第二测序深度下测得的单细胞中不同基因各自的计数,辅助信息包括第一总计数,第一总计数为第一基因表达数据中各基因的计数之和;Step 110C, obtaining a plurality of samples, wherein each of the plurality of samples comprises first gene expression data, mask gene expression data, and auxiliary information, wherein the first gene expression data comprises counts of different genes in a single cell measured at a first sequencing depth, and the mask gene expression data is obtained by The first gene expression data is downsampled and counts of some genes in the downsampled first gene expression data are masked, the downsampled first gene expression data simulates respective counts of different genes in a single cell measured at a second sequencing depth lower than the first sequencing depth, and the auxiliary information includes a first total count, which is the sum of the counts of each gene in the first gene expression data;

对各术语的解释,以及对降采样、掩码方式的描述参见对前文,不再赘述。The explanation of each term, as well as the description of downsampling and masking methods, can be found in the previous text and will not be repeated here.

训练预测模型时,以实际测序得到的第一基因表达数据作为高测序深度下的基因计数的真实值,用经降采样的第一基因表达数据模拟低测序深度下的基因计数,用模型根据低测序深度下的基因计数预测高测序深度下的基因计数,并根据预测值和高测序深度下的基因计数的真实值计算损失,用损失值更新模型。如此可使预测模型学会捕捉不同测序深度下的相似细胞的基因表达之间的关系。为进一步提高训练效果,使用MAE方式进行训练,即用经降采样的第一基因表达数据作为低测序深度下的基因计数,并对其进行部分掩码,用模型根据部分掩码后的、低测序深度下的基因计数预测高测序深度下的基因计数,并根据预测出的高测序深度下的基因计数和高测序深度下的基因计数的真实值中被掩码的元素对应的部分计算损失,用损失值更新模型。因此,在构建训练数据时,需使用实际测序得到的第一基因表达数据、对第一基因表达数据进行降采样并对经降采样的第一基因表达数据中部分基因的计数进行掩码得到的掩码基因表达数据。When training the prediction model, the first gene expression data obtained by actual sequencing is used as the true value of the gene count at a high sequencing depth, the first gene expression data after downsampling is used to simulate the gene count at a low sequencing depth, the model is used to predict the gene count at a high sequencing depth based on the gene count at a low sequencing depth, and the loss is calculated based on the predicted value and the true value of the gene count at a high sequencing depth, and the model is updated with the loss value. In this way, the prediction model can learn to capture the relationship between the gene expressions of similar cells at different sequencing depths. In order to further improve the training effect, the MAE method is used for training, that is, the first gene expression data after downsampling is used as the gene count at a low sequencing depth, and partially masked, and the model is used to predict the gene count at a high sequencing depth based on the partially masked gene count at a low sequencing depth, and the loss is calculated based on the predicted gene count at a high sequencing depth and the corresponding part of the masked elements in the true value of the gene count at a high sequencing depth, and the model is updated with the loss value. Therefore, when constructing the training data, it is necessary to use the first gene expression data obtained by actual sequencing, downsample the first gene expression data, and mask the counts of some genes in the downsampled first gene expression data to obtain the masked gene expression data.

为使预测模型能够正确根据低测序深度下的基因计数预测高测序深度下的基因计数,就需要预测模型理解低测序深度和高测序深度分别是多少。可理解的是,测序深度越高,测序结果中各基因计数之和越大,因此测序深度可由该测序深度下各基因的计数之和来表征。In order for the prediction model to correctly predict the gene count at a high sequencing depth based on the gene count at a low sequencing depth, the prediction model needs to understand what the low sequencing depth and high sequencing depth are. It is understandable that the higher the sequencing depth, the greater the sum of the counts of each gene in the sequencing result, so the sequencing depth can be characterized by the sum of the counts of each gene at the sequencing depth.

在一个示例中,辅助信息包括第一总计数,但不必须包括低测序深度下的基因计数之和,即经降采样的第一基因表达数据中各基因的计数之和。因为即便辅助信息不包括经降采样的第一基因表达数据中各基因的计数之和,模型也可以根据样本中的掩码基因表达数据,猜出掩码部分的基因计数,进而可以估算出经降采样的第一基因表达数据中各基因的计数之和。当然,辅助信息也可包括经降采样的第一基因表达数据中各基因的计数之和,如此可显式的告知模型低测序深度的高低,进而降低学习成本、加速模型收敛。 In one example, the auxiliary information includes the first total count, but does not necessarily include the sum of the gene counts at the low sequencing depth, that is, the sum of the counts of each gene in the downsampled first gene expression data. Because even if the auxiliary information does not include the sum of the counts of each gene in the downsampled first gene expression data, the model can guess the gene counts of the masked part based on the masked gene expression data in the sample, and then estimate the sum of the counts of each gene in the downsampled first gene expression data. Of course, the auxiliary information may also include the sum of the counts of each gene in the downsampled first gene expression data, so that the model can be explicitly informed of the low sequencing depth, thereby reducing the learning cost and accelerating model convergence.

步骤120C,对于多个样本中的每个样本,执行步骤PB1和步骤PB2,其中步骤PB1和步骤PB2是步骤120C的子步骤;Step 120C, for each sample in the plurality of samples, executing step PB1 and step PB2, wherein step PB1 and step PB2 are sub-steps of step 120C;

步骤PB1,利用待训练的预测模型处理该样本中的掩码基因表达数据以及辅助信息,以得到该样本对应的第一测序深度下各基因计数的预测值;Step PB1, using the prediction model to be trained to process the masked gene expression data and auxiliary information in the sample to obtain the predicted value of each gene count at the first sequencing depth corresponding to the sample;

示例性的,利用待训练的预测模型处理该样本中的掩码基因表达数据以及辅助信息时,掩码基因表达数据和辅助信息可以均从预测模型首个网络层输入。将该样本中的掩码基因表达数据与辅助信息进行拼接,作为该样本对应的输入特征张量;或者,将该样本中的掩码基因表达数据对应的特征张量与辅助信息对应的特征张量进行拼接,作为该样本对应的输入特征张量;以及利用待训练的预测模型处理输入特征张量。Exemplarily, when the prediction model to be trained is used to process the masked gene expression data and auxiliary information in the sample, the masked gene expression data and the auxiliary information can both be input from the first network layer of the prediction model. The masked gene expression data in the sample and the auxiliary information are concatenated as the input feature tensor corresponding to the sample; or the feature tensor corresponding to the masked gene expression data in the sample and the feature tensor corresponding to the auxiliary information are concatenated as the input feature tensor corresponding to the sample; and the input feature tensor is processed using the prediction model to be trained.

示例性的,也可使掩码基因表达数据从预测模型首个网络层输入,辅助信息从预测模型靠后的网络层输入,待训练的预测模型包括输入层、输出层以及输入层与输出层之间的多个中间层。此时,步骤PB1包括:将该样本中的掩码基因表达数据作为输入特征张量或将该样本中的掩码基因表达数据对应的特征张量作为输入特征张量,输入输入层和多个中间层中的第一预定层,得到中间特征张量,第一预定层的数量大于等于0;将辅助信息与中间特征张量进行拼接,以得到拼接后的中间特征张量;以及将拼接后的中间特征张量输入多个中间层中的第二预定层以及输出层,第二预定层不同于第一预定层。例如,预测模型包括解码器网络时,在解码器网络的最后若干层之前将辅助信息输入;再例如,预测模型包括编码器网络和解码器网络时,可在编码器网络的某个层之前或解码器网络的最后若干层之前将辅助信息输入。这样,模型可以在处理输入数据的过程中利用辅助信息来提取更有意义的特征或进行更准确的预测。Exemplarily, the masked gene expression data can also be input from the first network layer of the prediction model, and the auxiliary information can be input from the network layer behind the prediction model. The prediction model to be trained includes an input layer, an output layer, and multiple intermediate layers between the input layer and the output layer. At this time, step PB1 includes: using the masked gene expression data in the sample as an input feature tensor or the feature tensor corresponding to the masked gene expression data in the sample as an input feature tensor, inputting the input layer and the first predetermined layer in the multiple intermediate layers to obtain an intermediate feature tensor, and the number of the first predetermined layers is greater than or equal to 0; splicing the auxiliary information with the intermediate feature tensor to obtain the spliced intermediate feature tensor; and inputting the spliced intermediate feature tensor into the second predetermined layer in the multiple intermediate layers and the output layer, and the second predetermined layer is different from the first predetermined layer. For example, when the prediction model includes a decoder network, the auxiliary information is input before the last several layers of the decoder network; for another example, when the prediction model includes an encoder network and a decoder network, the auxiliary information can be input before a certain layer of the encoder network or before the last several layers of the decoder network. In this way, the model can use the auxiliary information to extract more meaningful features or make more accurate predictions in the process of processing input data.

示例性的,输入预测模型的可以是掩码基因表达数据和辅助信息本身,也可将掩码基因表达数据和辅助信息转换为对应的嵌入向量再输入预测模型;输入预测模型首个网络层的可以是掩码基因表达数据中全部基因对应的计数,也可以是部分基因对应的计数(例如计数非0且未被掩码的基因所对应的计数)。Exemplarily, the input to the prediction model may be the masked gene expression data and auxiliary information themselves, or the masked gene expression data and auxiliary information may be converted into corresponding embedding vectors and then input into the prediction model; the input to the first network layer of the prediction model may be the counts corresponding to all genes in the masked gene expression data, or the counts corresponding to some genes (for example, the counts corresponding to genes whose counts are not 0 and are not masked).

将掩码基因表达数据、掩码基因表达数据对应的特征张量、掩码基因表达数据与辅助信息的拼接结果、掩码基因表达数据对应的特征张量与辅助信 息对应的特征张量的拼接结果作为输入特征张量可做广义解释,不仅指输入预测模型首个网络层,也可指输入预测模型的任意网络层。例如输入特征张量中部分张量作为预测模型首个网络层的输入,部分张量作为预测模型中间网络层的输入。此外,输入特征张量可以包括掩码基因表达数据所有元素对应的部分,也可仅包括掩码基因表达数据部分元素(例如非0且未被掩码的部分)对应的部分,还包括掩码基因表达数据之外的元素(例如填充的元素,将在后文解释)对应的部分。The masked gene expression data, the feature tensor corresponding to the masked gene expression data, the concatenation result of the masked gene expression data and the auxiliary information, the feature tensor corresponding to the masked gene expression data and the auxiliary information are concatenated. The concatenation result of the feature tensors corresponding to the information can be interpreted broadly as the input feature tensor, which refers not only to the first network layer of the input prediction model, but also to any network layer of the input prediction model. For example, part of the tensors in the input feature tensor are used as the input of the first network layer of the prediction model, and part of the tensors are used as the input of the middle network layer of the prediction model. In addition, the input feature tensor may include the part corresponding to all elements of the masked gene expression data, or only the part corresponding to some elements of the masked gene expression data (such as non-zero and unmasked parts), and also include the part corresponding to elements outside the masked gene expression data (such as filled elements, which will be explained later).

需要说明的是,输入预测模型的网络层的待处理对象的尺寸是一定的,例如均为2000*768。当将掩码基因表达数据中计数非0且未被掩码的基因所对应的计数输入预测模型的网络层时,由不同掩码基因表达数据中包括的非0基因计数的数量不同、被掩码的元素的比例不同,可能导致输入预测模型网络层的待处理对象的尺寸不同,此时可将其填充至相同尺寸,以确保输入预测模型的网络层的待处理对象尺寸相同。例如,各样本中掩码基因表达数据中包括的非0且未被掩码的基因计数的数量在300-1000之间波动,可将输入特征张量的尺寸填充至1000。例如,可使输入特征张量的尺寸填充至全量基因的10%。It should be noted that the size of the object to be processed of the network layer of the input prediction model is certain, for example, 2000*768. When the counts corresponding to the genes with non-zero and unmasked counts in the masked gene expression data are input into the network layer of the prediction model, the different numbers of non-zero gene counts and the different proportions of masked elements included in different masked gene expression data may lead to different sizes of the objects to be processed of the network layer of the input prediction model. In this case, they can be padded to the same size to ensure that the sizes of the objects to be processed of the network layer of the input prediction model are the same. For example, the number of non-zero and unmasked gene counts included in the masked gene expression data in each sample fluctuates between 300-1000, and the size of the input feature tensor can be padded to 1000. For example, the size of the input feature tensor can be padded to 10% of the total number of genes.

将掩码基因表达数据转换为其对应的嵌入向量的方式参见前文描述,不再赘述。The method of converting the masked gene expression data into its corresponding embedding vector is described above and will not be repeated here.

步骤PB2,根据预测值与第一基因表达数据中与部分基因所对应的计数确定该样本对应的损失值;以及Step PB2, determining a loss value corresponding to the sample according to the predicted value and the counts corresponding to the part of the genes in the first gene expression data; and

步骤130C,根据多个样本中的每个样本对应的损失值更新待训练的预测模型。Step 130C: update the prediction model to be trained according to the loss value corresponding to each sample in the multiple samples.

本具体实施方式对实际的高测序深度下的测序结果进行降采样,模拟出低测序深度下的测序结果,用高测序深度下的测序结果和低测序深度下的测序结果作为样本训练预测模型,同时显式的引入表征期望测序深度的辅助信息,使得预测模型学会捕捉不同测序深度下的相似细胞的基因计数之间的关系以及单个细胞内基因计数之间的关系,从而能够最大限度地减少测序中的技术噪声干扰,并用计算方式增加测序深度,进而提升下游任务输出结果的准确性。 This specific implementation method downsamples the sequencing results at an actual high sequencing depth, simulates the sequencing results at a low sequencing depth, uses the sequencing results at the high sequencing depth and the sequencing results at the low sequencing depth as samples to train the prediction model, and explicitly introduces auxiliary information that characterizes the expected sequencing depth, so that the prediction model learns to capture the relationship between gene counts of similar cells at different sequencing depths and the relationship between gene counts in a single cell, thereby minimizing technical noise interference in sequencing, and increasing the sequencing depth by computational means, thereby improving the accuracy of downstream task output results.

示例性的,多个样本中的每个样本包括归一化的第一基因表达数据、归一化掩码基因表达数据以及辅助信息,归一化的第一基因表达数据是对第一基因表达数据进行归一化得到的,归一化掩码基因表达数据是通过对第一基因表达数据进行降采样、对经降采样的第一基因表达数据进行归一化,并对归一化结果中部分基因的计数进行掩码得到的;辅助信息还包括第二总计数,第二总计数为经降采样的第一基因表达数据中各基因的计数之和;步骤PB1,包括:利用待训练的预测模型处理该样本中的归一化掩码基因表达数据以及辅助信息,以得到该样本对应的第一测序深度下各基因计数的归一化值的预测值,步骤PB2,包括:根据预测值与归一化的第一基因表达数据中与部分基因所对应的计数确定该样本对应的损失值。Exemplarily, each sample among the multiple samples includes normalized first gene expression data, normalized masked gene expression data and auxiliary information, the normalized first gene expression data is obtained by normalizing the first gene expression data, the normalized masked gene expression data is obtained by downsampling the first gene expression data, normalizing the downsampled first gene expression data, and masking the counts of some genes in the normalized result; the auxiliary information also includes a second total count, which is the sum of the counts of each gene in the downsampled first gene expression data; step PB1, including: using the prediction model to be trained to process the normalized masked gene expression data and the auxiliary information in the sample to obtain a predicted value of the normalized value of the counts of each gene at the first sequencing depth corresponding to the sample, and step PB2, including: determining the loss value corresponding to the sample based on the predicted value and the counts corresponding to the part of the genes in the normalized first gene expression data.

相比于基因的绝对计数,不同基因的计数之间相对关系对于下游任务更为重要。为消除测序深度带来的基因计数差异,使得不同样本、不同测序深度下的测序结果之间具有可比性,需要对测序结果进行标准化和归一化处理来得到训练所用的样本。也就是说,此时可采用归一化的第一基因表达数据和归一化掩码基因表达数据代替步骤110、步骤120、步骤121以及步骤122中的第一基因表达数据和掩码基因表达数据。Compared to the absolute counts of genes, the relative relationship between the counts of different genes is more important for downstream tasks. In order to eliminate the differences in gene counts caused by sequencing depth and make the sequencing results of different samples and different sequencing depths comparable, the sequencing results need to be standardized and normalized to obtain the samples used for training. In other words, at this time, the normalized first gene expression data and the normalized masked gene expression data can be used to replace the first gene expression data and the masked gene expression data in step 110, step 120, step 121, and step 122.

测序深度用总计数表示,为消除测序深度影响而进行的归一化中,可以使用基因表达数据中的总计数来进行归一化。具体的归一化方法参见前文描述。The sequencing depth is represented by the total counts. In order to eliminate the influence of the sequencing depth, the total counts in the gene expression data can be used for normalization. The specific normalization method is described above.

经降采样的第一基因表达数据模拟在低于第一测序深度的第二测序深度下测得的单细胞中不同基因各自的计数,第二总计数可以作为第二测序深度的表征。虽然第一测序深度可通过实际测序获得,是已知的,但第二测序深度是未知的,为使第一测序深度和第二测序深度用相同的量度表征,使用第一总计数和第二总计数来表征第一测序深度和第二测序深度。The downsampled first gene expression data simulates the counts of different genes in a single cell measured at a second sequencing depth lower than the first sequencing depth, and the second total count can be used as a representation of the second sequencing depth. Although the first sequencing depth can be obtained through actual sequencing and is known, the second sequencing depth is unknown. In order to characterize the first sequencing depth and the second sequencing depth with the same measure, the first total count and the second total count are used to characterize the first sequencing depth and the second sequencing depth.

示例性的,第一总计数T和第二总计数S均可用前文描述的方式确定各自对应的嵌入向量,且第一总计数的记号与第二总计数的记号不同,也不同于其他基因ID。Exemplarily, the first total count T and the second total count S can both determine their corresponding embedding vectors in the manner described above, and the sign of the first total count is different from the sign of the second total count and is also different from other gene IDs.

在示例中,由于样本仅包括归一化掩码基因表达数据,而不包括归一化前的经降采样的第一基因表达数据,因此模型无法估算出经降采样而非经归一化的第一基因表达数据中各基因的计数之和,也即无法估算出归一化掩码 基因表达数据所对应的较低测序深度有多低。因此,必须在辅助信息中包括第二总计数。这样,预测模型可以根据第二总计数显式的获知归一化掩码基因表达数据所对应的较低测序深度有多低,第一基因表达数据所对应的较高测序深度有多高,进而能够根据归一化掩码基因表达数据预测出较高测序深度下的基因计数。In the example, since the sample only includes the normalized masked gene expression data but not the downsampled first gene expression data before normalization, the model cannot estimate the sum of the counts of each gene in the downsampled but not normalized first gene expression data, that is, it cannot estimate the normalized masked gene expression data. How low is the lower sequencing depth corresponding to the gene expression data. Therefore, the second total count must be included in the auxiliary information. In this way, the prediction model can explicitly know how low the lower sequencing depth corresponding to the normalized masked gene expression data is, and how high the higher sequencing depth corresponding to the first gene expression data is based on the second total count, and then can predict the gene count at the higher sequencing depth based on the normalized masked gene expression data.

可理解的是,由于使用归一化的第一基因表达数据作为真实值,因此模型输出的第一测序深度下各基因计数的预测值也是归一化的。可以通过将预测值乘以第一总计数来还原为未经过归一化的基因计数的预测值,从而将归一化的预测值重新映射到原始的计数空间。It is understandable that, since the normalized first gene expression data is used as the true value, the predicted value of each gene count at the first sequencing depth output by the model is also normalized. The predicted value can be restored to the predicted value of the gene count that has not been normalized by multiplying the predicted value by the first total count, thereby remapping the normalized predicted value to the original count space.

在一个示例中,通过以下方式可以由第一基因表达数据得到归一化掩码基因表达数据。第一步,对第一基因表达数据进行降采样。第二步,对经降采样的第一基因表达数据进行归一化。第三步,对第二步的结果中部分基因的计数进行掩码,以得到掩码基因表达数据。In one example, the normalized masked gene expression data can be obtained from the first gene expression data in the following manner: Step 1, downsampling the first gene expression data. Step 2, normalizing the downsampled first gene expression data. Step 3, masking the counts of some genes in the result of Step 2 to obtain the masked gene expression data.

示例性的,待训练的预测模型包括编码器网络和解码器网络,其中,利用待训练的预测模型处理输入特征张量包括:利用编码器网络对输入特征张量中与未被掩码且非0的计数对应的第一张量进行编码,以得到编码张量;将输入特征张量中的第一张量替换为编码张量,以得到编码输入特征张量;以及利用解码器网络对编码输入特征张量进行解码。Exemplarily, the prediction model to be trained includes an encoder network and a decoder network, wherein processing the input feature tensor using the prediction model to be trained includes: encoding the first tensor corresponding to the unmasked and non-zero counts in the input feature tensor using the encoder network to obtain an encoded tensor; replacing the first tensor in the input feature tensor with the encoded tensor to obtain an encoded input feature tensor; and decoding the encoded input feature tensor using the decoder network.

在一个示例中,步骤PB1包括:In one example, step PB1 includes:

步骤PB11,利用编码器网络对输入特征张量中与未被掩码且非0的计数对应的第一张量进行编码,以得到编码张量。示例性的,第一张量的尺寸可以为2000*768,编码张量的尺寸可以为2000*768。2000个元素中第1,2,4,10,30..元素为计数值非0且未遮盖的元素,第1561-2000个元素为填充的元素。Step PB11, using the encoder network to encode the first tensor corresponding to the unmasked and non-zero counts in the input feature tensor to obtain an encoded tensor. Exemplarily, the size of the first tensor can be 2000*768, and the size of the encoded tensor can be 2000*768. The 1st, 2nd, 4th, 10th, 30th, etc. elements of the 2000 elements are non-zero and unmasked elements, and the 1561st to 2000th elements are filled elements.

步骤PB12,将输入特征张量中的第一张量替换为编码张量,以得到编码输入特征张量。将N*768中第1,2,4,10,30..元素替换为编码张量中相应张量。Step PB12, replace the first tensor in the input feature tensor with the encoding tensor to obtain the encoded input feature tensor. Replace the 1st, 2nd, 4th, 10th, 30th, etc. elements in N*768 with the corresponding tensors in the encoding tensor.

步骤PB13,利用解码器网络对编码输入特征张量进行解码,得到解码张量。解码张量的尺寸可以为N*768。 Step PB13, using a decoder network to decode the encoded input feature tensor to obtain a decoded tensor. The size of the decoded tensor can be N*768.

示例性的,预测模型还包括多层感知器,在步骤PB13后,步骤PB1还包括步骤PB14:利用预测模型中的多层感知器处理解码张量,得到样本对应的第一测序深度下各基因计数的预测值。预测值的尺寸为N*1。可理解的是,多层感知器与预测模型中其他部分一起训练。Exemplarily, the prediction model further includes a multilayer perceptron. After step PB13, step PB1 further includes step PB14: using the multilayer perceptron in the prediction model to process the decoded tensor to obtain the predicted value of each gene count at the first sequencing depth corresponding to the sample. The size of the predicted value is N*1. It is understandable that the multilayer perceptron is trained together with other parts in the prediction model.

示例性的,在步骤PB11前,可用前文描述的方法,将掩码基因表达数据以及辅助信息输入根据计数确定计数嵌入向量的模块,得到输入特征张量。例如,掩码基因表达数据尺寸为N*1,辅助信息包括T和S,输入特征张量的尺寸为(N+2)*768。Exemplarily, before step PB11, the masked gene expression data and auxiliary information can be input into a module that determines count embedding vectors based on counts using the method described above to obtain an input feature tensor. For example, the size of the masked gene expression data is N*1, the auxiliary information includes T and S, and the size of the input feature tensor is (N+2)*768.

本具体实施方式中,编码器部分关注于输入特征张量中的非0计数对应的嵌入向量,而解码器部分接收所有基因的嵌入向量(即编码器处理的嵌入和其他基因的嵌入),整合所有位置的信息。在零值位置和掩码位置在馈送到编码器之前被过滤之后,编码器输入的输入序列长度大约是全基因长度的10%。这种设计大大减少了所需的计算资源,使编码器能够采用一系列普通的Transformer块来捕获基因依赖性,大大提升了训练效率和训练效果。由于编码器模块仅处理非0、未被遮盖的计数,使模型能够有效地关注最具信息量的非零表达基因,同时在解码器阶段允许零值基因参与模型训练。In this specific embodiment, the encoder part focuses on the embedding vector corresponding to the non-zero count in the input feature tensor, while the decoder part receives the embedding vectors of all genes (i.e., the embedding processed by the encoder and the embedding of other genes), integrating the information of all positions. After the zero-value positions and masked positions are filtered before being fed to the encoder, the input sequence length of the encoder input is approximately 10% of the full gene length. This design greatly reduces the required computing resources, allowing the encoder to use a series of ordinary Transformer blocks to capture gene dependencies, greatly improving training efficiency and training effects. Since the encoder module only processes non-zero, unmasked counts, the model can effectively focus on the most informative non-zero expressed genes, while allowing zero-value genes to participate in model training in the decoder stage.

示例性的,待训练的预测模型为编码器-解码器网络、解码器网络以及多层感知器之中一者。Exemplarily, the prediction model to be trained is one of an encoder-decoder network, a decoder network, and a multi-layer perceptron.

示例性的,经降采样的第一基因表达数据是通过使用统计采样算法对所述第一基因表达数据进行降采样得到的。具体参见前文降采样的描述。Exemplarily, the downsampled first gene expression data is obtained by downsampling the first gene expression data using a statistical sampling algorithm. For details, see the above description of downsampling.

根据本公开的另一个方面,提供一种基因表达数据的校正方法。图2是根据本公开实施例的用于基因表达数据的校正方法的流程图。在方法100,已对预测模型进行训练,训练好的预测模型能够对基因表达数据进行校正,方法200中,使用训练好的预测模型进行推理。如图2所示,该方法200包括步骤210和步骤220。According to another aspect of the present disclosure, a method for correcting gene expression data is provided. FIG. 2 is a flow chart of a method for correcting gene expression data according to an embodiment of the present disclosure. In method 100, a prediction model has been trained, and the trained prediction model can correct gene expression data. In method 200, the trained prediction model is used for reasoning. As shown in FIG. 2, the method 200 includes steps 210 and 220.

在本公开中,基因表达数据的校正方法也称为基因调控关系确定方法。In the present disclosure, the method for correcting gene expression data is also referred to as a method for determining gene regulatory relationships.

步骤210,获取当前基因表达数据,其中,当前基因表达数据包括在实际测序深度下测得的不同基因各自的计数。Step 210, obtaining current gene expression data, wherein the current gene expression data includes respective counts of different genes measured at an actual sequencing depth.

可理解的是,由方法200使用的是方法100中训练得到的预测模型,当前基因表达数据应该和掩码基因表达数据的形式一致,只是无需掩码。如果 训练样本中的掩码基因表达数据是经归一化的,则当前基因表达数据也应当是经归一化的;如果训练样本中的掩码基因表达数据是未经归一化的,则当前基因表达数据也应当是未经归一化的。It is understandable that the prediction model trained in method 100 is used by method 200, and the current gene expression data should be in the same form as the masked gene expression data, except that no mask is required. If the masked gene expression data in the training sample is normalized, the current gene expression data should also be normalized; if the masked gene expression data in the training sample is not normalized, the current gene expression data should also be not normalized.

可理解的是,如果方法200使用的是训练时提供了辅助信息的预测模型,则步骤210中也应获取当前基因表达数据对应的辅助信息;如果方法200使用的是训练时未提供辅助信息的预测模型,则步骤210中也无需获取当前基因表达数据对应的辅助信息。It is understandable that if method 200 uses a prediction model that provides auxiliary information during training, then the auxiliary information corresponding to the current gene expression data should also be obtained in step 210; if method 200 uses a prediction model that does not provide auxiliary information during training, then there is no need to obtain the auxiliary information corresponding to the current gene expression data in step 210.

可理解的是,当前基因表达数据可为单细胞基因表达数据,也可为由大量细胞(bulk)测序获得的基因表达数据。It is understandable that the current gene expression data may be single-cell gene expression data or gene expression data obtained by sequencing a large number of cells (bulk).

示例性的,当前基因表达数据可以是基因表达量矩阵,基因表达量矩阵可以是通过单细胞测序得到的。若当前基因表达数据来源于一个单个细胞,基因表达量矩阵可以为一个向量;若当前基因表达数据来源于多个单个细胞,基因表达量矩阵可以为一个矩阵,矩阵每行或每列对应一个单细胞的基因表达量。Exemplarily, the current gene expression data may be a gene expression matrix, and the gene expression matrix may be obtained through single-cell sequencing. If the current gene expression data is derived from a single cell, the gene expression matrix may be a vector; if the current gene expression data is derived from multiple single cells, the gene expression matrix may be a matrix, and each row or column of the matrix corresponds to the gene expression level of a single cell.

可理解的是,当前基因表达数据中的基因为预测模型训练时所使用的样本中第一基因表达数据所包含基因的子集,否则,当前基因表达数据中基因之间的表达量关系未被预测模型学到,其对应的基因表达量关系也就不能为预测模型所预测。It is understandable that the genes in the current gene expression data are a subset of the genes contained in the first gene expression data in the sample used when training the prediction model. Otherwise, the expression level relationship between the genes in the current gene expression data is not learned by the prediction model, and the corresponding gene expression level relationship cannot be predicted by the prediction model.

步骤220,利用预测模型中至少部分网络层处理当前基因表达数据,以得到当前基因表达数据的校正值或校正值的中间处理结果。其中,预测模型可以是根据方法100训练的。Step 220 , using at least part of the network layers in the prediction model to process the current gene expression data to obtain a correction value of the current gene expression data or an intermediate processing result of the correction value. The prediction model may be trained according to method 100 .

示例性的,在步骤220,可用预测模型中全部网络层处理当前基因表达数据,此时得到的是得到当前基因表达数据的校正值;也可仅用预测模型中部分网络层(例如前M层,例如预测模型中除了用于将校正后各基因的计数之间相对关系的中间表示投影为基因计数的预测值的网络层之外的网络层)处理当前基因表达数据,此时得到的是当前基因表达数据的校正值的中间处理结果。具体根据下游任务需要而定。Exemplarily, in step 220, all network layers in the prediction model may be used to process the current gene expression data, and the corrected value of the current gene expression data is obtained at this time; or only some network layers in the prediction model (e.g., the first M layers, e.g., the network layers in the prediction model except for the network layer used to project the intermediate representation of the relative relationship between the counts of each gene after correction as the predicted value of the gene count) may be used to process the current gene expression data, and the intermediate processing result of the corrected value of the current gene expression data is obtained at this time. It depends on the needs of the downstream tasks.

示例性的,校正值为图1A中多层感知器的输出,校正值的中间处理结果指为得到校正值而获得的中间处理结果,并不意味着最终得到了校正值。 图1A中编码器网络输出的编码向量、解码器网络输出的解码向量等均为中间处理结果。Exemplarily, the correction value is the output of the multilayer perceptron in FIG. 1A , and the intermediate processing result of the correction value refers to the intermediate processing result obtained to obtain the correction value, and does not mean that the correction value is finally obtained. The encoding vector output by the encoder network and the decoding vector output by the decoder network in Figure 1A are all intermediate processing results.

示例性的,继续参考图1A,方法220中使用的预测模型中至少部分网络层包括编码器网络,还可包括解码器网络,根据编码器网络的输出能够得到表征细胞的特征,根据解码器网络的输出能够得到表示基因的特征,当下游任务需使用表征细胞的特征时,在步骤220,可仅使用编码器网络处理当前基因表达数据,在后续步骤将由编码器网络的输出得到的表征细胞的特征用于下游任务;当下游任务需使用表征基因的特征时,在步骤220,可使用编码器网络和解码器网络处理当前基因表达数据,在后续步骤将由解码器网络的输出得到的表征基因的特征用于下游任务。Exemplarily, continuing to refer to Figure 1A, at least some network layers in the prediction model used in method 220 include an encoder network and may also include a decoder network. Features characterizing cells can be obtained based on the output of the encoder network, and features representing genes can be obtained based on the output of the decoder network. When the downstream task needs to use features characterizing cells, in step 220, only the encoder network can be used to process the current gene expression data, and in subsequent steps, the features characterizing cells obtained from the output of the encoder network are used for the downstream task; when the downstream task needs to use features characterizing genes, in step 220, the encoder network and the decoder network can be used to process the current gene expression data, and in subsequent steps, the features characterizing genes obtained from the output of the decoder network are used for the downstream task.

示例性的,预测模型包括编码器网络时,由编码器网络的输出得到的表征细胞的特征的方式可以为,编码器网络的输出经过一池化层(例如最大池化层)得到表征细胞的特征。示例性的,预测模型包括编码器网络和解码器网络时,由解码器网络的输出得到表示基因的特征可以为,解码器网络的输出经过多层感知器得到表征基因的特征。Exemplarily, when the prediction model includes an encoder network, the method of obtaining the characteristics of the cell from the output of the encoder network may be that the output of the encoder network passes through a pooling layer (e.g., a maximum pooling layer) to obtain the characteristics of the cell. Exemplarily, when the prediction model includes an encoder network and a decoder network, the method of obtaining the characteristics of the gene from the output of the decoder network may be that the output of the decoder network passes through a multi-layer perceptron to obtain the characteristics of the gene.

可理解的是,如果方法200使用的是训练时提供了辅助信息的预测模型,则步骤220中也利用预测模型中至少部分网络层处理当前基因表达数据的辅助信息,此时在步骤220得到的是期望测序深度下,当前基因表达数据的校正值或所述校正值的中间处理结果;如果方法200使用的是训练时未提供辅助信息的预测模型,此时在步骤220得到的是在当前基因表达数据大体不变的测序深度下,当前基因表达数据的校正值或所述校正值的中间处理结果,则步骤220中也无需处理当前基因表达数据对应的辅助信息。It is understandable that if method 200 uses a prediction model that provides auxiliary information during training, then step 220 also uses at least part of the network layer in the prediction model to process the auxiliary information of the current gene expression data. In this case, what is obtained in step 220 is the corrected value of the current gene expression data at the expected sequencing depth or the intermediate processing result of the corrected value; if method 200 uses a prediction model that does not provide auxiliary information during training, in this case, what is obtained in step 220 is the corrected value of the current gene expression data or the intermediate processing result of the corrected value at a sequencing depth where the current gene expression data remains substantially unchanged. In this case, there is no need to process the auxiliary information corresponding to the current gene expression data in step 220.

示例性的,方法200还包括步骤230,确定当前基因表达数据对应的特征张量,步骤220包括利用预测模型中至少部分网络层处理当前基因表达数据对应的特征向量。可理解的是,由于当前基因表达数据中只包括未被掩码的计数(0值和非0值),步骤230可参考步骤123中确定未被掩码且非0、未被掩码且为0的计数对应的特征张量的方法。Exemplarily, method 200 further includes step 230, determining a feature tensor corresponding to the current gene expression data, and step 220 includes processing a feature vector corresponding to the current gene expression data using at least part of the network layer in the prediction model. It is understandable that, since the current gene expression data only includes unmasked counts (0 values and non-0 values), step 230 can refer to the method for determining the feature tensors corresponding to unmasked and non-0 counts and unmasked and 0 counts in step 123.

示例性的,预测模型的至少部分网络层包括编码器网络,还可包括解码器网络,步骤220中包括:步骤2201,利用编码器网络对编码器网络的输入进行编码,以得到编码器网络的输出;编码器网络的输入包括当前基因表达 数据中与非0的计数(当前基因表达数据中没有被掩码的计数)对应的特征张量;示例性的,编码器网络的输入还包括填充元素对应的特征向量。可选的,在步骤2201之后,步骤220包括步骤2202,根据编码器网络的输出和当前基因表达数据中与0值的计数对应的特征张量(示例性的,将二者按计数所在位置拼接),得到解码器网络的输入;利用解码器网络对解码器网络的输入进行解码,得到解码器网络的输出。可选的,在步骤2202后,步骤220包括2203,将解码器网络的输出投影为当前基因表达数据的校正值。Exemplarily, at least part of the network layers of the prediction model include an encoder network and may also include a decoder network. Step 220 includes: Step 2201, encoding the input of the encoder network using the encoder network to obtain the output of the encoder network; the input of the encoder network includes the current gene expression The feature tensor corresponding to the non-zero counts in the data (the counts that are not masked in the current gene expression data); exemplarily, the input of the encoder network also includes the feature vector corresponding to the filler element. Optionally, after step 2201, step 220 includes step 2202, according to the output of the encoder network and the feature tensor corresponding to the counts of 0 values in the current gene expression data (exemplarily, the two are spliced according to the position of the counts), the input of the decoder network is obtained; the input of the decoder network is decoded by the decoder network to obtain the output of the decoder network. Optionally, after step 2202, step 220 includes 2203, projecting the output of the decoder network as a correction value of the current gene expression data.

可理解的是,当下游任务需使用表征细胞的特征时,可执行步骤2201得到编码器网络的输出,之后将编码器网络的输出(可以是编码器网络的输出中去掉与填充元素对应位置的特征张量)输入一池化层,得到表征细胞的特征,将该特征用于下游任务。当下游任务需使用表征基因的特征时,可执行步骤2201和步骤2202得到解码器网络的输出,之后将解码器网络的输出输入一多层感知器,得到表征基因的特征,将该特征用于下游任务。当下游任务需使用当前基因表达数据的校正值时,可执行步骤2201、2202和2203,将校正值用于下游任务。It is understandable that when the downstream task needs to use features that characterize cells, step 2201 can be executed to obtain the output of the encoder network, and then the output of the encoder network (which can be the feature tensor of the position corresponding to the filler element removed from the output of the encoder network) is input into a pooling layer to obtain features that characterize cells, and the features are used for downstream tasks. When the downstream task needs to use features that characterize genes, steps 2201 and 2202 can be executed to obtain the output of the decoder network, and then the output of the decoder network is input into a multilayer perceptron to obtain features that characterize genes, and the features are used for downstream tasks. When the downstream task needs to use the correction value of the current gene expression data, steps 2201, 2202 and 2203 can be executed to use the correction value for the downstream task.

方法200可以通过预测模型提高当前基因表达数据中各基因计数间相对关系的准确性,得到可用于不同下游任务的当前基因表达数据的校正值或校正值的中间处理结果,将其应用于下游任务可提高下游任务的准确性。可理解的是,本公开实施例中的下游任务可以是现有的下游任务算法或模型,如细胞分类模型、扰动预测模型等,这些算法或模型以表征细胞的特征或表征基因的特征为输入。本公开实施例中,用本公开实施例提供的表征细胞的特征、表征基因的特征替代现有技术的下游任务算法或模型中使用的表征细胞的特征、表征基因的特征。也即,可用根据方法200获得的表征细胞的特征或表征基因的特征代替下游任务中原本使用的表征细胞的特征或表征基因的特征。Method 200 can improve the accuracy of the relative relationship between each gene count in the current gene expression data through the prediction model, obtain the correction value of the current gene expression data that can be used for different downstream tasks or the intermediate processing result of the correction value, and apply it to the downstream task to improve the accuracy of the downstream task. It is understandable that the downstream task in the disclosed embodiment can be an existing downstream task algorithm or model, such as a cell classification model, a disturbance prediction model, etc., and these algorithms or models use the characteristics of characterizing cells or the characteristics of characterizing genes as input. In the disclosed embodiment, the characteristics of characterizing cells and the characteristics of characterizing genes provided by the disclosed embodiment replace the characteristics of characterizing cells and the characteristics of characterizing genes used in the downstream task algorithm or model of the prior art. That is, the characteristics of characterizing cells or the characteristics of characterizing genes obtained according to method 200 can replace the characteristics of characterizing cells or the characteristics of characterizing genes originally used in the downstream task.

在一个具体实施方式中,步骤210包括:获取归一化的当前基因表达数据和当前辅助信息;其中,当前基因表达数据包括在实际测序深度下测得的不同基因各自的计数,归一化的当前基因表达数据是对当前基因表达数据进行归一化得到的;当前辅助信息包括期望第一总计数和当前第二总计数;当前第二总计数为当前基因表达数据中各基因计数之和;期望第一总计数用于 表征期望测序深度,期望第一总计数大于等于当前基因表达数据中各基因的计数之和。In a specific embodiment, step 210 includes: obtaining normalized current gene expression data and current auxiliary information; wherein the current gene expression data includes the counts of different genes measured at the actual sequencing depth, and the normalized current gene expression data is obtained by normalizing the current gene expression data; the current auxiliary information includes an expected first total count and a current second total count; the current second total count is the sum of the counts of each gene in the current gene expression data; the expected first total count is used for Characterizing the expected sequencing depth, the expected first total count is greater than or equal to the sum of the counts of each gene in the current gene expression data.

在方法100的部分实施方式中,样本中包括归一化的第一基因表达数据、掩码基因表达数据和辅助信息,使得预测模型学会捕捉不同测序深度下的相似细胞的基因计数之间的关系以及单个细胞内基因计数之间的关系,进而能够根据低测序深度下的基因计数预测高测序深度下的基因计数;相应的,在使用该样本训练得到的预测模型时,预测模型的输入中包括归一化的当前基因表达数据和表征低测序深度、高测序深度的当前辅助信息,从而使预测模型能够根据低测序深度下的基因计数预测期望的高测序深度下的基因计数。In some embodiments of method 100, the sample includes normalized first gene expression data, masked gene expression data, and auxiliary information, so that the prediction model learns to capture the relationship between gene counts of similar cells at different sequencing depths and the relationship between gene counts in a single cell, and is thus able to predict gene counts at a high sequencing depth based on gene counts at a low sequencing depth; accordingly, when using the prediction model trained using the sample, the input of the prediction model includes normalized current gene expression data and current auxiliary information characterizing low sequencing depth and high sequencing depth, so that the prediction model is able to predict the expected gene counts at a high sequencing depth based on the gene counts at a low sequencing depth.

期望第一总计数可以大于等于当前第二总计数。当期望第一总计数大于当前基因表达数据中各基因的计数之和时,希望预测模型根据实际测得的、较低测序深度下的基因计数预测期望的较高测序深度下的基因计数。当期望第一总计数等于当前基因表达数据中各基因的计数之和时,希望预测模型根据实际测得的基因计数预测测序深度大体不变的条件下的校正后的基因计数。It is expected that the first total count may be greater than or equal to the current second total count. When the expected first total count is greater than the sum of the counts of each gene in the current gene expression data, it is expected that the prediction model predicts the expected gene count at a higher sequencing depth based on the gene counts actually measured at a lower sequencing depth. When the expected first total count is equal to the sum of the counts of each gene in the current gene expression data, it is expected that the prediction model predicts the corrected gene count under the condition that the sequencing depth remains substantially unchanged based on the gene counts actually measured.

由于能够增加测序深度的空间有限,并且为了保证目标单细胞在期望测序深度下各基因计数的预测值的数据准确性,期望第一总计数不能比当前基因表达数据中各基因的计数之和大太多。示例性的,可以根据测序方法设置期望测序深度。例如,测序方法为10X Genomics Chromium测序技术(10X基因组学铬测序技术,简称为“10X测序技术”),当前基因表达数据中各基因的计数之和约为1000,此时可以设置期望测序深度为10000。Since the space for increasing the sequencing depth is limited, and in order to ensure the data accuracy of the predicted value of each gene count in the target single cell at the expected sequencing depth, it is expected that the first total count cannot be much larger than the sum of the counts of each gene in the current gene expression data. Exemplarily, the expected sequencing depth can be set according to the sequencing method. For example, the sequencing method is 10X Genomics Chromium sequencing technology (10X Genomics Chromium sequencing technology, referred to as "10X sequencing technology"), and the sum of the counts of each gene in the current gene expression data is about 1000. At this time, the expected sequencing depth can be set to 10000.

示例性的,在获取一个低测序深度下的当前基因表达数据(当前基因表达数据是未经归一化的)后,一方面计算当前基因表达数据中各基因计数之和作为当前第二总计数,另一方面对当前基因表达数据进行归一化得到归一化的当前基因表达数据,以便后续利用预测模型中至少部分网络层处理归一化的当前基因表达数据。可理解的是,如果获取到的低测序深度下的基因表达数据本就是归一化之后的,可先将其还原为未归一化的当前基因表达数据。Exemplarily, after obtaining current gene expression data at a low sequencing depth (the current gene expression data is not normalized), on the one hand, the sum of the counts of each gene in the current gene expression data is calculated as the current second total count, and on the other hand, the current gene expression data is normalized to obtain normalized current gene expression data, so that at least part of the network layer in the prediction model can be used to process the normalized current gene expression data later. It is understandable that if the gene expression data obtained at a low sequencing depth is normalized, it can be restored to the unnormalized current gene expression data first.

步骤220包括:利用预测模型中至少部分网络层处理归一化的当前基因表达数据和当前辅助信息,以得到当前基因表达数据在期望测序深度下的归一化值的校正值或归一化值的校正值的中间处理结果。 Step 220 includes: using at least part of the network layer in the prediction model to process the normalized current gene expression data and the current auxiliary information to obtain a corrected value of the normalized value of the current gene expression data at the expected sequencing depth or an intermediate processing result of the corrected value of the normalized value.

示例性的,利用预测模型中至少部分网络层处理归一化的当前基因表达数据对应的特征张量和当前辅助信息对应特征张量,在此之前,需要确定归一化的当前基因表达数据对应的特征张量和当前辅助信息对应特征张量。确定归一化的当前基因表达数据对应的特征张量的方法参见对步骤230的描述,确定当前辅助信息对应的特征张量的方法参见对步骤125的描述。Exemplarily, at least part of the network layer in the prediction model is used to process the normalized feature tensor corresponding to the current gene expression data and the feature tensor corresponding to the current auxiliary information. Prior to this, the normalized feature tensor corresponding to the current gene expression data and the feature tensor corresponding to the current auxiliary information need to be determined. The method for determining the feature tensor corresponding to the normalized current gene expression data is described in the description of step 230, and the method for determining the feature tensor corresponding to the current auxiliary information is described in the description of step 125.

示例性的,根据下游任务需求确定利用预测模型哪部分网络层来处理归一化的当前基因表达数据和当前辅助信息,参见前文描述。Exemplarily, which part of the network layer of the prediction model is used to process the normalized current gene expression data and the current auxiliary information is determined according to the downstream task requirements, as described above.

示例性的,当前辅助信息可随当前基因表达数据输入首个网络层,也可仅将当前基因表达输入首个网络层,将辅助信息输入不同于首个网络层的网络层,具体的输入方式、输入形式需与训练阶段一致,不再赘述。Exemplarily, the current auxiliary information can be input into the first network layer along with the current gene expression data, or only the current gene expression can be input into the first network layer, and the auxiliary information can be input into a network layer different from the first network layer. The specific input method and input form must be consistent with the training stage and will not be repeated here.

在本公开的一个具体实施方式中,如图2A所示,提供一种基因调控关系确定方法200A,包括:步骤210A,获取待处理细胞对应的基因表达量矩阵;其中,待处理细胞所表达的基因为初始基因表达量矩阵中所包含基因的子集;步骤220A,将基因表达量矩阵输入至基因调控关系模型中,获取基因调控关系表示,基因调控关系表示为三维张量;其中,基因调控关系模型是根据方法100B的基因调控关系模型生成方法生成的。基因表达量矩阵例如可以是通过单细胞测序得到的。当待处理细胞为单个细胞时,基因表达量矩阵可以为一个向量,当待处理细胞为多个细胞时,基因表达量矩阵可以为一个矩阵,矩阵每行或每列对应一个细胞的基因表达量。将获取的基因表达量矩阵输入至训练后的基因调控关系模型中,由基因调控关系模型输出待处理细胞对应的基因调控关系表示。其中,基因调控关系表示为三维张量(例如是C×G×D的三维张量),每个待处理细胞对应的基因调控关系用该三维张量中该待处理细胞对应的矩阵所表征。可理解的是,当C为1时,基因调控关系模型是根据基因调控关系模型生成方法生成的。可理解的,当C为1时,C×G×D的三维张量变为1×G×D的。可理解的是,待处理细胞所表达的基因为基因调控关系模型训练时所使用的初始基因表达量矩阵中所包含基因的子集。如果待处理细胞所表达的基因不在初始基因表达量矩阵中所包含的基因中,说明其基因之间的调控关系未被基因调控关系模型学到,其对应的基因调控关系也就不能用基因调控关系模型所预测。 In a specific embodiment of the present disclosure, as shown in FIG2A , a method 200A for determining a gene regulation relationship is provided, including: step 210A, obtaining a gene expression matrix corresponding to the cells to be treated; wherein the genes expressed by the cells to be treated are a subset of the genes contained in the initial gene expression matrix; step 220A, inputting the gene expression matrix into the gene regulation relationship model, obtaining a gene regulation relationship representation, and the gene regulation relationship representation is a three-dimensional tensor; wherein the gene regulation relationship model is generated according to the gene regulation relationship model generation method of method 100B. The gene expression matrix can be obtained, for example, by single-cell sequencing. When the cell to be treated is a single cell, the gene expression matrix can be a vector, and when the cell to be treated is a plurality of cells, the gene expression matrix can be a matrix, and each row or column of the matrix corresponds to the gene expression of a cell. The obtained gene expression matrix is input into the trained gene regulation relationship model, and the gene regulation relationship model outputs the gene regulation relationship representation corresponding to the cells to be treated. Among them, the gene regulation relationship is expressed as a three-dimensional tensor (for example, a three-dimensional tensor of C×G×D), and the gene regulation relationship corresponding to each cell to be treated is characterized by the matrix corresponding to the cell to be treated in the three-dimensional tensor. It is understandable that when C is 1, the gene regulation relationship model is generated according to the gene regulation relationship model generation method. It is understandable that when C is 1, the three-dimensional tensor of C×G×D becomes 1×G×D. It is understandable that the genes expressed by the cells to be treated are a subset of the genes contained in the initial gene expression matrix used when training the gene regulation relationship model. If the genes expressed by the cells to be treated are not in the genes contained in the initial gene expression matrix, it means that the regulatory relationship between its genes has not been learned by the gene regulation relationship model, and its corresponding gene regulation relationship cannot be predicted by the gene regulation relationship model.

在本公开的一个具体实施方式中,提供了一种基因表达数据的校正方法200B,如图2B所示,方法200B包括:步骤210B,获取当前基因表达数据和当前辅助信息,其中,当前基因表达数据包括在实际测序深度下测得不同基因各自的计数,当前辅助信息包括期望第一总计数,期望第一总计数用于表征期望测序深度,期望第一总计数大于等于当前基因表达数据中各基因的计数之和;步骤220B,利用根据方法100C训练的预测模型中至少部分网络层处理当前基因表达数据以及当前辅助信息,以得到当前基因表达数据在期望测序深度下的校正值或校正值的中间处理结果。如前文描述的理由,训练阶段,辅助信息可仅包含第一总计数而不包含第二总计数,相应的,在推理阶段,当前辅助信息可以仅包括期望第一总计数而不包含当前第二总计数。此时,训练阶段训练样本中的第一基因表达数据和推理阶段的当前基因表达数据都是未经归一化的。关于本具体实施方式的术语解释和步骤说明参见前文。In a specific embodiment of the present disclosure, a method 200B for correcting gene expression data is provided, as shown in FIG2B , and the method 200B includes: step 210B, obtaining current gene expression data and current auxiliary information, wherein the current gene expression data includes the counts of different genes measured at the actual sequencing depth, and the current auxiliary information includes the expected first total count, the expected first total count is used to characterize the expected sequencing depth, and the expected first total count is greater than or equal to the sum of the counts of each gene in the current gene expression data; step 220B, using at least part of the network layer in the prediction model trained according to method 100C to process the current gene expression data and the current auxiliary information, so as to obtain the correction value of the current gene expression data at the expected sequencing depth or the intermediate processing result of the correction value. As described above, in the training stage, the auxiliary information may only include the first total count but not the second total count, and accordingly, in the inference stage, the current auxiliary information may only include the expected first total count but not the current second total count. At this time, the first gene expression data in the training sample in the training stage and the current gene expression data in the inference stage are both unnormalized. For the explanation of terms and step description of this specific embodiment, please refer to the above.

图3是根据本公开实施例的下游任务执行方法的流程图。如图3所示,该方法300包括步骤310和步骤320。FIG3 is a flow chart of a downstream task execution method according to an embodiment of the present disclosure. As shown in FIG3 , the method 300 includes step 310 and step 320 .

在步骤310,获取输入数据。其中,输入数据包括i)根据方法200得到的校正值;或者ii)根据方法200得到的校正值的中间处理结果;或者iii)对方法200得到的校正值进行预处理得到的预处理结果;或者iv)对方法200得到的校正值的中间处理结果进行预处理得到的预处理结果。In step 310, input data is obtained, wherein the input data includes i) a correction value obtained according to method 200; or ii) an intermediate processing result of the correction value obtained according to method 200; or iii) a preprocessing result obtained by preprocessing the correction value obtained by method 200; or iv) a preprocessing result obtained by preprocessing the intermediate processing result of the correction value obtained by method 200.

示例性的,预处理可用为池化、使用线性层处理、使用特定的模型进行处理等。例如,预处理可以为前文描述的由编码器的输出得到表征细胞的特征的方法,也可以为前文描述的由解码器的输出得到表征基因的特征的方法。Exemplarily, preprocessing can be pooling, processing using a linear layer, processing using a specific model, etc. For example, preprocessing can be the method described above for obtaining features characterizing cells from the output of the encoder, or it can be the method described above for obtaining features characterizing genes from the output of the decoder.

在步骤320,利用下游任务算法处理所述输入数据,以得到下游任务结果,所述下游任务包括细胞归类任务,扰动预测任务或药物反应预测任务。In step 320, the input data is processed using a downstream task algorithm to obtain a downstream task result, wherein the downstream task includes a cell classification task, a disturbance prediction task or a drug response prediction task.

示例性的,下游任务包括细胞归类任务(进一步细分为细胞分类任务、细胞聚类任务),扰动预测任务或药物反应预测任务。示例性的,将表征细胞的特征用于细胞分类任务、细胞聚类任务,药物反应预测任务,将表征基因的特征用于扰动预测任务。Exemplarily, downstream tasks include cell classification tasks (further subdivided into cell classification tasks, cell clustering tasks), perturbation prediction tasks or drug response prediction tasks. Exemplarily, features characterizing cells are used for cell classification tasks, cell clustering tasks, and drug response prediction tasks, and features characterizing genes are used for perturbation prediction tasks.

示例性的,当下游任务为细胞分类任务时,将表征细胞的特征用于细胞分类模型,得到细胞分类结果。在示例中,分类模型例如可以采用支持向量 机(Support Vector Machine,SVM)、多层感知机(Multilayer Perceptron,MLP)或决策树(Decision Tree)等。For example, when the downstream task is a cell classification task, the features characterizing the cells are used in the cell classification model to obtain the cell classification result. In the example, the classification model can use, for example, support vector Machine (Support Vector Machine, SVM), Multilayer Perceptron (Multilayer Perceptron, MLP) or Decision Tree (Decision Tree), etc.

示例性的,当下游任务为细胞聚类任务时,可用聚类算法对表征细胞的特征进行聚类,得到聚类结果。聚类算法可以采用K-means聚类等。For example, when the downstream task is a cell clustering task, a clustering algorithm can be used to cluster the features representing the cells to obtain a clustering result. The clustering algorithm can be K-means clustering or the like.

细胞归类结果例如可以是癌症细胞和正常细胞的分类结果、不同类型的免疫细胞的聚类结果或不同器官中的细胞类型分类结果等。The cell classification results may be, for example, classification results of cancer cells and normal cells, clustering results of different types of immune cells, or classification results of cell types in different organs.

当下游任务为药物反应预测任务时,可用图神经网络提取药物特征,用预测模型确定表征细胞的特征,将药物特征和表征细胞的特征输入药物反应预测模型,得到细胞是否对药物敏感的预测结果。When the downstream task is a drug response prediction task, graph neural networks can be used to extract drug features, and prediction models can be used to determine features that characterize cells. The drug features and features that characterize cells are input into the drug response prediction model to obtain a prediction result on whether the cells are sensitive to the drug.

当下游任务为扰动预测任务时,可将预测模型确定出的表征基因的特征作为基因共表达图神经网络的基因节点所对应的基因特征,将扰动特征施加于基因共表达图神经网络,可预测施加扰动的基因表达。When the downstream task is a perturbation prediction task, the features characterizing the genes determined by the prediction model can be used as the gene features corresponding to the gene nodes of the gene co-expression graph neural network. The perturbation features are applied to the gene co-expression graph neural network to predict the expression of the perturbed genes.

示例性的,下游任务所用到的预测模型之外的模型,可与预测模型一同进行训练。Exemplarily, models other than the prediction model used in downstream tasks can be trained together with the prediction model.

根据本公开的一个方面,提供了一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的至少一个存储器,至少一个存储器存储有指令,指令在被至少一个处理器执行时,使至少一个处理器执行上述的方法。According to one aspect of the present disclosure, an electronic device is provided, comprising: at least one processor; and at least one memory communicatively connected to the at least one processor, wherein the at least one memory stores instructions, and when the instructions are executed by the at least one processor, the at least one processor executes the above method.

根据本公开的另一个方面,提供了一种存储有指令的非瞬时计算机可读存储介质,指令在被计算机的至少一个处理器执行时,使计算机执行上述的方法。According to another aspect of the present disclosure, a non-transitory computer-readable storage medium storing instructions is provided. When the instructions are executed by at least one processor of a computer, the computer executes the above method.

根据本公开的另一个方面,提供了一种计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现上述的方法。According to another aspect of the present disclosure, a computer program product is provided, including a computer program, and the computer program implements the above method when executed by a processor.

图4是能够用于实现本公开的实施例的示例性电子设备的结构框图。FIG. 4 is a block diagram of an exemplary electronic device that can be used to implement an embodiment of the present disclosure.

电子设备400可以是各种不同类型的设备。电子设备400的示例包括但不限于:台式计算机、服务器计算机、笔记本电脑或上网本计算机、移动设备(例如,平板电脑、蜂窝或其他无线电话(例如,智能电话)、记事本计算机、移动台)、可穿戴设备(例如,眼镜、手表)、娱乐设备(例如,娱乐器具、通信地耦合到显示设备的机顶盒、游戏机)、电视或其他显示设备、汽车计算机等等。 The electronic device 400 can be a variety of different types of devices. Examples of the electronic device 400 include, but are not limited to, desktop computers, server computers, laptop or netbook computers, mobile devices (e.g., tablet computers, cellular or other wireless phones (e.g., smart phones), notepad computers, mobile stations), wearable devices (e.g., glasses, watches), entertainment devices (e.g., entertainment appliances, set-top boxes communicatively coupled to a display device, game consoles), televisions or other display devices, automotive computers, and the like.

电子设备400可以包括能够诸如通过系统总线414或其他适当的连接彼此通信的至少一个处理器402、存储器404、(多个)通信接口406、显示设备408、其他输入/输出(I/O)设备410以及一个或更多大容量存储设备412。The electronic device 400 may include at least one processor 402, memory 404, communication interface(s) 406, a display device 408, other input/output (I/O) devices 410, and one or more mass storage devices 412 that can communicate with each other, such as via a system bus 414 or other appropriate connection.

处理器402可以是单个处理单元或多个处理单元,所有处理单元可以包括单个或多个计算单元或者多个核心。处理器402可以被实施成一个或更多微处理器、微型计算机、微控制器、数字信号处理器、中央处理单元、状态机、逻辑电路和/或基于操作指令来操纵信号的任何设备。除了其他能力之外,处理器402可以被配置成获取并且执行存储在存储器404、大容量存储设备412或者其他计算机可读介质中的计算机可读指令,诸如操作系统416的程序代码、应用程序418的程序代码、其他程序420的程序代码等。Processor 402 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. Processor 402 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuits, and/or any device that manipulates signals based on operating instructions. Among other capabilities, processor 402 may be configured to obtain and execute computer-readable instructions stored in memory 404, mass storage device 412, or other computer-readable media, such as program code for operating system 416, program code for application program 418, program code for other programs 420, and the like.

存储器404和大容量存储设备412是用于存储指令的计算机可读存储介质的示例,所述指令由处理器402执行来实施前面所描述的各种功能。举例来说,存储器404一般可以包括易失性存储器和非易失性存储器二者(例如RAM、ROM等等)。此外,大容量存储设备412一般可以包括硬盘驱动器、固态驱动器、可移除介质、包括外部和可移除驱动器、存储器卡、闪存、软盘、光盘(例如CD、DVD)、存储阵列、网络附属存储、存储区域网等等。存储器404和大容量存储设备412在本文中都可以被统称为存储器或计算机可读存储介质,并且可以是能够把计算机可读、处理器可执行程序指令存储为计算机程序代码的非暂态介质,所述计算机程序代码可以由处理器402作为被配置成实施在本文的示例中所描述的操作和功能的特定机器来执行。The memory 404 and the mass storage device 412 are examples of computer-readable storage media for storing instructions that are executed by the processor 402 to implement the various functions described above. For example, the memory 404 may generally include both volatile memory and non-volatile memory (e.g., RAM, ROM, etc.). In addition, the mass storage device 412 may generally include a hard drive, a solid-state drive, a removable medium, including external and removable drives, a memory card, a flash memory, a floppy disk, an optical disk (e.g., a CD, a DVD), a storage array, a network attached storage, a storage area network, etc. The memory 404 and the mass storage device 412 may all be collectively referred to herein as memory or computer-readable storage media, and may be a non-transitory medium capable of storing computer-readable, processor-executable program instructions as computer program code, which may be executed by the processor 402 as a specific machine configured to implement the operations and functions described in the examples herein.

多个程序可以存储在大容量存储设备412上。这些程序包括操作系统416、一个或多个应用程序418、其他程序420和程序数据422,并且它们可以被加载到存储器404以供执行。这样的应用程序或程序模块的示例可以包括例如用于实现以下部件/功能的计算机程序逻辑(例如,计算机程序代码或指令):方法100(包括方法100的任何合适的步骤)、方法100B(包括方法100B的任何合适的步骤)、方法100C(包括方法100C的任何合适的步骤)、方法200(包括方法200的任何合适的步骤)、方法200A(包括方法200A的任何合适的步骤)、方法200B(包括方法200B的任何合适的步骤)、方法300(包括方法300的任何合适的步骤)、和/或本文描述的另外的实施例。 A plurality of programs may be stored on the mass storage device 412. These programs include an operating system 416, one or more application programs 418, other programs 420, and program data 422, and they may be loaded into the memory 404 for execution. Examples of such applications or program modules may include, for example, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: method 100 (including any suitable steps of method 100), method 100B (including any suitable steps of method 100B), method 100C (including any suitable steps of method 100C), method 200 (including any suitable steps of method 200), method 200A (including any suitable steps of method 200A), method 200B (including any suitable steps of method 200B), method 300 (including any suitable steps of method 300), and/or other embodiments described herein.

虽然在图4中被图示成存储在电子设备400的存储器404中,但是模块416、418、420和422或者其部分可以使用可由电子设备400访问的任何形式的计算机可读介质来实施。如本文所使用的,“计算机可读介质”至少包括两种类型的计算机可读介质,也就是计算机可读存储介质和通信介质。4 as being stored in the memory 404 of the electronic device 400, the modules 416, 418, 420, and 422, or portions thereof, may be implemented using any form of computer-readable media accessible by the electronic device 400. As used herein, "computer-readable media" includes at least two types of computer-readable media, namely, computer-readable storage media and communication media.

计算机可读存储介质包括通过用于存储信息的任何方法或技术实施的易失性和非易失性、可移除和不可移除介质,所述信息诸如是计算机可读指令、数据结构、程序模块或者其他数据。计算机可读存储介质包括而不限于RAM、ROM、EEPROM、闪存或其他存储器技术,CD-ROM、数字通用盘(DVD)、或其他光学存储装置,磁盒、磁带、磁盘存储装置或其他磁性存储设备,或者可以被用来存储信息以供电子设备访问的任何其他非传送介质。与此相对,通信介质可以在诸如载波或其他传送机制之类的已调制数据信号中具体实现计算机可读指令、数据结构、程序模块或其他数据。本文所定义的计算机可读存储介质不包括通信介质。Computer-readable storage media include volatile and non-volatile, removable and non-removable media implemented by any method or technology for storing information, such as computer-readable instructions, data structures, program modules or other data. Computer-readable storage media include but are not limited to RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD), or other optical storage device, magnetic cassette, magnetic tape, magnetic disk storage device or other magnetic storage device, or any other non-transmission medium that can be used to store information for access by electronic devices. In contrast, communication media can embody computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transmission mechanism. Computer-readable storage media defined herein do not include communication media.

一个或更多通信接口406用于诸如通过网络、直接连接等等与其他设备交换数据。这样的通信接口可以是以下各项中的一个或多个:任何类型的网络接口(例如,网络接口卡(NIC))、有线或无线(诸如IEEE 802.11无线LAN(WLAN))无线接口、全球微波接入互操作(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、BluetoothTM接口、近场通信(NFC)接口等。通信接口406可以促进在多种网络和协议类型内的通信,其中包括有线网络(例如LAN、电缆等等)和无线网络(例如WLAN、蜂窝、卫星等等)、因特网等等。通信接口406还可以提供与诸如存储阵列、网络附属存储、存储区域网等等中的外部存储装置(未示出)的通信。One or more communication interfaces 406 are used to exchange data with other devices, such as through a network, direct connection, etc. Such communication interfaces can be one or more of the following: any type of network interface (e.g., a network interface card (NIC)), a wired or wireless (such as IEEE 802.11 wireless LAN (WLAN)) wireless interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a BluetoothTM interface, a Near Field Communication (NFC) interface, etc. The communication interface 406 can facilitate communication within a variety of network and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, etc. The communication interface 406 can also provide communication with external storage devices (not shown) such as storage arrays, network attached storage, storage area networks, etc.

在一些示例中,可以包括诸如监视器之类的显示设备408,以用于向用户显示信息和图像。其他I/O设备410可以是接收来自用户的各种输入并且向用户提供各种输出的设备,并且可以包括触摸输入设备、手势输入设备、摄影机、键盘、遥控器、鼠标、打印机、音频输入/输出设备等等。In some examples, a display device 408 such as a monitor may be included for displaying information and images to the user. Other I/O devices 410 may be devices that receive various inputs from the user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and the like.

本文描述的技术可以由电子设备400的这些各种配置来支持,并且不限于本文所描述的技术的具体示例。例如,该功能还可以通过使用分布式系统在“云”上全部或部分地实现。云包括和/或代表用于资源的平台。平台抽象云的硬件(例如,服务器)和软件资源的底层功能。资源可以包括在远离电 子设备400的服务器上执行计算处理时可以使用的应用和/或数据。资源还可以包括通过因特网和/或通过诸如蜂窝或Wi-Fi网络的订户网络提供的服务。平台可以抽象资源和功能以将电子设备400与其他电子设备连接。因此,本文描述的功能的实现可以分布在整个云内。例如,功能可以部分地在电子设备400上以及部分地通过抽象云的功能的平台来实现。The techniques described herein may be supported by these various configurations of electronic device 400 and are not limited to the specific examples of the techniques described herein. For example, the functionality may also be implemented in whole or in part on a "cloud" using a distributed system. The cloud includes and/or represents a platform for resources. The platform abstracts the underlying functionality of the hardware (e.g., servers) and software resources of the cloud. Resources may include resources that are located remotely from the computer. Applications and/or data that can be used when performing computing processing on the server of the sub-device 400. Resources can also include services provided through the Internet and/or through a subscriber network such as a cellular or Wi-Fi network. The platform can abstract resources and functions to connect the electronic device 400 with other electronic devices. Therefore, the implementation of the functions described herein can be distributed throughout the cloud. For example, the functions can be implemented partially on the electronic device 400 and partially through a platform that abstracts the functions of the cloud.

虽然在附图和前面的描述中已经详细地说明和描述了本公开,但是这样的说明和描述应当被认为是说明性的和示意性的,而非限制性的;本公开不限于所公开的实施例。通过研究附图、公开内容和所附的权利要求书,本领域技术人员在实践所要求保护的主题时,能够理解和实现对于所公开的实施例的变型。在权利要求书中,词语“包括”不排除未列出的其他元件或步骤,不定冠词“一”或“一个”不排除多个,并且术语“多个”是指两个或两个以上。在相互不同的从属权利要求中记载了某些措施的仅有事实并不表明这些措施的组合不能用来获益。 Although the present disclosure has been illustrated and described in detail in the drawings and the foregoing description, such illustration and description should be considered illustrative and exemplary rather than restrictive; the present disclosure is not limited to the disclosed embodiments. By studying the drawings, the disclosure and the appended claims, those skilled in the art will be able to understand and implement variations to the disclosed embodiments when practicing the claimed subject matter. In the claims, the word "comprising" does not exclude other elements or steps that are not listed, the indefinite article "a" or "an" does not exclude a plurality, and the term "plurality" means two or more. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims (18)

一种预测模型的训练方法,包括:A prediction model training method, comprising: 获取多个样本,其中,所述多个样本中的每个样本包括第一基因表达数据和掩码基因表达数据,所述第一基因表达数据包括单细胞中不同基因各自的计数,所述掩码基因表达数据是通过对所述第一基因表达数据中部分基因的计数进行处理得到的,用于得到所述掩码基因表达数据的处理包括掩码;Acquire multiple samples, wherein each sample in the multiple samples includes first gene expression data and masked gene expression data, the first gene expression data includes counts of different genes in a single cell, the masked gene expression data is obtained by processing counts of some genes in the first gene expression data, and the processing for obtaining the masked gene expression data includes masking; 对于所述多个样本中的每个样本:For each sample in the plurality of samples: 利用待训练的预测模型处理该样本中的掩码基因表达数据,以得到该样本对应的所述部分基因计数的预测值;Processing the masked gene expression data in the sample using the prediction model to be trained to obtain a predicted value of the partial gene count corresponding to the sample; 根据所述预测值与所述第一基因表达数据中与所述部分基因所对应的计数确定该样本对应的损失值;以及Determining a loss value corresponding to the sample according to the predicted value and the counts corresponding to the part of genes in the first gene expression data; and 根据所述多个样本中的每个样本对应的所述损失值更新所述待训练的预测模型。The prediction model to be trained is updated according to the loss value corresponding to each sample in the multiple samples. 根据权利要求1所述的方法,其中,所述方法还包括:确定所述掩码基因表达数据对应的特征张量;The method according to claim 1, wherein the method further comprises: determining a feature tensor corresponding to the masked gene expression data; 所述利用待训练的预测模型处理该样本中的掩码基因表达数据,包括:利用待训练的预测模型处理该样本中的掩码基因表达数据对应的特征张量。The using the prediction model to be trained to process the masked gene expression data in the sample includes: using the prediction model to be trained to process the feature tensor corresponding to the masked gene expression data in the sample. 根据权利要求2所述的方法,其中,所述利用待训练的预测模型处理该样本中的掩码基因表达数据对应的特征张量,包括:The method according to claim 2, wherein the step of processing the feature tensor corresponding to the masked gene expression data in the sample using the prediction model to be trained comprises: 将所述掩码基因表达数据中与未被掩码且非0的计数对应的特征张量输入所述待训练的预测模型的首个网络层;Inputting the feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data into the first network layer of the prediction model to be trained; 将所述掩码基因表达数据中与被掩码的计数对应的特征张量和所述掩码基因表达数据中与0值的计数对应的特征张量输入所述待训练的预测模型不同于所述首个网络层的网络层。The feature tensor corresponding to the masked counts in the masked gene expression data and the feature tensor corresponding to the counts of 0 values in the masked gene expression data are input into a network layer of the prediction model to be trained that is different from the first network layer. 根据权利要求3所述的方法,其中,所述待训练的预测模型包括编码器网络和解码器网络, The method according to claim 3, wherein the prediction model to be trained comprises an encoder network and a decoder network, 利用待训练的预测模型处理该样本中的掩码基因表达数据对应的特征张量,包括:The feature tensor corresponding to the masked gene expression data in the sample is processed using the prediction model to be trained, including: 利用所述编码器网络对编码器网络的输入进行编码,以得到编码器网络的输出;所述编码器网络的输入包括所述掩码基因表达数据中与未被掩码且非0的计数对应的特征张量;Encoding the input of the encoder network using the encoder network to obtain the output of the encoder network; the input of the encoder network includes a feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data; 根据所述编码器网络的输出、所述掩码基因表达数据中与被掩码的计数对应的特征张量和所述掩码基因表达数据中与0值的计数对应的特征张量,得到解码器网络的输入;以及obtaining an input to a decoder network based on the output of the encoder network, the feature tensor corresponding to the masked counts in the masked gene expression data, and the feature tensor corresponding to the counts of 0 values in the masked gene expression data; and 利用所述解码器网络对所述解码器网络的输入进行解码。An input to the decoder network is decoded using the decoder network. 根据权利要求4所述的方法,其特征在于,所述编码器网络包括M层编码单元,所述解码器网络包括N层解码单元,所述M的数值大于N的数值。The method according to claim 4 is characterized in that the encoder network includes M layers of encoding units, the decoder network includes N layers of decoding units, and the value of M is greater than the value of N. 根据权利要求4或5所述的方法,其特征在于,所述编码器网络的每层编码单元包括至少一个多头注意力单元和至少一个前向传播单元,所述解码器网络的每层解码单元包括至少一个前向传播单元,还包括至少一个线性注意力单元或稀疏注意力单元。The method according to claim 4 or 5 is characterized in that each layer of encoding units of the encoder network includes at least one multi-head attention unit and at least one forward propagation unit, and each layer of decoding units of the decoder network includes at least one forward propagation unit and also includes at least one linear attention unit or sparse attention unit. 根据权利要求2-6任一项所述的方法,所述确定掩码基因表达数据对应的特征张量,包括:According to the method according to any one of claims 2 to 6, determining the feature tensor corresponding to the masked gene expression data comprises: 将所述掩码基因表达数据中未被掩码且非0的计数输入用于确定嵌入向量的模块,得到所述未被掩码且非0的计数对应的计数嵌入向量;其中,所述用于确定嵌入向量的模块中包括可学的参数;Inputting the unmasked and non-zero counts in the masked gene expression data into a module for determining an embedding vector, and obtaining a count embedding vector corresponding to the unmasked and non-zero counts; wherein the module for determining the embedding vector includes learnable parameters; 根据所述掩码基因表达数据中未被掩码且非0的计数对应的基因标识,确定所述未被掩码且非0的计数对应的基因嵌入向量;Determining, according to gene identifiers corresponding to unmasked and non-zero counts in the masked gene expression data, gene embedding vectors corresponding to the unmasked and non-zero counts; 根据所述未被掩码且非0的计数对应的计数嵌入向量和所述未被掩码且非0的计数对应的基因嵌入向量,得到所述未被掩码且非0的计数对应的特征张量; Obtaining a feature tensor corresponding to the unmasked and non-zero count according to the count embedding vector corresponding to the unmasked and non-zero count and the gene embedding vector corresponding to the unmasked and non-zero count; 所述方法还包括:根据所述多个样本中的每个样本对应的所述损失值更新所述可学的参数。The method also includes: updating the learnable parameters according to the loss value corresponding to each sample in the multiple samples. 根据权利要求1-7任一项所述的方法,其中,所述掩码基因表达数据是经过了归一化的;所述多个样本中的每个样本中包括的第一基因表达数据是归一化的第一基因表达数据;The method according to any one of claims 1 to 7, wherein the masked gene expression data is normalized; the first gene expression data included in each of the plurality of samples is the normalized first gene expression data; 所述利用待训练的预测模型处理该样本中的掩码基因表达数据,以得到该样本对应的所述部分基因计数的预测值,包括:The using the prediction model to be trained to process the masked gene expression data in the sample to obtain the predicted value of the partial gene count corresponding to the sample includes: 利用待训练的预测模型处理该样本中的掩码基因表达数据,以得到该样本对应的所述部分基因计数的归一化值的预测值,Processing the masked gene expression data in the sample using the prediction model to be trained to obtain a predicted value of the normalized value of the partial gene count corresponding to the sample, 所述根据所述预测值与所述第一基因表达数据中与所述部分基因所对应的计数确定该样本对应的损失值,包括:The determining the loss value corresponding to the sample according to the predicted value and the counts corresponding to the part of genes in the first gene expression data includes: 根据所述归一化值的预测值与所述归一化的第一基因表达数据中与所述部分基因所对应的计数确定该样本对应的损失值。The loss value corresponding to the sample is determined according to the predicted value of the normalized value and the counts corresponding to the part of genes in the normalized first gene expression data. 根据权利要求8所述的方法,其中,所述掩码基因表达数据是通过对第一基因表达数据中部分基因的计数进行处理得到的;用于得到所述掩码基因表达数据的处理包括降采样、归一化、掩码;所述第一基因表达数据包括在第一测序深度下测得的单细胞中不同基因各自的计数,经降采样的第一基因表达数据模拟在低于所述第一测序深度的第二测序深度下测得的单细胞中不同基因各自的计数;所述多个样本中的每个样本还包括辅助信息,所述辅助信息包括第一总计数和第二总计数,所述第一总计数为所述第一基因表达数据中各基因的计数之和;所述第二总计数为所述经降采样的第一基因表达数据中各基因的计数之和;The method according to claim 8, wherein the masked gene expression data is obtained by processing the counts of some genes in the first gene expression data; the processing for obtaining the masked gene expression data includes downsampling, normalization, and masking; the first gene expression data includes the counts of different genes in a single cell measured at a first sequencing depth, and the downsampled first gene expression data simulates the counts of different genes in a single cell measured at a second sequencing depth lower than the first sequencing depth; each sample in the multiple samples also includes auxiliary information, and the auxiliary information includes a first total count and a second total count, the first total count is the sum of the counts of each gene in the first gene expression data; the second total count is the sum of the counts of each gene in the downsampled first gene expression data; 利用待训练的预测模型处理该样本中的掩码基因表达数据,以得到该样本对应的所述部分基因计数的预测值,包括:Processing the masked gene expression data in the sample using the prediction model to be trained to obtain the predicted value of the partial gene count corresponding to the sample, including: 利用待训练的预测模型处理该样本中的掩码基因表达数据以及所述辅助信息,以得到该样本对应的第一测序深度下各基因计数的归一化值的预测值。 The masked gene expression data in the sample and the auxiliary information are processed using the prediction model to be trained to obtain a predicted value of the normalized value of each gene count at the first sequencing depth corresponding to the sample. 根据权利要求9所述的方法,其中,所述待训练的预测模型包括编码器网络和解码器网络;所述方法还包括:确定所述辅助信息对应的特征张量;The method according to claim 9, wherein the prediction model to be trained includes an encoder network and a decoder network; the method further comprises: determining a feature tensor corresponding to the auxiliary information; 利用待训练的预测模型处理该样本中的掩码基因表达数据以及所述辅助信息,包括:Processing the masked gene expression data in the sample and the auxiliary information using the prediction model to be trained includes: 利用所述编码器网络对编码器网络的输入进行编码,以得到编码器网络的输出;所述编码器网络的输入包括所述掩码基因表达数据中与未被掩码且非0的计数对应的特征张量,所述编码器网络的输入还包括所述辅助信息对应的特征张量;Encoding the input of the encoder network using the encoder network to obtain the output of the encoder network; the input of the encoder network includes a feature tensor corresponding to the unmasked and non-zero counts in the masked gene expression data, and the input of the encoder network also includes a feature tensor corresponding to the auxiliary information; 根据所述编码器网络的输出、所述掩码基因表达数据中与被掩码的计数对应的特征张量和所述掩码基因表达数据中与0值的计数对应的特征张量,得到解码器网络的输入;以及obtaining an input to a decoder network based on the output of the encoder network, the feature tensor corresponding to the masked counts in the masked gene expression data, and the feature tensor corresponding to the counts of 0 values in the masked gene expression data; and 利用所述解码器网络对所述解码器网络的输入进行解码。An input to the decoder network is decoded using the decoder network. 根据权利要求9所述的方法,其中,所述待训练的预测模型包括输入层、输出层以及所述输入层与所述输出层之间的多个中间层;所述方法还包括:确定所述辅助信息对应的特征张量;The method according to claim 9, wherein the prediction model to be trained comprises an input layer, an output layer, and a plurality of intermediate layers between the input layer and the output layer; the method further comprises: determining a feature tensor corresponding to the auxiliary information; 所述利用待训练的预测模型处理该样本中的掩码基因表达数据以及辅助信息,包括:The method of processing the masked gene expression data and auxiliary information in the sample using the prediction model to be trained includes: 将该样本中的所述掩码基因表达数据对应的特征张量输入所述输入层和所述多个中间层中的第一预定层,得到中间特征张量,所述第一预定层的数量大于等于0;Inputting the feature tensor corresponding to the masked gene expression data in the sample into the input layer and a first predetermined layer among the plurality of intermediate layers to obtain an intermediate feature tensor, wherein the number of the first predetermined layers is greater than or equal to 0; 将所述辅助信息对应的特征张量与所述中间特征张量进行拼接,以得到拼接后的中间特征张量;以及splicing the feature tensor corresponding to the auxiliary information with the intermediate feature tensor to obtain a spliced intermediate feature tensor; and 将所述拼接后的中间特征张量输入所述多个中间层中的第二预定层以及所述输出层,所述第二预定层不同于所述第一预定层。The concatenated intermediate feature tensor is input into a second predetermined layer among the multiple intermediate layers and the output layer, where the second predetermined layer is different from the first predetermined layer. 根据权利要求10或11所述的方法,其中,确定所述辅助信息对应的特征张量,包括: The method according to claim 10 or 11, wherein determining the feature tensor corresponding to the auxiliary information comprises: 将所述辅助信息输入用于确定嵌入向量的模块,得到辅助信息对应的计数嵌入向量;其中,用于确定嵌入向量的模块中包括可学的参数;Inputting the auxiliary information into a module for determining an embedding vector to obtain a count embedding vector corresponding to the auxiliary information; wherein the module for determining the embedding vector includes learnable parameters; 获取所述辅助信息对应的基因嵌入向量;Obtaining a gene embedding vector corresponding to the auxiliary information; 根据所述辅助信息对应的计数嵌入向量和所述辅助信息对应的基因嵌入向量,得到所述辅助信息对应的特征张量;Obtaining a feature tensor corresponding to the auxiliary information according to the count embedding vector corresponding to the auxiliary information and the gene embedding vector corresponding to the auxiliary information; 其中,所述可学的参数根据所述多个样本中的每个样本对应的所述损失值更新。The learnable parameters are updated according to the loss value corresponding to each sample in the multiple samples. 一种基因表达数据的校正方法,包括:A method for correcting gene expression data, comprising: 获取当前基因表达数据,其中,所述当前基因表达数据包括在实际测序深度下测得的不同基因各自的计数;以及Acquiring current gene expression data, wherein the current gene expression data includes counts of different genes measured at an actual sequencing depth; and 利用根据权利要求1-12任一项所述的训练方法训练的预测模型中至少部分网络层处理所述当前基因表达数据,以得到当前基因表达数据的校正值或所述校正值的中间处理结果。The current gene expression data is processed using at least part of the network layers in the prediction model trained according to the training method according to any one of claims 1 to 12 to obtain a corrected value of the current gene expression data or an intermediate processing result of the corrected value. 根据权利要求13的方法,其中,所述获取当前基因表达数据包括:获取归一化的当前基因表达数据和当前辅助信息;其中,归一化的当前基因表达数据是对当前基因表达数据进行归一化得到的;所述当前辅助信息包括期望第一总计数和当前第二总计数;所述当前第二总计数为所述当前基因表达数据中各基因计数之和;所述期望第一总计数用于表征期望测序深度,所述期望第一总计数大于等于所述当前基因表达数据中各基因的计数之和;以及According to the method of claim 13, wherein the obtaining of the current gene expression data comprises: obtaining normalized current gene expression data and current auxiliary information; wherein the normalized current gene expression data is obtained by normalizing the current gene expression data; the current auxiliary information comprises an expected first total count and a current second total count; the current second total count is the sum of the counts of each gene in the current gene expression data; the expected first total count is used to characterize the expected sequencing depth, and the expected first total count is greater than or equal to the sum of the counts of each gene in the current gene expression data; and 所述利用根据权利要求1-12中任一项所述的训练方法训练的预测模型中至少部分网络层处理所述当前基因表达数据,以得到当前基因表达数据的校正值或所述校正值的中间处理结果包括:The processing of the current gene expression data by at least part of the network layers in the prediction model trained by the training method according to any one of claims 1 to 12 to obtain the corrected value of the current gene expression data or the intermediate processing result of the corrected value comprises: 所述利用根据权利要求9-12中任一项所述的训练方法训练的预测模型中至少部分网络层处理所述归一化的当前基因表达数据和所述当前辅助信息,以得到当前基因表达数据在所述期望测序深度下的归一化值的校正值或归一化值的校正值的中间处理结果。 At least part of the network layers in the prediction model trained by the training method described in any one of claims 9 to 12 processes the normalized current gene expression data and the current auxiliary information to obtain a corrected value of the normalized value of the current gene expression data at the expected sequencing depth or an intermediate processing result of the corrected value of the normalized value. 一种下游任务执行方法,包括:A downstream task execution method, comprising: 获取输入数据,其中,所述输入数据包括i)根据权利要求13或14所述的方法得到的校正值;或者ii)根据权利要求13或14所述的方法中,所述校正值的中间处理结果;或者iii)对根据权利要求13或14所述的方法得到的校正值进行预处理得到的预处理结果;或者iv)对根据权利要求13或14所述的方法得到的校正值的中间处理结果进行预处理得到的预处理结果;以及Acquire input data, wherein the input data comprises i) a correction value obtained by the method according to claim 13 or 14; or ii) an intermediate processing result of the correction value in the method according to claim 13 or 14; or iii) a preprocessing result obtained by preprocessing the correction value obtained by the method according to claim 13 or 14; or iv) a preprocessing result obtained by preprocessing the intermediate processing result of the correction value obtained by the method according to claim 13 or 14; and 利用下游任务算法处理所述输入数据,以得到下游任务结果,所述下游任务包括细胞归类任务,扰动预测任务或药物反应预测任务。The input data is processed using a downstream task algorithm to obtain a downstream task result, wherein the downstream task includes a cell classification task, a disturbance prediction task or a drug response prediction task. 一种计算设备,包括:A computing device comprising: 存储器、处理器以及存储在所述存储器上的计算机程序,a memory, a processor and a computer program stored on the memory, 其中,所述处理器被配置为执行所述计算机程序以实现权利要求1-15中任一项所述方法的步骤。The processor is configured to execute the computer program to implement the steps of the method according to any one of claims 1 to 15. 一种非暂态计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1-15中任一项所述方法的步骤。A non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 15. 一种计算机程序产品,包括计算机指令,其中,所述计算机指令被处理器执行时实现权利要求1-15中任一项所述方法的步骤。 A computer program product comprises computer instructions, wherein when the computer instructions are executed by a processor, the steps of the method according to any one of claims 1 to 15 are implemented.
PCT/CN2024/073344 2023-01-19 2024-01-19 Prediction model training method, gene expression data correction method, and downstream task execution method Ceased WO2024153239A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202310097546.2A CN116403634A (en) 2023-01-19 2023-01-19 Gene regulation and control relation model generation method and device
CN202310097546.2 2023-01-19
CN202310630156.7 2023-05-30
CN202310630156.7A CN119068985A (en) 2023-05-30 2023-05-30 Prediction model training method, device, equipment and medium

Publications (1)

Publication Number Publication Date
WO2024153239A1 true WO2024153239A1 (en) 2024-07-25

Family

ID=91955395

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/073344 Ceased WO2024153239A1 (en) 2023-01-19 2024-01-19 Prediction model training method, gene expression data correction method, and downstream task execution method

Country Status (1)

Country Link
WO (1) WO2024153239A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119832987A (en) * 2024-11-29 2025-04-15 北京大学成都前沿交叉生物技术研究院 Cell characterization model pre-training method, cell downstream task processing method, device, storage medium and program product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107689224A (en) * 2016-08-22 2018-02-13 北京深鉴科技有限公司 The deep neural network compression method of reasonable employment mask
US20180046915A1 (en) * 2016-08-12 2018-02-15 Beijing Deephi Intelligence Technology Co., Ltd. Compression of deep neural networks with proper use of mask
CN110957009A (en) * 2019-11-05 2020-04-03 中山大学中山眼科中心 Single-cell transcriptome missing value filling method based on deep hybrid network
CN112908414A (en) * 2021-01-28 2021-06-04 中山大学 Large-scale single cell typing method, system and storage medium
CN114496083A (en) * 2022-01-26 2022-05-13 腾讯科技(深圳)有限公司 Cell type determination method, device, equipment and storage medium
CN115394358A (en) * 2022-08-31 2022-11-25 西安理工大学 Single-cell sequencing gene expression data imputation method and system based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046915A1 (en) * 2016-08-12 2018-02-15 Beijing Deephi Intelligence Technology Co., Ltd. Compression of deep neural networks with proper use of mask
CN107689224A (en) * 2016-08-22 2018-02-13 北京深鉴科技有限公司 The deep neural network compression method of reasonable employment mask
CN110957009A (en) * 2019-11-05 2020-04-03 中山大学中山眼科中心 Single-cell transcriptome missing value filling method based on deep hybrid network
CN112908414A (en) * 2021-01-28 2021-06-04 中山大学 Large-scale single cell typing method, system and storage medium
CN114496083A (en) * 2022-01-26 2022-05-13 腾讯科技(深圳)有限公司 Cell type determination method, device, equipment and storage medium
CN115394358A (en) * 2022-08-31 2022-11-25 西安理工大学 Single-cell sequencing gene expression data imputation method and system based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AVSEC Z. ET AL.: "Effective gene expression prediction from sequence by integrating long-range interactions", NATURE METJODS, vol. 18, 31 October 2021 (2021-10-31), XP037599652, DOI: 10.1038/s41592-021-01252-x *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119832987A (en) * 2024-11-29 2025-04-15 北京大学成都前沿交叉生物技术研究院 Cell characterization model pre-training method, cell downstream task processing method, device, storage medium and program product

Similar Documents

Publication Publication Date Title
JP7247258B2 (en) Computer system, method and program
CN113361680B (en) Neural network architecture searching method, device, equipment and medium
CN118172364B (en) Medical image automatic analysis method based on artificial intelligence
CN112308204A (en) Automated Neural Network Generation Using Fitness Estimation
CN115879533B (en) Class increment learning method and system based on analogy learning
WO2021176337A1 (en) Deterministic decoder variational autoencoder
US20220147877A1 (en) System and method for automatic building of learning machines using learning machines
CN119761443A (en) Neural network pruning method
CN113642727A (en) Training method of neural network model and multimedia information processing method and device
US12387105B2 (en) Executing a genetic algorithm on a low-power controller
CN114925767B (en) A scene generation method and apparatus based on variational autoencoder
CN113826117A (en) Efficient binary representation from neural networks
CN112101524A (en) Method and system for quantized neural network capable of switching bit width online
CN117809734B (en) A dimensionality reduction modeling method and system for gene regulatory networks
US20250200348A1 (en) Model Compression Method and Apparatus, and Related Device
CN109769080B (en) Encrypted image cracking method and system based on deep learning
WO2024153239A1 (en) Prediction model training method, gene expression data correction method, and downstream task execution method
CN114842920A (en) A molecular property prediction method, device, storage medium and electronic device
CN107743071B (en) A kind of enhanced representation method and device of network node
CN109145132B (en) Method and device for extracting hash code from image and image retrieval method and device
CN109325140B (en) Method and device for extracting hash code from image and image retrieval method and device
JP7625096B2 (en) Generating a small number of class examples for training data
CN115965082B (en) Phylogenetic tree construction method and system based on deep learning and beam search
CN117634580A (en) Data processing method, training method and related equipment of neural network model
CN118246484A (en) Four-sequence phylogenetic tree reconstruction method and system based on LSTM and Transformer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24744408

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE