WO2025123187A1

WO2025123187A1 - Feature extraction method and apparatus for cell grouping, and cell grouping method and apparatus

Info

Publication number: WO2025123187A1
Application number: PCT/CN2023/137958
Authority: WO
Inventors: 任亚亭; 杨超; 方双桑
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2025-06-19
Anticipated expiration: 2026-06-11

Abstract

The present application relates to a feature extraction method and apparatus for cell grouping, and a cell grouping method and apparatus. The feature extraction method for cell grouping comprises: performing disturbance on a cell feature matrix of a sample cell, so as to obtain a first feature matrix and a second feature matrix; performing disturbance on an adjacency matrix of the sample cell, so as to obtain a first adjacency matrix and a second adjacency matrix; by means of a feature encoder to be trained, performing encoding on a first image to obtain a first embedding matrix, and performing encoding on a second image to obtain a second embedding matrix; fusing the first embedding matrix with the second embedding matrix, so as to obtain a target embedding matrix; performing decoding on the target embedding matrix, so as to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix; and on the basis of at least one of a first difference and a second difference, iteratively optimizing parameters of said feature encoder until an iteration stop condition is met, so as to obtain a trained feature encoder, wherein the feature encoder is used for extracting features required for cell grouping.

Description

Cell clustering feature extraction method, cell clustering method and device

Technical Field

本申请涉及计算机技术和人工智能技术领域，特别是涉及一种细胞分群特征提取方法、细胞分群方法及装置。The present application relates to the field of computer technology and artificial intelligence technology, and in particular to a method for extracting cell clustering features, a cell clustering method and a cell clustering device.

Background Art

随着人工智能技术的发展，对细胞进行自动分群成为了可能。细胞分群的目的是根据每个细胞的特征的相似度与相异度将细胞划分到不同的簇，确保每个簇中的细胞尽可能相似，不同簇的细胞尽可能相异。With the development of artificial intelligence technology, it has become possible to automatically cluster cells. The purpose of cell clustering is to divide cells into different clusters based on the similarity and difference of each cell's characteristics, ensuring that the cells in each cluster are as similar as possible and the cells in different clusters are as different as possible.

传统方法中，一般是采用主成分分析法(PCA，Principal Components Analysis))来对细胞的特征进行表示，以将细胞的特征维度变小，同时尽量减少信息损失。然而这种方法无法提取细胞的基因表达的深层次特征，导致基于表示的信息进行细胞分群的结果不准确。In traditional methods, principal component analysis (PCA) is generally used to represent cell characteristics in order to reduce the cell feature dimension while minimizing information loss. However, this method cannot extract the deep-level characteristics of cell gene expression, resulting in inaccurate results of cell clustering based on the represented information.

发明内容Summary of the invention

基于此，有必要针对上述技术问题，提供一种能够准确提取细胞分群所需特征的细胞分群特征提取方法及装置、细胞分群方法及装置、计算机设备和计算机可读存储介质。Based on this, it is necessary to provide a cell clustering feature extraction method and device, a cell clustering method and device, a computer device and a computer-readable storage medium that can accurately extract the features required for cell clustering in order to address the above technical problems.

第一方面，本申请提供了一种细胞分群特征提取方法。所述方法包括：In a first aspect, the present application provides a method for extracting cell clustering features. The method comprises:

对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵；Perturbing the cell feature matrix of the sample cells to obtain a first feature matrix and a second feature matrix;

对所述样本细胞的邻接矩阵进行扰动，得到第一邻接矩阵和第二邻接矩阵；Perturbing the adjacency matrix of the sample cells to obtain a first adjacency matrix and a second adjacency matrix;

通过待训练的特征编码器对第一图进行编码得到第一嵌入矩阵，对第二图进行编码得到第二嵌入矩阵；所述第一图包括所述第一特征矩阵和所述第一邻接矩阵；所述第二图包括所述第二特征矩阵和所述第二邻接矩阵；The first graph is encoded by a feature encoder to be trained to obtain a first embedding matrix, and the second graph is encoded to obtain a second embedding matrix; the first graph includes the first feature matrix and the first adjacency matrix; the second graph includes the second feature matrix and the second adjacency matrix;

将所述第一嵌入矩阵和所述第二嵌入矩阵进行融合得到目标嵌入矩阵；Fusing the first embedding matrix and the second embedding matrix to obtain a target embedding matrix;

对所述目标嵌入矩阵进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵中的至少一种；Decoding the target embedding matrix to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix;

根据第一差异和第二差异中的至少一种，迭代优化所述待训练的特征编码器的参数，直至符合迭代停止条件，得到训练完成的特征编码器；所述特征编码器用于提取细胞分群所需的特征；所述第一差异是重构的细胞特征矩阵与所述样本细胞的细胞特征矩阵之间的差异；所述第二差异是所述重构的邻接矩阵与所述样本细胞的邻接矩阵之间的差异。 According to at least one of the first difference and the second difference, the parameters of the feature encoder to be trained are iteratively optimized until the iteration stop condition is met, thereby obtaining a trained feature encoder; the feature encoder is used to extract features required for cell clustering; the first difference is the difference between the reconstructed cell feature matrix and the cell feature matrix of the sample cells; the second difference is the difference between the reconstructed adjacency matrix and the adjacency matrix of the sample cells.

第二方面，本申请还提供了一种细胞分群方法。所述方法包括：In a second aspect, the present application also provides a cell clustering method. The method comprises:

将待分群细胞的细胞特征矩阵和邻接矩阵所组成的图输入至预先训练的特征编码器中，得到提取的嵌入矩阵；所述预先训练的特征编码器，是预先根据第一图和第二图进行训练得到的；所述第一图包括第一特征矩阵和第一邻接矩阵；所述第二图包括第二特征矩阵和第二邻接矩阵；所述第一特征矩阵和所述第二特征矩阵是对样本细胞的细胞特征矩阵进行扰动得到的；所述第一邻接矩阵和第二邻接矩阵是对所述样本细胞的邻接矩阵进行扰动得到的；Inputting a graph composed of a cell feature matrix and an adjacency matrix of cells to be grouped into a pre-trained feature encoder to obtain an extracted embedding matrix; the pre-trained feature encoder is pre-trained according to a first graph and a second graph; the first graph includes a first feature matrix and a first adjacency matrix; the second graph includes a second feature matrix and a second adjacency matrix; the first feature matrix and the second feature matrix are obtained by perturbing the cell feature matrix of sample cells; the first adjacency matrix and the second adjacency matrix are obtained by perturbing the adjacency matrix of the sample cells;

对所述提取的嵌入矩阵中各所述待分群细胞分别对应的嵌入向量进行聚类，得到所述待分群细胞的细胞分群结果。Clustering is performed on the embedded vectors corresponding to each of the cells to be grouped in the extracted embedded matrix to obtain a cell grouping result of the cells to be grouped.

第三方面，本申请还提供了一种细胞分群特征提取装置。所述装置包括：In a third aspect, the present application also provides a cell clustering feature extraction device. The device comprises:

第一扰动模块，用于对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵；A first perturbation module is used to perturb the cell feature matrix of the sample cells to obtain a first feature matrix and a second feature matrix;

第二扰动模块，用于对所述样本细胞的邻接矩阵进行扰动，得到第一邻接矩阵和第二邻接矩阵；A second perturbation module is used to perturb the adjacency matrix of the sample cells to obtain a first adjacency matrix and a second adjacency matrix;

编码模块，用于通过待训练的特征编码器对第一图进行编码得到第一嵌入矩阵，对第二图进行编码得到第二嵌入矩阵；所述第一图包括所述第一特征矩阵和所述第一邻接矩阵；所述第二图包括所述第二特征矩阵和所述第二邻接矩阵；An encoding module, configured to encode a first graph through a feature encoder to be trained to obtain a first embedding matrix, and encode a second graph to obtain a second embedding matrix; the first graph includes the first feature matrix and the first adjacency matrix; the second graph includes the second feature matrix and the second adjacency matrix;

融合模块，用于将所述第一嵌入矩阵和所述第二嵌入矩阵进行融合得到目标嵌入矩阵；A fusion module, configured to fuse the first embedding matrix and the second embedding matrix to obtain a target embedding matrix;

解码模块，用于对所述目标嵌入矩阵进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵中的至少一种；A decoding module, used for decoding the target embedding matrix to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix;

参数优化模块，用于根据第一差异和第二差异中的至少一种，迭代优化所述待训练的特征编码器的参数，直至符合迭代停止条件，得到训练完成的特征编码器；所述特征编码器用于提取细胞分群所需的特征；所述第一差异是重构的细胞特征矩阵与所述样本细胞的细胞特征矩阵之间的差异；所述第二差异是所述重构的邻接矩阵与所述样本细胞的邻接矩阵之间的差异。A parameter optimization module is used to iteratively optimize the parameters of the feature encoder to be trained according to at least one of the first difference and the second difference until the iteration stop condition is met, thereby obtaining a trained feature encoder; the feature encoder is used to extract features required for cell clustering; the first difference is the difference between the reconstructed cell feature matrix and the cell feature matrix of the sample cells; the second difference is the difference between the reconstructed adjacency matrix and the adjacency matrix of the sample cells.

第四方面，本申请还提供了一种细胞分群装置。所述装置包括：In a fourth aspect, the present application also provides a cell clustering device. The device comprises:

嵌入矩阵确定模块，用于将待分群细胞的细胞特征矩阵和邻接矩阵所组成的图输入至预先训练的特征编码器中，得到提取的嵌入矩阵；所述预先训练的特征编码器，是预先根据第一图和第二图进行训练得到的；所述第一图包括第一特征矩阵和第一邻接矩阵；所述第二图包括第二特征矩阵和第二邻接矩阵；所述第一特征矩阵和所述第二特征矩阵是对样本细胞的细胞特征矩阵进行扰动得到的；所述第一邻接矩阵和第二邻接矩阵是对所述样本细胞的邻接矩阵进行扰动得到的；The embedding matrix determination module is used to input the graph composed of the cell feature matrix and the adjacency matrix of the cells to be grouped into a pre-trained feature encoder to obtain an extracted embedding matrix; the pre-trained feature encoder is pre-trained according to the first graph and the second graph; the first graph includes a first feature matrix and a first adjacency matrix; the second graph includes a second feature matrix and a second adjacency matrix; the first feature matrix and the second feature matrix are obtained by perturbing the cell feature matrix of the sample cells; the first adjacency matrix and the second adjacency matrix are obtained by perturbing the cell feature matrix of the sample cells; The cell's adjacency matrix is perturbed;

聚类模块，用于对所述提取的嵌入矩阵中各所述待分群细胞分别对应的嵌入向量进行聚类，得到所述待分群细胞的细胞分群结果。The clustering module is used to cluster the embedding vectors corresponding to each of the cells to be grouped in the extracted embedding matrix to obtain the cell grouping results of the cells to be grouped.

第五方面，本申请还提供了一种计算机设备。所述计算机设备包括存储器和一个或多个处理器，所述存储器中存储有计算机程序，所述计算机程序被所述处理器执行时，使得所述一个或多个处理器执行本申请各实施例所述的细胞分群特征提取方法或细胞分群方法中的步骤。In a fifth aspect, the present application further provides a computer device. The computer device includes a memory and one or more processors, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the one or more processors execute the steps in the cell clustering feature extraction method or the cell clustering method described in each embodiment of the present application.

第六方面，本申请还提供了一种计算机可读存储介质。所述计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时，使得一个或多个处理器执行本申请各实施例所述的细胞分群特征提取方法或细胞分群方法中的步骤。In a sixth aspect, the present application further provides a computer-readable storage medium, wherein a computer program is stored thereon, and when the computer program is executed by a processor, one or more processors execute the steps in the cell clustering feature extraction method or the cell clustering method described in each embodiment of the present application.

上述细胞分群特征提取方法及装置、细胞分群方法及装置、计算机设备和计算机可读存储介质，对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵，对样本细胞的邻接矩阵进行扰动，得到第一邻接矩阵和第二邻接矩阵，能够得到图的不同视图下更加丰富的特征，然后通过待训练的特征编码器分别对信息量丰富的不同视图的第一图和第二图进行编码，并将第一图和第二图的编码结果进行融合得到目标嵌入矩阵，使得目标嵌入矩阵能够具有更加丰富且深层次的特征，再对目标嵌入矩阵进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵进而确定重构损失，从而通过能够提取深层次特征的特征编码器能准确地提取细胞分群所需的特征。The above-mentioned cell clustering feature extraction method and device, cell clustering method and device, computer equipment and computer-readable storage medium perturb the cell feature matrix of the sample cells to obtain the first feature matrix and the second feature matrix, and perturb the adjacency matrix of the sample cells to obtain the first adjacency matrix and the second adjacency matrix, which can obtain richer features under different views of the graph, and then encode the first and second images of different views with rich information respectively through the feature encoder to be trained, and fuse the encoding results of the first and second images to obtain the target embedding matrix, so that the target embedding matrix can have richer and deeper features, and then decode the target embedding matrix to obtain the reconstructed cell feature matrix and the reconstructed adjacency matrix to determine the reconstruction loss, so that the features required for cell clustering can be accurately extracted through the feature encoder that can extract deep features.

BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或传统技术中的技术方案，下面将对实施例或传统技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据公开的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the conventional technology, the drawings required for use in the embodiments or the conventional technology descriptions are briefly introduced below. Obviously, the drawings described below are merely embodiments of the present application, and ordinary technicians in this field can obtain other drawings based on the disclosed drawings without paying any creative work.

图1为一个实施例中细胞分群特征提取方法的流程示意图；FIG1 is a schematic diagram of a process for extracting cell clustering features in one embodiment;

图2为一个实施例中细胞分群特征提取方法的整体流程示意图；FIG2 is a schematic diagram of the overall process of a method for extracting cell clustering features in one embodiment;

图3为一个实施例中细胞分群方法的流程示意图；FIG3 is a schematic diagram of a process for cell clustering in one embodiment;

图4为一个实施例中细胞分群特征提取装置的结构框图；FIG4 is a structural block diagram of a cell clustering feature extraction device in one embodiment;

图5为另一个实施例中细胞分群特征提取装置的结构框图；FIG5 is a structural block diagram of a cell clustering feature extraction device in another embodiment;

图6为一个实施例中细胞分群装置的结构框图；FIG6 is a block diagram of a cell clustering device in one embodiment;

图7为另一个实施例中细胞分群装置的结构框图； FIG7 is a structural block diagram of a cell clustering device in another embodiment;

图8为一个实施例中计算机设备的内部结构图；FIG8 is a diagram showing the internal structure of a computer device in one embodiment;

图9为另一个实施例中计算机设备的内部结构图。FIG. 9 is a diagram showing the internal structure of a computer device in another embodiment.

DETAILED DESCRIPTION

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

在一个示例性的实施例中，如图1所示，提供了一种细胞分群特征提取方法，本实施例以该方法应用于计算机设备进行举例说明。计算机设备可以是终端或服务器。终端可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备，物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。本实施例中，该方法包括以下步骤：In an exemplary embodiment, as shown in FIG1 , a method for extracting cell clustering features is provided, and this embodiment is illustrated by applying the method to a computer device. The computer device may be a terminal or a server. The terminal may be, but is not limited to, various personal computers, laptops, smart phones, tablet computers, Internet of Things devices, and portable wearable devices. The Internet of Things devices may be smart speakers, smart TVs, smart air conditioners, smart car-mounted devices, etc. Portable wearable devices may be smart watches, smart bracelets, head-mounted devices, etc. The server may be implemented as an independent server or a server cluster consisting of multiple servers. In this embodiment, the method comprises the following steps:

步骤102，对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵。Step 102 , perturbing the cell feature matrix of the sample cells to obtain a first feature matrix and a second feature matrix.

其中，细胞特征矩阵，是用于表征各个样本细胞的特征的矩阵。The cell feature matrix is a matrix used to characterize the features of each sample cell.

在一个示例性的实施例中，细胞特征矩阵所表征的样本细胞的特征可以包括样本细胞的基因特征和邻接关系特征。In an exemplary embodiment, the features of the sample cells represented by the cell feature matrix may include gene features and adjacency relationship features of the sample cells.

在一个示例性的实施例中，细胞特征矩阵可以是根据样本细胞的基因表达矩阵和邻接矩阵得到的。其中，样本细胞的基因表达矩阵，是用于表征各个样本细胞的基因的矩阵。样本细胞的邻接矩阵，是用于表征各个样本细胞之间的邻接关系的矩阵。In an exemplary embodiment, the cell feature matrix can be obtained based on the gene expression matrix and adjacency matrix of the sample cells. The gene expression matrix of the sample cells is a matrix used to characterize the genes of each sample cell. The adjacency matrix of the sample cells is a matrix used to characterize the adjacency relationship between each sample cell.

在一个示例性的实施例中，样本细胞的邻接矩阵可以是根据样本细胞的空间位置矩阵确定的。样本细胞的邻接矩阵用于表征各个样本细胞之间的距离相似度。计算机设备可以根据样本细胞的空间位置矩阵，确定各个样本细胞之间的距离相似度，根据距离相似度得到样本细胞的邻接矩阵。其中，样本细胞的空间位置矩阵，是用于表征各个样本细胞的空间位置的矩阵。In an exemplary embodiment, the adjacency matrix of the sample cells can be determined based on the spatial position matrix of the sample cells. The adjacency matrix of the sample cells is used to characterize the distance similarity between each sample cell. The computer device can determine the distance similarity between each sample cell based on the spatial position matrix of the sample cells, and obtain the adjacency matrix of the sample cells based on the distance similarity. Among them, the spatial position matrix of the sample cells is a matrix used to characterize the spatial position of each sample cell.

在一个示例性的实施例中，样本细胞的基因表达矩阵中的行表示样本细胞，列表示基因。样本细胞的空间位置矩阵中的行表示样本细胞，列表示样本细胞的坐标值。在另一些实施例中，也可以反过来，样本细胞的基因表达矩阵中的列表示样本细胞，行表示基因。样本细胞的空间位置矩阵中的列表示样本细胞，行表示样本细胞的坐标值。In an exemplary embodiment, the rows in the gene expression matrix of the sample cells represent the sample cells, and the columns represent the genes. The rows in the spatial position matrix of the sample cells represent the sample cells, and the columns represent the coordinate values of the sample cells. In other embodiments, it can also be the other way around, the columns in the gene expression matrix of the sample cells represent the sample cells, and the rows represent the genes. The columns in the spatial position matrix of the sample cells represent the sample cells, and the rows represent the coordinate values of the sample cells.

在一个示例性的实施例中，样本细胞的邻接矩阵中的每行表示一个样本细胞节点与各个样本细胞节点之间的距离相似度，例如：第i行表示样本细胞节点i与各个样本细胞节点之间的距离相似度，第i行第j列表示样本细胞节点i与样本细胞节点j之间的距离相似度。In an exemplary embodiment, each row in the adjacency matrix of the sample cell represents a sample cell node and each For example, the i-th row represents the distance similarity between sample cell node i and each sample cell node, and the i-th row and j-th column represent the distance similarity between sample cell node i and sample cell node j.

在一个示例性的实施例中，计算机设备可以分别对样本细胞的细胞特征矩阵进行不同的扰动，得到第一特征矩阵和第二特征矩阵。In an exemplary embodiment, the computer device may perform different perturbations on the cell feature matrix of the sample cells to obtain a first feature matrix and a second feature matrix.

步骤104，对样本细胞的邻接矩阵进行扰动，得到第一邻接矩阵和第二邻接矩阵。Step 104 , perturb the adjacency matrix of the sample cells to obtain a first adjacency matrix and a second adjacency matrix.

在一个示例性的实施例中，计算机设备可以分别对样本细胞的邻接矩阵进行不同的扰动，得到第一邻接矩阵和第二邻接矩阵。In an exemplary embodiment, the computer device may perform different perturbations on the adjacency matrix of the sample cells to obtain a first adjacency matrix and a second adjacency matrix.

步骤106，通过待训练的特征编码器对第一图进行编码得到第一嵌入矩阵，对第二图进行编码得到第二嵌入矩阵；第一图包括第一特征矩阵和第一邻接矩阵；第二图包括第二特征矩阵和第二邻接矩阵。Step 106, encoding the first graph through the feature encoder to be trained to obtain a first embedding matrix, and encoding the second graph to obtain a second embedding matrix; the first graph includes a first feature matrix and a first adjacency matrix; the second graph includes a second feature matrix and a second adjacency matrix.

在一个示例性的实施例中，第一图其中，X₁表示第一特征矩阵，A_m表示第一邻接矩阵。第二图其中，X₂表示第二特征矩阵，A_d表示第二邻接矩阵。In an exemplary embodiment, the first Among them, _X1 represents the first characteristic matrix, and A _m represents the first adjacency matrix. Among them, _X2 represents the second characteristic matrix, and _Ad represents the second adjacency matrix.

在一个示例性的实施例中，计算机设备可以将第一图和第二图分别输入至相应的待训练的特征编码器进行编码，分别得到第一嵌入矩阵和第二嵌入矩阵。其中，第一图和第二图分别对应的待训练的特征编码器之间参数共享。In an exemplary embodiment, the computer device may input the first image and the second image into corresponding feature encoders to be trained for encoding, and obtain a first embedding matrix and a second embedding matrix, respectively. The parameters of the feature encoders to be trained corresponding to the first image and the second image are shared.

步骤108，将第一嵌入矩阵和第二嵌入矩阵进行融合得到目标嵌入矩阵。Step 108: The first embedding matrix and the second embedding matrix are fused to obtain a target embedding matrix.

在一个示例性的实施例中，计算机设备可以将第一嵌入矩阵和第二嵌入矩阵进行线性相加得到目标嵌入矩阵。In an exemplary embodiment, the computer device may linearly add the first embedding matrix and the second embedding matrix to obtain a target embedding matrix.

步骤110，对目标嵌入矩阵进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵中的至少一种。Step 110, decoding the target embedding matrix to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix.

在一个示例性的实施例中，计算机设备可以将目标嵌入矩阵输入至解码器中进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵中的至少一种。In an exemplary embodiment, the computer device may input the target embedding matrix into a decoder for decoding to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix.

在一个示例性的实施例中，计算机设备可以将目标嵌入矩阵输入至解码器中进行解码，得到重构的细胞特征矩阵，然后根据重构的细胞特征矩阵和重构的细胞特征矩阵的转置，确定重构的邻接矩阵。在一个示例性的实施例中，计算机设备可以将重构的细胞特征矩阵和重构的细胞特征矩阵的转置进行内积，得到重构的邻接矩阵。In an exemplary embodiment, the computer device may input the target embedding matrix into a decoder for decoding to obtain a reconstructed cell feature matrix, and then determine a reconstructed adjacency matrix based on the reconstructed cell feature matrix and the transpose of the reconstructed cell feature matrix. In an exemplary embodiment, the computer device may perform an inner product of the reconstructed cell feature matrix and the transpose of the reconstructed cell feature matrix to obtain a reconstructed adjacency matrix.

在一个示例性的实施例中，解码器可以是基于图卷积神经网络的解码器。In an exemplary embodiment, the decoder may be a graph convolutional neural network based decoder.

步骤112，根据第一差异和第二差异中的至少一种，迭代优化待训练的特征编码器的参数，直至符合迭代停止条件，得到训练完成的特征编码器；特征编码器用于提取细胞分群所需的特征；第一差异是重构的细胞特征矩阵与样本细胞的细胞特征矩阵之间的差异；第二差异是重构的邻接矩阵与样本细胞的邻接矩阵之间的差异。Step 112, iteratively optimizing the feature encoder to be trained according to at least one of the first difference and the second difference Parameters are adjusted until the iteration stop condition is met to obtain a trained feature encoder; the feature encoder is used to extract the features required for cell clustering; the first difference is the difference between the reconstructed cell feature matrix and the cell feature matrix of the sample cells; the second difference is the difference between the reconstructed adjacency matrix and the adjacency matrix of the sample cells.

在一个示例性的实施例中，计算机设备可以根据第一差异和第二差异中的至少一种，确定重构损失值，根据重构损失值，迭代优化待训练的特征编码器的参数，直至符合迭代停止条件，得到训练完成的特征编码器。In an exemplary embodiment, the computer device may determine a reconstruction loss value based on at least one of the first difference and the second difference, and iteratively optimize the parameters of the feature encoder to be trained based on the reconstruction loss value until an iteration stop condition is met, thereby obtaining a trained feature encoder.

在一个示例性的实施例中，计算机设备可以根据第一差异确定特征矩阵重构损失值，根据第二差异确定邻接矩阵重构损失值，根据特征矩阵重构损失值和邻接矩阵重构损失值之和确定重构损失值。In an exemplary embodiment, the computer device may determine a feature matrix reconstruction loss value based on the first difference, determine an adjacency matrix reconstruction loss value based on the second difference, and determine a reconstruction loss value based on the sum of the feature matrix reconstruction loss value and the adjacency matrix reconstruction loss value.

在一个示例性的实施例中，计算机设备可以根据重构的细胞特征矩阵与样本细胞的细胞特征矩阵之间的差值的2范数，确定特征矩阵重构损失值。在一个示例性的实施例中，特征矩阵重构损失值L_REC-F的计算公式可以为：
In an exemplary embodiment, the computer device may determine the feature matrix reconstruction loss value based on the 2 norm of the difference between the reconstructed cell feature matrix and the cell feature matrix of the sample cell. In an exemplary embodiment, the calculation formula of the feature matrix reconstruction loss value L _REC-F may be:

其中，N为样本细胞的个数。X为样本细胞的细胞特征矩阵。为重构的细胞特征矩阵。Where N is the number of sample cells and X is the cell feature matrix of the sample cells. is the reconstructed cell feature matrix.

在一个示例性的实施例中，计算机设备可以根据重构的邻接矩阵与样本细胞的邻接矩阵之间的差值的2范数，确定邻接矩阵重构损失值。在一个示例性的实施例中，邻接矩阵重构损失值L_REC-A的计算公式可以为：
In an exemplary embodiment, the computer device may determine the adjacency matrix reconstruction loss value according to the 2-norm of the difference between the reconstructed adjacency matrix and the adjacency matrix of the sample cell. In an exemplary embodiment, the calculation formula of the adjacency matrix reconstruction loss value L _REC-A may be:

其中，N为样本细胞的个数。A为样本细胞的邻接矩阵。为重构的邻接矩阵。Where N is the number of sample cells and A is the adjacency matrix of sample cells. is the reconstructed adjacency matrix.

在一个示例性的实施例中，重构损失值L_REC可以用如下公式计算：
L_REC＝L_REC-F+L_REC-A In an exemplary embodiment, the reconstruction loss value L _REC may be calculated using the following formula:
L _REC = L _REC-F + L _REC-A

在一个示例性的实施例中，迭代停止条件可以包括重构损失值小于或等于第一预设损失阈值。在另一些实施例中，迭代停止条件可以包括迭代次数大于或等于预设次数阈值。In an exemplary embodiment, the iteration stopping condition may include that the reconstruction loss value is less than or equal to a first preset loss threshold. In some other embodiments, the iteration stopping condition may include that the number of iterations is greater than or equal to a preset number threshold.

上述细胞分群特征提取方法，对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵，对样本细胞的邻接矩阵进行扰动，得到第一邻接矩阵和第二邻接矩阵，能够得到图的不同视图下更加丰富的特征，然后通过孪生的待训练的特征编码器分别对信息量丰富的不同视图的第一图和第二图进行编码，并将第一图和第二图的编码结果进行融合得到目标嵌入矩阵，使得目标嵌入矩阵能够具有更加丰富且深层次的特征，再对目标嵌入矩阵进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵进而确定重构损失，能够准确地训练得到特征编码器，从而通过能够提取深层次特征的特征编码器能准确地提取细胞分群所需的特征。The above-mentioned cell clustering feature extraction method perturbs the cell feature matrix of the sample cells to obtain the first feature matrix and the second feature matrix, and perturbs the adjacency matrix of the sample cells to obtain the first adjacency matrix and the second adjacency matrix, which can obtain richer features under different views of the graph, and then encodes the first graph and the second graph of different views with rich information respectively through the twin feature encoder to be trained, and fuses the encoding results of the first graph and the second graph The target embedding matrix is obtained by combining the target embedding matrix so that the target embedding matrix can have richer and deeper features. The target embedding matrix is then decoded to obtain the reconstructed cell feature matrix and the reconstructed adjacency matrix to determine the reconstruction loss. The feature encoder can be accurately trained to obtain the features required for cell clustering, so that the feature encoder that can extract deep features can accurately extract the features required for cell clustering.

在一个示例性的实施例中，对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵包括：获取第一噪声矩阵和第二噪声矩阵；分别根据第一噪声矩阵和第二噪声矩阵对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵。In an exemplary embodiment, perturbing the cell feature matrix of the sample cells to obtain a first feature matrix and a second feature matrix includes: acquiring a first noise matrix and a second noise matrix; and perturbing the cell feature matrix of the sample cells according to the first noise matrix and the second noise matrix, respectively, to obtain a first feature matrix and a second feature matrix.

在一个示例性的实施例中，计算机设备可以从高斯分布中随机采样得到第一噪声矩阵和第二噪声矩阵。In an exemplary embodiment, the computer device may obtain the first noise matrix and the second noise matrix by randomly sampling from a Gaussian distribution.

在一个示例性的实施例中，计算机设备可以将第一噪声矩阵与样本细胞的细胞特征矩阵相乘得到第一特征矩阵，将第二噪声矩阵与样本细胞的细胞特征矩阵相乘得到第二特征矩阵。In an exemplary embodiment, the computer device may multiply the first noise matrix with the cell feature matrix of the sample cells to obtain a first feature matrix, and multiply the second noise matrix with the cell feature matrix of the sample cells to obtain a second feature matrix.

上述实施例中，分别根据第一噪声矩阵和第二噪声矩阵对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵，能够得到更加丰富的特征矩阵。In the above embodiment, the cell feature matrix of the sample cells is perturbed according to the first noise matrix and the second noise matrix respectively to obtain the first feature matrix and the second feature matrix, so that a richer feature matrix can be obtained.

在一个示例性的实施例中，对样本细胞的邻接矩阵进行扰动，得到第一邻接矩阵和第二邻接矩阵包括：根据样本细胞的邻接矩阵确定各个样本细胞节点之间的距离相似度；从样本细胞的邻接矩阵中删除距离相似度符合预设条件的样本细胞节点之间连成的边，得到第一邻接矩阵；根据样本细胞的邻接矩阵确定各样本细胞节点的重要性，并根据重要性调整各样本细胞节点之间的距离相似度，得到第二邻接矩阵。In an exemplary embodiment, perturbing the adjacency matrix of sample cells to obtain a first adjacency matrix and a second adjacency matrix includes: determining the distance similarity between each sample cell node according to the adjacency matrix of the sample cells; deleting the edges connecting the sample cell nodes whose distance similarity meets a preset condition from the adjacency matrix of the sample cells to obtain the first adjacency matrix; determining the importance of each sample cell node according to the adjacency matrix of the sample cells, and adjusting the distance similarity between each sample cell node according to the importance to obtain the second adjacency matrix.

其中，一个样本细胞节点对应一个样本细胞。Among them, one sample cell node corresponds to one sample cell.

在一个示例性的实施例中，预设条件的样本细胞节点之间连成的边，可以是样本细胞之间的距离相似度小于或等于预设相似度阈值的样本细胞之间连成的边，或者，可以是按照从小到大的顺序取前预设数量的距离相似度对应的样本细胞之间连成的边，或者，可以是按照从小到大的顺序取前预设占比的距离相似度对应的样本细胞之间连成的边。例如：可以按照从小到大的顺序取前10％的距离相似度对应的样本细胞之间连成的边进行删除。In an exemplary embodiment, the edges connected between sample cell nodes of a preset condition may be edges connected between sample cells whose distance similarity between sample cells is less than or equal to a preset similarity threshold, or edges connected between sample cells corresponding to a preset number of distance similarities in ascending order, or edges connected between sample cells corresponding to a preset percentage of distance similarities in ascending order. For example, edges connected between sample cells corresponding to the top 10% of distance similarities in ascending order may be deleted.

在一个示例性的实施例中，计算机设备可以通过personalized PageRank(PPR)算法根据样本细胞的邻接矩阵确定各样本细胞节点的重要性，并根据重要性调整各样本细胞节点之间的距离相似度，得到第二邻接矩阵。In an exemplary embodiment, the computer device can determine the importance of each sample cell node according to the adjacency matrix of the sample cells through the personalized PageRank (PPR) algorithm, and adjust the distance similarity between each sample cell node according to the importance to obtain a second adjacency matrix.

上述实施例中，从样本细胞的邻接矩阵中删除距离相似度符合预设条件的样本细胞节点之间连成的边，得到第一邻接矩阵，根据样本细胞的邻接矩阵确定各样本细胞节点的重要性，并根据重要性调整各样本细胞节点之间的距离相似度，得到第二邻接矩阵，能够得到更加丰富的邻接矩阵。通过引入personalized PageRank(PPR)算法生成第二邻接矩阵，提高了浅层网络结构模型的远程信息捕获能力，这进一步提高了聚类性能力。In the above embodiment, the edges between the sample cell nodes whose distance similarity meets the preset conditions are deleted from the adjacency matrix of the sample cells to obtain a first adjacency matrix, the importance of each sample cell node is determined according to the adjacency matrix of the sample cells, and the distance similarity between each sample cell node is adjusted according to the importance to obtain a second adjacency matrix, so that By introducing the personalized PageRank (PPR) algorithm to generate the second adjacency matrix, the remote information capture capability of the shallow network structure model is improved, which further improves the clustering capability.

在一个示例性的实施例中，在对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵之前，方法还包括：获取样本细胞的空间位置矩阵和基因表达矩阵；空间位置矩阵，用于表征每个样本细胞的空间位置；基因表达矩阵，用于表征每个样本细胞的基因；根据空间位置矩阵，确定样本细胞的邻接矩阵；将邻接矩阵和基因表达矩阵进行组合，得到样本细胞的细胞特征矩阵。In an exemplary embodiment, before perturbing the cell feature matrix of the sample cells to obtain the first feature matrix and the second feature matrix, the method also includes: acquiring a spatial position matrix and a gene expression matrix of the sample cells; the spatial position matrix is used to characterize the spatial position of each sample cell; the gene expression matrix is used to characterize the genes of each sample cell; based on the spatial position matrix, determining the adjacency matrix of the sample cells; and combining the adjacency matrix and the gene expression matrix to obtain the cell feature matrix of the sample cells.

在一个示例性的实施例中，计算机设备可以将邻接矩阵和基因表达矩阵进行拼接组合，得到样本细胞的细胞特征矩阵。In an exemplary embodiment, the computer device may concatenate the adjacency matrix and the gene expression matrix to obtain a cell feature matrix of the sample cells.

例如：假设基因表达矩阵是一个N*M的矩阵，邻接矩阵是一个N*N的矩阵。其中，N表示样本细胞的数量，M表示基因的数量。拼接组合的方式为：基因表达矩阵|邻接矩阵，即，将邻接矩阵放到基因表达矩阵的右侧做拼接，或者也可以将邻接矩阵放到基因表达矩阵的左侧做拼接。在其他实施例中，假设基因表达矩阵是M*N的矩阵，邻接矩阵是N*N的矩阵，则可以将邻接矩阵放到基因表达矩阵的下方或上方做拼接。For example: Assume that the gene expression matrix is an N*M matrix, and the adjacency matrix is an N*N matrix. Wherein, N represents the number of sample cells, and M represents the number of genes. The splicing combination method is: gene expression matrix | adjacency matrix, that is, the adjacency matrix is placed on the right side of the gene expression matrix for splicing, or the adjacency matrix can be placed on the left side of the gene expression matrix for splicing. In other embodiments, assuming that the gene expression matrix is an M*N matrix, and the adjacency matrix is an N*N matrix, the adjacency matrix can be placed below or above the gene expression matrix for splicing.

上述实施例中，根据空间位置矩阵，确定样本细胞的邻接矩阵，将邻接矩阵和基因表达矩阵进行组合，得到样本细胞的细胞特征矩阵，能够得到样本细胞的多方面的特征信息。In the above embodiment, the adjacency matrix of the sample cells is determined according to the spatial position matrix, and the adjacency matrix and the gene expression matrix are combined to obtain the cell feature matrix of the sample cells, thereby obtaining various feature information of the sample cells.

在一个示例性的实施例中，通过待训练的特征编码器对第一图进行编码得到第一嵌入矩阵，对第二图进行编码得到第二嵌入矩阵包括：将第一图和第二图分别作为待编码图，将待编码图中的特征矩阵和邻接矩阵输入至待编码图对应的待训练的特征编码器，以对待编码图进行逐层编码；第一图和第二图分别对应的待训练的特征编码器之间参数共享；在对待编码图进行逐层编码的过程中，将待训练的特征编码器的第一层编码层作为当前层，并将输入至第一层编码层的特征矩阵作为输入至当前层的嵌入矩阵，根据输入至当前层的嵌入矩阵和邻接矩阵的归一化矩阵的乘积确定当前层的融合矩阵；将当前层的融合矩阵和输入至当前层的嵌入矩阵按照当前层中的权重进行加权融合，得到当前层输出的嵌入矩阵，将当前层输出的嵌入矩阵作为下一层的输入，并将下一层作为新的当前层，返回执行根据输入至当前层的嵌入矩阵和邻接矩阵的归一化矩阵的乘积确定当前层的融合矩阵的步骤及后续步骤，将待训练的特征编码器中最后一层输出的嵌入矩阵作为待编码图对应的嵌入矩阵；其中，待编码图为第一图时，所对应的嵌入矩阵为第一嵌入矩阵；待编码图为第二图时，所对应的嵌入矩阵为第二嵌入矩阵。In an exemplary embodiment, encoding a first image by a feature encoder to be trained to obtain a first embedding matrix, and encoding a second image to obtain a second embedding matrix includes: taking the first image and the second image as images to be encoded, respectively, inputting the feature matrix and the adjacency matrix in the image to be encoded into the feature encoder to be trained corresponding to the image to be encoded, so as to encode the image to be encoded layer by layer; sharing parameters between the feature encoders to be trained corresponding to the first image and the second image; in the process of encoding the image to be encoded layer by layer, taking the first encoding layer of the feature encoder to be trained as the current layer, and taking the feature matrix input to the first encoding layer as the embedding matrix input to the current layer, and according to the embedding matrix and the adjacency matrix input to the current layer, The product of the normalized matrices determines the fusion matrix of the current layer; the fusion matrix of the current layer and the embedding matrix input to the current layer are weightedly fused according to the weights in the current layer to obtain the embedding matrix output by the current layer; the embedding matrix output by the current layer is used as the input of the next layer, and the next layer is used as the new current layer; the step of determining the fusion matrix of the current layer according to the product of the normalized matrix of the embedding matrix input to the current layer and the adjacency matrix and subsequent steps are returned to execute; the embedding matrix output by the last layer in the feature encoder to be trained is used as the embedding matrix corresponding to the image to be encoded; wherein, when the image to be encoded is the first image, the corresponding embedding matrix is the first embedding matrix; when the image to be encoded is the second image, the corresponding embedding matrix is the second embedding matrix.

在一个示例性的实施例中，计算机设备可以通过当前层的权重矩阵对当前层的嵌入矩阵和邻接矩阵的归一化矩阵的乘积进行加权得到第一结果，通过当前层的权重矩阵对当前层的嵌入矩阵进行加权得到第二结果，然后将第一结果的非线性激活函数值和第二结果的非线性激活函数值进行相加，得到当前层输出的嵌入矩阵。In an exemplary embodiment, the computer device may weight the product of the embedding matrix of the current layer and the normalized matrix of the adjacency matrix through the weight matrix of the current layer to obtain a first result, and weight the product of the embedding matrix of the current layer and the normalized matrix of the adjacency matrix through the weight matrix of the current layer to obtain a first result. The embedding matrix of the layer is weighted to obtain the second result, and then the nonlinear activation function value of the first result and the nonlinear activation function value of the second result are added to obtain the embedding matrix output by the current layer.

在一个示例性的实施例中，特征编码器第l层的公式如下：

In an exemplary embodiment, the formula for the feature encoder layer l is as follows:

其中，A_m和A_d分别为第一邻接矩阵和第二邻接矩阵。D_m和D_d分别为第一邻接矩阵A_m和第二邻接矩阵A_d的度矩阵。I为单位矩阵。和分别为第一邻接矩阵A_m和第二邻接矩阵A_d的归一化矩阵。和为特征编码器第l层的权重矩阵。b^(l)为特征编码器第l层的偏差向量。σ为非线性激活函数。例如：非线性激活函数可以是ReLU或者Tanh。为第一图对应的特征编码器第l层的编码结果。为第一图对应的特征编码器第l-1层的编码结果。为第二图对应的特征编码器第l层的编码结果。为第二图对应的特征编码器第l-1层的编码结果。当编码器的层数l＝1时，为输入的第一图，为输入的第二图。in, A _m and A _d are the first adjacency matrix and the second adjacency matrix, respectively. _{D m} and D _d are the degree matrices of the first adjacency matrix A _m and the second adjacency matrix A _d , respectively. I is the identity matrix. and are the normalized matrices of the first adjacency matrix A _m and the second adjacency matrix A _d, respectively. and is the weight matrix of the feature encoder layer l. ^{b (l)} is the bias vector of the feature encoder layer l. σ is the nonlinear activation function. For example, the nonlinear activation function can be ReLU or Tanh. is the encoding result of the lth layer of the feature encoder corresponding to the first image. It is the encoding result of the l-1th layer of the feature encoder corresponding to the first figure. is the encoding result of the lth layer of the feature encoder corresponding to the second figure. is the encoding result of the feature encoder layer l-1 corresponding to the second figure. When the number of encoder layers l = 1, is the first input image, The second image is input.

上述实施例中，由于不需要对负样本进行挖掘，仅需通过孪生网络结构的编码器对正样本进行编码，因此相较于使用对比学习的方法而言减少了空间利用率，实现了在学习到更丰富的特征表示的同时，减少空间的利用率。In the above embodiment, since there is no need to mine negative samples, positive samples only need to be encoded through the encoder of the twin network structure. Therefore, compared with the method of using contrastive learning, the space utilization is reduced, and it is achieved that while learning richer feature representations, the space utilization is reduced.

在一个示例性的实施例中，对目标嵌入矩阵进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵中的至少一种包括：将样本细胞的邻接矩阵和目标嵌入矩阵输入至解码器，以对目标嵌入矩阵进行逐层解码；在对目标嵌入矩阵进行逐层解码的过程中，将解码器的第一层解码层作为当前层，并将输入至第一层解码层的目标嵌入矩阵作为当前层的输入；对当前层的输入和样本细胞的邻接矩阵的归一化的乘积进行加权，得到当前层输出的解码结果，将当前层输出的解码结果作为下一层的输入，并将下一层作为新的当前层，返回执行对当前层的输入和样本细胞的邻接矩阵的归一化的乘积进行加权，得到当前层输出的解码结果的步骤及后续步骤，将解码器中最后一层输出的解码结果作为重构的细胞特征矩阵；根据重构的细胞特征矩阵和重构的细胞特征矩阵的转置，确定重构的邻接矩阵。In an exemplary embodiment, decoding the target embedding matrix to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix includes: inputting the adjacency matrix of the sample cells and the target embedding matrix into the decoder to decode the target embedding matrix layer by layer; in the process of decoding the target embedding matrix layer by layer, taking the first decoding layer of the decoder as the current layer, and taking the target embedding matrix input to the first decoding layer as the input of the current layer; weighting the normalized product of the input of the current layer and the adjacency matrix of the sample cells to obtain the decoding result output by the current layer, taking the decoding result output by the current layer as the input of the next layer, and taking the next layer as the new current layer, returning to execute the step of weighting the normalized product of the input of the current layer and the adjacency matrix of the sample cells to obtain the decoding result output by the current layer and the subsequent steps, taking the decoding result output by the last layer in the decoder as the reconstructed cell feature matrix; determining the reconstructed adjacency matrix according to the reconstructed cell feature matrix and the transpose of the reconstructed cell feature matrix.

在一个示例性的实施例中，在解码过程中，计算机设备可以对当前层的输入和样本细胞的邻接矩阵的归一化的乘积用当前层的权重矩阵进行加权，然后取加权结果的非线性激活函数值，得到当前层输出的解码结果。 In an exemplary embodiment, during the decoding process, the computer device can weight the normalized product of the input of the current layer and the adjacency matrix of the sample cell with the weight matrix of the current layer, and then take the nonlinear activation function value of the weighted result to obtain the decoding result output by the current layer.

在一个示例性的实施例中，第k层解码器的解码结果的计算公式为：
In an exemplary embodiment, the calculation formula of the decoding result of the k-th layer decoder is:

其中，H^(k)为解码器第k层的解码结果。H^(k-1)为解码器第k-1层的解码结果。σ为非线性激活函数。A为样本细胞的邻接矩阵。I为单位矩阵，D为A的度矩阵。为A的归一化。W^(k)为解码器第k层的参数矩阵。Where H ^(k) is the decoding result of the kth layer of the decoder. ^{H (k-1)} is the decoding result of the k-1th layer of the decoder. σ is a nonlinear activation function. A is the adjacency matrix of the sample cell. I is the identity matrix, and D is the degree matrix of A. is the normalized value of A. W ^(k) is the parameter matrix of the kth layer of the decoder.

在一个示例性的实施例中，计算机设备可以将重构的细胞特征矩阵和重构的细胞特征矩阵的转置进行内积，得到重构的邻接矩阵。In an exemplary embodiment, the computer device may perform an inner product on the reconstructed cell feature matrix and the transpose of the reconstructed cell feature matrix to obtain a reconstructed adjacency matrix.

上述实施例中，通过解码器进行逐层解码，得到重构的细胞特征矩阵和重构的邻接矩阵，从而能够准确地计算出重构损失，根据重构损失值能够使得低维的特征嵌入同时考虑到基因表达信息和空间位置信息，从而提高了训练的特征编码器的准确性。In the above embodiment, the decoder performs layer-by-layer decoding to obtain a reconstructed cell feature matrix and a reconstructed adjacency matrix, so that the reconstruction loss can be accurately calculated. According to the reconstruction loss value, the low-dimensional feature embedding can take into account both the gene expression information and the spatial position information, thereby improving the accuracy of the trained feature encoder.

在一个示例性的实施例中，根据第一差异和第二差异中的至少一种，迭代优化待训练的特征编码器的参数包括：根据第一差异和第二差异中的至少一种，确定重构损失值；根据聚类指导损失值和去冗余损失值中的至少一种、以及重构损失值，确定目标损失值；根据目标损失值，迭代优化待训练的特征编码器的参数；其中，聚类指导损失值是根据软分配矩阵和目标分布矩阵之间的差异确定的；软分配矩阵是根据基于目标嵌入矩阵进行聚类得到的聚类结果确定的；目标分布矩阵是对软分配矩阵进行归一化得到的；去冗余损失值是根据第一嵌入矩阵和第二嵌入矩阵之间的差异确定的。In an exemplary embodiment, iteratively optimizing the parameters of the feature encoder to be trained according to at least one of the first difference and the second difference includes: determining a reconstruction loss value according to at least one of the first difference and the second difference; determining a target loss value according to at least one of a clustering guidance loss value and a de-redundancy loss value, and the reconstruction loss value; iteratively optimizing the parameters of the feature encoder to be trained according to the target loss value; wherein the clustering guidance loss value is determined according to the difference between the soft allocation matrix and the target distribution matrix; the soft allocation matrix is determined according to the clustering result obtained by clustering based on the target embedding matrix; the target distribution matrix is obtained by normalizing the soft allocation matrix; and the de-redundancy loss value is determined according to the difference between the first embedding matrix and the second embedding matrix.

在一个示例性的实施例中，计算机设备可以根据聚类指导损失值和去冗余损失值中的至少一种与重构损失值之和，确定目标损失值。In an exemplary embodiment, the computer device may determine the target loss value according to the sum of at least one of the clustering guidance loss value and the de-redundancy loss value and the reconstruction loss value.

在一个示例性的实施例中，目标损失值L的计算公式可以为：
L＝L_REC+L_C+L_RR In an exemplary embodiment, the calculation formula of the target loss value L may be:
L＝L _REC +L _C +L _RR

其中，L_REC为重构损失值。L_C为聚类指导损失值。L_RR为去冗余损失值。Where L _REC is the reconstruction loss value, L _C is the clustering guidance loss value, and L _RR is the redundancy removal loss value.

在一个示例性的实施例中，迭代停止条件可以目标损失值小于或等于第二预设损失阈值。在另一些实施例中，迭代停止条件可以包括迭代次数大于或等于预设次数阈值。In an exemplary embodiment, the iteration stopping condition may be that the target loss value is less than or equal to a second preset loss threshold. In other embodiments, the iteration stopping condition may include that the number of iterations is greater than or equal to a preset number threshold.

上述实施例中，根据聚类指导损失值和去冗余损失值中的至少一种、以及重构损失值，确定目标损失值，根据聚类指导损失值能够有效学习到与聚类任务有关的特征嵌入，根据去冗余损失值能够消除嵌入中的冗余信息，为每个样本细胞节点生成可区分的嵌入，根据重构损失值能够使得低维的特征嵌入同时考虑到基因表达信息和空间位置信息，从而进一步提高了训练的特征编码器的准确性。In the above embodiment, the target loss value is determined according to at least one of the clustering guidance loss value and the de-redundancy loss value, and the reconstruction loss value. The feature embedding related to the clustering task can be effectively learned according to the clustering guidance loss value. The redundant information in the embedding can be eliminated according to the de-redundancy loss value, and a distinguishable embedding is generated for each sample cell node. The low-dimensional feature embedding can take into account both gene expression information and spatial position information according to the reconstruction loss value, thereby further This further improves the accuracy of the trained feature encoder.

在一个示例性的实施例中，聚类指导损失值的确定步骤包括：根据目标嵌入矩阵进行聚类，得到参照聚类中心；确定目标嵌入矩阵中各个样本细胞分别对应的向量与参照聚类中心之间的软分配矩阵；软分配矩阵用于表征向量分别被分配到各参照聚类中心的概率；对软分配矩阵进行归一化，生成目标分布矩阵；根据软分配矩阵和目标分布矩阵的分布之间的差异，确定聚类指导损失值。In an exemplary embodiment, the step of determining the clustering guidance loss value includes: clustering according to the target embedding matrix to obtain the reference cluster center; determining the soft assignment matrix between the vectors corresponding to each sample cell in the target embedding matrix and the reference cluster center; the soft assignment matrix is used to characterize the probability that the vector is assigned to each reference cluster center; normalizing the soft assignment matrix to generate a target distribution matrix; and determining the clustering guidance loss value based on the difference between the distribution of the soft assignment matrix and the target distribution matrix.

其中，目标嵌入矩阵中每一行是一个样本细胞对应的向量。Among them, each row in the target embedding matrix is a vector corresponding to a sample cell.

在一个示例性的实施例中，计算机设备可以将目标嵌入矩阵中每个样本细胞分别对应的向量进行聚类，得到参照聚类中心。In an exemplary embodiment, the computer device may cluster the vectors corresponding to each sample cell in the target embedding matrix to obtain reference cluster centers.

在一个示例性的实施例中，计算机设备可以利用学生t分布，计算目标嵌入矩阵中各个样本细胞分别对应的向量与参照聚类中心之间的软分配矩阵。In an exemplary embodiment, the computer device may use Student's t distribution to calculate the soft assignment matrix between the vectors corresponding to each sample cell in the target embedding matrix and the reference cluster center.

在一个示例性的实施例中，目标分布矩阵P中第i行第j列的元素p_ij可以通过如下公式计算：
In an exemplary embodiment, the element p _ij in the i-th row and j-th column of the target distribution matrix P can be calculated by the following formula:

其中，q_ij为Q软分配矩阵中第i行第j列的元素。j’表示软分配矩阵中的某列的索引。Wherein, _qij is the element in the i-th row and j-th column in the soft allocation matrix Q. j' represents the index of a column in the soft allocation matrix.

在一个示例性的实施例中，计算机设备可以根据软分配矩阵和目标分布矩阵采用KL散度计算聚类指导损失值。公式如下：
In an exemplary embodiment, the computer device may calculate the clustering guidance loss value using KL divergence according to the soft assignment matrix and the target distribution matrix. The formula is as follows:

上述实施例中，基于目标嵌入矩阵的聚类结果生成软分配分布和目标分布，然后通过使用聚类指导损失值将这两个分布对齐以指导网络学习，提高了特征编码器的准确性，能够有效学习到与聚类任务有关的特征嵌入。In the above embodiment, a soft assignment distribution and a target distribution are generated based on the clustering result of the target embedding matrix, and then the two distributions are aligned by using a clustering guidance loss value to guide network learning, thereby improving the accuracy of the feature encoder and being able to effectively learn feature embeddings related to the clustering task.

在一个示例性的实施例中，去冗余损失值包括第一去冗余损失值；去冗余损失值的确定步骤包括：根据第一嵌入矩阵和第二嵌入矩阵之间的相似度，确定节点相似度矩阵；根据节点相似度矩阵与单位矩阵之间的差异，确定第一去冗余损失值。In an exemplary embodiment, the de-redundancy loss value includes a first de-redundancy loss value; the step of determining the de-redundancy loss value includes: determining a node similarity matrix based on the similarity between the first embedding matrix and the second embedding matrix; and determining the first de-redundancy loss value based on the difference between the node similarity matrix and the unit matrix.

在一个示例性的实施例中，相似度可以是余弦相似度。节点相似度矩阵S_N可以通过如下公式计算：
In an exemplary embodiment, the similarity may be cosine similarity. The node similarity matrix _SN may be calculated by the following formula:

其中，H₁为第一嵌入矩阵。H₂为第二嵌入矩阵。H₂ ^T表示H₂的转置，|| ||表示矩阵的模运算。Where _H1 is the first embedding matrix. _H2 is the second embedding matrix. _H2T represents the transpose of _H2 , ^and || || represents the modulus operation of the matrix.

在一个示例性的实施例中，计算机设备可以计算节点相似度矩阵与单位矩阵的MSE，确定第一去冗余损失值。第一去冗余损失值L_RR-N可以通过如下公式计算：
In an exemplary embodiment, the computer device may calculate the MSE of the node similarity matrix and the identity matrix to determine the first de-redundancy loss value. The first de-redundancy loss value L _RR-N may be calculated by the following formula:

其中，N为样本细胞的数量。S_N为节点相似度矩阵。I为单位矩阵。Where N is the number of sample cells. S _N is the node similarity matrix. I is the identity matrix.

上述实施例中，根据第一嵌入矩阵和第二嵌入矩阵之间的相似度，确定节点相似度矩阵，根据节点相似度矩阵与单位矩阵之间的差异，确定第一去冗余损失值，这种去相关操作可以使得网络减少潜在空间中样本细胞节点之间的冗余信息，从而使学习到的嵌入更具鉴别性，使得特征编码器的鲁棒性增强，解决了传统方法中存在的表示崩塌的问题。In the above embodiment, a node similarity matrix is determined based on the similarity between the first embedding matrix and the second embedding matrix, and a first de-redundancy loss value is determined based on the difference between the node similarity matrix and the unit matrix. This decorrelation operation can enable the network to reduce redundant information between sample cell nodes in the latent space, thereby making the learned embedding more discriminative, enhancing the robustness of the feature encoder, and solving the problem of representation collapse existing in traditional methods.

在一个示例性的实施例中，去冗余损失值还包括第二去冗余损失值；去冗余损失值的确定步骤还包括：将第一嵌入矩阵和第二嵌入矩阵分别映射为簇级嵌入，得到第一簇级嵌入矩阵和第二簇级嵌入矩阵；根据第一簇级嵌入矩阵和第二簇级嵌入矩阵之间的相似度，确定簇级相似度矩阵；根据簇级相似度矩阵与单位矩阵之间的差异，确定第二去冗余损失值。In an exemplary embodiment, the de-redundancy loss value also includes a second de-redundancy loss value; the step of determining the de-redundancy loss value also includes: mapping the first embedding matrix and the second embedding matrix to cluster-level embeddings respectively to obtain a first cluster-level embedding matrix and a second cluster-level embedding matrix; determining a cluster-level similarity matrix based on the similarity between the first cluster-level embedding matrix and the second cluster-level embedding matrix; determining a second de-redundancy loss value based on the difference between the cluster-level similarity matrix and the unit matrix.

其中，簇级嵌入，是将嵌入矩阵中的各个样本细胞节点的向量进行聚类，将同一聚类簇中的向量进行平均得到的。Among them, cluster-level embedding is obtained by clustering the vectors of each sample cell node in the embedding matrix and averaging the vectors in the same cluster.

在一个示例性的实施例中，计算机设备可以利用读出函数将第一嵌入矩阵和第二嵌入矩阵映射到簇级嵌入，得到第一簇级嵌入矩阵和第二簇级嵌入矩阵。In an exemplary embodiment, a computer device may use a read function The first embedding matrix and the second embedding matrix are mapped to the cluster-level embedding to obtain a first cluster-level embedding matrix and a second cluster-level embedding matrix.

在一个示例性的实施例中，相似度可以是余弦相似度。簇级相似度矩阵S_F可以通过如下公式计算：
In an exemplary embodiment, the similarity may be cosine similarity. The cluster-level similarity matrix _SF may be calculated by the following formula:

其中，Z₁为第一簇级嵌入矩阵。Z₂为第二簇级嵌入矩阵。Where _Z1 is the first cluster-level embedding matrix and _Z2 is the second cluster-level embedding matrix.

在一个示例性的实施例中，计算机设备可以计算簇级相似度矩阵与单位矩阵的MSE，确定第二去冗余损失值。第二去冗余损失值L_RR-F可以通过如下公式计算：
In an exemplary embodiment, the computer device may calculate the MSE of the cluster-level similarity matrix and the identity matrix to determine the second de-redundancy loss value. The second de-redundancy loss value L _RR-F may be calculated by the following formula:

其中，D为目标嵌入矩阵H的维度。S_F为簇级相似度矩阵。I为单位矩阵。 Where D is the dimension of the target embedding matrix H. _{S F} is the cluster-level similarity matrix. I is the identity matrix.

在一个示例性的实施例中，计算机设备可以根据第一去冗余损失值和第二去冗余损失值之和，确定去冗余损失值。去冗余损失值L_RR的计算公式为：
L_RR＝L_RR-N+L_RR-F In an exemplary embodiment, the computer device may determine the redundancy loss value according to the sum of the first redundancy loss value and the second redundancy loss value. The calculation formula of the redundancy loss value L _RR is:
_LRR ＝LRR _-N +LRR _-F

上述实施例中，将节点嵌入映射到簇级嵌入，能够从特征层面上减少冗余信息，从而使学习到的嵌入更具鉴别性。In the above embodiment, mapping the node embedding to the cluster-level embedding can reduce redundant information at the feature level, thereby making the learned embedding more discriminative.

在一个示例性的实施例中，方法还包括：将待分群细胞的细胞特征矩阵和邻接矩阵所组成的图输入至训练完成的特征编码器中，得到提取的嵌入矩阵；对提取的嵌入矩阵中各待分群细胞分别对应的嵌入向量进行聚类，得到待分群细胞的细胞分群结果。In an exemplary embodiment, the method further includes: inputting a graph consisting of a cell feature matrix and an adjacency matrix of cells to be grouped into a trained feature encoder to obtain an extracted embedding matrix; clustering the embedding vectors corresponding to each cell to be grouped in the extracted embedding matrix to obtain a cell grouping result of the cells to be grouped.

其中，提取的嵌入矩阵，是指特征编码器对输入的图进行编码后输出的嵌入矩阵。待分群细胞的细胞分群结果是指将待分群细胞划分至多个细胞群的划分结果。The extracted embedding matrix refers to the embedding matrix outputted after the feature encoder encodes the input graph. The cell clustering result of the cells to be clustered refers to the result of dividing the cells to be clustered into multiple cell clusters.

在一个示例性的实施例中，待分群细胞可以是除样本细胞之外的细胞。在另一些实施例中，待分群细胞可以是样本细胞本身。In an exemplary embodiment, the cells to be grouped may be cells other than sample cells. In other embodiments, the cells to be grouped may be sample cells themselves.

在一个示例性的实施例中，提取的嵌入矩阵中每行分别为一个待分群细胞对应的嵌入向量。计算机设备可以对提取的嵌入矩阵中各待分群细胞分别对应的嵌入向量进行聚类，将聚类结果中属于同一聚类簇中的向量对应的待分群细胞划分至同一细胞群。In an exemplary embodiment, each row in the extracted embedding matrix is an embedding vector corresponding to a cell to be grouped. The computer device can cluster the embedding vectors corresponding to each cell to be grouped in the extracted embedding matrix, and divide the cells to be grouped corresponding to the vectors belonging to the same cluster in the clustering result into the same cell group.

上述实施例中，通过训练完成的能够提取深层次特征的特征编码器提取待分群细胞的嵌入矩阵，然后根据提取的学习到了深层次特征的嵌入矩阵进行聚类，能够提高细胞分群的准确性。In the above embodiment, the embedding matrix of the cells to be clustered is extracted by a feature encoder that is trained to extract deep features, and then clustering is performed based on the extracted embedding matrix that has learned the deep features, which can improve the accuracy of cell clustering.

在一个示例性的实施例中，方法还包括：对目标嵌入矩阵中各样本细胞对应的嵌入向量进行聚类，得到样本细胞的细胞分群结果。In an exemplary embodiment, the method further includes: clustering the embedding vectors corresponding to each sample cell in the target embedding matrix to obtain a cell clustering result of the sample cells.

其中，样本细胞的细胞分群结果是指将样本细胞划分至多个细胞群的划分结果。The cell clustering result of the sample cells refers to the result of dividing the sample cells into a plurality of cell clusters.

上述实施例中，在需要对一组样本细胞进行分群时，可以先基于样本细胞的细胞特征矩阵和邻接矩阵训练能够提取到深层次特征的特征编码器，使得得到的目标嵌入矩阵能够反映深层次的特征，从而可以直接根据目标嵌入矩阵进行聚类，得到样本细胞的准确的细胞分群结果。In the above embodiment, when a group of sample cells needs to be clustered, a feature encoder that can extract deep-level features can be first trained based on the cell feature matrix and adjacency matrix of the sample cells, so that the obtained target embedding matrix can reflect the deep-level features, thereby clustering can be directly performed according to the target embedding matrix to obtain accurate cell clustering results for the sample cells.

如图2所示，是本申请各实施例中的细胞分群特征提取方法的整体流程示意图。首先，分别使用第一噪声矩阵N₁和第二噪声矩阵N₂对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵X₁和第二特征矩阵X₂，分别对邻接矩阵A进行不同的扰动，得到第一邻接矩阵A_m和第二邻接矩阵A_d。分别将第一图和第二图输入至参数共享的待训练的特征编码器，得到第一嵌入矩阵H₁和第二嵌入矩阵H₂，将第一嵌入矩阵H₁和第二嵌入矩阵H₂线性相加得到目标嵌入矩阵H。接着，将目标嵌入矩阵H输入至解码器得到重构的细胞特征矩阵和重构的邻接矩阵以确定重构损失值L_REC。根据第一嵌入矩阵H₁和第二嵌入矩阵H₂之间的相似度确定去冗余损失值L_RR。根据对目标嵌入矩阵H进行聚类的结果确定聚类指导损失值L_C。根据重构损失值L_REC去冗余损失值L_RR和聚类指导损失值L_C优化特征编码器的参数，最终得到训练完成的特征编码器，特征编码器可以用于提取细胞分群所需的特征。As shown in FIG2 , it is a schematic diagram of the overall process of the cell clustering feature extraction method in each embodiment of the present application. First, the cell feature matrix of the sample cells is perturbed using the first noise matrix N ₁ and the second noise matrix N ₂ to obtain the first feature matrix X ₁ and the second feature matrix X ₂ , and the adjacency matrix A is perturbed differently to obtain the first adjacency matrix A _m and the second adjacency matrix A _d . And the second picture Input to the parameter-sharing feature encoder to be trained, obtain the first embedding matrix _H1 and the second embedding matrix _H2 , and transform the first embedding matrix The matrix _H1 and the second embedding matrix _H2 are linearly added to obtain the target embedding matrix H. Then, the target embedding matrix H is input into the decoder to obtain the reconstructed cell feature matrix and the reconstructed adjacency matrix The reconstruction loss value L _REC is determined. The redundancy removal loss value L _RR is determined according to the similarity between the first embedding matrix H ₁ and the second embedding matrix H _2. The clustering guidance loss value _LC is determined according to the result of clustering the target embedding matrix H. The parameters of the feature encoder are optimized according to the reconstruction loss value L _REC , the redundancy removal loss value L _RR and the clustering guidance loss value _LC , and finally a trained feature encoder is obtained. The feature encoder can be used to extract the features required for cell clustering.

在一个示例性的实施例中，如图3所示，提供了一种细胞分群方法，本实施例以该方法应用于计算机设备进行举例说明。计算机设备可以是终端或服务器。终端可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备，物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。本实施例中，该方法包括以下步骤：In an exemplary embodiment, as shown in FIG3 , a cell clustering method is provided, and this embodiment is illustrated by applying the method to a computer device. The computer device may be a terminal or a server. The terminal may be, but is not limited to, various personal computers, laptops, smart phones, tablet computers, Internet of Things devices, and portable wearable devices. The Internet of Things devices may be smart speakers, smart TVs, smart air conditioners, smart car-mounted devices, etc. Portable wearable devices may be smart watches, smart bracelets, head-mounted devices, etc. The server may be implemented as an independent server or a server cluster consisting of multiple servers. In this embodiment, the method comprises the following steps:

步骤302，将待分群细胞的细胞特征矩阵和邻接矩阵所组成的图输入至预先训练的特征编码器中，得到提取的嵌入矩阵；预先训练的特征编码器，是预先根据第一图和第二图进行训练得到的；第一图包括第一特征矩阵和第一邻接矩阵；第二图包括第二特征矩阵和第二邻接矩阵；第一特征矩阵和第二特征矩阵是对样本细胞的细胞特征矩阵进行扰动得到的；第一邻接矩阵和第二邻接矩阵是对样本细胞的邻接矩阵进行扰动得到的。Step 302, input the graph composed of the cell feature matrix and the adjacency matrix of the cells to be grouped into a pre-trained feature encoder to obtain an extracted embedding matrix; the pre-trained feature encoder is pre-trained according to the first graph and the second graph; the first graph includes the first feature matrix and the first adjacency matrix; the second graph includes the second feature matrix and the second adjacency matrix; the first feature matrix and the second feature matrix are obtained by perturbing the cell feature matrix of the sample cells; the first adjacency matrix and the second adjacency matrix are obtained by perturbing the adjacency matrix of the sample cells.

步骤304，对提取的嵌入矩阵中各待分群细胞分别对应的嵌入向量进行聚类，得到待分群细胞的细胞分群结果。Step 304 , clustering the embedding vectors corresponding to the cells to be grouped in the extracted embedding matrix to obtain the cell grouping results of the cells to be grouped.

上述细胞分群方法，通过训练完成的能够提取深层次特征的特征编码器提取待分群细胞的嵌入矩阵，然后根据提取的学习到了深层次特征的嵌入矩阵进行聚类，能够提高细胞分群的准确性。The above-mentioned cell clustering method extracts the embedding matrix of the cells to be clustered through a feature encoder that has been trained to extract deep features, and then performs clustering based on the extracted embedding matrix that has learned deep features, thereby improving the accuracy of cell clustering.

在一个示例性的实施例中，特征编码器的训练步骤包括：对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵；对样本细胞的邻接矩阵进行扰动，得到第一邻接矩阵和第二邻接矩阵；通过待训练的特征编码器对第一图进行编码得到第一嵌入矩阵，对第二图进行编码得到第二嵌入矩阵；第一图包括第一特征矩阵和第一邻接矩阵；第二图包括第二特征矩阵和第二邻接矩阵；将第一嵌入矩阵和第二嵌入矩阵进行融合得到目标嵌入矩阵；对目标嵌入矩阵进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵中的至少一种；根据第一差异和第二差异中的至少一种，迭代优化待训练的特征编码器的参数，直至符合迭代停止条件，得到训练完成的特征编码器；特征编码器用于提取细胞分群所需的特征；第一差异是重构的细胞特征矩阵与样本细胞的细胞特征矩阵之间的差异；第二差异是重构的邻接矩阵与样本细胞的邻接矩阵之间的差异。In an exemplary embodiment, the training steps of the feature encoder include: perturbing the cell feature matrix of the sample cells to obtain a first feature matrix and a second feature matrix; perturbing the adjacency matrix of the sample cells to obtain a first adjacency matrix and a second adjacency matrix; encoding the first graph through the feature encoder to be trained to obtain a first embedding matrix, and encoding the second graph to obtain a second embedding matrix; the first graph includes a first feature matrix and a first adjacency matrix; the second graph includes a second feature matrix and a second adjacency matrix; fusing the first embedding matrix and the second embedding matrix to obtain a target embedding matrix; decoding the target embedding matrix to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix; iteratively optimizing the parameters of the feature encoder to be trained according to at least one of the first difference and the second difference. until the iteration stop condition is met, and a trained feature encoder is obtained; the feature encoder is used to extract the features required for cell clustering; the first difference is the difference between the reconstructed cell feature matrix and the cell feature matrix of the sample cells; the second difference is the difference between the reconstructed adjacency matrix and the adjacency matrix of the sample cells.

上述实施例中，对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵，对样本细胞的邻接矩阵进行扰动，得到第一邻接矩阵和第二邻接矩阵，能够得到图的不同视图下更加丰富的特征，然后通过孪生的待训练的特征编码器分别对信息量丰富的不同视图的第一图和第二图进行编码，并将第一图和第二图的编码结果进行融合得到目标嵌入矩阵，使得目标嵌入矩阵能够具有更加丰富且深层次的特征，再对目标嵌入矩阵进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵进而确定重构损失，能够准确地训练得到特征编码器，从而通过能够提取深层次特征的特征编码器能准确地提取细胞分群所需的特征。In the above embodiment, the cell feature matrix of the sample cells is perturbed to obtain the first feature matrix and the second feature matrix, and the adjacency matrix of the sample cells is perturbed to obtain the first adjacency matrix and the second adjacency matrix, so that richer features can be obtained under different views of the graph, and then the first graph and the second graph of different views with rich information are respectively encoded by the twin feature encoder to be trained, and the encoding results of the first graph and the second graph are fused to obtain the target embedding matrix, so that the target embedding matrix can have richer and deeper features, and then the target embedding matrix is decoded to obtain the reconstructed cell feature matrix and the reconstructed adjacency matrix to determine the reconstruction loss, so that the feature encoder can be accurately trained, so that the features required for cell clustering can be accurately extracted through the feature encoder that can extract deep features.

在小鼠胚胎发育的四个阶段：E9.5 E1S1阶段、E9.5 E2S2阶段、E9.5 E2S3阶段和E9.5 E2S4阶段数据和human breast cancer(10x Visium)数据上对特征编码器做了模型评估测试。采用了广泛使用的评估指标：调整兰德指数(Adjusted Rand index,ARI)，取值范围为[-1,1]，越接近1越好；归一化互信(Normalized Mutual Information,NMI)，取值范围为[0,1]，越接近1越好；Fowlkes-Mallows指数(FMI)，取值范围为[0,1]，越接近1越好。这三个指标都是用于衡量两个分布的吻合程度，值越大聚类效果与真实情况越吻合。The model evaluation test of the feature encoder was conducted on the data of four stages of mouse embryonic development: E9.5 E1S1, E9.5 E2S2, E9.5 E2S3 and E9.5 E2S4, and human breast cancer (10x Visium) data. Widely used evaluation indicators were used: Adjusted Rand index (ARI), with a range of [-1, 1], the closer to 1, the better; Normalized Mutual Information (NMI), with a range of [0, 1], the closer to 1, the better; Fowlkes-Mallows index (FMI), with a range of [0, 1], the closer to 1, the better. These three indicators are used to measure the degree of fit between two distributions. The larger the value, the more consistent the clustering effect is with the actual situation.

表1模型在mouse embryo的E9.5 E1S1阶段数据上的测试结果
Table 1. Test results of the model on the E9.5 E1S1 stage data of mouse embryo

从表1中可以看出，本申请所提出的算法在大部分指标上远优于现有算法模型。It can be seen from Table 1 that the algorithm proposed in this application is far superior to the existing algorithm model in most indicators.

表2模型在mouse embryo的E9.5 E2S2阶段数据上的测试结果

Table 2. Test results of the model on the E9.5 E2S2 stage data of mouse embryo

从表2中可以看出，本申请所提出的算法在所有指标上远优于现有算法模型。It can be seen from Table 2 that the algorithm proposed in this application is far superior to the existing algorithm model in all indicators.

表3模型在mouse embryo的E9.5 E2S3阶段数据上的测试结果
Table 3. Test results of the model on the E9.5 E2S3 stage data of mouse embryo

从表3中可以看出，本专利所提出的申请在所有指标上远优于现有算法模型。It can be seen from Table 3 that the application proposed by this patent is far superior to the existing algorithm model in all indicators.

表4模型在mouse embryo的E9.5 E2S4阶段数据上的测试结果
Table 4. Test results of the model on the E9.5 E2S4 stage data of mouse embryo

从表4中可以看出，本申请所提出的算法在所有指标上远优于现有算法模型。It can be seen from Table 4 that the algorithm proposed in this application is far superior to the existing algorithm model in all indicators.

表5模型在human breast cancer(10x Visium)数据上的测试结果
Table 5. Test results of the model on human breast cancer (10x Visium) data

从表5中可以看出，本申请所提出的算法在所有指标远优于现有算法模型。It can be seen from Table 5 that the algorithm proposed in this application is far superior to the existing algorithm model in all indicators.

应该理解的是，虽然如上的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，如上的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段，这些步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the steps in the flowcharts of the above embodiments are shown in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction for the execution of these steps, and these steps can be executed in other orders. Moreover, at least a portion of the steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. The steps or stages do not necessarily have to be executed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be executed in rotation or alternation with other steps or at least part of the steps or stages in other steps.

基于同样的发明构思，本申请实施例还提供了一种用于实现上述所涉及的细胞分群特征提取方法的细胞分群特征提取装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似，故下面所提供的一个或多个细胞分群特征提取装置实施例中的具体限定可以参见上文中对于细胞分群特征提取方法的限定，在此不再赘述。Based on the same inventive concept, the embodiment of the present application also provides a cell clustering feature extraction device for implementing the cell clustering feature extraction method involved above. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the above method, so the specific limitations in one or more cell clustering feature extraction device embodiments provided below can refer to the limitations of the cell clustering feature extraction method above, and will not be repeated here.

在一个示例性的实施例中，如图4所示，提供了一种细胞分群特征提取装置400，包括：第一扰动模块402、第二扰动模块404、编码模块406、融合模块408、解码模块410和参数优化模块412，其中：In an exemplary embodiment, as shown in FIG4 , a cell clustering feature extraction device 400 is provided, comprising: a first perturbation module 402, a second perturbation module 404, an encoding module 406, a fusion module 408, a decoding module 410 and a parameter optimization module 412, wherein:

第一扰动模块402，用于对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵。The first perturbation module 402 is used to perturb the cell feature matrix of the sample cells to obtain a first feature matrix and a second feature matrix.

第二扰动模块404，用于对样本细胞的邻接矩阵进行扰动，得到第一邻接矩阵和第二邻接矩阵。The second perturbation module 404 is used to perturb the adjacency matrix of the sample cells to obtain a first adjacency matrix and a second adjacency matrix.

编码模块406，用于通过待训练的特征编码器对第一图进行编码得到第一嵌入矩阵，对第二图进行编码得到第二嵌入矩阵；第一图包括第一特征矩阵和第一邻接矩阵；第二图包括第二特征矩阵和第二邻接矩阵。The encoding module 406 is used to encode the first image through the feature encoder to be trained to obtain a first embedding matrix, and encode the second image to obtain a second embedding matrix; the first image includes a first feature matrix and a first adjacency matrix; the second image includes a second feature matrix and a second adjacency matrix.

融合模块408，用于将第一嵌入矩阵和第二嵌入矩阵进行融合得到目标嵌入矩阵；A fusion module 408, configured to fuse the first embedding matrix and the second embedding matrix to obtain a target embedding matrix;

解码模块410，用于对目标嵌入矩阵进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵中的至少一种。The decoding module 410 is used to decode the target embedding matrix to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix.

参数优化模块412，用于根据第一差异和第二差异中的至少一种，迭代优化待训练的特征编码器的参数，直至符合迭代停止条件，得到训练完成的特征编码器；特征编码器用于提取细胞分群所需的特征；第一差异是重构的细胞特征矩阵与样本细胞的细胞特征矩阵之间的差异；第二差异是重构的邻接矩阵与样本细胞的邻接矩阵之间的差异。The parameter optimization module 412 is used to iteratively optimize the parameters of the feature encoder to be trained according to at least one of the first difference and the second difference until the iteration stop condition is met to obtain a trained feature encoder; the feature encoder is used to extract the features required for cell clustering; the first difference is the difference between the reconstructed cell feature matrix and the cell feature matrix of the sample cells; the second difference is the difference between the reconstructed adjacency matrix and the adjacency matrix of the sample cells.

在一个示例性的实施例中，第一扰动模块402还用于获取第一噪声矩阵和第二噪声矩阵；分别根据第一噪声矩阵和第二噪声矩阵对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵。In an exemplary embodiment, the first perturbation module 402 is further used to obtain a first noise matrix and a second noise matrix; and to perturb the cell feature matrix of the sample cells according to the first noise matrix and the second noise matrix, respectively, to obtain a first feature matrix and a second feature matrix.

在一个示例性的实施例中，第二扰动模块404还用于根据样本细胞的邻接矩阵确定各个样本细胞节点之间的距离相似度；从样本细胞的邻接矩阵中删除距离相似度符合预设条件的样本细胞节点之间连成的边，得到第一邻接矩阵；根据样本细胞的邻接矩阵确定各样本细胞节点的重要性，并根据重要性调整各样本细胞节点之间的距离相似度，得到第二邻接矩阵。In an exemplary embodiment, the second perturbation module 404 is further used to determine the distance similarity between each sample cell node according to the adjacency matrix of the sample cells; delete the edges connected between the sample cell nodes whose distance similarity meets the preset conditions from the adjacency matrix of the sample cells to obtain a first adjacency matrix; determine the importance of each sample cell node according to the adjacency matrix of the sample cells, and adjust the distance similarity between each sample cell node according to the importance to obtain a second adjacency matrix. Connect the matrix.

在一个示例性的实施例中，第一扰动模块402还用于获取样本细胞的空间位置矩阵和基因表达矩阵；空间位置矩阵，用于表征每个样本细胞的空间位置；基因表达矩阵，用于表征每个样本细胞的基因；根据空间位置矩阵，确定样本细胞的邻接矩阵；将邻接矩阵和基因表达矩阵进行组合，得到样本细胞的细胞特征矩阵。In an exemplary embodiment, the first perturbation module 402 is also used to obtain a spatial position matrix and a gene expression matrix of sample cells; the spatial position matrix is used to characterize the spatial position of each sample cell; the gene expression matrix is used to characterize the genes of each sample cell; based on the spatial position matrix, the adjacency matrix of the sample cells is determined; the adjacency matrix and the gene expression matrix are combined to obtain a cell feature matrix of the sample cells.

在一个示例性的实施例中，编码模块406还用于将第一图和第二图分别作为待编码图，将待编码图中的特征矩阵和邻接矩阵输入至待编码图对应的待训练的特征编码器，以对待编码图进行逐层编码；第一图和第二图分别对应的待训练的特征编码器之间参数共享；在对待编码图进行逐层编码的过程中，将待训练的特征编码器的第一层编码层作为当前层，并将输入至第一层编码层的特征矩阵作为输入至当前层的嵌入矩阵，根据输入至当前层的嵌入矩阵和邻接矩阵的归一化矩阵的乘积确定当前层的融合矩阵；将当前层的融合矩阵和输入至当前层的嵌入矩阵按照当前层中的权重进行加权融合，得到当前层输出的嵌入矩阵，将当前层输出的嵌入矩阵作为下一层的输入，并将下一层作为新的当前层，返回执行根据输入至当前层的嵌入矩阵和邻接矩阵的归一化矩阵的乘积确定当前层的融合矩阵的步骤及后续步骤，将待训练的特征编码器中最后一层输出的嵌入矩阵作为待编码图对应的嵌入矩阵；其中，待编码图为第一图时，所对应的嵌入矩阵为第一嵌入矩阵；待编码图为第二图时，所对应的嵌入矩阵为第二嵌入矩阵。In an exemplary embodiment, the encoding module 406 is further used to use the first image and the second image as images to be encoded, respectively, and input the feature matrix and the adjacency matrix in the image to be encoded into the feature encoder to be trained corresponding to the image to be encoded, so as to encode the image to be encoded layer by layer; parameters are shared between the feature encoders to be trained corresponding to the first image and the second image; in the process of encoding the image to be encoded layer by layer, the first encoding layer of the feature encoder to be trained is used as the current layer, and the feature matrix input to the first encoding layer is used as the embedding matrix input to the current layer, and the fusion matrix of the current layer is determined according to the product of the embedding matrix input to the current layer and the normalized matrix of the adjacency matrix; The fusion matrix of the current layer and the embedding matrix input to the current layer are weightedly fused according to the weights in the current layer to obtain the embedding matrix output by the current layer, the embedding matrix output by the current layer is used as the input of the next layer, and the next layer is used as the new current layer, and the step of determining the fusion matrix of the current layer according to the product of the normalized matrix of the embedding matrix input to the current layer and the adjacency matrix and subsequent steps are returned to execute, and the embedding matrix output by the last layer in the feature encoder to be trained is used as the embedding matrix corresponding to the image to be encoded; wherein, when the image to be encoded is the first image, the corresponding embedding matrix is the first embedding matrix; when the image to be encoded is the second image, the corresponding embedding matrix is the second embedding matrix.

在一个示例性的实施例中，还用于将样本细胞的邻接矩阵和目标嵌入矩阵输入至解码器，以对目标嵌入矩阵进行逐层解码；在对目标嵌入矩阵进行逐层解码的过程中，将解码器的第一层解码层作为当前层，并将输入至第一层解码层的目标嵌入矩阵作为当前层的输入；对当前层的输入和样本细胞的邻接矩阵的归一化的乘积进行加权，得到当前层输出的解码结果，将当前层输出的解码结果作为下一层的输入，并将下一层作为新的当前层，返回执行对当前层的输入和样本细胞的邻接矩阵的归一化的乘积进行加权，得到当前层输出的解码结果的步骤及后续步骤，将解码器中最后一层输出的解码结果作为重构的细胞特征矩阵；根据重构的细胞特征矩阵和重构的细胞特征矩阵的转置，确定重构的邻接矩阵。In an exemplary embodiment, the method is also used to input the adjacency matrix of sample cells and the target embedding matrix into the decoder to decode the target embedding matrix layer by layer; in the process of decoding the target embedding matrix layer by layer, the first decoding layer of the decoder is used as the current layer, and the target embedding matrix input to the first decoding layer is used as the input of the current layer; the normalized product of the input of the current layer and the adjacency matrix of the sample cells is weighted to obtain the decoding result output by the current layer, the decoding result output by the current layer is used as the input of the next layer, and the next layer is used as the new current layer, and the step of weighting the normalized product of the input of the current layer and the adjacency matrix of the sample cells to obtain the decoding result output by the current layer and the subsequent steps are returned, and the decoding result output by the last layer in the decoder is used as the reconstructed cell feature matrix; the reconstructed adjacency matrix is determined according to the reconstructed cell feature matrix and the transpose of the reconstructed cell feature matrix.

在一个示例性的实施例中，参数优化模块412还用于根据第一差异和第二差异中的至少一种，确定重构损失值；根据聚类指导损失值和去冗余损失值中的至少一种、以及重构损失值，确定目标损失值；根据目标损失值，迭代优化待训练的特征编码器的参数；其中，聚类指导损失值是根据软分配矩阵和目标分布矩阵之间的差异确定的；软分配矩阵是根据基于目标嵌入矩阵进行聚类得到的聚类结果确定的；目标分布矩阵是对软分配矩阵进行归一化得到的；去冗余损失值是根据第一嵌入矩阵和第二嵌入矩阵之间的差异确定的。 In an exemplary embodiment, the parameter optimization module 412 is also used to determine a reconstruction loss value based on at least one of the first difference and the second difference; determine a target loss value based on at least one of the clustering guidance loss value and the de-redundancy loss value, and the reconstruction loss value; iteratively optimize the parameters of the feature encoder to be trained based on the target loss value; wherein the clustering guidance loss value is determined based on the difference between the soft allocation matrix and the target distribution matrix; the soft allocation matrix is determined based on the clustering result obtained by clustering based on the target embedding matrix; the target distribution matrix is obtained by normalizing the soft allocation matrix; and the de-redundancy loss value is determined based on the difference between the first embedding matrix and the second embedding matrix.

在一个示例性的实施例中，参数优化模块412还用于根据目标嵌入矩阵进行聚类，得到参照聚类中心；确定目标嵌入矩阵中各个样本细胞分别对应的向量与参照聚类中心之间的软分配矩阵；软分配矩阵用于表征向量分别被分配到各参照聚类中心的概率；对软分配矩阵进行归一化，生成目标分布矩阵；根据软分配矩阵和目标分布矩阵的分布之间的差异，确定聚类指导损失值。In an exemplary embodiment, the parameter optimization module 412 is also used to perform clustering according to the target embedding matrix to obtain reference cluster centers; determine the soft assignment matrix between the vectors corresponding to each sample cell in the target embedding matrix and the reference cluster centers; the soft assignment matrix is used to characterize the probability that the vectors are assigned to each reference cluster center; normalize the soft assignment matrix to generate a target distribution matrix; and determine the clustering guidance loss value based on the difference between the distribution of the soft assignment matrix and the target distribution matrix.

在一个示例性的实施例中，参数优化模块412还用于根据第一嵌入矩阵和第二嵌入矩阵之间的相似度，确定节点相似度矩阵；根据节点相似度矩阵与单位矩阵之间的差异，确定第一去冗余损失值。In an exemplary embodiment, the parameter optimization module 412 is further configured to determine a node similarity matrix according to the similarity between the first embedding matrix and the second embedding matrix; and determine a first de-redundancy loss value according to the difference between the node similarity matrix and the identity matrix.

在一个示例性的实施例中，参数优化模块412还用于将第一嵌入矩阵和第二嵌入矩阵分别映射为簇级嵌入，得到第一簇级嵌入矩阵和第二簇级嵌入矩阵；根据第一簇级嵌入矩阵和第二簇级嵌入矩阵之间的相似度，确定簇级相似度矩阵；根据簇级相似度矩阵与单位矩阵之间的差异，确定第二去冗余损失值。In an exemplary embodiment, the parameter optimization module 412 is also used to map the first embedding matrix and the second embedding matrix into cluster-level embeddings, respectively, to obtain a first cluster-level embedding matrix and a second cluster-level embedding matrix; determine a cluster-level similarity matrix based on the similarity between the first cluster-level embedding matrix and the second cluster-level embedding matrix; and determine a second de-redundancy loss value based on the difference between the cluster-level similarity matrix and the unit matrix.

在一个示例性的实施例中，如图5所示，细胞分群特征提取装置400还包括：聚类模块414。In an exemplary embodiment, as shown in FIG. 5 , the cell clustering feature extraction device 400 further includes: a clustering module 414 .

编码模块406还用于将待分群细胞的细胞特征矩阵和邻接矩阵所组成的图输入至训练完成的特征编码器中，得到提取的嵌入矩阵。The encoding module 406 is also used to input the graph composed of the cell feature matrix and the adjacency matrix of the cells to be grouped into the trained feature encoder to obtain the extracted embedding matrix.

聚类模块414还用于对提取的嵌入矩阵中各待分群细胞分别对应的嵌入向量进行聚类，得到待分群细胞的细胞分群结果。The clustering module 414 is also used to cluster the embedding vectors corresponding to each of the cells to be grouped in the extracted embedding matrix to obtain the cell grouping results of the cells to be grouped.

在一个示例性的实施例中，聚类模块414还用于对目标嵌入矩阵中各样本细胞对应的嵌入向量进行聚类，得到样本细胞的细胞分群结果。In an exemplary embodiment, the clustering module 414 is further used to cluster the embedding vectors corresponding to each sample cell in the target embedding matrix to obtain a cell clustering result of the sample cells.

在一个示例性的实施例中，如图6所示，提供了一种细胞分群装置600，包括：嵌入矩阵确定模块602和聚类模块604，其中：In an exemplary embodiment, as shown in FIG6 , a cell clustering device 600 is provided, comprising: an embedding matrix determination module 602 and a clustering module 604, wherein:

嵌入矩阵确定模块602，用于将待分群细胞的细胞特征矩阵和邻接矩阵所组成的图输入至预先训练的特征编码器中，得到提取的嵌入矩阵；预先训练的特征编码器，是预先根据第一图和第二图进行训练得到的；第一图包括第一特征矩阵和第一邻接矩阵；第二图包括第二特征矩阵和第二邻接矩阵；第一特征矩阵和第二特征矩阵是对样本细胞的细胞特征矩阵进行扰动得到的；第一邻接矩阵和第二邻接矩阵是对样本细胞的邻接矩阵进行扰动得到的。The embedding matrix determination module 602 is used to input the graph composed of the cell feature matrix and the adjacency matrix of the cells to be clustered into a pre-trained feature encoder to obtain an extracted embedding matrix; the pre-trained feature encoder is pre-trained according to the first graph and the second graph; the first graph includes the first feature matrix and the first adjacency matrix; the second graph includes the second feature matrix and the second adjacency matrix; the first feature matrix and the second feature matrix are obtained by perturbing the cell feature matrix of the sample cells; the first adjacency matrix and the second adjacency matrix are obtained by perturbing the adjacency matrix of the sample cells.

聚类模块604，用于对提取的嵌入矩阵中各待分群细胞分别对应的嵌入向量进行聚类，得到待分群细胞的细胞分群结果。The clustering module 604 is used to cluster the embedding vectors corresponding to the cells to be clustered in the extracted embedding matrix to obtain the cell clustering results of the cells to be clustered.

在一个示例性的实施例中，细胞分群装置600还包括：模型训练模块606。 In an exemplary embodiment, the cell clustering device 600 further includes: a model training module 606 .

模型训练模块606还用于对样本细胞的细胞特征矩阵进行扰动，得到第一特征矩阵和第二特征矩阵；对样本细胞的邻接矩阵进行扰动，得到第一邻接矩阵和第二邻接矩阵；通过待训练的特征编码器对第一图进行编码得到第一嵌入矩阵，对第二图进行编码得到第二嵌入矩阵；第一图包括第一特征矩阵和第一邻接矩阵；第二图包括第二特征矩阵和第二邻接矩阵；将第一嵌入矩阵和第二嵌入矩阵进行融合得到目标嵌入矩阵；对目标嵌入矩阵进行解码，得到重构的细胞特征矩阵和重构的邻接矩阵中的至少一种；根据第一差异和第二差异中的至少一种，迭代优化待训练的特征编码器的参数，直至符合迭代停止条件，得到训练完成的特征编码器；特征编码器用于提取细胞分群所需的特征；第一差异是重构的细胞特征矩阵与样本细胞的细胞特征矩阵之间的差异；第二差异是重构的邻接矩阵与样本细胞的邻接矩阵之间的差异。The model training module 606 is also used to perturb the cell feature matrix of the sample cells to obtain a first feature matrix and a second feature matrix; perturb the adjacency matrix of the sample cells to obtain a first adjacency matrix and a second adjacency matrix; encode the first graph through the feature encoder to be trained to obtain a first embedding matrix, and encode the second graph to obtain a second embedding matrix; the first graph includes a first feature matrix and a first adjacency matrix; the second graph includes a second feature matrix and a second adjacency matrix; fuse the first embedding matrix and the second embedding matrix to obtain a target embedding matrix; decode the target embedding matrix to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix; iteratively optimize the parameters of the feature encoder to be trained according to at least one of the first difference and the second difference until the iteration stop condition is met to obtain a trained feature encoder; the feature encoder is used to extract the features required for cell clustering; the first difference is the difference between the reconstructed cell feature matrix and the cell feature matrix of the sample cells; the second difference is the difference between the reconstructed adjacency matrix and the adjacency matrix of the sample cells.

上述细胞分群特征提取装置和细胞分群装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。The above-mentioned cell clustering feature extraction device and each module in the cell clustering device can be implemented in whole or in part by software, hardware and a combination thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, or can be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是服务器，其内部结构图可以如图7所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种细胞分群特征提取方法或细胞分群方法。In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be shown in FIG7 . The computer device includes a processor, a memory, and a network interface connected via a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by the processor, a cell clustering feature extraction method or a cell clustering method is implemented.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是终端，其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信，无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种细胞分群特征提取方法或细胞分群方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be shown in FIG8 . The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected via a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner may be implemented through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. When the computer program is executed by the processor, a cell clustering feature extraction method or a cell clustering method is implemented. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covering the display screen, or a key, trackball or touchpad provided on the housing of the computer device, or an external keyboard, touchpad or mouse. Standard, etc.

本领域技术人员可以理解，图7或图8中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art will understand that the structure shown in FIG. 7 or FIG. 8 is merely a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现上述各方法实施例中的步骤。In one embodiment, a computer device is provided, including a memory and a processor, wherein a computer program is stored in the memory, and the processor implements the steps in the above-mentioned method embodiments when executing the computer program.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps in the above-mentioned method embodiments are implemented.

需要说明的是，本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)，均为经用户授权或者经过各方充分授权的信息和数据。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用，均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory，ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory，MRAM)、铁电存储器(Ferroelectric Random Access Memory，FRAM)、相变存储器(Phase Change Memory，PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory，RAM)或外部高速缓冲存储器等。作为说明而非局限，RAM可以是多种形式，比如静态随机存取存储器(Static Random Access Memory，SRAM)或动态随机存取存储器(Dynamic Random Access Memory，DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等，不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等，不限于此。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. Among them, any reference to the memory, database or other medium used in the embodiments provided in the present application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetoresistive random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. As an illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM). The database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include distributed databases based on blockchains, etc., but are not limited to this. The processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, etc., but are not limited to this.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。 The technical features of the above-described embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above-described embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。 The above-described embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be construed as limiting the scope of the patent application. It should be pointed out that, for a person of ordinary skill in the art, several variations and improvements can be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the patent application shall be subject to the attached claims.

Claims

A method for extracting cell clustering features, characterized in that the method comprises:

Perturbing the cell feature matrix of the sample cells to obtain a first feature matrix and a second feature matrix;

Perturbing the adjacency matrix of the sample cells to obtain a first adjacency matrix and a second adjacency matrix;

The first graph is encoded by a feature encoder to be trained to obtain a first embedding matrix, and the second graph is encoded to obtain a second embedding matrix; the first graph includes the first feature matrix and the first adjacency matrix; the second graph includes the second feature matrix and the second adjacency matrix;

Fusing the first embedding matrix and the second embedding matrix to obtain a target embedding matrix;

Decoding the target embedding matrix to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix;

According to at least one of the first difference and the second difference, the parameters of the feature encoder to be trained are iteratively optimized until the iteration stop condition is met, thereby obtaining a trained feature encoder; the feature encoder is used to extract features required for cell clustering; the first difference is the difference between the reconstructed cell feature matrix and the cell feature matrix of the sample cells; the second difference is the difference between the reconstructed adjacency matrix and the adjacency matrix of the sample cells.

The method according to claim 1, characterized in that the perturbing the cell characteristic matrix of the sample cells to obtain the first characteristic matrix and the second characteristic matrix comprises:

Obtain a first noise matrix and a second noise matrix;

The cell feature matrix of the sample cells is perturbed according to the first noise matrix and the second noise matrix respectively to obtain a first feature matrix and a second feature matrix.

The method according to claim 1 or 2, characterized in that the perturbing the adjacency matrix of the sample cells to obtain the first adjacency matrix and the second adjacency matrix comprises:

Determine the distance similarity between each sample cell node according to the adjacency matrix of the sample cell;

Deleting the edges connecting the sample cell nodes whose distance similarity meets the preset conditions from the adjacency matrix of the sample cells to obtain a first adjacency matrix;

The importance of each of the sample cell nodes is determined according to the adjacency matrix of the sample cells, and the distance similarity between each of the sample cell nodes is adjusted according to the importance to obtain a second adjacency matrix.

The method according to any one of claims 1 to 3, characterized in that before perturbing the cell feature matrix of the sample cells to obtain the first feature matrix and the second feature matrix, the method further comprises:

Obtain the spatial position matrix and gene expression matrix of the sample cells; the spatial position matrix is used to characterize each The spatial position of the sample cells; the gene expression matrix is used to characterize the genes of each of the sample cells;

Determining an adjacency matrix of the sample cells according to the spatial position matrix;

The adjacency matrix and the gene expression matrix are combined to obtain a cell feature matrix of the sample cells.

The method according to any one of claims 1 to 4, characterized in that encoding the first image by the feature encoder to be trained to obtain the first embedding matrix, and encoding the second image to obtain the second embedding matrix comprises:

The first image and the second image are respectively used as images to be encoded, and the feature matrix and the adjacency matrix in the image to be encoded are input into the feature encoder to be trained corresponding to the image to be encoded, so as to encode the image to be encoded layer by layer; the parameters of the feature encoders to be trained corresponding to the first image and the second image are shared;

In the process of encoding the to-be-encoded graph layer by layer, a first encoding layer of the to-be-trained feature encoder is used as a current layer, and a feature matrix input to the first encoding layer is used as an embedding matrix input to the current layer, and a fusion matrix of the current layer is determined according to a product of the embedding matrix input to the current layer and a normalized matrix of the adjacency matrix;

The fusion matrix of the current layer and the embedding matrix input to the current layer are weightedly fused according to the weights in the current layer to obtain the embedding matrix output by the current layer, the embedding matrix output by the current layer is used as the input of the next layer, and the next layer is used as the new current layer, and the step of determining the fusion matrix of the current layer according to the product of the embedding matrix input to the current layer and the normalized matrix of the adjacency matrix and subsequent steps are returned to execute, and the embedding matrix output by the last layer in the feature encoder to be trained is used as the embedding matrix corresponding to the image to be encoded;

Among them, when the image to be encoded is the first image, the corresponding embedding matrix is the first embedding matrix; when the image to be encoded is the second image, the corresponding embedding matrix is the second embedding matrix.

The method according to any one of claims 1 to 5, characterized in that the decoding of the target embedding matrix to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix comprises:

Inputting the adjacency matrix of the sample cells and the target embedding matrix into a decoder to decode the target embedding matrix layer by layer;

In a process of decoding the target embedding matrix layer by layer, taking a first decoding layer of the decoder as a current layer, and taking the target embedding matrix input to the first decoding layer as an input of the current layer;

The normalized product of the input of the current layer and the adjacency matrix of the sample cells is weighted to obtain a decoding result output by the current layer, the decoding result output by the current layer is used as the input of the next layer, and the next layer is used as the new current layer, and the step of weighting the normalized product of the input of the current layer and the adjacency matrix of the sample cells to obtain the decoding result output by the current layer and the subsequent steps are returned, and the decoding result output by the last layer in the decoder is used as the reconstructed cell feature matrix;

Determine the reconstructed adjacency matrix according to the reconstructed cell feature matrix and the transpose of the reconstructed cell feature matrix Formation.

The method according to any one of claims 1 to 6, characterized in that the iterative optimization of the parameters of the feature encoder to be trained according to at least one of the first difference and the second difference comprises:

determining a reconstruction loss value based on at least one of the first difference and the second difference;

Determining a target loss value according to at least one of a clustering guidance loss value and a de-redundancy loss value, and the reconstruction loss value;

Iteratively optimizing the parameters of the feature encoder to be trained according to the target loss value;

The clustering guidance loss value is determined according to the difference between the soft assignment matrix and the target distribution matrix; the soft assignment matrix is determined according to the clustering result obtained by clustering based on the target embedding matrix; the target distribution matrix is obtained by normalizing the soft assignment matrix;

The de-redundancy loss value is determined according to a difference between the first embedding matrix and the second embedding matrix.

The method according to claim 7, characterized in that the step of determining the clustering guidance loss value comprises:

Performing clustering according to the target embedding matrix to obtain reference cluster centers;

Determine a soft assignment matrix between the vectors corresponding to each sample cell in the target embedding matrix and the reference cluster center; the soft assignment matrix is used to represent the probability that the vectors are assigned to each reference cluster center;

Normalizing the soft allocation matrix to generate a target distribution matrix;

The clustering guidance loss value is determined according to a difference between the distributions of the soft assignment matrix and the target distribution matrix.

The method according to claim 7 or 8, characterized in that the de-redundancy loss value includes a first de-redundancy loss value; and the step of determining the de-redundancy loss value includes:

Determining a node similarity matrix according to the similarity between the first embedding matrix and the second embedding matrix;

The first de-redundancy loss value is determined according to a difference between the node similarity matrix and a unit matrix.

The method according to claim 9, characterized in that the de-redundancy loss value further includes a second de-redundancy loss value; and the step of determining the de-redundancy loss value further includes:

Mapping the first embedding matrix and the second embedding matrix into cluster-level embeddings respectively to obtain a first cluster-level embedding matrix and a second cluster-level embedding matrix;

Determining a cluster-level similarity matrix according to the similarity between the first cluster-level embedding matrix and the second cluster-level embedding matrix;

The second de-redundancy loss value is determined according to the difference between the cluster-level similarity matrix and the identity matrix.

The method according to any one of claims 1 to 10, characterized in that the method further comprises:

The graph composed of the cell feature matrix and the adjacency matrix of the cells to be grouped is input into the feature encoding method completed by the training. In the encoder, the extracted embedding matrix is obtained;

Clustering is performed on the embedded vectors corresponding to each of the cells to be grouped in the extracted embedded matrix to obtain a cell grouping result of the cells to be grouped.

Clustering is performed on the embedding vectors corresponding to each of the sample cells in the target embedding matrix to obtain a cell clustering result of the sample cells.

A cell clustering method, characterized in that the method comprises:

Inputting a graph composed of a cell feature matrix and an adjacency matrix of cells to be grouped into a pre-trained feature encoder to obtain an extracted embedding matrix; the pre-trained feature encoder is pre-trained according to a first graph and a second graph; the first graph includes a first feature matrix and a first adjacency matrix; the second graph includes a second feature matrix and a second adjacency matrix; the first feature matrix and the second feature matrix are obtained by perturbing the cell feature matrix of sample cells; the first adjacency matrix and the second adjacency matrix are obtained by perturbing the adjacency matrix of the sample cells;

The method according to claim 13, characterized in that the step of training the feature encoder comprises:

A cell clustering feature extraction device, characterized in that the device comprises:

A first perturbation module is used to perturb the cell feature matrix of the sample cells to obtain a first feature matrix and a second feature matrix;

A second perturbation module is used to perturb the adjacency matrix of the sample cells to obtain a first adjacency matrix and a second adjacency matrix;

An encoding module, configured to encode a first graph through a feature encoder to be trained to obtain a first embedding matrix, and encode a second graph to obtain a second embedding matrix; the first graph includes the first feature matrix and the first adjacency matrix; the second graph includes the second feature matrix and the second adjacency matrix;

A fusion module, configured to fuse the first embedding matrix and the second embedding matrix to obtain a target embedding matrix;

A decoding module, used for decoding the target embedding matrix to obtain at least one of a reconstructed cell feature matrix and a reconstructed adjacency matrix;

A parameter optimization module is used to iteratively optimize the parameters of the feature encoder to be trained according to at least one of the first difference and the second difference until the iteration stop condition is met, thereby obtaining a trained feature encoder; the feature encoder is used to extract features required for cell clustering; the first difference is the difference between the reconstructed cell feature matrix and the cell feature matrix of the sample cells; the second difference is the difference between the reconstructed adjacency matrix and the adjacency matrix of the sample cells.

A cell clustering device, characterized in that the device comprises:

The embedding matrix determination module is used to input a graph composed of a cell feature matrix and an adjacency matrix of cells to be grouped into a pre-trained feature encoder to obtain an extracted embedding matrix; the pre-trained feature encoder is pre-trained according to a first graph and a second graph; the first graph includes a first feature matrix and a first adjacency matrix; the second graph includes a second feature matrix and a second adjacency matrix; the first feature matrix and the second feature matrix are obtained by perturbing the cell feature matrix of sample cells; the first adjacency matrix and the second adjacency matrix are obtained by perturbing the adjacency matrix of the sample cells;

The clustering module is used to cluster the embedding vectors corresponding to each of the cells to be grouped in the extracted embedding matrix to obtain the cell grouping results of the cells to be grouped.

A computer device comprises a memory and one or more processors, wherein the memory stores computer-readable instructions, and wherein the one or more processors implement the steps of the method described in any one of claims 1 to 14 when executing the computer-readable instructions.

A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are suitable for being loaded by one or more processors and executing the method according to any one of claims 1-14.