WO2025189874A1

WO2025189874A1 - Representation sequence compression method and apparatus, and related device

Info

Publication number: WO2025189874A1
Application number: PCT/CN2024/139141
Authority: WO
Inventors: 郑鹏飞; 王迎春; 季晨鹏; 朱琦; 吴树森
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2024-03-14
Filing date: 2024-12-13
Publication date: 2025-09-18
Anticipated expiration: 2026-09-14
Also published as: CN120654758A

Abstract

A representation sequence compression method and apparatus, and a related device, relating to the technical field of artificial intelligence. The method comprises: acquiring a representation sequence, evaluating the importance of different representations (tokens) in the representation sequence in a first network layer set in an AI model, and dividing the representations in the representation sequence into a first representation set and a second representation set on the basis of the importance; and determining at least one virtual representation on the basis of representations in the first representation set, and using a compressed representation sequence consisting of the virtual representation and representations in the second representation set as an input sequence of a second network layer set in the AI model, wherein the number of determined virtual representations is less than the number of the representations in the first representation set. In this way, by trimming some of the representations in the representation sequence, the amount of resources needing to be occupied by forward calculation of the AI model can be reduced, and information of the trimmed representations is recycled using the virtual representation and participates in the subsequent reasoning process, so that it can be ensured that the reasoning precision of the AI model reaches a high level.

Description

Method, device and related equipment for expressing sequence compression

本申请要求于2024年03月14日提交国家知识产权局、申请号为202410295573.5、申请名称为“表示序列压缩的方法、装置及相关设备”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the State Intellectual Property Office on March 14, 2024, with application number 202410295573.5 and application name “Method, device and related equipment for expressing sequence compression”, the entire contents of which are incorporated by reference into this application.

Technical Field

本申请涉及人工智能技术领域，尤其涉及一种表示序列压缩的方法、装置及相关设备。The present application relates to the field of artificial intelligence technology, and in particular to a method, apparatus, and related equipment for representing sequence compression.

Background Art

随着人工智能(Artificial Intelligence，AI)技术的发展，生成式预训练Transformer模型(generative pre-trained transformer，GPT)、视觉Transformers(Vision Transformer，ViT)等AI模型，在自然语言处理、视觉处理等多种领域中存在广泛应用。With the development of artificial intelligence (AI) technology, AI models such as generative pre-trained transformer (GPT) and vision transformer (ViT) have been widely used in various fields such as natural language processing and vision processing.

实际应用场景中，作为AI模型输入的表示(token)序列的长度处于递增趋势，如输入至AI模型的token序列的数量可能达到2000等，这会导致AI模型根据该token序列进行训练或者推理的过程中所需的硬件资源过大。为此，通常可以将作为AI模型的输入的表示序列进行压缩，以降低AI模型在训练或者推理过程中对于硬件资源的需求。具体地，在模型训练或者推理过程中的前向计算阶段，可以在AI模型的浅层(如前5个网络层等)识别各个token的重要性，并根据每个token的重要性对输入的token序列进行修剪，从而仅保留重要性较高的多个token被传递至后续的网络层继续进行计算。这样，AI模型能够基于较少数量的token执行前向计算过程，以此能够实现降低对于硬件资源的需求。In actual application scenarios, the length of the representation (token) sequence used as input to the AI model is on an increasing trend. For example, the number of token sequences input to the AI model may reach 2,000, etc., which will result in excessive hardware resources required for the AI model to be trained or reasoned based on the token sequence. For this reason, the representation sequence used as input to the AI model can usually be compressed to reduce the AI model's demand for hardware resources during training or reasoning. Specifically, in the forward calculation stage of the model training or reasoning process, the importance of each token can be identified in the shallow layer of the AI model (such as the first 5 network layers, etc.), and the input token sequence can be pruned according to the importance of each token, so that only multiple tokens with higher importance are retained and passed to the subsequent network layers for further calculation. In this way, the AI model can perform a forward calculation process based on a smaller number of tokens, thereby reducing the demand for hardware resources.

但是，实际应用场景中，这种压缩表示序列的方式，容易导致AI模型的推理精度较低。However, in actual application scenarios, this method of compressing and representing sequences can easily lead to low reasoning accuracy of AI models.

Summary of the Invention

本申请提供了一种表示序列压缩的方法，旨在降低AI模型在推理(或者训练)过程中所需占用的资源数量的同时，提高AI模型的推理精度。此外，本申请还提供了表示序列压缩装置、计算设备、计算机可读存储介质以及计算机程序产品。This application provides a method for compressing representation sequences, aiming to reduce the number of resources required by AI models during inference (or training) while improving the inference accuracy of AI models. Furthermore, this application also provides a representation sequence compression apparatus, a computing device, a computer-readable storage medium, and a computer program product.

第一方面，本申请提供一种表示序列压缩的方法，该方法可以由相应的表示序列压缩装置执行，具体地，表示序列压缩装置获取第一表示序列，该第一表示序列为根据AI(人工智能)模型的输入数据获得的序列，比如，当输入数据为一段文字时，第一表示序列中的表示(token)可以是该段文字中的符号或者字(或者词)。AI模型包括第一网络层集合和第二网络层集合，并且，该第一网络层集合与第二网络层集合前后级联，如第一网络层集合的输出数据为第二网络层集合的输入数据。然后，表示序列压缩装置评估第一表示序列中不同的表示在第一网络层集合的重要程度，该重要程度例如可以是通过重要性分数进行度量。接着，表示序列压缩装置根据第一表示序列中的各个表示的重要程度，将第一表示序列中的表示分为第一表示集合和第二表示集合，通常情况下，第二表示集合中的表示的重要程度高于第一表示集合中的表示的重要程度。最后，表示序列压缩装置根据第一表示集合中的表示确定至少一个第一虚拟表示，并将根据至少一个第一虚拟表示以及第二表示集合中的表示构成的压缩表示序列，作为第二网络层集合的输入序列，其中，所确定的至少一个虚拟表示的数量少于第一表示集合中的表示的数量。此时，第一表示集合中的表示，即为第一表示序列中被裁剪的表示。In the first aspect, the present application provides a method for compressing a representation sequence, which can be performed by a corresponding representation sequence compression device. Specifically, the representation sequence compression device obtains a first representation sequence, which is a sequence obtained based on the input data of an AI (artificial intelligence) model. For example, when the input data is a paragraph of text, the representation (token) in the first representation sequence can be a symbol or character (or word) in the paragraph of text. The AI model includes a first network layer set and a second network layer set, and the first network layer set and the second network layer set are cascaded front and back, such as the output data of the first network layer set is the input data of the second network layer set. Then, the representation sequence compression device evaluates the importance of different representations in the first representation sequence in the first network layer set, and the importance can be measured, for example, by an importance score. Then, the representation sequence compression device divides the representations in the first representation sequence into a first representation set and a second representation set according to the importance of each representation in the first representation sequence. Generally, the importance of the representations in the second representation set is higher than the importance of the representations in the first representation set. Finally, the representation sequence compression device determines at least one first virtual representation based on the representations in the first representation set, and uses the compressed representation sequence composed of the at least one first virtual representation and the representations in the second representation set as the input sequence for the second network layer set. The number of the determined at least one virtual representation is less than the number of representations in the first representation set. In this case, the representations in the first representation set are the pruned representations from the first representation sequence.

如此，通过虚拟表示来回收在第一网络层集合因为重要程度较低而被裁剪的表示的信息，这使得即使在第一网络层集合中误将对于第二网络层集合的重要程度较高的表示进行裁剪，也能利用虚拟表示所回收的信息来保证AI模型的推理精度，从而能够避免完全裁剪该表示导致该AI模型的推理精度降低。同时，在裁剪第一表示序列中的部分表示后，AI模型基于数量较少的表示进行前向计算，所需占用的资源数量也能得到有效减少。并且，在前向计算过程中，虚拟表示与保留的表示(也即第二表示集合中的表示)之间可以相互独立，这能够避免虚拟表示与保留的表示之间相互干扰而影响AI模型的推理精度。In this way, virtual representation is used to recover information of representations that have been pruned in the first network layer set because of their low importance. This allows the information recovered by the virtual representation to ensure the reasoning accuracy of the AI model even if the representations with higher importance to the second network layer set are mistakenly pruned in the first network layer set. This can avoid completely pruning the representation and causing the reasoning accuracy of the AI model to decrease. At the same time, after pruned part of the representations in the first representation sequence, the AI model performs forward calculations based on a smaller number of representations, and the number of resources required can be effectively reduced. Moreover, during the forward calculation process, the virtual representation and the retained representation (that is, the representation in the second representation set) can be independent of each other, which can avoid mutual interference between the virtual representation and the retained representation and affect the reasoning accuracy of the AI model.

在一种可能的实施方式中，表示序列压缩装置在根据第一表示集合中的表示确定至少一个第一虚拟表示时，具体可以是根据第一表示集合中的表示的值，确定至少一个第一虚拟表示的值。此时，第一虚拟表示的数量可以是预先配置的固定数量。如此，利用第一虚拟表示来回收被裁剪的第一表示集合中的表示的信息(也即表示的取值)，能够避免完全裁剪该表示导致该AI模型的推理精度降低。In one possible embodiment, when the representation sequence compression device determines at least one first virtual representation based on the representation in the first representation set, it can specifically determine the value of at least one first virtual representation based on the value of the representation in the first representation set. At this time, the number of first virtual representations can be a pre-configured fixed number. In this way, using the first virtual representation to recycle the information of the representation in the pruned first representation set (that is, the value of the representation) can avoid completely pruning the representation, which leads to a decrease in the reasoning accuracy of the AI model.

在一种可能的实施方式中，表示序列压缩装置在根据第一表示集合中的表示确定至少一个第一虚拟表示时，具体可以是先根据第一表示集合中的表示的数量，确定至少一个第一虚拟表示的数量，再根据第一表示集合中的表示的值，确定该至少一个第一虚拟表示的值。此时，第一虚拟表示的数量，也可以根据被裁剪的表示的数量进行确定，如被裁剪的表示的数量越大，所确定的第一虚拟表示的数量越大等。如此，表示序列压缩装置能够根据实际被裁剪的表示数量，动态配置虚拟表示的数量，以此可以提高利用虚拟表示回收被裁剪的表示的信息的灵活性。In one possible embodiment, when the representation sequence compression device determines at least one first virtual representation based on the representations in the first representation set, it can specifically first determine the number of at least one first virtual representation based on the number of representations in the first representation set, and then determine the value of the at least one first virtual representation based on the value of the representation in the first representation set. At this time, the number of first virtual representations can also be determined based on the number of pruned representations, such as the larger the number of pruned representations, the larger the number of determined first virtual representations, etc. In this way, the representation sequence compression device can dynamically configure the number of virtual representations based on the actual number of pruned representations, thereby improving the flexibility of using virtual representations to recover information of pruned representations.

在一种可能的实施方式中，AI模型中的第一网络层集合包括目标参数，该目标参数用于计算第一表示序列中的表示在第一网络层集合中的重要程度，则，表示序列压缩装置还可以在训练AI模型的过程中，根据损失函数计算得到损失值，该损失函数包括目标参数对应的方差正则项，并根据该损失值，更新第一网络层集合中的目标参数。如此，表示序列压缩装置可以通过更新目标参数的值，能够提高该目标参数所计算出的不同表示的重要程度的区分度，从而有助于提高度量表示的重要程度的准确性。In one possible embodiment, the first network layer set in the AI model includes a target parameter, which is used to calculate the importance of the representation in the first representation sequence in the first network layer set. Then, the representation sequence compression device can also calculate a loss value according to a loss function during the training of the AI model. The loss function includes a variance regularization term corresponding to the target parameter, and updates the target parameter in the first network layer set according to the loss value. In this way, the representation sequence compression device can improve the discrimination of the importance of different representations calculated by the target parameter by updating the value of the target parameter, thereby helping to improve the accuracy of the measurement of the importance of the representation.

在一种可能的实施方式中，第一表示序列中的表示，包括单词(包括字或词)、或者图像块。比如，可以将用户输入的一段文本中的字或者词的序列，作为第一表示序列；或者，可以将用户提供的一张图像中的各个图像块所构成的序列，作为第一表示序列等。In one possible implementation, the representations in the first representation sequence include words (including characters or phrases) or image blocks. For example, a sequence of characters or phrases in a text input by a user can be used as the first representation sequence; or a sequence of image blocks in an image provided by a user can be used as the first representation sequence.

在一种可能的实施方式中，第一表示序列中的表示包括单词，此时，用于回收被裁剪的表示的信息的至少一个第一虚拟表示，位于压缩表示序列中的起始位置。如此，根据单词在文本中的位置对于该文本的重要性，来设置虚拟表示在压缩表示序列中的位置，可以有助于提高AI模型根据该压缩表示序列进行推理的准确性。In one possible implementation, the representations in the first representation sequence include words. In this case, at least one first virtual representation used to recycle the information of the pruned representation is located at the beginning of the compressed representation sequence. In this way, the position of the virtual representation in the compressed representation sequence is set based on the importance of the position of the word in the text to the text, which can help improve the accuracy of the AI model's reasoning based on the compressed representation sequence.

在一种可能的实施方式中，第一表示序列中的表示包括图像块，此时，用于回收被裁剪的表示的信息的至少一个第一虚拟表示，位于压缩表示序列中的中间位置。如此，根据图像块在图像中的位置对于该图像的重要性，来设置虚拟表示在压缩表示序列中的位置，可以有助于提高AI模型根据该压缩表示序列进行推理的准确性。In one possible implementation, the representations in the first representation sequence include image blocks. In this case, at least one first virtual representation used to recover information from the cropped representation is located in the middle of the compressed representation sequence. In this way, positioning the virtual representations in the compressed representation sequence based on the importance of the image block's position in the image can help improve the accuracy of the AI model's reasoning based on the compressed representation sequence.

在一种可能的实施方式中，AI模型还包括第三网络层集合，其中，第二网络层集合与第三网络层集合前后级联，即，第二网络层集合的输出数据可以作为第三网络层集合的输入数据。则，表示序列压缩装置还可以评估第二表示集合中不同的表示在第二网络层集合的重要程度，并根据第二表示集合中不同的表示在第二网络层集合的重要程度，将第二表示集合中的表示分为第一子集合以及第二子集合，其中，第二子集合中的表示的重要程度高于第一子集合中的表示的重要程度；然后，表示序列压缩装置还可以根据第一子集合中的表示确定至少一个第二虚拟表示，该至少一个虚拟表示的数量少于第一子集合中的表示的数量，并将根据至少一个第一虚拟表示、至少一个第二虚拟表示以及第二子集合中的表示构成的压缩表示序列，作为第三网络层集合的输入序列。如此，表示序列压缩装置可以通过在不同的网络层集合对表示序列进行多次裁剪，以此实现进一步减少前向计算阶段所需占用的资源数量。并且，表示序列压缩装置可以利用不同的虚拟表示来回收在不同的网络层集合被裁剪的表示的信息，并基于该虚拟表示参与后续的网络层的计算，这能够提高AI模型的推理精度。In one possible embodiment, the AI model also includes a third network layer set, wherein the second network layer set and the third network layer set are cascaded front and back, that is, the output data of the second network layer set can be used as the input data of the third network layer set. Then, the representation sequence compression device can also evaluate the importance of different representations in the second representation set in the second network layer set, and divide the representations in the second representation set into a first subset and a second subset according to the importance of different representations in the second representation set in the second network layer set, wherein the importance of the representations in the second subset is higher than the importance of the representations in the first subset; then, the representation sequence compression device can also determine at least one second virtual representation based on the representations in the first subset, the number of the at least one virtual representation is less than the number of representations in the first subset, and use the compressed representation sequence composed of the at least one first virtual representation, the at least one second virtual representation and the representations in the second subset as the input sequence of the third network layer set. In this way, the representation sequence compression device can further reduce the number of resources required in the forward calculation stage by cutting the representation sequence multiple times in different network layer sets. Moreover, the representation sequence compression device can utilize different virtual representations to recover the representation information that has been pruned in different sets of network layers, and participate in the calculation of subsequent network layers based on the virtual representation, which can improve the reasoning accuracy of the AI model.

在一种可能的实施方式中，AI模型还包括第三网络层集合，其中，第二网络层集合与第三网络层集合前后级联。则，表示序列压缩装置还可以评估第二表示集合中不同的表示在第二网络层集合的重要程度，并根据第二表示集合中不同的表示在第二网络层集合的重要程度，将第二表示集合中的表示分为第一子集合以及第二子集合，其中，第二子集合中的表示的重要程度高于第一子集合中的表示的重要程度；然后，表示序列压缩装置还可以根据第一子集合中的表示确定至少一个第一虚拟表示在第二网络层集合中的值，并将根据该至少一个第一虚拟表示以及第二子集合中的表示构成的压缩表示序列，作为第三网络层集合的输入序列。如此，表示序列压缩装置可以通过在不同的网络层集合对表示序列进行多次裁剪，以此实现进一步减少前向计算阶段所需占用的资源数量。并且，表示序列压缩装置可以利用同一虚拟表示来回收在不同的网络层集合被裁剪的表示的信息，这能够进一步降低前向计算过程中所需消耗的资源数量。In one possible embodiment, the AI model also includes a third network layer set, wherein the second network layer set and the third network layer set are cascaded. The representation sequence compression device can then evaluate the importance of different representations in the second representation set in the second network layer set, and based on the importance of different representations in the second representation set in the second network layer set, divide the representations in the second representation set into a first subset and a second subset, wherein the importance of the representations in the second subset is higher than the importance of the representations in the first subset. The representation sequence compression device can then determine the value of at least one first virtual representation in the second network layer set based on the representations in the first subset, and use the compressed representation sequence composed of the at least one first virtual representation and the representations in the second subset as the input sequence for the third network layer set. In this way, the representation sequence compression device can further reduce the number of resources required in the forward computation phase by pruning the representation sequence multiple times in different network layer sets. Furthermore, the representation sequence compression device can use the same virtual representation to recycle information about representations pruned in different network layer sets, which can further reduce the number of resources consumed in the forward computation process.

在一种可能的实施方式中，表示序列压缩装置对第一表示序列进行压缩的方法流程可以应用于AI模型的部署阶段，则，表示序列压缩装置还可以根据压缩表示序列更新AI模型对应的模型计算图，以便表示序列压缩装置在根据更新后的模型计算图执行前向计算的过程中，能够通过对表示序列进行压缩来降低前向计算过程中所需消耗的资源数量，同时利用虚拟表示回收被裁剪的表示的信息来保证AI模型的推理精度。In one possible embodiment, the method flow of the representation sequence compression device for compressing the first representation sequence can be applied to the deployment stage of the AI model. Then, the representation sequence compression device can also update the model calculation graph corresponding to the AI model according to the compressed representation sequence, so that the representation sequence compression device can reduce the amount of resources required for the forward calculation process by compressing the representation sequence during the process of performing forward calculation according to the updated model calculation graph, and at the same time use virtual representation to recycle the information of the cropped representation to ensure the reasoning accuracy of the AI model.

在一种可能的实施方式中，表示序列压缩装置对第一表示序列进行压缩的方法流程可以应用于在AI模型的开发阶段，如可以在开发阶段将用于实现压缩表示序列的SDK(软件开发工具包)添加至AI模型对应的模型文件中。如此，表示序列压缩装置在运行完成开发的AI模型的过程中，能够对表示序列进行压缩来降低前向计算过程中所需消耗的资源数量，同时利用虚拟表示回收被裁剪的表示的信息来保证AI模型的推理精度。In one possible implementation, the method flow for compressing the first representation sequence by the representation sequence compression device can be applied during the development phase of the AI model. For example, an SDK (software development kit) for implementing the compressed representation sequence can be added to the model file corresponding to the AI model during the development phase. In this way, when the representation sequence compression device runs the developed AI model, it can compress the representation sequence to reduce the amount of resources consumed in the forward calculation process, and at the same time, use virtual representation to recycle the cropped representation information to ensure the reasoning accuracy of the AI model.

第二方面，本申请提供一种表示序列压缩装置，该表示序列压缩装置包括获取模块，用于获取第一表示序列，第一表示序列为根据人工智能AI模型的输入数据获得的序列，其中，AI模型包括第一网络层集合和第二网络层集合，第一网络层集合与第二网络层集合前后级联；评估模块，用于评估第一表示序列中不同的表示在第一网络层集合的重要程度；划分模块，用于根据重要程度，将第一表示序列中的表示分为第一表示集合和第二表示集合；确定模块，用于根据第一表示集合中的表示确定至少一个第一虚拟表示，至少一个虚拟表示的数量少于第一表示集合中的表示的数量，并将根据至少一个第一虚拟表示以及第二表示集合中的表示构成的压缩表示序列，作为第二网络层集合的输入序列。In a second aspect, the present application provides a representation sequence compression device, which includes an acquisition module for acquiring a first representation sequence, where the first representation sequence is a sequence obtained based on input data of an artificial intelligence AI model, wherein the AI model includes a first network layer set and a second network layer set, and the first network layer set and the second network layer set are cascaded front and back; an evaluation module for evaluating the importance of different representations in the first representation sequence in the first network layer set; a division module for dividing the representations in the first representation sequence into a first representation set and a second representation set according to the importance; a determination module for determining at least one first virtual representation based on the representations in the first representation set, where the number of at least one virtual representation is less than the number of representations in the first representation set, and using a compressed representation sequence composed of at least one first virtual representation and the representations in the second representation set as the input sequence of the second network layer set.

在一种可能的实施方式中，确定模块，用于根据第一表示集合中的表示的值，确定至少一个第一虚拟表示的值。In a possible implementation, the determination module is configured to determine a value of at least one first virtual representation according to values of representations in the first representation set.

在一种可能的实施方式中，确定模块，用于：根据第一表示集合中的表示的数量，确定至少一个第一虚拟表示的数量；根据第一表示集合中的表示的值，确定至少一个第一虚拟表示的值。In a possible implementation, the determination module is configured to: determine the number of at least one first virtual representation based on the number of representations in the first representation set; and determine the value of at least one first virtual representation based on the value of the representation in the first representation set.

在一种可能的实施方式中，第一网络层集合包括目标参数，目标参数用于计算第一表示序列中的表示在第一网络层集合中的重要程度；表示序列压缩装置还包括训练模块，用于：在训练AI模型的过程中，根据损失函数计算得到损失值，损失函数包括目标参数对应的方差正则项；根据损失值，更新目标参数。In one possible embodiment, the first network layer set includes a target parameter, which is used to calculate the importance of the representation in the first representation sequence in the first network layer set; the representation sequence compression device also includes a training module, which is used to: in the process of training the AI model, calculate the loss value according to the loss function, and the loss function includes a variance regularization term corresponding to the target parameter; update the target parameter according to the loss value.

在一种可能的实施方式中，第一表示序列中的表示，包括单词、或者图像块。In a possible implementation, the representations in the first representation sequence include words or image blocks.

在一种可能的实施方式中，表示包括单词，至少一个第一虚拟表示位于压缩表示序列中的起始位置。In one possible implementation, the representation includes words, and the at least one first virtual representation is located at a starting position in the compressed representation sequence.

在一种可能的实施方式中，表示包括图像块，至少一个第一虚拟表示位于压缩表示序列中的中间位置。In one possible implementation, the representation comprises image blocks, and the at least one first virtual representation is located at an intermediate position in a sequence of compressed representations.

在一种可能的实施方式中，AI模型还包括第三网络层集合，第二网络层集合与第三网络层集合前后级联；评估模块，还用于评估第二表示集合中不同的表示在第二网络层集合的重要程度；划分模块，还用于根据第二表示集合中不同的表示在第二网络层集合的重要程度，将第二表示集合中的表示分为第一子集合以及第二子集合；确定模块，还用于根据第一子集合中的表示确定至少一个第二虚拟表示，至少一个虚拟表示的数量少于第一子集合中的表示的数量，并将根据至少一个第一虚拟表示、至少一个第二虚拟表示以及第二子集合中的表示构成的压缩表示序列，作为第三网络层集合的输入序列。In one possible embodiment, the AI model also includes a third network layer set, and the second network layer set and the third network layer set are cascaded front and back; the evaluation module is also used to evaluate the importance of different representations in the second representation set in the second network layer set; the division module is also used to divide the representations in the second representation set into a first subset and a second subset according to the importance of different representations in the second representation set in the second network layer set; the determination module is also used to determine at least one second virtual representation based on the representations in the first subset, the number of at least one virtual representation is less than the number of representations in the first subset, and the compressed representation sequence composed of at least one first virtual representation, at least one second virtual representation and the representation in the second subset is used as the input sequence of the third network layer set.

在一种可能的实施方式中，AI模型还包括第三网络层集合，第二网络层集合与第三网络层集合前后级联；评估模块，还用于评估第二表示集合中不同的表示在第二网络层集合的重要程度；划分模块，还用于根据第二表示集合中不同的表示在第二网络层集合的重要程度，将第二表示集合中的表示分为第一子集合以及第二子集合；确定模块，还用于根据第一子集合中的表示确定至少一个第一虚拟表示在第二网络层集合中的值，并将根据至少一个第一虚拟表示以及第二子集合中的表示构成的压缩表示序列，作为第三网络层集合的输入序列。In one possible embodiment, the AI model also includes a third network layer set, and the second network layer set and the third network layer set are cascaded front and back; the evaluation module is also used to evaluate the importance of different representations in the second representation set in the second network layer set; the division module is also used to divide the representations in the second representation set into a first subset and a second subset according to the importance of different representations in the second representation set in the second network layer set; the determination module is also used to determine the value of at least one first virtual representation in the second network layer set based on the representation in the first subset, and use the compressed representation sequence composed of at least one first virtual representation and the representation in the second subset as the input sequence of the third network layer set.

在一种可能的实施方式中，表示序列压缩装置应用于AI模型的部署阶段，表示序列压缩装置还包括更新模块，用于根据压缩表示序列更新AI模型对应的模型计算图。In one possible implementation, the representation sequence compression device is applied to the deployment stage of the AI model, and the representation sequence compression device also includes an update module for updating the model calculation graph corresponding to the AI model according to the compressed representation sequence.

在一种可能的实施方式中，表示序列压缩装置应用于AI模型的开发阶段。In a possible implementation, the representation sequence compression device is applied to the development stage of the AI model.

第二方面提供的表示序列压缩装置对应于第一方面提供的表示序列压缩方法，故第二方面以及第二方面中任一种实现方式所具有的技术效果，可参见上述第一方面以及第一方面中相应实现方式所具有的技术效果的相关之处描述，在此不做赘述。The representation sequence compression device provided in the second aspect corresponds to the representation sequence compression method provided in the first aspect. Therefore, the technical effects of the second aspect and any implementation method in the second aspect can be found in the relevant description of the technical effects of the first aspect and the corresponding implementation method in the first aspect, and will not be repeated here.

第三方面，本申请提供一种计算设备，该计算设备包括处理器和存储器；其中，存储器用于存储指令，处理器执行存储器存储的该指令，以执行上述第一方面以及第一方面中任意一种实现方式所述的表示序列压缩方法的操作步骤。In a third aspect, the present application provides a computing device comprising a processor and a memory; wherein the memory is used to store instructions, and the processor executes the instructions stored in the memory to perform the operating steps of the representation sequence compression method described in the first aspect and any one of the implementations of the first aspect.

第四方面，本申请提供一种计算机可读存储介质，所述计算机可读存储介质中存储有指令，当其在计算设备上运行时，使得计算设备执行上述第一方面或第一方面的任一种实现方式所述的表示序列压缩方法的操作步骤。In a fourth aspect, the present application provides a computer-readable storage medium, which stores instructions. When the computer-readable storage medium is run on a computing device, the computing device executes the operating steps of the representation sequence compression method described in the first aspect or any implementation of the first aspect.

第五方面，本申请提供了一种包含指令的计算机程序产品，当其在计算设备上运行时，使得计算设备执行上述第一方面或第一方面的任一种实现方式所述的表示序列压缩方法的操作步骤。In a fifth aspect, the present application provides a computer program product comprising instructions, which, when executed on a computing device, enables the computing device to execute the operational steps of the representation sequence compression method described in the first aspect or any one of the implementations of the first aspect.

本申请在上述各方面提供的实现方式的基础上，还可以进行进一步组合以提供更多实现方式。Based on the implementation methods provided in the above aspects, this application can also be further combined to provide more implementation methods.

BRIEF DESCRIPTION OF THE DRAWINGS

图1为AI模型的推理过程示意图；Figure 1 is a schematic diagram of the reasoning process of the AI model;

图2为本申请提供的一示例性AI模型的推理过程示意图；FIG2 is a schematic diagram of the reasoning process of an exemplary AI model provided in this application;

图3为本申请提供的一种表示序列压缩方法的流程示意图；FIG3 is a flow chart of a method for compressing a sequence provided by the present application;

图4为本申请提供的在token序列1中插入m个虚拟token的示意图；FIG4 is a schematic diagram of inserting m virtual tokens into token sequence 1 provided by the present application;

图5为本申请提供的裁剪token和回收token信息的示意图；FIG5 is a schematic diagram of the trimming token and recycling token information provided by this application;

图6为裁剪token后减少数据计算量的示意图；Figure 6 is a schematic diagram of reducing the amount of data calculation after token clipping;

图7为本申请提供的另一种表示序列压缩方法的流程示意图；FIG7 is a flow chart of another method for sequence compression provided by the present application;

图8为本申请提供的一种表示序列压缩装置的结构示意图；FIG8 is a schematic structural diagram of a sequence compression device provided by the present application;

图9为本申请提供的一种计算设备的硬件结构示意图。FIG9 is a schematic diagram of the hardware structure of a computing device provided in this application.

DETAILED DESCRIPTION

为使本申请的上述目的、特征和优点能够更加明显易懂，下面将结合附图对本申请实施例中的各种非限定性实施方式进行示例性说明。显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，基于上述内容所获得的所有其它实施例，都属于本申请保护的范围。To make the above-mentioned objects, features, and advantages of the present application more clearly understood, various non-limiting embodiments of the embodiments of the present application will be exemplified below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained based on the above content are within the scope of protection of this application.

参见图1，示出了一种AI模型的推理过程示意图，该AI模型可以通过一个或者多个计算节点运行。Referring to FIG1 , a schematic diagram of the reasoning process of an AI model is shown, where the AI model can be run through one or more computing nodes.

示例性地，AI模型，例如可以是生成式预训练Transformer模型(generative pre-trained transformer，GPT)、来自Transformers的双向编码表示(bidirectional encoder representations from transformer，BERT)模型、视觉Transformers(Vision Transformer，ViT)模型、对于语言-图像预训练模型(contrastive language-image pre-training，CLIP)模型等，或者可以是其他类型的AI模型，对此并不进行限定。Exemplarily, the AI model may be, for example, a generative pre-trained transformer model (GPT), a bidirectional encoder representations from transformer (BERT) model, a vision transformer (ViT) model, a contrastive language-image pre-training (CLIP) model, etc., or may be other types of AI models, without limitation.

支持AI模型运行的计算节点，可以是加速卡，该加速卡例如可以是深度学习处理器(deep-learning processing unit，DPU)、数据处理单元(Data processing unit，DPU)、图形处理器(graphics processing unit，GPU)、神经网络处理器(neural-network processing unit，NPU)，也可以是其他类型的加速卡。或者，计算节点可以是通用处理器，如中央处理器(central processing unit，CPU)等。又或者，计算节点也可以包括CPU以及加速卡的计算设备。本申请对于计算节点的具体实现方式并不进行限定。The computing node that supports the operation of the AI model can be an accelerator card, which can be, for example, a deep-learning processing unit (DPU), a data processing unit (DPU), a graphics processing unit (GPU), a neural network processing unit (NPU), or other types of accelerator cards. Alternatively, the computing node can be a general-purpose processor, such as a central processing unit (CPU). Alternatively, the computing node can also include a computing device of a CPU and an accelerator card. This application does not limit the specific implementation of the computing node.

为便于描述，下面以利用一个计算节点运行AI模型为例进行说明。For ease of description, the following uses an example of running an AI model on a computing node.

如图1所示，AI模型的输入数据可以是一张图像，并且，AI模型能够根据该图像进行相应的推理。比如，AI模型可以根据输入的图像数据生成文本信息，该文本信息例如可以是能够描述该图像中的画面内容的文本信息。As shown in Figure 1, the input data of the AI model can be an image, and the AI model can perform corresponding reasoning based on the image. For example, the AI model can generate text information based on the input image data, and the text information can be text information that can describe the content of the image.

在利用AI模型根据图像数据进行推理的过程中，计算节点可以先将图像分解成多个图像块(每个图像块的尺寸可以相同)，并按照指定的顺序，如按照从图像的左上角到右下角的顺序等，将该多个图像块进行依次展开，能够得到图像块序列。则，计算节点可以将每个图像块作为一个token，从而能够得到相应的token序列，如图1所示。When using an AI model to perform inference based on image data, the computing node can first decompose the image into multiple image blocks (each image block can be of the same size) and then expand the multiple image blocks in a specified order, such as from the upper left corner to the lower right corner of the image, to obtain a sequence of image blocks. The computing node can then use each image block as a token to obtain a corresponding token sequence, as shown in Figure 1.

然后，计算节点可以利用AI模型根据该token序列执行相应的推理过程。在此过程中，输入的token序列越长，则AI表示序列压缩所需占用的计算资源以及存储资源越多。因此，计算节点可以对输入的token进行修剪。具体地，如图1所示，计算节点可以在AI模型的浅层，对输入的token序列中的各个token进行重要性计算，并根据各个token的重要性分数进行掩码计算，以便利用该掩码结果标识token序列中重要性较低的token，从而计算节点可以根据掩码结果剔除token序列中重要性较低的token，并利用剔除后剩余的多个token继续在AI模型中参与后续的网络层的计算。其中，AI模型包括多个网络层，并且，按照前向计算的顺序，该多个网络层中执行顺序靠前的网络层可以描述为浅层(如前5个网络层等)，执行顺序靠后的多个网络层可以描述为深层(如最后10个网络层等)。这样，后续的网络层可以基于数量较少的多个token进行计算，以此能够减少AI模型进行推理所整体占用的资源。同时，虽然减少token序列中的部分token进行推理，会降低AI模型的推理精度，但是，所保留的多个token为重要性较高的token，这使得AI模型的推理精度也能相对保持在较高的水平。Then, the computing node can use the AI model to perform the corresponding reasoning process based on the token sequence. In this process, the longer the input token sequence is, the more computing resources and storage resources are required for the AI representation sequence compression. Therefore, the computing node can prune the input tokens. Specifically, as shown in Figure 1, the computing node can calculate the importance of each token in the input token sequence at the shallow layer of the AI model, and perform mask calculation based on the importance score of each token, so as to use the mask result to identify the less important tokens in the token sequence, so that the computing node can eliminate the less important tokens in the token sequence according to the mask result, and use the remaining multiple tokens after elimination to continue to participate in the calculation of subsequent network layers in the AI model. Among them, the AI model includes multiple network layers, and, in the order of forward calculation, the network layers with the highest execution order in the multiple network layers can be described as shallow layers (such as the first 5 network layers, etc.), and the multiple network layers with the lowest execution order can be described as deep layers (such as the last 10 network layers, etc.). This allows subsequent network layers to perform calculations based on a smaller number of tokens, reducing the overall resources consumed by the AI model for inference. While reducing the number of tokens in the token sequence for inference will reduce the AI model's inference accuracy, the remaining tokens are more important, which helps maintain a relatively high level of AI model inference accuracy.

通常情况下，计算节点在AI模型的浅层所剔除的token的数量越多，则，AI模型根据输入的token序列进行推理所占用的资源越少，但是，AI模型的推理精度降低的程度也越大，如图1所示。反之，计算节点在AI模型的浅层所剔除的token的数量越少，则，AI模型根据输入的token序列进行推理所占用的资源越多，相应地，AI模型的推理精度越高，如图1所示。并且，在AI模型的浅层的重要性较低的token，可能会在AI模型的深层的重要性较高，从而在浅层剔除对于深层来说重要性较高的token，也会对AI模型的推理精度产生较大影响。Generally speaking, the more tokens a computing node eliminates in the shallow layers of an AI model, the fewer resources the AI model uses to perform reasoning based on the input token sequence. However, the greater the reduction in the AI model's reasoning accuracy, as shown in Figure 1. Conversely, the fewer tokens a computing node eliminates in the shallow layers of an AI model, the more resources the AI model uses to perform reasoning based on the input token sequence, and accordingly, the higher the AI model's reasoning accuracy, as shown in Figure 1. Furthermore, tokens that are less important in the shallow layers of an AI model may be more important in the deeper layers of the AI model. Therefore, eliminating tokens in the shallow layers that are more important to the deeper layers can also have a significant impact on the AI model's reasoning accuracy.

基于此，本申请提供了一种token序列压缩方法，旨在降低AI模型在推理(或者训练)过程中对于资源的需求的同时，提高AI模型的推理精度。Based on this, the present application provides a token sequence compression method, which aims to reduce the resource requirements of the AI model during the reasoning (or training) process while improving the reasoning accuracy of the AI model.

具体实现时，如图2所示，在AI模型的前向计算阶段，计算节点可以在AI模型的浅层，对输入的token序列中的各个token进行重要性计算，具体可以是评估各个token在浅层中的重要程度(该重要程度可以通过重要性分数进行度量)，并根据各个token的重要程度将token序列中的多个token划分成两个token集合，分别为token集合1以及token集合2，每个token集合包括至少一个token。其中，token集合2中的token在AI模型中的浅层的重要程度，高于token集合1中的token在AI模型中的浅层的重要程度。如，计算节点可以对每个token的重要性进行计算，并根据每个token的重要性分数进行掩码计算，从而利用掩码结果来标识重要性较高的token和重要性较低的token，以此实现将token序列划分成两个token集合。然后，计算节点会根据token集合1中的token，确定至少一个虚拟token，以便利用该虚拟token来回收token集合1中的token信息，该虚拟token的数量少于token集合1中的token数量，从而计算节点将根据该至少一个虚拟token以及token集合2中的token构成的压缩token序列，作为AI模型中的深层的输入序列，即将虚拟token以及保留的token集合2中的token传递至AI模型中的深层进行计算。此时，token集合1中的token，为token序列中被裁剪的token，并且，计算节点在浅层裁剪重要性较低的token后，会利用虚拟token来回收被裁剪的token的信息并参与深层的计算。In specific implementation, as shown in FIG2, in the forward calculation stage of the AI model, the computing node can calculate the importance of each token in the input token sequence at the shallow layer of the AI model. Specifically, it can be to evaluate the importance of each token in the shallow layer (the importance can be measured by the importance score), and divide the multiple tokens in the token sequence into two token sets according to the importance of each token, namely token set 1 and token set 2, each token set includes at least one token. Among them, the importance of the tokens in token set 2 in the shallow layer of the AI model is higher than the importance of the tokens in token set 1 in the shallow layer of the AI model. For example, the computing node can calculate the importance of each token and perform mask calculation based on the importance score of each token, so as to use the mask result to identify tokens with higher importance and tokens with lower importance, thereby realizing the division of the token sequence into two token sets. The computing node will then determine at least one virtual token based on the tokens in token set 1, so that it can use this virtual token to recover the token information in token set 1. The number of these virtual tokens is less than the number of tokens in token set 1. Therefore, the computing node will use the compressed token sequence composed of the at least one virtual token and the tokens in token set 2 as the input sequence for the deeper layer of the AI model, that is, pass the virtual token and the retained tokens in token set 2 to the deeper layer of the AI model for calculation. At this time, the tokens in token set 1 are the pruned tokens in the token sequence, and after the computing node prunes the less important tokens at the shallow layer, it will use the virtual tokens to recover the information of the pruned tokens and participate in the deeper calculation.

如此，通过虚拟token来回收在浅层因为重要性较低而被裁剪的token的信息，这使得即使在浅层中误将对于深层的重要性较高的token进行裁剪，也能利用虚拟token所回收的信息来保证AI模型的推理精度，从而能够避免完全裁剪该token导致该AI模型的推理精度降低，如图2所示。同时，在裁剪token序列中的部分token后，AI模型基于数量较少的token进行前向计算，所需占用的资源数量也能得到有效减少。实际测试场景中，利用虚拟token来回收被裁剪的token的信息，能够将AI模型的推理精度提升至94.65％，同时，所需占用的内存资源的数量也能减少63％。In this way, virtual tokens are used to recycle information about tokens that were pruned at shallow layers due to their low importance. This allows the AI model's reasoning accuracy to be maintained even if tokens of high importance to deeper layers are mistakenly pruned at shallow layers. This prevents the AI model's reasoning accuracy from being reduced due to the complete pruned token, as shown in Figure 2. At the same time, after pruning some tokens from the token sequence, the AI model performs forward calculations based on a smaller number of tokens, effectively reducing the number of resources required. In actual test scenarios, using virtual tokens to recycle information about pruned tokens can improve the AI model's reasoning accuracy to 94.65%, while also reducing the amount of memory resources required by 63%.

并且，在前向计算过程中，虚拟token与保留的token之间可以相互独立，这能够避免虚拟token与保留的token之间相互干扰而影响AI模型的推理精度。Moreover, during the forward calculation process, the virtual token and the retained token can be independent of each other, which can avoid mutual interference between the virtual token and the retained token and affect the reasoning accuracy of the AI model.

实际应用场景中，计算节点可以利用配置的插件，实现对输入至AI模型的token序列进行裁剪；或者，计算节点可以基于AI模型对应的模型文件中内置的软件开发工具包(software development kit，SDK)的代码逻辑，实现对输入至AI模型的token序列进行裁剪。In actual application scenarios, computing nodes can use configured plug-ins to trim the token sequence input to the AI model; alternatively, computing nodes can trim the token sequence input to the AI model based on the code logic of the software development kit (SDK) built into the model file corresponding to the AI model.

在第一种实现方式中，在AI模型的部署阶段，计算节点中可以配置有目标插件，该目标插件能够用于裁剪token序列。则，在基于AI模型进行训练或者推理时，计算节点可以激活并运行该目标插件，并利用该目标插件对AI模型对应的模型计算图(前向计算阶段)进行更新。比如，计算节点可以利用该目标插件，调用计算节点中的动态图引擎，以分析得到该AI模型对应的模型计算图，该模型计算图能够用于指示AI模型在前向计算阶段所执行的计算逻辑。然后，计算节点可以利用该目标插件，采用动态图插装技术(如Torch.FX工具等)，在该模型计算图中插入关于裁剪token序列的计算逻辑，该计算逻辑包括计算token的重要性分数、生成用于指示被裁剪token和保留token的掩码、配置和插入虚拟token等。这样，计算节点可以根据更新后的模型计算图执行前向计算过程，实现在AI模型中对输入的token序列进行相应的裁剪。In the first implementation, during the deployment phase of the AI model, a target plug-in can be configured in the compute node, which can be used to trim token sequences. Then, when training or inference is performed based on the AI model, the compute node can activate and run the target plug-in and use it to update the model computation graph (forward computation phase) corresponding to the AI model. For example, the compute node can use the target plug-in to call the dynamic graph engine in the compute node to analyze and obtain the model computation graph corresponding to the AI model. The model computation graph can be used to indicate the computational logic executed by the AI model during the forward computation phase. The compute node can then use the target plug-in and dynamic graph instrumentation technology (such as Torch.FX tools) to insert computational logic for trimming token sequences into the model computation graph. This computational logic includes calculating token importance scores, generating masks to indicate trimmed and retained tokens, and configuring and inserting virtual tokens. In this way, the compute node can execute the forward computation process based on the updated model computation graph, achieving corresponding trimming of the input token sequence in the AI model.

在第二种实现方式中，在AI模型的开发阶段，如基于Pytorch或Mindspoe等框架开发AI模型的过程中，用户可以调用SDK接口，将用于实现裁剪token序列的SDK添加至该AI模型对应的模型文件中。这样，在基于AI模型进行训练和推理的过程中，计算节点可以执行该AI模型对应的模型文件，从而通过运行该模型文件中的SDK，实现对输入的token序列进行相应地裁剪。In the second implementation, during the AI model development phase, such as when developing an AI model based on a framework like Pytorch or Mindspoe, users can call the SDK interface and add the SDK for implementing token sequence trimming to the model file corresponding to the AI model. In this way, during training and inference based on the AI model, the computing node can execute the model file corresponding to the AI model, thereby trimming the input token sequence accordingly by running the SDK in the model file.

为便于理解，下面结合附图，对本申请提供的表示序列压缩方法的实施例进行描述。For ease of understanding, an embodiment of the representation sequence compression method provided by the present application is described below with reference to the accompanying drawings.

参见图3，图3为本申请实施例提供的一种表示序列压缩方法的流程示意图。如图3所示，表示序列压缩方法具体可以包括：See Figure 3, which is a flow chart of a method for compressing a representation sequence provided by an embodiment of the present application. As shown in Figure 3, the method for compressing a representation sequence may specifically include:

S301：计算节点获取token序列1，该token序列1为根据AI模型的输入数据获得的序列。S301: The computing node obtains token sequence 1, which is a sequence obtained based on the input data of the AI model.

通常情况下，token序列1中可以包括多个token。Typically, token sequence 1 may include multiple tokens.

示例性地，每个token，可以是一个单词。当AI模型根据一段单词序列进行前向计算时，计算节点可以将该单词序列中的每个单词作为一个token，从而得到token序列1。比如，用户可以向计算节点提供一段文字：“今天的天气很好，适合与朋友出去玩些什么项目呢？”，则，计算节点可以将该段文字中的每个字以及每个标点符号均可以作为一个token，或者，计算节点可以对该段文字进行分词，并将切分得到的每个词语作为一个token等，对此并不进行限定。For example, each token can be a word. When the AI model performs forward computation based on a word sequence, the computing node can use each word in the word sequence as a token, thereby obtaining token sequence 1. For example, a user can provide the computing node with a text message: "The weather is great today, what activities are suitable for going out with friends?" The computing node can then use each word and each punctuation mark in the text message as a token. Alternatively, the computing node can segment the text message and use each segmented word as a token, etc., without limitation.

或者，每个token，可以是一个图像块。当AI模型根据一张或者多张图像进行前向计算时，计算节点可以将该图像按照指定的顺序依次分解成多个图像块，并将每个图像块作为一个token，从而计算节点可以根据分解图像的顺序，得到相应的token序列1。比如，用户可以向计算节点提供一张尺寸为2560×1440像素的图像，则计算节点可以按照从图像的左上角到右下角的顺序，将图像分解成16个尺寸为160×90像素的图像块，得到包括16个token的序列。Alternatively, each token can be an image block. When the AI model performs forward computation based on one or more images, the computing node can decompose the image into multiple image blocks in a specified order, and use each image block as a token. Based on the order of the decomposed images, the computing node can obtain a corresponding token sequence 1. For example, a user can provide a computing node with a size of 2560×1440 pixels. The computing node can then decompose the image into 16 image blocks of 160×90 pixels in order from the upper left corner to the lower right corner of the image, obtaining a sequence of 16 tokens.

在一种可能的实施方式中，计算节点可以对外提供客户端，该客户端例如可以是运行在用户设备上的应用程序，或者可以是网页浏览器等。用户可以向该客户端提供文本或者图像，并由客户端将该文本或者图像转发给计算节点。然后，计算节点可以将接收到的文本或者图像作为AI模型的输入数据，并按照上述方式根据该输入数据生成相应的token序列1。In one possible implementation, the computing node can provide a client, which can be, for example, an application running on a user's device or a web browser. The user can provide text or images to the client, which then forwards the text or images to the computing node. The computing node can then use the received text or images as input data for the AI model and generate a corresponding token sequence 1 based on the input data in the manner described above.

计算节点在获取后token序列1后，可以根据该token序列1进行前向计算。其中，计算节点根据输入的token序列1所执行的前向计算过程，可以是推理场景下的前向计算过程，或者可以是训练场景下的前向计算过程，对此并不进行限定。After obtaining the token sequence 1, the computing node can perform forward computing based on the token sequence 1. The forward computing process performed by the computing node based on the input token sequence 1 can be a forward computing process in an inference scenario or a forward computing process in a training scenario, and this is not limited.

S302：计算节点评估token序列1中不同的token在第一网络层集合的重要程度，其中，AI模型包括第一网络层集合以及第二网络层集合，并且，第一网络层集合与第二网络层集合前后级联。S302: The computing node evaluates the importance of different tokens in token sequence 1 in the first network layer set, wherein the AI model includes the first network layer set and the second network layer set, and the first network layer set and the second network layer set are cascaded front and back.

本实施例中，AI模型可以包括多个网络层，并且，该多个网络层可以根据其在AI模型中的深度，划分为第一网络层集合以及第二网络层集合。其中，第一网络层集合以及第二网络层集合分别包括至少一个网络层。实际应用时，第一网络层集合中的网络层，也可以是称之为AI模型中的浅层，第二网络层集合中的网络层，也可以是称之为AI模型中的深层。并且，第一网络层集合与第二网络层集合前后级联，即，在前向计算阶段，第二网络层集合的输入数据，可以根据第一网络层集合的输出数据得到。In this embodiment, the AI model may include multiple network layers, and the multiple network layers may be divided into a first network layer set and a second network layer set according to their depth in the AI model. The first network layer set and the second network layer set each include at least one network layer. In actual application, the network layers in the first network layer set may also be referred to as shallow layers in the AI model, and the network layers in the second network layer set may also be referred to as deep layers in the AI model. Moreover, the first network layer set and the second network layer set are cascaded front and back, that is, in the forward calculation stage, the input data of the second network layer set can be obtained based on the output data of the first network layer set.

在获取到token序列1后，计算节点可以根据该token序列1在AI模型的每个网络层进行前向计算，并且，当token序列1传递至AI模型的第一网络层集合时，计算节点可以计算token序列1中的每个token在该第一网络层集合中的重要性分数，该重要性分数用于度量该token在第一网络层集合中的重要程度。通常情况下，重要性分数越大，重要程度越高。After obtaining token sequence 1, the computing node can perform forward calculations based on token sequence 1 at each network layer of the AI model. Furthermore, when token sequence 1 is passed to the first network layer set of the AI model, the computing node can calculate the importance score of each token in token sequence 1 within the first network layer set. This importance score is used to measure the importance of the token within the first network layer set. Generally, a larger importance score indicates a higher importance.

作为一种实现示例，计算节点可以先计算token序列1中在不同的空间位置上的token在第一网络层集合中的特征向量的范数，并根据该特征向量的范数计算出该token的重要性分数1。然后，计算节点再计算该token与token序列1中的起始位置的token之间的注意力分数，并根据该注意力分数计算出该token的重要性分数2。其中，计算重要性分数1以及重要性分数2的实现方式，在实际场景中存在相关应用，在此不做赘述。最后，计算节点通过对该token的重要性分数1以及重要性分数2进行加权求和，计算得到该token在第一网络层集合中最终的重要性分数。As an implementation example, the computing node can first calculate the norm of the feature vector of the token at different spatial positions in the token sequence 1 in the first network layer set, and calculate the importance score 1 of the token based on the norm of the feature vector. Then, the computing node calculates the attention score between the token and the token at the starting position in the token sequence 1, and calculates the importance score 2 of the token based on the attention score. Among them, the implementation method of calculating the importance score 1 and the importance score 2 has relevant applications in actual scenarios and will not be elaborated here. Finally, the computing node calculates the final importance score of the token in the first network layer set by performing a weighted sum of the importance score 1 and the importance score 2 of the token.

在其它实施例中，计算节点也可以基于学习token剪枝(learnt token pruning，LTP)算法计算每个token在第一网络层集合中的重要性分数，或者可以基于视觉Transformer的token稀疏自动扩展(auto-scaling vision transformers，AS-ViT)框架计算每个token在第一网络层集合中的重要性分数。另外，token在第一网络层集合中的重要程度，也可以通过其他方式进行度量，对此并不进行限定。In other embodiments, the computing node may also calculate the importance score of each token in the first network layer set based on a learned token pruning (LTP) algorithm, or may calculate the importance score of each token in the first network layer set based on the auto-scaling vision transformer (AS-ViT) framework. Furthermore, the importance of a token in the first network layer set may also be measured by other methods, which are not limited to this.

S303：计算节点根据各个token的重要程度，将token序列1中的token分成token集合1以及token集合2。S303: The computing node divides the tokens in the token sequence 1 into a token set 1 and a token set 2 according to the importance of each token.

本实施例中，token集合2中的token在第一网络层集合中的重要程度，高于token集合1中的token在AI模型的第一网络层集合中的重要程度。In this embodiment, the importance of the tokens in token set 2 in the first network layer set is higher than the importance of the tokens in token set 1 in the first network layer set of the AI model.

作为一种实现示例，当重要程度通过重要性分数进行度量时，计算节点可以将每个token在第一网络层集合中的重要性分数与阈值1进行比较，并将重要性分数小于阈值1的token划入token集合1，将重要性分数大于或者等于阈值1的token划入token集合2。从而，计算节点能够将token序列1包括的多个token划分成token集合1以及token集合2，并且，token集合2中的token的重要程度高于token集合1中的token的重要程度。其中，阈值1可以是由技术人员预先设定的值，或者可以是由AI模型在训练过程中通过自学习进行确定。As an implementation example, when the importance is measured by an importance score, the computing node can compare the importance score of each token in the first network layer set with threshold 1, and classify tokens with importance scores less than threshold 1 into token set 1, and tokens with importance scores greater than or equal to threshold 1 into token set 2. Thus, the computing node can divide the multiple tokens included in token sequence 1 into token set 1 and token set 2, and the importance of the tokens in token set 2 is higher than the importance of the tokens in token set 1. Threshold 1 can be a value pre-set by a technician, or can be determined by the AI model through self-learning during the training process.

其中，重要程度较低的token集合1中的token，即为在AI模型中需要被裁剪的token。相应地，在前向计算过程中，计算节点可以保留token集合2中的token参与后续网络层的计算。实际应用场景中，当token具体为单词时，token序列1中的第一个token为单词序列中的首个单词，并且，首个单词对于针对该单词序列进行自然语言处理过程通常较为重要。相应地，token序列1中的第一个token在AI模型的前向计算过程中的重要程度也较高，从而计算节点通常可以将该token保留在token集合2中。当token具体为图像块时，token序列1中位于中间的token的重要程度通常较高，因此，计算节点通常可以将位于中间的token保留在token集合2中。Among them, the tokens in the token set 1 with lower importance are the tokens that need to be pruned in the AI model. Accordingly, in the forward calculation process, the computing node can retain the tokens in the token set 2 to participate in the calculation of the subsequent network layer. In actual application scenarios, when the token is specifically a word, the first token in the token sequence 1 is the first word in the word sequence, and the first word is usually more important for the natural language processing process of the word sequence. Accordingly, the first token in the token sequence 1 is also more important in the forward calculation process of the AI model, so the computing node can usually retain the token in the token set 2. When the token is specifically an image block, the importance of the token in the middle of the token sequence 1 is usually higher. Therefore, the computing node can usually retain the token in the middle in the token set 2.

实际应用时，计算节点也可以通过其他方式对token序列1中的多个token进行划分，对此并不进行限定。In actual applications, the computing node may also divide the multiple tokens in the token sequence 1 in other ways, and there is no limitation on this.

S304：计算节点根据token集合1中的token确定至少一个虚拟token，其中，该至少一个虚拟token的数量少于token集合1中的token数量。S304: The computing node determines at least one virtual token based on the tokens in token set 1, wherein the number of the at least one virtual token is less than the number of tokens in token set 1.

本实施例中，如果直接将在第一网络层集合中重要程度较低的token进行裁剪，则可能会因为该部分token在后续的第二网络层中的重要程度较高而导致AI模型的推理精度受到影响。因此，计算节点可以利用虚拟token来回收被裁剪的token的信息，以便利用回收的token信息来实现对AI模型的推理精度进行补偿。In this embodiment, if tokens with lower importance in the first network layer are directly pruned, the AI model's reasoning accuracy may be affected because these tokens are more important in the subsequent second network layer. Therefore, the computing node can use virtual tokens to recycle the pruned token information, so that the recycled token information can be used to compensate for the AI model's reasoning accuracy.

在第一种可能的实施方式中，在对token序列1进行划分之前，计算节点可以预先为该token序列1配置目标数量的虚拟token(如配置4个虚拟token等)，并将该目标数量的虚拟token插入至计算节点所获取的token序列1中。所插入的虚拟token，用于回收在第一网络层集合中被裁剪的token集合1中的token信息。其中，虚拟token的目标数量，可以预先由技术人员进行设定，如可以是由技术人员通过有限次数的实验设定虚拟token的目标数量等。In a first possible implementation, before partitioning token sequence 1, the computing node may pre-configure a target number of virtual tokens for token sequence 1 (e.g., 4 virtual tokens) and insert this target number of virtual tokens into token sequence 1 obtained by the computing node. The inserted virtual tokens are used to reclaim token information from token set 1 that was pruned from the first network layer set. The target number of virtual tokens may be pre-set by a technician, for example, through a limited number of experiments.

当token为单词时，计算节点可以将虚拟token插入至token序列1的起始位置。比如，如图4所示，token序列1中可以包括“cls”标记的token(也可以称之为分类字符)，该token用于标记token序列1的开始(已经能够用于指示文本的类别)。同时，token序列1中还可以包括n+1个单词对应的token(n为正整数)，图4中用x₀至x_n分别进行标识。在token序列1中，“cls”标记的token的下一个token(即x₀)即为单词序列中的第一个单词对应的token。因此，在对token序列1进行划分之前，计算节点可以先在token序列1中，在“cls”标记的token之前依次插入m个虚拟token(m均为正整数)，得到新的token序列，图4中用v₁至x_m依次标识所插入的虚拟token。此时，如图4所示，新生成的token序列中，第一个token为虚拟token，第m+1个token为“cls”标记对应的token，第m+2个token为单词序列中的首个单词对应的token。即，token序列1中的每个token在新生成的token序列中的绝对位置均推后m个token对应的位置。然后，计算节点可以对包括虚拟token的新的token序列进行划分，得到重要程度相对较高的token集合2。通常情况下，计算节点在对划分得到token集合2时，该“cls”标记的token也会保留在该token集合2中。When the token is a word, the computing node can insert a virtual token into the starting position of token sequence 1. For example, as shown in Figure 4, token sequence 1 can include a token marked with "cls" (also known as a classification character), which is used to mark the beginning of token sequence 1 (it can already be used to indicate the category of the text). At the same time, token sequence 1 can also include tokens corresponding to n+1 words (n is a positive integer), which are identified by _x0 to _xn in Figure 4. In token sequence 1, the next token of the token marked with "cls" (i.e., _x0 ) is the token corresponding to the first word in the word sequence. Therefore, before dividing token sequence 1, the computing node can first insert m virtual tokens (m is a positive integer) in token sequence 1 before the token marked with "cls" to obtain a new token sequence. In Figure 4, _v1 to _xm are used to identify the inserted virtual tokens. At this point, as shown in Figure 4, in the newly generated token sequence, the first token is a dummy token, the m+1th token is the token corresponding to the "cls" tag, and the m+2th token is the token corresponding to the first word in the word sequence. That is, the absolute position of each token in token sequence 1 in the newly generated token sequence is pushed back by m tokens. The computing node can then partition the new token sequence, including the dummy token, to obtain a relatively important token set 2. Typically, when the computing node partitions to obtain token set 2, the token labeled "cls" will also be retained in this token set 2.

当token为图像块时，计算节点可以将虚拟token插入至token序列1的中间位置。比如，假设token序列1包括N个token(N为大于1的正整数)，计算节点可以从第个token处开始依次插入m个虚拟token等。When the token is an image block, the computing node can insert the virtual token into the middle position of the token sequence 1. For example, assuming that the token sequence 1 includes N tokens (N is a positive integer greater than 1), the computing node can insert the virtual token from the first Starting from the token, m virtual tokens are inserted in sequence.

实际应用时，计算节点在token序列1中插入虚拟token的位置，也可以是其他位置，如可以从token序列1的最末端插入虚拟token等，对此并不进行限定。并且，计算节点在插入虚拟token后，可以对虚拟token的取值进行初始化，如将该虚拟token的取值全部设为1等。In actual applications, the computing node can insert the virtual token at any other location within token sequence 1, such as at the very end of token sequence 1, and there is no limitation on this. Furthermore, after inserting the virtual token, the computing node can initialize the value of the virtual token, such as by setting all values of the virtual token to 1.

相应地，计算节点在对token序列1中的多个token进行划分时，具体可以是对插入虚拟token所得到的token序列进行划分，此时，计算节点可以默认将该token序列中的全部虚拟token均划分至token集合2中(虚拟token可以不参与重要性分数的计算)。Accordingly, when the computing node divides multiple tokens in the token sequence 1, it may specifically divide the token sequence obtained by inserting the virtual token. At this time, the computing node may divide all the virtual tokens in the token sequence into the token set 2 by default (the virtual token may not participate in the calculation of the importance score).

在得到token集合2后，计算节点可以利用插入的虚拟token(也即token集合2中的虚拟token)，回收被裁剪的token集合1中的token信息。具体实现时，计算节点可以采用映射算法，根据token集合1中的token的取值，计算每个虚拟token的取值。比如，如图5所示，假设被裁剪的token为x₁至x_i，则，每个虚拟token的取值，均可以是根据被裁剪多个token的取值进行计算得到。示例性地，计算虚拟token的取值所采用的映射算法，例如可以是设施选址问题(facility location problem，FLP)算法等，具体是将虚拟token的取值作为地址，将虚拟token的建造开销、虚拟token与被裁剪的token的取值差异作为目标(也即最小化回收成本目标)，通过建立原始问题松弛的对偶规划快速求解得到虚拟token的取值，以此完成被裁剪的token至虚拟token的映射。实际应用时，计算节点也可以采用其他映射算法，根据token集合1中的取值计算得到虚拟token的取值，对此并不进行限定。After obtaining token set 2, the computing node can use the inserted virtual token (i.e., the virtual token in token set 2) to recover the token information in the pruned token set 1. In specific implementation, the computing node can use a mapping algorithm to calculate the value of each virtual token based on the values of the tokens in token set 1. For example, as shown in Figure 5, assuming that the pruned tokens are _x1 to _x1 , the value of each virtual token can be calculated based on the values of the pruned tokens. Exemplarily, the mapping algorithm used to calculate the value of the virtual token can be, for example, a facility location problem (FLP) algorithm. Specifically, the virtual token value is used as the address, the construction cost of the virtual token, and the difference between the virtual token and the pruned token values as the objective (i.e., minimizing the recovery cost objective). The virtual token value is quickly solved by establishing a dual program that relaxes the original problem, thereby completing the mapping from the pruned token to the virtual token. In actual applications, the computing node may also adopt other mapping algorithms to calculate the value of the virtual token according to the value in the token set 1, and there is no limitation on this.

上述第一种实现方式中，计算节点是先配置固定数量的虚拟token，再对包括虚拟token的序列进行划分。在第二种可能的实施方式中，计算节点也可以先对token序列1进行划分，再向重要程度较高的token集合2对应的token序列插入虚拟token。In the first implementation described above, the computing node first configures a fixed number of virtual tokens and then divides the sequence containing the virtual tokens. In a second possible implementation, the computing node may also first divide token sequence 1 and then insert virtual tokens into the token sequence corresponding to the more important token set 2.

具体实现时，计算节点在得到token集合1后，可以先根据token集合1中的token数量1，确定虚拟token的数量2，并且，数量2小于数量1。比如，当需要裁剪的token集合1中的数量为20时，虚拟token的数量可以是4。然后，计算节点可以根据该数量2配置虚拟token，并将该虚拟token插入至token集合2对应的token序列中，如可以将该虚拟token插入该token序列的起始位置或者中间位置等。其中，被裁剪的token数量与虚拟token的数量之间的比值，可以预先由技术人员进行配置。其中，计算节点可以基于任意方式确定虚拟token的数量2，对此并不进行限定。In specific implementation, after obtaining token set 1, the computing node can first determine the number 2 of virtual tokens based on the number 1 of tokens in token set 1, and the number 2 is less than the number 1. For example, when the number of tokens in token set 1 that need to be pruned is 20, the number of virtual tokens can be 4. Then, the computing node can configure a virtual token based on the number 2, and insert the virtual token into the token sequence corresponding to token set 2, such as inserting the virtual token at the starting position or the middle position of the token sequence. The ratio between the number of pruned tokens and the number of virtual tokens can be configured in advance by technical personnel. The computing node can determine the number 2 of virtual tokens based on any method, and there is no limitation on this.

接着，计算节点可以根据被裁剪的token集合1中的token取值，确定所插入的虚拟token的值，如可以利用FLP算法确定虚拟token的值等，以便利用该虚拟token回收被裁剪的token的信息。如此，计算节点能够根据实际被裁剪的token数量，动态配置虚拟token的数量，以此可以提高利用虚拟token回收被裁剪的token信息的灵活性。比如，当被裁剪的token数量较多时，计算节点在token集合2对应的token序列中所插入的虚拟token的数量也越多，以便利用更多的虚拟token进行信息回收。Next, the computing node can determine the value of the inserted virtual token based on the token value in the pruned token set 1, such as by using the FLP algorithm to determine the value of the virtual token, so as to use the virtual token to recover the information of the pruned token. In this way, the computing node can dynamically configure the number of virtual tokens based on the actual number of pruned tokens, thereby improving the flexibility of using virtual tokens to recover the pruned token information. For example, when the number of pruned tokens is large, the computing node will insert more virtual tokens in the token sequence corresponding to token set 2, so as to use more virtual tokens for information recovery.

可以理解，上述配置虚拟token以及确定虚拟token取值的实现方式仅作为一些示例性说明，在其他实施例中，计算节点也可以采用其他方式配置虚拟token的数量以及取值，对此并不进行限定。It is understandable that the above-mentioned implementation methods of configuring virtual tokens and determining virtual token values are only some exemplary explanations. In other embodiments, the computing node may also use other methods to configure the number and value of virtual tokens, and this is not limited.

S305：计算节点将根据至少一个虚拟token以及token集合2中的token构成的压缩token序列，作为第二网络层集合的输入序列。S305: The computing node uses a compressed token sequence composed of at least one virtual token and tokens in token set 2 as an input sequence of the second network layer set.

本实施例中，计算节点在第一网络层集合中裁剪重要程度较低的部分token后，可以基于插入的虚拟token以及所保留的重要程度较高的token得到新的序列，该新的序列中所包括的token数量少于token序列1中包括的token数量，也即实现对token序列1的压缩。为便于区分，以下将新得到的序列称之为压缩token序列，从而计算节点可以将该压缩token序列作为第二网络层集合的输入序列，以便将该压缩token序列传递至AI模型中的第二网络层集合继续进行计算。In this embodiment, after the computing node prunes the less important tokens in the first network layer set, it can obtain a new sequence based on the inserted virtual tokens and the retained more important tokens. The number of tokens included in this new sequence is less than the number of tokens included in token sequence 1, thus compressing token sequence 1. For ease of distinction, the newly obtained sequence is referred to as a compressed token sequence below, so that the computing node can use this compressed token sequence as the input sequence of the second network layer set, so as to pass this compressed token sequence to the second network layer set in the AI model for continued calculation.

具体实现时，当token为单词时，计算节点可以基于插入的虚拟token以及token集合2中的token，生成token序列2(也即上述压缩token序列)，并且，虚拟token位于该token序列2的起始位置，从而计算节点可以将该token序列2传递至第二网络层集合中进行计算。在第二网络层集合中，计算节点可以根据虚拟token的取值以及(非虚拟)token的值执行相应的计算，以便AI模型继续完成前向计算过程。如此，计算节点在第二网络层集合中进行计算的数据量能够得到有效减少，从而能够计算所需的资源消耗。比如，如图6所示，在对输入至AI模型的token序列1进行裁剪之前，计算节点在第二网络层集合中利用激活函数(如Sigmoid函数等)基于该token序列1进行计算所生成的激活值，为N×d维度的值(token序列1中包括N个token)。而在对token序列1进行裁剪并得到token序列2之后，计算节点在第二网络层集合中利用激活函数基于该token序列2进行计算所生成的激活值，为k×d维度的值(token序列1中包括k个token，k为小于N的正整数)，以此计算节点可以减少(N-k)×d维度的数据量计算，也即能够减少该部分数据量计算所产生的资源消耗。In a specific implementation, when the token is a word, the computing node can generate a token sequence 2 (that is, the compressed token sequence mentioned above) based on the inserted virtual token and the token in the token set 2, and the virtual token is located at the starting position of the token sequence 2, so that the computing node can pass the token sequence 2 to the second network layer set for calculation. In the second network layer set, the computing node can perform corresponding calculations based on the value of the virtual token and the value of the (non-virtual) token, so that the AI model continues to complete the forward calculation process. In this way, the amount of data calculated by the computing node in the second network layer set can be effectively reduced, so that the required resource consumption can be calculated. For example, as shown in Figure 6, before the token sequence 1 input to the AI model is trimmed, the activation value generated by the computing node in the second network layer set using the activation function (such as the Sigmoid function, etc.) based on the token sequence 1 is a value of N×d dimensions (token sequence 1 includes N tokens). After trimming token sequence 1 and obtaining token sequence 2, the computing node in the second network layer set uses the activation function to calculate based on token sequence 2 to generate an activation value of k×d dimension (token sequence 1 includes k tokens, k is a positive integer less than N). In this way, the computing node can reduce the data calculation of (N-k)×d dimension, that is, it can reduce the resource consumption generated by the calculation of this part of the data volume.

当token为图像块时，计算节点可以基于插入的虚拟token以及token集合2中的token，生成token序列3(也即上述压缩token序列)，并且，虚拟token位于该token序列3的中间位置，从而计算节点可以将该token序列3传递至第二网络层集合中进行计算。类似地，通过减少传递至第二网络层集合的token(包括虚拟token以及非虚拟的token)数量，能够减少计算节点在第二网络层集合进行计算所需的资源消耗。When the token is an image block, the computing node can generate a token sequence 3 (i.e., the compressed token sequence described above) based on the inserted virtual token and the tokens in token set 2. Furthermore, the virtual token is located in the middle of token sequence 3, so that the computing node can pass token sequence 3 to the second network layer set for calculation. Similarly, by reducing the number of tokens (including virtual tokens and non-virtual tokens) passed to the second network layer set, the resource consumption required by the computing node for calculation in the second network layer set can be reduced.

需要说明的是，即使token序列1中对于第二网络层集合的重要程度较高的一个或者多个token，因为在第一网络层集合的重要程度较低而被误裁剪，由于该一个或者多个token的信息被回收至虚拟token的取值中，因此，计算节点也能通过在第二网络层集合中根据虚拟token进行计算，来保证AI模型根据token集合2中的token进行推理的精度，能够近似达到AI模型在第二网络层集合中基于原始的token序列1进行推理的精度。同时，在第二网络层集合中参与计算的token总数(包括虚拟token以及非虚拟的token)，能够有效减少AI表示序列压缩过程中所产生的资源消耗。It should be noted that even if one or more tokens in token sequence 1 that are more important to the second network layer set are mistakenly clipped because they are less important in the first network layer set, since the information of the one or more tokens is recycled into the value of the virtual token, the computing node can also ensure the accuracy of the AI model's reasoning based on the tokens in token set 2 by performing calculations based on the virtual tokens in the second network layer set, and can approximately achieve the accuracy of the AI model's reasoning based on the original token sequence 1 in the second network layer set. At the same time, the total number of tokens involved in the calculation in the second network layer set (including virtual tokens and non-virtual tokens) can effectively reduce the resource consumption generated during the compression process of the AI representation sequence.

进一步的，在训练AI模型的过程中，计算节点可以根据上述步骤S301至步骤S305所描述的方法流程执行前向计算过程，并根据前向计算阶段所计算得到的结果与真实结果之间的差异，对AI模型中各个网络层中的参数值进行更新。其中，所更新的第一网络层集合中的参数值，包括目标参数的值，该目标参数为计算token序列1中的每个token在第一网络层集合中的重要程度的参数。Furthermore, during the training of the AI model, the computing node may perform a forward calculation process according to the method flow described in steps S301 to S305 above, and update the parameter values in each network layer of the AI model based on the difference between the result calculated in the forward calculation phase and the actual result. The updated parameter values in the first network layer set include the value of the target parameter, which is a parameter for calculating the importance of each token in token sequence 1 in the first network layer set.

本实施例中，计算节点在更新目标参数的值时，可以提高该目标参数所计算出的不同token的重要程度的区分度。In this embodiment, when the computing node updates the value of the target parameter, it can improve the differentiation of the importance of different tokens calculated by the target parameter.

具体实现时，计算节点在训练AI模型所采用的损失函数中，可以包括该目标参数对应的方差正则项。其中，损失函数，用于计算损失值，该损失值能够用于度量AI模型在前向计算阶段所计算得到的推理结果与token序列1对应的真实结果之间的差异；通常情况下，当损失值足够小时(如小于预设值)，AI模型训练结束。方差正则项，用于强化目标参数针对不同token所计算出的重要程度的区分度。In specific implementation, the computing node may include a variance regularization term corresponding to the target parameter in the loss function used to train the AI model. The loss function is used to calculate the loss value, which can be used to measure the difference between the inference result calculated by the AI model in the forward calculation phase and the actual result corresponding to token sequence 1. Normally, when the loss value is small enough (such as less than a preset value), the AI model training ends. The variance regularization term is used to enhance the discrimination of the importance of the target parameter calculated for different tokens.

这样，在前向计算阶段，计算节点在利用目标参数计算出每个token的重要性分数后，可以根据各个token的重要性分数，计算出第一网络层集合对应的重要性分数的方差，并将该方差带入损失函数中计算损失值，并根据该损失值对第一网络层集合中的目标参数的取值进行更新。In this way, in the forward calculation stage, after the computing node calculates the importance score of each token using the target parameters, it can calculate the variance of the importance scores corresponding to the first network layer set based on the importance scores of each token, and bring the variance into the loss function to calculate the loss value, and update the value of the target parameter in the first network layer set according to the loss value.

上述图2所示的实施例中，是以在前向计算阶段，计算节点对token序列进行一次裁剪为例进行说明。在其他实施例中，计算节点也可以在不同的网络层集合对token序列进行多次裁剪，以此实现进一步减少前向计算阶段所需占用的资源数量。下面结合图7对此进行示例性说明。In the embodiment shown in FIG2 above, the example of a computing node performing a single pruning of a token sequence during the forward computation phase is used for illustration. In other embodiments, the computing node may perform multiple pruning of the token sequence at different network layer sets to further reduce the number of resources required during the forward computation phase. This is exemplified below with reference to FIG7 .

参见图7，示出了另一种表示序列压缩方法的流程示意图，如图7所示，该方法具体可以包括：7 , which shows a flowchart of another method for sequence compression. As shown in FIG7 , the method may specifically include:

S701：计算节点获取token序列1，该token序列1为根据AI模型的输入数据获得的序列。S701: The computing node obtains token sequence 1, which is a sequence obtained based on the input data of the AI model.

S702：计算节点评估token序列1中不同的token在第一网络层集合的重要程度，其中，AI模型包括第一网络层集合以及第二网络层集合，并且，第一网络层集合与第二网络层集合前后级联。S702: The computing node evaluates the importance of different tokens in token sequence 1 in the first network layer set, wherein the AI model includes the first network layer set and the second network layer set, and the first network layer set and the second network layer set are cascaded front and back.

S703：计算节点根据各个token的重要程度，将token序列1中的token分成token集合1以及token集合2。S703: The computing node divides the tokens in the token sequence 1 into a token set 1 and a token set 2 according to the importance of each token.

其中，token集合2中的token在第一网络层集合中的重要程度，高于token集合1中的token在AI模型的第一网络层集合中的重要程度。Among them, the importance of the tokens in token set 2 in the first network layer set is higher than the importance of the tokens in token set 1 in the first network layer set of the AI model.

S704：计算节点根据token集合1中的token确定m个虚拟token，其中，该m个虚拟token的数量少于token集合1中的token数量。S704: The computing node determines m virtual tokens according to the tokens in token set 1, wherein the number of the m virtual tokens is less than the number of tokens in token set 1.

S705：计算节点将根据m个虚拟token以及token集合2中的token构成的压缩token序列，作为第二网络层集合的输入序列。S705: The computing node uses the compressed token sequence composed of the m virtual tokens and the tokens in the token set 2 as the input sequence of the second network layer set.

其中，步骤S701至步骤S705的具体实现过程，可参见前述步骤S301至步骤S305的相关之处描述，在此不做赘述。The specific implementation process of steps S701 to S705 can be found in the relevant description of steps S301 to S305 above, and will not be repeated here.

S706：计算节点评估token集合2中不同的token在第二网络层集合的重要程度，其中，AI模型还包括第三网络层集合，并且，第二网络层集合与第三网络层集合前后级联。S706: The computing node evaluates the importance of different tokens in token set 2 in the second network layer set, wherein the AI model also includes a third network layer set, and the second network layer set and the third network layer set are cascaded front and back.

S707：计算节点根据各个token的重要程度，将token集合2中的token分成子集合1和子集合2。S707: The computing node divides the tokens in token set 2 into subset 1 and subset 2 according to the importance of each token.

其中，子集合2中的token在第二网络层集合中的重要程度高于子集合1中的token在第二网络层集合中的重要程度。Among them, the importance of the tokens in subset 2 in the second network layer set is higher than the importance of the tokens in subset 1 in the second network layer set.

作为一种实现示例，当重要程度通过重要性分数度量时，计算节点可以计算token集合2中的各个token在第二网络层集合中的重要性分数，该重要性分数用于指示token在第二网络层集合的重要程度，并且，重要性分数越大，重要程度越高。然后，计算节点可以将各个token在第二网络层集合中的重要性分数与阈值2进行比较。对于重要性分数小于该阈值2的token，计算节点可以将该token划入子集合1中；对于重要性分数大于或者等于该阈值2的token，计算节点可以将该token划入子集合2中。本实施例中，计算节点对token集合2中的多个token进行划分的方式，可参见上述图3实施例中关于划分token序列1的相关之处描述，在此不做赘述。其中，阈值2，可以是由技术人员预先设定的值，或者可以是由AI模型在训练过程中自学习得到的值，对此并不进行限定。并且，不同网络层集合中用于度量token重要程度的阈值，可以相同，或者可以存在差异，如阈值1可以大于阈值2等。As an implementation example, when importance is measured using an importance score, the computing node may calculate the importance score of each token in token set 2 within the second network layer set. This importance score indicates the importance of the token within the second network layer set, with a higher importance score indicating a higher importance. The computing node may then compare the importance score of each token within the second network layer set with threshold 2. For tokens with an importance score less than threshold 2, the computing node may assign the token to subset 1; for tokens with an importance score greater than or equal to threshold 2, the computing node may assign the token to subset 2. In this embodiment, the manner in which the computing node divides the multiple tokens in token set 2 can be described in the description of the division of token sequence 1 in the embodiment of FIG. 3 above and is not further elaborated here. Threshold 2 can be a value pre-set by a technician or a value learned by the AI model during training, and this is not limited to this. Furthermore, the thresholds used to measure token importance in different network layer sets can be the same or different, such as threshold 1 being greater than threshold 2.

本实施例中，对于重要程度较低的子集合1中的token，即为需要在第二网络层中需要被裁剪的token，而重要程度较高的子集合2中的token，会在AI模型中被传递至后续的网络层中进行计算。In this embodiment, the tokens in the subset 1 with lower importance are the tokens that need to be pruned in the second network layer, while the tokens in the subset 2 with higher importance will be passed to the subsequent network layers in the AI model for calculation.

S708：计算节点根据子集合1中的token，确定p个虚拟token，其中，该p个虚拟token的数量少于子集合1中的token数量。S708: The computing node determines p virtual tokens based on the tokens in subset 1, where the number of the p virtual tokens is less than the number of tokens in subset 1.

其中，p的取值，可以是预设值。比如，在根据输入的token序列1进行推理之前，计算节点可以预先针对第一网络层集合配置m个虚拟token(如4个虚拟token)、以及预先为第二网络层集合配置p个虚拟token(如2个虚拟token)。然后，在计算节点可以将配置的(m+p)个虚拟token插入token序列1中，其具体实现方式，可参见上述图3所示实施例中的关于插入虚拟token的相关之处描述，在此不做赘述。其中，所配置的(m+p)个虚拟token在第一网络层集合以及第二网络层集合中，均可以不参与重要性计算。Among them, the value of p can be a preset value. For example, before performing reasoning based on the input token sequence 1, the computing node can pre-configure m virtual tokens (such as 4 virtual tokens) for the first network layer set, and pre-configure p virtual tokens (such as 2 virtual tokens) for the second network layer set. Then, the computing node can insert the configured (m+p) virtual tokens into the token sequence 1. The specific implementation method can refer to the description of the relevant parts about inserting virtual tokens in the embodiment shown in Figure 3 above, which will not be repeated here. Among them, the configured (m+p) virtual tokens in the first network layer set and the second network layer set may not participate in the importance calculation.

或者，p的取值，可以是由计算节点动态确定。比如，计算节点在对token集合2中的多个token进行划分并得到子集合1和子集合2后，可以根据子集合1中包括的token数量，确定p值。比如，当子集合1中包括的token的数量为10时，计算节点所确定的p值可以为2。然后，计算节点可以将p个虚拟token插入子集合1中的多个token所对应的token序列中。其中，计算节点所插入的p个虚拟token在该token序列中的位置并不进行限定。Alternatively, the value of p can be dynamically determined by the computing node. For example, after the computing node divides the multiple tokens in token set 2 and obtains subsets 1 and 2, it can determine the value of p based on the number of tokens included in subset 1. For example, when the number of tokens included in subset 1 is 10, the value of p determined by the computing node can be 2. The computing node can then insert the p virtual tokens into the token sequence corresponding to the multiple tokens in subset 1. The positions of the p virtual tokens inserted by the computing node in the token sequence are not limited.

对于所配置的p个虚拟token，计算节点可以根据子集合1中的token的取值，确定该p个虚拟token的取值。示例性地，计算节点可以通过FLP算法或者其他算法，将子集合1中的token取值映射至该p个虚拟token的取值，其具体实现方式，可参见上述图3实施例中关于步骤S304的相关之处描述，在此不做赘述。For the configured p virtual tokens, the computing node can determine the values of the p virtual tokens based on the values of the tokens in subset 1. For example, the computing node can map the token values in subset 1 to the values of the p virtual tokens using the FLP algorithm or other algorithm. The specific implementation method can be found in the description of step S304 in the embodiment of FIG. 3 above, and is not further described here.

S709：计算节点将根据m个虚拟token、p个虚拟token以及子集合2中的token构成的压缩token序列，作为第三网络层集合的输入序列。S709: The computing node uses the compressed token sequence composed of the m virtual tokens, the p virtual tokens and the tokens in the subset 2 as the input sequence of the third network layer set.

具体地，计算节点可以将(m+p)个虚拟token以及子集合2中的token，拼接成新的压缩token序列，也即实现对token序列的进一步压缩，并将该新的压缩token序列作为第三网络层集合的输入序列，以便将该压缩token序列传递至第三网络层集合中继续进行计算。Specifically, the computing node can concatenate (m+p) virtual tokens and the tokens in subset 2 into a new compressed token sequence, that is, further compress the token sequence, and use the new compressed token sequence as the input sequence of the third network layer set, so as to pass the compressed token sequence to the third network layer set for further calculation.

值得注意的是，本实施例中，计算节点会在不同的网络层集合多次裁剪token序列，并且，每次裁剪token序列时，都会利用新的虚拟token来回收被裁剪的token的信息。在其他实施例中，计算节点在多次裁剪token序列的过程中，可以利用同一虚拟token来回收每次被裁剪的不同token的信息。It is worth noting that in this embodiment, the computing node may prune the token sequence multiple times at different network layers, and each time the token sequence is pruned, a new virtual token is used to retrieve the pruned token information. In other embodiments, the computing node may use the same virtual token to retrieve the information of different tokens pruned each time the token sequence is pruned multiple times.

以在第一网络层集合以及第二网络层集合依次对token序列进行裁剪为例，计算节点在将token序列1划分成token集合1以及token集合2后，可以利用m个虚拟token回收被裁剪的token集合1中token信息，并将m个虚拟token以及token集合2中的token传递至第二网络层集合中进行计算。然后，计算节点在第二网络层集合中将token集合2中的多个token划分成子集合1以及子集合2，并仍然利用该m个虚拟token来回收被裁剪的子集合1中的token的信息。比如，计算节点可以先利用FLP算法将被裁剪的子集合1中的token的值映射至该m个虚拟token的值，然后，计算节点可以将该映射得到的m个虚拟token的值与该m个虚拟token当前的取值进行求和，计算得到该m个虚拟token的新的取值。最后，计算节点将m个虚拟token以及子集合2中的token传递至AI模型中的第三网络层集合进行计算。这样，在前向计算阶段，在多次裁剪token的过程中，计算节点利用m个虚拟token回收每次被裁剪的token的信息，从而可以进一步减少在网络层集合中参与计算的token数量，减少前向计算阶段所需占用的资源数量。Taking the example of token sequence pruning in the first and second network layer sets, after the computing node divides token sequence 1 into token set 1 and token set 2, it can use m virtual tokens to recover token information from the pruned token set 1 and pass the m virtual tokens and tokens from token set 2 to the second network layer set for calculation. Then, in the second network layer set, the computing node divides the multiple tokens from token set 2 into subset 1 and subset 2, and still uses the m virtual tokens to recover token information from the pruned subset 1. For example, the computing node can first use the FLP algorithm to map the values of the tokens in the pruned subset 1 to the values of the m virtual tokens. Then, the computing node can sum the values of the m virtual tokens obtained by mapping with the current values of the m virtual tokens to calculate the new values of the m virtual tokens. Finally, the computing node passes the m virtual tokens and tokens from subset 2 to the third network layer set in the AI model for calculation. In this way, in the forward calculation phase, during the process of multiple token pruning, the computing node uses m virtual tokens to recycle the information of the pruned tokens each time, thereby further reducing the number of tokens participating in the calculation in the network layer set and reducing the number of resources required in the forward calculation phase.

进一步的，在训练AI模型的过程中，计算节点可以对多个网络层集合中用于计算token的重要性分数的参数取值进行更新。此时，计算节点可以在AI模型对应的损失函数中针对该部分参数添加方差正则项，用于强化该部分参数所计算出的不同token的重要程度的区分度。Furthermore, during AI model training, the computing nodes can update the parameter values used to calculate token importance scores across multiple network layers. In this case, the computing nodes can add a variance regularization term to the AI model's corresponding loss function for these parameters to enhance the discriminability of the importance scores of different tokens calculated using these parameters.

具体实现时，以对第一网络层集合以及第二网络层集合中用于计算重要性分数的参数取值进行更新为例，在前向计算阶段，计算节点在利用第一网络层集合中的参数计算出每个token的重要性分数后，可以根据各个token的重要性分数，计算出第一网络层集合对应的重要性分数的方差1。然后，在利用第二网络层集合中的参数计算出传递至第二网络层集合中的各个token(不包括虚拟token)的重要性分数，并根据传递至第二网络层集合中的各个token的重要性分数，计算出第二网络层集合对应的重要性分数的方差2。接着，计算节点可以根据方差1以及方差2，计算出目标方差。比如，计算节点可以对方差1以及方差2进行加权求和，所得到的和值即为目标方差；并且，方差1的权重值大于方差2的权重值。最后，计算节点可以将目标方差带入损失函数中计算损失值，并基于该损失值对第一网络层集合、第二网络层集合中用于计算重要性分数的参数的取值进行更新。In a specific implementation, taking the updating of the parameter values used to calculate importance scores in the first and second network layer sets as an example, in the forward computation phase, after calculating the importance score of each token using the parameters in the first network layer set, the computing node can calculate the variance 1 of the importance score corresponding to the first network layer set based on the importance scores of each token. Then, using the parameters in the second network layer set, the computing node calculates the importance score of each token (excluding virtual tokens) passed to the second network layer set, and calculates the variance 2 of the importance score corresponding to the second network layer set based on the importance scores of each token passed to the second network layer set. Next, the computing node can calculate the target variance based on variance 1 and variance 2. For example, the computing node can perform a weighted sum of variance 1 and variance 2, and the resulting sum is the target variance; the weight of variance 1 is greater than the weight of variance 2. Finally, the computing node can substitute the target variance into the loss function to calculate the loss value, and based on this loss value, update the values of the parameters used to calculate importance scores in the first and second network layer sets.

值得注意的是，本领域的技术人员根据以上描述的内容，能够想到的其他合理的步骤组合，也属于本申请的保护范围内。其次，本领域技术人员也应该熟悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作并不一定是本申请所必须的。It is worth noting that other reasonable step combinations that can be thought of by those skilled in the art based on the above description also fall within the scope of protection of this application. Secondly, those skilled in the art should also be familiar with that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by this application.

以上结合图1至图7对本申请实施例提供的表示序列压缩方法进行介绍，接下来结合附图对本申请实施例提供的表示序列压缩装置以及计算设备的结构进行介绍。The above describes the representation sequence compression method provided in the embodiment of the present application in conjunction with Figures 1 to 7. Next, the structure of the representation sequence compression device and the computing device provided in the embodiment of the present application will be described in conjunction with the accompanying drawings.

参见图8，示出了一种处理器的结构示意图，该表示序列压缩装置800包括：8 , which shows a schematic diagram of the structure of a processor, the representation sequence compression device 800 includes:

获取模块801，用于获取第一token序列，第一token序列为根据人工智能AI模型的输入数据获得的序列，其中，AI模型包括第一网络层集合和第二网络层集合，第一网络层集合与第二网络层集合前后级联；An acquisition module 801 is configured to acquire a first token sequence, where the first token sequence is a sequence obtained based on input data of an artificial intelligence (AI) model, wherein the AI model includes a first network layer set and a second network layer set, and the first network layer set and the second network layer set are cascaded.

评估模块802，用于评估第一token序列中不同的token在第一网络层集合的重要程度；An evaluation module 802 is configured to evaluate the importance of different tokens in the first token sequence in the first network layer set;

划分模块803，用于根据重要程度，将第一token序列中的token分为第一token集合和第二token集合；A division module 803 is configured to divide the tokens in the first token sequence into a first token set and a second token set according to their importance;

确定模块804，用于根据第一token集合中的token确定至少一个第一虚拟token，至少一个虚拟token的数量少于第一token集合中的token的数量，并将根据至少一个第一虚拟token以及第二token集合中的token构成的压缩token序列，作为第二网络层集合的输入序列。Determination module 804 is used to determine at least one first virtual token based on the tokens in the first token set, where the number of the at least one virtual token is less than the number of tokens in the first token set, and use a compressed token sequence composed of the at least one first virtual token and the tokens in the second token set as an input sequence of the second network layer set.

在一种可能的实施方式中，确定模块804，用于根据第一token集合中的token的值，确定至少一个第一虚拟token的值。In a possible implementation, the determination module 804 is configured to determine a value of at least one first virtual token according to values of tokens in the first token set.

在一种可能的实施方式中，确定模块804，用于：In a possible implementation, the determining module 804 is configured to:

根据第一token集合中的token的数量，确定至少一个第一虚拟token的数量；Determining the number of at least one first virtual token based on the number of tokens in the first token set;

根据第一token集合中的token的值，确定至少一个第一虚拟token的值。The value of at least one first virtual token is determined according to the value of the token in the first token set.

在一种可能的实施方式中，第一网络层集合包括目标参数，目标参数用于计算第一token序列中的token在第一网络层集合中的重要程度；In one possible implementation, the first network layer set includes a target parameter, and the target parameter is used to calculate the importance of a token in the first token sequence in the first network layer set;

表示序列压缩装置800还包括训练模块805，用于：The representation sequence compression apparatus 800 further includes a training module 805, configured to:

在训练AI模型的过程中，根据损失函数计算得到损失值，损失函数包括目标参数对应的方差正则项；During the training of the AI model, the loss value is calculated according to the loss function, which includes the variance regularization term corresponding to the target parameter;

根据损失值，更新目标参数。According to the loss value, update the target parameters.

在一种可能的实施方式中，第一token序列中的token，包括单词、或者图像块。In a possible implementation, the tokens in the first token sequence include words or image blocks.

在一种可能的实施方式中，第一token序列中的token包括单词，至少一个第一虚拟token位于压缩token序列中的起始位置。In a possible implementation, the tokens in the first token sequence include words, and at least one first virtual token is located at a starting position in the compressed token sequence.

在一种可能的实施方式中，第一token序列中的token包括图像块，至少一个第一虚拟token位于压缩token序列中的中间位置。In a possible implementation, the tokens in the first token sequence include image blocks, and at least one first virtual token is located in the middle of the compressed token sequence.

在一种可能的实施方式中，AI模型还包括第三网络层集合，第二网络层集合与第三网络层集合前后级联；In one possible implementation, the AI model further includes a third network layer set, and the second network layer set and the third network layer set are cascaded front and back;

评估模块802，还用于评估第二token集合中不同的token在第二网络层集合的重要程度；The evaluation module 802 is further configured to evaluate the importance of different tokens in the second token set in the second network layer set;

划分模块803，还用于根据第二token集合中不同的token在第二网络层集合的重要程度，将第二token集合中的token分为第一子集合以及第二子集合；The division module 803 is further configured to divide the tokens in the second token set into a first subset and a second subset according to the importance of different tokens in the second token set in the second network layer set;

确定模块804，还用于根据第一子集合中的token确定至少一个第二虚拟token，至少一个虚拟token的数量少于第一子集合中的token的数量，并将根据至少一个第一虚拟token、至少一个第二虚拟token以及第二子集合中的token构成的压缩token序列，作为第三网络层集合的输入序列。The determination module 804 is also used to determine at least one second virtual token based on the tokens in the first subset, where the number of at least one virtual token is less than the number of tokens in the first subset, and use a compressed token sequence composed of at least one first virtual token, at least one second virtual token and the tokens in the second subset as an input sequence of the third network layer set.

确定模块804，还用于根据第一子集合中的token确定至少一个第一虚拟token在第二网络层集合中的值，并将根据至少一个第一虚拟token以及第二子集合中的token构成的压缩token序列，作为第三网络层集合的输入序列。The determination module 804 is also used to determine the value of at least one first virtual token in the second network layer set based on the token in the first subset, and use the compressed token sequence composed of at least one first virtual token and the token in the second subset as the input sequence of the third network layer set.

在一种可能的实施方式中，表示序列压缩装置800应用于AI模型的部署阶段，表示序列压缩装置800还包括更新模块806，用于根据压缩token序列更新AI模型对应的模型计算图。In one possible implementation, the representation sequence compression device 800 is applied to the deployment stage of the AI model, and the representation sequence compression device 800 further includes an update module 806 for updating the model calculation graph corresponding to the AI model according to the compressed token sequence.

在一种可能的实施方式中，表示序列压缩装置800应用于AI模型的开发阶段。In a possible implementation, the representation sequence compression apparatus 800 is applied to the development phase of an AI model.

由于图8所示的表示序列压缩装置800，对应于上述图3或者图7所示实施例中的计算节点，故图8所示的表示序列压缩装置800的具体实现方式及其所具有的技术效果，参见上述图3或者图7所示实施例中的相关之处描述，在此不做赘述。Since the representation sequence compression device 800 shown in Figure 8 corresponds to the computing node in the embodiment shown in Figure 3 or Figure 7 above, the specific implementation method of the representation sequence compression device 800 shown in Figure 8 and the technical effects thereof can be found in the relevant descriptions in the embodiment shown in Figure 3 or Figure 7 above, and will not be repeated here.

图9为本申请提供的一种计算设备900的硬件结构示意图，该计算设备900例如可以实现上述图3或者图7所示实施例中的计算节点等。FIG9 is a schematic diagram of the hardware structure of a computing device 900 provided in the present application. The computing device 900 can, for example, implement the computing nodes in the embodiments shown in FIG3 or FIG7 .

如图9所示，所述计算设备900包括处理器901、存储器902、通信接口903。其中，处理器901、存储器902、通信接口903通过总线904进行通信，也可以通过无线传输等其他手段实现通信。该存储器902用于存储指令，该处理器901用于执行该存储器902存储的指令。进一步的，计算设备900还可以包括内存单元905，该内存单元905可以通过总线904与处理器901、存储介质902以及通信接口903连接。其中，该存储器902存储程序代码，且处理器901可以调用存储器902中存储的程序代码执行以下操作：As shown in Figure 9, the computing device 900 includes a processor 901, a memory 902, and a communication interface 903. The processor 901, the memory 902, and the communication interface 903 communicate via a bus 904, and may also communicate via other means such as wireless transmission. The memory 902 is used to store instructions, and the processor 901 is used to execute the instructions stored in the memory 902. Furthermore, the computing device 900 may also include a memory unit 905, which may be connected to the processor 901, the storage medium 902, and the communication interface 903 via a bus 904. The memory 902 stores program code, and the processor 901 may call the program code stored in the memory 902 to perform the following operations:

获取第一表示序列，所述第一表示序列为根据人工智能AI模型的输入数据获得的序列，其中，所述AI模型包括第一网络层集合和第二网络层集合，所述第一网络层集合与所述第二网络层集合前后级联；Obtaining a first representation sequence, where the first representation sequence is a sequence obtained based on input data of an artificial intelligence (AI) model, wherein the AI model includes a first network layer set and a second network layer set, and the first network layer set and the second network layer set are cascaded in a front-to-back manner;

评估所述第一表示序列中不同的表示在所述第一网络层集合的重要程度；evaluating the importance of different representations in the first representation sequence in the first network layer set;

根据所述重要程度，将所述第一表示序列中的表示分为第一表示集合和第二表示集合；dividing the representations in the first representation sequence into a first representation set and a second representation set according to the importance;

根据所述第一表示集合中的表示确定至少一个第一虚拟表示，所述至少一个虚拟表示的数量少于所述第一表示集合中的表示的数量；determining at least one first virtual representation from the representations in the first representation set, the number of the at least one virtual representation being less than the number of representations in the first representation set;

将根据所述至少一个第一虚拟表示以及所述第二表示集合中的表示构成的压缩表示序列，作为所述第二网络层集合的输入序列。A compressed representation sequence formed according to the at least one first virtual representation and representations in the second representation set is used as an input sequence of the second network layer set.

应理解，在本实施例中，该处理器901可以是CPU，该处理器901还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立器件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。It should be understood that in this embodiment, the processor 901 may be a CPU, or may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete device components, etc. A general-purpose processor may be a microprocessor or any conventional processor, etc.

该存储器902可以包括只读存储器和随机存取存储器，并向处理器901提供指令和数据。存储器902还可以包括非易失性随机存取存储器。The memory 902 may include a read-only memory and a random access memory, and provides instructions and data to the processor 901. The memory 902 may also include a nonvolatile random access memory.

该存储器902可以是易失性存储器或非易失性存储器，或可包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(read-only memory，ROM)、可编程只读存储器(programmable ROM，PROM)、可擦除可编程只读存储器(erasable PROM，EPROM)、电可擦除可编程只读存储器(electrically EPROM，EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory，RAM)，其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的RAM可用，例如静态随机存取存储器(static RAM，SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM，SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM，ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM，SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM，DR RAM)。The memory 902 may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. The nonvolatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM), which is used as an external cache. By way of example and not limitation, many forms of RAM are available, such as static RAM (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link dynamic random access memory (SLDRAM), and direct rambus RAM (DR RAM).

该通信接口903用于与计算设备900连接的其它设备进行通信。该总线904除包括数据总线之外，还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见，在图中将各种总线都标为总线904。The communication interface 903 is used to communicate with other devices connected to the computing device 900. In addition to the data bus, the bus 904 may also include a power bus, a control bus, and a status signal bus. However, for the sake of clarity, various buses are labeled as bus 904 in the figure.

应理解，根据本申请实施例的计算设备900可对应于本申请实施例中的表示序列压缩装置800，并可以对应于执行根据本申请实施例中图3或者图7所示方法中的计算节点所执行的方法，计算设备900所实现的上述和其它操作和/或功能分别为了实现图3或者图7中的相应方法的流程，为了简洁，在此不再赘述。It should be understood that the computing device 900 according to the embodiment of the present application may correspond to the representation sequence compression device 800 in the embodiment of the present application, and may correspond to executing the method executed by the computing node in the method shown in Figure 3 or Figure 7 in the embodiment of the present application. The above-mentioned and other operations and/or functions implemented by the computing device 900 are respectively for implementing the process of the corresponding method in Figure 3 or Figure 7. For the sake of brevity, they will not be repeated here.

本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令，所述指令指示计算设备执行上述表示序列压缩方法。The present application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be stored by a computing device, or a data storage device such as a data center that contains one or more available media. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device to execute the above-mentioned representation sequence compression method.

本申请实施例还提供了一种计算机程序产品。所述计算机程序产品包括一个或多个计算机指令。在计算设备上加载和执行所述计算机指令时，全部或部分地产生按照本申请实施例所述的流程或功能。The present application also provides a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computing device, the computer program product generates, in whole or in part, the process or function described in the present application.

所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机或数据中心进行传输。The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from one website, computer, or data center to another website, computer, or data center via wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.

所述计算机程序产品可以为一个软件安装包，在需要使用前述表示序列压缩方法的任一方法的情况下，可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。The computer program product may be a software installation package, which may be downloaded and executed on a computing device when any of the aforementioned methods for representing sequence compression is required.

上述实施例，可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时，上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时，全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质。半导体介质可以是固态硬盘。The above embodiments can be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented using software, the above embodiments can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions can be transmitted from one website, computer, server or data center to another website, computer, server or data center via a wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) method. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that contains one or more available media sets. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium. The semiconductor medium can be a solid-state drive.

上述实施例中所使用的术语只是为了描述特定实施例的目的，而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样，单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式，除非其上下文中明确地有相反指示。还应当理解，在本申请实施例中，“一个或多个”是指一个、两个或两个以上；字符“/”一般表示前后关联对象是一种“或”的关系。在本申请实施例中。“同时”是指在同一时间段内，包括处于同一时刻的情况。本申请的说明书和权利要求书及附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解，这样使用的术语在适当情况下可以互换，这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。The terms used in the above embodiments are only for the purpose of describing specific embodiments and are not intended to limit the present application. As used in the specification and claims of this application, the singular expressions "one", "a kind of", "said", "above", "the" and "this" are intended to also include expressions such as "one or more", unless the context clearly indicates otherwise. It should also be understood that in the embodiments of the present application, "one or more" refers to one, two or more; the character "/" generally indicates that the objects associated with each other are in an "or" relationship. In the embodiments of the present application. "Simultaneously" means within the same time period, including situations at the same time. The terms "first", "second", etc. in the specification, claims and drawings of this application are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchangeable where appropriate, and this is merely a way of distinguishing objects with the same properties when describing them in the embodiments of the present application.

在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此，在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例，而是意味着“一个或多个但不是所有的实施例”，除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”，除非是以其他方式另外特别强调。References to "one embodiment" or "some embodiments" in this specification mean that a particular feature, structure, or characteristic described in conjunction with that embodiment is included in one or more embodiments of the present application. Thus, phrases such as "in one embodiment," "in some embodiments," "in other embodiments," and "in yet other embodiments" appearing in various places in this specification do not necessarily refer to the same embodiment, but rather mean "one or more but not all embodiments," unless otherwise specifically emphasized. The terms "including," "comprising," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以权利要求的保护范围为准。The above description is merely a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present application, and such modifications or substitutions should be included in the scope of protection of the present application. Therefore, the scope of protection of the present application should be based on the scope of protection of the claims.

Claims

A method for compressing a representation sequence, characterized in that the method comprises:

Obtaining a first representation sequence, where the first representation sequence is a sequence obtained based on input data of an artificial intelligence (AI) model, wherein the AI model includes a first network layer set and a second network layer set, and the first network layer set and the second network layer set are cascaded in a front-to-back manner;

evaluating the importance of different representations in the first representation sequence in the first network layer set;

dividing the representations in the first representation sequence into a first representation set and a second representation set according to the importance;

determining at least one first virtual representation from the representations in the first representation set, the number of the at least one virtual representation being less than the number of representations in the first representation set;

A compressed representation sequence formed according to the at least one first virtual representation and representations in the second representation set is used as an input sequence of the second network layer set.

The method according to claim 1, wherein determining at least one first virtual representation based on representations in the first representation set comprises:

A value of the at least one first virtual representation is determined based on the values of representations in the first set of representations.

determining a number of the at least one first virtual representation based on the number of representations in the first representation set;

The method according to any one of claims 1 to 3, characterized in that the first network layer set includes a target parameter, and the target parameter is used to calculate the importance of the representations in the first representation sequence in the first network layer set;

The method further comprises:

During the training of the AI model, a loss value is calculated according to a loss function, where the loss function includes a variance regularization term corresponding to the target parameter;

The target parameter is updated according to the loss value.

The method according to any one of claims 1 to 4, characterized in that the representations in the first representation sequence include words or image blocks.

The method according to claim 5, characterized in that the representation includes words and the at least one first virtual representation is located at a starting position in the compressed representation sequence.

The method according to claim 5, characterized in that the representation includes image blocks and the at least one first virtual representation is located at an intermediate position in the sequence of compressed representations.

The method according to any one of claims 1 to 7, characterized in that the AI model further includes a third network layer set, the second network layer set and the third network layer set are cascaded front and back, and the method further includes:

evaluating the importance of different representations in the second representation set in the second network layer set;

Dividing the representations in the second representation set into a first subset and a second subset according to importance of different representations in the second network layer set;

determining at least one second virtual representation from the representations in the first subset, the number of the at least one virtual representation being less than the number of representations in the first subset;

A compressed representation sequence formed according to the at least one first virtual representation, the at least one second virtual representation and the representations in the second subset is used as an input sequence of the third network layer set.

determining a value of the at least one first virtual representation in the second set of network layers based on the representations in the first subset;

A compressed representation sequence formed according to the at least one first virtual representation and the representations in the second subset is used as an input sequence of the third network layer set.

The method according to any one of claims 1 to 9 is characterized in that the method is applied to the deployment stage of the AI model, and the method further includes: updating the model computation graph corresponding to the AI model according to the compressed representation sequence.

The method according to any one of claims 1 to 9, characterized in that the method is applied to the development stage of the AI model.

A representation sequence compression device, characterized in that the device comprises:

an acquisition module, configured to acquire a first representation sequence, where the first representation sequence is a sequence obtained based on input data of an artificial intelligence (AI) model, wherein the AI model includes a first set of network layers and a second set of network layers, wherein the first set of network layers and the second set of network layers are cascaded in series;

an evaluation module, configured to evaluate the importance of different representations in the first representation sequence in the first network layer set;

a division module, configured to divide the representations in the first representation sequence into a first representation set and a second representation set according to the importance;

A determination module is used to determine at least one first virtual representation based on the representations in the first representation set, where the number of the at least one virtual representation is less than the number of representations in the first representation set, and to use a compressed representation sequence composed of the at least one first virtual representation and the representations in the second representation set as an input sequence of the second network layer set.

The device according to claim 12 is characterized in that the determination module is used to determine the value of the at least one first virtual representation based on the values of the representations in the first representation set.

The device according to claim 12, wherein the determining module is configured to:

The apparatus according to any one of claims 12 to 14, wherein the first network layer set includes a target parameter, and the target parameter is used to calculate the importance of the representations in the first representation sequence in the first network layer set;

The apparatus further comprises a training module for:

The target parameter is updated according to the loss value.

The device according to any one of claims 12 to 15, characterized in that the representations in the first representation sequence include words or image blocks.

The apparatus of claim 16, wherein the representations include words and the at least one first virtual representation is located at a starting position in the compressed representation sequence.

The apparatus of claim 16, wherein the representations comprise image blocks, and wherein the at least one first virtual representation is located at an intermediate position in the sequence of compressed representations.

The device according to any one of claims 12 to 18, characterized in that the AI model further includes a third network layer set, and the second network layer set and the third network layer set are cascaded front and back;

The evaluation module is further configured to evaluate the importance of different representations in the second representation set in the second network layer set;

The division module is further configured to divide the representations in the second representation set into a first subset and a second subset according to the importance of different representations in the second network layer set;

The determination module is further used to determine at least one second virtual representation based on the representations in the first subset, where the number of the at least one virtual representation is less than the number of representations in the first subset, and to use a compressed representation sequence composed of the at least one first virtual representation, the at least one second virtual representation and the representations in the second subset as an input sequence of the third network layer set.

The determination module is also used to determine the value of the at least one first virtual representation in the second network layer set based on the representation in the first subset, and use the compressed representation sequence composed of the at least one first virtual representation and the representation in the second subset as the input sequence of the third network layer set.

The device according to any one of claims 12 to 20 is characterized in that the device is applied to the deployment stage of the AI model, and the device further includes an update module for updating the model calculation graph corresponding to the AI model according to the compressed representation sequence.

The device according to any one of claims 12 to 20 is characterized in that the device is applied to the development stage of the AI model.

A computing device, characterized in that the computing device includes a processor and a memory;

The memory is used to store instructions, and the processor executes the instructions stored in the memory to enable the computing device to perform the method according to any one of claims 1 to 11.

A computer-readable storage medium, characterized by comprising instructions, which, when executed on a computing device, cause the computing device to execute the method according to any one of claims 1 to 11.

A computer program product comprising instructions, characterized in that when the computer program product is run on at least one computing device, the computer program product causes the at least one computing device to perform the method according to any one of claims 1 to 11.