[go: up one dir, main page]

CN114327630A - High-performance operator generation method suitable for Huaji Shengteng chip - Google Patents

High-performance operator generation method suitable for Huaji Shengteng chip Download PDF

Info

Publication number
CN114327630A
CN114327630A CN202210009738.9A CN202210009738A CN114327630A CN 114327630 A CN114327630 A CN 114327630A CN 202210009738 A CN202210009738 A CN 202210009738A CN 114327630 A CN114327630 A CN 114327630A
Authority
CN
China
Prior art keywords
data
target operation
function
result
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210009738.9A
Other languages
Chinese (zh)
Other versions
CN114327630B (en
Inventor
龙汀汀
樊春
马银萍
董昊森
李若淼
杨宏辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202210009738.9A priority Critical patent/CN114327630B/en
Publication of CN114327630A publication Critical patent/CN114327630A/en
Application granted granted Critical
Publication of CN114327630B publication Critical patent/CN114327630B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

The present invention discloses a high performance operator generation method suitable for Huazhi rising chip. Wherein, the method comprises the following steps: generating a plurality of candidate operation functions in a target development mode, wherein the target development mode is a tensor iterator kernel development mode determined by a tensor acceleration engine operator development framework based on the rising artificial intelligence processor; selecting a target operation function to be used from a plurality of candidate operation functions; and executing the target operation by using the target operation function and the target operation data to obtain a target operation result. The invention solves the technical problem of low development efficiency of high-performance operators in the related technology.

Description

一种适用于华为昇腾芯片的高性能算子生成方法A high-performance operator generation method suitable for Huawei Ascend chips

技术领域technical field

本申请涉及人工智能领域,具体而言,涉及一种适用于华为昇腾芯片的高性能算子生成方法。The present application relates to the field of artificial intelligence, and in particular, to a high-performance operator generation method suitable for Huawei Ascend chips.

背景技术Background technique

高性能算子是深度学习模型中涉及到的计算函数,常见的算子包括卷积、矩阵乘、修正线性单元(Rectified Linear Unit,ReLU)等。作为人工智能(ArtificialIntelligence, AI)计算框架的基本组成部分,高性能算子向下调用中央处理器(CentralProcessing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、神经网络处理器(Neural-network Processing Unit,NPU)等AI芯片,向上为多种计算框架提供操作接口。高性能算子是充分发挥芯片计算潜力、提升训练和推理效率的重要基础。High-performance operators are computational functions involved in deep learning models. Common operators include convolution, matrix multiplication, and Rectified Linear Unit (ReLU). As a basic part of the artificial intelligence (Artificial Intelligence, AI) computing framework, high-performance operators call down the central processing unit (Central Processing Unit, CPU), graphics processing unit (Graphics Processing Unit, GPU), neural network processor (Neural- AI chips such as network Processing Unit (NPU) provide operation interfaces for various computing frameworks. High-performance operators are an important foundation for fully exploiting the computing potential of chips and improving the efficiency of training and inference.

相关技术中,昇腾技术栈提供了张量加速引擎(Tensor Boost Engine,TBE)算子开发框架,开发者可以选择使用领域特定语言(Domain-Specific Language,DSL)或张量迭代器内核(Tensor Iterator Kernel,TIK)开发方式进行算子开发。其中,DSL的灵活性差、性能差,但开发效率高;TIK的灵活性和性能高,但开发效率低并且开发工作量大,要求开发者熟悉底层硬件架构,并手动规划算子的调度。利用现有技术进行算子开发时,难以同时获得良好的算子性能和开发效率。Among related technologies, the Ascend technology stack provides a Tensor Boost Engine (TBE) operator development framework, and developers can choose to use a Domain-Specific Language (DSL) or a tensor iterator kernel (Tensor Iterator Kernel, TIK) development method for operator development. Among them, DSL has poor flexibility and performance, but high development efficiency; TIK has high flexibility and performance, but low development efficiency and large development workload. Developers are required to be familiar with the underlying hardware architecture and manually plan operator scheduling. When using the existing technology for operator development, it is difficult to obtain good operator performance and development efficiency at the same time.

针对上述的问题,目前尚未提出有效的解决方案。For the above problems, no effective solution has been proposed yet.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种适用于华为昇腾芯片的高性能算子生成方法,以至少解决相关技术中对于高性能算子的开发效率低下的技术问题。The embodiment of the present application provides a high-performance operator generation method suitable for Huawei Ascend chips, so as to at least solve the technical problem of low development efficiency of high-performance operators in the related art.

根据本申请其中一实施例,提供了一种适用于华为昇腾芯片的高性能算子生成方法,包括:在目标开发方式下,生成多个候选操作函数,其中,目标开发方式是基于昇腾人工智能处理器的张量加速引擎算子开发框架确定的张量迭代器内核开发方式;从多个候选操作函数选取待使用的目标操作函数;利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果。According to one of the embodiments of the present application, a high-performance operator generation method suitable for Huawei Ascend chips is provided, including: generating a plurality of candidate operation functions in a target development mode, wherein the target development mode is based on the Ascend The tensor iterator kernel development method determined by the tensor acceleration engine operator development framework of the artificial intelligence processor; select the target operation function to be used from multiple candidate operation functions; use the target operation function and the target operation data to perform the target operation, and obtain Target operation result.

可选地,从多个候选操作函数选取目标操作函数包括:从多个候选操作函数选取数据搬运函数。Optionally, selecting the target operation function from the multiple candidate operation functions includes: selecting the data handling function from the multiple candidate operation functions.

可选地,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:获取目标操作数据,其中,目标操作数据包括:第一源操作数、第一目的操作数和数据长度;利用数据搬运函数、第一源操作数、第一目的操作数和数据长度执行数据搬运操作,得到数据搬运结果。Optionally, using the target operation function and the target operation data to perform the target operation, and obtaining the target operation result includes: obtaining the target operation data, wherein the target operation data includes: a first source operand, a first destination operand and a data length; using The data moving function, the first source operand, the first destination operand and the data length perform a data moving operation to obtain a data moving result.

可选地,利用数据搬运函数、第一源操作数、第一目的操作数和数据长度执行数据搬运操作,得到数据搬运结果包括:利用数据长度对第一源操作数进行分块处理,得到分块结果;当基于分块结果确定不存在尾块时,通过数据搬运函数和分块结果将第一源操作数搬运至第一目的操作数,得到数据搬运结果;当基于分块结果确定存在尾块时,按照第一源操作数与第一目的操作数的存储位置确定目标搬运方式,并通过数据搬运函数和目标搬运方式将第一源操作数搬运至第一目的操作数,得到数据搬运结果。Optionally, using the data handling function, the first source operand, the first destination operand, and the data length to perform the data handling operation, and obtaining the data handling result includes: using the data length to perform block processing on the first source operand, and obtaining a score. Block result; when it is determined that there is no tail block based on the block result, the first source operand is transferred to the first destination operand through the data transfer function and the block result, and the data transfer result is obtained; when it is determined based on the block result that there is a tail block When block, determine the target transfer mode according to the storage location of the first source operand and the first destination operand, and transfer the first source operand to the first destination operand through the data transfer function and the target transfer mode, and obtain the data transfer result .

可选地,从多个候选操作函数选取目标操作函数包括:从多个候选操作函数选取精度向量计算函数。Optionally, selecting the target operation function from the multiple candidate operation functions includes: selecting the precision vector calculation function from the multiple candidate operation functions.

可选地,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:获取目标操作数据,其中,目标操作数据包括:第二源操作数、第二目的操作数和第一指令名称;利用精度向量计算函数、第二源操作数、第二目的操作数和第一指令名称执行精度向量计算操作,得到精度向量计算结果。Optionally, using the target operation function and the target operation data to perform the target operation, and obtaining the target operation result includes: obtaining the target operation data, wherein the target operation data includes: the second source operand, the second destination operand and the first instruction name. ; Use the precision vector calculation function, the second source operand, the second destination operand and the first instruction name to perform the precision vector calculation operation to obtain the precision vector calculation result.

可选地,利用精度向量计算函数、第二源操作数、第二目的操作数和第一指令名称执行精度向量计算操作,得到精度向量计算结果包括:当第二源操作数未位于全局内存时,确定精度向量计算次数,利用精度向量计算函数和第一指令名称对第二源操作数进行精度向量计算操作以得到精度向量计算结果,并将精度向量计算结果搬运至第二目的操作数;当第二源操作数位于全局内存时,对第二源操作数进行多核优化处理以得到优化处理结果,确定精度向量计算次数,利用精度向量计算函数和第一指令名称对优化处理结果进行精度向量计算操作以得到精度向量计算结果,并将精度向量计算结果搬运至第二目的操作数。Optionally, using the precision vector calculation function, the second source operand, the second destination operand and the first instruction name to perform the precision vector calculation operation, and obtaining the precision vector calculation result includes: when the second source operand is not located in the global memory. , determine the number of precision vector calculations, use the precision vector calculation function and the first instruction name to perform the precision vector calculation operation on the second source operand to obtain the precision vector calculation result, and transfer the precision vector calculation result to the second destination operand; when When the second source operand is located in the global memory, perform multi-core optimization processing on the second source operand to obtain the optimization processing result, determine the number of precision vector calculations, and use the precision vector calculation function and the first instruction name to perform precision vector calculation on the optimization processing result. Operate to obtain the precision vector calculation result, and transfer the precision vector calculation result to the second destination operand.

可选地,从多个候选操作函数选取目标操作函数包括:从多个候选操作函数选取规约向量计算函数。Optionally, selecting the target operation function from the multiple candidate operation functions includes: selecting a reduction vector calculation function from the multiple candidate operation functions.

可选地,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:获取目标操作数据,其中,目标操作数据包括:第三源操作数、第三目的操作数和第一数据量;利用规约向量计算函数、第三源操作数、第三目的操作数和第一数据量执行多轮迭代处理,得到规约向量计算结果。Optionally, using the target operation function and the target operation data to perform the target operation, and obtaining the target operation result includes: obtaining the target operation data, wherein the target operation data includes: a third source operand, a third destination operand and a first data amount. ; Use the reduction vector calculation function, the third source operand, the third destination operand and the first data amount to perform multiple rounds of iterative processing to obtain the reduction vector calculation result.

可选地,从多个候选操作函数选取目标操作函数包括:从多个候选操作函数选取数据填充函数。Optionally, selecting the target operation function from the multiple candidate operation functions includes: selecting a data filling function from the multiple candidate operation functions.

可选地,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:获取目标操作数据,其中,目标操作数据包括:第四源操作数、第四目的操作数和第二指令名称;利用数据填充函数、第四源操作数、第四目的操作数和第二指令名称执行数据填充操作,得到数据填充结果。Optionally, using the target operation function and the target operation data to perform the target operation, and obtaining the target operation result includes: obtaining the target operation data, wherein the target operation data includes: the fourth source operand, the fourth destination operand and the second instruction name ; Use the data filling function, the fourth source operand, the fourth destination operand and the second instruction name to perform the data filling operation to obtain the data filling result.

可选地,从多个候选操作函数选取目标操作函数包括:从多个候选操作函数选取浮点数标量比较函数。Optionally, selecting the target operation function from the multiple candidate operation functions includes: selecting a floating-point scalar comparison function from the multiple candidate operation functions.

可选地,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:获取目标操作数据,其中,目标操作数据包括:第一浮点数标量和第二浮点数标量;利用浮点数标量比较函数、第一浮点数标量和第二浮点数标量执行浮点数标量比较操作,得到浮点数标量比较结果。Optionally, using the target operation function and the target operation data to perform the target operation, and obtaining the target operation result includes: obtaining the target operation data, wherein the target operation data includes: a first floating-point number scalar and a second floating-point number scalar; using the floating-point number scalar The comparison function, the first floating-point scalar, and the second floating-point scalar perform a floating-point scalar comparison operation to obtain a floating-point scalar comparison result.

可选地,利用浮点数标量比较函数、第一浮点数标量和第二浮点数标量执行浮点数标量比较操作,得到浮点数标量比较结果包括:将第一浮点数标量写入第一张量,以及将第二浮点数标量写入第二张量;使用向量减法指令获取第一张量与第二张量的差值运算结果,并将差值运算结果转换为整数标量;基于整数标量执行浮点数标量比较操作,得到浮点数标量比较结果。Optionally, using the floating-point scalar comparison function, the first floating-point scalar, and the second floating-point scalar to perform a floating-point scalar comparison operation, and obtaining a floating-point scalar comparison result includes: writing the first floating-point scalar into the first tensor, And write the second floating point scalar into the second tensor; use the vector subtraction instruction to obtain the difference operation result between the first tensor and the second tensor, and convert the difference operation result into an integer scalar; perform floating operation based on the integer scalar Point scalar comparison operation, get the result of floating-point scalar comparison.

可选地,从多个候选操作函数选取目标操作函数包括:从多个候选操作函数选取除法计算函数。Optionally, selecting the target operation function from the multiple candidate operation functions includes: selecting a division calculation function from the multiple candidate operation functions.

可选地,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:获取目标操作数据,其中,目标操作数据包括:第五源操作数、第五目的操作数和第二数据量;利用除法计算函数、第五源操作数、第五目的操作数和第二数据量执行除法计算操作,得到除法计算结果。Optionally, using the target operation function and the target operation data to perform the target operation, and obtaining the target operation result includes: obtaining the target operation data, wherein the target operation data includes: the fifth source operand, the fifth destination operand and the second data amount. ; Use the division calculation function, the fifth source operand, the fifth destination operand and the second data amount to perform the division calculation operation to obtain the division calculation result.

根据本申请其中一实施例,还提供了一种适用于华为昇腾芯片的高性能算子生成装置,包括:生成模块,用于在目标开发方式下,生成多个候选操作函数,其中,目标开发方式是基于昇腾人工智能处理器的张量加速引擎算子开发框架确定的张量迭代器内核开发方式;选取模块,用于从多个候选操作函数选取待使用的目标操作函数;处理模块,用于利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果。According to one of the embodiments of the present application, a high-performance operator generation device suitable for Huawei Ascend chips is also provided, including: a generation module for generating a plurality of candidate operation functions in a target development mode, wherein the target The development method is based on the tensor iterator kernel development method determined by the tensor acceleration engine operator development framework of the Ascend AI processor; the selection module is used to select the target operation function to be used from multiple candidate operation functions; the processing module , which is used to perform the target operation using the target operation function and target operation data to obtain the target operation result.

可选地,选取模块还用于从多个候选操作函数选取数据搬运函数。Optionally, the selection module is further configured to select a data handling function from a plurality of candidate operation functions.

可选地,处理模块还用于:获取目标操作数据,其中,目标操作数据包括:第一源操作数、第一目的操作数和数据长度;利用数据搬运函数、第一源操作数、第一目的操作数和数据长度执行数据搬运操作,得到数据搬运结果。Optionally, the processing module is further configured to: obtain target operation data, wherein the target operation data includes: a first source operand, a first destination operand, and a data length; using a data transfer function, a first source operand, a first The destination operand and data length perform the data transfer operation to obtain the data transfer result.

可选地,处理模块还用于:利用数据长度对第一源操作数进行分块处理,得到分块结果;当基于分块结果确定不存在尾块时,通过数据搬运函数和分块结果将第一源操作数搬运至第一目的操作数,得到数据搬运结果;当基于分块结果确定存在尾块时,按照第一源操作数与第一目的操作数的存储位置确定目标搬运方式,并通过数据搬运函数和目标搬运方式将第一源操作数搬运至第一目的操作数,得到数据搬运结果。Optionally, the processing module is also used to: use the data length to perform block processing on the first source operand to obtain a block result; when it is determined based on the block result that there is no tail block, the data transfer function and the block result are used. The first source operand is transferred to the first destination operand, and the data transfer result is obtained; when it is determined that there is a tail block based on the block result, the target transfer mode is determined according to the storage location of the first source operand and the first destination operand, and The first source operand is transferred to the first destination operand through the data transfer function and the target transfer method, and the data transfer result is obtained.

可选地,选取模块还用于从多个候选操作函数选取精度向量计算函数。Optionally, the selection module is further configured to select a precision vector calculation function from a plurality of candidate operation functions.

可选地,处理模块还用于:获取目标操作数据,其中,目标操作数据包括:第二源操作数、第二目的操作数和第一指令名称;利用精度向量计算函数、第二源操作数、第二目的操作数和第一指令名称执行精度向量计算操作,得到精度向量计算结果。Optionally, the processing module is further configured to: obtain target operation data, wherein the target operation data includes: the second source operand, the second destination operand, and the first instruction name; the calculation function using the precision vector, the second source operand , the second destination operand and the first instruction name to perform the precision vector calculation operation to obtain the precision vector calculation result.

可选地,处理模块还用于:当第二源操作数未位于全局内存时,确定精度向量计算次数,利用精度向量计算函数和第一指令名称对第二源操作数进行精度向量计算操作以得到精度向量计算结果,并将精度向量计算结果搬运至第二目的操作数;当第二源操作数位于全局内存时,对第二源操作数进行多核优化处理以得到优化处理结果,确定精度向量计算次数,利用精度向量计算函数和第一指令名称对优化处理结果进行精度向量计算操作以得到精度向量计算结果,并将精度向量计算结果搬运至第二目的操作数。Optionally, the processing module is further configured to: when the second source operand is not located in the global memory, determine the number of times of precision vector calculation, and use the precision vector calculation function and the first instruction name to perform a precision vector calculation operation on the second source operand to Obtain the calculation result of the precision vector, and transfer the calculation result of the precision vector to the second destination operand; when the second source operand is located in the global memory, perform multi-core optimization processing on the second source operand to obtain the optimization processing result, and determine the precision vector Calculate the number of times, use the precision vector calculation function and the first instruction name to perform the precision vector calculation operation on the optimization processing result to obtain the precision vector calculation result, and transfer the precision vector calculation result to the second destination operand.

可选地,选取模块还用于从多个候选操作函数选取规约向量计算函数。Optionally, the selection module is further configured to select a reduction vector calculation function from a plurality of candidate operation functions.

可选地,处理模块还用于:获取目标操作数据,其中,目标操作数据包括:第三源操作数、第三目的操作数和第一数据量;利用规约向量计算函数、第三源操作数、第三目的操作数和第一数据量执行多轮迭代处理,得到规约向量计算结果。Optionally, the processing module is further configured to: obtain target operation data, wherein the target operation data includes: a third source operand, a third destination operand, and a first data amount; using a reduction vector to calculate a function, a third source operand , the third destination operand and the first data amount to perform multiple rounds of iterative processing to obtain the reduction vector calculation result.

可选地,选取模块还用于从多个候选操作函数选取数据填充函数。Optionally, the selection module is further configured to select a data filling function from a plurality of candidate operation functions.

可选地,处理模块还用于:获取目标操作数据,其中,目标操作数据包括:第四源操作数、第四目的操作数和第二指令名称;利用数据填充函数、第四源操作数、第四目的操作数和第二指令名称执行数据填充操作,得到数据填充结果。Optionally, the processing module is further configured to: obtain target operation data, wherein the target operation data includes: a fourth source operand, a fourth destination operand, and a second instruction name; the data filling function, the fourth source operand, The fourth destination operand and the second instruction name perform a data filling operation to obtain a data filling result.

可选地,选取模块还用于从多个候选操作函数选取浮点数标量比较函数。Optionally, the selection module is further configured to select a floating-point scalar comparison function from a plurality of candidate operation functions.

可选地,处理模块还用于:获取目标操作数据,其中,目标操作数据包括:第一浮点数标量和第二浮点数标量;利用浮点数标量比较函数、第一浮点数标量和第二浮点数标量执行浮点数标量比较操作,得到浮点数标量比较结果。Optionally, the processing module is further configured to: obtain target operation data, wherein the target operation data includes: a first floating-point scalar and a second floating-point scalar; using a floating-point scalar comparison function, the first floating-point scalar and the second floating-point scalar; Point scalars perform a floating-point scalar comparison operation to obtain a floating-point scalar comparison result.

可选地,处理模块还用于:将第一浮点数标量写入第一张量,以及将第二浮点数标量写入第二张量;使用向量减法指令获取第一张量与第二张量的差值运算结果,并将差值运算结果转换为整数标量;基于整数标量执行浮点数标量比较操作,得到浮点数标量比较结果。Optionally, the processing module is further configured to: write the first floating-point scalar into the first tensor, and write the second floating-point scalar into the second tensor; use the vector subtraction instruction to obtain the first tensor and the second tensor; The difference operation result of the quantity is converted into an integer scalar; the floating-point scalar comparison operation is performed based on the integer scalar, and the floating-point scalar comparison result is obtained.

可选地,选取模块还用于从多个候选操作函数选取除法计算函数。Optionally, the selection module is further configured to select a division calculation function from a plurality of candidate operation functions.

可选地,处理模块还用于:获取目标操作数据,其中,目标操作数据包括:第五源操作数、第五目的操作数和第二数据量;利用除法计算函数、第五源操作数、第五目的操作数和第二数据量执行除法计算操作,得到除法计算结果。Optionally, the processing module is further configured to: obtain target operation data, wherein the target operation data includes: the fifth source operand, the fifth destination operand, and the second data amount; the division calculation function, the fifth source operand, A division calculation operation is performed on the fifth destination operand and the second data amount to obtain a division calculation result.

根据本申请其中一实施例,还提供了一种非易失性存储介质,存储介质中存储有计算机程序,其中,计算机程序被设置为运行时执行上述任一项中的适用于华为昇腾芯片的高性能算子生成方法。According to one of the embodiments of the present application, a non-volatile storage medium is also provided, and a computer program is stored in the storage medium, wherein the computer program is configured to execute any one of the above-mentioned functions applicable to the Huawei Ascend chip when running. high-performance operator generation method.

根据本申请其中一实施例,还提供了一种处理器,处理器用于运行程序,其中,程序被设置为运行时执行上述任一项中的适用于华为昇腾芯片的高性能算子生成方法。According to one of the embodiments of the present application, there is also provided a processor for running a program, wherein the program is configured to execute any one of the above-mentioned high-performance operator generation methods applicable to Huawei Ascend chips when running. .

根据本申请其中一实施例,还提供了一种电子装置,包括存储器、处理器,存储器中存储有计算机程序,处理器被设置为运行计算机程序以执行上述任一项中的适用于华为昇腾芯片的高性能算子生成方法。According to one of the embodiments of the present application, an electronic device is also provided, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to execute any one of the above-mentioned applications suitable for Huawei Ascend A high-performance operator generation method for chips.

在本申请实施例中,通过在目标开发方式下,生成多个候选操作函数,目标开发方式是基于昇腾人工智能处理器的张量加速引擎算子开发框架确定的张量迭代器内核开发方式,进而从多个候选操作函数选取待使用的目标操作函数,最后利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果,达到了快速利用目标操作函数对目标操作数据执行目标操作以获得目标操作结果的目的,从而实现了提高算子的开发效率、降低开发工作量的技术效果,进而解决了相关技术中对于高性能算子的开发效率低下的技术问题。In the embodiment of the present application, multiple candidate operation functions are generated in the target development mode, and the target development mode is the tensor iterator kernel development mode determined based on the tensor acceleration engine operator development framework of the Ascend artificial intelligence processor , and then select the target operation function to be used from multiple candidate operation functions, and finally use the target operation function and the target operation data to perform the target operation to obtain the target operation result, so as to quickly use the target operation function to perform the target operation on the target operation data. The purpose of the target operation result is to achieve the technical effect of improving the development efficiency of the operator and reducing the development workload, thereby solving the technical problem of low development efficiency of the high-performance operator in the related art.

附图说明Description of drawings

此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:

图1是根据现有技术的一种AI Core的内部架构示意图;1 is a schematic diagram of an internal architecture of an AI Core according to the prior art;

图2是根据本申请实施例的一种用于实现适用于华为昇腾芯片的高性能算子生成方法的计算机终端的硬件结构框图;2 is a hardware structural block diagram of a computer terminal for implementing a high-performance operator generation method suitable for Huawei Ascend chips according to an embodiment of the present application;

图3是根据本申请实施例的一种适用于华为昇腾芯片的高性能算子生成方法的流程图;3 is a flowchart of a method for generating high-performance operators suitable for Huawei Ascend chips according to an embodiment of the present application;

图4是根据本申请实施例的一种中间结果矩阵的示意图;4 is a schematic diagram of an intermediate result matrix according to an embodiment of the present application;

图5是根据本申请实施例的又一种中间结果矩阵的示意图;5 is a schematic diagram of yet another intermediate result matrix according to an embodiment of the present application;

图6是根据本申请实施例的一种适用于华为昇腾芯片的高性能算子生成装置的结构框图。FIG. 6 is a structural block diagram of a device for generating high-performance operators suitable for Huawei Ascend chips according to an embodiment of the present application.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only The embodiments are part of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the scope of protection of the present application.

需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

首先,在对本申请实施例进行描述的过程中出现的部分名词或术语适用于如下解释:First of all, some nouns or terms that appear in the process of describing the embodiments of the present application are suitable for the following explanations:

昇腾(Ascend)系列处理器,是一种基于达芬奇架构的NPU,昇腾芯片在神经网络的训练和推理中,相对于传统的CPU/GPU芯片具有更高的性能和更低的功耗。Ascend series processors are NPUs based on DaVinci architecture. Compared with traditional CPU/GPU chips, Ascend chips have higher performance and lower power in neural network training and inference. consumption.

达芬奇架构,是一种面向AI计算特征的全新计算架构,具备高算力、高能效、灵活可剪裁的特性,可以高效地进行向量和张量的相关运算,提高神经网络运算的效率。为提升AI计算的完备性和不同场景下的计算效率,达芬奇架构中可以集成多种计算单元。DaVinci architecture is a new computing architecture for AI computing features. It has the characteristics of high computing power, high energy efficiency, flexibility and tailoring. It can efficiently perform related operations on vectors and tensors, and improve the efficiency of neural network operations. In order to improve the completeness of AI computing and the computing efficiency in different scenarios, a variety of computing units can be integrated into the DaVinci architecture.

昇腾AI处理器中包含若干个AI计算核心(Core),负责执行向量和张量相关的计算密集型算子。一个AI Core的内部架构如下图所示。图1是根据现有技术的一种 AI Core的内部架构示意图,其中,一个AI Core中包括:系统控制模块、L1缓冲区、 L1输入缓冲控制器、L0A缓冲区、L0C缓冲区、输出缓冲区、矩阵计算单元(cube unit)、向量计算单元(vectorunit)、标量计算单元(scalar unit)、专用寄存器、通用寄存器、总线接口单元、标量指令处理队列、指令发送模块、事件同步模块等。The Ascend AI processor contains several AI computing cores (Core), which are responsible for executing vector- and tensor-related computationally intensive operators. The internal architecture of an AI Core is shown in the figure below. 1 is a schematic diagram of the internal architecture of an AI Core according to the prior art, wherein an AI Core includes: a system control module, an L1 buffer, an L1 input buffer controller, an L0A buffer, an L0C buffer, and an output buffer , matrix calculation unit (cube unit), vector calculation unit (vectorunit), scalar calculation unit (scalar unit), special register, general register, bus interface unit, scalar instruction processing queue, instruction sending module, event synchronization module, etc.

具体的,输入缓冲区(L1 buffer),是AI Core内用于暂存AI Core需要反复使用的数据的数据中转区,从而能够减少AI Core从总线读写这些数据的次数。输出缓冲区(Unified Buffer,UB),是AI Core内用于存放向量计算单元和标量计算单元的输入输出数据的数据中转区,进行向量计算前需要先将数据搬运到输出缓冲区中,才能进入向量计算单元进行运算。全局存储器(Global Memory,GM),即AI Core的外部存储,用于存储等待进行AI计算的数据。在进行AI计算时,必须将数据搬运到AI Core的内部才能进行AI计算。Specifically, the input buffer (L1 buffer) is a data transfer area in the AI Core for temporarily storing data that the AI Core needs to use repeatedly, thereby reducing the number of times the AI Core reads and writes these data from the bus. The output buffer (Unified Buffer, UB) is the data transfer area in the AI Core used to store the input and output data of the vector computing unit and the scalar computing unit. Before performing vector computing, the data needs to be moved to the output buffer before entering. The vector computing unit performs operations. Global Memory (GM), the external storage of AI Core, is used to store data waiting for AI calculation. When performing AI calculations, data must be moved to the interior of the AI Core to perform AI calculations.

相关方案中提供了两种算子的开发方式:DSL和TIK。在DSL开发方式中,开发者只需调用一些高度封装的接口完成计算过程的表达,然后由编译器完成后续的自动调度、算子代码生成。DSL开发方式的缺点在于,不够灵活,无法开发出一些复杂的算子;开发出来的算子的性能一般较差。而TIK开发方式则更加灵活,可以开发更加复杂的算子;TIK的性能更好,在矩阵乘、张量切片等很多算子中可以比DSL快数十倍甚至上千倍。但TIK开发方式对开发者的水平要求更高,且更加耗时耗力。The related scheme provides two development methods for operators: DSL and TIK. In the DSL development method, the developer only needs to call some highly encapsulated interfaces to express the calculation process, and then the compiler completes the subsequent automatic scheduling and operator code generation. The disadvantage of the DSL development method is that it is not flexible enough to develop some complex operators; the performance of the developed operators is generally poor. The TIK development method is more flexible, and more complex operators can be developed; TIK has better performance, and can be dozens or even thousands of times faster than DSL in many operators such as matrix multiplication and tensor slicing. However, the TIK development method requires a higher level of developers and is more time-consuming and labor-intensive.

在使用TIK开发方式进行算子开发时,开发者需要耗费大量时间精力编写数据分块、数据对齐、多核与多线程优化、缓冲区管理等相关代码,而在DSL开发方式中这些操作都是由编译器自动生成的。为了实现相同的功能,TIK开发方式的工作量通常是DSL开发方式的十倍以上,因而利用TIK开发方式开发算子时的开发效率低下。When using the TIK development method for operator development, developers need to spend a lot of time and energy writing code related to data partitioning, data alignment, multi-core and multi-threading optimization, buffer management, etc. In the DSL development method, these operations are performed by automatically generated by the compiler. In order to achieve the same function, the workload of the TIK development method is usually more than ten times that of the DSL development method, so the development efficiency of using the TIK development method to develop operators is low.

DSL开发方式能够获得良好的开发效率,但利用DSL开发方式开发出的算子的性能较差;TIK开发方式能够通过开发者手动开发出高性能算子,但开发效率低下。相关技术中存在对于高性能算子的开发效率低下的技术问题。The DSL development method can achieve good development efficiency, but the performance of the operators developed by the DSL development method is poor; the TIK development method can manually develop high-performance operators by developers, but the development efficiency is low. There is a technical problem in the related art that the development efficiency of high-performance operators is low.

根据本申请其中一实施例,提供了一种适用于华为昇腾芯片的高性能算子生成方法的实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。According to one of the embodiments of the present application, an embodiment of a method for generating high-performance operators applicable to Huawei Ascend chips is provided. It should be noted that the steps shown in the flowcharts in the accompanying drawings may Executable instructions are executed in a computer system and, although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that herein.

本申请实施例所提供的方法实施例可以在计算机终端或者类似的运算装置中执行。图2是根据本申请实施例的一种用于实现适用于华为昇腾芯片的高性能算子生成方法的计算机终端的硬件结构框图。如图2所示,计算机终端20可以包括一个或多个(图中采用202a、202b,……,202n来示出)处理器202(处理器202可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器204、以及用于通信功能的传输装置206。除此以外,还可以包括:显示器、输入/输出接口 (I/O接口)、通用串行总线(USB)端口(可以作为BUS总线的端口中的一个端口被包括)、网络接口、电源和/或相机。本领域普通技术人员可以理解,图2所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端20还可包括比图2中所示更多或者更少的组件,或者具有与图2所示不同的配置。The method embodiments provided in the embodiments of the present application may be executed in a computer terminal or a similar computing device. FIG. 2 is a hardware structural block diagram of a computer terminal for implementing a method for generating high-performance operators suitable for Huawei Ascend chips according to an embodiment of the present application. As shown in FIG. 2, the computer terminal 20 may include one or more (202a, 202b, . processing means of a logic device FPGA or the like), a memory 204 for storing data, and a transmission means 206 for communication functions. In addition, may also include: display, input/output interface (I/O interface), universal serial bus (USB) port (may be included as one of the ports of the BUS bus), network interface, power supply and/or or camera. Those of ordinary skill in the art can understand that the structure shown in FIG. 2 is only for illustration, and does not limit the structure of the above electronic device. For example, the computer terminal 20 may also include more or fewer components than shown in FIG. 2 , or have a different configuration than that shown in FIG. 2 .

应当注意到的是上述一个或多个处理器202和/或其他数据处理电路在本文中通常可以被称为“数据处理电路”。该数据处理电路可以全部或部分的体现为软件、硬件、固件或其他任意组合。此外,数据处理电路可为单个独立的处理模块,或全部或部分的结合到计算机终端20中的其他元件中的任意一个内。如本申请实施例中所涉及到的,该数据处理电路作为一种处理器控制(例如与接口连接的可变电阻终端路径的选择)。It should be noted that the one or more processors 202 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits." The data processing circuit may be embodied in whole or in part as software, hardware, firmware or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in computer terminal 20. As referred to in the embodiments of the present application, the data processing circuit acts as a kind of processor control (eg, selection of a variable resistance termination path connected to an interface).

存储器204可用于存储应用软件的软件程序以及模块,如本申请实施例中的适用于华为昇腾芯片的高性能算子生成方法对应的程序指令/数据存储装置,处理器202通过运行存储在存储器204内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的适用于华为昇腾芯片的高性能算子生成方法。存储器204可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器204可进一步包括相对于处理器 202远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端20。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 204 can be used to store software programs and modules of the application software, such as the program instruction/data storage device corresponding to the high-performance operator generation method applicable to the Huawei Ascend chip in the embodiment of the present application, the processor 202 is stored in the memory by running The software programs and modules in 204 are used to execute various functional applications and data processing, that is, to implement the above-mentioned high-performance operator generation method suitable for Huawei Ascend chips. Memory 204 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, memory 204 may further include memory located remotely from processor 202, which may be connected to computer terminal 20 through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

传输装置206用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端20的通信供应商提供的无线网络。在一个实例中,传输装置206包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置206可以为射频(Radio Frequency, RF)模块,其用于通过无线方式与互联网进行通讯。Transmission means 206 are used to receive or transmit data via a network. A specific example of the above-mentioned network may include a wireless network provided by a communication provider of the computer terminal 20 . In one example, the transmission device 206 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through the base station so as to communicate with the Internet. In one example, the transmission device 206 may be a radio frequency (Radio Frequency, RF) module, which is used for wirelessly communicating with the Internet.

显示器可以例如触摸屏式的液晶显示器(LCD),该液晶显示器可使得用户能够与计算机终端20(或移动设备)的用户界面进行交互。The display may be, for example, a touch screen type liquid crystal display (LCD) that enables a user to interact with the user interface of the computer terminal 20 (or mobile device).

此处需要说明的是,在一些可选实施例中,上述图2所示的计算机设备可以包括硬件元件(包括电路)、软件元件(包括存储在计算机可读介质上的计算机代码)、或硬件元件和软件元件两者的结合。应当指出的是,图2仅为特定具体实例的一个实例,并且旨在示出可存在于上述计算机设备中的部件的类型。It should be noted here that, in some optional embodiments, the computer device shown in FIG. 2 may include hardware elements (including circuits), software elements (including computer code stored on a computer-readable medium), or hardware elements A combination of both components and software components. It should be noted that FIG. 2 is only one example of a specific specific example, and is intended to illustrate the types of components that may be present in the computer apparatus described above.

在上述运行环境下,本申请提供了如图3所示的适用于华为昇腾芯片的高性能算子生成方法,该方法可以由图2所示的计算机终端或者类似的类似的运算装置执行。图3是根据本申请其中一实施例的一种适用于华为昇腾芯片的高性能算子生成方法的流程图,如图3所示,该流程包括如下步骤:Under the above operating environment, the present application provides a high-performance operator generation method suitable for Huawei Ascend chips as shown in FIG. 3 , which can be executed by the computer terminal shown in FIG. 2 or a similar computing device. FIG. 3 is a flowchart of a method for generating high-performance operators suitable for Huawei Ascend chips according to one of the embodiments of the present application. As shown in FIG. 3 , the process includes the following steps:

步骤S31,在目标开发方式下,生成多个候选操作函数,其中,目标开发方式是基于昇腾人工智能处理器的张量加速引擎算子开发框架确定的张量迭代器内核开发方式;Step S31, under the target development mode, generate a plurality of candidate operation functions, wherein the target development mode is a tensor iterator kernel development mode determined based on the tensor acceleration engine operator development framework of the Ascend artificial intelligence processor;

上述多个候选函数包括:数据搬运函数、精度向量计算函数、规约向量计算函数、数据填充函数、浮点数标量比较函数、除法计算函数。其中,不同函数可以对不同的操作数据进行操作,进而得到对应的操作结果。The above-mentioned multiple candidate functions include: a data transfer function, a precision vector calculation function, a reduction vector calculation function, a data filling function, a floating-point scalar comparison function, and a division calculation function. Among them, different functions can operate on different operation data, and then obtain corresponding operation results.

步骤S32,从多个候选操作函数选取待使用的目标操作函数;Step S32, select the target operation function to be used from a plurality of candidate operation functions;

具体的,从多个候选操作函数选取待使用的目标操作函数的实现过程可以参照对于本申请实施例的进一步介绍,不予赘述。Specifically, for an implementation process of selecting a target operation function to be used from a plurality of candidate operation functions, reference may be made to the further introduction to the embodiments of the present application, which will not be repeated.

步骤S33,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果。In step S33, the target operation is performed by using the target operation function and the target operation data, and the result of the target operation is obtained.

可选地,不同的目标操作函数可以对目标操作数据执行不同的目标操作,进而得到对应的目标操作结果。Optionally, different target operation functions can perform different target operations on the target operation data, thereby obtaining corresponding target operation results.

在一种可选的实施方式中,当目标操作函数为数据搬运函数时,利用数据搬运函数和目标操作数据执行数据搬运操作,得到数据搬运结果。In an optional implementation manner, when the target operation function is a data transfer function, the data transfer operation is performed using the data transfer function and the target operation data to obtain a data transfer result.

在一种可选的实施方式中,当目标操作函数为精度向量计算函数时,利用精度向量计算函数和目标操作数据执行精度向量计算操作,得到精度向量计算结果。In an optional implementation manner, when the target operation function is a precision vector calculation function, the precision vector calculation operation is performed by using the precision vector calculation function and the target operation data to obtain a precision vector calculation result.

在一种可选的实施方式中,当目标操作函数为规约向量计算函数时。利用规约向量计算函数和目标操作数据执行多轮迭代处理,得到规约向量计算结果。In an optional embodiment, when the target operation function is a reduction vector calculation function. Using the reduction vector calculation function and the target operation data to perform multiple rounds of iterative processing, the reduction vector calculation result is obtained.

在一种可选的实施方式中,当目标操作函数为数据填充函数时,利用数据填充函数和目标操作数据执行数据填充操作,得到数据填充结果。In an optional implementation manner, when the target operation function is a data filling function, a data filling operation is performed using the data filling function and the target operation data to obtain a data filling result.

在一种可选的实施方式中,当目标操作函数为浮点数标量比较函数时,利用浮点数标量比较函数和目标操作数据执行浮点数标量比较操作,得到浮点数标量比较结果。In an optional implementation, when the target operation function is a floating-point scalar comparison function, a floating-point scalar comparison operation is performed using the floating-point scalar comparison function and the target operation data to obtain a floating-point scalar comparison result.

在一种可选的实施方式中,当目标操作函数为除法计算函数时,利用除法计算函数和目标操作数据执行除法计算操作,得到除法计算结果。In an optional implementation manner, when the target operation function is a division calculation function, a division calculation operation is performed by using the division calculation function and the target operation data to obtain a division calculation result.

具体的,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果的实现过程可以参照对于本申请实施例的进一步介绍,不予赘述。Specifically, for the implementation process of performing the target operation by using the target operation function and the target operation data, and obtaining the target operation result, reference may be made to the further introduction of the embodiments of the present application, which will not be repeated.

通过上述步骤S31至步骤S33,通过在目标开发方式下,生成多个候选操作函数,目标开发方式是基于昇腾人工智能处理器的张量加速引擎算子开发框架确定的张量迭代器内核开发方式,进而从多个候选操作函数选取待使用的目标操作函数,最后利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果,达到了快速利用目标操作函数对目标操作数据执行目标操作以获得目标操作结果的目的,从而实现了提高算子的开发效率、降低开发工作量的技术效果,进而解决了相关技术中对于高性能算子的开发效率低下的技术问题。Through the above steps S31 to S33, multiple candidate operation functions are generated in the target development mode, and the target development mode is based on the tensor iterator kernel development determined by the tensor acceleration engine operator development framework of the Ascend AI processor method, and then select the target operation function to be used from multiple candidate operation functions, and finally use the target operation function and the target operation data to perform the target operation to obtain the target operation result. The purpose of obtaining the target operation result is to achieve the technical effect of improving the development efficiency of the operator and reducing the development workload, thereby solving the technical problem of low development efficiency of the high-performance operator in the related art.

下面对上述实施例所述的适用于华为昇腾芯片的高性能算子生成方法进行进一步介绍。The high-performance operator generation method applicable to the Huawei Ascend chip described in the above-mentioned embodiment will be further introduced below.

可选地,在步骤S32,从多个候选操作函数选取目标操作函数包括:Optionally, in step S32, selecting a target operation function from a plurality of candidate operation functions includes:

步骤S321,从多个候选操作函数选取数据搬运函数。Step S321, selecting a data transfer function from a plurality of candidate operation functions.

具体的,数据搬运函数中包括自动进行数据分块和数据中转的算法。Specifically, the data handling function includes algorithms for automatically performing data partitioning and data transfer.

可选地,在步骤S33,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:Optionally, in step S33, using the target operation function and the target operation data to perform the target operation, obtaining the target operation result includes:

步骤S41,获取目标操作数据,其中,目标操作数据包括:第一源操作数、第一目的操作数和数据长度;Step S41, acquiring target operation data, wherein the target operation data includes: a first source operand, a first destination operand and a data length;

上述第一源操作数包括待搬运的数据块,通过执行数据搬运操作可以将第一源操作数据中待搬运的数据块搬运到第一目的操作数中。The above-mentioned first source operand includes the data block to be transported, and the data block to be transported in the first source operation data can be transported to the first destination operand by executing the data transport operation.

步骤S42,利用数据搬运函数、第一源操作数、第一目的操作数和数据长度执行数据搬运操作,得到数据搬运结果。Step S42, using the data transfer function, the first source operand, the first destination operand and the data length to perform a data transfer operation to obtain a data transfer result.

具体的,利用数据搬运函数、第一源操作数、第一目的操作数和数据长度执行数据搬运操作,得到数据搬运结果可以参照下文实施例的进一步介绍,不予赘述。Specifically, using the data transfer function, the first source operand, the first destination operand, and the data length to perform the data transfer operation, to obtain the data transfer result, reference may be made to the further introduction in the following embodiments, which will not be repeated.

基于上述步骤S41至步骤S42,能够利用数据搬运函数、第一源操作数、第一目的操作数和数据长度执行数据搬运操作,快速得到数据搬运结果,进而提高开发高性能算子时的数据搬运操作的效率。Based on the above steps S41 to S42, the data handling function, the first source operand, the first destination operand and the data length can be used to perform the data handling operation, and the data handling result can be obtained quickly, thereby improving the data handling when developing high-performance operators. operational efficiency.

可选地,在步骤S42,利用数据搬运函数、第一源操作数、第一目的操作数和数据长度执行数据搬运操作,得到数据搬运结果包括:Optionally, in step S42, the data handling operation is performed using the data handling function, the first source operand, the first destination operand and the data length, and the obtained data handling result includes:

步骤S421,利用数据长度对第一源操作数进行分块处理,得到分块结果;Step S421, using the data length to perform block processing on the first source operand to obtain a block result;

在昇腾AI处理器中,数据搬运的最小单位是块,一个块等于32字节。如果数据长度是32字节的整数倍,则分块结果中包括能够分成整块的数据部分;如果数据长度不是32字节的整数倍,则分块结果中不仅包括能够分成整块的数据部分,还包括无法分成整块的数据部分,即尾块。In the Ascend AI processor, the smallest unit of data transfer is a block, and one block is equal to 32 bytes. If the data length is an integer multiple of 32 bytes, the block result includes the data part that can be divided into whole blocks; if the data length is not an integer multiple of 32 bytes, the block result includes not only the data part that can be divided into whole blocks , and also includes the portion of data that cannot be divided into whole blocks, the tail block.

步骤S422,当基于分块结果确定不存在尾块时,通过数据搬运函数和分块结果将第一源操作数搬运至第一目的操作数,得到数据搬运结果;Step S422, when it is determined that there is no tail block based on the block result, the first source operand is transferred to the first destination operand by the data transfer function and the block result, and the data transfer result is obtained;

步骤S423,当基于分块结果确定存在尾块时,按照第一源操作数与第一目的操作数的存储位置确定目标搬运方式,并通过数据搬运函数和目标搬运方式将第一源操作数搬运至第一目的操作数,得到数据搬运结果。Step S423, when it is determined that there is a tail block based on the block result, the target handling method is determined according to the storage position of the first source operand and the first destination operand, and the first source operand is handled by the data handling function and the target handling method. To the first destination operand, the data transfer result is obtained.

具体的,当基于分块结果确定存在尾块时,判断第一源操作数、第一目的操作数的存储位置位于全局内存或者缓冲区,进而确定目标搬运方式。Specifically, when it is determined that there is a tail block based on the block division result, it is determined that the storage locations of the first source operand and the first destination operand are located in the global memory or the buffer, and then the target transport mode is determined.

作为一种可选的实施方式,当第一源操作数和第一目的操作数均位于缓冲区中,使用搬运指令(set_as)将第一源操作数的尾块数据逐个搬运到第一目的操作数中。As an optional implementation manner, when both the first source operand and the first destination operand are located in the buffer, a move instruction (set_as) is used to move the tail block data of the first source operand to the first destination operation one by one. in number.

作为一种可选的实施方式,当第一源操作数位于缓冲区中,第一目的操作数位于全局内存中,申请容量为一个块的缓冲区暂存空间,使用set_as指令将第一源操作数末尾块的数据逐个搬运到该缓冲区暂存空间中,然后将这个块一次搬运到第一目的操作数中。As an optional implementation manner, when the first source operand is located in the buffer and the first destination operand is located in the global memory, a buffer temporary storage space with a capacity of one block is requested, and the set_as instruction is used to manipulate the first source The data of the block at the end of the number is transferred to the temporary buffer space one by one, and then the block is transferred to the first destination operand at a time.

作为一种可选的实施方式,当第一源操作数位于全局内存中,第一目的操作数为缓冲区中,申请容量为一个块的缓冲区暂存空间,将第一源操作数的末尾块一次搬运到缓冲区暂存空间,然后再使用set_as指令将缓冲区暂存空间中的数据逐个搬运到第一目的操作数。As an optional implementation, when the first source operand is located in the global memory, the first destination operand is in the buffer, and the buffer temporary storage space with a capacity of one block is applied for, and the end of the first source operand is The block is moved to the buffer temporary space at a time, and then the set_as instruction is used to move the data in the buffer temporary space to the first destination operand one by one.

作为一种可选的实施方式,当第一源操作数和第一目的操作数均位于全局内存中,申请一个容量能被32字节整除的缓冲区暂存空间,计算需要进行数据中转的次数,然后使用循环指令(for_range)进行数据中转。最后一次循环时,需要考虑尾块的搬运,此时的搬运过程可以参照上述实施例搬运尾块的实现过程,不予赘述。如果最后一次循环时,数据量小于一个块,则将倒数第二次循环的最后一个块一同搬运,确保数据量至少有一个块。As an optional implementation, when both the first source operand and the first destination operand are located in the global memory, apply for a buffer temporary storage space with a capacity divisible by 32 bytes, and calculate the number of data transfers that need to be performed , and then use the loop instruction (for_range) to transfer data. In the last cycle, the handling of the tail block needs to be considered, and the handling process at this time may refer to the implementation process of handling the tail block in the above-mentioned embodiment, and will not be repeated. If the amount of data is less than one block in the last loop, the last block of the penultimate loop will be transported together to ensure that there is at least one block of data.

基于上述步骤S421至步骤S423,通过利用数据长度对第一源操作数进行分块处理,得到分块结果;当基于分块结果确定不存在尾块时,通过数据搬运函数和分块结果将第一源操作数搬运至第一目的操作数,得到数据搬运结果;当基于分块结果确定存在尾块时,按照第一源操作数与第一目的操作数的存储位置确定目标搬运方式,并通过数据搬运函数和目标搬运方式将第一源操作数搬运至第一目的操作数,得到数据搬运结果,能够自动进行数据分块以及不同存储位置之间的数据搬运。相较于相关技术中不能直接实现全局内存到全局内存的数据搬运,需要开发者手动经过缓冲区进行数据中转,本申请实施例中的适用于华为昇腾芯片的高性能算子生成方法能够自动进行数据分块和数据中转,从而有效提升开发高性能算子时的数据搬运效率。Based on the above steps S421 to S423, the first source operand is subjected to block processing by using the data length to obtain a block result; when it is determined based on the block result that there is no tail block, the first source operand is processed by the data transfer function and the block result. A source operand is transferred to the first destination operand, and the data transfer result is obtained; when it is determined that there is a tail block based on the block result, the target transfer mode is determined according to the storage location of the first source operand and the first destination operand, and the The data transfer function and the target transfer method transfer the first source operand to the first destination operand, obtain the data transfer result, and can automatically perform data partitioning and data transfer between different storage locations. Compared with the related art, which cannot directly implement data transfer from global memory to global memory, and requires developers to manually transfer data through a buffer, the high-performance operator generation method suitable for Huawei Ascend chips in the embodiment of the present application can automatically Perform data partitioning and data transfer to effectively improve the data handling efficiency when developing high-performance operators.

可选地,在步骤S32,从多个候选操作函数选取目标操作函数包括:Optionally, in step S32, selecting a target operation function from a plurality of candidate operation functions includes:

步骤S322,从多个候选操作函数选取精度向量计算函数。Step S322, selecting a precision vector calculation function from a plurality of candidate operation functions.

上述精度向量计算函数包括一般向量计算和高精度向量计算。其中,一般向量计算包括以下六种计算类型:单目、双目、标量双目、标量三目、比较/选择、精度转换。与一般向量计算相比,高精度向量计算需要使用额外的缓冲区暂存空间。在相关技术中,开发者需要计算暂存空间的大小。本申请实施例中,利用精度向量计算函数能够自动计算暂存空间的大小,提高开发算子时的精度向量计算效率。The above-mentioned precision vector calculation functions include general vector calculation and high-precision vector calculation. Among them, the general vector calculation includes the following six calculation types: monocular, binocular, scalar binocular, scalar trinocular, comparison/selection, and precision conversion. Compared with general vector calculations, high-precision vector calculations require additional buffer scratch space. In the related art, the developer needs to calculate the size of the temporary storage space. In the embodiment of the present application, the size of the temporary storage space can be automatically calculated by using the precision vector calculation function, which improves the calculation efficiency of the precision vector when developing operators.

可选地,在步骤S33,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:Optionally, in step S33, using the target operation function and the target operation data to perform the target operation, obtaining the target operation result includes:

步骤S51,获取目标操作数据,其中,目标操作数据包括:第二源操作数、第二目的操作数和第一指令名称;Step S51, obtaining target operation data, wherein the target operation data includes: the second source operand, the second destination operand and the first instruction name;

上述第二源操作数包括待进行精度向量计算的数据,精度向量计算结果可以搬运至第二目的操作数,第一指令包括单目向量计算指令、双目向量计算指令、标量双目向量计算指令、标量三目向量计算指令、比较/选择指令和精度转换指令中的一项或多项。The above-mentioned second source operand includes the data to be calculated by the precision vector, and the result of the precision vector calculation can be transferred to the second destination operand. The first instruction includes a monocular vector calculation instruction, a binocular vector calculation instruction, and a scalar binocular vector calculation instruction. , one or more of scalar ternary vector calculation instructions, compare/select instructions, and precision conversion instructions.

步骤S52,利用精度向量计算函数、第二源操作数、第二目的操作数和第一指令名称执行精度向量计算操作,得到精度向量计算结果。Step S52, using the precision vector calculation function, the second source operand, the second destination operand and the first instruction name to perform the precision vector calculation operation to obtain the precision vector calculation result.

具体的,利用精度向量计算函数、第二源操作数、第二目的操作数和第一指令名称执行精度向量计算操作,得到精度向量计算结果的实现过程可以参照下文实施例的进一步介绍,不予赘述。Specifically, use the precision vector calculation function, the second source operand, the second destination operand, and the first instruction name to perform the precision vector calculation operation, and obtain the implementation process of the precision vector calculation result with reference to the further introduction in the following embodiments. Repeat.

基于上述步骤S51至步骤S52,能够利用精度向量计算函数、第二源操作数、第二目的操作数和第一指令名称执行精度向量计算操作,快速得到精度向量计算结果,进而提高开发高性能算子时的精度向量计算效率。Based on the above steps S51 to S52, the precision vector calculation function, the second source operand, the second destination operand and the first instruction name can be used to perform the precision vector calculation operation, and the precision vector calculation result can be obtained quickly, thereby improving the development of high-performance computing. Efficiency of precision vector computations for subtimes.

可选地,在步骤S52,利用精度向量计算函数、第二源操作数、第二目的操作数和第一指令名称执行精度向量计算操作,得到精度向量计算结果包括:Optionally, in step S52, using the precision vector calculation function, the second source operand, the second destination operand and the first instruction name to perform the precision vector calculation operation, and obtaining the precision vector calculation result includes:

步骤S521,当第二源操作数未位于全局内存时,确定精度向量计算次数,利用精度向量计算函数和第一指令名称对第二源操作数进行精度向量计算操作以得到精度向量计算结果,并将精度向量计算结果搬运至第二目的操作数;Step S521, when the second source operand is not located in the global memory, determine the number of times of precision vector calculation, use the precision vector calculation function and the first instruction name to perform the precision vector calculation operation on the second source operand to obtain the precision vector calculation result, and Move the calculation result of the precision vector to the second destination operand;

具体的,以第一指令为单目向量计算指令为例,由于AI Core上向量处理单元每次处理的数据量有限,所以需要根据AI Core上的数据长度,确定向量处理单元的计算次数,即确定精度向量计算次数。确定了精度向量计算次数后,循环地将第二源操作数进行分块并调用TIK指令进行精度向量计算操作,并将精度向量计算结果送入第二目的操作数。期间开启多线程优化,从而提高执行性能。Specifically, taking the first instruction as a monocular vector calculation instruction as an example, since the amount of data processed by the vector processing unit on the AI Core is limited each time, it is necessary to determine the number of calculations of the vector processing unit according to the data length on the AI Core, that is, Determines the number of precision vector calculations. After the number of precision vector calculations is determined, the second source operand is cyclically divided into blocks, the TIK instruction is called to perform the precision vector calculation operation, and the precision vector calculation result is sent to the second destination operand. During this period, multi-threading optimization is enabled to improve execution performance.

作为一种可选的实施方式,如果第二目的操作数位于缓冲区内,则直接在第一指令中指定第二目的操作数;如果第二目的操作数未位于缓冲区内,先把精度向量计算结果存入缓冲区内的暂存空间,然后再调用上述实施例中的数据搬运函数来把暂存空间内的精度向量计算结果搬运到第二目的操作数中。As an optional implementation, if the second destination operand is located in the buffer, the second destination operand is directly specified in the first instruction; if the second destination operand is not located in the buffer, the precision vector The calculation result is stored in the temporary storage space in the buffer, and then the data transfer function in the above embodiment is called to transfer the calculation result of the precision vector in the temporary storage space to the second destination operand.

步骤S522,当第二源操作数位于全局内存时,对第二源操作数进行多核优化处理以得到优化处理结果,确定精度向量计算次数,利用精度向量计算函数和第一指令名称对优化处理结果进行精度向量计算操作以得到精度向量计算结果,并将精度向量计算结果搬运至第二目的操作数。Step S522, when the second source operand is located in the global memory, perform multi-core optimization processing on the second source operand to obtain the optimization processing result, determine the number of times of calculation of the precision vector, and use the precision vector calculation function and the first instruction name to optimize the processing result. The precision vector calculation operation is performed to obtain the precision vector calculation result, and the precision vector calculation result is transferred to the second destination operand.

具体的,继续以第一指令为单目向量计算指令为例,当第二源操作数位于全局内存时,对第二源操作数进行多核优化处理以得到优化处理结果。Specifically, continuing to take the first instruction as a monocular vector calculation instruction as an example, when the second source operand is located in the global memory, multi-core optimization processing is performed on the second source operand to obtain the optimization processing result.

首先,制定分核方案:调用TBE平台的应用程序接口(Application ProgrammingInterface,API)获取当前处理器的AI Core数量(CORE_NUM);令第二源操作数的数据量为n,尝试将n个数据均分到每个AI Core中。由于昇腾AI处理器中数据搬运的最小单位是块,为避免频繁的尾块数据处理,将数据长度n换算成块数量并向下取整,得到块数量(n_block)。若n_block等于0,说明数据量不足一个block,此时无需使用多核,仅使用单个AICore进行处理。若n_block大于0,则将n_block个块的数据均分到每个AI Core上。如果不能恰好均分,则序号较小的AI Core承担更多的处理任务。例如,当n_block等于34,而CORE_NUM等于32,则令0号核、1号核处理两个块,其余30个核各处理1个块。First, formulate a sub-core scheme: call the Application Programming Interface (API) of the TBE platform to obtain the number of AI Cores (CORE_NUM) of the current processor; let the data volume of the second source operand be n, and try to divide the n data into Divided into each AI Core. Since the smallest unit of data handling in the Ascend AI processor is a block, in order to avoid frequent tail block data processing, the data length n is converted into the number of blocks and rounded down to obtain the number of blocks (n_block). If n_block is equal to 0, it means that the amount of data is less than one block. In this case, there is no need to use multiple cores, and only a single AICore is used for processing. If n_block is greater than 0, the data of n_block blocks is evenly distributed to each AI Core. If it cannot be equally divided, the AI Core with a smaller serial number undertakes more processing tasks. For example, when n_block is equal to 34 and CORE_NUM is equal to 32, then let core 0 and core 1 process two blocks, and the remaining 30 cores each process one block.

其次,在分核方案制定完成后,将数据分散搬运到不同AI Core上,确定精度向量计算次数,利用精度向量计算函数和第一指令名称对优化处理结果进行精度向量计算操作以得到精度向量计算结果,并将精度向量计算结果搬运至第二目的操作数。Secondly, after the core division plan is formulated, the data is distributed to different AI Cores, the number of precision vector calculations is determined, and the precision vector calculation function is used and the first instruction name is used to perform the precision vector calculation operation on the optimized processing result to obtain the precision vector calculation. As a result, the precision vector calculation result is transferred to the second destination operand.

基于上述步骤S521至步骤S522,能够利用精度向量计算函数、第二源操作数、第二目的操作数和第一指令名称执行精度向量计算操作,快速得到精度向量计算结果,进一步提高开发高性能算子时的精度向量计算效率。Based on the above steps S521 to S522, the precision vector calculation operation can be performed by using the precision vector calculation function, the second source operand, the second destination operand and the first instruction name, and the precision vector calculation result can be obtained quickly, which further improves the development of high-performance computing. Efficiency of precision vector computations for subtimes.

可选地,在步骤S32,从多个候选操作函数选取目标操作函数包括:Optionally, in step S32, selecting a target operation function from a plurality of candidate operation functions includes:

步骤S323,从多个候选操作函数选取规约向量计算函数。In step S323, a reduction vector calculation function is selected from a plurality of candidate operation functions.

上述规约向量计算函数的功能包括求最大值及索引,求最小值及索引,求全部值的和以及求全部值的积。The functions of the above reduction vector calculation function include finding the maximum value and the index, finding the minimum value and the index, finding the sum of all values, and finding the product of all the values.

由于TIK开发方式中的规约指令每次只能规约256字节的数据,对于更大规模的数据则需要开发者手动进行多轮迭代的处理。利用本实施例中的适用于华为昇腾芯片的高性能算子生成方法,开发者只需调用规约向量计算函数函数并传入源操作数、目的操作数和数据量,规约向量计算函数函数中将自动进行多轮迭代,从而能够提高开发高性能算子时的规约向量计算效率。Since the reduction command in the TIK development method can only reduce 256 bytes of data at a time, for larger-scale data, developers need to manually perform multiple rounds of iterative processing. Using the high-performance operator generation method suitable for Huawei Ascend chips in this embodiment, the developer only needs to call the reduced vector calculation function and pass in the source operand, destination operand and data amount. In the reduced vector calculation function Multiple rounds of iterations will be performed automatically, which can improve the computational efficiency of reduced vectors when developing high-performance operators.

可选地,在步骤S33,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:Optionally, in step S33, using the target operation function and the target operation data to perform the target operation, obtaining the target operation result includes:

步骤S61,获取目标操作数据,其中,目标操作数据包括:第三源操作数、第三目的操作数和第一数据量;Step S61, acquiring target operation data, wherein the target operation data includes: a third source operand, a third destination operand, and a first data amount;

上述第一数据量为第三源操作数的长度。The above-mentioned first data amount is the length of the third source operand.

步骤S62,利用规约向量计算函数、第三源操作数、第三目的操作数和第一数据量执行多轮迭代处理,得到规约向量计算结果。Step S62, using the reduction vector calculation function, the third source operand, the third destination operand and the first data amount to perform multiple rounds of iterative processing to obtain a reduction vector calculation result.

基于上述步骤S61至步骤S62,能够利用规约向量计算函数、第三源操作数、第三目的操作数和第一数据量执行多轮迭代处理,快速得到规约向量计算结果,从而提高开发高性能算子时的规约向量计算效率。Based on the above steps S61 to S62, the reduction vector calculation function, the third source operand, the third destination operand and the first data amount can be used to perform multiple rounds of iterative processing to quickly obtain the reduction vector calculation result, thereby improving the development of high-performance computing. Reduced vector computation efficiency for subtimes.

具体的,以求最大值及索引的规约向量计算过程为例,对上述多轮迭代处理的过程进行介绍。Specifically, taking the reduction vector calculation process of finding the maximum value and the index as an example, the above-mentioned multi-round iterative processing process is introduced.

首先,计算需要执行的轮数,以及第一轮迭代所产生的结果长度。结果长度与第三源操作数的长度、第三源操作数的数据类型有关。例如,当第三源操作数的长度等于1024,数据类型为16位浮点类型(float16)时,向量处理单元一次可以规约128 个float16类型的数据,并生成长度为2的结果。因此,第一轮迭代的结果长度等于16。第二轮计算中,可以一次性处理长度为16的结果,因此,整个计算过程需要两轮。First, calculate the number of rounds that need to be performed, and the length of the result produced by the first iteration. The result length is related to the length of the third source operand and the data type of the third source operand. For example, when the length of the third source operand is equal to 1024 and the data type is a 16-bit floating point type (float16), the vector processing unit can reduce 128 data of type float16 at a time and generate a result of length 2. Therefore, the resulting length of the first iteration is equal to 16. In the second round of calculation, the result of length 16 can be processed at one time, so the entire calculation process requires two rounds.

其次,根据迭代处理的结果,在缓冲区中申请一个宽度等于第一轮迭代所产生的结果长度,高度等于计算轮数的中间结果矩阵。接着,执行若干轮TIK规约指令,将结果存储在中间结果矩阵中。图4是根据本申请实施例的一种中间结果矩阵的示意图,如图4所示,对缓冲区中的第三源操作数向量进行n轮迭代处理,每轮迭代的结果存储在中间结果矩阵的对应行中。Secondly, according to the result of the iterative processing, an intermediate result matrix whose width is equal to the length of the result generated by the first iteration and whose height is equal to the number of calculation rounds is applied in the buffer. Next, several rounds of TIK reduction instructions are executed and the results are stored in the intermediate result matrix. Fig. 4 is a schematic diagram of an intermediate result matrix according to an embodiment of the present application. As shown in Fig. 4, n rounds of iterative processing are performed on the third source operand vector in the buffer, and the result of each round of iteration is stored in the intermediate result matrix in the corresponding row.

进一步的,自底向上递归地计算最终的索引。在最后一次迭代所产生的结果中,最大值即为最终的最大值,最小值即为最终的最小值,但是,所得到的索引是倒数第二次迭代所产生的结果的索引,因此,需要自底向上递归地计算最终的索引。Further, the final index is calculated recursively bottom-up. In the results generated by the last iteration, the maximum value is the final maximum value, and the minimum value is the final minimum value. However, the obtained index is the index of the result generated by the penultimate iteration. Therefore, it is necessary to The final index is calculated recursively bottom-up.

例如,图5是根据本申请实施例的又一种中间结果矩阵的示意图,如图5所示,第三源操作数为[3,13,5,9,4,8,1,6,7,2,14,0,11,15,10,12],向量计算单元每次只能规约4个数字,计算得到第一轮迭代的长度等于8,需要3轮迭代,因此,申请一个3 ×8的中间结果矩阵。For example, FIG. 5 is a schematic diagram of another intermediate result matrix according to an embodiment of the present application. As shown in FIG. 5 , the third source operand is [3, 13, 5, 9, 4, 8, 1, 6, 7 , 2, 14, 0, 11, 15, 10, 12], the vector computing unit can only reduce 4 numbers at a time, and the length of the first round of iterations is calculated to be equal to 8, which requires 3 rounds of iterations. Therefore, apply for a 3 × 8's intermediate result matrix.

在第一轮迭代中,首先规约源操作数的前四个数字:13,3,5,9;最大值为13,位于第0位,因此,在中间结果矩阵的第一行中填入13,0。接着看后续四个数字:4, 8,1,6;最大值为8,位于第1位,因此。在中间结果矩阵的第一行中填入8,1。以此类推,经过四次规约后完成第一轮的计算。In the first iteration, the first four digits of the source operand are reduced first: 13, 3, 5, 9; the maximum value is 13, which is at position 0, so 13 is filled in the first row of the intermediate result matrix , 0. Then look at the next four numbers: 4, 8, 1, 6; the maximum value is 8, which is in the first place, so. Fill in 8, 1 in the first row of the intermediate result matrix. By analogy, the first round of calculation is completed after four rounds of statutes.

在第二轮迭代中,首先规约中间结果矩阵的第一行的前四个数字:13,0,8,1。其中,“0”和“1”属于索引,因此,在调用TIK指令进行规约时,将其屏蔽,不参与实际规约计算中。最终,本轮迭代得到的结果为:13,0,15,0;将结果写入中间结果矩阵的第二行。In the second iteration, the first four numbers of the first row of the intermediate result matrix are first reduced: 13, 0, 8, 1. Among them, "0" and "1" belong to the index, so when the TIK instruction is called for reduction, it is shielded and does not participate in the actual reduction calculation. Finally, the result obtained in this round of iteration is: 13, 0, 15, 0; write the result to the second row of the intermediate result matrix.

在第三轮迭代中,得到的结果为:15,2。将结果写入中间矩阵的第三行。In the third iteration, the results obtained are: 15, 2. Write the result to the third row of the intermediate matrix.

自底向上递归地计算最终的索引。最后一轮,即第三轮迭代得到的索引等于2。根据索引2,查找第二轮迭代的结果得到“15,0”,即第二轮迭代的索引等于0。根据索引0,查找第一轮迭代的结果得到“15,2”。根据索引2,查找源操作数,得到源操作数中的索引等于10。因此,最终的结果等于“15,10”,即最大值等于15,索引等于10。The final index is calculated recursively bottom-up. The index obtained in the last round, i.e. the third iteration, is equal to 2. According to the index 2, looking up the result of the second round of iteration gets "15, 0", that is, the index of the second round of iteration is equal to 0. According to index 0, look up the result of the first iteration to get "15, 2". According to index 2, the source operand is looked up, and the index in the source operand is obtained equal to 10. Therefore, the final result is equal to "15, 10", that is, the maximum value is equal to 15, and the index is equal to 10.

需要说明的是,规约计算求最小值及索引,求全部值的和以及求全部值的积的实现过程可以参照上述求最大值的实现过程,不予赘述。It should be noted that the implementation process of calculating the minimum value and index, calculating the sum of all values, and calculating the product of all values may refer to the above-mentioned implementation process of obtaining the maximum value, and will not be repeated.

在规约计算求全部值的和以及求全部值的积时,与求最大值的过程相比,一次规约指令生成的结果不是2个数字,而是1个,因为没有索引值;并且无需递归地计算索引值,最后一轮迭代的结果即为最终结果。When reducing the sum of all values and the product of all values, the result of a reduce instruction is not 2 numbers, but 1, compared to the process of finding the maximum value, because there is no index value; and there is no need to recursively Calculate the index value, and the result of the last iteration is the final result.

可选地,在步骤S32,从多个候选操作函数选取目标操作函数包括:Optionally, in step S32, selecting a target operation function from a plurality of candidate operation functions includes:

步骤S324,从多个候选操作函数选取数据填充函数。Step S324, selecting a data filling function from a plurality of candidate operation functions.

可选地,在步骤S33,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:Optionally, in step S33, using the target operation function and the target operation data to perform the target operation, obtaining the target operation result includes:

步骤S71,获取目标操作数据,其中,目标操作数据包括:第四源操作数、第四目的操作数和第二指令名称;Step S71, acquiring target operation data, wherein the target operation data includes: a fourth source operand, a fourth destination operand, and a second instruction name;

步骤S72,利用数据填充函数、第四源操作数、第四目的操作数和第二指令名称执行数据填充操作,得到数据填充结果。Step S72, using the data filling function, the fourth source operand, the fourth destination operand and the second instruction name to perform the data filling operation to obtain a data filling result.

需要说明的是,利用数据填充函数、第四源操作数、第四目的操作数和第二指令名称执行数据填充操作,得到数据填充结果的实现过程,可以参照上述实施例中利用精度向量计算函数、第二源操作数、第二目的操作数和第一指令名称执行精度向量计算操作,得到精度向量计算结果的实现过程,不予赘述。It should be noted that, the implementation process of using the data filling function, the fourth source operand, the fourth destination operand and the second instruction name to perform the data filling operation to obtain the data filling result can refer to the above-mentioned embodiment using the precision vector calculation function. , the second source operand, the second destination operand and the first instruction name to perform the precision vector calculation operation to obtain the realization process of the precision vector calculation result, which will not be repeated.

基于上述步骤S71至步骤S72,能够利用数据填充函数、第四源操作数、第四目的操作数和第二指令名称执行数据填充操作,快速得到数据填充结果,从而提高开发高性能算子时的数据填充效率。Based on the above steps S71 to S72, the data filling operation can be performed by using the data filling function, the fourth source operand, the fourth destination operand and the second instruction name, and the data filling result can be obtained quickly, thereby improving the development of high-performance operators. Data filling efficiency.

可选地,在步骤S32,从多个候选操作函数选取目标操作函数包括:Optionally, in step S32, selecting a target operation function from a plurality of candidate operation functions includes:

步骤S325,从多个候选操作函数选取浮点数标量比较函数。Step S325, selecting a floating-point scalar comparison function from a plurality of candidate operation functions.

可选地,在步骤S33,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:Optionally, in step S33, using the target operation function and the target operation data to perform the target operation, obtaining the target operation result includes:

步骤S81,获取目标操作数据,其中,目标操作数据包括:第一浮点数标量和第二浮点数标量;Step S81, acquiring target operation data, wherein the target operation data includes: a first floating-point scalar and a second floating-point scalar;

步骤S82,利用浮点数标量比较函数、第一浮点数标量和第二浮点数标量执行浮点数标量比较操作,得到浮点数标量比较结果。Step S82, using the floating-point scalar comparison function, the first floating-point scalar, and the second floating-point scalar to perform a floating-point scalar comparison operation to obtain a floating-point scalar comparison result.

具体的,利用浮点数标量比较函数、第一浮点数标量和第二浮点数标量执行浮点数标量比较操作,得到浮点数标量比较结果的实现过程可以参照下文实施例的进一步介绍。Specifically, use the floating-point scalar comparison function, the first floating-point scalar, and the second floating-point scalar to perform the floating-point scalar comparison operation, and obtain the implementation process of the floating-point scalar comparison result with reference to the further introduction in the following embodiments.

基于上述步骤S81至步骤S82,通过利用浮点数标量比较函数、第一浮点数标量和第二浮点数标量执行浮点数标量比较操作,快速得到浮点数标量比较结果,从而提高开发高性能算子时的浮点数标量比较效率。Based on the above steps S81 to S82, by using the floating-point scalar comparison function, the first floating-point scalar and the second floating-point scalar to perform the floating-point scalar comparison operation, the floating-point scalar comparison result can be obtained quickly, thereby improving the development of high-performance operators. Floating-point scalar comparison efficiency.

可选地,在步骤S82,利用浮点数标量比较函数、第一浮点数标量和第二浮点数标量执行浮点数标量比较操作,得到浮点数标量比较结果包括:Optionally, in step S82, using the floating-point scalar comparison function, the first floating-point scalar, and the second floating-point scalar to perform a floating-point scalar comparison operation, and obtaining a floating-point scalar comparison result includes:

步骤S821,将第一浮点数标量写入第一张量,以及将第二浮点数标量写入第二张量;Step S821, writing the first floating-point scalar into the first tensor, and writing the second floating-point scalar into the second tensor;

步骤S822,使用向量减法指令获取第一张量与第二张量的差值运算结果,并将差值运算结果转换为整数标量;Step S822, use the vector subtraction instruction to obtain the difference operation result between the first tensor and the second tensor, and convert the difference operation result into an integer scalar;

步骤S823,基于整数标量执行浮点数标量比较操作,得到浮点数标量比较结果。Step S823, performing a floating-point scalar comparison operation based on the integer scalar to obtain a floating-point scalar comparison result.

具体的,利用浮点数减法和数据类型静态转换,来把浮点数标量的比较转换为整数标量的比较。Specifically, the floating-point scalar comparison is converted into an integer scalar comparison by using floating-point number subtraction and data type static conversion.

例如,将需要进行比较的两个浮点数标量(fp1和fp2)分别写入两个张量中,即将fp1写入第一张量,将fp2写入第二张量。For example, write two floating-point scalars (fp1 and fp2) that need to be compared into two tensors, that is, write fp1 to the first tensor and write fp2 to the second tensor.

使用向量减法指令获取第一张量与第二张量的差值运算结果(即fp1-fp2),然后把运算结果写入标量并强制转换为整数标量,此时这两个整数标量的内容依然是浮点数,但被解读为整数。Use the vector subtraction instruction to obtain the difference operation result between the first tensor and the second tensor (that is, fp1-fp2), and then write the operation result into a scalar and force it to be converted to an integer scalar. At this time, the contents of the two integer scalars are still is a floating point number, but is interpreted as an integer.

根据转换得到的整数标量的正负来判断两个浮点数的大小。由于浮点数和整数的二进制表示中,都是用最高位作为符号位,因此,两个浮点数的差转换为整型标量后,其符号位的意义依然得到保留,可以据此判断出原来的两个浮点数的大小关系。具体判断过程如下:Determine the size of two floating-point numbers according to the positive or negative of the converted integer scalar. Since the binary representation of floating-point numbers and integers uses the highest bit as the sign bit, after the difference between the two floating-point numbers is converted into an integer scalar, the meaning of the sign bit is still preserved, and it can be judged from this that the original The size relationship between two floating-point numbers. The specific judgment process is as follows:

判断fp1-fp2是否等于0,如果等于则说明fp1等于fp2,算法结束;判断fp1-fp2 的最高位,如果最改为等于0,则说明fp1大于fp2,如果最高位等于1,则说明fp1 小于fp2。Determine whether fp1-fp2 is equal to 0, if it is equal, it means that fp1 is equal to fp2, and the algorithm ends; judge the highest bit of fp1-fp2, if it is changed to 0, it means that fp1 is greater than fp2, if the highest bit is equal to 1, it means that fp1 is less than fp2.

例如,判断两个单精度浮点数fp1和fp2的大小关系,其中,fp1等于3.7,其二进制表示为“01000000011011001100110011001101”。fp2等于4.1,其二进制表示为“01000000100000110011001100110011”。fp1-fp2=-0.4,其二进制表示为“10111110110011001100110011001000”。二进制表示的最高位等于“1”,说明fp1-fp2 的结果为负数,由此可以判断出fp1小于fp2。For example, to determine the size relationship between two single-precision floating-point numbers fp1 and fp2, where fp1 is equal to 3.7, its binary representation is "01000000011011001100110011001101". fp2 is equal to 4.1 and its binary representation is "01000000100000110011001100110011". fp1-fp2=-0.4, its binary representation is "10111110110011001100110011001000". The highest bit of the binary representation is equal to "1", indicating that the result of fp1-fp2 is a negative number, so it can be judged that fp1 is smaller than fp2.

基于上述步骤步骤S821至步骤S823,通过将第一浮点数标量写入第一张量,以及将第二浮点数标量写入第二张量,进而使用向量减法指令获取第一张量与第二张量的差值运算结果,并将差值运算结果转换为整数标量,最后基于整数标量执行浮点数标量比较操作,快速得到浮点数标量比较结果,进一步提高开发高性能算子时的浮点数标量比较效率。Based on the above steps S821 to S823, by writing the first floating-point scalar into the first tensor, and writing the second floating-point scalar into the second tensor, the vector subtraction instruction is used to obtain the first tensor and the second tensor. The difference operation result of the tensor, and the difference operation result is converted into an integer scalar. Finally, the floating-point scalar comparison operation is performed based on the integer scalar to quickly obtain the floating-point scalar comparison result, which further improves the floating-point scalar when developing high-performance operators. Compare efficiency.

可选地,在步骤S32,从多个候选操作函数选取目标操作函数包括:Optionally, in step S32, selecting a target operation function from a plurality of candidate operation functions includes:

步骤S326,从多个候选操作函数选取除法计算函数。Step S326, selecting a division calculation function from a plurality of candidate operation functions.

可选地,在步骤S33,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果包括:Optionally, in step S33, using the target operation function and the target operation data to perform the target operation, obtaining the target operation result includes:

步骤S91,获取目标操作数据,其中,目标操作数据包括:第五源操作数、第五目的操作数和第二数据量;Step S91, acquiring target operation data, wherein the target operation data includes: a fifth source operand, a fifth destination operand, and a second data amount;

步骤S92,利用除法计算函数、第五源操作数、第五目的操作数和第二数据量执行除法计算操作,得到除法计算结果。Step S92, using the division calculation function, the fifth source operand, the fifth destination operand, and the second data amount to perform a division calculation operation to obtain a division calculation result.

具体的,调用精度向量计算函数,利用高精度向量计算进行取倒数操作,以及乘法操作,得到除法计算结果。即先计算除数的倒数,然后把除数的倒数与被除数相乘,从而实现除法计算操作,得到除法计算结果。Specifically, the precision vector calculation function is called, and the high precision vector calculation is used to perform the reciprocal operation and the multiplication operation to obtain the division calculation result. That is, first calculate the reciprocal of the divisor, and then multiply the reciprocal of the divisor with the dividend, so as to realize the division calculation operation and obtain the division calculation result.

基于上述步骤S91至步骤S92,能够利用除法计算函数、第五源操作数、第五目的操作数和第二数据量执行除法计算操作,快速得到除法计算结果,从而提高开发高性能算子时的除法计算效率。Based on the above steps S91 to S92, the division calculation function, the fifth source operand, the fifth destination operand and the second data amount can be used to perform the division calculation operation, and the division calculation result can be obtained quickly, thereby improving the development of high-performance operators. Efficiency of division calculation.

本申请实施例中的适用于华为昇腾芯片的高性能算子生成方法基于TIK开发方式,向开发者提供一系列操作函数,这些函数实现了数据搬运、一般向量计算、高精度向量计算、规约向量计算、数据填充、浮点数标量比较和除法等操作,并自动进行数据分块、数据对齐、多核与多线程优化、缓冲区管理。开发者调用这些操作函数来实现算子的运算逻辑,操作函数内部代替开发者进行优化和代码生成,使得开发者无需关心数据分块、数据对齐、多核与多线程优化、缓冲区管理等细节,将开发时间缩短了 50%以上,从而有效提高开发高性能算子过程中的开发效率。The high-performance operator generation method applicable to the Huawei Ascend chip in the embodiment of this application is based on the TIK development method, and provides developers with a series of operation functions. These functions implement data handling, general vector calculation, high-precision vector calculation, and protocol Vector calculation, data filling, floating-point scalar comparison and division, and other operations, and automatic data block, data alignment, multi-core and multi-thread optimization, buffer management. The developer calls these operation functions to realize the operation logic of the operator, and the operation function replaces the developer to perform optimization and code generation, so that the developer does not need to care about details such as data partitioning, data alignment, multi-core and multi-threading optimization, buffer management, etc. The development time is shortened by more than 50%, thereby effectively improving the development efficiency in the process of developing high-performance operators.

在一种可选的实施例中,将本申请实施例中的算法封装进TIK编译器,成为新的TIK指令,并向开发者提供相应接口,从而有效提高开发高性能算子过程中的开发效率。In an optional embodiment, the algorithm in the embodiment of the present application is encapsulated into a TIK compiler to become a new TIK instruction, and a corresponding interface is provided to developers, thereby effectively improving development in the process of developing high-performance operators efficiency.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the methods described in the various embodiments of this application.

在本实施例中还提供了一种适用于华为昇腾芯片的高性能算子生成装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。In this embodiment, a high-performance operator generation device suitable for Huawei Ascend chips is also provided, and the device is used to implement the above-mentioned embodiments and preferred implementations, and what has been described will not be repeated. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.

图6是根据本申请实施例的一种适用于华为昇腾芯片的高性能算子生成装置的结构框图,如图6所示,该适用于华为昇腾芯片的高性能算子生成装置600包括:FIG. 6 is a structural block diagram of a high-performance operator generation device suitable for Huawei Ascend chips according to an embodiment of the present application. As shown in FIG. 6 , the high-performance operator generation device 600 suitable for Huawei Ascend chips includes: :

生成模块601,用于在目标开发方式下,生成多个候选操作函数,其中,目标开发方式是基于昇腾人工智能处理器的张量加速引擎算子开发框架确定的张量迭代器内核开发方式;The generating module 601 is used to generate a plurality of candidate operation functions under the target development mode, wherein the target development mode is the tensor iterator kernel development mode determined based on the tensor acceleration engine operator development framework of the Ascend artificial intelligence processor ;

选取模块602,用于从多个候选操作函数选取待使用的目标操作函数;A selection module 602 is used to select a target operation function to be used from a plurality of candidate operation functions;

处理模块603,用于利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果。The processing module 603 is configured to perform the target operation by using the target operation function and the target operation data to obtain the target operation result.

可选地,选取模块602还用于从多个候选操作函数选取数据搬运函数。Optionally, the selection module 602 is further configured to select a data handling function from a plurality of candidate operation functions.

可选地,处理模块603还用于:获取目标操作数据,其中,目标操作数据包括:第一源操作数、第一目的操作数和数据长度;利用数据搬运函数、第一源操作数、第一目的操作数和数据长度执行数据搬运操作,得到数据搬运结果。Optionally, the processing module 603 is further configured to: obtain target operation data, wherein the target operation data includes: the first source operand, the first destination operand and the data length; the data transfer function, the first source operand, the first source operand, the first Perform a data transfer operation with a destination operand and data length to obtain the data transfer result.

可选地,处理模块603还用于:利用数据长度对第一源操作数进行分块处理,得到分块结果;当基于分块结果确定不存在尾块时,通过数据搬运函数和分块结果将第一源操作数搬运至第一目的操作数,得到数据搬运结果;当基于分块结果确定存在尾块时,按照第一源操作数与第一目的操作数的存储位置确定目标搬运方式,并通过数据搬运函数和目标搬运方式将第一源操作数搬运至第一目的操作数,得到数据搬运结果。Optionally, the processing module 603 is further configured to: use the data length to perform block processing on the first source operand to obtain a block result; when it is determined based on the block result that there is no tail block, pass the data handling function and the block result. The first source operand is transported to the first destination operand, and the data handling result is obtained; when it is determined that there is a tail block based on the block result, the target handling mode is determined according to the storage location of the first source operand and the first destination operand, The first source operand is transferred to the first destination operand through the data transfer function and the target transfer method to obtain a data transfer result.

可选地,选取模块602还用于从多个候选操作函数选取精度向量计算函数。Optionally, the selection module 602 is further configured to select a precision vector calculation function from a plurality of candidate operation functions.

可选地,处理模块603还用于:获取目标操作数据,其中,目标操作数据包括:第二源操作数、第二目的操作数和第一指令名称;利用精度向量计算函数、第二源操作数、第二目的操作数和第一指令名称执行精度向量计算操作,得到精度向量计算结果。Optionally, the processing module 603 is further configured to: acquire target operation data, wherein the target operation data includes: the second source operand, the second destination operand and the first instruction name; the calculation function using the precision vector, the second source operation The number, the second destination operand and the name of the first instruction execute the precision vector calculation operation to obtain the precision vector calculation result.

可选地,处理模块603还用于:当第二源操作数未位于全局内存时,确定精度向量计算次数,利用精度向量计算函数和第一指令名称对第二源操作数进行精度向量计算操作以得到精度向量计算结果,并将精度向量计算结果搬运至第二目的操作数;当第二源操作数位于全局内存时,对第二源操作数进行多核优化处理以得到优化处理结果,确定精度向量计算次数,利用精度向量计算函数和第一指令名称对优化处理结果进行精度向量计算操作以得到精度向量计算结果,并将精度向量计算结果搬运至第二目的操作数。Optionally, the processing module 603 is further configured to: when the second source operand is not located in the global memory, determine the number of times of precision vector calculation, and use the precision vector calculation function and the first instruction name to perform a precision vector calculation operation on the second source operand. To obtain the calculation result of the precision vector, and transfer the calculation result of the precision vector to the second destination operand; when the second source operand is located in the global memory, perform multi-core optimization processing on the second source operand to obtain the optimization processing result and determine the accuracy For the number of vector calculations, use the precision vector calculation function and the first instruction name to perform the precision vector calculation operation on the optimization processing result to obtain the precision vector calculation result, and transfer the precision vector calculation result to the second destination operand.

可选地,选取模块602还用于从多个候选操作函数选取规约向量计算函数。Optionally, the selection module 602 is further configured to select a reduction vector calculation function from a plurality of candidate operation functions.

可选地,处理模块603还用于:获取目标操作数据,其中,目标操作数据包括:第三源操作数、第三目的操作数和第一数据量;利用规约向量计算函数、第三源操作数、第三目的操作数和第一数据量执行多轮迭代处理,得到规约向量计算结果。Optionally, the processing module 603 is further configured to: obtain target operation data, wherein the target operation data includes: a third source operand, a third destination operand, and a first data amount; using a reduction vector to calculate a function, a third source operation Perform multiple rounds of iterative processing on the number, the third destination operand and the first data amount to obtain the reduction vector calculation result.

可选地,选取模块602还用于从多个候选操作函数选取数据填充函数。Optionally, the selection module 602 is further configured to select a data filling function from a plurality of candidate operation functions.

可选地,处理模块603还用于:获取目标操作数据,其中,目标操作数据包括:第四源操作数、第四目的操作数和第二指令名称;利用数据填充函数、第四源操作数、第四目的操作数和第二指令名称执行数据填充操作,得到数据填充结果。Optionally, the processing module 603 is further configured to: obtain target operation data, wherein the target operation data includes: the fourth source operand, the fourth destination operand and the second instruction name; the data filling function, the fourth source operand , the fourth destination operand and the second instruction name to perform a data filling operation to obtain a data filling result.

可选地,选取模块602还用于从多个候选操作函数选取浮点数标量比较函数。Optionally, the selection module 602 is further configured to select a floating-point scalar comparison function from a plurality of candidate operation functions.

可选地,处理模块603还用于:获取目标操作数据,其中,目标操作数据包括:第一浮点数标量和第二浮点数标量;利用浮点数标量比较函数、第一浮点数标量和第二浮点数标量执行浮点数标量比较操作,得到浮点数标量比较结果。Optionally, the processing module 603 is further configured to: acquire target operation data, wherein the target operation data includes: a first floating-point scalar and a second floating-point scalar; using a floating-point scalar comparison function, the first floating-point scalar and the second floating-point scalar The floating-point scalar performs a floating-point scalar comparison operation to obtain the floating-point scalar comparison result.

可选地,处理模块603还用于:将第一浮点数标量写入第一张量,以及将第二浮点数标量写入第二张量;使用向量减法指令获取第一张量与第二张量的差值运算结果,并将差值运算结果转换为整数标量;基于整数标量执行浮点数标量比较操作,得到浮点数标量比较结果。Optionally, the processing module 603 is further configured to: write the first floating-point scalar into the first tensor, and write the second floating-point scalar into the second tensor; use a vector subtraction instruction to obtain the first tensor and the second tensor. The difference operation result of the tensor is converted into an integer scalar; the floating-point scalar comparison operation is performed based on the integer scalar, and the floating-point scalar comparison result is obtained.

可选地,选取模块602还用于从多个候选操作函数选取除法计算函数。Optionally, the selection module 602 is further configured to select a division calculation function from a plurality of candidate operation functions.

可选地,处理模块603还用于:获取目标操作数据,其中,目标操作数据包括:第五源操作数、第五目的操作数和第二数据量;利用除法计算函数、第五源操作数、第五目的操作数和第二数据量执行除法计算操作,得到除法计算结果。Optionally, the processing module 603 is further configured to: obtain target operation data, wherein the target operation data includes: the fifth source operand, the fifth destination operand, and the second data amount; using the division calculation function, the fifth source operand , the fifth destination operand and the second data amount to perform a division calculation operation to obtain a division calculation result.

需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。It should be noted that the above modules can be implemented by software or hardware, and the latter can be implemented in the following ways, but not limited to this: the above modules are all located in the same processor; or, the above modules can be combined in any combination The forms are located in different processors.

本申请的实施例还提供了一种非易失性存储介质,该非易失性存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一方法实施例中的步骤。Embodiments of the present application further provide a non-volatile storage medium, where a computer program is stored in the non-volatile storage medium, wherein the computer program is configured to execute the steps in any of the above method embodiments when running .

可选地,在本实施例中,上述非易失性存储介质可以被设置为存储用于执行以下步骤的计算机程序:Optionally, in this embodiment, the above-mentioned non-volatile storage medium may be configured to store a computer program for executing the following steps:

S1,在目标开发方式下,生成多个候选操作函数,其中,目标开发方式是基于昇腾人工智能处理器的张量加速引擎算子开发框架确定的张量迭代器内核开发方式;S1, in the target development mode, generate multiple candidate operation functions, wherein the target development mode is the tensor iterator kernel development mode determined based on the tensor acceleration engine operator development framework of the Ascend artificial intelligence processor;

S2,从多个候选操作函数选取待使用的目标操作函数;S2, select the target operation function to be used from multiple candidate operation functions;

S3,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果。S3, use the target operation function and the target operation data to perform the target operation to obtain the target operation result.

可选地,在本实施例中,上述非易失性存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。Optionally, in this embodiment, the above-mentioned non-volatile storage medium may include but is not limited to: a U disk, a read-only memory (Read-Only Memory, referred to as ROM), and a random access memory (Random Access Memory, referred to as ROM) Various media that can store computer programs, such as RAM), mobile hard disks, magnetic disks or optical disks.

本申请的实施例还提供了一种电子装置,包括存储器、处理器、第一无线网络芯片和第二无线网络芯片,其特征在于,存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。Embodiments of the present application also provide an electronic device, including a memory, a processor, a first wireless network chip and a second wireless network chip, wherein a computer program is stored in the memory, and the processor is configured to run a computer A program to perform the steps in any one of the above method embodiments.

可选地,在本实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:Optionally, in this embodiment, the above-mentioned processor may be configured to execute the following steps through a computer program:

S1,在目标开发方式下,生成多个候选操作函数,其中,目标开发方式是基于昇腾人工智能处理器的张量加速引擎算子开发框架确定的张量迭代器内核开发方式;S1, in the target development mode, generate multiple candidate operation functions, wherein the target development mode is the tensor iterator kernel development mode determined based on the tensor acceleration engine operator development framework of the Ascend artificial intelligence processor;

S2,从多个候选操作函数选取待使用的目标操作函数;S2, select the target operation function to be used from multiple candidate operation functions;

S3,利用目标操作函数和目标操作数据执行目标操作,得到目标操作结果。S3, use the target operation function and the target operation data to perform the target operation to obtain the target operation result.

可选地,本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。Optionally, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and optional implementation manners, and details are not described herein again in this embodiment.

上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments.

在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present application, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are only illustrative, for example, the division of the units may be a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes .

以上所述仅是本申请的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。The above are only the preferred embodiments of the present application. It should be pointed out that for those skilled in the art, without departing from the principles of the present application, several improvements and modifications can also be made. It should be regarded as the protection scope of this application.

Claims (20)

1. A method for generating high performance operators for Huaji Shengji chip, comprising:
generating a plurality of candidate operation functions in a target development mode, wherein the target development mode is a tensor iterator kernel development mode determined by a tensor acceleration engine operator development framework based on the rising artificial intelligence processor;
selecting a target operation function to be used from the plurality of candidate operation functions;
and executing target operation by using the target operation function and the target operation data to obtain a target operation result.
2. The method of claim 1, wherein selecting the target operation function from the plurality of candidate operation functions comprises:
a data-handling function is selected from the plurality of candidate operation functions.
3. The method as claimed in claim 2, wherein the performing a target operation using the target operation function and the target operation data to obtain the target operation result comprises:
obtaining the target operation data, wherein the target operation data comprises: a first source operand, a first destination operand, and a data length;
and executing data carrying operation by using the data carrying function, the first source operand, the first destination operand and the data length to obtain a data carrying result.
4. The method as claimed in claim 3, wherein the performing data-handling operations using the data-handling function, the first source operand, the first destination operand, and the data length to obtain the data-handling result comprises:
carrying out blocking processing on the first source operand by using the data length to obtain a blocking result;
when it is determined that no tail block exists based on the blocking result, carrying the first source operand to the first destination operand through the data carrying function and the blocking result to obtain the data carrying result;
and when the tail block is determined to exist based on the blocking result, determining a target carrying mode according to the storage positions of the first source operand and the first destination operand, and carrying the first source operand to the first destination operand through the data carrying function and the target carrying mode to obtain the data carrying result.
5. The method of claim 1, wherein selecting the target operation function from the plurality of candidate operation functions comprises:
a precision vector calculation function is selected from the plurality of candidate operation functions.
6. The method as claimed in claim 5, wherein the performing the target operation using the target operation function and the target operation data to obtain the target operation result comprises:
obtaining the target operation data, wherein the target operation data comprises: a second source operand, a second destination operand, and a first instruction name;
and executing precision vector calculation operation by using the precision vector calculation function, the second source operand, the second destination operand and the first instruction name to obtain a precision vector calculation result.
7. The method of claim 6, wherein performing a precision vector calculation operation using the precision vector calculation function, the second source operand, the second destination operand, and the first instruction name to obtain the precision vector calculation result comprises:
when the second source operand is not located in the global memory, determining the number of precision vector calculation, performing precision vector calculation operation on the second source operand by using the precision vector calculation function and the first instruction name to obtain a precision vector calculation result, and carrying the precision vector calculation result to the second destination operand;
when the second source operand is located in the global memory, performing multi-core optimization processing on the second source operand to obtain an optimization processing result, determining the precision vector calculation times, performing precision vector calculation operation on the optimization processing result by using the precision vector calculation function and the first instruction name to obtain the precision vector calculation result, and carrying the precision vector calculation result to the second destination operand.
8. The method of claim 1, wherein selecting the target operation function from the plurality of candidate operation functions comprises:
selecting a reduced vector calculation function from the plurality of candidate operation functions.
9. The method as claimed in claim 8, wherein the performing a target operation using the target operation function and the target operation data to obtain the target operation result comprises:
obtaining the target operation data, wherein the target operation data comprises: a third source operand, a third destination operand, and a first data volume;
and executing multiple rounds of iterative processing by using the reduced vector calculation function, the third source operand, the third destination operand and the first data volume to obtain a reduced vector calculation result.
10. The method of claim 1, wherein selecting the target operation function from the plurality of candidate operation functions comprises:
a data fill function is selected from the plurality of candidate operation functions.
11. The method as claimed in claim 10, wherein the performing a target operation using the target operation function and the target operation data to obtain the target operation result comprises:
obtaining the target operation data, wherein the target operation data comprises: a fourth source operand, a fourth destination operand, and a second instruction name;
and executing data filling operation by using the data filling function, the fourth source operand, the fourth destination operand and the second instruction name to obtain a data filling result.
12. The method of claim 1, wherein selecting the target operation function from the plurality of candidate operation functions comprises:
selecting a floating point number scalar compare function from the plurality of candidate operation functions.
13. The method as claimed in claim 12, wherein the performing a target operation using the target operation function and the target operation data to obtain the target operation result comprises:
obtaining the target operation data, wherein the target operation data comprises: a first floating point scalar and a second floating point scalar;
and executing a floating point scalar comparison operation by using the floating point scalar comparison function, the first floating point scalar and the second floating point scalar to obtain a floating point scalar comparison result.
14. The method of claim 13, wherein performing a floating point scalar compare operation using the floating point scalar compare function, the first floating point scalar and the second floating point scalar to obtain the floating point scalar compare result comprises:
writing the first floating point scalar to a first tensor, and writing the second floating point scalar to a second tensor;
obtaining a difference operation result of the first tensor and the second tensor by using a vector subtraction instruction, and converting the difference operation result into an integer scalar;
and executing a floating point scalar comparison operation based on the integer scalar to obtain a floating point scalar comparison result.
15. The method of claim 1, wherein selecting the target operation function from the plurality of candidate operation functions comprises:
a division calculation function is selected from the plurality of candidate operation functions.
16. The method as claimed in claim 15, wherein the performing a target operation using the target operation function and the target operation data to obtain the target operation result comprises:
obtaining the target operation data, wherein the target operation data comprises: a fifth source operand, a fifth destination operand, and a second data volume;
and executing division calculation operation by using the division calculation function, the fifth source operand, the fifth destination operand and the second data volume to obtain a division calculation result.
17. A high performance operator generator for Huaji Shengji chip, comprising:
a generation module for generating a plurality of candidate operation functions in a target development mode, wherein the target development mode is a tensor iterator kernel development mode determined by a tensor acceleration engine operator development framework based on the Itanium artificial intelligence processor;
a selecting module for selecting a target operation function to be used from the plurality of candidate operation functions;
and the processing module is used for executing the target operation by utilizing the target operation function and the target operation data to obtain a target operation result.
18. A non-volatile storage medium, wherein a computer program is stored in the storage medium, and wherein the computer program is configured to execute the method for generating a high performance operator for use in hua tiao chip of any one of claims 1 to 17 when running.
19. A processor for running a program, wherein the program is configured to execute the method for generating a high performance operator suitable for use in hua tiao chips of any one of claims 1 to 17.
20. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the high performance operator generation method for Huaqi Shengji chip as claimed in any one of claims 1 to 17.
CN202210009738.9A 2022-01-05 2022-01-05 A high-performance operator generation method suitable for Huawei Ascend chips Active CN114327630B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210009738.9A CN114327630B (en) 2022-01-05 2022-01-05 A high-performance operator generation method suitable for Huawei Ascend chips

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210009738.9A CN114327630B (en) 2022-01-05 2022-01-05 A high-performance operator generation method suitable for Huawei Ascend chips

Publications (2)

Publication Number Publication Date
CN114327630A true CN114327630A (en) 2022-04-12
CN114327630B CN114327630B (en) 2023-02-10

Family

ID=81024935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210009738.9A Active CN114327630B (en) 2022-01-05 2022-01-05 A high-performance operator generation method suitable for Huawei Ascend chips

Country Status (1)

Country Link
CN (1) CN114327630B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119295293A (en) * 2024-12-10 2025-01-10 合肥中科类脑智能技术有限公司 Optimization method and related equipment of attention operator based on Ascend AI processor

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170103304A1 (en) * 2015-10-08 2017-04-13 Via Alliance Semiconductor Co., Ltd. Neural network unit with plurality of selectable output functions
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method
WO2021057746A1 (en) * 2019-09-24 2021-04-01 安徽寒武纪信息科技有限公司 Neural network processing method and apparatus, computer device and storage medium
CN113704689A (en) * 2021-08-25 2021-11-26 北京大学 Matrix multiplier processing method and device based on soar AI processor
CN113722269A (en) * 2021-08-26 2021-11-30 北京大学 Stride slice operator processing method and device based on soaring AI processor
US20210397935A1 (en) * 2020-06-18 2021-12-23 Samsung Electronics Co., Ltd. Method, accelerator, and electronic device with tensor processing
CN113885941A (en) * 2021-09-06 2022-01-04 鹏城实验室 A method, device and related equipment for realizing singular value decomposition operation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170103304A1 (en) * 2015-10-08 2017-04-13 Via Alliance Semiconductor Co., Ltd. Neural network unit with plurality of selectable output functions
WO2021057746A1 (en) * 2019-09-24 2021-04-01 安徽寒武纪信息科技有限公司 Neural network processing method and apparatus, computer device and storage medium
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method
US20210397935A1 (en) * 2020-06-18 2021-12-23 Samsung Electronics Co., Ltd. Method, accelerator, and electronic device with tensor processing
CN113704689A (en) * 2021-08-25 2021-11-26 北京大学 Matrix multiplier processing method and device based on soar AI processor
CN113722269A (en) * 2021-08-26 2021-11-30 北京大学 Stride slice operator processing method and device based on soaring AI processor
CN113885941A (en) * 2021-09-06 2022-01-04 鹏城实验室 A method, device and related equipment for realizing singular value decomposition operation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A.REUTHER等: "Survey of Machine Learning Accelerators", 《2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE》 *
王庆林等: "面向飞腾多核处理器的Winograd快速卷积算法优化", 《计算机研究与发展》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119295293A (en) * 2024-12-10 2025-01-10 合肥中科类脑智能技术有限公司 Optimization method and related equipment of attention operator based on Ascend AI processor
CN119295293B (en) * 2024-12-10 2025-03-18 合肥中科类脑智能技术有限公司 Optimization method and related equipment of attention operator based on Ascend AI processor

Also Published As

Publication number Publication date
CN114327630B (en) 2023-02-10

Similar Documents

Publication Publication Date Title
Liu et al. Lightweight deep learning for resource-constrained environments: A survey
JP7087079B2 (en) Robust gradient weight compression scheme for deep learning applications
US12373678B2 (en) Method and device for generating operation data and related product
CN112292667B (en) Method and apparatus for selecting a processor
CN114862656A (en) Method for acquiring training cost of distributed deep learning model based on multiple GPUs
CN113704689B (en) Matrix multiplier processing method and device based on soar AI processor
CN117492766A (en) Compiling method, compiler, neural network accelerator, chip and electronic equipment
CN108520300A (en) A method and device for implementing a deep learning network
CN118093203A (en) Data handling method, distributed training system, electronic device, and storage medium
CN118170347A (en) Precision conversion method and device, data processing method, processor, and electronic device
CN114327630B (en) A high-performance operator generation method suitable for Huawei Ascend chips
Wang et al. Reconfigurable CNN accelerator embedded in instruction extended RISC-V core
CN113885941A (en) A method, device and related equipment for realizing singular value decomposition operation
CN116402091A (en) Hybrid engine intelligent computing method and device for artificial intelligence chip
CN110956252B (en) Method and computing device for performing computation of multiple neural networks
CN112633502B (en) Cross-platform execution method and device of deep learning model and electronic equipment
CN119271274A (en) A method, device, equipment and medium for processing multi-dimensional data
CN119396592A (en) Data computing method, device and medium
CN118312130A (en) Data processing method and device, processor, electronic equipment and storage medium
CN117873394A (en) Data compression method, device, electronic device and readable storage medium
CN117348931A (en) Command devices, integrated circuit devices and boards
CN116468087A (en) Hardware accelerator for performing calculations of deep neural network and electronic device including same
CN119156617A (en) System and method for hardware acceleration for masking and normalizing data using triangle input mask
JP2023024960A (en) Optimization of memory usage for efficiently executing neural network
Gupta et al. Hardware based AI and ML

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant