WO2025091335A1

WO2025091335A1 - Multi-precision tensor multiplication in neural network

Info

Publication number: WO2025091335A1
Application number: PCT/CN2023/129107
Authority: WO
Inventors: Chen MENG; Pujiang HE; Bin Wang; Shan ZHOU
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2025-05-08
Anticipated expiration: 2026-05-01

Abstract

Tensor multiplication operations in a deep neural network (DNN) may be performed at multiple precision levels. Various precision levels may be selected for various DNN layers to achieve a desired accuracy of the DNN with minimized consumption of computational resource. A precision level may be selected from a plurality of predetermined precision levels for a layer. The layer may include a tensor multiplication operation on an activation tensor and a weight tensor. An activation in the activation tensor may be converted to lower-precision activations. A weight in the weight tensor may be converted to lower-precision weights. An algorithm determined based on the selected precision level may be used to compute an approximate product of an activation and weight using the lower-precision activations and lower-precision weights. A different precision level may be selected for a different layer, and a different algorithm may be used to compute products in the different layer.

Description

MULTI-PRECISION TENSOR MULTIPLICATION IN NEURAL NETWORK

Technical Field

This disclosure relates generally to neural networks (also referred to as “deep neural networks” or “DNNs” ) , and more specifically, multi-precision tensor multiplication operations in DNNs.

Background

DNNs are used extensively for a variety of artificial intelligence applications ranging from computer vision to speech recognition and natural language processing due to their ability to achieve high accuracy. However, the high accuracy comes at the expense of significant computation cost. DNNs have extremely high computing demands as each inference can require hundreds of millions of tensor operations as well as a large amount of data to read and write. Many tensor operations (e.g., MAC (multiply-accumulate) operations in convolutional layers, linear transformations in fully-connected layers, etc. ) include tensor multiplications.

Brief Description of the Drawings

Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.

FIG. 1 illustrates an example DNN, in accordance with various embodiments.

FIG. 2 illustrates an example convolution, in accordance with various embodiments.

FIG. 3 is a block diagram of a DNN system, in accordance with various embodiments.

FIG. 4 is a block diagram of a DNN module, in accordance with various embodiments.

FIG. 5 is a block diagram of a multi-precision tuning module, in accordance with various embodiments.

FIG. 6 illustrates an example graph, in accordance with various embodiments.

FIG. 7 illustrates data elements of different precisions, in accordance with various embodiments.

FIG. 8 illustrates an example multi-precision tensor multiplication operator, in accordance with various embodiments.

FIG. 9 illustrates locality and reuse of data in a multi-precision tensor multiplication operation, in accordance with various embodiments.

FIG. 10 is a flowchart showing a method of executing a tensor multiplication operation in a DNN, in accordance with various embodiments.

FIG. 11 is a block diagram of an example computing device, in accordance with various embodiments.

Detailed Description

Overview

The last decade has witnessed a rapid rise in AI (artificial intelligence) based data processing, particularly based on DNNs. DNNs are widely used in the domains of computer vision, speech recognition, image, and video processing mainly due to their ability to achieve beyond human-level accuracy. The significant improvements in DNN model size and accuracy coupled with the rapid increase in computing power of execution platforms have led to the adoption of DNN applications even within resource constrained mobile and edge devices that have limited energy availability.

Many DNN layers include tensor multiplication operations. A DNN layer may include one or more deep learning operations (also referred to as “neural network operations” ) , such as convolution, linear transformation, pooling, elementwise operation, linear operation, nonlinear operation, and so on. An essential part of convolution or linear transformation is tensor multiplication. A deep learning operation in a DNN may be performed on one or more internal parameters of the DNNs (e.g., weights) , which are determined during the training phase, and one or more activations. An activation may be a data point (also referred to as “data elements” or “elements” ) . Activations or weights of a DNN layer may be elements of a tensor of the DNN layer. A tensor is a data structure having multiple elements across one or more dimensions. Example tensors include a vector, which may be a one-dimensional tensor, and a matrix, which may be a two-dimensional tensor. There can also be three-dimensional tensors and even higher dimensional tensors. A DNN layer may have an input tensor (also referred to as “input feature map (IFM) ” ) including one or more input activations (also referred to as “input elements” ) and a weight tensor including one or more weights. A weight is an element in the weight tensor. A weight tensor may be a kernel, a filter, or a group of filters. The output data of the DNN layer may be an output tensor (also referred to as “output feature map (OFM) ” ) that includes one or more output activations (also referred to as “output elements” ) .

Data used in tensor multiplication operations may have different precisions. Tensor multiplications with higher precisions may provide better accuracy but usually require more computational resources, such as power, time, memory, and so on. Tensor multiplications with lower precisions may have lower consumption of computational resources but may lead to lower accuracy. With the increasing demand for reduced inference latency in AI applications, it can be important to enhance support for low-precision data structures in tensor multiplication operations.

Currently available approaches for improving performance using low-precision formats include employing a mixture of precision schemes based on empirical and heuristic approaches. Some approaches convert all tensor multiplication operators to BF16 (where BF stands for brain floating point) data format by default. When accuracy requirements are not met, heuristic methods are employed to roll back precision-sensitive layers to the FP32 (where FP stands for floating point) data format. Some other approaches make FP32 and BF16 data types available for every tensor multiplication operator and use tuning algorithm for automatic search. For instance, a tuning process is used to determine whether the operator should adopt single-precision or low-precision formats. This approach helps avoid unacceptable accuracy losses resulting from mixed-precision optimizations. However, it suffers from the disadvantage that not many tensor multiplication operators can be selected to utilize low-precision formats in the final solution, which can significantly limit the utilization. Also, both heuristic and automatic tuning algorithms are limited to operating at the level of distinct DNN layers. In the case of those precision-sensitive models (e.g., recommendation systems, etc. ) , certain layers would need to be rolled back to the FP32 format, which leads to suboptimal utilization of the acceleration capabilities and reduced harnessing of computational power.

Embodiments of the present disclosure may improve on at least some of the challenges and issues described above by facilitating multi-precision tensor multiplication in DNNs. Optimization of multi-precision tensor multiplication operations can be automated to achieve a desired accuracy of the DNN with minimized consumption of computational resources (e.g., time, power, memory, etc. ) . Tuning can be done not only between different tensor multiplication operators but also within a single tensor multiplication operator, so that the acceleration capability can be maximized.

In various embodiments of the present disclosure, various precision levels may be selected for various layers in a DNN. For instance, the precision level of a DNN layer may be selected from a plurality of predetermined precision levels that are supported by tensor multiplication operators configured to execute the tensor multiplication operations in the DNN. The tensor multiplication operation in a layer may be performed on an activation tensor and a weight tensor. An activation in the activation tensor may be converted to lower-precision activations. A weight in the weight tensor may also be converted to lower-precision weights. An algorithm determined based on the precision level selected for the layer may be used to compute an approximate product of an activation and weight using the lower-precision activations and lower-precision weights. A different precision level may be selected for another layer in the DNN. The tensor multiplication operator may perform in a different operation mode and use a different algorithm determined based on the different precision level to compute approximate products of activations and weights.

A graph may be used to determine optimal precision settings of a DNN. The graph may include nodes representing layers in the DNN and one or more edges between the nodes. An edge may indicate a relationship between layers, e.g., data flow between two layers. The nodes may encode precision levels of the layers. The initial states of the nodes may be set to a random precision level. The states of the nodes may be iteratively updated to find optimal precision levels. The optimal precision level may be the precision levels of the layers that can meet a requirement on the accuracy of the DNN and a requirement on the consumption of computational resources. The requirement on the consumption of computational resources may be maximized inference speed, for example. Tensor multiplication operators may be configured based on the precision levels selected for the layers. While executing a layer, the tensor multiplication operators may operate in the operation mode that matches the selected precision level of the layer.

The present disclosure provides an approach to accelerate tensor multiplication operations in DNNs. Various precision levels may be used for various tensor multiplication operations in a DNN to maximize inference speed but still achieve the desired accuracy of the DNN. Many data formats can be supported, such as such as FP32, FP16, BF16, FP8, INT8, and so on.

For purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the illustrative implementations. However, it will be apparent to one skilled in the art that the present disclosure may be practiced without the specific details or/and that the present disclosure may be practiced with only some of the described aspects. In other instances, well known features are omitted or simplified in order not to obscure the illustrative implementations.

Further, references are made to the accompanying drawings that form a part hereof, and in which is shown, by way of illustration, embodiments that may be practiced. It is to be understood that other embodiments may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense.

Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order from the described embodiment. Various additional operations may be performed or described operations may be omitted in additional embodiments.

For the purposes of the present disclosure, the phrase “A or B” or the phrase "A and/or B" means (A) , (B) , or (A and B) . For the purposes of the present disclosure, the phrase “A, B, or C” or the phrase "A, B, and/or C" means (A) , (B) , (C) , (A and B) , (A and C) , (B and C) , or (A, B, and C) . The term "between, " when used with reference to measurement ranges, is inclusive of the ends of the measurement ranges.

The description uses the phrases "in an embodiment" or "in embodiments, " which may each refer to one or more of the same or different embodiments. The terms "comprising, " "including, " "having, " and the like, as used with respect to embodiments of the present disclosure, are synonymous. The disclosure may use perspective-based descriptions such as "above, " "below, " "top, " "bottom, " and "side" to explain various features of the drawings, but these terms are simply for ease of discussion, and do not imply a desired or required orientation. The accompanying drawings are not necessarily drawn to scale. Unless otherwise specified, the use of the ordinal adjectives “first, ” “second, ” and “third, ” etc., to describe a common object, merely indicates that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.

In the following detailed description, various aspects of the illustrative implementations will be described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art.

The terms “substantially, ” “close, ” “approximately, ” “near, ” and “about, ” generally refer to being within +/-20%of a target value as described herein or as known in the art. Similarly, terms indicating orientation of various elements, e.g., “coplanar, ” “perpendicular, ” “orthogonal, ” “parallel, ” or any other angle between the elements, generally refer to being within +/-5-20%of a target value as described herein or as known in the art.

In addition, the terms “comprise, ” “comprising, ” “include, ” “including, ” “have, ” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, device, or DNN accelerator that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such method, process, device, or DNN accelerators. Also, the term “or” refers to an inclusive “or” and not to an exclusive “or. ”

The systems, methods and devices of this disclosure each have several innovative aspects, no single one of which is solely responsible for all desirable attributes disclosed herein. Details of one or more implementations of the subject matter described in this specification are set forth in the description below and the accompanying drawings.

Example DNN

FIG. 1 illustrates an example DNN 100, in accordance with various embodiments. For the purpose of illustration, the DNN 100 in FIG. 1 is a CNN. In other embodiments, the DNN 100 may be other types of DNNs. The DNN 100 is trained to receive images and output classifications of objects in the images. In the embodiments of FIG. 1, the DNN 100 receives an input image 105 that includes objects 115, 125, and 135. The DNN 100 includes a sequence of layers comprising a plurality of convolutional layers 110 (individually referred to as “convolutional layer 110” ) , a plurality of pooling layers 120 (individually referred to as “pooling layer 120” ) , and a plurality of fully-connected layers 130 (individually referred to as “fully-connected layer 130” ) . In other embodiments, the DNN 100 may include fewer, more, or different layers. In an inference of the DNN 100, the layers of the DNN 100 execute tensor computation that includes many tensor operations, such as convolution (e.g., multiply-accumulate (MAC) operations, etc. ) , pooling operations, elementwise operations (e.g., elementwise addition, elementwise multiplication, etc. ) , other types of tensor operations, or some combination thereof.

The convolutional layers 110 summarize the presence of features in the input image 105. The convolutional layers 110 function as feature extractors. The first layer of the DNN 100 is a convolutional layer 110. In an example, a convolutional layer 110 performs a convolution on an input tensor 140 (also referred to as IFM 140) and a filter 150. As shown in FIG. 1, the IFM 140 is represented by a 7×7×3 three-dimensional (3D) matrix. The IFM 140 includes 3 input channels, each of which is represented by a 7×7 two-dimensional (2D) matrix. The 7×7 2D matrix includes 7 input elements (also referred to as input points) in each row and 7 input elements in each column. The filter 150 is represented by a 3×3×3 3D matrix. The filter 150 includes 3 kernels, each of which may correspond to a different input channel of the IFM 140. A kernel is a 2D matrix of weights, where the weights are arranged in columns and rows. A kernel can be smaller than the IFM. In the embodiments of FIG. 1, each kernel is represented by a 3×3 2D matrix. The 3×3 kernel includes 3 weights in each row and 3 weights in each column. Weights can be initialized and updated by backpropagation using gradient descent. The magnitudes of the weights can indicate importance of the filter 150 in extracting features from the IFM 140.

The convolution includes MAC operations with the input elements in the IFM 140 and the weights in the filter 150. The convolution may be a standard convolution 163 or a depthwise convolution 183. In the standard convolution 163, the whole filter 150 slides across the IFM 140. All the input channels are combined to produce an output tensor 160 (also referred to as output feature map (OFM) 160) . The OFM 160 is represented by a 5×5 2D matrix. The 5×5 2D matrix includes 5 output elements (also referred to as output points) in each row and 5 output elements in each column. For the purpose of illustration, the standard convolution includes one filter in the embodiments of FIG. 1. In embodiments where there are multiple filters, the standard convolution may produce multiple output channels in the OFM 160.

The multiplication applied between a kernel-sized patch of the IFM 140 and a kernel may be a dot product. A dot product is the elementwise multiplication between the kernel-sized patch of the IFM 140 and the corresponding kernel, which is then summed, always resulting in a single value. Because it results in a single value, the operation is often referred to as the “scalar product. ” Using a kernel smaller than the IFM 140 is intentional as it allows the same kernel (set of weights) to be multiplied by the IFM 140 multiple times at different points on the IFM 140. Specifically, the kernel is applied systematically to each overlapping part or kernel-sized patch of the IFM 140, left to right, top to bottom. The result from multiplying the kernel with the IFM 140 one time is a single value. As the kernel is applied multiple times to the IFM 140, the multiplication result is a 2D matrix of output elements. As such, the 2D output matrix (i.e., the OFM 160) from the standard convolution 163 is referred to as an OFM.

In the depthwise convolution 183, the input channels are not combined. Rather, MAC operations are performed on an individual input channel and an individual kernel and produce an output channel. As shown in FIG. 1, the depthwise convolution 183 produces a depthwise output tensor 180. The depthwise output tensor 180 is represented by a 5×5×3 3D matrix. The depthwise output tensor 180 includes 3 output channels, each of which is represented by a 5×5 2D matrix. The 5×5 2D matrix includes 5 output elements in each row and 5 output elements in each column. Each output channel is a result of MAC operations of an input channel of the IFM 140 and a kernel of the filter 150. For instance, the first output channel (patterned with dots) is a result of MAC operations of the first input channel (patterned with dots) and the first kernel (patterned with dots) , the second output channel (patterned with horizontal strips) is a result of MAC operations of the second input channel (patterned with horizontal strips) and the second kernel (patterned with horizontal strips) , and the third output channel (patterned with diagonal stripes) is a result of MAC operations of the third input channel (patterned with diagonal stripes) and the third kernel (patterned with diagonal stripes) . In such a depthwise convolution, the number of input channels equals the number of output channels, and each output channel corresponds to a different input channel. The input channels and output channels are referred to collectively as depthwise channels. After the depthwise convolution, a pointwise convolution 193 is then performed on the depthwise output tensor 180 and a 1×1×3 tensor 190 to produce the OFM 160.

The OFM 160 is then passed to the next layer in the sequence. In some embodiments, the OFM 160 is passed through an activation function. An example activation function is rectified linear unit (ReLU) . ReLU is a calculation that returns the value provided as input directly, or the value zero if the input is zero or less. The convolutional layer 110 may receive several images as input and calculate the convolution of each of them with each of the kernels. This process can be repeated several times. For instance, the OFM 160 is passed to the subsequent convolutional layer 110 (i.e., the convolutional layer 110 following the convolutional layer 110 generating the OFM 160 in the sequence) . The subsequent convolutional layers 110 perform a convolution on the OFM 160 with new kernels and generate a new feature map. The new feature map may also be normalized and resized. The new feature map can be kernelled again by a further subsequent convolutional layer 110, and so on.

In some embodiments, a convolutional layer 110 has four hyperparameters: the number of kernels, the size F kernels (e.g., a kernel is of dimensions F×F×D pixels) , the S step with which the window corresponding to the kernel is dragged on the image (e.g., a step of one means moving the window one pixel at a time) , and the zero-padding P (e.g., adding a black contour of P pixels thickness to the input image of the convolutional layer 110) . The convolutional layers 110 may perform various types of convolutions, such as 2-dimensional convolution, dilated or atrous convolution, spatial separable convolution, depthwise separable convolution, transposed convolution, and so on. The DNN 100 includes 16 convolutional layers 110. In other embodiments, the DNN 100 may include a different number of convolutional layers.

The pooling layers 120 down-sample feature maps generated by the convolutional layers, e.g., by summarizing the presence of features in the patches of the feature maps. A pooling layer 120 is placed between two convolution layers 110: a preceding convolutional layer 110 (the convolution layer 110 preceding the pooling layer 120 in the sequence of layers) and a subsequent convolutional layer 110 (the convolution layer 110 subsequent to the pooling layer 120 in the sequence of layers) . In some embodiments, a pooling layer 120 is added after a convolutional layer 110, e.g., after an activation function (e.g., ReLU, etc. ) has been applied to the OFM 160.

A pooling layer 120 receives feature maps generated by the preceding convolution layer 110 and applies a pooling operation to the feature maps. The pooling operation reduces the size of the feature maps while preserving their important characteristics. Accordingly, the pooling operation improves the efficiency of the DNN and avoids over-learning. The pooling layers 120 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map) , max pooling (calculating the maximum value for each patch of the feature map) , or a combination of both. The size of the pooling operation is smaller than the size of the feature maps. In various embodiments, the pooling operation is 2×2 pixels applied with a stride of two pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size. In an example, a pooling layer 120 applied to a feature map of 6×6 results in an output pooled feature map of 3×3. The output of the pooling layer 120 is inputted into the subsequent convolution layer 110 for further feature extraction. In some embodiments, the pooling layer 120 operates upon each feature map separately to create a new set of the same number of pooled feature maps.

The fully-connected layers 130 are the last layers of the DNN. The fully-connected layers 130 may be convolutional or not. The fully-connected layers 130 may also be referred to as linear layers. In some embodiments, a fully-connected layer 130 (e.g., the first fully-connected layer in the DNN 100) may receive an input operand. The input operand may define the output of the convolutional layers 110 and pooling layers 120 and includes the values of the last feature map generated by the last pooling layer 120 in the sequence. The fully-connected layer 130 may apply a linear transformation to the input operand through a weight matrix. The weight matrix may be a kernel of the fully-connected layer 130. The linear transformation may include a tensor multiplication between the input operand and the weight matrix. The result of the linear transformation may be an output operand. In some embodiments, the fully-connected layer may further apply a non-linear transformation (e.g., by using a non-linear activation function) on the result of the linear transformation to generate an output operand. The output operand may contain as many elements as there are classes: element i represents the probability that the image belongs to class i. Each element is therefore between 0 and 1, and the sum of all is worth one. These probabilities are calculated by the last fully-connected layer 130 by using a logistic function (binary classification) or a SoftMax function (multi-class classification) as an activation function.

In some embodiments, the fully-connected layers 130 classify the input image 105 and return an operand of size N, where N is the number of classes in the image classification problem. In the embodiments of FIG. 1, N equals 3, as there are 3 objects 115, 125, and 135 in the input image. Each element of the operand indicates the probability for the input image 105 to belong to a class. To calculate the probabilities, the fully-connected layers 130 multiply each input element by weight, make the sum, and then apply an activation function (e.g., logistic if N=2, SoftMax if N>2) . This is equivalent to multiplying the input operand by the matrix containing the weights. In an example, the vector includes 3 probabilities: a first probability indicating the object 115 being a tree, a second probability indicating the object 125 being a car, and a third probability indicating the object 135 being a person. In other embodiments where the input image 105 includes different objects or a different number of objects, the individual values can be different.

Example Convolution

FIG. 2 illustrates an example convolution, in accordance with various embodiments. The convolution may be a deep learning operation in a convolutional layer of a DNN, e.g., a convolutional layer 110 in FIG. 1. The convolution can be executed on an input tensor 210 and filters 220 (individually referred to as “filter 220” ) . The result of the convolution is an output tensor 230. In some embodiments, the convolution is performed by a DNN accelerator. An example of the DNN accelerator may be the DNN accelerator 302 in FIG. 3.

In the embodiments of FIG. 2, the input tensor 210 includes activations (also referred to as “input activations, ” “elements, ” or “input elements” ) arranged in a 3D matrix. An input element is a data point in the input tensor 210. The input tensor 210 has a spatial size H_in×W_in×C_in, where H_in is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of activations in a column in the 3D matrix of each input channel) , W_in is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of activations in a row in the 2D matrix of each input channel) , and C_in is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of input channels) . For the purpose of simplicity and illustration, the input tensor 210 has a spatial size of 7×7×3, i.e., the input tensor 210 includes three input channels and each input channel has a 7×7 2D matrix. Each input element in the input tensor 210 may be represented by a (X, Y, Z) coordinate. In other embodiments, the height, width, or depth of the input tensor 210 may be different.

Each filter 220 includes weights arranged in a 3D matrix. The values of the weights may be determined through training the DNN. A filter 220 has a spatial size H_f×W_f×C_f, where H_f is the height of the filter (i.e., the length along the Y axis, which indicates the number of weights in a column in each kernel) , W_f is the width of the filter (i.e., the length along the X axis, which indicates the number of weights in a row in each kernel) , and C_f is the depth of the filter (i.e., the length along the Z axis, which indicates the number of channels) . In some embodiments, C_f equals C_in. For purpose of simplicity and illustration, each filter 220 in FIG. 2 has a spatial size of 2×3×3, i.e., the filter 220 includes 2 convolutional kernels with a spatial size of 2×3. In other embodiments, the height, width, or depth of the filter 220 may be different. The spatial size of the convolutional kernels is smaller than the spatial size of the 2D matrix of each input channel in the input tensor 210.

An activation or weight may take one or more bytes in a memory. The number of bytes for an activation or weight may depend on the data format. For example, when the activation or weight has an INT8 format, the activation takes one byte. When the activation or weight has a FP16 format, the activation or weight takes two bytes. Other data formats may be used for activations or weights.

In the convolution, each filter 220 slides across the input tensor 210 and generates a 2D matrix for an output channel in the output tensor 230. In the embodiments of FIG. 2, the 2D matrix has a spatial size of 5×5. The output tensor 230 includes activations (also referred to as “output activations, ” “elements, ” or “output element” ) arranged in a 3D matrix. An output activation is a data point in the output tensor 230. The output tensor 230 has a spatial size H_out×W_out×C_out, where H_out is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of output activations in a column in the 2D matrix of each output channel) , W_out is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of output activations in a row in the 2D matrix of each output channel) , and C_out is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of output channels) . C_out may equal the number of filters 220 in the convolution. H_out and W_out may depend on the heights and weights of the input tensor 210 and each filter 220.

As a part of the convolution, MAC operations can be performed on a 2×3×3 subtensor 215 (which is highlighted with a dotted pattern in FIG. 2) in the input tensor 210 and each filter 220. The result of the MAC operations on the subtensor 215 and one filter 220 is an output activation. In some embodiments (e.g., embodiments where the convolution is an integral convolution) , an output activation may include 8 bits, e.g., one byte. In other embodiments (e.g., embodiments where the convolution is a floating-point convolution) , an output activation may include more than one byte. For instance, an output element may include two bytes.

After the MAC operations on the subtensor 215 and all the filters 220 are finished, a vector 235 is produced. The vector 235 is highlighted with slashes in FIG. 2. The vector 235 includes a sequence of output activations, which are arranged along the Z axis. The output activations in the vector 235 have the same (X, Y) coordinate, but the output activations correspond to different output channels and have different Z coordinates. The dimension of the vector 235 along the Z axis may equal the total number of output channels in the output tensor 230. After the vector 235 is produced, further MAC operations are performed to produce additional vectors till the output tensor 230 is produced.

In some embodiments, the MAC operations on a 2×3×3 subtensor (e.g., the subtensor 215) and a filter 220 may be performed by a plurality of PEs. One or more PEs may receive an input operand (e.g., an input operand 217 shown in FIG. 2) and a weight operand (e.g., the weight operand 227 shown in FIG. 2) . The input operand 217 includes a sequence of activations having the same (X, Y) coordinate but different z coordinates. The input operand 217 includes an activation from each of the input channels in the input tensor 210. The weight operand 227 includes a sequence of weights having the same (X, Y) coordinate but different z coordinates. The weight operand 227 includes a weight from each of the channels in the filter 220. Activations in the input operand 217 and weights in the weight operand 227 may be sequentially fed into a PE. The PE may receive an activation and a weight ( “an activation-weight pair” ) at a time and multiple the activation and the weight. The position of the activation in the input operand 217 may match the position of the weight in the weight operand 227. The activation and weight may correspond to the same channel.

Activations or weights may be floating-point numbers. Floating-point numbers may have various data formats, such as FP32, FP16, BF16, and so on. A floating-point number may be a positive or negative number with a decimal point. A floating-point number may be represented by a sequence of bits that includes one or more bits representing the sign of the floating-point number (e.g., positive or negative) , bits representing an exponent of the floating-point number, and bits representing a mantissa of the floating-point number. The mantissa is the part of a floating-point number that represents the significant digits of that number. The mantissa is multiplied by the base raised to the exponent to give the actual value of the floating-point number.

In some embodiments, the output activations in the output tensor 230 may be further processed based on one or more activation functions before they are stored or inputted into the next layer of the DNN. The processing based on the one or more activation functions may be at least part of the post processing of the convolution. In some embodiments, the post processing may include one or more other computations, such as offset computation, bias computation, and so on. The results of the post processing may be stored in a local memory of the compute block and be used as input to the next DNN layer. In some embodiments, the input activations in the input tensor 210 may be results of post processing of the previous DNN layer.

Example DNN System

[Rectified under Rule 91, 20.11.2023]
FIG. 3 is a block diagram of a DNN system 300, in accordance with various embodiments. The whole DNN system 300 or a part of the DNN system 300 may be implemented in one or more computing devices, such as the computing device 1100 in FIG. 11. The DNN system 300 can generate and execute DNNs, such as the DNN 100 in FIG. 1. As shown in FIG. 3, the DNN system 300 includes a DNN module 301 and a DNN accelerator 302. In other embodiments, alternative configurations, different or additional components may be included in the DNN system 300. For instance, the DNN system 300 may include multiple DNN modules or multiple DNN accelerators. Further, functionality attributed to a component of the DNN system 300 may be accomplished by a different component included in the DNN system 300 or a different system. In some embodiments, the DNN module 301 and DNN accelerator 302 may include different types of processing units. The DNN module 301 and DNN accelerator 302 may be implemented in the same chip or separate chips.

The DNN module 301 facilitates generation and deployment of DNNs. In some embodiments, the DNN module 301 may generate and train DNNs. For instance, the DNN module 301 can define the layered architecture of a DNN. The DNN module 301 can also determine the internal parameters of the DNN through a DNN training process. The DNN module 301 may also determine one or more hyperparameters that define how the DNN is trained. An example hyperparameter is a sparsity ratio that defines the sparsity level of one or more deep learning tensors for the DNN.

After DNNs are trained, the DNN module 301 may compress DNNs to reduce the amount of computation resources (e.g., power, processing element, time, memory, etc. ) that would be needed for executing the DNNs. For instance, the DNN module 301 may determine precision levels of tensor multiplication operations in the DNN layers. One or more layers may be assigned to lower-precision levels so that the time needed to execute these layers can be reduced. Other layers may have higher precision levels to make sure the accuracy of the DNN meets a desired accuracy. The DNN module 301 may provide configuration parameters to configure the compute blocks 330 (e.g., the multi-precision tensor multiplication operations 350) to operate in modes that match the selected precision levels.

The DNN module 301 may deploy trained, compressed, or validated DNNs for use in deep learning applications. The DNN module 301 may control inference processes of trained, compressed, or validated DNNs. In some embodiments, the DNN module 301 may distribute trained, compressed, or validated DNNs to devices or systems which may use the DNNs to perform tasks (e.g., image classification, motion planning, etc. ) for which the DNNs were trained. In other embodiments, the DNN module 301 may facilitate deployment of the DNNs using the DNN accelerator 302. For instance, the DNN module 301 may receive data from a device or system coupled with the DNN system 300 and input the received data (or data generated by the DNN module 301, e.g., based on the received data) into a DNN. The DNN module 301 may generate instructions (e.g., configuration files) that control the operation of the DNN accelerator 302 during the DNN inference. The DNN module 301 may receive an output of the DNN from the DNN accelerator 302. The DNN module 301 may transmit the output of the DNN (or a result of processing the output of the DNN by the DNN module 301) to the device or system. Certain aspects of the DNN module 301 are provided below in conjunction with FIG. 4.

The DNN accelerator 302 executes DNNs provided by the DNN module 301. For instance, the DNN accelerator 302 can perform DNN inference, e.g., by running deep learning operations in the DNNs, for training DNNs or for using the trained/compressed/validated DNNs to perform tasks. As shown in FIG. 3, the DNN accelerator 302 includes a memory 310, a DMA (direct memory access) engine 320, and compute blocks 330 (individually referred to as “compute block 330” ) . In other embodiments, alternative configurations, different or additional components may be included in the DNN accelerator 302. For example, the DNN accelerator 302 may include more than one memory 310 or DMA engine 320. As another example, the DNN accelerator 302 may include a single compute block 330. Further, functionality attributed to a component of the DNN accelerator 302 may be accomplished by a different component included in the DNN accelerator 302 or by a different system. A component of the DNN accelerator 302 may be implemented in hardware, software, firmware, or some combination thereof.

The memory 310 stores data associated with deep learning operations performed by the DNN accelerator. In some embodiments, the memory 310 may store data to be used by the compute blocks 330 for DNN inference. For example, the memory 310 may store weights, such as weights of convolutional layers, which are determined by training DNNs. As another example, the memory 310 may store inputs to DNNs or outputs of DNNs. The memory 310 may also store data generated by the compute blocks 330 from performing deep learning operations in DNNs. Example deep learning operations include convolutions (also referred to as “convolutional operations” ) , pooling operations, elementwise operations, activation functions, other types of deep learning operations, or some combination thereof. The memory 310 may be a main memory of the DNN accelerator 302. In some embodiments, the memory 310 includes one or more dynamic random-access memories (DRAMs) .

The DMA engine 320 facilitates data transfer between the memory 310 and local memories of the compute blocks 330. For example, the DMA engine 320 can read data from the memory 310 and write data into a local memory of a compute block 330. As another example, the DMA engine 320 can read data from a local memory of a compute block 330and write data into the memory 310. The DMA engine 320 provides a DMA feature that allows the compute block 330 to initiate data transfer between the memory 310 and the local memories of the compute blocks 330 and to perform other operations while the data transfer is in being conducted. In some embodiments, the DMA engine 320 may read tensors from the memory 310, modify the tensors in a way that is optimized for the compute block 330 before it writes the tensors into the local memories of the compute blocks 330.

The compute blocks 330 can perform deep learning operations in DNNs. For instance, a compute block 330 may run a deep learning operation in a DNN layer, or a portion of the deep learning operation, at a time. The compute blocks 330 may be capable of running various types of deep learning operations, such as convolution, pooling, elementwise operation, linear operation, nonlinear operation, and so on. In an example, a compute block 330 may perform convolutions, e.g., standard convolution or depthwise convolution. In some embodiments, the compute block 330 receives an input tensor and one or more convolutional kernels and performs a convolution with the input tensor and convolutional kernels. The result of the convolution may be an output tensor, which can be further computed, e.g., by the compute block 330 or another compute block 330. In some embodiments, the operations of the DNN layers may be run by multiple compute blocks 330 in parallel. For instance, multiple compute blocks 330 may each perform a portion of a workload for a convolution. Data may be shared between the compute blocks 330. A compute block 330 may also be referred to as a compute tile. In some embodiments, each compute block 330 may be a processing unit.

In the embodiments of FIG. 3, each compute block 330 includes a local memory 340 and a multi-precision tensor multiplication operator 350. Some or all the components of the compute block 330 can be implemented on the same chip. In other embodiments, alternative configurations, different or additional components may be included in the compute block 330. Further, functionality attributed to a component of the compute block 330 may be accomplished by a different component included in the compute block 330, a different compute block 330, another component of the DNN accelerator 302, or a different system. A component of the compute block 330 may be implemented in hardware, software, firmware, or some combination thereof.

The local memory 340 is local to the corresponding compute block 330. In the embodiments of FIG. 3, the local memory 340 is inside the compute block 330. In other embodiments, the local memory 340 may be outside the compute block 330. Data in the local memory 340 may be transferred to or from the memory 310, e.g., through the DMA engine 320. In some embodiments, data in the local memory 340 may be transferred to or from the local memory of another compute block 330. The local memory 340 may store data received, used, or generated by the multi-precision tensor multiplication operator 350. Examples of the data may include input activations, weights, output activations, sparsity bitmaps, and so on.

The local memory 340 may include different levels of memories. For instance, the local memory 340 may include registers, L1 cache memory, and so on. Different memories in the local memory 340 may be arranged at different locations in the compute block 330. In some embodiments, the local memory 340 includes one or more static random-access memories (SRAMs) . The local memory 340 may be byte-addressable, and each memory address identifies a single byte (eight bits) of storage. In some embodiments, the local memory 340 may include memory banks. The number of data banks in the local memory 340 may be 16, 64, 128, 356, 512, 1024, 2048, or other numbers. A memory bank may include a plurality of storage units. In an example, a data bank may include 8, 16, 64, or a different number of storage units. A memory bank or a storage unit in a memory bank may have a memory address. In an example, a storage unit may store a single byte, and data larger than a single byte may be stored in storage units with consecutive memory addresses, i.e., adjacent storage units.

Data elements of different precisions may require different amounts of memory. For instance, a storage unit of 8 bits can store a data element in the FP8 format. Two storage units may be needed to store a data element in the FP16 or BF16 format, which has 16 bits. Three storage units may be needed to store a data element in the FP32 format, which has 32 bits. Data elements of different precisions may also require different memory bandwidth. In some embodiments, a single read cycle may read a predetermined number of bits from the local memory 340. In example where 16 bits may be transferred from the local memory 340 in a single read cycle, a read cycle can transfer two data elements in the FP8 format but a single data element in the FP16 or BF16 format, while the transfer of a single data element in the FP32 format would require two read cycles. Therefore, reducing data precision may reduce memory storage space and memory bandwidth.

The multi-precision tensor multiplication operator 350 may receive data from the local memory 340 and perform multi-precision tensor multiplication operation on the data. As shown in FIG. 3, the multi-precision tensor multiplication operator 350 includes a precision reduction operator 360 and a multiplication operator 370. In other embodiments, the multi-precision tensor multiplication operator 350 may include more than one precision reduction operator 360 or multiplication operator 370.

The precision reduction operator 360 may reduce precision of data elements, such as data elements stored in the memory 310 or the local memory 340. The data elements may include activations and weights. In some embodiments, the precision reduction operator 360 may convert a data element of a higher precision to one or more data elements of a lower precision. For instance, the precision reduction operator 360 may remove one or more bits in a data element to reduce the precision of the data element. In some embodiments (e.g., embodiments where the precision reduction operator 360 converts one data element to multiple data elements of a lower precision) , the precision reduction operator 360 may generate a first lower-precision data element by removing a certain number of bits from the original data element. The number of removed bits may be determined based on the difference between the two precisions. The precision reduction operator 360 may generate a second lower-precision data element based on the first lower-precision data element.

In an example where the precision reduction operator 360 converts data elements in the FP32 format to data elements in the BF16 format, the precision reduction operator 360 may reduce precisions of weights using a conversion function denoted as:W0_BF16= (BF16) W_FP32W1_BF16= (BF16) (W_FP32- (FP32) W0_BF16

where W_FP32 is a weight in the FP32 format, which may in a weight tensor that is determined from training the DNN; W0_BF16 is a lower-precision weight in the BF16 format; W1_BF16 is another lower-precision weight in the BF16 format; (BF16) W_FP32 denotes an operation that removes 16 bits from the W_FP32 to convert W_FP32 to W0_BF16; and (FP32) W0_BF16 denotes an operation that adds 16 bits (e.g., 16 zero bits) to W0_BF16. W0_BF16 and W1_BF16 may be used, in lieu of W_FP32, in a tensor multiplication operation. In some embodiments, reduction of weight precision may be performed offline, e.g., before DNN inference. Offline reduction of weight precision may be performed by the DNN module 301. For instance, after the multi-precision tuning module 450 selects a precision level for a DNN layer, the multi-precision tuning module 450 may reduce the precision of weights in the DNN layer based on the selected precision level.

Similarly, the precision reduction operator 360 may reduce precisions of weights using a conversion function denoted as:A0_BF16= (BF16) A_FP32A1_BF16= (BF16) (A_FP32- (FP32) A0_BF16

where A_FP32 is an activation in the FP32 format, which may in an activation tensor that is input into the DNN or generated by a previous layer in the DNN; A0_BF16 is a lower-precision activation in the BF16 format; W1_BF16 is another lower-precision activation in the BF16 format; (BF16) A_FP32 denotes an operation that removes 16 bits from the A_FP32 to convert A_FP32 to A0_BF16; and (FP32) A0_BF16 denotes an operation that adds 16 bits (e.g., 16 zero bits) to A0_BF16. A0_BF16 and A1_BF16 may be used, in lieu of A_FP32, in a tensor multiplication operation. Even though FP32 and BF16 are described above, the precision reduction operator 360 may convert or generate data elements in other formats.

The multiplication operator 370 is configurable to perform tensor multiplication operations at different precision levels. The multiplication operator 370 may have multiple operation modes that correspond to different precision levels. For instance, the multiplication operator 370 may receive a configuration parameter from the DNN module 301. The configuration parameter may configure the multiplication operator 370 to operate in the corresponding mode.

The multiplication operator 370 may receive inputs from the precision reduction operator 360 and compute outputs. The outputs of the multiplication operator 370 may be used as the products of activations and weights in subsequent deep learning operations in the DNN. In some embodiments, the multiplication operator 370 includes one or more multipliers and one or more accumulators (also referred to as “adders” ) for performing multi-precision multiplications. The multipliers and accumulators may allow the multiplication operator 370 to compute outputs using different algorithms or functions and therefore facilitate different precision levels. More details about the multi-precision tensor multiplication operator 350 are provided below in conjunction with FIG. 8.

Example DNN Module

FIG. 4 is a block diagram of a DNN module 400, in accordance with various embodiments. The DNN module 400 may be an embodiment of the DNN module 301 in FIG. 3. As shown in FIG. 4, the DNN module 400 includes an interface module 410, a training module 420, a compressing module 430, a validating module 440, a multi-precision tuning module 450, and a datastore 460. In other embodiments, alternative configurations, different or additional components may be included in the DNN module 400. Further, functionality attributed to a component of the DNN module 400 may be accomplished by a different component included in the DNN module 400 or a different module or system.

The interface module 410 facilitates communications of the DNN module 400 with other modules or systems. For example, the interface module 410 establishes communications between the DNN module 400 with an external database to receive data that can be used to train DNNs or input into DNNs to perform tasks. As another example, the interface module 410 supports the DNN module 400 to distribute DNNs to other systems, e.g., computing devices configured to apply DNNs to perform tasks.

The training module 420 trains DNNs by using a training dataset. The training module 420 forms the training dataset. In an embodiment where the training module 420 trains an DNN to recognize objects in images, the training dataset includes training images and training labels. The training labels describe ground-truth classifications of objects in the training images. In some embodiments, each label in the training dataset corresponds to an object in a training image. In some embodiments, a part of the training dataset may be used to initially train the DNN, and the rest of the training dataset may be held back as a validation subset used by the validating module 440 to validate performance of a trained DNN. The portion of the training dataset not including the tuning subset and the validation subset may be used to train the DNN.

The training module 420 also determines hyperparameters for training the DNN. Hyperparameters are variables specifying the DNN training process. Hyperparameters are different from parameters inside the DNN (e.g., weights of filters) . In some embodiments, hyperparameters include variables determining the architecture of the DNN, such as number of hidden layers, etc. Hyperparameters also include variables which determine how the DNN is trained, such as batch size, number of epochs, etc. A batch size defines the number of training samples to work through before updating the parameters of the DNN. The batch size is the same as or smaller than the number of samples in the training dataset. The training dataset can be divided into one or more batches. The number of epochs defines how many times the entire training dataset is passed forward and backwards through the entire network. The number of epochs defines the number of times that the deep learning algorithm works through the entire training dataset. One epoch means that each training sample in the training dataset has had an opportunity to update the parameters inside the DNN. An epoch may include one or more batches. The number of epochs may be 1, 5, 10, 50, 100, 500, 1000, or even larger.

The training module 420 defines the architecture of the DNN, e.g., based on some of the hyperparameters. The architecture of the DNN includes an input layer, an output layer, and a plurality of hidden layers. The input layer of an DNN may include tensors (e.g., a multidimensional array) specifying attributes of the input image, such as the height of the input image, the width of the input image, and the depth of the input image (e.g., the number of bits specifying the color of a pixel in the input image) . The output layer includes labels of objects in the input layer. The hidden layers are layers between the input layer and output layer. The hidden layers include one or more convolutional layers and one or more other types of layers, such as pooling layers, fully-connected layers, normalization layers, SoftMax or logistic layers, and so on. The convolutional layers of the DNN abstract the input image to a feature map that is represented by a tensor specifying the feature map height, the feature map width, and the feature map channels (e.g., red, green, blue images include 3 channels) . A pooling layer is used to reduce the spatial volume of input image after convolution. It is used between two convolution layers. A fully-connected layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer. It is used to classify images between different category by training.

In the process of defining the architecture of the DNN, the training module 420 also adds an activation function to a hidden layer or the output layer. An activation function of a layer transforms the weighted sum of the input of the layer to an output of the layer. The activation function may be, for example, a ReLU activation function, a tangent activation function, or other types of activation functions.

After the training module 420 defines the architecture of the DNN, the training module 420 inputs a training dataset into the DNN. The training dataset includes a plurality of training samples. An example of a training sample includes an object in an image and a ground-truth label of the object. The training module 420 modifies the parameters inside the DNN ( “internal parameters of the DNN” ) to minimize the error between labels of the training objects that are generated by the DNN and the ground-truth labels of the objects. The internal parameters include weights of filters in the convolutional layers of the DNN. In some embodiments, the training module 420 uses a cost function to minimize the error.

The training module 420 may train the DNN for a predetermined number of epochs. The number of epochs is a hyperparameter that defines the number of times that the deep learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update internal parameters of the DNN. After the training module 420 finishes the predetermined number of epochs, the training module 420 may stop updating the parameters in the DNN. The DNN having the updated parameters is referred to as a trained DNN.

The compressing module 430 compresses DNNs. For instance, the compressing module 430 may add pruning operations to DNN layers to reduce computational complexity or memory usage. A pruning operation may prune weight tensors of a DNN layer by changing one or more nonzero valued weights of the layer to zeros. The modification may be done before, during, or after training. Weights may be pruned during training, during inference, or a combination of both. The compressing module 430 may determine a sparsity ratio for a DNN layer. The sparsity ratio may be a ratio of the number of zero-valued weights to the total number of weights in the layer. The compressing module 430 may perform the pruning operation till the sparsity ratio of the DNN layer meets a target sparsity ration, such as 10%, 20%, 30%, 40%, 50%, and so on.

In some embodiments, the compressing module 430 may select one or more layers in a DNN and modify each selected layer with a pruning operation. For instance, the compressing module 430 may select computationally complex layers, such as layers with large filters. For a pruning operation of a layer or of a type of layer, the compressing module 430 may determine a weight threshold that would not cause a loss of the accuracy of the DNN to exceed an accuracy loss constraint. A pruning operation may modify weights having absolute values above the weight threshold to zeros and leave the other weights unchanged. The weight pruning can reduce memory storage as zero-valued weights may not be stored. Also, the number of operations in the layer can be reduced as computations on zero-valued weights can be skipped without impacting the output of the layer. In some embodiments, the compressing module 430 may also measure energy saving, final DNN accuracy, or layer-wise sparsity caused by pruning operations.

After compressing a DNN, the compressing module 430 may fine tune the DNN, e.g., through a retraining process. The compressing module 430 may fine tunes DNNs after weights are pruned. In some embodiments, the fine-tuning process is a retraining or further training process. For instance, after weights in a DNN are pruned, the compressing module 430 may further train the DNN by inputting a training dataset into the DNN. The values of the unpruned weights in the DNN may be modified based on outputs of the DNN and ground-truth labels of the training samples in the training dataset. In some embodiments, the values of the pruned weights (i.e., zero) are not changed during the fine-tuning process. For instance, the compressing module 430 may place a mask over a pruned weight block and the mask can prevent values in the pruned weight blocks from being changed during the fine-tuning process. In other embodiments, the values of all weights, including the pruned weights, may be changed during the fine-tuning process. After one or more cycles of retraining and weight changing by the compressing module 430, the compressing module 430 may perform a new pruning process, e.g., by selecting weight blocks and pruning the selected weight blocks. In some embodiments, the weight pruning process may be repeated multiple times before the fine-tuning process is done.

In some embodiments, the number of epochs in the fine-tuning process may be different from the number of epochs in the training process in which the pre-pruning values of the weights are determined. For instance, the fine-tuning process may have less epochs than the training process. In an example, the number of epochs in the fine-tuning process may be relatively small, such as 2, 3, 4, 5, and so on.

The validating module 440 verifies accuracy of trained or compressed DNNs. In some embodiments, the validating module 440 inputs samples in a validation dataset into a trained DNN and uses the outputs of the DNN to determine the model accuracy. In some embodiments, a validation dataset may be formed of some or all the samples in the training dataset. Additionally or alternatively, the validation dataset includes additional samples, other than those in the training sets. In some embodiments, the validating module 440 may determine an accuracy score measuring the precision, recall, or a combination of precision and recall of the DNN. The validating module 440 may use the following metrics to determine the accuracy score: Precision = TP / (TP + FP) and Recall = TP / (TP + FN) , where precision may be how many the reference classification model correctly predicted (TP or true positives) out of the total it predicted (TP + FP or false positives) , and recall may be how many the reference classification model correctly predicted (TP) out of the total number of objects that did have the property in question (TP + FN or false negatives) . The F-score (F-score = 2 *PR / (P + R) ) unifies precision and recall into a single measure.

The validating module 440 may compare the accuracy score with a threshold score. In an example where the validating module 440 determines that the accuracy score of the augmented model is less than the threshold score, the validating module 440 instructs the training module 420 to re-train the DNN. In one embodiment, the training module 420 may iteratively re-train the DNN until the occurrence of a stopping condition, such as the accuracy measurement indication that the DNN may be sufficiently accurate, or a number of training rounds having taken place.

The multi-precision tuning module 450 may determine precision levels of deep learning operations in DNN layers. In some embodiments, the multi-precision tuning module 450 may select a precision level from a plurality of predetermined precision levels for a DNN layer. The multi-precision tuning module 450 may also generate a configuration parameter that indicates the selected precision level. The multi-precision tuning module 450 may further send the configuration parameter to a compute block (e.g., the compute block 330) so that the compute block would execute the DNN layer at the selected precision level. For instance, the compute block may execute the DNN layer in an operation mode configured by the configuration parameter. In the process of executing the DNN layer at the selected precision level, one or more tensor multiplication operations in the DNN layer are performed at the selected prevision level. For instance, the product of two data elements (e.g., a product of an activation and a weight) may be approximated at the selected precision level during a tensor multiplication operation.

In some embodiments, the multi-precision tuning module 450 may determine precision levels for DNN layers based on one or more criteria. The one or more criteria may include a target accuracy of the output of a DNN layer, a target accuracy of the output of the DNN, estimated consumption of one or more computational resources (e.g., power, time, memory storage, memory bandwidth, computation, etc. ) for executing a DNN layer, estimated consumption of one or more computational resources for scouting the DNN, other criteria, or some combination thereof. In some embodiments, the multi-precision tuning module 450 may perform a search using a graph representing the DNN to determine precision levels of layers in the DNN. The graph may include nodes representing the layer and links between the nodes (also referred to as “edges” ) may represent relationships between the layers. For instance, a layer is connected to another layer that receives the output of the layer. A node may encode the corresponding layer’s precision level. For instance, the state of the node may indicate the precision level of the layer. The node may also encode other attributes of the layer, such as a target accuracy of the layer, amount of computation in the layer, estimated consumption of computational resources for executing the layer, and so on. The multi-precision tuning module 450 may update the states of the nodes till the corresponding search criteria are met. The states of the nodes that result in the satisfaction of the search criteria may encode the optimal precision levels of the layer. By using graph search, the multi-precision tuning module 450 can achieve automated optimization of multi-precision tensor multiplication operations. Certain aspects of the multi-precision tuning module 450 are described below in conjunction with FIG. 5.

The datastore 460 stores data received, generated, used, or otherwise associated with the DNN module 400. For example, the datastore 460 stores the datasets used by the training module 420 and validating module 440. The datastore 460 may also store data generated by the training module 420 and validating module 440, such as the hyperparameters for training DNNs, internal parameters of trained DNNs (e.g., weights, etc. ) , data for sparsity acceleration (e.g., sparsity bitmap, etc. ) , and so on. The datastore 460 may store configuration parameters and graphs generated by the multi-precision tuning module 450. In the embodiment of FIG. 4, the datastore 460 is a component of the DNN module 400. In other embodiments, the datastore 460 may be external to the DNN module 400 and communicate with the DNN module 400 through a network.

FIG. 5 is a block diagram of the multi-precision tuning module 450, in accordance with various embodiments. The multi-precision tuning module 450 includes a graph generator 510, a graph compiler 520, a search criteria module 530, a search module 540, and an output module 550. In other embodiments, alternative configurations, different or additional components may be included in the multi-precision tuning module 450. Further, functionality attributed to a component of the multi-precision tuning module 450 may be accomplished by a different module, device, or system.

The graph generator 510 generates graphs representing DNNs. A graph is a data structure comprising a collection of nodes and one or more edges. An edge is a connection of two nodes. Each node may encode a layer of a DNN represented by the graph. Each edge linking two or more nodes may encode the connection between the layers encoded by the nodes. For instance, an edge between two nodes may encode the data flow between the two layers encoded by the two nodes.

In some embodiments, the graph generator 510 may generate a graph for a DNN that includes nodes representing all the layers in the DNN. In other embodiments, the graph generator 510 may select a subset of layers in the DNN and generate a graph that includes nodes representing the selected layers. The graph may have no layers representing the unselected layers in the DNN. For instance, the graph generator 510 may select layers that include tensor multiplication operations. A layer that includes no tensor multiplication operation may not be selected. In yet other embodiments, the graph generator 510 may generate a graph that includes nodes representing layers without tensor multiplication operation. However, the nodes representing layers without tensor multiplication operation may encode different information from nodes representing layers with representing layers with tensor multiplication operations. For instance, the nodes representing layers without tensor multiplication operation may not encode precision levels of the corresponding layer.

The graph compiler 520 may replace tensor multiplication operations in graphs with multi-precision tensor multiplication operators. The graphs may be generated by the graph generator 510. An example multi-precision tensor multiplication operator is the multi-precision tensor multiplication operator 800 in FIG. 8. With the replacement, the graph compiler 520 may transform a graph representing a DNN into a temporary graph based on multi-precision tensor multiplications. The graph compiler may assign an initial precision level to each tensor multiplication operation in the temporary graph. The temporary graph defines a search space for search desirable precision levels of the tensor multiplication operations in the temporary graph.

The search criteria module 530 determines criteria for performing searches to find desirable precision levels of tensor multiplication operations. Such criteria may be referred to as search criteria or search thresholds. In some embodiments, the search criteria module 530 determines a search threshold based on an accuracy of one or more tensor multiplication operations or an accuracy of the DNN. The accuracy may be a target accuracy, such as a desired accuracy. The search criteria module 530 may determine a search threshold further based on other factors, such as estimated consumption of computational resources for executing the tensor multiplication operations, etc.

The search module 540 performs searches to find desirable precision levels of tensor multiplication operations based on search criteria determined by the search criteria module 530. In some embodiments, the search module 540 performs such searches in search spaces defined by graphs compiled by the graph compiler 520. In some embodiments, the search module may determine a search algorithm. Examples of search algorithm include particle sawm optimization (PSO) algorithm, generic algorithm, differential evolution algorithm, and so on. The search module 540 may iteratively explore different combinations of precision levels (e.g., by iteratively updating the states of nodes in the graph) with the goal to find the optimal precision-level settings that can satisfy the search criterion while minimizing consumption of computational resources, e.g., minimizing inference time (i.e., maximizing inference speed) .

The output module 550 may output results of searches performed by the search module 540. In some embodiments, the output module 550 may generate configuration parameters based on the optimal precision levels of the layers. The output module 550 may provide the configuration parameter to a compiler that compiles the DNN or provide the configuration parameters to a DNN accelerator (e.g., the DNN accelerator 302) that executes DNN inference. In some embodiments, the result of a search may be the computation graph with the optimal precision-level settings. The graph may be used for DNN inference. With the optimal precision-level settings, the DNN inference can be more efficient and therefore, the performance of the DNN accelerator (e.g., the DNN accelerator 302) that executes the DNN can be enhanced.

Example Graph Representing DNN

FIG. 6 illustrates an example graph 600, in accordance with various embodiments. The graph 600 is a data structure including a collection of nodes 610A-610P (collectively referred to as “nodes 610” or “node 610” ) . The lines linking the nodes 610 indicate connections between the nodes 610. A connection in the graph 600 is referred to as an edge. The nodes 610 and edges in FIG. 6 are shown for the purpose of illustration. In other embodiments, the graph 600 may include a different number of nodes or different edges.

The graph 600 may be used to represent DNNs. In an example, a node 610 may represent a layer or a deep learning operation in a DNN. The deep learning operation may be a tensor multiplication operation or may include a tensor multiplication operation. The edges may indicate relationships between the layers or deep learning operations in the DNN. A node 610 may be associated with an embedding that encodes information about the layer or deep learning operation, such as the precision level. The graph may be generated by the graph generator 510. The graph may also be compiled by the graph compiler 520. In some embodiments, the graph may represent an optimal precision-level setting of the DNN, e.g., an optimal combination of various precision levels for various tensor multiplication operations in the DNN. The graph may be used to run inference of the DNN.

Example Data Elements of Different Precisions

FIG. 7 illustrates data elements 710 and 720 of different precisions, in accordance with various embodiments. The data element 710 or 720 may be an activation in an input tensor of a DNN layer or a weight in a weight tensor of a DNN layer. The data elements 710 and 720 have different precision and therefore require different numbers of bits for storage.

In the embodiments of FIG. 7, the data element 710 has a lower precision, e.g., the BF16 precision, and has 16 bits. The first bit (represented by “S” in FIG. 7) indicates the sign of the data element 710. The next eight bits (represented by “E” in FIG. 7) indicate the exponent of the data element 710. The next seven bits (represented by “M” in FIG. 7) indicate the mantissa of the data element 710. In contrast, the data element 720 has a higher precision, e.g., the FP32 precision, and has 32 bits. The first bit (represented by “S” in FIG. 7) indicates the sign of the data element 720. The next eight bits (represented by “E” in FIG. 7) indicate the exponent of the data element 720. The next 23 bits (represented by “M” in FIG. 7) indicate the mantissa of the data element 720. Compared with the data element 710, the data element 720 has 16 more bits for mantissa.

The data element 720 may be converted to one or more data elements having a lower precision to increase the speed of DNN inference and to reduce memory space and memory bandwidth. In an example, the data element 720 may be converted to one or more data elements having the lower precision of the data element 710. For instance, the 16 extra bits for mantissa, which are enclosed in the dashed oval in FIG. 7, may be removed. Also, the data element 710 may be converted to the data element 720, e.g., by adding 16 extra bits for mantissa. The 16 extra bits may be zeros. In some embodiments, the conversion from the data element 710 to the data element 720 may be used for computing another data element of the lower precision in the process of reducing the precision of the data element 720.

Example Multi-precision Tensor Multiplication Operator

FIG. 8 illustrates an example multi-precision tensor multiplication operator 800, in accordance with various embodiments. The multi-precision tensor multiplication operator 800 may be an operator in a compute block, e.g., the compute block 330 in FIG. 3. The multi-precision tensor multiplication operator 800 may perform tensor multiplications at various precision levels. In the embodiment of FIG. 8, the multi-precision tensor multiplication operator 800 includes a precision reduction operator 810A, another precision reduction operator 810W. and three multiplication operators 820A-820C (collectively referred to as “multiplication operators 820” or “multiplication operator 820” ) .

In other embodiments, alternative configurations, different or additional components may be included in the multi-precision tensor multiplication operator 800. For instance, the multi-precision tensor multiplication operator 800 may include a different number of precision reduction operators or a different number of multiplication operators. Further, functionality attributed to a component of the multi-precision tensor multiplication operator 800 may be accomplished by a different component included in the multi-precision tensor multiplication operator 800 or a different module, device, or system.

In a round of computation, the precision reduction operator 810W receives a data element 801, and the precision reduction operator 810A receives a data element 802. The data element 801 may be a weight. The data element 802 may be an activation. The data elements 801 and 802 may have the same precision. For the purpose of illustration, the data elements 801 and 802 are both in the FP32 format in FIG. 8. The precision reduction operator 810W converts the data element 801 into two new data elements 803 and 804. The two data elements 803 and 804 have a lower precision than the data element 801. In FIG. 8, the data elements 803 and 804 are in the BF16 format. Similarly, the precision reduction operator 810A converts the data element 802 into two new data elements 805 and 806. The two data elements 805 and 806 have a lower precision than the data element 802. In FIG. 8, the data elements 805 and 806 are in the BF16 format.

The precision reduction operators 810W and 810A may be examples of the precision reduction operator 360 in FIG. 3. Even though FIG. 8 shows two precision reduction operators, the multi-precision tensor multiplication operator 800 may include a single precision reduction operator in other embodiments. For instance, the multi-precision tensor multiplication operator 800 may include a precision reduction operator for precision reduction of activations during DNN inference. Precision reduction of weights may be performed by a different module or system, e.g., by the DNN module 301 before DNN inference.

The data elements 803, 804, 805, and 806 are input into one of the multiplication operators 820 for computing a product 807 of the data element 801 and the data element 802. The product 807 may not be the real product of the data element 801 and the data element 802. Rather, the product 807 may be an approximation of the real product of the data element 801 and the data element 802. The three multiplication operators 820 may correspond to different precision levels. For instance, the multiplication operator 820A may compute approximate products having a higher precision than approximate products computed by the other two multiplication operators 820. The multiplication operator 820B may compute approximate products having a higher precision than approximate products computed by the other two multiplication operator 820C.

The three multiplication operators 820 may use different algorithms to compute the product 807. In an example, the multiplication operator 820A may use an algorithm or function denoted as:P0=A0_BF16×W0_BF16+A0_BF16×W1_BF16+A1_BF16×W0_BF16

The multiplication operator 820B may use an algorithm or function denoted as:P0=A0_BF16×W0_BF16+A0_BF16×W1_BF16

The multiplication operator 820C may use an algorithm or function denoted as:P0=A0_BF16×W0_BF16

That way, the product 807 computed by the three multiplication operators 820 would have different precisions. In other embodiments, the multiplication operators 820 may use different algorithms or functions to compute products. Each multiplication operator 820 may include one or more multipliers and one or more adders for computing the product 807. In some embodiments, a multiplier or adder may be shared by some of or all the multiplication operators 820.

Example Data Locality and Reuse

FIG. 9 illustrates locality and reuse of data in a multi-precision tensor multiplication operation, in accordance with various embodiments. The multi-precision tensor multiplication operation in FIG. 9 may be part of a convolution or other types of deep learning operations. The multi-precision tensor multiplication operation has an activation tensor 910 and a weight tensor 920 as input. For the purpose of simplicity and illustration, each of the activation tensor 910 weight tensor 920 is a 4×4 matrix. The activation tensor 910 includes 16 activations, each of which is converted to two lower-prevision activations (A0 and A1) . The weight tensor 920 includes 16 weights, each of which is converted to two lower-prevision weights (W0 and W1) . The precision of the lower-prevision activations may be the same as the precision of the lower-prevision weights.

In some embodiments, the lower-prevision activations converted from some or all of the activations in the activation tensor 910 may be stored in a memory, e.g., a register file. The register file may be in the local memory 340. The lower-prevision activations converted from some or all of the weights in the weight tensor 920 may be stored in another memory, such as another register file in the local memory 340. . The storage of lower-prevision activations converted from multiple activations in the same register file at the same time or the storage of lower-prevision weights converted from multiple weights in the same register file at the same time can increase data locality and data reuse. As represented by the arrows in FIG. 9, each activation in the activation tensor 910 may be multiplied by multiple (or even all) the weights in the weight tensor 920. Similarly, each weight in the weight tensor 920 may be multiplied by multiple (or even all) the activations in the activation tensor 910. When multiple activations (or weights) are stored in the same register file at the same time, the weights (or activations) to be multiplied with these activations (or weights) can be used more than once. The data locality and data reuse can improve efficiency of the DNN inference as less data transfer operations would be needed.

Example Method of Executing Tensor Multiplication Operations

FIG. 10 is a flowchart showing a method 1000 of executing a tensor multiplication operation in a DNN, in accordance with various embodiments. The method 1000 may be performed by the DNN system 300 in FIG. 3. Although the method 1000 is described with reference to the flowchart illustrated in FIG. 10, many other methods for draining data from a PE array may alternatively be used. For example, the order of execution of the steps in FIG. 10 may be changed. As another example, some of the steps may be changed, eliminated, or combined.

The DNN system 300 selects 1010 a precision level for a layer in the DNN based on a target accuracy of the DNN. The layer comprises a multiplication operation on a first tensor and a second tensor of the neural network. The first tensor comprises a first data element. The second tensor comprises a second data element. In some embodiments, the layer is a convolutional layer, the first tensor is an input tensor of the convolutional layer, the second tensor comprises a kernel of the convolutional layer, and the tensor multiplication operation is part of a convolution. In other embodiments, the layer is a linear layer.

In some embodiments, the DNN system 300 selects the precision level for the layer by generating a graph representing at least part of the DNN. The graph comprises nodes and one or more links between the nodes. The node represents layers in the DNN. The one or more links represent one or more connections between the layers. The DNN system 300 performs a search using the graph to select the precision level for the layer. In some embodiments (e.g., embodiments where the DNN system 300 selects the precision level is selected from a plurality of predetermined precision levels) , the nodes in the graph have states encoding the plurality of predetermined precision levels.

In some embodiments, the DNN system 300 determines one or more criteria for the search based on the target accuracy of the DNN. The DNN system 300 updates the states of the nodes in the graph till the one or more criteria are satisfied. The one or more criteria for the search are determined further based on an estimated consumption of one or more computational resources for executing the DNN.

The DNN system 300 converts 1020 the first data element to a first pair of data elements. The first data element has a first precision. The first pair of data elements have a second precision. In some embodiments, the first precision is higher than the second precision. The DNN system 300 converts the first data element to the first pair of data elements by removing one or more bits in the first data element. In some embodiments, a data element of the first pair of data elements is computed by removing the one or more bits in the first data element. Another data element of the first pair of data elements is computed based on the data element of the first pair of data elements.

The DNN system 300 converts 1030 the second data element to a second pair of data elements. The second data element has the first precision. The second pair of data elements have the second precision. In some embodiments, the DNN system 300 converts the second data element to the second pair of data elements by removing one or more bits in the second data element. In some embodiments, a data element of the second pair of data elements is computed by removing the one or more bits in the second data element. Another data element of the second pair of data elements is computed based on the data element of the second pair of data elements.

The DNN system 300 identifies 1040 a function from a plurality of functions based on the selected precision level of the layer. In some embodiments, the DNN system 300 selects the precision level from a plurality of predetermined precision levels. The plurality of predetermined precision level corresponds to different functions for computing approximate products of the first data element and the second data element. The approximate products have different precisions. In some embodiments, the DNN system 300 selects a different one of the plurality of predetermined precision levels as a precision level for the other layer in the DNN.

The DNN system 300 computes 1050 an output data element of the layer by applying the identified function on the first pair of data elements and the second pair of data elements. The output data element has the first precision. In some embodiments, the DNN system 300 uses a different function to compute the output data element when the selected precision level is different.

The DNN system 300 executes 1060 another layer in the DNN by using the output data element. In some embodiments, the output data element is an approximation of the product of the first data element and the second data element.

Example Computing Device

FIG. 11 is a block diagram of an example computing device 1100, in accordance with various embodiments. In some embodiments, the computing device 1100 can be used as at least part of the DNN system 300. A number of components are illustrated in FIG. 11 as included in the computing device 1100, but any one or more of these components may be omitted or duplicated, as suitable for the application. In some embodiments, some or all of the components included in the computing device 1100 may be attached to one or more motherboards. In some embodiments, some or all of these components are fabricated onto a single system on a chip (SoC) die. Additionally, in various embodiments, the computing device 1100 may not include one or more of the components illustrated in FIG. 11, but the computing device 1100 may include interface circuitry for coupling to the one or more components. For example, the computing device 1100 may not include a display device 1106, but may include display device interface circuitry (e.g., a connector and driver circuitry) to which a display device 1106 may be coupled. In another set of examples, the computing device 1100 may not include an audio input device 1118 or an audio output device 1108, but may include audio input or output device interface circuitry (e.g., connectors and supporting circuitry) to which an audio input device 1118 or audio output device 1108 may be coupled.

The computing device 1100 may include a processing device 1102 (e.g., one or more processing devices) . The processing device 1102 processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory. The computing device 1100 may include a memory 1104, which may itself include one or more memory devices such as volatile memory (e.g., DRAM) , nonvolatile memory (e.g., read-only memory (ROM) ) , high bandwidth memory (HBM) , flash memory, solid state memory, and/or a hard drive. In some embodiments, the memory 1104 may include memory that shares a die with the processing device 1102. In some embodiments, the memory 1104 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for accelerating deep learning operations, e.g., the method 1000 described above in conjunction with FIG. 10 or some operations performed by the DNN system 300 described above in conjunction with FIGS. 3-5. The instructions stored in the one or more non-transitory computer-readable media may be executed by the processing device 1102.

In some embodiments, the computing device 1100 may include a communication chip 1112 (e.g., one or more communication chips) . For example, the communication chip 1112 may be configured for managing wireless communications for the transfer of data to and from the computing device 1100. The term "wireless" and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.

The communication chip 1112 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family) , IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment) , Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as "3GPP2" ) , etc. ) . IEEE 802.16 compatible Broadband Wireless Access (BWA) networks are generally referred to as WiMAX networks, an acronym that stands for worldwide interoperability for microwave access, which is a certification mark for products that pass conformity and interoperability tests for the IEEE 802.16 standards. The communication chip 1112 may operate in accordance with a Global System for Mobile Communication (GSM) , General Packet Radio Service (GPRS) , Universal Mobile Telecommunications System (UMTS) , High Speed Packet Access (HSPA) , Evolved HSPA (E-HSPA) , or LTE network. The communication chip 1112 may operate in accordance with Enhanced Data for GSM Evolution (EDGE) , GSM EDGE Radio Access Network (GERAN) , Universal Terrestrial Radio Access Network (UTRAN) , or Evolved UTRAN (E-UTRAN) . The communication chip 1112 may operate in accordance with Code-division Multiple Access (CDMA) , Time Division Multiple Access (TDMA) , Digital Enhanced Cordless Telecommunications (DECT) , Evolution-Data Optimized (EV-DO) , and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond. The communication chip 1112 may operate in accordance with other wireless protocols in other embodiments. The computing device 1100 may include an antenna 1122 to facilitate wireless communications and/or to receive other wireless communications (such as AM or FM radio transmissions) .

In some embodiments, the communication chip 1112 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet) . As noted above, the communication chip 1112 may include multiple communication chips. For instance, a first communication chip 1112 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 1112 may be dedicated to longer-range wireless communications such as global positioning system (GPS) , EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others. In some embodiments, a first communication chip 1112 may be dedicated to wireless communications, and a second communication chip 1112 may be dedicated to wired communications.

The computing device 1100 may include battery/power circuitry 1114. The battery/power circuitry 1114 may include one or more energy storage devices (e.g., batteries or capacitors) and/or circuitry for coupling components of the computing device 1100 to an energy source separate from the computing device 1100 (e.g., AC line power) .

The computing device 1100 may include a display device 1106 (or corresponding interface circuitry, as discussed above) . The display device 1106 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD) , a light-emitting diode display, or a flat panel display, for example.

The computing device 1100 may include an audio output device 1108 (or corresponding interface circuitry, as discussed above) . The audio output device 1108 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.

The computing device 1100 may include an audio input device 1118 (or corresponding interface circuitry, as discussed above) . The audio input device 1118 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output) .

The computing device 1100 may include a GPS device 1116 (or corresponding interface circuitry, as discussed above) . The GPS device 1116 may be in communication with a satellite-based system and may receive a location of the computing device 1100, as known in the art.

The computing device 1100 may include another output device 1110 (or corresponding interface circuitry, as discussed above) . Examples of the other output device 1110 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or an additional storage device.

The computing device 1100 may include another input device 1120 (or corresponding interface circuitry, as discussed above) . Examples of the other input device 1120 may include an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, a Quick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.

The computing device 1100 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile internet device, a music player, a tablet computer, a laptop computer, a netbook computer, an ultrabook computer, a personal digital assistant (PDA) , an ultramobile personal computer, etc. ) , a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, or a wearable computer system. In some embodiments, the computing device 1100 may be any other electronic device that processes data.

Select Examples

The following paragraphs provide various examples of the embodiments disclosed herein.

Example 1 provides a method for executing a tensor multiplication operation in a neural network, including selecting a precision level for a layer in the neural network based on a target accuracy of the neural network, the layer including a multiplication operation on a first tensor and a second tensor of the neural network, the first tensor including a first data element, the second tensor including a second data element, the first data element and the second data element having a first precision; converting the first data element to a first pair of data elements; converting the second data element to a second pair of data elements, the first pair of data elements and second pair of data elements having a second precision; identifying a function from a plurality of functions based on the selected precision level of the layer; computing an output data element of the layer by applying the identified function on the first pair of data elements and the second pair of data elements, the output data element having the first precision; and executing another layer in the neural network using the output data element.

Example 2 provides the method of example 1, in which selecting the precision level includes selecting the precision level from a plurality of predetermined precision levels, the plurality of predetermined precision level corresponding to different ones of the plurality of functions.

Example 3 provides the method of example 2, further including selecting a different one of the plurality of predetermined precision levels as a precision level for the another layer in the neural network.

Example 4 provides the method of any one of examples 1-3, in which the first precision is higher than the second precision, and converting the first data element to the first pair of data elements includes removing one or more bits in the first data element.

Example 5 provides the method of example 4, in which a data element in the first pair of data elements is computed by removing the one or more bits in the first data element, and another data element in the first pair of data elements is computed based on the data element in the first pair of data elements.

Example 6 provides the method of any one of examples 1-5, in which selecting the precision level for the layer includes generating a graph representing at least part of the neural network, the graph including nodes and one or more links between the nodes, the nodes representing layers in the neural network, the one or more links representing one or more connections between the layers; and performing a search using the graph to select the precision level for the layer.

Example 7 provides the method of example 6, in which selecting the precision level is selected from a plurality of predetermined precision levels, and the nodes in the graph have states encoding the plurality of predetermined precision levels.

Example 8 provides the method of example 7, in which performing the search using the graph includes determining one or more criteria for the search based on the target accuracy of the neural network; and updating the states of the nodes in the graph till the one or more criteria are satisfied.

Example 9 provides the method of example 8, in which the one or more criteria for the search is determined further based on an estimated consumption of one or more computational resources for executing the neural network.

Example 10 provides the method of any one of examples 1-9, in which the layer is a convolutional layer or a linear layer, the first tensor is an input tensor of the convolutional layer or the linear layer, the second tensor includes a kernel of the convolutional layer or the linear layer, and the tensor multiplication operation is part of a convolution or a linear transformation.

Example 11 provides one or more non-transitory computer-readable media storing instructions executable to perform operations for executing a tensor multiplication operation in a neural network, the operations including selecting a precision level for a layer in the neural network based on a target accuracy of the neural network, the layer including a multiplication operation on a first tensor and a second tensor of the neural network, the first tensor including a first data element, the second tensor including a second data element, the first data element and the second data element having a first precision; converting the first data element to a first pair of data elements; converting the second data element to a second pair of data elements, the first pair of data elements and second pair of data elements having a second precision; identifying a function from a plurality of functions based on the selected precision level of the layer; computing an output data element of the layer by applying the identified function on the first pair of data elements and the second pair of data elements, the output data element having the first precision; and executing another layer in the neural network using the output data element.

Example 12 provides the one or more non-transitory computer-readable media of example 11, in which selecting the precision level includes selecting the precision level from a plurality of predetermined precision levels, the plurality of predetermined precision level corresponding to different ones of the plurality of functions.

Example 13 provides the one or more non-transitory computer-readable media of example 12, in which the operations further include selecting a different one of the plurality of predetermined precision levels as a precision level for the another layer in the neural network.

Example 14 provides the one or more non-transitory computer-readable media of any one of examples 11-13, in which the first precision is higher than the second precision, and converting the first data element to the first pair of data elements includes removing one or more bits in the first data element.

Example 15 provides the one or more non-transitory computer-readable media of example 14, in which a data element in the first pair of data elements is computed by removing the one or more bits in the first data element, and another data element in the first pair of data elements is computed based on the data element in the first pair of data elements.

Example 16 provides the one or more non-transitory computer-readable media of any one of examples 11-15, in which selecting the precision level for the layer includes generating a graph representing at least part of the neural network, the graph including nodes and one or more links between the nodes, the nodes representing layers in the neural network, the one or more links representing one or more connections between the layers; and performing a search using the graph to select the precision level for the layer.

Example 17 provides an apparatus, including a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations for executing a tensor multiplication operation in a neural network, the operations including selecting a precision level for a layer in the neural network based on a target accuracy of the neural network, the layer including a multiplication operation on a first tensor and a second tensor of the neural network, the first tensor including a first data element, the second tensor including a second data element, the first data element and the second data element having a first precision, converting the first data element to a first pair of data elements, converting the second data element to a second pair of data elements, the first pair of data elements and second pair of data elements having a second precision, identifying a function from a plurality of functions based on the selected precision level of the layer, computing an output data element of the layer by applying the identified function on the first pair of data elements and the second pair of data elements, the output data element having the first precision, and executing another layer in the neural network using the output data element.

Example 18 provides the apparatus of example 17, in which selecting the precision level includes selecting the precision level from a plurality of predetermined precision levels, the plurality of predetermined precision level corresponding to different ones of the plurality of functions.

Example 19 provides the apparatus of example 17 or 18, in which the first precision is higher than the second precision, and converting the first data element to the first pair of data elements includes removing one or more bits in the first data element.

Example 20 provides the apparatus of any one of examples 17-19, in which selecting the precision level for the layer includes generating a graph representing at least part of the neural network, the graph including nodes and one or more links between the nodes, the nodes representing layers in the neural network, the one or more links representing one or more connections between the layers; and performing a search using the graph to select the precision level for the layer.

The above description of illustrated implementations of the disclosure, including what is described in the Abstract, is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. While specific implementations of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize. These modifications may be made to the disclosure in light of the above detailed description.

Claims

A method for executing a tensor multiplication operation in a neural network, comprising:

selecting a precision level for a layer in the neural network based on a target accuracy of the neural network, the layer comprising a multiplication operation on a first tensor and a second tensor of the neural network, the first tensor comprising a first data element, the second tensor comprising a second data element, the first data element and the second data element having a first precision;

converting the first data element to a first pair of data elements;

converting the second data element to a second pair of data elements, the first pair of data elements and second pair of data elements having a second precision;

identifying a function from a plurality of functions based on the selected precision level of the layer;

computing an output data element of the layer by applying the identified function on the first pair of data elements and the second pair of data elements, the output data element having the first precision; and

executing another layer in the neural network using the output data element.
The method of claim 1, wherein selecting the precision level comprises:

selecting the precision level from a plurality of predetermined precision levels, the plurality of predetermined precision level corresponding to different ones of the plurality of functions.
The method of claim 2, further comprising:

selecting a different one of the plurality of predetermined precision levels as a precision level for the another layer in the neural network.
The method of claim 1, wherein the first precision is higher than the second precision, and converting the first data element to the first pair of data elements comprises removing one or more bits in the first data element.
The method of claim 4, wherein a data element in the first pair of data elements is computed by removing the one or more bits in the first data element, and another data element in the first pair of data elements is computed based on the data element in the first pair of data elements.
The method of claim 1, wherein selecting the precision level for the layer comprises:

generating a graph representing at least part of the neural network, the graph comprising nodes and one or more links between the nodes, the nodes representing layers in the neural network, the one or more links representing one or more connections between the layers; and

performing a search using the graph to select the precision level for the layer.
The method of claim 6, wherein selecting the precision level is selected from a plurality of predetermined precision levels, and the nodes in the graph have states encoding the plurality of predetermined precision levels.
The method of claim 7, wherein performing the search using the graph comprises:

determining one or more criteria for the search based on the target accuracy of the neural network; and

updating the states of the nodes in the graph till the one or more criteria are satisfied.
The method of claim 8, wherein the one or more criteria for the search is determined further based on an estimated consumption of one or more computational resources for executing the neural network.
The method of claim 1, wherein the layer is a convolutional layer or a linear layer, the first tensor is an input tensor of the convolutional layer or the linear layer, the second tensor comprises a kernel of the convolutional layer or the linear layer, and the tensor multiplication operation is part of a convolution or a linear transformation.
One or more non-transitory computer-readable media storing instructions executable to perform operations for executing a tensor multiplication operation in a neural network, the operations comprising:

selecting a precision level for a layer in the neural network based on a target accuracy of the neural network, the layer comprising a multiplication operation on a first tensor and a second tensor of the neural network, the first tensor comprising a first data element, the second tensor comprising a second data element, the first data element and the second data element having a first precision;

converting the first data element to a first pair of data elements;

converting the second data element to a second pair of data elements, the first pair of data elements and second pair of data elements having a second precision;

identifying a function from a plurality of functions based on the selected precision level of the layer;

computing an output data element of the layer by applying the identified function on the first pair of data elements and the second pair of data elements, the output data element having the first precision; and

executing another layer in the neural network using the output data element.
The one or more non-transitory computer-readable media of claim 11, wherein selecting the precision level comprises:

selecting the precision level from a plurality of predetermined precision levels, the plurality of predetermined precision level corresponding to different ones of the plurality of functions.
The one or more non-transitory computer-readable media of claim 12, wherein the operations further comprise:

selecting a different one of the plurality of predetermined precision levels as a precision level for the another layer in the neural network.
The one or more non-transitory computer-readable media of claim 11, wherein the first precision is higher than the second precision, and converting the first data element to the first pair of data elements comprises removing one or more bits in the first data element.
The one or more non-transitory computer-readable media of claim 14, wherein a data element in the first pair of data elements is computed by removing the one or more bits in the first data element, and another data element in the first pair of data elements is computed based on the data element in the first pair of data elements.
The one or more non-transitory computer-readable media of claim 11, wherein selecting the precision level for the layer comprises:

generating a graph representing at least part of the neural network, the graph comprising nodes and one or more links between the nodes, the nodes representing layers in the neural network, the one or more links representing one or more connections between the layers; and

performing a search using the graph to select the precision level for the layer.
An apparatus, comprising:

a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations for executing a tensor multiplication operation in a neural network, the operations comprising:

selecting a precision level for a layer in the neural network based on a target accuracy of the neural network, the layer comprising a multiplication operation on a first tensor and a second tensor of the neural network, the first tensor comprising a first data element, the second tensor comprising a second data element, the first data element and the second data element having a first precision,

converting the first data element to a first pair of data elements,

converting the second data element to a second pair of data elements, the first pair of data elements and second pair of data elements having a second precision,

identifying a function from a plurality of functions based on the selected precision level of the layer,

computing an output data element of the layer by applying the identified function on the first pair of data elements and the second pair of data elements, the output data element having the first precision, and

executing another layer in the neural network using the output data element.
The apparatus of claim 17, wherein selecting the precision level comprises:

selecting the precision level from a plurality of predetermined precision levels, the plurality of predetermined precision level corresponding to different ones of the plurality of functions.
The apparatus of claim 17, wherein the first precision is higher than the second precision, and converting the first data element to the first pair of data elements comprises removing one or more bits in the first data element.
The apparatus of claim 17, wherein selecting the precision level for the layer comprises:

generating a graph representing at least part of the neural network, the graph comprising nodes and one or more links between the nodes, the nodes representing layers in the neural network, the one or more links representing one or more connections between the layers; and

performing a search using the graph to select the precision level for the layer.