WO2025184850A1

WO2025184850A1 - Executing matrix multiplication by performing convolution with deep neural network accelerator

Info

Publication number: WO2025184850A1
Application number: PCT/CN2024/080471
Authority: WO
Inventors: Peiqing Jiang; Yuanyuan Li; Xu QIAN; Haiyun HONG; Yinghao Li
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2024-03-07
Filing date: 2024-03-07
Publication date: 2025-09-12
Anticipated expiration: 2026-09-07
Also published as: WO2025184850A8

Abstract

Matrix multiplications in deep neural networks (DNNs) may be converted to convolutions executed by DNN accelerators. A matrix multiplication may have a first input tensor and a second input tensor. The first input tensor may be converted to an activation tensor of a convolutional operation. The second input tensor may be converted to a weight tensor of a convolutional operation. The conversion of the first input tensor to the activation tensor or the second input tensor to the weight tensor may include tensor transposing followed by tensor reshaping. The DNN accelerator may perform the convolutional operation on the activation tensor and weight tensor. An output tensor of the convolutional operation may be converted to an output tensor of the matrix multiplication. The conversion from the output tensor of the convolutional operation to the output tensor of the matrix multiplication may include tensor reshaping followed by tensor transposing.

Description

EXECUTING MATRIX MULTIPLICATION BY PERFORMING CONVOLUTION WITH DEEP NEURAL NETWORK ACCELERATOR

Technical Field

This disclosure relates generally to neural networks (also referred to as “deep neural networks” or “DNN” ) , and more specifically, executing matrix multiplications by performing convolutions with DNN accelerators.

Background

DNNs are used extensively for a variety of artificial intelligence (AI) applications ranging from computer vision to speech recognition and natural language processing due to their ability to achieve high accuracy. Transformer-based large language models (LLMs) are a type of DNNs that has attracted lots of attention due to their abilities to achieve general-purpose language generation and understanding. A typical Transformer-based LLM usually includes a significant amount of matrix multiplication operations.

Brief Description of the Drawings

Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.

FIG. 1 illustrates an example DNN, in accordance with various embodiments.

FIG. 2 illustrates an example convolution, in accordance with various embodiments.

FIG. 3 is a block diagram of a DNN system, in accordance with various embodiments.

FIG. 4 illustrates an example matrix multiplication, in accordance with various embodiments.

FIGS. 5A-5D illustrate broadcasting in matrix multiplication operations, in accordance with various embodiments.

FIG. 6 is a block diagram of a DNN module, in accordance with various embodiments.

FIG. 7 illustrates conversion of a matrix multiplication operation to a convolutional operation, in accordance with various embodiments.

FIG. 8 illustrates an example of computing an output tensor of a matrix multiplication through a convolutional operation, in accordance with various embodiments.

FIG. 9 illustrates a sparse cell, in accordance with various embodiments.

FIG. 10 illustrates an example data processing unit, in accordance with various embodiments.

FIG. 11 is a flowchart showing a method of executing a matrix multiplication operation, in accordance with various embodiments.

FIG. 12 is a block diagram of an example computing device, in accordance with various embodiments.

Detailed Description

Overview

The last decade has witnessed a rapid rise in AI based data processing, particularly based on DNNs. DNNs are widely used in the domains of computer vision, speech recognition, image, and video processing mainly due to their ability to achieve beyond human-level accuracy. The significant improvements in DNN model size and accuracy coupled with the rapid increase in computing power of execution platforms have led to the adoption of DNN applications even within resource constrained mobile and edge devices that have limited energy availability.

A DNN layer may include one or more deep learning operations (also referred to as “neural network operations” ) , such as matrix multiplication, convolution, pooling, elementwise operation, linear operation, nonlinear operation, and so on. Input or output data of deep learning operations may be arranged in data structures called tensors. A tensor is a data structure having multiple elements across one or more dimensions. Example tensors include a vector (which is a one-dimensional (1D) tensor) and a matrix (which is a two-dimensional (2D) tensor) . There can also be three-dimensional (3D) tensors, four-dimensional (4D) tensors, and even higher dimensional tensors. A dimension of a tensor may correspond to an axis, e.g., an axis in a coordinate system. The size of a dimension may correspond to or equal to the number of data points along the axis. The dimensions and sizes of the dimensions of a tensor may define the shape of the tensor. The shape of a tensor may be represented by a sequence of numbers, where every number indicates the size of a different dimension. The order of the numbers may indicate the order of the dimension, i.e., the order of the axes, in the tensor. A DNN layer may receive one or more input tensors and compute an output tensor from the one or more input tensors. For a convolution layer, the input tensors include an activation tensor (also referred to as “input feature map (IFM) ” ) including one or more activations (also referred to as “input elements” ) and a weight tensor. The weight tensor may be a kernel (a2D weight tensor) , a filter (a3D weight tensor) , or a group of filters (a4D weight tensor) .

The recent progression in the AI field has seen a shift from convolution-based models to Transformer-based models, such as LLMs. However, currently available DNN accelerators (aka AI accelerators) are usually designed to enhance processing speed and efficiency for convolution computing but not for computations in Transformer-based models, such as matrix multiplication operations in Transformer-based models. Convolutional operations typically take inputs with fixed rank and order. However, matrix multiplication, which is the fundamental component of Transformer-based models, has a flexible input scheme allowing for any number of combinations of input ranks and orders. In comparison to convolution operations, matrix multiplication operations can allow a significantly larger amount of flexibility.

To integrate Transformer-based models with convolution-based accelerators, modifications are needed on either the hardware or software side. Redesigning the hardware architecture may result in the best performance and power optimization. However, hardware redesign would typically require considerable time and financial resources. An alternative approach would be to alter the software, specifically the AI model compiler, to allow matrix multiplication operations to be compiled in the form of convolutional operations. This would enable the leveraging of hardware acceleration capabilities on existing devices.

Currently available AI model compilers can introduce a series of compiler passes specifically designed for different matrix multiplication patterns. Each pass may detect a unique pattern and carries out a particular transformation. For example, different matrix multiplications may be treated as completely different patterns, which results in different and separate compilation passes, even though the different patterns should follow the same role of transformation. Currently available solutions usually lack a general, efficient, and scalable method to transform matrix multiplication operations. As more matrix multiplication patterns emerge (e.g., due to ever-changing future models) , the compiler's scalability can be severely limited. Furthermore, the necessity to continuously support new patterns can adversely affect development efficiency.

Embodiments of the present disclosure may improve on at least some of the challenges and issues described above by converting matrix multiplication operations, which may have arbitrary inputs, to convolutional operations, which may have fixed input ranks and orders, in a general manner despite differences in patterns of different matrix multiplication operations. The conversion of matrix multiplication operations to convolutional operations can be performed based on tensor transformation. For instance, a matrix multiplication operation may be converted to a convolutional operation by transforming (e.g., transposing and reshaping) the two input tensors of the matrix multiplication operation into the activation tensor and weight tensor, respectively, of the convolution operation and after the convolutional operation is performed on the activation tensor and weight tensor, transforming (e.g., reshaping and transposing) the output tensor of the convolutional operation ( “convolutional output tensor” ) into the output tensor of the matrix multiplication operation ( “MatMul output tensor” ) . MatMul refers to matrix multiplication.

In various embodiments of the present disclosure, matrix multiplications in DNNs (e.g., Transformer-based models) may be converted to convolutions executed by DNN accelerators. A DNN accelerator may be designed for performing convolution with desirable or optimal efficiency. For example, the DNN accelerator may accelerate executions of convolutions based on sparsity in activation tensors and weight tensors. Additionally or alternatively, the DNN accelerator may accelerate execution of convolutions with other mechanisms. The conversion of matrix multiplications may be facilitated by a DNN module that is coupled with the DNN accelerator. A matrix multiplication may have a first input tensor and a second input tensor. The DNN module may transform the first input tensor into an activation tensor of a convolutional operation, e.g., by transposing the first input tensor, then reshaping the transposed tensor. The DNN module may also transform the second input tensor into a weight tensor of a convolutional operation, e.g., by transposing the second input tensor, then reshaping the transposed tensor. The DNN module may also generate a configuration parameter that indicates how the order of the dimensions in the two input tensors are changed by the transpose. The DNN accelerator may perform the convolutional operation on the activation tensor and weight tensor and compute a convolutional output tensor. The DNN module may then transform the convolution output tensor into a MatMul output tensor based on the configuration parameter. For instance, the DNN module may reshape the convolutional output tensor and transpose the reshaped tensor with the configuration parameter to reverse indices mapping to the original order. The result of the transposing may be the MatMul output tensor.

The present disclosure provides a MatMul mapping approach that can efficiently map matrix multiplication operations to convolutional operations. It can simplify the AI compiler design process. Existing redundant passes can be superseded by a single pass. The approach addresses the MatMul mapping issue for various kinds of Transformer-based models, eliminating it once and for all. This is crucial for currently available DNN accelerators to achieve desirable performance while executing Transformer-based models. The MatMul mapping approach can facilitate performance optimization in subsequent processes such as implicit reshaping.

For purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the illustrative implementations. However, it will be apparent to one skilled in the art that the present disclosure may be practiced without the specific details or/and that the present disclosure may be practiced with only some of the described aspects. In other instances, well known features are omitted or simplified in order not to obscure the illustrative implementations.

Further, references are made to the accompanying drawings that form a part hereof, and in which is shown, by way of illustration, embodiments that may be practiced. It is to be understood that other embodiments may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense.

Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order from the described embodiment. Various additional operations may be performed or described operations may be omitted in additional embodiments.

For the purposes of the present disclosure, the phrase “A or B” or the phrase "A and/or B" means (A) , (B) , or (A and B) . For the purposes of the present disclosure, the phrase “A, B, or C” or the phrase "A, B, and/or C" means (A) , (B) , (C) , (A and B) , (A and C) , (B and C) , or (A, B, and C) . The term "between, " when used with reference to measurement ranges, is inclusive of the ends of the measurement ranges.

The description uses the phrases "in an embodiment" or "in embodiments, " which may each refer to one or more of the same or different embodiments. The terms "comprising, " "including, " "having, " and the like, as used with respect to embodiments of the present disclosure, are synonymous. The disclosure may use perspective-based descriptions such as "above, " "below, " "top, " "bottom, " and "side" to explain various features of the drawings, but these terms are simply for ease of discussion, and do not imply a desired or required orientation. The accompanying drawings are not necessarily drawn to scale. Unless otherwise specified, the use of the ordinal adjectives “first, ” “second, ” and “third, ” etc., to describe a common object, merely indicates that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.

In the following detailed description, various aspects of the illustrative implementations will be described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art.

The terms “substantially, ” “close, ” “approximately, ” “near, ” and “about, ” generally refer to being within +/-20%of a target value as described herein or as known in the art. Similarly, terms indicating orientation of various elements, e.g., “coplanar, ” “perpendicular, ” “orthogonal, ” “parallel, ” or any other angle between the elements, generally refer to being within +/-5-20%of a target value as described herein or as known in the art.

In addition, the terms “comprise, ” “comprising, ” “include, ” “including, ” “have, ” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, device, or DNN accelerator that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such method, process, device, or DNN accelerators. Also, the term “or” refers to an inclusive “or” and not to an exclusive “or. ”

The systems, methods and devices of this disclosure each have several innovative aspects, no single one of which is solely responsible for all desirable attributes disclosed herein. Details of one or more implementations of the subject matter described in this specification are set forth in the description below and the accompanying drawings.

Example DNN

FIG. 1 illustrates an example DNN 100, in accordance with various embodiments. The DNN 100 may be executed by a DNN accelerator, e.g., the DNN accelerator 302 in FIG. 3. In an example, the DNN 100 may be a convolution-based DNN. In other examples, the DNN 100 may be other types of DNNs. For the purpose of illustration, the DNN 100 includes a sequence of layers comprising a plurality of convolutional layers 110 (individually referred to as “convolutional layer 110” ) , a plurality of pooling layers 120 (individually referred to as “pooling layer 120” ) , and a plurality of fully-connected layers 130 (individually referred to as “fully-connected layer 130” ) . In other embodiments, the DNN 100 may include fewer, more, or different layers. In an execution of the DNN 100, the layers of the DNN 100 execute tensor computation that includes many tensor operations, such as matrix multiplications, convolutions (e.g., multiply-accumulate (MAC) operations, etc. ) , pooling operations, elementwise operations (e.g., elementwise addition, elementwise multiplication, etc. ) , other types of tensor operations, or some combination thereof.

The convolutional layers 110 summarize the presence of features in inputs to the DNN 100. The convolutional layers 110 function as feature extractors. The first layer of the DNN 100 is a convolutional layer 110. In an example, a convolutional layer 110 performs a convolution on an input tensor 140 (also referred to as IFM 140) and a filter 150. As shown in FIG. 1, the IFM 140 is represented by a 7×7×3 three-dimensional (3D) matrix. The IFM 140 includes 3 input channels, each of which is represented by a 7×7 two-dimensional (2D) matrix. The 7×7 2D matrix includes 7 input elements (also referred to as input points) in each row and 7 input elements in each column. The filter 150 is represented by a 3×3×3 3D matrix. The filter 150 includes 3 kernels, each of which may correspond to a different input channel of the IFM 140. A kernel is a 2D matrix of weights, where the weights are arranged in columns and rows. A kernel can be smaller than the IFM. In the embodiments of FIG. 1, each kernel is represented by a 3×3 2D matrix. The 3×3 kernel includes 3 weights in each row and 3 weights in each column. Weights can be initialized and updated by backpropagation using gradient descent. The magnitudes of the weights can indicate importance of the filter 150 in extracting features from the IFM 140.

The convolution includes MAC operations with the input elements in the IFM 140 and the weights in the filter 150. The convolution may be a standard convolution 163 or a depthwise convolution 183. In the standard convolution 163, the whole filter 150 slides across the IFM 140. All the input channels are combined to produce an output tensor 160 (also referred to as OFM 160) . The OFM 160 is represented by a 5×5 2D matrix. The 5×5 2D matrix includes 5 output elements (also referred to as output points) in each row and 5 output elements in each column. For the purpose of illustration, the standard convolution includes one filter in the embodiments of FIG. 1. In embodiments where there are multiple filters, the standard convolution may produce multiple output channels in the OFM 160.

The multiplication applied between a kernel-sized patch of the IFM 140 and a kernel may be a dot product. A dot product is the elementwise multiplication between the kernel-sized patch of the IFM 140 and the corresponding kernel, which is then summed, always resulting in a single value. Because it results in a single value, the operation is often referred to as the “scalar product. ” Using a kernel smaller than the IFM 140 is intentional as it allows the same kernel (set of weights) to be multiplied by the IFM 140 multiple times at different points on the IFM 140. Specifically, the kernel is applied systematically to each overlapping part or kernel-sized patch of the IFM 140, left to right, top to bottom. The result from multiplying the kernel with the IFM 140 one time is a single value. As the kernel is applied multiple times to the IFM 140, the multiplication result is a 2D matrix of output elements. As such, the 2D output matrix (i.e., the OFM 160) from the standard convolution 163 is referred to as an OFM.

In the depthwise convolution 183, the input channels are not combined. Rather, MAC operations are performed on an individual input channel and an individual kernel and produce an output channel. As shown in FIG. 1, the depthwise convolution 183 produces a depthwise output tensor 180. The depthwise output tensor 180 is represented by a 5×5×3 3D matrix. The depthwise output tensor 180 includes 3 output channels, each of which is represented by a 5×5 2D matrix. The 5×5 2D matrix includes 5 output elements in each row and 5 output elements in each column. Each output channel is a result of MAC operations of an input channel of the IFM 140 and a kernel of the filter 150. For instance, the first output channel (patterned with dots) is a result of MAC operations of the first input channel (patterned with dots) and the first kernel (patterned with dots) , the second output channel (patterned with horizontal strips) is a result of MAC operations of the second input channel (patterned with horizontal strips) and the second kernel (patterned with horizontal strips) , and the third output channel (patterned with diagonal stripes) is a result of MAC operations of the third input channel (patterned with diagonal stripes) and the third kernel (patterned with diagonal stripes) . In such a depthwise convolution, the number of input channels equals the number of output channels, and each output channel corresponds to a different input channel. The input channels and output channels are referred to collectively as depthwise channels. After the depthwise convolution, a pointwise convolution 193 is then performed on the depthwise output tensor 180 and a 1×1×3 tensor 190 to produce the OFM 160.

The OFM 160 is then passed to the next layer in the sequence. In some embodiments, the OFM 160 is passed through an activation function. An example activation function is rectified linear unit (ReLU) . ReLU is a calculation that returns the value provided as input directly, or the value zero if the input is zero or less. The convolutional layer 110 may receive several images as input and calculate the convolution of each of them with each of the kernels. This process can be repeated several times. For instance, the OFM 160 is passed to the subsequent convolutional layer 110 (i.e., the convolutional layer 110 following the convolutional layer 110 generating the OFM 160 in the sequence) . The subsequent convolutional layers 110 perform a convolution on the OFM 160 with new kernels and generate a new feature map. The new feature map may also be normalized and resized. The new feature map can be kernelled again by a further subsequent convolutional layer 110, and so on.

In some embodiments, a convolutional layer 110 has four hyperparameters: the number of kernels, the size F kernels (e.g., a kernel is of dimensions F×F×D pixels) , the S step with which the window corresponding to the kernel is dragged on the image (e.g., a step of one means moving the window one pixel at a time) , and the zero-padding P (e.g., adding a black contour of P pixels thickness to the input image of the convolutional layer 110) . The convolutional layers 110 may perform various types of convolutions, such as 2-dimensional convolution, dilated or atrous convolution, spatial separable convolution, depthwise separable convolution, transposed convolution, and so on. The DNN 100 includes 16 convolutional layers 110. In other embodiments, the DNN 100 may include a different number of convolutional layers.

The pooling layers 120 down-sample feature maps generated by the convolutional layers, e.g., by summarizing the presence of features in the patches of the feature maps. A pooling layer 120 is placed between two convolution layers 110: a preceding convolutional layer 110 (the convolution layer 110 preceding the pooling layer 120 in the sequence of layers) and a subsequent convolutional layer 110 (the convolution layer 110 subsequent to the pooling layer 120 in the sequence of layers) . In some embodiments, a pooling layer 120 is added after a convolutional layer 110, e.g., after an activation function (e.g., ReLU, etc. ) has been applied to the OFM 160.

A pooling layer 120 receives feature maps generated by the preceding convolution layer 110 and applies a pooling operation to the feature maps. The pooling operation reduces the size of the feature maps while preserving their important characteristics. Accordingly, the pooling operation improves the efficiency of the DNN and avoids over-learning. The pooling layers 120 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map) , max pooling (calculating the maximum value for each patch of the feature map) , or a combination of both. The size of the pooling operation is smaller than the size of the feature maps. In various embodiments, the pooling operation is 2×2 pixels applied with a stride of two pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size. In an example, a pooling layer 120 applied to a feature map of 6×6 results in an output pooled feature map of 3×3. The output of the pooling layer 120 is inputted into the subsequent convolution layer 110 for further feature extraction. In some embodiments, the pooling layer 120 operates upon each feature map separately to create a new set of the same number of pooled feature maps.

The fully-connected layers 130 are the last layers of the DNN. The fully-connected layers 130 may be convolutional or not. The fully-connected layers 130 receive an input operand. The input operand defines the output of the convolutional layers 110 and pooling layers 120 and includes the values of the last feature map generated by the last pooling layer 120 in the sequence. The fully-connected layers 130 apply a linear combination and an activation function to the input operand and generate a vector. The vector may contain as many elements as there are classes: element i represents the probability that the image belongs to class i. Each element is therefore between 0 and 1, and the sum of all is worth one. These probabilities are calculated by the last fully-connected layer 130 by using a logistic function (binary classification) or a SoftMax function (multi-class classification) as an activation function. In some embodiments, the fully-connected layers 130 multiply each input element by weight, make the sum, and then apply an activation function (e.g., logistic if N=2, SoftMax if N>2) . This is equivalent to multiplying the input operand by the matrix containing the weights.

Example Convolution

FIG. 2 illustrates an example convolution, in accordance with various embodiments. The convolution may be a deep learning operation in a convolutional layer of a DNN, e.g., a convolutional layer 110 in FIG. 1. The convolution can be executed on an activation tensor 210 and filters 220 (individually referred to as “filter 220” ) . A filter, a portion of a filter, or a combination of multiple filters may be referred to as a weight tensor of the convolution. The result of the convolution is an output tensor 230. In some embodiments, the convolution is performed by a DNN accelerator. An example of the DNN accelerator may be the DNN accelerator 302 in FIG. 3. For instance, the convolution may be performed by the sparse cell array 370 in the DNN accelerator 302.

In the embodiments of FIG. 2, the activation tensor 210 includes activations (also referred to as “input activations, ” “elements, ” or “input elements” ) arranged in a 3D matrix. The activation tensor 210 may also be referred to as an input tensor of the convolution. An input element is a data point in the activation tensor 210. The activation tensor 210 has a spatial size H_in×W_in×C_in, where H_in is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of activations in a column in the 3D matrix of each input channel) , W_in is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of activations in a row in the 2D matrix of each input channel) , and C_in is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of input channels) . For the purpose of simplicity and illustration, the activation tensor 210 has a spatial size of 7×7×3, i.e., the activation tensor 210 includes three input channels and each input channel has a 7×7 2D matrix. Each input element in the activation tensor 210 may be represented by a (X, Y, Z) coordinate. In other embodiments, the height, width, or depth of the activation tensor 210 may be different.

Each filter 220 includes weights arranged in a 3D matrix. The values of the weights may be determined through training the DNN. A filter 220 has a spatial size H_f×W_f×C_f, where H_f is the height of the filter (i.e., the length along the Y axis, which indicates the number of weights in a column in each kernel) , W_f is the width of the filter (i.e., the length along the X axis, which indicates the number of weights in a row in each kernel) , and C_f is the depth of the filter (i.e., the length along the Z axis, which indicates the number of channels) . In some embodiments, C_f equals C_in. For purpose of simplicity and illustration, each filter 220 in FIG. 2 has a spatial size of 2×3×3, i.e., the filter 220 includes 2 convolutional kernels with a spatial size of 2×3. In other embodiments, the height, width, or depth of the filter 220 may be different. The spatial size of the convolutional kernels is smaller than the spatial size of the 2D matrix of each input channel in the activation tensor 210.

An activation or weight may take one or more bytes in a memory. The number of bytes for an activation or weight may depend on the data format. For example, when the activation or weight has an INT8 format, the activation takes one byte. When the activation or weight has a FP16 format, the activation or weight takes two bytes. Other data formats may be used for activations or weights.

In the convolution, each filter 220 slides across the activation tensor 210 and generates a 2D matrix for an output channel in the output tensor 230. In the embodiments of FIG. 2, the 2D matrix has a spatial size of 5×5. The output tensor 230 includes activations (also referred to as “output activations, ” “elements, ” or “output element” ) arranged in a 3D matrix. An output activation is a data point in the output tensor 230. The output tensor 230 has a spatial size H_out×W_out×C_out, where H_out is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of output activations in a column in the 2D matrix of each output channel) , W_out is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of output activations in a row in the 2D matrix of each output channel) , and C_out is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of output channels) . C_out may equal the number of filters 220 in the convolution. H_out and W_out may depend on the heights and weights of the activation tensor 210 and each filter 220.

As a part of the convolution, MAC operations can be performed on a 2×3×3 subtensor 215 (which is highlighted with a dotted pattern in FIG. 2) in the activation tensor 210 and each filter 220. The result of the MAC operations on the subtensor 215 and one filter 220 is an output activation. In some embodiments (e.g., embodiments where the convolution is an integral convolution) , an output activation may include 8 bits, e.g., one byte. In other embodiments (e.g., embodiments where the convolution is a floating-point convolution) , an output activation may include more than one byte. For instance, an output element may include two bytes.

After the MAC operations on the subtensor 215 and all the filters 220 are finished, a vector 235 is produced. The vector 235 is highlighted with slashes in FIG. 2. The vector 235 includes a sequence of output activations, which are arranged along the Z axis. The output activations in the vector 235 have the same (X, Y) coordinate, but the output activations correspond to different output channels and have different Z coordinates. The dimension of the vector 235 along the Z axis may equal the total number of output channels in the output tensor 230. After the vector 235 is produced, further MAC operations are performed to produce additional vectors till the output tensor 230 is produced.

In some embodiments, the MAC operations on a 2×3×3 subtensor (e.g., the subtensor 215) and a filter 220 may be performed by a plurality of MAC units. One or more MAC units may receive an input operand (e.g., an activation operand 217 shown in FIG. 2) and a weight operand (e.g., the weight operand 227 shown in FIG. 2) . The activation operand 217 includes a sequence of activations having the same (x, y) coordinate but different z coordinates. The activation operand 217 includes an activation from each of the input channels in the activation tensor 210. The weight operand 227 includes a sequence of weights having the same (x, y) coordinate but different z coordinates. The weight operand 227 includes a weight from each of the channels in the filter 220. Activations in the activation operand 217 and weights in the weight operand 227 may be sequentially fed into a MAC unit. The MAC unit may receive an activation and a weight ( “an activation-weight pair” ) at a time and multiple the activation and the weight. The position of the activation in the activation operand 217 may match the position of the weight in the weight operand 227. The activation and weight may correspond to the same channel.

Activations or weights may be floating-point numbers. Floating-point numbers may have various data formats, such as FP32, FP16, BF16, and so on. A floating-point number may be a positive or negative number with a decimal point. A floating-point number may be represented by a sequence of bits that includes one or more bits representing the sign of the floating-point number (e.g., positive or negative) , bits representing an exponent of the floating-point number, and bits representing a mantissa of the floating-point number. The mantissa is the part of a floating-point number that represents the significant digits of that number. The mantissa is multiplied by the base raised to the exponent to give the actual value of the floating-point number.

In some embodiments, the output activations in the output tensor 230 may be further processed based on one or more activation functions before they are stored or inputted into the next layer of the DNN. The processing based on the one or more activation functions may be at least part of the post processing of the convolution. In some embodiments, the post processing may include one or more other computations, such as offset computation, bias computation, and so on. The results of the post processing may be stored in a local memory of the compute block and be used as input to the next DNN layer. In some embodiments, the input activations in the activation tensor 210 may be results of post processing of the previous DNN layer.

Example DNN System

FIG. 3 is a block diagram of a DNN system 300, in accordance with various embodiments. The whole DNN system 300 or a part of the DNN system 300 may be implemented in one or more computing devices, such as the computing device 1200 in FIG. 12. The DNN system 300 can generate and execute DNNs, such as Transformer-based models, convolution-based models, and so on. As shown in FIG. 3, the DNN system 300 includes a DNN module 301 and a DNN accelerator 302. In other embodiments, alternative configurations, different or additional components may be included in the DNN system 300. For instance, the DNN system 300 may include multiple DNN modules or multiple DNN accelerators. Further, functionality attributed to a component of the DNN system 300 may be accomplished by a different component included in the DNN system 300 or a different system. In some embodiments, the DNN module 301 and DNN accelerator 302 may include different types of processing units. In an example, the DNN module 301 may be implemented by one or more central processing units (CPUs) . The DNN accelerator 302 may also be referred to as an AI accelerator or an AI processor. The DNN module 301 and DNN accelerator 302 may be implemented in the same chip or separate chips.

The DNN module 301 facilitates generation and deployment of DNNs. In some embodiments, the DNN module 301 may generate and train DNNs. For instance, the DNN module 301 can define the layered architecture of a DNN. The DNN module 301 can also determine the internal parameters of the DNN through a DNN training process. The DNN module 301 may also determine one or more hyperparameters that define how the DNN is trained. An example hyperparameter is a sparsity ratio that defines the sparsity level of one or more deep learning tensors for the DNN.

The DNN module 301 may also compress DNNs, e.g., during or after training. In some embodiments, the DNN module 301 may prune weights in one or more layers of a DNN by changing nonzero valued weight to zeros. The DNN module 301 may prune weights based on a target weight sparsity ratio. A weight sparsity ratio may be the ratio of the number of zero-valued weights to the total number of weights. In an example where the DNN module 301 prunes weight during DNN training, the DNN module 301 may prune weight of a layer to achieve a target sparsity ratio after one or more epochs. The DNN module 301 may prevent the pruned weights from changing values during the rest of the training process. Alternatively, the DNN module 301 may allow the pruned weights to change values so that a pruned, zero-valued weight may have a nonzero value after further training. The DNN module 301 may prune weights of the layer again after one or more additional epochs.

The DNN module 301 may deploy trained, compressed, or validated DNNs for use in deep learning applications. In some embodiments, the DNN module 301 may distribute trained, compressed, or validated DNNs to devices or systems which may use the DNNs to perform tasks (e.g., image classification, motion planning, etc. ) for which the DNNs were trained. In other embodiments, the DNN module 301 may facilitate deployment of the DNNs using the DNN accelerator 302. For instance, the DNN module 301 may receive data from a device or system coupled with the DNN system 300 and input the received data (or data generated by the DNN module 301, e.g., based on the received data) into a DNN. The DNN module 301 may generate instructions (e.g., configuration files) that control the operation of the DNN accelerator 302 during the DNN execution. The DNN module 301 may receive an output of the DNN from the DNN accelerator 302. The DNN module 301 may transmit the output of the DNN (or a result of processing the output of the DNN by the DNN module 301) to the device or system.

The DNN module 301 may control execution processes of trained, compressed, or validated DNNs. In some embodiments, the DNN module 301 facilitates mapping matrix multiplications to convolutions that can be executed by the DNN accelerator 302 so that the outputs of the matrix multiplications (i.e., MatMul output tensors) can be computed by carrying out the convolutions as opposed to directly carrying out the matrix multiplications. For instance, the DNN module 301 may convert input tensors of matrix multiplication to activation tensors and weight tensors of convolutions. The DNN module 301 may cause the DNN accelerator 302 to execute convolutions on the activation tensors and weight tensors and compute convolution output tensors. The DNN module 301 may receive convolutional output tensors from the DNN accelerator and convert convolutional output tensors to MatMul output tensors. Certain aspects of the DNN module 301 are provided below in conjunction with FIG. 6.

The DNN accelerator 302 executes DNNs provided by the DNN module 301. For instance, the DNN accelerator 302 can execute a DNN by running deep learning operations in the DNN. The process of carrying out a deep learning operation is also referred to as a process of executing the deep learning operation or performing the deep learning operation. The execution of the DNN may be for training the DNN or for using the DNN to perform AI tasks. In some embodiments, the DNN accelerator 302 includes components designed for optimal efficiency in running convolution-based DNNs. As shown in FIG. 3, the DNN accelerator 302 includes a memory 310, a DMA (direct memory access) engine 320, and compute blocks 330 (individually referred to as “compute block 330” ) . In other embodiments, alternative configurations, different or additional components may be included in the DNN accelerator 302. For example, the DNN accelerator 302 may include more than one memory 310 or DMA engine 320. As another example, the DNN accelerator 302 may include a single compute block 330. Further, functionality attributed to a component of the DNN accelerator 302 may be accomplished by a different component included in the DNN accelerator 302 or by a different system. A component of the DNN accelerator 302 may be implemented in hardware, software, firmware, or some combination thereof.

The memory 310 stores data associated with deep learning operations performed by the DNN accelerator 302. In some embodiments, the memory 310 may store data to be used by the compute blocks 330 for DNN execution. The memory 310 may store weights, such as weights of convolutional layers, which are determined by training DNNs. The memory 310 may further store inputs to DNN layers or outputs of DNN layers, such as data generated by the compute blocks 330 from performing deep learning operations in DNNs. Example deep learning operations include convolutional operations, matrix multiplication operations, pooling operations, elementwise operations, activation functions, other types of deep learning operations, or some combination thereof. The memory 310 may be a main memory of the DNN accelerator 302. In some embodiments, the memory 310 includes one or more dynamic random-access memories (DRAMs) .

The DMA engine 320 facilitates data transfer between the memory 310 and local memories of the compute blocks 330. For example, the DMA engine 320 can read data from the memory 310 and write data into a local memory of a compute block 330. As another example, the DMA engine 320 can read data from a local memory of a compute block 330and write data into the memory 310. The DMA engine 320 provides a DMA feature that allows the compute block 330 to initiate data transfer between the memory 310 and the local memories of the compute blocks 330 and to perform other operations while the data transfer is being conducted. In some embodiments, the DMA engine 320 may read tensors from the memory 310, modify the tensors in a way that is optimized for the compute block 330 before it writes the tensors into the local memories of the compute blocks 330.

The compute blocks 330 can perform deep learning operations in DNNs. For instance, a compute block 330 may execute a DNN layer by running one or more deep learning operations in the DNN layer. A compute block 330 may execute a layer, or a portion of a layer, at a time. The compute blocks 330 may be capable of running various types of deep learning operations, such as convolution, pooling, elementwise operation, linear operation, nonlinear operation, and so on. In an example, a compute block 330 may perform convolutions, e.g., standard convolution or depthwise convolution. In some embodiments, the compute block 330 receives an input tensor and one or more convolutional kernels and performs a convolution with the input tensor and convolutional kernels. The result of the convolution may be an output tensor, which can be further computed, e.g., by the compute block 330 or another compute block 330. In some embodiments, the operations of the DNN layers may be run by multiple compute blocks 330 in parallel. For instance, multiple compute blocks 330 may each perform a portion of a workload for a convolution. Data may be shared between the compute blocks 330. A compute block 330 may also be referred to as a compute tile. In some embodiments, each compute block 330 may be a processing unit.

In the embodiments of FIG. 3, each compute block 330 includes a local memory 340, a sparsity mode module 350, a load module 360, a sparse cell array 370 (also referred to as a data processing unit) , and a drain module 380. Some or all the components of the compute block 330 can be implemented on the same chip. In other embodiments, alternative configurations, different or additional components may be included in the compute block 330. Further, functionality attributed to a component of the compute block 330 may be accomplished by a different component included in the compute block 330, a different compute block 330, another component of the DNN accelerator 302, or a different system. A component of the compute block 330 may be implemented in hardware, software, firmware, or some combination thereof.

The local memory 340 is local to the corresponding compute block 330. In the embodiments of FIG. 3, the local memory 340 is inside the compute block 330. In other embodiments, the local memory 340 may be outside the compute block 330. Data in the local memory 340 may be transferred to or from the memory 310, e.g., through the DMA engine 320. In some embodiments, data in the local memory 340 may be transferred to or from the local memory of another compute block 330. The local memory 340 may store data received, used, or generated by the sparsity mode module 350, the load module 360, the sparse cell array 370, or the drain module 380. Examples of the data may include input activations, weights, output activations, sparsity bitmaps, and so on.

In some embodiments, the local memory 340 may store dense tensors (e.g., dense activation tensors, dense weight tensors, etc. ) , sparse tensors (e.g., sparse activation tensors, sparse weight tensors, etc. ) , and so on. A dense tensor may be a tensor from which zero-valued elements (if any) are not removed. A dense tensor may be converted to a sparse tensor by removing one or more zero-valued elements in the dense tensor. A sparse tensor may also be referred to as a compressed tensor or packed tensor. The process of converting a dense tensor to a sparse tensor may be referred to as sparsity encoding. Sparsity encoding may also generate a sparsity tensor. Each element in the sparsity tensor may correspond to a different element in the dense tensor and indicate whether the element in the dense tensor is zero or not. The sparsity tensor may indicate positions of elements of the sparse tensor in the dense tensor. The sparsity tensor may be a sparsity bitmap, each element of which is a bit. A sparse tensor may be converted to a dense tensor through a densifying process, in which one or more zeros may be added to the sparse tensor based on the sparsity tensor.

In some embodiments, the local memory 340 includes one or more static random-access memories (SRAMs) . The local memory 340 may be byte-addressable, and each memory address identifies a single byte (eight bits) of storage. In some embodiments, the local memory 340 may include memory banks. The number of data banks in the local memory 340 may be 16, 64, 128, 356, 512, 1024, 3048, or other numbers. A memory bank may include a plurality of storage units. In an example, a data bank may include 8, 16, 64, or a different number of storage units. A memory bank or a storage unit in a memory bank may have a memory address. In an example, a storage unit may store a single byte, and data larger than a single byte may be stored in storage units with consecutive memory addresses, i.e., adjacent storage units. For instance, a storage unit can store an integer number in the INT8 format, versus two storage units may be needed to store a number in the FP16 or BF16 format, which has 16 bits. In some embodiments, 16 bits can be transferred from the local memory 340 in a single read cycle. In other embodiments, 16 bits can be transferred from the local memory 340 in multiple read cycles, such as two cycles.

The sparsity mode module 350 determines sparsity modes in which the compute block 330 operates to execute DNN layers. For instance, the sparsity mode module 350 may determine whether to accelerate a layer based on weight sparsity, activation sparsity, or both. The sparsity mode module 350 select the sparsity mode for a layer from a group of sparsity modes that includes, for example, combined sparsity mode in which the layer is accelerated based on both weight sparsity and activation sparsity, activation sparsity mode in which the layer is accelerated based on activation sparsity but not based on weight sparsity, weight sparsity mode in which the layer is accelerated based on weight sparsity but not based on activation sparsity, and a dense mode in which the layer is not accelerated based on sparsity. In some embodiments (e.g., embodiments where a layer is executed by multiple compute blocks 330) , the sparsity mode module 350 may determine the sparsity mode for all the compute blocks 330 that executes the layer. In some embodiments, the sparsity mode module 350 may receive configuration parameters from the DNN module 301. A configuration parameter may correspond to a layer and indicate whether to accelerate the layer based on weight sparsity. The sparsity mode module 350 may determine the sparsity mode of the layer based on the configuration parameter.

The load module 360 loads data from the local memory 340 to the sparse cell array 370. The load module 360 may read tensors from the local memory 340. The tensors may include sparse activation tensors, sparse weight tensors, activation sparsity tensors, weight sparsity tensors, and so on. In some embodiments, the load module 360 may load data based on the sparsity mode determined by the sparsity mode module 350. The load module 360 may select different data to transmit to the sparse cell array 370 in different sparsity modes. For instance, the load module 360 may transmit an activation sparsity tensor and a weight sparsity tensor of a layer to the sparse cell array 370 in the combined sparsity mode, while transmit the activation sparsity tensor but not the weight sparsity tensor to the sparse cell array 370 in the activation sparsity mode and transmit the weight sparsity tensor but not the activation sparsity tensor to the sparse cell array 370 in the weight sparsity mode. In the dense mode, the load module 360 does not transmit either the activation sparsity tensor or the weight sparsity tensor to the sparse cell array 370.

In some embodiments, the load module 360 may process (e.g., densify) data stored in the local memory 340 before providing the data to the sparse cell array 370. In an example, the load module 360, while operating in the weight sparsity mode, may densify sparse activation tensors to generate dense activation tensors based on corresponding activation sparsity tensors. For instance, the load module 360 may add one or more zeros into a sparse activation tensor based on an activation sparsity tensor associated with the sparse activation tensor to generate the dense activation tensor. The dense activation tensor includes one or more elements than the sparse activation tensor. The additional element (s) are zero-valued. The load module 360 may identify one or more elements in the activation sparsity tensor that correspond to the zero-valued element (s) , determine the position of each of the zero-valued element (s) in the dense activation tensor, and insert the zero-valued element (s) into the sparse activation tensor based on the determined positions. After the densification, the load module 360 may transmit the dense activation tensors to the sparse cell array 370. The load module 360 may also transmit corresponding sparse weight tensors and weight sparsity tensors to the sparse cell array 370. Activation sparsity tensor of the dense activation tensors may not be loaded to the sparse cell array 370.

In another example, the load module 360, while operating in the activation sparsity mode, may densify sparse weight tensors to generate dense weight tensors based on corresponding weight sparsity tensors by inserting zeros into sparse weight tensors. The densification of sparse weight tensors may be similar to the densification of sparse activation tensors described above. After the densification, the load module 360 may transmit the dense weight tensors to the sparse cell array 370. The load module 360 may also transmit corresponding sparse activation tensors and activation sparsity tensors to the sparse cell array 370. Weight sparsity tensor of the dense weight tensors may not be loaded to the sparse cell array 370.

In yet another example, the load module 360, while operating in the dense mode, may densify both sparse weight tensors and sparse activation tensors. The load module 360 may generate the input tensor and weight tensor of the layer and transmit the tensors to the sparse cell array 370 for executing the layer without sparsity acceleration.

The sparse cell array 370 may include one or more sparse cells. Each sparse cell may include one or more MAC units that can perform MAC operations. The MAC units in a sparse cell may be arranged in an array that includes rows and columns. The sparse cells may be arranged in one or more rows and one or more columns in the sparse cell array 370. All the MAC units in the sparse cell array 370 may constitute a bigger array that includes more rows and columns. In some embodiments (e.g., embodiments where the compute block 330 executes a convolutional layer) , a computation in an MAC unit may be an MAC operation on an activation operand and a weight operand. The activation operand may be an activation tensor that may include one or more activations in the input tensor of the convolution. Different activations may be in different input channels. The weight operand may be a weight tensor that may include one or more weights in the filter of the convolution. The values of the weights are determined through training the DNN. The weights in the weight operand may be in different input channels.

In some embodiments, an MAC unit includes one or more multipliers for performing multiplications. An MAC unit may also include one or more accumulators ( “adders” ) for performing accumulations. A column of MAC units is referred to as an MAC column. An MAC column may be associated with one or more MAC lanes. An MAC lane is a path for loading data e.g., by the load module 360, into an MAC column. An MAC lane may be also referred to as a data transmission lane or data loading lane. An MAC column may have multiple MAC lanes. The loading bandwidth of the MAC column is an aggregation of the loading bandwidths of all the MAC lanes associated with the MAC column. With a certain number of MAC lanes, data can be fed into the same number of independent MAC units simultaneously. In some embodiments where an MAC column has four MAC lanes for feeding activations or weights into the MAC column and each MAC lane may have a bandwidth of 16 bytes, the four MAC lanes can have a total loading bandwidth of 64 bytes.

In some embodiments, the sparse cell array 370 may be capable of depthwise convolution, standard convolution, or both. In a depthwise convolution, an MAC unit may perform an MAC operation that includes a sequence of multiplications for an input operand and a weight operand. Each multiplication in the sequence (also referred to as a cycle) is a multiplication of a different activation in the input operand with a different weight in the weight operand. The activation and weight in the same cycle may correspond to the same channel. The sequence of multiplication produces a product operand that includes a sequence of products. The MAC operation may also include accumulations in which multiple product operands are accumulated to produce an output operand of the MAC unit. The sparse cell array 370 may output multiple output operands at a time, each of which is generated by a different MAC unit. In a standard convolution, MAC operations may include accumulations across the channels. For instance, as opposed to generating an output operand, a MAC unit may accumulate products across different channels to generate a single output point.

In some embodiments, the sparse cell array 370 may perform MAC operations in quantized deep learning operations, such as MAC operations in a quantized convolution. In some embodiments, an MAC unit in the sparse cell array 370 may receive quantized activation and quantized weights and compute a quantized MAC result. The quantized MAC result may be a quantized value in an integer format and may be the output of the PE. In some embodiments, the MAC unit may also include a quantization multiplier that can multiply a quantization scale with the quantized MAC result, and the output of the MAC unit may be a real value in a floating-point format. The MAC unit may include no quantization subtractors as zero-point offsetting is not needed for the MAC operations in quantized deep learning operations.

In some embodiments, the sparse cell array 370 may include sparsity acceleration logic for facilitating sparsity acceleration. For instance, each sparse cell in the sparse cell array 370 may include one or more sparsity modules. In an example, each MAC column or each MAC row may have a corresponding sparsity module that accelerates MAC operations in the MAC column or MAC row. In some embodiments, a sparsity module accelerates computations in the sparse cell array 370 based on sparsity in activations, sparsity in weights, or both. The sparsity module may include a storage unit that stores a sparsity tensor, which may be loaded to the storage unit by the load module 360. The sparsity tensor may be an activation sparsity tensor, a weight sparsity tensor, or a combined sparsity tensor.

An activation sparsity tensor may be the sparsity tensor of an activation tensor and has the same number of elements as the activation tensor. An element in the activation sparsity tensor may indicate whether the corresponding element in the activation tensor is zero or not. For instance, a zero-valued in the activation sparsity tensor may indicate that the corresponding element in the activation tensor is zero. A one-valued in the activation sparsity tensor may indicate that the corresponding element in the activation tensor is nonzero. A weight sparsity tensor may be the sparsity tensor of a weight tensor and has the same number of elements as the weight tensor. An element in the weight sparsity tensor may indicate whether the corresponding element in the weight tensor is zero or not. For instance, a zero-valued in the weight sparsity tensor may indicate that the corresponding element in the weight tensor is zero. A one-valued in the weight sparsity tensor may indicate that the corresponding element in the weight tensor is nonzero. The sparsity module may generate a combined sparsity tensor using an activation sparsity tensor and a weight sparsity tensor. For instance, the sparsity module may multiply an element of the activation sparsity tensor with a corresponding element of the weight sparsity tensor to compute an element of the combined sparsity tensor. The positions of the three elements in their corresponding sparsity tensors may match. In some embodiments, each element in a sparsity tensor may be a bit, and the sparsity tensor may be referred to as a sparsity bitmap.

The sparsity module may use the sparsity tensor to identify activations and weights to be used in MAC operations by the MAC units. In an embodiment where the sparse cell array 370 operates in the combined sparsity mode, the sparsity module may identify activations and weights that correspond to nonzero valued elements of a combined sparsity tensor. In an embodiment where the sparse cell array 370 operates in the activation sparsity mode, the sparsity module may identify activations and weights that correspond to nonzero valued elements of an activation sparsity tensor. In an embodiment where the sparse cell array 370 operates in the weight sparsity mode, the sparsity module may identify activations and weights that correspond to nonzero valued elements of a weight sparsity tensor. The sparsity module may be bypassed in the dense mode as no sparsity acceleration would be conducted.

The drain module 380 drains data from the sparse cell array 370 and writes the data to the local memory 340. The data may be outputs of MAC operations performed by MAC units in the sparse cell array 370. In some embodiments, the drain module 380 may drain data on a cell level. For each sparse cell, the drain module 380 may drain outputs of MAC units in the sparse cell based on a row index or column index of each MAC unit. For instance, the drain module 380 may use a sequence of cycles to drain data from a sparse cell. The drain module 380 may drain the output of some of the MAC units in each cycle. The sequence of the cycles may be configured based on a configuration parameter indicating the operation mode of the load module 360.

In some embodiments, the drain module 380 may determine whether to drain the output of an MAC unit based on the column index of the MAC unit when the load module operates in the activation sparsity mode versus based on the row index of the MAC unit when the load module operates in the weight sparsity mode. For instance, for MAC operations where the load module 360 operates in the activation sparsity mode, the drain module 380 may drain the output of a different MAC column in each cycle. The sequence of cycles may start with the first MAC column (e.g., the MAC column on the left side of the sparse cell) and end with the last MAC column (e.g., the MAC column on the right side of the sparse cell) . For MAC operations where the load module 360 operates in the weight sparsity mode, the drain module 380 may drain the output of a different MAC row in each cycle. The sequence of cycles may start with the first MAC row (e.g., the MAC row at the top of the sparse cell) and end with the last MAC row (e.g., the MAC column at the bottom of the sparse cell) . In other embodiments, the drain module 380 may determine whether to drain the output of an MAC unit based on the row index of the MAC unit when the load module operates in the activation sparsity mode versus based on the column index of the MAC unit when the load module operates in the weight sparsity mode.

The drain module 380 may also include sparsity encoding logic that can convert outputs of the sparse cell array 370 from a dense format to a sparse format. For instance, the drain module 380 may be implemented with one or more sparsity encoders. A sparsity encoder converts dense data to compressed data based on sparsity in the dense data. For instance, the sparsity encoder may remove zeros in an activation tensor computed by the sparse cell array 370 to convert the activation tensor to a compressed activation tensor. The sparsity encoder may also generate sparsity tensors, including activation sparsity tensors.

In some embodiments, the data drained from the sparse cell array 370 may be at least part of an output tensor (e.g., the output tensor 230 in FIG. 2) of a deep learning operation. The sparsity encoder may generate a compressed version of the output tensor. The sparsity encoder may identify every zero-valued activation in the output tensor and remove these activations from the output tensor to generate a compressed activation tensor (aka “sparse activation tensor” ) . The sparsity encoder may also generate one or more sparsity tensors for the output tensor. A sparsity tensor may correspond to a portion of the output tensor (e.g., the vector 235 in FIG. 2) . The sparsity tensor may include sparsity elements (e.g., bits) , each of which corresponds to a different activation in the vector and indicates whether the corresponding activation is zeroed or not.

The drain module 380 may write the compressed activation tensor and the one or more sparsity tensors into the local memory 340. The sparse activation tensor and the one or more sparsity tensors may be further loaded to the memory 310, e.g., through the DMA engine 320. Additionally or alternatively, the sparse activation tensor and the one or more sparsity tensors may be loaded by the load module 360 to the sparse cell array for further computation, e.g., for performing a deep learning operation in the next layer.

Example Matrix Multiplication

FIG. 4 illustrates an example matrix multiplication, in accordance with various embodiments. For the purpose of illustration, the matrix multiplication is performed an input matrix 410 and another input matrix 420. The result of the matrix multiplication is an output matrix 430. The input matrix 410 has a spatial size of 2×4, i.e., the input matrix 410 has eight elements arranged in two rows and four columns. The input matrix 420 has a spatial size of 4×3, i.e., the input matrix 420 has eight elements arranged in four rows and three columns. The output matrix 430 has a spatial size of 2×3, i.e., the output matrix 430 has six elements arranged in two rows and three columns. In other embodiments, the input matrix 410, input matrix 420, or output matrix 430 may have a different size or shape.

An element in the output matrix 430 may be computed by multiplying a vector in the input matrix 410 and a vector in the input matrix 420. For instance, the first element in the output matrix 430 is computed by multiplying the first row in the input matrix 410 and the first column in the input matrix 420. The computation of the elements in the output matrix 430 may be denoted as:

c₁₁=a₁₁b₁₁+a₁₂b₂₁+a₁₃b₃₁+a₁₄b₄₁,

c₁₂=a₁₁b₁₂+a₁₂b₂₂+a₁₃b₃₂+a₁₄b₄₂,

c₁₃=a₁₁b₁₃+a₁₂b₂₃+a₁₃b₃₃+a₁₄b₄₃,

c₂₁=a₂₁b₁₁+a₂₂b₂₁+a₂₃b₃₁+a₂₄b₄₁,

c₂₂=a₂₁b₁₂+a₂₂b₂₂+a₂₃b₃₂+a₂₄b₄₂, and

c₂₃=a₂₁b₁₃+a₂₂b₂₃+a₂₃b₃₃+a₂₄b₄₃.

In the embodiments of FIG. 4, the input matrix 410, input matrix 420, and output matrix 430 each have two dimensions. In some embodiments, the two dimensions of the input matrix 410 may be the last two dimensions of a first input tensor of a matrix multiplication operation, the two dimensions of the input matrix 420 may be the last two dimensions of a first second tensor of the matrix multiplication operation, and the two dimensions of the output matrix 430 may be the last two dimensions of the output tensor of the matrix multiplication operation. The input matrix 410, input matrix 420, or output matrix 430 may have one or more other dimensions. For the other dimensions, one or more broadcasting operations may be performed to facilitate the matrix multiplication operation.

FIGS. 5A-5D illustrate broadcasting in matrix multiplication operations, in accordance with various embodiments. Broadcasting (e.g., NumPy broadcasting) may be applied when input tensors having incompatible shapes. In some embodiments, a broadcasting operation may match the dimensions of differently shaped input tensors in order to perform matrix multiplication on the input tensors. For instance, the smaller tensor may be broadcast across the larger tensor so that they have compatible shapes. To match the shapes of multiple tensors, broadcasting may be applied on at least one of the tensors. For instance, broadcasting may be applied on one or both tensors to make the shape of one tensor match the shape of the other tensor.

In some embodiments, a condition for applying broadcasting to match the shapes of two tensors is that the trailing dimensions of the two tensors are compatible. The trailing dimension of a tensor may be the rightmost dimension of the tensor. Multiple trailing dimensions may be compatible when the trailing dimensions are equal or when one of the trailing dimensions is one-sized. In an example where the first tensor has a size of M×N, and the second tensor has a size of m×n, the trailing dimensions of the two tensors are N and n, respectively. When N=n or when either N or n is one, broadcasting may be applied on at least one of the tensors to match the shapes of the tensors.

FIG. 5A shows a tensor 510A and a tensor 520A. For the purpose of illustration and simplicity, FIG. 5A shows one dimension of each of the tensors 510A and 520A. The size of the tensor 510A in the dimension is three, as the tensor 510A has three elements in the dimension. The size of the tensor 520A in the dimension is one, as the tensor 520A has one element in the dimension. The shape of the tensor 510 may be denoted as 1×3. The shape of the tensor 520A may be denoted as 1×1. To match the shapes of the tensors 510A and 520A, a broadcasting operation may be performed on the tensor 520A to increase the size of the tensor 520A from one to three. As shown in FIG. 5A, the element is duplicated to generate two new elements that are represented by the dashed boxes in FIG. 5A. After the broadcasting, the size of the tensor 520A in the dimension becomes three so that the post-broadcasting shape of the tensor 520A can match the shape of the tensor 520A.

FIG. 5B shows a tensor 510B and a tensor 520B. For the purpose of illustration and simplicity, FIG. 5B shows two dimensions of each of the tensors 510B and 520B. The shape of the tensor 510B is 4×3, as the tensor 510B has four rows and three columns. The shape of the tensor 520B is 1×3, as the tensor 520B has one row and three columns. To match the shapes of the tensors 510B and 520B, a broadcasting operation may be performed on the tensor 520B to change the shape of the tensor 520B from 1×3 to 4×3. As shown in FIG. 5B, the three elements in the tensor 520B are duplicated to generate three new rows that are represented by the dashed boxes in FIG. 5B. After the broadcasting, the shape of the tensor 520B matches the shape of the tensor 510B.

FIG. 5C shows a tensor 510C and a tensor 520C. For the purpose of illustration and simplicity, FIG. 5C shows two dimensions of each of the tensors 510C and 520C. The shape of the tensor 510C is 4×1, as the tensor 510C has four rows and one column. The shape of the tensor 520C is 1×3, as the tensor 520C has one row and three columns. To match the shapes of the tensors 510C and 520C, a broadcasting operation may be performed on the tensor 510C to change the shape of the tensor 520C from 4×1 to 4×3. Another broadcasting operation may be performed on the tensor 520C to change the shape of the tensor 520C from 1×3 to 4×3. As shown in FIG. 5C, the four elements in the tensor 510C are duplicated to generate two new columns that are represented by the dashed boxes in FIG. 5C. And the three elements in the tensor 520C are duplicated to generate three new rows that are represented by the dashed boxes in FIG. 5C. After the broadcasting, both tensors have a size of 4×3 so that the shape of the tensor 520C matches the shape of the tensor 510C.

FIG. 5D shows a tensor 510D and a tensor 520D. For the purpose of illustration and simplicity, FIG. 5D shows two dimensions of each of the tensors 510D and 520D. The shape of the tensor 510D is 4×3, as the tensor 510D has four rows and three columns. The shape of the tensor 520D is 1×4, as the tensor 520D has one row and four columns. Broadcasting on either tensor may fail as the trailing dimensions of the tensors 510D and 520D are not compatible. The trailing dimension of the tensor 510D is four, while the trailing dimension of the tensor 520D is three. They are not equal, and neither is one.

Transforming Matrix Multiplication to Convolution

FIG. 6 is a block diagram of a DNN module 600, in accordance with various embodiments. The DNN module 600 facilitates transformation of matrix multiplications to convolutions. The DNN module 600 may be an embodiment of the DNN module 301 in FIG. 3. As shown in FIG. 6, the DNN module 600 includes an interface module 610, a training module 620, a compressing module 630, a validating module 640, a transforming module 650, and a datastore 660. In other embodiments, alternative configurations, different or additional components may be included in the DNN module 600. Further, functionality attributed to a component of the DNN module 600 may be accomplished by a different component included in the DNN module 600 or a different module or system.

The interface module 610 facilitates communications of the DNN module 600 with other modules or systems. For example, the interface module 610 establishes communications between the DNN module 600 with an external database to receive data that can be used to train DNNs or input into DNNs to perform tasks. As another example, the interface module 610 supports the DNN module 600 to distribute DNNs to other systems, e.g., computing devices configured to apply DNNs to perform tasks.

The training module 620 trains DNNs by using a training dataset. The training module 620 forms the training dataset. In an embodiment where the training module 620 trains an DNN to recognize objects in images, the training dataset includes training images and training labels. The training labels describe ground-truth classifications of objects in the training images. In some embodiments, each label in the training dataset corresponds to an object in a training image. In some embodiments, a part of the training dataset may be used to initially train the DNN, and the rest of the training dataset may be held back as a validation subset used by the validating module 640 to validate performance of a trained DNN. The portion of the training dataset not including the tuning subset and the validation subset may be used to train the DNN.

The training module 620 also determines hyperparameters for training the DNN. Hyperparameters are variables specifying the DNN training process. Hyperparameters are different from parameters inside the DNN (e.g., weights of filters) . In some embodiments, hyperparameters include variables determining the architecture of the DNN, such as number of hidden layers, etc. Hyperparameters also include variables which determine how the DNN is trained, such as batch size, number of epochs, etc. A batch size defines the number of training samples to work through before updating the parameters of the DNN. The batch size is the same as or smaller than the number of samples in the training dataset. The training dataset can be divided into one or more batches. The number of epochs defines how many times the entire training dataset is passed forward and backwards through the entire network. The number of epochs defines the number of times that the deep learning algorithm works through the entire training dataset. One epoch means that each training sample in the training dataset has had an opportunity to update the parameters inside the DNN. An epoch may include one or more batches. The number of epochs may be 1, 5, 10, 50, 100, 500, 1000, or even larger.

The training module 620 defines the architecture of the DNN, e.g., based on some of the hyperparameters. The architecture of the DNN includes an input layer, an output layer, and a plurality of hidden layers. The input layer of an DNN may include tensors (e.g., a multidimensional array) specifying attributes of the input image, such as the height of the input image, the width of the input image, and the depth of the input image (e.g., the number of bits specifying the color of a pixel in the input image) . The output layer includes labels of objects in the input layer. The hidden layers are layers between the input layer and output layer. The hidden layers include one or more convolutional layers and one or more other types of layers, such as pooling layers, fully-connected layers, normalization layers, SoftMax or logistic layers, and so on. The convolutional layers of the DNN abstract the input image to a feature map that is represented by a tensor specifying the feature map height, the feature map width, and the feature map channels (e.g., red, green, blue images include 3 channels) . A pooling layer is used to reduce the spatial volume of input image after convolution. It is used between two convolution layers. A fully-connected layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer. It is used to classify images between different categories by training.

In the process of defining the architecture of the DNN, the training module 620 also adds an activation function to a hidden layer or the output layer. An activation function of a layer transforms the weighted sum of the input of the layer to an output of the layer. The activation function may be, for example, a ReLU activation function, a tangent activation function, or other types of activation functions.

After the training module 620 defines the architecture of the DNN, the training module 620 inputs a training dataset into the DNN. The training dataset includes a plurality of training samples. An example of a training sample includes an object in an image and a ground-truth label of the object. The training module 620 modifies the parameters inside the DNN ( “internal parameters of the DNN” ) to minimize the error between labels of the training objects that are generated by the DNN and the ground-truth labels of the objects. The internal parameters include weights of filters in the convolutional layers of the DNN. In some embodiments, the training module 620 uses a cost function to minimize the error.

The training module 620 may train the DNN for a predetermined number of epochs. The number of epochs is a hyperparameter that defines the number of times that the deep learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update internal parameters of the DNN. After the training module 620 finishes the predetermined number of epochs, the training module 620 may stop updating the parameters in the DNN. The DNN having the updated parameters is referred to as a trained DNN.

The compressing module 630 compresses DNNs. For instance, the compressing module 630 may add pruning operations to DNN layers to reduce computational complexity or memory usage. A pruning operation may prune weight tensors of a DNN layer by changing one or more nonzero valued weights of the layer to zeros. The modification may be done before, during, or after training. Weights may be pruned during training, during inference, or a combination of both. The compressing module 630 may determine a sparsity ratio for a DNN layer. The sparsity ratio may be a ratio of the number of zero-valued weight to the total number of weights in the layer. The compressing module 630 may perform the pruning operation till the sparsity ratio of the DNN layer meets a target sparsity ration, such as 10%, 20%, 30%, 60%, 50%, and so on.

In some embodiments, the compressing module 630 may select one or more layers in a DNN and modify each selected layer with a pruning operation. For instance, the compressing module 630 may select computationally complex layers, such as layers with large filters. For a pruning operation of a layer or of a type of layer, the compressing module 630 may determine a weight threshold that would not cause a loss of the accuracy of the DNN to exceed an accuracy loss constraint. A pruning operation may modify weights having absolute values above the weight threshold to zeros and leave the other weights unchanged. The weight pruning can reduce memory storage as zero-valued weights may not be stored. Also, the number of operations in the layer can be reduced as computations on zero-valued weights can be skipped without impacting the output of the layer. In some embodiments, the compressing module 630 may also measure energy saving, final DNN accuracy, or layer-wise sparsity caused by pruning operations.

After compressing a DNN, the compressing module 630 may fine tune the DNN, e.g., through a retraining process. The compressing module 630 may fine tunes DNNs after weights are pruned. In some embodiments, the fine-tuning process is a retraining or further training process. For instance, after weights in a DNN are pruned, the compressing module 630 may further train the DNN by inputting a training dataset into the DNN. The values of the unpruned weights in the DNN may be modified based on outputs of the DNN and ground-truth labels of the training samples in the training dataset. In some embodiments, the values of the pruned weights (i.e., zero) are not changed during the fine-tuning process. For instance, the compressing module 630 may place a mask over a pruned weight block and the mask can prevent values in the pruned weight blocks from being changed during the fine-tuning process. In other embodiments, the values of all weights, including the pruned weights, may be changed during the fine-tuning process. After one or more cycles of retraining and weight changing by the compressing module 630, the compressing module 630 may perform a new pruning process, e.g., by selecting weight blocks and pruning the selected weight blocks. In some embodiments, the weight pruning process may be repeated multiple times before the fine-tuning process is done.

In some embodiments, the number of epochs in the fine-tuning process may be different from the number of epochs in the training process in which the pre-pruning values of the weights are determined. For instance, the fine-tuning process may have less epochs than the training process. In an example, the number of epochs in the fine-tuning process may be relatively small, such as 2, 3, 6, 5, and so on.

The validating module 640 verifies accuracy of trained or compressed DNNs. In some embodiments, the validating module 640 inputs samples in a validation dataset into a trained DNN and uses the outputs of the DNN to determine the model accuracy. In some embodiments, a validation dataset may be formed of some or all the samples in the training dataset. Additionally or alternatively, the validation dataset includes additional samples, other than those in the training sets. In some embodiments, the validating module 640 may determine an accuracy score measuring the precision, recall, or a combination of precision and recall of the DNN. The validating module 640 may use the following metrics to determine the accuracy score: Precision = TP / (TP + FP) and Recall = TP / (TP + FN) , where precision may be how many the DNN correctly predicted (TP or true positives) out of the total it predicted (TP + FP or false positives) , and recall may be how many the DNN correctly predicted (TP) out of the total number of objects that did have the property in question (TP + FN or false negatives) . The F-score (F-score = 2 *PR / (P + R) ) unifies precision and recall into a single measure.

The validating module 640 may compare the accuracy score with a threshold score. In an example where the validating module 640 determines that the accuracy score of the DNN is less than the threshold score, the validating module 640 instructs the training module 620 to re-train the DNN. In one embodiment, the training module 620 may iteratively re-train the DNN until the occurrence of a stopping condition, such as the accuracy measurement indication that the DNN may be sufficiently accurate, or a number of training rounds having taken place.

The transforming module 650 transforms matrix multiplication operations to convolutional operations that can be executed by DNN accelerators, e.g., the DNN accelerator 302. For instance, the transforming module 650 may transform input tensors of a matrix multiplication operation into activation tensor and weight tensor of a convolutional operation. A DNN accelerator may perform the convolutional operation on the activation tensor and weight tensor and compute a convolutional output tensor, i.e., the output tensor of the convolutional operation. The transforming module 650 may then transform the convolutional output tensor into a MatMul output tensor, i.e., the output tensor of the matrix multiplication operation. This way, the matrix multiplication operation can be executed by performing the convolutional operation.

In some embodiments, the transforming module 650 may transform matrix multiplication operations to convolutional operations with predetermined parameters. For instance, the transforming module 650 may transform matrix multiplication operations to convolutional operations with predetermine stride, dilation rate, or padding. Stride is a convolutional operation parameter that refers to the number of pixels (e.g., data elements) by which the filter moves across the input matrix. In an example, the stride of a convolutional operation transformed from a matrix multiplication operation may be set to [1, 1] , i.e., the stride for both dimensions (e.g., height and weight) is 1. That means during the convolutional operation, the filter moves across one pixel for both dimensions of the input matrix. Dilation rate refers to the number of holes inserted between weights in the filter to dilate the kernel. Dilation rate may define a spacing between the values in a kernel. In an example, the stride of a convolutional operation transformed from a matrix multiplication operation may be [1, 1] , i.e., the dilation rate for both dimensions (e.g., height and weight) is 1. That means the kernel is dilated by inserting one hole between weights for both dimensions (e.g., height and weight) of the kernel. Padding refers to the addition of extra pixels around the borders of the input matrix. The extra pixels may be zero-valued data elements. For a convolutional operation transformed from a matrix multiplication operation, padding may be deactivated. No extra pixels may be added around the borders of the input matrix.

In some embodiments, the transforming module 650 may transform a matrix multiplication operation to a group convolution. A group convolution includes multiple convolutions. The output tensor of the group convolution may be a result of concatenating the output tensors of the convolutions in the group. For an example group convolution, the shape representation of the activation tensor may be [g, N, H, W, C_IN] , which indicates the sizes of the five dimensions of the activation tensor. The size of a dimension may indicate the number of pixels (e.g., data elements) present in the dimension. The shape representation of the weight tensor may be [g, C_OUT, C_IN, K_H, K_W] , which indicates the sizes of the five dimensions of the weight tensor. g denotes the number of convolutions in the group convolution. N denotes the number of batches. H denotes the height of the input matrix in a single input channel. Wdenotes the width of the input matrix in a single input channel. C_IN denotes the number of input channels. C_OUT denotes the number of output channels. K_H denotes the height of the kernel. K_W denotes the width of the kernel. N, K_H, K_W and may be set to 1 to facilitate the transformation of the matrix multiplication operation to the group convolution. As the kernel has a size of 1×1, the height and weight of the output matrix in a single output channel may equal the height and weight, respectively, of the input matrix in a single input channel. The shape representation of the output tensor is [g, N, H, W, C_OUT] , which indicates the sizes of the five dimensions of the output tensor.

In some embodiments, the transforming module 650 may transform the input tensors of a matrix multiplication operation by transposing and reshaping the input tensors. A matrix multiplication operation may be denoted as [B₀, B₁, …, B_n, L, M] × [B₀, B₁, …, B_n, M, N] = [B₀, B₁, …, B_n, L, N] . [B₀, B₁, …, B_n, L, M] is the shape representation of the first input tensor. [B₀, B₁, …, B_n, M, N] is the shape representation of the second input tensor. [B₀, B₁, …, B_n, L, N] is the shape representation of the output tensor.

The transforming module 650 may generalize the shape of the first input tensor to [B₀, …, B_i, B_i+1, …, B_j, 1, …, 1, L, M] and transpose the first input tensor to [B₀, …, B_i, B_i+1, …, B_j, L, 1, …, 1, M] . B_i is the ith batch dimension. The L-sized dimension and a one-sized dimension are transposed. Also, the transforming module 650 may generalize the shape of the second input tensor to [B₀, …, B_i, 1, …, 1, B_j+1, …, B_n, M, N] and transpose the second input tensor to [B₀, …, B_i, 1, …, 1, B_j+1, …, B_n, N, M] . The last two dimensions are transposed.

The transforming module 650 may divide the dimensions of the two transposed input tensors into groups. In an example, the transforming module 650 divides the dimensions of the two transposed input tensors into four groups: S_ab, S_a, S_b, and S_mul. S_ab includes [B₀, …, B_i] , which are the dimensions shared by the two input tensors. S_a includes [B_i+1, …, B_j, L] , which are the dimensions present in the first input tensor but not present in the second input tensor. S_b includes [B_j+1, …, B_n, N] , which are the dimensions present in the second input tensor but not present in the first input tensor. S_mul includes [M] , which is the dimension where multiplication and reduce-sum happen. The shape of the first input tensor after transpose becomes [S_ab, S_a, …, S_mul] . The shape of the second input tensor after transpose becomes [S_ab, …, S_b, S_mul] .

After the input tensors are transposed, the transforming module 650 may reshape the transposed input tensors. In an example, the transforming module 650 may change the shape of the first input tensor from [S_ab, S_a, …, S_mul] to [S_ab′, 1, 1, S_a′, S_mul] . The transforming module 650 may change the shape of the second input tensor from [S_ab, …, S_b, S_mul] to [S_ab′, S_b′, 1, 1, S_mul] . In some embodiments, S′=∏_SS (i) . Through the transposing and reshaping process, the first input tensor is converted into the activation tensor of the convolutional operation, and the second input tensor is converted into the weight tensor of the convolutional operation. The transposing and reshaping process enables a mapping for each dimension from the matrix multiplication operation to a dimension in the convolution operation. For instance, [S_ab′] may be mapped to [g] , [S_a′] may be mapped to [H, W] , [S_b′] may be mapped to [C_OUT] , and [S_mul] may be mapped to [C_IN] .

In some embodiments, the transforming module 650 generates one or more configuration parameters based on the transpose of the input tensors. Such a configuration parameter may be referred to as an input-order parameter. An input-order parameter indicates how the order of the dimensions in the input tensors are changed by the transpose. In an example, the input-order parameter may include a sequence of indices. Each index may correspond to a dimension and indicate the position of the dimension in the shape representation of the input tensor. As the input tensor is transposed, the order of the indices is changed accordingly. In an example where the shape of a tensor is [3, 5, 6, 4] , the original sequence of indices is [0, 1, 2, 3] , indicating that 3 is the size of the first dimension, 5 is the size of the second dimension, 6 is the size of the third dimension, and 4 is the size of the fourth dimension. As the tensor is transposed and the shape of the tensor is changed to [3, 6, 5, 4] , the sequence of indices becomes [0, 2, 1, 3] , indicating that the second dimension and the third dimension are transposed. The new sequence of indices indicates the new dimension order (or new axes order) after the transpose. The new sequence of indices may be used as an input-order parameter or used to determine an input-order parameter. In some embodiments, the transforming module 650 may generate an input-order parameter based on the shapes of the two input tensors and a comparison of the dimension sizes of the two input tensors at the same position. To compare the dimension size, the transforming module 650 may identify a first dimension from the first input tensor and identify a second dimension from the second input tensor with the index of the first dimension in the shape representation of the first input tensor is the same as the index of the second dimension in the shape representation of second first input tensor. The transforming module 650 may then compare the size of the first dimension with the size of the second dimension and determine the input-order parameter based on the broadcasting rule. The input-order parameter may be used later (e.g., after the convolutional operation is complete) to reverse indices mapping to the original order.

The transforming module 650 may provide the activation tensor and the weight tensor to the DNN accelerator. The DNN accelerator may perform the convolution operation on the two tensors and compute a convolutional output tensor. The transforming module 650 may receive the convolutional output tensor and transform the convolutional output tensor to a MatMul output tensor. In some embodiments, the transforming module 650 may reshape the convolutional output tensor. For instance, the transforming module 650 may determine the target shape (i.e., the shape after reshaping) by replacing [g] with [S_ab] , which includes [S₀, …, S_i] , replacing [W] with [S_a] , which includes [S_i+1, …, S_j] , and replacing [C_OUT] with S_b, which includes [S_j+1, …, S_n] . After the reshaping, the transforming module 650 may transpose the reshaped tensor based on the input-order parameter that was generated form the transpose of the input tensors. The transforming module 650 may use the input-order parameter to reverse the dimension order mapping of the transpose of the input tensors.

The datastore 660 stores data received, generated, used, or otherwise associated with the DNN module 600. For example, the datastore 660 stores the datasets used by the training module 620 and validating module 640. The datastore 660 may also store data generated by the training module 620 and validating module 640, such as the hyperparameters for training DNNs, internal parameters of trained DNNs (e.g., weights, etc. ) , data for sparsity acceleration (e.g., sparsity bitmap, etc. ) , and so on. The datastore 660 may store tensors (e.g., transposed tensors, reshaped tensors, etc. ) , shape representations of tensors, configuration parameters (e.g., input-order parameters) , or other data generated by the transforming module 650. In the embodiment of FIG. 6, the datastore 660 is a component of the DNN module 600. In other embodiments, the datastore 660 may be external to the DNN module 600 and communicate with the DNN module 600 through a network.

FIG. 7 illustrates conversion of a matrix multiplication operation to a convolutional operation, in accordance with various embodiments. The matrix multiplication operation has an input tensor 701 and an input tensor 702, each of which may have one or more dimensions. The matrix multiplication operation may be performed either by a MatMul operator 705 or a transformed MatMul operator 700 to compute an output tensor 703. MatMul stands for matrix multiplication.

In an example, the input tensor 701 and the input tensor 702 may each have one dimension. The matrix multiplication operation may be a 1Dx1D multiplication (where D stands for dimension) , which may be denoted as [M] × [M] , where M represents the size of the input tensor 701 and the input tensor 702. In another example, the input tensor 701 may have one dimension and the input tensor 702 may have multiple dimensions. The matrix multiplication operation may be a 1DxND multiplication (where N is an integer that is greater than 1) , which may be denoted as [M] × [B₀, B₁, …, B_n, M, N] = [B₀, B₁, …, B_n, N] . In yet another example, the input tensor 702 may have one dimension and the input tensor 701 may have multiple dimensions. The matrix multiplication operation may be a NDx1D multiplication, which may be denoted as [B₀, B₁, …, B_n, L, M] × [M] = [B₀, B₁, …, B_n, L] . In yet another example, the input tensor 701 and the input tensor 702 may each have multiple dimensions. The matrix multiplication operation may be a NDxND multiplication, which may be denoted as [B₀, B₁, …, B_n, L, M] × [B₀, B₁, …, B_n, M, N] = [B₀, B₁, …, B_n, L, N] .

In embodiments where the matrix multiplication operation is performed by the MatMul operator 705, the MatMul operator 705 may compute an output tensor 703 from the input tensor 701 and the input tensor 702. The input tensor 701 may have a shape representation [B₀, B₁, …, B_k, L, M] . The input tensor 702 may have a shape representation [B₀, B₁, …, B_k, M, N] . The MatMul operator 705 may be a generalized matrix multiplication operator that may apply matrix multiplication on the last two dimensions of the input tensor 701 and the input tensor 702 and apply broadcasting (e.g., NumPy broadcasting) on the rest of the dimensions. Matrix multiplication may be based on elementwise multiplication and reduce-sum as fundamental computation. The MatMul operator 705 may perform matrix multiplication on two input matrices [L, M] and [M, N] to compute an output matrix [L, N] . M indicates the size of the dimension where elementwise multiplication and reduce-sum happens. L and N are the sizes of the other dimensions of the two input matrices. The MatMul operator 705 may apply broadcasting to the batch dimensions [B₀, B₁, …, B_k] . To broadcast the two tensors, the MatMul operator 705 may compare their shapes elementwise starting with the trailing (i.e. rightmost) dimension and works its way left. The output tensor may have a shape representation [B₀, B₁, …, B_k, L, N] .

In an example, the input tensor 701 may be represented by [22, 1, 44, 4, 5] , which indicates that the input tensor 701 has five dimensions. The sizes of the input tensor 701 in the five dimensions are 22, 1, 44, 4, and 5, respectively. The input tensor 702 may be represented by [1, 33, 1, 5, 6] , which indicates that the input tensor 702 has five dimensions. The sizes of the input tensor 702 in the five dimensions are 1, 33, 1, 5, and 6, respectively. The MatMul operator 705 may apply matrix multiplication on the last two dimensions of the input tensor 701 (i.e., [4, 5] ) and the last two dimensions of the input tensor 702 (i.e., [5, 6] ) and compute the last two dimensions of the output tensor 703 (i.e., [4, 6] ) . The MatMul operator 705 may apply broadcasting on the first three dimensions of the input tensor 701 (i.e., [22, 1, 44] ) and the first three dimensions of the input tensor 702 (i.e., [1, 33, 1] ) to compute the first three dimensions of the output tensor 703 (i.e., [22, 33, 44] ) . The output tensor 703 may therefore be represented by [22, 33, 44, 4, 6] .

In embodiments where the matrix multiplication operation may be performed by the transformed MatMul operator 700, the transformed MatMul operator 700 may convert the matrix multiplication operation to a convolutional operation. The transformed MatMul operator 700 may perform the convolutional operation and compute the output tensor 703 from the output of the convolutional operation. In some embodiments, the convolutional operation may be a group convolution. As shown in FIG. 7, the transformed MatMul operator 700 includes a transposing module 710, another transposing module 720, a reshaping module 730, another reshaping module 740, a convolution operator 750, a reshaping module 760, and a transposing module 770. In other embodiments, the transformed MatMul operator 700 may include fewer, more, or different components. Also, functionality attributed to a component of the transformed MatMul operator 700 may be accomplished by a different component included in the transformed MatMul operator 700 or a different system or device. For instance, functionality attributed to the transposing module 710 and transposing module 720 may be accomplished by a single transposing module. Functionality attributed to the reshaping module 730 and reshaping module 740 may be accomplished by a single reshaping module. The transposing module 710, transposing module 720, reshaping module 730, reshaping module 740, reshaping module 760, and transposing module 770 may be components of a transforming module, e.g., the transforming module 650 in FIG. 6.

In some embodiments, the transposing module 710 may receive the input tensors 701 and 702 and transpose the two tensors. In an example, the transposing module 710 may transpose the input tensor 701 to generate a transposed tensor with a shape represented by [B₀, …, B_i, B_i+1, …, B_j, L, 1, …, 1, M] . The transposing module 710 may transpose the input tensor 702 to generate a transposed tensor with a shape represented by [B₀, …, B_i, 1, …, 1, B_j+1, …, B_k, N, M] . In some embodiments, a configuration parameter may be generated. The configuration parameter indicates the new order of the dimensions in the transposed tensors. The configuration parameter may be determined based on the shapes and dimensions sizes of the input tensors 701 and 702. The dimensions of the two transposed tensors may be divided into four groups: S_ab: [B₀, …, B_i] , S_a: [B_i+1, …, B_j, L] , S_b: [B_j+1, …, B_k, N] , and S_mul: [M] . The transposed tensor generated by the transposing module 710 may be represented by [S_ab, S_a, …, M] . The transposed tensor generated by the transposing module 720 may be represented by [S_ab, …, S_b, M] .

The reshaping module 730 may reshape the transposed tensor generated by the transposing module 710. For instance, the reshaping module 730 may reshape the transposed tensor [S_ab, S_a, …, M] to form a reshaped tensor [S_ab′, 1, 1, S_a′, M] . The reshaping module 740 may reshape the transposed tensor [S_ab, …, S_b, M] to form a reshaped tensor [S_ab′, S_b′, 1, 1, M] . In some embodiments, the reshaping module 730 or 740 may compute the target shape of the corresponding transposed tensor by products of S_ab, S_a, and S_b, respectively at their position.

The two reshaped tensors are input into the convolution operator 750. The convolution operator 750 performs a convolutional operation on the two reshaped tensors. An example of the convolution operator 750 may be the DNN accelerator 302 or part of the DNN accelerator 302, such as the sparse cell array 370. In some embodiments, the convolution operation 750 may be the data processing unit 1000 in FIG. 10. The convolution operator 750 may include one or more sparse cells, such as the sparse cell 900 in FIG. 9. The convolutional operation may be a group convolution on an activation tensor (which may be the tensor computed by the reshaping module 730) and a weight tensor (which may be the tensor computed by the reshaping module 740) to compute an output tensor. The convolutional operation may have a stride of [1, 1] , a dilation rate of [1, 1] , and no padding.

The shape of the activation tensor may be denoted as [g, N, H, W, C_IN] , where g denotes the number of convolutions in the group convolution, N denotes the number of baches, H and W denote the height and width of space dimensions of the activation tensor, and C_IN denotes the number of input channels. The shape of the weight tensor may be denoted as [g, C_OUT, C_IN, K_H, K_W] , where K_H and K_W are kernel size on H and W dimensions, and C_OUT denotes the number of output channels. The kernel may be a 1×1 kernel so that the convolutional operation may not change the height and width. The shape of the output tensor may be denoted as [g, N, H, W, C_OUT] . In embodiments where g equals one, the group convolution is equivalent to a single convolution. The dimensions in the shape representations in the tensor may vary. For instance, the activation tensor may be represented by [N, g×C_IN, H, W] . The weight tensor may be represented by [C_OUT, g× C_IN, K_H, K_W] . The output tensor may be represented by [N, g×C_OUT, H, W] . In other embodiments, the activation tensor, weight tensor, or output tensor may have other shape representations.

The two reshaped tensors generated by the reshaping modules 730 and 740 are mapped to be the activation tensor and the weight tensor, respectively, of the group convolution. For the purpose of illustration and simplicity, the batch size N, kernel height K_H, and kernel width K_W may equal one in some examples. S_ab′may be mapped to g. S_a′may be mapped to [H, W] . S_b′may be mapped to [C_OUT] . S_mul may be mapped to [C_IN] . The convolutional output tensor may be a result of concatenating the results of the convolutions in the group convolution. In some embodiments, the shape of the convolutional output tensor may be [S_ab′, 1, 1, S_a′, S_b′] .

The reshaping module 760 may reshape the convolutional output tensor and change the shape from [S_ab′, 1, 1, S_a′, S_b′] to [B₀, …, B_i, 1, 1, B_i+1, …, B_j, L, B_j+1, …, B_k, N] . The transposing module 770 may transpose the reshaped output tensor from [B₀, …, B_i, 1, 1, B_i+1, …, B_j, L, B_j+1, …, B_k, N] to [B₀, …, B_k, L, N] . In some embodiments, the transposing module 770 may transpose the reshaped output tensor based on the configuration parameter. For instance, the transposing module 770 may transpose the reshaped output tensor by reversing the transposes done by the transposing module 710 and 720. The output of the transposing module 770 is the output tensor 703.

FIG. 8 illustrates an example of computing an output tensor of a matrix multiplication through a convolutional operation, in accordance with various embodiments. In the example of FIG. 8, the matrix multiplication has two input tensors with shape representations [5, 2, 1, 11, 7] and [3, 1, 2, 4, 7, 10] , respectively. The first input tensor has five dimensions, while the second input tensor has six dimensions.

The first input tensor is transposed to generate a first transposed tensor with a shape representation [2, 5, 11, 1, 1, 7] . The transpose may be performed by the transposing module 710 in FIG. 7. Before the transpose, the indices of the five dimensions may be represented as [0, 1, 2, 3, 4] , respectively. The index of a dimension indicates the position of the dimension in the first input tensor. After the transpose, the indices of the five dimensions are changed to [1, 0, 3, 2, 4] , which indicates the new order of the dimensions in the first input tensor.

The second input tensor is transposed to generate a second transposed tensor with a shape representation [2, 1, 3, 4, 10, 7] . The transposing may be performed by the transposing module 720 in FIG. 7. Before the transpose, the indices of the six dimensions may be represented as [0, 1, 2, 3, 4, 5] , respectively. The index of a dimension indicates the position of the dimension in the second input tensor. After the transpose, the indices of the five dimensions are changed to [2, 1, 0, 3, 5, 4] , which indicates the new order of the dimensions in the second input tensor. A configuration parameter may be computed based on the indices of the dimension of the first input tensor (i.e., [1, 0, 3, 2, 4] ) and the indices of the dimension of the first input tensor (i.e., [2, 1, 0, 3, 5, 4] ) . The computation of the configuration parameter may also be based on a comparison of the sizes of the dimensions at the same position, which relates to broadcasting in matrix multiplication. In some embodiments, the configuration parameter is [3, 1, 0, 4, 2, 5] .

Further, the first transposed tensor is reshaped to generate a first reshaped tensor with a different shape representation [2, 1, 1, 55, 7] . The reshaping may be performed by the reshaping module 730 in FIG. 7. The second transposed tensor is reshaped to generate a second reshaped tensor with a different shape representation [2, 120, 1, 1, 7] . The reshaping may be performed by the reshaping module 740 in FIG. 7.

The first reshaped tensor may be used as an activation tensor of a convolutional operation, while the second reshaped tensor may be used as a weight tensor of the convolutional operation. The convolutional operation may include a group of convolutions. The total number of convolutions in the convolutional operation may equal 2, as 2 is the size of the dimension shared by the two input tensors. To facilitate the transformation of the matrix multiplication operation to the convolutional operation, the total number of batches is made 1, the width of the weight tensor is made 1, and the height of the weight tensor is made 1.

The number of input channels may be 7, as 7 is the size of the dimension where multiplication and reduce-sum happen, i.e., the last dimension of the first input tensor and the second last dimension of the second input tensor. The number of output channels may be 120, which is the multiplication product of 3, 4, and 10 that are the sizes of the dimensions present in the second input tensor but not present in the first input tensor. The size of the dimension that is present in the first input tensor but not present in the second input tensor is 5, meaning the multiplication product of the height and width of the activation tenor equals 5. In the embodiments of FIG. 8, the height of the activation tensor is 5 and the width of the activation tensor is 1. The convolutional output tensor may have a shape representation [2, 1, 1, 55, 120] . As described above, the number of convolutions is 2, the number of batches is 1, and the number of output channels may equal 120. The dimensions left are the dimensions with sizes 1 and 55, which represent the height and width, respectively, of the output tensor.

The convolutional output tensor is reshaped to generate a reshaped output tensor with a shape representation [2, 5, 11, 3, 4, 10] . The reshaping may be performed by the reshaping module 760 in FIG. 7. The multiplication product of 5 and 11 is 55, which is present in the shape representation of the output tensor. The one-sized dimensions in the output tensor may be disregarded.

Further, the reshaped output tensor is transposed to generate the MatMul output tensor. The order of the dimensions in the convolutional output tensor may be changed based on the configuration parameter [3, 1, 0, 4, 2, 5] so that the shape representation [2, 5, 11, 3, 4, 10] is changed to a new shape representation [3, 5, 2, 4, 11, 10] . The transposing may be performed by the transposing module 770 in FIG. 7. The transpose can ensure that the last two dimensions would match the last two dimensions of the MatMul output tensor as if the MatMul output tensor is generated from by performing matrix multiplication. For instance, the last dimension of the output tensor should be the last dimension of the second input tensor, and the second last dimension of the output tensor should be the second last dimension of the first input tensor.

Also, the transpose can ensure that the other dimensions would follow the broadcasting rule in matrix multiplication, as the configuration parameter is generated by comparing the sizes of the dimensions of the two input tensors at the same position. As the first input tensor has less dimensions, a new dimension may be added to the first input tensor. The new dimension may be one-sized and may be placed as the first dimension of the first input tensor. Accordingly, the first four dimensions of the first input tensor become [1, 5, 2, 1] . The first four dimensions of the second input tensor are [3, 1, 2, 4] . The four dimensions of the first input tensor and the second input tensor are therefore compatible and after broadcasting, the four dimensions become [3, 5, 2, 4] .

Example Data Processing Unit

FIG. 9 illustrates an example sparse cell 900, in accordance with various embodiments. The sparse cell 900 may be in a sparse cell array, e.g., the sparse cell array 370 in FIG. 3. The sparse cell 900 includes 16 MAC units 910 (individually referred to as “MAC unit 910” ) arranged in four rows and four columns, 16 weight register files 920 (individually referred to as “weight register file 920” ) , 16 activation register files 930 (individually referred to as “activation register file 930” ) , four row buffers 940 (individually referred to as “row buffer 940” ) , and sparsity modules 960 (individually referred to as “sparsity module 960” ) . In other embodiments, the sparse cell 900 may include fewer, more, or different components. For example, the sparse cell 900 may include a different number of MAC units 910, weight register files 920, activation register files 930, row buffers 940, or sparsity modules 960. As another example, the sparse cell 900 may include column buffers in lieu of or in addition to the row buffers 940.

The MAC units 910 are configured to perform MAC operations. Each MAC unit 910 may include one or more multipliers and one or more adders. A multiplier may multiply an activation with a weight at a time to compute a product. In some embodiments (e.g., embodiments where the MAC unit 910 includes multiple multipliers) , the multipliers may operate simultaneously to process multiple activation-weight pairs and compute multiple products in one cycle. An adder may accumulate products computed by the multipliers. Even though not shown in FIG. 9, the sparse cell may include an adder tree including a plurality of adder tiers. The first tier may receive outputs of a plurality of MAC units 910. The number of adders in the first tier may be half of the number of the MAC units 910, and each adder may accumulate the outputs of two MAC units 910. The second tier may receive outputs of adders in the first tier. The number of adders in the second tier may be half of the number of adders in the first tier, and each adder in the second tier may accumulate the outputs of two adders in the first tier. The adder tree may include one or more other tiers. The last tier may include a single adder that accumulates outputs of adders in the second last tier to compute a partial sum of the sparse cell 900.

The weight register files 920 store weights to be processed in MAC operations. In the embodiments of FIG. 9, four weight register files 920 are grouped into a storage set that stores data to be used by a column of MAC units 910. There are four storage sets corresponding to the four columns of MAC units 910. In some embodiments, a weight register file 920 may correspond to a MAC unit 910 and store data to be processed by the MAC unit. In some embodiments, all the 16 weight register files 920 constitute a weight storage unit.

The activation register files 930 stores activations to be processed in MAC operations. In the embodiments of FIG. 9, four activation register files 930 are grouped into a storage set that stores data to be used by a row of MAC units 910. There are four storage sets corresponding to the four rows of MAC units 910. In some embodiments, an activation register file 930 may correspond to a MAC unit 910 and store data to be processed by the MAC unit. In some embodiments, all the 16 activation register files 930 constitute an activation storage unit. The row buffers 940 store outputs of the MAC units 910. Each row buffer 940 may drain outputs of a single row of MAC units 910.

The sparsity module 960 facilitates dynamic sparsity-based acceleration in the sparse cell 900. In the embodiments of FIG. 9, each sparsity module 960 includes a sparsity tensor storage unit 965 and a control logic 967. The sparsity tensor storage unit 965 stores combined sparsity tensors. A combined sparsity tensor stored in the sparsity tensor storage unit 965 may correspond to an activation tensor and a weight tensor. A nonzero element in the combined sparsity tensor may correspond to a nonzero activation-weight pair that includes a nonzero activation and a nonzero weight. The position of the nonzero activation in the activation tensor may match the position of the nonzero weight in the weight tensor. The product of the nonzero activation and nonzero weight would be nonzero.

The control logic 967 may control transmission of activations and weights stored from the weight register files 920 and the activation register files 930 to the MAC units 910 based on sparsity tensors. For instance, the control logic 967 may select a subset of the weights stored in the weight register files 920 and select a subset of activations stored in the activation register files 930 based on a combined sparsity tensor. The selected weights and activations constitute nonzero activation-weight pairs. The control logic 967 may transmit the selected weights and activations to the MAC units 910 for performing MAC operations. The other weights stored in the weight register files 920 and the other activations stored in the activation register files 930 are skipped from computation. In the embodiments of FIG. 9, each sparsity module 960 controls sparsity acceleration in a respective MAC unit 910. As the sparsity acceleration is either based on both weight sparsity and activation sparsity, 16 sparsity modules 960 are used for acceleration computations in the 16 MAC units 910.

As shown in FIG. 9, the sparse cell 900 is associated with multiplexers (MUXs) 903, 904, 905, and 906. In other embodiments, the sparse cell 900 may be associated with a different number of MUXs or other devices. The MUX 903 facilitates loading weights, e.g., from the local memory 340, into the weight register files 920. The MUX 904 facilitates loading activations, e.g., from the local memory 340, into the activation register files 930. The MUX 905 facilitates loading sparsity tensors into the sparsity tensor storage unit 965. The MUX 906 may be a drain MUX that can facilitate draining outputs of the MAC units 910, e.g., to the local memory 340.

In some embodiments, the sparse cell 900 may also execute matrix multiplications converted from Fourier transform operations. For an example Fourier transform operation, the MAC units 910 may perform MAC operations in the two sequences of matrix multiplications converted from the Fourier transform operation. The weight register files 920 may be used to store data points in transformation tensor of the Fourier transform operation. The activation register file 930 may be used to store data points in the input tensor of the Fourier transform operation. The row buffers 940 may store data points in the output tensor of the Fourier transform operation.

FIG. 10 illustrates a data processing unit 1000, in accordance with various embodiments. The data processing unit 1000 may be an example of the sparse cell array 370 in FIG. 3. In FIG. 10, the data processing unit 1000 includes sparse cells 1010 (individually referred to as “sparse cell 1010” ) arranged in four columns and four rows, an activation memory 1020, and a weight memory 1030. The data processing unit 1000 may also be referred to as a data processing unit. In other embodiments, the data processing unit 1000 may include fewer, more, or different components. For instance, the data processing unit 1000 may include a different number of columns, rows, or sparse cells 1010.

Each sparse cell 1010 may perform sparsity accelerated MAC operations. The sparse cells 1010 may facilitate dynamic sparsity mode. For instance, the sparsity modes of a sparse cell 1010 may be dynamically changed between a combined sparsity mode, an activation sparsity mode, a weight sparsity mode, and a dense mode. An embodiment of a sparse cell 1010 may be the sparse cell 900 in FIG. 9. The activation memory 1020 stores activations, such as activations in input tensors of deep learning operations. Activations may be loaded from the activation memory 1020 to sparse cells 1010. The weight memory 1030 stores weights, such as weights in filters of deep learning operations. Weights may be loaded from the weight memory 1030 to sparse cells 1010. The activation memory 1020 or weight memory 1030 may be a buffer. In other embodiments, the data processing unit 1000 may include a dense data memory and a sparse data memory in lieu of the activation memory 1020 and weight memory 1030. The dense data memory may store dense tensors, e.g., dense tensors generated by the load module 360. The sparse data memory may store sparse tensors.

The data processing unit 1000 may also execute matrix multiplications in Fourier transform operations. The activation memory 1010 may be used to store input tensors of the Fourier transform operations. The weight memory 1030 may be used to store transformation matrices of the Fourier transform operations.

Example Method of Executing Matrix Multiplication Operations

FIG. 11 is a flowchart showing a method 1100 of executing a matrix multiplication operation, in accordance with various embodiments. The method 1100 may be performed by the DNN system 300 in FIG. 3. Although the method 1100 is described with reference to the flowchart illustrated in FIG. 11, many other methods for executing matrix multiplication operations may alternatively be used. For example, the order of execution of the steps in FIG. 11 may be changed. As another example, some of the steps may be changed, eliminated, or combined.

The DNN system 300 receives 1110 a first input tensor and a second input tensor of a matrix multiplication operation. In some embodiments, the matrix multiplication operation is an operation in a Transformer-based DNN. In some embodiments, the first input tensor or the second input tensor has a plurality of dimensions. The dimensions, the order of the dimension, and the sizes of the dimensions define the shape of the first input tensor or the second input tensor.

The DNN system 300 converts 1120 the first input tensor of the matrix multiplication operation into an activation tensor of a convolutional operation. In some embodiments, the DNN system 300 generates a transposed tensor by transposing one or more dimensions of the first input tensor. The DNN system 300 modifies the shape of the transposed tensor. In some embodiments, the DNN system 300 determines whether the first input tensor has less dimensions than the second input tensor. In response to determining that the first input tensor has less dimension than the second input tensor, DNN system 300 adds a dimension to the first input tensor. In some embodiments, the DNN system 300 adds the dimension to the first input tensor by adding the dimension as the first dimension that is arranged before one or more dimensions of the first input tensor. In some embodiments, the dimension is one-sized.

In some embodiments, the convolutional operation comprises a group of convolutions. The DNN system 300 generates the output tensor of the matrix multiplication operation by concatenating output tensors of the convolutions. In some embodiments, the total number of convolutions in the convolutional operation is determined based on a size of a dimension shared by the first input tensor and the second input tensor of the matrix multiplication operation. In some embodiments, a total number of channels in the activation tensor equal to a size of a last dimension of the first input tensor of the matrix multiplication operation.

The DNN system 300 converts 1130 the second input tensor of the matrix multiplication operation into a weight tensor of the convolutional operation. In some embodiments, at least one dimension of the weight tensor is one-sized. In some embodiments, the DNN system 300 generates a transposed tensor by transposing one or more dimensions of the second input tensor. The DNN system 300 modifies the shape of the transposed tensor. In some embodiments, the DNN system 300 determines whether the second input tensor has less dimensions than the first input tensor. In response to determining that the second input tensor has less dimension than the first input tensor, DNN system 300 adds a dimension to the second input tensor. In some embodiments, the DNN system 300 adds the dimension to the second input tensor by adding the dimension as the first dimension that is arranged before one or more dimensions of the second input tensor. In some embodiments, the dimension is one-sized.

The DNN system 300 generates 1140 an output tensor of the matrix multiplication operation by performing the convolutional operation on the activation tensor and the weight tensor. In some embodiments, the DNN system 300 generates an output tensor of the convolutional operation from the activation tensor and the weight tensor. The DNN system 300 generates a reshaped tensor by modifying a shape of the output tensor of the convolutional operation. The DNN system 300 transposes one or more dimensions of the reshaped tensor to generate the output tensor of the matrix multiplication operation.

Example Computing Device

FIG. 12 is a block diagram of an example computing device 1200, in accordance with various embodiments. In some embodiments, the computing device 1200 can be used as at least part of the DNN system 300. A number of components are illustrated in FIG. 12 as included in the computing device 1200, but any one or more of these components may be omitted or duplicated, as suitable for the application. In some embodiments, some or all of the components included in the computing device 1200 may be attached to one or more motherboards. In some embodiments, some or all of these components are fabricated onto a single system on a chip (SoC) die. Additionally, in various embodiments, the computing device 1200 may not include one or more of the components illustrated in FIG. 12, but the computing device 1200 may include interface circuitry for coupling to the one or more components. For example, the computing device 1200 may not include a display device 1206, but may include display device interface circuitry (e.g., a connector and driver circuitry) to which a display device 1206 may be coupled. In another set of examples, the computing device 1200 may not include an audio input device 1218 or an audio output device 1208 but may include audio input or output device interface circuitry (e.g., connectors and supporting circuitry) to which an audio input device 1218 or audio output device 1208 may be coupled.

The computing device 1200 may include a processing device 1202 (e.g., one or more processing devices) . The processing device 1202 processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory. The computing device 1200 may include a memory 1204, which may itself include one or more memory devices such as volatile memory (e.g., DRAM) , nonvolatile memory (e.g., read-only memory (ROM) ) , high bandwidth memory (HBM) , flash memory, solid state memory, and/or a hard drive. In some embodiments, the memory 1204 may include memory that shares a die with the processing device 1202. In some embodiments, the memory 1204 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for executing matrix multiplication operations (e.g., the method 1100 described in conjunction with FIG. 11) or some operations performed by the DNN system 300. The instructions stored in the one or more non-transitory computer-readable media may be executed by the processing device 1202.

In some embodiments, the computing device 1200 may include a communication chip 1212 (e.g., one or more communication chips) . For example, the communication chip 1212 may be configured for managing wireless communications for the transfer of data to and from the computing device 1200. The term "wireless" and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.

The communication chip 1212 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family) , IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment) , Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as "3GPP2" ) , etc. ) . IEEE 802.16 compatible Broadband Wireless Access (BWA) networks are generally referred to as WiMAX networks, an acronym that stands for worldwide interoperability for microwave access, which is a certification mark for products that pass conformity and interoperability tests for the IEEE 802.16 standards. The communication chip 1212 may operate in accordance with a Global System for Mobile Communication (GSM) , General Packet Radio Service (GPRS) , Universal Mobile Telecommunications System (UMTS) , High Speed Packet Access (HSPA) , Evolved HSPA (E- HSPA) , or LTE network. The communication chip 1212 may operate in accordance with Enhanced Data for GSM Evolution (EDGE) , GSM EDGE Radio Access Network (GERAN) , Universal Terrestrial Radio Access Network (UTRAN) , or Evolved UTRAN (E-UTRAN) . The communication chip 1212 may operate in accordance with Code-division Multiple Access (CDMA) , Time Division Multiple Access (TDMA) , Digital Enhanced Cordless Telecommunications (DECT) , Evolution-Data Optimized (EV-DO) , and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond. The communication chip 1212 may operate in accordance with other wireless protocols in other embodiments. The computing device 1200 may include an antenna 1222 to facilitate wireless communications and/or to receive other wireless communications (such as AM or FM radio transmissions) .

In some embodiments, the communication chip 1212 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet) . As noted above, the communication chip 1212 may include multiple communication chips. For instance, a first communication chip 1212 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 1212 may be dedicated to longer-range wireless communications such as global positioning system (GPS) , EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others. In some embodiments, a first communication chip 1212 may be dedicated to wireless communications, and a second communication chip 1212 may be dedicated to wired communications.

The computing device 1200 may include battery/power circuitry 1214. The battery/power circuitry 1214 may include one or more energy storage devices (e.g., batteries or capacitors) and/or circuitry for coupling components of the computing device 1200 to an energy source separate from the computing device 1200 (e.g., AC line power) .

The computing device 1200 may include a display device 1206 (or corresponding interface circuitry, as discussed above) . The display device 1206 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD) , a light-emitting diode display, or a flat panel display, for example.

The computing device 1200 may include an audio output device 1208 (or corresponding interface circuitry, as discussed above) . The audio output device 1208 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.

The computing device 1200 may include an audio input device 1218 (or corresponding interface circuitry, as discussed above) . The audio input device 1218 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output) .

The computing device 1200 may include a GPS device 1216 (or corresponding interface circuitry, as discussed above) . The GPS device 1216 may be in communication with a satellite-based system and may receive a location of the computing device 1200, as known in the art.

The computing device 1200 may include another output device 1210 (or corresponding interface circuitry, as discussed above) . Examples of the other output device 1210 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or an additional storage device.

The computing device 1200 may include another input device 1220 (or corresponding interface circuitry, as discussed above) . Examples of the other input device 1220 may include an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, a Quick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.

The computing device 1200 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile internet device, a music player, a tablet computer, a laptop computer, a netbook computer, an ultrabook computer, a personal digital assistant (PDA) , an ultramobile personal computer, etc. ) , a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, or a wearable computer system. In some embodiments, the computing device 1200 may be any other electronic device that processes data.

Select Examples

The following paragraphs provide various examples of the embodiments disclosed herein.

Example 1 provides a method, including receiving a first input tensor and a second input tensor of a matrix multiplication operation; converting the first input tensor of the matrix multiplication operation into an activation tensor of a convolutional operation; converting the second input tensor of the matrix multiplication operation into a weight tensor of the convolutional operation; and generating an output tensor of the matrix multiplication operation by performing the convolutional operation on the activation tensor and the weight tensor.

Example 2 provides the method of example 1, in which converting the first input tensor of the matrix multiplication operation into the activation tensor of the convolutional operation includes generating a transposed tensor by transposing one or more dimensions of the first input tensor; and modifying a shape of the transposed tensor.

Example 3 provides the method of example 1 or 2, in which converting the first input tensor of the matrix multiplication operation into the activation tensor of the convolutional operation includes determining whether the first input tensor has less dimensions than the second input tensor; and in response to determining that the first input tensor has less dimension than the second input tensor, adding a dimension to the first input tensor.

Example 4 provides the method of example 3, in which adding the dimension to the first input tensor includes adding the dimension as a first dimension that is arranged before one or more dimensions of the first input tensor, in which the dimension is one-sized.

Example 5 provides the method of any one of examples 1-4, in which the convolutional operation includes a group of convolutions, and generating the output tensor of the matrix multiplication operation includes concatenating output tensors of the convolutions.

Example 6 provides the method of example 5, in which a total number of convolutions in the convolutional operation is determined based on a size of a dimension shared by the first input tensor and the second input tensor of the matrix multiplication operation.

Example 7 provides the method of any one of examples 1-6, in which a total number of channels in the activation tensor equal to a size of a last dimension of the first input tensor of the matrix multiplication operation.

Example 8 provides the method of any one of examples 1-7, in which at least one dimension of the weight tensor is one-sized.

Example 9 provides the method of any one of examples 1-8, in which generating the output tensor of the matrix multiplication operation includes generating an output tensor of the convolutional operation from the activation tensor and the weight tensor; generating a reshaped tensor by modifying a shape of the output tensor of the convolutional operation; and transposing one or more dimensions of the reshaped tensor.

Example 10 provides the method of example 9, in which a total number of channels in the output tensor of the convolutional operation is determined based on a size of a dimension of the second input tensor of the matrix multiplication operation.

Example 11 provides one or more non-transitory computer-readable media storing instructions executable to perform operations, the operations including receiving a first input tensor and a second input tensor of a matrix multiplication operation; converting the first input tensor of the matrix multiplication operation into an activation tensor of a convolutional operation; converting the second input tensor of the matrix multiplication operation into a weight tensor of the convolutional operation; and generating an output tensor of the matrix multiplication operation by performing the convolutional operation on the activation tensor and the weight tensor.

Example 12 provides the one or more non-transitory computer-readable media of example 11, in which converting the first input tensor of the matrix multiplication operation into the activation tensor of the convolutional operation includes generating a transposed tensor by transposing one or more dimensions of the first input tensor; and modifying a shape of the transposed tensor.

Example 13 provides the one or more non-transitory computer-readable media of example 11 or 12, in which converting the first input tensor of the matrix multiplication operation into the activation tensor of the convolutional operation includes determining whether the first input tensor has less dimensions than the second input tensor; and in response to determining that the first input tensor has less dimension than the second input tensor, adding a dimension to the first input tensor.

Example 14 provides the one or more non-transitory computer-readable media of any one of examples 11-13, in which the convolutional operation includes a group of convolutions, and generating the output tensor of the matrix multiplication operation includes concatenating output tensors of the convolutions.

Example 15 provides the one or more non-transitory computer-readable media of any one of examples 11-14, in which a total number of channels in the activation tensor equal to a size of a last dimension of the first input tensor of the matrix multiplication operation.

Example 16 provides the one or more non-transitory computer-readable media of any one of examples 11-15, in which generating the output tensor of the matrix multiplication operation includes generating an output tensor of the convolutional operation from the activation tensor and the weight tensor; generating a reshaped tensor by modifying a shape of the output tensor of the convolutional operation; and transposing one or more dimensions of the reshaped tensor.

Example 17 provides the one or more non-transitory computer-readable media of example 16, in which a total number of channels in the output tensor of the convolutional operation is determined based on a size of a dimension of the second input tensor of the matrix multiplication operation.

Example 18 provides an apparatus, including a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations including receiving a first input tensor and a second input tensor of a matrix multiplication operation, converting the first input tensor of the matrix multiplication operation into an activation tensor of a convolutional operation, converting the second input tensor of the matrix multiplication operation into a weight tensor of the convolutional operation, and generating an output tensor of the matrix multiplication operation by performing the convolutional operation on the activation tensor and the weight tensor.

Example 19 provides the apparatus of example 18, in which converting the first input tensor of the matrix multiplication operation into the activation tensor of the convolutional operation includes generating a transposed tensor by transposing one or more dimensions of the first input tensor; and modifying a shape of the transposed tensor.

Example 20 provides the apparatus of example 18 or 19, in which generating the output tensor of the matrix multiplication operation includes generating an output tensor of the convolutional operation from the activation tensor and the weight tensor; generating a reshaped tensor by modifying a shape of the output tensor of the convolutional operation; and transposing one or more dimensions of the reshaped tensor.

The above description of illustrated implementations of the disclosure, including what is described in the Abstract, is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. While specific implementations of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize. These modifications may be made to the disclosure in light of the above detailed description.

Claims

A method, comprising:

receiving a first input tensor and a second input tensor of a matrix multiplication operation;

converting the first input tensor of the matrix multiplication operation into an activation tensor of a convolutional operation;

converting the second input tensor of the matrix multiplication operation into a weight tensor of the convolutional operation; and

generating an output tensor of the matrix multiplication operation by performing the convolutional operation on the activation tensor and the weight tensor.
The method of claim 1, wherein converting the first input tensor of the matrix multiplication operation into the activation tensor of the convolutional operation comprises:

generating a transposed tensor by transposing one or more dimensions of the first input tensor; and

modifying a shape of the transposed tensor.
The method of claim 1, wherein converting the first input tensor of the matrix multiplication operation into the activation tensor of the convolutional operation comprises:

determining whether the first input tensor has less dimensions than the second input tensor; and

in response to determining that the first input tensor has less dimension than the second input tensor, adding a dimension to the first input tensor.
The method of claim 3, wherein adding the dimension to the first input tensor comprises:

adding the dimension as a first dimension that is arranged before one or more dimensions of the first input tensor,

wherein the dimension is one-sized.
The method of claim 1, wherein the convolutional operation comprises a group of convolutions, and generating the output tensor of the matrix multiplication operation comprises concatenating output tensors of the convolutions.
The method of claim 5, wherein a total number of convolutions in the convolutional operation is determined based on a size of a dimension shared by the first input tensor and the second input tensor of the matrix multiplication operation.
The method of claim 1, wherein a total number of channels in the activation tensor equal to a size of a last dimension of the first input tensor of the matrix multiplication operation.
The method of claim 1, wherein at least one dimension of the weight tensor is one-sized.
The method of claim 1, wherein generating the output tensor of the matrix multiplication operation comprises:

generating an output tensor of the convolutional operation from the activation tensor and the weight tensor;

generating a reshaped tensor by modifying a shape of the output tensor of the convolutional operation; and

transposing one or more dimensions of the reshaped tensor.
The method of claim 9, wherein a total number of channels in the output tensor of the convolutional operation is determined based on a size of a dimension of the second input tensor of the matrix multiplication operation.
One or more non-transitory computer-readable media storing instructions executable to perform operations, the operations comprising:

receiving a first input tensor and a second input tensor of a matrix multiplication operation;

converting the first input tensor of the matrix multiplication operation into an activation tensor of a convolutional operation;

converting the second input tensor of the matrix multiplication operation into a weight tensor of the convolutional operation; and

generating an output tensor of the matrix multiplication operation by performing the convolutional operation on the activation tensor and the weight tensor.
The one or more non-transitory computer-readable media of claim 11, wherein converting the first input tensor of the matrix multiplication operation into the activation tensor of the convolutional operation comprises:

generating a transposed tensor by transposing one or more dimensions of the first input tensor; and

modifying a shape of the transposed tensor.
The one or more non-transitory computer-readable media of claim 11, wherein converting the first input tensor of the matrix multiplication operation into the activation tensor of the convolutional operation comprises:

determining whether the first input tensor has less dimensions than the second input tensor; and

in response to determining that the first input tensor has less dimension than the second input tensor, adding a dimension to the first input tensor.
The one or more non-transitory computer-readable media of claim 11, wherein the convolutional operation comprises a group of convolutions, and generating the output tensor of the matrix multiplication operation comprises concatenating output tensors of the convolutions.
The one or more non-transitory computer-readable media of claim 11, wherein a total number of channels in the activation tensor equal to a size of a last dimension of the first input tensor of the matrix multiplication operation.
The one or more non-transitory computer-readable media of claim 11, wherein generating the output tensor of the matrix multiplication operation comprises:

generating an output tensor of the convolutional operation from the activation tensor and the weight tensor;

generating a reshaped tensor by modifying a shape of the output tensor of the convolutional operation; and

transposing one or more dimensions of the reshaped tensor.
The one or more non-transitory computer-readable media of claim 16, wherein a total number of channels in the output tensor of the convolutional operation is determined based on a size of a dimension of the second input tensor of the matrix multiplication operation.
An apparatus, comprising:

a computer processor for executing computer program instructions; and

a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations comprising:

receiving a first input tensor and a second input tensor of a matrix multiplication operation,

converting the first input tensor of the matrix multiplication operation into an activation tensor of a convolutional operation,

converting the second input tensor of the matrix multiplication operation into a weight tensor of the convolutional operation, and

generating an output tensor of the matrix multiplication operation by performing the convolutional operation on the activation tensor and the weight tensor.
The apparatus of claim 18, wherein converting the first input tensor of the matrix multiplication operation into the activation tensor of the convolutional operation comprises:

generating a transposed tensor by transposing one or more dimensions of the first input tensor; and

modifying a shape of the transposed tensor.
The apparatus of claim 18, wherein generating the output tensor of the matrix multiplication operation comprises:

generating an output tensor of the convolutional operation from the activation tensor and the weight tensor;

generating a reshaped tensor by modifying a shape of the output tensor of the convolutional operation; and

transposing one or more dimensions of the reshaped tensor.