WO2025230559A1

WO2025230559A1 - Neural network accelerator with configurable data storage

Info

Publication number: WO2025230559A1
Application number: PCT/US2024/044731
Authority: WO
Inventors: Dinakar Kondru; Deepak Abraham Mathaikutty; Martin-Thomas Grymel; Arnab RAHA; Umer Iftikhar Cheema
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2024-04-29
Filing date: 2024-08-30
Publication date: 2025-11-06
Anticipated expiration: 2026-10-29

Abstract

A data processing unit (DPU) in a neural network accelerator may include a compute unit, a memory, and a data delivery unit. The compute unit may perform computations in neural network layers. The memory stores data used and generated by the compute unit. The delivery unit may transfer data between the memory and the computer unit and may include one or more configurable data storages that may be configured to store different types of data for different operational modes of the DPU. An example configurable data storage may store sparsity data when the DPU operates in one mode but store input or output data of a neural network layer when the DPU operates in another mode. The sparsity data may indicate sparsity in the input or output data of the neural network layer and may be used to accelerate the execution of the neural network layer or the next layer.

Description

NEURAL NETWORK ACCELERATOR WITH CONFIGURABLE DATA STORAGE

Cross-Reference to Related Application

[0001] This application claims the benefit of U.S. Provisional Patent Application No. 63/640,074, filed April 29, 2024, and entitled "RECONFIGURABLE STORAGE FOR CORSS- MODAL FUNCTIONALITY AND PERFORMANCE UPLIFT IN DEEP NEURAL NETWORK," which is incorporated in its entirety by reference.

Technical Field

[0002] This disclosure relates generally to neural networks (also referred to as "deep neural networks" or "DNN"), and more specifically, DNN accelerators with configurable data storages.

Background

[0003] DNNs are used extensively for a variety of artificial intelligence applications ranging from computer vision to speech recognition and natural language processing due to their ability to achieve high accuracy. However, the high accuracy comes at the expense of significant computation cost. DNNs have extremely high computing demands as there can be a large number of operations as well as a large amount of data to read and write. Therefore, techniques to improve efficiency of DNNs are needed.

Brief Description of the Drawings

[0004] Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.

[0005] FIG. 1 illustrates an example DNN, in accordance with various embodiments.

[0006] FIG. 2 illustrates an example convolution, in accordance with various embodiments.

[0007] FIGS. 3A-3F illustrate various storage formats of a tensor, in accordance with various embodiments.

[0008] FIG. 4 is a block diagram of a DNN system, in accordance with various embodiments. [0009] FIG. 5 illustrates an example sparse cell, in accordance with various embodiments. [0010] FIG. 6 illustrates an example sparse cell array, in accordance with various embodiments.

[0011] FIG. 7 illustrates an example processing element (PE), in accordance with various embodiments.

[0012] FIG. 8 illustrates an example input delivery unit ( I DU ), in accordance with various embodiments.

[0013] FIG. 9 illustrates an example output delivery unit (ODU), in accordance with various embodiments.

[0014] FIG. 10 illustrates that sparsity storages in the ODU can be reconfigured for staging, in accordance with various embodiments.

[0015] FIGS. 11A-11C illustrate that sparsity storages in the ODU can be reconfigured for reordering, in accordance with various embodiments.

[0016] FIG. 12 illustrates that sparsity storages in the ODU can be reconfigured for reordering and transposing, in accordance with various embodiments.

[0017] FIG. 13 illustrates that permutation storages in the ODU can be reconfigured for staging, in accordance with various embodiments.

[0018] FIG. 14 illustrates that write storages in the ODU can be reconfigured as a transaction first-in -first out (FIFO), in accordance with various embodiments.

[0019] FIGS. 15A-15L illustrate a process of permutating a tensor in an intermediate storage unit, in accordance with various embodiments.

[0020] FIG. 16 is a block diagram of a DNN module, in accordance with various embodiments.

[0021] FIG. 17 is a block diagram of an example computing device, in accordance with various embodiments.

Detailed Description

Overview

[0022] The last decade has witnessed a rapid rise in artificial intelligence (Al) based data processing, particularly based on DNNs. DNNs are widely used in the domains of computer vision, speech recognition, image, and video processing mainly due to their ability to achieve beyond human-level accuracy. A DNN typically includes a sequence of layers. A DNN layer may include one or more deep learning operations (also referred to as "neural network operations"), such as convolution, matrix multiplication, layer normalization, batch normalization, SoftMax operation, pooling, elementwise operation, linear operation, nonlinear operation, and so on.

[0023] Input or output data of deep learning operations may be arranged in data structures called tensors. A tensor is a data structure having multiple elements across one or more dimensions. Examples of tensors include vector (which is one-dimensional (ID) tensor), matrix (which is two-dimensional (2D) tensor), three-dimensional (3D) tensors, fourdimensional (4D) tensors, and even higher dimensional tensors. A dimension of a tensor may correspond to an axis, e.g., an axis in a coordinate system. A dimension may be measured by the number of data points along the axis. The dimensions of a tensor may define the shape of the tensor. A DNN layer may receive one or more input tensors and compute an output tensor from the one or more input tensors. Taking a convolutional layer for example, the input tensors include an activation tensor (also referred to as "input feature map (IFM)" or "input activation tensor") including one or more activations (also referred to as "input elements") and a weight tensor. The weight tensor may be a kernel (a 2D weight tensor), a filter (a 3D weight tensor), or a group of filters (a 4D weight tensor). A convolution may be performed on the input activation tensor and weight tensor to compute an output activation tensor in the convolutional layer.

[0024] The significant improvements in DNN model size and accuracy coupled with the rapid increase in computing power of execution platforms have led to the adoption of DNN applications even within resource constrained mobile and edge devices that have limited energy availability. DNN models may be executed, e.g., for training or inference, by DNN accelerators. A DNN accelerator may be or include one or more data processing units (DPUs). A DPU may also be referred to as a neural processing unit or compute block. A DPU may include PEs that can carry out neural network operations.

[0025] The execution of many DNN models can be facilitated by tensor permutation. Tensor permutation can be used to increase the utilization of a DPU. In an example where the DPU is designed to be fully utilized when the minimum dimensions of input tensor is 4 X 4 X 1, the utilization would be 25% when the input tensor is 1 x 4 x 4. By permuting the input tensor to 4 X 4 X 1, 100% utilization can be achieved, which can significantly improve the efficiency and performance of the DPU. Alternatively or additionally, tensor permutation may be required for some DNN models, such as Transformer-based networks. In an example, tensor permutation is used to convert a 2D tensor (e.g., a 2D tensor having a spatial size of batch size x sequence length) to a 3D tensor (e.g., a 3D tensor having a spatial size of batch size X sequence length X embedding dimension) in an input embedding block of a Transformer-based network. Tensor permutation is also used in encoder blocks of Transformer-based networks.

[0026] A 3D tensor may have three dimensions corresponding to X-, Y-, and Z-axes, respectively, in a 3D space. Data elements (e.g., activations, weights, etc.) in the 3D tensor may be arranged along these axes. Each data element may be represented by a (x, y, z) coordinate, which may indicate the position of the data element in the 3D tensor. The shape and spatial size of the 3D tensor may be defined by lengths in the three dimensions, which may be the number of data elements along the three axes. In an example, a 3D tensor having a spatial size of 3 X 4 X 5 may have a total of three elements along the X axis, a total of four elements along the Y axis, and a total of five elements along the Z axis. In some embodiments, the Z axis may correspond to channels (e.g., input channels, output channels, etc.). The length along the Z axis may indicate a total number of channels in the tensor, (x, y) may represent a spatial point in the 3D tensor.

[0027] Tensors in DNNs can be saved or transferred in various formats, such as X-major (e.g., XYZ or XZY format), Y-major formats (e.g., YXZ or YZX format), and Z-major formats (e.g., ZXY or ZYX format). The format of a tensor may define the order in which the data points in the tensor are stored, written, or read. The first character may represent the dimension in which data points are contiguous in memory. The second character may represent the dimension in which data points can be accessed after the contiguous data points are accessed in memory. The third character may represent the dimension in which data points are accessed after the data points in the dimension represented by the second character are exhausted. Taking the ZXY format for example, the access order first starts in the Z dimension, then moves to the X dimension, and finally moves to the Y dimension. Data points in the tensor are contiguous in memory in the Z dimension, meaning data points having the same (x, y) coordinates are contiguous in memory. Using tensor permutation, the tensor may be read from memory in a different format. Output tensors of neural network operations may be saved in Z-major formats because the DPU may accumulate partial products along channel or Z dimension. Z-major formats of tensors may be converted to X- major formats through X-major permutation or converted to Y-major formats through Y- major permutation.

[0028] A DPU in a DNN accelerator may include one or more data delivery units in addition to one or more compute units. The data delivery unit(s) can transfer data between the compute unit(s) and a memory (e.g., a static random-access memories (SRAM)). For instance, a DPU may include a load unit for loading data into the compute unit(s) and a drain unit for draining data from the compute unit(s). A data delivery unit may include one or more data storage units used for tensor computation. For instance, tensor permutation can be done by writing the incoming tensor to a storage at desired location(s), followed by reading from the desired location(s) of the storage in the desired order and writing back to the memory. There may also be one or more data storage units that can store sparsity data, which may indicate sparsity in input or output of neural network layers and may be used for accelerating computations in the compute unit(s). The sparsity data may include sparsity tensors, such as sparsity bitmaps. These data storage units may be used in some functional modes (also referred to "operational modes") of the DPU but unused in other functional modes. For instance, a storage for permutation may be unused in a Z-major functional mode of the DPU when the output of the compute unit(s) is already in a Z-major format. Also, a sparsity storage may be unused in a dense mode of the DPU when there is no sparsity acceleration for the computations performed by the compute unit(s). For the functional modes where the storages are not used, there can be a waste of area, which can impair the performance of the DNN accelerator.

[0029] Embodiments of the present disclosure may improve on at least some of the challenges and issues described above by providing DNN accelerators with configurable data storages. An example configurable data storage may be configured to have different storage functions for different operational modes of the DNN accelerator. For instance, a configurable data storage may be configured to store different types of data for different operational modes, which can improve area efficiency and performance of the DNN accelerator.

[0030] In various embodiments of the present disclosure, a DNN accelerator includes one or more DPUs that can execute operations in DNNs. A DPU may include a compute unit, a local memory, an IDU, and an ODU. The compute unit may include PEs for performing computations in neural network operations. The local memory stores data used and generated by the compute unit. The IDU transfers data (e.g., input of neural network layers) from the local memory to the compute unit. The IDU may also be referred to as an input delivery module or load module. The ODU transfers data (e.g., output of neural network layers) from the compute unit to the memory. The ODU may also be referred to as an output delivery module or drain module. The DPU may have different operational modes. For instance, the DPU may have a sparse mode, in which computations in neural network operations may be accelerated based on sparsity in the inputs of the neural network operations by skipping computations performed on zero-valued data elements in the inputs, and a dense mode, in which there is no sparsity acceleration and all the data elements in the inputs of the neural network operations may be processed by the compute unit.

Additionally or alternatively, the DPU may have operational modes corresponding to different data formats, such as Z-major mode, X-major mode, and Y-major mode.

[0031] The IDU or ODU may include one or more configurable data storages. The IDU may include a sparsity storage (e.g., a data random-access memory (RAM)) that is configured to store sparsity data (e.g., sparsity bitmaps) when the DPU operates in the sparse mode but configured to store input data of neural network operations in the dense mode. The sparsity storage in the IDU may function as a response FIFO buffer in the dense mode. The ODU may include a sparsity storage (e.g., a data RAM) that is configured to store sparsity data (e.g., sparsity bitmaps) when the DPU operates in the sparse mode but configured to store output data of neural network operations in the dense mode. The sparsity storage in the ODU may be used as a staging buffer in a dense, Z-major mode. In dense, X-major or Y-major mode, the sparsity storage may be used to facilitate tensor permutation. The ODU may also include a permutation storage that performs tensor permutation in the X-major or Y-major mode but used as a staging buffer in the Z-major mode. The ODU may also include a write combine buffer that can be configured to combine small write transactions at one time while configured to function as a FIFO buffer at another time.

[0032] The present disclosure provides data storages in DNN accelerators that can be reconfigured to augment existing data storage or provide new data storage functionalities by taking advantage of data storages that are unused in certain operational modes of the DNN accelerators. With such reconfigurable and multifunctional data storages, the DNN accelerators in the present disclosure can have better area efficiency and performance, compared with currently available DNN accelerators. [0033] For purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the illustrative implementations. However, it will be apparent to one skilled in the art that the present disclosure may be practiced without the specific details or/and that the present disclosure may be practiced with only some of the described aspects. In other instances, well known features are omitted or simplified in order not to obscure the illustrative implementations.

[0034] Further, references are made to the accompanying drawings that form a part hereof, and in which is shown, by way of illustration, embodiments that may be practiced. It is to be understood that other embodiments may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense.

[0035] Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order from the described embodiment. Various additional operations may be performed or described operations may be omitted in additional embodiments.

[0036] For the purposes of the present disclosure, the phrase "A or B" or the phrase "A and/or B" means (A), (B), or (A and B). For the purposes of the present disclosure, the phrase "A, B, or C" or the phrase "A, B, and/or C" means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C). The term "between," when used with reference to measurement ranges, is inclusive of the ends of the measurement ranges.

[0037] The description uses the phrases "in an embodiment" or "in embodiments," which may each refer to one or more of the same or different embodiments. The terms "comprising," "including," "having," and the like, as used with respect to embodiments of the present disclosure, are synonymous. The disclosure may use perspective-based descriptions such as "above," "below," "top," "bottom," and "side" to explain various features of the drawings, but these terms are simply for ease of discussion, and do not imply a desired or required orientation. The accompanying drawings are not necessarily drawn to scale. Unless otherwise specified, the use of the ordinal adjectives "first," "second," and "third," etc., to describe a common object, merely indicates that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.

[0038] In the following detailed description, various aspects of the illustrative implementations will be described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art.

[0039] The terms "substantially," "close," "approximately," "near," and "about," generally refer to being within +/- 20% of a target value as described herein or as known in the art. Similarly, terms indicating orientation of various elements, e.g., "coplanar," "perpendicular," "orthogonal," "parallel," or any other angle between the elements, generally refer to being within +/- 5-20% of a target value as described herein or as known in the art.

[0040] In addition, the terms "comprise," "comprising," "include," "including," "have," "having" or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, device, or DNN accelerator that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such method, process, device, or DNN accelerators. Also, the term "or" refers to an inclusive "or" and not to an exclusive "or."

[0041] The systems, methods and devices of this disclosure each have several innovative aspects, no single one of which is solely responsible for all desirable attributes disclosed herein. Details of one or more implementations of the subject matter described in this specification are set forth in the description below and the accompanying drawings.

Example DNN

[0042] FIG. 1 illustrates an example DNN 100, in accordance with various embodiments. The DNN 100 may be executed by a DNN accelerator, e.g., the DNN accelerator 402 in FIG. 4. In an example, the DNN 100 may be a convolution-based DNN. In other examples, the DNN 100 may be other types of DNNs. For the purpose of illustration, the DNN 100 includes a sequence of layers comprising a plurality of convolutional layers 110 (individually referred to as "convolutional layer 110"), a plurality of pooling layers 120 (individually referred to as "pooling layer 120"), and a plurality of fully-connected layers 130 (individually referred to as "fully-connected layer 130"). In other embodiments, the DNN 100 may include fewer, more, or different layers. In an execution of the DNN 100, the layers of the DNN 100 execute tensor computation that includes many tensor operations, such as matrix multiplications, convolutions (e.g., multiply-accumulate (MAC) operations, etc.), pooling operations, elementwise operations (e.g., elementwise addition, elementwise multiplication, etc.), other types of tensor operations, or some combination thereof.

[0043] The convolutional layers 110 summarize the presence of features in inputs to the DNN 100. The convolutional layers 110 function as feature extractors. The first layer of the DNN 100 is a convolutional layer 110. In an example, a convolutional layer 110 performs a convolution on an input tensor 140 (also referred to as IFM 140) and a filter 150. As shown in FIG. 1, the IFM 140 is represented by a 7x7x3 three-dimensional (3D) matrix. The IFM 140 includes 3 input channels, each of which is represented by a 7x7 two-dimensional (2D) matrix. The 7x7 2D matrix includes 7 input elements (also referred to as input points) in each row and 7 input elements in each column. The filter 150 is represented by a 3x3x3 3D matrix. The filter 150 includes 3 kernels, each of which may correspond to a different input channel of the IFM 140. A kernel is a 2D matrix of weights, where the weights are arranged in columns and rows. A kernel can be smaller than the IFM. In the embodiments of FIG. 1, each kernel is represented by a 3x3 2D matrix. The 3x3 kernel includes 3 weights in each row and 3 weights in each column. Weights can be initialized and updated by backpropagation using gradient descent. The magnitudes of the weights can indicate importance of the filter 150 in extracting features from the IFM 140.

[0044] The convolution includes MAC operations with the input elements in the IFM 140 and the weights in the filter 150. The convolution may be a standard convolution 163 or a depthwise convolution 183. In the standard convolution 163, the whole filter 150 slides across the IFM 140. All the input channels are combined to produce an output tensor 160 (also referred to as OFM 160). The OFM 160 is represented by a 5x5 2D matrix. The 5x5 2D matrix includes 5 output elements (also referred to as output points) in each row and 5 output elements in each column. For the purpose of illustration, the standard convolution includes one filter in the embodiments of FIG. 1. In embodiments where there are multiple filters, the standard convolution may produce multiple output channels in the OFM 160. [0045] The multiplication applied between a kernel-sized patch of the IFM 140 and a kernel may be a dot product. A dot product is the elementwise multiplication between the kernelsized patch of the IFM 140 and the corresponding kernel, which is then summed, always resulting in a single value. Because it results in a single value, the operation is often referred to as the "scalar product." Using a kernel smaller than the IFM 140 is intentional as it allows the same kernel (set of weights) to be multiplied by the IFM 140 multiple times at different points on the IFM 140. Specifically, the kernel is applied systematically to each overlapping part or kernel-sized patch of the IFM 140, left to right, top to bottom. The result from multiplying the kernel with the IFM 140 one time is a single value. As the kernel is applied multiple times to the IFM 140, the multiplication result is a 2D matrix of output elements. As such, the 2D output matrix (i.e., the OFM 160) from the standard convolution 163 is referred to as an OFM.

[0046] In the depthwise convolution 183, the input channels are not combined. Rather, MAC operations are performed on an individual input channel and an individual kernel and produce an output channel. As shown in FIG. 1, the depthwise convolution 183 produces a depthwise output tensor 180. The depthwise output tensor 180 is represented by a 5x5x3 3D matrix. The depthwise output tensor 180 includes 3 output channels, each of which is represented by a 5x5 2D matrix. The 5x5 2D matrix includes 5 output elements in each row and 5 output elements in each column. Each output channel is a result of MAC operations of an input channel of the IFM 140 and a kernel of the filter 150. For instance, the first output channel (patterned with dots) is a result of MAC operations of the first input channel (patterned with dots) and the first kernel (patterned with dots), the second output channel (patterned with horizontal strips) is a result of MAC operations of the second input channel (patterned with horizontal strips) and the second kernel (patterned with horizontal strips), and the third output channel (patterned with diagonal stripes) is a result of MAC operations of the third input channel (patterned with diagonal stripes) and the third kernel (patterned with diagonal stripes). In such a depthwise convolution, the number of input channels equals the number of output channels, and each output channel corresponds to a different input channel. The input channels and output channels are referred to collectively as depthwise channels. After the depthwise convolution, a pointwise convolution 193 is then performed on the depthwise output tensor 180 and a 1x1x3 tensor 190 to produce the OFM 160.

[0047] The OFM 160 is then passed to the next layer in the sequence. In some embodiments, the OFM 160 is passed through an activation function. An example activation function is rectified linear unit (ReLU). ReLU is a calculation that returns the value provided as input directly, or the value zero if the input is zero or less. The convolutional layer 110 may receive several images as input and calculate the convolution of each of them with each of the kernels. This process can be repeated several times. For instance, the OFM 160 is passed to the subsequent convolutional layer 110 (i.e., the convolutional layer 110 following the convolutional layer 110 generating the OFM 160 in the sequence). The subsequent convolutional layers 110 perform a convolution on the OFM 160 with new kernels and generate a new feature map. The new feature map may also be normalized and resized. The new feature map can be kernelled again by a further subsequent convolutional layer 110, and so on.

[0048] In some embodiments, a convolutional layer 110 has four hyperparameters: the number of kernels, the size F kernels (e.g., a kernel is of dimensions FxFxD pixels), the S step with which the window corresponding to the kernel is dragged on the image (e.g., a step of one means moving the window one pixel at a time), and the zero-padding P (e.g., adding a black contour of P pixels thickness to the input image of the convolutional layer 110). The convolutional layers 110 may perform various types of convolutions, such as 2-dimensional convolution, dilated or atrous convolution, spatial separable convolution, depthwise separable convolution, transposed convolution, and so on. The DNN 100 includes 16 convolutional layers 110. In other embodiments, the DNN 100 may include a different number of convolutional layers.

[0049] The pooling layers 120 down-sample feature maps generated by the convolutional layers, e.g., by summarizing the presence of features in the patches of the feature maps. A pooling layer 120 is placed between two convolution layers 110: a preceding convolutional layer 110 (the convolution layer 110 preceding the pooling layer 120 in the sequence of layers) and a subsequent convolutional layer 110 (the convolution layer 110 subsequent to the pooling layer 120 in the sequence of layers). In some embodiments, a pooling layer 120 is added after a convolutional layer 110, e.g., after an activation function (e.g., ReLU, etc.) has been applied to the OFM 160.

[0050] A pooling layer 120 receives feature maps generated by the preceding convolution layer 110 and applies a pooling operation to the feature maps. The pooling operation reduces the size of the feature maps while preserving their important characteristics. Accordingly, the pooling operation improves the efficiency of the DNN and avoids over- learning. The pooling layers 120 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map), max pooling (calculating the maximum value for each patch of the feature map), or a combination of both. The size of the pooling operation is smaller than the size of the feature maps. In various embodiments, the pooling operation is 2x2 pixels applied with a stride of two pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size. In an example, a pooling layer 120 applied to a feature map of 6x6 results in an output pooled feature map of 3x3. The output of the pooling layer 120 is inputted into the subsequent convolution layer 110 for further feature extraction. In some embodiments, the pooling layer 120 operates upon each feature map separately to create a new set of the same number of pooled feature maps.

[0051] The fully-connected layers 130 are the last layers of the DNN. The fully-connected layers 130 may be convolutional or not. The fully-connected layers 130 receive an input operand. The input operand defines the output of the convolutional layers 110 and pooling layers 120 and includes the values of the last feature map generated by the last pooling layer 120 in the sequence. The fully-connected layers 130 apply a linear combination and an activation function to the input operand and generate a vector. The vector may contain as many elements as there are classes: element i represents the probability that the image belongs to class i. Each element is therefore between 0 and 1, and the sum of all is worth one. These probabilities are calculated by the last fully-connected layer 130 by using a logistic function (binary classification) or a SoftMax function (multi-class classification) as an activation function. In some embodiments, the fully-connected layers 130 multiply each input element by weight, make the sum, and then apply an activation function (e.g., logistic if N=2, SoftMax if N>2). This is equivalent to multiplying the input operand by the matrix containing the weights.

[0052] FIG. 2 illustrates an example convolution, in accordance with various embodiments. The convolution may be a deep learning operation in a convolutional layer of a DNN, e.g., a convolutional layer 110 in FIG. 1. The convolution can be executed on an activation tensor 210 and filters 220 (individually referred to as "filter 220"). The filters may constitute a weight tensor of the convolution. The result of the convolution is an output tensor 230. In some embodiments, the convolution is performed by a DNN accelerator. An example of the DNN accelerator may be the DNN accelerator 402 in FIG. 4. For instance, the convolution may be performed by one or more DPUs 430 in the DNN accelerator 402.

[0053] The activation tensor 210 may be computed in a previous layer of the DNN. In some embodiments (e.g., embodiments where the convolutional layer is the first layer of the DNN), the activation tensor 210 may be an image. In the embodiments of FIG. 2, the activation tensor 210 includes activations (also referred to as "input activations," "elements," or "input elements") arranged in a 3D matrix. The activation tensor 210 may also be referred to as an input tensor of the convolution. An input element is a data point in the activation tensor 210. The activation tensor 210 has a spatial size H_in X W_in X C_in, where H_in is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of activations in a column in the 3D matrix of each input channel), W_in is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of activations in a row in the 2D matrix of each input channel), and C_in is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of input channels). For the purpose of simplicity and illustration, the activation tensor 210 has a spatial size of 7x7x3, i.e., the activation tensor 210 includes three input channels and each input channel has a 7x7 2D matrix. Each input element in the activation tensor 210 may be represented by a (X, Y, Z) coordinate. In other embodiments, the height, width, or depth of the activation tensor 210 may be different.

[0054] Each filter 220 includes weights arranged in a 3D matrix. The values of the weights may be determined through training the DNN. A filter 220 has a spatial size Hf X W^ X Cf, where H is the height of the filter (i.e., the length along the Y axis, which indicates the number of weights in a column in each kernel), Wf is the width of the filter (i.e., the length along the X axis, which indicates the number of weights in a row in each kernel), and is the depth of the filter (i.e., the length along the Z axis, which indicates the number of channels). In some embodiments, equals C_in. For purpose of simplicity and illustration, each filter 220 in FIG. 2 has a spatial size of 2x3x3, i.e., the filter 220 includes 2 convolutional kernels with a spatial size of 2x3. In other embodiments, the height, width, or depth of the filter 220 may be different. The spatial size of the convolutional kernels is smaller than the spatial size of the 2D matrix of each input channel in the activation tensor 210.

[0055] An activation or weight may take one or more bytes in a memory. The number of bytes for an activation or weight may depend on the data format. For example, when the activation or weight has an INT8 format, the activation takes one byte. When the activation or weight has a FP16 format, the activation or weight takes two bytes. Other data formats may be used for activations or weights.

[0056] In the convolution, each filter 220 slides across the activation tensor 210 and generates a 2D matrix for an output channel in the output tensor 230. In the embodiments of FIG. 2, the 2D matrix has a spatial size of 5x5. The output tensor 230 includes activations (also referred to as "output activations," "elements," or "output element") arranged in a 3D matrix. An output activation is a data point in the output tensor 230. The output tensor 230 has a spatial size H_out X W_out C_out, where H_out is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of output activations in a column in the 2D matrix of each output channel), W_out is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of output activations in a row in the 2D matrix of each output channel), and C_out is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of output channels). C_out may equal the number of filters 220 in the convolution. H_out and W_out may depend on the heights and weights of the activation tensor 210 and each filter 220. In an example where the kernel size is lxl, H_out and W_ollt may equal to H_in and W_in, respectively.

[0057] As a part of the convolution, MAC operations can be performed on a 2x3x3 subtensor 215 (which is highlighted with a dotted pattern in FIG. 2) in the activation tensor 210 and each filter 220. The result of the MAC operations on the subtensor 215 and one filter 220 is an output activation. In some embodiments (e.g., embodiments where the convolution is an integral convolution), an output activation may include 8 bits, e.g., one byte. In other embodiments (e.g., embodiments where the convolution is a floating-point convolution), an output activation may include more than one byte. For instance, an output element may include two bytes.

[0058] After the MAC operations on the subtensor 215 and all the filters 220 are finished, a vector 235 is produced. The vector 235 is highlighted with a dotted pattern in FIG. 2. The vector 235 includes a sequence of output activations, which are arranged along the Z axis. The output activations in the vector 235 have the same (x, y) coordinate, but the output activations correspond to different output channels and have different Z coordinates. The dimension of the vector 235 along the Z axis may equal the total number of output channels in the output tensor 230. After the vector 235 is produced, further MAC operations are performed to produce additional vectors till the output tensor 230 is produced. In the embodiments of FIG. 2, the output tensor 230 is computed in a Z-major format. When the output tensor 230 is computed in the ZXY format, the vector that is adjacent to the vector 235 along the X axis may be computed right after the vector 235. When the output tensor 230 is computed in the ZYX format, the vector that is adjacent to the vector 235 along the Y axis may be computed right after the vector 235. The output tensor 230 may be permuted, e.g., by the drain module 390, and stored in a memory (e.g., the local memory 440) in an X- major format or Y-major format.

[0059] In some embodiments, the MAC operations on a 3x3x3 subtensor (e.g., the subtensor 215) and a filter 220 may be performed by a plurality of MAC units. One or more MAC units may receive an input operand (e.g., an activation operand 217 shown in FIG. 2) and a weight operand (e.g., the weight operand 227 shown in FIG. 2). The activation operand 217 includes a sequence of activations having the same (x, y) coordinate but different z coordinates. The activation operand 217 includes an activation from each of the input channels in the activation tensor 210. The weight operand 227 includes a sequence of weights having the same (x, y) coordinate but different z coordinates. The weight operand 227 includes a weight from each of the channels in the filter 220. Activations in the activation operand 217 and weights in the weight operand 227 may be sequentially fed into a MAC unit. The MAC unit may receive an activation and a weight ("an activation-weight pair") at a time and multiple the activation and the weight. The position of the activation in the activation operand 217 may match the position of the weight in the weight operand 227. The activation and weight may correspond to the same channel.

[0060] Activations or weights may be floating-point numbers. Floating-point numbers may have various data formats, such as FP32, FP16, BF16, and so on. A floating-point number may be a positive or negative number with a decimal point. A floating-point number may be represented by a sequence of bits that includes one or more bits representing the sign of the floating-point number (e.g., positive or negative), bits representing an exponent of the floating-point number, and bits representing a mantissa of the floating-point number. The mantissa is the part of a floating-point number that represents the significant digits of that number. The mantissa is multiplied by the base raised to the exponent to give the actual value of the floating-point number.

[0061] In some embodiments, the output activations in the output tensor 230 may be further processed based on one or more activation functions before they are written into the memory or inputted into the next layer of the DNN. The processing based on the one or more activation functions may be at least part of the post processing of the convolution. In some embodiments, the post processing may include one or more other computations, such as offset computation, bias computation, and so on. The results of the post processing may be stored in a local memory of the compute block and be used as input to the next DNN layer. In some embodiments, the input activations in the activation tensor 210 may be results of post processing of the previous DNN layer.

[0062] FIGS. 3A-3F illustrate various storage formats of a tensor, in accordance with various embodiments. For the purpose of illustration, the tensor has a spatial size of 8 X 8 X 8, i.e., there are eight data points in each vector along the X axis (e.g., each row), eight data points in each vector along the Y axis (e.g., each column), and eight data points in each vector along the Z axis. In other embodiments, the tensor may have a different shape or spatial size.

[0063] FIG. 3A shows the tensor in the ZYX format. The eight data points with (0, 0, 0-7) coordinates, which are highlighted with the denser dotted pattern, may be stored in the memory first. For instance, these data points are stored as contiguous bytes in the local memory 440. These eight data points are followed by the data points with (0, 1, 0-7) coordinates, then the data points with (0, 2, 0-7) coordinates, then the data points with (0, 3, 0-7) coordinates, all the way to the data points with (0, 7, 0-7) coordinates. The 56 data points with (0, 1-7, 0-7) coordinates are highlighted with the less dense dotted pattern. 0-7 represents all the integer numbers from 0 to 7. 1-7 represents all the integer numbers from 1 to 7. After that, the 64 data points with (1, 0-7, 0-7) coordinates are stored. This continues till all the data points of the tensor are stored.

[0064] FIG. 3B shows the tensor in the ZXY format. With the ZXY format, the eight data points with (0, 0, 0-7) coordinates, which are highlighted with the denser dotted pattern, may be stored in the memory first, followed by the data points with (1, 0, 0-7) coordinates, then the data points with (2, 0, 0-7) coordinates, then the data points with (3, 0, 0-7) coordinates, all the way to the data points with (7, 0, 0-7) coordinates. The 56 data points with (1-7, 0, 0-7) coordinates are highlighted with the less dense dotted pattern. After that, the 64 data points with (0-7, 1, 0-7) coordinates are stored. This continues till all the data points of the tensor are stored. [0065] FIG. 3C shows the tensor in the XYZ format. The eight data points with (0-7, 0, 0) coordinates, which are highlighted with the denser dotted pattern, may be stored in the memory first, followed by the data points with (0-7, 1, 0) coordinates, then the data points with (0-7, 2, 0) coordinates, then the data points with (0-7, 3, 0) coordinates, all the way to the data points with (0-7, 7, 0) coordinates. The 56 data points with (0-7, 1-7, 0) coordinates are highlighted with the less dense dotted pattern. After that, the 64 data points with (0-7, 0-7, 1) coordinates are stored. This continues till all the data points of the tensor are stored. [0066] FIG. 3D shows the tensor in the XZY format. With the XZY format, the eight data points with (0-7, 0, 0) coordinates, which are highlighted with the denser dotted pattern, may be stored in the memory first, followed by the data points with (0-7, 0, 1) coordinates, then the data points with (0-7, 0, 2) coordinates, then the data points with (0-7, 0, 3) coordinates, all the way to the data points with (0-7, 0, 7) coordinates. The 56 data points with (0-7, 0, 1-7) coordinates are highlighted with the less dense dotted pattern. After that, the 64 data points with (0-7, 1, 0-7) coordinates are stored. This continues till all the data points of the tensor are stored.

[0067] FIG. 3E shows the tensor in the YZX format. The eight data points with (0, 0-7, 0) coordinates, which are highlighted with the denser dotted pattern, may be stored in the memory first, followed by the data points with (0, 0-7, 1) coordinates, then the data points with (0, 0-7, 2) coordinates, then the data points with (0, 0-7, 3) coordinates, all the way to the data points with (0, 0-7, 7) coordinates. The 56 data points with (0, 0-7, 1-7) coordinates are highlighted with the less dense dotted pattern. After that, the 64 data points with (1, 0- 7, 0-7) coordinates are stored. This continues till all the data points of the tensor are stored. [0068] FIG. 3F shows the tensor in the YXZ format. The eight data points with (0, 0-7, 0) coordinates, which are highlighted with the denser dotted pattern, may be stored in the memory first, followed by the data points with (1, 0-7, 0) coordinates, then the data points with (2, 0-7, 0) coordinates, then the data points with (3, 0-7, 0) coordinates, all the way to the data points with (7, 0-7, 0) coordinates. The 56 data points with (1-7, 0-7, 0) coordinates are highlighted with the less dense dotted pattern. After that, the 64 data points with (0-7, 0-7, 1) coordinates are stored. This continues till all the data points of the tensor are stored. [0069] In some embodiments, one of the formats shown in FIGS. 3A-3E may be changed to another one of the formats through tensor permutation. The tensor permutation would change the layout of the 256 data points in the memory, e.g., the order in which the data

Y1 points are arranged in the memory but would not change the value of any data point. The tensor permutation may be performed by the ODU 490 before the ODU 490 writes the data points to the memory.

Example DNN System

[0070] FIG. 4 is a block diagram of a DNN system 400, in accordance with various embodiments. The whole DNN system 400 or a part of the DNN system 400 may be implemented in one or more computing devices, such as the computing device 1700 in FIG. 17. The DNN system 400 can generate and execute DNNs, such as Transformer-based models, convolution-based models, and so on. As shown in FIG. 4, the DNN system 400 includes a DNN module 401 and a DNN accelerator 402. In other embodiments, alternative configurations, different or additional components may be included in the DNN system 400. For instance, the DNN system 400 may include multiple DNN modules or multiple DNN accelerators. Further, functionality attributed to a component of the DNN system 400 may be accomplished by a different component included in the DNN system 400 or a different system. In some embodiments, the DNN module 401 and DNN accelerator 402 may include different types of processing units. In an example, the DNN module 401 may be implemented by one or more central processing units (CPUs). The DNN accelerator 402 may also be referred to as an Al accelerator or an Al processor. The DNN module 401 and DNN accelerator 402 may be implemented in the same chip or separate chips.

[0071] The DNN module 401 facilitates generation and deployment of DNNs. In some embodiments, the DNN module 401 may generate and train DNNs. For instance, the DNN module 401 can define the layered architecture of a DNN. The DNN module 401 can also determine the internal parameters of the DNN through a DNN training process. The DNN module 401 may also determine one or more hyperparameters that define how the DNN is trained. An example hyperparameter is a sparsity ratio that defines the sparsity level of one or more deep learning tensors for the DNN.

[0072] The DNN module 401 may also compress DNNs, e.g., during or after training. In some embodiments, the DNN module 401 may prune weights in one or more layers of a DNN by changing nonzero valued weight to zeros. The DNN module 401 may prune weights based on a target weight sparsity ratio. A weight sparsity ratio may be the ratio of the number of zero-valued weights to the total number of weights. In an example where the DNN module 401 prunes weight during DNN training, the DNN module 401 may prune weight of a layer to achieve a target sparsity ratio after one or more epochs. The DNN module 401 may prevent the pruned weights from changing values during the rest of the training process. Alternatively, the DNN module 401 may allow the pruned weights to change values so that a pruned, zero-valued weight may have a nonzero value after further training. The DNN module 401 may prune weights of the layer again after one or more additional epochs.

[0073] The DNN module 401 may deploy trained, compressed, or validated DNNs for use in deep learning applications. In some embodiments, the DNN module 401 may distribute trained, compressed, or validated DNNs to devices or systems which may use the DNNs to perform tasks (e.g., image classification, motion planning, etc.) for which the DNNs were trained. In other embodiments, the DNN module 401 may facilitate deployment of the DNNs using the DNN accelerator 402. For instance, the DNN module 401 may receive data from a device or system coupled with the DNN system 400 and input the received data (or data generated by the DNN module 401, e.g., based on the received data) into a DNN. The DNN module 401 may generate instructions (e.g., configuration descriptors) that control the operation of the DNN accelerator 402 during the DNN execution. For instance, the DNN module 401 may generate configuration descriptors that can be used to configure data storages in the DNN accelerator 402. The DNN module 401 may receive an output of the DNN from the DNN accelerator 402. The DNN module 401 may transmit the output of the DNN (or a result of processing the output of the DNN by the DNN module 401) to the device or system. In some embodiments, the DNN module 401 may control execution processes of trained, compressed, or validated DNNs. The DNN module 401 may function as a compiler for DNNs executed by the DNN accelerator 402. The DNN module 401 may perform compilation of DNNs and generate configuration descriptors or configuration parameters, based on which the DNNs may be executed. Certain aspects of the DNN module 401 are provided below in conjunction with FIG. 16.

[0074] The DNN accelerator 402 executes DNNs provided by the DNN module 401. For instance, the DNN accelerator 402 can execute a DNN by running neural network operations in the DNN. The process of carrying out a neural network operation is also referred to as a process of executing the neural network operation or performing the neural network operation. The execution of the DNN may be for training the DNN or for using the DNN to perform Al tasks. As shown in FIG. 4, the DNN accelerator 402 includes a memory 410, a DMA (direct memory access) engine 420, and DPUs 430 (individually referred to as "DPU 430"). In other embodiments, alternative configurations, different or additional components may be included in the DNN accelerator 402. For example, the DNN accelerator 402 may include more than one memory 410 or DMA engine 420. As another example, the DNN accelerator 402 may include a single DPU 430. Further, functionality attributed to a component of the DNN accelerator 402 may be accomplished by a different component included in the DNN accelerator 402 or by a different system. A component of the DNN accelerator 402 may be implemented in hardware, software, firmware, or some combination thereof.

[0075] The memory 410 stores data associated with neural network operations performed by the DNN accelerator 402. In some embodiments, the memory 410 may store data to be used by the DPUs 430 for DNN execution. The memory 410 may store weights, such as weights of convolutional layers, which are determined by training DNNs. The memory 410 may further store inputs to DNN layers or outputs of DNN layers, such as data generated by the DPUs 430 from performing deep learning operations in DNNs. Example neural network operations include convolutions (also referred to as "convolutional operations"), layer normalization operations, SoftMax operations, matrix multiplication operations, pooling operations, elementwise operations, activation functions, other types of deep learning operations, or some combination thereof. The memory 410 may also store configuration descriptors, including configuration descriptors that configure data storages in the DNN accelerator 402. The memory 410 may be a main memory of the DNN accelerator 402. In some embodiments, the memory 410 includes one or more dynamic random-access memories (DRAMs).

[0076] The DMA engine 420 facilitates data transfer between the memory 410 and local memories of the DPUs 430. For example, the DMA engine 420 can read data from the memory 410 and write data into a local memory of a DPU 430. As another example, the DMA engine 420 can read data from a local memory of a DPU 430and write data into the memory 410. The DMA engine 420 provides a DMA feature that allows the DPU 430 to initiate data transfer between the memory 410 and the local memories of the DPUs 430 and to perform other operations while the data transfer is being conducted. In some embodiments, the DMA engine 420 may read tensors from the memory 410, modify the tensors in a way that is optimized for the DPU 430 before it writes the tensors into the local memories of the DPUs 430. [0077] The DPUs 430 perform neural network operations in DNNs. For instance, a DPU 430 may execute a DNN layer by running one or more neural network operations in the DNN layer. The DPU 430 may compute an output of the DNN layer from an input of the DNN layer. In some embodiments, the DPU 430 may also use internal parameters (e.g., weights) of the DNN layer to compute the output. A DPU 430 may execute a layer, or a portion of a layer, at a time. In some embodiments, the operations of the DNN layers may be run by multiple DPUs 430 in parallel. For instance, multiple DPUs 430 may each perform a portion of a workload for a deep learning operation. Data may be shared between the DPUs 430. A DPU 430 may also be referred to as a neural processing unit, a compute block, or a compute tile.

[0078] The DPUs 430 may be capable of running various types of neural network operations, such as convolution, layer normalization, SoftMax operation, pooling, elementwise operation, linear operation, nonlinear operation, and so on. Deep learning operations performed by the DPUs 430 include tensor operations, i.e., operations whose inputs are tensors or operations whose outputs are tensors. In an example, the DPU 430 receives an input tensor and one or more convolutional kernels and performs a convolution with the input tensor and convolutional kernels. The result of the convolution may be an output tensor, which can be further computed, e.g., by the DPU 430 or another DPU 430. [0079] In the embodiments of FIG. 4, each DPU 430 includes a local memory 440, an IDU 450, a compute unit 460 including a processing engine 470 and a post-processing engine 480, and an ODU 490. Some or all the components of the DPU 430 can be implemented on the same chip. In other embodiments, alternative configurations, different or additional components may be included in the DPU 430. Further, functionality attributed to a component of the DPU 430 may be accomplished by a different component included in the DPU 430, a different DPU 430, another component of the DNN accelerator 402, or a different system. A component of the DPU 430 may be implemented in hardware, software, firmware, or some combination thereof.

[0080] The local memory 440 is local to the corresponding DPU 430. In the embodiments of FIG. 4, the local memory 440 is inside the DPU 430. In other embodiments, the local memory 440 may be outside the DPU 430. Data in the local memory 440 may be transferred to or from the memory 410, e.g., through the DMA engine 420. In some embodiments, data in the local memory 440 may be transferred to or from the local memory of another DPU 430. The local memory 440 may store data received, used, or generated by the IDU 450, the processing engine 470, the post-processing engine 480, or the ODU 490. Examples of the data may include input activations, weights, output activations, sparsity bitmaps, and so on. [0081] In some embodiments, the local memory 440 may store tensors to be processed by the processing engine 470 or the post-processing engine 480. The tensors may be input tensors of deep learning operations. The local memory 440 may also store tensors generated by the processing engine 470 or the post-processing engine 480. The tensors may be output tensors of deep learning operations. The layout of data points of a tensor in the local memory 440 may depend on the format in which the tensor is stored. In some embodiments, the local memory 440 may store tensors in various formats, including Z- major format, X-major format, and Y-major format. For a tensor with Z-major format, the local memory 440 may store data points having the same (x, y) coordinate contiguously. For instance, the data points having the same (x, y) coordinate may be stored at a sequence of memory addresses in the local memory 440. For a tensor with the ZXY format or ZYX format, the local memory 440 may store data points having the same (x, y) coordinate contiguously. For instance, the data points having the same (x, y) coordinate may be stored at a sequence of memory addresses in the local memory 440. For a tensor with X-major format, the local memory 440 may store data points having the same (y, z) coordinate contiguously. For a tensor with Y-major format, the local memory 440 may store data points having the same (x, z) coordinate contiguously.

[0082] In an example where the tensor is in the ZXY format. With the ZXY format, the data points with (0, 0, 0-C) coordinates may be stored in the memory first, followed by the data points with (1, 0, 0-C) coordinates, then the data points with (2, 0, 0-C) coordinates, and so on. 0-C represents all the integer numbers from 0 to C, and C is the maximum coordinate index in the Z dimension of the tensor. After all the coordinates in the X dimension are exhausted, the layout would continue to y coordinate of 1. In an example where the tensor is in the ZYX format, the data points with (0, 0, 0-C) coordinates may be stored in the memory first, followed by the data points with (0, 1, 0-C) coordinates, then the data points with (0, 2, 0-C) coordinates, and so on. After all the coordinates in the Y dimension are exhausted, the layout would continue to x coordinate of 1.

[0083] In an example where the tensor is in the XYZ format, the data points with (0-W, 0, 0) coordinates may be stored in the memory first, followed by the data points with (0-W, 1, 0) coordinates, then the data points with (0-W, 2, 0) coordinates, and so on. 0-C represents all the integer numbers from 0 to W, and W represents the maximum coordinate index in the X dimension of the tensor. After all the coordinates in the Y dimension are exhausted, the layout would continue to z coordinate of 1. In an example where the tensor is in the XZY format, the data points with (0-W, 0, 0) coordinates may be stored in the memory first, followed by the data points with (0-W, 0, 1) coordinates, then the data points with (0-W, 0, 2) coordinates, and so on. After all the coordinates in the Z dimension are exhausted, the layout would continue to y coordinate of 1.

[0084] In an example where the tensor is in the YXZ format, the data points with (0, 0-H, 0) coordinates may be stored in the memory first, followed by the data points with (1, 0-H, 0) coordinates, then the data points with (2, 0-H, 0) coordinates, and so on. 0-H represents all the integer numbers from 0 to H, and H represents the maximum coordinate index in the X dimension of the tensor. After all the coordinates in the X dimension are exhausted, the layout would continue to z coordinate of 1. In an example where the tensor is in the YZX format, the data points with (0, 0-H, 0) coordinates may be stored in the memory first, followed by the data points with (0, 0-H, 1) coordinates, then the data points with (0, 0-H, 2) coordinates, and so on. After all the coordinates in the Z dimension are exhausted, the layout would continue to x coordinate of 1.

[0085] In some embodiments, the local memory 440 may store dense tensors (e.g., dense activation tensors, dense weight tensors, etc.), sparse tensors (e.g., sparse activation tensors, sparse weight tensors, etc.), and so on. A dense tensor may be a tensor from which zero-valued elements (if any) are not removed. A dense tensor may be converted to a sparse tensor by removing one or more zero-valued elements in the dense tensor. A sparse tensor may also be referred to as a compressed tensor or packed tensor. The process of converting a dense tensor to a sparse tensor may be referred to as sparsity encoding. Sparsity encoding may also generate a sparsity tensor. Each element in the sparsity tensor may correspond to a different element in the dense tensor and indicate whether the element in the dense tensor is zero or not. The sparsity tensor may indicate positions of elements of the sparse tensor in the dense tensor. The sparsity tensor may be a sparsity bitmap, each element of which is a bit. A sparse tensor may be converted to a dense tensor through a densifying process, in which one or more zeros may be added to the sparse tensor based on the sparsity tensor. [0086] In some embodiments, the local memory 440 includes one or more SRAMs. The local memory 440 may be byte-addressable, and each memory address identifies a single byte (eight bits) of storage. In some embodiments, the local memory 440 may include memory banks. The number of data banks in the local memory 440 may be 16, 64, 128, 456, 512, 1024, 2048, or other numbers. A memory bank may include a plurality of storage units. In an example, a data bank may include 8, 16, 64, or a different number of storage units. A memory bank or a storage unit in a memory bank may have a memory address. In an example, a storage unit may store a single byte, and data larger than a single byte may be stored in storage units with consecutive memory addresses, i.e., adjacent storage units. For instance, a storage unit can store an integer number in the INT8 format, versus two storage units may be needed to store a number in the FP16 or BF16 format, which has 16 bits. In some embodiments, 16 bits can be transferred from the local memory 440 in a single read cycle. In other embodiments, 16 bits can be transferred from the local memory 440 in multiple read cycles, such as two cycles.

[0087] The IDU 450 loads data from the local memory 440 to the compute unit 460. The IDU 450 may read tensors from the local memory 440. The tensors may include activation tensors, weight tensors, activation sparsity tensors, weight sparsity tensors, combined sparsity tensors, and so on. In some embodiments, the IDU 450 may load data based on the operational mode of the DPU 430. The IDU 450 may select different data to transmit to the compute unit 460 in different sparse modes. For instance, the IDU 450 may transmit an activation sparsity tensor and a weight sparsity tensor of a layer to the compute unit 460 in the combined sparse mode, while transmit the activation sparsity tensor but not the weight sparsity tensor to the compute unit 460 in the activation sparse mode and transmit the weight sparsity tensor but not the activation sparsity tensor to the compute unit 460 in the weight sparse mode. In the dense mode, the IDU 450 does not transmit either the activation sparsity tensor or the weight sparsity tensor to the compute unit 460.

[0088] In some embodiments, the IDU 450 may process (e.g., densify) data stored in the local memory 440 before providing the data to the compute unit 460. In an example, the IDU 450, while operating in the weight sparse mode, may densify sparse activation tensors to generate dense activation tensors based on corresponding activation sparsity tensors. For instance, the IDU 450 may add one or more zeros into a sparse activation tensor based on an activation sparsity tensor associated with the sparse activation tensor to generate the dense activation tensor. The dense activation tensor includes one or more elements than the sparse activation tensor. The additional element(s) are zero-valued. The I DU 450 may identify one or more elements in the activation sparsity tensor that correspond to the zerovalued element(s), determine the position of each of the zero-valued element(s) in the dense activation tensor, and insert the zero-valued element(s) into the sparse activation tensor based on the determined positions. After the densification, the IDU 450 may transmit the dense activation tensors to the compute unit 460. The IDU 450 may also transmit corresponding sparse weight tensors and weight sparsity tensors to the compute unit 460. Activation sparsity tensor of the dense activation tensors may not be loaded to the compute unit 460.

[0089] In another example, the IDU 450, while operating in the activation sparse mode, may densify sparse weight tensors to generate dense weight tensors based on corresponding weight sparsity tensors by inserting zeros into sparse weight tensors. The densification of sparse weight tensors may be similar to the densification of sparse activation tensors described above. After the densification, the IDU 450 may transmit the dense weight tensors to the compute unit 460. The IDU 450 may also transmit corresponding sparse activation tensors and activation sparsity tensors to the compute unit 460. Weight sparsity tensor of the dense weight tensors may not be loaded to the compute unit 460.

[0090] In yet another example, the IDU 450, while operating in the dense mode, may densify both sparse weight tensors and sparse activation tensors. The IDU 450 may generate the input tensor and weight tensor of the layer and transmit the tensors to the processing engine 470 for executing the layer without sparsity acceleration.

[0091] The IDU 450 may include one or more sparsity data storages for storing sparsity data, such as sparsity tensors. A sparsity data storage may be configurable to have different functionalities for different operational modes of the DPU 430. For instance, there may be no sparse data to store in a sparsity data storage when the DPU 430 operates in a dense mode. The sparsity data storage may be configured to store other types of data, such as activations or weights. Certain aspects of the IDU 450 are described below in conjunction with FIG. 8.

[0092] The processing engine 470 performs operations in DNNs. The processing engine 470 may accelerate neural network operations based on sparsity in data. In some embodiments, the processing engine 470 may operate in a dense mode in which sparsity acceleration is not performed. The processing engine 470 may include one or more processing cells. In some embodiments, the processing cells may be arranged in one or more rows and one or more columns in the processing engine 470. Each processing cell may include PEs that may be arranged in an array that includes rows and columns. All the PEs in the processing engine 470 may constitute a bigger array that includes more rows and columns.

[0093] An example PE may be or may include one or more MAC units that can perform MAC operations. In some embodiments (e.g., embodiments where the DPU 430 executes a convolutional layer), a computation in an MAC unit may be an MAC operation on an activation operand and a weight operand. The activation operand may be an activation tensor that may include one or more activations in the input tensor of the convolution. Different activations may be in different input channels. The weight operand may be a weight tensor that may include one or more weights in the filter of the convolution. The values of the weights are determined through training the DNN. The weights in the weight operand may be in different input channels.

[0094] In some embodiments, an MAC unit includes one or more multipliers for performing multiplications. An MAC unit may also include one or more accumulators ("adders") for performing accumulations. A column of MAC units is referred to as an MAC column. An MAC column may be associated with one or more MAC lanes. A MAC lane is a path for loading data e.g., by the IDU 450, into an MAC column. A MAC lane may be also referred to as a data transmission lane or data loading lane. An MAC column may have multiple MAC lanes. The loading bandwidth of the MAC column is an aggregation of the loading bandwidths of all the MAC lanes associated with the MAC column. With a certain number of MAC lanes, data can be fed into the same number of independent MAC units simultaneously. In some embodiments where an MAC column has four MAC lanes for feeding activations or weights into the MAC column and each MAC lane may have a bandwidth of 16 bytes, the four MAC lanes can have a total loading bandwidth of 64 bytes.

[0095] In some embodiments, the processing engine 470 may be capable of depthwise convolution, standard convolution, or both. In a depthwise convolution, an MAC unit may perform an MAC operation that includes a sequence of multiplications for an input operand and a weight operand. Each multiplication in the sequence (also referred to as a cycle) is a multiplication of a different activation in the input operand with a different weight in the weight operand. The activation and weight in the same cycle may correspond to the same channel. The sequence of multiplication produces a product operand that includes a sequence of products. The MAC operation may also include accumulations in which multiple product operands are accumulated to produce an output operand of the MAC unit. The processing engine 470 may output multiple output operands at a time, each of which is generated by a different MAC unit. In a standard convolution, MAC operations may include accumulations across the channels. For instance, as opposed to generating an output operand, a MAC unit may accumulate products across different channels to generate a single output point.

[0096] In some embodiments, the processing engine 470 may perform MAC operations in quantized deep learning operations, such as MAC operations in a quantized convolution. In some embodiments, an MAC unit in the processing engine 470 may receive quantized activation and quantized weights and compute a quantized MAC result. The quantized MAC result may be a quantized value in an integer format and may be the output of the MAC unit. In some embodiments, the MAC unit may also include a quantization multiplier that can multiply a quantization scale with the quantized MAC result, and the output of the MAC unit may be a real value in a floating-point format. The MAC unit may include no quantization subtractors as zero-point offsetting is not needed for the MAC operations in quantized deep learning operations.

[0097] In some embodiments, the processing engine 470 may include sparsity acceleration logic for facilitating sparsity acceleration. For instance, each processing cell in the processing engine 470 may include one or more sparsity modules. In an example, each MAC column or each MAC row may have a corresponding sparsity module that accelerates MAC operations in the MAC column or MAC row. In some embodiments, a sparsity module accelerates computations in the processing engine 470 based on sparsity in activations, sparsity in weights, or both. The sparsity module may include a storage unit that stores a sparsity tensor, which may be loaded to the storage unit by the IDU 450. The sparsity tensor may be an activation sparsity tensor, a weight sparsity tensor, or a combined sparsity tensor.

[0098] An activation sparsity tensor may be the sparsity tensor of an activation tensor and has the same number of elements as the activation tensor. An element in the activation sparsity tensor may indicate whether the corresponding element in the activation tensor is zero or not. For instance, a zero-valued in the activation sparsity tensor may indicate that the corresponding element in the activation tensor is zero. A one-valued in the activation sparsity tensor may indicate that the corresponding element in the activation tensor is nonzero. A weight sparsity tensor may be the sparsity tensor of a weight tensor and has the same number of elements as the weight tensor. An element in the weight sparsity tensor may indicate whether the corresponding element in the weight tensor is zero or not. For instance, a zero-valued in the weight sparsity tensor may indicate that the corresponding element in the weight tensor is zero. A one-valued in the weight sparsity tensor may indicate that the corresponding element in the weight tensor is nonzero. The sparsity module may generate a combined sparsity tensor using an activation sparsity tensor and a weight sparsity tensor. For instance, the sparsity module may multiply an element of the activation sparsity tensor with a corresponding element of the weight sparsity tensor to compute an element of the combined sparsity tensor. The positions of the three elements in their corresponding sparsity tensors may match. In some embodiments, each element in a sparsity tensor may be a bit, and the sparsity tensor may be referred to as a sparsity bitmap. [0099] The sparsity module may use the sparsity tensor to identify activations and weights to be used in MAC operations by the MAC units. In an embodiment where the processing engine 470 operates in the combined sparse mode, the sparsity module may identify activations and weights that correspond to nonzero valued elements of a combined sparsity tensor. In an embodiment where the processing engine 470 operates in the activation sparse mode, the sparsity module may identify activations and weights that correspond to nonzero valued elements of an activation sparsity tensor. In an embodiment where the processing engine 470 operates in the weight sparse mode, the sparsity module may identify activations and weights that correspond to nonzero valued elements of a weight sparsity tensor. The sparsity module may be bypassed in the dense mode as no sparsity acceleration would be conducted.

[0100] The post-processing engine 480 processes outputs of the processing engine 470. The post-processing engine 480 may include one or more post-processing elements. In some embodiments, the post-processing elements in the post-processing engine 480 may be arranged in an array that has rows and columns. In some embodiments, the post-processing engine 480 computes activation functions. The post-processing engine 480 may receive outputs of the processing engine 470 as inputs to the activation functions. In addition or alternative to activation functions, the post-processing engine 480 may perform other types of post processing on outputs of the processing engine 470. For instance, the post- processing engine 480 may apply a bias on an output of the processing engine 470. In some embodiments, the post-processing engine 480 may be bypassed for certain DNN layers. [0101] The ODU 490 drains data from the compute unit 460 and writes data to the local memory 440. In some embodiments, the ODU 490 may drain data on a cell level. For each processing cell, the ODU 490 may drain outputs of PEs in the processing cell based on a row index or column index of each PE. For instance, the ODU 490 may use a sequence of cycles to drain data from a processing cell. The ODU 490 may drain the output of some of the PEs in each cycle. The sequence of the cycles may be configured based on a configuration descriptor indicating the operational mode of the IDU 450. In some embodiments, the ODU 490 may perform parallel reads or parallel writes. For instance, the ODU 490 may write data into multiple memory banks in parallel or read data from multiple memory banks in parallel. [0102] The ODU 490 may process the drained data before writing the data into the local memory 440. In some embodiments, the ODU 490 may compress data, e.g., based on sparsity in the data. The ODU 490 may also generate sparsity data that indicates the sparsity in the data. The ODU 490 may include a sparsity encoding logic that can convert data from a dense format to a sparse format. For instance, the ODU 490 may be implemented with one or more sparsity encoders. A sparsity encoder converts dense data to compressed data based on sparsity in the dense data. The data received by the ODU 490 from the compute unit 460 may be at least part of an output tensor (e.g., the output tensor 230 in FIG. 2) of a DNN layer. The sparsity encoder may remove one or more zeros in the output tensor to convert the output tensor to a compressed activation tensor ("sparse activation tensor"). The sparsity encoder may also generate at least one sparsity tensor that indicates the sparsity in the output tensor. The sparsity tensor may also be referred to as an activation sparsity tensor. The activation sparsity tensor may correspond to at least part of the output tensor (e.g., the vector 235 in FIG. 2) of the DNN layer. The activation sparsity tensor may include sparsity elements (e.g., bits), each of which corresponds to a different activation in the output tensor and indicates whether the corresponding activation is zeroed or not. The ODU 490 may have data storage units that store sparsity data.

[0103] In addition or alternative to compression, the ODU 490 may perform tensor permutation to change storage formats of tensors. For instance, the ODU 490 may permute tensors before writing the tensors to the local memory 440 so that the ODU 490 may write the tensors to the local memory 440 in the new formats. In some embodiments, the ODU 490 may perform tensor permutation to change Z-major formats to X-major formats or Y- major formats. For instance, a tensor drained by the ODU 490 from the processing engine 470 or from the post-processing engine 480 may be in a Z-major format. The ODU 490 may change the Z-major format to a X-major or Y-major format and write the tensor to the local memory 440 in the X-major or Y-major format. In some embodiments, the ODU 490 includes intermediate storage that facilitates tensor permutations. For instance, the ODU 490 may include one or more memory banks. A memory bank may have a matrix structure with rows and columns. The ODU 490 may write data to a memory bank in a different manner from reading the data from the memory bank for tensor permutation. For example, the ODU 490 may write a tensor to a memory bank in a column-wise manner but read the tenor from the memory bank in a row-wise manner. As another example, the ODU 490 may write a tensor to a memory bank in a row-wise manner but read the tenor from the memory bank in a column-wise manner. The ODU 490 may read a tensor while writing another tensor. For instance, the ODU 490 may read data points of a tensor from a row of the memory bank and write data points of another tensor to the row of the memory bank in the same cycle. [0104] The ODU 490 may include one or more write modules that write data into the local memory 440. An example write module may include a write storage, e.g., a write combine buffer (WCB) that can combine smaller write transactions to bigger write transactions. Data storages in the ODU 490, such as sparsity storages, permutation storages, or write storages, may be configurable to perform different functions for different operational modes of the DPU 430. Certain aspects of the ODU 490 are described below in conjunction with FIGS. 9- 14.

[0105] FIG. 5 illustrates an example sparse cell 500, in accordance with various embodiments. The sparse cell 500 may be a processing cell in a processing engine, e.g., the processing engine 470 in FIG. 4. The sparse cell 500 includes 16 MAC units 510 (individually referred to as "MAC unit 510"), which constitutes a MAC array having four rows and four columns. The MAC array has a spatial shape of 5x4, meaning the height of the MAC array is four and the width of the MAC array is also 5. The sparse cell 500 also includes 16 weight register files 520 (individually referred to as "weight register file 520"), 16 activation register files 530 (individually referred to as "activation register file 530"), four row buffers 540 (individually referred to as "row buffer 540"), and sparsity modules 560 (individually referred to as "sparsity module 560"). In other embodiments, the sparse cell 500 may include fewer, more, or different components. For example, the sparse cell 500 may include a different number of MAC units 510, weight register files 520, activation register files 530, row buffers 540, or sparsity modules 560. As another example, the sparse cell 500 may include column buffers in lieu of or in addition to the row buffers 540. Also, the shape (e.g., the height or width) of the MAC array may be different.

[0106] The MAC units 510 are configured to perform MAC operations. Each MAC unit 510 may include one or more multipliers and one or more adders. A multiplier may multiply an activation with a weight at a time to compute a product. In some embodiments (e.g., embodiments where the MAC unit 510 includes multiple multipliers), the multipliers may operate simultaneously to process multiple activation-weight pairs and compute multiple products in one cycle. An adder may accumulate products computed by the multipliers. Even though not shown in FIG. 5, the sparse cell may include an adder tree including a plurality of adder tiers. The first tier may receive outputs of a plurality of MAC units 510. The number of adders in the first tier may be half of the number of the MAC units 510, and each adder may accumulate the outputs of two MAC units 510. The second tier may receive outputs of adders in the first tier. The number of adders in the second tier may be half of the number of adders in the first tier, and each adder in the second tier may accumulate the outputs of two adders in the first tier. The adder tree may include one or more other tiers. The last tier may include a single adder that accumulates outputs of adders in the second last tier to compute a partial sum of the sparse cell 500.

[0107] The weight register files 520 store weights to be processed in MAC operations. In the embodiments of FIG. 5, four weight register files 520 are grouped into a storage set that stores data to be used by a column of MAC units 510. There are four storage sets corresponding to the four columns of MAC units 510. In some embodiments, a weight register file 520 may correspond to a MAC unit 510 and store data to be processed by the MAC unit. In some embodiments, all the 16 weight register files 520 constitute a weight storage unit.

[0108] The activation register files 530 stores activations to be processed in MAC operations. In the embodiments of FIG. 5, four activation register files 530 are grouped into a storage set that stores data to be used by a row of MAC units 510. There are four storage sets corresponding to the four rows of MAC units 510. In some embodiments, an activation register file 530 may correspond to a MAC unit 510 and store data to be processed by the MAC unit. In some embodiments, all the 16 activation register files 530 constitute an activation storage unit. The row buffers 540 store outputs of the MAC units 510. Each row buffer 540 may drain outputs of a single row of MAC units 510.

[0109] The sparsity module 560 facilitates dynamic sparsity-based acceleration in the sparse cell 500. In the embodiments of FIG. 5, each sparsity module 560 includes a sparsity tensor storage unit 565 and a control logic 567. The sparsity tensor storage unit 565 stores combined sparsity tensors. A combined sparsity tensor stored in the sparsity tensor storage unit 565 may correspond to an activation tensor and a weight tensor. A nonzero element in the combined sparsity tensor may correspond to a nonzero activation-weight pair that includes a nonzero activation and a nonzero weight. The position of the nonzero activation in the activation tensor may match the position of the nonzero weight in the weight tensor. The product of the nonzero activation and nonzero weight would be nonzero.

[0110] The control logic 567 may control transmission of activations and weights stored from the weight register files 520 and the activation register files 530 to the MAC units 510 based on sparsity tensors. For instance, the control logic 567 may select a subset of the weights stored in the weight register files 520 and select a subset of activations stored in the activation register files 530 based on a combined sparsity tensor. The selected weights and activations constitute nonzero activation-weight pairs. The control logic 567 may transmit the selected weights and activations to the MAC units 510 for performing MAC operations. The other weights stored in the weight register files 520 and the other activations stored in the activation register files 530 are skipped from computation. In the embodiments of FIG.

5, each sparsity module 560 controls sparsity acceleration in a respective MAC unit 510. As the sparsity acceleration is either based on both weight sparsity and activation sparsity, 16 sparsity modules 560 are used for acceleration computations in the 16 MAC units 510.

[0111] As shown in FIG. 5, the sparse cell 500 is associated with multiplexers (MUXs) 503, 504, 505, and 506. In other embodiments, the sparse cell 500 may be associated with a different number of MUXs or other devices. The MUX 503 facilitates loading weights, e.g., from the local memory 440, into the weight register files 520. The MUX 504 facilitates loading activations, e.g., from the local memory 440, into the activation register files 530. The MUX 505 facilitates loading sparsity tensors into the sparsity tensor storage unit 565. The MUX 506 may be a drain MUX that can facilitate draining outputs of the MAC units 510, e.g., to the local memory 440. [0112] In some embodiments, the sparse cell 500 may also execute matrix multiplications converted from Fourier transform operations. For an example Fourier transform operation, the MAC units 510 may perform MAC operations in the two sequences of matrix multiplications converted from the Fourier transform operation. The weight register files 520 may be used to store data points in transformation tensor of the Fourier transform operation. The activation register file 530 may be used to store data points in the input tensor of the Fourier transform operation. The row buffers 540 may store data points in the output tensor of the Fourier transform operation.

[0113] FIG. 6 illustrates a sparse cell array 600, in accordance with various embodiments. The sparse cell array 600 may be an example of the processing engine 470 in FIG. 4. In FIG. 6, the sparse cell array 600 includes sparse cells 610 (individually referred to as "sparse cell 610") arranged in four columns and four rows, an activation memory 620, and a weight memory 630. In other embodiments, the sparse cell array 600 may include fewer, more, or different components. For instance, the sparse cell array 600 may include a different number of columns, rows, or sparse cells 610.

[0114] Each sparse cell 610 may perform sparsity accelerated MAC operations. The sparse cells 610 may facilitate dynamic sparse mode. For instance, the sparse modes of a sparse cell 610 may be dynamically changed between a combined sparse mode, an activation sparse mode, a weight sparse mode, and a dense mode. An embodiment of a sparse cell 610 may be the sparse cell 500 in FIG. 5. The activation memory 620 stores activations, such as activations in input tensors of deep learning operations. Activations may be loaded from the activation memory 620 to sparse cells 610. The weight memory 630 stores weights, such as weights in filters of deep learning operations. Weights may be loaded from the weight memory 630 to sparse cells 610. The activation memory 620 or weight memory 630 may be a buffer. In other embodiments, the sparse cell array 600 may include a dense data memory and a sparse data memory in lieu of the activation memory 620 and weight memory 630. The dense data memory may store dense tensors, e.g., dense tensors generated by the load module 360. The sparse data memory may store sparse tensors.

[0115] The sparse cell array 600 may also execute matrix multiplications in Fourier transform operations. The activation memory 620 may be used to store input tensors of the Fourier transform operations. The weight memory 630 may be used to store transformation matrices of the Fourier transform operations. [0116] FIG. 7 illustrates an example PE 700, in accordance with various embodiments. The PE 700 may be a unit component of a processing cell, e.g., a processing cell in the processing engine 370. In the embodiments of FIG. 7, the PE 700 includes an MAC unit 705, an activation register file 710, a weight register file 720, an output register file 750, and a sparsity accelerator 760. The MAC unit 705 includes a multiplier 730 and an adder 740. In other embodiments, the PE 700 may include fewer, more, or different components.

[0117] The activation register file 710 stores an activation operand, which may be a context. The activation register file 710 may be an example of the activation register files 530 in FIG. 5. The weight register file 720 stores a weight operand. The weight register file 720 may be an example of the weight register files 520 in FIG. 5. The activation operand and weight operand may be loaded from a memory (e.g., the memory 340) into the activation register file 710 and the weight register file 720, respectively. The sparsity accelerator 760 receives a sparsity bitmap 715 that corresponds to the sparse tensor in the weight register file 720. The sparsity bitmap 715 may be a combined sparsity bitmap when the MAC unit 705 operates in a combined sparse mode. The sparsity bitmap 715 may be an activation sparsity bitmap when the MAC unit 705 operates in an activation sparse mode. The sparsity bitmap 715 may be a weight sparsity bitmap when the MAC unit 705 operates in a weight sparse mode. The sparsity bitmap 715 may have the same size (e.g., the same number of elements) as or a larger size than the activation operand or the weight operand.

[0118] Using the sparsity bitmap 715, the sparsity accelerator 760 selects four activations from the activation register file 710 and selects four weights from the weight register file 720. The sparsity accelerator 760 transmits the selected activations and weights to the multiplier 730. These selected data elements correspond to the nonzero valued elements of the sparsity bitmap 715. The four selected activations and the four selected weights may constitute four activation-weight pairs. The multiplier 730 may compute a product based on each activation-weight pair and therefore, compute four products in total. The four products may be provided to the adder 740. Even though FIG. 7 shows a single multiplier 730, the MAC unit 705 may include multiple multipliers that can perform multiple multiplication operations at the same time.

[0119] The adder 740 accumulates the four products and computes a unit-level internal partial sum. The four unselected elements of the dense tensor are not processed to save power and time, which would not impact the value of the unit-level internal partial sum. For instance, when the dense tensor is a dense activation tensor, the weights corresponding to the unselected activations are zeros so the products of the unselected activations and the weights would all be zero and have no contribution to the unit-level internal partial sum or other partial sums computed by the sparse cell. Similarly, when the dense tensor is a dense weight tensor, the activations corresponding to the unselected weights are zeros so the products of the unselected weights and the activations would all be zero and have no contribution to the unit-level internal partial sum or other partial sums computed by the sparse cell. In other embodiments, the MAC unit 705 may operate in a dense mode in which the sparsity bitmap 715 is not used and the sparsity accelerator 760 is inactive. The MAC unit 705 may process all the activations in the activation operand and all the weights in the weight operand.

[0120] The unit-level internal partial sum may be stored in the output register file 750. In some embodiments, the unit-level internal partial sum may be used multiple times. For instance, the activation operand may represent N data blocks in the input tensor of the convolution, where N is an integer greater than 1. Instead of processing all the N data blocks to compute N unit-level internal partial sums, the unit-level internal partial sum is computed once and used N times in the convolutional layers as N unit-level internal partial sums.

[0121] In some embodiments, the PE 700 receives one or more PE-level internal partial sums from one or more other PEs. The adder 740 or an accumulator (not shown in FIG. 7) can accumulate the one or more PE-level internal partial sums with the PE-level internal partial sum of the PE 700 and store the result of the accumulation (i.e., a multi-PE internal partial sum) in the output register file 750. The one or more other PEs may be in the same column as the PE 700 in a sparse cell. The multi-unit internal partial sum may be a columnlevel internal partial sum. In some embodiments, the PE-level internal partial sum of the PE 700 or the multi-unit internal partial sum may be sent to one or more other PEs for further accumulation.

[0122] FIG. 8 illustrates an example IDU 800, in accordance with various embodiments. The IDU 800 may be a data delivery unit in a DPU. The IDU may be an example of the IDU 450 in FIG. 4. The IDU 800 is communicatively coupled to a memory port 801 and a compute unit 805. The memory port 801 may provide an interface over which a memory may be accessed for data read or write. In some embodiments, the memory port 801 may be a common memory port (CMX) that can be shared by the IDU 800 with the ODU in the DPU or other components of the DPU. The IDU 800 can load data from the memory to the compute unit 805 through the memory port 801. In some embodiments, the memory port 801 may be a component of the memory. The memory may be an example of the local memory 440 in FIG. 4. The compute unit 802 may be an example of the compute unit 460 in FIG. 4. As shown in FIG. 8, the IDU 800 includes an arbitrator 810, a reader 820, a storage unit 830, a response FIFO 840, a response data FIFO 845, a command FIFO 850, a configurable storage unit 860, a MUX 870, a control logic 880, and a data loader 890. In other embodiments, the IDU 800 may include fewer, more, or different components. For instance, the IDU 800 may include multiple response data FIFOs.

[0123] The arbitrator 810 may arbitrate multiple data read requests ("read requests"). A read request is a request to read data (e.g., activations, weights, sparsity bitmaps, etc.) stored in the memory. The read request may include information indicating the memory address where the data is stored. The memory port 801 may be shared among multiple data requests, which can result in bank conflicts, e.g., there may be multiple data requests requesting data stored in the same bank in the memory. Such conflicts may be resolved by the arbiter 810. The arbiter 810 can schedule these read requests, e.g., by determining the order in which the read requests should be processed or transmitted to the memory port 801. The reader 820 makes read requests.

[0124] The storage unit 830, response FIFO 840, response data FIFO 845, command FIFO 850, and configurable storage unit 860 may constitute the internal storage of the IDU 800. The internal storage may be required to meet DNN performance goals. For instance, without internal storage, the IDU 800 may fail to keep the pipeline to the memory busy or fail to absorb stalls, resulting in falling short of utilizing the available bandwidth. In some embodiments, the IDU 800 may use independent ports for activations, weights, and sparsity bitmaps. That can cause activations, weights, and sparsity bitmaps arrive at different times. The compute unit 802 would need activations to be aligned with weights for performing computations. In the sparse mode, the compute unit 802 would need the activations, weights, and sparsity bitmaps to be aligned. With the internal storage, the IDU 800 can store the already-received data in its internal storage while waiting for the other data and after all the data is received, send all the data to the compute unit 802. In some embodiments, when the compute unit 802 is performing computation on the previously loaded data, the internal storage of the IDU 800 stores data for the next computation to be performed by the compute unit 802.

[0125] The storage unit 830 may store read requests. In some embodiments, the storage unit 830 is a RAM. When the arbiter 810 determines that a read request may be processed, the reader 820 may retrieve the read request from the storage unit 830 and transmit the read request to the arbiter 810. The command FIFO 850 may store commands from the reader 820, e.g., control signals, metadata, and so on. The command FIFO 850 may include a sparsity storage that may store commands for sparsity data.

[0126] The response FIFO 840 and response data FIFO 845 receive response to read requests from the arbiter 810. A response to a read request may include one or more response signals indicating that the corresponding read request was a valid request and that a valid response was issued. The response signals may be stored in the response FIFO 840. The response may also include data read from the memory, and the data may be stored in the response data FIFO 845. In some embodiments, the response data FIFO 845 stores activations and weights.

[0127] The configurable storage unit 860 is configurable to adopt different functions for different operational modes of the DPU. The configurable storage unit 860 may be configured by a configuration descriptor, which may be generated by a DNN module, e.g., the DNN module 401 in FIG. 4. In some embodiments, e.g., embodiments in which the DPU operates in a sparse mode, data received by the IDU 800 from the memory port 801 may be sparsity data. The configurable storage unit 860 is configured to store sparsity data. The sparsity data may include one or more sparsity tensors, such as sparsity bitmaps. In some embodiments, e.g., embodiments in which the DPU operates in a dense mode, the configurable storage unit 860 may have a different function. The IDU 800 may not request any sparsity data from the memory. The configurable storage unit 860, instead of storing sparsity data or staying empty, may be used as an additional response data FIFO to augment the response data FIFO 845, as shown by the dotted line in FIG. 8. That way, the IDU 800 can store more response data at a time. For instance, the configurable storage unit 860 may increase the FIFO depth for storing response data, e.g., from a depth of 20 to a depth of 52. In some embodiments, the response data FIFO depth may be further augmented by the sparsity storage in the command FIFO 850, e.g., from the depth of 52 to a depth of 68. The increase in depth can help absorb additional stalls. The configurable storage unit 860 may receive no sparsity data in dense modes. In some embodiments, the configurable storage unit 860 is a RAM.

[0128] The MUX 865 receives data from the response data FIFO 845 and data from the configurable storage unit 860 as input signals and selects one of input signals as an output of the MUX 865.The control logic 880 receives commands stored in the command FIFO 850 and uses the commands to control data transfer to the compute unit 802. The data loader 890 receives output data of the MUX 870, which may either be data stored in the response data FIFO 845 or data stored in the configurable storage unit 860. The data loader 890 may perform one or more operations on the data. The operations may include, for example, generating data table, data conversation, data unpacking, or other types of operation. The data loader 890 sends data to the compute unit 802.

[0129] FIG. 9 illustrates an example ODU 900, in accordance with various embodiments. The ODU 900 may be a data delivery unit in a DPU. The ODU 900 facilitates transfer of data computed by a compute unit from the compute unit to a memory. The ODU 900 may transform the data by performing data compression, tensor permutation, and so on. The ODU 900 may be an example of the ODU 490 in FIG. 4. As shown in FIG. 9, the ODU 900 includes a staging buffer 910, transposable register files (TRFs) 920 (individually referred to as "TRF 920"), a compression module 930, a data shifter 940, a data memory 945, a data buffer 947, a sparsity shifter 950, a sparsity memory 955, a sparsity buffer 957, a WCB 960, a data memory 965, a CMX data buffer 967, another WCB 970, another sparsity memory 975, and a CMX sparsity buffer 977. In other embodiments, the ODU 900 may include fewer, more, or different components.

[0130] The staging buffer 910 temporarily stores data received from the compute unit. For instance, the staging buffer 910 may store data while the TRFs 920 are performing permutation on previously-received data. In an example, the data may be output activations of a neural network layer, which may be used as input activations for the next neural network layer. Data may be read from the staging buffer 910 and written into the TRFs 920. The TRFs 920 are storages that can facilitate tensor permutation. For example, the TRFs 920 may transpose X and Z coordinates and writes out data in X-major format even though they receive data in Z-major format. As another example, the TRFs 920 may transpose Y and Z coordinates and writes out data in Y-major format even though they receive data in Z-major format. [0131] In some embodiments, each of the TRFs 920 may include storage elements arranged in an array with rows and columns. Each storage element may store a single entry at a time. An entry may be a data point in a tensor. The TRF 920 may alternate orientations in which data points of a tensor are written and read to change the storage format of the tensor. The tensor may be a matrix that has rows and columns. The number of data points in each row of the tensor may be no greater than the number of storage elements in a row or column of the TRF 920. Also, the number of data points in each column of the tensor may be no greater than the number of storage elements in a row or column of the TRF 920. In an example, data may be written into the TRF 920 in a column-wise manner but read from the TRF 920 in a row-wise manner to permute a tensor. In another example, data may be written into the TRF 920 in a row-wise manner but read from the TRF 920 in a column -wise manner to permute a tensor.

[0132] The compression module 930 may compress data read from the TRFs 920. For instance, the compression module 930 compresses data based on sparsity. In some embodiments, the compression module 930 may compress data by removing data elements having values not greater than a threshold. The threshold may be zero, for example. The compression module 930 may include one or more comparators, each of which may compare the value of a data element with the threshold. After determining that the data element is not greater than the threshold, the compression module 930 may remove the data element. The compression module 930 may also generate sparsity data that indicates sparsity in the data received from the TRFs 920. For example, the compression module 930 may generate sparsity tensors. A sparsity tensor may include a sequence of sparsity data elements, each of which corresponding to a data element received from the TRFs 920 and indicates whether the data element is zero or not. The sparsity tensor may correspond to an activation tensor or weight tensor, and the positions of the sparsity data elements in the sparsity tensor may match the positions of the corresponding data elements in the activation tensor or weight tensor. An example sparsity tensor is a sparsity bitmap, in which a sparsity bit element is a single bit. A zero bit may indicate that the corresponding data element is zero, while a one bit may indicate that the corresponding data element is not zero.

[0133] The compressed data (e.g., compressed activation tensors) is transmitted to the data shifter 940. The data shifter 940 may shift data elements received from the compression module 930. In some embodiments, the data elements all have nonzero values. The compressed data, before or after the shifting, may be stored in the data memory 945. The compressed and shifted data is transmitted to the data buffer 947. The sparsity data is transmitted to the sparsity shifter 940. The sparsity shifter 940 may shift sparsity data elements received from the compression module 930. The sparsity data, before or after the shifting, may be stored in the sparsity memory 945. The compressed and shifted sparsity data is transmitted to the sparsity buffer 957.

[0134] The WCB 960 may combine small write transactions for writing data stored in the data buffer 947 into bigger write transactions. In an example, the WCB 960 may combine two 16-byte transactions into a 32-byte transaction. The small or bigger write transactions may be stored in the data memory 965. The bigger write transactions generated by the WCB 960 are stored in the CMX data buffer 967, from which the write transactions are to be sent to the memory. The WCB 970 may combine small write transactions for writing sparsity data stored in the sparsity buffer 957 into bigger write transactions. The small or bigger write transactions may be stored in the sparsity memory 975. The bigger write transactions generated by the WCB 970 are stored in the CMX sparsity buffer 977, from which the write transactions are to be sent to the memory.

[0135] In some embodiments, one or more TRFs 920, the sparsity memory 955, the sparsity memory 975, the WCB 960, or WCB 970 may be configurable so that it can be configured to have different functions in different operational mode of the DPU. Certain aspects of configurable storages in the ODU are described below in conjunction with FIGS. 10-13.

[0136] FIG. 10 illustrates that sparsity storages in the ODU 900 can be reconfigured for staging, in accordance with various embodiments. In the embodiments of FIG. 10, the sparsity memory 955 and sparsity memory 975 are configurable. For instance, the sparsity memory 955 and sparsity memory 975 are configured to store sparsity data when the compute unit (or another compute unit) is to operate in a sparse mode for executing the next neural network layer. When the compute unit (or another compute unit) is to operate in a dense mode for executing the next neural network layer, the sparsity memory 955 and sparsity memory 975 are configured to adopt the functionality of the staging buffer 910, as shown by the dash lines in FIG. 10. The sparsity memory 955 and sparsity memory 975 can also store data received from the compute unit and augment the storage capacity of the staging buffer 910. [0137] The sparsity memory 955 and sparsity memory 975 may be configured by one or more configuration descriptors generated for the DNN. The configuration descriptors may be generated offline, e.g., before the execution of the DNN starts, by the DNN module 401. Even though both the sparsity memory 955 and sparsity memory 975 are configurable in FIG. 10, the sparsity memory 955 or sparsity memory 975 may not be configurable in other embodiments.

[0138] FIGS. 11A-11C illustrate that sparsity storages in the ODU 900 can be reconfigured for reordering, in accordance with various embodiments. In the embodiments of FIGS. 11A- 11C, the sparsity memory 955 and sparsity memory 975 are configurable. For instance, the sparsity memory 955 and sparsity memory 975 are configured to store sparsity data when the compute unit (or another compute unit) is to operate in a sparse mode for executing the next neural network layer. When the compute unit (or another compute unit) is to operate in a dense mode for executing the next neural network layer, the sparsity memory 955 and sparsity memory 975 are configured to function as a reordering buffer 1100, as shown by the dash lines in FIG. 11A. In an example, the dense mode is a dense, X-major mode, e.g., the next neural network layer is to be executed with input activations in X-major format and the next neural network layer is to be executed without sparsity acceleration.

[0139] The reordering buffer 1100 is placed between the staging buffer 910 and the TRFs 920 on the data path in the ODU 900. The sparsity memory 955 and sparsity memory 975 may change the order in which data elements (e.g., output activations) are stored or transferred to facilitate tensor permutation by the TRFs 920. In an example, the data elements may be stored in the staging buffer in a first order, which may be the same order in which the data element are output from the compute unit. The data element may be written into the reordering buffer 1100 in the first order but read from the reordering buffer 1100 in a second order. The second order is different from the first order. For instance, the first order may follow a Z-major format, while the second order may follow a X-major format.

[0140] FIG. 11B shows data read and write transactions in the reordering buffer 1100 and a TRF 920 for the example in which the first order follows a Z-major format and the second order follows a X-major format. "W" in FIG. 11B stands for write, and "R" in FIG. 11B stands for read. Each column in FIG. 11B represents four cycles. Each cycle may be used to transfer a fixed number of bits, such as 8 bits, i.e. 1 byte. The ODU 900 may receive 4 X-bytes (represented by a cell with the dotted pattern in FIG. 11B) in Z=0:15 (i.e., 16 consecutive Z- bytes), followed by 4 X-bytes (represented by a cell with the horizontal stripe pattern in FIG. 11B) in Z=16:31 (i.e., 16 consecutive Z-bytes), further followed by 4 X-bytes (represented by a cell with the diamond grid pattern in FIG. 11B) in Z=32:47 (i.e., 16 consecutive Z-bytes), further followed by 4 X-bytes (represented by a cell with the diagonal stripe pattern in FIG. 11B) in Z=48:63 (i.e., 16 consecutive Z-bytes). These 16 X-bytes are represented by four different patterns in FIG. 11B: a dotted pattern, a horizontal stripe pattern, a diamond grid pattern, and a diagonal stripe pattern. These bytes are written to the staging buffer 910 and then written into the reordering buffer 1100 in the same order. FIG. 11B shows four groups of such 16 bytes.

[0141] The bytes are read from the reordering buffer 1100 in a different order. The reordering buffer 1100 may collect 16 consecutive X-bytes per Z-set and present them to the TRF 920. As shown in FIG. 11B, 16 consecutive X-bytes are read from the reordering buffer 1100 and written into the TRF 920. With the reordering functionality of the sparsity memory 955 and sparsity memory 975, less TRFs may be required. As shown in FIG. 11C, 4 TRFs 1110A-1110D would be needed for the 64 consecutive Z-bytes as each group of 4 X- bytes in 16 consecutive Z-bytes would need one TRF. With the reordering buffer 1100, one TRF is sufficient as the 16 consecutive X-bytes can be read from the reordering buffer and written into a single TRF.

[0142] The sparsity memory 955 and sparsity memory 975 may be configured by one or more configuration descriptors generated for the DNN. The configuration descriptors may be generated offline, e.g., before the execution of the DNN starts, by the DNN module 401. Even though both the sparsity memory 955 and sparsity memory 975 are reconfigured to function as reordering buffer in the embodiments of FIGS. 11A-11C, one of the sparsity memory 955 and sparsity memory 975 may not be reconfigured in other embodiments. [0143] FIG. 12 illustrates that sparsity storages in the ODU 900 can be reconfigured for reordering and transposing, in accordance with various embodiments. In the embodiments of FIG. 12, the sparsity memory 955 and sparsity memory 975 are configurable. For instance, the sparsity memory 955 and sparsity memory 975 are configured to store sparsity data when the compute unit (or another compute unit) is to operate in a sparse mode for executing the next neural network layer. When the compute unit (or another compute unit) is to operate in a dense mode for executing the next neural network layer, the sparsity memory 955 and sparsity memory 975 are configured to function as a buffer 1200, as shown by the dash lines in FIG. 12.

[0144] The buffer 1200 can reorder and transpose data. In an example, the dense mode is a dense, Y-major mode, e.g., the next neural network layer is to be executed with input activations in Y-major format and the next neural network layer is to be executed without sparsity acceleration. The ODD 900 may need to perform Y-permutation on data received from the compute unit. The Y-permutation may be done by using the buffer 1200 and the TRFs 920 in the ODU 900. For Y-permutation, consecutive bytes in Y need to be collected while the compute unit sends consecutive bytes in Z. The buffer 1200 may save the incoming 1x4x16 tensors until there are 16 consecutive bytes in the Y dimension. For a 16x16x16 tensor, the entire tensor needs to be stored before any data can be written out. That may be because in Z-X-Y/Z-Y-X order (i.e., Z-major), Y is not the fastest changing dimension. The entire tensor may be stored in the buffer 1200. By reading the tensor and using the TRFs 920 to transpose, the ODU 900 can get close to a Z-major performance for a Y-major permutation.

[0145] FIG. 13 illustrates that permutation storages in the ODU 900 can be reconfigured for staging, in accordance with various embodiments. In the embodiments of FIG. 13, the TRFs 920 are configurable. For instance, the TRFs 920 are configured to permute tensors when the compute unit (or another compute unit) is to operate in a X-major mode or Y-major mode for executing the next neural network layer. When the compute unit (or another compute unit) is to operate in a Z-major mode for executing the next neural network layer, the TRFs 920 are configured to augment the staging buffer 910 and forms an augmented staging buffer 1300 with the staging buffer 910. The TRFs 920 may be configured by one or more configuration descriptors generated for the DNN. The configuration descriptors may be generated offline, e.g., before the execution of the DNN starts, by the DNN module 401. Even though all the TRFs 920 are configurable in FIG. 13, one or more of the TRFs 920 may not be configurable in other embodiments.

[0146] FIG. 14 illustrates that write storages in the ODU 900 can be reconfigured as a transaction FIFO 1400, in accordance with various embodiments. In the embodiments of FIG. 13, the WCB 960 and WCB 970 are configurable. For instance, the WCB 960 and WCB 970 are configured to buffer write transactions before the write transactions are sent to the memory. As described above, the WCB 960 and WCB 970 can combine smaller write transactions into bigger write transactions. In some embodiments, there may not be much of an advantage to combine smaller transactions. In these cases, the WCB 960 and WCB 970 are repurposed as a transaction FIFO 1400. The transaction FIFO 1400 may buffer transactions before sending them out to the memory port, such as the CMX. This can be advantageous because the CMX may not be ready to accept transactions at times due to other accesses from the read-side for instance. To avoid letting these stalls propagate through the entire DPU pipeline and decreasing overall performance, the transaction FIFO 1400 may absorb some or even the entire effect of the stalls.

[0147] The WCB 960 and WCB 970 may be configured by one or more configuration descriptors generated for the DNN. The configuration descriptors may be generated offline, e.g., before the execution of the DNN starts, by the DNN module 401. Even though both the WCB 960 and WCB 970 are configurable in FIG. 13, the WCB 960 or WCB 970 may not be configurable in other embodiments.

[0148] FIGS. 15A-15L illustrate a process of permutating a tensor in an intermediate storage unit, in accordance with various embodiments. The tensor permutation process may be performed by a ODU in a DNN accelerator, such as the ODU 490 in FIG. 4 or the ODU 900 in FIG. 9. The intermediate storage unit may be a TRF. The tensor permutation process may be used to change a Z-major format to a X-major format. For the purpose of illustration, the tensor has a length of 8 in the X dimension and a length of 8 in the Z dimension. In other embodiments, the tensor may have a different shape or size.

[0149] FIGS. 15A-11D show the process of writing the tensor to the intermediate storage unit. FIG. 15A shows the intermediate storage unit before any data is written to the intermediate storage unit. The intermediate storage unit has a matrix structure or grid structure that includes 64 storage elements arranged in 8 rows and 8 columns. Each storage element in the intermediate storage unit is represented by a square. In some embodiments, each storage element may include a fixed number of flip-flops for storing a fixed number of bits. The fixed number may be 4, 8, 16, 32, and so on. Examples of the storage elements include the flip-flop groups in FIG. 14. In some embodiments, each storage element may store an entry at a time. An entry may include a single data point. The intermediate storage unit may be used to store data of various data types, such as INT8, FP16, BF16, FP32, and so on. The intermediate storage unit may be an example of the TRFs 920 in FIG. 9. In some embodiments, the intermediate storage unit is a component of the drain module performing the tensor permutation.

[0150] In FIG. 15B, a vector of 8 data points is written to a column of the intermediate storage unit. The 8 data points are represented by dotted squares in FIG. 15B. The vector may be in the Z dimension of the tensor. For instance, the 8 data points may have the same (x, y) coordinate but different z coordinates. In some embodiments, the 8 data points may be computed by performing a neural network operation or part of a neural network operation. The vector is written to the most right column of the intermediate storage unit. After the vector is written, the intermediate storage unit has one column storing data and seven columns with no data. The 8 data points may be written in one clock cycle of the intermediate storage unit. The clock cycle may be the first clock cycle of a first sequence of clock cycles.

[0151] In FIG. 15C, another vector of 8 data points is written to another column of the intermediate storage unit. The vector is written to the second most right column of the intermediate storage unit. After this vector is written, the intermediate storage unit has two columns storing data and six columns with no data. The 8 data points may be written in the second clock cycle of the first sequence of clock cycles.

[0152] Even though not shown in FIG. 15C, six additional vectors, each of which has 8 data points, are written to other columns of the intermediate storage unit till the intermediate storage unit is full. Each additional vector may be written in a subsequent clock cycle of sequence of clock cycles. As shown in FIG. 15D, all the columns of the intermediate storage unit have data stored. The writes of the 64 data points are in a column-wise manner and take 8 clock cycles of the first sequence of clock cycles. Each column of the intermediate storage unit stores a vector in the Z dimension, and each row of the intermediate storage unit stores a vector in the X dimension.

[0153] In FIG. 15E, the data stored in the top row of the intermediate storage unit (i.e., a vector in the X dimension) is read from the intermediate storage unit in the first clock cycle of a second sequence of clock cycles. The vector may be written to a memory, e.g., the local memory 440, by the drain module. In the same clock cycle, a new vector of 8 data points, which are represented by squares with diamond grids, is written to the top row of the intermediate storage unit. The new vector may be a portion of a new tensor, which is different from the tensor written to the intermediate storage unit in the first sequence of clock cycles. The new vector may be in the Z dimension. For the purpose of illustration, the tensor written to the intermediate storage unit in the first sequence of clock cycles is referred to as the first tensor, and the new tensor is referred to as the second tensor.

[0154] In FIG. 15F, the data stored in the second top row of the intermediate storage unit (i.e., another vector of the first tensor in the X dimension) is read from the intermediate storage unit in a second clock cycle in the second sequence of clock cycles. This vector may be written to the local memory 440. In the same cycle, another vector in the second tensor is written to the second top row of the intermediate storage unit. This row-wise read/write continues.

[0155] FIG. 15G shows the layout of the intermediate storage unit after the seventh clock cycle in the second sequence of clock cycles. The data stored in the top seven rows of the intermediate storage unit (i.e., data in the first tensor) is read, and data in the second tensor is written to these rows.

[0156] In FIG. 15H, the data stored in the bottom row of the intermediate storage unit is read from the intermediate storage unit in the last clock cycle of the second sequence of clock cycles. In the same cycle, another vector in the second tensor is written to the bottom row of the intermediate storage unit. The second sequence of clock cycles is complete, i.e., the read of the first tensor and the write of the second tensor are complete.

[0157] As the first tensor is written in the column-wise manner (each column stores a vector in the Z dimension) but read in the row-wise manner (each row stores a vector in the X dimension), the alternating pattern of the write/read of the first tensor can change the Z- major format of the first tensor to an X-format. Also, as the second tensor is written in the same clock cycles in which the first tensor is read, the draining process can be more efficient compared with currently available DNN accelerators that cannot write a tensor until the read of the previous tensor is complete.

[0158] FIGS. 151-11 L shows a third sequence of clock cycles. In FIG. 151, the data in the most right column of the intermediate storage unit (i.e., a vector of the second tensor in the X dimension) is read in the first cycle of the third sequence of clock cycles. In the same clock cycle, a vector of 8 data points in a third tensor is written to the most right column of the intermediate storage unit. Data points of the third tensor are represented by dotted squares. [0159] In FIG. 151, the data in the second most right column of the intermediate storage unit (i.e., another vector of the second tensor in the X dimension) is read in the second cycle of the third sequence of clock cycles. In the same clock cycle, another vector of 8 data points in the third tensor is written to the second most right column of the intermediate storage unit. This column-wise read/write process continues.

[0160] In FIG. 15K, the data in seven columns of the intermediate storage unit is read. Also, seven vectors of the third tensor are written into these seven columns of the intermediate storage unit. In FIG. 15L, the data in the last column of the intermediate storage unit is read in the last clock cycle of the third sequence of clock cycles. Also, an eighth vector of the third tensor are written into the last column of the intermediate storage unit in the same cycle. [0161] As the second tensor is written in the row -wise manner (each column stores a vector in the Z dimension) but read in the column -wise manner (each row stores a vector in the X dimension), the alternating pattern of the write/read of the first tensor can change the Z-major format of the first tensor to an X-format. Also, as the third tensor is written in the same clock cycles in which the second tensor is read, the draining process can be more efficient compared with currently available DNN accelerators that cannot write a tensor until the read of the previous tensor is complete.

[0162] Even though not shown in FIGS. 15A-15L, read/write in the alternating pattern may continue. In some embodiments, the first tensor, second tensor, or third tensor may be a subtensor of an output tensor of a neural network operation that is computed by a compute engine, e.g., the processing engine 370 or post-processing engine 380. The output tensor may be an input tensor of another neural network operation, e.g., a neural network operation in the next layer of the DNN. The first tensor, second tensor, and third tensor may be written to the local memory 440 from the intermediate storage unit. The process in FIGS. 15A-15L can be used to permute the output tensor so that the order in which the data points are stored in the local memory 440 is different from the order in which the data points are output from the compute engine. The process may be used to change a Z-major format to an X-major format, change a X-major format to a Z-major format, or other types of tensor permutation. The tensor permutation may be part of a neural network operation. Additionally or alternatively, the tensor permutation may improve utilization of components in the compute engine for executing the next layer of the DNN which uses the first tensor, second tensor, or third tensor as input data. [0163] FIG. 16 is a block diagram of a DNN module 1600, in accordance with various embodiments. The DNN module 1600 may be an embodiment of the DNN module 401 in FIG. 4. As shown in FIG. 16, the DNN module 1600 includes an interface module 1610, a training module 1620, a compressing module 1630, a validating module 1640, a compiler 1650, and a datastore 1660. In other embodiments, alternative configurations, different or additional components may be included in the DNN module 1600. Further, functionality attributed to a component of the DNN module 1600 may be accomplished by a different component included in the DNN module 1600 or a different module or system.

[0164] The interface module 1610 facilitates communications of the DNN module 1600 with other modules or systems. For example, the interface module 1610 establishes communications between the DNN module 1600 with an external database to receive data that can be used to train DNNs or input into DNNs to perform tasks. As another example, the interface module 1610 transmits configuration descriptors to the DNN accelerator 402 for configuring components of the DNN accelerator 402 for DNN execution. As yet another example, the interface module 1610 supports the DNN module 1600 to distribute DNNs to other systems, e.g., computing devices configured to apply DNNs to perform tasks.

[0165] The training module 1620 trains DNNs by using a training dataset. The training module 1620 forms the training dataset. In an example where the training module 1620 trains an DNN to recognize objects in images, the training dataset includes training images and training labels. The training labels describe ground-truth classifications of objects in the training images. In some embodiments, each label in the training dataset corresponds to an object in a training image. In some embodiments, a part of the training dataset may be used to initially train the DNN, and the rest of the training dataset may be held back as a validation subset used by the validating module 1640 to validate performance of a trained DNN. The portion of the training dataset not including the tuning subset and the validation subset may be used to train the DNN.

[0166] The training module 1620 also determines hyperparameters for training the DNN. Hyperparameters are variables specifying the DNN training process. Hyperparameters are different from parameters inside the DNN (e.g., weights of filters). In some embodiments, hyperparameters include variables determining the architecture of the DNN, such as number of hidden layers, etc. Hyperparameters also include variables which determine how the DNN is trained, such as batch size, number of epochs, etc. A batch size defines the number of training samples to work through before updating the parameters of the DNN. The batch size is the same as or smallerthan the number of samples in the training dataset. The training dataset can be divided into one or more batches. The number of epochs defines how many times the entire training dataset is passed forward and backwards through the entire network. The number of epochs defines the number of times that the deep learning algorithm works through the entire training dataset. One epoch means that each training sample in the training dataset has had an opportunity to update the parameters inside the DNN. An epoch may include one or more batches. The number of epochs may be 1, 5, 10, 50, 100, 500, 1000, or even larger.

[0167] The training module 1620 defines the architecture of the DNN, e.g., based on some of the hyperparameters. The architecture of the DNN includes an input layer, an output layer, and a plurality of hidden layers. The input layer of an DNN may include tensors (e.g., a multidimensional array) specifying attributes of the input image, such as the height of the input image, the width of the input image, and the depth of the input image (e.g., the number of bits specifying the color of a pixel in the input image). The output layer includes labels of objects in the input layer. The hidden layers are layers between the input layer and output layer. The hidden layers include one or more convolutional layers and one or more other types of layers, such as pooling layers, fully-connected layers, normalization layers, SoftMax or logistic layers, and so on. The convolutional layers of the DNN abstract the input image to a feature map that is represented by a tensor specifying the feature map height, the feature map width, and the feature map channels (e.g., red, green, blue images include 3 channels). A pooling layer is used to reduce the spatial volume of input image after convolution. It is used between two convolution layers. A fully-connected layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer. It is used to classify images between different categories by training.

[0168] In the process of defining the architecture of the DNN, the training module 1620 also adds an activation function to a hidden layer or the output layer. An activation function of a layer transforms the weighted sum of the input of the layer to an output of the layer. The activation function may be, for example, a ReLU activation function, a tangent activation function, or other types of activation functions.

[0169] After the training module 1620 defines the architecture of the DNN, the training module 1620 inputs a training dataset into the DNN. The training dataset includes a plurality of training samples. An example of a training sample includes an object in an image and a ground-truth label of the object. The training module 1620 modifies the parameters inside the DNN ("internal parameters of the DNN") to minimize the error between labels of the training objects that are generated by the DNN and the ground-truth labels of the objects. The internal parameters include weights of filters in the convolutional layers of the DNN. In some embodiments, the training module 1620 uses a cost function to minimize the error. [0170] The training module 1620 may train the DNN for a predetermined number of epochs. The number of epochs is a hyperparameter that defines the number of times that the deep learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update internal parameters of the DNN. After the training module 1620 finishes the predetermined number of epochs, the training module 1620 may stop updating the parameters in the DNN. The DNN having the updated parameters is referred to as a trained DNN.

[0171] The compressing module 1630 compresses DNNs. For instance, the compressing module 1630 may add pruning operations to DNN layers to reduce computational complexity or memory usage. A pruning operation may prune weight tensors of a DNN layer by changing one or more nonzero valued weights of the layer to zeros. The modification may be done before, during, or after training. Weights may be pruned during training, during inference, or a combination of both. The compressing module 1630 may determine a sparsity ratio for a DNN layer. The sparsity ratio may be a ratio of the number of zero-valued weight to the total number of weights in the layer. The compressing module 1630 may perform the pruning operation till the sparsity ratio of the DNN layer meets a target sparsity ration, such as 10%, 20%, 30%, 160%, 50%, and so on.

[0172] In some embodiments, the compressing module 1630 may select one or more layers in a DNN and modify each selected layer with a pruning operation. For instance, the compressing module 1630 may select computationally complex layers, such as layers with large filters. For a pruning operation of a layer or of a type of layer, the compressing module 1630 may determine a weight threshold that would not cause a loss of the accuracy of the DNN to exceed an accuracy loss constraint. A pruning operation may modify weights having absolute values above the weight threshold to zeros and leave the other weights unchanged. The weight pruning can reduce memory storage as zero-valued weights may not be stored. Also, the number of operations in the layer can be reduced as computations on zero-valued weights can be skipped without impacting the output of the layer. In some embodiments, the compressing module 1630 may also measure energy saving, final DNN accuracy, or layer-wise sparsity caused by pruning operations.

[0173] After compressing a DNN, the compressing module 1630 may fine tune the DNN, e.g., through a retraining process. The compressing module 1630 may fine tunes DNNs after weights are pruned. In some embodiments, the fine-tuning process is a retraining or further training process. For instance, after weights in a DNN are pruned, the compressing module 1630 may further train the DNN by inputting a training dataset into the DNN. The values of the unpruned weights in the DNN may be modified based on outputs of the DNN and ground-truth labels of the training samples in the training dataset. In some embodiments, the values of the pruned weights (i.e., zero) are not changed during the fine-tuning process. For instance, the compressing module 1630 may place a mask over a pruned weight block and the mask can prevent values in the pruned weight blocks from being changed during the fine-tuning process. In other embodiments, the values of all weights, including the pruned weights, may be changed during the fine-tuning process. After one or more cycles of retraining and weight changing by the compressing module 1630, the compressing module 1630 may perform a new pruning process, e.g., by selecting weight blocks and pruning the selected weight blocks. In some embodiments, the weight pruning process may be repeated multiple times before the fine-tuning process is done. In some embodiments, the number of epochs in the fine-tuning process may be different from the number of epochs in the training process in which the pre-pruning values of the weights are determined. For instance, the fine-tuning process may have less epochs than the training process. In an example, the number of epochs in the fine-tuning process may be relatively small, such as 2, 3, 16, 5, and so on.

[0174] The validating module 1640 verifies accuracy of trained or compressed DNNs. In some embodiments, the validating module 1640 inputs samples in a validation dataset into a trained DNN and uses the outputs of the DNN to determine the model accuracy. In some embodiments, a validation dataset may be formed of some or all the samples in the training dataset. Additionally or alternatively, the validation dataset includes additional samples, other than those in the training sets. In some embodiments, the validating module 1640 may determine an accuracy score measuring the precision, recall, or a combination of precision and recall of the DNN. The validating module 1640 may use the following metrics to determine the accuracy score: Precision - TP / (TP + FP) and Recall - TP I (TP + FN), where precision may be how many the DNN correctly predicted (TP or true positives) out of the total it predicted (TP + FP or false positives), and recall may be how many the DNN correctly predicted (TP) out of the total number of objects that did have the property in question (TP + FN or false negatives). The F-score (F-score = 2 * PR / (P + R)) unifies precision and recall into a single measure.

[0175] The validating module 1640 may compare the accuracy score with a threshold score. In an example where the validating module 1640 determines that the accuracy score of the DNN is less than the threshold score, the validating module 1640 instructs the training module 1620 to re-train the DNN. In one embodiment, the training module 1620 may iteratively re-train the DNN until the occurrence of a stopping condition, such as the accuracy measurement indication that the DNN may be sufficiently accurate, or a number of training rounds having taken place.

[0176] The compiler 1650 compiles information of DNNs to executable instructions that can be executed, e.g., by the DNN accelerator 302, to carry out neural network operations in DNNs. In some embodiments, the compiler 405 may generate a graph representing a DNN. The graph may include nodes and edges. A node may represent a specific neural network operation in the DNN. An edge may connect two nodes and represent a connection between the two corresponding neural network operations. In an example, an edge may encode a tensor that flows from one of the neural network operations to the other neural network operation. The tensor may be an output tensor of the first neural network operation and an input tensor of the second neural network operation. The edge may encode one or more attributes of the tensor, such as size, shape, storage format, and so on. The compiler 1650 may use the graph to generate executable DNNs. For instance, the compiler may generate computer program instructions (e.g., compilation descriptors) for executing DNNs. The instructions may be stored in registers associated with components of the DNN accelerator 302.

[0177] The compiler 1650 may also generate configuration descriptors that specify operation modes of components in the DNN accelerator 402, such as one or more configuration descriptors specifying whether a DPU 430 operates in dense or sparse modes. In dense modes, the DPU 430 may execute a DNN layer without any sparsity acceleration. For instance, the DPU 430 may not skip computations of zero-valued data elements in dense modes. In sparse modes, the DPU 430 may accelerate the execution of a DNN layer based on sparsity by skipping computations of zero-valued data elements. The compiler 1650 may determine whether to accelerate the layer based on weight sparsity, activation sparsity, or both. The compiler 1650 may select the sparse mode for a layer from a group of sparse modes that includes, for example, combined sparse mode in which the layer is accelerated based on both weight sparsity and activation sparsity, activation sparse mode in which the layer is accelerated based on activation sparsity but not based on weight sparsity, weight sparse mode in which the layer is accelerated based on weight sparsity but not based on activation sparsity, and a dense mode in which the layer is not accelerated based on sparsity. In some embodiments (e.g., embodiments where a layer is executed by multiple DPUs 430), the compiler 1650 may determine the sparse mode for all the DPUs 430 that executes the layer.

[0178] Additionally or alternatively, the compiler 1650 may determine whether a DPU 430 operates in Z-major mode, X-major mode, or Y-major mode for executing a DNN layer. The compiler 1650 may determine a tensor storage format for the DNN layer, which may be the format of the input tensor of the DNN layer, and select the corresponding mode as the operational mode of the DPU 430. In some embodiments, after selecting an operation mode of the DPU 430, the compiler 1650 may generate configuration descriptors that configure components of the DPU 430 to operate in the selected operational mode. The configuration descriptors include configuration descriptors that can configure data storages in the DNN accelerator 402. For instance, the compiler 1650 may generate configuration descriptors that configure a data storage in the IDU 450 or ODU 490 to adopt a new functionality when the DPU 430 operates in a certain mode that may be different from its original functionality. The new functionality may be the functionality of another data storage in the IDU 450 or ODU 490. In some embodiments, the compiler 1650 may generate the configuration descriptors for configuring data storages based on operational modes of the DPU 430, e.g., based on configuration descriptors that specify the operational modes. The configuration descriptors may be provided to the DPU 430, e.g., through the interface module 1610.

[0179] The compiler 1650 may also generate configuration parameters that facilitates data read or data write, such as a configuration parameter that indicates the number of data elements to be processed (e.g., the number of data elements in a tile), configuration parameter that indicates the memory address where an input data element may be fetched, configuration parameter that indicates memory address where an output data element may be stored, configuration parameter that indicates memory address where another configuration parameters may be stored, and so on.

[0180] The datastore 1660 stores data received, generated, used, or otherwise associated with the DNN module 1600. For example, the datastore 1660 stores the datasets used by the training module 1620 and validating module 1640. The datastore 1660 may also store data generated by the training module 1620 and validating module 1640, such as the hyperparameters for training DNNs, internal parameters of trained DNNs (e.g., weights, etc.), data for sparsity acceleration (e.g., sparsity bitmap, etc.), and so on. The datastore 1660 may store configuration descriptors, configuration parameters, or other data generated by the compressing module 1630. The datastore 1660 may include one or more memories. In the embodiment of FIG. 16, the datastore 1660 is a component of the DNN module 1600. In other embodiments, the datastore 1660 may be external to the DNN module 1600 and communicate with the DNN module 1600 through a network.

Example Computing Device

[0181] FIG. 17 is a block diagram of an example computing device 1700, in accordance with various embodiments. In some embodiments, the computing device 1700 can be used as at least part of the DNN system 400. A number of components are illustrated in FIG. 17 as included in the computing device 1700, but any one or more of these components may be omitted or duplicated, as suitable for the application. In some embodiments, some or all of the components included in the computing device 1700 may be attached to one or more motherboards. In some embodiments, some or all of these components are fabricated onto a single system on a chip (SoC) die. Additionally, in various embodiments, the computing device 1700 may not include one or more of the components illustrated in FIG. 17, but the computing device 1700 may include interface circuitry for coupling to the one or more components. For example, the computing device 1700 may not include a display device 1706, but may include display device interface circuitry (e.g., a connector and driver circuitry) to which a display device 1706 may be coupled. In another set of examples, the computing device 1700 may not include an audio input device 1718 or an audio output device 1708 but may include audio input or output device interface circuitry (e.g., connectors and supporting circuitry) to which an audio input device 1718 or audio output device 1708 may be coupled. [0182] The computing device 1700 may include a processing device 1702 (e.g., one or more processing devices). The processing device 1702 processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory. The computing device 1700 may include a memory 1704, which may itself include one or more memory devices such as volatile memory (e.g., DRAM), nonvolatile memory (e.g., read-only memory (ROM)), high bandwidth memory (HBM), flash memory, solid state memory, and/or a hard drive. In some embodiments, the memory 1704 may include memory that shares a die with the processing device 1702. In some embodiments, the memory 1704 includes one or more non-transitory computer- readable media storing instructions executable to perform operations performed by one or more components of the DNN system 400 (e.g., the DNN module 401, etc.). The instructions stored in the one or more non-transitory computer-readable media may be executed by the processing device 1702.

[0183] In some embodiments, the computing device 1700 may include a communication chip 1712 (e.g., one or more communication chips). For example, the communication chip 1712 may be configured for managing wireless communications for the transfer of data to and from the computing device 1700. The term "wireless" and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.

[0184] The communication chip 1712 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family), IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment), Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as "3GPP2"), etc.). IEEE 802.16 compatible Broadband Wireless Access (BWA) networks are generally referred to as WiMAX networks, an acronym that stands for worldwide interoperability for microwave access, which is a certification mark for products that pass conformity and interoperability tests for the IEEE 802.16 standards. The communication chip 1712 may operate in accordance with a Global System for Mobile Communication (GSM), General Packet Radio Service (GPRS), Universal Mobile Telecommunications System (UMTS), High Speed Packet Access (HSPA), Evolved HSPA (E- HSPA), or LTE network. The communication chip 1712 may operate in accordance with Enhanced Data for GSM Evolution (EDGE), GSM EDGE Radio Access Network (GERAN), Universal Terrestrial Radio Access Network (UTRAN), or Evolved UTRAN (E-UTRAN). The communication chip 1712 may operate in accordance with Code-division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Digital Enhanced Cordless Telecommunications (DECT), Evolution-Data Optimized (EV-DO), and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond. The communication chip 1712 may operate in accordance with other wireless protocols in other embodiments. The computing device 1700 may include an antenna 1722 to facilitate wireless communications and/or to receive other wireless communications (such as AM or FM radio transmissions).

[0185] In some embodiments, the communication chip 1712 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet). As noted above, the communication chip 1712 may include multiple communication chips. For instance, a first communication chip 1712 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 1712 may be dedicated to longer-range wireless communications such as global positioning system (GPS), EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others. In some embodiments, a first communication chip 1712 may be dedicated to wireless communications, and a second communication chip 1712 may be dedicated to wired communications.

[0186] The computing device 1700 may include battery/power circuitry 1714. The battery/power circuitry 1714 may include one or more energy storage devices (e.g., batteries or capacitors) and/or circuitry for coupling components of the computing device 1700 to an energy source separate from the computing device 1700 (e.g., AC line power). [0187] The computing device 1700 may include a display device 1706 (or corresponding interface circuitry, as discussed above). The display device 1706 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD), a light-emitting diode display, or a flat panel display, for example. [0188] The computing device 1700 may include an audio output device 1708 (or corresponding interface circuitry, as discussed above). The audio output device 1708 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.

[0189] The computing device 1700 may include an audio input device 1718 (or corresponding interface circuitry, as discussed above). The audio input device 1718 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output).

[0190] The computing device 1700 may include a GPS device 1716 (or corresponding interface circuitry, as discussed above). The GPS device 1716 may be in communication with a satellite-based system and may receive a location of the computing device 1700, as known in the art.

[0191] The computing device 1700 may include another output device 1710 (or corresponding interface circuitry, as discussed above). Examples of the other output device 1710 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or an additional storage device.

[0192] The computing device 1700 may include another input device 1720 (or corresponding interface circuitry, as discussed above). Examples of the other input device 1720 may include an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, a Quick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.

[0193] The computing device 1700 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile internet device, a music player, a tablet computer, a laptop computer, a netbook computer, an ultrabook computer, a personal digital assistant (PDA), an ultramobile personal computer, etc.), a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, or a wearable computer system. In some embodiments, the computing device 1700 may be any other electronic device that processes data. Select Examples

[0194] The following paragraphs provide various examples of the embodiments disclosed herein.

[0195] Example 1 provides an apparatus, including a write module configured to write data into a memory, in which the data is at least part of output data of a neural network layer that is computed by a compute unit performing one or more computations in the neural network layer; and a data storage unit configured to: store, in a first operational mode of the apparatus, sparsity data indicating sparsity in the output data, and store, in a second operational mode of the apparatus, at least part of the data, in which the second operational mode is different from the first operational mode.

[0196] Example 2 provides the apparatus of example 1, in which the data includes all data elements in the output data in the first operational mode, and the data is generated by removing one or more data elements from the output data in the second operational mode. [0197] Example 3 provides the apparatus of example 2, in which one or more computations performed by the compute unit for a next neural network layer are accelerated by skipping at least part of the one or more computations based on the sparsity data in the first operational mode.

[0198] Example 4 provides the apparatus of example 2 or 3, further including a compression module configured, in the second operation mode, to: remove the one or more data elements from the output data in the second operation mode to generate the data; and generate the sparsity data.

[0199] Example 5 provides the apparatus of any one of examples 1-4, in which the apparatus further includes an intermediate storage unit configured to change an order of data elements in the data before the data is written into the memory.

[0200] Example 6 provides the apparatus of example 5, in which the data storage unit, in the second operational mode of the apparatus, is configured to: store the data elements in a first order in which the data elements are received by the apparatus from the compute unit; and after storing the data elements in the first order, transmit the data elements to the intermediate storage unit in the first order, in which the intermediate storage unit is configured to change the first order to a second order that is different from the first order. [0201] Example 7 provides the apparatus of example 5, in which the data storage unit, in the second operational mode of the apparatus, is further configured to: store the data elements in a first order in which the data elements are received by the apparatus from the compute unit; after storing the data elements in the first order, changing the first order of the data elements to a second order; and transmit the data elements to the intermediate storage unit in the second order, in which the intermediate storage unit is configured to change the second order to a third order that is different from the first order and from the second order.

[0202] Example 8 provides the apparatus of any one of examples 5-7, in which the intermediate storage unit includes a transposable register file and is configured to change the order of the data elements by: writing the data elements into the transposable register file in one of a column-wise manner and a row-wise manner, the transposable register file including storage elements arranged in columns and rows; and reading the data elements from the transposable register file in another one of the column-wise manner and the rowwise manner.

[0203] Example 9 provides the apparatus of any one of examples 1-8, in which the write module includes a WCB configured to: receive data transfer transactions for writing the data into the memory; combine, in a first operational mode of the write module, the data transfer transactions to generate a combined data transfer transaction and write the data into the memory by performing the combined data transfer transaction; and store, in a second operational mode of the write module, the data transfer transactions and write the data into the memory by performing the data transfer transactions in an order in which the data transfer transactions are received.

[0204] Example 10 provides the apparatus of any one of examples 1-9, in which the sparsity data includes a sequence of sparsity data elements, a sparsity data element corresponding to a data element in the output data and indicating whether the data element is zero.

[0205] Example 11 provides an apparatus, including a read module configured to read data from a memory, in which the data is at least part of an input tensor of a neural network layer and is to be used to perform one or more computations in the neural network layer by a compute unit; and a data storage unit configured to: store, in a first operational mode of the apparatus, sparsity data to be used by the compute unit to accelerate the one or more computations, and store, in a second operational mode of the apparatus, one or more data elements in the data before the one or more data elements are provided to the compute unit, in which the second operational mode is different from the first operational mode. [0206] Example 12 provides the apparatus of example 11, in which the data storage unit, in the second operational mode of the apparatus, is further configured to: receive the one or more data elements by receiving, in an order, responses to one or more data read requests made by the read module; and transmit out the one or more data elements in the order. [0207] Example 13 provides the apparatus of example 11 or 12, in which all data elements in the data are nonzero data elements in the first operational mode, and a data element in the data is zero in the second operational mode.

[0208] Example 14 provides the apparatus of any one of examples 11-13, in which the sparsity data includes a sequence of sparsity data elements, a sparsity data element corresponding to a data element in the input tensor and indicating whether the data element is zero.

[0209] Example 15 provides an apparatus, including a compute unit configured to perform one or more computations in a neural network layer, the neural network layer having an input and output; and a data delivery unit configured to transfer data in the input or output of the neural network layer between a memory and the compute unit, the data delivery unit including a data storage unit, the data storage unit configured to: store, in a first operational mode of the apparatus, sparsity data indicating sparsity in at least part of the input or output of neural network layer, and store, in a second operational mode of the apparatus, at least part of the data, in which the second operational mode is different from the first operational mode.

[0210] Example 16 provides the apparatus of example 15, in which the first operational mode is a sparse mode in which the one or more computations performed by the compute unit are accelerated by skipping at least part of the one or more computations based on the sparsity data.

[0211] Example 17 provides the apparatus of example 15 or 16, in which the data delivery unit is configured to transfer the data from the memory to the compute unit, the data is at least part of the input of the neural network layer, and the apparatus further includes an additional data delivery unit configured to transfer at least part of the output of the neural network layer from the compute unit to the memory.

[0212] Example 18 provides the apparatus of example 17, in which the additional data delivery unit includes an additional data storage unit configured to: store, in the first operational mode of the apparatus, sparsity data indicating sparsity in at least part of the output of neural network layer; and store, in the second operational mode of the apparatus, at least part of the output of neural network layer.

[0213] Example 19 provides the apparatus of example 15 or 16, in which the data delivery unit is configured to transfer the data from the compute unit to the memory, the data is at least part of the output of the neural network layer, and the apparatus further includes an additional data delivery unit configured to transfer at least part of the input of the neural network layer from the memory to the compute unit.

[0214] Example 20 provides the apparatus of example 19, in which the additional data delivery unit includes an additional data storage unit configured to: store, in the first operational mode of the apparatus, sparsity data indicating sparsity in at least part of the input of neural network layer; and store, in the second operational mode of the apparatus, at least part of the input of neural network layer.

[0215] Example 21 provides the apparatus of any one of examples 19 or 20, in which the data delivery unit further includes another storage unit configured to, in the first operational mode and second operational mode of the apparatus, change an order of data elements in the output of the neural network layer before the data elements are written into the memory.

[0216] Example 22 provides the apparatus of example 21, in which the data storage unit, in the second operational mode of the apparatus, is configured to: store the data elements in a first order in which the data elements are received by the apparatus from the compute unit; and after storing the data elements in the first order, transmit the data elements to the another storage unit in the first order, in which the permutation module is configured to change the first order to a second order that is different from the first order.

[0217] Example 23 provides the apparatus of example 21, in which the data storage unit, in the second operational mode of the apparatus, is further configured to: store the data elements in a first order in which the data elements are received by the apparatus from the compute unit; after storing the data elements in the first order, changing the first order of the data elements to a second order; and transmit the data elements to the another storage unit in the second order, in which the permutation module is configured to change the second order to a third order that is different from the first order and from the second order. [0218] Example 24 provides the apparatus of any one of examples 19-23, in which the data delivery unit further includes a WCB configured to: receive data transfer transactions for writing the data into the memory; combine, in a first operational mode of the WCB, the data transfer transactions to generate a combined data transfer transaction and write the data into the memory by performing the combined data transfer transaction; and store, in a second operational mode of the WCB, the data transfer transactions and write the data into the memory by performing the data transfer transactions in an order in which the data transfer transactions are received.

[0219] Example 25 provides the apparatus of any one of examples 15-24, in which the apparatus further includes the memory.

[0220] The above description of illustrated implementations of the disclosure, including what is described in the Abstract, is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. While specific implementations of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize. These modifications may be made to the disclosure in light of the above detailed description.

Claims

1. An apparatus, comprising: a write module configured to write data into a memory, wherein the data is at least part of output data of a neural network layer that is computed by a compute unit performing one or more computations in the neural network layer; and a data storage unit configured to: store, in a first operational mode of the apparatus, sparsity data indicating sparsity in the output data, and store, in a second operational mode of the apparatus, at least part of the data, wherein the second operational mode is different from the first operational mode.

2. The apparatus of claim 1, wherein the data comprises all data elements in the output data in the first operational mode, and the data is generated by removing one or more data elements from the output data in the second operational mode.

3. The apparatus of claim 2, wherein one or more computations performed by the compute unit for a next neural network layer are accelerated by skipping at least part of the one or more computations based on the sparsity data in the first operational mode.

4. The apparatus of claim 2 or 3, further comprising a compression module configured, in the second operation mode, to remove the one or more data elements from the output data to generate the data; and generate the sparsity data based on the output data.

5. The apparatus of any one of claims 1-4, wherein the apparatus further comprises an intermediate storage unit configured to change an order of data elements in the data before the data is written into the memory.

6. The apparatus of claim 5, wherein the data storage unit, in the second operational mode of the apparatus, is configured to: store the data elements in a first order in which the data elements are received by the apparatus from the compute unit; and after storing the data elements in the first order, transmit the data elements to the intermediate storage unit in the first order, wherein the intermediate storage unit is configured to change the first order to a second order that is different from the first order.

7. The apparatus of claim 5, wherein the data storage unit, in the second operational mode of the apparatus, is further configured to: store the data elements in a first order in which the data elements are received by the apparatus from the compute unit; after storing the data elements in the first order, changing the first order of the data elements to a second order; and transmit the data elements to the intermediate storage unit in the second order, wherein the intermediate storage unit is configured to change the second order to a third order that is different from the first order and from the second order.

8. The apparatus of any one of claims 5-7, wherein the intermediate storage unit comprises a transposable register file and is configured to change the order of the data elements by: writing the data elements into the transposable register file in one of a column-wise manner and a row-wise manner, the transposable register file comprising storage elements arranged in columns and rows; and reading the data elements from the transposable register file in another one of the column-wise manner and the row-wise manner.

9. The apparatus of any one of claims 1-8, wherein the write module comprises a write combine buffer configured to: receive data transfer transactions for writing the data into the memory; combine, in a first operational mode of the write module, the data transfer transactions to generate a combined data transfer transaction and write the data into the memory by performing the combined data transfer transaction; and store, in a second operational mode of the write module, the data transfer transactions and write the data into the memory by performing the data transfer transactions in an order in which the data transfer transactions are received.

10. The apparatus of any one of claims 1-9, wherein the sparsity data comprises a sequence of sparsity data elements, a sparsity data element corresponding to a data element in the output data and indicating whether the data element is zero.

11. An apparatus, comprising: a read module configured to read data from a memory, wherein the data is at least part of an input tensor of a neural network layer and is to be used to perform one or more computations in the neural network layer by a compute unit; and a data storage unit configured to: store, in a first operational mode of the apparatus, sparsity data to be used by the compute unit to accelerate the one or more computations, and store, in a second operational mode of the apparatus, one or more data elements in the data before the one or more data elements are provided to the compute unit, wherein the second operational mode is different from the first operational mode.

12. The apparatus of claim 11, wherein the data storage unit, in the second operational mode of the apparatus, is further configured to: receive the one or more data elements by receiving, in an order, responses to one or more data read requests made by the read module; and transmit out the one or more data elements in the order.

13. The apparatus of claim 11 or 12, wherein all data elements in the data are nonzero data elements in the first operational mode, and a data element in the data is zero in the second operational mode.

14. The apparatus of any one of claims 11-13, wherein the sparsity data comprises a sequence of sparsity data elements, a sparsity data element corresponding to a data element in the input tensor and indicating whether the data element is zero.

15. An apparatus, comprising: a compute unit configured to perform one or more computations in a neural network layer, the neural network layer having an input and output; and a data delivery unit configured to transfer data in the input or output of the neural network layer between a memory and the compute unit, the data delivery unit comprising a data storage unit, the data storage unit configured to: store, in a first operational mode of the apparatus, sparsity data indicating sparsity in at least part of the input or output of neural network layer, and store, in a second operational mode of the apparatus, at least part of the data, wherein the second operational mode is different from the first operational mode.

16. The apparatus of claim 15, wherein the first operational mode is a sparse mode in which the one or more computations performed by the compute unit are accelerated by skipping at least part of the one or more computations based on the sparsity data.

17. The apparatus of claim 15 or 16, wherein the data delivery unit is configured to transfer the data from the memory to the compute unit, the data is at least part of the input of the neural network layer, and the apparatus further comprises an additional data delivery unit configured to transfer at least part of the output of the neural network layer from the compute unit to the memory.

18. The apparatus of claim 17, wherein the additional data delivery unit comprises an additional data storage unit configured to: store, in the first operational mode of the apparatus, sparsity data indicating sparsity in at least part of the output of neural network layer; and store, in the second operational mode of the apparatus, at least part of the output of neural network layer.

19. The apparatus of claim 15 or 16, wherein the data delivery unit is configured to transfer the data from the compute unit to the memory, the data is at least part of the output of the neural network layer, and the apparatus further comprises an additional data delivery unit configured to transfer at least part of the input of the neural network layer from the memory to the compute unit.

20. The apparatus of claim 19, wherein the additional data delivery unit comprises an additional data storage unit configured to: store, in the first operational mode of the apparatus, sparsity data indicating sparsity in at least part of the input of neural network layer; and store, in the second operational mode of the apparatus, at least part of the input of neural network layer.

21. The apparatus of any one of claims 19 or 20, wherein the data delivery unit further comprises another storage unit configured to, in the first operational mode and second operational mode of the apparatus, change an order of data elements in the output of the neural network layer before the data elements are written into the memory.

22. The apparatus of claim 21, wherein the data storage unit, in the second operational mode of the apparatus, is configured to: store the data elements in a first order in which the data elements are received by the apparatus from the compute unit; and after storing the data elements in the first order, transmit the data elements to the another storage unit in the first order, wherein the another storage unit is configured to change the first order to a second order that is different from the first order.

23. The apparatus of claim 21, wherein the data storage unit, in the second operational mode of the apparatus, is further configured to: store the data elements in a first order in which the data elements are received by the apparatus from the compute unit; after storing the data elements in the first order, changing the first order of the data elements to a second order; and transmit the data elements to the another storage unit in the second order, wherein the another storage unit is configured to change the second order to a third order that is different from the first order and from the second order.

24. The apparatus of any one of claims 19-23, wherein the data delivery unit further comprises a write combine buffer configured to: receive data transfer transactions for writing the data into the memory; combine, in a first operational mode of the write combine buffer, the data transfer transactions to generate a combined data transfer transaction and write the data into the memory by performing the combined data transfer transaction; and store, in a second operational mode of the write combine buffer, the data transfer transactions and write the data into the memory by performing the data transfer transactions in an order in which the data transfer transactions are received.

25. The apparatus of any one of claims 15-24, wherein the apparatus further comprises the memory.