WO2021054990A1 - Systèmes et procédés de génération de code parcimonieux pour réseaux neuronaux convolutifs - Google Patents
Systèmes et procédés de génération de code parcimonieux pour réseaux neuronaux convolutifs Download PDFInfo
- Publication number
- WO2021054990A1 WO2021054990A1 PCT/US2019/063678 US2019063678W WO2021054990A1 WO 2021054990 A1 WO2021054990 A1 WO 2021054990A1 US 2019063678 W US2019063678 W US 2019063678W WO 2021054990 A1 WO2021054990 A1 WO 2021054990A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- code
- tensor
- convolutional
- fma
- kernel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0893—Caches characterised by their organisation or structure
- G06F12/0897—Caches characterised by their organisation or structure with two or more cache hierarchy levels
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Definitions
- the invention relates generally to compiling and executing computer code for operating neural networks; specifically to code for convolutional neural network (CNNs).
- CNNs convolutional neural network
- NNs Neural networks
- connectionist systems are computing systems inspired by biological computing systems, but operating using manufactured digital computing technology.
- NNs are made up of computing units typically called neurons (which are artificial neurons, as opposed to biological neurons) communicating with each other via connections, links or edges.
- the signal at the link between artificial neurons may be for example a real number, and the output of each neuron may be computed by function of the (typically weighted) sum of its inputs, such as the ReLU rectifier function.
- NN links or edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection.
- NN neurons are divided or arranged into layers, where different layers may perform different kinds of transformations on their inputs and may have different patterns of connections with other layers.
- a higher or upper layer, or a layer “above” another layer is a layer more towards the input layer, and a lower layer is a layer towards the output layer.
- Such systems may learn to perform tasks by considering example input data, generally without being programmed with any task-specific rules, being presented with the correct output for the data, and self-correcting.
- the NN may execute a forward -backward pass where in the forward pass the NN is presented with an input and produces an output, and in the backward pass (backpropagation) the NN is presented with the correct output, generates an error (e.g., a “loss”), and generates update gradients which are used to alter the weights at the links or edges.
- an error e.g., a “loss”
- a convolutional neural network is a deep, feed-forward network, which includes for example one or more convolutional layers, fully connected layers, and pooling layers. CNNs are particularly useful for visual and speech applications.
- Other NNs include for example long short-term memory (LSTM) networks.
- LSTM long short-term memory
- a NN may be modelled as an abstract mathematical object, such as a function. Thus the NN may be “virtual” and no actual physical neurons, links, etc. may exist, these existing rather as data executed by processors.
- a NN may be translated physically to a CPU (e.g.
- a traditionally architecture computer such as a PC
- graphics processing units GPUs, specialized processors
- entries in the matrix or tensor represent neurons and/or links (e.g. artificial neurons connected by edges or links) or other NN parameters and functions carried out on the tensor or matrix represent functions of the NN.
- GPUs and similar massively parallel hardware devices may be used to provide the large amounts of compute typically needed to train and/or perform inference (e.g. operate or execute at run-time) in NNs.
- GPUs can have thousands of relatively weak compute cores, small caches, but high memory bandwidth.
- Input data to a CNN layer may have a number of channels, e.g. RGB (red, blue, green) or more abstract channels such as the output of filters.
- the output of a CNN layer may be features.
- a CNN may iterate a patch or tile smaller than the size of the input data across the input data to produce outputs, performing mathematical convolutional operations using the input data and kernel data describing a filter.
- filters which are applied repeatedly to patches or tiles of input, and which may be represented as “kernels”, or tensors.
- a set of data (e.g. input data to the NN, or intermediate data produced by a NN layer) is multiplied by numbers that are determined by the NN structure, links and weights, and filter or kernel structure of a NN.
- a filter may be in the context of a convolutional NN layer an operation performed on a subset of input data (e.g. a tile), and a kernel may be the set of fixed data used to multiply against input data to execute a NN.
- the neural network weights and inputs may be represented as tensors or matrices, and the computation of the network (e.g. the inference or run time operation) may include a sequence of convolutional calculations of these tensors or matrices.
- the computation of the network e.g. the inference or run time operation
- the computation of the network may include a sequence of convolutional calculations of these tensors or matrices.
- NN convolutional calculation algorithms is key to the performance of NNs.
- Properties of the data structures such as matrices representing the convolutional layers can enable faster convolutional algorithms.
- One such property is sparsity - a matrix, tensor or data structure is said to be sparse if it contains a lot of entries that are zero (0).
- the weights of the neural network can be made sparse using a technique called pruning.
- pruning Among the many parameters in a neural network, a fair fraction are redundant and do not contribute much to the network’s output.
- One may, for example, rank the neurons in the network according to how much they contribute, and then remove the low ranking neurons from the network by setting their matrix or tensor entries to 0. If the process of pruning is done properly, the resulting sparse network, the one where some of the weights are zero, can actually have the same or improved accuracy relative to the original network.
- a system and method may compile or generate code to be used when executing neural networks (NNs), for example convolutional neural networks (CNNs) which may include one or more convolutional layers.
- NNs neural networks
- CNNs convolutional neural networks
- FMA fused multiply-add
- the portion of the output for the convolutional layer may be represented as an FMA input or parameter as a register (in some embodiments two parameters may be used to represent this - the previously accumulated values, and the output); the input data for the convolutional layer may be represented as an FMA input or parameter as a register; and the non-zero element may be represented as an FMA input or parameter as a register or reference to a memory location.
- the non-zero kernel element may be represented as an FMA input or parameter as a broadcast instruction taking kernel data from, for example, an external memory location.
- the non-zero kernel element may be represented as an FMA input or parameter as a register which is filled by a vector broadcast instruction. For each non-zero element, a vector broadcast instruction to a register may be generated.
- any other suitable processing instruction e.g. add and multiply or multiply-accumulate
- set of instructions may be generated for each non-zero element of the sparse kernel, filter or tensor.
- the software or code produced may be executed during convolutional operations, for example as part of a larger application such as a NN inference application.
- NN may execute the multiplication if the data entry is non-zero, and do nothing if it is zero.
- the data or kernel size may also be reduced, as in some embodiments zero elements of a kernel may not be stored associated with code. While code size may be increased as compared with traditional convolutional operation code, when the number of zero elements is large, the increase in instructions may be greatly offset by the decrease in compute time, and the decrease in stored data size.
- embodiments of the present invention include a sparse convolution process to automatically generate ahead of the computation or inference, once the sparse kernels are known, the multiplication code for the computations of only the non-zero entries in the sparse kernels, and not generate any code for the zero entries.
- Embodiments may detect non-zero elements, store only those non-zero elements in .data section of the compiled code, and encode offsets to .data into the compiled instructions.
- convolutional kernels may include a spatial component such that each output value depends on input values which are spatially nearby each other, typically in a small square (for example, 3x3).
- Embodiments may compute sparse-dense convolution when the zero and non-zero entries are stable and pre-determined, and where the same sparse kernels are repeatedly used (e.g. during inference across large sets of different input), without using (during inference) a branch instruction to detect zero or non-zero kernel elements, and without using indexing.
- only the code for the non-zero entries is generated, and no code is executed at inference time for the pre-determined zero entries.
- the resulting code may be “sparse” in that only non-zero kernel entries result in code instructions being generated.
- the same sparse kernels may be repeatedly used, by generating the code executed for the non-zero entries and no code for the zero entries, and then reusing this code many times.
- the sparsity (e.g. the zero entries) may be a result of a pruning process of the neural network.
- the instruction representation may be shortened relative to standard instruction length in order to increase arithmetic intensity. If the resulting code is executed on a CPU it may utilize its registers and caches by storing and executing only the instructions corresponding to non-zero entries.
- Such methods may be used with a "pyramid” method of inference on a NN.
- Fig. 1A is a block diagram of a neural network according to an embodiment of the present invention.
- Fig. IB is a block diagram of a neural network according to an embodiment of the present invention.
- FIG. 2 is a high-level block diagram of an exemplary computing device which may be used with embodiments of the present invention.
- FIG. 3 is a high-level block diagram of an exemplary computing system which may be used with embodiments of the present invention.
- Fig. 4 is a flowchart of a method according to embodiments of the present invention.
- Fig. 5 is a simplified schematic diagram of a CNN having a number of sub-computations (e.g., tasks) spanning more than one layer of the CNN, according to some embodiments of the invention.
- Embodiments of the present invention may automatically create code or software for use by a NN execution computer for efficiently executing convolutional calculations, for example in NN inference, after for example pruning data structures such as the tensors or matrices or NNs.
- one multi-dimensional tensor A is iterated over windows of input B.
- Each of A, B and C may be tensors. (although in some embodiments, tensors are operated on, such operations may be considered to take place over matrices.).
- Tensors may be considered to be multi-dimensional data structures; for example the special case of a matrix can be considered to be a two-dimensional tensor.
- Tensors may be stored as for example a list of numbers organized by one index per dimension. Dimensions may be given in "major order" (e.g. the first dimension changes the slowest and the last dimension changes the fastest).
- channel dimension is the last dimension since in some calculations channel changes the fastest, channels may be in some notations the first dimension, not the last dimension.
- data is described herein as being organized as tensors, other data input and output of a convolutional calculation may be used, such as matrices.
- Embodiments of the invention may automatically generate ahead of the computation, once sparse convolutional data is known, the multiplication code only for the computations of the non zero entries in the convolutional data, and not generate any code for the zero entries.
- Producing this code from a compiler may include analyzing (e.g. by a processor) the convolutional data ahead of the computation and properly laying out the sequences of registers and FMA instructions that need to be executed.
- a process may generate machine code directed to a specific architecture. Code, instructions or software may be generated or emitted, which then may be used when executing a CNN. For each non-zero element in a kernel tensor or matrix part of the convolutional layer, instructions may be generated or issued.
- a series of nested loops may fix output entries (e.g. by having the outer loops iterate over entries of an output tensor having dimensions OC x OH x OW), and iterate over inputs (e.g. by having inner loops iterating over some or all dimensions of a kernel having dimensions IC (input channel) x OC (output channel) x KH (kernel height) x KW (kernel width), to calculate each output entry.
- IC input channel
- x OC output channel
- KH kernel height
- KW kernel width
- the input may be for example an input tensor representing a feature map or input to a NN.
- a series of discrete values of the tensor may be loaded to the register.
- a vector broadcast instruction may place the same value at multiple places in a register. Such a broadcast instruction may be integrated with an FMA in some target architectures, and thus a broadcast instruction separate from an FMA need not be generated.
- an FMA instruction or other multiply-accumulate or vector instruction having as parameters output (and possibly previously accumulated output); input data, and the non-zero kernel element.
- a store instruction saving the output register may be generated (thus, during execution or inference, after a series of instructions generated by an inner loop completes, a store instruction may execute).
- the software or code produced may be executed during convolutional operations, for example as part of a larger application such as a NN inference application.
- the emitted or produced instructions may be strung together to produce a code segment or block for the convolutional calculation of the layer using the specific sparse kernel, having within or associated with the code the non-zero kernel information, the code referencing and accessing a memory block holding the input to the layer or NN.
- a method may iterate across a portion of a kernel tensor to access each element of the kernel tensor, and if an accessed element is not zero generate a multiply-accumulate instruction taking as input at least the accessed element.
- Performing compute using registers e.g. quickly accessible locations available to a CPU or core, typically speeds processing. Registers are typically small storage internal to a processor storing one or more words of data.
- Code may be designed for a specific target architecture - e.g. a specific type of model of processor or CPU on which the produced code is to execute. Parameters may be determined or calculated affecting the production of the code, e.g. describing the dimensions of a produced output tensor. Such parameters may depend on characteristics of a target architecture, e.g. on the number of registers in a target architecture executing the computer instruction code. [0033] In one embodiment, for each row of values which are outputs of the convolution, with a fixed spatial coordinate (e.g. an output coordinate may be held constant in the loops), 0_1 ...
- a process may iterate over every input in the pre-image of O (e.g. those in the range B[l...IC][OH ... OH + KH][OW ... OW + KW]).
- a pre-image of an output may be the set of inputs whose value affects those output values.
- For each such input value I a load instruction of I to a register from a circular queue may be emitted.
- a process may then examine each of the OC kernel values which map I to some O j .
- a process may store the non-zero kernel element, e, in .data and emit and FMA multiplying I with e and accumulate to O j (or one FMA handling multiple such multiply-accumulate operations).
- Auxiliary code may be emitted or produced around this core sequence of input Load and FMA instructions to iterate over each spatial output coordinate.
- Each non-zero kernel value may be used to compute each spatial output row, but embodiments of an algorithm may only write each non-zero value to .data once.
- B and C may be considered feature maps, e.g. a set of values organized spatially (e.g. via height and width dimensions) and into channels (via a channel dimension; e.g. OC and IC may be channel dimensions).
- Feature maps might be inputs to the neural network as a whole, including as a representation of a human-viewable image, or they might be intermediate data in a NN, in which case they may have no human-understandable meaning.
- code produced by embodiments of the invention is used during NN inference (e.g. run time), in some embodiments, code produced may be used during certain training portions.
- the code that is produced may be executed for example on multiprocessor architectures.
- the multiplication in the context of convolutional neural network layers is typically between kernel tensors of values, which may be sparse because as being pruned extensively, and the tensor or matrix of inputs (or modified inputs) to a NN at a first layer or inputs to an intermediate layer of a NN.
- the inputs may be dense (e.g. the input data matrix is not assumed to be sparse, in other words, it is not assumed to have many entries that are zero).
- Inputs to a NN may be for example an image to be categorized, speech data to be analyzed, etc.; inputs to layers may also be intermediate values sent from a previous NN layer which is to be processed using convolutional operations.
- Embodiments may include methods to avoid performing, at inference, unnecessary and costly operations associated with zero elements of data used in convolutional operations, such as branching instructions. This may be accomplished by developing from a sparse convolutional kernel data (e.g. represented as a tensor) executable code including only necessary multiplication operations on non-zero elements of the convolutional operations data.
- a sparse convolutional kernel data e.g. represented as a tensor
- the zero and non-zero values in sparse kernel data are stable over time, and may be pre-determined as of compile or code-generation time.
- the pattern e.g. which multiplications are avoided by not being included in the compiled code because the corresponding element of the convolutional kernel data is zero, typically remains the same.
- Subsequent training typically does not change this pattern because the pruned weights remain zero, and their corresponding multiplications are avoided. Other weights may change, but their corresponding multiplications are performed.
- the same generated code may be used many times (e.g. many NN inferences), saving much time as each time the generated code is executed, less code is executed than if zeros were branched on or otherwise considered at execution time.
- the code generated may include an instruction representation which is shortened relative to standard instruction length and also which may increase arithmetic intensity (AI).
- the AI of code or an algorithm may be the number of compute operations it performs per byte fetched from, or stored to, a certain level of the memory hierarchy (e.g. shared cache or main memory).
- Compiling or creating computer instructions or code corresponding to or based on the sparse data may be performed in such a way that it can be executed efficiently on modem multiprocessors.
- the maximum amount of computation that can be performed per cycle is fixed, hence efficiency is measured by compute utilization, e.g. a fraction of the time spent on actual computation on a given piece of data as opposed to the time spent accessing that piece of data from memory.
- vector processing instmctions may be used, such as fused multiply-add (FMA), that operate on multiple items or words of data simultaneously.
- An embodiment may convert the convolutional kernel data into executable code, which contains a floating-point multiply-add instruction such as an FMA instmctions for each non-zero element, and typically no FMA instmction or other instmctions for zero elements in convolutional kernel data.
- a floating-point multiply-add instruction such as an FMA instmctions for each non-zero element, and typically no FMA instmction or other instmctions for zero elements in convolutional kernel data.
- embodiments may store each value of the non-zero element of the convolutional kernel or other data within the code itself, or associated with the code, as a constant (for example in a dedicated .data section for such constants); typically embodiments do not store in produced code the zero elements of the convolutional kernel.
- Each FMA instmction then multiplies some non-zero value of the convolutional kernel by a set of consecutive elements of the input data.
- Each value that is read and broadcast is multiplied by more than one subsequent sets of elements in input data. This way relative inefficiency of reading/broadcasting is amortized among many efficient FMA instructions, that provide highest throughput possible on modem multiprocessor architectures.
- indexing may be avoided in that to review an input convolutional kernel for data, indexing may need to be used: by compiling code that includes within or associated with the code itself certain input convolutional kernel data, the indexing for that data is not performed at execute time.
- Embodiments may work on CPUs, GPUs, or other multicore computing devices. Embodiments may produce code such that NNs execute efficiently, e.g. such that the compute resources of the CPUs have a high utilization.
- FMA instructions may provide high throughput in terms of number of operations per performed per cycle, they also have substantial latency (e.g. time between invoking the instruction and having its result available). Therefore, to achieve high utilization embodiments may incorporate enough independent FMA instructions, that will be pipelined by the target processor, and utilize the high throughput of the FMA units.
- Embodiments may improve processing and lead to significant speedups when used on multiprocessors for applications performing sparse-dense operations where the sparse data has a fixed or mostly fixed sparsity pattern.
- Some multicore CPU processors include several layers of cache.
- a third level of cache known as the F3 cache
- Other levels of cache such as El and F2 are faster and private to a specific core.
- Some embodiments of the invention may take advantage of this relationship. While some specific embodiments are described in terms of LI, L2, and L3 cache levels as in Intel architectures, embodiments may work with other architectures with a hierarchy of shared and core-exclusive cache levels.
- the compute- to-memory ratio may be defined as the ratio between the system’s maximal number of compute operations per second, and its memory bandwidth.
- X86 family CPUs may have an FMA instruction set that is an extension to the “Streaming SIMD Extensions” instructions which may perform fused multiply-add (FMA) operations that operate on long bit vectors, for example 128 bits, 256 bits, or even 512 bits.
- FMA fused multiply-add
- a single 512-bit FMA instruction can be used to multiply 8 pairs of 32-bit numbers, and add another accumulator number to each product. Embodiments of the invention may use such instructions.
- the AI of an algorithm executed on some architecture is not higher than the CMR of some memory level on that architecture, the execution will be memory bound, e.g. bottlenecked on bringing the data in or out of that memory level.
- the algorithm may behave as it were being executed on a processor capable of fewer compute operations per unit time.
- the ratio between the theoretical, and effectively achieved number of compute operations per unit time is equal to AI/CMR. This ratio may be described as the processor’s utilization (typically at the given memory level).
- the number of compute operations of an algorithm per item brought from or stored to a given memory level should exceed the system’s CMR.
- One way to increase the AI on a modern CPU includes reducing or eliminating branching, conditionals, and indirection, and to execute the majority of instructions as FMAs.
- Convolutions or convolutional components such as tensors may be partitioned (e.g. blocked), e.g. into subconvolutions.
- a process may partition an output tensor C with dimension OC x OH x OW into subtensors each with dimension OC’ x OH’ x OW’, perform subconvolutions to fully compute each subtensor, and combine the results into original output C. It may be important to observe that each compiled subconvolution operating on the same range of output channels j ... j + OC’ will access the exact same pattern of non-zero kernel elements, regardless of which spatial block is being computed. Thus, in one embodiment, a process should only perform such a compilation algorithm once per block of OC’ output channels and re-use the compiled code for each spatial block, in order to avoid increasing up the memory footprint of the compiled code, which can be quite large.
- Embodiments may consider the effects of particular blocking schemes, e.g. sizes of sub tensors in each subtask, on the efficiency of the multiplication.
- Convolution may be efficiently parallelizable among many processors using or adapting a standard blocking technique, in which the sparse-dense convolutional calculation task is split into multiple sparse-dense sub-tensor multiplication subtasks.
- Each of these sub-tensors may then be performed (e.g. fully determine an output for) for example on a single processor, core within a processor, or thread.
- the sub-tensors may then later be combined into a larger tensor, the output of a convolutional calculation of the original tensor, using known methods.
- tensor A (representing or including kernel parameters, weights or parameters) and tensor B (representing NN or layer input data) may each be divided into sub-tensors, which may be used for a convolutional calculation, and the results combined into an output C representing A in a convolutional operation with sub-tensor B .
- a pyramid scheme is used, recombination of tensors may be delayed until near the final NN layer, and the tensors of the initial and intermediate layers of the pyramid may never be combined.
- Code generated for, and code generation for, divided tensors may be affected and optimized as a result of blocking, e.g. parameters of the example code in Table (e.g. OC, OH, OW etc.) may be selected as the sizes of sub-tensors which can be part of a larger tensor.
- Some embodiments may optimize tensor subdivision to work with a particular computer configuration that will execute the NN operations, taking into account for example register or vector configurations, instruction latency and/or other processor details.
- Blocking may include or cause determining parameters describing the dimensions of a produced output tensor, for example some of such parameters dependent on the number of registers in a target architecture executing the computer instruction code.
- a tensor calculation can be divided into sub-tensor calculations, according to a known blocking process, and blocking may be according to parameters created based on a target processor architecture.
- the tensor calculation of sparse tensor A and input tensor B may be divided into multiple tasks of where tensor A’ (part of A) is multiplied by tensor B’ (part of B). Each sub-tensor calculation may compute a convolutional calculation which may be denoted A’ ⁇ B’, the output being tensor C’. Multiple outputs C’ may be combined according to known methods into a final result tensor C which is A ⁇ B.
- a processor for which code is generated may include multiple registers or vector registers each including a number of values such as floating point values, the number of values of a register being its size which may be termed S.
- A may be divided prior to code generation, B may be divided at run-time, and outputs C’ may be combined to C. While the division of B and combinations of C’ are effected at run time, (e.g. inference) although the code may be structured to effectively perform this division and combination prior to code generation.
- parameters or settings may be optimized based on the target architecture, the processor type on which it is intended that the code will be executed.
- a set of target architecture values or parameters may be used to determine blocking dimensions, which in turn may define the generation of code. For example, when blocking or dividing the output tensor for a CNN layer having dimensions OC x OH x OW into smaller tensors each having dimensions OC’ x OH’ x OW’, OC’, OH’ and OW’ may each be determined based at least in part on processor parameters.
- the size of A’ may be derived to have dimensions IC x OC x KH x KW and the size of B’ may be derived to be IC x IH x IW.
- a target architecture may have a certain number of vector registers, (as opposed to scalar registers) available for the processor, or available for each core of a processor (e.g. in one type of target architecture each core has an L2 and L3 register private to that core) e.g. 32 registers in one target architecture.
- a certain number of the vector registers (e.g. 5) may be allocated to a circular queue of registers, and the remainder, e.g. 27, may be allocated to output tensor values.
- a number of the vector registers may be allocated to processor or core tasks separate from the convolutional calculation, and thus a smaller number than the total vector registers may be allocated to the queue and tensor outputs.
- OC’ may be derived as being not more than the number of registers in the circular queue.
- OH’ and OW’ may each be set equal to the square root of half of cache size for the processor.
- the KH, KW, and IC dimensions are not divided or blocked, and IH’ and IW’ may be derived from OC’ OH’ and OW’.
- a target processor e.g. intended to execute the produced code
- a target processor has limited registers and cache size, and a convolutional process is typically performed using a processor’s internal registers and cache for speed and efficiency. Parameters governing code creation may be created with this in mind.
- Optimizing or creating parameters which are, in turn, used to partition tensors used in generating convolutional code may improve prior convolutional processes by spending the resources to determine an optimal tensor partition or division, or blocking, at compile time, rather than at compute, inference or execute time, improving inference speed.
- the non-zero elements of a sparse tensor may be moved (e.g. partitioned) to a data store within the code (e.g. .data) at compile time.
- data input and output tensors may be only "logically" partitioned, such that during runtime (e.g.
- NN execution or inference logical sub-tensors are accessed according to partitioning, but data for these tensors is typically not moved to partition the tensor during runtime.
- NN execution or inference is performed many more times than the compilation of NN code, and thus one compilation producing improved code results in a large amount of saved resources.
- subdividing e.g. blocking
- convolutional calculation operations such subdivision need not be used.
- a process following the example pseudocode in Table 1 may be used as the core of a process to produce convolutional layer calculation code (e.g. convolutional layer calculation code or software 302, described herein), typically machine code for the relevant architecture, such as an X86 architecture.
- convolutional layer calculation code e.g. convolutional layer calculation code or software 302, described herein
- Other processor types may be used, such as for example AVX2 or AVX512 processors, e.g. provided by Haswell, Broadwell, Cascade and Skylake.
- a specific processor may be designed, or specific instructions may be designed, to be used with an embodiment of the present invention such that the benefits of the invention are increased. While in Table 1 specific type of FMA, load and other instructions are used, other instructions such as other floating-point multiply-add instructions or multiply-accumulate instructions may be used.
- the example code in Table 1 may produce code to perform a convolutional operation on input kernel A (with dimensions IC x OC x KH x KW), input (e.g. input to a NN, or input from a layer) B (with dimensions IC x IH x IW) to produce output C (with dimensions OC x OH x OW).
- Such produced code may perform such an operation on sub-inputs, using blocking, e.g. using as data A’ and B’ of larger tensors A and B.
- Other dimensions may be used for tensors or matrices, other blocking or division schemes may be used, and blocking need not be used.
- Table 1 assumes the existence of a circular queue of size R_1 of vector “line-registers”.
- the code produced by the pseudocode of Table 1 may use a processor’s registers (e.g. vector registers), in a queue or circular queue fashion.
- registers e.g. vector registers
- a process may increment in circular fashion each register chosen from a circular queue, with registers retired back to the queue when their associated FMA or other instruction completes.
- Registers in a processor may be used for multiple purposes by such pseudocode or other code, such as broadcasting kernel values and floating point operations.
- a vector broadcast instruction may be integrated with an FMA or other multiply- accumulate instruction.
- Other specific operations to compile or generate code, different from those shown in Table 1, may be used:
- Table 1 is used to produce inference or execution code, which has a structure, at its core, of a series of FMA and associated instructions surrounded by the loops defining the beginning of Table E
- Compiled code may have two nested loops (e.g. oh in 1 to OH and ow in 1 to OW by S) controlling and repeating the instructions.
- Other looping e.g. for ic in 1 to IC, kh in 1 to KH, kw in 1 to KW) in the code in Table 1 occurs only during code production: the code produced simply performs the load, FMA and other instructions in the order it is created or emitted.
- a different section of code according to Table 1 may be created for each blocked output tensor, and the code may be combined with other blocks; certain blocks of compiled code may be repeated or re-used for different sub-tensors.
- some code sections may complete execution, and execution of another NN layer may proceed in part, before other CNN code sections start execution.
- the loops of oh and ow in combination with the production of code to initialize OC registers may be an outer loop sequence over entries in an output tensor C, where the code fixes individual entries in C, and then produces values of the fixed individual entries.
- These outer loops may be outside inner loop iterating over dimensions of an input and a kernel or filter each associated with (e.g. used for the computation of) a convolutional layer.
- the loops “For ic in 1 to IC, kh in 1 to KH, kw in 1 to KW” and “For oc in 1 to OC” are inner loops producing code moving over tensors A and B.
- the outer loops (e.g. to OH and OW) may execute during inference, and the inner loops (e.g. to IC, KH and KW) may produce code which, during inference, executes.
- Instructions may be generated such that data from input (e.g. a feature map or input to a NN) is loaded using a “LOAD” instruction or other suitable instruction to a register from a queue, e.g. a circular queue, of registers. If a non-zero element is found in an input kernel tensor, an FMA is generated for that element, and if an examined kernel element is zero, no FMA is generated for that zero element (and no reserving of a place for e in .data is performed, saving memory footprint of the resulting code). In some embodiments, more than one multiply-accumulate instruction or FMA may be generated for each non-zero element.
- the FMA may include a broadcast or other suitable instruction loading kernel data to a register which is a parameter of the FMA; such a broadcast instruction may be generated so that it is executed separately from the FMA.
- a vector broadcast instruction may be to a register, the register input to an FMA instruction.
- the code that is produced may have .data be part of it or associated with it.
- a dynamically sized vector is created storing each non zero elements; at the end of the compile process this may be saved to a data block.
- the data in A is known at compile or code generation time but the data in B is only known at run (e.g. inference) time.
- B is typically a dense matrix.
- code generated is code that executes a sub-tensor as part of a blocking or partitioning
- a different code segment may be generated for each convolutional calculation of A’ and B’ producing output C ⁇ and code may also be generated or integrated to, after the execution of the convolutional code, combine the results C’ into one larger output C.
- Partitioning in the sense of dividing inputs may be done for a sparse tensor by appropriately placing data and instructions in or associated with generated code, and for run-time input data block, by generating instructions to access that input appropriately.
- Tensor A may be the tensor of a trained kernel, representing for example weights.
- A’ may be a subtensor of such a kernel in the cases where blocking is used, or may be the full kernel tensor. If subtensors are used, the products of the sub-convolutional operations may be combined using known blocking methods into a larger output.
- no code is created for zero elements of A’, obviating the need for a conditional or other instruction based on a zero entry in a tensor, and resulting at run-time (e.g. inference or learning in NN applications) of no code being executed or existing for the zero entries.
- the determination of whether a value is zero (e.g.
- Fig. 1A is a simplified block diagram of a NN which may be operated on or computed according to an embodiment of the present invention; in typical use thousands of neurons and links are used.
- software or code generated simulates the operation of NN 1000.
- NN 1000 may input data as for example an input vector or tensor 1010 of values (representing, e.g. a photograph, voice recording, or any sort of data).
- NN 1000 may have neurons arranged into layers 1030, each including neurons 1040 connected to other neurons by links or edges 1050.
- NN 1000 may input data, for example an image (e.g.
- NN 1000 may in one example have layers such as convolution, pooling, output layers, an FC layer, a softmax layer, etc. Each layer may connect to other layers by links or edges.
- Fig. IB shows an example of a CNN with a sequence of layers including convolutional layers.
- NN 20 includes direct convolutional layer 30, pool layer 32, and convolutional layer 34.
- Layer 35 may be a pool layer
- layer 36 may be a fully connected, or for example a convolution
- layer 37 may be a softmax layer (softmax being a function that may be used in some NN layers).
- One or more cores or processors may process the NN during inference (e.g. run-time) by, e.g. simulating the activity and data flow of the nodes or neurons and layers, which may include tensor or matrix multiply and convolutional calculations.
- NNs in Figs. 1A and IB are typically simulated, and represented as data, for example by systems such as shown in Figs. 2 and 3, using code such as generated by methods described herein. While specific numbers and types of layers are shown, Fig. 1A is merely a highly generalized example, and Fig. IB is merely an example: NNs used with embodiments of the present invention may vary widely as known in the art.
- Fig. 2 shows a high-level block diagram of an exemplary computing device which may be used with embodiments of the present invention.
- Computing device 100 may include a controller or processor 105 that may be or include, for example, one or more central processing unit processor(s) (CPU), one or more Graphics Processing Unit(s) (GPU or GPGPU), a chip or any suitable computing or computational device, an operating system 115, a memory 120, a storage 130, input devices 135 and output devices 140.
- modules and equipment such as code production computer 300, NN computer 320, NNs as shown in Fig. 1 and other modules or equipment mentioned herein may be or include, or may be executed by, a computing device such as included in Fig. 2, although various units among these entities may be combined into one computing device.
- Operating system 115 may be or may include any code segment designed and/or configured to perform tasks involving coordination, scheduling, arbitration, supervising, controlling or otherwise managing operation of computing device 100, for example, scheduling execution of programs.
- Memory 120 may be or may include, for example, a Random Access Memory (RAM), a read only memory (ROM), a Dynamic RAM (DRAM), a Synchronous DRAM (SD-RAM), a double data rate (DDR) memory chip, a Flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units or storage units.
- Memory 120 may be or may include a plurality of, possibly different memory units.
- Memory 120 may store for example, instructions to carry out a method (e.g. code 125), and/or data such as user responses, interruptions, etc.
- Executable code 125 may be any executable code, e.g., an application, a program, a process, task or script. Executable code 125 may be executed by controller 105 possibly under control of operating system 115. For example, executable code 125 may when executed cause the production or compilation of computer code, or application execution such as NN execution or inference, according to embodiments of the present invention. Executable code 125 may be code produced by methods described herein. For the various modules and functions described herein, one or more computing devices 100 or components of computing device 100 may be used. Devices that include components similar or different to those included in computing device 100 may be used, and may be connected to a network and used as a system. One or more processor(s) 105 may be configured to carry out embodiments of the present invention by for example executing software or code.
- Storage 130 may be or may include, for example, a hard disk drive, a floppy disk drive, a Compact Disk (CD) drive, a CD-Recordable (CD-R) drive, a universal serial bus (USB) device or other suitable removable and/or fixed storage unit.
- Data such as instructions, code, NN model data, parameters, etc. may be stored in a storage 130 and may be loaded from storage 130 into a memory 120 where it may be processed by controller 105. In some embodiments, some of the components shown in Fig. 2 may be omitted.
- Input devices 135 may be or may include for example a mouse, a keyboard, a touch screen or pad or any suitable input device. It will be recognized that any suitable number of input devices may be operatively connected to computing device 100 as shown by block 135.
- Output devices 140 may include one or more displays, speakers and/or any other suitable output devices. It will be recognized that any suitable number of output devices may be operatively connected to computing device 100 as shown by block 140.
- Any applicable input/output (I/O) devices may be connected to computing device 100, for example, a wired or wireless network interface card (NIC), a modem, printer or facsimile machine, a universal serial bus (USB) device or external hard drive may be included in input devices 135 and/or output devices 140.
- NIC network interface card
- USB universal serial bus
- Embodiments of the invention may include one or more article(s) (e.g. memory 120 or storage 130) such as a computer or processor non-transitory readable medium, or a computer or processor non-transitory storage medium, such as for example a memory, a disk drive, or a USB flash memory, encoding, including or storing instructions, e.g., computer-executable instructions, which, when executed by a processor or controller, carry out methods disclosed herein.
- article(s) e.g. memory 120 or storage 130
- a computer or processor non-transitory readable medium such as for example a memory, a disk drive, or a USB flash memory
- encoding including or storing instructions, e.g., computer-executable instructions, which, when executed by a processor or controller, carry out methods disclosed herein.
- Fig. 3 is a high-level block diagram of an exemplary computing system which may be used with embodiments of the present invention.
- Code production computer 300 may produce convolutional code or software 302 for use in inference for NNs, or may produce code for other applications using tensor or matrix multiplication.
- Convolutional code or software 302 may include or be associated with sparse data 303, e.g. the non-zero entries from an input dataset, tensor or matrix.
- Code 302 may execute sparse-dense multiplication as described herein, and may be executed in conjunction with, or may be combined with, application or NN code 304 such as TensorFlow or PyTorch open-source code, which may perform known operations for inference (e.g.
- Application code 304 may interface with code 302, where code 302 may when executed handle certain tensor or matrix operations, and code 304 may handle other aspects of NN inference as known in the art.
- Application code 304 may interface with code 302 to for example execute or train NNs such as shown in Fig. 1.
- Code 302 if used for another application may include other code managing such an application; in such a case code 304 may manage such other non-NN applications.
- Code production computer 300 may communicate, e.g. via a network 310 (e.g. the internet) with NN computer 320.
- NN computer 320 may train and or execute (e.g. inference) a NN by using (e.g. executing) code 302 and/or code 304.
- NN computer 320 may manage execution of a NN in cooperation with other computers, e.g. a cloud server, or a set (e.g. pod) of servers or computers, such that part of code 302 and/or code 304 may be executed by another computer separate from or remote from computer 320.
- code production computer 300 may produce NN code 304, and deliver to another computer such as NN computer 320 which may integrate or use code 302 with code 304, where code 304 is not produced at code production computer 300.
- Code production computer 300 and NN computer 320 may be, be executed by, or include, elements of a computer such as shown in Fig. 2.
- Fig. 4 is a flowchart of a method for producing or compiling code and/or executing an application such as NN inference according to embodiments of the present invention. While in one embodiment the operations of Fig. 4 are carried out using systems as shown in Figs. 1-3, in other embodiments other systems and equipment can be used.
- a NN may be trained to produce a set of parameters, e.g. link weights.
- the NN may be sparsified through known methods such as pruning, such that many non-zero link weights are converted to zero.
- parameters of the NN may be sparsified.
- Data that has been produced and sparsified, e.g. NN parameters, may be placed in a sparse tensor or matrix. Alternately, a produced tensor or matrix may be sparsified.
- optimization parameters may be produced using as input parameters or factors specifications of a target processor, e.g. the size of caches local to cores, the number and division of available vector registers, or other specifications.
- Output of optimization may be for example parameters to determine a blocking operation, dimensions of input or output tensors or sub-tensors used, the number of registers, typically vector registers, used for convolutional operations, and/or the size of a pool or a queue that includes registers used for convolutional operations.
- code or software may be produced using as input a tensor or matrix produced in operation 410 and possibly optimization or optimized parameters.
- Such code may be produced for at least one convolutional layer of the NN.
- a process as in example Table 1 above may be used; however other embodiments may use other specific processes.
- the code produced may be tensor operations or matrix multiply operations designed for a specific processor (e.g. target architecture) with specific parameters.
- a process may be performed; if the element is zero the process may be avoided or not performed.
- Operation 432 may use an if-then or other conditional operation such that such an operation is performed at code generation time rather than inference time.
- This process may include in operation 434, emitting or generating a vector broadcast instruction broadcasting the non-zero kernel entry to a register and in operation 436, emitting or generating a vector or multiply- accumulate instruction such as a floating-point multiply-add instruction.
- Such an FMA instruction may have for example three or four parameters, e.g. an output (e.g. a portion of an output for a convolutional layer), previously accumulated output, non-zero kernel data, and input data. Output and previously accumulated output may be combined to one parameter.
- Non-zero kernel data may be for example a register to which data was broadcast, a memory reference for kernel data, or a broadcast element itself.
- Input data (e.g. feature, image, or other data) may be a memory reference or register.
- a parameter being a register containing the kernel data
- a parameter being a reference to external memory containing the kernel data, or a broadcast instruction for the broadcast instruction
- Other suitable instructions or groups of instructions for calculating a convolutional layer, other than an FMA may be generated based on a determination that a kernel entry is non-zero.
- Operations 432-436 may be iterated a number of times until for example no unvisited (e.g. unprocessed) non-zero elements remain.
- Operation 432 may use nested loops, iterating over elements in input and output tensors.
- Generating or compiling code as in operation 430 may partition or divide a sparse tensor or matrix in the sense that the non-zero elements of the sparse tensor or matrix are distributed appropriately in a data store in the resulting code, or that instructions may access the data in a manner effectively partitioning the data.
- the sparse tensor A may be blocked, partitioned or divided in this virtual sense by, for example, generating a separate set of convolutional code for each partition of tensors A’, B’ and C’; code may also be generated to combine the results C’ tensors after calculations are performed. However typically these tensors are not actually partitioned in the sense that data is not moved to divide the tensors.
- a separate section or block of convolution code may be created for each certain different section of blocked tensors.
- code produced in operation 430 may be combined with or integrated with other code used to operate an application, e.g. code to operate a NN, to produce a complete or combined set of code (e.g. tensor code 302 and application code 304) which may be used to, e.g. perform inference on a NN.
- code may be generated or added to control the execution of the tensor code a number of times, for each of the sub-tensor/sub-input pairs, and after the execution of the convolutional code, combine the results C’ (e.g.
- code produced in operation 430 and/or combined code may be distributed or transmitted from a production computer to a computer used to execute the code.
- the code produced may include or be associated with the sparse kernel data - e.g. only non-zero entries of the original kernel, which included many zero entries, may be associated with the code.
- code or software produced, combined and/or distributed may be executed, possibly in combination with or while communicating with other code used to operate an application, e.g. code to operate a NN (e.g. application code 304).
- a NN e.g. application code 304
- one or more computers conducting NN inference e.g. computer 320
- Such execution may include, for example, tensor or convolutional operations or multiplication of matrix data, such as data in data 303, by NN input data which is also represented by a tensor.
- Execution of such code may include controlling the execution of the tensor multiplication code, and combining the tensor results into one output tensor.
- Data used as input may be input to a NN, or intermediate values sent from one NN layer to another.
- Executing the NN may include, for at least one convolutional layer of the NN, executing the series of instructions (e.g. including FMA instructions) created, each FMA instruction taking as input a non-zero element of the filter or kernel tensor associated with the convolutional layer, the series of FMA instructions associated with the data store holding only non-zero elements of the kernel tensor.
- Executing the NN may include executing a vector broadcast instruction to a register, the register input to an FMA instruction.
- a set of instructions may be executed to combine more than one produced output tensor into a final output tensor.
- at least a first code section may be executed in conjunction with code executing at least a portion of a lower CNN layer before the execution of a second code section.
- a result or output may be produced by the executed code, e.g. via a NN, e.g. a classification of data.
- Embodiments may apply methods as described herein - e.g. producing code for sparse convolutional operations — with a pyramidal processing approach.
- a pyramidal processing approach is described in embodiments of US Patent Application no. 16/426,609, filed on May 30, 2019, entitled “SYSTEMS AND METHODS FOR IMPROVED NEURAL NETWORK EXECUTION”, incorporated herein by reference in its entirety.
- Pyramid processing embodiments may execute the layered graphs defining CNNs on multicore CPUs that typically have large caches and low memory bandwidth.
- Pyramid embodiments may divide computation into tasks in a manner that can be agnostic to the layered structure of the network, breaking the whole network (or substantially the whole network) down into asynchronously executable “pyramid” shaped tasks that can cross layer boundaries.
- Each task can be executed by a single compute core, encompassing a part of the computation that can optimize the cache and/or compute capabilities of this individual core.
- One benefit of this approach can be to reduce the amount of data brought to memory at any given time so that it can be proportional to the total amount of core cache memory, rather than the size of a given network layer.
- Another benefit of this approach can be that it can also turn computations that are memory bound into ones that are compute bound for the most complex convolutional computations.
- Element-wise computations in a CNN can be ones in which the inputs needed to compute a given output value typically do not overlap with the inputs required to compute any other output value.
- a pyramid embodiment (and its resulting savings) may be applied to all (or substantially all) types of computation in a CNN, and in particular to non-element-wise operations such as those in convolutional layers.
- Embodiments may execute the tasks efficiently even though they may overlap with one another in the computations they perform. This can allow an asynchronous execution of a complete CNN, in training and/or inference modes.
- Pyramid embodiment processing may break with traditional or prior art GPU style executions of such networks, which, apart from fusing the computation of a few types of element-wise layers (such as pooling and ReLU), are typically based on executing the computation layer after layer with multiple cores executing a complete layer in a synchronous or bulk-synchronous fashion.
- the computation of some embodiments, within each pyramid task can allow maximizing cache buffer reuse and/or reduction of memory bandwidth traffic, which can allow great savings in the amount of overall memory that needs to be used at any given point in the computation (e.g. a process may not need to store a whole layer’s data in memory at the same time). This property can be a critical component enabling efficient execution of sparse CNNs.
- a pyramid embodiment can execute a CNN computation graph as a collection of “pyramid” tasks, each executing a subset of the neurons or nodes across several layers, rather than just a subset of the nodes or neurons of a given layer.
- the subset of network nodes can form an abstract shape of a pyramid; hence the name.
- the choice of neurons executed in each pyramid task can be designed to (1) fit the computation of the task, e.g.
- Each task can be a set of compute instructions with a set of inputs that are dependent on prior tasks and an output that will feed into subsequent tasks.
- the nodes or neurons in a task can be chosen so as to minimize the amount of data moved during the computation, allowing, for example, to repeatedly swap data within the same regions of cache, in order to make a task execution completely compute bound, that is, spend most of its time computing on the same data rather than on bringing new data from memory.
- This non-lay ered pyramidal approach can differ from prior art approaches where one waits for the completion of the computation of all the nodes or neurons of a given network layer before proceeding to compute the following layer, typically incurring large penalties because of memory traffic.
- a component of a pyramid task may be a block of code which executes a part of a convolutional layer (e.g. as produced by the example Table 1 process), the block of code executed in sequence with other components of the pyramid task before other blocks of code for the CNN layer are executed.
- a convolutional layer e.g. as produced by the example Table 1 process
- a first set of tasks may output to a second set of tasks which may output to a third set of tasks.
- an algorithm may recursively move back in the layers, aggregating the sub-computations necessary to compute this single output.
- An algorithm may determine the inputs required for an ultimate output by taking the union of the inputs needed for each sub-computation.
- all the computations needed to create the inputs to a certain task (which are the outputs of other tasks), may be aggregated into a new set of pyramid tasks that each are calculated backwards to include re-shuffle, pooling, and convolutional computation.
- each of the inputs to a task may be computed via a new set of pyramidal tasks that span the pooling and convolution layers.
- the first convolutional layer of a VGG NN may take a two dimensional input image with dimensions 224 x 224 and with three input channels (RGB).
- the output of the layer has 64 channels, and hence, there may be 64 x 3 convolutional kernels each of dimensions 3 x 3. With padding 1 and stride 1 along each dimension this may lead to an output of the same dimensionality 224 x 224 but with 64 channels.
- ReLU rectified linear unit
- the output of the second convolutional layer is again 224 x 224 with 64 channels. This is then passed through a ReLU and a max-pooling of 2 x 2, reducing the dimensionality to 112 x 112.
- the next, third, convolutional layer has 128 output channels, but the same padding and strides as before that leaves the dimensionality of the output also 112 x 112 with 128 channels.
- the fourth convolutional layer follows after a ReLU, with same padding and strides and 128 output channels. Finally, a second max-pooling occurs after another ReLU.
- an embodiment may be able to create a pyramid task that starts at the beginning of the network and ends after the third convolutional layer. This follows from the above calculations, as a process can afford 64 input and output channels in operations of the second convolutional layer. Even though it does not make a difference in this particular example, in this context, the amount of computation is actually increased substantially, giving even more leeway in terms of the CMR requirements.
- FIG. 5 is a simplified schematic diagram of a CNN having a number of sub computations (e.g., tasks) spanning more than one layer of the CNN, according to some embodiments of the invention.
- Taski, Task2, ... Task N can be executed in parallel or asynchronously.
- Task mi can execute as soon as Taski, Task2 output is ready.
- Task m 2 can be executed as soon as Task N output is ready.
- Task mi , Task m 2 can execute at different times. In this manner, execution of the CNN layer by layer can be avoided; combinations of portions of layer executions, combined across layers, can be executed in conjunction, one portion of a layer execution providing output to another portion in another layer being executed in conjunction with the first portion.
- the plurality of sub-computations Taski, Task2, ... Task N , Task mi , Task m 2, Taski can be determined prior to execution.
- the plurality of sub-computations can be determined recursively for example by moving back from a portion of the Softmax layer output 510 and aggregating the sub-computations that are required to produce the Softmax layer output 510.
- Taski traverses back and aggregates all outputs/computations from Softmax layer output 510, through Fully-Connected reshuffle layer and a stops in the Fully-Connected matrix multiply, which indicates that in this example the Fully Connected matrix multiply layer is when the outputs reach a memory size that reaches a memory threshold.
- Task mi , Task m 2 All of the outputs/computations needed to create the inputs for Taski, which are the outputs of Task mi , Task m 2 are considered.
- Task mi , Task m 2 each traverse back and aggregate all outputs/computations from Fully-Connected re-shuffle, pooling, and convolutional computation. All of the outputs/computations needed to create the inputs for Task mi , Task m 2, which are the outputs of Taski, Task2, ... Task N are considered.
- Taski, Task2, ... Task N each traverse back and aggregate all outputs/computations from a portion of the pooling and convolution layers, which indicates that the memory threshold has not been met and all computations from all layers have been added to the sub-computations.
- One embodiment may analyze a NN to determine a plurality of sub-computations from total computations of the neural network, where determining each sub-computation includes determining a group of outputs for each sub-computation based on a layer of the NN; and for each group of outputs, determining a largest number of inputs in some layer that precedes the one layer that are necessary to compute the respective group of outputs and results in a memory requirement less than a memory threshold. At least two of the largest number of inputs for the respective sub computation may overlap and span at least two layers of the plurality of layers of the neural network.
- the outputs of the neural network may be computed by executing each sub-computation.
- a “blocked” convolutional calculation where different portions of a convolutional layer are divided based on the division of tensors into tasks, each task including one or more code segments such as code generated by a process such as in Table 1, may be integrated with a “pyramid” process.
- Different portions of the convolutional calculation may be integrated into a pyramid task, such that some portions of the convolutional calculation may be calculated with an associated pyramid task before other convolutional portions are calculated.
- layers lower to (e.g. accepting input from a prior layer) the convolutional layer may be partially calculated before all convolutional portions for the convolutional layer are calculated: some convolutional portion calculations may be delayed while pyramid calculations including other convolutional portions are calculated.
- a number of code sections for the at least one convolutional layer may be produced, such that during execution or inference of the CNN, at least a first code section is executed in conjunction with code executing at least a portion of a lower CNN layer before the execution of a second code section of the convolutional layer.
- Embodiments of the present invention may improve prior NN inference by for example avoiding completely both certain convolutional layer operations involving zero parameters and also, at inference, branch operations (e.g. if zero then do not multiply) which may attempt to avoid such zero operations.
- a kernel - e.g. describing a filter, a set of weights or other parameters - may be combined with NN code such as TensorFlow or PyTorch open-source code, which may take input data and perform inference on the NN which may include convolutional kernel calculations, or tensor or matrix multiplication, of zero entries in the kernel, or branch operations to avoid multiplication on zero items.
- embodiments of the present invention may include only non-zero kernel values in compiled code, and only operations to perform multiplication operations on those non-zero kernel values in the compiled code.
- code specific to a NN may be compiled specific to a kernel or set of parameters.
- Embodiments of the invention may be applicable to NNs computed with any sort of nodes, e.g. CPUs, GPUs, or other types of processors.
- “plurality” and “a plurality” as used herein can include, for example, “multiple” or “two or more”.
- the terms “plurality” or “a plurality” can be used throughout the specification to describe two or more components, devices, elements, units, parameters, or the like.
- the term set when used herein can include one or more items. Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Additionally, some of the described method embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
L'invention concerne un système et un procédé qui peuvent générer un code à utiliser lors de l'exécution de réseaux neuronaux (NN), par exemple des réseaux neuronaux convolutifs (CNN) qui peuvent comprendre une ou plusieurs couches de convolution. Pour au moins une couche de convolution, pour chaque élément non nul dans un tenseur de noyau ou une matrice associée à la couche convolutive, des instructions peuvent être générées ou émises. Par exemple, pour chaque élément non nul, une instruction de diffusion vectorielle peut être générée, et une instruction de multiplication-addition fusionnée (FMA) peut être générée, ayant comme paramètres un registre représentant une partie de la sortie pour la couche de convolution, un registre stockant des données d'entrée pour la couche de convolution, et un registre ou référence à la mémoire stockant l'élément non nul. Le logiciel ou le code produit peut être exécuté pendant des opérations de convolution, par exemple en tant que partie d'une application plus grande telle qu'une application d'inférence de NN.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/751,953 US10963787B2 (en) | 2018-05-31 | 2020-01-24 | Systems and methods for generation of sparse code for convolutional neural networks |
| US17/178,603 US11216732B2 (en) | 2018-05-31 | 2021-02-18 | Systems and methods for generation of sparse code for convolutional neural networks |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201962900933P | 2019-09-16 | 2019-09-16 | |
| US62/900,933 | 2019-09-16 |
Related Parent Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/426,609 Continuation-In-Part US11449363B2 (en) | 2018-05-31 | 2019-05-30 | Systems and methods for improved neural network execution |
| USPCT/US2019/040537 Continuation-In-Part | 2018-05-31 | 2019-07-03 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/751,953 Continuation US10963787B2 (en) | 2018-05-31 | 2020-01-24 | Systems and methods for generation of sparse code for convolutional neural networks |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021054990A1 true WO2021054990A1 (fr) | 2021-03-25 |
Family
ID=74884304
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2019/063678 Ceased WO2021054990A1 (fr) | 2018-05-31 | 2019-11-27 | Systèmes et procédés de génération de code parcimonieux pour réseaux neuronaux convolutifs |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2021054990A1 (fr) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113469337A (zh) * | 2021-06-29 | 2021-10-01 | 上海寒武纪信息科技有限公司 | 用于优化神经网络模型的编译方法及其相关产品 |
| CN113850385A (zh) * | 2021-10-12 | 2021-12-28 | 北京航空航天大学 | 一种粗细粒度联合的神经网络剪枝方法 |
| CN115221102A (zh) * | 2021-04-16 | 2022-10-21 | 中科寒武纪科技股份有限公司 | 用于优化片上系统的卷积运算操作的方法和相关产品 |
| CN119514606A (zh) * | 2024-11-03 | 2025-02-25 | 重庆大学 | 一种基于编码分布式计算的卷积神经网络卷积层加速方法 |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9818059B1 (en) * | 2016-10-27 | 2017-11-14 | Google Inc. | Exploiting input data sparsity in neural network compute units |
| US20190138902A1 (en) * | 2017-11-06 | 2019-05-09 | Neuralmagic Inc. | Methods and systems for improved transforms in convolutional neural networks |
-
2019
- 2019-11-27 WO PCT/US2019/063678 patent/WO2021054990A1/fr not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9818059B1 (en) * | 2016-10-27 | 2017-11-14 | Google Inc. | Exploiting input data sparsity in neural network compute units |
| US20190138902A1 (en) * | 2017-11-06 | 2019-05-09 | Neuralmagic Inc. | Methods and systems for improved transforms in convolutional neural networks |
Non-Patent Citations (4)
| Title |
|---|
| CHEN XUHAO: "Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs", ARXIV.ORG, - 20180228, 3 April 2019 (2019-04-03), XP081199008, Retrieved from the Internet <URL:https://arxiv.org/pdf/1802.10280.pdf> [retrieved on 20200117] * |
| GEORGANAS EVANGELOS; AVANCHA SASIKANTH; BANERJEE KUNAL; KALAMKAR DHIRAJ; HENRY GREG; PABST HANS; HEINECKE ALEXANDER: "Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures", SC18: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 20 August 2018 (2018-08-20), pages 830 - 841, XP033529907, Retrieved from the Internet <URL:https://arxiv.org/pdf/1808.05567.pdf> [retrieved on 20200117], DOI: 10.1109/SC.2018.00069 * |
| LIU BAOYUAN, MIN WANG; FOROOSH HASSAN; TAPPEN MARSHALL; PENKSY MARIANNA: "Sparse convolutional neural networks", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 6 December 2015 (2015-12-06), XP032793491, Retrieved from the Internet <URL:http://openaccess.thecvf.com/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf> [retrieved on 20200117], DOI: 10.1109/CVPR.2015.7298681 * |
| PAPYAN VARDAN, ROMANO YANIV, ELAD MICHAEL: "Convolutional neural networks analyzed via convolutional sparse coding", THE JOURNAL OF MACHINE LEARNING RESEARCH, 17 July 2017 (2017-07-17), XP055805506, Retrieved from the Internet <URL:http://www.jmlr.org/papers/volume18/16-505/16-505.pdf> [retrieved on 20200117] * |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115221102A (zh) * | 2021-04-16 | 2022-10-21 | 中科寒武纪科技股份有限公司 | 用于优化片上系统的卷积运算操作的方法和相关产品 |
| CN115221102B (zh) * | 2021-04-16 | 2024-01-19 | 中科寒武纪科技股份有限公司 | 用于优化片上系统的卷积运算操作的方法和相关产品 |
| CN113469337A (zh) * | 2021-06-29 | 2021-10-01 | 上海寒武纪信息科技有限公司 | 用于优化神经网络模型的编译方法及其相关产品 |
| CN113469337B (zh) * | 2021-06-29 | 2024-04-05 | 上海寒武纪信息科技有限公司 | 用于优化神经网络模型的编译方法及其相关产品 |
| CN113850385A (zh) * | 2021-10-12 | 2021-12-28 | 北京航空航天大学 | 一种粗细粒度联合的神经网络剪枝方法 |
| CN119514606A (zh) * | 2024-11-03 | 2025-02-25 | 重庆大学 | 一种基于编码分布式计算的卷积神经网络卷积层加速方法 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10963787B2 (en) | Systems and methods for generation of sparse code for convolutional neural networks | |
| US11216732B2 (en) | Systems and methods for generation of sparse code for convolutional neural networks | |
| US10915816B2 (en) | System and method of executing neural networks | |
| Lu et al. | Optimizing depthwise separable convolution operations on gpus | |
| US20230038061A1 (en) | Convergence among concurrently executing threads | |
| Wahib et al. | Scalable kernel fusion for memory-bound GPU applications | |
| US8676874B2 (en) | Data structure for tiling and packetizing a sparse matrix | |
| US8762655B2 (en) | Optimizing output vector data generation using a formatted matrix data structure | |
| WO2021054990A1 (fr) | Systèmes et procédés de génération de code parcimonieux pour réseaux neuronaux convolutifs | |
| WO2015150342A1 (fr) | Exécution de programme sur plateforme hétérogène | |
| Falch et al. | Register caching for stencil computations on GPUs | |
| Dong et al. | Characterizing the microarchitectural implications of a convolutional neural network (cnn) execution on gpus | |
| Phillips et al. | A cuda implementation of the high performance conjugate gradient benchmark | |
| Spector et al. | Thunderkittens: Simple, fast, and adorable ai kernels | |
| US20230418673A1 (en) | Method and apparatus for gnn-acceleration for efficient parallel processing of massive datasets | |
| Liu | Parallel and scalable sparse basic linear algebra subprograms | |
| Forster | Louvain community detection with parallel heuristics on GPUs | |
| CN117076098A (zh) | 一种动态张量编译优化方法、装置、电子设备及介质 | |
| Ho et al. | Tensorox: Accelerating gpu applications via neural approximation on unused tensor cores | |
| WO2021061172A1 (fr) | Système et procédé d'exécution de réseaux neuronaux | |
| US12190086B1 (en) | Method and apparatus for ML graphs by a compiler | |
| Siddiqui et al. | Design space exploration of embedded applications on heterogeneous CPU-GPU platforms | |
| JP7718014B2 (ja) | コンパイラ装置、演算処理装置、命令生成方法、プログラム、コンパイル方法及びコンパイラプログラム | |
| Gurgel et al. | Parallel implementation of feedforward neural networks on gpus | |
| 정우근 | A Deep Learning Optimization Framework for Versatile GPU Workloads |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19945571 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19945571 Country of ref document: EP Kind code of ref document: A1 |