WO2023249762A1

WO2023249762A1 - Max-pool prediction for efficient convolutional nuerual network for resource-constrained devices

Info

Publication number: WO2023249762A1
Application number: PCT/US2023/022507
Authority: WO
Inventors: Joydeep Kumar Devnath; Sri Hari NEMANI
Original assignee: Ceremorphic Inc
Current assignee: Ceremorphic Inc
Priority date: 2022-06-21
Filing date: 2023-05-17
Publication date: 2023-12-28
Anticipated expiration: 2024-12-21

Abstract

In some embodiments computing device is configured to perform at least (1) accepting as inputs at least an input feature map having one or more first elements and at least one or more filters having one or more second elements (2) computing at least a first prediction value for a first stride (3) computing at least a second prediction value associated with the at least a second stride, wherein the first prediction value and the at least a second prediction value define a group of prediction values associated with the stride tensor (4) determining the greatest prediction value, the greatest prediction value is associated with one selected stride of the stride tensor and (5) executing a convolution operation only for the selected stride of the stride tensor.

Description

MAX-POOL PREDICTION FOR EFFICIENT CONVOLUTIONAL NUERUAL NETWORK FOR RESOURCE-CONSTRAINED DEVICES

FIELD OF TECHNOLOGY

[0001] Some embodiments pertain to systems and methods for predicting strides whose output will be dropped during a max-pooling operation. In particular some embodiments pertain to systems and methods for predicting strides whose output will be dropped during a max -pooling operation and then disregarding those strides during convolution operations.

BACKGROUND

[0002] Neural networks are a branch of artificial intelligence that are inspired by human neural networks. In particular, neural networks are a type of deep learning model. The use of neural networks includes two stages: 1) training; and 2) inference. Training a neural network usually includes providing substantial amounts of training data to a neural network during a training phase. Inference is putting a trained neural network to work to perform a task.

[0003] One type of neural network is a convolution neural network (CNN). A CNN is useful for recognizing patterns in image and audio data. In the context of images, CNN’s are used for image recognition and classification. In particular, CNN’s are often used for face recognition and object recognition. Practical applications include, for example, self-driving automobiles, airport security, and surveillance. In the context of audio data, CNN’s are used for speech recognition and other similar applications.

[0004] CNN’s are organized in stages, often called “layers.” There are a variety of different types of layers. Different types of layers involve different structures and/or different types of operations. Often there are different views regarding what constitutes a given type of layer. Including whether a given operation is included within a given layer or is its own separate layer.

[0005] One type of layer is a convolution layer. In one view a convolution layer includes several operations: 1) a convolution operation, 2) an activation operation, 3) a batch normalization layer and a pooling layer. A brief, very simplified discussion of these four operations follows.

[0006] A convolution operation applies one or more filters to one or more input feature maps to generate one or more output maps. An example of an input feature map is an image. In the context of images, the purpose of the filters is to extract features or patterns from the and to present the extracted features and patterns in the one or more output maps. [0007] An activation operation receives input from one or more convolution operations and applies an activation function. The activation function is almost always a non-linear function. It adds a non-linearity to the one or more output maps. In the context of a CNN working with image data, the activation function is often a rectified linear unit function (ReLU). Other types of activation functions are also possible, such as for example, a Tanh or Sigmoid function. A detailed discussion of activation functions is beyond the scope of this document.

[0008] A batch-normalization operation may take the output of an activation operation and standardize the output to speed learning in the next layer. Batch normalization works on “minibatches.” Batch normalization is a trained process. That is, in many implementations it uses two parameters that are trained in a training phase. These two parameters are referred to as beta (0) and gamma (T). Details of how a batch-normalization operation functions is beyond the scope of this document.

[0009] A pooling operation may take the output of a batch-normalization operation and reduce the size of the data. There are several types of pooling operations. One type of pooling operation is average pooling Another type of pooling operation is max-pooling. Max pooling receives output data and reduces the size of the output data by discarding data. For example, a max pooling operation may discard Ws of the output data and retain only 14 of the output data. A maxpooling operation does this by selecting from a group of data elements only the maximum element. For example, a group of data elements may include 4, 14, 12, and 27. In max pooling, element 27 is selected for retention and the elements 4, 14, and 12 are discarded.

[0010] The discarding of data by a max-pooling operation is significant because the operations that produce the discarded data are resource-intensive. In particular, convolution operations utilize frequent multiply-and-accumulate (MAC) operations. When MAC operations are used to create elements that are subsequently discarded by a max-pooling operation, then those MAC operations are wasted, creating an inefficiency.

[0001] There is a category of computing devices which may be referred to as resource- constrained devices. Resource-constrained devices include devices, such as for example, an Internet of Things device (IOT device), an embedded computing device, or other device with at least one of limited processing capacity, limited power supply, or limited storage capabilities. Other specific examples include smart telephones, hand-held computing devices, smart cameras, and the like. [0002] Components of devices may also be resource constrained. One example is a chiplet. Recently chiplets are part of a trend toward modular design of processors and other large integrated circuits. One reason for this is reduced cost if there is a defect. If there is a defect in a monolithic computer processor, the entire computer processor may have to be discarded. In contrast if there is a defect is a chiplet that forms part of a modular computer processor, only that one defective chiplet need be discarded.

[0011] The utilization of CNN’s in resource-constrained devices presents various challenges. In particular, there is a need for methods and systems for implementing efficient CNN’s that conserve computing resources.

SUMMARY

[0012] Some embodiments provide a computing device configured to implement at least a portion of a convolutional neural network. In some embodiments the computing device includes at least one or more processing devices and one or more machine-readable mediums bearing one or more executable instructions that when executed by the one or more processing devices cause the computing device to perform one or more operations that include at least accepting as inputs at least an input feature map having one or more first elements and at least one or more filters having one or more second elements.

[0013] The one or more operations further include at least for a first stride associated with a stride tensor, computing at least a first prediction value associated with the first stride, wherein a stride includes at least a mapping of the one or more second elements onto a group of one or more first elements and a stride tensor includes at least a group of strides and wherein a prediction value predicts a magnitude of an output value for a convolution operation associated with the first stride.

[0014] The one or more operations further include at least for at least a second stride of the stride tensor, computing at least a second prediction value associated with the at least a second stride, wherein the first prediction value and the at least a second prediction value define a group of prediction values associated with the stride tensor.

[0015] The one or more operations further include at least determining the greatest prediction value of the group of prediction values, wherein the greatest prediction value is associated with one selected stride of the stride tensor. [0016] And the one or more operations further include at least executing a convolution operation only for the selected stride of the stride tensor, the executing performed by utilizing at least the input feature map and the one or more filters.

BRIEF DESCRIPTION OF THE DRAWINGS

[0017] Representative embodiments are illustrated by way of example and not by limitation in the accompanying figures, in which:

[0018] Fig. 1 is a simplified schematic diagram showing MAC operations leading to activation output elements that are either selected or discarded in a max-pooling operation.

[0019] Fig. 2 is a simplified schematic diagram of a sample convolution layer showing an input feature map, filters, output an activation map, and the effect of max-pooling as shown in a max-pooled output map.

[0020] Fig. 3 is a simplified block diagram of a device - a smart camera - in which some embodiments may be practiced.

[0021] Fig. 4 is a simplified block diagram of an exemplary computing device in which some embodiments may be practiced. Shown are, consistent with some embodiments, a processing device and the memory storing data and executable circuitry.

[0022] Fig. 5 is a simplified block diagram of two modified convolution layers, consistent with some embodiments, showing operations that include prediction, modified convolution, activation (ReLU), and batch normalization.

[0023] Fig. 6A is a simplified schematic diagram of a modified convolution layer, consistent with some embodiments, showing at least a counter output map.

[0024] Fig. 6B is a simplified schematic diagram, consistent with some embodiments, showing an input feature map with associated P’s, trained batch normalization parameters.

[0025] Fig. 6C is a simplified block diagram of a set of filters, consistent with some embodiments, showing a mean value pf.

[0026] Fig. 6D is a schematic diagram showing a comparison of an activation element from an input feature map and a weight element from a filter, consistent with some embodiments.

[0027] Fig. 6E is a simplified block diagram of a counter list structure, consistent with some embodiments. [0028] Fig. 6F is a simplified block diagram of a counter list structure, showing a tie between two final counter values, consistent with some embodiments.

[0029] Fig. 6G is a simplified schematic diagram showing a sequence of strides of a first stride tensor, consistent with some embodiments.

[0030] Fig. 6H is a simplified two-dimensional diagram showing the strides of Fig. 6G overlayed to define the first stride tensor, consistent with some embodiments.

[0031] Fig. 61 is a simplified two-dimensional diagram showing a group of strides overlayed to define a second stride tensor, consistent with some embodiments.

[0032] Figs. 7A and 7B are method flow charts that together show an exemplary method, consistent with some embodiments.

[0033] Fig. 8A is a method flow chart, showing update option 1 for implementing a operation of the method of Fig. 7.

[0034] Fig. 8B is a method flow chart, showing update option 2 for implementing a operation of the method of Fig. 7.

[0035] Fig. 9 is a method flow chart showing an exemplary method, consistent with some embodiments.

[0036] Fig. 10A is a method flow chart showing optional process blocks for the method of Fig- 9.

[0037] Fig. 10B is a method flow chart showing optional process blocks for the method of Fig. 9.

[0038] Figs. 11A - 1 ID are charts that summarize some experimental data.

[0039] Fig. 12 is a simplified diagram of a data structure showing alternative notation for stride position, consistent with some embodiments.

DETAILED DESCRIPTION

[0040] In the above-described drawing, certain features are simplified to avoid obscuring the pertinent features with extraneous details. The above drawings are not necessarily to scale

[0041] It is to be understood that the disclosed embodiments are merely exemplary of the invention, which may be embodied in various forms. It is also to be understood that multiple references to “some embodiments” are not necessarily referring to the same embodiments. [0042] As used in this document, the term “resource-constrained device” includes at least one of an Internet of Things device (IOT device), an embedded computing device, or a device with at least one of limited processing capability, limited power supply, or limited storage capabilities Some resource-constrained devices utilize what is known by those of skill in the art as a microcontroller (MCU). Further examples of resource-constrained devices include, without limitation, smart telephones, hand-held computing devices, smart cameras, and the like.

[0043] As used in this document, the term “convolution layer” includes a convolution operation plus one or more additional operations, such as for example, an activation operation, a batch normalization operation, and/or a max-pooling operation. The term activation operation refers to an operation utilizing an activation function. The term batch normalization operation refers to an operation that includes batch normalization as that phrase is used in the art. The term max-pooling operation refers to an operation that includes max pooling as that phrase is used in the art.

[0044] This document, consistent with some embodiments, also refers to a “modified convolution layer” that includes one or more of a prediction operation, a modified convolution operation, an activation operation, and a batch-normalization operation, but not a max-pooling operation. Further details of a modified convolution layer are provided below.

[0045] As used in this document the phrase “input feature map” refers to the input to a convolution operation. For example, the input feature map could be an image or audio data. Other terms for the input feature map that are used in the art include “input map” or “activation map.” An input feature map includes multiple elements (which could be, for example pixels, or other data elements) and may include multiple channels. Elements of input feature maps are referred to as “activation elements.”

[0046] The term “filters” is also used with its usual meaning in the art. Filters are also sometimes referred to as “weights.” Filters also include multiple elements and may include multiple channels. Elements of filters are referred to as “weight elements.”

[0047] As used in this document the phrase “output activation map” refers to the output of a convolution operation. The elements of an output activation map are referred to as “activation output elements”. Additionally, the phrase “max-pooled output map” is used to refer to the output of a max-pooling operation. [0048] The term “stride” has its ordinary meaning in the art refers to a mapping of one or more filters to one or more portions of one or more input feature maps. The phrase “stride tensor” is used to refer to a collection of consecutive strides. For example, a “stride tensor” may include a first stride, a second stride, a third stride, and a fourth stride The concept of a stride tensor is further discussed and explained below.

[0049] As used in this document, the phrase “counter output map” refers to a map that informs a compiler of which strides of a filter over an input feature map are to be computed. A counter output map is an example of a maximum prediction value position map which provides a positions of strides associated with maximum prediction values (e g. a maximum counter values, discussed below).

[0050] A max-pooling operation in deep neural networks (DNN) can be seen in most state- of-the art DNN’s whose function is to filter out the maximum value from a max-pool window of X by X elements (where “X” is the number of pixels or activation elements in an output activation map). Max pooling has various purposes, such as for example down-sampling tensor size, reducing complexity of a DNN, promoting scale/rotation invariance, reducing the problem of over-fitting, and other uses. The larger the max-pool window, the larger the data reduction. For example, if a max-pool window has X elements and X is 16, then one maximum element is selected and 15 are discarded. However, if X is 4, then when one maximum element is selected 3 are discarded.

[0051] Referencing Fig. 1, a max-pooling operation in a portion of a convolution layer 100 is illustrated. An output activation map 101 includes a max-pool window 105 over activation output elements 104A, 104B, 104C, and 104D. Activation output elements 104A, 104B, 104C, and 104D were produced via multiply and accumulate (MAC) operations 103A, 103B, 103C, and 103D. In a max-pooling operation, the maximum of activation output elements 104A - 104D is selected. And although Fig. 1 does not indicate the numerical values of activation output elements 104A - 104D, it is depicted that activation output element 104A is the maximum element that is transmitted to the next layer. And activation output elements 104B - 104D are discarded. The computational resources used to perform MAC operations 103B - 103D are wasted. As can be seen, max-pool window 104C has 4 elements

[0052] For resource-constrained devices, MAC operations are computationally expensive. The wasting of MAC operations 103B - 103C is a waste of computational resources that is not desirable for resource-constrained devices That is, when performing max-pooling, the computing device transmits only a maximum value of a max-pooling window to the next layer, while other values inside the max-pooling window are dropped. Thus, some of the MAC operations were, in hindsight, unnecessary. Therefore, there is a need for a convolution layer that avoids computing unnecessary MAC operations.

[0053] Another view of a conventional max-pooling operation is provided with reference to Fig. 2. Fig. 2 depicts a conventional convolution layer 200 that uses max-pooling. Convolution layer 200 includes input feature map 202 (e.g. an image or audio data) Input feature map 202 includes 3 channels, that is channel 203 A in back, channel 203B in the middle, and channel 203 C in front. Thus, input feature map 202 has 3 channels, a height, and a width, which are indicated as 3 x H x W.

[0054] Convolution layer 200 further includes two sets of filters 213A, 213B. A given filter, e g. 213B, includes channels 215A, 215B, 215C. The filters also have 3 channels, a height, and a width, although this height and width are different than the height and width of the input feature map 202 For these particular features, the number of channels and the height and width are indicated as 3 x 3 x 3. That is 3 channels, a height of 3 elements (not shown), and a width of 3 elements (not shown). A given filter, e.g. 213B, includes, for example, channels 215A, 215B, 215C.

[0055] A convolution operation 218 applies filters 213 A, 213B to input feature map 202 resulting in output activation map 101. Because there were two sets of filters 213A, 213B, output activation map 101 has two channels 205A, 205B. With the use of zero padding (e g. to maintain same height and width) for convolution operation 218, output activation map 101 has the same height and width as input feature map 202. Output activation map 101 is shown with activation output element 104A with a value of 24, activation output element 104B with a value of 2, activation output element 104C with a value of 0, and activation output element 104d with a value of 12.

[0056] A max-pooling operation is indicated via window 105A, shown applied to activation output elements 104A - 104D of output activation map 101. The same window is as shown window 105B separate from output activation map 101. Referencing window 105A, shown are values 24 for activation output element 104 A, 2 for activation output element 104B, 0 for activation output element 104C, and 12 for activation output element 104D. Of these the maximum value is 24 of activation output element 104A which is carried forward as max-pooled element 21 1 of max-pooled output map 209. The remaining activation output elements 104B, 104C, and 104D are discarded. The computational resources that were used for the MAC operations to generate activation output elements 104B - 104C are wasted. Once again it is seen that there is a need for a convolution layer that avoids computing unnecessary MAC operations.

[0057] Some embodiments, use a prediction algorithm to avoid computing unnecessary MAC operations whose output will be discarded at the max-pooling operation. For example, referencing window 105A, in these embodiments it is predicted that activation output element 104A would be selected in the max-pooling operation and based on this prediction, activation output elements 104B - 104C (and their attendant values) are never computed. In some embodiments the prediction takes place before the convolution operation, for example before convolution operation 218 of Fig. 2. The prediction algorithm enables resource-constrained devices to utilize CNN’s while conserving their limited computing resources. Thus, devices such as smart phones and smart camera’s are enabled to perform image recognition, such as for example facial recognition, while not exceeding their computing capabilities. These same device are further enabled to perform audio recognition, such as for example speech recognition. As discussed in more detail below, in some experiments, computational power consumption (as distinguished from memory read/write power) was reduced by a factor of between three and four.

[0058] In some particular embodiments, the above prediction makes use of a P (beta) parameter from batch normalization. As previously discussed, the P parameter is a learnable parameter that is learned during a training phase of a CNN The parameter is thus learned (and hence calculated) during the training of the CNN and are not modified during inference.

[0059] In some embodiments, the prediction algorithm is applied during inference and not during training. In these embodiments, the prediction algorithm therefore does not bear any training load involved with calculating the parameter. In some embodiments however, the prediction algorithm is applied during training.

[0060] A first version of the prediction algorithm is presented first. This first version makes use of the P parameter, while a second version discussed below does not. In particular this first version utilizes Pi which is the beta parameter for the ith channel of the input feature map and therefore, the p parameter without an i subscript is the beta parameter for the entire input feature map. For example, referencing Fig. 6B, an input feature map 202 has channel 203A associated with Pl (682A), channel 203B associated with 2 (682B), and channel 203 C associated with P3 (682C). In this example “i” could be either 1, 2, or 3 referring to Pl (682A), 2 (682B), or P3 (682C).

[0061] There are at least two types of parameters: (1) P parameter trained on the convolution output of a prior convolution layer; or (2) P parameter trained on the convolution output of the current convolution layer (this option requires that the input feature map channel size and the number of filters in the current convolution operation be the same), p parameters and their association with either a prior convolution layer or a current convolution layer is discussed further below with reference to Fig. 5.

[0062] Another variable is uf which is the mean of a filter, such as for example, the mean of all elements of filter 213. For example, referencing Fig. 6C, pf (646) is the mean of the three channels 215A, 215B, and 215C of filter 213.

[0063] In some embodiments, the number of strides in a stride tensor is equal to the number of elements in a max-pooling window. That is, if a prediction algorithm is predicting the effects of a max-pooling operation that uses a max-pooling window of 4 elements, then a stride tensor would have 4 strides. Stated more generally, if a prediction algorithm is predicting the effects of a max-pooling operation that uses a max-pooling window of X number of elements, then a stride tensor would have X number of strides.

[0064] Preliminarily, both the First Prediction Algorithm and the Second Prediction Algorithm are specific embodiments and are not intended to be limiting. Once having become familiar with these algorithms, those or ordinary skill in the art may appreciate imagine ways in which these algorithms may be modified and still serve their purpose.

[0065] The first prediction algorithm is as follows:

FIRST PREDICTION ALGORITM

Prior to convolution -

1. For each filter: a. Set output counter map =0_SiZe(ouiput_map)[stores the position of the predicted strides for conv MAC ops] b. For " stride tensor (4 strides at a time incase of 2x 2 Max Pool window) G set of strides (during 2D-con volution): i. Initialize counterlist forthat stride_tensor [] [list to save and compare counter values of 4 elements (incase of 2x 2 Max Pool window) in order to get max element candidate position] ii For each stride in stride tensor:

A. Set counter_stnde =0

B. For each element G stride:

If sign(Activation_cicmcnt) == sign(Weightcicmcnt) :

counterstride +=2 else: counterstride +=1

C. Append counterii_St for that stride_tensor with counterstnde lii. output counter map = (x,y) position of Max (counter list for that stride)

2. ONLY evaluate (MAC) those stride positions that are pointed to by the output counter map tensor.

[0066] As can be seen a computing device (e.g. 300 or 400) performs the entire First Prediction Algorithm prior to a convolution operation. The First Prediction Algorithm begins in a filter loop that is performed for each filter. Initially the filter loop initializes an counter output map, output_counter_map, which stores the position of strides to be computed in one or more stride tensors, is initialized to zero. Then, for each stride tensor the computing device executes a stride tensor loop inside the filter loop. The stride tensor loop includes initializing a counter list structure, counter list. An inner stride loop for each stride of the stride tensor executes inside the stride tensor loop. This document discusses strides and their position in more detail relative to Figs. 6G - 61 below.

[0067] In some embodiments, the inner stride loop first initializes a counter, counter stride, and then tests whether an element (an activation element) that is a member of the current stride (an element of channel i of the input feature map) has the same sign (i.e., positive or negative) as a weight element of channel i of the filter used to perform the current stride. If the signs are the same, the algorithm proceeds to the next test. If the signs are not the same, the next test is not performed and the counter is not incremented. An example of this test is shown in Fig. 6D, in which activation element 668 of input feature map 202 is compared to weight element 643 of filter 213 to determine if they are of the same sign.

[0068] The inner stride loop next tests (1) whether the activation element of the input feature map is greater than pi, the beta parameter for the current channel i and (2) whether the filter element is greater than pf (discussed above). If both of the above evaluate to true, the counter is incremented by two. Otherwise the counter is incremented by one. In some embodiments, the pf comparisons with weight elements are precomputed and stored

[0069] The inner stride loop repeats until all activation elements of the stride are processed as described above. When the inner stride loop exits, then the in the outer stride tensor loop for the final counter value (e.g. final value for counter_stride) for that stride just processed is added to a counter list structure which stores all of the final counter values for each stride in the stride tensor. Each final counter value in the counter list structure has a position. These position are shown relative to Fig. 6E which shows counter list structure 646 which indicates positions for four final counter values, which are [1, 1], [1, 2], [2, 1], and [2, 2], When the above is completed, the outer stride loop processes the next stride and a new inner stride loop begins. This continues until all strides of the current stride tensor are processed.

[0070] Once all the strides of the current stride tensor are processed, the outer loop completes and the output_counter_map is updated using the counter list for the current stride tensor. Part of that update is computing the maximum counter value in the counter list structure. The position of the stride corresponding to the maximum counter value is then stored in the output_counter-map. For example, moving forward to reference Fig. 6A, a counter list structure 646A for a first stride tensor has final counter values (e.g. elements) 27 at position [1, 1], 10 at position [1, 2], 2 at position [2, 1], and 14 at position [2, 2], The maximum of the final counter values is 27 at position [1, 1], Therefore, the position [1, 1] (e.g. in (x, y) format) is stored in positional element 661A of counter output map 688.

[0071] In cases in which two or more final counter values in the counterlist have same maximum counter value, then all of the positions of the strides corresponding to the tying countervalues are entered into output_counter_map. For example, referencing Fig. 6F, a counter list structure 646 has positions [1, 1] and [1, 2] which we take as being associated with equal final counter values. That is [1 , 1 ] = [1 , 2], The result is that initially both are stored in an counter output map as a set: ([1, 1], [1, 2]). Then during convolution, both of these two positions are convoluted with MAC operations to yield results SI and S2. The greater of these is retained and the lesser one is then discarded.

[0072] The algorithm then moves to the outermost filter loop which repeats as long as there is an unprocessed filter. Once the filter loop is completed, the algorithm then moves to a modified convolution operation which uses the counter output map, output_counter_map, to determine which strides of each stride tensor to convolute and which to ignore.

[0073] For the above-discussed tying strides, in the convolution operation the MAC operations for those tying strides are computed and the value for the stride associated with the greatest computed value are then kept and the others discarded.

[0074] The second prediction algorithm is similar except that it does not include a comparison with the p parameter. The second prediction algorithm is as follows:

SECOND PREDICTION ALGORITHM

Prior to convolution -

1. For each filter: a. Set output counter map =0_Size(output_map)[ stores the position of the predicted strides for conv MAC ops] b. For each stride tensor (4 strides at a time incase of 2x 2 Max Pool window) 6 set of strides (during 2D-convolution): i. Initialize counter list for that stride_tensor = [] [list to save and compare counter values of 4 elements (incase of 2x 2 Max Pool window) in order to get max element candidate position] ii. For each stride in stride tensor:

A. Set counter_stnde =0

B. For each element e stride:

If sign(Activationelement) == sign(Weighteiement) :

else: counterstride +=1 C. Append counter list for that stridc_tcnsor with eounterstridc iii. output counter map = (x,y) position of Max (counter hst for that stride)

[0075] As discussed above, the Second Prediction Algorithm is the same as the first except that there are no comparisons with the 0 parameter in the inner stride loop.

[0076] Referencing Fig. 3, a resource-constrained device, such as smart camera 301 provides an environment in which some embodiments may be at least partially practiced. Smart camera 300 includes lens 302 coupled or integral with housing 310. Housing 310 includes a processing device such as CPU 308. Housing 310 further includes artificial intelligence resources such as for example Al engines 3O5A - 305D, ethernet interface 309, compression circuitry 307, and image processor 306, each of which is communicably linked with CPU 308. An optical filter 303 and an image sensor 304 are provided and are positioned to capture and process light entering lens 302. As has been discussed above, smart camera 301 is just one example of a resource- constrained device.

[0077] Referencing Fig. 4, controlling circuitry 400 for a computing device, such as for example, resource-constrained device 301, includes one or more processing devices 406, links for receiving inputs 402 and for transmitting outputs 404. Controlling circuitry further includes memory 408.

[0078] In some embodiments, one or more processing devices 406 include, for example, at least one of a central processing units (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a memory storing executable code, another form of logic, or any combination of the foregoing.

[0079] In some embodiments, memory 408 includes at least one of a random-access memory, a volatile memory, a non-volatile memory, a hard drive, a flash memory, a portable memory, or other memory.

[0080] Memory includes data 410 and executable circuitry 450. In some embodiments, executable circuitry 450 includes a variety of circuits. Even though Fig. 4 shows these circuits as being within memory 208 as part of executable circuitry 450, in some embodiments, at least part of the functionality may be outside of executable circuitry 450. Tn some embodiments, one or more circuitries includes data. When the words “circuitry” or “circuitries” are used, they may refer to any of the above, all of the above, or any combination of the above, without limitation.

[0081] Further referencing Fig 4, in some embodiments, data includes at least one or more of a input feature map 412, fdters 414, a counter list structure 416, a counter value 418, weight elements 420, activation elements 422, p0 424 (from a previous convolution layer), pl 426 (from the current convolution layer), pf (fdter mean) 428, a counter output map 430, stride tensor data 432 (e g. data defining and/or indicating stride position, constituent strides, elements, etc.), stride data 434 (e.g. data defining and/or indicating stride position, constituent elements, etc.), prediction values 436 (e g. final counter values).

[0082] In some embodiments, executable circuitry 450 includes at least one or more of prediction value determination circuitry 452 (e.g. for testing if activation elements and filter elements of a stride are the same sign, for comparing activation elements and P, for comparing filter elements to pf, and for updating counter), maximum prediction value determination circuitry 464, counter output map update circuitry 466, modified convolution operation circuitry 468 and counter 470. prediction value determination circuitry 452 optionally includes one or more of sign equality determination circuitry 454, P compare to first element circuitry 456, mean (pf) compare to second element circuitry 458, and counter update circuitry 460. Modified convolution operation circuitry 466 optionally includes counter output map circuitry 468.

[0083] Referencing Fig. 5, a schematic diagram shows a portion of a convolutional neural network 550, consistent with some embodiments. CNN 550 includes a previous modified convolution layer X-l (502A) and a current modified convolution layer X (502B). Previous modified convolution layer X-l (502A) appears before current modified convolution layer X (502B) in the flow of CNN 550.

[0084] Previous modified convolution layer X-l (502A) includes, in some embodiments, at least a prediction operation 552A, which is an operation that implements steps la and lb of First Prediction Algorithm and/or steps la and lb of Second Prediction Algorithm. An output of prediction operation 552A is a counter output map (see e.g. counter output map 688 of Fig. 6A below).

[0085] Previous modified convolution layer X-l (502A) further includes at least a modified convolution operation 554A. Modified convolution operation 554A, in some embodiments, accepts as input an input feature map (e.g. 202), one or more filters (e.g., 213 A, 213B), and the counter output map generated in the prediction operation 552A. In some embodiments, modified convolution operation 554A implements step 2 of First Prediction Algorithm and/or Second Prediction Algorithm and outputs a modified output activation map (see, e g. modified output activation map 687 of Fig. 6A below).

[0086] Previous modified convolution layer X-l (502A) further includes at least a batchnormalization operation 556A that accepts as input the output of modified convolution operation 554A. When the CNN 550 is in training mode, batch normalization operation 556A computes and trains at least a P parameter for each channel (e.g. for each channel i). With respect to prediction operation 552B discussed below, the P parameter is a P parameter associated with a previous convolution layer - that is previous modified convolution layer X - l (502A).

[0087] Previous modified convolution layer X-l (502A) further includes at least an activation (ReLU) operation 558A. Activation (ReLU) operation 558A accepts as input at least the output of batch normalization operation 556A. Although illustrated as rectified linear unit (ReLU), in some embodiments the activation function used is another type of activation function, such as a sigmoid activation function or a tanh activation function.

[0088] Previous modified convolution layer X-l (502A) does not include a max -pooling operation because a reduced modified output activation map (e g. 687) output by modified convolution operation 554A is at least similar in data size to a maxed-pooled output map 209.

[0089] Current modified convolution layer X (502B) includes at least a prediction operation 552B, which is an operation that, in some embodiments, implements steps la and lb of First Prediction Algorithm and/or steps la and lb of Second Prediction Algorithm. An output of prediction operation 552B is a input counter map (see e.g. counter output map 688 of Fig. 6A below).

[0090] Current modified convolution layer X (502B) further includes at least a modified convolution operation 554B Modified convolution operation 554B accepts as input an input feature map (e.g. 202), one or more filters (e.g., 213A, 213B), and the counter output map generated in the prediction operation 552B. In some embodiments, modified convolution operation 554B implements step 2 of First Prediction Algorithm and/or Second Prediction Algorithm and outputs a modified output activation map (see, e.g. modified output activation map 687 of Fig. 6A below). [0091 ] Current modified convolution layer X (502B) further includes at least a batchnormalization operation 556B that accepts as input at least a modified output activation map (e.g. 687) output from modified convolution operation 554B. When the CNN 550 is in training mode, batch normalization operation 556B computes and trains a P parameter. With respect to prediction operation 552B discussed below, P parameter is a P parameter associated with the output of a current convolution layer X (502B).

[0092] Current modified convolution layer X (502B) further includes at least a activation (ReLU) operation 558B. Activation (ReLU) operation 558B accepts as input at least the output of batch-normalization operation 556B. Although illustrated as rectified linear unit (ReLU), in some embodiments the activation function used is another type of activation function, such as a sigmoid activation function or a tanh activation function.

[0093] Current modified convolution layer X (502B) does not include a max-pooling operation because a reduced modified output activation map (e g. 687) output by modified convolution operation 554B is at least similar in data size to a maxed-pooled output map (e.g. 209).

[0094] Referencing Fig. 6A, a modified convolution layer 522 includes input feature map 202 and filters 213A, 213B and their attendant structures which have been described relative to Fig. 2. That discussion is not repeated. As discussed relative to Fig. 5, modified convolution layer includes a prediction operation which, in some embodiments executes steps la and lb of First Prediction Algorithm and/or Second Prediction Algorithm, and which produces counter output map 688.

[0095] Counter output map 688 contains two channels 604 A, 604B because of the two sets of filters 213A, 213B. Counter output map 688 includes positional information identifying stride positions (e.g. positions associated with strides) to be convoluted via MAC operations in a modified convolution operation. This positional information is included in positional elements 661A, 661B, which indicated a position of a stride to be computed.

[0096] Further referencing Fig 6 A, inputs to counter output map 688 are now discussed The inputs include data from a first counter list structure 646A associated with a first stride tensor (e.g. 672A of Fig. 6H below) and from a second counter list structure 646 B associated with a second stride tensor (e g. 672B of Fig. 61 below) First counter list structure 646A includes final counter values 27 (647A) at position [1, 1], 10 (647B) at position [1, 2], 2 at position [2, 1], and 14 at position [2, 2], Second counter list structure 646B includes final counter values 12 (647C) at position [1, 1], 5 at position [1, 2], 30 at position [2, 1], and 10 at position [2, 2],

[0097] For each stride counter list structure a maximum counter value is computed. These maximum counter values are 27 at position [1, 1] for first counter list structure 646A and 30 at position [2, 1] for second counter list structure 646B. The position for the maximum value 27 at position [1, 1] is contained in positional element 661A as a max value position 670A which indicates the position [1, 1], The position for the maximum value 30 at position [2, 1] is contained in positional element 661B as a max value position 670B which indicates the position [2, 1],

[0098] Further referencing Fig. 6A, a modified convolution operation (e.g. such as modified convolution operation 554A or 554B) applies counter output map 688 to input feature map 202 to identify the stride positions of input feature map 202 to be convoluted via MAC operations. The stride positions not identified in counter output map 688 are not convoluted, thus saving unnecessary MAC operations. The result is modified output activation map 687 which includes channels 606A, 606B. Because unnecessary stride positions are not convoluted, modified output activation map 687 has fewer activation output elements and has dimensions 2 x H/2 x W/2, which indicates that modified output activation map 687 has half of the height and half of the width and is therefore % of the size of input feature map 202. Modified output activation map is the same size as max-pooled output map 209, thus the same result of reducing the data is achieved without calculating unnecessary MAC operations.

[0099] With modified output activation map 687 as input, an activation operation and a batch normalization operation are performed which results in convolution layer output map 690 which also has two channels 608A, 608B and dimensions 2 x H/2 x W/2.

[0100] Moving forward with reference to Fig. 6G, a group 649A of four strides are shown with filter 213 (with 3 channels 215A, 215B, 215C) is deployed against a input feature map 202 and its channels 203 A, 203B, and 203C. More specifically, channel 215A is deployed against channel 203 A, channel 215B is deployed against channel 203B, and channel 215C is deployed against channel 203 C.

[0101] In stride 1 (648), the filter 213 is deployed against the left edge and top edge of input feature map 202. In stride 2 (650), the filter 213 is deployed one activation element (shown in Fig. 6H) rightward of stride 1. In stride 3 (652), the filter 213 is deployed down one activation element (shown in Fig. 6H) from the top edge but against the left edge. In stride 4 (654), the filter is shifted rightward by one activation element (shown in Fig. 6H - see directional symbol) from stride 3, still one activation element below the top. Thus, each stride has a different position. The combination of these four strides is a stride tensor.

[0102] Referencing Fig 6H, shown are (1) input feature map 202 with stride tensor 672A (2) a max-pool window 105, and a filter 213. Turning first to stride tensor 672A, this stride tensor is defined by the strides 1 - 4 of Fig. 6G are shown in a two-dimensional view overlayed with activation elements shown. That is, for simplicity and because of the two-dimensional layout of Fig. 6H, the strides which occur in all three channels of input feature map 202 (see Fig. 6G), are only shown in one channel (channel 203C) of input feature map 202. The activation elements are shown in grid format, but for clarity only activation element 668 is indicated with a reference number. Activation element 668 is representative of the other unnumbered activation elements.

[0103] The combination of stride 1 (648), stride 2 (650), stride 3 (652), and stride 4 (654) form the first stride tensor 672A, which occupies a square of composed sixteen activation elements. First stride tensor 672A borders the left edge and the top edge of input feature map 202. Each stride of stride tensor 672a occupies 9 adjacent activation elements that define a square. As indicated in the legend, stride 1 (648) occupies the activation elements with l ’s, stride 2 (650) occupies the activation elements with 2’s, stride 3 (652) occupies the activation elements with 3’s, and stride 4 (654) occupies the activation elements with 4’s. That is, stride 1 occupies a square that is defined by nine activation elements that border the left edge and the top edge of input feature map 202. Stride 2 is shifted leftward (relative to stride 1) by the width of an activation element. Stride 3 is shifted downward (relative to stride 1) the height of an activation element. And stride 4 is shifted leftward (relative to stride 3) the width of an activation element. In general, a stride has the same height and width as the filter that is being applied to an input feature map.

[0104] Referencing Fig. 6A, and the first counter list structure 646A, the greatest final counter value is final counter value 647A (a greatest prediction value) with a value 27 and a position [1, 1] corresponds to stride 1 (648) of first stride tensor 672A of Fig. 6H. A second final counter value 10 (647B) at position [1, 2] corresponds to stride 2 (650) of the first stride tensor (672A), a third final counter value 2 at position [2, 1] corresponds to stride 3 (652) of the first stride tensor (672A), and a fourth final counter value 14 at position [2, 2] corresponds to stride 4 (654) of the first stride tensor (672A). [0105] Returning to reference Fig 6H, stride tensor 672A is shaped as a square with a height and a width of T, which in this example is the size of 4 activation elements. Filter 213 defines channels 215A, 215B, and 215C. Max-pool window 105 is also shaped as a square with a height and a width of P, which in this example is 2. In some embodiments, a prediction algorithm is being employed to predict the effect of using max-pool window 105 in a max-pool operation. Filter 213 is also shaped as a square with a height and a width of n, which in this example is the size of 3 weight elements (e.g. weight element 668C, which is representative). The variable c is indicated and c indicates the number of channels defined by the filter 213. In this example, c is 3 because there are three channels (213A - 213C). Assuming normal convolution, there is a relationship among T, P, and n, which is: T = [n + (p - 1)] x [n + (p - 1)] x c. For example, if P = 2, n = 3, and c =3, then T = 4 with 3 channels. As indicated above, because of the two-dimensional drawing of input feature map 202 and stride tensor 672A, only one channel of stride tensor 672A is shown in Fig. 6H.

[0106] Referencing Fig. 61, a second stride tensor 672B occupies a square defined by sixteen activation elements, the second stride tensor 672B being the same size and shape as the first stride tensor 672A, but shifted two activation elements rightward (see directional symbol). That is, the second stride tensor 672B is shifted a distance P rightward of the first stride tensor 672A. As discussed above, the distance P is measured in activation units. And the distance between the first stride tensor 672A and the second stride tensor 672B is P. This occurs where a prediction algorithm is being employed which is emulating a max-pool operation using a max-pool window that is P x P in size, such as max-pool window 105 of Fig. 6H.

[0107] Second stride tensor 672B includes stride 1 (656) occupying activation elements with l’s, stride 2 (658) occupying activation elements with 2’s, stride 3 (660) occupying activation elements with 3’s, and stride 4 (662) occupying activation elements with 4’s. That is, stride 1 occupies the nine activation elements beginning at the top edge of input feature map 202 and beginning the width of two activation elements from the left edge of input feature map 202. Stride 2 is shifted rightward (relative to stride 1) the width of an activation element. Stride 3 is shifted downward (relative to stride 1) the height of an activation element. And stride 4 is shifted leftward (relative to stride 3) the width of an activation element.

[0108] Referencing Fig. 6A, and the second counter list structure 646B, the greatest final counter value 647D (a greatest prediction value) with a value 30 and a position [2, 1] corresponds to stride 3 (660) of second stride tensor 672B of Fig. 6T. Another final counter value 12 (647C) at position [1, 1] corresponds to stride 1 (656) of the second stride tensor (672B), an additional final counter value 5 at position [1, 2] corresponds to stride 2 (658) of the second stride tensor (672B), and a fourth final counter value 10 at position [2, 2] corresponds to stride 4 (662) of the second stride tensor (672B).

[0109] Therefore, for each stride tensor a counter list structure is generated with a final counter value for each stride in the stride tensor. That is, each stride tensor is associated with a counter list structure identifying the final counter values for each associated stride.

[0110] Referencing Fig. 12 a data structure 1246 includes notations s(l,l), s(l,2), s(2, 1), s(2,2) which represent respectively, positions of prediction values associated with a first stride, a second stride, a third stride, and a fourth stride of a stride tensor. That is, the above notations represents stride positions. That is, the strides of a stride tensor are identified by position. For example, the strides 1, 2, 3, and 4 of stride tensor 672A of Fig. 6H, may be referred to, respectively as s(l, 1) for stride 1, s(l,2) for stride 2, s(2, l) for stride 3, and s(2,2) for stride 4. These notations are alternatives for other notation used in this document. For example s(l , 1) = [1,1]; s(l,2) = [1,2], s(2, 1) = [2, 1], and s(2, 2) = [2, 2], Other notations may also be used. For example, for identifying, each stride position with a number. For example, s(l , 1) = [1,1] = 1; s(l,2) = [1,2] = 2, s(2,l) = [2, 1] = 3, and s(2, 2) = [2, 2] = 4.

[0111] Regardless of the notation, the above stride positions may be converted to memory addresses and stored from and retrieved from memory according to these memory addresses. For example, in some embodiments, the stride positions may be stored in row major format. And in other embodiments, may be stored in column major format.

[0112] Figs. 7 - 10B illustrate exemplary methods that are capable of being performed in one or more of the physical environments illustrated in other drawings. However, the exemplary methods are not limited to the disclosed physical environments and may be performed in a variety of other physical environments. In addition, although the exemplary methods have steps or operations that are illustrated as being performed in certain orders or sequences, it should be understood that at least some of the illustrated steps and orders may be performed in different orders or sequences or may be performed concurrently. Additionally, not all embodiments require all steps or operations, as will be apparent to those of skill in the art, some steps or operations are optional in some embodiments. [01 13] Referencing Figs 7A and 7B, a method 700 of predicting maximum stride positions (e.g. those that would be selected in a hypothetical max-pooling operation) and then only computing MAC operations for the predicted maximum stride positions. Method 700 starts at operation 702 which accepts inputs. In some embodiments the inputs includes at least one of a input feature map or one or more fdters. In some embodiments, the inputs further include at least one of a p parameter or a mean (pf) of at least one of the one or more fdters (e.g. a mean of all channels of a fdter). If there are multiple fdters, then a separate mean (pf) for the respective fdters.

[0114] Control moves to operation 704 which begins a fdter loop that is performed for each fdter until there are no unprocessed fdters. This loop extends to operation 726, where control returns to operation 704 as long as there is at least one unprocessed fdter. In some embodiments, the fdter loop begins by initializing an counter output map.

[0115] Control moves to operation 706 which begins a stride tensor loop that is performed for each stride tensor until there are no unprocessed stride tensors. This stride tensor loop extends to operation 724, where control returns to operation 706 as long as there is at least one unprocessed stride tensor. In some embodiments, the stride tensor loop begins by initializing a counter list structure for the current stride tensor.

[0116] Control moves to operation 708 which begins a stride loop that is performed for each stride of the current stride tensor (e.g. the current stride tensor in the loop between operations 706 and 724) until there are no unprocessed strides of the current stride tensor. This loop extends to operation 718, where control returns to operation 708 as long as there is at least one unprocessed stride of the cunent stride tensor. In some embodiments, the stride loop begins by initializing a counter.

[0117] Control moves to operation 710 which begins an activation element loop that is performed for each activation element of the current stride (e.g. the current stride in the loop between operations 708 and 718) until there are no unprocessed activation elements of the current stride. This loop extends to operation 714, where control returns to operation 710 as long as there is at least one unprocessed activation element of the current stride.

[0118] Control moves to operation 712 which tests the current activation element and determines whether to update a counter (e.g. counter 470), and if so, the amount of the counter update. In some embodiments process, operation 712 determines if the current activation element and a corresponding weight element in the one or more fdters are of the same mathematical sign (e g both positive or both negative) The “corresponding” weight element is a weight element of the current filter that would be applied against the current activation element in a convolution operation. The “current filter” is the filter currently being processed by the filter loop. In some embodiments there is a filter loop counter (not shown) that provides indexes for the filters as they are processed by the filter loop and the current filter would be addressable by one of those indexes. If the current activation element and the corresponding weight element are not of the same mathematical sign, then operation 712 does not perform a counter update and control moves to operation 714.

[0119] If the current activation element and the corresponding weight element are of the same mathematical sign, then in some embodiments further testing is performed to determine the amount of a counter update. And then operation 712 updates the counter accordingly. Further specific embodiments are discussed below relative to Figs. 8A and 8B.

[0120] Control moves to operation 714 which determines if an additional activation element remains for processing If yes, then control returns to operation 710 to process an additional activation element. Each iteration of the activation element loop uses the same counter. And if the counter is updated, it is updated relative to the last time the counter was updated in this loop. For example, the counter may initially be initialed to zero. In a first loop for a first activation element it may be determined to update the counter by incrementing it to 1 . In a second loop with a second activation element, it may be determined to update the counter by incrementing the counter by 2, which results in the counter being a 3. The loop may then be executed 15 more times with the counter being incremented numerous times with these increments totaling 24. Because the counter was previously at 3, the counter is then at 27 Afterward, if the loop is exited because there are no further activation elements to process, then the final counter value is 27.

[0121] When there are no further activation elements to process, then operation 714 evaluates to no, and control moves to operation 716 which updates a counter list structure with the final counter value, which in the above example is 27. For example, referencing counter list structure 646A of Fig. 6A, the final counter value 27 (647A) is entered in the upper left position in counter list structure 646A That is, at position [1, 1],

[0122] Control moves to operation 718 which determines if there is at least one stride remaining to process. If yes, the counter (e.g. 470) is reinitialized and control returns to operation 708 to process another stride. If no, the stride loop is exited and control moves to operation 720. Exiting this loop means that all strides associated with the current stride tensor have been processed.

[0123] Operation 720 determines the maximum value in the counter list structure for the current stride tensor. For example, referencing counter list structure 646A of Fig. 6A, there are four final counter values (e.g. one for each of the strides in a first stride tensor). The maximum is the maximum value position (670A) is the number 27 in position [1, 1], which is final counter value 647 A of counter list structure 646 A.

[0124] Control moves to operation 722 which updates a counter output map with the maximum value from operation 720. For example, referencing out counter map 688 of Fig. 6A, positional element 661 A is updated with the position of the maximum value position 670A which (as discussed above relative to operation 720) is position [1, 1], In some embodiments, completion of operation 722 means that the current stride tensor has been completely processed.

[0125] Control moves to operation 724 which determines if there is at least one stride tensor that still needs to be processed. If yes, control returns to operation 706 for another iteration of the stride tensor loop. If no, then the stride loop is exited and control moves to operation 726. In some embodiments, exiting from the stride tensor loop means that processing of the current filter is completed and that the current counter output map is completed and ready to use.

[0126] Operation 726 determines if there is at least one more filter that still needs to be processed. If yes, control moves to operation 704 and another iteration of the filter loop. If no, the filter loop is exited and control moves to operation 728. Completion of the filter loop completes a prediction operation, such as for example, prediction operation 552A or 552B of Fig. 5.

[0127] Operation 728 executes a modified convolution operation. A modified convolution operation 554A or 554B is discussed above relative to Fig. 5 and that discussion is not repeated.

[0128] Control moves to operation 730 which performs a batch normalization operation, such as for example, batch normalization operation 556A or 556B which is discussed above relative to Fig. 5 That discussion is not repeated.

[0129] Control moves to operation 732 which performs an activation operation, such as for example, activation operation 558A or 558B which is discussed above relative to Fig. 5 That discussion is not repeated. The method 700 then terminates.

[0130] Referencing Fig. 8A, a particular implementation of operation 712 of method 700 is described. In particular, operation 810 includes update option 1, which starts with operation 812 which determines if the current activation element and the corresponding weight element have equal mathematical signs. If the signs are not equal, control moves to operation 814 and there is no counter update and control moves to operation 714 of Fig. 7, which is discussed above. If the signs are equal, then control moves to operation 816.

[0131] Operation 816 determines if the corresponding weight element (the second element) is greater than a mean, such as for example, mean (pf) which is discussed above relative to Fig. 6C. If yes, control moves to operation 818 and the counter is updated a greater amount, such as for example, 2. If no, control moves to operation 820 and the counter is updated a lesser amount (relative to the greater amount), such as for example 1.

[0132] Referencing Fig. 8B, a particular implementation of operation 712 of method 700 is described. In particular, operation 850 includes update option 2, which starts with operation 812 which determines if the current activation element and the corresponding weight element have equal mathematical signs. If the signs are not equal, control moves to operation 814 and there is no counter update and control moves to operation 714 of Fig. 7, which is discussed above. If the signs are equal, then control moves to operation 852.

[0133] Operation 852 determines (1) if the current activation element (the 1st element) is greater than a Pi parameter (discussed above in connection with the First Prediction Algorithm) and (2) if the corresponding weight element (the 2nd element) is greater than a mean, such as for example, mean (pf) which is discussed above relative to Fig. 6C. If yes, control moves to operation 818 and the counter is updated a greater amount, such as for example, 2. If no, control moves to operation 820 and the counter is updated a lesser amount (relative to the greater amount), such as for example 1.

[0134] Referencing Fig. 9, a method 900 begins with operation 902 which accepts inputs. In some embodiments, operation 902 includes at least accepting as inputs at least an input feature map (e.g. 202) having one or more first elements (e.g. activation elements 668) and at least one or more filters (e.g. 213) having one or more second elements (e g. weight elements 643). In some embodiments, operation 902 is performed at least in part with input acceptance circuitry 451 .

[0135] Control moves to operation 904 which computes a first prediction value (e.g. final counter value 647A) for a first stride (e.g. 648). In some embodiments, operation 904 includes at least for a first stride (e.g. 648) associated with a stride tensor (e.g. first stride tensor 672A), computing at least a first prediction value (e.g. final counter value 647 A) associated with the first stride, wherein a stride includes at least a mapping of the one or more second elements onto a group of one or more first elements and a stride tensor (e.g. first stride tensor 672A) includes at least a group of strides (e.g. strides 648, 650, 652, 654) and wherein a prediction value predicts a magnitude of an output value (e.g. such as value of activation output element 104A) for a convolution operation associated with the first stride. In some embodiments, operation 904 is performed at least in part with prediction value determination circuitry 451 .

[0136] Control moves to operation 906 which computes at least a second prediction value for at least a second stride (e.g. 650). In some embodiments, operation 906 includes at least for at least a second stride (e.g. 650) of the stride tensor (e.g. first stride tensor 672A), computing at least a second prediction value (e.g. another final counter value 647B) associated with the at least a second stride, wherein the first prediction value and the at least a second prediction value define a group of prediction values (e.g. final counter values 27, 10, 2, 14 of counter list structure 646A) associated with the stride tensor. In some embodiments, operation 906 is performed at least in part with prediction value determination circuitry 451 .

[0137] Control moves to operation 908 which determines a greatest prediction value (e.g. final counter value 647A of Fig. 6A with value of 27) of a group of prediction values (final counter values 27, 10, 2, and 14 of counter list structure 646A). In some embodiments, operation 908 includes at least determining the greatest prediction value of the group of prediction values, wherein the greatest prediction value is associated with one selected stride of the stride tensor (see. e g. Fig. 6A, 6H, wherein the greatest final counter value 647A (a greatest prediction value) with a value 27 corresponds to stride 1 of first stride tensor 672A of Fig. 6H). In some embodiments, operation 908 is performed at least in part with maximum prediction value determination circuitry 464.

[0138] Control moves to operation 910 which executes a modified convolution operation. In some embodiments, operation 910 includes at least executing a convolution operation (e.g. modified convolution operation (e.g. 554A or 554B) only for the selected stride of the stride tensor, the executing performed by utilizing at least the input feature map and the one or more filters. In some embodiments, operation 910 is performed at least in part with modified convolution circuitry 466.

[0139] Referencing Fig. 10A, operation 902 optionally includes operation 912 which accepts as the input feature map an input feature map that includes one or more first channels (e g. channels 203A - 203C of input feature map 202) and further accepts as the one or more filters, one or more filters that define one or more second channels (e.g. channels 215A - 215C of filter 213B), wherein the one or more first channels include at least the one or more first elements (e.g. activation elements such as for example activation element 668) and the one or more second channels include at least the one or more second elements (e.g. weight elements such as for example weight element 643). In some embodiments, operation 912 is performed at least in part with input acceptance circuitry 451.

[0140] Further referencing Fig. 10A, operation 904 optionally includes operation 914 in which at least some of the one or more first elements (e.g. activation elements, such as for example 668) are associated with respective numerical values (e.g. activation element 668 of Fig. 6D with value of +12) and at least some of the one or more second elements are associated with respective numerical values (e.g. weight element 643 of Fig. 6D with value of +13). And in which computing at least the first prediction value includes at least determining if a first sign and a second sign are equal (e.g. activation element 668 of Fig. 6D is a positive 12 and weight element 643 of Fig. 6D is also positive - a positive 13), wherein the first sign is associated with a first number that is associated with a given first element of the group of first elements and the second sign is associated with a second number associated with a given second element of the one or more second elements. In some embodiments, operation 914 is performed at least in part with sign equality determination circuitry 454.

[0141 ] Further referencing Fig. 10A, operation 914 optionally includes operation 916 in which computing the first prediction value further includes at least (1) incrementing a counter value (e.g. counter 470 of Fig. 4) responsive to a determination that the first sign and the second sign are equal and (2) otherwise not incrementing the counter value. In some embodiments, operation 916 is performed at least in part with counter update circuitry 460.

[0142] Further referencing Fig. 10A, operation 916 optionally further includes at least operation 918 in which incrementing a counter value responsive to a determination that the first sign and the second sign are equal includes a least determining if the second number (e.g. weight element 643 associated with number +13) is greater than a mean numerical value (e g. mean pf) 646 of Fig. 6C) that is a mean of all numerical values associated with the one or more second elements. In some embodiments, operation 918 is performed at least in part with mean compare to second element circuitry 458. [0143] Operation 918 optionally includes operation 920 in which incrementing a counter value responsive to a determination that the first sign and the second sign are equal further includes at least incrementing the counter value by a first value (e.g. +2) responsive to a determination that the second number is greater that the mean numerical value or incrementing the counter value by a second value (e.g. +1) responsive to a determination that the second number is not greater than the mean numerical value, wherein the second value is less than the first value. In some embodiments, operation 920 is performed at least in part with counter update circuitry 460.

[0144] Further referencing Fig. 10A, operation 916 optionally further includes at least operation 924 in which incrementing a counter value responsive to a determination that the first sign and the second sign are equal includes at least (1) determining if the first number (e.g. +12 associated with activation element 668 if Fig. 6D) is greater than a batch normalization number (e g any of [31 682A, [32 682B, |33 682A of Fig. 6B) associated with batch normalization and (2) determining if the second number (e.g. +13 associated with weight element 643) is greater than a mean numerical value (e g. mean pf) 646 of Fig. 6C) that is a mean of all numerical values associated with the one or more second elements. In some embodiments, the batch normalization number is associated with output of either a current convolution layer or a previous convolution layer. And the input feature map is associated with at least the current convolution layer. In some embodiments, operation 924 is performed at least in part with [3 compare to first element circuitry 456 and mean compare to second element circuitry 458.

[0145] Operation 924 optionally includes operation 926 in which incrementing a counter value responsive to a determination that the first sign and the second sign are equal further includes at least (1) incrementing the counter value by a first value (e.g. +2) responsive to determining at least one of (1) that the first number is greater than the batch normalization number or (2) that the second number is greater that the mean numerical value, or (2) otherwise incrementing the counter value by a second value (e.g. +1), wherein the second value is less than the first value. In some embodiments, operation 916 is performed at least in part with counter update circuitry 460

[0146] Further referencing Fig. 10A, process 904 optionally includes operation 930 in which the first prediction value is a counter value and in which computing at least a first prediction value associated with the first stride includes at least (A) computing a first counter value that is associated with the first stride of the stride tensor at least in part by determining based on one or more inputs to perform at least one of (1) not incrementing a previous counter value, (2) incrementing the previous counter value by a first amount, or (3) incrementing the previous counter value by a second amount that is less than the first amount, (B) computing a second counter value that is associated with the first stride of the stride tensor and that is computed at least in part by determining based on one or more inputs to perform at least one of (1) not incrementing the first counter value, (2) incrementing the first counter value by a first amount, or (3) incrementing the first counter value by a second amount that is less than the first amount, (C) computing a third counter value that is associated with the first stride of the stride tensor and that is computed at least in part by determining based on one or more inputs to perform at least one of (1) not incrementing the second counter value, (2) incrementing the second counter value by the first amount, or (3) incrementing the second counter value by the second amount, and (D) computing at least a fourth counter value that is associated with the first stride of the stride tensor and that is computed at least in part by determining based on one or more inputs to perform at least one of (1) not incrementing the third counter value, (2) incrementing the third counter value by the first amount, or (3) incrementing the third counter value by the second amount. In some embodiments operation 930 is performed at least in part with the prediction value determination circuitry 452 and the counter update circuitry 460.

[0147] Referencing Fig. 10B, operation 906 optionally includes operation 931 in which computing at least a second prediction value includes at least (1) computing a second prediction value associated with a second stride of the stride tensor (e.g. counter value 10 (647B) associated with stride 2 (650) of the first stride (672A)), (2) computing a third prediction value associated with a third stride of the stride tensor, and (3) computing at least a fourth prediction value associated with at least a fourth stride of the stride tensor. In some embodiments, the group of prediction values includes at least the first prediction value, the second prediction value, the third prediction value, and the at least a fourth prediction value. In some embodiments, operation 931 is performed at least in part with prediction determination circuitry 452.

[0148] Again referencing Fig. 10B, operation 908 optionally includes operation 932 in which determining the greatest prediction value of the group of prediction values, with the greatest prediction value being associated with one selected stride of the stride tensor includes at least determining the greatest prediction value from among a first prediction value associated the first stride, a second prediction value associated with a second stride of the stride tensor, a third prediction value associated with a third stride of the stride tensor, and the at least a fourth prediction value associated with at least a fourth stride of the stride tensor (see e.g. greatest prediction value 27 associated with stride 1 and prediction values 10, 2, and 14, associated with strides 2, 3, and 4 respectively of first stride tensor 672A). In some embodiments, operation 931 is performed at least in part with prediction determination circuitry 452 In some embodiments, operation 932 is performed at least in part with maximum prediction value determination circuitry 464.

[0149] Operation 908 optionally includes operation 934 in which determining the greatest prediction value of the group of prediction values, with the greatest prediction value being associated with one selected stride of the stride tensor includes at least if there is a tie for the greatest prediction value among two or more prediction values of the group of prediction values, then selecting the two or more prediction values and including positional information regarding the selected two or more prediction values in a maximum prediction value position map (e.g. a counter output map) For example, if in Fig. 6F [1.1] and [1, 2] are tied and are both selected for counter output map, then the tie is broken based at least in part on the output of the convolution operation where S I < S2). In some embodiments, operation 934 is performed at least in part with maximum prediction value determination circuitry 464 and at least in part with modified convolution circuitry 466.

[0150] Operation 908 optionally includes operation 936 in which the greatest prediction value of the group of prediction values is a first greatest prediction value, with the selected stride being a first selected stride, and the stride tensor is a first stride tensor and in which determining the greatest prediction value of the group of prediction values, with the greatest prediction value being associated with one selected stride of the stride tensor further includes at least (1) calculating a greatest prediction value associated with a selected stride of a second stride tensor (2) calculating a greatest prediction value associated with a selected stride of a third stride tensor and (3) calculating at least a greatest prediction value associated with a selected stride at least a fourth stride tensor. Operation 936 optionally further includes at least generating at least an maximum prediction value position map (e.g. a counter output map) that indicates positional information regarding at least the first selected stride, the second selected stride, the third selected stride and the fourth selected stride. In some embodiments, operation 936 is performed at least in part with maximum prediction value determination circuitry 464.

[0151] Continuing with reference to Fig. 10B, operation 910 optionally includes operation 940 in which executing a convolution operation only for the selected stride of the stride tensor, the executing performed by utilizing at least the input map and the one or more filters includes at least (1) generating a maximum prediction value position map (e.g. counter output map 688) with positional information (e.g. maximum value position [1, 1] (670A)) regarding a first selected stride (648) associated with a first stride tensor (672A), a second selected stride associated with a second stride tensor, a third selected stride associated with a third stride tensor, and at least a fourth selected stride associated with at least a fourth stride tensor and (2) utilizing the maximum prediction value position map to restrict convolution operations with respect to the first selected stride, the second selected stride, the third selected stride, and the fourth selected stride. In some embodiments, operation 910 is performed at least in part with modified convolution circuitry 466 and with counter output map circuitry 468.

[0152] Experimental results utilizing some embodiments are now described. A VGG-10 CNN architecture with 10 layers was used for these experiments. The dataset used was Cifar-10 with a batch size of 128 images. One version is referred to as a “Normal VGG-10” model which used max-pooling in a conventional way. Other versions use VGG-10 modified to use either the First Prediction Algorithm or the Second Prediction Algorithm. The First Prediction Algorithm was used with either a previous layer 0 or current layer 0.

[0153] These experiments were carried out using an NVIDIA DGX server with 20 physical cores consisting of Intel Xeon (CPU E5 - 2698 v4 @2.2GHz) with 188 GB ram. The above server also had 4 GPU cores consisting of Tesla V100 with a 16 GB ram on each GPU core.

[0154] Referencing Fig 11A, row 1102 of chart 1100A contains comparative data regarding the accuracy of the Normal VGG-10, the First Prediction Algorithm, and the Second Prediction Algorithm in performing a classification task. The classification task included selecting photographs out of a plurality of photographs that contained a type of animal depicted. For example, a task included determining which photograph of a plurality of photographs contained, for example, a cat. The indicated Top-1 Test Accuracies was determined with the following formula: Top-1 Accuracy = (Total number of correct predictions in the batch) / (Total number of photos in the batch) x 100. As noted above, the dataset used for these classification tasks was the Cifar-10.

[0155] Normal VGG-10 was found to have a baseline accuracy of 95.15% on these classification tasks. The accuracy on these classification tasks for predictions was 87.76% for the First Prediction Algorithm with a previous layer 0 and 88.06% for the First Prediction Algorithm with a current layer p. For the VGG-10 modified for the Second Prediction Algorithm, the accuracy on these classification tasks was 88.06%. Thus, the accuracy for all systems was similar.

[0156] While the experiment included 100 data sets, the test datasets were shuffled and then broken into batches. Statistically, the accuracy of the Top — 1 predictions (above in Fig. 11 A) was very similar for the different batches. The results shown in Figs. 1 IB - 1 ID represents data from a batch that is statistically similar to data from all the batches.

[0157] Fig. 11B via chart 1100B provides data regarding the number of multiplication, addition, and comparison operations required for the (1) Normal VGG-10 versus (2) First Prediction Algorithm, Previous Layer p.

[0158] Fig. 11C via chart 1100C provides data regarding the number of multiplication, addition, and comparison operations required for the (1) Normal VGG-10 versus (2) First Prediction Algorithm, Current Layer p.

[0159] Fig. 11D via chart HOOD provides data regarding the number of multiplication, addition, and comparison operations required for the (1) Normal VGG-10 versus (2) Second Prediction Algorithm, which does not use a p comparison.

[0160] The results contained in Figs. 11B - 11C, show that the embodiments described above reduced MAC operations as compared to Normal VGG-10 by a factor of 3 - 4x. That is, multiplication and addition operations are reduced by a significant number using the abovedescribed embodiments. While the above-described embodiments utilize additional comparison operations, in the worst case, these are comparable to addition operations. The reduction in multiplication operations yields significant savings in the number of computations, power usage, and memory usage. The result in the above-described embodiments result in substantial energy efficiency for computing hardware.

[0161] It will be understood by those skilled in the art that the terminology used in this specification and in the claims is “open” in the sense that the terminology is open to additional elements not enumerated. For example, the word “includes” should be interpreted to mean “including at least” and so on. Even if “includes at least” is used sometimes and “includes” is used other times, the meaning is the same: includes at least. The word “comprises” is also “open” regardless of where in a claim it is used. In addition, articles such as “a” or “the” should be interpreted as not referring to a specific number, such as one, unless explicitly indicated. At times a convention of “at least one of A, B, or C” is used, the intent is that this language includes any combination of A, B, C, including, without limitation, any of A alone, B alone, C alone, A and B, B and C, A and C, all of A, B, and C or any combination of the foregoing, such as for example AABBC, or ABBBCC. The same is indicated by the conventions “one of more of A, B, or C” and “and/or”.

[0162] In addition, references to circuits, refer to circuitry for causing a device to perform the respective functions of these circuits. In some embodiments, one or more of these circuits are implemented as sub-chiplets. In some embodiments, these circuits include at least one of executable code stored on a machine-readable medium, an application, a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), memories storing executable code, other forms of logic or any combination of the foregoing. In some embodiments, these circuits include are part of and/or include a link to dedicated processor. In embodiments in which one or more circuits includes executable code stored in a machine-readable medium, this executable code, when executed, would cause the dedicated processor to perform the respective functions of the circuit. Consistent with the above discussion, in some embodiments, these circuits may be part of a processing device, such as a central processing unit (CPU), a processor, a controller, a field-programmable gate array, a graphics accelerator, a graphics processing unit (GPU), or hard-wired logic, such as for example a application-specific integrated circuit (ASIC). In some embodiments one or more circuits may contain memory, may be configured to access stored memory, may be configured to access remote memory, or may not contain or access memory, dependent on their function. In some embodiments, one or more circuits contain or are linked to one or more machine-readable mediums.

[0163] Various functional logic blocks, such as for example, an Al engine or a ML engine may, in some embodiments, be implemented as circuits. And the above discussion of circuits would be fully applicable.

[0164] Although embodiments have been described in detail, it should be understood that various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention as defined by the appended claims and equivalents thereof.

Claims

We claim:

1. A computing device configured to implement at least a portion of a convolutional neural network, the computing device comprising: one or more processing devices; one or more machine-readable mediums bearing one or more executable instructions that when executed by the one or more processing devices cause the computing device to perform one or more operations that include at least: accepting as inputs at least an input feature map having one or more first elements and at least one or more filters having one or more second elements; for a first stride associated with a stride tensor, computing at least a first prediction value associated with the first stride, wherein a stride includes at least a mapping of the one or more second elements onto a group of one or more first elements and a stride tensor includes at least a group of strides and wherein a prediction value predicts a magnitude of an output value for a convolution operation associated with the first stride; for at least a second stride of the stride tensor, computing at least a second prediction value associated with the at least a second stride, wherein the first prediction value and the at least a second prediction value define a group of prediction values associated with the stride tensor; determining the greatest prediction value of the group of prediction values, wherein the greatest prediction value is associated with one selected stride of the stride tensor; and executing a convolution operation only for the selected stride of the stride tensor, the executing performed by utilizing at least the input feature map and the one or more filters

2. The computing device of claim 1, wherein the input feature map includes one or more first channels and wherein the one or more filters define one or more second channels, wherein the one or more first channels include at least the one or more first elements and the one or more second channels include at least the one or more second elements.

3. The computing device of claim 1, wherein: at least some of the one or more first elements are associated with respective numerical values; at least some of the one or more second elements are associated with respective numerical values; and computing at least the first prediction value comprises: least determining if a first sign and a second sign are equal, wherein the first sign is associated with a first number that is associated with a given first element of the group of first elements and the second sign is associated with a second number associated with a given second element of the one or more second elements.

4. The computing device of claim 3, wherein computing the first prediction value further comprises: incrementing a counter value responsive to a determination that the first sign and the second sign are equal; and otherwise not incrementing the counter value.

5. The computing device of claim 4, wherein incrementing a counter value responsive to a determination that the first sign and the second sign are equal comprises: determining if the second number is greater than a mean numerical value that is a mean of all numerical values associated with the one or more second elements.

6. The computing device of claim 5, wherein incrementing a counter value responsive to a determination that the first sign and the second sign are equal further comprises: incrementing the counter value by a first value responsive to a determination that the second number is greater that the mean numerical value; or incrementing the counter value by a second value responsive to a determination that the second number is not greater than the mean numerical value, wherein the second value is less than the first value

7. The computing device of claim 4, wherein incrementing a counter value responsive to a determination that the first sign and the second sign are equal comprises: determining if the first number is greater than a batch normalization number associated with batch normalization.

8. The computing device of claim 7, wherein the batch normalization number is associated with output of either a current convolution layer or a previous convolution layer, wherein the input feature map is associated with at least the cunent convolution layer

9. The computing device of claim 7, wherein incrementing a counter value responsive to a determination that the first sign and the second sign are equal further comprises: determining if the second number is greater than a mean numerical value that is a mean of all numerical values associated with the one or more second elements.

10. The computing device of claim 9, wherein incrementing a counter value responsive to a determination that the first sign and the second sign are equal further comprises: incrementing the counter value by a first value responsive to determining at least one of (1) that the first number is greater than the batch normalization number or (2) that the second number is greater that the mean numerical value; or otherwise incrementing the counter value by a second value, wherein the second value is less than the first value.

11. The computing device of claim 1, wherein the first prediction value is a counter value and wherein computing at least a first prediction value associated with the first stride comprises: computing a first counter value that is associated with the first stride of the stride tensor at least in part by determining based on one or more inputs to perform at least one of (1) not incrementing a previous counter value, (2) incrementing the previous counter value by a first amount, or (3) incrementing the previous counter value by a second amount that is less than the first amount computing a second counter value that is associated with the first stride of the stride tensor and that is computed at least in part by determining based on one or more inputs to perform at least one of (1) not incrementing the first counter value, (2) incrementing the first counter value by a first amount, or (3) incrementing the first counter value by a second amount that is less than the first amount; computing a third counter value that is associated with the first stride of the stride tensor and that is computed at least in part by determining based on one or more inputs to perform at least one of (1) not incrementing the second counter value, (2) incrementing the second counter value by the first amount, or (3) incrementing the second counter value by the second amount; and computing at least a fourth counter value that is associated with the first stride of the stride tensor and that is computed at least in part by determining based on one or more inputs to perform at least one of (1) not incrementing the third counter value, (2) incrementing the third counter value by the first amount, or (3) incrementing the third counter value by the second amount.

12. The computing device of claim 1, wherein computing at least a second prediction value comprises: computing a second prediction value associated with a second stride of the stride tensor; computing a third prediction value associated with a third stride of the stride tensor; and computing at least a fourth prediction value associated with at least a fourth stride of the stride tensor.

13. The computing device of claim 12, wherein the group of prediction values includes at least the first prediction value, the second prediction value, the third prediction value, and the at least a fourth prediction value.

14. The computing device of claim 1, wherein determining the greatest prediction value of the group of prediction values, wherein the greatest prediction value is associated with one selected stride of the stride tensor comprises: determining the greatest prediction value from among a first prediction value associated the first stride, a second prediction value associated with a second stride of the stride tensor, a third prediction value associated with a third stride of the stride tensor, and the at least a fourth prediction value associated with at least a fourth stride of the stride tensor.

15. The computing device of claim 1, wherein determining the greatest prediction value of the group of prediction values, wherein the greatest prediction value is associated with one selected stride of the stride tensor comprises: if there is a tie for the greatest prediction value among two or more prediction values of the group of prediction values, then selecting the two or more prediction values and including positional information regarding the selected two or more prediction values in a maximum prediction value position map.

16. The computing device of claim 1, wherein the greatest prediction value of the group of prediction values is a first greatest prediction value, the selected stride is a first selected stride, and the stride tensor is a first stride tensor; and wherein determining the greatest prediction value of the group of prediction values, wherein the greatest prediction value is associated with one selected stride of the stride tensor further includes at least: determining a greatest prediction value associated with a selected stride of a second stride tensor; determining a greatest prediction value associated with a selected stride of a third stride tensor; and determining at least a greatest prediction value associated with a selected stride of at least a fourth stride tensor.

17. The computing device of claim 16, further comprising: generating at least an maximum prediction value position map that indicates positional information regarding at least the first selected stride, the second selected stride, the third selected stride and the fourth selected stride.

18. The computing device of claim 1, wherein executing a convolution operation only for the selected stride of the stride tensor, the executing performed by utilizing at least the input feature map and the one or more filters comprises: generating a maximum prediction value position map with positional information regarding at least a first selected stride associated with a first stride tensor, a second selected stride associated with a second stride tensor, a third selected stride associated with a third stride tensor, and at least a fourth selected stride associated with at least a fourth stride tensor; and utilizing the maximum prediction value position map to restrict convolution operations with respect to the first selected stride, the second selected stride, the third selected stride, and the fourth selected stride.

19. A method that implements at least a portion of a convolutional neural network and that is performed at least in part with a computing device, the method comprising: accepting as inputs at least an input feature map having one or more first elements and at least one or more filters having one or more second elements; for a first stride associated with a stride tensor, computing at least a first prediction value associated with the first stride, wherein a stride includes at least a mapping of the one or more second elements onto a group of one or more first elements and a stride tensor includes at least a group of strides and wherein a prediction value predicts a magnitude of an output value for a convolution operation associated with the first stride; for at least a second stride of the stride tensor, computing at least a second prediction value associated with the at least a second stride, wherein the first prediction value and the at least a second prediction value define a group of prediction values associated with the stride tensor; determining the greatest prediction value of the group of prediction values, wherein the greatest prediction value is associated with one selected stride of the stride tensor; and executing a convolution operation only for the selected stride of the stride tensor, the executing performed by utilizing at least the input feature map and the one or more filters

20. A computing device that implements at least a portion of a convolutional neural network and that is performed at least in part with a computing device, the computing comprising: circuitry for accepting as inputs at least an input feature map having one or more first elements and at least one or more filters having one or more second elements; circuitry for computing at least a first prediction value associated with a first stride of a stride tensor, wherein a stride includes at least a mapping of the one or more second elements onto a group of one or more first elements and a stride tensor includes at least a group of strides and wherein a prediction value predicts a magnitude of an output value for a convolution operation associated with the first stride; circuitry for computing at least a second prediction value associated with the at least a second stride of the stride tensor, wherein the first prediction value and the at least a second prediction value define a group of prediction values associated with the stride tensor; circuitry for determining the greatest prediction value of the group of prediction values, wherein the greatest prediction value is associated with one selected stride of the stride tensor; and circuitry for executing a convolution operation only for the selected stride of the stride tensor, the executing performed by utilizing at least the input feature map and the one or more filters.