WO2018158293A1

WO2018158293A1 - Allocation of computational units in object classification

Info

Publication number: WO2018158293A1
Application number: PCT/EP2018/054891
Authority: WO
Inventors: Rastislav STRUHARIK; Bogdan VUKOBRATOVIC; Mihajlo KATONA
Original assignee: FROBAS GmbH
Current assignee: FROBAS GmbH
Priority date: 2017-02-28
Filing date: 2018-02-28
Publication date: 2018-09-07
Anticipated expiration: 2019-08-28

Abstract

At least one memory (301, 341) and a plurality of computational units (321-323) perform a plurality of filter operations between an input feature map and a filter map book for classification of at least one object represented by the input feature map. Each filter operation of the plurality of filter operations includes a plurality of combinational operations. The control logic is configured to sequentially assign at least two or all combinational operations of the same filter operation to the same computational unit of the plurality of computational units.

Description

Allocation of Computational Units in Object Classification

TECHNICAL FIELD Various examples of the invention relate to techniques of controlling at least one memory and a plurality of computational units to perform a plurality of filter operations between an input feature map and a filter map for classification of at least one object. Furthermore, various examples of the invention relate to techniques of selecting a filter geometry from a plurality of filter geometry and using filters of the filter map for object classification which have the selected filter geometry.

BACKGROUND

In recent years, automated object recognition and identification of the type of the recognized objects (object classification) is applied in various fields. Examples include assisted driving and autonomous driving, voice recognition, image analysis, etc. In view of the wide variety of applications, the objects to be classified can widely vary. Generally, an object includes a set of features. The features are typically arranged in a certain inter-relationship with respect to each other. For example, typical objects in the field of image analysis for assisted and autonomous driving may include: neighboring vehicles; lane markings; traffic signs; pedestrians; etc.. Features may include: edges; colors; geometrical shapes; etc..

One technique employed for object classification are neural networks. Neural networks are typically hierarchical graph-based algorithms, wherein the structure of the graph correlates with previously trained recognition capabilities. The neural networks break down the problem of classification of an object into sequential recognition of the various features and their interrelationship. In detail, typically, the neural network is initially trained by inputting information (machine learning or training), e.g., using techniques of backpropagation; here, supervised and semi-supervised training or even fully automatic training helps to configure the neural network to accurately recognize objects. Then, the trained neural network can be applied to a classification task. When processing input data (sometimes also called input instance) - such as image data, audio data, sensor data, video data, etc. -, the data is transformed into an input feature map of a given dimensionality, e.g., 2-D, 3-D, or 4-D. Then, a typical neural network includes a plurality of layers which are arranged sequentially. Each layer receives a corresponding input feature map which has been processed by a preceding layer. Each layer processes the respective input feature map based on a layer-specific filter map including filter coefficients. The filter map defines a strength of connection (weights) between data points of subsequent layers (neurons). Different layers correspond to different features of the object. Each layer outputs a processed input feature map (output feature map) to the next layer. The last layer then provides - as the respective output feature map - output data (sometimes also referred to as classification vector) which is indicative of the recognized objects, e.g., their position, orientation, count, and/or type/class.

Neural network can be implemented in software and/or hardware. For example, according to reference implementations, neural network can be implemented in software executed on a general-purpose central processing unit (CPU). It is also possible to implement neural network algorithms on a graphical processor unit (GPU). In other examples, it is also possible to implement neural network algorithms at least partly in hardware, e.g., using a field- programmable gate array (FPGA) integrated circuit or an application specific integrated circuit (ASIC). See, for example: Dundar, Aysegul, et al. "Embedded Streaming Deep Neural Networks Accelerator With Applications." IEEE transactions on neural networks and learning systems (2016).

Generally, processing of unclassified data by means of a neural network (forward pass) requires significant computational resources, typically both in terms of processing power and memory access requirements. One bottleneck in terms of computational resources is that, typically, on-chip memory is insufficient for storing the various intermediate feature maps and filter maps that occur during processing a particular input data. This typically results in a need for significant memory read/write operations (data movement) with respect to an off- chip/external memory. This data movement can be, both, energy inefficient, as well as time consuming.

To mitigate this issue, techniques are known which provide for a particular hardware architecture with respect to external memory and internal memory. See, for example: Chen, Yu-Hsin, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." IEEE Journal of Solid-State Circuits (2016).

However, also such techniques face certain restrictions and drawbacks. For example, it has been observed that the time-averaged allocation of computational units performing the atomic arithmetic operations of the neural network algorithm can be limited. Then, a certain fraction of the computational units is idling. This typically slows down the processing of data for object recognition and can, furthermore, reduce the energy efficiency.

SUMMARY Therefore, a need exists for advanced techniques of classification of objects. In particular, need exists for techniques which overcome or mitigate at least some of the above-identified drawbacks and limitations.

This need is met by the features of the independent claims. The features of the dependent claims define embodiments.

A circuit includes at least one memory. The at least one memory is configured to store an input feature map and a filter map. The input feature map represents at least one object. The circuit further includes a plurality of computational units. The circuit further includes a control logic. The control logic is configured to control the at least one memory and the plurality of computational units to perform a plurality of filter operations between the input feature map and the filter map. Said performing of the filter operations of the plurality of filter operations is for classification of the at least one object. Each filter operation of the plurality of filter operations includes a plurality of combinational operations. The control logic is configured to sequentially assign at least two or all combinational operations of the same filter operation to the same computational unit of the plurality of computational units.

A method includes storing an input feature map and a filter map. The input feature map represents at least one object. The method further includes controlling a plurality of computational units to perform a plurality of filter operations between the input feature map and the filter map for classification of the at least one object. Each filter operation of the plurality of filter operations includes a plurality of combinational operations. The method further includes sequentially assigning at least two or all combinational operations of the same filter operation to the same computational unit of the plurality of computational units.

A computer program product or computer program includes program code that can be executed by at least one computer. Executing the program code causes the at least one computer to perform a method. The method includes storing an input feature map and a filter map. The input feature map represents at least one object. The method further includes controlling a plurality of computational units to perform a plurality of filter operations between the input feature map and the filter map for classification of the at least one object. Each filter operation of the plurality of filter operations includes a plurality of combinational operations. The method further includes sequentially assigning at least two or all combinational operations of the same filter operation to the same computational unit of the plurality of computational units. A method includes loading an input feature map. The input feature map represents at least one object. The method further includes selecting at least one filter geometry from a plurality of filter geometries. The method further includes performing a plurality of filter operations between receptive fields of the input feature map and filters of the filter map for classification of the at least one objects. The filters have the selected at least one filter geometry.

A circuit includes at least one memory configured to store an input feature map and a filter map, the input feature map representing at least one object. The circuit further includes a control logic configured to select at least one filter geometry from a plurality of filter geometry. The circuit further includes a plurality of computational units configured to perform a plurality of filter operations between receptive fields of the input feature map and filters of the filter map for classification of the at least one objects. The filters have the selected at least one filter geometry.

A computer program product includes program code that can be executed by at least one computer. Executing the program code causes the at least one computer to perform a method. The method includes loading an input feature map. The input feature map represents at least one object. The method further includes selecting at least one filter geometry from a plurality of filter geometries. The method further includes performing a plurality of filter operations between receptive fields of the input feature map and filters of the filter map for classification of the at least one objects. The filters have the selected at least one filter geometry.

A computer program includes program code that can be executed by at least one computer. Executing the program code causes the at least one computer to perform a method. The method includes loading an input feature map. The input feature map represents at least one object. The method further includes selecting at least one filter geometry from a plurality of filter geometries. The method further includes performing a plurality of filter operations between receptive fields of the input feature map and filters of the filter map for classification of the at least one objects. The filters have the selected at least one filter geometry.

A circuit includes a plurality of computational units; and a first cache memory associated with the plurality of computational units; and a second cache memory associated with the plurality of computational units. The circuit also includes an interface configured to connect to an off- chip random-access memory for storing an input feature map and a filter map. The circuit also includes a control logic configured to select allocations of the first cache memory and the second cache memory to the input feature map and to the filter map, respectively. The control logic is further configured to route the input feature map and the filter map to the plurality of computational units via the first cache memory or the second cache memory, respectively, and to control the plurality of computational units perform a plurality of filter operations between the input feature map and the filter map for classification of at least one object represented by the input feature map.

The circuit also may include at least one router configured to dynamically route data stored by the first cache memory to computational units of the plurality of computational units. The second cache memory may comprise a plurality of blocks, wherein different blocks of the plurality of blocks are statically connected with different computational units of the plurality of computational units.

The blocks of the plurality of blocks of the second cache memory may all have the same size.

The control logic may be configured to select a first allocation of the first cache memory and the second cache memory to a first input feature map and a first filter map, respectively. The control logic may be configured to select a second allocation of the first cache memory and the second cache memory to a second input feature map and a second filter map, respectively. The first feature map and the first filter map may be associated with a first layer of a multi-layer neural network. The second feature map and the second filter map may be associated with a second layer of a multi-layer neural network.

The control logic may be configured to select the allocation of the first cache memory and the second cache memory based on at least one of a size of the input feature map, a size of the filter map, a relation of the size of the input feature map with respect to the size of the filter map.

The control logic may be configured to select the allocation of the first cache memory to the input feature map if the size of the input feature map is larger than the size of the kernel map. The control logic may be configured to select the allocation of the first cache memory to the filter map if the size of the input feature map is not larger than the size of the kernel map.

Each block of the plurality of blocks may be dimensioned in size to store an entire receptive field of the input feature map and/or is dimensioned in size to store an entire filter of the filter map. Data written to the first cache memory by a single refresh event may be routed to a multiple computational units of the plurality of computational units.

Data written to the second cache memory by a single refresh event may be routed to a single computational unit of the plurality of computational units.

A rate of refresh events of the first cache memory may be larger than a rate of refresh events of the second cache memory. The circuit may further comprise at least one cache memory providing level-2 cache functionality to the plurality of computational units, and optionally at least one cache memory providing level-3 cache functionality to the plurality of computational units.

The control logic may be configured to, depending on a size of receptive fields of the input feature map and a stride size associated with filters of the filter map: controlling data written to a given cache memory of the at least one cache memory providing level-2 cache functionality to the plurality of computational units.

The circuit may further include a first cache memory providing level-2 cache functionality to the plurality of computational units and being allocated to the input feature map and not to the filter map; and a second cache memory providing level-2 cache functionality and level-3 cache functionality to the plurality of computational units and being allocated to the input feature map and to the filter map.

The control logic may be configured to allocate the first cache memory to a first one of the input feature map and the filter map and to allocate the second cache memory to a second one of the input feature map and the filter map.

The first cache memory and the second cache memory are arranged at the same hierarchy with respect to the plurality of computation units. The first cache memory and the second cache memory may be at level-1 hierarchy with respect to the plurality of computational units.

The plurality of filter operations may comprise convolutions of a convolutional layer of a convolutional neural network, the convolutions being between a respective kernel of the filter map and a respective receptive field of the input feature map.

A method includes selecting allocations of a first cache memory associated with a plurality of computational units and of a second level-2 cache memory associated with the plurality of computational units to an input feature map and a filter map, respectively; and routing the input feature map and the filter map to the plurality of computational units via the first cache memory or the second cache memory, respectively; and controlling the plurality of computational units to perform a plurality of filter operations between the input feature map and the filter map for classification of at least one object represented by the input feature map.

It is to be understood that the features mentioned above and those yet to be explained below may be used not only in the respective combinations indicated, but also in other combinations or in isolation without departing from the scope of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 schematically illustrates a circuit including an external memory and a computer including an internal memory.

FIG. 2 is a flowchart of a method of processing data using a multi-layer neural network according to various examples.

FIG. 3 schematically illustrates the various layers of the multi-layer neural network, as well as receptive fields of neurons of the neural network arranged with respect to respective feature maps according to various examples.

FIG. 4 schematically illustrates a convolutional layer of the layers of the multi-layer network according to various examples.

FIG. 5 schematically illustrates a stride between of a convolution of an input feature map with a kernel positioned at various positions throughout the feature map according to various examples, wherein the different positions correspond to different receptive fields. FIG. 6 schematically illustrates arithmetic operations associated with a convolution according to various examples.

FIG. 7 schematically illustrates a cubic kernel having a large kernel size according to various examples.

FIG. 8 schematically illustrates a cubic kernel having a small kernel size according to various examples. FIG. 9 schematically illustrates a spherical kernel having a large kernel size according to various examples. FIG. 10 schematically illustrates a pooling layer of the layers of the multi-layer network according to various examples.

FIG. 1 1 schematically illustrates an adding layer of the layers of the multi-layer network according to various examples.

FIG. 12 schematically illustrates a concatenation layer of the layers of the multi-layer network according to various examples.

FIG. 13 schematically illustrates a fully-connected layer of the layers of the multi-layer network according to various examples, wherein the fully-connected layer is connected to a not-fully- connected layer.

FIG. 14 schematically illustrates a fully-connected layer of the layers of the multi-layer network according to various examples, wherein the fully-connected layer is connected to a fully- connected layer.

FIG. 15 schematically illustrates a circuit including an external memory and a computer according to various examples, wherein the computer includes a plurality of calculation modules.

FIG. 16 schematically illustrates a circuit including an external memory and a computer according to various examples, wherein the computer includes a single calculation module.

FIG. 17 schematically illustrates a circuit including an external memory and a computer according to various examples, wherein the computer includes a plurality of calculation modules.

FIG. 18 schematically illustrates details of a calculation module, the calculation module including a plurality of computational units according to various examples.

FIG. 19 schematically illustrates assigning multiple convolutions to multiple computational units. FIG. 20 schematically illustrates assigning multiple convolutions to multiple computational units according to various examples. FIG. 21 is a flowchart of a method according to various examples.

FIG. 22 is a flowchart of a method according to various examples.

FIG. 23 schematically illustrates details of a calculation module, the calculation module including a plurality of computational units according to various examples.

FIG. 24 is a flowchart of a method according to various examples.

FIG. 25 is a flowchart of a method according to various examples.

FIG. 26 schematically illustrates techniques of dynamic memory allocation between input feature maps and filter maps according to various examples, and further illustrates refresh events of level-1 cache memories according to various examples. DETAILED DESCRIPTION OF EMBODIMENTS

In the following, embodiments of the invention will be described in detail with reference to the accompanying drawings. It is to be understood that the following description of embodiments is not to be taken in a limiting sense. The scope of the invention is not intended to be limited by the embodiments described hereinafter or by the drawings, which are taken to be illustrative only.

The drawings are to be regarded as being schematic representations and elements illustrated in the drawings are not necessarily shown to scale. Rather, the various elements are represented such that their function and general purpose become apparent to a person skilled in the art. Any connection or coupling between functional blocks, devices, components, or other physical or functional units shown in the drawings or described herein may also be implemented by an indirect connection or coupling. A coupling between components may also be established over a wireless connection. Functional blocks may be implemented in hardware, firmware, software, or a combination thereof. Hereinafter, techniques of object classification are described. For example, objects can be classified in input data such as sensor data, audio data, video data, image data, etc.. Classifications of objects can yield output data which is indicative of one or more properties of the objects, e.g., of the position of the objects within the input data, of the orientation of the objects within the input data, a type of the objects, etc.

The techniques described herein can facilitate object classification based on graph-based algorithms. According to some examples, neural networks are employed for the object classification. A particular form of neural networks that can be employed according to examples are convolutional neural networks (CNN).

CNNs are a type of feed-forward neural networks in which the connectivity between the neurons is inspired by the connectivity found in the animal visual cortex. Individual neurons from the visual cortex respond to stimuli from a restricted region of space, known as receptive field. In other words, the receptive field of a neuron may designate a 3-D region within the respective input feature map to which said neuron is directly connected to. The receptive fields of neighboring neurons may partially overlap. The receptive fields may span the entire visual field, i.e., the entire input feature map. It was shown that the response of an individual neuron to stimuli within its receptive field can be approximated mathematically by a convolution, so CNNs extensively make use of convolution. A convolution includes a plurality of combinational operations which can be denoted as in the products of vectors. A convolution may be defined with respect to a certain kernel. According to examples, a convolution may be between a 3-D kernel - or, generally, a 3-D filter - and a 3-D input feature map. Hence, a convolution includes a plurality of combinational operations, i.e., applying 2-D channels of the 3-D kernel - or, generally, 2-D filter coefficients - to 2-D sections of a 3-D receptive field associated with a certain neuron; such applying of 2-D channels to 2-D sections may include multiple arithmetic operations, e.g., multiplication and adding operations. In particular, such applying of 2-D channels to 2-D sections may correspond to an inner product of two vectors. A CNN is formed by stacking multiple layers that transform the input data into an appropriate output data, e.g., holding the class scores. The CNN may include layers which are selected from the group of layer types including: Convolutional Layer, Pooling Layer, Non-Linear Activation Layer, Adding Layer, Concatenation Layer, Fully-connected Layer. As opposed to conventional multi-layer perceptron neural networks, CNN are typically characterized by the following features: (i) 3-D volume of neurons: the layers of a CNN have neurons arranged in 3-D: width, height and depth. The neurons inside a layer are selectively connected to a sub- region of the input feature map obtained from the previous layer, called a receptive field. Distinct types of layers, both locally and completely connected, are stacked to form a CNN. (ii) Local connectivity: following the concept of receptive fields, CNNs typically exploit spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers. The architecture thus ensures that the learnt filters produce the strongest response to a spatially local input pattern. Stacking many such layers leads to non-linear filters that become increasingly global, i.e. responsive to a larger regions. This helps to take into-account interrelationships between different low-level features: This allows the network to first create good representations of small parts of the input, and then assemble representations of larger areas from them, (iii) Shared weights: in CNNs, each filter is replicated across the entire input feature map. Such replications - separated by a stride and forming a set of combinational operations - thus share the same filter coefficients (weight vector and bias) and form an output feature map. This results in the neurons of a given convolutional layer detecting the same feature - defined by the filter coefficients. Replicating units in this way allows for features to be detected regardless of their position in the visual field, thus constituting the property of translational invariance. Such properties typically allow CNNs to achieve good generalization. CNNs are widely applied for object recognition in image data (vision).

Various techniques described herein are based on the finding that the computational resources associated with implementing the CNN can vary from layer to layer, in particular, depending on a layer type. For example, it has been found that weight sharing as implemented by the convolutional layer can significantly reduce the number of free parameters being learnt, such that the memory access requirements for running the network are reduced. In other words, a filter map of the convolutional layers can be comparably small. On the other hand, the convolutional layers may require significant processing power, because a large number of convolutions may have to be performed. Moreover, different convolutional layers may rely on different kernels: in particular, the kernel geometry may vary from layer to layer. Hence, computational resources in terms of processing power and memory access requirements can change from convolutional layer to convolutional layer. Convolutional layers, in other words, are often characterized by a relatively small number of weights since kernels are shared; but because input feature maps and output feature maps of convolutional layers are large, there is often a large number of combinational operations that need to be performed. So, in convolutional layers, often memory access requirements are comparably limited, but the required processing power is large. Often, in fully-connected layers, the situation is the opposite: here, the memory access requirements can be high since there is no weight sharing between the neurons, but the number of combinational operations is small. According to various examples it is possible to provide a circuit which flexibly provides efficient usage of available computational units for the various layers encountered in a CNN - even in view of different requirements in terms of memory acces requirements and/or processing power, as described above. This is facilitated by a large degree of freedom and flexibility provided when assigning combinational operations to available computational units, i.e., when allocating computational units for certain combinational operations.

While, hereinafter, reference is primarily made to CNNs for sake of simplicity, generally, such techniques may be readily applied to different types of neural networks. The techniques described herein are based on the finding that various types of neural networks rely on filter operations - formed by a plurality of combinational operations - between an input feature map and a filter map. In convolutional layers of a CNN, these filter operations are implemented by convolutions. However, generally, different types and kinds of filter operations may benefit from the techniques described herein. Other examples of filter operations include operations associated with fully-connected layers, pooling layers, adding layers, and/or concatenation layers.

The techniques described herein may be of particular use for multi-layer filter networks which iteratively employ multiple filters, wherein different iterations are associated with a different balance between computational resources in terms of processing power on the one hand side and computational resources in terms of memory access requirements on the other hand side. Hence, while reference is made primarily to convolutional layers of CNNs - requiring significant processing power - hereinafter, such techniques may be applied to other kinds and types of layers of CNNs, e.g., fully-connected layers - having significant memory access requirements.

FIG. 1 schematically illustrates aspects with respect to the circuit 100 that can be configured to implement a neural network. For example, the circuit 100 could be implemented by an ASIC or FPGA.

The circuit 100 includes a computer 121 that may be integrated on a single chip/die which includes an on-chip/internal memory 122. For example, the internal memory 122 could be implemented by cache or buffer memory. The circuit 100 also includes external memory 1 1 1 , e.g., DDR3 RAM.

FIG. 1 schematically illustrates input data 201 which is re-presenting an object 285. The circuit 100 is configured to recognize and classify the object 285. For this, the input data 201 is processed. In particular, a set of filter maps 280 is stored in the external memory 1 1 1. Each filter map 280 includes a plurality of filters, e.g., kernels for the convolutional layers of a CNN. Each filter map 280 is associated with a corresponding layer of a multi-layer neural network, e.g., a CNN.

FIG. 2 is a flowchart of a method. The method of FIG. 2 illustrates aspects with respect to processing of the input data 201 . FIG. 2 illustrates aspects with respect to iteratively processing the input data 201 using multiple filters. First, in 1001 , the input data 201 is read as a current input feature map. For example, the input data may be read from the external memory 1 1 1. For example, the input data 201 may be retrieved from a sensor.

Next, in 1002, layer processing is performed based on the current input feature map. Each layer, i.e., each execution of 1002, corresponds to an iteration of the analysis of the data. Different iterations of 1002 may be associated with different requirements for computational resources, e.g., in terms of processing power vs. memory access requirements.

Depending on the particular layer associated with the current iteration of 1002 such layer processing may include one or more filter operations be between the current input feature map and the filters of the respective filter map 280.

Next, in 1003, an output feature map is written. For example, the output feature map may be written to the external memory 1 1 1.

In 1004 it is checked whether the CNN includes a further layer. If this is not the case, then the current output feature map of the current iteration of 1003 is output; the current output feature map then provides classification of the object 285. Otherwise, in 1005 the current output feature map is read as the current input feature map, e.g., from the external memory 1 1 1 . Then, 1002 - 1004 are re-executed in a next iteration.

From the method according to the example of FIG. 2 it becomes apparent that processing of the input data requires multiple read and multiple write operations to the external memory 1 1 1 , e.g., for different iterations of 1002 or even multiple times per iteration 1002. Such data movement can be energy inefficient and may require significant time. According to examples described herein, it is possible to reduce such data movement. Furthermore, from the method according to the example of FIG. 2, it becomes apparent that multiple input feature maps are subsequently processed in the multiple iterations of 1002. This can be time-consuming. Various techniques described herein enable to efficiently implement the layer processing of 1002. In particular, according to examples, it is possible to avoid idling of computational units during execution of 1002.

FIG. 3 illustrates aspects with respect to a CNN 200. The CNN includes a count of sixteen layers 260. FIG. 3 illustrates the input data 201 converted to a respective input feature map. The first layer 260 which receives the input data is typically called an input layer.

The feature maps 202, 203, 205, 206, 208, 209, 21 1 - 213, 215 - 217 are associated with convolutional layers 260. The feature maps 204, 207, 210, 214 are associated with pooling layers 260.

The feature maps 219, 220 are associated with fully-connected layers 260.

The output data 221 corresponds to the output feature map of the last fully connected layer 260. The last layer which outputs the output data is typically called an output layer. Layers 260 between the input layer and the exit layer are sometimes referred to as hidden layers 260.

The output feature maps of every convolutional layer 260 and of at least some of the fully- connected layers are post-processed using a non-linear post-processing function (not shown in FIG. 3), e.g., a rectified linear activation function and/or a softmax activation function. Sometimes, dedicated layers can be provided for non-linear post-processing (not shown in FIG. 3).

FIG. 3 also illustrates the receptive fields 251 of neurons 255 of the various layers 260. The lateral size (xy-plane) of the receptive fields 251 - and thus of the corresponding kernels - is the same for all layers 200, e.g., 3x3 neurons. Generally, different layers 260 could rely on kernels and receptive fields having different lateral sizes.

In the example of FIG. 3, the various convolutional layers 260 employ receptive fields and kernels having different depth dimensions (z-axis). For example, in FIG. 3, the smallest depth dimensions equals to 3 neurons while the largest depth dimensions equals to 512 neurons. The pooling layers 260 employ 2x2 pooling kernels of different depths. Similar to convolutional layers, different pooling layers may use different sized pooling kernels and / or stride size. The size of 2x2 is an example, only. For example, the CNN 200 according to the example of FIG. 3 and also of the various further examples described herein may have about 15,000,000 neurons, 138,000,000 network parameters and may require more than 15,000,000,000 our arithmetic operations.

FIG. 4 illustrates aspects with respect to a convolutional layer 260. In the example of FIG. 4, the input feature map 208 is processed to obtain the output feature map 209 of the convolutional layer 260.

Convolutional layers 260 can be seen as the core building blocks of a CNN 200. The convolutional layers 260 are associated with a set of learnable 3-D filters - also called kernels 261 , 262 - stored in a filter map. Each filter has limited lateral dimensions (xy-plane) - associated with the small receptive field 251 , 252 typical for the convolutional layers -, but typically extend through the full depth of the input feature map (in FIG. 4, only 2-D slices 261 - 1 , 262-1 of the kernels 261 , 262 are illustrated for sake of simplicity, but the arrows along z- axis indicate that the kernels 261 , 262 are, in fact, 3-D structures; also cf. FIG. 5 which illustrates multiple slices 261 -1 - 261 -3 of the kernel 261 being applied to different slices 251 - 1 - 251 -3 of the receptive field 251 ). The different kernels 261 , 262 are each associated with a plurality of combinational operations 201 1 , 2012; the various combinational operations 201 1 , 2012 of a kernel 261 , 262 correspond to different receptive fields 251 (in FIG. 5, per kernel 261 , 262 a single combinational operation 201 1 , 2012 corresponding to a given receptive field 251 , 252 is illustrated). In other words, each kernel 261 , 262 will be applied to different receptive fields 251 , 252; each such application of a kernel 261 , 262 to a certain receptive field defines a respective combinational operation 201 1 , 2012 between the respective kernel 261 , 262 and the respective receptive field 251 , 252. As illustrated in FIG. 4, each kernel 261 , 262 is convolved across the width and height of the input feature map 208 (in FIG. 4, only a single position of the kernels 261 , 262 is illustrated for sake of simplicity), computing the inner vector product (sometimes also referred to as dot product) between the slices 261 -1 , 262-1 of the kernels 261 , 262 and the slices of the respective receptive fields 251 , 262 of the input feature map. Each kernel 261 , 262 defines a corresponding convolution 2001 , 2002, but each convolution 2001 , 2002 includes multiple combinational operations corresponding to the different receptive fields. This produces a 2-D activation map of that kernel 261 , 262. As a result, kernels 261 , 262 activate when detecting some specific type of feature at some spatial position in the input feature map.

Stacking such activation maps for all kernels 261 , 262 along the depth dimension (z-axis) forms the full output feature map 209 of the convolution layer 260. Every entry in the output feature map 209 can thus also be interpreted as a neuron 255 that perceives a small receptive field of the input feature map 208 and shares parameters with neurons 255 in the same slice of the output feature map. Often, when dealing with high-dimensional inputs such as images, it may be undesirable to connect neurons 255 of the current convolutional layer to all neurons 255 of the previous layer, because such network architecture does not take the spatial structure of the data into account. CNNs exploit spatially local correlation by enforcing a local connectivity pattern between the neurons 255 of adjacent layers 260: each neuron is connected to only a small region of the input feature map. The extent of this connectivity is a parameter called the receptive field of the neuron. The connections are local in space (along width and height of the input feature map), but typically extend along the entire depth of the input feature map. Such architecture ensures that the learnt filters produce the strongest response to a spatially local input pattern. In some examples, three parameters control the size of the output feature map of the convolutional layer: the (i) depth, (ii) stride, and (iii) zero-padding: (i) Depth (D) parameter of the output feature map controls the number of neurons 255 in the current layer 260 that connect to the same receptive field 251 , 252 of the input feature map 208. All of these neurons 255 activate for different features in the input feature map 208 by relying on different kernels 261 , 262. Different convolutions 2001 , 2002 are implemented for different kernels, (ii) The Stride (S) parameter controls how receptive fields 251 , 252 of different neurons 255 slide around the lateral dimensions (width and height; xy-plane) of the input feature map 208. When the stride is set to 1 , the receptive fields 251 , 252 of adjacent neurons are located at spatial positions only 1 spatial unit apart (horizontally, vertically or both). This leads to heavily overlapping receptive fields 251 , 252 between the neurons, and also to large output feature maps 209. Conversely, if higher strides are used then the receptive fields will overlap less and the resulting output feature map 209 will have smaller lateral dimensions (cf. FIG. 5 which illustrates a stride size 269 of two), (iii) Sometimes it is convenient to pad the input feature map 208 with zeros on the border of the input volume. The size of this zero-padding (P) is a third parameter. Zero padding provides control of the output volume spatial size. In particular, sometimes it is desirable to exactly preserve the spatial size of the input feature map in the output feature map. The spatial size of the output feature map 209 can be computed as a function of the input feature map 208 whose width is W and height is H, the kernel field size of the convolutional layer neurons is KWKKH, the stride with which they are applied S, and the amount of zero padding P used on the border. The number of neurons that "fit" a given output feature map is given by

{{W - K_W + 2P)I S + l)x{{H -K_H + 2P)I S + l) (1 ) If this number is not an integer, then the strides are set incorrectly and the neurons cannot be tiled to fit across the input feature map in a symmetric way. In general, setting zero padding to be P=(K-1 )/2 when the stride is S=1 ensures that the input volume and output volume will have the same size spatially. This is a very common situation in most of the currently used CNNs. Weight sharing scheme is used in convolutional layers to control the number of free parameters. It relies on one reasonable assumption: That if one kernel feature is useful to compute at some spatial position, then it should also be useful to compute at a different position. In other words, denoting a single 2-dimensional slice of depth one as a depth slice of the output feature map, it is possible to constrain the neurons 255 in each depth slice of the output feature map 209 to use the same weights and bias, i.e., the same kernel.

Since all neurons 255 in a single depth slice of the output feature map 209 are sharing the same kernel 261 , 262, then the forward pass in each depth slice of the convolutional layer can be computed as a 3D convolution 2001 , 2002 of the neurons' 255 weights (kernel coefficients) with the section of the input volume including its receptive field 251 , 252:

OFM[z][x][y] = B[z] + £∑∑ IFM[k] [Sx + i][Sy + j]^■ Kernel[z][k][i] [j],

k=0 i=0 7=0 (2) where

0 < z < D_O , 0 < x < W_o = ^{Wl ~ Kw + 2P} , 0 < y < H_o = ^{Hl ~ KH + 2P} (cf. FIG. 6 which

S S

illustrates the pairwise multiplication 2091 and summation 2092 between weights 261 A of the kernel slice 261 -1 and elements 251 A of the slice 251 -1 receptive field 251 as atomic arithmetic operations). Here, IFM and OFM are 3-D input feature map 208 and output feature map 209, respectively; Wi, Hi, DI, WO, HO and Do are the width, height and depth of the input and output feature map 208, 209, respectively; B is the bias value for each kernel 261 , 262 from the kernel map; Kernel is the kernel map; Kw, KH and KD are the width, height and depth of every kernel 261 , 262 respectively.

From Eq. 2 it is apparent that each convolution 2001 , 2002 - defined by a certain value of parameter z - includes a plurality of combinational operations 201 1 , 2012, i.e., the sums defining the inner vector product. The various combinational operations 201 1 , 2012 correspond to different neurons 255 of the output feature map 209, i.e., different values of x and y in Eq. 2. Each combinational operation 201 1 , 2012 can be broken down into a plurality of arithmetic operations, i.e., multiplications and sums (cf. FIG. 5).

FIG. 4 illustrates how the process of calculating 3-D convolutions 2001 , 2002 is performed. For each neuron 255 in the output feature map 209, located at the same depth slice, the same 3-D kernel 261 , 261 is being used when calculating the 3-D convolution. What differ between the neurons 255 from the same depth slice are the 3-D receptive fields 251 , 252 used in the 3-D convolution. For example, the neuron 255 (di, xi, yi) uses 3-D kernel 261 Kernel(di) and receptive field 251 . All neurons 255 of the output feature map 209 located in the depth slice di use the same kernel 261 Kernel(di), i.e., are associated with the same 3-D convolution 2001. In a different slice of the output feature map 209, for example the depth slice d∑, a different 3- D kernel 262 Kernel(d2) is used. In other words: neurons 255 of the output feature map 209 with identical spatial coordinates, (x, y), located in different depth slices, use identical receptive fields 251 , 252 of the input feature map 208, but different 3-D kernels 261 , 262. So, for example, output feature map neurons OFM(di, xi, yi) and OFM(c/2, xi, yi) would use the same 3-D receptive field 251 , 262 from the input feature map, but different 3-D kernels 261 , 262 , Kernel{di) and Kernel{d2) respectively.

It is common to refer to the sets of weights as a kernel 261 , 262 - or generally as a filter -, which is convolved with the different receptive fields 251 , 252 of the input feature map 208. The result of this convolution is an activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output feature map 209 of the convolution layer. Above, various examples have been described with the kernels 261 , 262 have a certain kernel geometry. FIGs. 7 - 9 illustrate aspects with respect to different kernel geometries. For example, FIG. 7 illustrates a slice of a kernel 261 , 262 having a cuboid shape. From a comparison of FIGs. 7 and 8, it can be seen that different kernel geometries may be achieved by varying the size 265 of the respective kernel 261 , 262. Furthermore, from a comparison of FIGs. 7 and 8 versus FIG. 9 it can be seen that different kernel geometries may be achieved by varying the 3-D shape of the kernel 261 , 262, i.e., cuboid in FIGs. 7 and 8 and spherical in FIG. 9. Generally, it would be possible to achieve different kernel geometries by varying the lateral cross-section in xy-plane only.

FIG. 10 illustrates aspects with respect to a pooling layer 260 which performs a filter operation in the form of pooling. Pooling is generally a form of non-linear down-sampling. The intuition is that once a feature has been found, its exact location may not be as important as its rough location relative to other features, i.e., its spatial inter-relationship to other features. The pooling layer 260 operates independently on every depth slice of the input feature map 209 and resizes it spatially. The function of the pooling layer 260 is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the CNN 200, and hence to also control overfitting. Pooling layers 260 may be inserted in-between successive convolutional layers 260. The pooling operation provides a form of translation invariance.

There are several non-linear functions to implement pooling:

(i) Max pooling partitions the input feature map 209 into a set of non-overlapping pooling regions 671 , 672 of size PWXPH (these sub-regions are usually rectangles) and, for each such pooling region 671 , 672, outputs the maximum value of the input feature map points located within it

OFM[z][x][y] = max {lFM[z][Sx + i][Sy + j]},

0≤i<P_w -l

(3)

W - P H - P

0 < z < D, 0 < x < W = ^, 0 < y < H =

This is illustrated in FIG. 10 for two neurons 255 associated with different pooling regions 671 , 672.

An example is a pooling layer with filters of size 2x2 applied with a stride of 2 which downsamples at every depth slice the spatial size of input feature map 209 by a factor of 2 along both width and height, discarding in the process 75% of the input feature map values within every 2x2 sub-region. Every max operation would in this case be taking a max over 4 numbers. The depth dimension remains unchanged. (ii) Average pooling partitions the input feature map 209 into a non-overlapping sub-regions and for each sub-region outputs the average value of input feature map points located within it.

OFM[z][x][y] : ∑∑IFM[z-\[Sx + i\[Sy + j

P - P

(4)

W - P H -

0 < z < D,0 < x < W = ^, 0 < y < H = -

S S

(iii) L²-norm pooling partitions the input feature map 209 into a non-overlapping sub-regions and for each sub-region outputs the L2-norm values of input feature map points located within it, defined by

W -P_m H

0 < z < D,0 < x < W ,0 < y < H -

Pooling is performed on depth slice by depth slice basis within the pooling layer 260, as can be seen from the FIG. 10. Each neuron 255 of the output feature map 210, irrelevant of its lateral position in the xy-plane, uses identical 2D pooling region 671 , 672, but applied to different slices of the input feature map 209, because similar to the convolutional layer each neuron from the pooling layer has its own unique region of interest. By applying selected nonlinear pooling function to the values from the current 2-D pooling region 671 , 672 of the input feature map 209, a resulting value from the output feature map 210 is calculated.

In general, every pooling layer 260: (i) Accepts an input feature map of size

(ii) Requires two parameters: spatial extent of the pooling sub-region 671 , 672 PWXPH and the stride S; Produces an output feature map of size WoxHo*Do where: Wo={WrPw)IS+ , Ho= {Hr PH)/S+1 , Do=Di; and introduces zero parameters, since it computes a fixed function of the input feature map 209.

FIG. 1 1 illustrates aspects with respect to an adding layer 260 which performs a filter operation in the form of an point-wise addition of two or more input feature maps 225-227 to yield a corresponding output feature map 228. In general, every adding layer 260 accepts two or more input feature maps 225-227, each of size WlxHIxDI, Produces an output feature map 228 of size WOxHOxDO where: WO=WI, HO=HI, DO=DI; and introduces zero parameters, since it computes a fixed function of the input feature map. As can be seen from the FIG. 1 1 , each neuron 255 of the output feature map 228, OFM(c/, x, y) actually represents a sum of neurons from all input feature maps 225-227 located at the same location within the input feature map 225-228. For example, in FIG. 1 1 the neuron 255 OFM(c/i, xi, yi) of the output feature map 228 is calculated as the sum of neurons of the input feature maps 225 - 227 located at the same coordinates, i.e., as

I FMi(di, i, yi) + I FM₂(di, xi, yi)+ I FM₃(df, xi, yi). (6) FIG. 12 illustrates aspects with respect to a concatenation layer 260. The concatenation layer concatenates two or more input feature maps 225-227, usually along the depth axis and, thereby, implements a corresponding filter operation. In general, every concatenation layer accepts N input feature maps 225-227, each of them with identical spatial size WI*HI, but with possibly different depths Dn, D , DIN. produces an output feature map 228 of size Wo*Ho*Do where:

+ D_!2 + . . . + DIN, and introduces zero parameters, since it computes a fixed function of the input feature maps 225-227.

From FIG. 12 it is apparent that within the concatenation layer 260 of the CNN 200, individual input feature maps 225-227 are stacked together to form a single output feature map 228. In the example of FIG. 12, three input feature maps 225-227 are concatenated, resulting in an output feature map 228 OFM which is composed of the three input feature maps 225-227, I FMi , I FM₂ and I FM₃. The input feature map 225 IFM1 is located at depth slices 1 :D within the output feature map 228, then comes the input feature map 226 IFM2, located at depth slices (D +1 ):(D +Dl2) within the output feature map 228, while input feature map 227 IFM3 is located at depth slices (Dh+DI₂+1 ):(Dli+DI₂+DI₃) of the output feature map 228.

FIGs. 13 and 14 illustrate aspects with respect to fully-connected layers. Typically the high- level reasoning in the CNN 200 is done by means of fully connected layers 260. Neurons 255 in a fully connected layer 260 have full connections to all neurons 255 in the respective input feature map 218, 219. Their activations can hence be computed with a filter operation implemented by a matrix multiplication followed by a bias offset:

or

where Eq. 7 applies to a scenario where the input feature map 218 is associated with a not- fully connected layer 260 (cf. FIG. 13) while Eq. 8 applies to a scenario where the input feature map 219 is associated with a fully connected layer 260 (cf. FIG. 14).

Furthermore, because there is no weight sharing the memory access requirements can be comparably high, in particular if compared to executing Eq. 2 for a convolutional layer. Typically, the accumulated weighted sums for all neurons 255 of the fully-connected layer 260 are passed through some non-linear activation function, Af. Activation functions which are most commonly used within the fully-connected layer are the ReLU, Sigmoid and Softmax functions.

FIGs. 13 and 14 illustrate that every neuron 255 of the respective output feature map 219, 220 is determined based on all values of the respective input feature map 218, 219. What is different between neurons 255 from the fully-connected layer are the weight values that are used to modify input feature map values 218, 219. Each neuron 255 of the output feature map 219, 220 uses its own unique set of weights. In other words, the receptive fields of all neurons 255 of an output feature map 219, 220 of a fully-connected layer 260 are identically spanning entire input feature map 218, 219, but different neurons 255 us different kernels.

In general, every fully-connected layer: Accepts an input feature map 219, 220 of size WlxHIxDI, if the input feature map is the product of the convolutional, pooling, non-linear activation, adding or concatenation layer. If the input feature map is the product of the fully- connected layer, than its size is equal to Nl neurons; Produces an output feature map 219, 220of size NO; introduces a total of WlxHlxDlxNO weights and NxO biases, in case of non- fully-connected input feature maps, or NlxNO weights and Ν-Ό biases, in case of fully- connected input feature maps. FIG. 15 illustrates aspects with respect to the architecture of the circuit 100. In the example of FIG. 15, the circuit 100 includes a memory controller 1 12 which controls data movement from and to the external memory 1 1 1 . Furthermore, a memory access arbiter 1 13 is provided which distributes data between multiple calculation modules (CMs) 123. Each CM 122 may include internal memory 122. Each CM 122 may include one or more computational units (not illustrated in FIG. 15; sometimes also referred to functional units, FUs), e.g., an array of FU units. The FU array can be re-configured to perform processing for different types of layers 260, e.g., convolutional layer, pooling layer, etc.. This is why the FU array is sometimes also referred to as reconfigurable computing unit (RCU).

In the example of FIG. 15 the circuit 100 includes a plurality of CMs aligned in parallel. In other examples, the circuit 100 could include a plurality of CMs arranged in a network geometry, e.g., a 2-D mesh network (not illustrated in FIG. 15).

By means of multiple CMs, different instances of the input data can be processed in parallel. For example, different frames of the video could be assigned to different CMs 123. Pipelined processing may be employed.

FIG. 16 illustrates aspects with respect to the CMs 123. FIG. 16 illustrates an example where the circuit 100 only includes a single CM 123. However, it would be possible that the circuit 100 includes a larger number of CMs 123.

In the example of FIG. 16, a - generally optional - feature map memory is provided on the hierarchy of the computer 121 . The feature map memory 164 may be referred to as level 2 cache. The feature map memory is configured to cache at least parts of input feature maps and/or output feature maps that are currently processed by the computer 121 . This may facilitate power reduction, because read/write to the external memory 1 1 1 can be reduced. For example, the feature map memory 164 could store all intermediate feature maps 201 -220. Here, a tradeoff between reduced energy consumption and increased on-chip memory may be found. Furthermore, the CM 123 includes an input stream manager for controlling data movement from the feature map memory 164 and/or the external memory 1 1 1 to a computing unit array 161 . The CM 123 also includes an output stream manager 163 for controlling data movement to the feature map memory 164 and/or the external memory 1 1 1 from the computing unit array 161 . The input stream manager 162 is configured to supply all data to be processed such as configuration data, the kernel map, and the input feature map, coming from the external memory, to the proper FU units. The output stream manager 163 is configured to format and stream processed data to the external memory 1 1 1 .

FIG. 17 illustrates aspects with respect to the CMs 123. The example of FIG. 17 generally corresponds to the example of FIG. 16. However, in the example of FIG. 17, a plurality of CMs 123 is provided. This facilitates pipelined or parallel processing of different instances of the input data. For example, kernel maps may be shared between multiple CMs 123. For example, if a kernel map is unloaded from a first CM 123, the kernel map may be loaded by a second CM 123. For this, the feature map cache is employed. This helps to avoid frequent read/write operations to the external memory 1 1 1 with respect to the kernel maps. This may refer to a pipelined processing of the kernel maps for different instances of input data 200, e.g., relating to different frames of a video etc.. Here, kernel maps 280 are handed from CM 123 to CM 123 instead of moving kernel maps 280 back and forth the memory external memory 1 1 1 . Such sharing of kernel maps between the CMs 123 can relate to parallel processing of different instances of the input data. Here, different CMs 123 use the same kernel map - which is thus shared between multiple CMs 123; and different CMs 123 process different instances of the input data or different feature maps of the input data. Differently, If CMs 123 use pipelined processing, then every CM 123 uses a different kernel map, because each CM 123 evaluates different CNN layer; here, feature maps will move along the pipeline, sliding one CM mode at a time. In other words, in pipelined processing, different input instances (e.g., different video frames or different images) are at a different stage of processing by the CNN depending where the respective input instance is currently located in the pipeline.

FIG. 18 illustrates aspects with respect to the FU array 161 , the input stream manager 162, and the output stream manager 163. Typically, these elements are integrated on a single chip or die.

As illustrated in FIG. 18, the FU array 161 includes a plurality of FU units 321 - 323. While in the example of FIG. 18, a count FU units 321 - 323 is illustrated, in other examples, it would be possible that the FU array 161 includes a larger count of FU units 321 - 323. The various FU units 321 - 323 can be implemented alike or can be identical to each other. The various FU units 321 - 323 may be configured to perform basic arithmetic operations such as multiplication or summation. For example, the FU array 161 may include a count of at least 200 FU units 321 - 323, optionally at least 1000 FU units 321 - 323, further optionally at least 5000 FU units 321 - 323. In the example of FIG. 18, the FU array 161 also includes shared memory 301. Different sections of the shared memory can be associated with data for different ones of the FU units 321 - 323 (in FIG. 18 the three illustrated partitions are associated with different FU units 321 - 323). In other words, different sections of the shared memory may be allocated to different FU units 321 -323. In order to move data from the shared memory 301 to a given FU unit 321 - 323, routing elements 319, e.g., multiplexers, are employed. An encoder 352 may encode the output data. An inter-related decoder 342 is provided in the input stream manager 162.

In the example of FIG. 18, the input stream manager 162 includes a stick buffer 341 ; here, it is possible to pre-buffer certain data later on provided to the shared memories 301 . Likewise, the output stream manager 163 includes an output buffer 351 . These buffers 341 , 351 are optional.

Further illustrated in FIG. 18 are output registers 329 associated with the various FU units 321 - 323. The registers 329 can be used to buffer data that has been processed by the FU units 321 - 323.

The FU array 161 also includes a postprocessing unit 330. The postprocessing unit 330 can be configured to modify the data processed by the FU units 321 - 323 based on linear or nonlinear functions. While in the example of FIG. 18 a dedicated postprocessing unit 330 non- linear postprocessing is illustrated, in other examples, it would also be possible that non-linear postprocessing is associated with a dedicated layer of the CNN 200.

Examples of non-linear postprocessing functions include:

(i) Rectified Linear Function (ReLU) - non-saturating activation function defined, by

(ii) Hyperbolic Tangent Function - saturation activation function, defined by

A_f {x) = tanh( ) (10) e^x + e

(iii) Sigmoid Function - saturation activation function, defined by

(1 1 ) (iv) Softmax Function - saturation activation function, which "squashes" a N-dimensional vector x of arbitrary values to a N-dimensional vector f(x) of real values in the range (0, 1 ) that add up to 1 , defined by

Compared to other functions the usage of ReLU is sometimes preferred, because it results in the CNN 200 training several times faster, without making a significant difference to generalisation accuracy. However, Softmax function is usually used in the final layer of the CNN 200 to generate output classification predictions in terms of class membership probabilities.

FIG. 19 illustrates aspects with respect to the assignment (arrows in FIG. 19) of convolutions 2001 -2003 to FU units 321 -323. In the example of FIG. 19, different combinational operations 201 1 , 2012 of each convolution 2001 -2003 are assigned to different FU units 321 -323. The various convolutions 2001 , 2002 are processed sequentially. Because the number of combinational operations 201 1 , 2012 of a given convolution 2001 -2003 may not match the number of FU units 321 -323, this may result in idling FU units 323. This reduces the efficiency.

FIG. 20 illustrates aspects with respect to the assignment of convolutions 2001 -2003 using different kernels 261 , 262 to FU units 321 -323. According to various examples, the combinational operations required to complete processing of an input feature map 201 - 220, 225 - 228 are flexibly assigned to the various FU units 321 - 323. This is based on the finding that such a flexible assignment of the combinational operations can reduce idling of the FU units 321 - 323 if compared to a static assignment, i.e., a predefined assignment which does not vary - e.g., from layer to layer 260 of the CNN 200. If a static, predefined assignment is used it may not be possible to flexibly adjust the assignment depending on properties of the respective input feature map 201 - 220, 225 - 228 and/or the respective kernel map 280.

The flexible assignment enables to tailor allocation of the FU units 321 - 323. In particular, the assignment can take into account properties such as the size 265 of the used kernel 261 , 262 or the shape of the used kernel 261 , 262 - or generally the kernel geometry. The assignment can take into account the stride 269. In the example of FIG. 20, a control logic - e.g., implemented by a control 343 in the input stream manager 162 and/or a control 353 in the output stream manager 163 or another control of the computer 121 - is configured to sequentially assign at least two combinational operations 201 1 , 2012 of the same convolution 2001 , 2002 to the same FU unit 321 - 323. This avoids idling of FU units 321 -323. As can be seen from a comparison of FIGs. 19 and 20, the processing time is reduced.

As will be appreciated from FIG. 20, the control logic 343, 353 is configured to sequentially assign all combinational operations 201 1 , 2012 of a given convolution 2001 - 2003 to the same FU unit 321 - 323. In other words, the values of all neurons 255 of a given slice of the output feature map of the respective convolution or layer 260 are determined by the same FU unit 321 - 323. In still other words, all arithmetic operations of a given value of z in equation 2 are performed by the same FU unit 321 - 323. As will be appreciated from FIG. 20, this results in a scenario where the FU units 321 - 323 perform at least some of the convolutions 2001 , 2002 in parallel. This helps to reduce the overall processing time.

In particular, idling of FU units 321 - 323 is avoided for a scenario where the count of FU units 321 - 323 is different from a count of convolutions 2001 , 2002 - or generally the count of filter operations. The count of convolutions 2001 , 2002 can depend on various parameters such as the kernel geometry; the stride size; etc. The count of filter operations may be different for convolutional layers 260 if compared to fully-connected layers. It is thus generally possible that the control logic 343, 353 is configured to selectively assign at least two combinational operations 201 1 , 2012 of the same convolution 2001 , 2002 to the same FU unit 321 - 323 depending on at least one of the following: a kernel geometry; a stride size; a count of the FU units 321 - 323; a count of the kernels 261 , 262 of the kernel map 280 or, generally, a count of the filter operations of the filter map; a size of the on-chip memory 301 (which may limit the number of combinational operations 201 1 , 2012 that can be possibly executed in parallel); and generally layer 260 of the CNN 200 which is currently processed.

In FIG. 20, a scenario is illustrated where combinational operations 201 1 , 2012 associated with different convolutions 2001 - 2003 are completed at the same point in time. However, generally, it would be possible that combinational operations 201 1 , 2012 associated with different convolutions 2001 - 2003 are not time aligned. Then, it the control logic 343, 353 may be configured to monitor completion of the first combinational operation by the respective FU unit 321 - 323; and then trigger a second combinational operation of the same convolution 2001 - 2003 to be performed by the respective FU unit 321 - 323. The trigger time points are generally not required to be synchronized for different filter operations.

FIG. 21 is a flowchart of a method according to various examples.

In 101 1 , a plurality of filter operations is performed. Each filter operation includes a plurality of combinational operations. For example, in 101 1 the filter operations may correspond to 3-D convolutions between an input feature map and a kernel map. For example, the plurality of combinational operations in 101 1 may correspond to arithmetic operations such as multiplications and summations between a plurality of two-dimensional slices of 3-D receptive fields of the feature map and associated 2-D filter coefficients of a 3-D kernel of the kernel map. For example, the filter operations in 101 1 may be part of processing of a not-fully- connected layer or of a fully-connected layer. For example, 101 1 may be re-executed for various layers of a multi-layer neural network (cf. FIG. 2, 1002).

Next, in 1012, at least two combinational operations of the same filter operation are assigned to the same FU unit. Thereby, the same FU unit sequentially calculates at least parts of the filter operation. This facilitates efficient utilization of the available FU units. For example, for at least one filter operation, it would be possible to assign all respective combinational operations to the same FU unit.

Based on such techniques it is possible to flexibly utilize the available FU units - even if the number, size, and/or complexity of the combinational operation changes, e.g., from iteration to iteration. Hence, it is possible to flexibly utilize the available FU units even in view of different requirements imposed by different layers of the multi-layer neural network such as a CNN.

Based on such techniques it may be possible to relieve some constraints conventionally imposed on the selection of parameters of the various layers of the multi-layer neural network. FIG. 22 illustrates a method which could enable such flexible selection of the parameters of the various layers of the multi-layer neural network.

First, in 1021 , an input feature map is loaded. For example, the input feature map may be associated with an input layer or an output layer or a hidden layer. For example, the input feature map may be loaded from an external memory or from some on-chip memory, e.g., a level-2 cache, etc. The input feature map may also correspond to the output feature map of a previous layer, e.g., in case of a hidden layer. Next in 1022, a filter geometry is selected from a plurality of filter geometries. It is possible that different layers of the multi-layer neural network are associated with different filter geometries. For example, different filter geometries may refer to different filter sizes in the lateral plane (xy - plane) and/or different filter shapes. Possible filter shapes may correspond to: cuboid; spherical; cubic; and/or cylindric. The filter geometries may be selected in view of the feature recognition task. Selecting the appropriate filter geometry may increase an accuracy with which the features can be recognized.

In some examples, it would be possible that one and the same filter geometry is used throughout receptive fields across the entire input feature map of the respective layer of the multi-layer neural network. In other examples, it would be possible that different filter geometries are selected for different receptive fields of the respective input feature map, i.e., that different filter geometries are used for one and the same layer of the multi-layer neural network.

For example, it would be possible that all layers, e.g., all convolutional layers, use filters having the same geometry (for example 3x3), but with different depth. Alternatively, it would be possible that different convolutional layers use different filter geometries, e.g., different lateral filter shapes. Alternatively or additionally, it would be possible that within one or more convolutional layers different filter geometries are used at different depths: e.g., neurons from depth 1 of the output feature map use cubical kernel, neurons from depth 2 use spherical, neurons from depth 3 again use cubical kernel but with different xy size, etc..

Such a flexible selection of the filter geometry can break the translational invariance and, thus, help to accurately identify objects, e.g., based on a-priori knowledge. For example, it would be possible that the filter geometry is selected based on a-priori knowledge on objects represented by the input filter map. For example, the a-priori knowledge may correspond to distance information for one or more objects represented by the input filter map. For example, it would be possible that a-priori knowledge is obtained by sensor fusion between a sensor providing the input data and one or more further sensors providing the a-priori knowledge. For example, it would be conceivable that distance information on one or more objects represented by the input data is obtained from a distance sensor such as RADAR or LIDAR or a stereoscopic camera. Then, for example, it could be assumed that such objects which have corresponding distance information indicating a shorter distance (larger distance) to the sensor providing the input data are represented in a larger (smaller) region of the input data; then, a larger (smaller) filter size in terms of the lateral dimensions of the filter could be selected. Similar considerations may also apply with respect to the filter geometry.

Next, in 1023, a plurality of filter operations are performed. Examples of filter operations include convolutions. Here, it is possible that the corresponding layer is a convolutional layer of a CNN, e.g., a CNN as described in further examples disclosed herein. The filter operations rely on one or more filters having the selected filter geometry.

It is then possible that an assignment between combinational operations of each filter operation and FU units is flexibly adapted depending on the selected filter geometry. This facilitates efficient utilization of the available FU units even in view of different selected filter geometries.

In the example of FIG. 22, 1022 is executed after 1021. In other examples, it would also be possible that 1022 is executed prior to executing 1021 . For example, it would be possible to select the appropriate filter geometries for all respective layers of a multi-layer neural network prior to the start of processing of the input data. In other examples, it would be possible to select the appropriate filter geometry for a given layer only once the input feature map of that given layer is available, i.e., after processing of the preceding layer of the multi-layer network has concluded. Here, selecting of the filter geometry from the plurality of filter geometries can be in response to said loading of the input feature map.

FIG. 23 illustrates aspects with respect to the RCU 161 , the input stream manager 162, and the output stream manager 163. Typically, these elements are integrated on a single chip or die. The RCU 161 of FIG. 23 generally corresponds to the RCU 161 of FIG. 18. Hence, the techniques of assigning at least some or all combinational operations of the same filter operation to the same FU unit 321 -323, as explained above, can also be implemented for the RCU 161 of FIG. 23.

As illustrated in FIG. 23, the RCU 161 includes multiple instances of L1 cache memory 301 , 302 associated with the FUs 321 -323. As can be seen from FIG. 23, the L1 cache memory 301 and the L1 cache memory 302 are arranged on the same level of hierarchy with respect to the FUs 321 -323, because access of the FUs 321 -323 to the L1 cache memory 301 is not via the L1 cache memory 302, and vice versa. According to examples, an input stream router 344 implements a control functionality configured for selecting an allocation of the cache memory 301 and an allocation of the cache memory 302 to the respective input feature map and to the respective kernel map of the active layer of the CNN 200, respectively. Then, the input stream router 344 is configured to route the input feature map and the kernel map to the FUs 321 - 323 via the cache memory 301 or the cache memory 302, respectively. By providing different instances of the L1 cache memory 301 , 302, it is possible to tailor the allocation to the input feature maps and the kernel maps, respectively. As illustrated in FIG. 23, the L1 cache memory 301 is connected to the FUs 321 - 323 via routers 319. In other words, the shared L1 cache memory 301 is shared between the multiple FUs 321 - 323. As illustrated in FIG. 23, different sections of the shared L1 cache memory 301 can be allocated for data associated with different parts of the allocated map. This may be helpful, in particular, if the map allocated to the L1 cache memory 302 is replicated more than once across different blocks 31 1 -313 of the L1 cache memory 302.

In various examples, it is possible that the shared L1 cache memory 301 providing data to multiple FUs 321 - 323 is implemented in a single memory entity in terms of a geometrical arrangement on the respective chip and/or in terms of an address space. For example, it would be possible that different blocks of the shared L1 cache memory 301 use the same address space. For example, the routers 319 could be configured by the input stream router 344 to access the appropriate address space of the shared L1 cache memory 301 .

In the example of FIG. 23, the input stream manager 162 also includes L2 cache memory 341 (labeled stick buffer in FIG. 23). Data is provided to the shared L1 cache memory 301 by the input stream router 344 via the L2 cache memory 341.

The cache memory 341 may be used to buffer sticks of data - e.g., receptive fields 251 , 252 - from the respective map, e.g., the input feature map. In some examples, the cache memory 341 may be allocated to storing data of the input feature map, but not allocated to store data of the kernel map.

These sticks of data are re-used in the adjacent convolutions 2001 , 2002 in case the stride 269 is less than width or height of the respective kernel 261 , 262. In this case, when moving horizontally or vertically by a single stride increment, some parts of the input feature map that were used in the previous convolution 2001 , 2002 can be re-used in the subsequent calculation, as well - due to the overlap in the receptive fields 251 , 252 in view of the small stride 269. The cache memory 341 allows on-chip buffer stores those parts of the input feature map that will be needed in upcoming convolutions 2001 , 2002. Hence, it is possible to control data written to the cache memory 341 depending on a size of the receptive fields 251 , 252 of the input feature map and the stride size 269 of the kernels 261 , 262 of the kernel map. For example, the refresh events of the cache memory 341 may be controlled depending on the size of the receptive fields 251 , 252 of the input feature map and the stride size 269 of the kernels 261 , 262 of the kernel map. A sequence of processing the convolutions 2001 , 2002 may be set appropriately. By using the cache memory 241 , data movement from DDR can be significantly reduced. For example, if a 3x3 kernel with a stride of 1 is employed, every data point of the input feature map is re-used in 9 different convolutions 2001 , 2002. Thus, if a given data point is stored in the cache memory 241 until is used said 9 times, the data movement to the external memory 101 can be reduced by a factor of 9.

The cache memory 341 is arranged up-stream of the input stream router 344, i.e., closer to the external memory 101 . Hence, it is possible to store data of the input feature map irrespective of the allocation of the input feature map to the local L1 cache memory 302 or to the shared L1 cache memory 301 .

As will be appreciated, the cache memory 341 provides L2 cache memory functionality for the input feature map - but not for the kernel map. As such, the cache memory 164 provides L3 cache memory functionality for the input feature map - because of the intermediate cache memory 341 - and, at the same time, the cache memory 164 provides L2 cache memory functionality for the kernel map. Also illustrated in FIG. 23 is an implementation of the L1 cache memory 302 including a plurality of blocks 31 1 - 313. Different blocks 31 1 - 313 are statically connected with different FUs 321 - 323. Hence, there are no intermediate routers required between the blocks 31 1 - 313 and the FUs 321 - 323. For this reason, the L1 cache memory 302 is local memory associated with the FUs 321 -323. For example, it would be possible that the blocks 31 1 - 313 of the local L1 cache memory 302 are separately implemented, i.e., on different positions on the respective chip. Alternatively or additionally, it would be possible that the blocks 31 1 - 313 use different address spaces.

By implementing L1 cache memory, both, as shared RAM and local RAM, as illustrated in FIG. 23, data movement can be significantly reduced when processing a layer 260 of the CNN 200. For example, refresh events - where the content of at least parts of the respective L1 cache memory 301 , 302 is flushed and new content is written to the respective L1 cache memory 301 , 302 - may occur less frequently. This is reflected by the following finding: for example, for processing a convolutional layer 260 of the CNN 200, convolutions are performed between multiple kernels 261 , 262 and one and the same receptive field 251 , 252. Likewise, it is also required to perform convolutions between one and the same kernel 261 , 262 and multiple receptive fields 251 , 252. Therefore, if each block 31 1 - 313 of the local L1 cache memory 302 stores at least parts or all of a receptive field 251 , 252 - i.e., in a scenario where the input feature map is allocated to the local L1 cache memory 302 -, it is possible to reuse that data for multiple convolutions with different kernels 261 , 262 that are stored in the shared L1 cache memory 301 . Here, by means of the routers 319, different sections of the shared L1 cache memory 301 can be flexibly routed to the FUs 321 - 323. This is because the shared L1 cache memory 301 stores, at a given point in time / in response to a given refresh event, data which is being processed by each FU 321 -323, i.e., being routed to multiple FUs 321 -323. Differently, the data written to the local L1 cache memory 302 by a single refresh event is routed to a single FU 321 -323.

It is thus possible to distinguish between two modes of operation for allocation between the shared L1 cache memory 301 and the local L1 cache memory 302: (I) Shared Input Feature Map (SIFM) - in this mode of operation, data of the kernel map is stored locally using the local L1 cache memory 302, while data of the input feature map is stored centrally using the shared L1 cache memory 301. The data stored using the shared L1 cache memory 301 is shared between the FUs 321 -323. In this mode of operation, at a given point in time, multiple FUs 321 -323 calculate different 3-D convolutions using their locally stored convolution coefficients or kernels 261 , 262 on the same data of the input feature map, e.g., the same receptive field 251 , 252 currently stored in the shared L1 cache memory 301 . If the ratio of available FUs 321 -323 and number of different kernels 261 , 262 is such that it allows simultaneous calculation of all required convolutions (e.g., more FUs 321— 323 than kernels 261 ,262), this is achieved by storing data of different receptive fields 251 , 252 of the input feature map at a given point in time in the shared L1 cache memory 301 . Hence, in other words, when operating in the SIFM mode of operation, the kernel map is stationary stored in the local L1 cache memory 301 ; while different data of the input feature map associated with one or more receptive fields 252, 152 is sequentially stored in the shared L1 cache memory 301 . (II) Shared Kernel Map (SKM) - in this mode of operation input data of the input feature map is stored locally using the blocks 31 1 -313 of the local L1 cache memory 302; while the data of the kernel map is stored in the shared L1 cache memory 301 and is shared between the FUs 321 -323. In the SKM mode of operation, the situation is reversed from the one found in the SIFM mode. In other words, the input feature map is now kept stationary using the local L1 cache memory 302 and the data of the kernel map is sequentially / iteratively written to the shared L1 cache memory 301 , i.e., storing one or more kernels 261 , 262 at a time in the shared L1 cache memory 301 . Thus, as will be appreciated, SIFM and SKM implement different allocations of the shared L1 cache memory 301 to a first one of the input feature map and the kernel map of the local L1 cache memory 302 to a second one of the input feature map and the kernel map. From the above, it is apparent that data structures of the same size are stored in the blocks 31 1 -313. Therefore, in some examples, it is possible that the blocks 31 1 - 313 of the L1 cache memory 302 are all of the same size. Then, refresh events may occur in a correlated manner - e.g., within a threshold time or synchronously - for all blocks 31 1 -313 of the local L1 cache memory 302.

In order to facilitate a reduced number of refresh events for the blocks 31 1 - 313 throughout the processing of a particular layer 260 of the CNN 200, it can be possible to implement the L1 cache memory 302 with a particularly large size. For example, it would be possible that the size of the local L1 cache memory 302 is larger than the size of the shared L1 cache memory 301 . For example, the time-alignment between multiple convolutions that re-use certain data between the FUs 321 -323 can require a high rate of refresh events for the shared L1 cache memory 301 . Differently, the local L1 cache memory 302 may have a comparably low rate of refresh event, e.g., only a single refresh event at the beginning of processing a particular layer 260. This may be achieved if each block 31 1 -313 of the local L1 cache memory 302 is dimensioned to store an entire receptive field 251 , 252 or an entire kernel 261 , 262; i.e., if the local L1 cache memory 302 is dimensioned to store the entire input feature map or the entire kernel map.

FIG. 24 is a flowchart of a method according to various examples. For example, the method of FIG. 24 may be executed by the input stream router 344. First, in 701 1 , and allocation of first L1 cache memory and second L1 cache memory is selected. For example, a first one of an input feature map and a filter map - e.g., a kernel map - may be allocated to the first L1 cache memory; in the second one of the input feature map and the kernel map may be allocated to the second L1 cache memory.

It is possible that the first L1 cache memory is shared memory associated with multiple computational units of a plurality of computational units; while the second L1 cache memory is local memory, wherein each block of the local memory is associated with the respective computational unit of the plurality of computational units. Such a configuration appropriately reflects filter operations such as convolutions typically required for processing of a layer of a multi-layer neural network, where a given block of data - e.g., a receptive field or a kernel - is combined with many other blocks of data - e.g., kernels or receptive fields.

In 7012, the input feature map and the kernel map are routed to the computational units, in accordance with the allocation selected in 701 1. In 7013, the computational units are controlled to perform multiple filter operations. For this, data from the first and second L1 cache memory is provided to the computational units. FIG. 25 is a flowchart of a method according to various examples. In particular, FIG. 25 illustrates aspects with respect to selecting an allocation for the first and second L1 cache memory. For example, the method according to FIG. 25 could be executed as part of 701 1 (cf. FIG. 24). First, in 7021 , it is checked whether a further layer exists in a multi-layer neural network for which an allocation of L1 cache memories is required. The further layer is selected as the current layer, if applicable.

If a further layer exists, in 7022, it is checked whether the size of the input feature map of the current layer is larger than the size of the kernel map of the current layer.

If the size of the input feature map is smaller than the size of the kernel map, then, in 7023, the local L1 cache memory is allocated to the input feature map. If, however, the size of the input feature map is not smaller than the size of the kernel map, then, in 7024, the local L1 cache memory is allocated to the kernel map. By allocating the smaller one of the input feature map and the kernel map to the local L1 cache memory, it is possible to reduce the size of the local L1 cache memory.

Thus, if the input feature map is larger than the kernel map, then the kernel map is allocated to the local L1 cache memory. If, however, the input feature map is smaller than the kernel map, then the input feature map is allocated to the local L1 cache memory. In other words, the smaller of the maps is allocated to the local L1 cache memory. Thereby, the size of the local L1 cache memory 302 can be significantly reduced. This is based on the finding that, in a typical CNN, the size of input feature maps gets smaller towards deeper layers. The opposite is true for kernel maps. Without the possibility to select, on a per-layer basis, what map (feature or kernel) is stored in local L1 cache memory, it would be required to make local L1 cache memory big enough to either store the largest input feature map, max{IFMSize(l)}, or kernel map, max{KMSize(l)}, depending on the mapping mode fixed across all layers. In the previous formulas, / stands for the /-th CNN layer and maximum operator goes over all CNN layers. For example, in the case of VGG-16 CNN, if the input feature maps are fixedly stored in the local L1 cache memory 302 for all layers, then the total size of local L1 cache memory 302 must be roughly at least 2^*3MB=6MB. If the kernel maps are fixedly stored in the local L1 cache memory 302 for all layers, then the total size of local RAM modules must be roughly at least 2^*2MB=4MB.

If flexible selection of the allocation is available for every CNN layer, i.e., if it can be selected what map will go into local L1 cache memory 302, than the local L1 cache memory 302 can have a total size which is equal to max{min{IFMSize(l), KMSize(l)}}, where / stands for the /-th CNN layer, and maximum operator is taken over all CNN layers. In case of VGG-16 CNN, using this approach, this would require that the total size of local RAM memories is roughly at least 2^*550KB=1.1 MB, which is 4-6 times less from the sizes required if one cannot select allocation of input feature map and kernel map to the local L1 cache memory 302 on a per- layer basis. Reducing the size of the local L1 cache memory 302 helps to simplify the hardware requirements.

With respect to the above-given example: the equations for calculating the required sizes of the local L1 cache memory 302 are examples only an may be more complex if the stride size is taken into account as well.

As will be appreciated from FIG. 25, it is possible that different allocations are selected for different layers of the multi-layer neural network. The method according to FIG. 25 can be executed prior to the forward pass or during the forward pass of the respective multi-layer neural network.

While in the example of FIG. 25, the allocation is selected based on a relation of the size of the input feature map with respect to the size of the kernel map, in other examples, it would also be possible to select the allocation based on other decision criteria such as the (absolute) size of the input feature map and/or the (absolute) size of the kernel map. For example, one decision criterion that may be taken into account is whether the size of the local L1 cache memory is sufficient to store the entire input feature map and/or the entire kernel map.

FIG. 26 schematically illustrates aspects with respect to refresh events 701 -710. In particular, FIG. 26 schematically illustrates a timeline of processing different input feature maps 209 - 21 1 . In the example of FIG. 26, first, processing of the input feature map 209 associated with the respective layer 260 of the CNN 200 commences. For example, it would be possible that when processing the input feature map 209, the input feature map 209 is allocated to the shared L1 cache memory while the respective kernel map is allocated to the local L1 cache memory. At a refresh event 701 , the input feature map 209 is partly loaded into the shared L1 cache memory and the kernel map is fully loaded into the local L1 cache memory. Then, different receptive fields 251 , 252 of the input feature map 209 - the data of which is stored in the shared L1 cache memory 301 in response to the refresh event 701 - are convoluted with each one of the kernels 261 , 262 stored in the local L1 cache memory 302. At subsequent refresh events 702 - 704, different data of the input feature map 209 are subsequently written to the shared L1 cache memory 301 and, subsequently, convoluted with the various kernels 261 , 262 which have already been written to the local L1 cache memory 302 at the refresh events 701. Therefore, as will be appreciated from FIG. 26, a rate of refresh events 701 - 704 is larger for the shared L1 cache memory 301 than for the local L1 cache memory 302.

In some cases, it will be possible that the map allocated to the locale L1 cache memory 302 is replicated over a number of FUs 321 -323 / blocks 31 1 -313. This may be the case because there is a correspondingly large count of FUs 321-323 and blocks 31 1 -313. This is facilitated by the shared L1 cache memory 301 being able to store different sections of the respective other map which makes it possible to keep busy all FUs 321 -323 even if they are allocated to the same sections of the respective map. Here, the number of, e.g., input feature map bundles allocated to the shared L1 cache memory 301 that can be processed in parallel is equal to the replication factor of the kernel map allocated to the local L1 cache memory 302 (or vice versa).

Next, processing of the input feature map 210 commences. For example, here, it would be possible that the size of the input feature map 210 is significantly smaller than the size of the input feature map 209. For this reason, while processing the input feature map 210, the input feature map 210 is allocated to the local L1 cache memory 302 while the respective kernel map is allocated to the shared L1 cache memory 301 . A similar allocation is also used for processing the input feature map 21 1 which may have the same size as the input feature map 210. Then, with respect to processing of the input feature maps 210, 21 1 , a rate of refresh events 705 - 710 is larger for data associated with the respective kernel maps - now stored in the shared L1 cache memory 310 - than for data associated with the respective input feature maps 210, 21 1 .

Although the invention has been shown and described with respect to certain preferred embodiments, equivalents and modifications will occur to others skilled in the art upon the reading and understanding of the specification. The present invention includes all such equivalents and modifications and is limited only by the scope of the appended claims. For example, above various examples have been described with respect to a filter operation being implemented by a convolution between an input feature map and a kernel map. Different combinational operations of the filter operations could in this context relate to convolutions between a given kernel and different receptive fields. However, the techniques described herein may be used for other kinds and types of filter operations, e.g., filter operations associated with an adding layer, a pooling layer, a concatenation layer, or a fully connected layer. Here, different combinational operations may, generally, relate to the input provided for different neurons of the output feature map.

Furthermore, while above various examples have been described with respect to multi-layer neural networks being implemented by CNNs, in other examples, it would also be possible to employ such techniques for other kinds and types of multi-layer neural networks, e.g., to conventional multi-layer perceptron neural networks. Furthermore, generally, various graph- based filtering algorithms can benefit from the techniques disclosed herein.

Claims

1. Acircuit (100, 121, 123, 161), comprising:

- at least one memory (111, 164, 301) configured to store an input feature map (201- 220, 225-228) and a filter map (280), the input feature map (201-220, 225-228) representing at least one object (285),

- a plurality of computational units (321-323), and

- a control logic (121 , 343, 353) configured to control the at least one memory (111, 164, 301) and the plurality of computational units (321-323) to perform a plurality of filter operations (2001-2003) between the input feature map (201-220, 225-228) and the filter map (280) for classification of the at least one object (285), each filter operation (2001-2003) of the plurality of filter operations (2001-2003) comprising a plurality of combinational operations (2011, 2012),

wherein the control logic (121, 343, 353) is configured to sequentially assign all combinational operations (2011, 2012) of a filter operation (2001-2003) of the plurality of filter operations (2001-2003) to a same computational unit of the plurality of computational units (321-323).

2. The circuit (100, 121, 123, 161) of claim 1,

wherein the plurality of combinational operations (2011 , 2012) of a given filter operation (2001-2003) are multiplications and summations between a plurality of two- dimensional slices (251-1) of three-dimensional receptive fields (251, 252) of the feature map (201-220, 225-228) and associated two-dimensional slices (261-1 , 261-2, 261-3, 262-1) of a three-dimensional filter (261 , 262) of the filter map (280).

3. The circuit (100, 121, 123, 161) of claims 1 or 2,

wherein the control logic (121, 343, 353) is configured to control the plurality of computational units (321-323) to perform at least some filter operations (2001-2003) in parallel.

4. The circuit (100, 121, 123, 161) of any one of the preceding claims,

wherein a count of the plurality of computational units (321-323) is different from a count of the filter operations (2001-2003). 5. The circuit (100, 121, 123, 161) of any one of the preceding claims,

wherein the control logic (121, 343, 353) is configured to selectively assign the at least two combinational operations (2011 , 2012) of the same filter operation (2001-2003) to the same computational unit (321 -323) depending on at least one of the following: a filter geometry of filters (261 , 262) of the filter map (280); a stride size associated with the filters (261 , 262) of the filter map (280); a count of the plurality of computational units (321 -323); a count of the filters (261 , 262) of the filter map (280); a size of the at least one memory (1 1 1 , 164, 301 ); and a layer (260) of a multi-layer neural network (200) associated with the feature map (201 -220, 225-228) and the filter map (280).

6. The circuit (100, 121 , 123, 161 ) of any one of the preceding claims,

wherein the control logic (121 , 343, 353) is configured to monitor completion of a first combinational operation of the at least two combinational operations (201 1 , 2012),

wherein the control logic (121 , 343, 353) is further configured to trigger a second combinational operation of the at least two combinational based on said monitoring.

7. The circuit (100, 121 , 123, 161 ) of any one of the preceding claims,

wherein the input feature map (201 -220, 225-228) and the filter map (280) are defined with respect to a layer (260) of a plurality of layers (260) of a convolutional neural network (200).

8. A method, comprising:

- loading an input feature map (201 -220, 225-228) representing at least one object

(285),

- selecting at least one filter geometry from a plurality of filter geometries, and

- performing a plurality of filter operations (2001 -2003) between receptive fields (251 , 252) of the input feature map (201 -220, 225-228) and filters (261 , 262) of the filter map (280) for classification the at least one object (285), the filters (261 , 262) having the selected at least one filter geometry.

9. The method of claim 8,

wherein different filter geometries are selected for different receptive fields (251 , 252) or for different slices (251 -1 ) of receptive fields (251 , 252) of the input feature map (201 -220, 225-228).

10. The method of claims 8 or 9,

wherein the at least one filter geometry comprises at least one of a three-dimensional lateral filter shape and a filter size (265).

1 1 . The method of any one of claims 8-10, wherein said selecting of the at least one filter geometry is based on a-priori knowledge on the at least one object (285), the a-priori knowledge optionally comprising a distance information for the at least one object (285).

12. The method of any one of claims 8-1 1 ,

wherein said selecting of the at least one filter geometry from the plurality of filter geometries is in response to said loading of the input feature map (201 -220, 225-228).

13. A method, comprising:

- storing an input feature map (201 -220, 225-228) and a filter map (280), the input feature map (201 -220, 225-228) representing at least one object (285),

- controlling a plurality of computational units (321-323) to perform a plurality of filter operations (2001 -2003) between the input feature map (201 -220, 225-228) and the filter map (280) for classification of the at least one object (285), each filter operation (2001 -2003) of the plurality of filter operations (2001 -2003) comprising a plurality of combinational operations (201 1 , 2012), and

- sequentially assigning all combinational operations (201 1 , 2012) of the same filter operation (2001 -2003) to the same computational unit (321 -323) of the plurality of computational units (321 -323).

14. The method of claim 13,

wherein the method is executed by the circuit (100, 121 , 123, 161 ) of any one of claims 1 -7. 15. A circuit (100, 121 , 123, 161 ), comprising:

- at least one memory configured to store an input feature map and a filter map, the input feature map representing at least one object,

- a control logic configured to select at least one filter geometry from a plurality of filter geometry, and

- a plurality of computational units configured to perform a plurality of filter operations between receptive fields of the input feature map and filters of the filter map for classification of the at least one objects., the filters having the selected at least one filter geometry.

16. The circuit (100, 121 , 123, 161 ) of claim 15,

wherein the circuit is configured to perform the method of any one of claims 8-12.