WO2018145308A1

WO2018145308A1 - Filter reusing mechanism for constructing robust deep convolutional neural network

Info

Publication number: WO2018145308A1
Application number: PCT/CN2017/073342
Authority: WO
Inventors: Xiaoheng JIANG
Original assignee: Nokia Technologies Beijing Co Ltd; Nokia Technologies Oy
Current assignee: Nokia Technologies Beijing Co Ltd; Nokia Technologies Oy
Priority date: 2017-02-13
Filing date: 2017-02-13
Publication date: 2018-08-16
Anticipated expiration: 2019-08-13
Also published as: CN110506277B; CN110506277A

Abstract

An apparatus and method, and the method includes: generating feature maps for a first convolutional layer of a convolutional neural network based on a region of an image to be evaluated and a learned filter from the first convolutional layer (406); generating feature maps for one or more subsequent convolutional layers of the convolutional neural network based on the feature maps of a prior convolutional layer, a learned filter for the prior convolutional layer and a learned filter for the subsequent convolutional layer (408); detecting a presence of an object of interest in the region of the image based on the generated feature maps of the first and one or more subsequent convolutional layers (410).

Description

[Title established by the ISA under Rule 37.2] FILTER REUSING MECHANISM FOR CONSTRUCTING ROBUST DEEP CONVOLUTIONAL NEURAL NETWORK

FIELD

The present disclosure is related to a neural network， and more particularly， to a filtering mechanism for a convolutional neural network.

BACKGROUND

Object recognition is an important component in the field of computer vision. In the past few years， deep convolutional neural networks (CNNs) have been used to advance object recognition. The power of deep convolntional neural networks lies in the fact that they are able to learn a hierarchy of features. An example of CNN architecture is described in G. Huang， Z. Liu， Q. Weinberge： Densely Connected Convolutional Networks， CoRR， abs/1608. 06993 (2016) (hereinafter “Huang” ) . In Huang， a CNN architecture is proposed that introduces direct connections within all layers of a block in the neural network. That is， each layer is directly connected to every other layer in one block in a feed-forward fashion. One block typically consists of several layers without a down-sampling operation. For each layer， the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. The core idea is to reuse the feature maps generated in the previous layers. However， these feature maps themselves do not bring in new information to the neural network.

SUMMARY

Accordingly， the present disclosure provides an apparatus and method to generate feature maps for a first convolutional layer of a convolutional neural network based on a region of an image to be evaluated and a learned filter from the first convolutional layer， to generate feature maps for one or more subsequent convolutional layers of the convolutional neural network after the first convolutional layer， and to detect a presence of an object of interest in the region of the image based on the generated feature maps of the first and one or more subsequent convolutional layers. Each subsequent convolutional layer is generated based on the feature maps of a prior convolutional layer， a learned filter for the prior convolutional layer and a learned filter for the subsequent convolutional layer.

The apparatus and method can be further configured to receive the image which is captured from an image sensing device， and/or to initiate an alarm if the object is detected. Furthermore， the convolutional neural network can be applied to each region of the image to detect whether the object is present in any of the regions of the image.

The apparatus and method can also be configured to learn a filter for each convolutional layer of the convolutional neural network during a training stage (or phase) using one or more training images. To learn a filter， the apparatus and method can be configured to initialize filters for the convolutional layers of the convolutional neural network， to generate feature maps for each convolutional layer using forward-propagation， to calculate a loss using a loss function based on the generated feature maps and a score for each category and corresponding label， and to update the filters for the convolutional layers using back-propagation if the calculated loss has decreased. Each subsequent convolutional layer after the first convolutional layer is generated based on the feature maps of a prior convolutional layer， a learned filter for the prior convolutional layer and a filter for the subsequent convolutional layer. The apparatus and method can be configured to repeat the operations of calculating feature maps， calculating a loss， and updating the filters， until the convolutional neural network converges when the calculated loss is no longer decreasing.

In the apparatus and method， two map features can be generated for each of the one or more subsequent convolutional layers. Furthermore， the operations of generating feature maps for a first convolutional layer， generating feature maps for one or more subsequent convolutional layers， and detecting a presence of an object are performed in a testing stage.

To detect a presence of an object in a region of an image， the apparatus and method can be configured to obtain a score for the region from application of the convolutional neural network， and to compare the score for the region to a threshold value. The object is detected in the region if the score for the region is larger than the threshold value.

DESCRIPTION OF THE FIGURES

The description of the various example embodiments is explained in conjunction with the appended drawings， in which：

Fig. 1 illustrates a block diagram of an example system for detecting a presence or absence of an object using a convolutional neural network (CNN) with filter reuse (or sharing) in accordance with an embodiment of the present disclosure；

Fig. 2 illustrates a block diagram of an example system for detecting a presence or absence of an object using a convolutional neural network (CNN) with filter reuse (or sharing) in accordance with another embodiment of the present disclosure；

Fig. 3 is an example architecture of a convolutional neural network which re-uses a filter from a prior convolutional layer in a subsequent convolutional layer in accordance with an embodiment of the present disclosure；

Fig. 4 is a flow diagram showing an example process by which a system， such as for example in Fig. 1 or 2， is configured to implement training and/or testing stages using a convolutional neural network， in accordance with an embodiment of the present disclosure；

Fig. 5 is a flow diagram showing an example process by which a system， such as for example in Fig. 1 or 2， is configured to implement a training stage for training a convolutional neural network， in accordance with an embodiment of the present disclosure；

Fig. 6 is a flow diagram showing an example process by which a system， such as for example in Fig. 1 or 2， is configured to implement a testing stage for evaluating an image or regions thereof using a trained convolutional neural network， in accordance with an embodiment of the present disclosure； and

Fig. 7 is a flow diagram showing an example detection process by which a system， such as for example in Fig. 1 or 2， is configured to detect a presence (or absence) of a feature， such as an object， using a convolutional neural network， in accordance with an example embodiment of the present disclosure.

DISCUSSION OF EXAMPLE EMBODIMENTS

In accordance with various example embodiments， there is provided an apparatus and method， which employ a deep convolutional neural network (CNN) with a filtering reuse mechanism to analyze an image or region thereof， and to detect a presence (or absence) of an object (s) of interest. The CNN is configured to re-use filters from a prior (e.g.， a previous or earlier) convolutional layer to compute map features in a subsequent convolutional layer. In this way， the filters can be fully used or shared so that the ability of feature representation is significantly enhanced， thereby significantly improving the recognition accuracy of the resulting deep CNN. Compared to other approaches that simply reuse prior feature maps， the present CNN approach with filter reuse also can take advantage of information (e.g.， filters) obtained from the prior convolutional layer， as well as generate new information (e.g.， feature maps) in a current convolutional layer. Furthermore， the architecture of such a CNN can reduce the number of parameters because each current convolutional layer reuses the filter of a prior convolutional layer. Such a configuration， thus， can address the over-fitting problem that is caused by using too many parameters.

The apparatus and method of the present disclosure can be employed in object recognition systems， such as for example a video surveillance system that employs a camera or other sensor. For example， the camera can capture several multi-view images of the same scenario such as 360-degree images. The task of the video surveillance is to detect one or more objects of interest (e.g.， pedestrians， animals， or other objects) from the multi-view images， and then provide an alert or notification (e.g.， an alarm or warning) to the user. Because a camera system can be provided to capture 360-degree images， a video surveillance system can potentially detect all objects of interest appearing in a scenario or environment. In such a surveillance system， each camera (or camera sub-system) can be configured to perform object detection. For example， the operations of the video surveillance system using CNN with filter reuse can involve the following. Each camera of the system captures an image. For each region of a captured image， the CNN with filter reuse can， for example， be employed to classify the region as an object of interest if the response of the CNN is larger than a pre-defined threshold， and to classify the region as background (e.g.， non-object) if the response of the CNN is equal to or less than the threshold.

The object detection process， as described herein， can involve a training stage and testing stage. The goal of the training stage is to design or configure the structure of the CNN with filter reuse， and to learn the parameters (i.e.， the filters) of the CNN. In the training stage， the CNN is trained to detect a presence (or absence) of a particular object (s) using training images as input. Back-propagation can， for example， be used to learn or configure the parameters， such as the filters， of the CNN to detect the presence (or absence) of an object. The training images can include example images of the object (s) of interest， of background (s) ， and other aspects that may be present in an image. In the testing stage， the trained CNN with filter reuse is applied to an image to be tested (e.g.， input image or testing image) to detect a presence (or absence) of the particular object (s) . With the structure and parameters of the trained deep CNN， the goal of the testing stage is to classify each region of the image by taking the region as the input of the trained CNN. The region is classified as either an object of interest or background. If the classification decision is an object of interest， the system generates， for example， an alert or notification (e.g.， an alert signal in the form of voice or message) which can be immediately sent to the user via a network connection (e.g.， the Internet) or other media. These operations implemented in the process of object detection can be performed in each camera or camera-subsystem of the surveillance system. An alert can be generated once one of the cameras in the system detects an object of interest. The object detection processes may be implemented in or with each camera or each camera subsystem. Examples of a CNN with filter reuse， and an objection detection system are described in further detail below with reference to the figures.

Fig. 1 illustrates a block diagram of example components of a system 100 for detecting a presence (or absence) of an object of interest using a convolutional neural network (CNN) that reuses or shares filters. As shown in Fig. 1， the system 100 includes one or more processor (s) 110， a plurality of sensors 120， a user interface (s) 130， a memory 140， a communication interface (s) 150， a power supply 160 and output device (s) 170. The power supply 160 can include a battery power unit， which can be rechargeable， or a unit that provides connection to an external power source.

The sensors 120 are configured to sense or monitor activities， e.g.， an object (s) ， in a geographical area or an environment， such as around a vehicle， around or inside a building， and so forth. The sensors 120 can include one or more image sensing device (s) or sensor (s) . The sensor 120 can for example be a camera with one or more lenses (e.g.， a camera， a web camera， a camera system to capture panoramic or 360 degree images， a camera with a wide lens or multiple lenses， etc. ) . The image sensing device is configured to capture images or image data， which can be analyzed using the CNN to detect a presence (or absence) of an object of interest. The captured images or image data can include image frames， video， pictures， and/or the like. The sensor 120 may also comprise a millimeter wave radar， an infrared camera， Lidar (Light Detection And Ranging) sensor and/or other types of sensors.

The user interface (s) 130 may include a plurality of user input devices through which a user can input information or commands to the system 100. The user interface (s) 130 may include a keypad， a touch-screen display， a microphone， or other user input devices through which a user can input information or commands.

The output devices 170 can include a display， a speaker or other devices which are able to convey information to a user. The communication interface (s) 150 can include communication circuitry (e.g.， transmitter (TX) ， receiver (RX) ， transceiver such as a radio frequency transceiver， etc. ) for conducting line-based communications with an external device such as a USB or Ethernet cable interface， or for conducting wireless communications with an external device， such as for example through a wireless personal area network， a wireless local area network， a cellular network or wireless wide area network. The communication interface (s) 150 can， for example， be used to receive a CNN and its parameters or updates thereof (e.g.， learned filters for an object of interest) from an external computing device 180 (e.g.， server， data center， etc. ) ， to transmit an alarm or other notification to an external computing device 180 (e.g.， a user’s device such as a computer， etc. ) ， and/or to interact with external computing devices 180 in to implement in a distributed manner the various operations described herein， such as the training stage， the testing stage， the alarm notification and/or other operations as described herein.

The memory 140 is a data storage device that can store computer executable code or programs， which when executed by the processor 110， controls the operations of the system 100. The memory 140 also can store configuration information for a CNN 142 and its parameters 144 (e.g.， learned filters) ， images 146 (e.g.， training images， captured images， etc. ) ， and a detection algorithm 148 for implementing the various operations described herein， such as the training stage， the testing stage， the alarm notification， and other operations as described herein.

The processor 110 is in communication with the memory 140. The processor 110 is a processing system， which can include one or more processors， such as CPU， GPU， controller， dedicated circuitry or other processing unit， which controls the operations of the system 100， including the detection operations (e.g.， training stage， testing stage， alarm notification， etc. ) described herein in the present disclosure. For example， the processor 110 is configured to train the CNN 142 to detect a presence or absence of objects of interest (e.g.， detect an object (s) of interest， background (s) ， etc. ) by configuring or learning the parameters (e.g.， learning the filters) using training images or the like， category/label information， and so forth. The processor 110 is also configured to test captured image (s) or regions thereof using the trained CNN 142 with the learned parameters in order to detect a presence (or absence) of an object in an image or region thereof. The object of interest may include a person such as a pedestrian， an animal， vehicles， traffic signs， road hazards， and/or the like， or other objects of interest according to the intended application. The processor 110 is also configured to initiate an alarm or other notification when a presence of the object is detected， such as notifying a user by outputting the notification using the output device 170 or by transmitting the notification to an external computing device 180 (e.g.， user’s device， data center， server， etc. ) via the communication interface 150. The external computing device 180 can include components similar to those in the system 100， such as shown and described above with reference to Fig. 1.

Fig. 2 depicts an example system 200 including processor (s) 210， and sensor (s) 220 in accordance with some example embodiments. The system 200 may also include a radio frequency transceiver 250. Moreover， the system 200 may be mounted in a vehicle 20， such as a car or truck， although the system may be used without the vehicles 20 as well. The system 200 may include the same or similar components and functionality， such as provided in the system 100 of Fig. 1.

For example， the sensor (s) 220 may comprise one or more image sensors configured to provide image data， such as image frames， video， pictures， and/or the like. In the case of advanced driver assistance systems/autonomous vehicles for example， the sensor 220 may comprise a camera， millimeter wave radar， an infrared camera， Lidar (Light Detection And Ranging) sensor and/or other types of sensors.

The processor 210 may comprise of CNN circuitry， which may represent dedicated CNN circuitry configured to implement the convolutional neural network and other operations as described herein. Alternatively or additionally， the CNN circuitry may be implemented in other ways such as， using at least one memory including program code which when executed by at least one processing device (e.g.， CPU， GPU， controller， etc. ) .

In some example embodiments， the system 200 may have a training stage. The training stage may configure the CNN circuitry to learn to detect and/or classify one or more objects of interest. The processor 210 may be trained with images including objects such as people， other vehicles， road hazards， and/or the like. Once trained， when an image includes the object (s) ， the trained CNN implementable via the processor 210 may detect the object (s) and provide an indication of the detection/classification of the object (s) . In the training stage， the CNN may learn its configuration (e.g.， parameters， weights， and/or the like) . Once trained， the configured CNN can be used in a test or operational stage to detect and/or classify regions (e.g.， patches or portions) of an unknown， input image and thus determine whether that input image includes an object of interest or just background (i.e.， not having an object of interest) .

In some example embodiments， the system 200 may be trained to detect objects， such as people， animals， other vehicles， traffic signs， road hazards， and/or the like. In the advanced driver assistance system (ADAS) ， when an object is detected， such as a vehicle/person， an output such as a warning sound， haptic feedback， indication of recognized object， or other indication may be generated to for example warn or notify a driver. In the case of an autonomous vehicle including system 200， the detected objects may signal control circuitry to take additional action in the vehicle (e.g.， initiate breaking， acceleration/deceleration， steering and/or some other action) . Moreover， the indication may be transmitted to other vehicles， IoT devices or cloud， mobile edge computing (MEC) platform and/or the like via radio transceiver 250.

Fig. 3 is an example of a convolutional neural network (CNN) architecture 300， which includes a plurality of convolutional layers (e.g.， Layer 1 ... Layer L or l) ， and a decision layer. The CNN architecture 200 is configured to re-use or share filters from a prior convolutional layer in a subsequent convolutional layer. For example， in layer 1， an N₁ feature maps C₁ are obtained by a filter W₁. The spatial width and height of C₁ are w₁ and h₁， respectively. In layer 2， feature maps C₂ are obtained not only by a new filter W₂ but also by the filter W₁ of prior layer 1. With the filter W₂， N₂₁ feature maps are obtained. With the existing filter W₁， N₂₂ feature maps are obtained. The N₂₁ feature maps and the N₂₂ feature maps are concatenated to form the feature maps C₂ in layer 2. Therefore， as shown in Fig. 2， the filter W₁ of prior layer 1 is reused in layer 2. Similarly， a new filter W₃ is used to generate the N₃₁ feature maps of layer 3， and the filter W₂ obtained in prior layer 2 is used to produce the N₃₂ feature maps of layer 3. The N₃₁ feature maps and N₃₂ feature maps are concatenated to form feature maps C₃ of layer 3. In the same way， the rest of the feature maps C₄， C₅ ... C_L are computed. The CNN architecture 300 can be employed in a detection process to detect a presence (or absence) of an object (s) of interest in a region of an image， or to classify regions of interest of an image. As described herein， the detection process can include a training stage to learn the parameters for the CNN using training images， and a testing stage to apply the trained CNN to classify regions of an image and to detect a presence (or absence) of an object of interest. Examples of the training stage and testing stage are described below with reference to the figures.

Fig. 4 is a flow diagram showing an example process 400 by which a system， such as for example in Fig. 1 or 2， is configured to implement training and/or testing stages of the convolutional neural network， such as for example shown in Fig. 3. For the purpose of explanation， the process 400 is discussed below with reference to the processor 110 and other components of the system 100 in Fig 1， and describes high level operations that are performed in relations to a training stage， and a testing stage.

At reference 402， the processor 110 is configured to provide a convolutional neural network during a training stage.

At reference 404， the processor 110 is configured to learn a parameter (s) ， such as a filter， for each convolutional layer of the convolutional neural network during a training stage.

At reference 406， the processor 110 is configured to generate feature maps for a first convolutional layer of the convolutional neural network based on a region of an image to be evaluated and a learned filter from the first convolutional layer during a testing stage.

At reference 408， the processor 110 is configured to generate feature maps for one or more subsequent convolutional layers of the convolutional neural network based on feature maps of a prior convolutional layer， a learned filter for the prior convolutional layer， and a learned filter for the subsequent convolutional layer during the testing stage.

At reference 410， the processor 110 is configured to detect a presence (or an absence) of an object of interest in the region of the image based on the generated feature maps of the first and one or more subsequent convolutional layers during the testing stage. In the event that an object is detected， the processor can be configured to initiate an alarm or other notification to a user or other entity.

Fig. 5 is a flow diagram showing an example process 500 by which a system， such as for example in Fig. 1 or 2， is configured to implement a training stage for training a CNN with filter reuse (see， e.g.， Fig. 3) . For the purpose of explanation， the process 400 is discussed below with reference to the processor 110 and other components of the system 100 in Fig 1， and describes operations that are performed during the training stage.

At reference 502， a set of training images and their corresponding labels are prepared. For example， if the training image contains an object of interest， then the label is set to a number (e.g.， 1) . If the training image does not contain the object of interest， then the label of the image set to another number (e.g.， -1) . The set of training images and their corresponding labels are used during the training stage in the design and configuration of a CNN for detecting the object of interest.

At reference 504， the processor 110 implements an initialization operation of the parameters， such as the filters， for the CNN. For example， the processor 110 initializes the filters (e.g.， W₁ ... W_L) for the convolutional layers (e.g.， Layers 1 ... L or l) of the CNN， such as in Fig. 3. The filters can be initialized by using a Gaussian distribution with zero mean and a small variation (e.g.， 0.01) .

At reference 506， the processor 110 generates (e.g.， calculates or computes) the feature maps on a convolutional layer-by-layer basis， such as for example using forward-propagation with a training image or region thereof as an input from the set of training images. For example， this operation can involve calculating feature maps using two filters， such as shown in the CNN architecture 300 of Fig. 3. One filter comes from the prior convolutional layer and the other filter comes from the current convolutional layer. For instance， given W_i of layer l and W_i+1 of layer l+1， the feature maps generated in layer l are denoted by N_l. When computing the feature maps of layer l+1， the convolution operation is carried out twice. First， the feature maps are computed

where “°” represents a convolution operation. Second， the feature maps are computed as follows：

Thereafter， the feature maps

and

are concatenated to generate the final output N_l+1 of layer l+1. It is noted that W_l is used in layer l to calculate the feature maps

Therefore， the filter W_l used in layer l is reused in layer l+1 to generate new feature maps.

At reference 508， the processor 110 implements a decision layer in which a loss calculation is performed. For example， the processor 110 performs a loss calculation， such as by calculating the loss according to the final score for each category and the corresponding label. The loss calculation can be performed using a softmax loss function. An example of a softmax loss function is represented by equation (1) as follows：

where：

y is the vector representing the scores for all classes， and

y_c is the score of class c.

Instead of the softmax loss function， other functions can also be adopted in the decision layer， such as Support Vector Machine (SVM) loss function or other suitable loss functions for use with a CNN. By way of example， softmax loss function calculates the cross- entropy loss， whereas SVM loss function calculates the hinge loss. As to the classification task， these two functions perform almost the same.

At reference 510， the processor 110 determines whether the filters of the CNN should be updated based on the calculated loss， e.g.， a change of calculated loss. For example， the processor 110 determines if the loss has stopped decreasing or changing， or in other words， if the CNN is converging. If the loss has stopped decreasing， the processor 110 outputs the filters (e.g.， the learned filters) for use in the CNN during the testing stage at reference 514. The outputted filters can be stored in memory for use with the CNN.

Otherwise， if the loss has not stopped decreasing， the processor 110 updates the filters of the CNN at reference 512. For example， the processor 110 can implement back-propagation (e.g.， standard back-propagation or other variants thereof) to update all of the filters of the CNN. The filters can be updated through the chain rule during back-propagation， for example， according to equation (2) as follows：

where，

ε represents the loss function， and

is the gradient propagated from the deep layer.

Thereafter， the filters are updated as follows：

where，

η represents the updating coefficient (e.g.， learning rate) .

The process 500 then continues by repeating the operations in

references

506， 508 and 510， until the calculated loss stops decreasing or changing， or in other words， the CNN converges.

Fig. 6 is a flow diagram showing an example process 600 by which a system， such as for example in Fig. 1 or 2， is configured to implement a testing stage for evaluating an image or region thereof using a trained CNN with filter reuse (see， e.g.， Fig. 3) . The test stage can differ from the training stage in that it does not need to update the filters. Instead， the test stage can adopt the filters learned from the training stage to classify or detect objects. Furthermore， there is no need to calculate the loss for the decision layer. The decision layer simply decides which class has the highest score. For the purpose of explanation， the process 600 is discussed below with reference to the processor 110 and other components of the system 100 in Fig 1， and describes operations that are performed during the testing stage.

At reference 602， the processor 110 implements a region proposal operation by determining the region (of an image) that is likely to contain the object of interest， e.g.， a targeted object. For example， one simple approach to identify a region of interest for evaluation is to adopt the sliding window technique that scans an input image exhaustively. Other methods can also be adopted.

At reference 604， the processor 110 implements map feature generation using the CNN with filter reuse. For example， the processor 110 applies the region of interest of the image to the CNN， and generates the feature maps on a convolutional layer-by-layer basis using the learned parameters， e.g.， the filters， such as from the training stage. The map feature generation procedure in the test stage can be similar to that performed in the training stages， such as described above with reference to Fig. 5.

At reference 606， the processor 110 implements a decision layer to perform classification or object detection of the region. For example， in the decision layer， the processor 110 can take the score vectory as input and determine which one (e.g.， y_c) has the highest score. This operation outputs the label (e.g.， pedestrian) corresponding to the highest score.

As previously discussed， the decision layer can use the softmax loss function， or other loss functions such as the SVM loss function. The softmax loss function calculates the cross-entropy loss， whereas the SVM loss function calculates the hinge loss. As to the classification task， these two functions perform almost the same.

Fig. 7 is a flow diagram showing an example detection process 700 by which a system， such as for example in Fig. 1 or 2， is configured to detect a presence (or absence) of an object of interest， using a trained CNN with filter reuse (see， e.g.， Fig. 3) . For the purpose of explanation， the process 700 is discussed below with reference to the processor 110 and other components of the system 100 in Fig 1.

At reference 702， the sensor (s) 120 captures image (s) . The images can be captured for different scenarios depending on the application for the detection process 700. For example， the sensor (s) 120 may be positioned， installed or mounted to capture images for fixed locations (e.g.， different locations in or around a building or other location) or for movable locations (e.g.， locations around a moving vehicle， person or other system) . By way of example， a camera system， such as a single or multi-lens camera or camera system to capture panoramic or 360 degree images， can be installed on a vehicle.

At reference 704， the processor 110 scans each region of an image， such as from the captured image (s) .

At reference 706， the processor 110 applies the CNN to each region of the image， such as by implementing a testing stage. An example of a testing stage is described by the process 600 which is described with reference to Fig. 6. As explained above， the application of the CNN provides a score for the tested region of the image.

At reference 708， the processor 110 determines if the score from the CNN is larger than a threshold (e.g.， a threshold value) .

If the score is not larger than the threshold， the processor 110 does not initiate an alarm or notification， at reference 710. The process 700 continues to capture and evaluate images. Otherwise， ifthe score is larger than the threshold， the processor 110， at reference 712， initiates an alarm or notification reflecting a detection of an object of interest or classification of such an object. As previously discussed， examples of an object of interest can include a pedestrian， an animal， a vehicle， a traffic sign， a road hazard or other pertinent objects depending on the intended application for the detection process. The alarm or notification may be initiated locally at the system 100 via one of the output devices 170 or transmitted to an external computing device 180. The alarm may be provided to the user in the form of a visual or audio notification or other suitable medium (e.g.， vibrational， etc. ) .

Experimental Example

Experimental results on the KITTI dataset show the effectiveness of the present method and system which employs a CNN with filter reuse. The KITTI dataset were captured by a pair of cameras. The subset of the KITTI dataset used for pedestrian detection consists of 7481 training images and 7518 test images. In the present method and system， the deep CNN can， for example， be composed of L＝13 layers. The sizes of the filters W₁， W₂， W₃， W₄， W₅， W₆， W₇， W_s， W₉， W₁₀， W₁₁， W₁₂， and W₁₃ are 3×3×3， 3×3×32， 3×3×64， 3×3×64， 3×3×128， 3×3×128， 3×3×128， 3×3×256， 3×3×256， 3×3×256， 3×3×256， 3×3×256， and 3×3×256， respectively. The traditional VGG neural network is compared with an example of the present method and system which employs a filter reusing mechanism with the CNN. The average precision (AP) of the present CNN with filter reuse is 60.43％whereas the average precision of the traditional VGG neural network is 56.74％ (see， e.g.， Simonyan K， Zisserman A.： Very deep convolutional networks for large-scale image recognition， arXiv preprint arXiv： 1409.1556 (2014) ) . It is observed that the present CNN method with filter reuse significantly outperforms the traditional VGG method. That is， the introduction of filter reuse or sharing plays an important role in improving the performance of object detection. As such， the present method and system， which t employs a filtering reuse mechanism in a CNN， can provide significant improvements to the field of object detection， and thus， video surveillance.

It should be understood that systems and methods described above are provided as examples. Although the

system

100 or 200， as described herein， can be used to implement among other things operations including the training stage， the testing stage and the alarm notification， these operations may be distributed and performed across a plurality of systems over a communication network (s) . Furthermore， in addition to standard back-propagation， the training stage may instead employ other variants of back-propagation that may be aimed at improving the performance of back-propagation. The training and testing stages may also adopt other suitable loss functions or training strategies. The CNN approach with reuse or shared filters， as described herein， may be utilized in various applications， including but not limited to object detection/recognition in video surveillance systems， in autonomous or semi-autonomous vehicles， or in ADAS implementations.

It should also be understood that the example embodiments disclosed and taught herein are susceptible to numerous and various modifications and alternative forms. Thus， the use of a singular term， such as， but not limited to， “a” and the like， is not intended as limiting of the number of items.

It will be appreciated that the development of an actual， real commercial application incorporating aspects of the disclosed embodiments will require many implementation specific decisions to achieve the developer’s ultimate goal for the commercial embodiment. Such implementation specific decisions may include， and likely are not limited to， compliance with system related， business related， government related and other constraints， which may vary by specific implementation， location and from time to time. While a developer’s efforts might be complex and time consuming in an absolute sense， such efforts would nevertheless be a routine undertaking for those of skill in this art having the benefit of this disclosure.

Using the description provided herein， the example embodiments may be implemented as a machine， process， or article of manufacture by using standard programming and/or engineering techniques to produce programming software， firmware， hardware or any combination thereof.

Any resulting program (s) ， having computer-readable program code， may be embodied on one or more computer-usable media such as resident memory devices， smart cards or other removable memory devices， or transmitting devices， thereby making a computer program product or article of manufacture according to the embodiments. As such， the terms “article of manufacture” and “computer program product” as used herein are intended to encompass a computer program that exists permanently or temporarily on any computer-usable medium or in any transmitting medium which transmits such a program.

As indicated above， memory/storage devices can include， but are not limited to， disks， solid state drives， optical disks， removable memory devices such as smart cards， SIMs， WIMs， semiconductor memories such as RAM， ROM， PROMS， etc. Transmitting mediums include， but are not limited to， transmissions via wireless communication networks， the Intemet， intranets， telephone/modem-based network communication， hard-wired/cabled communication network， satellite communication， and other stationary or mobile network systems/communication links.

While particular embodiments and applications of the present disclosure have been illustrated and described， it is to be understood that the present disclosure is not limited to the precise construction and compositions disclosed herein and that various modifications， changes， and variations can be apparent from the foregoing descriptions without departing from the invention as defined in the appended claims.

Claims

A computer-implemented method， comprising：

configuring to generate feature maps for a first convolutional layer of a convolutional neural network based on a region of an image to be evaluated and a learned filter from the first convolutional layer；

configuring to generate feature maps for one or more subsequent convolutional layers of the convolutional neural network， each subsequent convolutional layer being generated based on the feature maps of a prior convolutional layer， a learned filter for the prior convolutional layer and a learned filter for the subsequent convolutional layer； and

configuring to detect a presence of an object of interest in the region of the image based on the generated feature maps of the first and one or more subsequent convolutional layers.
The method according to claim 1， further comprising：

configuring to receive the image which is captured from an image sensing device.
The method according to any one of claims 1 and 2， further comprising：

configuring to learn a filter for each convolutional layer of the convolutional neural network during a training stage using one or more training images.
The method according to claim 3， wherein the configuring to learn a filter comprises：

configuring to initialize filters for the convolutional layers of the convolutional neural network；

configuring to generate feature maps for each convolutional layer using forward-propagation， each subsequent convolutional layer after the first convolutional layer being generated based on the feature maps of a prior convolutional layer， a learned filter for the prior convolutional layer and a filter for the subsequent convolutional layer；

configuring to calculate a loss using a loss function based on the generated feature maps and a score for each category and corresponding label； and

configuring to update the filters for the convolutional layers using back-propagation if the calculated loss has decreased，

wherein the configuring to calculate feature maps， the configuring to calculate a loss， and the configuring to update the filters are repeated until the convolutional neural network converges when the calculated loss is no longer decreasing.
The method according to any one of claims 1 through 4， wherein two map features are generated for each of the one or more subsequent convolutional layers.
The method according to claims 5， further comprising：

concatenating the two map features for each of the one or more subsequent convolutional layers.
The method according to any one of claims 1 through 6， wherein the configuring to generate feature maps for a first convolutional layer， the configuring to generate feature maps for one or more subsequent convolutional layers， and the configuring to detect a presence of an object are performed in a testing stage.
The method according to any one of claims 1 through 7， wherein the configuring to detect comprises：

configuring to obtain a score for the region from application of the convolutional neural network； and

configuring to compare the score for the region to a threshold value，

wherein the object is detected in the region if the score for the region is larger than the threshold value.
The method according to any one of claims 1 through 8， further comprising：

configuring to initiate an alarm ifthe object is detected.
The method according to any one of claims 1 through 9， wherein the convolutional neural network is applied to each region of the image to detect whether the object is present in any of the regions of the image.
An apparatus comprising means for performing the method of any one of claims 1 through 10.
A computer program product， comprising computer code instructions， when executed by at least one processor， cause an apparatus to perform at least the method of any one of claims 1 through 10.
An apparatus， comprising：

a memory； and

one or more processors configured：

to generate feature maps for a first convolutional layer of a convolutional neural network based on a region of an image to be evaluated and a learned filter from a first convolutional layer；

to generate feature maps for one or more subsequent convolutional layers of the convolutional neural network， each subsequent convolutional layer being generated based on the feature maps of a prior convolutional layer， a learned filter for the prior convolutional layer and a learned filter for the subsequent convolutional layer； and

to detect a presence of an object of interest in the region of the image based on the generated feature maps of the first and one or more subsequent convolutional layers.
The apparatus according to claim 13， wherein the one or more processors are further configured to receive the image which is captured from an image sensing device.
The apparatus according to any one of claims 13 and 14， wherein the one or more processors are further configured to learn a filter for each convolutional layer of the convolutional neural network during a training stage using one or more training images.
The apparatus according to claim 15， wherein， to learn a filter， the one or more processors are configured：

to initialize filters for the convolutional layers of the convolutional neural network；

to generate feature maps for each convolutional layer using forward-propagation， each subsequent convolutional layer after the first convolutional layer being generated based on the feature maps of a prior convolutional layer， a learned filter for the prior convolutional layer and a filter for the subsequent convolutional layer；

to calculate a loss using a loss function based on the generated feature maps and a score for each category and corresponding label； and

to update the filters for the convolutional layers using back-propagation if the calculated loss has decreased，

wherein the one or more processors are configured to repeat the operations of calculating feature maps， calculating a loss， and updating the filters until the convolutional neural network converges when the calculated loss is no longer decreasing.
The apparatus according to any one of claims 13 through 16， wherein two map features are generated for each of the one or more subsequent convolutional layers.
The apparatus according to any one of claims 13 through 17， wherein the one or more processors are configured to concatenate the two map features for each of the one or more subsequent convolutional layers.
The apparatus according to any one of claims 13 through 18， wherein the one or more processors are configured to generate feature maps for a first convolutional layer， to generate feature maps for one or more subsequent convolutional layers， and to detect a presence of an object in a testing stage.
The apparatus according to any one of claims 13 through 19， wherein， to detect the presence of an object， the one or more processors are configured：

to obtain a score for the region from application of the convolutional neural network； and

to compare the score for the region to a threshold value，

wherein the object is detected in the region if the score for the region is larger than the threshold value.
The apparatus according to any one of claims 13 through 20， wherein the one or more processors are further configured to initiate an alarm if the object is detected.
The apparatus according to any one of claims 13 through 21， wherein the convolutional neural network is applied to each region of the image to detect whether the object is present in any of the regions of the image.
A computer-implemented method， comprising：

configuring to initialize filters for convolutional layers of a convolutional neural network；

configuring to generate feature maps for each convolutional layer using forward-propagation， each subsequent convolutional layer after the first convolutional layer being generated based on the feature maps of a prior convolutional layer， a learned filter for the prior convolutional layer and a filter for the subsequent convolutional layer；

configuring to calculate a loss using a loss function based on the generated feature maps and a score for each category and corresponding label； and

configuring to update the filters for the convolutional layers using back-propagation if the calculated loss has decreased，

wherein the configuring to calculate feature maps， the configuring to calculate a loss， and the configuring to update the filters are repeated until the convolutional neural network converges when the calculated loss is no longer decreasing.
An apparatus comprising means for performing the method of claims 23.
A computer program product， comprising computer code instructions， when executed by at least one processor， cause an apparatus to perform at least the method of claim 23.
An apparatus， comprising：

a memory； and

one or more processors configured：

to initialize filters for convolutional layers of a convolutional neural network；

to generate feature maps for each convolutional layer using forward-propagation， each subsequent convolutional layer after the first convolutional layer being generated based on the feature maps of a prior convolutional layer， a learned filter for the prior convolutional layer and a filter for the subsequent convolutional layer；

to calculate a loss using a loss function based on the generated feature maps and a score for each category and corresponding label； and

to update the filters for the convolutional layers using back-propagation if the calculated loss has decreased，

wherein the one or more processors are configured to repeat the operations of calculating feature maps， calculating a loss， and updating the filters until the convolutional neural network converges when the calculated loss is no longer decreasing.