WO2018132961A1

WO2018132961A1 - Apparatus, method and computer program product for object detection

Info

Publication number: WO2018132961A1
Application number: PCT/CN2017/071477
Authority: WO
Inventors: Jiale CAO
Original assignee: Nokia Technologies Beijing Co Ltd; Nokia Technologies Oy
Current assignee: Nokia Technologies Beijing Co Ltd; Nokia Technologies Oy
Priority date: 2017-01-18
Filing date: 2017-01-18
Publication date: 2018-07-26
Anticipated expiration: 2019-07-18

Abstract

Apparatus, method, computer program product and computer readable medium are disclosed for object detection. The method comprises: receiving a sample degraded by at least two factors (304); performing the following operations for each factor of the at least two factors: removing a factor of the at least two factors from the sample by a factor removal neural network; computing residual information corresponding to the factor based on the sample and the output of the factor removal neural network; computing a difference between the output and a sum of residual information for all the other factor (s) except the factor; extracting a feature from the difference by a feature extraction neural network; stacking the feature extracted by each feature extraction neural network to input to a classification neural network (308); and outputting the result of classification neural network as a detection result (310).

Description

APPARATUS, METHOD AND COMPUTER PROGRAM PRODUCT FOR OBJECT DETECTION

Field of the Invention

Embodiments of the disclosure generally relate to information technologies, and, more particularly, to object detection.

Background

Object detection plays an important role in most applications. For example, object detection systems are broadly used for computer vision, automatic speech recognition, natural language processing, drug discovery and toxicology, customer relationship management, recommendation systems, and biomedical Informatics. As an example, in computer vision, the object detection systems can be used in video surveillance, traffic surveillance, driver assistant systems, autonomous vehicle, traffic monitoring, human identification, human-computer interaction, public security, event detection, tracking, frontier guards and the Customs, scenario analysis and classification, image indexing and retrieve, etc.

However, the input/sample of object detection systems may be degraded by at least two factors which may greatly influence the performance of the object detection systems. For example, in a bad weather caused by several factors, an image captured by the driver assistant system may be degraded by at least two of haze, rain, fog, sand, dust, sand storm, hailstone, dark light, etc. As an example, haze and dark light are two common sources of degrading image quality. They hamper the visibility of the scene and its objects. The intensity, hue and saturation of the scene and its objects are also altered by the haze and dark light. The performance of the driver assistant system degrades drastically when complex and challenging weather occurs.

Therefore, it is required a solution for improving the performance of object detection/recognition systems when the input of the object detection/recognition systems is degraded by at least two factors.

Summary

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

According to one aspect of the disclosure, it is provided an apparatus. The apparatus may comprise at least one processor； and at least one memory including computer program code, the memory and the computer program code configured to, working with the at least one processor, cause the apparatus to receive a sample degraded by at least two factors； perform the following operations for each factor of the at least two factors: remove a factor of the at least two factors from the sample by a factor removal neural network； compute residual information corresponding to the factor based on the sample and the output of the factor removal neural network； compute a difference between the output and a sum of residual information for all the other factor (s) except the factor； extract a feature from the difference by a feature extraction neural network； stack the feature extracted by each feature extraction neural network to input to a classification neural network； and output the result of classification neural network as a detection result.

According to another aspect of the present disclosure, it is provided a method. The method may comprise receiving a sample degraded by at least two factors； performing the following operations for each factor of the at least two factors: removing a factor of the at least two factors from the sample by a factor removal neural network； computing residual information corresponding to the factor based on the sample and the output of the factor removal neural network； computing a difference between the output and a sum of residual information for all the other factor (s) except the factor； extracting a feature from the difference by a feature extraction neural network； stacking the feature extracted by each feature extraction neural network to input to a classification neural network； and outputting the result of classification neural network as a detection result.

According to still another aspect of the present disclosure, it is provided a computer program product embodied on a distribution medium readable by a computer and comprising program instructions which, when loaded into a computer, causes a processor to receive a sample degraded by at least two factors； perform the following operations for each factor of the at least two factors: remove a factor of the at least two factors from the sample by a factor removal neural network； compute residual information corresponding to the factor based on the sample and the output of the factor removal neural network； compute a difference between the output and a sum of residual information for all the other factor (s) except the factor； extract a feature from the difference by a feature extraction neural network； stack the feature extracted by each feature extraction neural network to input to a classification neural network； and output the result of classification neural network as a detection result.

According to still another aspect of the present disclosure, it is provided a non-transitory computer readable medium having encoded thereon statements and instructions to cause a processor to receive a sample degraded by at least two factors； perform the following operations for each factor of the at least two factors: remove a factor of the at least two factors from the sample by a factor removal neural network； compute residual information corresponding to the factor based on the sample and the output of the factor removal neural network； compute a difference between the output and a sum of residual information for all the other factor (s) except the factor； extract a feature from the difference by a feature extraction neural network； stack the feature extracted by each feature extraction neural network to input to a classification neural network； and output the result of classification neural network as a detection result.

According to still another aspect of the present disclosure, it is provided an apparatus comprising means configured to receive a sample degraded by at least two factors； means configured to perform the following operations for each factor of the at least two factors: remove a factor of the at least two factors from the sample by a factor removal neural network； compute residual information corresponding to the factor based on the sample and the output of the factor removal neural network； compute a difference between the output and a sum of residual information for all the other factor (s) except the factor； extract a feature from the difference by a feature extraction neural network； means configured to stack the feature extracted by each feature extraction neural network to input to a classification neural network； and means configured to output the result of classification neural network as a detection result.

These and other objects, features and advantages of the disclosure will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.

Brief Description of the Drawings

Figure 1 is a simplified block diagram showing an apparatus according to an embodiment；

Figure 2 is a flow chart depicting a process of a training stage of a neural network according to an embodiment of the present disclosure；

Figure 3 is a flow chart depicting a process of a testing stage of a neural network according to embodiments of the present disclosure；

Figure 4 schematically shows a neural network used for the training stage according to an embodiment of the disclosure； and

Figure 5 schematically shows a neural network used for the testing stage according to an embodiment of the disclosure.

Detailed Description

For the purpose of explanation, details are set forth in the following description in order to provide a thorough understanding of the embodiments disclosed. It is apparent, however, to those skilled in the art that the embodiments may be implemented without these specific details or with an equivalent arrangement. Various embodiments of the disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein； rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout. As used herein, the terms "data, " "content, " "information, " and similar terms may be used interchangeably to refer to data capable of being transmitted, received and/or stored in accordance with embodiments of the present disclosure. Thus, use of any such terms should not be taken to limit the spirit and scope of embodiments of the present disclosure.

Additionally, as used herein, the term 'circuitry'refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry) ； (b) combinations of circuits and computer program product (s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein； and (c) circuits, such as, for example, a microprocessor (s) or a portion of a microprocessor (s) , that require software or firmware for operation even if the software or firmware is not physically present. This definition of 'circuitry'a pplies to all uses of this term herein, including in any claims. As a further example, as used herein, the term 'circuitry'a lso includes an implementation comprising one or more processors and/or portion (s) thereof and accompanying software and/or firmware. As another example, the term 'circuitry'a s used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network apparatus, other network apparatus, and/or other computing apparatus.

As defined herein, a "non-transitory computer-readable medium, " which refers to a physical medium (e.g., volatile or non-volatile memory device) , can be differentiated from a "transitory computer-readable medium, " which refers to an electromagnetic signal.

It is noted that though the embodiments are mainly described in the context of deep convolutional neural network, they are not limited to this but can be applied to any suitable neural network. Moreover, the embodiments of the disclosure can be applied to automatic speech recognition, natural language processing, drug discovery and toxicology, customer relationship management, recommendation systems, and biomedical Informatics, etc., though they are mainly discussed in the context of image recognition.

Figure 1 is a simplified block diagram showing an apparatus, such as an electronic apparatus 10, in which various embodiments of the disclosure may be applied. It should be understood, however, that the electronic apparatus as illustrated and hereinafter described is merely illustrative of an apparatus that could benefit from embodiments of the disclosure and, therefore, should not be taken to limit the scope of the disclosure. While the electronic apparatus 10 is illustrated and will be hereinafter described for purposes of example, other types of apparatuses may readily employ embodiments of the disclosure. The electronic apparatus 10 may be a portable digital assistant (PDAs) , a user equipment, a mobile computer, a desktop computer, a smart television, an intelligent glass, a gaming apparatus, a laptop computer, a media player, a camera, a video recorder, a mobile phone, a global positioning system (GPS) apparatus, a smart phone, a tablet, a server, a thin client, a cloud computer, a virtual server, a set-top box, a computing device, a distributed system, a smart glass, a vehicle navigation system, an advanced driver assistance systems (ADAS) , a self-driving apparatus, a video surveillance apparatus, an intelligent robotics, a virtual reality apparatus and/or any other types of electronic systems. The electronic apparatus 10 may run with any kind of operating system including, but not limited to, Windows, Linux, UNIX, Android, iOS and their variants. Moreover, the apparatus of at least one example embodiment need not to be the entire electronic apparatus, but may be a component or group of components of the electronic apparatus in other example embodiments.

Furthermore, the electronic apparatus may readily employ embodiments of the disclosure regardless of their intent to provide mobility. In this regard, it should be understood that embodiments of the disclosure may be utilized in conjunction with a variety of applications.

In at least one example embodiment, the electronic apparatus 10 may comprise processor 11 and memory 12. Processor 11 may be any type of processor, controller, embedded controller, processor core, graphics processing unit (GPU) and/or the like. In at least one example embodiment, processor 11 utilizes computer program code to cause an apparatus to perform one or more actions. Memory 12 may comprise volatile memory, such as volatile Random Access Memory (RAM) including a cache area for the temporary storage of data and/or other memory, for example, non-volatile memory, which may be embedded and/or may be removable. The non-volatile memory may comprise an EEPROM, flash memory and/or the like. Memory 12 may store any of a number of pieces of information, and data. The information and data may be used by the electronic apparatus 10 to implement one or more functions of the electronic apparatus 10, such as the functions described herein. In at least one example embodiment, memory 12 includes computer program code such that the memory and the computer program code are configured to, working with the processor, cause the apparatus to perform one or more actions described herein.

The electronic apparatus 10 may further comprise a communication device 15. In at least one example embodiment, communication device 15 comprises an antenna, (or multiple antennae) , a wired connector, and/or the like in operable communication with a transmitter and/or a receiver. In at least one example embodiment, processor 11 provides signals to a transmitter and/or receives signals from a receiver. The signals may comprise signaling information in accordance with a communications interface standard, user speech, received data, user generated data, and/or the like. Communication device 15 may operate with one or more air interface standards, communication protocols, modulation types, and access types. By way of illustration, the electronic communication device 15 may operate in accordance with second-generation (2G) wireless communication protocols IS-136 (time division multiple access (TDMA) ) , Global System for Mobile communications (GSM) , and IS-95 (code division multiple access (CDMA) ) , with third-generation (3G) wireless communication protocols, such as Universal Mobile Telecommunications System (UMTS) , CDMA2000, wideband CDMA (WCDMA) and time division-synchronous CDMA (TD-SCDMA) , and/or with fourth-generation (4G) wireless communication protocols, wireless networking protocols, such as 802.11, short-range wireless protocols, such as Bluetooth, and/or the like. Communication device 15 may operate in accordance with wireline protocols, such as Ethernet, digital subscriber line (DSL) , and/or the like.

Processor 11 may comprise means, such as circuitry, for implementing audio, video, communication, navigation, logic functions, and/or the like, as well as for implementing embodiments of the disclosure including, for example, one or more of the functions described herein. For example, processor 11 may comprise means, such as a digital signal processor device, a microprocessor device, various analog to digital converters, digital to analog converters, processing circuitry and other support circuits, for performing various functions including, for example, one or more of the functions described herein. The apparatus may perform control and signal processing functions of the electronic apparatus 10 among these devices according to their respective capabilities. The processor 11 thus may comprise the functionality to encode and interleave message and data prior to modulation and transmission. The processor 11 may additionally comprise an internal voice coder, and may comprise an internal data modem. Further, the processor 11 may comprise functionality to operate one or more software programs, which may be stored in memory and which may, among other things, cause the processor 11 to implement at least one embodiment including, for example, one or more of the functions described herein. For example, the processor 11 may operate a connectivity program, such as a conventional internet browser. The connectivity program may allow the electronic apparatus 10 to transmit and receive internet content, such as location-based content and/or other web page content, according to a Transmission Control Protocol (TCP) , Internet Protocol (IP) , User Datagram Protocol (UDP) , Internet Message Access Protocol (IMAP) , Post Office Protocol (POP) , Simple Mail Transfer Protocol (SMTP) , Wireless Application Protocol (WAP) , Hypertext Transfer Protocol (HTTP) , and/or the like, for example.

The electronic apparatus 10 may comprise a user interface for providing output and/or receiving input. The electronic apparatus 10 may comprise an output device 14. Output device 14 may comprise an audio output device, such as a ringer, an earphone, a speaker, and/or the like. Output device 14 may comprise a tactile output device, such as a vibration transducer, an electronically deformable surface, an electronically deformable structure, and/or the like. Output Device 14 may comprise a visual output device, such as a display, a light, and/or the like. The electronic apparatus may comprise an input device 13. Input device 13 may comprise a light sensor, a proximity sensor, a microphone, a touch sensor, a force sensor, a button, a keypad, a motion sensor, a magnetic field sensor, a camera, a removable storage device and/or the like. A touch sensor and a display may be characterized as a touch display. In an embodiment comprising a touch display, the touch display may be configured to receive input from a single point of contact, multiple points of contact, and/or the like. In such an embodiment, the touch display and/or the processor may determine input based, at least in part, on position, motion, speed, contact area, and/or the like.

The electronic apparatus 10 may include any of a variety of touch displays including those that are configured to enable touch recognition by any of resistive, capacitive, infrared, strain gauge, surface wave, optical imaging, dispersive signal technology, acoustic pulse recognition or other techniques, and to then provide signals indicative of the location and other parameters associated with the touch. Additionally, the touch display may be configured to receive an indication of an input in the form of a touch event which may be defined as an actual physical contact between a selection object (e.g., a finger, stylus, pen, pencil, or other pointing device) and the touch display. Alternatively, a touch event may be defined as bringing the selection object in proximity to the touch display, hovering over a displayed object or approaching an object within a predefined distance, even though physical contact is not made with the touch display. As such, a touch input may comprise any input that is detected by a touch display including touch events that involve actual physical contact and touch events that do not involve physical contact but that are otherwise detected by the touch display, such as a result of the proximity of the selection object to the touch display. A touch display may be capable of receiving information associated with force applied to the touch screen in relation to the touch input. For example, the touch screen may differentiate between a heavy press touch input and a light press touch input. In at least one example embodiment, a display may display two-dimensional information, three-dimensional information and/or the like.

Input device 13 may comprise a media capturing element. The media capturing element may be any means for capturing an image, video, and/or audio for storage, display or transmission. For example, in at least one example embodiment in which the media capturing element is a camera module, the camera module may comprise a digital camera which may form a digital image file from a captured image. As such, the camera module may comprise hardware, such as a lens or other optical component (s) , and/or software necessary for creating a digital image file from a captured image. Alternatively, the camera module may comprise only the hardware for viewing an image, while a memory device of the electronic apparatus 10 stores instructions for execution by the processor 11 in the form of software for creating a digital image file from a captured image. In at least one example embodiment, the camera module may further comprise a processing element such as a co-processor that assists the processor 11 in processing image data and an encoder and/or decoder for compressing and/or decompressing image data. The encoder and/or decoder may encode and/or decode according to a standard format, for example, a Joint Photographic Experts Group (JPEG) standard format, a moving picture expert group (MPEG) standard format, a Video Coding Experts Group (VCEG) standard format or any other suitable standard formats.

Figure 2 is a flow chart depicting a process 200 of a training stage of a neural network according to an embodiment of the present disclosure, which may be performed at an apparatus such as the electronic apparatus 10 (for example a distributed system or cloud computing) of Figure 1. As such, the electronic apparatus 10 may provide means for accomplishing various parts of the process 200 as well as means for accomplishing other processes in conjunction with other components.

The neural network may comprise a factor removal neural network, a feature extraction neural network and a classification neural network. The factor removal neural network may be used to remove a factor from the input/sample of the neural network. The factor removal neural network may be any suitable factor removal neural network for example depending on the factor to be removed. In generally, there may be a specific factor removal neural network for each factor. In other words, there may be n factor removal neural networks if there are n factors to be removed. The feature extraction neural network may be used to extract feature. The classification neural network may be used for classification. The feature extraction neural network may be any suitable feature extraction neural network for example depending on the feature to be extracted. Similarly, the classification neural network may be any suitable classification neural network for example depending on the feature to be classified. Each of the factor removal neural network, the feature extraction neural network and the classification neural network may comprises k layers, wherein k≥3.

Figure 4 schematically shows a neural network 400 used for the training stage according to an embodiment of the disclosure, wherein the neural network 400 can be used to process a sample degraded by 2 factors. As shown in Figure 4, the neural network 400 may comprises three parts: a factor removal part 402, a feature extraction part 404 and a classification part 406. It is noted that the neural network 400 can be easily expanded to any other neural network which can process the sample degraded by more than 2 factors. The process 200 will be described in detail with reference to Figures 2 and 4.

As shown in Figure 2, the process 200 may start at block 202 where the parameters/weights of the neural network 400 (the factor removal neural network, the feature extraction neural network and the classification neural network) are initialized with for example random values. Parameters like the number of filters, filter sizes, architecture of the network etc. have all been fixed before block 202 and do not change during the training stage.

At block 204, the electronic apparatus 10 receives a set of pairs of training samples with labels, wherein each pair of training samples contains first training sample degraded by n≥2 factors and second training sample where a factor j∈ [1, ..., n] of the n factors does not degrade the first training sample. The training sample may be any suitable sample which can be processed by the neural network 400, such as image, audio or text. The label may indicate the classification of the training sample. The set of pairs of training samples with labels may be pre-stored in a memory of the electronic apparatus 10, or retrieved from a network location or a local location.

The factors may be any factors which can degrade the sample such as image, audio, text or any other suitable sample. In an embodiment, the training sample is the image, and the factors may comprise at least two of haze, fog, dark light, dust storm, sand storm, snow, hailstone, blowball, pollen.

Turn to figure 4, supposing that a pair of training samples contains the first training sample degraded by the 2 factors and the second training sample where a factor j∈ [1, 2]does not degrade the first training sample. The first training sample (such as an image) shown by Input is input respectively to the factor removal

neural networks

410 and 408 and the second training sample shown by Ground truth1 and Ground truth2 may be stored in the neural network 400, wherein the Ground truth1 stands for a second training sample where factor 1 does not degrade the first training sample, and Ground truth2 stands for a second training sample where factor 2 does not degrade the first training sample.

At block 206, the electronic apparatus 10 may perform the following operations for the factor j of each pair of training samples: remove the factor j from the first training sample by the factor removal neural network； compute residual information R_j corresponding to the factor j based on the first training sample and the output C_j of the factor removal neural network； compute loss L_jof the factor removal neural network based on the difference between the output C_j and the second training sample； compute a difference D_j between the output C_j and a sum of residual information for n-1 factor (s) except j； extract feature from the difference D_j by the feature extraction neural network.

As shown in Figure 4,

factors

1 and 2 may be removed respectively from the first training sample by the factor removal

neural networks

408 and 410. In generally, the factor removal

neural networks

408 and 410 may be different neural networks each of which is suitable for removing a specific factor from the first training sample. The factor removal

neural networks

408 and 410 may each contain a plurality of layers. The number of layers of the factor removal

neural networks

408 and 410 may be different though the same number of layers m is shown in Figure 4. The layers of the factor removal neural networks 408 are denoted by

and the layers of the factor removal neural networks 410 are denoted by

stands for the estimated sample where factor 1 has been removed from the first training sample.

stands for the estimated sample where factor 2 has been removed from the first training sample.

Then residual information R₁ corresponding to the factor 1 is computed by subtracting the output

of the factor removal neural network 408 from the first training sample. Similarly, residual information R₂ corresponding to the factor 2 is computed by subtracting the output

of the factor removal neural network 410 from the first training sample.

The loss Loss1 of the factor removal neural network 408 may be computed by subtracting the output

of the factor removal neural network 408 from the second training sample Ground truth1. Similarly, the loss Loss2 of the factor removal neural network 410 may be computed by subtracting the output

of the factor removal neural network 410 from the second training sample Ground truth2.

A difference

between the output

of the factor removal neural network 408 and a sum of residual information for n-1 factor (s) except factor 1, wherein in this embodiment, the sum of residual information is R₂. Similarly, a difference

between the output

of the factor removal neural network 410 and a sum of residual information for n-1 factor (s) except factor 2, wherein in this embodiment, the sum of residual information is R₁. The difference

and

may stand for a sample where both factor 1 and factor 2 are removed from the first training sample.

Then the difference

and

may be input to the feature extraction

neural network

412 and 414 to extract feature from the difference

and

The output

of the feature extraction neural network 412 may stand for the feature extracted from the difference

and output

of the feature extraction neural network 414 may stand for the feature extracted from the difference

In generally, the feature extraction

neural network

412 and 414 may be the same feature extraction neural network. The feature extraction

neural network

412 and 414 may each contain a plurality of layers. The layers of the feature extraction neural network 412 are denoted by

and the layers of the feature extraction neural network 414 are denoted by

In other embodiments, there may be one feature extraction neural network in the neural network 400 and each difference may be sequentially input to the feature extraction neural network. In addition, the feature extraction

neural network

412 and 414 may be any suitable feature extraction neural network for example depending on the features to be extracted.

Turn to Figure 2, at block 208, the electronic apparatus 10 may stack n features to input to the classification neural network. For example, as shown in Figure 4, the feature layer

and the feature layer

are stacked to form the first layer E₁ of the classification neural network 416. The stacking operation may further comprise convolution operation, activation operation and pooling operation. The classification neural network 416 may comprise k layers denoted by E₁, E₂, …, E_k, wherein k≥3. The classification neural network 416 may be any suitable classification neural network for example depending on the features to be classified. The last layer E_k of the classification neural network 416 is the classification/detection result.

At block 210, the electronic apparatus 10 may compute classification loss based on the result of classification neural network and the label of the first training sample. For example, as shown in Figure 4, the electronic apparatus 10 may compute the classification loss at block 418.

At block 212, the electronic apparatus 10 may add n losses L_j and the classification loss to form joint loss. For example, as shown in Figure 4, the electronic apparatus 10 may add Loss1, Loss2 and the classification loss to form the joint loss at block 420.

At block 214, the electronic apparatus 10 may learn the parameters of the factor removal neural network, the feature extraction neural network and the classification neural network by minimizing the joint loss with the standard back-propagation algorithm. It is noted that the parameters of the factor removal neural network, the feature extraction neural network and the classification neural network may be learned by minimizing the classification loss with the standard back-propagation algorithm in other embodiments, and in this case, the computation of Loss1 and Loss2, and the adding operation may be omitted.

The trained neural network can then be used for classifying a sample such as image. Figure 3 is a flow chart depicting a process 300 of a testing stage of a neural network according to embodiments of the present disclosure, which may be performed at an apparatus such as the electronic apparatus 10 (for example an advanced driver assistance system (ADAS) or a self-driving apparatus) of Figure 1. As such, the electronic apparatus 10 may provide means for accomplishing various parts of the process 300 as well as means for accomplishing other processes in conjunction with other components. Moreover, the neural network has been trained by using the process 200 of Figure 2.

As described with reference to Figures 2 and 4, the neural network may comprise a factor removal neural network, a feature extraction neural network and a classification neural network. Figure 5 schematically shows a neural network 500 used for the testing stage according to an embodiment of the disclosure, wherein the neural network 500 can be used to process a sample degraded by 2 factors. As shown in Figure 5, the neural network 500 may comprises three parts: a factor removal part 502, a feature extraction part 504 and a classification part 506. It is noted that the neural network 500 can be easily expanded to any other neural network which can process a sample degraded by more than 2 factors. The process 300 will be described in detail with reference to Figures 3 and 5.

As shown in Figure 3, the process 300 may start at block 302 where the parameters/weights of the neural network 500 (the factor removal neural network, the feature extraction neural network and the classification neural network) are initialized with the values obtained in the taining stage.

At block 304, the electronic apparatus 10 receives a sample degraded by n factors, wherein n≥2. The sample may be any suitable sample which can be processed by the neural network 500, such as image, audio or text. The sample may be pre-stored in a memory of the electronic apparatus 10, retrieved from a network location or a local location, or captured in real time for example by the ADAS/autonomous vehicle. In an embodiment, the sample is the image, and the factors may comprise at least two of haze, fog, dark light, dust storm, sand storm, snow, hailstone, blowball, pollen. Turn to Figure 5, the sample (such as an image) shown by Input is input respectively to the factor removal

neural networks

510 and 508.

At block 306, the electronic apparatus 10 may perform the following operations for each factor i∈ [1, ..., n] of n factors: remove the factor i from the sample by a factor removal neural network； compute residual information R_i corresponding to the factor i based on the sample and the output C_i of the factor removal neural network； compute a difference D_i between the output C_i and a sum of residual information for n-1 factor (s) except the factor i； extract feature from the difference D_i by the feature extraction neural network.

As shown in Figure 5,

factors

1 and 2 may be removed respectively from the sample by the factor removal

neural networks

508 and 510. In generally, the factor removal

neural networks

508 and 510 may be different neural networks each of which is suitable for removing a specific factor from the sample. The factor removal

neural networks

508 and 510 may each contain a plurality of layers. The number of layers of the factor removal

neural networks

508 and 510 may be different though the same number of layers m is shown in Figure 5. The layers of the factor removal neural networks 508 are denoted by

and the layers of the factor removal neural networks 510 are denoted by

stands for the estimated sample where factor 1 has been removed from the sample.

stands for the estimated sample where factor 2 has been removed from the sample.

of the factor removal neural network 508 from the sample. Residual information R₂ corresponding to the factor 2 is computed by subtracting the output

of the factor removal neural network 510 from the sample.

A difference

between the output

of the factor removal neural network 508 and a sum of residual information for n-1 factor (s) except factor 1, wherein in this embodiment, the sum of residual information is R₂. Similarly, a difference

between the output

of the factor removal neural network 510 and a sum of residual information for n-1 factor (s) except factor 2, wherein in this embodiment, the sum of residual information is R₁. The difference

and

may stand for a processed sample where both factor 1 and factor 2 are removed from the sample.

Then the difference

and

may be input to the feature extraction

neural network

512 and 514 to extract feature from the difference

and

The output

of the feature extraction neural network 512 may stand for the feature extracted from the difference

and the output

of the feature extraction neural network 514 may stand for the feature extracted from the difference

In generally, the feature extraction

neural network

512 and 514 may be the same feature extraction neural network. The feature extraction

neural network

512 and 514 may each contain a plurality of layers. The layers of the feature extraction neural network 512 are denoted by

and the layers of the feature extraction neural network 514 are denoted by

In other embodiment, there may be one feature extraction neural network in the neural network 500 and each difference may be sequentially input to the feature extraction neural network. In addition, the feature extraction

neural network

512 and 514 may be any suitable feature extraction neural network for example depending on the features to be extracted.

Turn to Figure 3, at block 308, the electronic apparatus 10 may stack n features to input to the classification neural network. For example, as shown in Figure 5, the feature layer

and the feature layer

are stacked to form the first layer E₁ of the classification neural network 516. The stacking operation may further comprise convolution operation, activation operation and pooling operation. The classification neural network 516 may comprise k layers denoted by E₁, E₂, …, E_k, wherein k≥3. The classification neural network 516 may be any suitable classification neural network for example depending on the features to be classified. The last layer E_k of the classification neural network 516 may be outputted as a detection/classification result at block 310.

In an embodiment, the process 300 may be used in the ADAS/autonomous vehicle, such as for object detection. For example, a vision system is equipped with the ADAS or autonomous vehicle. The process 300 can be integrated into the vision system. In the vision system, an image is captured by a camera and the important objects such as pedestrians and bicycles are detected from the image by the process 300. In the ADAS, some forms (e.g., warning voice) of warning may be generated if important objects (e.g., pedestrians) are detected so that the driver in the vehicle can pay attention to the objects and try to avoid traffic accident. In the Autonomous Vehicle, the detected objects may be used as inputs of a control module and the control module takes proper action according to the objects.

Some advantages of the method of the embodiments of the disclosure are as follows. (1) The method constructs a neural network (such as a deep convolutional neural network) which greatly improves the performance of object detection systems, wherein a sample which is input to the object detection systems is degraded by at least two factors. (2) The restoration residual such as R₁ in Figure 4 corresponding to one factor is used to deal with the sample degraded by another factor, which greatly weaken the negative influence of the factors of the sample. (3) The adverse factor removal, feature extraction, and classification are jointly performed under the framework of a neural network such as the deep convolutional neural network.

According to an aspect of the disclosure it is provided an apparatus for object detection. For same parts as in the previous embodiments, the description thereof may be omitted as appropriate. The apparatus may comprise means configured to carry out the processes described above. In an embodiment, the apparatus comprises means configured to receive a sample degraded by at least two factors； means configured to perform the following operations for each factor of the at least two factors: remove a factor of the at least two factors from the sample by a factor removal neural network； compute residual information corresponding to the factor based on the sample and the output of the factor removal neural network； compute a difference between the output and a sum of residual information for all the other factor (s) except the factor； extract a feature from the difference by a feature extraction neural network； means configured to stack the feature extracted by each feature extraction neural network to input to a classification neural network； and means configured to output the result of classification neural network as a detection result.

In an embodiment, the apparatus further comprises means configured to train the factor removal neural network, the feature extraction neural network and the classification neural network.

In an embodiment, the apparatus further comprises means configured to receive a set of pairs of training samples with labels, wherein each pair of training samples contains first training sample degraded by the at least two factors and second training sample where a factor of the at least two factors does not degrade the first training sample； means configured to perform the following operations for the factor of each pair of training samples: remove the factor from the first training sample by the factor removal neural network； compute residual information corresponding to the factor based on the first training sample and the output of the factor removal neural network； compute a loss of the factor removal neural network based on the difference between the output and the second training sample； compute a difference between the output and a sum of residual information for all the other factor (s) except the factor； extract a feature from the difference by the feature extraction neural network； means configured to stack the feature extracted by each feature extraction neural network to input to the classification neural network； means configured to compute a classification loss based on the result of classification neural network and the label of the first training sample； means configured to add the loss of each factor removal neural network and the classification loss to form joint loss； and means configured to learn the parameters of the factor removal neural network, the feature extraction neural network and the classification neural network by minimizing the joint loss with the standard back-propagation algorithm.

In an embodiment, the sample and the training sample comprise one of image, audio and text.

In an embodiment, the sample and the training sample is the image, and the factors comprise at least two of haze, fog, dark light, dust storm, sand storm, snow, hailstone, blowball, pollen.

In an embodiment, the apparatus is used in an advanced driver assistance system/autonomous vehicle.

In an embodiment, the neural network comprises a convolutional neural network.

It is noted that any of the components of the apparatus described above can be implemented as hardware or software modules. In the case of software modules, they can be embodied on a tangible computer-readable recordable storage medium. All of the software modules (or any subset thereof) can be on the same medium, or each can be on a different medium, for example. The software modules can run, for example, on a hardware processor. The method steps can then be carried out using the distinct software modules, as described above, executing on a hardware processor.

Additionally, an aspect of the disclosure can make use of software running on a general purpose computer or workstation. Such an implementation might employ, for example, a processor, a memory, and an input/output interface formed, for example, by a display and a keyboard. The term “processor” as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit) and/or other forms of processing circuitry. Further, the term “processor” may refer to more than one individual processor. The term “memory” is intended to include memory associated with a processor or CPU, such as, for example, RAM (random access memory) , ROM (read only memory) , a fixed memory device (for example, hard drive) , a removable memory device (for example, diskette) , a flash memory and the like. The processor, memory, and input/output interface such as display and keyboard can be interconnected, for example, via bus as part of a data processing unit. Suitable interconnections, for example via bus, can also be provided to a network interface, such as a network card, which can be provided to interface with a computer network, and to a media interface, such as a diskette or CD-ROM drive, which can be provided to interface with media.

Accordingly, computer software including instructions or code for performing the methodologies of the disclosure, as described herein, may be stored in associated memory devices (for example, ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole (for example, into RAM) and implemented by a CPU. Such software could include, but is not limited to, firmware, resident software, microcode, and the like.

As noted, aspects of the disclosure may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. Also, any combination of computer readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (anon-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM) , a read-only memory (ROM) , an erasable programmable read-only memory (EPROM or Flash memory) , an optical fiber, a portable compact disc read-only memory (CD-ROM) , an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer program code for carrying out operations for aspects of the disclosure may be written in any combination of at least one programming language, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the "C"programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, component, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function (s) . It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that the terms "connected, " "coupled, " or any variant thereof, mean any connection or coupling, either direct or indirect, between two or more elements, and may encompass the presence of one or more intermediate elements between two elements that are "connected" or "coupled" together. The coupling or connection between the elements can be physical, logical, or a combination thereof. As employed herein, two elements may be considered to be "connected" or "coupled" together by the use of one or more wires, cables and/or printed electrical connections, as well as by the use of electromagnetic energy, such as electromagnetic energy having wavelengths in the radio frequency region, the microwave region and the optical region (both visible and invisible) , as several non-limiting and non-exhaustive examples.

In any case, it should be understood that the components illustrated in this disclosure may be implemented in various forms of hardware, software, or combinations thereof, for example, application specific integrated circuit (s) (ASICS) , a functional circuitry, a graphics processing unit, an appropriately programmed general purpose digital computer with associated memory, and the like. Given the teachings of the disclosure provided herein, one of ordinary skill in the related art will be able to contemplate other implementations of the components of the disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a, ” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising, ” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of another feature, integer, step, operation, element, component, and/or group thereof.

The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments.

Claims

An apparatus, comprising:

at least one processor；

at least one memory including computer program code, the memory and the computer program code configured to, working with the at least one processor, cause the apparatus to

receive a sample degraded by at least two factors；

perform the following operations for each factor of the at least two factors:

remove a factor of the at least two factors from the sample by a factor removal neural network；

compute residual information corresponding to the factor based on the sample and the output of the factor removal neural network；

compute a difference between the output and a sum of residual information for all the other factor (s) except the factor；

extract a feature from the difference by a feature extraction neural network；

stack the feature extracted by each feature extraction neural network to input to a classification neural network； and

output the result of classification neural network as a detection result.
The apparatus according to claim 1, wherein the memory and the computer program code is further configured to, working with the at least one processor, cause the apparatus to

train the factor removal neural network, the feature extraction neural network and the classification neural network.
The apparatus according to claim 2, wherein the memory and the computer program code is further configured to, working with the at least one processor, cause the apparatus to

receive a set of pairs of training samples with labels, wherein each pair of training samples contains first training sample degraded by the at least two factors and second training sample where a factor of the at least two factors does not degrade the first training sample；

perform the following operations for the factor of each pair of training samples:

remove the factor from the first training sample by the factor removal neural network；

compute residual information corresponding to the factor based on the first training sample and the output of the factor removal neural network；

compute a loss of the factor removal neural network based on the difference between the output and the second training sample；

compute a difference between the output and a sum of residual information for all the other factor (s) except the factor；

extract a feature from the difference by the feature extraction neural network；

stack the feature extracted by each feature extraction neural network to input to the classification neural network；

compute a classification loss based on the result of classification neural network and the label of the first training sample；

add the loss of each factor removal neural network and the classification loss to form joint loss； and

learn the parameters of the factor removal neural network, the feature extraction neural network and the classification neural network by minimizing the joint loss with the standard back-propagation algorithm.
The apparatus according to any one claims 1-3, wherein the sample and the training sample comprise one of image, audio and text.
The apparatus according to any one of claims 1-4, wherein the sample and the training sample is the image, and the factors comprise at least two of haze, fog, dark light, dust storm, sand storm, snow, hailstone, blowball, pollen.
The apparatus according to claim any one of claims 1-5, wherein the apparatus is used in an advanced driver assistance system/autonomous vehicle.
The apparatus according to claim any one of claims 1-6, the neural network comprises a convolutional neural network.
A method comprising:

receiving a sample degraded by at least two factors；

performing the following operations for each factor of the at least two factors:

removing a factor of the at least two factors from the sample by a factor removal neural network；

computing residual information corresponding to the factor based on the sample and the output of the factor removal neural network；

computing a difference between the output and a sum of residual information for all the other factor (s) except the factor；

extracting a feature from the difference by a feature extraction neural network；

stacking the feature extracted by each feature extraction neural network to input to a classification neural network； and

outputting the result of classification neural network as a detection result.
The method according to claim 8, further comprising

training the factor removal neural network, the feature extraction neural network and the classification neural network.
The method according to claim 9, wherein the training comprises

receiving a set of pairs of training samples with labels, wherein each pair of training samples contains first training sample degraded by the at least two factors and second training sample where a factor of the at least two factors does not degrade the first training sample；

performing the following operations for the factor of each pair of training samples:

removing the factor from the first training sample by the factor removal neural network；

compute residual information corresponding to the factor based on the first training sample and the output of the factor removal neural network；

computing a loss of the factor removal neural network based on the difference between the output and the second training sample；

computing a difference between the output and a sum of residual information for all the other factor (s) except the factor；

extracting a feature from the difference by the feature extraction neural network；

stacking the feature extracted by each feature extraction neural network to input to the classification neural network；

computing a classification loss based on the result of classification neural network and the label of the first training sample；

adding the loss of each factor removal neural network and the classification loss to form joint loss； and

learning the parameters of the factor removal neural network, the feature extraction neural network and the classification neural network by minimizing the joint loss with the standard back-propagation algorithm.
The method according to any one claims 8-10, wherein the sample and the training sample comprise one of image, audio and text.
The method according to any one of claims 8-11, wherein the sample and the training sample is the image, and the factors comprise at least two of haze, fog, dark light, dust storm, sand storm, snow, hailstone, blowball, pollen.
The method according to claim any one of claims 8-12, wherein the method is used in an advanced driver assistance system/autonomous vehicle.
An apparatus, comprising means configured to carry out the method according to any one of claims 8 to 13.
A computer program product embodied on a distribution medium readable by a computer and comprising program instructions which, when loaded into a computer, execute the method according to any one of claims 8 to 13.
A non-transitory computer readable medium having encoded thereon statements and instructions to cause a processor to execute a method according to any one of claims 8 to 13.