[go: up one dir, main page]

US12299968B2 - Cascaded neural network-based attention detection method, computer device, and computer-readable storage medium - Google Patents

Cascaded neural network-based attention detection method, computer device, and computer-readable storage medium Download PDF

Info

Publication number
US12299968B2
US12299968B2 US17/631,083 US201917631083A US12299968B2 US 12299968 B2 US12299968 B2 US 12299968B2 US 201917631083 A US201917631083 A US 201917631083A US 12299968 B2 US12299968 B2 US 12299968B2
Authority
US
United States
Prior art keywords
neural network
situation
inattention
region
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US17/631,083
Other versions
US20220277558A1 (en
Inventor
Xiaohui Li
Gang Peng
Nan Nan
Liping Ye
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Allwinner Technology Co Ltd
Original Assignee
Allwinner Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Allwinner Technology Co Ltd filed Critical Allwinner Technology Co Ltd
Assigned to ALLWINNER TECHNOLOGY CO., LTD. reassignment ALLWINNER TECHNOLOGY CO., LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LI, XIAOHUI, NAN, Nan, PENG, GANG, YE, LIPING
Publication of US20220277558A1 publication Critical patent/US20220277558A1/en
Application granted granted Critical
Publication of US12299968B2 publication Critical patent/US12299968B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris

Definitions

  • the present invention relates to the field of image recognition processing, and in particular, to an attention detection method based on a cascade neural network, a computer apparatus and a computer-readable storage medium for implementing the method.
  • Human attention detection has always been one of focuses and hotspots in the field of machine learning, and is mainly used in aspects such as security protection and aided driving. Due to a large number of uncertain factors in a real environment, such as the influence of different illumination conditions such as day and night, diversity of human head postures and expressions, and differences in race, gender and age, as well as human wearing of glasses, detecting a human attention state in the real environment is quite challenging.
  • remote eye tracking is a classical algorithm for detecting human attention.
  • this method needs to rely on a near-infrared lighting device to produce a bright pupil effect, so as to capture eye information.
  • the near-infrared lighting device used in this method is easily damaged due to vibration and bumps, and needs long-term maintenance, which is relatively high in cost.
  • a method of attention estimation based on a convolutional neural network may be used to automatically learn information of head posture features and eye features from data samples, without manually designing a feature extraction algorithm, and has good robustness.
  • a model of the convolutional neural network used is large in size and high in computational complexity, which is not suitable for an embedded device, resulting in great limitations on the use of this method.
  • a main objective of the present invention is to provide an attention detection method based on a cascade neural network, which has low computational complexity and good computational performance.
  • Another objective of the present invention is to provide a computer apparatus for implementing the foregoing attention detection method based on a cascade neural network.
  • Still another objective of the present invention is to provide a computer-readable storage medium for implementing the foregoing attention detection method based on a cascade neural network.
  • an attention detection method based on a cascade neural network includes: obtaining video data, recognizing a plurality of image frames, and extracting a face region of the plurality of image frames; recognizing the face region by using a first convolutional neural network to judge whether a first situation of inattention occurs; and recognizing, if it is confirmed that no first situation of inattention occurs, the face region by using a second convolutional neural network to judge whether a second situation of inattention occurs, where computational complexity of the first convolutional neural network is less than computational complexity of the second convolutional neural network.
  • the recognizing the face region by using a second convolutional neural network includes: capturing a plurality kind of regions of interest from the face region, and judging whether a second situation of inattention occurs according to a result of recognition of two or more kind regions of interest.
  • the plurality kind of regions of interest include a face frame region and a face supplement region; and the judging whether a second situation of inattention occurs according to a result of recognition of two or more kind regions of interest includes: judging whether a second situation of inattention occurs according to a result of image recognition of the face frame region and the face supplement region.
  • the plurality kind of regions of interest include a face frame region and an eye region; and the judging whether a second situation of inattention occurs according to a result of recognition of two or more kind regions of interest includes: judging whether a second situation of inattention occurs according to a result of image recognition of the face frame region and the eye region.
  • the judging whether a first situation of inattention occurs includes: recognizing the face region by using the first convolutional neural network, judging whether a rotation angle of the head in a preset direction is greater than a preset angle, and if so, confirming that the first situation of inattention occurs.
  • the second convolutional neural network includes a first convolution layer, a depthwise convolution layer, a plurality of bottleneck residual layers, a second convolution layer, a linear global depthwise convolution layer, a linear convolution layer, a fully connected layer and a classification layer that are sequentially cascaded.
  • the recognizing a plurality of image frames after obtaining the video data includes: selecting one image frame from every preset number of consecutive image frames of the video data for recognition.
  • a computer apparatus includes a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, each step of the foregoing attention detection method based on a cascade neural network is implemented.
  • a computer-readable storage medium stores a computer program, where when the computer program is executed by a processor, each step of the foregoing attention detection method based on a cascade neural network is implemented.
  • recognition is first performed by using a first convolutional neural network to judge whether a first situation of inattention occurs, and only when it is confirmed that no first situation of inattention occurs, recognition is performed by using a second convolutional neural network to judge whether a second situation of inattention occurs.
  • a first convolutional neural network to judge whether a first situation of inattention occurs
  • recognition is performed by using a second convolutional neural network to judge whether a second situation of inattention occurs. This may avoid using a convolutional neural network having a relatively complicated degree of computation for all situations, thereby simplifying the overall complexity of the attention detection.
  • the face region is divided into a plurality kind of regions of interest, the plurality kind of regions of interest are separately recognized, and then fusion analysis is performed in conjunction with the recognition of the plurality kind of regions of interest to judge whether the second situation of inattention occurs. This may improve accuracy of analysis, and have a better effect of recognizing inattention.
  • a face frame region and a face supplement region separately, a plurality of attention situations of people, such as a situation that a driver is looking at a left rearview mirror, looking at the front, looking at a vehicle interior rearview mirror, looking at a right rearview mirror, looking at a dashboard, looking at a central control region, or closing eyes, may be recognized.
  • a result of recognition of the face frame region and the face supplement region it can be judged whether the driver is distracted in driving or wants to change a lane in driving.
  • Whether the driver is fatigue in driving, is distracted in driving or wants to change a lane in driving can be judged by using the result of recognition of the face frame region and the face supplement region.
  • the driver is fatigue in driving.
  • the driver is fatigue in driving.
  • several consecutive face images fall into a situation of being deviated to the left side or right side and looking at the left side or the right side, it can be considered that the second situation of inattention occurs.
  • the design of the first convolutional neural network is very simple, and the computation amount is smaller
  • the computation amount of the second convolutional neural network is relatively complex and can be used to accurately recognize whether other situations of inattention occur to the driver, and the judgment is more accurate.
  • one image frame is selected from every set number of consecutive image frames to be recognized, which can greatly reduce the computation amount of inattention recognition, and can ensure the accuracy of recognition results.
  • one image frame may be selected from every six image frames for recognition.
  • FIG. 1 is a flowchart of an embodiment of an attention detection method based on a cascade neural network according to the present invention
  • FIG. 2 shows a formula for calculating a softmax probability value of a value in an embodiment of an attention detection method based on a cascade neural network according to the present invention
  • FIG. 3 is a structural block diagram of a first convolutional neural network in an embodiment of an attention detection method based on a cascade neural network according to the present invention
  • FIG. 4 is a schematic diagram of four regions of interest for image recognition by using an embodiment of an attention detection method based on a cascade neural network according to the present invention
  • FIG. 5 is a structural block diagram of a second convolutional neural network in an embodiment of an attention detection method based on a cascade neural network according to the present invention
  • FIG. 6 is a structural block diagram when a step size of a bottleneck residual layer of a second convolutional neural network is 1 in an embodiment of an attention detection method based on a cascade neural network according to the present invention.
  • FIG. 7 is a structural block diagram when a step size of a bottleneck residual layer of a second convolutional neural network is 2 in an embodiment of an attention detection method based on a cascade neural network according to the present invention.
  • an attention detection method based on a cascade neural network is applied to an intelligent device.
  • the intelligent device is provided with a photographing apparatus, such as a camera.
  • the intelligent device uses video data obtained by the photographing apparatus to perform image analysis, so as to judge whether a situation of inattention occurs to a specific person.
  • the intelligent device is provided with a processor and a memory storing a computer program, where the processor implements the attention detection method based on a cascade neural network by executing the computer program.
  • This embodiment is mainly based on head postures and eye information, and a cascade convolutional neural network is applied to detect attention of a specific person.
  • the entire method mainly includes three steps: video collection, image processing, and attention detection.
  • the photographing apparatus In the video collection step, video data is photographed by the photographing apparatus. In this embodiment, recognition can be performed on the basis of the video data of different scenarios (including different photographing angles, external illumination conditions, a position of a target, and the like). Therefore, the photographing apparatus can obtain target video data in various different postures.
  • the image processing step a plurality of image frames are obtained from the video data, the frames are detected by using a face detection algorithm, and images of a face region are captured.
  • a first convolutional neural network with low computational complexity is first used to judge head postures of a detected object, so as to achieve a preliminary judgment of attention detection; then, the detected face region of interest is further captured and expanded, a second convolutional neural network with high computational complexity is used to extract information of head posture features and eye features, and human behavior is judged by analyzing the human gaze direction.
  • the cascade convolutional neural network adopted in this embodiment has good generalization performance and low computational complexity, and is suitable for an embedded device.
  • Step S 1 is performed first to obtain video data, that is, the photographing apparatus of the intelligent device obtains consecutive video data.
  • the intelligent device may be a device disposed in a vehicle to detect whether the driver is inattentive, and the photographing apparatus may be disposed in front of the driver's seat or in side front of the driver's seat, for example, below a sun visor of the driver's seat or above a center console.
  • the photographing apparatus may start recording a video after a vehicle engine is started, and transmit obtained consecutive video data to the processor, and the processor processes the video data.
  • step S 2 is performed to recognize an image and extract a face region in the image. Because the video data obtained in step S 1 includes a plurality of consecutive image frames, step S 2 is to recognize the plurality of received image frames. However, because images of the plurality of consecutive image frames are very similar, the recognition of all image frames will lead to a very huge computation amount. In addition, results of recognition of a plurality of adjacent image frames are often the same. Therefore, in this embodiment, one image frame may be selected from every set number of consecutive image frames for recognition. For example, one image frame is selected from every six or eight image frames for recognition, that is, face detection is performed on this image frame, and the detected face region is captured.
  • the process of face detection is a process of confirming positions, sizes and postures of all faces in an image under the assumption that one or more faces exist in the input image. This process may be implemented by using the currently known face detection algorithm, and will not be described in detail herein.
  • a cascade convolutional neural network is used to recognize images.
  • the cascade convolutional neural network includes a first convolutional neural network and a second convolutional neural network, where the first convolutional neural network is used to judge head postures of the detected object, so as to achieve the preliminary judgment of attention, that is, to judge whether the first situation of inattention occurs.
  • the second convolutional neural network is used to extract the information of the head posture features and the eye features, and then judge human behavior by analyzing the human gaze direction to implement attention detection.
  • step S 3 is performed first to recognize the face region by using the first convolutional neural network.
  • an attention concentration state of the detected object is judged by using a detection model which is trained in advance, for example, a situation in which the head of the detected object is rotated by more than a certain angle may be set to be a first situation of inattention.
  • a situation in which the driver turns the head leftward (by greater than 60°), turns the head rightward (by greater than 60°), raises the head (by greater than 60°) or lower the head (by greater than 60°) is a situation of inattention.
  • the first convolutional neural network is a convolutional neural network with a small size and low computational complexity.
  • the first convolutional neural network of this embodiment includes several convolution layers, several pooling layers, a fully connected layer 16 and a classification layer 17 .
  • Each pooling layer is located between two adjacent convolution layers.
  • a dashed box 11 includes units formed by combining a plurality of convolution layers and pooling layers. Each unit includes one convolution layer and one pooling layer, and an output of the last pooling layer is input to the convolution layer 15 , so that the number of the convolution layers is greater than the number of the pooling layers by one.
  • two kinds of parameters of a plurality of convolution layers are provided.
  • One kind of parameters of the convolution layers is as follows: m filters are provided, a convolution kernel size is k 1 ⁇ k 1 , and a step size pixel is S 1 ; and the other kind of parameters of the convolution layers is as follows: n filters are provided, a convolution kernel size is k 2 ⁇ k 2 , and a step size pixel is S 2 .
  • Each pooling layer samples the output of the previous convolution layer.
  • the fully connected layer 16 is configured to implement the process of transforming a two-dimensional feature matrix output by the convolution layer 15 into a one-dimensional feature vector.
  • the classification layer 17 uses a softmax function to map outputs of a plurality of neurons into a (0, 1) interval, which may be understood as a probability distribution. Assuming that a probability distribution vector is P, and Pi denotes an i th value in P, the definition of the softmax probability value of this value is shown in the formula of FIG. 2 .
  • a maximum value is found in P, and a category corresponding to i with the highest probability is used as a detection result.
  • the detection result is whether the rotation angle of the driver's head exceeds the preset angle.
  • step S 4 is performed to judge whether the detection result of step S 3 is that the rotation angle of the driver's head exceeds the preset angle. If so, it is confirmed that the first situation of inattention occurs to the driver. In this case, step S 9 is performed to send warning information, such as voice warning information.
  • step S 5 is performed first to capture a plurality kind of regions of interest from the face region.
  • a first kind region of interest is implemented by directly obtaining embedded face field of vision, and corresponding human image information is a part between a rearview mirror and a mirror on the left of the driver, an image part in a dashed box 21 .
  • the human attention can be judged directly by using image information without face detection operations.
  • a second kind region of interest is implemented by using the known face detection algorithm to detect and capture a face frame as an input image, an image region within a solid box 22 .
  • the second kind region of interest may be referred to as the face frame region.
  • a third kind region of interest is implemented by expanding the detected face frame in four directions, up, down, left and right, and adding information of an additional face part based on the second kind region of interest, such as an image region in a solid box 23 in the figure.
  • the third kind region of interest may be referred to as the face supplement region.
  • additional auxiliary features are added, which can not only confirm the position of the human head, but also has very good robustness.
  • a fourth kind region of interest is implemented by capturing only an upper half of the face based on the second kind region of interest, such as an image region in a solid box 24 in FIG. 4 . Therefore, the fourth kind region of interest is the eye region, and is mainly used to judge the driver's attention by paying attention to eye information.
  • step S 6 is performed to recognize a plurality kind of regions of interest by using the second convolutional neural network.
  • an attention detection model which is trained in advance is used to recognize a plurality kind of regions of interest and classify recognition results.
  • the attention of the detected object falls into seven categories, namely, looking at a left rearview mirror, looking at the front, looking at a vehicle interior rearview mirror, looking at a right rearview mirror, looking at a dashboard, looking at a central control region, and closing eyes.
  • the attention of the detected object may fall into six categories, namely, looking at the left side, looking at the right side, looking at the front, looking above, looking down, and closing eyes.
  • the second convolutional neural network in this embodiment includes a first convolution layer 31 , a depthwise convolution layer 32 , several bottleneck residual layers, a second convolution layer 35 , a linear global depthwise convolution (linear GDConv) layer 36 , a linear convolution (linear Cony) layer 37 , a fully connected layer 38 , and a classification layer 39 that are sequentially cascaded.
  • a dashed box in FIG. 5 denotes a unit composed of a plurality of bottleneck residual layers.
  • the plurality of bottleneck residual layers include bottleneck residual layers 33 , 34 , and the like.
  • the bottleneck residual layer is repeated n i times, the number of channel expansions at each layer is ti, and the step size is s i .
  • parameters of the first convolution layer 31 and the second convolution layer 35 may be different.
  • Parameters of one convolution layer are as follows: m filters are provided, a convolution kernel size is k 1 ⁇ k 1 , and a step size pixel is S 1 ; and parameters of the other convolution layer are as follows: n filters are provided, a convolution kernel size is k 2 ⁇ k 2 , and a step size pixel is S 2 .
  • the depthwise convolution layer 32 performs a convolution operation on each input channel with a convolution kernel of the corresponding channel Assuming that an input dimension is m, and a size is w ⁇ h, m filters corresponding to the convolution layer are provided, the convolution kernel size is k ⁇ k, and a depthwise convolution operation is used. In this case, an output dimension is m, and the size is w′ ⁇ h′.
  • Each bottleneck residual layer includes a convolution unit, a depthwise convolution unit, and a residual unit.
  • the depthwise convolution unit is configured to receive an output of the convolution unit.
  • the step size of the residual unit at the convolution unit is 1, residual computation of the bottleneck residual layer is implemented.
  • FIG. 6 when the step size of the residual unit at the convolution unit is 1 and c′ channels are provided, values on corresponding channels of the inputs and outputs are added to implement residual computation, that is, input data passes through a first convolution unit 41 , a depthwise convolution unit 42 , a second convolution unit 43 and a residual unit 44 that are sequentially cascaded, and cumulative calculation of inputs and outputs is implemented at the residual unit 44 .
  • an input dimension is [w, h] and an output dimension is [w′, h] Because the input dimension is not equal to the output dimension, residual computation is not performed in this case.
  • the structural block diagram in this case is shown in FIG. 7 .
  • the input passes through a first convolution unit 51 , a depthwise convolution unit 52 and a second convolution unit 53 in sequence and then is output.
  • the convolution kernel size of the linear global depthwise convolution layer 36 is the same as the input size.
  • m filters are provided and the convolution kernel size is k ⁇ k.
  • n input channels are provided and the size is k ⁇ k.
  • m output channels are provided and the size is 1 ⁇ 1.
  • the linear convolution layer 37 is a convolution layer in a special form, and uses a linear function as its activation function.
  • a calculation process of the fully connected layer 38 is a process of converting a two-dimensional feature matrix output by an upper layer into a one-dimensional feature vector, and the output dimension is the same as the number of classifications.
  • the calculation method for the classification layer 39 is the same as the calculation method for the classification layer 17 of the first convolutional neural network, and is not described in detail herein.
  • step S 7 is performed to perform fusion analysis on the results of recognition of four kind regions of interest according to step S 6 to obtain results of fusion analysis.
  • the face detection algorithm is first used to detect the face region, and a corresponding face frame image is captured to achieve the classification of a face frame; then the captured face frame is expanded in four directions to obtain a new image to achieve the classification of the new image, and whether the driver is distracted in driving or wants to change a lane in driving can be judged by using the results of the two classifications.
  • the face detection algorithm is first used to detect the face region, and a corresponding face frame image is captured to achieve the classification of a face frame. Then an upper half of the face frame is retained, that is, information of the eye region is obtained, and the eye information is classified. Fusion analysis is performed by using the results of classification of the two, so that it can be judged whether the driver is fatigue in driving, is distracted in driving or wants to change a lane in driving. When several consecutive face images fall into the situation of closing eyes, it can be judged that the driver is fatigue in driving.
  • this method may also be applied to other scenarios, such as the detection of students' attention in class.
  • step S 7 is performed to judge whether a second situation of inattention, such as a situation of being fatigue in driving or being distracted in driving occurs according to the analysis result of step S 6 . If so, step S 9 is performed to send warning information; otherwise, step S 10 is performed to predict the driver's behavior according to the analysis result of step S 7 , such as wanting to change a lane to the left in driving. The result of prediction may be provided to other algorithms for use.
  • a second situation of inattention such as a situation of being fatigue in driving or being distracted in driving occurs according to the analysis result of step S 6 . If so, step S 9 is performed to send warning information; otherwise, step S 10 is performed to predict the driver's behavior according to the analysis result of step S 7 , such as wanting to change a lane to the left in driving.
  • the result of prediction may be provided to other algorithms for use.
  • a situation of vehicles coming from behind the left side such as whether there is a driving vehicle within a certain distance behind the left side, can be detected, so as to send indication information to the driver.
  • the second convolutional neural network may be replaced with a more lightweight network architecture with strong computing power, such as ShuffleNet, or bottleneck residual layers of the convolutional neural network may be reduced to retrain the model.
  • a more lightweight network architecture with strong computing power such as ShuffleNet
  • bottleneck residual layers of the convolutional neural network may be reduced to retrain the model.
  • the computer apparatus of this embodiment may be an intelligent device, such as a vehicle-mounted monitoring instrument with an image processing capability.
  • the computer apparatus includes a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, each step of the foregoing attention detection method based on a cascade neural network is implemented.
  • the computer program may be divided into one or more modules, and the one or more modules are stored in the memory and executed by the processor to complete each module of the present invention.
  • the one or more modules may be a series of computer instruction segments capable of performing a particular function, and the instruction segments are used to describe the execution process of the computer program in a terminal device.
  • the processor in the present invention may be a central processing unit (CPU), or may be other general-purpose processors, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor, or the like.
  • the processor is a control center of the terminal device, and uses various interfaces and lines to connect various parts of the entire terminal device.
  • the memory may be configured to store a computer program and/or a module, and the processor implements various functions of the terminal device by running or executing the computer program and/or the module stored in the memory and invoking data stored in the memory.
  • the memory may mainly include a program storage region and a data storage region, where the program storage region may store an operating system, application programs required for at least one function (such as a sound playing function and an image playing function), and the like; and the data storage region may store data (such as audio data and an address book) and the like created according to the use of a mobile phone.
  • the memory may include a high-speed random access memory, and may further include a non-volatile memory, such as a hard disk, an internal memory, a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, a flash card, at least one magnetic disk memory device, a flash memory device or other volatile solid-state memory devices.
  • a non-volatile memory such as a hard disk, an internal memory, a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, a flash card, at least one magnetic disk memory device, a flash memory device or other volatile solid-state memory devices.
  • the computer program stored in the computer apparatus may be stored in a computer-readable storage medium if implemented in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or some processes in the method of the foregoing embodiment of the present invention are implemented, or may be completed by instructing related hardware through a computer program.
  • the computer program may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, each step of the foregoing attention detection method based on a cascade neural network can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, in the form of object code, an executable file or in some intermediate forms, or the like.
  • the computer-readable storage medium may include any entity or apparatus capable of carrying computer program code, a recording medium, a USB flash disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), electrical carrier signals, telecommunications signals, a software distribution medium, and the like. It should be noted that the content contained in the computer-readable storage medium may be appropriately increased or decreased according to requirements of legislation and patent practice in a jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, the computer-readable storage medium does not include electrical carrier signals or telecommunications signals.
  • a cascaded convolutional neural network is used for recognition.
  • a first convolutional neural network has relatively low computational complexity and may be used to analyze a simple scenario and judge whether a first situation of inattention occurs to a driver. This can reduce the computation amount of the entire convolutional neural network, and the entire convolutional neural network model is small in size and low in computational complexity.
  • the head posture information is first used to make a preliminary judgment on whether the attention is concentrated, and then the head posture and eye information is used to further detect the driver's attention.
  • the driver's attention is detected, four methods are adopted to process an original image to obtain four kind regions of interest, and classification results are fused to analyze human behavior and intention. Therefore, the cascaded convolutional neural network of the present invention has good generalization performance and low computational complexity, and is suitable for an embedded device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Ophthalmology & Optometry (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The present invention provides an attention detection method based on a cascade neural network, a computer apparatus, and a computer-readable storage medium. The method includes: obtaining video data, recognizing a plurality of image frames, and extracting a face region of the plurality of image frames; recognizing the face region by using a first convolutional neural network to judge whether a first situation of inattention occurs; and recognizing, if it is confirmed that no first situation of inattention occurs, the face region by using a second convolutional neural network to judge whether a second situation of inattention occurs, where computational complexity of the first convolutional neural network is less than computational complexity of the second convolutional neural network. The present invention further provides the computer apparatus for implementing the foregoing method and the computer-readable storage medium.

Description

PRIORITY CLAIM
This is a U.S. national stage of application No. PCT/CN2019/098407, filed on Jul. 30, 2019, the content of which is incorporated herein by reference.
TECHNICAL FIELD
The present invention relates to the field of image recognition processing, and in particular, to an attention detection method based on a cascade neural network, a computer apparatus and a computer-readable storage medium for implementing the method.
BACKGROUND
With the development of the intelligent technology, detecting human attention by image recognition has become a cutting-edge emerging technology. Human attention detection has always been one of focuses and hotspots in the field of machine learning, and is mainly used in aspects such as security protection and aided driving. Due to a large number of uncertain factors in a real environment, such as the influence of different illumination conditions such as day and night, diversity of human head postures and expressions, and differences in race, gender and age, as well as human wearing of glasses, detecting a human attention state in the real environment is quite challenging.
In view of this, how to improve human attention detection performance has become a hotspot in the field of artificial intelligence research, and many algorithms have been proposed therefore. For example, remote eye tracking is a classical algorithm for detecting human attention. In an outdoor environment, this method needs to rely on a near-infrared lighting device to produce a bright pupil effect, so as to capture eye information. However, the near-infrared lighting device used in this method is easily damaged due to vibration and bumps, and needs long-term maintenance, which is relatively high in cost.
Therefore, some researchers have proposed to rely on head postures and eye information to detect human attention. One solution is to fuse the head postures, eye features and geometric features of a vehicle, and then classify regions concerned by human eyes to implement the detection of human attention. This method has achieved a good effect. Another method of combining head postures with eye features is to classify regions concerned by human eyes. However, the foregoing two methods have two main problems. The first is that a series of complex operations such as face detection, face calibration, eyeball detection and feature extraction are required, and if a submodule in this algorithm has poor performance, it will inevitably affect an overall effect. The second is that when features are extracted, an adopted conventional machine learning method and conventional feature extraction algorithm have a poorer generalization capability. For example, when a photographing angle of a camera, external illumination conditions and a position of a target change, the performance of this method drops sharply.
Therefore, some researchers have proposed a method of attention estimation based on a convolutional neural network. The method may be used to automatically learn information of head posture features and eye features from data samples, without manually designing a feature extraction algorithm, and has good robustness. However, a model of the convolutional neural network used is large in size and high in computational complexity, which is not suitable for an embedded device, resulting in great limitations on the use of this method.
SUMMARY
A main objective of the present invention is to provide an attention detection method based on a cascade neural network, which has low computational complexity and good computational performance.
Another objective of the present invention is to provide a computer apparatus for implementing the foregoing attention detection method based on a cascade neural network.
Still another objective of the present invention is to provide a computer-readable storage medium for implementing the foregoing attention detection method based on a cascade neural network.
Technical Solutions
To achieve the main objective of the present invention, an attention detection method based on a cascade neural network provided in the present invention includes: obtaining video data, recognizing a plurality of image frames, and extracting a face region of the plurality of image frames; recognizing the face region by using a first convolutional neural network to judge whether a first situation of inattention occurs; and recognizing, if it is confirmed that no first situation of inattention occurs, the face region by using a second convolutional neural network to judge whether a second situation of inattention occurs, where computational complexity of the first convolutional neural network is less than computational complexity of the second convolutional neural network.
In a preferred solution, the recognizing the face region by using a second convolutional neural network includes: capturing a plurality kind of regions of interest from the face region, and judging whether a second situation of inattention occurs according to a result of recognition of two or more kind regions of interest.
In a further solution, the plurality kind of regions of interest include a face frame region and a face supplement region; and the judging whether a second situation of inattention occurs according to a result of recognition of two or more kind regions of interest includes: judging whether a second situation of inattention occurs according to a result of image recognition of the face frame region and the face supplement region.
In a still further solution, the plurality kind of regions of interest include a face frame region and an eye region; and the judging whether a second situation of inattention occurs according to a result of recognition of two or more kind regions of interest includes: judging whether a second situation of inattention occurs according to a result of image recognition of the face frame region and the eye region.
In a still further solution, the judging whether a first situation of inattention occurs includes: recognizing the face region by using the first convolutional neural network, judging whether a rotation angle of the head in a preset direction is greater than a preset angle, and if so, confirming that the first situation of inattention occurs.
In a still further solution, the second convolutional neural network includes a first convolution layer, a depthwise convolution layer, a plurality of bottleneck residual layers, a second convolution layer, a linear global depthwise convolution layer, a linear convolution layer, a fully connected layer and a classification layer that are sequentially cascaded.
In a still further solution, the bottleneck residual layer includes a convolution unit and a depthwise convolution unit for receiving an output of the convolution unit, and is further provided with a residual unit, and the residual unit implements residual computation of the bottleneck residual layer when a step size of the convolution unit is 1.
In a still further solution, the recognizing a plurality of image frames after obtaining the video data includes: selecting one image frame from every preset number of consecutive image frames of the video data for recognition.
To achieve another objective, a computer apparatus provided in the present invention includes a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, each step of the foregoing attention detection method based on a cascade neural network is implemented.
To achieve still another objective, a computer-readable storage medium provided in the present invention stores a computer program, where when the computer program is executed by a processor, each step of the foregoing attention detection method based on a cascade neural network is implemented.
Beneficial Effects
In the applied solution of the present invention, after a face region of a plurality of image frames is extracted, recognition is first performed by using a first convolutional neural network to judge whether a first situation of inattention occurs, and only when it is confirmed that no first situation of inattention occurs, recognition is performed by using a second convolutional neural network to judge whether a second situation of inattention occurs. This may avoid using a convolutional neural network having a relatively complicated degree of computation for all situations, thereby simplifying the overall complexity of the attention detection.
In addition, in the solution of the present invention, when whether the second situation of inattention occurs is judged by using the second convolutional neural network, the face region is divided into a plurality kind of regions of interest, the plurality kind of regions of interest are separately recognized, and then fusion analysis is performed in conjunction with the recognition of the plurality kind of regions of interest to judge whether the second situation of inattention occurs. This may improve accuracy of analysis, and have a better effect of recognizing inattention.
Specifically, by recognizing a face frame region and a face supplement region separately, a plurality of attention situations of people, such as a situation that a driver is looking at a left rearview mirror, looking at the front, looking at a vehicle interior rearview mirror, looking at a right rearview mirror, looking at a dashboard, looking at a central control region, or closing eyes, may be recognized. By combination with a result of recognition of the face frame region and the face supplement region, it can be judged whether the driver is distracted in driving or wants to change a lane in driving. When several consecutive face images fall into situations of looking at the vehicle interior rearview mirror, looking at the dashboard and looking at the central control region, it can be judged that the driver is distracted in driving, and when several consecutive face images fall into situations of looking at the left rearview mirror and looking at the front, it can be judged that the driver wants to change a lane in driving.
Whether the driver is fatigue in driving, is distracted in driving or wants to change a lane in driving can be judged by using the result of recognition of the face frame region and the face supplement region. When several consecutive face images fall into the situation of closing eyes, it can be judged that the driver is fatigue in driving. When several consecutive face images fall into a situation of being deviated to the left side or right side and looking at the left side or the right side, it can be considered that the second situation of inattention occurs.
To judge whether the first situation of inattention occurs to the driver, it is required only to judge whether a rotation direction of the driver's head is greater than a preset angle, for example, when the head rotates upward, downward, leftward or rightward by more than 60°, it can be considered that the first situation of inattention occurs. In this way, the design of the first convolutional neural network is very simple, and the computation amount is smaller Once it is judged that the first situation of inattention occurs, it is not required to judge the second situation, and the overall computation amount of inattention judging can be saved.
However, the computation amount of the second convolutional neural network is relatively complex and can be used to accurately recognize whether other situations of inattention occur to the driver, and the judgment is more accurate.
In addition, because images of successive frames in the obtained video data are very similar, the recognition of all frames will lead to a huge computation amount, and a large number of similar calculations will result in substantially the same calculation results. Therefore, only one image frame is selected from every set number of consecutive image frames to be recognized, which can greatly reduce the computation amount of inattention recognition, and can ensure the accuracy of recognition results. Preferably, one image frame may be selected from every six image frames for recognition.
BRIEF DESCRIPTION OF DRAWINGS
FIG. 1 is a flowchart of an embodiment of an attention detection method based on a cascade neural network according to the present invention;
FIG. 2 shows a formula for calculating a softmax probability value of a value in an embodiment of an attention detection method based on a cascade neural network according to the present invention;
FIG. 3 is a structural block diagram of a first convolutional neural network in an embodiment of an attention detection method based on a cascade neural network according to the present invention;
FIG. 4 is a schematic diagram of four regions of interest for image recognition by using an embodiment of an attention detection method based on a cascade neural network according to the present invention;
FIG. 5 is a structural block diagram of a second convolutional neural network in an embodiment of an attention detection method based on a cascade neural network according to the present invention;
FIG. 6 is a structural block diagram when a step size of a bottleneck residual layer of a second convolutional neural network is 1 in an embodiment of an attention detection method based on a cascade neural network according to the present invention; and
FIG. 7 is a structural block diagram when a step size of a bottleneck residual layer of a second convolutional neural network is 2 in an embodiment of an attention detection method based on a cascade neural network according to the present invention.
The present invention is further described below in conjunction with the accompanying drawings and embodiments.
DESCRIPTION OF EMBODIMENTS
An attention detection method based on a cascade neural network according to the present invention is applied to an intelligent device. Preferably, the intelligent device is provided with a photographing apparatus, such as a camera. The intelligent device uses video data obtained by the photographing apparatus to perform image analysis, so as to judge whether a situation of inattention occurs to a specific person. Preferably, the intelligent device is provided with a processor and a memory storing a computer program, where the processor implements the attention detection method based on a cascade neural network by executing the computer program.
Embodiment of an attention detection method based on a cascade neural network
This embodiment is mainly based on head postures and eye information, and a cascade convolutional neural network is applied to detect attention of a specific person. The entire method mainly includes three steps: video collection, image processing, and attention detection.
In the video collection step, video data is photographed by the photographing apparatus. In this embodiment, recognition can be performed on the basis of the video data of different scenarios (including different photographing angles, external illumination conditions, a position of a target, and the like). Therefore, the photographing apparatus can obtain target video data in various different postures. In the image processing step, a plurality of image frames are obtained from the video data, the frames are detected by using a face detection algorithm, and images of a face region are captured. In the attention detection step, a first convolutional neural network with low computational complexity is first used to judge head postures of a detected object, so as to achieve a preliminary judgment of attention detection; then, the detected face region of interest is further captured and expanded, a second convolutional neural network with high computational complexity is used to extract information of head posture features and eye features, and human behavior is judged by analyzing the human gaze direction. The cascade convolutional neural network adopted in this embodiment has good generalization performance and low computational complexity, and is suitable for an embedded device.
A specific operating method of this embodiment is described below in conjunction with FIG. 1 . Step S1 is performed first to obtain video data, that is, the photographing apparatus of the intelligent device obtains consecutive video data. Specifically, the intelligent device may be a device disposed in a vehicle to detect whether the driver is inattentive, and the photographing apparatus may be disposed in front of the driver's seat or in side front of the driver's seat, for example, below a sun visor of the driver's seat or above a center console. The photographing apparatus may start recording a video after a vehicle engine is started, and transmit obtained consecutive video data to the processor, and the processor processes the video data.
Then, step S2 is performed to recognize an image and extract a face region in the image. Because the video data obtained in step S1 includes a plurality of consecutive image frames, step S2 is to recognize the plurality of received image frames. However, because images of the plurality of consecutive image frames are very similar, the recognition of all image frames will lead to a very huge computation amount. In addition, results of recognition of a plurality of adjacent image frames are often the same. Therefore, in this embodiment, one image frame may be selected from every set number of consecutive image frames for recognition. For example, one image frame is selected from every six or eight image frames for recognition, that is, face detection is performed on this image frame, and the detected face region is captured. Specifically, the process of face detection is a process of confirming positions, sizes and postures of all faces in an image under the assumption that one or more faces exist in the input image. This process may be implemented by using the currently known face detection algorithm, and will not be described in detail herein.
Then, attention detection is performed on the extracted face region. Specifically, step S3 to step S10 are performed. In this embodiment, a cascade convolutional neural network is used to recognize images. Specifically, the cascade convolutional neural network includes a first convolutional neural network and a second convolutional neural network, where the first convolutional neural network is used to judge head postures of the detected object, so as to achieve the preliminary judgment of attention, that is, to judge whether the first situation of inattention occurs. The second convolutional neural network is used to extract the information of the head posture features and the eye features, and then judge human behavior by analyzing the human gaze direction to implement attention detection.
Specifically, step S3 is performed first to recognize the face region by using the first convolutional neural network. Specifically, in this embodiment, an attention concentration state of the detected object is judged by using a detection model which is trained in advance, for example, a situation in which the head of the detected object is rotated by more than a certain angle may be set to be a first situation of inattention. For example, a situation in which the driver turns the head leftward (by greater than 60°), turns the head rightward (by greater than 60°), raises the head (by greater than 60°) or lower the head (by greater than 60°) is a situation of inattention.
Therefore, a recognition task of the first convolutional neural network is relatively simple and easy to distinguish. The first convolutional neural network is a convolutional neural network with a small size and low computational complexity. Referring to FIG. 3 , the first convolutional neural network of this embodiment includes several convolution layers, several pooling layers, a fully connected layer 16 and a classification layer 17. Each pooling layer is located between two adjacent convolution layers. As shown in FIG. 3 , a dashed box 11 includes units formed by combining a plurality of convolution layers and pooling layers. Each unit includes one convolution layer and one pooling layer, and an output of the last pooling layer is input to the convolution layer 15, so that the number of the convolution layers is greater than the number of the pooling layers by one.
In this embodiment, two kinds of parameters of a plurality of convolution layers are provided. One kind of parameters of the convolution layers is as follows: m filters are provided, a convolution kernel size is k1×k1, and a step size pixel is S1; and the other kind of parameters of the convolution layers is as follows: n filters are provided, a convolution kernel size is k2×k2, and a step size pixel is S2. Each pooling layer samples the output of the previous convolution layer. The fully connected layer 16 is configured to implement the process of transforming a two-dimensional feature matrix output by the convolution layer 15 into a one-dimensional feature vector. The classification layer 17, as the last layer of the first convolutional neural network, uses a softmax function to map outputs of a plurality of neurons into a (0, 1) interval, which may be understood as a probability distribution. Assuming that a probability distribution vector is P, and Pi denotes an ith value in P, the definition of the softmax probability value of this value is shown in the formula of FIG. 2 .
A maximum value is found in P, and a category corresponding to i with the highest probability is used as a detection result. The detection result is whether the rotation angle of the driver's head exceeds the preset angle.
Then, step S4 is performed to judge whether the detection result of step S3 is that the rotation angle of the driver's head exceeds the preset angle. If so, it is confirmed that the first situation of inattention occurs to the driver. In this case, step S9 is performed to send warning information, such as voice warning information.
If it is confirmed that no first situation of inattention occurs, the second convolutional neural network is applied to judge whether a second situation of inattention occurs. Specifically, step S5 is performed first to capture a plurality kind of regions of interest from the face region. Referring to FIG. 4 , with the driver sitting in a driving position as an example, a first kind region of interest is implemented by directly obtaining embedded face field of vision, and corresponding human image information is a part between a rearview mirror and a mirror on the left of the driver, an image part in a dashed box 21. In the first kind region of interest, the human attention can be judged directly by using image information without face detection operations. A second kind region of interest is implemented by using the known face detection algorithm to detect and capture a face frame as an input image, an image region within a solid box 22. The second kind region of interest may be referred to as the face frame region. A third kind region of interest is implemented by expanding the detected face frame in four directions, up, down, left and right, and adding information of an additional face part based on the second kind region of interest, such as an image region in a solid box 23 in the figure. The third kind region of interest may be referred to as the face supplement region. In a method for capturing the third kind region of interest, additional auxiliary features are added, which can not only confirm the position of the human head, but also has very good robustness. A fourth kind region of interest is implemented by capturing only an upper half of the face based on the second kind region of interest, such as an image region in a solid box 24 in FIG. 4 . Therefore, the fourth kind region of interest is the eye region, and is mainly used to judge the driver's attention by paying attention to eye information.
Then, step S6 is performed to recognize a plurality kind of regions of interest by using the second convolutional neural network. For example, an attention detection model which is trained in advance is used to recognize a plurality kind of regions of interest and classify recognition results. With the driver sitting in the driving position as an example, the attention of the detected object falls into seven categories, namely, looking at a left rearview mirror, looking at the front, looking at a vehicle interior rearview mirror, looking at a right rearview mirror, looking at a dashboard, looking at a central control region, and closing eyes. In other application scenarios, the attention of the detected object may fall into six categories, namely, looking at the left side, looking at the right side, looking at the front, looking above, looking down, and closing eyes.
Recognition and classification tasks in step S6 are relatively complex, and it is especially difficult to distinguish between connected regions such as the front and an instrument panel. Therefore, in this embodiment, a convolutional neural network with a strong learning capability and a fast calculation speed is used for recognition, that is, the second convolutional neural network is used to implement the foregoing recognition operation. Referring to FIG. 5 , the second convolutional neural network in this embodiment includes a first convolution layer 31, a depthwise convolution layer 32, several bottleneck residual layers, a second convolution layer 35, a linear global depthwise convolution (linear GDConv) layer 36, a linear convolution (linear Cony) layer 37, a fully connected layer 38, and a classification layer 39 that are sequentially cascaded. A dashed box in FIG. 5 denotes a unit composed of a plurality of bottleneck residual layers. For example, the plurality of bottleneck residual layers include bottleneck residual layers 33, 34, and the like. For the ith bottleneck residual layer, the bottleneck residual layer is repeated ni times, the number of channel expansions at each layer is ti, and the step size is si.
In this embodiment, parameters of the first convolution layer 31 and the second convolution layer 35 may be different. Parameters of one convolution layer are as follows: m filters are provided, a convolution kernel size is k1×k1, and a step size pixel is S1; and parameters of the other convolution layer are as follows: n filters are provided, a convolution kernel size is k2×k2, and a step size pixel is S2.
The depthwise convolution layer 32 performs a convolution operation on each input channel with a convolution kernel of the corresponding channel Assuming that an input dimension is m, and a size is w×h, m filters corresponding to the convolution layer are provided, the convolution kernel size is k×k, and a depthwise convolution operation is used. In this case, an output dimension is m, and the size is w′×h′.
Each bottleneck residual layer includes a convolution unit, a depthwise convolution unit, and a residual unit. The depthwise convolution unit is configured to receive an output of the convolution unit. When the step size of the residual unit at the convolution unit is 1, residual computation of the bottleneck residual layer is implemented. As shown in FIG. 6 , when the step size of the residual unit at the convolution unit is 1 and c′ channels are provided, values on corresponding channels of the inputs and outputs are added to implement residual computation, that is, input data passes through a first convolution unit 41, a depthwise convolution unit 42, a second convolution unit 43 and a residual unit 44 that are sequentially cascaded, and cumulative calculation of inputs and outputs is implemented at the residual unit 44.
When the step size of the residual unit at the convolution unit is 2 and c′ channels are provided, an input dimension is [w, h] and an output dimension is [w′, h] Because the input dimension is not equal to the output dimension, residual computation is not performed in this case. The structural block diagram in this case is shown in FIG. 7 . The input passes through a first convolution unit 51, a depthwise convolution unit 52 and a second convolution unit 53 in sequence and then is output.
The convolution kernel size of the linear global depthwise convolution layer 36 is the same as the input size. m filters are provided and the convolution kernel size is k×k. In this case, n input channels are provided and the size is k×k. After the linear global depthwise convolution computation, m output channels are provided and the size is 1×1.
The linear convolution layer 37 is a convolution layer in a special form, and uses a linear function as its activation function. A calculation process of the fully connected layer 38 is a process of converting a two-dimensional feature matrix output by an upper layer into a one-dimensional feature vector, and the output dimension is the same as the number of classifications. The calculation method for the classification layer 39 is the same as the calculation method for the classification layer 17 of the first convolutional neural network, and is not described in detail herein.
Then, step S7 is performed to perform fusion analysis on the results of recognition of four kind regions of interest according to step S6 to obtain results of fusion analysis. Specifically, with the driver sitting in the driving position as an example, when the results of recognition of the second kind region of interest and the third kind region of interest are utilized to perform fusion analysis, the face detection algorithm is first used to detect the face region, and a corresponding face frame image is captured to achieve the classification of a face frame; then the captured face frame is expanded in four directions to obtain a new image to achieve the classification of the new image, and whether the driver is distracted in driving or wants to change a lane in driving can be judged by using the results of the two classifications. When several consecutive face images fall into situations of looking at the interior rearview mirror, looking at the dashboard, and looking at the central control region, it can be judged that the driver is distracted in driving. When several consecutive face image fall into situations of looking at the left rearview mirror and looking at the front, it can be judged that the driver wants to change a lane in driving.
For another example, when the results of recognition of the second kind region of interest and the fourth kind region of interest are utilized to perform fusion analysis, the face detection algorithm is first used to detect the face region, and a corresponding face frame image is captured to achieve the classification of a face frame. Then an upper half of the face frame is retained, that is, information of the eye region is obtained, and the eye information is classified. Fusion analysis is performed by using the results of classification of the two, so that it can be judged whether the driver is fatigue in driving, is distracted in driving or wants to change a lane in driving. When several consecutive face images fall into the situation of closing eyes, it can be judged that the driver is fatigue in driving. Optionally, this method may also be applied to other scenarios, such as the detection of students' attention in class. In conjunction with the results of recognition of the second kind region of interest and the fourth kind region of interest, when several consecutive face images fall into a situation of being deviated to the left or right and looking at the left side or right side, it can be considered that a situation of inattention occurs.
Then, step S7 is performed to judge whether a second situation of inattention, such as a situation of being fatigue in driving or being distracted in driving occurs according to the analysis result of step S6. If so, step S9 is performed to send warning information; otherwise, step S10 is performed to predict the driver's behavior according to the analysis result of step S7, such as wanting to change a lane to the left in driving. The result of prediction may be provided to other algorithms for use. For example, in the field of aided driving, when it can be judged according to the result of step S7 that the driver wants to change a lane to the left in driving, a situation of vehicles coming from behind the left side, such as whether there is a driving vehicle within a certain distance behind the left side, can be detected, so as to send indication information to the driver.
Optionally, the second convolutional neural network may be replaced with a more lightweight network architecture with strong computing power, such as ShuffleNet, or bottleneck residual layers of the convolutional neural network may be reduced to retrain the model.
Embodiment of a computer apparatus:
The computer apparatus of this embodiment may be an intelligent device, such as a vehicle-mounted monitoring instrument with an image processing capability. The computer apparatus includes a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, each step of the foregoing attention detection method based on a cascade neural network is implemented.
For example, the computer program may be divided into one or more modules, and the one or more modules are stored in the memory and executed by the processor to complete each module of the present invention. The one or more modules may be a series of computer instruction segments capable of performing a particular function, and the instruction segments are used to describe the execution process of the computer program in a terminal device.
The processor in the present invention may be a central processing unit (CPU), or may be other general-purpose processors, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like. The general-purpose processor may be a microprocessor or the processor may be any conventional processor, or the like. The processor is a control center of the terminal device, and uses various interfaces and lines to connect various parts of the entire terminal device.
The memory may be configured to store a computer program and/or a module, and the processor implements various functions of the terminal device by running or executing the computer program and/or the module stored in the memory and invoking data stored in the memory. The memory may mainly include a program storage region and a data storage region, where the program storage region may store an operating system, application programs required for at least one function (such as a sound playing function and an image playing function), and the like; and the data storage region may store data (such as audio data and an address book) and the like created according to the use of a mobile phone. In addition, the memory may include a high-speed random access memory, and may further include a non-volatile memory, such as a hard disk, an internal memory, a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, a flash card, at least one magnetic disk memory device, a flash memory device or other volatile solid-state memory devices.
Computer-readable storage medium:
The computer program stored in the computer apparatus may be stored in a computer-readable storage medium if implemented in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or some processes in the method of the foregoing embodiment of the present invention are implemented, or may be completed by instructing related hardware through a computer program. The computer program may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, each step of the foregoing attention detection method based on a cascade neural network can be implemented.
The computer program includes computer program code, and the computer program code may be in the form of source code, in the form of object code, an executable file or in some intermediate forms, or the like. The computer-readable storage medium may include any entity or apparatus capable of carrying computer program code, a recording medium, a USB flash disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), electrical carrier signals, telecommunications signals, a software distribution medium, and the like. It should be noted that the content contained in the computer-readable storage medium may be appropriately increased or decreased according to requirements of legislation and patent practice in a jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, the computer-readable storage medium does not include electrical carrier signals or telecommunications signals.
Finally, it should be noted that the present invention is not limited to the foregoing implementations. For example, changes in the methods of dividing a plurality of regions of interest or changes in specific processes and results of fusion analysis according to the results of recognition of a plurality of regions of interest should also fall within the protection scope of the claims of the present invention.
INDUSTRIAL APPLICABILITY
In the method of the present invention, a cascaded convolutional neural network is used for recognition. A first convolutional neural network has relatively low computational complexity and may be used to analyze a simple scenario and judge whether a first situation of inattention occurs to a driver. This can reduce the computation amount of the entire convolutional neural network, and the entire convolutional neural network model is small in size and low in computational complexity.
In addition, in the method of the present invention, the head posture information is first used to make a preliminary judgment on whether the attention is concentrated, and then the head posture and eye information is used to further detect the driver's attention. Before the driver's attention is detected, four methods are adopted to process an original image to obtain four kind regions of interest, and classification results are fused to analyze human behavior and intention. Therefore, the cascaded convolutional neural network of the present invention has good generalization performance and low computational complexity, and is suitable for an embedded device.

Claims (9)

What is claimed is:
1. An attention detection method based on a cascade neural network, comprising:
obtaining video data, recognizing a plurality of image frames, and extracting a face region of the plurality of image frames;
wherein,
recognizing the face region by using a first convolutional neural network to judge whether a first situation of inattention occurs; and
recognizing, if it is confirmed that no first situation of inattention occurs, the face region by using a second convolutional neural network to judge whether a second situation of inattention occurs, wherein
computational complexity of the first convolutional neural network is less than computational complexity of the second convolutional neural network.
2. The attention detection method based on a cascade neural network according to claim 1, wherein
the recognizing the face region by using a second convolutional neural network comprises: capturing a plurality kind of regions of interest from the face region, and judging whether a second situation of inattention occurs according to a result of recognition of two or more kind regions of interest.
3. The attention detection method based on a cascade neural network according to claim 2, wherein
the plurality kind of regions of interest comprise a face frame region and a face supplement region; and
the judging whether a second situation of inattention occurs according to a result of recognition of two or more kind regions of interest comprises: judging whether a second situation of inattention occurs according to a result of image recognition of the face frame region and the face supplement region.
4. The attention detection method based on a cascade neural network according to claim 2, wherein
the plurality kind of regions of interest comprise a face frame region and an eye region; and
the judging whether a second situation of inattention occurs according to a result of recognition of two or more kind regions of interest comprises: judging whether a second situation of inattention occurs according to a result of image recognition of the face frame region and the eye region.
5. The attention detection method based on a cascade neural network according to claim 1, wherein
the judging whether a first situation of inattention occurs comprises: recognizing the face region by using the first convolutional neural network, judging whether a rotation angle of the head in a preset direction is greater than a preset angle, and if so, confirming that the first situation of inattention occurs.
6. The attention detection method based on a cascade neural network according to claim 1, wherein
the second convolutional neural network comprises a first convolution layer, a depthwise convolution layer, a plurality of bottleneck residual layers, a second convolution layer, a linear global depthwise convolution layer, a linear convolution layer, a fully connected layer and a classification layer that are sequentially cascaded.
7. The attention detection method based on a cascade neural network according to claim 6, wherein
the bottleneck residual layer comprises a convolution unit and a depthwise convolution unit for receiving an output of the convolution unit, and is further provided with a residual unit, and the residual unit implements residual computation of the bottleneck residual layer when a step size of the convolution unit is 1.
8. The attention detection method based on a cascade neural network according to claim 1, wherein
the recognizing a plurality of image frames after obtaining the video data comprises: selecting one image frame from every preset number of consecutive image frames of the video data for recognition.
9. A computer apparatus, comprising a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, each step of the attention detection method based on a cascade neural network according to claim 1 is implemented.
US17/631,083 2019-07-30 2019-07-30 Cascaded neural network-based attention detection method, computer device, and computer-readable storage medium Active 2041-06-10 US12299968B2 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/098407 WO2021016873A1 (en) 2019-07-30 2019-07-30 Cascaded neural network-based attention detection method, computer device, and computer-readable storage medium

Publications (2)

Publication Number Publication Date
US20220277558A1 US20220277558A1 (en) 2022-09-01
US12299968B2 true US12299968B2 (en) 2025-05-13

Family

ID=69088299

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/631,083 Active 2041-06-10 US12299968B2 (en) 2019-07-30 2019-07-30 Cascaded neural network-based attention detection method, computer device, and computer-readable storage medium

Country Status (3)

Country Link
US (1) US12299968B2 (en)
CN (1) CN110678873A (en)
WO (1) WO2021016873A1 (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12299968B2 (en) * 2019-07-30 2025-05-13 Allwinner Technology Co., Ltd. Cascaded neural network-based attention detection method, computer device, and computer-readable storage medium
CN111310705A (en) * 2020-02-28 2020-06-19 深圳壹账通智能科技有限公司 Image recognition method and device, computer equipment and storage medium
CN111563468B (en) * 2020-05-13 2023-04-07 电子科技大学 Driver abnormal behavior detection method based on attention of neural network
CN111739027B (en) * 2020-07-24 2024-04-26 腾讯科技(深圳)有限公司 Image processing method, device, equipment and readable storage medium
US20230290134A1 (en) * 2020-09-25 2023-09-14 Intel Corporation Method and system of multiple facial attributes recognition using highly efficient neural networks
CN112580458B (en) * 2020-12-10 2023-06-20 中国地质大学(武汉) Facial expression recognition method, device, equipment and storage medium
CN113076884B (en) * 2021-04-08 2023-03-24 华南理工大学 Cross-mode eye state identification method from near infrared light to visible light
CN113408466A (en) * 2021-06-30 2021-09-17 东风越野车有限公司 Method and device for detecting bad driving behavior of vehicle driver
CN113869225A (en) * 2021-09-29 2021-12-31 深圳市优必选科技股份有限公司 A face detection method, device and electronic device
CN114112984B (en) * 2021-10-25 2022-09-20 上海布眼人工智能科技有限公司 Fabric fiber component qualitative method based on self-attention
CN114067440B (en) * 2022-01-13 2022-04-26 深圳佑驾创新科技有限公司 Pedestrian detection method, device, equipment and medium of cascade neural network model
CN114581438B (en) * 2022-04-15 2023-01-17 深圳市海清视讯科技有限公司 MRI image classification method, device, electronic device and storage medium
CN117197415B (en) * 2023-11-08 2024-01-30 四川泓宝润业工程技术有限公司 Method, device and storage medium for detecting target in inspection area of natural gas long-distance pipeline

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4967559B2 (en) 2006-09-19 2012-07-04 株式会社豊田中央研究所 Doze driving prevention device and program
CN108664947A (en) 2018-05-21 2018-10-16 五邑大学 A kind of fatigue driving method for early warning based on Expression Recognition
US20190065873A1 (en) * 2017-08-10 2019-02-28 Beijing Sensetime Technology Development Co., Ltd. Driving state monitoring methods and apparatuses, driver monitoring systems, and vehicles
CN109598174A (en) 2017-09-29 2019-04-09 厦门歌乐电子企业有限公司 The detection method and its device and system of driver status
US20190138268A1 (en) * 2017-11-08 2019-05-09 International Business Machines Corporation Sensor Fusion Service to Enhance Human Computer Interactions
CN109740477A (en) 2018-12-26 2019-05-10 联创汽车电子有限公司 Study in Driver Fatigue State Surveillance System and its fatigue detection method
US20190283762A1 (en) * 2010-06-07 2019-09-19 Affectiva, Inc. Vehicle manipulation using cognitive state engineering
US20210012128A1 (en) * 2019-03-18 2021-01-14 Beijing Sensetime Technology Development Co., Ltd. Driver attention monitoring method and apparatus and electronic device
US20220129664A1 (en) * 2020-10-27 2022-04-28 National Cheng Kung University Deepfake video detection system and method
US20220180110A1 (en) * 2020-12-03 2022-06-09 Shenzhen Horizon Robotics Technology Co., Ltd. Fatigue State Detection Method and Apparatus, Medium, and Electronic Device
US20220277558A1 (en) * 2019-07-30 2022-09-01 Allwinner Technology Co., Ltd. Cascaded Neural Network-Based Attention Detection Method, Computer Device, And Computer-Readable Storage Medium
US20220327845A1 (en) * 2021-04-09 2022-10-13 Stmicroelectronics S.R.L. Method of processing signals indicative of a level of attention of a human individual, corresponding system, vehicle and computer program product
US20230154207A1 (en) * 2020-06-10 2023-05-18 Nanjing University Of Science And Technology Driver fatigue detection method and system based on combining a pseudo-3d convolutional neural network and an attention mechanism

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4967559B2 (en) 2006-09-19 2012-07-04 株式会社豊田中央研究所 Doze driving prevention device and program
US20190283762A1 (en) * 2010-06-07 2019-09-19 Affectiva, Inc. Vehicle manipulation using cognitive state engineering
US20190065873A1 (en) * 2017-08-10 2019-02-28 Beijing Sensetime Technology Development Co., Ltd. Driving state monitoring methods and apparatuses, driver monitoring systems, and vehicles
CN109598174A (en) 2017-09-29 2019-04-09 厦门歌乐电子企业有限公司 The detection method and its device and system of driver status
US20190138268A1 (en) * 2017-11-08 2019-05-09 International Business Machines Corporation Sensor Fusion Service to Enhance Human Computer Interactions
CN108664947A (en) 2018-05-21 2018-10-16 五邑大学 A kind of fatigue driving method for early warning based on Expression Recognition
CN109740477A (en) 2018-12-26 2019-05-10 联创汽车电子有限公司 Study in Driver Fatigue State Surveillance System and its fatigue detection method
US20210012128A1 (en) * 2019-03-18 2021-01-14 Beijing Sensetime Technology Development Co., Ltd. Driver attention monitoring method and apparatus and electronic device
US20220277558A1 (en) * 2019-07-30 2022-09-01 Allwinner Technology Co., Ltd. Cascaded Neural Network-Based Attention Detection Method, Computer Device, And Computer-Readable Storage Medium
US20230154207A1 (en) * 2020-06-10 2023-05-18 Nanjing University Of Science And Technology Driver fatigue detection method and system based on combining a pseudo-3d convolutional neural network and an attention mechanism
US20220129664A1 (en) * 2020-10-27 2022-04-28 National Cheng Kung University Deepfake video detection system and method
US20220180110A1 (en) * 2020-12-03 2022-06-09 Shenzhen Horizon Robotics Technology Co., Ltd. Fatigue State Detection Method and Apparatus, Medium, and Electronic Device
US20220327845A1 (en) * 2021-04-09 2022-10-13 Stmicroelectronics S.R.L. Method of processing signals indicative of a level of attention of a human individual, corresponding system, vehicle and computer program product

Also Published As

Publication number Publication date
CN110678873A (en) 2020-01-10
US20220277558A1 (en) 2022-09-01
WO2021016873A1 (en) 2021-02-04

Similar Documents

Publication Publication Date Title
US12299968B2 (en) Cascaded neural network-based attention detection method, computer device, and computer-readable storage medium
Wang et al. A survey on driver behavior analysis from in-vehicle cameras
Vora et al. Driver gaze zone estimation using convolutional neural networks: A general framework and ablative analysis
CN109584507B (en) Driving behavior monitoring method, device, system, vehicle and storage medium
CN105354986B (en) Driver's driving condition supervision system and method
CN107533754B (en) Reducing Image Resolution in Deep Convolutional Networks
EP3910507B1 (en) Method and apparatus for waking up screen
CN110705392A (en) A face image detection method and device, and storage medium
EP4024270A1 (en) Gesture recognition method, electronic device, computer-readable storage medium, and chip
CN113283338A (en) Method, device and equipment for identifying driving behavior of driver and readable storage medium
Ragab et al. A visual-based driver distraction recognition and detection using random forest
CN111860316B (en) Driving behavior recognition method, device and storage medium
CN117786520B (en) Training method and application of target perception model, unmanned vehicle and storage medium
CN114299473A (en) Driver behavior identification method based on multi-source information fusion
CN113537176A (en) Method, device and equipment for determining fatigue state of driver
Ai et al. Double attention convolutional neural network for driver action recognition
Du et al. A visual recognition method for the automatic detection of distracted driving behavior based on an attention mechanism
CN110837760A (en) Target detection method, training method and device for target detection
Thornton et al. Machine learning techniques for vehicle matching with non-overlapping visual features
CN119502686B (en) Emotion recognition-based head-up display adjustment method, system, equipment and medium
TW202326624A (en) Embedded deep learning multi-scale object detection model using real-time distant region locating device and method thereof
Yazici et al. System-on-chip based driver drowsiness detection and warning system
CN119888693A (en) Neural network model-based distraction driving behavior detection method and related device
Srivastava et al. Driver’s Face Detection in Poor Illumination for ADAS Applications
CN115496977B (en) Target detection method and device based on multi-mode sequence data fusion

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

AS Assignment

Owner name: ALLWINNER TECHNOLOGY CO., LTD., CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LI, XIAOHUI;PENG, GANG;NAN, NAN;AND OTHERS;REEL/FRAME:059498/0978

Effective date: 20220317

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STCF Information on status: patent grant

Free format text: PATENTED CASE