[go: up one dir, main page]

WO2022141114A1 - Procédé et appareil d'estimation de ligne de visée, véhicule, et support de stockage lisible par ordinateur - Google Patents

Procédé et appareil d'estimation de ligne de visée, véhicule, et support de stockage lisible par ordinateur Download PDF

Info

Publication number
WO2022141114A1
WO2022141114A1 PCT/CN2020/141074 CN2020141074W WO2022141114A1 WO 2022141114 A1 WO2022141114 A1 WO 2022141114A1 CN 2020141074 W CN2020141074 W CN 2020141074W WO 2022141114 A1 WO2022141114 A1 WO 2022141114A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
neural network
user
target
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2020/141074
Other languages
English (en)
Chinese (zh)
Inventor
徐吉睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SZ DJI Technology Co Ltd
Original Assignee
SZ DJI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SZ DJI Technology Co Ltd filed Critical SZ DJI Technology Co Ltd
Priority to PCT/CN2020/141074 priority Critical patent/WO2022141114A1/fr
Publication of WO2022141114A1 publication Critical patent/WO2022141114A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present application relates to the technical field of intelligent driving, and in particular, to a line of sight estimation method, device, vehicle, and computer-readable storage medium.
  • Gaze is a behavior that reflects the user's attention.
  • the present application provides a line-of-sight estimation method, apparatus, device, and computer-readable storage medium to solve the technical problem in the related art that the user's line-of-sight direction cannot be accurately estimated.
  • a line-of-sight estimation method comprising:
  • Acquiring at least one frame of an image to be identified, the image to be identified includes an image obtained by performing image collection on a user;
  • the target area corresponding to the image to be recognized is estimated by using the first neural network, the target area is at least one of a plurality of set areas, and the target area represents the area in the image to be recognized that the user's head is facing ;
  • a second neural network corresponding to the target area is acquired, and the target gaze direction information of the user is estimated by using the second neural network.
  • a line-of-sight estimation method comprising:
  • Acquiring at least one frame of an image to be identified, the image to be identified includes an image obtained by performing image collection on a user;
  • a neural network to estimate the user's target line of sight direction information; wherein, the neural network is used to: after extracting at least one type of target features representing the orientation of the user's head in the to-be-recognized image, fuse the extracted target features, The target gaze direction information of the user is estimated according to the feature fusion result.
  • a line-of-sight estimation device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, the processor implements the following steps when executing the computer program :
  • Acquiring at least one frame of an image to be identified, the image to be identified includes an image obtained by performing image collection on a user;
  • the target area corresponding to the image to be recognized is estimated by using the first neural network, the target area is at least one of a plurality of set areas, and the target area represents the area in the image to be recognized that the user's head is facing ;
  • a second neural network corresponding to the target area is acquired, and the target gaze direction information of the user is estimated by using the second neural network.
  • a line-of-sight estimation device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, the processor implements the following steps when executing the computer program :
  • Acquiring at least one frame of an image to be identified, the image to be identified includes an image obtained by performing image collection on a user;
  • a neural network to estimate the user's target line of sight direction information; wherein, the neural network is used to: after extracting at least one type of target features representing the orientation of the user's head in the to-be-recognized image, fuse the extracted target features, The target gaze direction information of the user is estimated according to the feature fusion result.
  • a vehicle comprising the device of the aforementioned third aspect or the aforementioned device of the aforementioned fourth aspect.
  • a computer-readable storage medium where several computer instructions are stored thereon, and when the computer instructions are executed, the line-of-sight estimation method described in the foregoing first aspect is implemented.
  • a computer-readable storage medium where several computer instructions are stored thereon, and when the computer instructions are executed, the line-of-sight estimation method described in the foregoing second aspect is implemented.
  • the first neural network first identifies the direction the user is facing from a large range. target area, and then use the second neural network corresponding to the target area to identify the target sight line direction information of the user's sight line in the target area. It can be seen that since the first neural network is used to first narrow the estimation range to the target area, and then the second neural network adapted to the target area is used to estimate the target line-of-sight direction information in the target area, so that the estimation results can be obtained relatively high accuracy.
  • FIG. 1A is a schematic diagram of a line-of-sight estimation method according to an embodiment of the present application.
  • FIG. 1B is a schematic diagram of a line-of-sight estimation method according to another embodiment of the present application.
  • FIG. 1C is a schematic structural diagram of a first neural network/second neural network according to an embodiment of the present application.
  • FIG. 1D is a schematic diagram of a line-of-sight estimation method according to another embodiment of the present application.
  • FIG. 1E is a schematic diagram of a line-of-sight estimation method according to another embodiment of the present application.
  • FIG. 2 is a schematic diagram of a line-of-sight estimation method according to another embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of an apparatus for implementing the line-of-sight estimation method of this embodiment in the present application.
  • FIG. 4 is a schematic diagram of a vehicle according to an embodiment of the present application.
  • one goal of line of sight estimation is to estimate line-of-sight direction information, that is, the direction of the user's line of sight; when the user pays attention to a certain position, the user's eyes are directed toward the position, thereby generating line-of-sight direction information, that is, the direction of the user's line of sight.
  • the orientation will correspond to the position the user is focusing on.
  • the user pays attention to other positions, and the corresponding user's line of sight faces other positions, and the line-of-sight direction information corresponding to the other positions is generated.
  • the user's eyes cover a large range of sight. In a space, the user's eyes can be directed to any position.
  • the target to be estimated (that is, the line-of-sight direction information corresponding to the specific position the user's eyes are facing) is relative to the entire line of sight covered by the line of sight.
  • the range is very small, so it is difficult for the machine model to estimate the line-of-sight direction information of the user's line of sight from a large coverage area, and the result of line-of-sight estimation is difficult to achieve very high accuracy.
  • the inventor of the present application also found that the existing line-of-sight estimation methods are rarely optimized for intelligent driving scenarios.
  • the traditional line-of-sight estimation method is based on image processing technology, which has poor robustness and cannot adapt to changes in illumination;
  • some methods have been used to estimate sight lines through neural networks, but the shortcomings of the existing technologies are: failure to fully utilize facial information and status; failure to consider timing information; failure to match intelligent driving scenarios Combined, the accuracy of line-of-sight estimation cannot be improved by certain prior knowledge.
  • the above deficiencies result in that the line of sight estimation cannot accurately estimate the driver's line of sight in an intelligent driving scenario, and it is difficult to make a correct judgment on whether the driver is distracted driving.
  • the embodiment of the present application proposes a line-of-sight estimation scheme, which pre-sets multiple setting regions, and uses two types of neural networks with different tasks to design a secondary estimation process. Identify the target area that the user is facing, and then use the second neural network corresponding to the target area to identify the target line of sight direction information of the user's line of sight in the target area. It can be seen that since the first neural network is used to first narrow the estimation range to the target area, and then the second neural network adapted to the target area is used to estimate the target line-of-sight direction information in the target area, so that the estimation results can be obtained relatively high accuracy.
  • FIG. 1A it is a flowchart of a line-of-sight estimation method according to an exemplary embodiment of the present application, including the following steps:
  • step 102 at least one frame of to-be-recognized image is acquired, and the to-be-recognized image includes an image obtained by performing image acquisition on a user;
  • a first neural network is used to estimate a target area corresponding to the to-be-recognized image, the target area is at least one of a plurality of set areas, and the target area represents the user's head in the to-be-recognized image the area towards which the part is directed;
  • step 106 a second neural network corresponding to the target area is acquired, and the target gaze direction information of the user is estimated by using the second neural network.
  • FIG. 1B The description will be made with reference to the schematic diagram of another line-of-sight estimation method shown in FIG. 1B .
  • this embodiment it is considered that the user's line of sight will cover a large area in the actual scene, and the goal of line-of-sight estimation is to accurately identify when the user's eyes are looking at a specific position in the entire covered area. Line-of-sight information.
  • this embodiment pre-configures multiple setting areas, which are the targets of line of sight estimation.
  • the configuration of these setting areas can be flexibly configured according to the needs of actual application scenarios.
  • the setting areas are i.
  • the setting area j and the setting area k are taken as an example for illustration, which is not limited in this embodiment.
  • One application scenario of line of sight estimation is an assisted driving scenario, in which the result of line of sight estimation can be used to detect whether the driver is distracted driving.
  • the multiple set areas include: front windshield area, left rearview mirror area, right mirror area, left window area, right window area, instrument panel area or center console area.
  • Other application scenarios of gaze estimation can be game scenarios, such as using the results of gaze estimation for game interaction; medical scenarios, such as using eye trackers to estimate the user's gaze direction; offline retail scenarios, such as using The line-of-sight estimation result predicts the user's degree of attention to the sold item, etc.
  • the user's line-of-sight coverage can be divided into multiple set areas as required, which is not limited in this embodiment.
  • the image to be recognized refers to an image obtained by capturing an image of a user.
  • the image capturing manner may be obtained by capturing a user by a camera device.
  • each set area may correspond to a second neural network; in other examples, the set area and the second neural network may also have a many-to-one relationship, that is, one neural network may be applicable to two one or more setting fields.
  • FIG. 1B shows the setting area and the second neural network in a one-to-one manner, which may be configured as required in practical applications, which is not limited in this embodiment.
  • the area facing the user's head in each training image covers the multiple set areas, so that the trained first neural network can accurately Identify the area that the user's head is facing in the image; for example, if the setting area includes: setting area A, setting area B and setting area C, the training data of the first neural network includes: the user's line of sight direction
  • the first neural network can be trained, and the first neural network is used to estimate the target area corresponding to the image to be recognized.
  • the estimation of orientation information is narrowed from a large range to the range of the target area. Since the first neural network can identify a small area in a large range without accurately identifying specific line-of-sight direction information, the output result of the first neural network has high reliability.
  • the training image set used to train the second neural network corresponding to the set area adopts the image of the area facing the user's head as the set area; for example, the set area includes : Setting area A, setting area B, and setting area C, the training data of the second neural network corresponding to setting area A is the training image whose line of sight direction information of the user belongs to setting area A, and setting area B corresponds to The training data of the second neural network is the training image of the user's line of sight direction information belonging to the set area B, and the training data of the second neural network corresponding to the set area C is the user's line of sight direction information belongs to the set area C. training images.
  • the second neural network corresponding to the set area focuses on the line-of-sight feature of the user's line-of-sight direction of the set area, so that the second neural network can accurately identify the line-of-sight direction information.
  • the second neural network is adapted to the target area, and the second neural network identifies specific line-of-sight direction information in a smaller target area, so the output of the second neural network also has high reliability.
  • the line-of-sight estimation results have higher accuracy.
  • the training process can be supervised.
  • the aforementioned training images can be calibrated with user gaze direction information, and the calibrated user gaze direction information will correspond to one of the multiple setting areas, and the training images are also calibrated. its corresponding setting area.
  • the training images may be virtual datasets, ie, images obtained in a simulation manner, which can efficiently obtain training images.
  • the training images can be real datasets, because the real-world disturbances will be mixed in the images during the actual collection of user images, and these disturbances will make the neural network have better performance during the training of the neural network. Therefore, it can output more accurate estimation results when dealing with the actual environment in the estimation stage.
  • the input data of the first neural network and the second neural network are images to be recognized.
  • the first neural network is used to estimate the target area corresponding to the image to be recognized, and the estimation process may be: Extracting image features of the to-be-recognized image, and using the image features to estimate a target area corresponding to the to-be-recognized image.
  • the second neural network is used for estimating the line of sight direction information of the image to be recognized, and the estimation processing process may be: extracting image features of the to-be-recognized image, and using the image features to estimate the line of sight direction information of the user.
  • the specific network structure design may be flexibly configured as required.
  • the estimation process of the first neural network and the second neural network are the same as a whole, but the output results are different.
  • the structure of the first neural network is the same as that of the second neural model. That is, the two can adopt the same network structure design.
  • the parameters of the first neural network obtained by training are different from the parameters of the second neural model.
  • the input of the first neural network and the second neural network is an image
  • a convolutional neural network can be well suited for image processing
  • the first neural network can be a convolutional neural network
  • the second neural network can also be Convolutional Neural Networks.
  • other forms of the first neural network and the second neural network can also be flexibly designed in practical applications, which are not limited in this embodiment.
  • the first neural network/second neural network Based on the overall consistent estimation process of the first neural network and the second neural network, for the first neural network/second neural network, image features are identified from the input image to be recognized, and image features are used to estimate the image to be recognized. Corresponding target area/target line-of-sight direction information. Considering the goal of line-of-sight estimation, in order to enable the first neural network and the second neural network to output accurate results, the first neural network/second neural network is also innovatively designed in this embodiment:
  • the image features include any of the following: face region features, eye region features, or head pose features, considering the target of gaze estimation.
  • the face region feature, eye region feature or head pose feature are correspondingly extracted from the face region, eye region or head region segmented from the to-be-recognized image.
  • the eye region features include any one of the following: binocular features, left eye features or right eye features.
  • the left-eye feature and the right-eye feature are extracted by using the Siamese network.
  • the Siamese network is used to perform dilated convolution operations on the left eye region and the right eye region respectively, so as to enhance the network's receptive field of the image.
  • FIG. 1C it is a schematic diagram of the first neural network/second neural network structure shown in this embodiment, and the network structure in FIG. 1C is applicable to the first neural network and the second neural network.
  • INPUT is the image to be recognized
  • the network structure includes an image segmentation layer face detection, which is used to divide the image to be recognized into three parts: face area, eye area landmark and head area headpose.
  • the key points of the face can be used, such as 68 key points of the face; the features of the entire face can also be extracted through the face region feature extraction layer CNN.
  • the binocular region pitch, the left eye region and the right eye region can be segmented through the segmentation operation crop.
  • the binocular features can be extracted through the binocular feature extraction layer.
  • the twin network layer can be used to extract the left eye feature and the right eye feature from the left eye area and the right eye area.
  • the twin network can share parameters based on the left eye area and the right eye area.
  • the difference and connection extraction features can also be used to perform dilated convolution operations on the left-eye region and the right-eye region respectively, so as to enhance the network's receptive field of the image, thereby extracting accurate left-eye and right-eye features.
  • head pose features can be identified.
  • All the extracted features can be fused through the fully connected layer FC, and then output the estimation result; for the first neural network, the output estimation result is the target area gaze zone; for the second neural network, the output estimation is the target line of sight direction information , specifically the angle information gaze pitch&yaw of the line of sight direction.
  • the first neural network can estimate the target area corresponding to the image to be recognized by recognizing the user's gaze direction information in the image to be recognized.
  • the first neural network can also be implemented in other ways, for example, it can be a neural network with other structures.
  • a simple-structured neural network can also be used to recognize the user's identity from the input image to be recognized.
  • the head pose information is used to estimate the target area. This method does not need to consider other information such as eyes and key points.
  • a neural network with a relatively simple structure is used, which can reduce the implementation cost.
  • the number of frames of the to-be-recognized image may be one frame, and after the to-be-recognized image of the frame is input to the first neural network, the first neural network identifies the target area corresponding to the to-be-recognized image; The second neural network corresponding to the target area inputs the to-be-recognized image of the frame into the second neural network, and uses the second neural network to estimate the target line-of-sight direction information of the user.
  • the image to be recognized there may be more than one frame of the image to be recognized, and sight line estimation can be performed for multiple frames of the image to be recognized.
  • the specific number of frames can be flexibly configured as required, for example, a fixed number of frames can be set, such as 3 frames, 5 frame etc.
  • the number of image frames for each line-of-sight estimation can be dynamically changed, and the number of image frames can also be determined by setting the time interval for line-of-sight estimation. It can be estimated by taking one or more frames of images in this time period.
  • the multiple frames of images to be identified include: multiple frames of images photographed by the same photographing device on the user at different times; or multiple frames of images photographed by different photographing devices on the user at different locations.
  • the line-of-sight estimation scheme of the multi-frame to-be-recognized images there can be implemented in various ways.
  • the processing process for multiple frames of images to be identified is shown.
  • the n-1th frame, the nth frame, and the n+1th frame are examples to illustrate an embodiment of the line of sight estimation.
  • the corresponding target area can be identified by the first neural network, and then Each frame of the to-be-recognized image is input to the corresponding second neural network.
  • using the second neural network to estimate the user's target gaze direction information includes: acquiring the user's gaze direction information in each frame of the image to be recognized; the user's gaze direction information in each frame of the to-be-recognized image is The second neural network corresponding to the target area corresponding to the frame to be recognized image is obtained by estimating the frame to be recognized image; the user's target gaze direction information is estimated after fusing the line of sight direction information corresponding to each frame of the to-be-recognized image.
  • the multiple frames of images to be identified shown in the schematic diagram of FIG. 1D all correspond to the same target area. In practical applications, there may also be situations where multiple frames of images to be identified correspond to different target areas, which is not limited in this embodiment.
  • using the second neural network to estimate the target line of sight direction information of the user includes: acquiring the image features of each frame of the image to be recognized; the image features of each frame of the image to be recognized correspond to the image to be recognized
  • the second neural network corresponding to the target area of the frame is extracted from the image to be recognized in the frame; different from the previous embodiment, this embodiment estimates the user's target line of sight direction information after fusing the image features of each frame of the image to be recognized.
  • the third neural network may be used to estimate the user's target line-of-sight direction information after fusing the image features of each frame of the image to be recognized.
  • the line-of-sight estimation solution in this embodiment can be applied to various business scenarios, such as assisted driving, games, VR (Virtual Reality, virtual reality), or medical treatment.
  • the to-be-identified image may include: an image taken by the driver in the vehicle car; based on the assisted driving scene, a plurality of set areas may be divided according to the driver's line of sight coverage in the vehicle car,
  • the plurality of setting areas may include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, a dashboard area or a center console area; of course, in practical applications, other forms of setting areas may be divided according to requirements, which are not limited in this embodiment. In some examples, other regions can also be divided as required.
  • FIG. 1E it is an embodiment of the line of sight estimation method in the assisted driving scene shown in this embodiment.
  • the image to be recognized as an INPUT is input into the first neural network CNN, and the image to be recognized is output by the first neural network CNN.
  • the corresponding target area gaze zone; the setting area divided in this embodiment is shown in Figure 1E, and each setting area corresponds to a second neural network, namely CNN-1, CNN-2 to CNN-7 in Figure 1E,
  • the target area gaze zone corresponding to the image to be recognized is output, the corresponding second neural network is obtained, the image to be recognized is input to the second neural network, and the gaze direction information gaze is output by the second neural network.
  • the estimated target line-of-sight direction information may be used to assess whether the driver is distracted driving.
  • the target line of sight direction information may be used to determine that the user's line of sight direction deviates from the set direction, and prompt information is output to prompt the driver.
  • the output mode of the prompt information includes any of the following modes: audio, text, video or lighting.
  • the prompt message can be played through audio, or the prompt message can be output in the form of text or video through the electronic screen in the car, or the light module in the car can be controlled to output the prompt message in a specific light color.
  • the combination of the above at least two ways outputs a prompt message.
  • the first neural network is used to estimate the target area corresponding to the to-be-recognized image
  • the target area is at least one of a plurality of set areas
  • the target area represents the location of the user's head in the to-be-recognized image towards the area.
  • the set area corresponds to a second network
  • the second neural network is used to estimate the target line of sight direction information of the user, so that on the basis of identifying the current area by the first neural network, the second neural network is used to perform more detailed and accurate line-of-sight estimate.
  • the embodiment of this specification also provides another line-of-sight estimation method.
  • the method of this embodiment is aimed at a single neural network.
  • this embodiment also achieves innovation on a single neural network, as shown in FIG. 2 .
  • step 202 at least one frame of images to be recognized is acquired.
  • the to-be-recognized image includes an image obtained by performing image acquisition on a user;
  • the target gaze direction information of the user is estimated by using a neural network.
  • the neural network is used for: after extracting at least one type of target features representing the orientation of the user's head in the to-be-recognized image, fuse the extracted target features, and estimate the user's target line-of-sight direction information according to the feature fusion result .
  • the neural network is used to extract at least one type of target features representing the orientation of the user's head in the image to be recognized, and the extracted target features are fused, so that the target gaze direction of the user can be accurately estimated according to the feature fusion result. information.
  • the target features include any of the following: face region features, eye region features, or head pose features.
  • the face region feature, eye region feature or head pose feature are correspondingly extracted from the face region, eye region or head region segmented from the to-be-recognized image.
  • the eye region features include any one of the following: binocular features, left eye features or right eye features.
  • the left-eye feature and the right-eye feature are extracted by using the Siamese network.
  • the Siamese network is used to perform dilated convolution operations on the left eye region and the right eye region respectively, so as to enhance the network's receptive field of the image.
  • the target line-of-sight direction information of the user includes: a target area, where the target area represents an area in the to-be-recognized image to which the user's head faces.
  • the training data of the neural network adopts an image in which the region facing the user's head is the target region.
  • the target gaze direction information of the user includes: angle information of the user's target gaze direction.
  • the to-be-identified image has multiple frames
  • the using the neural network to estimate the target gaze direction information of the user includes:
  • the user's line-of-sight direction information in each frame of the to-be-recognized image is estimated by the neural network corresponding to the to-be-recognized image of the frame of the to-be-recognized image;
  • the line-of-sight direction information corresponding to each frame of the image to be recognized is fused to estimate the user's target line-of-sight direction information.
  • the to-be-recognized image has multiple frames;
  • the neural network includes an initial sub-neural network and a fusion sub-neural network;
  • the using the neural network to estimate the target gaze direction information of the user includes:
  • the initial sub-neural network For each frame of the to-be-recognized image, the initial sub-neural network is used to obtain the feature fusion result of each frame of the to-be-recognized image; the initial sub-neural network is used to: extract at least one type of target representing the user's head orientation in the to-be-recognized image After the feature, the extracted target features are fused;
  • the feature fusion results of each frame of the image to be recognized are fused to estimate the user's target line-of-sight direction information.
  • the neural network is a convolutional neural network.
  • the initial sub-neural network is a convolutional neural network.
  • the fusion sub-neural network is a recurrent neural network.
  • the image to be identified includes an image taken of a driver in the vehicle.
  • the target gaze direction information is used to assess whether the driver is distracted driving.
  • the plurality of setting areas include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, a dashboard area, or a center console area .
  • the method further includes:
  • the target gaze direction information it is determined that the user's gaze direction deviates from the set direction, and prompt information is output.
  • the output mode of the prompt information includes any one of the following modes: audio, text, video or lighting.
  • the foregoing method embodiments may be implemented by software, and may also be implemented by hardware or a combination of software and hardware.
  • software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory through the processor that estimates the line of sight where it is located.
  • FIG. 3 which is a hardware structure diagram for implementing the line-of-sight estimation apparatus 300 in this embodiment, except for the processor 301 and the memory 302 shown in FIG.
  • the processing device of the line-of-sight estimation method usually according to the actual function of the processing device, may also include other hardware, which will not be repeated here.
  • the line-of-sight estimation apparatus 300 includes a processor, a memory, and a computer program stored on the memory and executable by the processor, and the processor implements the following steps when executing the computer program:
  • Acquiring at least one frame of an image to be identified, the image to be identified includes an image obtained by performing image collection on a user;
  • the target area corresponding to the image to be recognized is estimated by using the first neural network, the target area is at least one of a plurality of set areas, and the target area represents the area in the image to be recognized that the user's head is facing ;
  • a second neural network corresponding to the target area is acquired, and the target gaze direction information of the user is estimated by using the second neural network.
  • the training image set used for training the second neural network corresponding to the set area adopts the images of the area to which the user's head is facing is the set area.
  • the area to which the user's head faces in each training image covers the multiple set areas.
  • the to-be-identified image has multiple frames
  • the user's gaze direction information in each frame of the image to be recognized is estimated by the second neural network corresponding to the target area corresponding to the to-be-recognized image for the frame to be recognized. of;
  • the line-of-sight direction information corresponding to each frame of the image to be recognized is fused to estimate the user's target line-of-sight direction information.
  • the second neural network is used for: extracting image features of the to-be-recognized image, and estimating the user's gaze direction information by using the image features.
  • the to-be-identified image has multiple frames
  • the image feature of each frame of the image to be recognized is extracted from the image to be recognized by the second neural network corresponding to the target area corresponding to the image to be recognized in the frame;
  • the target gaze direction information of the user is estimated after the image features of each frame of the image to be recognized are fused.
  • the processor implements the following steps when estimating the target gaze direction information of the user after merging the image features of each frame of the image to be recognized:
  • the user's target line-of-sight direction information is estimated.
  • the first neural network is used to: extract image features of the to-be-recognized image, and use the image features to estimate a target area corresponding to the to-be-recognized image.
  • the image features include any of the following: face region features, eye region features, or head pose features.
  • the face region feature, eye region feature or head pose feature are correspondingly extracted from the face region, eye region or head region segmented from the to-be-recognized image.
  • the eye region features include any one of the following: binocular features, left eye features or right eye features.
  • the left-eye feature and the right-eye feature are extracted by using the Siamese network.
  • the Siamese network is used to perform atrous convolution operations on the left eye region and the right eye region respectively, so as to enhance the network's receptive field of the image.
  • the multiple frames of images to be identified include: multiple frames of images captured by the same photographing device on the user at different times; or multiple frames of images photographed by different photographing devices on the user at different locations.
  • the target area is estimated by the first neural network by recognizing the user's gaze direction information and/or the user's head posture information in the to-be-recognized image.
  • the structure of the first neural network is the same as the structure of the second neural model.
  • the parameters of the first neural network are different from the parameters of the second neural model.
  • the first neural network is a convolutional neural network and/or the second neural network is a convolutional neural network.
  • the third neural network is a recurrent neural network.
  • the target area is one of a plurality of set areas.
  • the to-be-identified image includes an image taken of a driver in a vehicle cabin.
  • the target gaze direction information is used to assess whether the driver is distracted driving.
  • the plurality of set areas include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, a dashboard area, or a center console area .
  • the processor further implements the following steps when executing the computer program:
  • the target gaze direction information it is determined that the user's gaze direction deviates from the set direction, and prompt information is output.
  • the output mode of the prompt information includes any one of the following modes: audio, text, video or lighting.
  • This embodiment also provides another apparatus for line-of-sight estimation.
  • the structure of the apparatus may be as shown in FIG. 3 .
  • the apparatus includes a processor, a memory, and a computer program stored in the memory and executed by the processor.
  • the processor implements the following steps when executing the computer program:
  • Acquiring at least one frame of an image to be identified, the image to be identified includes an image obtained by performing image collection on a user;
  • a neural network is used to estimate the target line of sight direction information of the user; wherein, the neural network is used to: after extracting at least one type of target features representing the orientation of the user's head in the to-be-recognized image, fuse the extracted target features, The target gaze direction information of the user is estimated according to the feature fusion result.
  • the target features include any of the following: face region features, eye region features, or head pose features.
  • the face region feature, eye region feature or head pose feature are correspondingly extracted from the face region, eye region or head region segmented from the to-be-recognized image.
  • the eye region features include any one of the following: binocular features, left eye features or right eye features.
  • the left-eye feature and the right-eye feature are extracted by using the Siamese network.
  • the Siamese network is used to perform dilated convolution operations on the left eye region and the right eye region respectively, so as to enhance the network's receptive field of the image.
  • the target line-of-sight direction information of the user includes: a target area, where the target area represents an area in the to-be-recognized image to which the user's head faces.
  • the training data of the neural network adopts an image in which the region facing the user's head is the target region.
  • the target gaze direction information of the user includes: angle information of the user's target gaze direction.
  • the to-be-identified image has multiple frames
  • the user's line-of-sight direction information in each frame of the to-be-recognized image is estimated by the neural network corresponding to the to-be-recognized image of the frame of the to-be-recognized image;
  • the line-of-sight direction information corresponding to each frame of the image to be recognized is fused to estimate the user's target line-of-sight direction information.
  • the to-be-recognized image has multiple frames;
  • the neural network includes an initial sub-neural network and a fusion sub-neural network;
  • the initial sub-neural network For each frame of the to-be-recognized image, the initial sub-neural network is used to obtain the feature fusion result of each frame of the to-be-recognized image; the initial sub-neural network is used for: extracting at least one type of target representing the orientation of the user's head in the to-be-recognized image After the feature, the extracted target features are fused;
  • the feature fusion results of each frame of the image to be recognized are fused to estimate the user's target line-of-sight direction information.
  • the neural network is a convolutional neural network.
  • the initial sub-neural network is a convolutional neural network.
  • the fusion sub-neural network is a recurrent neural network.
  • the image to be identified includes an image taken of a driver in the vehicle.
  • the target gaze direction information is used to assess whether the driver is distracted driving.
  • the plurality of set areas include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, a dashboard area, or a center console area .
  • the processor further implements the following steps when executing the computer program:
  • the target gaze direction information it is determined that the user's gaze direction deviates from the set direction, and prompt information is output.
  • the output mode of the prompt information includes any one of the following modes: audio, text, video or lighting.
  • an embodiment of the application further provides a vehicle 400 , and the vehicle may include any of the aforementioned line-of-sight estimation devices.
  • This embodiment further provides a computer-readable storage medium, where several computer instructions are stored thereon, and when the computer instructions are executed, the steps of the line-of-sight estimation method shown in FIG. 1A are implemented.
  • This embodiment also provides a computer-readable storage medium, characterized in that, the readable storage medium stores several computer instructions, and when the computer instructions are executed, the steps of the line-of-sight estimation method shown in FIG. 2 are implemented.
  • Embodiments of the present specification may take the form of a computer program product embodied on one or more storage media having program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.
  • Computer-usable storage media includes permanent and non-permanent, removable and non-removable media, and storage of information can be accomplished by any method or technology.
  • Information may be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • PRAM phase-change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read only memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • Flash Memory or other memory technology
  • CD-ROM Compact Disc Read Only Memory
  • CD-ROM Compact Disc Read Only Memory
  • DVD Digital Versatile Disc
  • Magnetic tape cartridges magnetic tape magnetic disk storage or other magnetic storage devices or any other non-

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Eye Examination Apparatus (AREA)
  • Image Analysis (AREA)

Abstract

La présente demande concerne un procédé et un appareil d'estimation de ligne de visée, un véhicule et un support de stockage lisible par ordinateur. Dans des modes de réalisation de la présente demande, du fait qu'une pluralité de zones établies sont prédéfinies, et qu'un processus d'estimation secondaire est conçu à l'aide de deux types de réseaux de neurones artificiels ayant des tâches différentes, un premier réseau de neurones artificiels reconnaît d'abord une zone cible à laquelle un utilisateur fait face depuis une longue distance, puis un second réseau de neurones artificiels correspondant à la zone cible est utilisé pour reconnaître des informations de direction de ligne de visée cible de la ligne de visée de l'utilisateur dans la zone cible. Par conséquent, du fait que le premier réseau de neurones artificiels est d'abord utilisé pour raccourcir la distance d'estimation dans la zone cible, et que le second réseau de neurones artificiels adapté à la zone cible est utilisé ensuite pour estimer les informations de direction de ligne de visée cible dans la zone cible, le résultat d'estimation peut avoir une précision plus élevée.
PCT/CN2020/141074 2020-12-29 2020-12-29 Procédé et appareil d'estimation de ligne de visée, véhicule, et support de stockage lisible par ordinateur Ceased WO2022141114A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141074 WO2022141114A1 (fr) 2020-12-29 2020-12-29 Procédé et appareil d'estimation de ligne de visée, véhicule, et support de stockage lisible par ordinateur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141074 WO2022141114A1 (fr) 2020-12-29 2020-12-29 Procédé et appareil d'estimation de ligne de visée, véhicule, et support de stockage lisible par ordinateur

Publications (1)

Publication Number Publication Date
WO2022141114A1 true WO2022141114A1 (fr) 2022-07-07

Family

ID=82259932

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141074 Ceased WO2022141114A1 (fr) 2020-12-29 2020-12-29 Procédé et appareil d'estimation de ligne de visée, véhicule, et support de stockage lisible par ordinateur

Country Status (1)

Country Link
WO (1) WO2022141114A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376114A (zh) * 2022-09-05 2022-11-22 润芯微科技(江苏)有限公司 一种汽车摄像的图像多模态取景方法及系统
CN115424318A (zh) * 2022-08-09 2022-12-02 华为技术有限公司 一种图像识别方法及设备
CN119851342A (zh) * 2024-12-17 2025-04-18 科大讯飞股份有限公司 行为识别方法、装置及车辆

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190370580A1 (en) * 2017-03-14 2019-12-05 Omron Corporation Driver monitoring apparatus, driver monitoring method, learning apparatus, and learning method
CN110765807A (zh) * 2018-07-25 2020-02-07 阿里巴巴集团控股有限公司 驾驶行为分析、处理方法、装置、设备和存储介质
CN111325736A (zh) * 2020-02-27 2020-06-23 成都航空职业技术学院 一种基于人眼差分图像的视线角度估计方法
CN111680546A (zh) * 2020-04-26 2020-09-18 北京三快在线科技有限公司 注意力检测方法、装置、电子设备及存储介质
CN111723828A (zh) * 2019-03-18 2020-09-29 北京市商汤科技开发有限公司 注视区域检测方法、装置及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190370580A1 (en) * 2017-03-14 2019-12-05 Omron Corporation Driver monitoring apparatus, driver monitoring method, learning apparatus, and learning method
CN110765807A (zh) * 2018-07-25 2020-02-07 阿里巴巴集团控股有限公司 驾驶行为分析、处理方法、装置、设备和存储介质
CN111723828A (zh) * 2019-03-18 2020-09-29 北京市商汤科技开发有限公司 注视区域检测方法、装置及电子设备
CN111325736A (zh) * 2020-02-27 2020-06-23 成都航空职业技术学院 一种基于人眼差分图像的视线角度估计方法
CN111680546A (zh) * 2020-04-26 2020-09-18 北京三快在线科技有限公司 注意力检测方法、装置、电子设备及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424318A (zh) * 2022-08-09 2022-12-02 华为技术有限公司 一种图像识别方法及设备
CN115376114A (zh) * 2022-09-05 2022-11-22 润芯微科技(江苏)有限公司 一种汽车摄像的图像多模态取景方法及系统
CN115376114B (zh) * 2022-09-05 2023-06-30 润芯微科技(江苏)有限公司 一种汽车摄像的图像多模态取景方法及系统
CN119851342A (zh) * 2024-12-17 2025-04-18 科大讯飞股份有限公司 行为识别方法、装置及车辆

Similar Documents

Publication Publication Date Title
WO2019149061A1 (fr) Système d'acquisition de données visuelles basé sur les gestes et le regard
US10679376B2 (en) Determining a pose of a handheld object
Feng et al. Cityflow-nl: Tracking and retrieval of vehicles at city scale by natural language descriptions
WO2022141114A1 (fr) Procédé et appareil d'estimation de ligne de visée, véhicule, et support de stockage lisible par ordinateur
Voit et al. Neural network-based head pose estimation and multi-view fusion
CN111797657A (zh) 车辆周边障碍检测方法、装置、存储介质及电子设备
CN109949347A (zh) 人体跟踪方法、装置、系统、电子设备和存储介质
CN106127788A (zh) 一种视觉避障方法和装置
CN116820251B (zh) 一种手势轨迹交互方法、智能眼镜及存储介质
CN117312992B (zh) 多视角人脸特征与音频特征融合的情绪识别方法及系统
US9323989B2 (en) Tracking device
CN111325107A (zh) 检测模型训练方法、装置、电子设备和可读存储介质
Hu et al. Gfie: A dataset and baseline for gaze-following from 2d to 3d in indoor environments
KR20200052068A (ko) 눈 추적 방법 및 장치
Lee et al. Multimodal pedestrian detection based on cross-modality reference search
CN120318864B (zh) 一种基于ai的消费者关注点估计方法及设备
CN116030387A (zh) 从视频中识别对象的方法和装置
CN113095347A (zh) 基于深度学习的标记识别方法和训练方法及其系统和电子设备
KR101217231B1 (ko) 물체 인식 방법 및 시스템
Khan et al. Towards monocular neural facial depth estimation: Past, present, and future
CN118071799B (zh) 一种低光照条件下的多目标追踪方法及系统
Kang et al. Head pose-aware regression for pupil localization from a-pillar cameras
WO2021047453A1 (fr) Procédé, appareil et dispositif de détermination de qualité d'images
KR20210153989A (ko) 맞춤형 객체 검출 모델을 가진 객체 검출 장치
JP7533426B2 (ja) 画像処理システムおよび画像処理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20967462

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20967462

Country of ref document: EP

Kind code of ref document: A1