WO2022141114A1

WO2022141114A1 - Line-of-sight estimation method and apparatus, vehicle, and computer-readable storage medium

Info

Publication number: WO2022141114A1
Application number: PCT/CN2020/141074
Authority: WO
Inventors: 徐吉睿
Original assignee: SZ DJI Technology Co Ltd
Current assignee: SZ DJI Technology Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2022-07-07
Anticipated expiration: 2023-06-29

Abstract

The present application provides a line-of-sight estimation method and apparatus, a vehicle, and a computer-readable storage medium. In embodiments of the present application, because a plurality of set areas are preset, and a process of secondary estimation is designed using two types of neural networks having different tasks, a first neural network first recognizes a target area that a user is facing from a large range, and then a second neural network corresponding to the target area is utilized to recognize target line-of-sight direction information of the user's line of sight in the target area. Hence, because the first neural network is first utilized to narrow the estimation range into the target area, and then the second neural network adapted to the target area is utilized to estimate the target line-of-sight direction information in the target area, the estimation result can have higher accuracy.

Description

Line-of-sight estimation method, apparatus, vehicle, and computer-readable storage medium

technical field

本申请涉及智能驾驶技术领域，具体而言，涉及一种视线估计方法、装置、车辆、及计算机可读存储介质。The present application relates to the technical field of intelligent driving, and in particular, to a line of sight estimation method, device, vehicle, and computer-readable storage medium.

Background technique

视线作为反映用户注意力的行为，在先技术中很少有将视线估计方法应用在智能驾驶场景中并进行针对性优化的方法，如何准确地估计用户的视线方向是亟待解决的技术问题。Gaze is a behavior that reflects the user's attention. In the prior art, there are few methods of applying the gaze estimation method to intelligent driving scenarios and performing targeted optimization. How to accurately estimate the user's gaze direction is a technical problem that needs to be solved urgently.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本申请提供一种视线估计方法、装置、设备、及计算机可读存储介质，以解决相关技术中无法准确估计用户的视线方向的技术问题。In view of this, the present application provides a line-of-sight estimation method, apparatus, device, and computer-readable storage medium to solve the technical problem in the related art that the user's line-of-sight direction cannot be accurately estimated.

第一方面，提供一种视线估计方法，所述方法包括：In a first aspect, a line-of-sight estimation method is provided, the method comprising:

获取至少一帧待识别图像，所述待识别图像包括对用户进行图像采集得到的图像；Acquiring at least one frame of an image to be identified, the image to be identified includes an image obtained by performing image collection on a user;

利用第一神经网络估计出所述待识别图像对应的目标区域，所述目标区域是多个设定区域中的至少一个，所述目标区域表征所述待识别图像中用户头部所朝向的区域；The target area corresponding to the image to be recognized is estimated by using the first neural network, the target area is at least one of a plurality of set areas, and the target area represents the area in the image to be recognized that the user's head is facing ;

获取与所述目标区域对应的第二神经网络，利用所述第二神经网络估计所述用户的目标视线方向信息。A second neural network corresponding to the target area is acquired, and the target gaze direction information of the user is estimated by using the second neural network.

第二方面，提供一种视线估计方法，所述方法包括：In a second aspect, a line-of-sight estimation method is provided, the method comprising:

利用神经网络估计所述用户的目标视线方向信息；其中，所述神经网络用于：提取所述待识别图像中至少一类表征用户头部朝向的目标特征后，将提取的目标特征进行融合，根据特征融合结果估计所述用户的目标视线方向信息。Use a neural network to estimate the user's target line of sight direction information; wherein, the neural network is used to: after extracting at least one type of target features representing the orientation of the user's head in the to-be-recognized image, fuse the extracted target features, The target gaze direction information of the user is estimated according to the feature fusion result.

第三方面，提供一种视线估计装置，所述装置包括处理器、存储器、存储在所述存储器上可被所述处理器执行的计算机程序，所述处理器执行所述计算机程序时实现以下步骤：In a third aspect, a line-of-sight estimation device is provided, the device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, the processor implements the following steps when executing the computer program :

第四方面，提供一种视线估计装置，所述装置包括处理器、存储器、存储在所述存储器上可被所述处理器执行的计算机程序，所述处理器执行所述计算机程序时实现以下步骤：In a fourth aspect, a line-of-sight estimation device is provided, the device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, the processor implements the following steps when executing the computer program :

第五方面，提供一种车辆，所述车辆包括前述第三方面所述的装置或前述第四方面所述的所述的装置。In a fifth aspect, a vehicle is provided, the vehicle comprising the device of the aforementioned third aspect or the aforementioned device of the aforementioned fourth aspect.

第六方面，提供一种计算机可读存储介质，所述可读存储介质上存储有若干计算机指令，所述计算机指令被执行时实现前述第一方面所述的视线估计方法。In a sixth aspect, a computer-readable storage medium is provided, where several computer instructions are stored thereon, and when the computer instructions are executed, the line-of-sight estimation method described in the foregoing first aspect is implemented.

第七方面，提供一种计算机可读存储介质，所述可读存储介质上存储有若干计算机指令，所述计算机指令被执行时实现前述第二方面所述的视线估计方法。In a seventh aspect, a computer-readable storage medium is provided, where several computer instructions are stored thereon, and when the computer instructions are executed, the line-of-sight estimation method described in the foregoing second aspect is implemented.

应用本申请提供的方案，由于预先设置了多个设定区域，并且利用两类不同任务的神经网络设计了二次估计的流程，第一神经网络首先从大范围上先识别出用户所朝向的目标区域，接着再利用目标区域对应的第二神经网络识别出用户视线在目标区域中的目标视线方向信息。由此可见，由于利用第一神经网络先将估计范围缩小至目标区域内，再由适配于目标区域的第二神经网络在目标区域内进行目标视线方向信息的估计，使得估计结果能够获得较高的准确度。Applying the solution provided by this application, since multiple setting areas are preset, and the secondary estimation process is designed by using two types of neural networks with different tasks, the first neural network first identifies the direction the user is facing from a large range. target area, and then use the second neural network corresponding to the target area to identify the target sight line direction information of the user's sight line in the target area. It can be seen that since the first neural network is used to first narrow the estimation range to the target area, and then the second neural network adapted to the target area is used to estimate the target line-of-sight direction information in the target area, so that the estimation results can be obtained relatively high accuracy.

Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

图1A是本申请一个实施例的视线估计方法的示意图。FIG. 1A is a schematic diagram of a line-of-sight estimation method according to an embodiment of the present application.

图1B是本申请另一个实施例的视线估计方法的示意图。FIG. 1B is a schematic diagram of a line-of-sight estimation method according to another embodiment of the present application.

图1C是本申请一个实施例的第一神经网络/第二神经网络的结构示意图。FIG. 1C is a schematic structural diagram of a first neural network/second neural network according to an embodiment of the present application.

图1D是本申请另一个实施例的视线估计方法的示意图。FIG. 1D is a schematic diagram of a line-of-sight estimation method according to another embodiment of the present application.

图1E是本申请另一个实施例的视线估计方法的示意图。FIG. 1E is a schematic diagram of a line-of-sight estimation method according to another embodiment of the present application.

图2是本申请另一个实施例的视线估计方法的示意图。FIG. 2 is a schematic diagram of a line-of-sight estimation method according to another embodiment of the present application.

图3是本申请中用于实施本实施例的视线估计方法的一种装置的结构示意图。FIG. 3 is a schematic structural diagram of an apparatus for implementing the line-of-sight estimation method of this embodiment in the present application.

图4是本申请一个实施例的车辆的示意图。FIG. 4 is a schematic diagram of a vehicle according to an embodiment of the present application.

Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments.

随着数据和技术的发展，由于视线信息能够反映用户的注意力，有关用户视线的关注和需求越来越多，需要提供能够准确估计用户视线的技术方案。With the development of data and technology, since the sight line information can reflect the user's attention, there are more and more concerns and demands on the user's sight line. It is necessary to provide a technical solution that can accurately estimate the user's sight line.

视线估计方案大多采用机器学习的方式实现，在机器学习领域，通常是先通过建模表示出一个模型，再通过构建评价函数对模型进行评价，最后根据样本数据及最优化方法对评价函数进行优化，把模型调整到最优。整个阶段涉及非常多的环节，例如样本数据的选择与处理、数据特征的设计、模型的设计、损失函数的设计或优化方法的设计等等，任一环节的细微差别都是导致估计准确度细微缺陷的因素。Most of the line-of-sight estimation schemes are implemented by machine learning. In the field of machine learning, a model is usually first expressed by modeling, then the model is evaluated by constructing an evaluation function, and finally the evaluation function is optimized according to sample data and optimization methods. , to adjust the model to the optimum. The whole stage involves a lot of links, such as the selection and processing of sample data, the design of data features, the design of models, the design of loss functions or the design of optimization methods, etc. The subtle differences in any link lead to subtle estimation accuracy. defect factor.

本申请发明人研究发现，视线估计的一个目标是估计出视线方向信息，即用户视线的朝向；当用户注意力在某个位置，用户眼睛朝向该位置，从而产生视线方向信息，即用户视线的朝向会对应到用户所关注的那个位置上。用户关注其他位置，相应的用户视线的朝向其他位置，产生对应于其他位置的视线方向信息。而用户眼睛所覆盖的视线范围较大，在一个空间里用户的眼睛可以朝向任何位置，想要估计的目标(即用户眼睛朝向的具体的位置对应的视线方向信息)相对于视线所覆盖的整个范围是很小的，因此让机器模型从较大的覆盖范围中估计出用户视线所朝向的视线方向信息，视线估计的结果难以达到非常高的准确度。The inventor of the present application has found that one goal of line of sight estimation is to estimate line-of-sight direction information, that is, the direction of the user's line of sight; when the user pays attention to a certain position, the user's eyes are directed toward the position, thereby generating line-of-sight direction information, that is, the direction of the user's line of sight. The orientation will correspond to the position the user is focusing on. The user pays attention to other positions, and the corresponding user's line of sight faces other positions, and the line-of-sight direction information corresponding to the other positions is generated. The user's eyes cover a large range of sight. In a space, the user's eyes can be directed to any position. The target to be estimated (that is, the line-of-sight direction information corresponding to the specific position the user's eyes are facing) is relative to the entire line of sight covered by the line of sight. The range is very small, so it is difficult for the machine model to estimate the line-of-sight direction information of the user's line of sight from a large coverage area, and the result of line-of-sight estimation is difficult to achieve very high accuracy.

本申请的发明人还发现，现有的视线估计方法，鲜有针对智能驾驶场景进行针对性优化，传统的视线估计方法基于图像处理技术，鲁棒性较差，无法适应光照变化；随着深度学习相关技术的发展，已经有部分方法采用通过神经网络的方法对视线进行估计，但现有技术的缺点在于：未能充分利用面部信息及状态；未能考虑时序信息；未能与智能驾驶场景结合，未能通过一定的先验知识来提高视线估计准确度。以上不足导致视线估计在智能驾驶场景中不能准确的估计驾驶员的视线，进而难以对驾驶员是否分心驾驶有正确的判断。The inventor of the present application also found that the existing line-of-sight estimation methods are rarely optimized for intelligent driving scenarios. The traditional line-of-sight estimation method is based on image processing technology, which has poor robustness and cannot adapt to changes in illumination; With the development of learning-related technologies, some methods have been used to estimate sight lines through neural networks, but the shortcomings of the existing technologies are: failure to fully utilize facial information and status; failure to consider timing information; failure to match intelligent driving scenarios Combined, the accuracy of line-of-sight estimation cannot be improved by certain prior knowledge. The above deficiencies result in that the line of sight estimation cannot accurately estimate the driver's line of sight in an intelligent driving scenario, and it is difficult to make a correct judgment on whether the driver is distracted driving.

因此，本申请实施例提出了一种视线估计方案，预先设置了多个设定区域，并且利用两类不同任务的神经网络设计了二次估计的流程，第一神经网络首先从大范围上先识别出用户所朝向的目标区域，接着再利用目标区域对应的第二神经网络识别出用户视线在目标区域中的目标视线方向信息。由此可见，由于利用第一神经网络先将估计范围缩小至目标区域内，再由适配于目标区域的第二神经网络在目标区域内进行目标视线方向信息的估计，使得估计结果能够获得较高的准确度。通过先验区域判别及多信息融合的方式，在智能驾驶的应用场景中，视线的估计可以更加准确，从而进一步有助于驾驶员是否分心驾驶的判断，提高驾驶安全性。接下来对本实施例方案进行详细说明。Therefore, the embodiment of the present application proposes a line-of-sight estimation scheme, which pre-sets multiple setting regions, and uses two types of neural networks with different tasks to design a secondary estimation process. Identify the target area that the user is facing, and then use the second neural network corresponding to the target area to identify the target line of sight direction information of the user's line of sight in the target area. It can be seen that since the first neural network is used to first narrow the estimation range to the target area, and then the second neural network adapted to the target area is used to estimate the target line-of-sight direction information in the target area, so that the estimation results can be obtained relatively high accuracy. Through a priori region discrimination and multi-information fusion, in the application scenario of intelligent driving, the line of sight can be estimated more accurately, which further helps the driver to judge whether the driver is distracted driving and improves driving safety. Next, the solution of this embodiment will be described in detail.

如图1A所示，是本申请根据一示例性实施例示出的视线估计方法的流程图，包括如下步骤：As shown in FIG. 1A , it is a flowchart of a line-of-sight estimation method according to an exemplary embodiment of the present application, including the following steps:

在步骤102中，获取至少一帧待识别图像，所述待识别图像包括对用户进行图像采集得到的图像；In step 102, at least one frame of to-be-recognized image is acquired, and the to-be-recognized image includes an image obtained by performing image acquisition on a user;

在步骤104中，利用第一神经网络估计出所述待识别图像对应的目标区域，所述目标区域是多个设定区域中的至少一个，所述目标区域表征所述待识别图像中用户头部所朝向的区域；In step 104, a first neural network is used to estimate a target area corresponding to the to-be-recognized image, the target area is at least one of a plurality of set areas, and the target area represents the user's head in the to-be-recognized image the area towards which the part is directed;

在步骤106中，获取与所述目标区域对应的第二神经网络，利用所述第二神经网络估计所述用户的目标视线方向信息。In step 106, a second neural network corresponding to the target area is acquired, and the target gaze direction information of the user is estimated by using the second neural network.

结合图1B示出的另一种视线估计方法的示意图进行说明。本实施例中考虑到实际场景中用户的视线范围较大会覆盖到很大的区域，而视线估计的目标是为了准确地识别出用户的眼睛看向整个覆盖的区域中某个具体的位置时的视线方向信息。基于此，本实施例预先配置了多个设定区域，这多个设定区域是视线估计的目标，这些设定区域的配置可以根据实际应用场景的需要灵活配置，图1B中以设定区域i、设定区域j 和设定区域k为例进行示意，本实施例对此不作限定。The description will be made with reference to the schematic diagram of another line-of-sight estimation method shown in FIG. 1B . In this embodiment, it is considered that the user's line of sight will cover a large area in the actual scene, and the goal of line-of-sight estimation is to accurately identify when the user's eyes are looking at a specific position in the entire covered area. Line-of-sight information. Based on this, this embodiment pre-configures multiple setting areas, which are the targets of line of sight estimation. The configuration of these setting areas can be flexibly configured according to the needs of actual application scenarios. In FIG. 1B, the setting areas are i. The setting area j and the setting area k are taken as an example for illustration, which is not limited in this embodiment.

视线估计的一个应用场景是辅助驾驶场景，在此场景中可以利用视线估计结果检测驾驶员是否分心驾驶，作为例子，所述多个设定区域包括：前挡风玻璃区域、左后视镜区域、右后视镜区域、左车窗区域、右车窗区域、仪表盘区域或中控台区域。One application scenario of line of sight estimation is an assisted driving scenario, in which the result of line of sight estimation can be used to detect whether the driver is distracted driving. As an example, the multiple set areas include: front windshield area, left rearview mirror area, right mirror area, left window area, right window area, instrument panel area or center console area.

视线估计的其他应用场景可以是游戏场景，例如利用视线估计结果进行游戏的交互；还可以是医疗场景，例如应用于眼动仪中估计用户的视线方向；还可以是线下零售场景，例如利用视线估计结果预测用户对售卖物品的关注程度等；在不同应用场景下，可以根据需要将用户视线覆盖范围划分出多个设定区域，本实施例对此不作限定。Other application scenarios of gaze estimation can be game scenarios, such as using the results of gaze estimation for game interaction; medical scenarios, such as using eye trackers to estimate the user's gaze direction; offline retail scenarios, such as using The line-of-sight estimation result predicts the user's degree of attention to the sold item, etc. In different application scenarios, the user's line-of-sight coverage can be divided into multiple set areas as required, which is not limited in this embodiment.

本实施例中，待识别图像是指对用户进行图像采集得到的图像，作为例子，图像采集方式可以是摄像设备对用户拍摄得到的。In this embodiment, the image to be recognized refers to an image obtained by capturing an image of a user. As an example, the image capturing manner may be obtained by capturing a user by a camera device.

本实施例中，第一神经网络的任务是区分出待识别图像中用户头部所朝向的区域，第二神经网络的任务是估计用户的目标视线方向信息。在一些例子中，可以是每个设定区域对应一个第二神经网络；在另一些例子中，设定区域与第二神经网络也可以是多对一的关系，即一个神经网络可以适用于两个或以上的设定区域。图1B中为了示意方便，以一对一的方式示出了设定区域与第二神经网络，实际应用中可以根据需要进行配置，本实施例对此不作限定。In this embodiment, the task of the first neural network is to distinguish the area in which the user's head is facing in the image to be recognized, and the task of the second neural network is to estimate the target gaze direction information of the user. In some examples, each set area may correspond to a second neural network; in other examples, the set area and the second neural network may also have a many-to-one relationship, that is, one neural network may be applicable to two one or more setting fields. For convenience of illustration, FIG. 1B shows the setting area and the second neural network in a one-to-one manner, which may be configured as required in practical applications, which is not limited in this embodiment.

基于第一神经网络的任务，用于训练第一神经网络的训练图像集中，各训练图像中用户头部所朝向的区域涵盖所述多个设定区域，使得训练出的第一神经网络能够准确地识别出图像中用户头部所朝向的区域；例如，设定区域包括：设定区域A、设定区域B和设定区域C，则第一神经网络的训练数据中包括：用户的视线方向信息属于设定区域A的训练图像、用户的视线方向信息属于设定区域B的训练图像以及用户的视线方向信息属于设定区域C的训练图像。基于此，可以训练得到第一神经网络，第一神经网络用于估计出待识别图像对应的目标区域，此处第一神经网络是在大范围中识别出一个小的目标区域，将后续的视线方向信息的估计从大范围缩小至目标区域的范围。由于第一神经网络可以无需精确地识别出具体的视线方向信息，而是在大范围中识别出一个小的区域，因此第一神经网络的输出结果具有较高的可靠性。Based on the task of the first neural network, in the training image set used to train the first neural network, the area facing the user's head in each training image covers the multiple set areas, so that the trained first neural network can accurately Identify the area that the user's head is facing in the image; for example, if the setting area includes: setting area A, setting area B and setting area C, the training data of the first neural network includes: the user's line of sight direction The information belongs to the training image of the set area A, the user's gaze direction information belongs to the training image of the set area B, and the user's gaze direction information belongs to the training image of the set area C. Based on this, the first neural network can be trained, and the first neural network is used to estimate the target area corresponding to the image to be recognized. The estimation of orientation information is narrowed from a large range to the range of the target area. Since the first neural network can identify a small area in a large range without accurately identifying specific line-of-sight direction information, the output result of the first neural network has high reliability.

基于第二神经网络的任务，用于训练设定区域对应的第二神经网络的训练图像集，采用的是用户头部所朝向的区域均为该设定区域的图像；例如，设定区域包括：设定区域A、设定区域B和设定区域C，设定区域A对应的第二神经网络的训练数据，是用户的视线方向信息属于设定区域A的训练图像，设定区域B对应的第二神经网络的训练数据，是用户的视线方向信息属于设定区域B的训练图像，设定区域C对应的第二神经网络的训练数据，是用户的视线方向信息属于设定区域C的训练图像。因此，设定区域对应的第二神经网络关注的是用户视线方向为设定区域的视线特征，使得第二神经网络能够准确地识别出视线方向信息。此处第二神经网络与该目标区域相适应，并且第二神经网络是在较小的目标区域内识别出具体的视线方向信息，因此第二神经网络的输出结果同样具有较高的可靠性，最终使得视线估计结果具备较高的准确度。Based on the task of the second neural network, the training image set used to train the second neural network corresponding to the set area adopts the image of the area facing the user's head as the set area; for example, the set area includes : Setting area A, setting area B, and setting area C, the training data of the second neural network corresponding to setting area A is the training image whose line of sight direction information of the user belongs to setting area A, and setting area B corresponds to The training data of the second neural network is the training image of the user's line of sight direction information belonging to the set area B, and the training data of the second neural network corresponding to the set area C is the user's line of sight direction information belongs to the set area C. training images. Therefore, the second neural network corresponding to the set area focuses on the line-of-sight feature of the user's line-of-sight direction of the set area, so that the second neural network can accurately identify the line-of-sight direction information. Here, the second neural network is adapted to the target area, and the second neural network identifies specific line-of-sight direction information in a smaller target area, so the output of the second neural network also has high reliability. Finally, the line-of-sight estimation results have higher accuracy.

可选的，训练过程可以采用有监督的方式，前述的训练图像可以标定有用户视线方向信息，而标定出的用户视线方向信息会对应多个设定区域中的其中一个，训练图像也标定了其对应的设定区域。在一些例子中，训练图像可以采用虚拟数据集，即采用仿真方式获得的图像，此种方式可以高效地获得训练图像。在另一些例子中，训练图像可以采用真实数据集，由于在真实采集用户图像的过程中会使图像夹杂了真实环境的干扰，而这些干扰在神经网络的训练过程中会使神经网络具有更好的抗干扰能力，从而在估计阶段应对实际环境时能够输出更准确的估计结果。Optionally, the training process can be supervised. The aforementioned training images can be calibrated with user gaze direction information, and the calibrated user gaze direction information will correspond to one of the multiple setting areas, and the training images are also calibrated. its corresponding setting area. In some examples, the training images may be virtual datasets, ie, images obtained in a simulation manner, which can efficiently obtain training images. In other examples, the training images can be real datasets, because the real-world disturbances will be mixed in the images during the actual collection of user images, and these disturbances will make the neural network have better performance during the training of the neural network. Therefore, it can output more accurate estimation results when dealing with the actual environment in the estimation stage.

本实施例中，第一神经网络和第二神经网络的输入数据都是待识别图像，作为例子，第一神经网络用于估计出所述待识别图像对应的目标区域，估计处理过程可以是：提取所述待识别图像的图像特征，利用所述图像特征估计所述待识别图像对应的目标区域。第二神经网络用于估计出所述待识别图像的视线方向信息，估计处理过程可以是：提取所述待识别图像的图像特征，利用所述图像特征估计用户的视线方向信息。In this embodiment, the input data of the first neural network and the second neural network are images to be recognized. As an example, the first neural network is used to estimate the target area corresponding to the image to be recognized, and the estimation process may be: Extracting image features of the to-be-recognized image, and using the image features to estimate a target area corresponding to the to-be-recognized image. The second neural network is used for estimating the line of sight direction information of the image to be recognized, and the estimation processing process may be: extracting image features of the to-be-recognized image, and using the image features to estimate the line of sight direction information of the user.

在一些例子中，在对第一神经网络和第二神经网络的设计时，可以采用不同的网络结构，具体的网络结构设计可以根据需要灵活配置。在另一些例子中，第一神经网络和第二神经网络的估计过程整体一致，只是输出的结果有所不同，为了提高效率，第一神经网络的结构与所述第二神经模型的结构相同，即两者可以采用设计相同的网络结构。可选的，由于两者输出的结果不同，训练得到的第一神经网络的参数与所述第二神经模型的参数不同。由于第一神经网络和第二神经网络的输入是图像，可选的，卷积神经网络可很好地适用于图像处理，第一神经网络可以是卷积神经网络，第二神经网络也可以是卷积神经网络。当然，实际应用中也可以根据需要灵活设计其他形式的第一神经网络和第二神经网络，本实施例对此不作限定。In some examples, when designing the first neural network and the second neural network, different network structures may be used, and the specific network structure design may be flexibly configured as required. In other examples, the estimation process of the first neural network and the second neural network are the same as a whole, but the output results are different. In order to improve the efficiency, the structure of the first neural network is the same as that of the second neural model. That is, the two can adopt the same network structure design. Optionally, because the output results of the two are different, the parameters of the first neural network obtained by training are different from the parameters of the second neural model. Since the input of the first neural network and the second neural network is an image, optionally, a convolutional neural network can be well suited for image processing, the first neural network can be a convolutional neural network, and the second neural network can also be Convolutional Neural Networks. Of course, other forms of the first neural network and the second neural network can also be flexibly designed in practical applications, which are not limited in this embodiment.

基于第一神经网络和第二神经网络整体一致的估计过程，对于第一神经网络/第二神经网络，其是从输入的待识别图像中识别出图像特征，利用图像特征估计所述待识别图像对应的目标区域/目标视线方向信息。考虑到视线估计的目标，为了使第一神经网络和第二神经网络能够输出准确的结果，本实施例中还对第一神经网络/第二神经网络进行创新设计：Based on the overall consistent estimation process of the first neural network and the second neural network, for the first neural network/second neural network, image features are identified from the input image to be recognized, and image features are used to estimate the image to be recognized. Corresponding target area/target line-of-sight direction information. Considering the goal of line-of-sight estimation, in order to enable the first neural network and the second neural network to output accurate results, the first neural network/second neural network is also innovatively designed in this embodiment:

在一些例子中，考虑到视线估计的目标，所述图像特征包括如下任一：人脸区域特征、眼部区域特征或头部姿态特征。在一些例子中，所述人脸区域特征、眼部区域特征或头部姿态特征，分别是从所述待识别图像分割出的人脸区域、眼部区域或头部区域中对应提取的。In some examples, the image features include any of the following: face region features, eye region features, or head pose features, considering the target of gaze estimation. In some examples, the face region feature, eye region feature or head pose feature are correspondingly extracted from the face region, eye region or head region segmented from the to-be-recognized image.

在一些例子中，所述眼部区域特征，包括如下任一：双眼特征、左眼特征或右眼特征。In some examples, the eye region features include any one of the following: binocular features, left eye features or right eye features.

在一些例子中，所述左眼特征和右眼特征，是利用孪生网络提取得到的。在一些例子中，所述孪生网络用于：分别对左眼区域和右眼区域进行空洞卷积操作，以增强网络对图像的感受野。In some examples, the left-eye feature and the right-eye feature are extracted by using the Siamese network. In some examples, the Siamese network is used to perform dilated convolution operations on the left eye region and the right eye region respectively, so as to enhance the network's receptive field of the image.

如图1C所示，是本实施例示出的第一神经网络/第二神经网络结构示意图，图1C的网络结构适用于第一神经网络和第二神经网络。其中，INPUT即待识别图像，在该网络结构中包括一图像分割层face detection，用于将待识别图像分割为三部分：人脸区域、眼部区域landmark和头部区域headpose。As shown in FIG. 1C , it is a schematic diagram of the first neural network/second neural network structure shown in this embodiment, and the network structure in FIG. 1C is applicable to the first neural network and the second neural network. Among them, INPUT is the image to be recognized, and the network structure includes an image segmentation layer face detection, which is used to divide the image to be recognized into three parts: face area, eye area landmark and head area headpose.

对于人脸区域，识别出人脸区域特征，可以人脸关键点，如68个人脸关键点；还可以通过一人脸区域特征提取层CNN来提取整个人脸的特征。For the face region, to identify the features of the face region, the key points of the face can be used, such as 68 key points of the face; the features of the entire face can also be extracted through the face region feature extraction layer CNN.

对于眼部区域，可以通过分割操作crop分割出双眼区域pitch、左眼区域和右眼区域。对于双眼区域pitch，可以通过双眼特征提取层提取出双眼特征。对于左眼区域和右眼区域，可以采用孪生网络层从左眼区域和右眼区域中提取出左眼特征和右眼特征，孪生网络可以通过共享参数的方式，基于左眼区域和右眼区域的差异与联系提取特征。可选的，孪生网络还可用于分别对左眼区域和右眼区域进行空洞卷积操作，以增强网络对图像的感受野，从而提取出准确地左眼特征和右眼特征。For the eye region, the binocular region pitch, the left eye region and the right eye region can be segmented through the segmentation operation crop. For the binocular area pitch, the binocular features can be extracted through the binocular feature extraction layer. For the left eye area and the right eye area, the twin network layer can be used to extract the left eye feature and the right eye feature from the left eye area and the right eye area. The twin network can share parameters based on the left eye area and the right eye area. The difference and connection extraction features. Optionally, the Siamese network can also be used to perform dilated convolution operations on the left-eye region and the right-eye region respectively, so as to enhance the network's receptive field of the image, thereby extracting accurate left-eye and right-eye features.

对于头部区域headpose，可识别出头部姿态特征。For the head region headpose, head pose features can be identified.

提取出的所有特征可以通过全连接层FC进行融合处理，进而输出估计结果；对于第一神经网络，输出的估计结果为目标区域gaze zone；对于第二神经网络，输出的估计为目标视线方向信息，具体可以是视线方向的角度信息gaze pitch&yaw。All the extracted features can be fused through the fully connected layer FC, and then output the estimation result; for the first neural network, the output estimation result is the target area gaze zone; for the second neural network, the output estimation is the target line of sight direction information , specifically the angle information gaze pitch&yaw of the line of sight direction.

除了图1C所示的实施例中，第一神经网络可以通过识别所述待识别图像中用户的视线方向信息来估计出待识别图像对应的目标区域。在其他实施例中，第一神经网络还可以采用其他方式实现，例如可以是其他结构的神经网络，为了提高效率，还可以采用一个简易结构的神经网络，通过输入的待识别图像中识别用户的头部姿态信息来估计目标区域，此种方式无需考虑眼部和关键点等其它信息，采用了一个结构相对简易的神经网络，可以减少实现成本。In addition to the embodiment shown in FIG. 1C , the first neural network can estimate the target area corresponding to the image to be recognized by recognizing the user's gaze direction information in the image to be recognized. In other embodiments, the first neural network can also be implemented in other ways, for example, it can be a neural network with other structures. In order to improve efficiency, a simple-structured neural network can also be used to recognize the user's identity from the input image to be recognized. The head pose information is used to estimate the target area. This method does not need to consider other information such as eyes and key points. A neural network with a relatively simple structure is used, which can reduce the implementation cost.

在一些例子中，待识别图像的帧数可以是一帧，该帧待识别图像输入至第一神经网络后，由第一神经网络识别出该帧待识别图像对应的目标区域；接着，可以获取与所述目标区域对应的第二神经网络，将该帧待识别图像输入至第二神经网络，利用所述第二神经网络估计所述用户的目标视线方向信息。In some examples, the number of frames of the to-be-recognized image may be one frame, and after the to-be-recognized image of the frame is input to the first neural network, the first neural network identifies the target area corresponding to the to-be-recognized image; The second neural network corresponding to the target area inputs the to-be-recognized image of the frame into the second neural network, and uses the second neural network to estimate the target line-of-sight direction information of the user.

在另一些例子中，待识别图像可以不止一帧，可以针对多帧待识别图像进行视线估计，具体的帧数可以根据需要灵活配置，例如可以是设置固定的帧数数量，如3帧、5帧等。在其他例子中，每次视线估计的图像帧数还可以的动态变化的，还可以是通过设置视线估计的时间间隔来确定图像帧数，例如每隔1秒进行一次视线估计，每次视线估计可以是采取该时间段内的一帧或多帧图像进行估计。In other examples, there may be more than one frame of the image to be recognized, and sight line estimation can be performed for multiple frames of the image to be recognized. The specific number of frames can be flexibly configured as required, for example, a fixed number of frames can be set, such as 3 frames, 5 frame etc. In other examples, the number of image frames for each line-of-sight estimation can be dynamically changed, and the number of image frames can also be determined by setting the time interval for line-of-sight estimation. It can be estimated by taking one or more frames of images in this time period.

作为例子，所述多帧待识别图像包括：同一拍摄设备在不同时刻对用户拍摄的多帧图像；或者，不同拍摄设备在不同位置对用户拍摄的多帧图像。对于多帧待识别图像的视线估计方案，可以有多种方式实现。As an example, the multiple frames of images to be identified include: multiple frames of images photographed by the same photographing device on the user at different times; or multiple frames of images photographed by different photographing devices on the user at different locations. For the line-of-sight estimation scheme of the multi-frame to-be-recognized images, there can be implemented in various ways.

作为例子，可以是针对多帧待识别图像估计出多个视线方向信息后进行融合；如图1D所示，示出了对于多帧待识别图像的处理过程，图1D中以三帧待识别图像(第n-1帧、第n帧及第n+1帧)为例示出了视线估计的一个实施例，针对每帧待识别图像可以分别利用第一神经网络识别出对应的目标区域，再分别将每帧待识别图像输入至对应的第二神经网络。作为例子，所述利用所述第二神经网络估计所述用户的目标视线方向信息，包括：获取每帧待识别图像中用户的视线方向信息；每帧待识别图像中用户的视线方向信息，是该帧待识别图像对应的目标区域所对应的第二神经网络对该帧待识别图像估计得到的；将各帧待识别图像对应的视线方向信息融合后估计用户的目标视线方向信息。其中，图1D的示意图示出的多帧待识别图像都对应同一个目标区域，实际应用中也有可能出现多帧待识别图像对应不同目标区域的情况，本实施例对此不作限定。As an example, it is possible to estimate multiple line-of-sight direction information for multiple frames of images to be identified and then fuse; as shown in FIG. 1D , the processing process for multiple frames of images to be identified is shown. (The n-1th frame, the nth frame, and the n+1th frame) are examples to illustrate an embodiment of the line of sight estimation. For each frame of the image to be identified, the corresponding target area can be identified by the first neural network, and then Each frame of the to-be-recognized image is input to the corresponding second neural network. As an example, using the second neural network to estimate the user's target gaze direction information includes: acquiring the user's gaze direction information in each frame of the image to be recognized; the user's gaze direction information in each frame of the to-be-recognized image is The second neural network corresponding to the target area corresponding to the frame to be recognized image is obtained by estimating the frame to be recognized image; the user's target gaze direction information is estimated after fusing the line of sight direction information corresponding to each frame of the to-be-recognized image. The multiple frames of images to be identified shown in the schematic diagram of FIG. 1D all correspond to the same target area. In practical applications, there may also be situations where multiple frames of images to be identified correspond to different target areas, which is not limited in this embodiment.

在另一些例子中，可以是针对多帧待识别图像，分别提取出图像特征后，对图像特征进行融合，进而根据特征融合结果估计出多个视线方向信息。具体的，所述利用所述第二神经网络估计所述用户的目标视线方向信息，包括：获取每帧待识别图像的图像特征；每帧待识别图像的图像特征，是该帧待识别图像对应的目标区域所对应的第二神经网络从该帧待识别图像提取得到的；与前述实施例不同，本实施例是将各帧待识别图像的图像特征融合后估计用户的目标视线方向信息。可选的，为了进一步提高估计结果的准确性，可以利用所述第三神经网络对各帧待识别图像的图像特征融合后估计用户的目标视线方向信息。In other examples, for multiple frames of images to be identified, after the image features are extracted respectively, the image features are fused, and then a plurality of line-of-sight direction information is estimated according to the feature fusion result. Specifically, using the second neural network to estimate the target line of sight direction information of the user includes: acquiring the image features of each frame of the image to be recognized; the image features of each frame of the image to be recognized correspond to the image to be recognized The second neural network corresponding to the target area of the frame is extracted from the image to be recognized in the frame; different from the previous embodiment, this embodiment estimates the user's target line of sight direction information after fusing the image features of each frame of the image to be recognized. Optionally, in order to further improve the accuracy of the estimation result, the third neural network may be used to estimate the user's target line-of-sight direction information after fusing the image features of each frame of the image to be recognized.

本实施例的视线估计方案可应用于多种业务场景中，例如辅助驾驶、游戏、VR(Virtual Reality，虚拟现实)或医疗等。The line-of-sight estimation solution in this embodiment can be applied to various business scenarios, such as assisted driving, games, VR (Virtual Reality, virtual reality), or medical treatment.

接下来以辅助驾驶场景为例，对本实施例的视线估计方案进行说明。在辅助驾驶场景中，所述待识别图像可以包括：对车辆轿厢内驾驶员拍摄的图像；基于辅助驾驶的场景，可以根据驾驶员在车辆轿厢的视线覆盖范围划分多个设定区域，作为例子，多个设定区域可以包括：前挡风玻璃区域、左后视镜区域、右后视镜区域、左车窗区域、右车窗区域、仪表盘区域或中控台区域；当然，实际应用中可以根据需要划分其他形式的设定区域，本实施例对此不作限定。在一些例子中，还可以根据需要划分出其他区域，这些其他区域未对应有第二神经网络，其并非视线估计方案的估计目标，在第一神经网络识别出待识别图像中对应为其他区域时，可以输出识别结果，结束视线估计流程。Next, taking an assisted driving scenario as an example, the line-of-sight estimation solution of this embodiment will be described. In the assisted driving scene, the to-be-identified image may include: an image taken by the driver in the vehicle car; based on the assisted driving scene, a plurality of set areas may be divided according to the driver's line of sight coverage in the vehicle car, As an example, the plurality of setting areas may include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, a dashboard area or a center console area; of course, In practical applications, other forms of setting areas may be divided according to requirements, which are not limited in this embodiment. In some examples, other regions can also be divided as required. These other regions do not correspond to the second neural network, and are not the estimation targets of the line-of-sight estimation scheme. When the first neural network identifies that the image to be recognized corresponds to other regions , you can output the recognition result, and end the line of sight estimation process.

如图1E所示，是本实施例示出的辅助驾驶场景下的视线估计方法的实施例，作为INPUT的待识别图像输入至第一神经网络CNN中，由第一神经网络CNN输出该待识别图像对应的目标区域gaze zone；本实施例中划分的设定区域如图1E所示，各个设定区域对应有第二神经网络，即图1E中的CNN-1、CNN-2至CNN-7，根据第一神经网络CNN输出该待识别图像对应的目标区域gaze zone，获取到对应的第二神经网络，将待识别图像输入至第二神经网络，利用第二神经网络输出视线方向信息gaze。As shown in FIG. 1E, it is an embodiment of the line of sight estimation method in the assisted driving scene shown in this embodiment. The image to be recognized as an INPUT is input into the first neural network CNN, and the image to be recognized is output by the first neural network CNN. The corresponding target area gaze zone; the setting area divided in this embodiment is shown in Figure 1E, and each setting area corresponds to a second neural network, namely CNN-1, CNN-2 to CNN-7 in Figure 1E, According to the first neural network CNN, the target area gaze zone corresponding to the image to be recognized is output, the corresponding second neural network is obtained, the image to be recognized is input to the second neural network, and the gaze direction information gaze is output by the second neural network.

可选的，估计出的目标视线方向信息可以用于评估所述驾驶员是否分心驾驶。作为例子，可以利用所述目标视线方向信息确定出所述用户的视线方向偏离设定方向，输出提示信息，以提示驾驶员。可选的，所述提示信息的输出方式，包括如下任一方式：音频、文字、视频或灯光。例如，通过音频播放出提示消息，或者是通过汽车内的电子屏幕以文字或视频的方式输出提示消息，还可以是控制汽车内的灯光模块以特定的灯光颜色输出提示消息，当然还可以是以上述至少两种方式的组合输出提示消息。Optionally, the estimated target line-of-sight direction information may be used to assess whether the driver is distracted driving. As an example, the target line of sight direction information may be used to determine that the user's line of sight direction deviates from the set direction, and prompt information is output to prompt the driver. Optionally, the output mode of the prompt information includes any of the following modes: audio, text, video or lighting. For example, the prompt message can be played through audio, or the prompt message can be output in the form of text or video through the electronic screen in the car, or the light module in the car can be controlled to output the prompt message in a specific light color. The combination of the above at least two ways outputs a prompt message.

本实施例方案中，首先通过第一神经网络估计出待识别图像对应的目标区域，目标区域是多个设定区域中的至少一个，所述目标区域表征所述待识别图像中用户头部所朝向的区域。其中，设定区域对应有第二网络，利用第二神经网络估计所述用户的目标视线方向信息，从而在第一神经网络识别出目前区域的基础上，利用第二神经网络进行更为细致准确的视线估计。In the solution of this embodiment, the first neural network is used to estimate the target area corresponding to the to-be-recognized image, the target area is at least one of a plurality of set areas, and the target area represents the location of the user's head in the to-be-recognized image towards the area. The set area corresponds to a second network, and the second neural network is used to estimate the target line of sight direction information of the user, so that on the basis of identifying the current area by the first neural network, the second neural network is used to perform more detailed and accurate line-of-sight estimate.

本说明书实施例还提供了另一种视线估计方法，本实施例方法是针对单个神经网络而言的，正如前述所言，本实施例在单个神经网络上也实现了创新，如图2所示，是该视线估计方法的流程示意图，包括如下步骤：The embodiment of this specification also provides another line-of-sight estimation method. The method of this embodiment is aimed at a single neural network. As mentioned above, this embodiment also achieves innovation on a single neural network, as shown in FIG. 2 . , is a schematic flowchart of the line-of-sight estimation method, including the following steps:

在步骤202中，获取至少一帧待识别图像。所述待识别图像包括对用户进行图像采集得到的图像；In step 202, at least one frame of images to be recognized is acquired. The to-be-recognized image includes an image obtained by performing image acquisition on a user;

在步骤204中，利用神经网络估计所述用户的目标视线方向信息。其中，所述神经网络用于：提取所述待识别图像中至少一类表征用户头部朝向的目标特征后，将提取的目标特征进行融合，根据特征融合结果估计所述用户的目标视线方向信息。In step 204, the target gaze direction information of the user is estimated by using a neural network. The neural network is used for: after extracting at least one type of target features representing the orientation of the user's head in the to-be-recognized image, fuse the extracted target features, and estimate the user's target line-of-sight direction information according to the feature fusion result .

本实施例中，利用神经网络提取待识别图像中至少一类表征用户头部朝向的目标特征，将提取的目标特征进行融合，从而可以根据特征融合结果准确地估计出所述用户的目标视线方向信息。In this embodiment, the neural network is used to extract at least one type of target features representing the orientation of the user's head in the image to be recognized, and the extracted target features are fused, so that the target gaze direction of the user can be accurately estimated according to the feature fusion result. information.

在一些例子中，所述目标特征包括如下任一：人脸区域特征、眼部区域特征或头部姿态特征。In some examples, the target features include any of the following: face region features, eye region features, or head pose features.

在一些例子中，所述人脸区域特征、眼部区域特征或头部姿态特征，分别是从所述待识别图像分割出的人脸区域、眼部区域或头部区域中对应提取的。In some examples, the face region feature, eye region feature or head pose feature are correspondingly extracted from the face region, eye region or head region segmented from the to-be-recognized image.

在一些例子中，所述左眼特征和右眼特征，是利用孪生网络提取得到的。In some examples, the left-eye feature and the right-eye feature are extracted by using the Siamese network.

在一些例子中，所述孪生网络用于：分别对左眼区域和右眼区域进行空洞卷积操作，以增强网络对图像的感受野。In some examples, the Siamese network is used to perform dilated convolution operations on the left eye region and the right eye region respectively, so as to enhance the network's receptive field of the image.

在一些例子中，所述用户的目标视线方向信息包括：目标区域，所述目标区域表征所述待识别图像中用户头部所朝向的区域。In some examples, the target line-of-sight direction information of the user includes: a target area, where the target area represents an area in the to-be-recognized image to which the user's head faces.

在一些例子中，所述神经网络的训练数据，采用的是用户头部所朝向的区域均为所述目标区域的图像。In some examples, the training data of the neural network adopts an image in which the region facing the user's head is the target region.

在一些例子中，所述用户的目标视线方向信息，包括：用户的目标视线方向的角度信息。In some examples, the target gaze direction information of the user includes: angle information of the user's target gaze direction.

在一些例子中，所述待识别图像有多帧；In some examples, the to-be-identified image has multiple frames;

所述利用所述神经网络估计所述用户的目标视线方向信息，包括：The using the neural network to estimate the target gaze direction information of the user includes:

获取每帧待识别图像中用户的视线方向信息；每帧待识别图像中用户的视线方向信息，是该帧待识别图像对应的神经网络对该帧待识别图像估计得到的；Obtaining the line-of-sight direction information of the user in each frame of the to-be-recognized image; the user's line-of-sight direction information in each frame of the to-be-recognized image is estimated by the neural network corresponding to the to-be-recognized image of the frame of the to-be-recognized image;

将各帧待识别图像对应的视线方向信息融合后估计用户的目标视线方向信息。The line-of-sight direction information corresponding to each frame of the image to be recognized is fused to estimate the user's target line-of-sight direction information.

在一些例子中，所述待识别图像有多帧；所述神经网络包括初始子神经网络和融合子神经网络；In some examples, the to-be-recognized image has multiple frames; the neural network includes an initial sub-neural network and a fusion sub-neural network;

对每帧待识别图像，分别利用初始子神经网络获取每帧待识别图像的特征融合结果；所述初始子神经网络用于：提取所述待识别图像中至少一类表征用户头部朝向的目标特征后，将提取的目标特征进行融合；For each frame of the to-be-recognized image, the initial sub-neural network is used to obtain the feature fusion result of each frame of the to-be-recognized image; the initial sub-neural network is used to: extract at least one type of target representing the user's head orientation in the to-be-recognized image After the feature, the extracted target features are fused;

利用所述融合子神经网络将各帧待识别图像的特征融合结果融合后估计用户的目标视线方向信息。Using the fusion sub-neural network, the feature fusion results of each frame of the image to be recognized are fused to estimate the user's target line-of-sight direction information.

在一些例子中，所述神经网络为卷积神经网络。In some examples, the neural network is a convolutional neural network.

在一些例子中，所述初始子神经网络为卷积神经网络。In some examples, the initial sub-neural network is a convolutional neural network.

在一些例子中，所述融合子神经网络为循环神经网络。In some examples, the fusion sub-neural network is a recurrent neural network.

在一些例子中，所述待识别图像包括：对车辆内驾驶员拍摄的图像。In some examples, the image to be identified includes an image taken of a driver in the vehicle.

在一些例子中，所述目标视线方向信息用于评估所述驾驶员是否分心驾驶。In some examples, the target gaze direction information is used to assess whether the driver is distracted driving.

在一些例子中，所述多个设定区域包括：前挡风玻璃区域、左后视镜区域、右后视镜区域、左车窗区域、右车窗区域、仪表盘区域或中控台区域。In some examples, the plurality of setting areas include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, a dashboard area, or a center console area .

在一些例子中，所述方法还包括：In some examples, the method further includes:

利用所述目标视线方向信息确定出所述用户的视线方向偏离设定方向，输出提示信息。Using the target gaze direction information, it is determined that the user's gaze direction deviates from the set direction, and prompt information is output.

在一些例子中，所述提示信息的输出方式，包括如下任一方式：音频、文字、视频或灯光。In some examples, the output mode of the prompt information includes any one of the following modes: audio, text, video or lighting.

本实施例的视线估计方法的实现过程，具体可参考前述图1A至图1E所示实施例的描述，在此不再赘述。For the implementation process of the line-of-sight estimation method in this embodiment, reference may be made to the foregoing descriptions of the embodiments shown in FIG. 1A to FIG. 1E , which will not be repeated here.

上述方法实施例可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。以软件实现为例，作为一个逻辑意义上的装置，是通过其所在视线估计的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言，如图3所示，为实施本实施例视线估计装置300的一种硬件结构图，除了图3所示的处理器301以及存储器302之外，实施例中用于实施本视线估计方法的处理设备，通常根据该处理设备的实际功能，还可以包括其他硬件，对此不再赘述。The foregoing method embodiments may be implemented by software, and may also be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory through the processor that estimates the line of sight where it is located. From the perspective of hardware, as shown in FIG. 3 , which is a hardware structure diagram for implementing the line-of-sight estimation apparatus 300 in this embodiment, except for the processor 301 and the memory 302 shown in FIG. The processing device of the line-of-sight estimation method, usually according to the actual function of the processing device, may also include other hardware, which will not be repeated here.

本实施例中，视线估计装置300包括处理器、存储器、存储在所述存储器上可被所述处理器执行的计算机程序，所述处理器执行所述计算机程序时实现以下步骤：In this embodiment, the line-of-sight estimation apparatus 300 includes a processor, a memory, and a computer program stored on the memory and executable by the processor, and the processor implements the following steps when executing the computer program:

在一些例子中，用于训练设定区域对应的第二神经网络的训练图像集，采用的是用户头部所朝向的区域均为该设定区域的图像。In some examples, the training image set used for training the second neural network corresponding to the set area adopts the images of the area to which the user's head is facing is the set area.

在一些例子中，用于训练第一神经网络的训练图像集中，各训练图像中用户头部所朝向的区域涵盖所述多个设定区域。In some examples, in the training image set used for training the first neural network, the area to which the user's head faces in each training image covers the multiple set areas.

所述处理器执行所述利用所述第二神经网络估计所述用户的目标视线方向信息时，实现以下步骤：When the processor executes the estimation of the target gaze direction information of the user by using the second neural network, the following steps are implemented:

获取每帧待识别图像中用户的视线方向信息；每帧待识别图像中用户的视线方向信息，是该帧待识别图像对应的目标区域所对应的第二神经网络对该帧待识别图像估计得到的；Obtain the user's gaze direction information in each frame of the image to be recognized; the user's gaze direction information in each frame of the to-be-recognized image is estimated by the second neural network corresponding to the target area corresponding to the to-be-recognized image for the frame to be recognized. of;

在一些例子中，所述第二神经网络用于：提取所述待识别图像的图像特征，利用所述图像特征估计用户的视线方向信息。In some examples, the second neural network is used for: extracting image features of the to-be-recognized image, and estimating the user's gaze direction information by using the image features.

所述处理器执行所述利用所述第二神经网络估计所述用户的目标视线方向信息时实现以下步骤：When the processor executes the estimation of the target gaze direction information of the user by using the second neural network, the following steps are implemented:

获取每帧待识别图像的图像特征；每帧待识别图像的图像特征，是该帧待识别图像对应的目标区域所对应的第二神经网络从该帧待识别图像提取得到的；Obtaining the image features of each frame of the image to be recognized; the image feature of each frame of the image to be recognized is extracted from the image to be recognized by the second neural network corresponding to the target area corresponding to the image to be recognized in the frame;

将各帧待识别图像的图像特征融合后估计用户的目标视线方向信息。The target gaze direction information of the user is estimated after the image features of each frame of the image to be recognized are fused.

在一些例子中，所述处理器执行所述将各帧待识别图像的图像特征融合后估计用户的目标视线方向信息时实现以下步骤：In some examples, the processor implements the following steps when estimating the target gaze direction information of the user after merging the image features of each frame of the image to be recognized:

利用所述第三神经网络对各帧待识别图像的图像特征融合后估计用户的目标视线方向信息。Using the third neural network to fuse the image features of the images to be identified in each frame, the user's target line-of-sight direction information is estimated.

在一些例子中，所述第一神经网络用于：提取所述待识别图像的图像特征，利用所述图像特征估计所述待识别图像对应的目标区域。In some examples, the first neural network is used to: extract image features of the to-be-recognized image, and use the image features to estimate a target area corresponding to the to-be-recognized image.

在一些例子中，所述图像特征包括如下任一：人脸区域特征、眼部区域特征或头部姿态特征。In some examples, the image features include any of the following: face region features, eye region features, or head pose features.

在一些例子中，所述孪生网络用于：分别对左眼区域和右眼区域进行空洞卷积操作，以增强网络对图像的感受野。In some examples, the Siamese network is used to perform atrous convolution operations on the left eye region and the right eye region respectively, so as to enhance the network's receptive field of the image.

在一些例子中，所述多帧待识别图像包括：同一拍摄设备在不同时刻对用户拍摄的多帧图像；或者，不同拍摄设备在不同位置对用户拍摄的多帧图像。In some examples, the multiple frames of images to be identified include: multiple frames of images captured by the same photographing device on the user at different times; or multiple frames of images photographed by different photographing devices on the user at different locations.

在一些例子中，所述目标区域，是所述第一神经网络通过识别所述待识别图像中用户的视线方向信息和/或用户的头部姿态信息估计出的。In some examples, the target area is estimated by the first neural network by recognizing the user's gaze direction information and/or the user's head posture information in the to-be-recognized image.

在一些例子中，第一神经网络的结构与所述第二神经模型的结构相同。In some examples, the structure of the first neural network is the same as the structure of the second neural model.

在一些例子中，第一神经网络的参数与所述第二神经模型的参数不同。In some examples, the parameters of the first neural network are different from the parameters of the second neural model.

在一些例子中，第一神经网络为卷积神经网络和/或第二神经网络为卷积神经网络。In some examples, the first neural network is a convolutional neural network and/or the second neural network is a convolutional neural network.

在一些例子中，第三神经网络为循环神经网络。In some examples, the third neural network is a recurrent neural network.

在一些例子中，所述目标区域是多个设定区域中的其中一个。In some examples, the target area is one of a plurality of set areas.

在一些例子中，所述待识别图像包括：对车辆轿厢内驾驶员拍摄的图像。In some examples, the to-be-identified image includes an image taken of a driver in a vehicle cabin.

在一些例子中，所述多个设定区域包括：前挡风玻璃区域、左后视镜区域、右后视镜区域、左车窗区域、右车窗区域、仪表盘区域或中控台区域。In some examples, the plurality of set areas include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, a dashboard area, or a center console area .

在一些例子中，所述处理器执行所述计算机程序时还实现以下步骤：In some instances, the processor further implements the following steps when executing the computer program:

本实施例还提供了另一种视线估计装置，该装置的结构可如图3所示，所述装置包括处理器、存储器、存储在所述存储器上可被所述处理器执行的计算机程序，所述处理器执行所述计算机程序时实现以下步骤：This embodiment also provides another apparatus for line-of-sight estimation. The structure of the apparatus may be as shown in FIG. 3 . The apparatus includes a processor, a memory, and a computer program stored in the memory and executed by the processor. The processor implements the following steps when executing the computer program:

利用神经网络估计所述用户的目标视线方向信息；其中，所述神经网络用于：提取所述待识别图像中至少一类表征用户头部朝向的目标特征后，将提取的目标特征进行融合，根据特征融合结果估计所述用户的目标视线方向信息。A neural network is used to estimate the target line of sight direction information of the user; wherein, the neural network is used to: after extracting at least one type of target features representing the orientation of the user's head in the to-be-recognized image, fuse the extracted target features, The target gaze direction information of the user is estimated according to the feature fusion result.

所述处理器执行所述利用所述神经网络估计所述用户的目标视线方向信息时，实现以下步骤：When the processor executes the estimation of the target gaze direction information of the user by using the neural network, the following steps are implemented:

对每帧待识别图像，分别利用初始子神经网络获取每帧待识别图像的特征融合结果；所述初始子神经网络用于：提取所述待识别图像中至少一类表征用户头部朝向的目标特征后，将提取的目标特征进行融合；For each frame of the to-be-recognized image, the initial sub-neural network is used to obtain the feature fusion result of each frame of the to-be-recognized image; the initial sub-neural network is used for: extracting at least one type of target representing the orientation of the user's head in the to-be-recognized image After the feature, the extracted target features are fused;

如图4所示，是申请实施例还提供一种车辆400，所述车辆可以包括前述任一种视线估计装置。As shown in FIG. 4 , an embodiment of the application further provides a vehicle 400 , and the vehicle may include any of the aforementioned line-of-sight estimation devices.

本实施例还提供一种计算机可读存储介质，所述可读存储介质上存储有若干计算机指令，所述计算机指令被执行时实现前述图1A所示视线估计方法的步骤。This embodiment further provides a computer-readable storage medium, where several computer instructions are stored thereon, and when the computer instructions are executed, the steps of the line-of-sight estimation method shown in FIG. 1A are implemented.

本实施例还提供一种计算机可读存储介质，其特征在于，所述可读存储介质上存储有若干计算机指令，所述计算机指令被执行时实现前述图2所示视线估计方法的步骤。This embodiment also provides a computer-readable storage medium, characterized in that, the readable storage medium stores several computer instructions, and when the computer instructions are executed, the steps of the line-of-sight estimation method shown in FIG. 2 are implemented.

本说明书实施例可采用在一个或多个其中包含有程序代码的存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。计算机可用存储介质包括永久性和非永久性、可移动和非可移动媒体，可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括但不限于：相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。Embodiments of the present specification may take the form of a computer program product embodied on one or more storage media having program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like. Computer-usable storage media includes permanent and non-permanent, removable and non-removable media, and storage of information can be accomplished by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

对于装置实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。For the apparatus embodiments, since they basically correspond to the method embodiments, reference may be made to the partial descriptions of the method embodiments for related parts. The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any such actual relationship or sequence exists. The terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also other not expressly listed elements, or also include elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

以上对本发明实施例所提供的方法和装置进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。The methods and devices provided by the embodiments of the present invention have been described in detail above. The principles and implementations of the present invention are described in this paper by using specific examples. At the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in the specific implementation and application scope. In summary, the content of this description should not be construed as a limitation of the present invention. .

Claims

A line-of-sight estimation method comprising:

Acquiring at least one frame of an image to be identified, the image to be identified includes an image obtained by performing image collection on a user;

The target area corresponding to the image to be recognized is estimated by using the first neural network, the target area is at least one of a plurality of set areas, and the target area represents the area in the image to be recognized that the user's head is facing ;

A second neural network corresponding to the target area is acquired, and the target gaze direction information of the user is estimated by using the second neural network.

The method according to claim 1, wherein the training image set used for training the second neural network corresponding to the set area adopts the image of the set area where the area facing the user's head is.

The method according to claim 1, wherein, in the training image set used for training the first neural network, the area to which the user's head faces in each training image covers the plurality of set areas.

The method according to claim 1, wherein the to-be-identified image has multiple frames;

The using the second neural network to estimate the target gaze direction information of the user includes:

Obtain the line of sight direction information of the user in each frame of the to-be-recognized image; the user's line-of-sight direction information in each frame of the to-be-recognized image is estimated by the second neural network corresponding to the target area corresponding to the to-be-recognized image of the frame of the to-be-recognized image. of;

The line-of-sight direction information corresponding to each frame of the image to be recognized is fused to estimate the user's target line-of-sight direction information.

The method according to claim 1 or 4, wherein the second neural network is used for: extracting image features of the to-be-recognized image, and estimating the user's gaze direction information by using the image features.

The method according to claim 1 or 5, wherein the to-be-identified image has multiple frames;

Obtaining the image features of each frame of the image to be recognized; the image feature of each frame of the image to be recognized is extracted from the image to be recognized by the second neural network corresponding to the target area corresponding to the image to be recognized in the frame;

The target gaze direction information of the user is estimated after the image features of each frame of the image to be recognized are fused.

The method according to claim 6, wherein the estimating the user's target line-of-sight direction information after fusing the image features of each frame of the image to be recognized comprises:

The user's target line of sight direction information is estimated by using the third neural network to fuse the image features of each frame of the image to be recognized.

The method according to claim 1, wherein the first neural network is used to: extract image features of the to-be-recognized image, and use the image features to estimate a target area corresponding to the to-be-recognized image.

The method according to claim 6 or 8, wherein the image features include any one of the following: a face region feature, an eye region feature, or a head pose feature.

The method according to claim 9, wherein the face region feature, the eye region feature or the head pose feature are respectively the face region, the eye region or the head segmented from the to-be-recognized image Correspondingly extracted from the partial area.

The method according to claim 10, wherein the eye region feature includes any one of the following: a binocular feature, a left eye feature or a right eye feature.

The method according to claim 11, wherein the left-eye feature and the right-eye feature are extracted by using a Siamese network.

The method according to claim 12, wherein the twin network is used for: performing hole convolution operations on the left eye region and the right eye region respectively, so as to enhance the receptive field of the network to the image.

The method according to claim 4 or 6, wherein the multiple frames of images to be identified comprise: multiple frames of images captured by the same shooting device at different times of the user; or, images captured by different shooting devices at different locations of the user multi-frame images.

The method according to claim 1 or 8, wherein the target area is estimated by the first neural network by identifying the user's gaze direction information and/or the user's head posture information in the to-be-recognized image out.

The method of claim 1, wherein the structure of the first neural network is the same as the structure of the second neural model.

The method of claim 16, wherein the parameters of the first neural network are different from the parameters of the second neural model.

The method of claim 16, wherein the first neural network is a convolutional neural network. The second neural network is a convolutional neural network.

The method of claim 16, wherein the second neural network is a convolutional neural network.

The method according to claim 7, wherein the third neural network is a recurrent neural network.

The method of claim 1, wherein the target area is one of a plurality of setting areas.

The method according to claim 1, wherein the to-be-identified image comprises: an image taken of a driver in a car of a vehicle.

The method of claim 22, wherein the target gaze direction information is used to assess whether the driver is distracted driving.

The method according to claim 22, wherein the plurality of setting areas include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, Dashboard area or center console area.

The method according to claim 1, wherein the method further comprises:

Using the target gaze direction information, it is determined that the user's gaze direction deviates from the set direction, and prompt information is output.

The method according to claim 25, wherein the output mode of the prompt information includes any one of the following modes: audio, text, video or lighting.

A line-of-sight estimation method, characterized in that the method comprises:

A neural network is used to estimate the target line of sight direction information of the user; wherein, the neural network is used to: after extracting at least one type of target features representing the orientation of the user's head in the to-be-recognized image, fuse the extracted target features, The target gaze direction information of the user is estimated according to the feature fusion result.

The method according to claim 27, wherein the target feature comprises any one of the following: a face region feature, an eye region feature or a head pose feature.

The method according to claim 28, wherein the face region feature, eye region feature or head pose feature are respectively a face region, an eye region or a head region segmented from the to-be-recognized image Correspondingly extracted from the partial area.

The method according to claim 29, wherein the eye region feature comprises any one of the following: a binocular feature, a left eye feature or a right eye feature.

The method according to claim 30, wherein the left-eye feature and the right-eye feature are extracted by using a Siamese network.

The method according to claim 31, wherein the twin network is used for: performing hole convolution operations on the left eye region and the right eye region respectively, so as to enhance the receptive field of the network to the image.

The method according to claim 27, wherein the target line-of-sight direction information of the user comprises: a target area, and the target area represents an area in the to-be-recognized image to which the user's head is facing.

The method according to claim 27, wherein the training data of the neural network adopts the image of the target area in which the area to which the user's head is facing is used.

The method according to claim 27, wherein the information on the target gaze direction of the user comprises: angle information of the user's target gaze direction.

The method according to claim 27, wherein the to-be-identified image has multiple frames;

The using the neural network to estimate the target gaze direction information of the user includes:

Obtaining the line-of-sight direction information of the user in each frame of the to-be-recognized image; the user's line-of-sight direction information in each frame of the to-be-recognized image is estimated by the neural network corresponding to the to-be-recognized image of the frame of the to-be-recognized image;

The method according to claim 27, wherein the image to be recognized has multiple frames; the neural network comprises an initial sub-neural network and a fusion sub-neural network;

For each frame of the to-be-recognized image, the initial sub-neural network is used to obtain the feature fusion result of each frame of the to-be-recognized image; the initial sub-neural network is used for: extracting at least one type of target representing the orientation of the user's head in the to-be-recognized image After the feature, the extracted target features are fused;

Using the fusion sub-neural network, the feature fusion results of each frame of the image to be recognized are fused to estimate the user's target line-of-sight direction information.

The method of claim 36, wherein the neural network is a convolutional neural network.

The method according to claim 37, wherein the initial sub-neural network is a convolutional neural network.

The method of claim 37, wherein the fusion sub-neural network is a recurrent neural network.

The method according to claim 27, wherein the to-be-identified image comprises: an image taken of a driver in the vehicle.

The method of claim 41, wherein the target gaze direction information is used to assess whether the driver is distracted driving.

The method according to claim 41, wherein the plurality of setting areas include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, Dashboard area or center console area.

The method of claim 42, wherein the method further comprises:

The method according to claim 44, wherein the output mode of the prompt information includes any one of the following modes: audio, text, video or lighting.

A line-of-sight estimation device, characterized in that the device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, and the processor implements the following steps when executing the computer program:

The target area corresponding to the image to be recognized is estimated by using the first neural network, the target area is at least one of a plurality of set areas, and the target area represents the area in the image to be recognized toward which the user's head is facing ;

The device according to claim 46, wherein the training image set used for training the second neural network corresponding to the set area adopts the image of the set area where the area facing the user's head is.

The apparatus according to claim 46, wherein, in the training image set used for training the first neural network, the area to which the user's head faces in each training image covers the plurality of set areas.

The device according to claim 46, wherein the to-be-identified image has multiple frames;

When the processor executes the estimation of the target gaze direction information of the user by using the second neural network, the following steps are implemented:

The apparatus according to claim 46 or 49, wherein the second neural network is used for: extracting image features of the to-be-recognized image, and estimating the user's gaze direction information by using the image features.

The device according to claim 46 or 50, wherein the to-be-identified image has multiple frames;

After fusing the image features of the images to be recognized in each frame, the user's target gaze direction information is estimated.

The device according to claim 51, wherein the processor implements the following steps when estimating the target line-of-sight direction information of the user after fusing the image features of each frame of the image to be recognized:

The apparatus according to claim 46, wherein the first neural network is used to: extract image features of the to-be-recognized image, and use the image features to estimate a target area corresponding to the to-be-recognized image.

The device according to claim 50 or 53, wherein the image features include any one of the following: a face region feature, an eye region feature, or a head pose feature.

The device according to claim 54, wherein the face region feature, eye region feature or head pose feature are respectively the face region, eye region or head segmented from the to-be-recognized image Correspondingly extracted from the partial area.

The device according to claim 55, wherein the eye region features include any one of the following: binocular features, left eye features or right eye features.

The device according to claim 56, wherein the left-eye feature and the right-eye feature are extracted by using a Siamese network.

The device according to claim 57, wherein the twin network is used for: performing dilated convolution operations on the left eye region and the right eye region respectively, so as to enhance the receptive field of the network to the image.

The device according to claim 49 or 51, wherein the multiple frames of images to be identified include: multiple frames of images captured by the same shooting device at different times of the user; or, images captured by different shooting devices at different positions of the user multi-frame images.

The device according to claim 46 or 53, wherein the target area is estimated by the first neural network by identifying the user's gaze direction information and/or the user's head posture information in the to-be-recognized image out.

The apparatus of claim 46, wherein the structure of the first neural network is the same as the structure of the second neural model.

61. The apparatus of claim 61, wherein the parameters of the first neural network are different from the parameters of the second neural model.

The apparatus according to claim 61, wherein the first neural network is a convolutional neural network and/or the second neural network is a convolutional neural network.

The apparatus of claim 52, wherein the third neural network is a recurrent neural network.

The apparatus of claim 46, wherein the target area is one of a plurality of set areas.

The device according to claim 46, wherein the to-be-identified image comprises: an image taken of a driver in a car of a vehicle.

66. The apparatus of claim 66, wherein the target gaze direction information is used to assess whether the driver is distracted driving.

The device according to claim 66, wherein the plurality of setting areas include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, Dashboard area or center console area.

The apparatus according to claim 67, wherein the processor further implements the following steps when executing the computer program:

The device according to claim 69, wherein the output mode of the prompt information includes any one of the following modes: audio, text, video, or lighting.

Use a neural network to estimate the user's target line of sight direction information; wherein, the neural network is used to: after extracting at least one type of target features representing the orientation of the user's head in the to-be-recognized image, fuse the extracted target features, The target gaze direction information of the user is estimated according to the feature fusion result.

The device according to claim 71, wherein the target feature comprises any one of the following: a face region feature, an eye region feature, or a head pose feature.

The device according to claim 72, wherein the face region feature, eye region feature or head pose feature are respectively a face region, an eye region or a head segmented from the to-be-recognized image Correspondingly extracted from the partial area.

The device according to claim 73, wherein the eye region feature comprises any one of the following: a binocular feature, a left eye feature or a right eye feature.

The device according to claim 74, wherein the left-eye feature and the right-eye feature are extracted by using a Siamese network.

The device according to claim 75, wherein the twin network is used for: performing a dilated convolution operation on the left eye region and the right eye region respectively, so as to enhance the receptive field of the network to the image.

The device according to claim 71, wherein the target line-of-sight direction information of the user comprises: a target area, and the target area represents an area in the to-be-recognized image to which the user's head is facing.

The device according to claim 71, wherein the training data of the neural network adopts an image of the target area in which the area to which the user's head is facing is used.

The device according to claim 71, wherein the target gaze direction information of the user comprises: angle information of the user's target gaze direction.

The device according to claim 71, wherein the to-be-identified image has multiple frames;

When the processor executes the estimation of the target gaze direction information of the user by using the neural network, the following steps are implemented:

The device according to claim 71, wherein the image to be recognized has multiple frames; the neural network comprises an initial sub-neural network and a fusion sub-neural network;

For each frame of the to-be-recognized image, the initial sub-neural network is used to obtain the feature fusion result of each frame of the to-be-recognized image; the initial sub-neural network is used to: extract at least one type of target representing the user's head orientation in the to-be-recognized image After the feature, the extracted target features are fused;

The fusion sub-neural network is used to fuse the feature fusion results of the images to be identified in each frame to estimate the user's target line of sight direction information.

The apparatus of claim 71, wherein the neural network is a convolutional neural network.

The apparatus according to claim 81, wherein the initial sub-neural network is a convolutional neural network.

The apparatus according to claim 81, wherein the fusion sub-neural network is a recurrent neural network.

The device according to claim 71, wherein the to-be-identified image comprises: an image taken of a driver in a vehicle.

84. The apparatus of claim 84, wherein the target gaze direction information is used to assess whether the driver is distracted driving.

The device according to claim 84, wherein the plurality of setting areas include: a front windshield area, a left rearview mirror area, a right rearview mirror area, a left window area, a right window area, Dashboard area or center console area.

The apparatus according to claim 71, wherein the processor further implements the following steps when executing the computer program:

The device according to claim 88, wherein the output mode of the prompt information includes any one of the following modes: audio, text, video or lighting.

A vehicle, characterized in that the vehicle comprises the line-of-sight estimation device according to any one of claims 46 to 70 or the line-of-sight estimation device according to any one of claims 71 to 89 .

A computer-readable storage medium, characterized in that, several computer instructions are stored on the readable storage medium, and when the computer instructions are executed, the steps of the method of any one of claims 1 to 26 are implemented.

A computer-readable storage medium, characterized in that, a plurality of computer instructions are stored on the readable storage medium, and when the computer instructions are executed, the steps of the method according to any one of claims 27 to 45 are implemented.