CN114495193A

CN114495193A - Head posture estimation method, device and equipment and readable storage medium

Info

Publication number: CN114495193A
Application number: CN202111521665.3A
Authority: CN
Inventors: 施惠杰; 杨青
Original assignee: Du Xiaoman Technology Beijing Co Ltd
Current assignee: Du Xiaoman Technology Beijing Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-05-13

Abstract

The invention discloses a head attitude estimation method, which calls a convolutional neural network to predict the head attitude of a current image frame, calls a cyclic neural network to predict the attitude time sequence change rule according to historical predicted attitude information, obtains the time characteristic of the current head attitude angle according to the change rule of a historical head attitude angle sequence so as to obtain a relatively reliable head attitude prediction result, and finally generates an attitude angle prediction result according to a first attitude angle prediction result and a second attitude angle prediction result. Therefore, when the head posture prediction of the convolutional neural network has deviation, the method can generate a relatively accurate head posture prediction value by analyzing the continuous change rule of the head posture, and can correct the prediction value of the convolutional neural network through the cyclic neural network, so that the robustness and the reliability of the whole estimation process are improved. The invention also discloses a head posture estimation device, equipment and a readable storage medium, and the device and the equipment have corresponding technical effects.

Description

Head posture estimation method, device and equipment and readable storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for estimating a head pose.

Background

With the development of computer vision technology, head pose estimation plays an important role in the visual technology of three-dimensional reconstruction, the realization of multimedia content operation, user interaction and the like. The head posture estimation task is to predict three euler angles of the head in three-dimensional space, namely a Pitch angle (Pitch), a Roll angle (Roll) and a Yaw angle (Yaw), as shown in fig. 1.

The existing head pose estimation method is mainly divided into head pose estimation based on key points of human faces and head pose estimation based on two-dimensional images.

In the head pose estimation method based on the face key points, two-dimensional face key points are obtained through a deep learning method, and then the head pose is recovered according to the corresponding relation between the face key points and a three-dimensional head model. Although the performance of the face key point detection algorithm is greatly improved along with the research progress of deep learning, the situation of error, loss and the like still exist in the key point prediction, and the performance of head pose estimation of a subsequent three-dimensional model is directly influenced. Furthermore, the quality of the three-dimensional model also largely affects the head pose estimation performance. Therefore, the two-step strategy increases the probability of error in head pose estimation, resulting in a less reliable method. The head posture estimation method based on the two-dimensional image adopts a single-step strategy, the image characteristics are directly extracted through a deep learning network, and then the head posture is obtained through a final classification network or a regression network, so that the error probability of head posture estimation is reduced, the reliability is slightly improved compared with the two-step strategy, however, in a continuous head posture estimation task, the method inevitably generates larger prediction deviation on partial frame images, and the abnormal estimation value also brings adverse effects on a subsequent task depending on a head posture angle.

In summary, how to improve the robustness and reliability of head pose estimation is a technical problem that needs to be solved urgently by those skilled in the art at present.

Disclosure of Invention

The invention aims to provide a head pose estimation method, a head pose estimation device and a readable storage medium, so that robustness and reliability of head pose estimation are improved.

In order to solve the technical problems, the invention provides the following technical scheme:

a head pose estimation method, comprising:

receiving a current image frame to be detected, and calling a convolutional neural network to predict the head attitude of the image frame to be detected as a first attitude angle prediction result;

receiving historical predicted attitude information, and calling a recurrent neural network to predict an attitude angle according to an attitude time sequence change rule of the historical predicted attitude information to serve as a second attitude angle prediction result; the image and the image frame according to which the historical predicted attitude information is predicted come from a continuous motion sequence image which is acquired and generated aiming at the same object;

and generating an attitude angle prediction result according to the first attitude angle prediction result and the second attitude angle prediction result.

Optionally, the invoking the convolutional neural network to perform head pose prediction on the image frame includes:

calling a convolutional neural network to extract the spatial features of the depth image of the image to obtain spatial features;

mapping the spatial features to a dimensional space corresponding to each head attitude angle to obtain angle features; wherein the head pose angle comprises: pitch, roll and yaw angles;

and carrying out angle classification according to the angle characteristics to obtain a first attitude angle prediction result.

Optionally, the invoking the convolutional neural network to perform depth image spatial feature extraction on the image includes: and calling ResNet to extract the spatial features of the depth image of the image.

Optionally, the invoking of the recurrent neural network performs attitude angle prediction according to the attitude time sequence change rule of the historical predicted attitude information, and the attitude angle prediction includes:

receiving historical predicted attitude information, and calling a recurrent neural network to extract time sequence characteristics of the historical head attitude angle to obtain time sequence characteristics;

and mapping the time sequence characteristics to an attitude angle tag space to obtain a second attitude angle prediction result.

Optionally, the invoking a recurrent neural network to perform time series feature extraction on the historical head pose angle includes:

and calling an LSTM network to extract the time sequence characteristics of the historical head attitude angle.

Optionally, the generating an attitude angle prediction result according to the first attitude angle prediction result and the second attitude angle prediction result includes:

and carrying out weighted summation on the first attitude angle prediction result and the second attitude angle prediction result, and taking the weighted sum as the attitude angle prediction result.

A head pose estimation apparatus, comprising:

the first prediction module is used for receiving the current image frame to be detected and calling a convolutional neural network to predict the head attitude of the image frame to be detected as a first attitude angle prediction result;

the second prediction module is used for receiving historical predicted attitude information, calling a recurrent neural network to predict an attitude angle according to the attitude time sequence change rule of the historical predicted attitude information, and taking the attitude angle as a second attitude angle prediction result; the image and the image frame according to which the historical predicted attitude information is predicted come from a continuous motion sequence image which is acquired and generated aiming at the same object;

and the result generation module is used for generating an attitude angle prediction result according to the first attitude angle prediction result and the second attitude angle prediction result.

Optionally, the second prediction module comprises:

the characteristic extraction submodule is used for receiving historical predicted attitude information and calling a recurrent neural network to extract time sequence characteristics of the historical head attitude angle to obtain time sequence characteristics;

and the characteristic mapping submodule is used for mapping the time sequence characteristics to an attitude angle tag space to obtain a second attitude angle prediction result.

A computer device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the above-described head pose estimation method when executing the computer program.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above-mentioned head pose estimation method.

The method provided by the embodiment of the invention calls the convolutional neural network to predict the head attitude of the current image frame, calls the cyclic neural network to predict the attitude time sequence change rule according to the historical predicted attitude information, obtains the time characteristic of the current head attitude angle according to the change rule of the historical head attitude angle sequence, thereby obtaining a relatively reliable head attitude prediction result, and finally generates an attitude angle prediction result according to the first attitude angle prediction result and the second attitude angle prediction result. Therefore, when the head posture prediction of the convolutional neural network deviates, the method can generate a relatively accurate head posture prediction value by analyzing the continuous change rule of the head posture, and can correct the prediction value of the convolutional neural network through the cyclic neural network, so that the robustness and reliability of the whole estimation process are improved, and the problem of an abnormal prediction value in the existing head posture estimation method can be effectively solved.

Accordingly, embodiments of the present invention further provide a head pose estimation apparatus, a device and a readable storage medium corresponding to the head pose estimation method, which have the above technical effects and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present invention or technical solutions in related arts, the drawings used in the description of the embodiments or related arts will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a head pose angle according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for estimating a head pose according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a head pose estimation apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of data flow of a head pose estimation apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The core of the invention is to provide a head pose estimation method to improve the robustness and reliability of head pose estimation.

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 2, fig. 2 is a flowchart of a head pose estimation method according to an embodiment of the present invention, the method includes the following steps:

s101, receiving a current image frame to be detected, and calling a convolutional neural network to predict the head attitude of the image frame to be detected to be used as a first attitude angle prediction result;

the input current image frame to be detected is a single frame image in a video, and the video is a head continuous motion sequence image collected and generated aiming at a certain object.

After receiving a current image frame to be detected, calling a convolutional neural network to predict the head attitude of the image frame, extracting the depth image space characteristics of an input image by the convolutional neural network, predicting the head attitude according to the extracted depth image space characteristics, and taking the generated prediction result as a first attitude angle prediction result. In this embodiment, specific implementation steps for calling the convolutional neural network to predict the head pose of the image frame are not limited, and the existing head pose estimation method based on the two-dimensional image may be referred to, and are not described herein again.

In addition, the Network type calling the Convolutional Neural Network in this step is not limited, and the Convolutional Neural Network capable of extracting features may be any Network, including but not limited to a ResNet (Residual Network), a VGG (deep Convolutional Neural Network), a google lenet (deep learning structure), a densneet (Dense Convolutional Neural Network), a MobileNet (a lightweight deep Neural Network), a shuffle Network (an efficient lightweight Network), and other feature extraction networks.

S102, receiving historical predicted attitude information, and calling a recurrent neural network to predict an attitude angle according to an attitude time sequence change rule of the historical predicted attitude information to serve as a second attitude angle prediction result;

the image and the image frame according to which the historical predicted attitude information is predicted come from a continuous motion sequence image which is acquired and generated aiming at the same object;

the historical predicted attitude information may be an attitude angle prediction result (for example, a specific angle predicted by a convolutional neural network), or may also be a historical image feature (an image feature value at any stage before a convolutional neural network classifier). In addition, the historical predicted posture information may be a historical predicted result of the convolutional neural network, or may be historical posture information that can be obtained by any other means, and the generation mode of the historical predicted posture information is not limited in this embodiment, as long as the posture information used for the training of the convolutional network and the actual test is the same data type. If the recurrent neural network is trained with specific angle values, any other means, including but not limited to convolutional neural networks, can obtain the specific angle values as input.

After the historical predicted attitude information is obtained, the recurrent neural network is called to perform attitude angle prediction according to the attitude time sequence change rule of the historical predicted attitude information in the embodiment. In a video stream or a continuous image sequence, the change of the head angle is a continuous process, the recurrent neural network can learn the change rule of time sequence data, the time sequence characteristic of historical attitude information is analyzed, the time characteristic of the current head attitude angle is obtained according to the change rule of the historical head attitude angle sequence, the prediction of the head attitude angle is realized, and the prediction result is used as a second attitude angle prediction result. For the specific implementation step of calling the recurrent neural network to perform the attitude angle prediction according to the attitude time sequence change rule of the historical predicted attitude information, this embodiment is not limited, and the related step of calling the recurrent neural network to perform the time sequence feature analysis in the related art may be referred to, and is not described herein again.

In this step, a Recurrent Neural Network is called to perform predictive analysis, and specific Network types of the Recurrent Neural Network are not limited in this embodiment, and include, but are not limited to, an LSTM Network (Long Short-Term Memory, Long Short Term Memory Network), an RNN (Recurrent Neural Network, a Recurrent Neural Network), a GRU (Gated Recurrent Unit, a Recurrent Neural Network), and other sequence generation networks, which may be used to select a Network model according to specific needs, and are not described herein again.

It should be noted that, in this embodiment, the execution sequence of step S101 and step S102 is not limited, step S101 may be executed first, step S102 may also be executed first, or step S102 may also be executed simultaneously, and specifically, corresponding setting may be performed according to actual use requirements, in fig. 1, step S101 is executed first, and then step S102 is executed as an example, and other execution sequences may refer to the description of this embodiment, and are not described herein again.

S103, generating an attitude angle prediction result according to the first attitude angle prediction result and the second attitude angle prediction result.

After a first attitude angle prediction result obtained through analysis of the convolutional neural network and a second attitude angle prediction result obtained through analysis of the cyclic neural network are obtained, the first attitude angle prediction result and the second attitude angle prediction result are fused to generate an attitude angle prediction result, and after operation, when the head attitude prediction of the convolutional neural network has large deviation, the cyclic neural network can generate a relatively accurate head attitude prediction value through a continuous change rule of the head attitude, and at the moment, the cyclic neural network can correct the first attitude angle prediction result generated by the convolutional neural network. Therefore, the problem of abnormal prediction values existing in the existing head posture estimation method can be effectively solved through the combination of the two methods.

For the result fusion manner of generating the attitude angle prediction result according to the first attitude angle prediction result and the second attitude angle prediction result, this embodiment is not limited, and the result fusion may be performed by weighting and may also be performed by using K-T transform, wavelet transform, and the like, and the fusion ratio and the fusion manner may be specifically determined according to the actual detection accuracy, which is not described herein again.

Based on the introduction, the technical scheme provided by the embodiment of the invention calls the convolutional neural network to predict the head attitude of the current image frame, calls the cyclic neural network to predict the attitude time sequence change rule according to the historical predicted attitude information, obtains the time characteristic of the current head attitude angle according to the change rule of the historical head attitude angle sequence, thereby obtaining a relatively reliable head attitude prediction result, and finally generates the attitude angle prediction result according to the first attitude angle prediction result and the second attitude angle prediction result. Therefore, when the head posture prediction of the convolutional neural network deviates, the method can generate a relatively accurate head posture prediction value by analyzing the continuous change rule of the head posture, and can correct the prediction value of the convolutional neural network through the cyclic neural network, so that the robustness and reliability of the whole estimation process are improved, and the problem of an abnormal prediction value in the existing head posture estimation method can be effectively solved.

It should be noted that, based on the above embodiments, the embodiments of the present invention also provide corresponding improvements. In the preferred/improved embodiment, the same steps as those in the above embodiment or corresponding steps may be referred to each other, and corresponding advantageous effects may also be referred to each other, which are not described in detail in the preferred/improved embodiment herein.

In the above embodiment, the specific implementation step of calling the convolutional neural network to perform head pose prediction on the image frame is not limited, and a prediction method is introduced in this embodiment, which specifically includes the following steps:

(1) calling a convolutional neural network to perform depth image spatial feature extraction on the image to obtain spatial features;

the process of calling the convolutional neural network to perform depth image spatial feature extraction on the image may refer to the convolutional neural network in the related art to perform the feature extraction process, which is not described herein again.

The specific network type of the convolutional neural network is not limited, and optionally, the depth image spatial features of the input image may be extracted by ResNet. The residual structure design of ResNet can make the layer number of the network deeper, so that the network can extract richer image semantic information, and more accurate evaluation is realized. It should be noted that ResNet in this embodiment may be replaced by other convolutional neural networks capable of extracting features, including, but not limited to, VGG, google net, densneet, MobileNet, ShuffleNet, and other feature extraction networks.

(2) Mapping the spatial features to a dimensional space corresponding to each head attitude angle to obtain angle features;

mapping the image features extracted by the CNN to a dimensional space corresponding to three head pose angles, wherein the head pose angles comprise: pitch angle, roll angle, and yaw angle. This step can be implemented by calling three parallel fully-connected neural network layers, which is not limited herein.

(3) And carrying out angle classification according to the angle characteristics to obtain a first attitude angle prediction result.

The angle prediction is solved as a classification problem, namely n degrees are used as a class, continuous head attitude angles (-90 degrees to 90 degrees) are discretized into a plurality of classes, and the corresponding attitude angles are determined by determining the class to which the angles belong. And image F_nThe corresponding three angle characteristics are calculated by a Softmax classifier to obtain three head attitude angles

Where n may be set according to actual classification requirements, and for example, every 5 ° may be set as one class, and consecutive head attitude angles (-90 ° to 90 °) may be discretized into 36 classes.

It should be noted that, in the present embodiment, the head pose angle of the convolutional neural network is predicted by using a classifier, or the head pose angle may be directly generated by replacing with a regressor.

The head pose prediction method based on the convolutional neural network introduced in this embodiment inputs a single frame image at the current time, and maps spatial features into angle features to perform angle classification, so as to predict three head pose angles at the current time, that is, a pitch angle (P), a roll angle (R), and a yaw angle (Y), and the calculation method is simple and easy to implement, and can ensure complete analysis of extracted feature information.

In this embodiment, only the above head pose prediction method based on the convolutional neural network is described, and other implementation manners can refer to the description of this embodiment, which is not described herein again.

In addition, the specific implementation steps of the attitude angle prediction method based on the recurrent neural network in the foregoing embodiment are not limited, and for deepening understanding, the prediction method described in this embodiment includes the following steps:

(1) receiving historical predicted attitude information, and calling a recurrent neural network to extract time sequence characteristics of historical head attitude angles to obtain time sequence characteristics;

assume that the overall prediction result obtained by predicting the head attitude angle at the historical time (time 1 to n-1) in the video stream is:

{α₁，…，α_i,…，α_n-1}＝{(Y₁,P₁,R₁),…，(Y_i,P_i,R_i),…，(Y_n-1,P_n-1,R_n-1)}

and calling a recurrent neural network to extract the time sequence characteristics of the historical prediction result, wherein the process of extracting the time sequence characteristics can refer to the time sequence characteristic extraction mode of the recurrent neural network in the related technology, and the details are not repeated. It should be noted that, in this embodiment, only the prediction result is taken as an example for description, and other prediction information types can refer to the description of this embodiment, which is not described herein again.

Alternatively, an LSTM network may be employed to extract temporal features of the head pose angle. In addition, an attention mechanism can be introduced into the LSTM network to fully utilize the output characteristics of the LSTM network at each moment and strengthen the characteristic extraction result.

It should be noted that, in this embodiment, the LSTM network may be replaced by other recurrent neural networks, such as sequence generation networks like RNN and GRU, which is not limited herein.

(2) And mapping the time sequence characteristics to an attitude angle tag space to obtain a second attitude angle prediction result.

Mapping the time sequence characteristics extracted by the recurrent neural network to an attitude angle label space so as to obtain a head attitude angle prediction result

The spatial mapping of the features may invoke the implementation of a fully-connected neural network layer, and the implementation manner is not limited in this embodiment.

The head pose prediction method based on the convolutional neural network introduced in the embodiment directly performs mapping of the pose angle space after the time sequence features are extracted, and the implementation steps are simple. In this embodiment, only the above head pose prediction method based on the convolutional neural network is described, and other implementation manners can refer to the description of this embodiment, which is not described herein again.

In the foregoing embodiment, a result fusion manner for generating the attitude angle prediction result according to the first attitude angle prediction result and the second attitude angle prediction result is not limited, and in order to deepen understanding, a result fusion method is introduced in this embodiment, and specifically: and carrying out weighted summation on the first attitude angle prediction result and the second attitude angle prediction result, and taking the weighted sum as an attitude angle prediction result.

Assume image F at the present time_nObtaining a predicted value of the head attitude angle through a convolutional neural network

The head attitude angle predicted value of the recurrent neural network is

Both are weighted to obtain the wholeHead attitude angle prediction result α_n＝(Y_n,P_n,R_n)：

In the formula, λ_Y、λ_P、λ_RThe weight coefficients are corresponding to the yaw angle Y, the pitch angle P and the roll angle R.

The result fusion method introduced in this embodiment adopts a weighted summation method, in which the weights of the two parts can be adjusted correspondingly according to the prediction error of the convolutional neural network, and the method is flexible. In this embodiment, only the data fusion method is described, and other implementation methods can refer to the description of this embodiment, which is not described herein again.

Corresponding to the above method embodiments, the present invention further provides a head pose estimation device, and the head pose estimation device described below and the head pose estimation method described above may be referred to in correspondence with each other.

Referring to fig. 3, the apparatus includes the following modules:

the first prediction module 110 is mainly configured to receive an image frame to be detected currently, and call a convolutional neural network to perform head pose prediction on the image frame as a first pose angle prediction result;

the second prediction module 120 is mainly configured to receive historical predicted attitude information, and invoke a recurrent neural network to perform attitude angle prediction according to an attitude time sequence change rule of the historical predicted attitude information, so as to obtain a second attitude angle prediction result; the image and the image frame according to which the historical predicted attitude information is predicted come from a continuous motion sequence image which is acquired and generated aiming at the same object;

the result generation module 130 is mainly configured to generate an attitude angle prediction result according to the first attitude angle prediction result and the second attitude angle prediction result.

In one embodiment of the present invention, the second prediction module comprises:

the characteristic extraction submodule is used for receiving historical predicted attitude information and calling a recurrent neural network to extract time sequence characteristics of historical head attitude angles to obtain time sequence characteristics;

and the characteristic mapping submodule is used for mapping the time sequence characteristics to an attitude angle label space to obtain a second attitude angle prediction result.

In the foregoing embodiment, the module structures of the first prediction module and the second prediction module are not limited, and a specific module construction form is described in this embodiment for the purpose of enhancing understanding.

The head pose estimation device proposed by the present embodiment mainly includes three parts: a convolutional neural network prediction part (i.e. a first prediction module), a cyclic neural network prediction part (i.e. a second prediction module) and a weighted fusion part (i.e. result generation), as shown in fig. 4, wherein the convolutional neural network prediction part can be divided into four modules, namely an input module, an image feature extraction module, an angle feature generation module and a classifier module; the recurrent neural network prediction part can be divided into an input module, a sequence feature extraction module and a regressor module. The following describes each part of the module in detail.

1. The convolutional neural network prediction part mainly comprises an input module, an image feature extraction module, an angle feature generation module and a classifier module.

Three head attitude angles, namely, a pitch angle (P), a roll angle (R) and a yaw angle (Y), at the current time are predicted by inputting a single frame image at the current time. Specifically, the method comprises the following steps:

(1) and an input module.

The input of the convolutional neural network prediction part is a single frame image in a video, and the input image at the current time (n time) is set as F_n。

(2) And an image space feature extraction module.

This module is a deep Convolutional Neural Network (CNN) module, which extracts the input F by ResNet_nThe depth image spatial feature of (1). The residual structure design of ResNet can make the layer number of the network deeper, so that the network can extract richer image semantic information.

It should be noted that, in the present technical solution, the image feature extraction module of the convolutional neural network may be replaced by other convolutional neural networks capable of extracting features, including but not limited to feature extraction networks such as VGG, google net, densneet, MobileNet, ShuffleNet, and the like.

(3) An angular feature generation module.

The module is three fully-connected neural network layers for mapping image features extracted by the CNN to dimensional spaces of three angle categories.

(4) And a classifier module.

Here, the angle prediction is solved as a classification problem, that is, the continuous head posture angle (-90 ° to 90 °) is discretized into 36 classes every 5 ° as one class. And image F_nThe corresponding three angle characteristics are calculated by a Softmax classifier to obtain three head attitude angles

It should be noted that, in the present technical solution, the head pose angle of the convolutional neural network is predicted by using a classifier, or may be directly generated by using a regressor instead.

2. The cyclic neural network prediction part consists of an input module, a sequence feature extraction module and a regressor module.

Assume that the overall prediction result of the head attitude angles at the historical times (times 1 to n-1) in the video stream is { α }₁，…，α_i,…，α_n-1}＝{(Y₁,P₁,R₁),…，(Y_i,P_i,R_i),…，(Y_n-1,P_n-1,R_n-1) And fourthly, the head attitude angle prediction flow of the recurrent neural network is as follows:

(1) and an input module.

Input of the recurrent neural network is a sequence of prediction angles { alpha ] of historical head poses₁，…，α_i,…，α_n-1}＝{(Y₁,P₁,R₁),…，(Y_i,P_i,R_i),…，(Y_n-1,P_n-1,R_n-1) In which α is_iPredicting the result by the i-time convolutional neural network

And recurrent neural network prediction results

And (4) obtaining the weight calculation.

(2) And a time sequence feature extraction module.

The module extracts the time sequence characteristics of the head attitude angle by adopting an LSTM network. In addition, an attention mechanism can be introduced into the LSTM network to fully utilize the output characteristics of the LSTM network at each moment and strengthen the characteristic extraction result.

It should be noted that the LSTM network in the present technical solution may be replaced by other recurrent neural networks, such as RNN, GRU, and other sequence generation networks.

(3) And a regressor module.

The module adopts a fully-connected neural network layer and is used for mapping the time sequence characteristics extracted by the LSTM network to an attitude angle label space so as to obtain a head attitude angle prediction result

3. And the weighted fusion part weights the prediction results of the two-part network to obtain the final head posture prediction result.

As shown in FIG. 1, an image F of the current time_nObtaining a predicted value of the head attitude angle through a convolutional neural network

The head attitude angle predicted value of the recurrent neural network is

Weighting the two to obtain an integral head attitude angle prediction result alpha_n＝(Y_n,P_n,R_n)：

In the formula, λ_Y、λ_P、λ_RThe weight coefficients are respectively corresponding to the yaw angle Y, the pitch angle P and the roll angle R.

In this embodiment, only the above device setting manner is taken as an example for description, and the module setting manner based on other prediction principles can refer to the description of this embodiment, and will not be described herein again.

Corresponding to the above method embodiment, the embodiment of the present invention further provides a computer device, and a computer device described below and a head pose estimation method described above may be referred to in correspondence with each other.

The computer device includes:

a memory for storing a computer program;

a processor for implementing the steps of the head pose estimation method of the above method embodiments when executing the computer program.

Specifically, referring to fig. 5, a specific structural diagram of a computer device provided in this embodiment is a schematic diagram, where the computer device may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 322 (e.g., one or more processors) and a memory 332, where the memory 332 stores one or more computer applications 342 or data 344. Memory 332 may be, among other things, transient or persistent storage. The program stored in memory 332 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a data processing device. Still further, the central processor 322 may be configured to communicate with the memory 332 to execute a series of instruction operations in the memory 332 on the computer device 301.

The computer device 301 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input-output interfaces 358, and/or one or more operating systems 341.

The steps in the head pose estimation method described above may be implemented by the structure of a computer device.

Corresponding to the above method embodiment, the present invention further provides a readable storage medium, and a readable storage medium described below and a head pose estimation method described above may be referred to with respect to each other.

A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the head pose estimation method of the above-mentioned method embodiments.

The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and various other readable storage media capable of storing program codes.

Those of skill would further appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Claims

1. A method of head pose estimation, comprising:

receiving historical predicted attitude information, and calling a recurrent neural network to predict an attitude angle according to an attitude time sequence change rule of the historical predicted attitude information to serve as a second attitude angle prediction result; the image according to which the historical predicted attitude information is predicted and the image frame are from continuous motion sequence images which are acquired and generated aiming at the same object;

2. The head pose estimation method of claim 1, wherein said invoking a convolutional neural network for head pose prediction of the image frame comprises:

3. The head pose estimation method of claim 2, wherein said invoking a convolutional neural network for depth image spatial feature extraction of the image comprises: and calling ResNet to extract the spatial features of the depth image of the image.

4. The head pose estimation method of claim 1, wherein said invoking cyclic neural network performs pose angle prediction according to the pose temporal variation law of the historical predicted pose information, comprising:

5. The head pose estimation method of claim 1, wherein said invoking a recurrent neural network for temporal feature extraction of the historical head pose angles comprises:

6. The head pose estimation method of claim 1, wherein said generating pose angle predictions from said first pose angle predictions and said second pose angle predictions, comprises:

and carrying out weighted summation on the first attitude angle prediction result and the second attitude angle prediction result, and taking the weighted summation as the attitude angle prediction result.

7. A head pose estimation apparatus, comprising:

the second prediction module is used for receiving historical predicted attitude information, calling a recurrent neural network to predict an attitude angle according to an attitude time sequence change rule of the historical predicted attitude information, and taking the attitude angle as a second attitude angle prediction result; the image and the image frame according to which the historical predicted attitude information is predicted come from a continuous motion sequence image which is acquired and generated aiming at the same object;

8. The head pose estimation apparatus of claim 7, wherein the second prediction module comprises:

9. A computer device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the head pose estimation method according to any of claims 1 to 6 when executing said computer program.

10. A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the head pose estimation method according to any one of claims 1 to 6.