[go: up one dir, main page]

CN119181126A - Face information acquisition method and electronic equipment - Google Patents

Face information acquisition method and electronic equipment Download PDF

Info

Publication number
CN119181126A
CN119181126A CN202411483367.3A CN202411483367A CN119181126A CN 119181126 A CN119181126 A CN 119181126A CN 202411483367 A CN202411483367 A CN 202411483367A CN 119181126 A CN119181126 A CN 119181126A
Authority
CN
China
Prior art keywords
face
facial
audio data
data
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411483367.3A
Other languages
Chinese (zh)
Other versions
CN119181126B (en
Inventor
张�成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
Priority to CN202411483367.3A priority Critical patent/CN119181126B/en
Publication of CN119181126A publication Critical patent/CN119181126A/en
Application granted granted Critical
Publication of CN119181126B publication Critical patent/CN119181126B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2131Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on a transform domain processing, e.g. wavelet transform
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供了面部信息获取方法和电子设备。应用于电子设备,尤其可以应用于未配置摄像头的电子设备。该方法中,电子设备能够通过超声波进行面部状态特征检测,生成与实际面部状态相匹配的面部信息。这里的面部信息包括面部状态和/或仿真人脸画面。实施该方法,即使时未配置摄像头的电子设备可以获取用户的面部状态和/或仿真人脸画面,从而丰富电子设备的使用场景。

The embodiments of the present application provide a facial information acquisition method and an electronic device. The method is applied to electronic devices, and in particular, can be applied to electronic devices that are not equipped with a camera. In the method, the electronic device can detect facial state features through ultrasound and generate facial information that matches the actual facial state. The facial information here includes facial state and/or simulated face image. By implementing the method, even an electronic device that is not equipped with a camera can obtain the user's facial state and/or simulated face image, thereby enriching the use scenarios of the electronic device.

Description

Face information acquisition method and electronic equipment
Technical Field
The present application relates to the field of vision and terminals, and more particularly, to a face information acquisition method and an electronic device.
Background
In daily life, electronic devices are of a wide variety. Among them, electronic devices (e.g., mobile phones, computers, etc.) equipped with a camera, and electronic devices (e.g., smart watches, smart glasses, etc.) not equipped with a camera are included. Although, compared to electronic devices equipped with cameras, electronic devices not equipped with cameras are limited in functionality and interactivity. However, an electronic device not equipped with a camera may exhibit advantages that are not exhibited by an electronic device equipped with a camera in a specific usage scenario. For example, an electronic device without a camera is a smart glasses, which can be used as glasses to help the wearer see things, and can have other functions besides looking things, including but not limited to providing navigational cues, real-time translation, displaying email notifications or upcoming schedules even in the wearer's line of sight, etc.
However, the electronic device without a camera cannot acquire the face information, and it is difficult to provide more possibilities in scenes requiring the face information such as video communication, face detection and recognition.
Disclosure of Invention
The application provides a face information acquisition method and electronic equipment, which are used for carrying out face state feature detection through ultrasonic waves and generating face information matched with an actual face state.
In a first aspect, the application provides a face information acquisition method, which is applied to first electronic equipment, wherein the first electronic equipment comprises a loudspeaker and a microphone, the method comprises the steps of acquiring first ultrasonic audio data through the microphone, the first ultrasonic audio data indicate ultrasonic signals which are emitted by the loudspeaker and reflected by a face, extracting facial state characteristics of the face through the first ultrasonic audio data, and adjusting a first preset face picture based on the facial state characteristics to generate an imitated face picture.
In the above embodiment, the face state feature is obtained based on the first ultrasonic audio data, but the face state feature may indicate the face state of the face-like picture, because the first ultrasonic audio data is derived from an ultrasonic signal reflected when the face is in the face state corresponding to the face-like picture. Accordingly, the electronic device can detect the face state features based on the ultrasonic signal, thereby generating a simulated face picture (a kind of face information) that matches the actual face state. Therefore, the electronic equipment can acquire ultrasonic audio data based on a loudspeaker and a microphone carried by the electronic equipment without depending on a camera when acquiring the face picture, and the generation of the face picture can be realized.
In combination with the first aspect, in some embodiments, the facial state features are used for adjusting the first preset face picture to generate an imitated face picture, and the method specifically comprises the steps of extracting face features in the first preset face picture through the first preset face, adjusting the face features through the facial state features, and mapping the adjusted face features into the imitated face picture.
In the above embodiment, the face features are data that is obtained based on the first preset face frame and can represent the first preset face frame. Therefore, the simulated face picture is obtained by adjusting the face features, and can be also understood as being obtained by adjusting the face based on the first preset face picture. But the simulated face picture is obtained by adjusting the face features corresponding to the first preset face picture, so that the adjustment efficiency is improved.
With reference to the first aspect, in some embodiments, the first electronic device further includes a gesture sensor, the method further includes obtaining first gesture data through the gesture sensor, the first gesture data indicates a gesture of the face, extracting gesture features of the face through the first gesture data, adjusting the face features through the face state features, and specifically includes adjusting the face features through the face state features and the gesture features to generate an artificial face frame, wherein the artificial face frame has the gesture indicated by the gesture features.
In the above embodiment, the posture data is posture data for the head. Pose data for the head is introduced. Can be used for reducing the negative influence of head movement on facial expression recognition and generating more accurate simulated face images. If the gesture data aiming at the head is not introduced, the gesture change of the head and the expression change of the face can influence the reflection condition of the ultrasonic signals, and then the gesture change and the expression change can be reflected in the ultrasonic audio data. When the electronic device analyzes the face information through the ultrasonic audio data, there is a problem that it is not accurate enough. For example, the change in the posture of the head may be mistaken for the change in the expression of the face, resulting in an error in the generated face information.
In combination with the first aspect, in some embodiments, the facial state features are determined through a first feature extraction network in a first face image generation model, the face features are determined through a second feature extraction network in the first face image generation model, the face image generation model further comprises a first face generation network, the first face generation network comprises a first convolution network, a first deconvolution network and a second convolution network, the first convolution network and the first deconvolution network function comprises the steps of adjusting the face features based on the facial state features to obtain adjusted face features, the second convolution network is used for mapping the adjusted face features into the simulated face image, the first face image generation model is trained by taking second ultrasonic audio data and a second preset face image as input data, predicting the face image as output data and a sample face image as reference data, and the sample face image is matched with the facial state features extracted based on the second ultrasonic audio data.
In combination with the first aspect, in some embodiments, the facial state features are determined through a first feature extraction network in a second face image generation model, the face features are determined through a second feature extraction network in the second face image generation model, the gesture features are determined through a third feature extraction network in the second face image generation model, the face image generation model further comprises a second face generation network, the second face generation network comprises a third convolution network, a second deconvolution network and a fourth convolution network, the functions of the third convolution network and the second deconvolution network comprise the steps of adjusting the face features based on the facial state features and gesture features to obtain adjusted face features, the fourth convolution network is used for mapping the adjusted face features into the simulated face image, the second face image generation model takes second ultrasonic audio data, second gesture data and second preset face images as input data, predicts face images as output data and faces as output data, and the face images are obtained by training samples based on the face state data and the face samples.
With reference to the first aspect, in some embodiments, before the first ultrasonic audio data is acquired through the microphone, the method further includes establishing a video call connection with the second electronic device, and after generating the simulated face frame, the method further includes sending the simulated face frame to the second electronic device through the video call connection.
In the above embodiment, the generated simulated face image may be applied in a video call scene. In this way, the video call scene can be independent of the camera. The generation of the face picture can be realized based on the ultrasonic audio data acquired by the loudspeaker and the microphone carried by the electronic equipment, and then the video call is realized.
With reference to the first aspect, in some embodiments, acquiring, by the microphone, first ultrasonic audio data specifically includes:
The method comprises the steps of collecting sound signals through a microphone, carrying out high-pass filtering on the sound signals to obtain ultrasonic signals reflected by the face, and processing the ultrasonic signals reflected by the face to obtain the first ultrasonic audio data.
In the above embodiment, the high-pass filtering is performed because the sound signal may include a sound signal in the environment in which the wearer is located in addition to the ultrasonic signal. Therefore, the noise is removed by high-pass filtering, so that the generated first ultrasonic audio data can more accurately indicate ultrasonic signals emitted by the loudspeaker and reflected by the face, further more accurate facial state characteristics are obtained, and finally more accurate simulated face images are obtained.
With reference to the first aspect, in some embodiments, the method further includes low pass filtering the sound signal to obtain an audible sound signal, the audible sound signal having a frequency lower than that of the ultrasonic signal, processing the audible sound signal to obtain audible audio data, and transmitting the audible audio data to the second electronic device over the video call connection.
In the above embodiment, the low-pass filtering is performed because the video call microphone needs to collect not only the sound signal of the wearer but also the ultrasonic signal reflected by the face, but the voice transmitted to the counterpart electronic device only needs to embody the sound signal of the wearer.
With reference to the first aspect, in some embodiments, the first electronic device is a wearable device worn on the head, and at least one speaker, at least one microphone, and at least one attitude sensor are disposed on both sides of a central axis of the first electronic device.
In the above embodiment, the microphone, the speaker and the gesture sensor are arranged in a bilateral symmetry manner, so that the acquired ultrasonic audio data and gesture data can be used for feeding back the face information more accurately. And obtaining a more accurate simulated face picture.
With reference to the first aspect, in some embodiments, the first electronic device is an eyeglass device, the first electronic device includes a first eyeglass frame, a second eyeglass frame, a first frame arm connected to the first eyeglass frame, and a second frame arm connected to the second eyeglass frame, one speaker is disposed at a first location in the first eyeglass frame and one microphone is disposed at a second location in the first eyeglass frame and the second eyeglass frame, and one attitude sensor is disposed at a third location in the first eyeglass frame and the second eyeglass frame, one speaker is disposed at a first location in the first frame arm and one microphone is disposed at a second location in the first frame arm and the second frame arm.
In the above embodiment, the microphone, the speaker and the gesture sensor are symmetrically and uniformly arranged in the frame arm and the frame, so that the ultrasonic audio data and the gesture data acquired by the intelligent glasses can be further enabled to feed back the face information more accurately. And obtaining a more accurate simulated face picture.
In a second aspect, an embodiment of the present application provides a face information obtaining method, which is applied to a first electronic device, where the first electronic device includes a speaker and a microphone, and the method includes obtaining, by the microphone, first ultrasonic audio data, where the first ultrasonic audio data indicates an ultrasonic signal emitted by the speaker and reflected by a face, extracting, by the first ultrasonic audio data, a face state feature of the face, and mapping, based on the face state feature, to obtain a face state of the face.
In the above embodiment, the electronic device may detect the facial status feature based on the ultrasonic signal, and then map to obtain the actual facial status (a type of facial information). Therefore, the electronic equipment can analyze the face state without depending on a camera, and can acquire ultrasonic audio data based on a loudspeaker and a microphone carried by the electronic equipment, so that the face state can be analyzed.
With reference to the second aspect, in some embodiments, the first electronic device further includes a gesture sensor, and the method further includes acquiring first gesture data by the gesture sensor, wherein the first gesture data indicates a gesture of the face, extracting gesture features of the face by the first gesture data, mapping to obtain a face state of the face based on the face state features, specifically including stitching the face state features and the gesture features to obtain stitched features, and mapping the stitched features to the face state.
In the above embodiment, the posture data is posture data for the head. Pose data for the head is introduced. Can be used to reduce the negative impact of head movements on facial expression recognition and analyze more accurate facial conditions. If the gesture data aiming at the head is not introduced, the gesture change of the head and the expression change of the face can influence the reflection condition of the ultrasonic signals, and then the gesture change and the expression change can be reflected in the ultrasonic audio data. When the electronic device analyzes the face state through the ultrasonic audio data, there is a problem that the face state is not accurate enough. For example, the change in the posture of the head may be mistaken for the change in the expression of the face, resulting in an error in the state of the face generated.
With reference to the second aspect, in some embodiments, the facial state features are determined by a first feature extraction network in a first facial state analysis model, the first facial state analysis model further comprising a first facial part analysis network;
The first face state analysis model is trained by taking second ultrasonic audio data as input data, the predicted face state as output data and the sample face state as reference data, and the sample face state is matched with face state characteristics extracted by the second ultrasonic audio data.
With reference to the second aspect, in some embodiments, the facial state features are determined by a first feature extraction network in a second facial state analysis model, the gesture features are determined by a second feature extraction network in the second facial state analysis model, the second facial state analysis model further includes a second facial state analysis network, the second facial state analysis network is used for mapping the stitching features into facial states of the face, the stitching features include the gesture features and the facial state features, the first facial state analysis model is trained by using second ultrasonic audio data and second gesture data as input data, predicting facial states as output data, and sample facial states as reference data, and the sample facial states are matched with the facial state features extracted by the second ultrasonic audio data.
In a third aspect, an embodiment of the application provides a first electronic device comprising one or more processors and a memory coupled to the one or more processors, the memory for storing computer program code comprising computer instructions that the one or more processors invoke the computer instructions to cause the first electronic device to perform the method as implemented in the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium comprising instructions which, when run on an electronic device, cause the electronic device to perform a method as implemented in the first aspect.
In a fifth aspect, embodiments of the present application provide a chip system for application to an electronic device, the chip system comprising one or more processors for invoking computer instructions to cause the electronic device to perform the method as implemented in the first aspect.
In a sixth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform the method as implemented in the first aspect.
It will be appreciated that the first electronic device provided in the third aspect, the computer storage medium provided in the fourth aspect, the chip system provided in the fifth aspect and the computer program product provided in the sixth aspect are all configured to perform the method provided by the embodiment of the present application. Therefore, other advantages achieved by the method can be referred to as advantages of the corresponding method, and will not be described herein.
Drawings
Fig. 1 is a schematic diagram of smart glasses provided in an embodiment of the present application;
FIG. 2 illustrates an exemplary scenario involved in acquiring training data in an embodiment of the present application;
FIG. 3 shows a schematic diagram of different facial states;
FIG. 4A shows a schematic diagram involved in training a face-screen generation model a;
FIG. 4B is a schematic diagram showing the relationship involved in acquiring a preset face frame;
FIG. 4C shows a schematic diagram involved in generating model a using a trained face frame;
FIG. 5A shows a schematic diagram involved in training a face-screen generation model 1;
FIG. 5B shows a schematic diagram involved in generating model 1 using a trained face frame;
FIG. 6A shows a schematic diagram involved in training a facial state analysis model a;
FIG. 6B shows a schematic diagram involved in using a trained facial state analysis model a;
FIG. 7A shows a schematic diagram involved in training a face state analysis model 1;
FIG. 7B shows a schematic diagram involved in using the trained facial state analysis model 1;
FIG. 8 is a schematic diagram of a system for implementing a method for acquiring facial information according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a module for implementing a method for acquiring facial information according to an embodiment of the present application;
FIG. 10 is an interactive schematic diagram for implementing a facial information acquisition method according to an embodiment of the present application;
FIG. 11A shows a schematic diagram involved in training a face information generation model;
FIG. 11B shows a schematic diagram involved in generating a model using facial information;
Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides a facial information acquisition method which is applied to electronic equipment, in particular to electronic equipment without a camera. Implementing the method may enable the electronic device to perform facial state feature detection by ultrasound (also referred to as ultrasound signals), generating facial information that matches the actual facial state. The face information here includes a face state and/or an artificial face picture. Wherein the facial state comprises an expression of a face, or the facial state comprises an expression of a face and a direction of a face. The expression of a face may be determined by the facial features such as blinking, mouth opening, mouth closing, etc. The direction of the face is determined by the pose of the head.
The scheme for acquiring the simulated face picture can be used for scenes such as video call. The scheme of acquiring the face state can be used in a scene requiring face state detection of a user, such as face authentication, fatigue driving detection and the like. In the fatigue driving detection, whether the eyes of the user are open or not can be analyzed through the face state, and if the eyes are not open, the fatigue driving is determined. The face authentication can analyze whether the face of the user meets the requirements through the face state, if so, the face authentication passes, otherwise, the face authentication does not pass.
In the face information acquisition method, a speaker and a microphone are provided in the electronic device. The electronic device may emit an ultrasonic signal through a speaker, which may be transmitted to a microphone after being reflected by the face. The electronic equipment collects ultrasonic signals reflected by the face through the microphone, and ultrasonic audio data can be obtained. The ultrasonic audio data carries or indicates ultrasonic signals sent by the loudspeaker and reflected by the face. The reflection conditions of the ultrasonic signals are different in different facial states, and the obtained ultrasonic audio data are different. For example, with different frequency distribution behavior. Therefore, by analyzing the ultrasonic audio data, the facial state characteristics can be acquired, and thus the facial information can be obtained. For example, when the face is in a mouth opening state and the face is in a mouth closing state respectively, reflection conditions of the same ultrasonic signals are different, and when the microphone collects ultrasonic signals reflected by the face in the two different face states (mouth opening and mouth closing), different ultrasonic audio data can be obtained. The ultrasonic audio data acquired when the face is in the mouth opening state comprises the reflection condition of the face on ultrasonic signals when the face is in the mouth opening state, and the ultrasonic audio data can be used for extracting the facial state characteristics when the face is in the mouth opening state so as to map that the face is in the mouth opening state. The ultrasonic audio data acquired when the face is in the closed state comprises the reflection condition of the face on ultrasonic signals when the face is closed, and the ultrasonic audio data can be used for extracting the facial state characteristics when the face is in the closed state so as to map out that the face is in the closed state.
The facial state features may be regarded as data carrying a facial state, which may be mapped based on the facial state features. Different facial state characteristics may analyze different facial states. With the facial status features, the preset face picture can be adjusted through the facial status features, and the simulated face picture is obtained. The simulated face picture is obtained by adjusting the face state of the face in the preset face picture to be the target face state. For example, if the face state feature indicates that the face state includes a mouth-opening state, the mouth shape of the face in the preset face screen may be adjusted to be mouth-opening. Wherein the target face state is obtained by face state feature mapping. The preset face picture is recorded in the electronic equipment and is recorded as a preset face picture 1.
Wherein the simulated face may be obtained by a face generation model 1 (trained). The electronic device may input the preset face picture 1 and the ultrasonic audio data 1 into the face picture generation model 1 (trained), and output an emulated face picture based on the preset face picture 1 and the ultrasonic audio data 1 through the face picture generation model 1 (trained). The ultrasonic audio data 1 are ultrasonic audio data acquired by a microphone when the electronic device is used for acquiring facial information.
The face picture generation model 1 may include a face state feature extraction network, a face feature extraction network, and a face generation network 1. Wherein the facial state feature extraction network is used to extract facial state features (noted as facial state features 1) from the ultrasonic audio data 1. The face feature extraction network is used for extracting face features (denoted as face features 1) from the preset face picture 1. The face generation network 1 is configured to adjust the face features 1 through the face state features 1, and map the adjusted face features into a simulated face picture. In some possible cases, the face generation network 1 may include a convolution network 11, a deconvolution network 11, and a convolution network 12. The convolution network 11 and the deconvolution network 11 have the functions of adjusting the face feature 1 through the face state feature 1 to obtain the adjusted face feature. The convolutional network 12 is used to map the adjusted face features into the simulated face frames.
The face picture generation model 1 is trained by taking ultrasonic audio data a and a preset face picture a as input data, taking a predicted face picture as output data and taking a sample face picture as reference data. The sample face picture is matched with the facial state features extracted based on the ultrasonic audio data a. For the relevant content of training the face state analysis model 1, reference may be made to the following description of fig. 5A, which is not repeated here.
The face state can be obtained by the face state analysis model 1 (after training). The electronic device may input the ultrasonic audio data 1 into a face state analysis model 1 (trained), through which face state analysis model 1 (trained) a face state is output based on the ultrasonic audio data 1.
The face state analysis model 1 may include therein a face state feature extraction network and a face state analysis network 1. Wherein the face state analysis network 1 is used to map the face state features 1 into a face state. The face state features 1 are extracted based on the ultrasonic audio data 1 through a face state feature extraction network.
The face state analysis model 1 is trained with ultrasonic audio data a as input data, a predicted face state as output data, and a sample face state as reference data. The sample face state matches the face state features extracted by the ultrasonic audio data a. The description of fig. 7A below may be referred to for the relevant content of training the face state analysis model 1, and is not repeated here.
Here, when the face information is determined by the ultrasonic signal, both the posture change of the head and the expression change of the face affect the reflection condition of the ultrasonic signal, and are reflected in the ultrasonic audio data 1. When the electronic device analyzes face information by the ultrasonic audio data 1, there is a problem that it is not accurate enough. For example, the change in the posture of the head may be mistaken for the change in the expression of the face, resulting in an error in the generated face information.
In order to improve the accuracy of generating face information, data for measuring face information may be introduced into pose data for the head in addition to ultrasonic audio data acquired by an ultrasonic signal. The gesture data may be acquired by a gesture sensor in the electronic device. The pose data may be used to indicate the pose of the head, reflecting the direction of the face or the motion of the face.
Under the condition of introducing the gesture data aiming at the head, the electronic equipment adjusts the preset face picture through the facial state characteristics to obtain the simulated face picture, and the method can comprise the steps that the electronic equipment adjusts the preset face picture through the facial state characteristics and the gesture characteristics to obtain the simulated face picture, wherein the simulated face picture has the gesture indicated by the gesture characteristics. Wherein the gesture features are extracted by gesture data. The gesture feature can be used for determining the facial direction change caused by the gesture change of the head, reducing the negative influence of the head movement on the facial expression recognition, and generating a more accurate simulated face picture.
At this time, the simulated face may be obtained by the face generation model a (after training). The electronic device may input the preset face picture 1, the ultrasonic audio data 1, and the pose data 1 into a face picture generation model a (trained) by which an artificial face picture is output based on the preset face picture 1, the ultrasonic audio data 1, and the pose data 1.
The face picture generation model a may include a face state feature extraction network, a face feature extraction network, a pose feature extraction network, and a face generation network a. Wherein the facial state feature extraction network is used to extract facial state features (noted as facial state features 1) from the ultrasonic audio data 1. The face feature extraction network is used for extracting face features (denoted as face features 1) from the preset face picture 1. The gesture feature extraction network is used to extract gesture features (denoted as gesture features 1) from the gesture data 1. The face generation network a is used for adjusting the face characteristics 1 through the face state characteristics 1 and the gesture characteristics 1, and mapping the adjusted face characteristics into an imitated face picture. In some possible cases, the face generation network a may include a convolution network 21, a deconvolution network 21, and a convolution network 22. The convolution network 21 and the deconvolution network 21 have the functions of adjusting the face feature 1 through the face state feature 1 and the gesture feature 1 to obtain an adjusted face feature. The convolutional network 22 is used for mapping the adjusted face features into the simulated face frames.
The face picture generation model a is obtained by training with ultrasonic audio data a, preset face picture a and gesture data a as input data, predicted face picture as output data and sample face picture as reference data. For the relevant content of the training face picture generation model a, reference may be made to the following descriptions of fig. 4A and fig. 4B, which are not repeated here.
In the case of introducing pose data for the head, the aforementioned electronic device obtains a face state based on the face state feature map. The method can comprise the steps that the electronic equipment splices facial state characteristics and gesture characteristics to obtain splicing characteristics. The stitching features are mapped to facial states.
At this time, the face state may be obtained by the face state analysis model a (after training). The electronic device may input the ultrasonic audio data 1 and the posture data 1 into a face state analysis model a (trained) by which a face state is output based on the ultrasonic audio data 1 and the posture data 1.
The face state analysis model a may include a face state feature extraction network, a posture feature extraction network, and a face state analysis network a. Wherein the face state analysis network a is used to map the stitching features into a face state. The stitching features include a face state feature 1 and a gesture feature 1. The face state features 1 are extracted based on the ultrasonic audio data 1 through a face state feature extraction network. The gesture feature 1 is extracted based on the gesture data 1 through a gesture feature extraction network.
The face state analysis model a is trained with ultrasonic audio data a and gesture data a as input data, a predicted face state as output data, and a sample face state as reference data. The sample face state matches the face state features extracted by the ultrasonic audio data a. For the relevant content of training the face state analysis model a, reference may be made to the following description of fig. 6A, which is not repeated here.
Here, it should be noted that, in a scheme in which pose data for the head is not introduced, the electronic device includes a wearable device that can be worn on the head. The method may also include prompting the wearer to align the electronic device worn on the non-head with the face when the method for acquiring the face information is implemented. In an approach to introducing pose data for a head, an electronic device includes a wearable device that can be worn on the head. The wearable devices worn on the head can comprise wearable devices such as intelligent glasses, mixed reality helmets (MR), AR glasses, VR glasses and the like. Electronic devices not intended to be worn on the head may include smart watches, smart phones, etc. The embodiment of the present application is not limited thereto.
It should be further noted that, if the electronic device is a wearable device, one condition for executing the face information acquiring method includes the electronic device confirming that the user has worn the wearable device on the head. In addition, in the process of executing the facial information acquisition method, whether the wearable device is still worn on the head needs to be detected according to a preset period. If the user does not wear the head, the facial information acquisition method is stopped. Otherwise, execution may continue.
Taking an electronic device as an example of smart glasses, training and application of the face picture generation model a and the face state analysis model a are described below in the case of introducing head-oriented posture data. And training and application of the face picture generation model 1 and the face state analysis model 1 without introducing pose data for the head.
Based on the foregoing, in the case of introducing pose data for the head, a microphone and a speaker need to be configured in the smart glasses to acquire ultrasonic audio data. It is also necessary to configure the attitude sensor to acquire attitude data. Then, the ultrasonic audio data and the gesture data are transmitted to a face picture generation model a to generate an imitated face picture, or the ultrasonic audio data and the gesture data are transmitted to a face state analysis model a to obtain a face state.
In order to make the acquired ultrasonic audio data and posture data more accurate, the face information is fed back. The microphone, the speaker and the attitude sensor can be arranged on the smart glasses in a bilateral symmetry manner. Comprises at least one loudspeaker, at least one microphone and at least one attitude sensor which are distributed on two sides of the central axis of the intelligent glasses.
As shown in fig. 1, the smart glasses include a frame 1, a frame 2, a frame arm 1 connected to the frame 1, and a frame arm 2 connected to the frame 2. The intelligent glasses are provided with 4 microphones, 4 loudspeakers and 2 attitude sensors, and are symmetrically arranged in the intelligent glasses. One speaker is arranged at each of the positions 1 of the frame 1 and the frame 2, and one microphone is arranged at each of the positions 2 of the frame 1 and the frame 2. One speaker is arranged at each of the positions 1 of the frame arms 1 and 2, and one microphone is arranged at each of the positions 2 of the frame arms 1 and 2. The 2 attitude sensors may be symmetrically arranged in the mirror frame 1 and the mirror frame 2, or the 2 attitude sensors may be symmetrically arranged in the frame arm 1 and the frame arm 2.
In some possible cases, the attitude sensor may be an inertial measurement unit (inertial measurement unit, IMU). The glasses frame carries lenses, and the lenses of the intelligent glasses can be used as a display of the intelligent glasses to provide display contents for a wearer.
Here, the following description will be made by taking, as an example, a case where 4 microphones, 4 speakers, and 2 attitude sensors are provided in the smart glasses. Other numbers of microphones, speakers, and attitude sensors are described herein without reference to 4 microphones, 4 speakers, and 2 attitude sensors as limiting embodiments of the application. If the gesture data is not introduced, the gesture sensor may not be arranged in the smart glasses.
It should be further noted that the description in fig. 1 and the description of arranging the microphone, the speaker, and the attitude sensor in a bilateral symmetry form are not limited to smart glasses, and other electronic devices may be employed. For example, other wearable devices besides smart glasses, such as VR glasses, etc.
The following describes the relevant contents of the training face screen generation model a in the case of introducing pose data for the head based on fig. 2, 3, 4A, and 4B.
Training the face-picture generation model a requires a large amount of training data. The training data comprises gesture data (the gesture data a) required during training, ultrasonic audio data (the ultrasonic audio data a) required during training, a preset face picture (the preset face picture a) required during training and a sample face picture. At least one frame of gesture data a, one frame of ultrasonic audio data a, a preset face picture a and one frame of sample face picture are needed for training the primary face picture generation model a. The sample face picture frame used in each training is matched with the gesture data a and the ultrasonic audio data a. It is also understood that the pose data acquired when the face is in the pose indicated in the sample face picture frame is the pose data a. Meanwhile, the ultrasonic audio data acquired when the face is in the posture indicated in the sample face picture frame is the ultrasonic audio data a.
The process of acquiring training data may be described with reference to fig. 2 and 3 below.
As shown in fig. 2, at time Tc, the subject wears smart glasses to simulate the presenter adjusting the facial state displayed in the acquisition device. The acquisition equipment acquires a face picture of a subject (a user wearing intelligent glasses) at time Tc through a camera connected with the acquisition equipment, and a frame of sample face picture is obtained. Meanwhile, the intelligent glasses send ultrasonic signals through 4 loudspeakers at time Tc, and acquire ultrasonic signals reflected by the face through 4 microphones to obtain ultrasonic audio data a of one frame. And, the smart glasses acquire one frame of posture data a at time Tc through 2 posture sensors for indicating the posture of the head of the subject at time Tc. Wherein, one frame of ultrasonic audio data a can comprise ultrasonic audio data acquired by 4 microphones at time Tc. One frame of attitude data a may include attitude data acquired by 2 attitude sensors at time Tc. After a plurality of time Tc, more training data can be obtained.
Here, it should be noted that the acquisition device may display the presenter's picture in different face states, and when the subject follows the presenter to adjust the face state, richer training data may be obtained. Meanwhile, for any one of the demonstrator pictures in different face states, the acquisition equipment can also display the any one of the demonstrator pictures for a plurality of times so as to acquire the training data with the referential property. The differences in the presenter view in the different facial states may include any one or more of the following, as shown in fig. 3 (1), the differences in the presenter view in the different facial states may include facial expressions. As shown in (2) of fig. 3, the differences in the presenter view in the different face states may include the difference in the posture of the head.
In some possible cases, the training data may be in the form of a data stream. The system comprises a plurality of frames of continuous ultrasonic audio data a, a plurality of frames of continuous gesture data a and a plurality of frames of continuous sample face picture frames. Any frame of ultrasonic audio data a in the multi-frame continuous ultrasonic audio data a corresponds to one frame of gesture data a and one frame of sample face picture with the same time stamp. The time stamp of a datum indicates the time when the datum was acquired.
After the training data is acquired, the face picture generation model a may be trained based on the training data, and the process may be described with reference to fig. 4A described below.
As shown in fig. 4A, the training face picture generation model a may include a pose feature extraction network to be trained, a face state feature extraction network, a face feature extraction network, and a face generation network a. The face generation network a includes a convolutional network 21 to be trained, a deconvolution network 21, and a convolutional network 22. The function of each network may be referred to in the foregoing description, and will not be described in detail herein.
The gesture data a, the ultrasonic audio data a and the sample face picture are input into the face picture generation model a to be trained frame by frame, and the time stamps of the training data related to the process of performing one training in the face picture generation model a are the same. As shown in fig. 4A, one frame of pose data a input into the face generation model a may also be referred to as a target pose data frame a, one frame of ultrasonic audio data a input into the face generation model a may also be referred to as an ultrasonic audio data frame a, and one frame of sample face input into the face generation model a may be referred to as a sample face frame.
In the process of training the face picture generation model a, as shown in (1) in fig. 4A, the gesture feature extraction network performs feature extraction on the target gesture data frame a to obtain a gesture feature a. The gesture feature a may be represented as a matrix of 1 x 56. See (2) in fig. 4A, the facial state feature extraction network performs feature extraction on the ultrasonic audio data frame a to obtain a facial state feature a. The facial state features a may be represented as a matrix of 128 x 56. Referring to fig. 4A (3), the face feature extraction network may perform feature extraction on a preset face picture a to obtain a face feature a. The face feature a may be represented as a matrix of 64 x 56. And then carrying out feature stitching on the gesture feature a, the face state feature a and the face feature a through the face picture generation model a. And transmitting the spliced features to a face generation network a. Here, feature stitching is performed by directly combining the pose feature a, the face state feature a, and the face feature a together to form a longer feature vector (i.e., stitching feature) without changing the values of the elements in each feature. The splice feature may be represented as a matrix of 193 (1+128+64) 56 x 56.
Subsequently, the spliced features are transmitted to a face generation network a, and in the face generation network a, the face state features a and the posture features a included in the spliced features are utilized to adjust the face features a included in the spliced features through a convolution network 21 and a deconvolution network 21, so that the adjusted face features are obtained. The convolutional network 22 then maps the adjusted face features into a frame of predicted face frames (predicted face frames). The predicted face picture frame may be represented as a matrix of 3 x 112. The predicted face frame may be understood as a simulated face outputted by the face generation model a in training.
Subsequently, the difference between the predicted face picture frame and the sample face picture frame is calculated through the face picture generation model a, and the loss function 1 is obtained. Parameters in each network in the face picture generation model a are adjusted based on the loss function 1. For example, in the case where the loss function 1 is greater than a preset threshold, parameters in each network in the face picture generation model a are adjusted. Under the condition that the loss function 1 is continuously smaller than a preset threshold value, parameters in each network in the face picture generation model a are not adjusted in the training.
The condition for finishing training the face picture generation model a comprises, but is not limited to, one or more of the following items 1, wherein the loss function 1 is continuously smaller than a preset threshold value for N times. Wherein N is an integer greater than or equal to 1. In the 2 nd item, the training times of the face picture generation model a reach the preset times 1. In item 3, the number of times that the parameters in the face picture generation model a are updated reaches a preset number of times 2. The preset times 1 and the preset times 2 may be equal or unequal. The preset times 1 and the preset times 2 are usually larger, for example, more than 100, which is not limited in the embodiment of the present application. Other conditions for model training completion in embodiments of the present application also include, but are not limited to, at least one of the 3 items described above.
In the foregoing description with respect to fig. 4A, the gesture data a, the ultrasonic audio data a, and the sample face image a input into the face image generation model a to be trained are taken as an example. The gesture data a, the ultrasonic audio data a and the sample face picture a which are actually input into the face picture generation model a to be trained can also be multi-frame, and the face picture generation model a correspondingly outputs multi-frame predicted face picture frames. The embodiment of the application is not limited to this, and depends on the training mode of the model. It is also understood that the pose data a may include at least one frame of pose data. The ultrasonic audio data a may include at least one frame of ultrasonic audio data.
Here, the target pose data frame a may be understood as one of the pose data input to the face frame generation model a to be trained as described above. The target pose data frame a may be represented as a matrix of 12 x 100. Wherein 12 refers to the target attitude data frame a including the absolute gravitational acceleration on the three axes and the linear force acceleration on the three axes determined based on the initial attitude data frame a1, and further including the absolute gravitational acceleration on the three axes and the linear force acceleration on the three axes determined based on the initial attitude data frame a 2. The absolute gravitational acceleration on either axis and the linear force acceleration on either axis can be represented in 100 elements. The acquisition process of the target attitude data frame a may be described below.
The target pose data frame a is acquired by a pose sensor. As shown in fig. 4A, the acquisition process includes that 2 attitude sensors (attitude sensor 1 and attitude sensor 2) in the smart glasses continuously acquire initial attitude data in the process of acquiring training data. Wherein, the initial posture data collected by the posture sensor 1 may be referred to as initial posture data a1, and the initial posture data collected by the posture sensor 2 may be referred to as initial posture data a2. The target attitude data frame a is acquired once every time Tc, which comprises the steps of carrying out six-axis fusion processing on one frame of initial attitude data a1 (for example, the initial attitude data frame a1 in FIG. 4A) and one frame of initial attitude data a2 (for example, the initial attitude data frame a2 in FIG. 4A) in the time Tc to obtain the target attitude data frame a. The initial pose data frame a1 and the initial pose data frame a2 may include accelerometer data on three axes of the smart glasses and gyroscope data on three axes for a time Tc. The three axes include an X axis, a Y axis, and a Z axis. Wherein the X-axis generally represents one axis in the horizontal direction. The Y-axis is also typically another axis in the horizontal direction, perpendicular to the X-axis. The Z-axis represents the vertical direction, perpendicular to the earth's surface. Since the positions of the two attitude sensors in the smart glasses are different, the accelerometer data on the three axes and the gyroscope data on the three axes included in the initial attitude data frame a1 and the initial attitude data frame a2 may be different, and a more detailed attitude may be reflected.
The six-axis fusion processing comprises the step of fusing accelerometer data on three axes and gyroscope data on the three axes in the initial gesture data frame a1 to obtain target gesture data with stronger comprehensiveness. For example, the target attitude data obtained after the six-axis fusion processing of the initial attitude data frame a1 may be the absolute gravitational acceleration on the three axes and the linear force acceleration on the three axes indicated in the initial attitude data frame a 1. The six-axis fusion processing further comprises the step of fusing the accelerometer data on the three axes in the initial gesture data frame a2 and the gyroscope data on the three axes to obtain target gesture data with stronger comprehensiveness. For example, the target attitude data obtained after the six-axis fusion processing of the initial attitude data frame a2 may be the absolute gravitational acceleration on the three axes and the linear force acceleration on the three axes indicated in the initial attitude data frame a 2.
In addition to the absolute gravitational acceleration on the three axes and the linear force acceleration on the three axes, other data may be used to comprehensively represent the initial pose data frame a1. Such as pitch angle, yaw angle, roll angle, etc. on three axes. The embodiment of the present application is not limited thereto.
The ultrasonic audio data frame a may be understood as one of the ultrasonic audio data input into the face picture generation model a to be trained as referred to above. The acquisition process of the ultrasonic audio data frame a may be described below.
The frame of ultrasonic audio data a is acquired by a microphone. As shown in fig. 4A, the acquisition process includes continuously acquiring initial audio data by 4 microphones (including microphone 1-microphone 4) in the smart glasses during the process of acquiring training data. The initial audio data collected by the microphone 1, the microphone 2, the microphone 3 and the microphone 4 are called initial audio data a1, initial audio data a2, initial audio data a3 and initial audio data a4, respectively. The ultrasonic audio data frame a is acquired every time Tc, which comprises recording and playing a frame of initial audio data a1 (such as initial audio data frame a1 in FIG. 4A), a frame of initial audio data a2 (such as initial audio data frame a2 in FIG. 4A), a frame of initial audio data a3 (such as initial audio data frame a3 in FIG. 4A) and a frame of initial audio data a4 (such as initial audio data frame a4 in FIG. 4A) in time Tc and performing Fourier transform to obtain the ultrasonic audio data frame a.
The initial audio data collected by each microphone includes not only ultrasonic audio data indicative of an ultrasonic signal (reflected), but also possibly audible audio data indicative of an audible sound signal. Therefore, it is necessary to filter the initial audio data a1 to the initial audio data a4 to remove audible audio data therein. Only ultrasound audio data is retained. Then, fourier transforming the filtered initial audio data frame a 1-the filtered initial audio data frame a4 to obtain an ultrasonic audio data frame a. It should be noted that the fourier transform is to transform the filtered initial audio data frame a 1-the filtered initial audio data frame a4 from the time domain to the ultrasonic audio data frame a in the frequency domain for easy processing.
The frame of ultrasonic audio data a may be represented as a matrix of 4 x 2048. Wherein 4 means that ultrasonic audio data on the frequency domain determined based on 4 paths of initial audio data frames (initial audio data frame a 1-initial audio data frame a 4) are included in the ultrasonic audio data frame a. The ultrasonic audio data in the determined frequency domain in any of the initial audio data frames a 1-a 4 may be represented by a matrix of 2048 x 2048. One of which 2048 represents that any one of the initial audio data frame divisions is referred to as 2048 time points. Another 2048 indicates that the original audio data frame can be analyzed for 2048 different frequency audio data components at one point in time.
The preset face picture a and the sample face picture are both face pictures of the subject. The sample face picture is acquired through a camera when training data is acquired, and the preset face picture a can be preset. The image can be obtained by performing image preprocessing on a training picture a with a human face. As shown in fig. 4B, the training screen a having a face (from a subject) is subjected to image preprocessing, and after confirming the face position, the face image is segmented to obtain the face of the subject. And expanding the edges of the human face to obtain a rectangular area comprising the human face, and obtaining a preset human face picture a. The content of the preset face picture a except the face may be set as a background, and the pixel values of the pixels in the background are the same, for example, gray or white. In order to make the obtained predicted face picture frame more real and reliable, the face in the training picture a with the face should be the frontal face of the subject after wearing the intelligent glasses.
Here, the foregoing describes a training process, and in actual situations, the face image generation model a needs to be trained multiple times. As shown in fig. 4A, the training of the face image generation model a is continued by acquiring the subsequent initial pose data, the initial audio data and the sample face image from front to back according to the time (t), and the training process may refer to the foregoing related content, which is not described herein.
It should be further noted that, in the embodiment of the present application, the time Tc is a period of time, which indicates a time (which may be understood as a duration) from the beginning to the end of audio corresponding to one frame of ultrasonic audio data (which may also be referred to as an ultrasonic audio data frame). The ultrasonic audio data of duration Tc may also be referred to as ultrasonic audio data frames.
In fourier transforming the frame of ultrasonic audio data, the time Tc may be understood as the duration of the observation window in the fourier transform. The length of the time Tc determines the resolution of the observation window for the ultrasonic audio data. An observation window of duration Tc can resolve audio data with frequency intervals exceeding 1/Tc (Hz). For example, when the frequency interval of audio data 1 and audio data 2 in an ultrasonic audio data frame is greater than 1/Tc (Hz), different audio may be distinguished by fourier transforms, otherwise the same audio.
According to the Doppler effect, time Tc, after the loudspeaker sends the ultrasonic signal, the ultrasonic signal is reflected by the face, and in the process of acquiring the reflected ultrasonic signal by the microphone to obtain an ultrasonic audio data frame, if the state of the face is changed, the relative displacement of part or all of the positions in the face relative to the positions between the loudspeaker and the microphone is formed. Whereas the relative displacement causes a change in the frequency of the reflected ultrasonic signal, which may be reflected in the frame of ultrasonic audio data. As shown in fig. 4A, the ultrasonic audio data frame can be considered to change linearly for a short period of time, and the frequency change slope (S) of the ultrasonic audio data frame can be expressed as: . S indicates the amount of frequency change that the ultrasonic signal emitted by the loudspeaker can change in unit time, Indicating thatFrequency spacing of intra-ultrasonic audio data frames. If fourier transformation is required, the bandwidth (B) of the ultrasonic signal emitted by the speaker needs to be such that b=s×tc. B determines the observable effective distance of the ultrasonic signal emitted by the loudspeaker). When the relative displacement of one position of the face is larger thanThe change in position can be captured by the reflected ultrasonic signal and reflected in the ultrasonic audio data frame, and then converted into an imitation face by the face generation model. Wherein the effective distanceThe determination process of (2) is as follows from equation (1) to equation (3).
In the formula (1), v represents the propagation velocity of the ultrasonic signal in the air, which is about 340m/s (meters per second). Other parameters may be referred to the foregoing related description and will not be repeated here. The equation in the equation (2) is converted from the equation (1). Inequality ofRepresenting the frequency spacing of different audio data in a frame of ultrasonic audio dataGreater than 1/Tc, may be distinguished by a fourier transform. Equation (3) can be converted from equation (2). Indicating effective distance in case that audio data greater than 1/Tc of frequency interval in ultrasonic audio data frame can be distinguished by Fourier transformAnd the bandwidth of the ultrasonic signal emitted by the loudspeaker. The relationship can be expressed as. For example, when the bandwidth of the ultrasonic signal emitted from the speaker is 80kHz, the effective distanceMay be 2.125mm. It can be seen that the greater the bandwidth B of the ultrasonic signal emitted by the loudspeaker, the greater the effective distanceSmaller, the relative displacement of one position of the face can be more easily larger than that of the otherThe change in position can be more easily captured by the reflected ultrasound signal.
Wherein the bandwidth of the ultrasound signal represents the frequency range of the ultrasound signal, i.e. the difference between the maximum frequency and the minimum frequency of the ultrasound signal. For example, when the bandwidth of the ultrasonic signal is 80kHz, the frequency of the ultrasonic signal may range from 20kHz to 100kHz.
The foregoing description of the observation window and the correlation related to the doppler effect applies not only to the ultrasonic audio data related to fig. 4A, but also to other ultrasonic audio data.
Because the computing power of the intelligent glasses is limited, the process of training the face picture generation model a can be performed on other electronic equipment with better computing power. If the electronic device implementing the face information acquisition method has a high computing power, training of the face image generation model a may be performed in the electronic device.
After the training of the face picture generation model a is completed, the face picture generation model a after the training is completed can be placed in the intelligent glasses, so that the intelligent glasses can use the face picture generation model a to acquire the simulated face picture through the gesture data 1, the ultrasonic audio data 1 and the preset face picture 1. The process of generating the model a by the smart glasses using the face picture may refer to the following description of fig. 4C.
As shown in fig. 4C, the trained face image generation model a may include a trained pose feature extraction network, a face state feature extraction network, a face feature extraction network, and a face generation network a. The face generation network a includes a convolution network 21 after training, a deconvolution network 21, and a convolution network 22. The function of each network may be referred to in the foregoing description, and will not be described in detail herein.
In a scene in which a simulated face is generated using the trained face generation model a. The gesture sensor 1 and the gesture sensor 2 in the smart glasses can continuously collect the initial gesture data 11 and the initial gesture data 12, respectively, and simultaneously, the microphones 1-4 in the smart glasses can continuously collect the initial audio data 11-14, respectively. And acquiring a frame of target attitude data frame 1 and a frame of ultrasonic audio data frame 1 at intervals of Tc. And transmitting the target gesture data frame 1, the ultrasonic audio data frame 1 and the preset face picture 1 into the face picture generation model a after training is completed, so as to obtain a frame of simulated face picture. The process is circularly executed along with the time t, and a plurality of frames of simulated face images can be obtained.
In the scene of generating the multi-frame simulated face frames, the preset face frames 1 input into the face frame generation model a can be unchanged, the gesture data 1 and the ultrasonic audio data 1 are input into the face frame generation model a after training, and the time stamps of one frame of gesture data 1 (which can be also called as a target gesture data frame 1) and one frame of ultrasonic audio data 1 (which can be also called as an ultrasonic audio data frame 1) used when one frame of simulated face frames (which can be also called as a simulated face frame) are acquired in the face frame generation model a are the same. The process of generating a frame of the simulated face picture may refer to the following descriptions of (1) in fig. 4C to (3) in fig. 4C.
At (1) in fig. 4C, the smart glasses perform feature extraction on the target gesture data frame 1 through the gesture feature extraction network, so as to obtain gesture feature 1. The gesture feature 1 may be represented as a matrix of 1 x 56. At (2) in fig. 4C, the intelligent glasses perform feature extraction on the ultrasonic audio data frame 1 through the face state feature extraction network to obtain the face state feature 1. The facial state feature 1 may be represented as a matrix of 128 x 56. In fig. 4C, at (3), the intelligent glasses may perform feature extraction on the preset face picture 1 through the face feature extraction network, so as to obtain the face feature 1. The face feature 1 may be represented as a matrix of 64 x 56. And then, carrying out feature stitching on the gesture features 1, the face state features 1 and the face features 1 through the face picture generation model a. Here, feature stitching is performed by directly combining the pose feature 1, the face state feature 1, and the face feature 1 together to form a longer feature vector (i.e., stitching feature) without changing the values of the elements in each feature. The splice feature may be represented as a matrix of 193 (1+128+64) 56 x 56.
Subsequently, the intelligent glasses transmit the spliced features to the face generation network a, and in the face generation network a, the face state features 1 and the posture features 1 included in the spliced features are utilized by the convolution network 21 and the deconvolution network 21 to adjust the face features 1 included in the spliced features, so that the adjusted face features are obtained. The convolutional network 22 then maps the adjusted face features into simulated face frames. The simulated face frames may be represented as a matrix of 3 x 112.
Wherein the target pose data frame 1 is acquired by a pose sensor. As shown in fig. 4C, the acquisition process includes subjecting one frame of initial pose data 11 (e.g., initial pose data frame 11 in fig. 4C) and one frame of initial pose data 12 (initial pose data frame 12 in fig. 4C) within a time Tc to six-axis fusion processing to obtain a target pose data frame 1. Here, the process of the six-axis fusion process and the process of obtaining the target attitude data frame 1 are the same as those of the six-axis fusion process and the process of obtaining the target attitude data frame a in fig. 4A, and the adaptive modification may be performed with reference to the description of the related content. For example, the initial pose data a1 and the initial pose data a2 are replaced with the initial pose data 11 and the initial pose data 12, respectively, and will not be described here. The target pose data frame 1 and the target pose data frame a may each be represented as a matrix of 12×100.
The frame of ultrasonic audio data 1 is acquired by a microphone. As shown in fig. 4C, the acquisition process includes subjecting a frame of initial audio data 11 (e.g., initial audio data frame 11 from microphone 1 in fig. 4C), a frame of initial audio data 12 (initial audio data frame 12 from microphone 2 in fig. 4C), a frame of initial audio data 13 (initial audio data frame 13 from microphone 3 in fig. 4C), and a frame of initial audio data 14 (initial audio data frame 14 from microphone 4 in fig. 4C) to filtering and fourier transformation to obtain ultrasonic audio data frame 1. Here, the process of filtering and fourier transforming and obtaining the ultrasonic audio data frame 1 is the same as that of the ultrasonic audio data frame a in fig. 4A, and the above description of the related content can be referred to for adaptive modification. For example, the initial audio data a1 to the initial audio data a4 are replaced with the initial audio data 11 to the initial audio data 14, respectively, and will not be described here. The ultrasonic audio data frame 1 and ultrasonic audio data frame a may each be represented as a matrix of 4×2048×2048.
The preset face picture 1 is a face picture of a wearer. The image can be obtained by performing image preprocessing on a preset picture 1 with a human face. In order to make the generated simulated face picture frame more real and reliable, the face in the preset picture 1 with the face should be the front face of the wearer after wearing the intelligent glasses. The process of obtaining the preset face picture 1 is the same as the process of obtaining the preset face picture a in fig. 4A, and the adaptation modification is performed. For example, the training frame a of the face is modified to be the preset frame 1 of the face, which is not described in detail in the embodiment of the present application. The preset face picture 1 and the preset face picture a can be represented as a matrix of 3×112×112.
In the above description with respect to fig. 4C, the posture data 1 and the ultrasonic audio data 1 input to the training-completed face-picture generation model a are described as an example of one frame. The gesture data 1 and the ultrasonic audio data 1 actually input into the trained face picture generation model a can be multi-frame, and the face picture generation model a correspondingly outputs multi-frame simulated face picture frames. It is also understood that the pose data 1 may comprise at least one frame of pose data. The ultrasonic audio data 1 may include at least one frame of ultrasonic audio data.
The following describes the relevant contents of the training face picture generation model 1 without introducing pose data for the head based on fig. 5A.
It should be noted that, compared to the face-picture generation model a, the training data used by the face-picture generation model 1 includes ultrasonic audio data a, a preset face picture a, and a sample face picture. But does not include the attitude data a. The training data required for training the face image generation model 1 may be obtained by referring to the description of the training data obtained in fig. 2 and 3, but the posture data is not required to be obtained. After the training data is acquired, the face picture generation model a may be trained based on the training data, and the process may be described with reference to fig. 5A described below.
As shown in fig. 5A, the face picture generation model 1 in training may include a face state feature extraction network to be trained, a face feature extraction network, and a face generation network 1. The face generating network 1 includes a convolutional network 11 to be trained, a deconvolution network 11, and a convolutional network 12. The function of each network may be referred to in the foregoing description, and will not be described in detail herein.
The ultrasonic audio data a and the sample face picture are input into the face picture generation model 1 to be trained frame by frame, and the time stamps of the training data related to the process of one training input into the face picture generation model 1 are the same. As shown in fig. 5A, one frame of ultrasonic audio data a input into the face-picture generation model 1 may also be referred to as an ultrasonic audio data frame a and one frame of sample face picture input into the face-picture generation model 1 may be referred to as a sample face-picture frame.
In the process of training the face picture generation model 1, see (1) in fig. 5A, the face state feature extraction network performs feature extraction on the ultrasonic audio data frame a to obtain a face state feature a. The facial state features a may be represented as a matrix of 128 x 56. Referring to fig. 5A (2), the face feature extraction network may perform feature extraction on a preset face picture a to obtain a face feature a. The face feature a may be represented as a matrix of 64 x 56. And then, carrying out feature stitching on the gesture feature a, the face state feature a and the face feature a through the face picture generation model 1. And transmitting the spliced features to the face generation network 1. Here, feature stitching is performed by directly combining the pose feature a, the face state feature a, and the face feature a together to form a longer feature vector (i.e., stitching feature) without changing the values of the elements in each feature. The splice feature may be represented as a matrix of 192 (128+64) 56 x 56.
Subsequently, the spliced features are transmitted to the face generation network 1, and in the face generation network 1, the face state features a included in the spliced features are utilized to adjust the face features a included in the spliced features through the convolution network 11 and the deconvolution network 11, so that the adjusted face features are obtained. The convolutional network 12 then maps the adjusted face features into a frame of predicted face frames (predicted face frames). The predicted face picture frame may be represented as a matrix of 3 x 112. The predicted face frame may be understood as a simulated face outputted from the face generation model 1 in training.
Subsequently, the difference between the predicted face picture frame and the sample face picture frame is calculated through the face picture generation model 1, and the loss function 2 is obtained. Parameters in each network in the face picture generation model 1 are adjusted based on the loss function 2. For example, in the case where the loss function 2 is greater than a preset threshold, parameters in each network in the face picture generation model 1 are adjusted. Under the condition that the loss function 2 is continuously smaller than a preset threshold value, parameters in each network in the face picture generation model 1 are not adjusted in the training.
Here, the foregoing describes a training process, and in actual cases, the face image generation model 1 needs to be trained multiple times. As shown in fig. 5A, the training of the face image generation model 1 is continued by acquiring the subsequent initial audio data and the sample face image from front to back according to the time (t), and the training process may refer to the foregoing related content, which is not described herein.
It should be further noted that, the process and definition of the data and the manner of obtaining the same names in fig. 5A and fig. 4A may refer to the description of the related data in fig. 4A, which is not repeated here. Here, the data of the same name includes a sample face frame, an ultrasonic audio data frame a, each initial data frame (including an initial audio data frame a 1-an initial audio data frame a 4), and the like. Ways of the same name include filtering, image preprocessing, and the like.
After the training of the face picture generation model 1 is completed, the face picture generation model 1 after the training is completed can be placed in the intelligent glasses, so that the intelligent glasses can use the face picture generation model 1 to acquire the simulated face picture through the ultrasonic audio data 1 and the preset face picture 1. The process of generating the model 1 using the face picture by the smart glasses may refer to the following description of fig. 5B.
As shown in fig. 5B, the trained face picture generation model 1 may include a trained pose feature extraction network, a face state feature extraction network, a face feature extraction network, and a face generation network 1. The face generation network 1 includes a convolution network 11 after training, a deconvolution network 11, and a convolution network 12. The function of each network may be referred to in the foregoing description, and will not be described in detail herein.
In a scene in which a simulated face is generated using the trained face generation model 1. The microphone 1-microphone 4 in the smart glasses may continuously collect initial audio data 11-initial audio data 14, respectively. A frame of ultrasonic audio data 1 is acquired every time Tc. And transmitting the ultrasonic audio data frame 1 and the preset face picture 1 to the face picture generation model 1 after training is completed, so as to obtain a frame of simulated face picture. The process is circularly executed along with the time t, and a plurality of frames of simulated face images can be obtained.
In the scene of generating the multi-frame simulated face picture, the preset face picture 1 input into the face picture generation model 1 can be unchanged, and the ultrasonic audio data 1 are input into the face picture generation model 1 after training frame by frame. The process of generating a frame of the simulated face picture may refer to the following description of (1) in fig. 5B and (2) in fig. 5B.
At (1) in fig. 5B, the intelligent glasses perform feature extraction on the ultrasonic audio data frame 1 through the face state feature extraction network to obtain the face state feature 1. The facial state feature 1 may be represented as a matrix of 128 x 56. The face feature extraction network can be used for extracting features of the preset face picture 1 to obtain the face feature 1. The face feature 1 may be represented as a matrix of 64 x 56. And then, the facial state features 1 and the facial features 1 are subjected to feature stitching through the facial picture generation model 1. Here, feature stitching is performed by directly combining the face state feature 1 and the face feature 1 together to form a longer feature vector (i.e., stitching feature) without changing the values of the elements in each feature. The splice feature may be represented as a matrix of 192 (128+64) 56 x 56.
Subsequently, the intelligent glasses transmit the spliced features to the face generation network 1, and in the face generation network 1, the face state features 1 included in the spliced features are utilized by the convolution network 11 and the deconvolution network 11 to adjust the face features 1 included in the spliced features, so that the adjusted face features are obtained. The convolutional network 12 then maps the adjusted face features into simulated face frames. The simulated face frames may be represented as a matrix of 3 x 112.
It should be noted that, the process and definition of the data and the manner of obtaining the same names in fig. 5B and fig. 4C may refer to the description of the related data in fig. 4C, which is not repeated here. Here, the data of the same name includes an ultrasonic audio data frame 1, a preset screen 1 having a face, and each initial data frame (including an initial audio data frame 11-an initial audio data frame 14), and the like. Ways of the same name include filtering, fourier transform, etc.
The following describes the relevant contents of the training face state analysis model a in the case of introducing pose data for the head based on fig. 6A.
As with the training face picture generation model a described above, training the face state analysis model a also requires a large amount of training data. The training data includes posture data (the aforementioned posture data a) required at the time of training, ultrasonic audio data (the aforementioned ultrasonic audio data a) required at the time of training, and a sample face state required at the time of training. The sample face state is obtained by carrying out face state identification through a sample face picture frame. A sample face frame can also obtain a corresponding sample face state.
At least one frame of gesture data a, one frame of ultrasonic audio data a and one frame of sample face image are needed for training the face state analysis model a once. The sample face state used at each training time is matched with the pose data a and the ultrasonic audio data a. It is also understood that the pose data acquired when the face is in the face state indicated in the sample face picture frame is the pose data a. Meanwhile, the ultrasonic audio data acquired when the face is in the face state indicated in the sample face picture frame is the ultrasonic audio data a.
The contents involved in acquiring the pose data a, the ultrasonic audio data a and the sample face frame in the training data are the same as those in fig. 2 and 3, and reference may be made to the descriptions of fig. 2 and 3, which are not repeated here.
After the training data is acquired, the facial state analysis model a may be trained based on the training data, and the process may be described with reference to fig. 6A described below.
As shown in fig. 6A, a posture feature extraction network, a face state feature extraction network, and a face state analysis network a to be trained may be included in the face state analysis model a under training. The face state analysis network a comprises a convolution network a1 to be trained, a deconvolution network a1 and a convolution network a2. The function of each network may be referred to in the foregoing description, and will not be described in detail herein.
The pose data a, the ultrasonic audio data a and the sample face states corresponding to the sample face picture are input into the face state analysis model a to be trained frame by frame, and the time stamps of the training data related to the process of training once in the face state analysis model a are the same. As shown in fig. 6A, one frame of pose data a input into the face state analysis model a may also be referred to as a target pose data frame a, one frame of ultrasonic audio data a input into the face state analysis model a may also be referred to as an ultrasonic audio data frame a, and a sample face state corresponding to one frame of sample face picture input into the face state analysis model a may be simply referred to as a sample face state, which is represented by a matrix of 1×112×112.
In the process of training the facial state analysis model a, as shown in (1) in fig. 6A, the gesture feature extraction network performs feature extraction on the target gesture data frame a to obtain a gesture feature a. The gesture feature a may be represented as a matrix of 1 x 56. See (2) in fig. 6A, the facial state feature extraction network performs feature extraction on the ultrasonic audio data frame a to obtain a facial state feature a. The facial state features a may be represented as a matrix of 128 x 56. And then carrying out feature stitching on the gesture feature a and the face state feature a through the face state analysis model a. And transmitting the spliced features to a face state analysis network a. Here, feature stitching is performed by directly combining the pose feature a and the face state feature a together to form a longer feature vector (i.e., stitching feature) without changing the values of the elements in each feature. The splice feature may be represented as a matrix of 129 (1+128) 56 x 56.
Subsequently, the spliced features (spliced features) are transmitted to the face state analysis network a, where the spliced features are mapped into predicted face states through the convolution network a1, deconvolution network a1, and convolution network a2. The convolution network a1 may be used to up-sample the input splicing feature, the deconvolution network 1 is used to deconvolute the up-sampled splicing feature output by the convolution network a1, and then the deconvoluted splicing feature is input to the convolution network a2. And the convolution network a2 performs downsampling based on the deconvoluted splicing characteristics to obtain a final face state.
The predicted face state is represented by a matrix of 1×112×112. The predicted face state can be understood as the face state output by the face state analysis model a in training.
Subsequently, the difference between the predicted face state and the sample face state is calculated by the face state analysis model a, resulting in a loss function a. Parameters in each network in the face state analysis model a are adjusted based on the loss function a. For example, in the case where the loss function a is greater than a preset threshold, parameters in each network in the face state analysis model a are adjusted. Under the condition that the loss function a is continuously smaller than a preset threshold value, the training does not adjust parameters in each network in the face state analysis model a.
Here, the foregoing describes a training process, and in practice, the facial state analysis model a needs to be trained multiple times. As shown in fig. 6A, the training of the facial state analysis model a is continued by acquiring subsequent initial audio data and sample face images from front to back according to time (t), and the training process may refer to the foregoing related content, which is not described herein.
It should be further noted that, the process and definition of the data and the manner of obtaining the same names in fig. 6A and fig. 4A may refer to the description of the related data in fig. 4A, which is not repeated here. Here, the data of the same name includes a target pose data frame a, an ultrasonic audio data frame a, and each of initial data frames (including an initial pose data frame a1, an initial pose data frame a2, and initial audio data frames a 1-a 4), a sample face frame, and the like. The same name includes six-axis fusion processing, filtering, and the like.
After the training of the face state analysis model a is completed, the trained face state analysis model a can be placed in the intelligent glasses, so that the intelligent glasses can acquire the face state through the gesture data 1 and the ultrasonic audio data 1 by using the face state analysis model a. The process of using the face state analysis model a by the smart glasses may refer to the following description of fig. 6B.
As shown in fig. 6B, the training-completed face state analysis model a may include a training-completed posture feature extraction network, a face state feature extraction network, and a face state analysis network a. The face state analysis network a includes a convolution network a1, a deconvolution network a1, and a convolution network a2 after training is completed. The function of each network may be referred to in the foregoing description, and will not be described in detail herein.
In a scenario where a face state is generated using the trained face state analysis model a. The gesture sensor 1 and the gesture sensor 2 in the smart glasses can continuously collect the initial gesture data 11 and the initial gesture data 12, respectively, and simultaneously, the microphones 1-4 in the smart glasses can continuously collect the initial audio data 11-14, respectively. And acquiring a frame of target attitude data frame 1 and a frame of ultrasonic audio data frame 1 at intervals of Tc. And transmitting the target posture data frame 1 and the ultrasonic audio data frame 1 to the face state analysis model a after training is completed, so as to obtain a face state. This process is cyclically performed over time t, and a plurality of face states can be obtained.
In a scene where a plurality of face states are generated, the pose data 1 and the ultrasonic audio data 1 are input into the face state analysis model a after training, frame by frame, and the time stamps of one frame of the pose data 1 (which may also be referred to as a target pose data frame 1) and one frame of the ultrasonic audio data 1 (which may also be referred to as an ultrasonic audio data frame 1) used when one face state is acquired in the face state analysis model a are the same. The process of generating one face state may refer to the following description at (1) in fig. 6B and (2) in fig. 6B.
At (1) in fig. 6B, the smart glasses perform feature extraction on the target gesture data frame 1 through the gesture feature extraction network, so as to obtain gesture feature 1. The gesture feature 1 may be represented as a matrix of 1 x 56. At (2) in fig. 6B, the intelligent glasses perform feature extraction on the ultrasonic audio data frame 1 through the face state feature extraction network to obtain the face state feature 1. The facial state feature 1 may be represented as a matrix of 128 x 56. And then carrying out feature stitching on the gesture features 1 and the facial state features 1 through the facial state analysis model a. Here, feature stitching is performed by directly combining the pose feature 1 and the face state feature 1 together to form a longer feature vector (i.e., stitching feature) without changing the values of the elements in each feature. The splice feature may be represented as a matrix of 129 (1+128) 56 x 56.
Subsequently, the smart glasses transmit the spliced features (spliced features) to the face state analysis network a, and in the face state analysis network a, the face state is obtained by utilizing the spliced features through the convolution network a1, the deconvolution network a1 and the convolution network a 2. The face status may be represented as a matrix of 1 x 112.
It should be noted that, the process and definition of the data and the manner of obtaining the same names in fig. 6B and fig. 4C may refer to the description of the related data in fig. 4C, which is not repeated here. Here, the data of the same name includes a target pose data frame 1, an ultrasonic audio data frame 1, and each initial data frame (including an initial pose data frame 11, an initial pose data frame 12, and an initial audio data frame 11-an initial audio data frame 14) and the like. The same name includes six-axis fusion processing, filtering, and the like.
The following describes the relevant contents of the training face state analysis model 1 without introducing pose data for the head based on fig. 7A.
In addition, the training data used by the face state analysis model 1 includes ultrasonic audio data a and a sample face state, as compared to the face state analysis model a. But does not include the attitude data a. The method for acquiring the training data required for training the face state analysis model 1 may refer to the description of acquiring the training data when the face state analysis model a is described above, but the gesture data is not required to be acquired, which is not repeated here. After acquiring the training data, the facial state analysis model 1 may be trained based on the training data, and the process may be described with reference to fig. 7A described below.
At least one frame of ultrasonic audio data a and a frame of sample face image corresponding sample face state are needed for training the face state analysis model 1 once. The sample face state used at each training time is matched with the ultrasonic audio data a. It is also understood that the ultrasonic audio data acquired when the face is in the face state indicated in the sample face picture frame is the ultrasonic audio data a.
As shown in fig. 6A, the face state analysis model 1 under training may include therein a posture feature extraction network to be trained, a face state feature extraction network, and a face state analysis network 1. The face state analysis network 1 includes a convolution network b1 to be trained, a deconvolution network b1, and a convolution network b2. The function of each network may be referred to in the foregoing description, and will not be described in detail herein.
The ultrasonic audio data a and the sample face states corresponding to the sample face picture are input into the face state analysis model 1 to be trained frame by frame, and the time stamps of the training data related to the process of training once input into the face state analysis model 1 are the same. As shown in fig. 6A, a frame of ultrasonic audio data a input into the face state analysis model 1 may also be referred to as an ultrasonic audio data frame a, and a sample face state corresponding to a frame of sample face picture input into the face state analysis model 1 may be simply referred to as a sample face state, which is represented by a matrix of 1×112×112.
In the process of training the facial state analysis model 1, the facial state feature extraction network performs feature extraction on the ultrasonic audio data frame a to obtain facial state features a. The facial state features a may be represented as a matrix of 128 x 56. The face state feature a is transferred to the face state analysis network 1, and the face state feature a is mapped to a predicted face state by the convolution network b1, deconvolution network b1, and convolution network b2 in the face state analysis network 1. The convolution network b1 may be configured to up-sample the input facial state feature a, and the deconvolution network 1 is configured to deconvolute the up-sampled facial state feature a output by the convolution network b1, and then input the deconvoluted facial state feature a to the convolution network b2. And the convolution network b2 performs downsampling based on the deconvoluted facial state characteristics a to obtain a final facial state.
The predicted face state is represented by a matrix of 1×112×112. The predicted face state can be understood as the face state output by the face state analysis model 1 in training.
Subsequently, the gap between the predicted face state and the sample face state is calculated by the face state analysis model 1, resulting in the loss function b. Parameters in each network in the face state analysis model 1 are adjusted based on the loss function b. For example, in the case where the loss function b is larger than a preset threshold, parameters in each network in the face state analysis model 1 are adjusted. Under the condition that the loss function b is continuously smaller than a preset threshold value, the training does not adjust parameters in each network in the face state analysis model 1.
Here, the foregoing describes a training process, and in practice, the facial state analysis model 1 needs to be trained multiple times. As shown in fig. 6A, the training of the facial state analysis model 1 is continued by acquiring subsequent initial audio data and sample face images from front to back according to time (t), and the training process may refer to the foregoing related content, which is not described herein.
It should be further noted that, the process and definition of the data and the manner of obtaining the same names in fig. 7A and fig. 6A may refer to the description of the related data in fig. 6A, and are not repeated here. Here, the data of the same name includes a sample face picture frame, a sample face state, each initial data frame (including an initial audio data frame a 1-an initial audio data frame a 4), and the like. Ways of the same name include filtering and fourier transformation, facial state recognition, and the like.
After the training of the face state analysis model 1 is completed, the trained face state analysis model 1 can be placed in the intelligent glasses, so that the intelligent glasses can acquire the face state through the ultrasonic audio data 1 by using the face state analysis model 1. The process of using the face state analysis model 1 by the smart glasses may refer to the following description of fig. 7B.
As shown in fig. 7B, the training-completed face state analysis model 1 may include a training-completed face state feature extraction network and a face state analysis network 1. The face state analysis network 1 includes a convolution network b1 after training, a deconvolution network b1, and a convolution network b2. The function of each network may be referred to in the foregoing description, and will not be described in detail herein.
In a scene where a face state is generated using the trained face state analysis model 1. The microphone 1-microphone 4 in the smart glasses may continuously collect initial audio data 11-initial audio data 14, respectively. A frame of ultrasonic audio data frame 1 (may also be referred to as ultrasonic audio data frame 1) is acquired every time Tc. And transmitting the ultrasonic audio data frame 1 to the face state analysis model 1 after training is completed, so as to obtain a face state. This process is cyclically performed over time t, and a plurality of face states can be obtained.
Referring to fig. 7B, the intelligent glasses perform feature extraction on the ultrasonic audio data frame 1 through the facial state feature extraction network to obtain facial state features 1. The facial state feature 1 may be represented as a matrix of 128 x 56. The face state feature 1 is transmitted to the face state analysis network 1, and the face state is mapped by the face state feature 1 through the convolution network b1, the deconvolution network b1 and the convolution network b2 in the face state analysis network 1. The face status may be represented as a matrix of 1 x 112.
It should be noted that, the process and definition of the data and the manner of obtaining the same names in fig. 7B and fig. 6B may refer to the description of the related data in fig. 6B, and are not repeated here. Here, the data of the same name includes an ultrasonic audio data frame 1, each initial data frame (initial audio data frame 11-initial audio data frame 14), and the like. Ways of the same name include filtering and fourier transformation, etc.
Fig. 8 is a schematic diagram of a system configuration for implementing the face information acquiring method according to the embodiment of the present application.
As shown in fig. 8, the system architecture is divided into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the system architecture is divided into 4 layers, from top to bottom, an application layer, an application framework (application framework) layer, a hardware abstraction layer (hardware abstraction laye, HAL), and a hardware layer, respectively.
The application layer may include a series of application packages, among other things. Such as communication, music, etc.
The application framework layer provides an application programming interface (application programming interface, API) and programming framework for the application of the application layer.
The application framework layer may include an audio framework layer and a facial information framework layer. Wherein the audio framework layer may be used to process audio data. Including received audio data sent by other electronic devices, and audio data sent to other electronic devices. The face information framework layer is used for managing generation of face information. For example, the face information framework layer may invoke the face information generation module to generate face information and feed the face information back to an application invoking the face information framework layer.
The hardware abstraction layer is an interface layer located between the operating system kernel layer and other layers of the electronic device (e.g., a local service layer) that aims at abstracting hardware to provide a virtual hardware platform for the operating system.
The hardware abstraction layer may include a facial information generation module therein.
The facial information generating module can comprise a sensor data processing module, a time stamp aligning module, an image preprocessing module and a facial picture generating model. The face-picture generation model may include the face-picture generation model a or the face-picture generation model 1 referred to above. When the face picture generation model is the face picture generation model 1, an attitude sensor may be included in a hardware layer of the electronic device. Regarding the content related to the face-picture generation model a and the face-picture generation model 1, reference may be made to the descriptions of the related content in fig. 4C and fig. 5B, respectively, which are not repeated here.
The sensor data processing module is used for processing the data uploaded by the microphone, the loudspeaker and the gesture sensor to obtain data which can be input into a face picture generation model (after training is completed). The sensor data processing module may be used to implement the six-axis fusion process, filtering, and fourier transform referred to in fig. 4C, previously described. The sensor data processing module may acquire target pose data 1 (at least one frame of target pose data) based on initial pose data (e.g., initial pose data 11 and initial pose data 12), and acquire ultrasonic audio data 1 (at least one frame of ultrasonic audio data) based on initial audio data (e.g., initial audio data 11-initial audio data 14).
The timestamp alignment module is used for controlling the same timestamps of the target gesture data 1 and the ultrasonic audio data 1 which are input into the face picture generation model. For example, the time stamps controlling the target pose data frame 1 and the ultrasonic audio data frame 1 referred to previously are the same.
The image preprocessing module is used for preprocessing such as image preprocessing on the preset picture 1 with the face to obtain the preset face picture 1. The description of the process may refer to the process of performing preprocessing such as image preprocessing on the training frame a with the face in fig. 4B to obtain the preset face frame a, modifying the preset face frame a to the preset face frame 1, and modifying the training frame a with the face to the preset frame 1 with the face, which is not repeated in the embodiment of the present application.
The hardware layers may include at least one microphone (e.g., microphone 1-microphone m), at least one speaker (e.g., speaker 1-speaker h), an attitude sensor, and a display. Where m is an integer greater than or equal to 1, for example 4.h is an integer greater than or equal to 1, for example 4. The data of the attitude sensor is greater than or equal to 1, for example, 2. The gesture sensor is optional for the electronic device.
The display is used for displaying the generated simulated face picture. The display is also used for displaying a peer-to-peer screen sent by other electronic devices (also referred to as peer-to-peer electronic devices) when a video call is being made.
Here, fig. 8 is an illustration of an example in which the face information generation module includes a face screen generation model. In practical applications, the model included in the face information generation module may be not a face picture generation model but a face state analysis model (including the face state analysis model 1 or the face state analysis model a). In the case of a face state analysis model, the face information generation module is used for generating a face state, and the layer preprocessing module may not be included in the face information generation module.
The layers cooperate with each other to complete the face information acquisition method related to the embodiment of the application. The face information acquisition method related to the embodiment of the application can be applied to a scene (video call scene) of video call between the electronic equipment and other electronic equipment (also called as opposite-end electronic equipment). Fig. 9 shows an interactive schematic diagram when a video call is conducted.
As shown in fig. 9, the electronic device (e.g., smart glasses) is in the process of making a video call with the counterpart electronic device. The smart glasses may activate at least one microphone to collect initial audio data (e.g., initial audio data 11-initial audio data 14 as previously referred to). At least one of the attitude sensors is simultaneously activated to acquire initial attitude data (e.g., initial attitude data 11 and initial attitude data 12 as previously described.
The initial audio data collected by the at least one microphone and the initial pose data collected by the at least one pose sensor are transmitted to a timestamp alignment module for timestamp alignment. Then, the initial audio data and the initial pose data after the time stamps are aligned are transmitted to a sensor data processing module for processing, so as to obtain pose data 1 (also referred to as target pose data 1) and ultrasonic audio data 1 which can be input into a face generation model a (after training). For example, the aforementioned target pose data frame 1 and ultrasonic audio data frame 1. The image preprocessing module performs image preprocessing on a preset picture 1 with a human face to obtain the preset human face picture 1.
Subsequently, inputting the gesture data1, the ultrasonic audio data1 and the preset face picture 1 into the face generation model a to obtain an imitated face picture. The related operation of generating the simulated face picture may be performed in the face information generation module.
Then, the intelligent glasses send the simulated face picture to the opposite-end electronic equipment. The peer electronic device may display the received simulated face image with a display (e.g., screen) of the peer electronic device.
The intelligent glasses can also receive the opposite-end picture sent by the opposite-end electronic equipment. Then, the opposite-end screen is displayed on a display (e.g., a lens) of the smart glasses. The opposite-end picture can be acquired by the opposite-end electronic equipment through the camera.
In the video call scenario, the microphone and the gesture sensor collect data in real time, and the target gesture data 1 and the ultrasonic audio data 1 can be generated in real time. And the target gesture data 1 and the ultrasonic audio data 1 can be input into the face picture generation model a frame by frame, so that the face picture generation model can generate the simulated face picture in real time.
Here, in fig. 9, a face generation model a is taken as an example, and the model used in the actual video call process may also generate the model 1 for the face picture instead of the face picture generation model a. The description of this process may refer to the description of fig. 9, and will not be repeated here. The use of the face-image generation model 1 does not require the use of an attitude sensor to acquire data.
An exemplary interaction procedure for implementing the face information acquisition method in a video call scenario is shown in fig. 10. The description of this process may refer to the following descriptions of step S101 to step S104, step S201a, and steps S201 b.
S101, the electronic equipment (such as intelligent glasses) establishes video call connection with the opposite-end electronic equipment through video call service.
After the video call connection is established, the electronic device may start to generate an emulated face picture and transmit the emulated face picture to the electronic device at the opposite end, and the description of the process may refer to the descriptions of the following steps S102 a-S104.
S102a, the electronic equipment starts at least one loudspeaker to continuously emit ultrasonic signals, and starts at least one microphone to collect sound signals, so that initial audio data carrying at least the ultrasonic signals are obtained.
In some possible cases, the relevant content involved in the collection of the initial audio data may refer to the process of collecting the initial audio data 11-14 referred to in fig. 4C.
The ultrasonic signal carried in the initial audio data is an ultrasonic signal sent by a loudspeaker after being reflected by the face. Since the microphone is required to collect audible sound signals of a user (e.g., a wearer of smart glasses) in addition to the ultrasonic signals during the video call. An audible sound signal herein refers to a sound signal that is discernable to the human ear, and is a non-ultrasonic signal. The frequency of the sound signal discernable by the human ear is typically in the range of 20 Hz to 20 kHz. Whereas the frequency of the ultrasonic signal is typically greater than 20 kHz.
The electronic device sends the initial audio data collected by the microphone to the facial information generating module to generate an imitated human face picture. And, the initial audio data is sent to the video call service, so that the video call service obtains audible audio data from the initial audio data.
S102b, the electronic equipment starts at least one gesture sensor to continuously collect initial gesture data.
In some possible cases, the relevant content involved in acquiring the pose data may refer to the process of acquiring the initial pose data 11 and acquiring the initial pose data 12 as previously described in fig. 4C.
The electronic device transmits the initial gesture data acquired by the at least one gesture sensor to the facial information generation module to generate a simulated face picture.
S103a, the video call service acquires audible audio data from the initial audio data.
The electronic equipment performs low-pass filtering and Fourier transformation on the initial audio data through the video call service, and filters ultrasonic audio signals in the initial audio data to obtain audible audio data.
S103b, processing the facial information generating module based on the initial posture data and the initial audio data to obtain target posture data 1 and ultrasonic audio data 1.
The facial information generating module may perform time stamp alignment on the initial pose data and the initial audio data through a time stamp alignment module therein. Then, the sensor data processing module processes the initial audio data and the initial posture data after the time stamps are aligned, so that target posture data 1 and ultrasonic audio data 1 which can be input into a face generation model a (after training) are obtained. For example, the aforementioned target pose data frame 1 and ultrasonic audio data frame 1.
The sensor data processing module performs filtering and fourier transform on the initial audio data after the time stamps are aligned to obtain ultrasonic audio data 1, and the process of generating a frame of ultrasonic audio data may refer to the description of obtaining the ultrasonic audio data frame 1 in fig. 4C. The sensor data processing module performs six-axis fusion processing on the initial gesture data after the time stamps are aligned to obtain target gesture data 1, and the process of generating one frame of target gesture data can refer to the description of the frame 1 of target gesture data in fig. 4C. The filtering process when obtaining ultrasonic audio data is high-pass filtering.
S104, the face information generation module inputs the target attitude data 1, the ultrasonic audio data 1 and the preset face picture 1 into the face picture generation model a to obtain an imitated face picture.
The process of obtaining the simulated face based on the face generating model a may be referred to in the foregoing related content related to fig. 4C, and will not be described herein.
Then, the electronic device transmits the simulated face picture and the audible audio data to the opposite-end electronic device through the video call service.
After the video call connection is established, the electronic device may also receive the peer-to-peer screen and the peer-to-peer audio data sent by the peer-to-peer electronic device, and the description of the process may refer to the description of step S201a and step S201b described below.
S201a, the electronic equipment processes the opposite-end picture which is sent by the opposite-end electronic equipment and used for realizing video call through the video call service, and the opposite-end picture is displayed in a display.
S201b, the electronic equipment processes opposite-end audio data which are sent by the opposite-end electronic equipment and used for realizing video call through the video call service, and plays the opposite-end audio through at least one loudspeaker.
In step S201b, the speaker playing the opposite-end audio also needs to continuously transmit the ultrasonic signal, or the speaker playing the opposite-end audio may be different from the speaker continuously transmitting the ultrasonic signal. Depending on the number of loudspeakers.
In fig. 10, the step S102a and the step S102b may be performed simultaneously. Step S103a and step S103b may be performed simultaneously, or step S103a may be allowed to be performed later than step S103b because it takes longer time to generate an emulated face from step S103a to step S103 b. But the time when the execution of step S103a ends should be controlled to be the same as the time when the execution of step S103b ends. The aforementioned step S201a and step S201b may be performed simultaneously.
In some cases, the face image generation model a and the face state analysis model a may be integrated into the same model (face information generation model), or it may be understood that a branch of the face state analysis model a is constructed in the face image generation model a, so as to obtain the face information generation model. The training process of the facial information generating model may refer to fig. 11A described below.
As shown in fig. 11A, the face information generating model in training may include a face state analysis model a to be trained, and may also include a face picture generating model a with training. The training process of the facial information generating model is to train the face state analysis model a and the face picture generating model a at the same time, the training data used can refer to the training data related to the training of the face state analysis model a and the face picture generating model a, and the training process can refer to the training process of the face state analysis model a in fig. 4A and the face picture generating model a in fig. 6A, which are not repeated here. The spliced feature 1 in fig. 11A can be regarded as the spliced feature in fig. 4A, and the spliced feature 1 can be regarded as the spliced feature in fig. 4A. The stitched feature 2 in fig. 11A can be regarded as the stitched feature in fig. 6A, and the stitched feature 2 can be regarded as the stitched feature in fig. 6A.
It should be noted that, the process and definition of the data and the manner of obtaining the same name in fig. 11A and fig. 4A or fig. 6A may refer to the description of the related data in fig. 4A or fig. 6A, and are not repeated here. Here, the data of the same name includes an ultrasonic audio data frame a, each initial data frame (initial pose data 11, initial pose data 12, initial audio data frame a 1-initial audio data frame a 4), and the like. Ways of the same name include filtering and fourier transformation, etc.
The trained face information generation model may be placed in the smart glasses such that the smart glasses may generate a face state and/or a simulated face picture using the face information generation model. The process of generating a model using the trained face information may refer to fig. 11B.
As shown in fig. 11B, the training-completed face information generation model includes a training-completed face image generation model a and a training-completed face state generation model a. The trained face information generation model may output a face state and a simulated face picture. The process of outputting the simulated face by the trained face information generating model may refer to the content related when the trained face generating model a outputs the simulated face in fig. 4C. The process of generating the model output face state from the training-completed face information may refer to the content related to the process of outputting the face state from the training-completed face state analysis model a in fig. 6B, and will not be described here again. The spliced feature 1 in fig. 11B can be regarded as the spliced feature in fig. 4C, and the spliced feature 1 can be regarded as the spliced feature in fig. 4C. The stitched feature 2 in fig. 11B can be regarded as the stitched feature in fig. 6B, and the stitched feature 2 can be regarded as the stitched feature in fig. 6B.
Fig. 11A and 11B illustrate an example of a face information generation model when posture data is introduced. The face information generation model when the pose data is not introduced may refer to the descriptions of fig. 11A and 11B, and the content related to the pose data may be deleted and adaptively modified. The embodiments of the present application are not limited in this regard.
It should be noted that, the facial status features related to the embodiments of the present application are extracted based on the ultrasonic audio data, and may also be referred to as ultrasonic audio features.
In the embodiment of the application, gesture data is not introduced, and in the scheme of generating the simulated face picture based on the ultrasonic audio data, the obtained simulated face picture can only adjust the facial expression based on the preset face picture but not adjust the facial direction. In this way, the method can concentrate on expression representation of the face, and avoid larger errors caused by introducing the direction of the face.
The feature extraction networks (face state feature extraction network, pose feature extraction network, and face feature extraction network) and the convolution networks (including the convolution network 11, the convolution network 12, the convolution network a1, the convolution network a2, and the like) referred to in the foregoing may be models of the convolution neural network (convolutional neural network, CNN), transfomer, or the like based on a residual structure, as some possible cases. The deconvolution network (including deconvolution network 11, deconvolution network a1, etc.) may be a deep convolutional neural network (deep convolutional neural network, DCNN). The loss functions (loss function 1, loss function 2, etc.) referred to in the foregoing may be Huber loss functions.
In some possible cases, the gesture features mentioned in the foregoing include gesture features for a face, may include direction (head up, head down, left side facing, right side facing, etc.) information of the face, which may include yaw angle of the head about three axes, typically including pitch (pitch) angle, yaw (yaw) angle, and roll (roll) angle, and so on.
Face features generally refer to a set of features used to identify a face, including, but not limited to, the shape of the face, the position and size of the eyes, nose, and mouth, and the texture of the skin, among others. These features are relatively inflexible and unique to each individual. Because of this, they are often used for face recognition, for example, to determine whether a facial image matches a particular individual.
Facial state features are more preferable than facial features in describing facial expressions such as open/close states of eyes (large or closed eyes), moving directions of eyeballs, positions of eyebrows (picking or frowning), shapes of the mouth (smiling or sad face), and the like. These features can be used to understand and interpret facial states in a shorter time scale, often used for expression recognition or fatigue detection tasks.
Facial state features (facial state feature 1, facial state feature a, etc.) are more primitive, abstract data than facial states. The facial state features are processed to obtain the facial state which is easy to understand by the user. For example, closing the eye, opening the mouth, etc.
The matrix size of each data shown in fig. 4A to 7B, 11A, and 11B is a distance description, and may be other values. For example, the matrix size 112 of 3×112×112 representing the preset face frame 1 and the preset face frame a is illustrated, and may be other values, such as 114, 116, etc. Wherein 3 represents that the preset face picture 1 and the preset face picture a are color images, and a pixel point in the picture can be described by 3 color channels, such as RGB color channels. 3 in the matrix representing the other pictures (e.g. the simulated face picture) is also representative of 3 color channels.
An exemplary electronic device provided by an embodiment of the present application is described below.
Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
It should be understood that the electronic device may have more or fewer components than shown in fig. 12, may combine two or more components, or may have a different configuration of components. The various components shown in fig. 12 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
The electronic device may include a processor 110, a memory 120, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, a communication module 150, a display 160, an audio module 170, and an attitude sensor 180.
In an embodiment of the present application, the processor 110 may include one or more processing units, and the memory 120 may store computer instructions related to implementing the facial information acquisition method and a model (e.g., a face image generation model), so that the electronic device performs the method in the embodiment of the present application.
The USB interface 130 may be used to connect a charger to charge an electronic device, or may be used to transfer data between the electronic device and a peripheral device.
Wherein the charge management module 140 is configured to receive a charge input from a charger. The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 to power the processor 110, the memory 120, the display 160, etc.
The communication module 150 is used to implement the communication function of the electronic device. For example, the video call between the aforementioned electronic device and other electronic devices may be completed based on the communication module 150.
The display 160 may be used to display content such as images or video. At least one display may be included in the electronic device. For example, when the electronic device is a smart glasses, two displays may be included. In some possible scenarios, reference may be made to the smart glasses previously described in fig. 1, with display 160 comprising a display in frame 1 and a display in frame 2. The display in the smart glasses is also a lens.
The audio module 170 may include a speaker 170A and a microphone 170B. The speaker 170A is used to play audio signals. Including ultrasonic signals and audible audio signals. The microphone 170B is used to collect sound signals to obtain ultrasonic audio data, or may also obtain audible audio data. The electronic device includes at least one speaker and microphone. For example, when the electronic device is smart glasses, speaker 170A may include 4 speakers and microphone 170B may include 4 microphones. At this time, the layout of the speakers and microphones in the electronic device may be described with reference to fig. 1.
The attitude sensor 180 may be used to collect initial attitude data to obtain target attitude data (abbreviated as attitude data).
The application also provides a chip system comprising at least one processor for implementing the functions involved in the method performed by the electronic device in any of the above embodiments.
In one possible design, the system on a chip further includes a memory to hold program instructions and data, the memory being located either within the processor or external to the processor.
The chip system may be formed of a chip or may include a chip and other discrete devices.
Alternatively, the processor in the system-on-chip may be one or more. The processor may be implemented in hardware or in software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented in software, the processor may be a general purpose processor, implemented by reading software code stored in a memory.
Alternatively, the memory in the system-on-chip may be one or more. The memory may be integral with the processor or separate from the processor, and embodiments of the present application are not limited.
The memory may be a non-transitory processor, such as a ROM, which may be integrated on the same chip as the processor, or may be separately provided on different chips, and the type of memory and the manner of providing the memory and the processor are not particularly limited in the embodiments of the present application.
The present application also provides a computer program product comprising a computer program (which may also be referred to as code, or instructions) which, when executed, causes a computer to perform the method performed by the electronic device in any of the embodiments described above.
The present application also provides a computer-readable storage medium storing a computer program (which may also be referred to as code, or instructions). The computer program, when executed, causes a computer to perform the method performed by the electronic device in any of the embodiments described above.
While the application has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some of the features thereof, and that the modifications or substitutions do not depart from the spirit of the embodiments.
As used in the above embodiments, the term "when..is interpreted as meaning" if..or "after..or" in response to determining..or "in response to detecting..is" depending on the context. Similarly, the phrase "when determining..or" if (a stated condition or event) is detected "may be interpreted to mean" if determined.+ -. "or" in response to determining.+ -. "or" when (a stated condition or event) is detected "or" in response to (a stated condition or event) "depending on the context.
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this disclosure refers to and encompasses any or all possible combinations of one or more of the listed items.
The terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.
Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. The storage medium includes a ROM or a random access memory RAM, a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (18)

1.一种面部信息获取方法,其特征在于,应用于第一电子设备,所述第一电子设备中包括扬声器和麦克风,所述方法包括:1. A facial information acquisition method, characterized in that it is applied to a first electronic device, wherein the first electronic device includes a speaker and a microphone, and the method comprises: 通过所述麦克风获取第一超声音频数据,所述第一超声音频数据指示了所述扬声器发射的且经面部反射后的超声信号;Acquire first ultrasonic audio data through the microphone, where the first ultrasonic audio data indicates an ultrasonic signal emitted by the speaker and reflected by the face; 通过所述第一超声音频数据提取所述面部的面部状态特征;extracting facial state features of the face through the first ultrasonic audio data; 基于所述面部状态特征对第一预设人脸画面进行调整,生成仿真人脸画面。The first preset face picture is adjusted based on the facial state feature to generate a simulated face picture. 2.根据权利要求1所述的方法,其特征在于,基于所述面部状态特征对第一预设人脸画面进行调整,生成仿真人脸画面,具体包括:2. The method according to claim 1, characterized in that the first preset face picture is adjusted based on the facial state feature to generate a simulated face picture, specifically comprising: 通过第一预设人脸提取所述第一预设人脸画面中的人脸特征;Extracting facial features in the first preset face picture through the first preset face; 通过所述面部状态特征对所述人脸特征进行调整;Adjusting the facial features according to the facial state features; 将调整后的人脸特征映射成为所述仿真人脸画面。The adjusted facial features are mapped into the simulated facial image. 3.根据权利要求2所述的方法,其特征在于,所述第一电子设备中还包括姿态传感器,所述方法还包括:3. The method according to claim 2, wherein the first electronic device further comprises a gesture sensor, and the method further comprises: 通过所述姿态传感器获取第一姿态数据,所述第一姿态数据指示了所述面部的姿态;Acquiring first posture data through the posture sensor, wherein the first posture data indicates the posture of the face; 通过所述第一姿态数据提取所述面部的姿态特征;extracting a posture feature of the face through the first posture data; 通过所述面部状态特征对所述人脸特征进行调整,具体包括:Adjusting the facial features according to the facial state features specifically includes: 通过所述面部状态特征和所述姿态特征对所述人脸特征进行调整,生成仿真人脸画面;所述仿真人脸画面具备所述姿态特征所指示的姿态。The facial features are adjusted through the facial state features and the posture features to generate a simulated facial image; the simulated facial image has the posture indicated by the posture features. 4.根据权利要求2所述的方法,其特征在于,所述面部状态特征是通过第一人脸画面生成模型中的第一特征提取网络确定的,所述人脸特征是通过第一人脸画面生成模型中的第二特征提取网络确定的;所述人脸画面生成模型中还包括第一人脸生成网络;所述第一人脸生成网络中包括第一卷积网络、第一反卷积网络以及第二卷积网络;所述第一卷积网络以及所述第一反卷积网络的作用包括:基于面部状态特征对所述人脸特征进行调整,得到调整后的人脸特征;所述第二卷积网络用于将调整后的人脸特征映射成为所述仿真人脸画面;4. The method according to claim 2 is characterized in that the facial state feature is determined by a first feature extraction network in a first face picture generation model, and the face feature is determined by a second feature extraction network in the first face picture generation model; the face picture generation model also includes a first face generation network; the first face generation network includes a first convolutional network, a first deconvolutional network and a second convolutional network; the first convolutional network and the first deconvolutional network have the following functions: adjusting the face feature based on the facial state feature to obtain the adjusted face feature; the second convolutional network is used to map the adjusted face feature into the simulated face picture; 其中,所述第一人脸画面生成模型是以第二超声音频数据以及第二预设人脸画面作为输入数据、预测人脸画面作为输出数据以及样本人脸画面作为参考数据进行训练得到的;所述样本人脸画面与基于所述第二超声音频数据所提取的面部状态特征相匹配。Among them, the first facial image generation model is trained with the second ultrasonic audio data and the second preset facial image as input data, the predicted facial image as output data and the sample facial image as reference data; the sample facial image matches the facial state features extracted based on the second ultrasonic audio data. 5.根据权利要求3所述的方法,其特征在于,所述面部状态特征是通过第二人脸画面生成模型中的第一特征提取网络确定的,所述人脸特征是通过所述第二人脸画面生成模型中的第二特征提取网络确定的,姿态特征是通过所述第二人脸画面生成模型中的第三特征提取网络确定的;所述人脸画面生成模型中还包括第二人脸生成网络;所述第二人脸生成网络中包括第三卷积网络、第二反卷积网络以及第四卷积网络;所述第三卷积网络以及所述第二反卷积网络的作用包括:基于面部状态特征以及姿态特征对所述人脸特征进行调整,得到调整后的人脸特征;所述第四卷积网络用于将调整后的人脸特征映射成为所述仿真人脸画面;5. The method according to claim 3 is characterized in that the facial state feature is determined by a first feature extraction network in a second face picture generation model, the face feature is determined by a second feature extraction network in the second face picture generation model, and the posture feature is determined by a third feature extraction network in the second face picture generation model; the face picture generation model also includes a second face generation network; the second face generation network includes a third convolution network, a second deconvolution network and a fourth convolution network; the functions of the third convolution network and the second deconvolution network include: adjusting the face feature based on the facial state feature and the posture feature to obtain the adjusted face feature; the fourth convolution network is used to map the adjusted face feature into the simulated face picture; 其中,所述第二人脸画面生成模型是以第二超声音频数据、第二姿态数据以及第二预设人脸画面作为输入数据、预测人脸画面作为输出数据以及样本人脸画面作为参考数据进行训练得到的;所述样本人脸画面与基于所述第二超声音频数据所提取的面部状态特征相匹配。Among them, the second facial image generation model is trained with the second ultrasonic audio data, the second posture data and the second preset facial image as input data, the predicted facial image as output data and the sample facial image as reference data; the sample facial image matches the facial state features extracted based on the second ultrasonic audio data. 6.根据权利要求1-5中任一项所述的方法,其特征在于,通过所述麦克风获取第一超声音频数据之前,所述方法还包括:6. The method according to any one of claims 1 to 5, characterized in that before acquiring the first ultrasonic audio data through the microphone, the method further comprises: 与所述第二电子设备建立视频通话连接;Establishing a video call connection with the second electronic device; 生成仿真人脸画面之后,所述方法还包括:After generating the simulated human face picture, the method further comprises: 通过所述视频通话连接向所述第二电子设备发送所述仿真人脸画面。The simulated face image is sent to the second electronic device via the video call connection. 7.根据权利要求6所述的方法,其特征在于,通过所述麦克风获取第一超声音频数据,具体包括:7. The method according to claim 6, characterized in that obtaining the first ultrasonic audio data through the microphone specifically comprises: 通过所述麦克风采集声音信号;Collecting sound signals through the microphone; 对所述声音信号进行高通滤波,得到所述经面部反射后的超声信号;Performing high-pass filtering on the sound signal to obtain the ultrasonic signal after being reflected by the face; 对所述经面部反射后的超声信号进行处理,得到所述第一超声音频数据。The ultrasonic signal reflected by the face is processed to obtain the first ultrasonic audio data. 8.根据权利要求7所述的方法,其特征在于,所述方法还包括:8. The method according to claim 7, characterized in that the method further comprises: 对所述声音信号进行低通滤波,得到可听声音信号;所述可听声音信号的频率低于超声信号的频率;Performing low-pass filtering on the sound signal to obtain an audible sound signal; the frequency of the audible sound signal is lower than the frequency of the ultrasonic signal; 对所述可听声音信号进行处理,得到可听音频数据;Processing the audible sound signal to obtain audible audio data; 通过所述视频通话连接向所述第二电子设备发送所述可听音频数据。The audible audio data is transmitted to the second electronic device via the video call connection. 9.根据权利要求3或5所述的方法,其特征在于,所述第一电子设备为佩戴于头部的可穿戴设备,所述第一电子设备的中轴线的两侧均布置了至少一个扬声器、至少一个麦克风以及至少一个姿态传感器。9. The method according to claim 3 or 5 is characterized in that the first electronic device is a wearable device worn on the head, and at least one speaker, at least one microphone and at least one posture sensor are arranged on both sides of the central axis of the first electronic device. 10.根据权利要求9所述的方法,其特征在于,所述第一电子设备为眼镜设备,所述第一电子设备中包括第一镜框、第二镜框、与所述第一镜框连接的第一框架臂以及与所述第二镜框连接的第二框架臂;所述第一镜框以及所述第二镜框中的第一位置处均布置了一个扬声器、所述第一镜框以及所述第二镜框中的第二位置处均布置了一个麦克风,且所述第一镜框以及所述第二镜框中的第三位置处均布置了一个姿态传感器;所述第一框架臂以及所述第二框架臂中的第一位置处均布置了一个扬声器,且所述第一框架臂以及所述第二框架臂中的第二位置处均布置了一个麦克风。10. The method according to claim 9 is characterized in that the first electronic device is an eyeglass device, and the first electronic device includes a first frame, a second frame, a first frame arm connected to the first frame, and a second frame arm connected to the second frame; a speaker is arranged at a first position in the first frame and the second frame, a microphone is arranged at a second position in the first frame and the second frame, and a posture sensor is arranged at a third position in the first frame and the second frame; a speaker is arranged at a first position in the first frame arm and the second frame arm, and a microphone is arranged at a second position in the first frame arm and the second frame arm. 11.一种面部信息获取方法,其特征在于,应用于第一电子设备,所述第一电子设备中包括扬声器和麦克风,所述方法包括:11. A facial information acquisition method, characterized in that it is applied to a first electronic device, wherein the first electronic device includes a speaker and a microphone, and the method comprises: 通过所述麦克风获取第一超声音频数据,所述第一超声音频数据指示了所述扬声器发射的且经面部反射后的超声信号;Acquire first ultrasonic audio data through the microphone, where the first ultrasonic audio data indicates an ultrasonic signal emitted by the speaker and reflected by the face; 通过所述第一超声音频数据提取所述面部的面部状态特征;extracting facial state features of the face through the first ultrasonic audio data; 基于所述面部状态特征映射得到所述面部的面部状态。A facial state of the face is obtained based on the facial state feature map. 12.根据权利要求11所述的方法,其特征在于,所述第一电子设备中还包括姿态传感器,所述方法还包括:12. The method according to claim 11, wherein the first electronic device further comprises a gesture sensor, and the method further comprises: 通过所述姿态传感器获取第一姿态数据,所述第一姿态数据指示了所述面部的姿态;Acquiring first posture data through the posture sensor, wherein the first posture data indicates the posture of the face; 通过所述第一姿态数据提取所述面部的姿态特征;extracting a posture feature of the face through the first posture data; 基于所述面部状态特征,映射得到所述面部的面部状态,具体包括:Mapping the facial state of the face based on the facial state feature specifically includes: 拼接所述面部状态特征以及所述姿态特征,得到拼接特征;splicing the facial state feature and the posture feature to obtain a spliced feature; 将所述拼接特征映射成为所述面部状态。The spliced features are mapped into the facial states. 13.根据权利要求11所述的方法,其特征在于,所述面部状态特征是通过第一面部状态分析模型中的第一特征提取网络确定的,所述第一面部状态分析模型中还包括第一面部分析网络;所述第一面部分析网络用于将所述面部状态特征映射成为所述面部的第一面部状态;13. The method according to claim 11, characterized in that the facial state feature is determined by a first feature extraction network in a first facial state analysis model, wherein the first facial state analysis model further comprises a first facial analysis network; the first facial analysis network is used to map the facial state feature into a first facial state of the face; 其中,所述第一面部状态分析模型是以第二超声音频数据作为输入数据、预测面部状态作为输出数据以及样本面部状态作为参考数据进行训练得到的;所述样本面部状态与所述第二超声音频数据所提取的面部状态特征相匹配。Among them, the first facial state analysis model is trained with the second ultrasonic audio data as input data, the predicted facial state as output data and the sample facial state as reference data; the sample facial state matches the facial state features extracted from the second ultrasonic audio data. 14.根据权利要求12所述的方法,其特征在于,所述面部状态特征是通过第二面部状态分析模型中的第一特征提取网络确定的,姿态特征是通过所述第二面部状态分析模型中的第二特征提取网络确定的;所述第二面部状态分析模型中还包括第二面部状态分析网络;所述第二面部状态分析网络用于将拼接特征映射成为所述面部的面部状态,所述拼接特征中包括所述姿态特征和所述面部状态特征;14. The method according to claim 12, characterized in that the facial state feature is determined by a first feature extraction network in a second facial state analysis model, and the posture feature is determined by a second feature extraction network in the second facial state analysis model; the second facial state analysis model also includes a second facial state analysis network; the second facial state analysis network is used to map the splicing feature into the facial state of the face, and the splicing feature includes the posture feature and the facial state feature; 其中,所述第二面部状态分析模型是以第二超声音频数据以及第二姿态数据作为输入数据、预测面部状态作为输出数据以及样本面部状态作为参考数据进行训练得到的;所述样本面部状态与所述第二超声音频数据所提取的面部状态特征相匹配。Among them, the second facial state analysis model is trained with the second ultrasonic audio data and the second posture data as input data, the predicted facial state as output data and the sample facial state as reference data; the sample facial state matches the facial state features extracted from the second ultrasonic audio data. 15.一种第一电子设备,其特征在于,包括:一个或多个处理器和存储器;所述存储器与所述一个或多个处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,所述一个或多个处理器调用所述计算机指令以使得所述第一电子设备执行如权利要求1-12中任一项所述的方法。15. A first electronic device, characterized in that it comprises: one or more processors and a memory; the memory is coupled to the one or more processors, the memory is used to store computer program code, the computer program code comprises computer instructions, and the one or more processors call the computer instructions so that the first electronic device executes the method as described in any one of claims 1-12. 16.一种计算机可读存储介质,包括计算机指令,其特征在于,当所述计算机指令在第一电子设备上运行时,使得所述第一电子设备执行如权利要求1-12中任一项所述的方法。16. A computer-readable storage medium, comprising computer instructions, wherein when the computer instructions are executed on a first electronic device, the first electronic device executes the method as claimed in any one of claims 1 to 12. 17.一种芯片系统,所述芯片系统应用于第一电子设备,其特征在于,所述芯片系统包括一个或多个处理器,所述处理器用于调用计算机指令以使得所述第一电子设备执行如权利要求1-12中任一项所述的方法。17. A chip system, wherein the chip system is applied to a first electronic device, characterized in that the chip system comprises one or more processors, and the processors are used to call computer instructions so that the first electronic device executes the method as described in any one of claims 1-12. 18.一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在第一电子设备上运行时,使得所述第一电子设备执行如权利要求1-12中任一项所述的方法。18. A computer program product comprising instructions, characterized in that when the computer program product is run on a first electronic device, the first electronic device is caused to execute the method as claimed in any one of claims 1 to 12.
CN202411483367.3A 2024-10-23 2024-10-23 Face information acquisition method and electronic equipment Active CN119181126B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411483367.3A CN119181126B (en) 2024-10-23 2024-10-23 Face information acquisition method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411483367.3A CN119181126B (en) 2024-10-23 2024-10-23 Face information acquisition method and electronic equipment

Publications (2)

Publication Number Publication Date
CN119181126A true CN119181126A (en) 2024-12-24
CN119181126B CN119181126B (en) 2025-08-15

Family

ID=93900963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411483367.3A Active CN119181126B (en) 2024-10-23 2024-10-23 Face information acquisition method and electronic equipment

Country Status (1)

Country Link
CN (1) CN119181126B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120279951A (en) * 2025-06-09 2025-07-08 山东大学 Facial expression recognition method and system based on sound perception

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150062321A1 (en) * 2013-08-27 2015-03-05 Blackberry Limited Device control by facial feature recognition
US20160085298A1 (en) * 2014-09-19 2016-03-24 Sony Corporation Ultrasound-based facial and modal touch sensing with head worn device
US20190138796A1 (en) * 2017-11-03 2019-05-09 Sony Interactive Entertainment Inc. Information processing device, information processing system, facial image output method, and program
CN112166350A (en) * 2018-06-05 2021-01-01 谷歌有限责任公司 System and method of ultrasonic sensing in smart devices
CN114710640A (en) * 2020-12-29 2022-07-05 华为技术有限公司 Video call method, device and terminal based on virtual image
CN114814800A (en) * 2021-01-19 2022-07-29 腾讯科技(深圳)有限公司 Object recognition method, device and storage medium based on ultrasonic echo
CN115116109A (en) * 2022-04-27 2022-09-27 平安科技(深圳)有限公司 Virtual character speaking video synthesis method, device, equipment and storage medium
US20230368453A1 (en) * 2022-05-12 2023-11-16 Microsoft Technology Licensing, Llc Controlling computer-generated facial expressions

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150062321A1 (en) * 2013-08-27 2015-03-05 Blackberry Limited Device control by facial feature recognition
US20160085298A1 (en) * 2014-09-19 2016-03-24 Sony Corporation Ultrasound-based facial and modal touch sensing with head worn device
US20190138796A1 (en) * 2017-11-03 2019-05-09 Sony Interactive Entertainment Inc. Information processing device, information processing system, facial image output method, and program
CN112166350A (en) * 2018-06-05 2021-01-01 谷歌有限责任公司 System and method of ultrasonic sensing in smart devices
CN114710640A (en) * 2020-12-29 2022-07-05 华为技术有限公司 Video call method, device and terminal based on virtual image
CN114814800A (en) * 2021-01-19 2022-07-29 腾讯科技(深圳)有限公司 Object recognition method, device and storage medium based on ultrasonic echo
CN115116109A (en) * 2022-04-27 2022-09-27 平安科技(深圳)有限公司 Virtual character speaking video synthesis method, device, equipment and storage medium
US20230368453A1 (en) * 2022-05-12 2023-11-16 Microsoft Technology Licensing, Llc Controlling computer-generated facial expressions

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120279951A (en) * 2025-06-09 2025-07-08 山东大学 Facial expression recognition method and system based on sound perception

Also Published As

Publication number Publication date
CN119181126B (en) 2025-08-15

Similar Documents

Publication Publication Date Title
CN112906604B (en) A behavior recognition method, device and system based on skeleton and RGB frame fusion
CN109309796A (en) Electronic device for acquiring images using multiple cameras and method for processing images therewith
CN107479699A (en) Virtual reality interaction method, device and system
US12101557B2 (en) Pose tracking for rolling shutter camera
KR102406878B1 (en) Visual search refinement for computer generated rendering environments
CN112400148B (en) Method and system for performing eye tracking using an off-axis camera
US12469234B2 (en) Robotic learning of assembly tasks using augmented reality
CN117425889A (en) Curvature estimation as a biometric signal
CN114967937B (en) Virtual human motion generation method and system
CN119181126A (en) Face information acquisition method and electronic equipment
CN112069863A (en) Face feature validity determination method and electronic equipment
US20230401673A1 (en) Systems and methods of automated imaging domain transfer
CN108537878B (en) Environment model generation method, device, storage medium and electronic device
US20250341636A1 (en) Expressions from transducers and camera
US20240087232A1 (en) Systems and methods of three-dimensional modeling based on object tracking
CN117095096A (en) Personalized human body digital twin model creation method, online driving method and device
US11282228B2 (en) Information processing device, information processing method, and program
CN115131635A (en) Training method of image generation model, image generation method, device and equipment
US20250329429A1 (en) Systems and methods for generating cosmetic medical treatment recommendations via machine learning
US12475696B2 (en) Personalized online learning for artificial reality applications
US20250080904A1 (en) Capturing Spatial Sound on Unmodified Mobile Devices with the Aid of an Inertial Measurement Unit
US20250378572A1 (en) Shared coordinate space
CN210295507U (en) Internet of things system for piano
WO2025014594A1 (en) Managing devices for virtual telepresence
TW202503485A (en) Managing devices for virtual telepresence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: Unit 3401, unit a, building 6, Shenye Zhongcheng, No. 8089, Hongli West Road, Donghai community, Xiangmihu street, Futian District, Shenzhen, Guangdong 518040

Applicant after: Honor Terminal Co.,Ltd.

Address before: 3401, unit a, building 6, Shenye Zhongcheng, No. 8089, Hongli West Road, Donghai community, Xiangmihu street, Futian District, Shenzhen, Guangdong

Applicant before: Honor Device Co.,Ltd.

Country or region before: China

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant